Skip to content

ripgrep crates/regex/src/config.rs: Code Companion

Reference code for the Regex Configuration lecture. Sections correspond to the lecture document.


Section 1: The Configuration Structure

/// Config represents the configuration of a regex matcher in this crate.
/// The configuration is itself a rough combination of the knobs found in
/// the `regex` crate itself, along with additional `grep-matcher` specific
/// options.
#[derive(Clone, Debug)]
pub(crate) struct Config {
    // Standard regex options
    pub(crate) case_insensitive: bool,
    pub(crate) case_smart: bool,           // ripgrep's intelligent case detection
    pub(crate) multi_line: bool,
    pub(crate) dot_matches_new_line: bool,
    pub(crate) swap_greed: bool,
    pub(crate) ignore_whitespace: bool,
    pub(crate) unicode: bool,
    pub(crate) octal: bool,

    // Resource limits
    pub(crate) size_limit: usize,
    pub(crate) dfa_size_limit: usize,
    pub(crate) nest_limit: u32,

    // ripgrep-specific options
    pub(crate) line_terminator: Option<LineTerminator>,
    pub(crate) ban: Option<u8>,            // byte to ban from matches
    pub(crate) crlf: bool,                 // Windows line ending support
    pub(crate) word: bool,                 // -w flag: word boundary matching
    pub(crate) fixed_strings: bool,        // -F flag: literal matching
    pub(crate) whole_line: bool,           // -x flag: full line matching
}

All fields use pub(crate) visibility—configuration is set through the builder API, not by direct manipulation. The Clone and Debug derives are essential for configuration types that need to be duplicated and debugged.


Section 2: Default Values and Their Rationale

impl Default for Config {
    fn default() -> Config {
        Config {
            case_insensitive: false,
            case_smart: false,
            multi_line: false,
            dot_matches_new_line: false,
            swap_greed: false,
            ignore_whitespace: false,
            unicode: true,              // User-friendly default: Unicode support on
            octal: false,
            // These size limits are much bigger than what's in the regex
            // crate by default.
            size_limit: 100 * (1 << 20),       // 100 MB for NFA
            dfa_size_limit: 1000 * (1 << 20),  // 1 GB for lazy DFA cache
            nest_limit: 250,                    // Max parenthesis nesting depth
            line_terminator: None,
            ban: None,
            crlf: false,
            word: false,
            fixed_strings: false,
            whole_line: false,
        }
    }
}

The generous size limits (100MB NFA, 1GB DFA cache) reflect that ripgrep is an end-user tool where users expect to search complex patterns. The unicode: true default prioritizes correctness over performance.


Section 3: Smart Case Detection

impl Config {
    /// Accounting for the `smart_case` config knob, return true if and only if
    /// this pattern should be matched case insensitively.
    fn is_case_insensitive(&self, analysis: &AstAnalysis) -> bool {
        // Explicit case-insensitive always wins
        if self.case_insensitive {
            return true;
        }
        // If smart case is off, don't do anything clever
        if !self.case_smart {
            return false;
        }
        // Smart case logic: case-insensitive if pattern has literals
        // but no uppercase characters
        // "foo" -> case insensitive (matches Foo, FOO)
        // "Foo" -> case sensitive (user explicitly used uppercase)
        analysis.any_literal() && !analysis.any_uppercase()
    }
}

The AstAnalysis parameter contains pre-computed information about the pattern's AST, avoiding repeated traversals. This method is called during HIR translation to determine the final case sensitivity setting.


Section 4: Fixed String Detection

/// Returns whether the given patterns should be treated as "fixed strings"
/// literals.
fn is_fixed_strings<P: AsRef<str>>(&self, patterns: &[P]) -> bool {
    // Case transforms require full parsing pipeline
    if self.case_insensitive || self.case_smart {
        return false;
    }

    // User explicitly requested fixed strings (-F flag)
    if self.fixed_strings {
        // But bail if any literal contains a line terminator
        if let Some(lineterm) = self.line_terminator {
            for p in patterns.iter() {
                if has_line_terminator(lineterm, p.as_ref()) {
                    return false;  // Let normal error handling catch this
                }
            }
        }
        return true;
    }

    // Check if patterns are effectively literals (no metacharacters)
    for p in patterns.iter() {
        let p = p.as_ref();
        // Any metacharacter means we need full parsing
        if p.chars().any(regex_syntax::is_meta_character) {
            return false;
        }
        // Same line terminator check as above
        if let Some(lineterm) = self.line_terminator {
            if has_line_terminator(lineterm, p) {
                return false;
            }
        }
    }
    true  // All patterns are plain literals
}

The generic P: AsRef<str> bound allows this method to accept &[String], &[&str], or any other string-like slice. The method returns false conservatively to ensure correctness over optimization.


Section 5: The ConfiguredHIR Type

/// A "configured" HIR expression, which is aware of the configuration which
/// produced this HIR.
#[derive(Clone, Debug)]
pub(crate) struct ConfiguredHIR {
    config: Config,
    hir: Hir,
}

impl ConfiguredHIR {
    /// Return a reference to the underlying configuration.
    pub(crate) fn config(&self) -> &Config {
        &self.config
    }

    /// Return a reference to the underlying HIR.
    pub(crate) fn hir(&self) -> &Hir {
        &self.hir
    }

    /// Turns this configured HIR into an equivalent one, but where it must
    /// match at word boundaries.
    pub(crate) fn into_word(self) -> ConfiguredHIR {
        // Consume self, wrap the HIR, return new ConfiguredHIR
        let hir = Hir::concat(vec![
            Hir::look(if self.config.unicode {
                hir::Look::WordStartHalfUnicode  // Unicode word boundary
            } else {
                hir::Look::WordStartHalfAscii    // ASCII word boundary
            }),
            self.hir,                            // Original pattern
            Hir::look(if self.config.unicode {
                hir::Look::WordEndHalfUnicode
            } else {
                hir::Look::WordEndHalfAscii
            }),
        ]);
        ConfiguredHIR { config: self.config, hir }
    }

    /// Turns this configured HIR into an equivalent one, but where it must
    /// match at the start and end of a line.
    pub(crate) fn into_whole_line(self) -> ConfiguredHIR {
        let line_anchor_start = Hir::look(self.line_anchor_start());
        let line_anchor_end = Hir::look(self.line_anchor_end());
        let hir =
            Hir::concat(vec![line_anchor_start, self.hir, line_anchor_end]);
        ConfiguredHIR { config: self.config, hir }
    }
}

The into_ methods take self by value (consuming ownership) and return a new ConfiguredHIR. This ownership-based transformation pattern is idiomatic Rust—you explicitly move the original to create the modified version.


Section 6: Building HIR from Patterns

impl ConfiguredHIR {
    /// Parse the given patterns into a single HIR expression that represents
    /// an alternation of the patterns given.
    fn new<P: AsRef<str>>(
        config: Config,
        patterns: &[P],
    ) -> Result<ConfiguredHIR, Error> {
        let hir = if config.is_fixed_strings(patterns) {
            // Fast path: directly construct HIR from literal bytes
            let mut alts = vec![];
            for p in patterns.iter() {
                alts.push(Hir::literal(p.as_ref().as_bytes()));
            }
            log::debug!(
                "assembling HIR from {} fixed string literals",
                alts.len()
            );
            Hir::alternation(alts)
        } else {
            // Normal path: parse and translate through regex-syntax
            let mut alts = vec![];
            for p in patterns.iter() {
                alts.push(if config.fixed_strings {
                    // Escape metacharacters for -F with case transforms
                    format!("(?:{})", regex_syntax::escape(p.as_ref()))
                } else {
                    format!("(?:{})", p.as_ref())
                });
            }
            let pattern = alts.join("|");

            // Parse to AST
            let ast = ast::parse::ParserBuilder::new()
                .nest_limit(config.nest_limit)
                .octal(config.octal)
                .ignore_whitespace(config.ignore_whitespace)
                .build()
                .parse(&pattern)
                .map_err(Error::generic)?;

            // Analyze AST for smart case detection
            let analysis = AstAnalysis::from_ast(&ast);

            // Translate AST to HIR with configuration
            let mut hir = hir::translate::TranslatorBuilder::new()
                .utf8(false)  // Allow matching invalid UTF-8
                .case_insensitive(config.is_case_insensitive(&analysis))
                .multi_line(config.multi_line)
                .dot_matches_new_line(config.dot_matches_new_line)
                .crlf(config.crlf)
                .swap_greed(config.swap_greed)
                .unicode(config.unicode)
                .build()
                .translate(&pattern, &ast)
                .map_err(Error::generic)?;

            // Check for banned bytes (e.g., NUL)
            if let Some(byte) = config.ban {
                ban::check(&hir, byte)?;
            }

            // Strip line terminators from match if configured
            hir = match config.line_terminator {
                None => hir,
                Some(line_term) => strip_from_match(hir, line_term)?,
            };
            hir
        };
        Ok(ConfiguredHIR { config, hir })
    }
}

The pipeline shows the full journey from patterns to HIR: join patterns with alternation, parse to AST, analyze for smart case, translate with configuration, apply post-processing. The fast path skips everything for pure literals.


Section 7: Regex Compilation

impl ConfiguredHIR {
    /// Convert this HIR to a regex that can be used for matching.
    pub(crate) fn to_regex(&self) -> Result<Regex, Error> {
        let meta = Regex::config()
            .utf8_empty(false)
            .nfa_size_limit(Some(self.config.size_limit))
            // One-pass DFA gets extra room (10 MB)
            .onepass_size_limit(Some(10 * (1 << 20)))
            // Full DFA: small but larger than regex crate default
            .dfa_size_limit(Some(1 * (1 << 20)))
            .dfa_state_limit(Some(1_000))
            // Lazy DFA gets the big cache
            .hybrid_cache_capacity(self.config.dfa_size_limit);

        Regex::builder()
            .configure(meta)
            .build_from_hir(&self.hir)
            .map_err(Error::regex)
    }
}

The configuration sets different limits for different regex engine strategies. The lazy DFA (hybrid) gets the largest allocation because it's ripgrep's primary matching strategy for most patterns.


Section 8: Line Terminator Handling

impl ConfiguredHIR {
    /// Returns the line terminator configured on this expression.
    pub(crate) fn line_terminator(&self) -> Option<LineTerminator> {
        // Text anchors (\A, \z) disable fast line searching
        if self.hir.properties().look_set().contains_anchor_haystack() {
            None
        } else {
            self.config.line_terminator
        }
    }

    /// Returns the "start line" anchor for this configuration.
    fn line_anchor_start(&self) -> hir::Look {
        if self.config.crlf {
            hir::Look::StartCRLF   // Windows: handle \r\n
        } else {
            hir::Look::StartLF    // Unix: just \n
        }
    }

    /// Returns the "end line" anchor for this configuration.
    fn line_anchor_end(&self) -> hir::Look {
        if self.config.crlf { 
            hir::Look::EndCRLF 
        } else { 
            hir::Look::EndLF 
        }
    }
}

/// Returns true if the given literal string contains any byte from the line
/// terminator given.
fn has_line_terminator(lineterm: LineTerminator, literal: &str) -> bool {
    if lineterm.is_crlf() {
        // CRLF mode: check for both \r and \n
        literal.as_bytes().iter().copied().any(|b| b == b'\r' || b == b'\n')
    } else {
        // Single byte terminator
        literal.as_bytes().iter().copied().any(|b| b == lineterm.as_byte())
    }
}

The line_terminator() method demonstrates a subtle correctness concern: text anchors (^/$ without multi-line mode) interact poorly with line-by-line searching, so the method returns None to disable the optimization in those cases.


Quick Reference

Config Fields by Category

Category Fields
Case handling case_insensitive, case_smart
Line mode multi_line, dot_matches_new_line, crlf
Pattern syntax ignore_whitespace, unicode, octal, swap_greed
Resource limits size_limit, dfa_size_limit, nest_limit
ripgrep features line_terminator, ban, word, fixed_strings, whole_line

Key Methods

Method Purpose
Config::build_many() Entry point: patterns → ConfiguredHIR
Config::is_case_insensitive() Resolve smart case logic
Config::is_fixed_strings() Detect fast path eligibility
ConfiguredHIR::new() Full compilation pipeline
ConfiguredHIR::to_regex() HIR → executable Regex
ConfiguredHIR::into_word() Add word boundary assertions
ConfiguredHIR::into_whole_line() Add line anchor assertions

Compilation Pipeline

patterns: &[P]
    ├─[fast path]──→ Hir::literal() for each ──→ Hir::alternation()
    └─[normal path]──→ AST parse ──→ AstAnalysis ──→ HIR translate
                                                         ├──→ ban check
                                                         └──→ strip line terms
                                                          ConfiguredHIR
                                                         to_regex()│
                                                            Regex (matcher)