ripgrep crates/regex/src/config.rs: Code Companion¶
Reference code for the Regex Configuration lecture. Sections correspond to the lecture document.
Section 1: The Configuration Structure¶
/// Config represents the configuration of a regex matcher in this crate.
/// The configuration is itself a rough combination of the knobs found in
/// the `regex` crate itself, along with additional `grep-matcher` specific
/// options.
#[derive(Clone, Debug)]
pub(crate) struct Config {
// Standard regex options
pub(crate) case_insensitive: bool,
pub(crate) case_smart: bool, // ripgrep's intelligent case detection
pub(crate) multi_line: bool,
pub(crate) dot_matches_new_line: bool,
pub(crate) swap_greed: bool,
pub(crate) ignore_whitespace: bool,
pub(crate) unicode: bool,
pub(crate) octal: bool,
// Resource limits
pub(crate) size_limit: usize,
pub(crate) dfa_size_limit: usize,
pub(crate) nest_limit: u32,
// ripgrep-specific options
pub(crate) line_terminator: Option<LineTerminator>,
pub(crate) ban: Option<u8>, // byte to ban from matches
pub(crate) crlf: bool, // Windows line ending support
pub(crate) word: bool, // -w flag: word boundary matching
pub(crate) fixed_strings: bool, // -F flag: literal matching
pub(crate) whole_line: bool, // -x flag: full line matching
}
All fields use pub(crate) visibility—configuration is set through the builder API, not by direct manipulation. The Clone and Debug derives are essential for configuration types that need to be duplicated and debugged.
Section 2: Default Values and Their Rationale¶
impl Default for Config {
fn default() -> Config {
Config {
case_insensitive: false,
case_smart: false,
multi_line: false,
dot_matches_new_line: false,
swap_greed: false,
ignore_whitespace: false,
unicode: true, // User-friendly default: Unicode support on
octal: false,
// These size limits are much bigger than what's in the regex
// crate by default.
size_limit: 100 * (1 << 20), // 100 MB for NFA
dfa_size_limit: 1000 * (1 << 20), // 1 GB for lazy DFA cache
nest_limit: 250, // Max parenthesis nesting depth
line_terminator: None,
ban: None,
crlf: false,
word: false,
fixed_strings: false,
whole_line: false,
}
}
}
The generous size limits (100MB NFA, 1GB DFA cache) reflect that ripgrep is an end-user tool where users expect to search complex patterns. The unicode: true default prioritizes correctness over performance.
Section 3: Smart Case Detection¶
impl Config {
/// Accounting for the `smart_case` config knob, return true if and only if
/// this pattern should be matched case insensitively.
fn is_case_insensitive(&self, analysis: &AstAnalysis) -> bool {
// Explicit case-insensitive always wins
if self.case_insensitive {
return true;
}
// If smart case is off, don't do anything clever
if !self.case_smart {
return false;
}
// Smart case logic: case-insensitive if pattern has literals
// but no uppercase characters
// "foo" -> case insensitive (matches Foo, FOO)
// "Foo" -> case sensitive (user explicitly used uppercase)
analysis.any_literal() && !analysis.any_uppercase()
}
}
The AstAnalysis parameter contains pre-computed information about the pattern's AST, avoiding repeated traversals. This method is called during HIR translation to determine the final case sensitivity setting.
Section 4: Fixed String Detection¶
/// Returns whether the given patterns should be treated as "fixed strings"
/// literals.
fn is_fixed_strings<P: AsRef<str>>(&self, patterns: &[P]) -> bool {
// Case transforms require full parsing pipeline
if self.case_insensitive || self.case_smart {
return false;
}
// User explicitly requested fixed strings (-F flag)
if self.fixed_strings {
// But bail if any literal contains a line terminator
if let Some(lineterm) = self.line_terminator {
for p in patterns.iter() {
if has_line_terminator(lineterm, p.as_ref()) {
return false; // Let normal error handling catch this
}
}
}
return true;
}
// Check if patterns are effectively literals (no metacharacters)
for p in patterns.iter() {
let p = p.as_ref();
// Any metacharacter means we need full parsing
if p.chars().any(regex_syntax::is_meta_character) {
return false;
}
// Same line terminator check as above
if let Some(lineterm) = self.line_terminator {
if has_line_terminator(lineterm, p) {
return false;
}
}
}
true // All patterns are plain literals
}
The generic P: AsRef<str> bound allows this method to accept &[String], &[&str], or any other string-like slice. The method returns false conservatively to ensure correctness over optimization.
Section 5: The ConfiguredHIR Type¶
/// A "configured" HIR expression, which is aware of the configuration which
/// produced this HIR.
#[derive(Clone, Debug)]
pub(crate) struct ConfiguredHIR {
config: Config,
hir: Hir,
}
impl ConfiguredHIR {
/// Return a reference to the underlying configuration.
pub(crate) fn config(&self) -> &Config {
&self.config
}
/// Return a reference to the underlying HIR.
pub(crate) fn hir(&self) -> &Hir {
&self.hir
}
/// Turns this configured HIR into an equivalent one, but where it must
/// match at word boundaries.
pub(crate) fn into_word(self) -> ConfiguredHIR {
// Consume self, wrap the HIR, return new ConfiguredHIR
let hir = Hir::concat(vec![
Hir::look(if self.config.unicode {
hir::Look::WordStartHalfUnicode // Unicode word boundary
} else {
hir::Look::WordStartHalfAscii // ASCII word boundary
}),
self.hir, // Original pattern
Hir::look(if self.config.unicode {
hir::Look::WordEndHalfUnicode
} else {
hir::Look::WordEndHalfAscii
}),
]);
ConfiguredHIR { config: self.config, hir }
}
/// Turns this configured HIR into an equivalent one, but where it must
/// match at the start and end of a line.
pub(crate) fn into_whole_line(self) -> ConfiguredHIR {
let line_anchor_start = Hir::look(self.line_anchor_start());
let line_anchor_end = Hir::look(self.line_anchor_end());
let hir =
Hir::concat(vec![line_anchor_start, self.hir, line_anchor_end]);
ConfiguredHIR { config: self.config, hir }
}
}
The into_ methods take self by value (consuming ownership) and return a new ConfiguredHIR. This ownership-based transformation pattern is idiomatic Rust—you explicitly move the original to create the modified version.
Section 6: Building HIR from Patterns¶
impl ConfiguredHIR {
/// Parse the given patterns into a single HIR expression that represents
/// an alternation of the patterns given.
fn new<P: AsRef<str>>(
config: Config,
patterns: &[P],
) -> Result<ConfiguredHIR, Error> {
let hir = if config.is_fixed_strings(patterns) {
// Fast path: directly construct HIR from literal bytes
let mut alts = vec![];
for p in patterns.iter() {
alts.push(Hir::literal(p.as_ref().as_bytes()));
}
log::debug!(
"assembling HIR from {} fixed string literals",
alts.len()
);
Hir::alternation(alts)
} else {
// Normal path: parse and translate through regex-syntax
let mut alts = vec![];
for p in patterns.iter() {
alts.push(if config.fixed_strings {
// Escape metacharacters for -F with case transforms
format!("(?:{})", regex_syntax::escape(p.as_ref()))
} else {
format!("(?:{})", p.as_ref())
});
}
let pattern = alts.join("|");
// Parse to AST
let ast = ast::parse::ParserBuilder::new()
.nest_limit(config.nest_limit)
.octal(config.octal)
.ignore_whitespace(config.ignore_whitespace)
.build()
.parse(&pattern)
.map_err(Error::generic)?;
// Analyze AST for smart case detection
let analysis = AstAnalysis::from_ast(&ast);
// Translate AST to HIR with configuration
let mut hir = hir::translate::TranslatorBuilder::new()
.utf8(false) // Allow matching invalid UTF-8
.case_insensitive(config.is_case_insensitive(&analysis))
.multi_line(config.multi_line)
.dot_matches_new_line(config.dot_matches_new_line)
.crlf(config.crlf)
.swap_greed(config.swap_greed)
.unicode(config.unicode)
.build()
.translate(&pattern, &ast)
.map_err(Error::generic)?;
// Check for banned bytes (e.g., NUL)
if let Some(byte) = config.ban {
ban::check(&hir, byte)?;
}
// Strip line terminators from match if configured
hir = match config.line_terminator {
None => hir,
Some(line_term) => strip_from_match(hir, line_term)?,
};
hir
};
Ok(ConfiguredHIR { config, hir })
}
}
The pipeline shows the full journey from patterns to HIR: join patterns with alternation, parse to AST, analyze for smart case, translate with configuration, apply post-processing. The fast path skips everything for pure literals.
Section 7: Regex Compilation¶
impl ConfiguredHIR {
/// Convert this HIR to a regex that can be used for matching.
pub(crate) fn to_regex(&self) -> Result<Regex, Error> {
let meta = Regex::config()
.utf8_empty(false)
.nfa_size_limit(Some(self.config.size_limit))
// One-pass DFA gets extra room (10 MB)
.onepass_size_limit(Some(10 * (1 << 20)))
// Full DFA: small but larger than regex crate default
.dfa_size_limit(Some(1 * (1 << 20)))
.dfa_state_limit(Some(1_000))
// Lazy DFA gets the big cache
.hybrid_cache_capacity(self.config.dfa_size_limit);
Regex::builder()
.configure(meta)
.build_from_hir(&self.hir)
.map_err(Error::regex)
}
}
The configuration sets different limits for different regex engine strategies. The lazy DFA (hybrid) gets the largest allocation because it's ripgrep's primary matching strategy for most patterns.
Section 8: Line Terminator Handling¶
impl ConfiguredHIR {
/// Returns the line terminator configured on this expression.
pub(crate) fn line_terminator(&self) -> Option<LineTerminator> {
// Text anchors (\A, \z) disable fast line searching
if self.hir.properties().look_set().contains_anchor_haystack() {
None
} else {
self.config.line_terminator
}
}
/// Returns the "start line" anchor for this configuration.
fn line_anchor_start(&self) -> hir::Look {
if self.config.crlf {
hir::Look::StartCRLF // Windows: handle \r\n
} else {
hir::Look::StartLF // Unix: just \n
}
}
/// Returns the "end line" anchor for this configuration.
fn line_anchor_end(&self) -> hir::Look {
if self.config.crlf {
hir::Look::EndCRLF
} else {
hir::Look::EndLF
}
}
}
/// Returns true if the given literal string contains any byte from the line
/// terminator given.
fn has_line_terminator(lineterm: LineTerminator, literal: &str) -> bool {
if lineterm.is_crlf() {
// CRLF mode: check for both \r and \n
literal.as_bytes().iter().copied().any(|b| b == b'\r' || b == b'\n')
} else {
// Single byte terminator
literal.as_bytes().iter().copied().any(|b| b == lineterm.as_byte())
}
}
The line_terminator() method demonstrates a subtle correctness concern: text anchors (^/$ without multi-line mode) interact poorly with line-by-line searching, so the method returns None to disable the optimization in those cases.
Quick Reference¶
Config Fields by Category¶
| Category | Fields |
|---|---|
| Case handling | case_insensitive, case_smart |
| Line mode | multi_line, dot_matches_new_line, crlf |
| Pattern syntax | ignore_whitespace, unicode, octal, swap_greed |
| Resource limits | size_limit, dfa_size_limit, nest_limit |
| ripgrep features | line_terminator, ban, word, fixed_strings, whole_line |
Key Methods¶
| Method | Purpose |
|---|---|
Config::build_many() |
Entry point: patterns → ConfiguredHIR |
Config::is_case_insensitive() |
Resolve smart case logic |
Config::is_fixed_strings() |
Detect fast path eligibility |
ConfiguredHIR::new() |
Full compilation pipeline |
ConfiguredHIR::to_regex() |
HIR → executable Regex |
ConfiguredHIR::into_word() |
Add word boundary assertions |
ConfiguredHIR::into_whole_line() |
Add line anchor assertions |