ripgrep crates/regex/src/matcher.rs: Code Companion¶
Reference code for the Implementing Matcher lecture. Sections correspond to the lecture document.
Section 1: The Builder Pattern Completed¶
/// A builder for constructing a `Matcher` using regular expressions.
///
/// This builder re-exports many of the same options found on the regex crate's
/// builder, in addition to a few other options such as smart case, word
/// matching and the ability to set a line terminator which may enable certain
/// types of optimizations.
#[derive(Clone, Debug)] // Clone enables reuse, Debug enables logging
pub struct RegexMatcherBuilder {
config: Config, // Delegates storage to Config struct
}
impl Default for RegexMatcherBuilder {
fn default() -> RegexMatcherBuilder {
RegexMatcherBuilder::new()
}
}
impl RegexMatcherBuilder {
/// Create a new builder for configuring a regex matcher.
pub fn new() -> RegexMatcherBuilder {
RegexMatcherBuilder { config: Config::default() }
}
}
The builder holds a single Config field, creating a clean separation between the fluent API (builder methods) and the actual configuration storage. The Default trait delegates to new(), ensuring consistent initialization.
Section 2: From Pattern to Matcher¶
/// Build a new matcher using the current configuration for the provided
/// pattern.
pub fn build(&self, pattern: &str) -> Result<RegexMatcher, Error> {
self.build_many(&[pattern]) // Single pattern delegates to multi-pattern
}
/// Build a new matcher using the current configuration for the provided
/// patterns. The resulting matcher behaves as if all of the patterns
/// given are joined together into a single alternation.
pub fn build_many<P: AsRef<str>>(
&self,
patterns: &[P],
) -> Result<RegexMatcher, Error> {
// Step 1: Build configured HIR from patterns
let mut chir = self.config.build_many(patterns)?;
// Step 2: Apply whole-line or word transformations
// 'whole_line' is a strict subset of 'word', so when it is enabled,
// we don't need to both with any specific to word matching.
if chir.config().whole_line {
chir = chir.into_whole_line();
} else if chir.config().word {
chir = chir.into_word();
}
// Step 3: Compile HIR to actual regex automaton
let regex = chir.to_regex()?;
log::trace!("final regex: {:?}", chir.hir().to_string());
// ... optimization setup continues in next section
}
The generic bound P: AsRef<str> allows passing &[&str], &[String], or any slice of string-like types. The into_whole_line() and into_word() methods consume and transform the HIR.
Section 3: The Fast-Line Optimization¶
pub fn build_many<P: AsRef<str>>(
&self,
patterns: &[P],
) -> Result<RegexMatcher, Error> {
// ... pattern compilation from previous section ...
let non_matching_bytes = chir.non_matching_bytes();
// If we can pick out some literals from the regex, then we might be
// able to build a faster regex that quickly identifies candidate
// matching lines. For example, for a regex like `\w+foo\w+`, we can
// look for `foo`, and when a match is found, look for the line
// containing `foo` and then run the original regex on only that line.
let fast_line_regex = InnerLiterals::new(&chir, ®ex).one_regex()?;
// We override the line terminator in case the configured HIR doesn't
// support it.
let mut config = self.config.clone();
config.line_terminator = chir.line_terminator();
// Construct the final matcher with all four components
Ok(RegexMatcher {
config, // Preserved for line_terminator queries
regex, // The main matching engine
fast_line_regex, // Optional optimization (may be None)
non_matching_bytes // Bytes that can never appear in matches
})
}
The Option<Regex> type for fast_line_regex makes the optimization's optionality explicit—callers must handle both cases.
Section 4: Builder Methods as Documentation¶
/// Whether to enable "smart case" or not.
///
/// When smart case is enabled, the builder will automatically enable
/// case insensitive matching based on how the pattern is written. Namely,
/// case insensitive mode is enabled when both of the following things
/// are true:
///
/// 1. The pattern contains at least one literal character.
/// 2. Of the literals in the pattern, none of them are considered to be
/// uppercase according to Unicode.
pub fn case_smart(&mut self, yes: bool) -> &mut RegexMatcherBuilder {
self.config.case_smart = yes;
self // Returns &mut Self for method chaining
}
/// Set an ASCII line terminator for the matcher.
///
/// The purpose of setting a line terminator is to enable a certain class
/// of optimizations that can make line oriented searching faster. Namely,
/// when a line terminator is enabled, then the builder will guarantee that
/// the resulting matcher will never be capable of producing a match that
/// contains the line terminator.
///
/// If the aforementioned guarantee cannot be made because of how the pattern
/// was written, then the builder will return an error. For example, the
/// pattern `a\nb` will result in an error since the `\n` cannot be easily
/// removed without changing the fundamental intent of the pattern.
pub fn line_terminator(
&mut self,
line_term: Option<u8>,
) -> &mut RegexMatcherBuilder {
self.config.line_terminator = line_term.map(LineTerminator::byte);
self
}
/// Require that all matches occur on word boundaries.
///
/// Enabling this option is subtly different than putting `\b` assertions
/// on both sides of your pattern. A `\b` assertion requires that one side
/// match a word character while the other match a non-word character.
/// This option merely requires that one side match a non-word character.
pub fn word(&mut self, yes: bool) -> &mut RegexMatcherBuilder {
self.config.word = yes;
self
}
Each builder method modifies the config and returns &mut Self, enabling fluent chaining like builder.case_smart(true).word(true).build(pattern).
Section 5: The RegexMatcher Structure¶
/// An implementation of the `Matcher` trait using Rust's standard regex
/// library.
#[derive(Clone, Debug)]
pub struct RegexMatcher {
/// The configuration specified by the caller.
config: Config,
/// The regular expression compiled from the pattern provided by the
/// caller.
regex: Regex, // regex_automata::meta::Regex
/// A regex that never reports false negatives but may report false
/// positives that is believed to be capable of being matched more quickly
/// than `regex`. Typically, this is a single literal or an alternation
/// of literals.
fast_line_regex: Option<Regex>,
/// A set of bytes that will never appear in a match.
non_matching_bytes: ByteSet,
}
impl RegexMatcher {
/// Create a new matcher from the given pattern using the default
/// configuration.
pub fn new(pattern: &str) -> Result<RegexMatcher, Error> {
RegexMatcherBuilder::new().build(pattern)
}
/// Create a new matcher from the given pattern using the default
/// configuration, but matches lines terminated by `\n`.
///
/// This is meant to be a convenience constructor for enabling
/// line-oriented optimizations.
pub fn new_line_matcher(pattern: &str) -> Result<RegexMatcher, Error> {
RegexMatcherBuilder::new().line_terminator(Some(b'\n')).build(pattern)
}
}
The four fields each serve distinct purposes: config for queries, regex for matching, fast_line_regex for optimization, and non_matching_bytes for skipping.
Section 6: Implementing the Matcher Trait¶
use grep_matcher::{
ByteSet, Captures, LineMatchKind, LineTerminator, Match, Matcher, NoError,
};
impl Matcher for RegexMatcher {
type Captures = RegexCaptures;
type Error = NoError; // Regex operations are infallible once compiled
#[inline]
fn find_at(
&self,
haystack: &[u8],
at: usize,
) -> Result<Option<Match>, NoError> {
// Create input spanning from 'at' to end of haystack
let input = Input::new(haystack).span(at..haystack.len());
// Convert regex_automata::Match to grep_matcher::Match
Ok(self.regex.find(input).map(|m| Match::new(m.start(), m.end())))
}
#[inline]
fn new_captures(&self) -> Result<RegexCaptures, NoError> {
Ok(RegexCaptures::new(self.regex.create_captures()))
}
#[inline]
fn capture_count(&self) -> usize {
self.regex.captures_len()
}
#[inline]
fn capture_index(&self, name: &str) -> Option<usize> {
// Look up named capture group in pattern 0
self.regex.group_info().to_index(PatternID::ZERO, name)
}
#[inline]
fn non_matching_bytes(&self) -> Option<&ByteSet> {
Some(&self.non_matching_bytes)
}
#[inline]
fn line_terminator(&self) -> Option<LineTerminator> {
self.config.line_terminator
}
}
The associated type Error = NoError indicates that regex operations cannot fail at runtime—all validation happens at compile time.
Section 7: The Fast-Line Search in Action¶
impl Matcher for RegexMatcher {
// ... other methods ...
#[inline]
fn find_candidate_line(
&self,
haystack: &[u8],
) -> Result<Option<LineMatchKind>, NoError> {
Ok(match self.fast_line_regex {
// Fast path: use optimized literal search
Some(ref regex) => {
let input = Input::new(haystack);
regex
.search_half(&input) // Only need match position, not full match
.map(|hm| LineMatchKind::Candidate(hm.offset()))
}
// Slow path: run full regex
None => {
self.shortest_match(haystack)?.map(LineMatchKind::Confirmed)
}
})
}
#[inline]
fn shortest_match_at(
&self,
haystack: &[u8],
at: usize,
) -> Result<Option<usize>, NoError> {
let input = Input::new(haystack).span(at..haystack.len());
// search_half returns just the end position, not the full match
Ok(self.regex.search_half(&input).map(|hm| hm.offset()))
}
}
LineMatchKind::Candidate indicates a potential match that needs verification; LineMatchKind::Confirmed indicates a definite match. The search_half method is faster than find because it only determines if a match exists and where it ends.
Section 8: Captures Implementation¶
use regex_automata::util::captures::Captures as AutomataCaptures;
/// Represents the match offsets of each capturing group in a match.
///
/// The first, or `0`th capture group, always corresponds to the entire match.
/// Note that not all capturing groups are guaranteed to be present in a match.
#[derive(Clone, Debug)]
pub struct RegexCaptures {
/// Where the captures are stored.
caps: AutomataCaptures,
}
impl Captures for RegexCaptures {
#[inline]
fn len(&self) -> usize {
self.caps.group_info().all_group_len()
}
#[inline]
fn get(&self, i: usize) -> Option<Match> {
// Convert from regex_automata's Span to grep_matcher's Match
self.caps.get_group(i).map(|sp| Match::new(sp.start, sp.end))
}
}
impl RegexCaptures {
#[inline]
pub(crate) fn new(caps: AutomataCaptures) -> RegexCaptures {
RegexCaptures { caps }
}
#[inline]
pub(crate) fn captures_mut(&mut self) -> &mut AutomataCaptures {
&mut self.caps
}
}
The pub(crate) visibility restricts new and captures_mut to the crate—external users interact through the Captures trait only.
Section 9: Iteration Implementation¶
impl Matcher for RegexMatcher {
#[inline]
fn try_find_iter<F, E>(
&self,
haystack: &[u8],
mut matched: F,
) -> Result<Result<(), E>, NoError>
where
F: FnMut(Match) -> Result<bool, E>,
{
for m in self.regex.find_iter(haystack) {
match matched(Match::new(m.start(), m.end())) {
Ok(true) => continue, // Keep searching
Ok(false) => return Ok(Ok(())), // Caller wants to stop
Err(err) => return Ok(Err(err)), // Caller encountered error
}
}
Ok(Ok(()))
}
#[inline]
fn captures_at(
&self,
haystack: &[u8],
at: usize,
caps: &mut RegexCaptures,
) -> Result<bool, NoError> {
let input = Input::new(haystack).span(at..haystack.len());
let caps = caps.captures_mut(); // Get mutable access to inner storage
self.regex.search_captures(&input, caps);
Ok(caps.is_match())
}
}
The Result<Result<(), E>, NoError> return type separates regex errors (outer Result, always Ok) from callback errors (inner Result). The callback returns bool to indicate whether iteration should continue.
Quick Reference¶
RegexMatcher Fields¶
| Field | Type | Purpose |
|---|---|---|
config |
Config |
Stores line terminator for queries |
regex |
Regex |
Main matching engine |
fast_line_regex |
Option<Regex> |
Literal-based optimization |
non_matching_bytes |
ByteSet |
Bytes that never match |
Key Matcher Trait Methods¶
| Method | Returns | Purpose |
|---|---|---|
find_at |
Option<Match> |
Find first match at offset |
shortest_match_at |
Option<usize> |
Find match end position only |
find_candidate_line |
Option<LineMatchKind> |
Fast pre-filtering |
captures_at |
bool |
Populate capture groups |
try_find_iter |
Result<(), E> |
Iterate with early exit |
LineMatchKind Variants¶
enum LineMatchKind {
Candidate(usize), // Potential match, needs verification
Confirmed(usize), // Definite match
}