The Matcher Trait: Code Companion¶
Reference code for the The Matcher Trait lecture. Sections correspond to the lecture document.
Section 1: What This File Does¶
/*!
This crate provides an interface for regular expressions, with a focus on line
oriented search. The purpose of this crate is to provide a low level matching
interface that permits any kind of substring or regex implementation to power
the search routines provided by the
[`grep-searcher`](https://docs.rs/grep-searcher)
crate.
The primary thing provided by this crate is the [`Matcher`] trait. The trait
defines an abstract interface for text search. It is robust enough to support
everything from basic substring search all the way to arbitrarily complex
regular expression implementations without sacrificing performance.
A key design decision made in this crate is the use of *internal iteration*,
or otherwise known as the "push" model of searching. In this paradigm,
implementations of the `Matcher` trait will drive search and execute callbacks
provided by the caller when a match is found.
*/
The module-level documentation establishes the crate's purpose and explicitly calls out the internal iteration design choice—a crucial architectural decision explained in the lecture.
Section 2: The Match Type - A Better Range¶
/// The type of a match.
///
/// Every `Match` is guaranteed to satisfy the invariant that `start <= end`.
#[derive(Clone, Copy, Debug, Eq, Hash, PartialEq)]
pub struct Match {
start: usize,
end: usize,
}
impl Match {
/// Create a new match.
///
/// # Panics
/// This function panics if `start > end`.
#[inline]
pub fn new(start: usize, end: usize) -> Match {
assert!(start <= end); // Invariant enforced at construction
Match { start, end }
}
/// Creates a zero width match at the given offset.
#[inline]
pub fn zero(offset: usize) -> Match {
Match { start: offset, end: offset }
}
/// Return a new match with the start offset replaced.
/// Maintains the invariant by asserting start <= end.
#[inline]
pub fn with_start(&self, start: usize) -> Match {
assert!(start <= self.end, "{} is not <= {}", start, self.end);
Match { start, ..*self } // Functional update syntax
}
/// Offset this match by the given amount and return a new match.
#[inline]
pub fn offset(&self, amount: usize) -> Match {
Match {
start: self.start.checked_add(amount).unwrap(),
end: self.end.checked_add(amount).unwrap(),
}
}
#[inline]
pub fn len(&self) -> usize {
self.end - self.start // Safe because start <= end
}
}
// Allows using Match directly as a slice index: &bytes[m]
impl std::ops::Index<Match> for [u8] {
type Output = [u8];
#[inline]
fn index(&self, index: Match) -> &[u8] {
&self[index.start..index.end]
}
}
impl std::ops::Index<Match> for str {
type Output = str;
#[inline]
fn index(&self, index: Match) -> &str {
&self[index.start..index.end]
}
}
Note the #[derive(Copy)]—this is significant because std::ops::Range<usize> does not implement Copy. The Index implementations allow seamless integration with Rust's slice syntax.
Section 3: Line Terminators and Cross-Platform Reality¶
/// A line terminator represents the end of a line.
#[derive(Clone, Copy, Debug, Eq, Hash, PartialEq)]
pub struct LineTerminator(LineTerminatorImp); // Public struct wraps private enum
#[derive(Clone, Copy, Debug, Eq, Hash, PartialEq)]
enum LineTerminatorImp {
/// Any single byte representing a line terminator.
Byte(u8),
/// A line terminator represented by `\r\n`.
CRLF,
}
impl LineTerminator {
/// Return a new single-byte line terminator.
#[inline]
pub fn byte(byte: u8) -> LineTerminator {
LineTerminator(LineTerminatorImp::Byte(byte))
}
/// Return a new CRLF line terminator.
#[inline]
pub fn crlf() -> LineTerminator {
LineTerminator(LineTerminatorImp::CRLF)
}
/// Returns this line terminator as a single byte.
///
/// If CRLF, returns `\n`. This simplifies line-oriented algorithms
/// that can treat `\n` as the significant byte in either mode.
#[inline]
pub fn as_byte(&self) -> u8 {
match self.0 {
LineTerminatorImp::Byte(byte) => byte,
LineTerminatorImp::CRLF => b'\n', // Pragmatic simplification
}
}
/// Returns the full byte sequence for this terminator.
#[inline]
pub fn as_bytes(&self) -> &[u8] {
match self.0 {
LineTerminatorImp::Byte(ref byte) => std::slice::from_ref(byte),
LineTerminatorImp::CRLF => &[b'\r', b'\n'],
}
}
}
impl Default for LineTerminator {
#[inline]
fn default() -> LineTerminator {
LineTerminator::byte(b'\n') // Unix-style default on all platforms
}
}
The private enum pattern (LineTerminatorImp) allows the internal representation to change without breaking API compatibility. Factory methods (byte(), crlf()) provide the only construction paths.
Section 4: ByteSet - Optimization Through Constraint¶
/// A set of bytes that can never appear in any match.
#[derive(Clone, Debug)]
pub struct ByteSet(BitSet);
#[derive(Clone, Copy)]
struct BitSet([u64; 4]); // 4 × 64 = 256 bits, one per possible byte value
impl std::fmt::Debug for BitSet {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
let mut fmtd = f.debug_set();
for b in 0..=255 {
if ByteSet(*self).contains(b) {
fmtd.entry(&b); // Only print bytes that are in the set
}
}
fmtd.finish()
}
}
impl ByteSet {
/// Create an empty set of bytes.
#[inline]
pub fn empty() -> ByteSet {
ByteSet(BitSet([0; 4]))
}
/// Create a full set (all 256 bytes present).
#[inline]
pub fn full() -> ByteSet {
ByteSet(BitSet([u64::MAX; 4]))
}
/// Add a byte to this set.
#[inline]
pub fn add(&mut self, byte: u8) {
let bucket = byte / 64; // Which u64 (0-3)
let bit = byte % 64; // Which bit within that u64
(self.0).0[usize::from(bucket)] |= 1 << bit;
}
/// Remove a byte from this set.
#[inline]
pub fn remove(&mut self, byte: u8) {
let bucket = byte / 64;
let bit = byte % 64;
(self.0).0[usize::from(bucket)] &= !(1 << bit); // Clear the bit
}
/// Return true if the byte is in this set.
#[inline]
pub fn contains(&self, byte: u8) -> bool {
let bucket = byte / 64;
let bit = byte % 64;
(self.0).0[usize::from(bucket)] & (1 << bit) > 0
}
}
The bit vector representation uses exactly 32 bytes (4 × 8 bytes) to represent any subset of the 256 possible byte values. All operations are O(1) with no branching except for the bucket selection.
Section 5: The Captures Trait - Abstracting Group Extraction¶
/// A trait that describes implementations of capturing groups.
pub trait Captures {
/// Return the total number of capturing groups.
fn len(&self) -> usize;
/// Return the capturing group match at the given index.
/// Index 0 is always the overall match.
fn get(&self, i: usize) -> Option<Match>;
/// Return the overall match (equivalent to get(0).unwrap()).
#[inline]
fn as_match(&self) -> Match {
self.get(0).unwrap()
}
/// Expands `$name` references in `replacement` to captured groups.
/// Default implementation delegates to the interpolate module.
#[inline]
fn interpolate<F>(
&self,
name_to_index: F,
haystack: &[u8],
replacement: &[u8],
dst: &mut Vec<u8>,
) where
F: FnMut(&str) -> Option<usize>,
{
interpolate(
replacement,
|i, dst| {
if let Some(range) = self.get(i) {
dst.extend(&haystack[range]); // Uses Match's Index impl
}
},
name_to_index,
dst,
)
}
}
/// NoCaptures provides an always-empty implementation of `Captures`.
/// Use this for matchers that don't support capturing groups.
#[derive(Clone, Debug)]
pub struct NoCaptures(()); // Zero-sized type
impl NoCaptures {
#[inline]
pub fn new() -> NoCaptures {
NoCaptures(())
}
}
impl Captures for NoCaptures {
#[inline]
fn len(&self) -> usize {
0 // No captures ever
}
#[inline]
fn get(&self, _: usize) -> Option<Match> {
None // No captures to get
}
}
NoCaptures is a null object—it satisfies the trait contract but provides no functionality. This allows matchers without capture support to still work with APIs that expect the Captures associated type.
Section 6: The NoError Type - When Failure Is Impossible¶
/// NoError provides an error type for matchers that never produce errors.
///
/// The Display impl panics—if you see this error, you have a bug.
#[derive(Debug, Eq, PartialEq)]
pub struct NoError(()); // Empty struct, cannot be constructed externally
impl std::error::Error for NoError {
fn description(&self) -> &str {
"no error"
}
}
impl std::fmt::Display for NoError {
fn fmt(&self, _: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
panic!("BUG for NoError: an impossible error occurred")
}
}
impl From<NoError> for std::io::Error {
fn from(_: NoError) -> std::io::Error {
panic!("BUG for NoError: an impossible error occurred")
}
}
The panicking implementations are intentional. NoError exists only to satisfy type system requirements—any actual attempt to use an error value indicates a programming mistake that should fail loudly.
Section 7: The Matcher Trait - Core Interface¶
/// A matcher defines an interface for regular expression implementations.
pub trait Matcher {
/// The concrete type of capturing groups.
/// Set to `NoCaptures` if not supported.
type Captures: Captures;
/// The error type. Use `NoError` for infallible matchers.
type Error: std::fmt::Display;
/// Returns the first match in `haystack` after position `at`.
/// This is the primary required method.
fn find_at(
&self,
haystack: &[u8],
at: usize,
) -> Result<Option<Match>, Self::Error>;
/// Creates an empty captures container.
/// The other required method.
fn new_captures(&self) -> Result<Self::Captures, Self::Error>;
/// Number of capture groups (0 if unsupported).
#[inline]
fn capture_count(&self) -> usize {
0 // Default: no capture support
}
/// Map capture group name to index.
#[inline]
fn capture_index(&self, _name: &str) -> Option<usize> {
None // Default: no named captures
}
/// Find first match starting from beginning.
#[inline]
fn find(&self, haystack: &[u8]) -> Result<Option<Match>, Self::Error> {
self.find_at(haystack, 0) // Delegates to required method
}
/// Check if any match exists (can be faster than find).
#[inline]
fn is_match(&self, haystack: &[u8]) -> Result<bool, Self::Error> {
self.is_match_at(haystack, 0)
}
#[inline]
fn is_match_at(
&self,
haystack: &[u8],
at: usize,
) -> Result<bool, Self::Error> {
Ok(self.shortest_match_at(haystack, at)?.is_some())
}
}
Only find_at and new_captures are required. All other methods have default implementations that build on these two primitives.
Section 8: Internal Iteration Methods¶
impl Matcher {
/// Iterate over all non-overlapping matches using internal iteration.
/// The callback returns false to stop iteration.
#[inline]
fn try_find_iter_at<F, E>(
&self,
haystack: &[u8],
at: usize,
mut matched: F,
) -> Result<Result<(), E>, Self::Error>
where
F: FnMut(Match) -> Result<bool, E>,
{
let mut last_end = at;
let mut last_match = None;
loop {
if last_end > haystack.len() {
return Ok(Ok(()));
}
let m = match self.find_at(haystack, last_end)? {
None => return Ok(Ok(())),
Some(m) => m,
};
// Handle empty matches specially to ensure progress
if m.start == m.end {
last_end = m.end + 1;
// Skip empty matches immediately following another match
if Some(m.end) == last_match {
continue;
}
} else {
last_end = m.end;
}
last_match = Some(m.end);
// Callback controls iteration: Ok(true) continues, Ok(false) stops
match matched(m) {
Ok(true) => continue,
Ok(false) => return Ok(Ok(())),
Err(err) => return Ok(Err(err)),
}
}
}
}
The nested Result<Result<(), E>, Self::Error> return type distinguishes between matcher errors (outer) and callback errors (inner). The empty match handling prevents infinite loops on patterns like a*.
Section 9: Blanket Implementation for References¶
/// Blanket implementation allows &M to be used wherever M: Matcher is expected.
impl<'a, M: Matcher> Matcher for &'a M {
type Captures = M::Captures;
type Error = M::Error;
#[inline]
fn find_at(
&self,
haystack: &[u8],
at: usize,
) -> Result<Option<Match>, Self::Error> {
(*self).find_at(haystack, at) // Delegate through the reference
}
#[inline]
fn new_captures(&self) -> Result<Self::Captures, Self::Error> {
(*self).new_captures()
}
// ... all other methods delegate similarly ...
#[inline]
fn non_matching_bytes(&self) -> Option<&ByteSet> {
(*self).non_matching_bytes()
}
#[inline]
fn line_terminator(&self) -> Option<LineTerminator> {
(*self).line_terminator()
}
}
This blanket impl means functions accepting impl Matcher or M: Matcher automatically work with borrowed matchers. The #[inline] hints help eliminate the indirection cost.
Quick Reference¶
Key Types¶
| Type | Purpose |
|---|---|
Match |
Byte offset range with start <= end invariant |
LineTerminator |
Single byte or CRLF line ending |
ByteSet |
256-bit set of bytes (for optimization hints) |
NoCaptures |
Null implementation of Captures trait |
NoError |
Placeholder error type for infallible matchers |
Matcher Trait Required Methods¶
fn find_at(&self, haystack: &[u8], at: usize) -> Result<Option<Match>, Self::Error>;
fn new_captures(&self) -> Result<Self::Captures, Self::Error>;
Matcher Associated Types¶
type Captures: Captures; // Use NoCaptures if unsupported
type Error: std::fmt::Display; // Use NoError if infallible
Optimization Hints¶
fn non_matching_bytes(&self) -> Option<&ByteSet>; // Bytes never in a match
fn line_terminator(&self) -> Option<LineTerminator>; // Line-oriented optimization
fn find_candidate_line(&self, haystack: &[u8]) -> Result<Option<LineMatchKind>, Self::Error>;