ripgrep crates/globset/src/lib.rs: Code Companion¶
Reference code for the Glob Pattern Matching lecture. Sections correspond to the lecture document.
Section 1: The Builder Pattern in Context¶
/// GlobSetBuilder builds a group of patterns that can be used to
/// simultaneously match a file path.
#[derive(Clone, Debug)]
pub struct GlobSetBuilder {
pats: Vec<Glob>,
}
impl GlobSetBuilder {
/// Create a new `GlobSetBuilder`. A `GlobSetBuilder` can be used to add new
/// patterns. Once all patterns have been added, `build` should be called
/// to produce a [`GlobSet`], which can then be used for matching.
pub fn new() -> GlobSetBuilder {
GlobSetBuilder { pats: vec![] }
}
/// Builds a new matcher from all of the glob patterns added so far.
///
/// Once a matcher is built, no new patterns can be added to it.
pub fn build(&self) -> Result<GlobSet, Error> {
GlobSet::new(self.pats.iter())
}
/// Add a new pattern to this set.
/// Returns &mut Self for method chaining
pub fn add(&mut self, pat: Glob) -> &mut GlobSetBuilder {
self.pats.push(pat);
self
}
}
impl GlobSet {
/// Alternative constructor that returns a GlobSetBuilder
/// Provides flexibility at call sites
#[inline]
pub fn builder() -> GlobSetBuilder {
GlobSetBuilder::new()
}
/// Create an empty `GlobSet`. An empty set matches nothing.
/// Note: const fn allows use in static/const contexts
#[inline]
pub const fn empty() -> GlobSet {
GlobSet { len: 0, strats: vec![] }
}
}
The add method returns &mut GlobSetBuilder to enable method chaining (e.g., builder.add(glob1).add(glob2)). The build method borrows &self, allowing the builder to be reused if needed.
Section 2: Error Design Philosophy¶
/// Represents an error that can occur when parsing a glob pattern.
#[derive(Clone, Debug, Eq, PartialEq)]
pub struct Error {
/// The original glob provided by the caller.
glob: Option<String>,
/// The kind of error.
kind: ErrorKind,
}
/// The kind of error that can occur when parsing a glob pattern.
#[derive(Clone, Debug, Eq, PartialEq)]
#[non_exhaustive] // Allows adding new variants in future versions
pub enum ErrorKind {
/// **DEPRECATED** - kept for backward compatibility
InvalidRecursive,
/// Occurs when a character class (e.g., `[abc]`) is not closed.
UnclosedClass,
/// Occurs when a range in a character (e.g., `[a-z]`) is invalid.
InvalidRange(char, char),
/// Occurs when a `}` is found without a matching `{`.
UnopenedAlternates,
/// Occurs when a `{` is found without a matching `}`.
UnclosedAlternates,
/// **DEPRECATED** - nested alternates now supported
NestedAlternates,
/// Occurs when an unescaped '\' is found at the end of a glob.
DanglingEscape,
/// An error associated with parsing or compiling a regex.
Regex(String),
}
impl ErrorKind {
fn description(&self) -> &str {
match *self {
// Note the helpful suggestion in error messages
ErrorKind::UnopenedAlternates => {
"unopened alternate group; missing '{' \
(maybe escape '}' with '[}]'?)"
}
ErrorKind::UnclosedAlternates => {
"unclosed alternate group; missing '}' \
(maybe escape '{' with '[{]'?)"
}
// Other variants...
ErrorKind::InvalidRange(_, _) => "invalid character range",
ErrorKind::UnclosedClass => {
"unclosed character class; missing ']'"
}
// ...
}
}
}
impl std::fmt::Display for Error {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
match self.glob {
// Include the problematic pattern in the error message
None => self.kind.fmt(f),
Some(ref glob) => {
write!(f, "error parsing glob '{}': {}", glob, self.kind)
}
}
}
}
The #[non_exhaustive] attribute forces downstream code to handle unknown variants with a wildcard arm, enabling library evolution without breaking changes.
Section 3: The Candidate Abstraction¶
/// A candidate path for matching.
///
/// All glob matching in this crate operates on `Candidate` values.
/// Constructing candidates has a very small cost associated with it, so
/// callers may find it beneficial to amortize that cost when matching a single
/// path against multiple globs or sets of globs.
#[derive(Clone)]
pub struct Candidate<'a> {
path: Cow<'a, [u8]>, // Full normalized path
basename: Cow<'a, [u8]>, // Filename without directory
ext: Cow<'a, [u8]>, // File extension (without dot)
}
impl<'a> Candidate<'a> {
/// Create a new candidate for matching from the given path.
pub fn new<P: AsRef<Path> + ?Sized>(path: &'a P) -> Candidate<'a> {
Self::from_cow(Vec::from_path_lossy(path.as_ref()))
}
fn from_cow(path: Cow<'a, [u8]>) -> Candidate<'a> {
// Normalize once, then extract derived values
let path = normalize_path(path);
let basename = file_name(&path).unwrap_or(Cow::Borrowed(B("")));
let ext = file_name_ext(&basename).unwrap_or(Cow::Borrowed(B("")));
Candidate { path, basename, ext }
}
// Efficient prefix/suffix extraction for Aho-Corasick matching
fn path_prefix(&self, max: usize) -> &[u8] {
if self.path.len() <= max { &*self.path } else { &self.path[..max] }
}
fn path_suffix(&self, max: usize) -> &[u8] {
if self.path.len() <= max {
&*self.path
} else {
&self.path[self.path.len() - max..]
}
}
}
Cow<'a, [u8]> (Copy-on-Write) allows the candidate to either borrow from the input path or own normalized data, avoiding allocation when the path is already in the correct form.
Section 4: Strategy-Based Matching Architecture¶
#[derive(Clone, Debug)]
enum GlobSetMatchStrategy {
Literal(LiteralStrategy), // Exact path match: "src/main.rs"
BasenameLiteral(BasenameLiteralStrategy), // Filename match: "Makefile"
Extension(ExtensionStrategy), // Extension match: "*.rs"
Prefix(PrefixStrategy), // Prefix match: "src/**"
Suffix(SuffixStrategy), // Suffix match: "**/*.rs"
RequiredExtension(RequiredExtensionStrategy), // Hybrid: ext + regex
Regex(RegexSetStrategy), // Full regex fallback
}
// Pattern categorization during GlobSet construction
pub fn new<I, G>(globs: I) -> Result<GlobSet, Error>
where
I: IntoIterator<Item = G>,
G: AsRef<Glob>,
{
// Initialize collectors for each strategy
let mut lits = LiteralStrategy::new();
let mut base_lits = BasenameLiteralStrategy::new();
let mut exts = ExtensionStrategy::new();
let mut prefixes = MultiStrategyBuilder::new();
let mut suffixes = MultiStrategyBuilder::new();
let mut required_exts = RequiredExtensionStrategyBuilder::new();
let mut regexes = MultiStrategyBuilder::new();
for (i, p) in it.enumerate() {
let p = p.as_ref();
// Route each pattern to its optimal strategy
match MatchStrategy::new(p) {
MatchStrategy::Literal(lit) => {
lits.add(i, lit);
}
MatchStrategy::BasenameLiteral(lit) => {
base_lits.add(i, lit);
}
MatchStrategy::Extension(ext) => {
exts.add(i, ext);
}
MatchStrategy::Prefix(prefix) => {
prefixes.add(i, prefix);
}
MatchStrategy::Suffix { suffix, component } => {
if component {
// Also add to literals for component matching
lits.add(i, suffix[1..].to_string());
}
suffixes.add(i, suffix);
}
MatchStrategy::RequiredExtension(ext) => {
required_exts.add(i, ext, p.regex().to_owned());
}
MatchStrategy::Regex => {
regexes.add(i, p.regex().to_owned());
}
}
}
// ... build strategies and return GlobSet
}
The MatchStrategy::new(p) call analyzes each glob pattern and determines which optimization strategy can handle it most efficiently.
Section 5: The Multi-Pattern Matching Engine¶
impl GlobSet {
/// Returns true if any glob in this set matches the path given.
/// Short-circuits on first match.
pub fn is_match_candidate(&self, path: &Candidate<'_>) -> bool {
if self.is_empty() {
return false;
}
for strat in &self.strats {
if strat.is_match(path) {
return true; // Early exit on first match
}
}
false
}
/// Returns true if all globs in this set match the path given.
/// Empty set returns true (vacuous truth).
pub fn matches_all_candidate(&self, path: &Candidate<'_>) -> bool {
for strat in &self.strats {
if !strat.is_match(path) {
return false; // Early exit on first non-match
}
}
true
}
/// Adds the sequence number of every glob pattern that matches.
/// Results are sorted and deduplicated.
pub fn matches_candidate_into(
&self,
path: &Candidate<'_>,
into: &mut Vec<usize>,
) {
into.clear(); // Always start fresh
if self.is_empty() {
return;
}
// Collect matches from all strategies
for strat in &self.strats {
strat.matches_into(path, into);
}
// Remove duplicates (pattern might match multiple strategies)
into.sort();
into.dedup();
}
}
The matches_into variant lets callers provide a pre-allocated vector, enabling efficient batch processing by reusing allocations across many path matches.
Section 6: Hash-Based Strategies¶
#[derive(Clone, Debug)]
struct LiteralStrategy(fnv::HashMap<Vec<u8>, Vec<usize>>);
impl LiteralStrategy {
fn new() -> LiteralStrategy {
LiteralStrategy(fnv::HashMap::default())
}
fn add(&mut self, global_index: usize, lit: String) {
// Multiple patterns can map to the same literal
self.0.entry(lit.into_bytes()).or_insert(vec![]).push(global_index);
}
fn is_match(&self, candidate: &Candidate<'_>) -> bool {
// O(1) hash lookup regardless of pattern count
self.0.contains_key(candidate.path.as_bytes())
}
#[inline(never)] // Hint to avoid inlining (code size optimization)
fn matches_into(
&self,
candidate: &Candidate<'_>,
matches: &mut Vec<usize>,
) {
if let Some(hits) = self.0.get(candidate.path.as_bytes()) {
matches.extend(hits);
}
}
}
#[derive(Clone, Debug)]
struct ExtensionStrategy(fnv::HashMap<Vec<u8>, Vec<usize>>);
impl ExtensionStrategy {
fn is_match(&self, candidate: &Candidate<'_>) -> bool {
if candidate.ext.is_empty() {
return false; // No extension means no match
}
self.0.contains_key(candidate.ext.as_bytes())
}
// ... similar structure to LiteralStrategy
}
#[derive(Clone, Debug)]
struct BasenameLiteralStrategy(fnv::HashMap<Vec<u8>, Vec<usize>>);
// ... same pattern, keyed by basename instead of full path
All three strategies use FNV (Fowler-Noll-Vo) hashing, optimized for small keys like file extensions and basenames.
Section 7: Prefix and Suffix Strategies¶
#[derive(Clone, Debug)]
struct PrefixStrategy {
matcher: AhoCorasick, // Multi-pattern string matching
map: Vec<usize>, // Pattern index to global index
longest: usize, // Maximum prefix length to check
}
impl PrefixStrategy {
fn is_match(&self, candidate: &Candidate<'_>) -> bool {
// Only check the relevant prefix of the path
let path = candidate.path_prefix(self.longest);
for m in self.matcher.find_overlapping_iter(path) {
if m.start() == 0 { // Must match from the beginning
return true;
}
}
false
}
fn matches_into(
&self,
candidate: &Candidate<'_>,
matches: &mut Vec<usize>,
) {
let path = candidate.path_prefix(self.longest);
for m in self.matcher.find_overlapping_iter(path) {
if m.start() == 0 {
matches.push(self.map[m.pattern()]);
}
}
}
}
#[derive(Clone, Debug)]
struct SuffixStrategy {
matcher: AhoCorasick,
map: Vec<usize>,
longest: usize,
}
impl SuffixStrategy {
fn is_match(&self, candidate: &Candidate<'_>) -> bool {
let path = candidate.path_suffix(self.longest);
for m in self.matcher.find_overlapping_iter(path) {
if m.end() == path.len() { // Must match at the end
return true;
}
}
false
}
}
The longest field enables an optimization: only examine the portion of the path that could possibly match, reducing work for long paths.
Section 8: RequiredExtension and Regex Strategies¶
#[derive(Clone, Debug)]
struct RequiredExtensionStrategy(fnv::HashMap<Vec<u8>, Vec<(usize, Regex)>>);
impl RequiredExtensionStrategy {
fn is_match(&self, candidate: &Candidate<'_>) -> bool {
if candidate.ext.is_empty() {
return false;
}
// First: O(1) extension lookup narrows candidates
match self.0.get(candidate.ext.as_bytes()) {
None => false,
Some(regexes) => {
// Then: run regex only on filtered candidates
for &(_, ref re) in regexes {
if re.is_match(candidate.path.as_bytes()) {
return true;
}
}
false
}
}
}
}
#[derive(Clone, Debug)]
struct RegexSetStrategy {
matcher: Regex,
map: Vec<usize>,
// Pool of PatternSets to avoid allocation per match
patset: Arc<Pool<PatternSet, PatternSetPoolFn>>,
}
type PatternSetPoolFn =
Box<dyn Fn() -> PatternSet + Send + Sync + UnwindSafe + RefUnwindSafe>;
impl RegexSetStrategy {
fn matches_into(
&self,
candidate: &Candidate<'_>,
matches: &mut Vec<usize>,
) {
let input = regex_automata::Input::new(candidate.path.as_bytes());
let mut patset = self.patset.get(); // Get from pool
patset.clear();
self.matcher.which_overlapping_matches(&input, &mut patset);
for i in patset.iter() {
matches.push(self.map[i]);
}
PoolGuard::put(patset); // Return to pool
}
}
The RequiredExtensionStrategy demonstrates a two-phase filter: cheap hash lookup first, expensive regex only if needed. The RegexSetStrategy uses object pooling to amortize allocation costs.
Quick Reference¶
Strategy Selection Summary¶
| Pattern Type | Example | Strategy | Lookup Complexity |
|---|---|---|---|
| Exact path | src/main.rs |
Literal | O(1) hash |
| Basename only | Makefile |
BasenameLiteral | O(1) hash |
| Extension | *.rs |
Extension | O(1) hash |
| Prefix | src/** |
Prefix (Aho-Corasick) | O(n) where n = prefix length |
| Suffix | **/*.rs |
Suffix (Aho-Corasick) | O(n) where n = suffix length |
| Ext + complex | src/**/*.rs |
RequiredExtension | O(1) + regex |
| Complex | {foo,bar}/** |
Regex | Full regex cost |
Key Types¶
// Primary API types
GlobSet // Immutable, optimized matcher for multiple patterns
GlobSetBuilder // Accumulates patterns before building
Candidate<'a> // Pre-processed path for efficient matching
// Strategy enum (internal)
enum GlobSetMatchStrategy {
Literal(LiteralStrategy),
BasenameLiteral(BasenameLiteralStrategy),
Extension(ExtensionStrategy),
Prefix(PrefixStrategy),
Suffix(SuffixStrategy),
RequiredExtension(RequiredExtensionStrategy),
Regex(RegexSetStrategy),
}
Matching Method Variants¶
| Method | Returns | Use Case |
|---|---|---|
is_match(path) |
bool |
Quick existence check |
is_match_candidate(candidate) |
bool |
Amortized existence check |
matches(path) |
Vec<usize> |
Get all matching pattern indices |
matches_candidate(candidate) |
Vec<usize> |
Amortized match collection |
matches_into(path, vec) |
() |
Reuse allocation |
matches_candidate_into(candidate, vec) |
() |
Full amortization |
matches_all(path) |
bool |
All patterns must match |