glob.rs: Glob Implementation¶

What This File Does¶

This file implements the core glob pattern parser and compiler that transforms shell-style patterns like *.rs or **/src/**/*.c into executable matchers. While the previous lesson covered lib.rs and how multiple globs get combined into efficient sets, this file focuses on the mechanics of individual patterns: how they're parsed, what optimizations can be extracted from them, and how they become regular expressions.

The architecture here is deceptively sophisticated. A simple pattern like **/*.rs doesn't just become a regex—the code analyzes its structure to determine the fastest matching strategy. Can we match by extension alone? By prefix? By suffix? These optimizations matter enormously when scanning millions of files, and they're all computed by examining the parsed token stream that this file produces.

Section 1: The Match Strategy Optimization¶

The most interesting design decision in this file appears right at the top: the MatchStrategy enum. Rather than treating every glob pattern as a regex that needs full pattern matching, the code identifies patterns that can be matched with simpler, faster operations.

Consider what happens when ripgrep encounters a .gitignore containing *.txt. That pattern could be handled by compiling it to a regex and running the regex engine on every file path. But there's a much faster approach: extract the extension .txt and perform a simple byte comparison against each file's extension. No regex engine needed, no backtracking, just a quick equality check.

The MatchStrategy enum captures seven different optimization categories. A Literal strategy means the entire path must match exactly—no wildcards at all. BasenameLiteral means we only need to check the filename portion. Extension is the .txt case. Prefix handles patterns that start with a literal path. Suffix handles patterns that end with a literal string. RequiredExtension is a partial optimization—we can quickly reject non-matching extensions before falling back to a full regex match.

This tiered approach reflects a key insight about real-world usage. Most ignore patterns in actual .gitignore files fall into simple categories. Optimizing for the common case while falling back to full regex power for complex patterns gives ripgrep both speed and correctness.

See: Companion Code Section 1

Section 2: The Glob Type and Builder Pattern¶

The Glob struct represents a successfully parsed pattern, and it follows a pattern we've seen throughout ripgrep: it stores both the original representation and the compiled form. The struct holds the original glob string, the generated regex string, the parsing options, and the internal token representation.

What's interesting here is the trait implementations. The PartialEq implementation only compares the glob string and options—not the regex or tokens. This is semantically correct because two globs with the same pattern and options should be equal regardless of their internal representation. The Hash implementation follows the same logic, which is essential for using globs as hash map keys.

The Debug implementation shows thoughtful attention to ergonomics. In normal debug mode, it shows just the glob pattern wrapped in a tuple struct format. But with alternate formatting via {:#?}, it expands to show all internal fields including the generated regex and tokens. This is useful during development when you need to see what the parser actually produced.

The GlobBuilder type follows the builder pattern we've examined in previous lessons on searcher/mod.rs and printer/standard.rs. It holds a reference to the pattern string and accumulates options before parsing. The lifetime 'a is tied to the input pattern, which means the builder borrows rather than owns the pattern text. This is efficient for the common case where you parse a pattern immediately.

See: Companion Code Section 2

Section 3: The Token Representation¶

Before any matching can happen, a glob pattern must be parsed into a structured representation. The Token enum captures all the possible elements that can appear in a glob pattern.

The basic tokens are straightforward: Literal holds a single character, Any represents ? (matches one character), and ZeroOrMore represents * (matches zero or more characters). But the recursive tokens are where things get interesting. RecursivePrefix represents **/ at the start of a pattern—it can match any number of directory components. RecursiveSuffix is /** at the end. RecursiveZeroOrMore is /**/ in the middle, which must match at least one slash.

The distinction between these recursive variants matters for optimization. A pattern starting with **/ only needs to match the basename of files, while /**/ requires at least one path separator. Getting these semantics wrong would cause incorrect matching behavior.

Character classes become Class tokens containing a negation flag and a vector of character ranges. The parser normalizes syntax like [a-z0-9] into a vector of tuples representing ranges. Alternates like {foo,bar} become nested Tokens structures, allowing for arbitrary nesting depth.

The Tokens wrapper type implements Deref and DerefMut to Vec<Token>, making it easy to work with as a vector while maintaining a distinct type for clarity.

See: Companion Code Section 3

Section 4: Strategy Extraction Methods¶

The Glob implementation includes a series of private methods that attempt to extract optimized matching strategies from the token stream. Each method returns Option<T>, returning None if the pattern is too complex for that optimization.

The literal method checks if the entire pattern consists only of literal characters. If case-insensitive matching is enabled, this optimization isn't available—case folding requires regex machinery. The method iterates through tokens, bailing out if it finds anything other than literals.

The ext method is more complex. It looks for patterns of the form *.ext or **/*.ext. The logic must account for several edge cases: the pattern might start with ** (recursive prefix), the * must be able to match path separators (otherwise *.c wouldn't match foo/bar.c), and the extension itself can't contain dots or slashes.

The prefix and suffix methods extract literal strings that must appear at the start or end of matching paths. These are particularly useful for patterns like build/** or **/*.test.js. The suffix method returns a tuple containing both the suffix string and a boolean indicating whether the suffix must match the entire path or can appear after a separator.

The basename_tokens method is crucial for the common **/filename pattern. When a pattern starts with **/ and the rest doesn't contain path separators, we only need to match against the file's basename. This method returns a slice of the original tokens rather than allocating a new vector—a small but meaningful efficiency.

See: Companion Code Section 4

Section 5: The Parser Structure¶

The Parser struct manages state during glob parsing. Its design reveals the complexity hidden behind seemingly simple glob syntax.

The parser tracks alternation nesting through two fields: alternates_stack marks indices where alternation groups begin, and branches holds the token sequences for each branch. When the parser encounters {, it pushes the current branch count onto the stack and starts a new branch. When it sees ,, it starts another branch at the same nesting level. When it hits }, it pops the stack and combines all branches since that point into an Alternates token.

The prev and cur fields track the previous and current characters. This two-character lookahead is essential for parsing **—the parser needs to know if a * is followed by another * to determine whether it's a recursive wildcard.

The found_unclosed_class field handles a subtle compatibility concern. Some glob implementations treat an unclosed [ as a literal character rather than an error. When allow_unclosed_class is enabled and we fail to find a closing ], we set this flag and never attempt to parse character classes again. Without this optimization, a pattern like [[[[[[[... would cause quadratic parsing time as we repeatedly scan forward looking for a ] that doesn't exist.

See: Companion Code Section 5

Section 6: Parsing Special Sequences¶

The parse_star method handles the most complex parsing in the file. A single * is straightforward, but ** has special semantics depending on context.

The method first checks if the next character is also *. If not, it simply pushes a ZeroOrMore token. But when we have **, the interpretation depends on position. At the start of a pattern (or after a separator), ** followed by a separator becomes RecursivePrefix. At the end of a pattern, **/ becomes RecursiveSuffix. In the middle with separators on both sides, it becomes RecursiveZeroOrMore.

The edge cases are subtle. What about ** at the start followed by a literal? It becomes two ZeroOrMore tokens, not a recursive pattern. What about a**b? Also two ZeroOrMore tokens—the recursive patterns only apply with proper separator context.

The parse_class method handles character classes with their own complexity. It tracks whether we're building a range (a-z) or collecting individual characters. The ] and - characters have special meaning at the start of a class—[]] matches a literal ], and [-a] includes a literal -. The method also implements the recovery logic for unclosed classes when that option is enabled.

See: Companion Code Section 6

Section 7: Regex Generation¶

Once parsing completes, the token stream must become a regular expression. The to_regex_with method on Tokens performs this translation.

The regex always starts with (?-u) to disable Unicode mode, since file paths are arbitrary bytes that may not be valid UTF-8. If case-insensitive matching is requested, (?i) follows. Then comes ^ to anchor at the start.

Each token type has a direct regex translation. Literal characters get escaped and inserted. Any becomes . (or [^/] if literal_separator is enabled). ZeroOrMore becomes .* (or [^/]*). The recursive patterns use non-capturing groups with alternation: RecursivePrefix becomes (?:/?|.*/) to match either nothing, a slash, or any path prefix.

The char_to_escaped_literal function handles the byte-level complexity of non-ASCII characters. Since the regex operates on bytes, multi-byte UTF-8 characters must be encoded as their individual bytes with proper escaping. A snowman character ☃ becomes \xe2\x98\x83 in the generated regex.

Character classes translate fairly directly, with each range becoming a regex character class. Alternates become non-capturing groups with | separators.

See: Companion Code Section 7

Section 8: The GlobBuilder Options¶

The builder pattern for Glob exposes five configuration options, each affecting parsing and matching semantics in specific ways.

The case_insensitive option adds (?i) to the generated regex, enabling case-insensitive matching. But it also disables all the extraction optimizations—when matching is case-insensitive, we can't do simple byte comparisons for extensions or literals.

The literal_separator option determines whether * and ? can match path separators. By default they can, which matches traditional glob behavior. But for strict path matching, enabling this option makes *.c only match files in the current directory, not foo/bar.c. This affects both regex generation (using [^/] instead of .) and which optimizations are available.

The backslash_escape option controls whether \ escapes special characters. On Unix, this is enabled by default—\* matches a literal asterisk. On Windows, where \ is a path separator, it's disabled by default. The implementation checks is_separator('\\') to determine the platform-appropriate default.

The empty_alternates option allows patterns like {,a} to match both empty string and a. By default, empty branches in alternates are filtered out during regex generation.

The allow_unclosed_class option enables compatibility mode where [ without a matching ] is treated as a literal rather than an error. This is useful for consuming patterns from sources that may have different parsing rules.

See: Companion Code Section 8

Section 9: The Matcher Types¶

The GlobMatcher struct wraps a compiled pattern for actual matching. It holds both the original Glob and a compiled Regex from the regex_automata crate.

The matching interface is simple: is_match takes anything that can be converted to a Path, and is_match_candidate takes a pre-computed Candidate. The Candidate type from lib.rs caches the path's bytes, basename, and extension, avoiding repeated computation when matching against multiple patterns.

For testing purposes, there's also GlobStrategic, which implements the optimized matching strategies. It's gated behind #[cfg(test)] because the public API always uses the regex matcher—the strategic matching is used in GlobSet from lib.rs, not in single-pattern matching. The test implementation ensures that strategic matching produces the same results as regex matching for all test cases.

The strategic matcher's is_match_candidate method shows the optimization logic in action. For Literal strategy, it's a simple byte comparison. For Extension, it compares the cached extension bytes. For Prefix and Suffix, it checks the appropriate ends of the path. For RequiredExtension, it checks the extension first as a quick rejection, then falls back to the regex.

See: Companion Code Section 9

Section 10: The Test Suite Philosophy¶

The test suite at the end of the file demonstrates a philosophy of exhaustive edge-case coverage. The macros define DSLs for different types of tests: syntax tests verify token parsing, error tests verify parse failures, regex tests verify generated patterns, and match tests verify actual matching behavior.

The syntax tests verify that patterns parse to expected token sequences. This is valuable for catching parser regressions and documenting expected behavior for edge cases like [-a-z-] (a dash at the start and end of a character class).

The match tests use both the standard regex matcher and the strategic matcher, ensuring they agree. They also test through a GlobSet to verify the optimization logic in the set builder.

The extraction tests verify the strategy detection methods. These are particularly important because incorrect strategy extraction leads to subtle matching bugs—patterns might match correctly in tests but fail on edge cases in the wild.

The options constants (CASEI, SLASHLIT, BSESC, etc.) allow tests to easily specify non-default configurations while keeping test definitions concise.

See: Companion Code Section 10

Key Takeaways¶

First, glob patterns benefit enormously from preprocessing optimizations. By analyzing pattern structure at compile time, ripgrep can often avoid regex matching entirely, using simple string operations instead.

Second, the builder pattern with method chaining provides a clean API for optional configuration while keeping the common case—default options—simple and efficient.

Third, cross-platform compatibility requires careful attention to path separator semantics. The backslash_escape option with its platform-dependent default shows how to handle behavioral differences between Unix and Windows.

Fourth, error recovery options like allow_unclosed_class should have conservative defaults but be available when compatibility matters more than strict parsing.

Fifth, testing strategy matters as much as implementation. Running both the optimized path and the fallback path in tests ensures they stay synchronized as the code evolves.

Sixth, the separation between parsing, optimization analysis, and regex generation keeps each concern focused and testable. The token representation serves as a clean intermediate form that all three phases can work with.

What to Read Next¶

How does the glob set combine multiple patterns efficiently? Return to crates/globset/src/lib.rs to see how MatchStrategy feeds into the multi-pattern optimization.

How does the ignore crate use these glob patterns? Read crates/ignore/src/gitignore.rs to see how .gitignore semantics layer on top of glob matching.

What other builder patterns exist in ripgrep? Compare with crates/searcher/src/searcher/mod.rs to see how the same pattern applies to search configuration.