literal.rs: Literal Optimization¶

What This File Does¶

This file implements one of ripgrep's most important performance optimizations: extracting literal strings from regular expressions to enable fast pre-filtering. The core insight is deceptively simple—if a regex must match certain literal bytes to succeed, you can search for those bytes first using highly optimized SIMD routines, then only run the full regex engine on the matching lines. For a regex like \s+Sherlock\s+, the word "Sherlock" must appear for any match to occur, so ripgrep can scan gigabytes of text looking only for that literal, dramatically reducing the work the regex engine needs to do.

The implementation centers on an Extractor struct that traverses the regex's internal representation (the HIR, or High-level Intermediate Representation) and attempts to extract useful literals. This isn't a simple task—the extractor must handle concatenations, alternations, repetitions, and character classes, all while maintaining correctness about whether extracted literals are "exact" (the complete match) or "inexact" (just part of it). The file also includes sophisticated heuristics for deciding when extracted literals are actually helpful versus when they might slow things down.

Section 1: The Inner Literal Strategy¶

The InnerLiterals struct encapsulates the entire optimization strategy. Unlike prefix or suffix literal extraction (where you look for what the regex starts or ends with), inner literal extraction finds literals that must occur somewhere in any match. This is particularly powerful for line-oriented searching because of a crucial property: if you find an inner literal on a line, you can run the full regex on just that line rather than the entire file.

The validity of this optimization depends entirely on ripgrep's line-oriented search model. When you search line-by-line rather than across the entire file as one continuous string, finding an inner literal tells you exactly which line to test with the full regex. This is why the new constructor immediately bails out if no line terminator is configured—without line boundaries, you can't safely narrow down where to apply the full regex after finding a literal.

The constructor also includes important short-circuit logic. If the regex engine reports that it's already "accelerated" (meaning it has its own internal literal optimizations), ripgrep defers to the engine unless the pattern contains Unicode word boundaries. Unicode word boundaries are expensive enough that the regex engine might fall back to slower paths, making ripgrep's custom optimization worthwhile even when the engine claims to be accelerated.

See: Companion Code Section 1

Section 2: The Extractor Architecture¶

The Extractor struct is where the real work happens. It's designed as a recursive tree-walker that descends through the HIR and accumulates literal sequences at each node. The struct carries configuration limits that prevent the extraction process from exploding in time or space—you don't want to spend more time extracting literals than you'd save by using them.

Four limits govern the extraction: limit_class caps how many characters from a character class will be expanded into literals, limit_repeat controls how many iterations of a repetition will be unrolled, limit_literal_len prevents any single literal from growing too long, and limit_total bounds the total number of literals in the final sequence. These limits exist because literal extraction has combinatorial complexity—a regex like [abc][def][ghi] could expand to 27 different literals, and with longer character classes or deeper nesting, this explodes rapidly.

The main entry point is extract_untagged, which calls the recursive extract method, then applies post-processing. The "untagged" name refers to stripping away the internal TSeq wrapper (a "tagged sequence" that tracks additional metadata) and returning just the raw Seq of literals. The post-processing includes optimization passes and a critical "is this actually good?" check that can throw away the entire result if the heuristics suggest the literals won't help.

See: Companion Code Section 2

Section 3: Handling Concatenation¶

The extract_concat method handles sequential patterns like abc or foo[a-z]bar. The fundamental operation here is the cross product: if you have literals ["a", "b"] from one part and ["x", "y"] from another, concatenating them yields ["ax", "ay", "bx", "by"]. This is where the combinatorial explosion comes from, and why limits are essential.

The implementation uses a clever optimization for patterns with "gaps"—places where literal extraction fails in the middle of a concatenation. Consider foo[a-z]+bar[a-z]+baz: the [a-z]+ parts produce no useful literals, but foo, bar, and baz are all perfectly good candidates. Rather than giving up when hitting an inexact sequence, the method tracks a prev candidate and uses the choose method to pick the best option among multiple possible literal sets.

This "keep the best so far" strategy allows the extractor to scan through an entire concatenation and identify the most useful literals, even when they're separated by unpredictable content. The is_really_good check enables short-circuiting: if we find a sequence that's clearly excellent (long literals, small count), we can stop looking and avoid wasting time on the rest of the pattern.

See: Companion Code Section 3

Section 4: Handling Alternation¶

Alternation (a|b|c) requires a different operation: union rather than cross product. If one branch matches "foo" and another matches "bar", the combined pattern can match either, so the literal set becomes ["foo", "bar"]. The extract_alternation method builds this union incrementally.

The short-circuit condition here is different from concatenation: once the sequence becomes infinite (meaning we've hit something we can't represent as finite literals), every subsequent union will also be infinite, so we can stop early. This prevents wasted work when one branch of an alternation contains something like .* that defeats literal extraction entirely.

The union operation in the Extractor includes a fascinating optimization for when the total would exceed limits. Rather than immediately giving up and returning an infinite sequence, it tries to trim the existing literals—keeping only their first few bytes. The magic number is 4 bytes, chosen specifically because downstream the literals might be fed to the Teddy algorithm, which can search for literals up to 4 bytes efficiently. By trimming rather than abandoning, the extractor preserves at least some filtering capability.

See: Companion Code Section 4

Section 5: Handling Repetition¶

Repetitions are perhaps the trickiest case because the semantics depend heavily on the bounds. The extract_repetition method handles ?, *, +, {n}, and {n,m} quantifiers, each requiring different treatment to maintain correctness about exactness.

For optional patterns (?), both the present and absent cases are valid, so the result is a union. For zero-or-more (*), the empty match is always possible, making it difficult to extract useful required literals. For one-or-more (+), at least one occurrence must appear, so the sub-expression's literals become inexact (they mark the start of a match but not the whole thing).

Fixed repetitions like {3} are handled by unrolling: we cross-product the sub-expression with itself the specified number of times, up to the limit_repeat threshold. This correctly produces "aaa" from a{3}. Bounded ranges like {3,5} work similarly but immediately mark results as inexact since the actual number of repetitions is variable.

The handling of greediness (* vs *?) affects the order of alternatives: greedy puts the non-empty match first, lazy puts the empty match first. This matters for optimizations that might prefer one interpretation over another.

See: Companion Code Section 5

Section 6: The Tagged Sequence Wrapper¶

The TSeq type wraps the standard Seq from regex-syntax with an additional piece of state: the prefix boolean. This tracks whether the literals in the sequence are known to start at a predictable position in the match. When prefix is false, the literals could appear anywhere, which changes how they compose with other sequences.

This distinction matters for correctness during cross-product operations. If a sequence isn't a prefix, you can't simply concatenate it onto another sequence because you don't know where it starts. Instead, you must choose between the sequences rather than combining them. The make_not_prefix method marks a sequence when extraction continues past an inexact segment.

The wrapper also adds quality-assessment methods that don't exist on the raw Seq type. The is_good and is_really_good methods implement heuristics about whether the extracted literals are actually useful. These consider minimum literal length and total count—a sequence with many very short literals might match so frequently that filtering provides no speedup.

See: Companion Code Section 6

Section 7: The Quality Heuristics¶

The quality heuristics deserve special attention because they're critical to the optimization's success. A prefilter that matches too often can actually slow down search—you spend time checking matches that rarely lead to real results.

The is_good method implements conservative thresholds. If any literal is "poisonous" (empty, or a single very common byte), the sequence is rejected. For short literals (length 1), the sequence must have at most 3 literals. For longer literals, the minimum length must be at least 2 and the total count at most 64. The is_really_good method uses stricter thresholds (minimum length 3, count at most 8) and triggers early short-circuiting.

The is_poisonous function identifies literals that would match far too frequently. Empty strings match everywhere, obviously. Single-byte literals are checked against a frequency rank—common bytes like spaces, tabs, and common ASCII letters score above 250 on a 0-255 scale and are considered poisonous. This ranking comes from typical byte frequency distributions in text files.

The choose method compares two candidate sequences when the extractor must pick one. It prefers finite over infinite, non-poisonous over poisonous, longer minimum literals over shorter, and fewer total literals over more. When all else is equal, it just picks the first one.

See: Companion Code Section 7

Section 8: Character Class Extraction¶

Character classes like [a-z] or [0-9] can sometimes be expanded into literal sets. The extract_class_unicode and extract_class_bytes methods handle this expansion, but only when the class is small enough. A class like [abc] becomes ["a", "b", "c"], which is useful. A class like [a-z] would become 26 literals, which might still be okay. But [A-Za-z0-9] would be 62 literals, likely exceeding limits.

The limit_class threshold (default 10) controls when classes get expanded versus abandoned. The methods iterate through the class ranges, counting characters, and return an infinite sequence if the count exceeds the limit. This prevents pathological cases where a single character class would blow up the entire extraction.

Unicode classes are particularly tricky because a single Unicode range can contain thousands of codepoints. The class_over_limit_unicode method counts efficiently using the range's len() method, bailing out early if the limit is exceeded partway through iteration.

See: Companion Code Section 8

Section 9: Building the Fast Regex¶

The one_regex method on InnerLiterals is the final step: converting the extracted literal sequence back into an actual regex that can be used for searching. This creates a simple alternation of literal patterns—if we extracted ["foo", "bar", "baz"], the resulting regex is essentially foo|bar|baz.

The returned regex is built with utf8_empty(false), allowing it to match empty strings if present (though quality heuristics should have filtered those out). The regex is constructed by converting each literal back into an HIR node and combining them with Hir::alternation. This might seem circular—we started with a regex, extracted literals, and now we're building another regex—but the resulting regex is dramatically simpler and will compile to much faster search code.

The Some/None return type reflects that extraction might not produce anything useful. If the sequence is infinite (represented as None from literals()) or empty, there's nothing to build. Callers must handle both cases, falling back to the original regex when no prefilter is available.

See: Companion Code Section 9

Section 10: Connecting to the Configuration Pattern¶

This file participates in ripgrep's broader Builder Pattern Ecosystem in a supporting role. The ConfiguredHIR type passed to InnerLiterals::new comes from the configuration infrastructure explored in earlier lessons. It bundles together the regex's HIR with the configuration options that affect how the regex should be matched.

The configuration dependency is most visible in the line terminator check. The chir.config().line_terminator access retrieves the configured line terminator (usually \n), and its absence disables the entire optimization. This is a concrete example of how configuration flows through the codebase—a command-line flag like --null-data (which uses \0 as the line terminator) changes this configuration, which in turn affects whether inner literal optimization is valid.

The logging statements throughout this file (using log::trace! and log::debug!) connect to ripgrep's broader observability story. When troubleshooting performance, you can enable trace logging to see exactly what literals were extracted, whether they passed quality checks, and what fast regex was built. This introspection is invaluable for understanding why a particular search is fast or slow.

See: Companion Code Section 10

Section 11: The Test Suite Philosophy¶

The comprehensive test suite at the end of the file serves as both verification and documentation. Each test encodes a specific expectation about how literal extraction should behave for a given pattern. Reading through these tests reveals the nuanced correctness requirements.

The tests use helper functions E and I to create exact and inexact literals concisely. The patterns range from simple literals (r"foo") through complex combinations of repetitions, alternations, and character classes. Many tests verify edge cases: empty patterns, impossible patterns (like [a&&b] which can never match), and patterns with unusual quantifier combinations.

The heuristics tests are particularly interesting because they verify the quality-based selection. The test r"[a-z]+(ab|cd|ef)[a-z]+hiya[a-z]+" demonstrates that the extractor should prefer "hiya" over ["ab", "cd", "ef"] because a single longer literal is better than three shorter ones. This kind of test documents intended behavior that might otherwise be opaque from the implementation alone.

See: Companion Code Section 11

Key Takeaways¶

Inner literal extraction enables line-oriented pre-filtering: By finding strings that must appear in any match and searching for them with fast SIMD routines, ripgrep can skip vast portions of input that could never match.
Correctness depends on search semantics: The optimization is only valid for line-by-line searching; without line boundaries, you can't safely narrow the regex search after finding a literal.
Combinatorial limits prevent extraction from becoming expensive: Character class expansion and repetition unrolling can explode exponentially, so hard limits ensure extraction stays fast.
Quality heuristics are essential: Not all literal sets are helpful—short literals or common bytes can match so frequently that filtering provides no benefit or even hurts performance.
The "choose best" strategy handles gaps in patterns: When parts of a regex don't yield literals, the extractor tracks multiple candidates and selects the most useful one.
Exactness tracking enables correct composition: Knowing whether a literal represents the complete match or just part of it is essential for correctly combining literals across concatenations and alternations.

What to Read Next¶

How does the regex matcher actually use these extracted literals? Look back at crates/regex/src/matcher.rs to see how InnerLiterals integrates with the matching infrastructure.

What's the HIR that this file analyzes? The regex_syntax::hir module defines the High-level Intermediate Representation that all these extraction methods traverse.

How does configuration reach this code? Review crates/regex/src/config.rs to understand the ConfiguredHIR type and how options like line terminators flow through the system.

Want to understand the Sink trait that receives matches found through this optimization? See crates/searcher/src/sink.rs from an earlier lesson.