Skip to content

Decompression Support: Code Companion

Reference code for the Decompression Support lecture. Sections correspond to the lecture document.


Section 1: The Two-Layer Builder Architecture

/// A builder for a matcher that determines which files get decompressed.
#[derive(Clone, Debug)]
pub struct DecompressionMatcherBuilder {
    /// The commands for each matching glob.
    commands: Vec<DecompressionCommand>,
    /// Whether to include the default matching rules.
    defaults: bool,
}

/// A representation of a single command for decompressing data
/// out-of-process.
#[derive(Clone, Debug)]
struct DecompressionCommand {
    /// The glob that matches this command.
    glob: String,
    /// The command or binary name - stored as absolute path for security
    bin: PathBuf,
    /// The arguments to invoke with the command.
    args: Vec<OsString>,
}

/// Configures and builds a streaming reader for decompressing data.
#[derive(Clone, Debug, Default)]
pub struct DecompressionReaderBuilder {
    // The compiled matcher - expensive to build, cheap to query
    matcher: DecompressionMatcher,
    // Delegates process spawning details to CommandReaderBuilder
    command_builder: CommandReaderBuilder,
}

The two-layer architecture separates concerns: DecompressionMatcherBuilder handles glob-to-command mapping, while DecompressionReaderBuilder handles process execution. Note that DecompressionCommand stores the binary as a PathBuf (absolute path) rather than a program name string.


Section 2: Building the Matcher with Layered Defaults

pub fn build(&self) -> Result<DecompressionMatcher, CommandError> {
    // Conditionally load defaults based on configuration flag
    let defaults = if !self.defaults {
        vec![]
    } else {
        default_decompression_commands()
    };

    let mut glob_builder = GlobSetBuilder::new();
    let mut commands = vec![];

    // Chain defaults with user commands - user commands come AFTER,
    // so they take precedence when multiple globs match
    for decomp_cmd in defaults.iter().chain(&self.commands) {
        // Compile each glob pattern
        let glob = Glob::new(&decomp_cmd.glob).map_err(|err| {
            CommandError::io(io::Error::new(io::ErrorKind::Other, err))
        })?;
        glob_builder.add(glob);
        // Build parallel vector - index in globs matches index in commands
        commands.push(decomp_cmd.clone());
    }

    let globs = glob_builder.build().map_err(|err| {
        CommandError::io(io::Error::new(io::ErrorKind::Other, err))
    })?;

    Ok(DecompressionMatcher { globs, commands })
}

/// Toggle whether defaults are included
pub fn defaults(&mut self, yes: bool) -> &mut DecompressionMatcherBuilder {
    self.defaults = yes;
    self
}

The chain() call ensures user-provided commands are added after defaults. The parallel construction of globs and commands vectors maintains index correspondence for later lookups.


Section 3: The Associate Methods and Security Considerations

/// Silent version - ignores failures (useful for optional tools)
pub fn associate<P, I, A>(
    &mut self,
    glob: &str,
    program: P,
    args: I,
) -> &mut DecompressionMatcherBuilder
where
    P: AsRef<OsStr>,
    I: IntoIterator<Item = A>,
    A: AsRef<OsStr>,
{
    // Deliberately ignore errors - if the binary doesn't exist,
    // the association is simply not added
    let _ = self.try_associate(glob, program, args);
    self
}

/// Explicit error handling version - returns Result
pub fn try_associate<P, I, A>(
    &mut self,
    glob: &str,
    program: P,
    args: I,
) -> Result<&mut DecompressionMatcherBuilder, CommandError>
where
    P: AsRef<OsStr>,
    I: IntoIterator<Item = A>,
    A: AsRef<OsStr>,
{
    let glob = glob.to_string();
    // SECURITY: Resolve to absolute path NOW, not at execution time
    // This prevents Windows CWD search vulnerability
    let bin = try_resolve_binary(Path::new(program.as_ref()))?;
    let args =
        args.into_iter().map(|a| a.as_ref().to_os_string()).collect();
    self.commands.push(DecompressionCommand { glob, bin, args });
    Ok(self)
}

The try_resolve_binary() call at configuration time is the key security measure. By resolving against PATH early, we store an absolute path that won't trigger Windows' dangerous CWD search behavior when the command is later executed.


Section 4: Command Lookup and the GlobSet Integration

/// A matcher for determining how to decompress files.
#[derive(Clone, Debug)]
pub struct DecompressionMatcher {
    /// Optimized structure for matching against many globs simultaneously
    globs: GlobSet,
    /// Parallel vector - index corresponds to glob index
    commands: Vec<DecompressionCommand>,
}

impl DecompressionMatcher {
    /// Return a Command if we know how to decompress this file
    pub fn command<P: AsRef<Path>>(&self, path: P) -> Option<Command> {
        // matches() returns ALL matching glob indices
        // next_back() takes the LAST one (most recently added)
        if let Some(i) = self.globs.matches(path).into_iter().next_back() {
            let decomp_cmd = &self.commands[i];
            // Build the Command from our stored configuration
            let mut cmd = Command::new(&decomp_cmd.bin);
            cmd.args(&decomp_cmd.args);
            return Some(cmd);
        }
        None
    }

    /// Cheaper check - just need to know if ANY glob matches
    pub fn has_command<P: AsRef<Path>>(&self, path: P) -> bool {
        // is_match can short-circuit; matches() must find all
        self.globs.is_match(path)
    }
}

The next_back() call implements the precedence rule. GlobSet::matches() returns indices in the order globs were added, so taking the last match gives priority to user customizations over defaults.


Section 5: The DecompressionReaderBuilder and Graceful Degradation

impl DecompressionReaderBuilder {
    pub fn build<P: AsRef<Path>>(
        &self,
        path: P,
    ) -> Result<DecompressionReader, CommandError> {
        let path = path.as_ref();

        // Path 1: No matching command -> passthru reader
        let Some(mut cmd) = self.matcher.command(path) else {
            return DecompressionReader::new_passthru(path);
        };

        // Add the file path as final argument to the decompression command
        cmd.arg(path);

        match self.command_builder.build(&mut cmd) {
            // Path 2: Command spawned successfully
            Ok(cmd_reader) => Ok(DecompressionReader { rdr: Ok(cmd_reader) }),
            // Path 3: Spawn failed -> log and fallback to passthru
            Err(err) => {
                log::debug!(
                    "{}: error spawning command '{:?}': {} \
                     (falling back to uncompressed reader)",
                    path.display(),
                    cmd,
                    err,
                );
                DecompressionReader::new_passthru(path)
            }
        }
    }

    /// Control async stderr reading to prevent deadlocks
    pub fn async_stderr(
        &mut self,
        yes: bool,
    ) -> &mut DecompressionReaderBuilder {
        self.command_builder.async_stderr(yes);
        self
    }
}

The three-path logic ensures ripgrep never fails catastrophically due to missing decompression tools. Errors are logged at debug level, allowing users to diagnose issues while keeping normal output clean.


Section 6: The DecompressionReader and Unified I/O

/// A streaming reader for decompressing the contents of a file.
#[derive(Debug)]
pub struct DecompressionReader {
    // Result used as Either: Ok = decompressing, Err = raw file
    rdr: Result<CommandReader, File>,
}

impl DecompressionReader {
    /// Convenience constructor using default matcher
    pub fn new<P: AsRef<Path>>(
        path: P,
    ) -> Result<DecompressionReader, CommandError> {
        DecompressionReaderBuilder::new().build(path)
    }

    /// Create a reader that just passes through file contents
    fn new_passthru(path: &Path) -> Result<DecompressionReader, CommandError> {
        let file = File::open(path)?;
        // Err variant holds the File for passthru mode
        Ok(DecompressionReader { rdr: Err(file) })
    }

    /// Clean up child process resources
    pub fn close(&mut self) -> io::Result<()> {
        match self.rdr {
            Ok(ref mut rdr) => rdr.close(),
            Err(_) => Ok(()),  // File needs no special cleanup
        }
    }
}

/// Unified Read implementation - caller doesn't know which mode is active
impl io::Read for DecompressionReader {
    fn read(&mut self, buf: &mut [u8]) -> io::Result<usize> {
        match self.rdr {
            Ok(ref mut rdr) => rdr.read(buf),   // Read from process stdout
            Err(ref mut rdr) => rdr.read(buf),  // Read from file directly
        }
    }
}

The Result<CommandReader, File> type is an unconventional use of Result as a sum type (like Either). Both variants implement io::Read, so the Read implementation simply delegates to whichever is present.


Section 7: Binary Resolution and Default Commands

fn try_resolve_binary<P: AsRef<Path>>(
    prog: P,
) -> Result<PathBuf, CommandError> {
    use std::env;

    fn is_exe(path: &Path) -> bool {
        let Ok(md) = path.metadata() else { return false };
        !md.is_dir()
    }

    let prog = prog.as_ref();
    // Already absolute? Return as-is
    if prog.is_absolute() {
        return Ok(prog.to_path_buf());
    }

    // Search PATH environment variable
    let Some(syspaths) = env::var_os("PATH") else {
        let msg = "system PATH environment variable not found";
        return Err(CommandError::io(io::Error::new(
            io::ErrorKind::Other,
            msg,
        )));
    };

    for syspath in env::split_paths(&syspaths) {
        if syspath.as_os_str().is_empty() {
            continue;
        }
        let abs_prog = syspath.join(prog);
        if is_exe(&abs_prog) {
            return Ok(abs_prog.to_path_buf());
        }
        // Windows: try .com and .exe extensions
        if abs_prog.extension().is_none() {
            for extension in ["com", "exe"] {
                let abs_prog = abs_prog.with_extension(extension);
                if is_exe(&abs_prog) {
                    return Ok(abs_prog.to_path_buf());
                }
            }
        }
    }

    let msg = format!("{}: could not find executable in PATH", prog.display());
    return Err(CommandError::io(io::Error::new(io::ErrorKind::Other, msg)));
}

fn default_decompression_commands() -> Vec<DecompressionCommand> {
    const ARGS_GZIP: &[&str] = &["gzip", "-d", "-c"];
    const ARGS_BZIP: &[&str] = &["bzip2", "-d", "-c"];
    const ARGS_XZ: &[&str] = &["xz", "-d", "-c"];
    // ... more compression formats ...

    fn add(glob: &str, args: &[&str], cmds: &mut Vec<DecompressionCommand>) {
        // Silently skip if binary not found in PATH
        let bin = match resolve_binary(Path::new(args[0])) {
            Ok(bin) => bin,
            Err(err) => {
                log::debug!("{}", err);
                return;  // Don't add rule for missing tools
            }
        };
        cmds.push(DecompressionCommand {
            glob: glob.to_string(),
            bin,
            args: args.iter().skip(1)  // Skip program name
                .map(|s| OsStr::new(s).to_os_string())
                .collect(),
        });
    }

    let mut cmds = vec![];
    add("*.gz", ARGS_GZIP, &mut cmds);
    add("*.tgz", ARGS_GZIP, &mut cmds);
    // ... register all supported formats ...
    cmds
}

The -d -c flags are standard across decompression tools: -d for decompress, -c for output to stdout. The add() helper silently skips unavailable tools, ensuring the matcher only contains commands that can actually run.


Quick Reference

Supported Compression Formats

Extension Command Notes
.gz, .tgz gzip -d -c Most common
.bz2, .tbz2 bzip2 -d -c Higher compression
.xz, .txz xz -d -c Modern, high ratio
.lz4 lz4 -d -c Fast decompression
.lzma xz --format=lzma -d -c Legacy format
.br brotli -d -c Web-optimized
.zst, .zstd zstd -q -d -c Fast, modern
.Z uncompress -c Unix compress

Key Types

// Builder for glob-to-command mapping
DecompressionMatcherBuilder::new()
    .defaults(true)                    // Include built-in rules
    .associate("*.custom", "tool", ["-d"])  // Add custom rule
    .build()?                          // -> DecompressionMatcher

// Compiled matcher (reuse for many files)
DecompressionMatcher::command(&path)   // -> Option<Command>
DecompressionMatcher::has_command(&path)  // -> bool

// Reader builder (reuse matcher)
DecompressionReaderBuilder::new()
    .matcher(matcher)
    .async_stderr(true)
    .build(&path)?                     // -> DecompressionReader

// Convenience (builds matcher each time)
DecompressionReader::new(&path)?       // -> DecompressionReader

Graceful Degradation Flow

Path → Matcher lookup
    ┌────┴────┐
    │ No match │ → Passthru reader (raw file)
    └────┬────┘
    Spawn command
    ┌────┴────┐
    │  Failed  │ → Log debug, passthru reader
    └────┬────┘
    CommandReader (decompressing)