Decompression Support: Code Companion¶
Reference code for the Decompression Support lecture. Sections correspond to the lecture document.
Section 1: The Two-Layer Builder Architecture¶
/// A builder for a matcher that determines which files get decompressed.
#[derive(Clone, Debug)]
pub struct DecompressionMatcherBuilder {
/// The commands for each matching glob.
commands: Vec<DecompressionCommand>,
/// Whether to include the default matching rules.
defaults: bool,
}
/// A representation of a single command for decompressing data
/// out-of-process.
#[derive(Clone, Debug)]
struct DecompressionCommand {
/// The glob that matches this command.
glob: String,
/// The command or binary name - stored as absolute path for security
bin: PathBuf,
/// The arguments to invoke with the command.
args: Vec<OsString>,
}
/// Configures and builds a streaming reader for decompressing data.
#[derive(Clone, Debug, Default)]
pub struct DecompressionReaderBuilder {
// The compiled matcher - expensive to build, cheap to query
matcher: DecompressionMatcher,
// Delegates process spawning details to CommandReaderBuilder
command_builder: CommandReaderBuilder,
}
The two-layer architecture separates concerns: DecompressionMatcherBuilder handles glob-to-command mapping, while DecompressionReaderBuilder handles process execution. Note that DecompressionCommand stores the binary as a PathBuf (absolute path) rather than a program name string.
Section 2: Building the Matcher with Layered Defaults¶
pub fn build(&self) -> Result<DecompressionMatcher, CommandError> {
// Conditionally load defaults based on configuration flag
let defaults = if !self.defaults {
vec![]
} else {
default_decompression_commands()
};
let mut glob_builder = GlobSetBuilder::new();
let mut commands = vec![];
// Chain defaults with user commands - user commands come AFTER,
// so they take precedence when multiple globs match
for decomp_cmd in defaults.iter().chain(&self.commands) {
// Compile each glob pattern
let glob = Glob::new(&decomp_cmd.glob).map_err(|err| {
CommandError::io(io::Error::new(io::ErrorKind::Other, err))
})?;
glob_builder.add(glob);
// Build parallel vector - index in globs matches index in commands
commands.push(decomp_cmd.clone());
}
let globs = glob_builder.build().map_err(|err| {
CommandError::io(io::Error::new(io::ErrorKind::Other, err))
})?;
Ok(DecompressionMatcher { globs, commands })
}
/// Toggle whether defaults are included
pub fn defaults(&mut self, yes: bool) -> &mut DecompressionMatcherBuilder {
self.defaults = yes;
self
}
The chain() call ensures user-provided commands are added after defaults. The parallel construction of globs and commands vectors maintains index correspondence for later lookups.
Section 3: The Associate Methods and Security Considerations¶
/// Silent version - ignores failures (useful for optional tools)
pub fn associate<P, I, A>(
&mut self,
glob: &str,
program: P,
args: I,
) -> &mut DecompressionMatcherBuilder
where
P: AsRef<OsStr>,
I: IntoIterator<Item = A>,
A: AsRef<OsStr>,
{
// Deliberately ignore errors - if the binary doesn't exist,
// the association is simply not added
let _ = self.try_associate(glob, program, args);
self
}
/// Explicit error handling version - returns Result
pub fn try_associate<P, I, A>(
&mut self,
glob: &str,
program: P,
args: I,
) -> Result<&mut DecompressionMatcherBuilder, CommandError>
where
P: AsRef<OsStr>,
I: IntoIterator<Item = A>,
A: AsRef<OsStr>,
{
let glob = glob.to_string();
// SECURITY: Resolve to absolute path NOW, not at execution time
// This prevents Windows CWD search vulnerability
let bin = try_resolve_binary(Path::new(program.as_ref()))?;
let args =
args.into_iter().map(|a| a.as_ref().to_os_string()).collect();
self.commands.push(DecompressionCommand { glob, bin, args });
Ok(self)
}
The try_resolve_binary() call at configuration time is the key security measure. By resolving against PATH early, we store an absolute path that won't trigger Windows' dangerous CWD search behavior when the command is later executed.
Section 4: Command Lookup and the GlobSet Integration¶
/// A matcher for determining how to decompress files.
#[derive(Clone, Debug)]
pub struct DecompressionMatcher {
/// Optimized structure for matching against many globs simultaneously
globs: GlobSet,
/// Parallel vector - index corresponds to glob index
commands: Vec<DecompressionCommand>,
}
impl DecompressionMatcher {
/// Return a Command if we know how to decompress this file
pub fn command<P: AsRef<Path>>(&self, path: P) -> Option<Command> {
// matches() returns ALL matching glob indices
// next_back() takes the LAST one (most recently added)
if let Some(i) = self.globs.matches(path).into_iter().next_back() {
let decomp_cmd = &self.commands[i];
// Build the Command from our stored configuration
let mut cmd = Command::new(&decomp_cmd.bin);
cmd.args(&decomp_cmd.args);
return Some(cmd);
}
None
}
/// Cheaper check - just need to know if ANY glob matches
pub fn has_command<P: AsRef<Path>>(&self, path: P) -> bool {
// is_match can short-circuit; matches() must find all
self.globs.is_match(path)
}
}
The next_back() call implements the precedence rule. GlobSet::matches() returns indices in the order globs were added, so taking the last match gives priority to user customizations over defaults.
Section 5: The DecompressionReaderBuilder and Graceful Degradation¶
impl DecompressionReaderBuilder {
pub fn build<P: AsRef<Path>>(
&self,
path: P,
) -> Result<DecompressionReader, CommandError> {
let path = path.as_ref();
// Path 1: No matching command -> passthru reader
let Some(mut cmd) = self.matcher.command(path) else {
return DecompressionReader::new_passthru(path);
};
// Add the file path as final argument to the decompression command
cmd.arg(path);
match self.command_builder.build(&mut cmd) {
// Path 2: Command spawned successfully
Ok(cmd_reader) => Ok(DecompressionReader { rdr: Ok(cmd_reader) }),
// Path 3: Spawn failed -> log and fallback to passthru
Err(err) => {
log::debug!(
"{}: error spawning command '{:?}': {} \
(falling back to uncompressed reader)",
path.display(),
cmd,
err,
);
DecompressionReader::new_passthru(path)
}
}
}
/// Control async stderr reading to prevent deadlocks
pub fn async_stderr(
&mut self,
yes: bool,
) -> &mut DecompressionReaderBuilder {
self.command_builder.async_stderr(yes);
self
}
}
The three-path logic ensures ripgrep never fails catastrophically due to missing decompression tools. Errors are logged at debug level, allowing users to diagnose issues while keeping normal output clean.
Section 6: The DecompressionReader and Unified I/O¶
/// A streaming reader for decompressing the contents of a file.
#[derive(Debug)]
pub struct DecompressionReader {
// Result used as Either: Ok = decompressing, Err = raw file
rdr: Result<CommandReader, File>,
}
impl DecompressionReader {
/// Convenience constructor using default matcher
pub fn new<P: AsRef<Path>>(
path: P,
) -> Result<DecompressionReader, CommandError> {
DecompressionReaderBuilder::new().build(path)
}
/// Create a reader that just passes through file contents
fn new_passthru(path: &Path) -> Result<DecompressionReader, CommandError> {
let file = File::open(path)?;
// Err variant holds the File for passthru mode
Ok(DecompressionReader { rdr: Err(file) })
}
/// Clean up child process resources
pub fn close(&mut self) -> io::Result<()> {
match self.rdr {
Ok(ref mut rdr) => rdr.close(),
Err(_) => Ok(()), // File needs no special cleanup
}
}
}
/// Unified Read implementation - caller doesn't know which mode is active
impl io::Read for DecompressionReader {
fn read(&mut self, buf: &mut [u8]) -> io::Result<usize> {
match self.rdr {
Ok(ref mut rdr) => rdr.read(buf), // Read from process stdout
Err(ref mut rdr) => rdr.read(buf), // Read from file directly
}
}
}
The Result<CommandReader, File> type is an unconventional use of Result as a sum type (like Either). Both variants implement io::Read, so the Read implementation simply delegates to whichever is present.
Section 7: Binary Resolution and Default Commands¶
fn try_resolve_binary<P: AsRef<Path>>(
prog: P,
) -> Result<PathBuf, CommandError> {
use std::env;
fn is_exe(path: &Path) -> bool {
let Ok(md) = path.metadata() else { return false };
!md.is_dir()
}
let prog = prog.as_ref();
// Already absolute? Return as-is
if prog.is_absolute() {
return Ok(prog.to_path_buf());
}
// Search PATH environment variable
let Some(syspaths) = env::var_os("PATH") else {
let msg = "system PATH environment variable not found";
return Err(CommandError::io(io::Error::new(
io::ErrorKind::Other,
msg,
)));
};
for syspath in env::split_paths(&syspaths) {
if syspath.as_os_str().is_empty() {
continue;
}
let abs_prog = syspath.join(prog);
if is_exe(&abs_prog) {
return Ok(abs_prog.to_path_buf());
}
// Windows: try .com and .exe extensions
if abs_prog.extension().is_none() {
for extension in ["com", "exe"] {
let abs_prog = abs_prog.with_extension(extension);
if is_exe(&abs_prog) {
return Ok(abs_prog.to_path_buf());
}
}
}
}
let msg = format!("{}: could not find executable in PATH", prog.display());
return Err(CommandError::io(io::Error::new(io::ErrorKind::Other, msg)));
}
fn default_decompression_commands() -> Vec<DecompressionCommand> {
const ARGS_GZIP: &[&str] = &["gzip", "-d", "-c"];
const ARGS_BZIP: &[&str] = &["bzip2", "-d", "-c"];
const ARGS_XZ: &[&str] = &["xz", "-d", "-c"];
// ... more compression formats ...
fn add(glob: &str, args: &[&str], cmds: &mut Vec<DecompressionCommand>) {
// Silently skip if binary not found in PATH
let bin = match resolve_binary(Path::new(args[0])) {
Ok(bin) => bin,
Err(err) => {
log::debug!("{}", err);
return; // Don't add rule for missing tools
}
};
cmds.push(DecompressionCommand {
glob: glob.to_string(),
bin,
args: args.iter().skip(1) // Skip program name
.map(|s| OsStr::new(s).to_os_string())
.collect(),
});
}
let mut cmds = vec![];
add("*.gz", ARGS_GZIP, &mut cmds);
add("*.tgz", ARGS_GZIP, &mut cmds);
// ... register all supported formats ...
cmds
}
The -d -c flags are standard across decompression tools: -d for decompress, -c for output to stdout. The add() helper silently skips unavailable tools, ensuring the matcher only contains commands that can actually run.
Quick Reference¶
Supported Compression Formats¶
| Extension | Command | Notes |
|---|---|---|
.gz, .tgz |
gzip -d -c |
Most common |
.bz2, .tbz2 |
bzip2 -d -c |
Higher compression |
.xz, .txz |
xz -d -c |
Modern, high ratio |
.lz4 |
lz4 -d -c |
Fast decompression |
.lzma |
xz --format=lzma -d -c |
Legacy format |
.br |
brotli -d -c |
Web-optimized |
.zst, .zstd |
zstd -q -d -c |
Fast, modern |
.Z |
uncompress -c |
Unix compress |
Key Types¶
// Builder for glob-to-command mapping
DecompressionMatcherBuilder::new()
.defaults(true) // Include built-in rules
.associate("*.custom", "tool", ["-d"]) // Add custom rule
.build()? // -> DecompressionMatcher
// Compiled matcher (reuse for many files)
DecompressionMatcher::command(&path) // -> Option<Command>
DecompressionMatcher::has_command(&path) // -> bool
// Reader builder (reuse matcher)
DecompressionReaderBuilder::new()
.matcher(matcher)
.async_stderr(true)
.build(&path)? // -> DecompressionReader
// Convenience (builds matcher each time)
DecompressionReader::new(&path)? // -> DecompressionReader