Inside the Rust Compiler: From Source Code to Binary¶

This comprehensive guide explores the internal workings of the Rust compiler (rustc), from parsing source code to generating binaries. We'll examine the various intermediate representations (IRs) and learn how to interact with the compiler programmatically.

What is Rust?¶

Rust is defined by the Rust Language team as "a language empowering everyone to build reliable and efficient software." Key characteristics include:

Systems-level control: Explicit memory allocation control
Efficiency: No garbage collector, compiles to small binaries for embedded systems
Memory safety: Prevents buffer overflows, use-after-free, and dangling pointers
Thread safety: Data race freedom in concurrent code
Compile-time guarantees: All safety checks happen at compile time

Borrowing and Lifetime Semantics¶

Rust's safety guarantees come from its borrowing rules: - Shared data cannot be mutated: Multiple immutable references are allowed - Mutable data cannot be aliased: Only one mutable reference at a time - Lifetimes ensure validity: References cannot outlive the data they point to

The `unsafe` Keyword¶

The unsafe keyword allows you to: - Directly manipulate memory and raw pointers - Call unsafe functions - Access mutable static variables - Implement unsafe traits

When used appropriately (like in the standard library), unsafe enables powerful abstractions while maintaining safety at higher levels.

The Rust Compiler Architecture¶

The Rust compiler (rustc) transforms source code through multiple intermediate representations before generating a binary:

Rust Source Code
      ↓
Abstract Syntax Tree (AST)
      ↓
High-level IR (HIR)
      ↓
Typed HIR (THIR)
      ↓
Mid-level IR (MIR)
      ↓
Code Generation (LLVM/GCC/Cranelift)
      ↓
Target Binary

Exploring the Rust Compiler Repository¶

The main Rust compiler repository contains several important directories:

compiler/: Core compiler code organized into multiple crates
library/: Standard library and core types
src/: Bootstrap and build scripts
tests/: Comprehensive test suite

Building the Compiler¶

To build the Rust compiler from source:

# Initial setup
./x.py setup
# Choose option B for compiler development

# Build the compiler
./x.py build

# Incremental builds (much faster)
./x.py build --keep-stage 1

The build process creates multiple compiler stages due to bootstrapping: - Stage 0: Minimal compiler built with existing Rust toolchain - Stage 1: Full compiler built with Stage 0 - Stage 2: Self-hosted compiler (Stage 1 building itself)

Intermediate Representations in Detail¶

Abstract Syntax Tree (AST)¶

The AST is created through lexing and parsing:

// Source code
fn main() {
    println!("Hello, world!");
}

View the AST using:

rustc -Z unpretty=ast hello.rs

The AST represents the program structure as a tree, with nodes for functions, expressions, and statements.

High-level IR (HIR)¶

HIR is created by lowering and desugaring the AST: - Multiple loop constructs (for, while, loop) are normalized to loop - Control flow is simplified to match statements - The prelude is made explicit

# View HIR
rustc -Z unpretty=hir hello.rs

# View HIR tree structure
rustc -Z unpretty=hir-tree hello.rs

At the HIR level, the compiler performs: - Type inference: Determining types from context - Trait solving: Ensuring generic constraints can be satisfied - Type checking: Verifying type compatibility

Example of trait bounds:

fn log_elements<T: std::fmt::Display>(vec: Vec<T>) {
    for (i, element) in vec.iter().enumerate() {
        println!("{}: {}", i, element);
    }
}

Typed HIR (THIR)¶

THIR contains fully elaborated types and is used for: - Unsafe code analysis - Additional type-related checks

rustc -Z unpretty=thir-tree hello.rs

Mid-level IR (MIR)¶

MIR transforms the tree structure into a Control Flow Graph (CFG):

# View MIR (recommended flags)
rustc -Z unpretty=mir -Z mir-opt-level=0 hello.rs

MIR represents programs as: - Places in memory instead of variables - Basic blocks containing statements - Terminators that branch between blocks

Example MIR structure:

fn main() -> () {
    let mut _0: ();
    let _1: ();

    bb0: {
        _1 = const "Hello, world!";
        _2 = &_1;
        _3 = std::io::_print(_2);
        goto -> bb1;
    }

    bb1: {
        return;
    }
}

At the MIR level, the compiler performs: - Borrow checking: Ensuring memory safety rules - Drop elaboration: Inserting cleanup code - Lifetime analysis: Validating reference lifetimes

Interacting with the Compiler Programmatically¶

You can write Rust programs that call the compiler internally using the rustc_driver crate:

Basic Setup¶

#![feature(rustc_private)]

extern crate rustc_driver;
extern crate rustc_interface;

use rustc_driver::Compilation;
use rustc_interface::interface;

struct MyCallbacks;

impl rustc_driver::Callbacks for MyCallbacks {
    fn after_analysis<'tcx>(
        &mut self,
        compiler: &interface::Compiler,
        queries: &'tcx rustc_interface::Queries<'tcx>,
    ) -> Compilation {
        queries.global_ctxt().unwrap().peek_mut().enter(|tcx| {
            // Access compiler internals here
            println!("Analyzing crate: {}", tcx.crate_name(rustc_span::def_id::LOCAL_CRATE));
        });
        Compilation::Continue
    }
}

fn main() {
    let args: Vec<String> = std::env::args().collect();
    rustc_driver::RunCompiler::new(&args, &mut MyCallbacks).run().unwrap();
}

Callback Points¶

The compiler provides three main callback points:

after_crate_root_parsing: After creating the initial AST
after_expansion: After macro expansion and name resolution
after_analysis: After all type checking and borrow checking

Accessing Compiler Internals¶

With a typing context (TyCtxt), you can call many internal compiler functions:

// Get all function bodies in the crate
let body_owners = tcx.hir().body_owners().collect::<Vec<_>>();

// Pretty print MIR
for def_id in body_owners {
    rustc_middle::mir::pretty::write_mir_pretty(tcx, Some(def_id), &mut std::io::stdout()).unwrap();
}

Code Generation Backends¶

The final step transforms MIR into target code. Rust supports multiple backends:

LLVM (default): Mature, highly optimizing
GCC: Alternative with different optimization characteristics
Cranelift: Faster compilation, less optimization

Custom Code Generation¶

You can implement custom code generation by implementing the CodegenBackend trait:

impl CodegenBackend for MyBackend {
    fn codegen_crate(&self, tcx: TyCtxt, metadata: EncodedMetadata, need_metadata_module: bool) -> Box<dyn Any> {
        // Custom code generation logic
    }

    // ... other required methods
}

This enables targeting custom architectures or generating specialized output formats.

Debugging and Optimization¶

Compiler Flags for Exploration¶

Useful flags for examining compiler internals:

# List all unstable options
rustc -Z help

# Common debugging flags
rustc -Z unpretty=ast          # Show AST
rustc -Z unpretty=hir          # Show HIR  
rustc -Z unpretty=mir          # Show MIR
rustc --emit=llvm-ir           # Show LLVM IR
rustc --emit=asm               # Show assembly

Development Workflow¶

When working on the compiler:

Use the compiler profile: ./x.py setup → option B
Make changes to compiler source
Rebuild incrementally: ./x.py build --keep-stage 1 library
Test changes with the stage 1 compiler

This workflow reduces rebuild times from minutes to seconds.

Practical Applications¶

Understanding compiler internals enables:

Static analysis tools: Analyze code without executing it
Custom lints: Detect project-specific issues
Code generation: Generate code from high-level descriptions
Verification tools: Prove program properties
Educational tools: Understand how language features work

Conclusion¶

The Rust compiler is a sophisticated piece of software that transforms high-level Rust code through multiple intermediate representations, each optimized for different types of analysis. By understanding these internals and learning to interact with them programmatically, you can build powerful tools that leverage the compiler's deep understanding of Rust programs.

The compiler's architecture demonstrates the careful balance between providing high-level language features and maintaining the performance and safety guarantees that make Rust unique in the systems programming landscape.