Skip to content

Inside the Rust Compiler: From Source Code to Binary

This comprehensive guide explores the internal workings of the Rust compiler (rustc), from parsing source code to generating binaries. We'll examine the various intermediate representations (IRs) and learn how to interact with the compiler programmatically.

What is Rust?

Rust is defined by the Rust Language team as "a language empowering everyone to build reliable and efficient software." Key characteristics include:

  • Systems-level control: Explicit memory allocation control
  • Efficiency: No garbage collector, compiles to small binaries for embedded systems
  • Memory safety: Prevents buffer overflows, use-after-free, and dangling pointers
  • Thread safety: Data race freedom in concurrent code
  • Compile-time guarantees: All safety checks happen at compile time

Borrowing and Lifetime Semantics

Rust's safety guarantees come from its borrowing rules: - Shared data cannot be mutated: Multiple immutable references are allowed - Mutable data cannot be aliased: Only one mutable reference at a time - Lifetimes ensure validity: References cannot outlive the data they point to

The unsafe Keyword

The unsafe keyword allows you to: - Directly manipulate memory and raw pointers - Call unsafe functions - Access mutable static variables - Implement unsafe traits

When used appropriately (like in the standard library), unsafe enables powerful abstractions while maintaining safety at higher levels.

The Rust Compiler Architecture

The Rust compiler (rustc) transforms source code through multiple intermediate representations before generating a binary:

Rust Source Code
Abstract Syntax Tree (AST)
High-level IR (HIR)
Typed HIR (THIR)
Mid-level IR (MIR)
Code Generation (LLVM/GCC/Cranelift)
Target Binary

Exploring the Rust Compiler Repository

The main Rust compiler repository contains several important directories:

  • compiler/: Core compiler code organized into multiple crates
  • library/: Standard library and core types
  • src/: Bootstrap and build scripts
  • tests/: Comprehensive test suite

Building the Compiler

To build the Rust compiler from source:

# Initial setup
./x.py setup
# Choose option B for compiler development

# Build the compiler
./x.py build

# Incremental builds (much faster)
./x.py build --keep-stage 1

The build process creates multiple compiler stages due to bootstrapping: - Stage 0: Minimal compiler built with existing Rust toolchain - Stage 1: Full compiler built with Stage 0 - Stage 2: Self-hosted compiler (Stage 1 building itself)

Intermediate Representations in Detail

Abstract Syntax Tree (AST)

The AST is created through lexing and parsing:

// Source code
fn main() {
    println!("Hello, world!");
}

View the AST using:

rustc -Z unpretty=ast hello.rs

The AST represents the program structure as a tree, with nodes for functions, expressions, and statements.

High-level IR (HIR)

HIR is created by lowering and desugaring the AST: - Multiple loop constructs (for, while, loop) are normalized to loop - Control flow is simplified to match statements - The prelude is made explicit

# View HIR
rustc -Z unpretty=hir hello.rs

# View HIR tree structure
rustc -Z unpretty=hir-tree hello.rs

At the HIR level, the compiler performs: - Type inference: Determining types from context - Trait solving: Ensuring generic constraints can be satisfied - Type checking: Verifying type compatibility

Example of trait bounds:

fn log_elements<T: std::fmt::Display>(vec: Vec<T>) {
    for (i, element) in vec.iter().enumerate() {
        println!("{}: {}", i, element);
    }
}

Typed HIR (THIR)

THIR contains fully elaborated types and is used for: - Unsafe code analysis - Additional type-related checks

rustc -Z unpretty=thir-tree hello.rs

Mid-level IR (MIR)

MIR transforms the tree structure into a Control Flow Graph (CFG):

# View MIR (recommended flags)
rustc -Z unpretty=mir -Z mir-opt-level=0 hello.rs

MIR represents programs as: - Places in memory instead of variables - Basic blocks containing statements - Terminators that branch between blocks

Example MIR structure:

fn main() -> () {
    let mut _0: ();
    let _1: ();

    bb0: {
        _1 = const "Hello, world!";
        _2 = &_1;
        _3 = std::io::_print(_2);
        goto -> bb1;
    }

    bb1: {
        return;
    }
}

At the MIR level, the compiler performs: - Borrow checking: Ensuring memory safety rules - Drop elaboration: Inserting cleanup code - Lifetime analysis: Validating reference lifetimes

Interacting with the Compiler Programmatically

You can write Rust programs that call the compiler internally using the rustc_driver crate:

Basic Setup

#![feature(rustc_private)]

extern crate rustc_driver;
extern crate rustc_interface;

use rustc_driver::Compilation;
use rustc_interface::interface;

struct MyCallbacks;

impl rustc_driver::Callbacks for MyCallbacks {
    fn after_analysis<'tcx>(
        &mut self,
        compiler: &interface::Compiler,
        queries: &'tcx rustc_interface::Queries<'tcx>,
    ) -> Compilation {
        queries.global_ctxt().unwrap().peek_mut().enter(|tcx| {
            // Access compiler internals here
            println!("Analyzing crate: {}", tcx.crate_name(rustc_span::def_id::LOCAL_CRATE));
        });
        Compilation::Continue
    }
}

fn main() {
    let args: Vec<String> = std::env::args().collect();
    rustc_driver::RunCompiler::new(&args, &mut MyCallbacks).run().unwrap();
}

Callback Points

The compiler provides three main callback points:

  1. after_crate_root_parsing: After creating the initial AST
  2. after_expansion: After macro expansion and name resolution
  3. after_analysis: After all type checking and borrow checking

Accessing Compiler Internals

With a typing context (TyCtxt), you can call many internal compiler functions:

// Get all function bodies in the crate
let body_owners = tcx.hir().body_owners().collect::<Vec<_>>();

// Pretty print MIR
for def_id in body_owners {
    rustc_middle::mir::pretty::write_mir_pretty(tcx, Some(def_id), &mut std::io::stdout()).unwrap();
}

Code Generation Backends

The final step transforms MIR into target code. Rust supports multiple backends:

  • LLVM (default): Mature, highly optimizing
  • GCC: Alternative with different optimization characteristics
  • Cranelift: Faster compilation, less optimization

Custom Code Generation

You can implement custom code generation by implementing the CodegenBackend trait:

impl CodegenBackend for MyBackend {
    fn codegen_crate(&self, tcx: TyCtxt, metadata: EncodedMetadata, need_metadata_module: bool) -> Box<dyn Any> {
        // Custom code generation logic
    }

    // ... other required methods
}

This enables targeting custom architectures or generating specialized output formats.

Debugging and Optimization

Compiler Flags for Exploration

Useful flags for examining compiler internals:

# List all unstable options
rustc -Z help

# Common debugging flags
rustc -Z unpretty=ast          # Show AST
rustc -Z unpretty=hir          # Show HIR  
rustc -Z unpretty=mir          # Show MIR
rustc --emit=llvm-ir           # Show LLVM IR
rustc --emit=asm               # Show assembly

Development Workflow

When working on the compiler:

  1. Use the compiler profile: ./x.py setup → option B
  2. Make changes to compiler source
  3. Rebuild incrementally: ./x.py build --keep-stage 1 library
  4. Test changes with the stage 1 compiler

This workflow reduces rebuild times from minutes to seconds.

Practical Applications

Understanding compiler internals enables:

  • Static analysis tools: Analyze code without executing it
  • Custom lints: Detect project-specific issues
  • Code generation: Generate code from high-level descriptions
  • Verification tools: Prove program properties
  • Educational tools: Understand how language features work

Conclusion

The Rust compiler is a sophisticated piece of software that transforms high-level Rust code through multiple intermediate representations, each optimized for different types of analysis. By understanding these internals and learning to interact with them programmatically, you can build powerful tools that leverage the compiler's deep understanding of Rust programs.

The compiler's architecture demonstrates the careful balance between providing high-level language features and maintaining the performance and safety guarantees that make Rust unique in the systems programming landscape.