Esc
Start typing to search...
Back to Blog

From Source Code to Registers: Inside Keel's Bytecode VM

2026-02-04 Keel Team 15 min read
internalscompilervmperformance

From Source Code to Registers: Inside Keel's Bytecode VM

There is a moment in every language project when you realize "parsing is working, the type checker is catching real bugs, but actually running the code... that part still needs to happen." For us, that moment arrived after we had a nice AST, a growing type system, and no execution engine to speak of. Just a tree-walking interpreter that got the job done the way a bicycle gets you across a continent: technically possible, not recommended.

This post is about what we built to replace it, what we learned along the way, and why a register-based VM turned out to be a surprisingly good fit for a functional language that is all about piping data through transformations.

The Interpreter We Outgrew

The first version of Keel's execution engine was a recursive AST walker. Parse the source into a tree, visit each node, compute the value. If the node is a let binding, evaluate the right side and store it. If it is a function call, evaluate the function, evaluate the arguments, apply. Straightforward, easy to reason about, and not fast.

The trouble was that Keel programs tend to look like this:

import List

[1, 2, 3, 4, 5]
    |> List.filter (|x| x > 2)
    |> List.map (|x| x * 10)
    |> List.sum
Try it

Every pipe stage means traversing tree nodes, following pointers across the heap, matching on AST variants. Do that a few hundred times in a tight loop and you are mostly paying the overhead of interpreting rather than the cost of the actual computation. The CPU cache is not your friend when every operation requires chasing pointers to scattered tree nodes.

So we needed bytecode. Flatten the tree into a linear sequence of instructions, then run those instructions in a tight loop. The question was what kind of bytecode.

Stack Machines vs. Register Machines

There are two basic designs, and people have strong opinions about both. We'll try to keep this balanced, which means we'll probably annoy everyone equally.

Stack-based VMs use an implicit operand stack. To add two numbers, you push them onto the stack, then execute ADD, which pops both and pushes the result. The JVM, CPython, and WebAssembly work this way.

PUSH 3        -- stack: [3]
PUSH 5        -- stack: [3, 5]
ADD           -- pops both, pushes 8. stack: [8]

Simple instructions, compact bytecode, easy to compile to. There is a good reason most VMs start here.

Register-based VMs use named slots. Each instruction says which registers to read from and where to put the result. Lua's VM (since 5.0) and Android's Dalvik use this approach.

LOAD R0, 3        -- R0 = 3
LOAD R1, 5        -- R1 = 5
ADD  R2, R0, R1   -- R2 = 8

Wider instructions, but fewer of them. That ADD is one dispatch cycle, not three.

Why We Chose Registers

For a language built around chaining transformations -- pipes threading results through functions, composition operators connecting stages, lambdas everywhere -- the number of instruction dispatches adds up quickly. Each dispatch has overhead: fetch the instruction, decode it, jump to the handler, touch memory. In a stack VM, a simple x + y where both values are already computed costs three instructions (push, push, add). In a register VM, it costs one: AddRegRegReg(R2, R0, R1).

When Lua switched from stack-based to register-based in version 5.0, the team reported significant speedups despite the wider instruction format. Fewer total instructions meant fewer dispatch cycles, and that was enough to compensate. That research gave us confidence. But honestly, the thing that sealed it was debugging. When something goes wrong at 11 PM and you are staring at bytecode dumps, knowing exactly which register holds which value is so much better than trying to mentally simulate a stack.

Was it obviously the right choice? No. Stack VMs are simpler to compile for, and the bytecode is more compact. We made a bet, measured the results, liked what we saw, and committed. That is about as rigorous as language design gets sometimes.

What a Register Actually Is

Here is the entire definition of a register in Keel:

pub struct Register(pub usize);

A wrapper around an index. The compiler allocates registers sequentially as it walks the AST -- need a temporary for an intermediate result? Bump the counter, take the next register. When compilation finishes, the compiler records the total count in bytecode.register_count, and the VM pre-allocates exactly that many slots at startup. No resizing, no guessing, no waste.

Each register holds a RegisterValue, which is a tagged union of everything Keel can represent at runtime:

VariantKeel TypeWhat it stores
Int(i64)Int64-bit signed integer
Float(f64)Float64-bit float
Bool(bool)BoolBoolean
String(String)StringString value
Char(char)CharUnicode character
Unit()Void/empty
Tuple(usize)(a, b)Heap index
List(usize)[a]Heap index
Record(usize){ field = val }Heap index
Closure(...)functionsCode + environment
Enum(...)custom typesVariant tag + optional data

Scalars live directly in registers. Compound types -- tuples, lists, records -- live on the heap, and the register holds an index pointing to them. Access is O(1) either way: just index into an array. We tried fancier designs early on and kept coming back to this. Sometimes the boring answer really is the right one.

Keel's Instruction Set

The compiler emits 47 instruction types, organized into five categories. Here are some representative examples:

CategoryExamplesPurpose
MemoryMovRegVal, LoadVarReg, InternStringMoving data between registers, variables, constants
ArithmeticAddRegRegReg, MulRegRegReg, Negate, NotMath, string concat, logical negation
ComparisonEqRegRegReg, LtRegRegReg, GeRegRegRegRelational operations producing booleans
Data StructuresAllocRecord, MakeList, TupleGet, ConsCreating and accessing compound types
Control FlowCall, Closure, Jump, JumpIfFalse, FunctionReturnFunction calls, branching, returns

Every arithmetic and comparison instruction uses three-address form: dest = src1 op src2. The instruction names are admittedly verbose -- AddRegRegReg is not winning any beauty contests -- but when you are reading a bytecode dump, clarity beats brevity.

Let's trace a real example. This Keel code:

let x = 3
let y = 5
x + y
Try it

Compiles to something like:

MovRegVal(R0, Int(3))       -- R0 = 3
MovRegVar("x", R0)          -- store R0 as variable "x"
MovRegVal(R1, Int(5))       -- R1 = 5
MovRegVar("y", R1)          -- store R1 as variable "y"
LoadVarReg(R2, "x")         -- R2 = load "x"
LoadVarReg(R3, "y")         -- R3 = load "y"
AddRegRegReg(R4, R2, R3)    -- R4 = R2 + R3 = 8

Every instruction is explicit about where data comes from and where it goes. No hidden state, no implicit stack to simulate in your head. As we discuss in Why Keel Is Written in Rust, these instruction categories are Rust enums, so adding a new variant means the compiler tells us every match statement that needs updating. It is like having a coworker who reads every line of your code and never forgets to ask about the edge cases.

The Execution Loop

The VM runs a classic fetch-decode-execute cycle. Fetch the instruction at the program counter, figure out its category, dispatch it to the right handler, advance the counter. Repeat until the program counter falls off the end of the instruction stream.

1. FETCH:    Read instruction at bytecode[pc]
2. DECODE:   Determine instruction category
3. DISPATCH: Route to handler (arithmetic, comparison, memory, ...)
4. EXECUTE:  Handler modifies VM state
5. ADVANCE:  Increment pc (unless jump/call changed it)
6. REPEAT:   Until pc >= bytecode.len()

The dispatch uses a Rust match on the instruction enum, which compiles down to a jump table. Each handler gets a reference to the instruction and does its work. The handle_arithmetic handler knows how to add integers, how to promote an Int to Float when you write 1 + 2.5, how to concatenate strings with ++. The handle_data_structures handler allocates records on the heap, retrieves tuple elements by index, prepends elements to lists with ::. Clean separation, and each handler stays small enough to reason about.

The Heap: Where Compound Data Lives

A record with ten fields does not fit in a register slot. Neither does a list with a thousand elements. Keel stores compound data on a heap, and registers hold indices into it.

When you write:

let person = { name = "Alice", age = 30 }
person.name
Try it

The compiler emits instructions to:

  1. Allocate a record on the heap (AllocRecord)
  2. Intern the field names as string indices (InternString)
  3. Store each field value (StoreNamedField)
  4. Put the heap index in a register

Later, person.name becomes a LoadNamedField instruction that looks up the field by its interned string index. The record data lives on the heap; the register is just a pointer. This means passing a record to a function copies a single integer (the heap index), not the entire record structure.

The heap uses a mark-sweep garbage collector. When the heap grows past a threshold (1024 slots) and the free list is empty, the VM traces from roots -- registers, variables, the call stack -- to mark everything reachable, then sweeps unmarked objects into a free list for reuse. Not sophisticated (a generational collector would handle long-lived objects better), but correct and predictable. For a language focused on data transformations rather than long-running servers, it does the job. We will revisit this when the use cases demand it.

One detail we are quietly proud of: the GC correctly traces nested structures. A list containing tuples containing records? All preserved, as long as the outermost container is reachable. It also follows references inside closure environments, so captured variables do not get collected out from under a function that is still waiting to be called. Getting that right was one of those "write a bunch of tests and keep fixing things until they all pass" afternoons.

String Interning: The Quiet Optimization

Every string in Keel -- variable names, field names, string literals -- goes through the string interner. The interner maps each unique string to an integer index. If you use the field name "name" in fifty different records, there is exactly one copy of that string in memory.

This turned out to be one of those changes where the implementation was straightforward but the payoff was surprisingly large:

  1. Memory: No duplicate strings. In a language where you process collections of records that all have the same field names, this matters more than you might think.
  2. Speed: Comparing two field names becomes comparing two integers, not walking character arrays. Record field lookup goes from string comparison to integer comparison.

The interner is shared between the compiler and the VM, so indices assigned at compile time are valid at runtime. The same interner that the parser uses for symbol names carries through all the way to execution -- one consistent string table from source code to final result.

Closures: Where Things Get Interesting

When you define a function:

fn add : Int -> Int -> Int
fn add x y = x + y
Try it

The compiler does not emit a "define function" instruction. It emits the function body as a range of instructions, then creates a Closure instruction wrapping that range:

0: LoadVarReg(R0, "arg0")     -- load first argument
1: LoadVarReg(R1, "arg1")     -- load second argument
2: AddRegRegReg(R2, R0, R1)   -- R2 = arg0 + arg1
3: FunctionReturn(R2)         -- return R2
4: Closure(R3, 0, 4, 2, 0)   -- closure: body=0..4, arity=2, provided=0

A closure carries its code range, its expected arity, and an environment of captured variables. This is how Keel implements automatic currying. Every function is a chain of single-argument functions. fn add : Int -> Int -> Int is really fn add : Int -> (Int -> Int). Apply one argument, and you get back a closure waiting for the next one:

let addFive = add 5   -- closure: arity=2, provided=1, env={arg0: 5}
addFive 3             -- fills arg1=3, fully applied, executes body
-- 8
Try it

The VM handles this transparently. When a closure is called, the VM saves the current state (program counter, variables, registers) onto the call stack, swaps in the closure's bytecode, runs it, and restores everything on return. It is a context switch, just like how an operating system handles function calls -- except entirely in userspace, and with considerably less screaming.

Recursive functions work through a CaptureVar instruction that updates a closure to include itself in its own environment after it has been stored as a variable. This means factorial can refer to factorial in its own body:

module Math exposing (factorial)

    fn factorial : Int -> Int
    fn factorial n =
        if n <= 1 then 1
        else n * factorial (n - 1)

Math.factorial 10
-- 3628800
Try it

Getting partial application, currying, and recursion all working correctly on a bytecode VM was one of those milestones that made us unreasonably happy. If you have tried implementing currying on top of a flat instruction stream, you know why.

Native Functions in the Same Loop

The standard library -- List.map, String.toUpper, Math.abs, and the rest -- is implemented in Rust, but it plugs into the same VM machinery as user-defined code. Native functions receive RegisterValue arguments and return RegisterValue results. When List.filter needs to call the user's lambda, it uses call_closure -- the same mechanism the VM uses for any function call. No special cases, no separate calling convention.

This means pipe chains that mix native and user code work without friction:

import List
import String

["alice", "bob", "carol"]
    |> List.map (|name| String.toUpper name)
    |> List.filter (|name| String.length name > 3)
-- ["ALICE", "CAROL"]
Try it

The pipe operator itself is not a VM instruction. It is syntactic sugar that the parser desugars into nested function calls. a |> f |> g becomes g(f(a)). It looks trivial, but it makes data transformation code dramatically more readable, and it interacts beautifully with the register VM -- the result of each stage lands in a register, ready for the next stage to pick up.

Connecting Types to Registers

This is where things connect back to the type system and the type checker. The compiler knows, at compile time, what type each register will hold. When you write 1 + 2, it emits AddRegRegReg knowing both operands are integers. When you write 1 + 2.5, it emits a float-promotion first. The types guide instruction selection, and the registers carry the results. It is a nice feedback loop that we did not fully plan but were very happy to discover.

Pattern matching -- which we wrote about in detail in the pattern matching post -- compiles down to a sequence of tag checks, jumps, and register loads. A case expression on an enum becomes:

GetEnumTag(R1, R0)                -- extract tag from enum in R0
MatchEnumVariant(R2, R0, 0)       -- does it match variant 0?
JumpIfFalse(R2, <next-branch>)    -- if not, try next branch
GetEnumData(R3, R0)               -- extract data payload
...                               -- branch body
Jump(<end>)                       -- skip remaining branches

Each branch is a linear sequence of instructions, connected by jumps. The type checker verifies exhaustiveness at compile time, so the VM does not need a fallback -- if you forget a variant, you find out before any code runs.

Benchmarking the Pipeline

One advantage of clean phase separation is that you can benchmark each phase independently. Keel's benchmark suite measures lexing, parsing, compilation, and execution separately. This means when we optimize something, we can see exactly which phase benefited. Was that improvement in parsing or in the VM? Did a compiler change actually speed up execution, or just rearrange the bytecode? The phase separation makes these questions answerable instead of "it seems faster, maybe."

What We Traded Off

Register-based VMs have wider instructions and a more complex compiler -- you need register allocation, which means tracking which register holds what, and debugging stale register values is a special kind of evening. The bytecode is less compact than a stack machine's, because every instruction names its operands explicitly.

We think the trade is worth it. Fewer instruction dispatches, explicit data flow, and an execution model that maps naturally to the functional style Keel encourages. But we are not going to pretend it was obvious from the start. We tried it, measured it, liked the results, and committed.

Try It Yourself

The entire compilation pipeline runs in a single call:

use keel_core::vm::vm_core::VM;

let result = VM::compile("1 + 2");
// result: [Int(3)]

If you want to see the bytecode, the time machine debugger lets you step through every instruction, inspect registers and heap state, and trace how source code maps to bytecode and back. It is the best way to understand what the VM is actually doing.

The source is on Codeberg. The instruction definitions live in compiler/instructions.rs, and each variant is documented with its arguments, behavior, and a concrete example. We wrote those docs for ourselves as much as for anyone else, because future-us has a terrible memory.