Esc
Start typing to search...
Back to Blog

From Characters to Meaning: How Keel Turns Text Into Code

2026-01-16 Keel Team 15 min read
internalsparserlexercompiler

From Characters to Meaning: How Keel Turns Text Into Code

Here is a thing I underestimated: the distance between a string of characters and something a computer can act on is vast. When you type let x = 1 + 2 and Keel hands back 3, two entire translation passes have already happened before the VM runs a single instruction. The first turns characters into tokens. The second turns tokens into a tree. This post is about both of those passes -- the parts that worked on the first try, the parts that took weeks, and the part that nearly made me give up on significant whitespace entirely.

Why This Is Two Steps

You might wonder why we don't go straight from text to tree. The answer is the same reason you don't eat an entire apple in one bite: it's technically possible, but you'll regret it.

Separating lexing and parsing lets each phase focus on one job. The lexer worries about characters: which ones form numbers, which form keywords, where strings begin and end. The parser never thinks about whitespace or character sequences -- it sees structured tokens like Let, Int(42), and PipeRight, and arranges them into a syntax tree. This separation also means each phase can be tested, benchmarked, and debugged independently, which matters more than you'd think when something goes wrong at midnight.

Phase 1: The Lexer

The lexer scans characters from left to right and groups them into tokens. The word let becomes a Let keyword token. 42 becomes Int(42). "hello" becomes String("hello"). Simple enough in theory.

Keel's lexer is built with chumsky, a parser combinator library for Rust. Instead of writing a big state machine with a switch statement inside a while loop (the "serious" approach), we compose small parsers together. A float parser is an integer parser, then a dot, then another integer. A keyword parser tries "exposing" before "else" to avoid partial matches. It's surprisingly pleasant, like snapping Lego bricks together.

Here is what the lexer produces for a record literal:

{ name = "Alice", age = 30 }
Try it

Becomes:

LeftBrace, LowerIdent("name"), Assign, String("Alice"), Comma,
LowerIdent("age"), Assign, Int(30), RightBrace

And a lambda expression:

|x| x + 1
Try it

Becomes:

Pipe, LowerIdent("x"), Pipe, LowerIdent("x"), Plus, Int(1)

The | character pulls double duty -- it's the lambda delimiter and also the separator in case branches and or-patterns. The parser sorts out which meaning applies based on context. This felt clever when I designed it. It felt less clever when I was debugging the tenth ambiguity it caused.

Significant Whitespace: The Beautiful Headache

Keel uses indentation to determine block structure. No semicolons, no curly braces for blocks. I wanted code that looked clean:

fn factorial : Int -> Int
fn factorial n =
    if n <= 1 then 1
    else n * factorial (n - 1)
Try it

The price for that cleanliness is paid by the lexer, which has to understand indentation and emit virtual Indent and Dedent tokens. When a new line starts with more spaces than the current level, the lexer emits Indent. When it starts with fewer spaces, it emits Dedent. The parser never sees raw whitespace -- it sees these virtual tokens the same way it would see { and } in a brace-delimited language.

The lexer tracks indentation with a stack inside an IndentationState struct. When a line starts at a deeper level, the new level gets pushed onto the stack. When a line starts at a shallower level, levels get popped until the stack matches. Here is what the token stream looks like for a simple indentation change:

-- Source:
foo
  bar
    baz

-- Tokens:
LowerIdent("foo"), Newline, Indent(2), LowerIdent("bar"),
Newline, Indent(4), LowerIdent("baz"), Dedent(4), Dedent(2)

That trailing Dedent(4), Dedent(2) at the end is the lexer cleaning up at EOF -- popping the stack back to zero so the parser sees a balanced structure. Getting this right was harder than I expected. The worst case is when indentation decreases at the same time as a new token appears:

module Math exposing (add)

    fn add : Int -> Int -> Int
    fn add x y = x + y

let result = Math.add 1 2
Try it

The module body is indented to 4 spaces. Then let appears at column 0. But let isn't whitespace -- the lexer sees a non-whitespace token at a lower indent level and has to emit the Dedent before emitting the Let token. This requires stashing the token, generating the dedents, and then emitting the stashed token afterward. The stored_token field in IndentationState exists for exactly this dance.

I spent more time on this indentation tracking than on any other single feature. Blank lines between indented blocks, comments at different indent levels, multiple dedents at end of file -- each one was its own special adventure. If someone tells you significant whitespace is "easy, the lexer just handles it," they haven't tried it. But I still think the result is worth it. Code that breathes matters.

Three Flavors of Comments

The lexer handles all three comment styles:

-- Line comment
{- Block comment spanning
   multiple lines -}
{-| Doc comment for tooling -}
Try it

Doc comments use {-| ... -} syntax -- the pipe character after {- is what distinguishes them from regular block comments. They stay in the token stream so the parser can attach them to declarations, which the LSP uses for hover info. Keeping comments around was a decision I'm glad I made early. Once the lexer throws them away, they're gone forever, and you end up with tooling that can never show documentation.

Graceful Recovery

One thing I appreciate about using chumsky for the lexer: when it hits an unrecognized character, it doesn't crash. It emits an Unknown token and keeps going. The character &, for instance, isn't valid in Keel:

let aaa& = 1
Try it

The lexer produces [Let, LowerIdent("aaa"), Unknown("&"), Assign, Int(1)] and reports an InvalidCharacter error with the exact byte offset. The parser can still try to make sense of the rest of the file. This matters a lot for editor tooling -- the LSP needs to provide diagnostics even when the code is half-typed.

Phase 2: The Parser

The parser takes the token stream and builds an abstract syntax tree (AST). This is where the grammar comes alive -- and where things get genuinely interesting.

Like the lexer, the parser is built with chumsky. The grammar is organized as four mutually recursive parsers:

ParserHandlesExamples
exprExpressions1 + 2, if x then y else z, |x| x + 1
stmtStatementslet x = 42, let { name, age } = person
declDeclarationsfn add : Int -> Int, module M exposing (..)
blockIndented blocksFunction bodies, case arms, if-then-else branches

These are defined recursively -- an expression can contain a block (like a case expression with indented arms), a block contains AST nodes, an AST node can be a statement containing an expression. Chumsky's Recursive::declare() lets us set up this mutual recursion cleanly:

let mut expr = Recursive::declare();
let mut stmt = Recursive::declare();
let mut decl = Recursive::declare();
let mut block = Recursive::declare();
let mut ast_node = Recursive::declare();

expr.define(parse_ast_expr(expr.clone(), ast_node.clone()));
stmt.define(parse_ast_statement(expr.clone()));
decl.define(parse_ast_declaration(expr.clone(), block.clone()));
block.define(parse_ast_block(ast_node.clone()));

If you've ever tried to do mutual recursion by hand in a recursive descent parser, you know the tangle of forward declarations and function pointers that awaits you. This is one of those places where parser combinators genuinely shine.

The Tree: What Comes Out

Every piece of Keel code maps to an AstNode -- which is either a Declaration, an Expr, a Stmt, or a Comment. The Expr enum is the heart of it, and it's large. Int, Float, String, Boolean, Var, Binary, If, Case, Lambda, FunctionCall, List, Tuple, Record, ModuleAccess -- the list goes on.

A pipe chain like this:

import List

[1, 2, 3]
    |> List.map (|x| x * 2)
    |> List.sum
Try it

Gets parsed into a tree where |> is a Binary node with PipeRight as its operator. The left side is another Binary(PipeRight) node, and so on down the chain. Pipes aren't special syntax -- they're just binary operators with a specific precedence level. The VM desugars them into function application during compilation. Keeping them as ordinary operators keeps the parser simple, and I'm quietly satisfied with that decision.

Every AST node carries a source span -- byte offsets into the original source -- so error messages and the LSP can point to the exact character. Every node also carries any attached comments, so a -- the total above a let stays associated with that binding.

Patterns: The Rich Part

Patterns are where Keel's grammar gets dense. A pattern can be a wildcard (_), a variable (x), a literal (42), a constructor (Just), a tuple ((a, b)), a record ({ name, .. }), a list ([x, y]), a cons (x :: xs), an or-pattern (North | South), or an as-pattern (Just x as original). And they nest:

case data of
    Ok (Just x) -> x * 2
    Ok Nothing  -> 0
    Err msg     -> -1
Try it

The Pattern enum has 13 variants to represent all of this. Getting the precedence right between :: (cons) and :: (qualified enum access like Color::Red) took careful ordering in the parser. The rule is contextual: if the left side is a lowercase identifier, it's a cons pattern; if it's uppercase, it's a qualified name. This seems obvious in retrospect. It was not obvious at the time.

We go deep on what you can do with patterns in the pattern matching post. From the parser's perspective, the challenge is representing all these forms faithfully so the type checker can validate them and the compiler can emit correct bytecode.

Operator Precedence

Getting operator precedence right is one of those things that matters enormously and feels unglamorous. 1 + 2 * 3 must parse as 1 + (2 * 3), not (1 + 2) * 3. Power is right-associative: 2 ^ 3 ^ 4 is 2 ^ (3 ^ 4). Logical AND binds tighter than OR: True && False || True is (True && False) || True. The pipe operator |> sits below everything else so a + b |> f pipes a + b into f, not just b.

The test suite has specific cases for each of these:

-- Parsed as 1 + (2 * 3) = 7
1 + 2 * 3

-- Parsed as (1 + 2) < (3 + 4)
1 + 2 < 3 + 4

-- Parsed as (not True) && False
not True && False
Try it

Chumsky handles precedence through ordered choice combinators -- higher-precedence operators are tried first in the parser chain. It works, and the parser source is organized by expression type (parse_lambdas.rs, parse_case.rs, parse_records.rs, parse_pipes.rs) so you can jump straight to the part you care about.

Scope Tracking: Catching Mistakes Early

Here is something a little unusual: Keel's parser tracks scope as it goes. The ParserState includes a ScopeState that knows about declared variables, functions, modules, enums, and type aliases at each point in the parse. Scopes form a tree -- function bodies create child scopes, case arms create child scopes, and lookups walk up the parent chain.

Why do this during parsing instead of a later pass? Because it enables some genuinely useful things right at parse time:

Undeclared variable detection with fuzzy matching. If you reference usrName, the parser checks its scope, doesn't find it, and searches for similar names using a combination of Jaro-Winkler similarity and Levenshtein distance:

Error: Variable 'usrName' is not declared.
Did you mean 'userName'?

Common mistake recognition. If you type true instead of True, or null instead of Nothing, the parser recognizes the pattern from other languages and suggests the Keel equivalent:

Error: 'null' is not a Keel keyword
Hint: Use `Nothing` for absent values
Note: Keel uses Maybe types to represent optional values safely.

Same for match (use case), function/def (use fn), var/const (use let), class/struct (use type). If you reach for JavaScript-style record syntax:

{ name: "Alice" }
Try it

You get:

Error: JavaScript-style record syntax is not allowed
Hint: Use `=` instead of `:` for record field assignment.

Duplicate detection. Defining the same module twice is caught at parse time, right when the second definition is encountered.

Every one of these errors comes with a hint (what to do) and a note (why). This was a deliberate design choice, and honestly, writing 115 error variants with hints and notes was one of the most tedious parts of the whole project. But a type error you can't understand is barely better than no type error at all. As we describe in the type system post, those error messages are one of the things users notice first.

Memory: Arenas and Interning

Two implementation details worth mentioning for the fellow language nerds.

Arena allocation. The parser stores symbols in a SymbolArena -- a flat Vec<Symbol> where references are 4-byte SymbolId indices. Lists of symbols use an 8-byte SymbolRange (start index plus length) instead of a 24-byte Vec. When you're parsing a file with hundreds of bindings, the difference between thousands of tiny heap allocations and one contiguous array shows up in both speed and memory fragmentation. As we discuss in Why Keel Is Written in Rust, this kind of memory layout control is one of the things that makes Rust a good fit.

String interning. Every identifier and string literal goes through the string interner. The name "userName" gets mapped to an integer index, and from that point on, every scope lookup and comparison uses the integer instead of the string. Comparing two integers is an order of magnitude faster than comparing two strings. The interner is shared between the parser, compiler, and VM, so indices assigned during parsing are valid all the way through execution.

The Full Pipeline

Putting it together, here is what happens when you call VM::compile("let x = 1 + 2"):

  1. Lexer tokenizes: [Let, LowerIdent("x"), Assign, Int(1), Plus, Int(2)]
  2. Parser builds AST: Stmt(Let { pattern: Var("x"), value: Binary(Add, Int(1), Int(2)) })
  3. Compiler emits bytecode: MovRegVal(R0, 1), MovRegVal(R1, 2), AddRegRegReg(R2, R0, R1), ...
  4. VM executes and returns: Int(3)

Each phase feeds cleanly into the next. The token_spans module can reconstruct byte offsets from tokens when the LSP or time machine debugger needs source mapping -- this reconstruction is shared between both tools so source positions are always consistent.

Why Parser Combinators?

I've been asked this a few times. A hand-written recursive descent parser would almost certainly be faster. So why chumsky?

Readability. A chumsky parser reads almost like the grammar itself. Adding a new expression form means adding a new combinator, not threading state through a web of mutually recursive functions. The parser directory has 30+ files, each focused on one syntax form. When something breaks, you know where to look.

Error recovery. Chumsky gives us error recovery with minimal effort. When the parser hits something unexpected, it can skip tokens and continue, collecting multiple errors in a single pass. The LSP depends on this -- it needs to provide completions and diagnostics for code that's actively being written and therefore actively broken.

The trade-off is performance. Combinator parsers carry overhead from closures and trait objects. For Keel's current use cases -- files of a few thousand lines -- this doesn't matter. The benchmark suite measures parsing independently, and it's fast enough. If we ever hit a wall, the token stream is well-defined enough to swap in a hand-written parser without changing anything downstream. Reversible decisions are good decisions.

History, Briefly

The parser has been rewritten more times than I'd like to admit. The first version was hand-rolled recursive descent -- the textbook approach, and honestly not a bad way to learn. But adding new syntax forms meant fighting the parser's own structure, and every error recovery fix broke two others. Moving to chumsky took about two weeks and was one of those rare moments where a tool change makes the whole project feel lighter. Not every rewrite is like that, but this one was.

What's Next

The parsing infrastructure is stable. The test suite has hundreds of cases exercising every parser path -- basic literals, nested patterns, deep indentation, mixed operators, edge cases for every syntax form. The next frontiers are:

  • File-based module loading: Currently, all code runs in a single file. Multi-file compilation will add import resolution between parsing and compilation.
  • Incremental parsing: For the LSP, re-parsing the entire file on every keystroke is wasteful. Partial re-parsing of just the changed region is the goal.

If you're curious about the internals, the source is on Codeberg. And if you want to see parsing in action, the time machine debugger lets you step through the token stream and AST construction token by token -- it's the best way to understand what's actually happening under the hood.