Cadey is coffee
<Cadey> Hello! Thank you for visiting my website. You seem to be using an ad-blocker. I understand why you do this, but I'd really appreciate if it you would turn it off for my website. These ads help pay for running the website and are done by Ethical Ads. I do not receive detailed analytics on the ads and from what I understand neither does Ethical Ads. If you don't want to disable your ad blocker, please consider donating on Patreon or sending some extra cash to xeiaso.eth or 0xeA223Ca8968Ca59e0Bc79Ba331c2F6f636A3fB82. It helps fund the website's hosting bills and pay for the expensive technical editor that I use for my longer articles. Thanks and be well!

Minicompiler: Lexing

Read time in minutes: 12

I've always wanted to make my own compiler. Compilers are an integral part of my day to day job and I use the fruits of them constantly. A while ago while I was browsing through the TempleOS source code I found MiniCompiler.HC in the ::/Demos/Lectures folder and I was a bit blown away. It implements a two phase compiler from simple math expressions to AMD64 bytecode (complete with bit-banging it to an array that the code later jumps to) and has a lot to teach about how compilers work. For those of you that don't have a TempleOS VM handy, here is a video of MiniCompiler.HC in action:

You put in a math expression, the compiler builds it and then spits out a bunch of assembly and runs it to return the result. In this series we are going to be creating an implementation of this compiler that targets WebAssembly. This compiler will be written in Rust and will use only the standard library for everything but the final bytecode compilation and execution phase. There is a lot going on here, so I expect this to be at least a three part series. The source code will be in Xe/minicompiler in case you want to read it in detail. Follow along and let's learn some Rust on the way!

Mara is hacker
<Mara> Compilers for languages like C are built on top of the fundamentals here, but they are much more complicated.

Description of the Language

This language uses normal infix math expressions on whole numbers. Here are a few examples:

  • 2 + 2
  • 420 * 69
  • (34 + 23) / 38 - 42
  • (((34 + 21) / 5) - 12) * 348

Ideally we should be able to nest the parentheses as deep as we want without any issues.

Looking at these values we can notice a few patterns that will make parsing this a lot easier:

  • There seems to be only 4 major parts to this language:
    • numbers
    • math operators
    • open parentheses
    • close parentheses
  • All of the math operators act identically and take two arguments
  • Each program is one line long and ends at the end of the line

Let's turn this description into Rust code:

Bringing in Rust

Make a new project called minicompiler with a command that looks something like this:

$ cargo new minicompiler

This will create a folder called minicompiler and a file called src/main.rs. Open that file in your editor and copy the following into it:

// src/main.rs

/// Mathematical operations that our compiler can do.
#[derive(Debug, Eq, PartialEq)]
enum Op {
    Mul,
    Div,
    Add,
    Sub,
}

/// All of the possible tokens for the compiler, this limits the compiler
/// to simple math expressions.
#[derive(Debug, Eq, PartialEq)]
enum Token {
    EOF,
    Number(i32),
    Operation(Op),
    LeftParen,
    RightParen,
}

Mara is hacker
<Mara> In compilers, "tokens" refer to the individual parts of the language you are working with. In this case every token represents every possible part of a program.

And then let's start a function that can turn a program string into a bunch of tokens:

// src/main.rs

fn lex(input: &str) -> Vec<Token> {
    todo!("implement this");
}

Mara is hmm
<Mara> Wait, what do you do about bad input such as things that are not math expressions? Shouldn't this function be able to fail?

You're right! Let's make a little error type that represents bad input. For creativity's sake let's call it BadInput:

// src/main.rs

use std::error::Error;
use std::fmt;

/// The error that gets returned on bad input. This only tells the user that it's
/// wrong because debug information is out of scope here. Sorry.
#[derive(Debug, Eq, PartialEq)]
struct BadInput;

// Errors need to be displayable.
impl fmt::Display for BadInput {
    fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
        write!(f, "something in your input is bad, good luck")
    }
}

// The default Error implementation will do here.
impl Error for BadInput {}

And then let's adjust the type of lex() to compensate for this:

// src/main.rs

fn lex(input: &str) -> Result<Vec<Token>, BadInput> {
    todo!("implement this");
}

So now that we have the function type we want, let's start implementing lex() by setting up the result and a loop over the characters in the input string:

// src/main.rs

fn lex(input: &str) -> Result<Vec<Token>, BadInput> {
    let mut result: Vec<Token> = Vec::new();
    
    for character in input.chars() {
        todo!("implement this");
    }

    Ok(result)
}

Looking at the examples from earlier we can start writing some boilerplate to turn characters into tokens:

// src/main.rs

// ...

for character in input.chars() {
    match character {
        // Skip whitespace
        ' ' => continue,

        // Ending characters
        ';' | '\n' => {
            result.push(Token::EOF);
            break;
        }

        // Math operations
        '*' => result.push(Token::Operation(Op::Mul)),
        '/' => result.push(Token::Operation(Op::Div)),
        '+' => result.push(Token::Operation(Op::Add)),
        '-' => result.push(Token::Operation(Op::Sub)),

        // Parentheses
        '(' => result.push(Token::LeftParen),
        ')' => result.push(Token::RightParen),

        // Numbers
        '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' => {
            todo!("implement number parsing")
        }

        // Everything else is bad input
        _ => return Err(BadInput),
    }
}

// ...

Mara is hmm
<Mara> Ugh, you're writing Token:: and Op:: a lot. Is there a way to simplify that?

Yes! enum variants can be shortened to their names with a use statement like this:

// src/main.rs

// ...

use Op::*;
use Token::*;

match character {
    // ...

    // Math operations
    '*' => result.push(Operation(Mul)),
    '/' => result.push(Operation(Div)),
    '+' => result.push(Operation(Add)),
    '-' => result.push(Operation(Sub)),

    // Parentheses
    '(' => result.push(LeftParen),
    ')' => result.push(RightParen),

    // ...
}
    
// ...

Which looks a lot better.

Mara is hacker
<Mara> You can use the use statement just about anywhere in your program. However to keep things flowing nicer, the use statement is right next to where it is needed in these examples.

Now we can get into the fun that is parsing numbers. When he wrote MiniCompiler, Terry Davis used an approach that is something like this (spacing added for readability):

case '0'...'9':
  i = 0;
  do {
    i = i * 10 + *src - '0';
    src++;
  } while ('0' <= *src <= '9');
  *num=i;

This sets an intermediate variable i to 0 and then consumes characters from the input string as long as they are between '0' and '9'. As a neat side effect of the numbers being input in base 10, you can conceptualize 40 as (4 * 10) + 2. So it multiplies the old digit by 10 and then adds the new digit to the resulting number. Our setup doesn't let us get that fancy as easily, however we can emulate it with a bit of stack manipulation according to these rules:

  • If result is empty, push this number to result and continue lexing the program
  • Pop the last item in result and save it as last
  • If last is a number, multiply that number by 10 and add the current number to it
  • Otherwise push the node back into result and push the current number to result as well

Translating these rules to Rust, we get this:

// src/main.rs

// ...

// Numbers
'0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' => {
    let num: i32 = (character as u8 - '0' as u8) as i32;
    if result.len() == 0 {
        result.push(Number(num));
        continue;
    }

    let last = result.pop().unwrap();

    match last {
        Number(i) => {
            result.push(Number((i * 10) + num));
        }
        _ => {
            result.push(last);
            result.push(Number(num));
        }
    }
}
            
// ...

Mara is hacker
<Mara> This is not the most robust number parsing code in the world, however it will suffice for now. Extra credit if you can identify the edge cases!

This should cover the tokens for the language. Let's write some tests to be sure everything is working the way we think it is!

Testing

Rust has a robust testing framework built into the standard library. We can use it here to make sure we are generating tokens correctly. Let's add the following to the bottom of main.rs:

#[cfg(test)] // tells the compiler to only build this code when tests are being run
mod tests {
    use super::{Op::*, Token::*, *};

    // registers the following function as a test function
    #[test]
    fn basic_lexing() {
        assert!(lex("420 + 69").is_ok());
        assert!(lex("tacos are tasty").is_err());

        assert_eq!(
            lex("420 + 69"),
            Ok(vec![Number(420), Operation(Add), Number(69)])
        );
        assert_eq!(
            lex("(30 + 560) / 4"),
            Ok(vec![
                LeftParen,
                Number(30),
                Operation(Add),
                Number(560),
                RightParen,
                Operation(Div),
                Number(4)
            ])
        );
    }
}

This test can and probably should be expanded on, but when we run cargo test:

$ cargo test
   Compiling minicompiler v0.1.0 (/home/cadey/code/Xe/minicompiler)

    Finished test [unoptimized + debuginfo] target(s) in 0.22s
     Running target/debug/deps/minicompiler-03cad314858b0419

running 1 test
test tests::basic_lexing ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out

And hey presto! We verified that all of the parsing is working correctly. Those test cases should be sufficient to cover all of the functionality of the language.


This is it for part 1. We covered a lot today. Next time we are going to run a validation pass on the program, convert the infix expressions to reverse polish notation and then also get started on compiling that to WebAssembly. This has been fun so far and I hope you were able to learn from it.

Special thanks to the following people for reviewing this post:

  • Steven Weeks
  • sirpros
  • Leonora Tindall
  • Chetan Conikee
  • Pablo
  • boopstrap
  • ash2x3

This article was posted on M10 29 2020. Facts and circumstances may have changed since publication. Please contact me before jumping to conclusions if something seems wrong or unclear.

Series: rust

Tags: rust templeos compiler

This post was not WebMentioned yet. You could be the first!

The art for Mara was drawn by Selicre.

The art for Cadey was drawn by ArtZora Studios.

Some of the art for Aoi was drawn by @Sandra_Thomas01.