How to Build a WebAssembly Language for Fun and Profit

Written by courier | Published 2022/09/02
Tech Story Tags: wasm | webassembly | binaryen | integration | engineering | programming-languages | good-company | webassembly-language

TLDRThis is the third article in the Build a programming language series. We are going to use a popular WebAssembly compiler toolchain called binaryen. Binaryen does most of the heavy compiling tasks for us. The compile function will take our root AST (a `BlockNode`) and return a new `binaryen module. The final phase of our code generation is code generation. The return type of `generateFunctionMap` is a map where the key is the function name and the value is an object containing all the important information about the function.via the TL;DR App

The final phase of our compiler is code generation. This phase takes the AST and converts it to a set of executable instructions. In our case, WebAssembly. To accomplish this, we are going to use a popular WebAssembly compiler toolchain called binaryen.

Binaryen does most of the heavy compiling tasks for us. Tricky optimizations like dead-code removal, code folding, and more are all handled out of the box. Thanks to binaryen, we can make a powerful language with very little code.

This is the third article in the Build a programming language series. If you’re just starting out, start with the first post here before continuing on.

The Compile function

One of the trickiest tasks in code generation is what to call the code generation function. For this article, I have opted to call the main function compile. The compile function will take our root AST (a BlockNode) and return a new binaryen.Module. The binaryen.Module exposes functions that can optimize and emit wasm in various formats.

Here's the outline:

// src/compiler.mts
import binaryen from "binaryen";
import { AstNode, BlockNode, IdentifierNode } from "./types/index.mjs";

export const compile = (block: BlockNode): binaryen.Module => {
  // Creates a new binaryen module that our helper functions will fill in
  const mod = new binaryen.Module();

  // The function map is used to track all the functions and their types. More on this later
  const functionMap = generateFunctionMap(block);

  // This function registers all the standard library functions we'll include with our language.
  // This includes functions like add, subtract, etc.
  registerStandardFunctions(mod, functionMap);

  // This is where the magic happens. Because `BlockNode` is an expression, this
  // function can recursively compile every instruction in a wispy program file
  compileExpression({
    expression: block,
    mod,
    functionMap,
    parameters: new Map(),
  });

  // Finally, we return the binaryen module
  return mod;
};

Generating A Function Map

Next, we define the generateFunctionMap function. This function crawls the entire expression tree to find and register function definitions. It is important we do this before actually compiling the functions as some functions may call other functions before they've been defined.

The return type of generateFunctionMap is a map where the key is the function name and the value is an object containing all the important information about the function the compiler needs to know about. For now, all we need is returnType.

Here’s the return type definition:

// src/compiler.mts
type FunctionMap = Map<string, { returnType: number }>;

Now that we have defined our return type, we can define the actual function

// src/compiler.mts
const generateFunctionMap = (block: BlockNode): FunctionMap => {
  // Preview the first node (i.e. expression) of the block
  const firstNode = block.expressions[0];

  // If the first node is an identifier and the identifier is "fn", then we know this block represents a function definition.
  if (isNodeType(firstNode, "identifier") && firstNode.identifier === "fn") {
    // Grab the function identifier / name, and it's return type. This is the second expression of
    // the function definition, a typed identifier.
    const { identifier, returnType } = getFunctionIdentifier(block); // We'll define this next

    // Return the function map
    return new Map([
      [identifier, { returnType }],

      // It's possible this function may contain function definitions inside of it. So we
      // We put all the remaining expressions of the function into a new block and scan it
      // then we merge the resulting map with this one.
      ...generateFunctionMap({ type: "block", expressions: block.expressions.slice(3) }),
    ]);
  }

  // A block can contain multiple expressions. So, we must scan each one to see if it is a function
  // definition. The root `BlockNode` for instance, will almost always have multiple functions.
  return block.expressions.reduce((map, expression) => {
    // Only block expressions can be functions
    if (expression.type === "block") {
      return new Map([...map, ...generateFunctionMap(expression)]);
    }

    // We can ignore all other expression
    return map;
  }, new Map());
}; 

Onto getFunctionIdentifier we called earlier. This function is simple. It takes a BlockNode, ensures that the second identifier is a TypedIdentifierNode, and then returns the identifier and return type:

// src/compiler.mts
const getFunctionIdentifier = (block: BlockNode) => {
  // Grab the second expression
  const node = block.expressions[1];

  // Ensure the expression is a typed identifier
  if (!isNodeType(node, "typed-identifier")) {
    throw new Error("Expected typed function name");
  }

  return {
    identifier: node.identifier,

    // We have to map the return type to a type binaryen understands.
    returnType: mapBinaryenType(node.typeIdentifier),
  };
};

As noted in the getFunctionIdentifier function, Binaryen doesn't understand what type the string typeIdentifier is. To handle this we have to map our defined types to binaryen types. For now, we'll just support i32 and f32. Thankfully binaryen uses the same nomenclature we do. So the map function is pretty simple:

// src/compiler.mts
const mapBinaryenType = (typeIdentifier: string): binaryen.Type => {
  if (typeIdentifier === "i32") return binaryen.i32;
  if (typeIdentifier === "f32") return binaryen.f32;
  throw new Error(`Unsupported type ${typeIdentifier}`);
};

getFunctionIdentifier Made a call to a new function, isNodeType. This function is essentially the same concept as the isTokenType function we defined in the parsing article, only for ASTNode instead of Token.

Here's the definition:

// src/compiler.mts
export const isNodeType = <T extends AstNode["type"]>(
  item: unknown,
  type: T
): item is Extract<AstNode, { type: T }> => {
  return (
    // Ensure the type exists
    !!item &&
    // Ensure the type is an object
    typeof item === "object" &&
    // Cast the type as a record, so TypeScript doesn't get mad at us and then compare the
    // type field with the type parameter. If they are equal, we know the node is the
    // the type we were looking for.
    (item as Record<string, unknown>)["type"] === type;
  )
};

With the mapper finished we can start generating some code.

Compiling Expressions

The compileExpression function is where we really start to make use of binaryen to model the generated machine code. Because of the tree structure of an, ahem, abstract syntax tree, compileExpression is highly recursive. This is one of my favorite things about programming languages, their patterns tend to lend themselves to elegant recursive functions with high levels of code re-use.

Let's start with defining the parameters of compileExpression. We will need to pass the binaryen.Module and the functionMap we created earlier, the actual expression we are compiling, and any parameters of the function this expression may be a part of (if it's inside) of a function. When there are more than two parameters of a function it can be difficult to visually keep track of what is what. So I like to make it clear by grouping them together in an object. This enforces labeling the parameters on call and as a result, improves code readability.

Here's the interface of that object:

Now that we have the options for compileExpression defined, we can define the actual function. compileExpression takes CompileExpressionOpts as a parameter and returns a number. The job of this function is to take an expression and determine what type of expression it is; from there, it can pass the expression to another compiler function that can handle that specific type of expression.

Why return a number? When we build an expression with binaryen it returns a number as an identifier for that expression. This allows us to compile an expression ahead of time and then reference that expression later down the line.

Here's the definition:

// src/compiler.mts
const compileExpression = (opts: CompileExpressionOpts): number => {
  // Grab the expression and the binaryen module (mod) from the options.
  // The other fields are used by child function calls
  const { expression, mod } = opts;

  // Map the expression node to it's corresponding specific compiler
  if (isNodeType(expression, "block")) return compileBlock({ ...opts, expression });

  // Numbers are simple enough to compiler that we can just inline the compiler here.
  // They are represented as constants
  if (isNodeType(expression, "int")) return mod.i32.const(expression.value);
  if (isNodeType(expression, "float")) return mod.f32.const(expression.value);

  if (isNodeType(expression, "identifier")) return compileIdentifier({ ...opts, expression });

  // Throw a helpful error message if we don't recognize the expression
  throw new Error(`Unrecognized expression ${expression.type}`);
};

Let's define the compileBlock functions. Because this function is also compiling an expression, we can re-use the previously defined CompileExpressionOpts, but we'll narrow the expression field to the BlockNode type, since we know we are compiling a block by the time this function is called:


interface CompileBlockOpts extends CompileExpressionOpts {
  expression: BlockNode;
}

const compileBlock = (opts: CompileBlockOpts): number => {
  // We re-map the expression field to block here for clarity.
  const { expression: block, mod } = opts;

  // When a block has multiple expressions and the first one is an identifier, that means
  // the block is actually a function call.
  if (isNodeType(block.expressions[0], "identifier") && block.expressions.length > 1) {
    // If it is a function call, transfer responsibility to the `compileFunctionCall` function (defined next)
    return compileFunctionCall(opts);
  }

  // This is where the recursive beauty starts to show. Since every value of a block
  // is an expression, we can map each one back to the compileExpression function.
  const expressions = block.expressions.map((expression) => {
    return compileExpression({ ...opts, expression });
  });

  // Now we generate the machine code by calling the block function of binaryen
  // This function takes a block name, an array of compiled expressions, and a block return type.
  // Named blocks are mostly useful for looping constructs like `for` and `while`. In this
  // case we can pass null as we're not compiling a loop construct. Additionally, we can
  // pass `auto` as the type since binaryen is smart enough to determine the return type
  // of blocks automatically.
  return mod.block(null, expressions, binaryen.auto);
};

Note: If you're curious to see how looping works in binaryen / WebAssembly works, check out my blog post on the subject here. Spoiler alert, its pretty weird.

The last simple expression we'll compile in this section is the identifier expression. If compileExpression was passed a lone IdentifierNode it means that the expression evaluates to the value of the identifier. In wispy, we don't have variables and function identifiers are caught before they could've been passed here. That means the only thing IdentifierNode can resolve to is a parameter.

Here's the definition:

interface CompileIdentifierOpts extends CompileExpressionOpts {
  expression: IdentifierNode;
}

const compileIdentifier = (opts: CompileIdentifierOpts): number => {
  // We remap expression to node to keep our lines a little shorter
  const { expression: node, parameters, mod } = opts;

  // Since we know the identifier has to be a parameter, we look it up in our
  // parameter map. Don't worry, we'll define the parameter map in the next section
  const info = parameters.get(node.identifier);
  if (!info) {
    throw new Error(`Unrecognized identifier ${node.identifier}`);
  }

  // Finally, we use the local.get instruction to return the parameter value.
  // Binaryen needs to know the parameters index and type. We'll get into
  // the index when we define our parameter mapping function.
  return mod.local.get(info.index, info.type);
};

The final expression type left to compile is a function call. This is interesting enough to warrant its own section.

Compiling Function Calls

In wispy function calls are blocks with multiple expressions where the first expression is an identifier. The job of compileFunction is to determine which function is being called, what its parameters and return type are, and finally, building the call instruction with binaryen.

Here's the definition:

// src/compiler.mts

// Because function calls are blocks, we can re-use CompileBlockOpts
const compileFunctionCall = (opts: CompileBlockOpts): number => {
  const { expression, functionMap, mod } = opts;

  // The first expression of a function call is the functions identifier
  const identifierNode = expression.expressions[0];

  // Here we just ensure the identifierNode is *actually* an identifier. Otherwise we throw an error.
  if (!isNodeType(identifierNode, "identifier")) {
    throw new Error("Expected identifier when compiling function call");
  }

  // Next we create a reference to what the actual identifier is
  const identifier = identifierNode.identifier;

  // If the identifier is "fn", the function we are calling is the function to define functions!
  // That's right! Functions are created by another function. Pretty neat if you ask me.
  if (identifier === "fn") return compileFunction(opts); // We'll define this next

  // Ifs are special functions. They may or may not have an else block. Binaryen needs to know
  // if the else block exists at compile time, so we have a special if compiler for this.
  if (identifier === "if") return compileIf(opts); // We'll define this later

  // Every other function is either part of the standard library, or is defined
  // within the wispy code itself.
  const functionInfo = functionMap.get(identifier);
  if (!functionInfo) {
    throw new Error(`Function ${identifier} not found`);
  }

  const params = expression.expressions
    // Every other expression in the block are parameters to the function, so we compile them
    // and then pass them to the call
    .slice(1)
    .map((expression) => compileExpression({ ...opts, expression }));

  // Now we use binaryen to construct the call expression. The first parameter
  // is the functions identifier, the second are the compiled parameter expression,
  // and the third is the return type which has already been determined by generateFunctionMap
  return mod.call(identifier, params, functionInfo.returnType);
};

Let's define the compileIf function before we move onto the compileFunction... function.

1

// src/compiler.mts

const compileIf = (opts: CompileBlockOpts): number => {
  const { expression, mod } = opts;

  // The first expression, expression.expressions[0], is the "if" identifier, we don't need
  // to do anything with it since we already know we are compiling an if expression

  // The second expression is the if condition
  const conditionNode = expression.expressions[1];

  // The third expression is the ifTrueNode, it's what is executed if the conditionNode evaluates to
  // true
  const ifTrueNode = expression.expressions[2];

  // Finally the fourth expression (which may or may not exist) is what is executed if the condition
  // evaluates to false
  const ifFalseNode = expression.expressions[3];

  // Compile the condition expression
  const condition = compileExpression({ ...opts, expression: conditionNode });

  // Compile the ifTrue Expression
  const ifTrue = compileExpression({ ...opts, expression: ifTrueNode });

  // Check to see if the ifFalseNode exists, if it does, compile it, otherwise set ifFalse to undefined
  const ifFalse = ifFalseNode ? compileExpression({ ...opts, expression: ifFalseNode }) : undefined;

  // Finally we use binaryen to compile the if expression
  return mod.if(condition, ifTrue, ifFalse);
};

Compiling Function Definitions

Function definitions are a whole lot like function calls, so the function structure is pretty similar. We take CompileBlockOpts and return a number (the binaryen expression reference).

Here's the definition:

// src/compiler.mts

const compileFunction = (opts: CompileBlockOpts): number => {
  const { expression: block, mod } = opts;

  // We need to tell binaryen what the identifier and return type of the function is
  // Thankfully, we already wrote a function for that, getFunctionIdentifier. We
  // could also have just looked up this information with the functionMap, but
  // this is more fun.
  const { identifier, returnType } = getFunctionIdentifier(block);

  // Next we grab the function parameters. This is the third expression of the function
  const { parameters, parameterTypes } = getFunctionParameters(block); // Defined later

  // The rest of the expressions in the function are the functions block. So we create
  // a new BlockNode from the remaining expression.
  const body = compileBlock({
    ...opts,
    expression: {
      type: "block",
      expressions: block.expressions.slice(3),
    },

    // We need to pass the parameters of this function, so they can be referenced in child
    // expressions
    parameters,
  });

  // Now we register the function with binaryen. Binaryen takes the function identifier,
  // an array of parameter types (each item being the type of parameter in order),
  // the function's return type, a list of variable types (wispy doesn't have any, so we pass an empty array)
  // and finally the compiled body of the function.
  mod.addFunction(identifier, parameterTypes, returnType, [], body);

  // To make things easy we export every single function defined in a wispy file
  // so it can be called by the WebAssembly host.
  mod.addFunctionExport(identifier, identifier);

  // Because function definitions are *technically* expressions that can be a part of another function
  // body, we need to return an expression pointer. For this, we just return a nop (do nothing instruction),
  // to make things consistent.
  return mod.nop();
};

Now, let's define the getFunctionParameters function. This function takes the function BlockNode, that is, the entire unmodified function definition, and extracts its parameters. The function returns two values, parameters and parameterTypes.

The first returned value, parameters, is a map where the key is the parameter identifier, and the value is the information needed to access the parameter down the line within the function body.

The second returned value is an array of binaryen types. There is one type for each defined parameter, and they must remain in the order they are defined. This is because binaryen doesn't reference parameters by their names, instead it references them by the index of the array in which they are defined. Don't worry if this is confusing to you, the code should make things a little more clear. If you need, refer to the compileIdentifier definition, to get a better understanding of how this works in practice.

Here's the definition:

// src/compiler.mts

type ParameterMap = Map<string, { index: number; type: number }>;

const getFunctionParameters = (block: BlockNode) => {
  // The parameters are defined in the third expression of the function definition
  const node = block.expressions[2];

  // Check to make sure the third expression is a block
  if (!isNodeType(node, "block")) {
    throw new Error("Expected function parameters");
  }

  // Now we reduce the parameters into a parameter map, and a list of binaryen types
  const { parameters, types } = node.expressions.reduce(
    (prev, node, index) => {
      // First, ensure that the node is a typed-identifier. Every parameter must be a
      // typed identifier, therefore, every node in this reducer must be a typed identifier.
      if (!isNodeType(node, "typed-identifier")) {
        throw new Error("All parameters must be typed");
      }

      // Determine the correct binaryen type of the parameter
      const type = mapBinaryenType(node.typeIdentifier);

      // Add the parameter's type to the list of types we've defined so far
      const types = [type, ...prev.types];

      // Now add the parameter to the parameter map. We save the parameters index and type.
      // The index and type is used binaryen to access the parameter when it is used
      // later in the function body
      const parameters = new Map([[node.identifier, { index, type }], ...prev.parameters]);

      // Return updated parameters map and types array
      return {
        parameters,
        types,
      };
    },
    // Here we are setting the starting values for our reducer function and casting the default
    // type so typescript can correctly infer the `prev` parameter type
    { parameters: new Map(), types: [] } as {
      parameters: ParameterMap;
      types: number[];
    }
  );

  // Finally we return the parameter map and the parameterTypes
  return {
    parameters,

    // Note: parameterTypes is a number, instead of an array of numbers as you'd expect.
    // So we have to use binaryen.createType to create a new type that is referenced
    // the mod.addFunction function. This is one inconsistency with the binaryen API. Parameters
    // are defined as a number, and variables are defined as an array of numbers. I'm sure there
    // is a reason for this, but I don't know what that reason is.
    parameterTypes: binaryen.createType(types),
  };
};

Now all that's left is to define the standard library. This part of the code isn't super interesting. We are essentially just mapping primitive WebAssembly instructions to a name to be referenced within wispy.

Here's the definitions. The only important information is the name we are associating with each instruction:

// src/compiler.mts

const registerStandardFunctions = (mod: binaryen.Module, map: FunctionMap) => {
  const { i32, f32 } = binaryen;
  const { i32: i32m, f32: f32m } = mod;
  const common = { mod, map };
  registerLogicFunction({ name: "lt_i32", type: i32, operator: i32m.lt_s, ...common });
  registerLogicFunction({ name: "gt_i32", type: i32, operator: i32m.gt_s, ...common });
  registerLogicFunction({ name: "eq_i32", type: i32, operator: i32m.eq, ...common });
  registerLogicFunction({ name: "lt_f32", type: f32, operator: f32m.lt, ...common });
  registerLogicFunction({ name: "gt_f32", type: f32, operator: f32m.gt, ...common });
  registerLogicFunction({ name: "eq_f32", type: f32, operator: f32m.eq, ...common });
  registerMathFunction({ name: "add_i32", type: i32, operator: i32m.add, ...common });
  registerMathFunction({ name: "sub_i32", type: i32, operator: i32m.sub, ...common });
  registerMathFunction({ name: "mul_i32", type: i32, operator: i32m.mul, ...common });
  registerMathFunction({ name: "add_f32", type: f32, operator: f32m.add, ...common });
  registerMathFunction({ name: "sub_f32", type: f32, operator: f32m.sub, ...common });
  registerMathFunction({ name: "mul_f32", type: f32, operator: f32m.mul, ...common });
  registerMathFunction({ name: "div_f32", type: f32, operator: f32m.div, ...common });
};

const registerMathFunction = (opts: {
  mod: binaryen.Module;
  name: string;
  type: number;
  operator: (left: number, right: number) => number;
  map: FunctionMap;
}) => {
  const { mod, name, type, operator, map } = opts;
  return registerBinaryFunction({
    mod,
    name,
    paramType: type,
    returnType: type,
    operator,
    map,
  });
};

const registerLogicFunction = (opts: {
  mod: binaryen.Module;
  name: string;
  type: number;
  operator: (left: number, right: number) => number;
  map: FunctionMap;
}) => {
  const { mod, name, type, operator, map } = opts;
  return registerBinaryFunction({
    mod,
    name,
    paramType: type,
    returnType: binaryen.i32,
    operator,
    map,
  });
};

const registerBinaryFunction = (opts: {
  mod: binaryen.Module;
  name: string;
  paramType: number;
  returnType: number;
  operator: (left: number, right: number) => number;
  map: FunctionMap;
}) => {
  const { mod, name, paramType, returnType, operator, map } = opts;
  mod.addFunction(
    name,
    binaryen.createType([paramType, paramType]),
    returnType,
    [],
    mod.block(
      null,
      [operator(mod.local.get(0, paramType), mod.local.get(1, paramType))],
      binaryen.auto
    )
  );
  map.set(name, { returnType });
};

By now src/compiler.mts should look something like this. With that, our compiler is finished. It's time to execute some wispy!

Putting It All Together

Now that we have finished our compiler, we can finally run our code.

First, replace the contents of src/index.mts with this:

// src/index.mts

#!/usr/bin/env node
import { readFileSync } from "fs";
import { compile } from "./compiler.mjs";
import { lex } from "./lexer.mjs";
import { parse } from "./parser.mjs";

const file = process.argv[2];
const input = readFileSync(file, "utf8");

const tokens = lex(input);
const ast = parse(tokens);

// !! New !!
const mod = compile(ast);

// This is sneakily where the code gen is *actually* happening
const binary = mod.emitBinary();

// Use the standard WebAssembly API to convert the wasm binary to a compiled module
// our host NodeJS/v8 can use
const compiled = new WebAssembly.Module(binary);

// Build the instance, here you would add any external functions you might want to import into
// the WebAssembly module
const instance = new WebAssembly.Instance(compiled, {});

// Finally, run the main function and log the result. We have to cast instance.exports to any
// The standard TypeScript types appear to be wrong.
console.log((instance.exports as any).main());

Now build and run project:

npx tsc
wispy example.wispy

If all goes well (and you passed the number 15 to fib), you should see the number 610 in the output of your console. If so, you've done it, you've made a working WebAssembly language.

Congrats!

A full copy of the wispy language implementation is available under an open-source license at https://github.com/drew-y/wispy.


Written by courier | Courier simplifies triggering and sending notifications from your app with one API and drag and drop UI.
Published by HackerNoon on 2022/09/02