Creating a Scripting Language with ANTLR

Read Part 2 after this.

Computer science is an ever-evolving field. Whether on a small level or a large scale, it is always changing. When ENIAC, the first general-purpose computer was created 70 years ago, nobody could have imagined that we would eventually be able to carry powerful computing machines around in our pockets.

Imagine that the first computers looked like this monster!

As computer science has evolved, so has computer programming. One of the most obvious changes is the extreme improvement in the usability of syntax of programming languages. Languages like C and Javascript make it no longer necessary to move around cables and switches on physical machines like ENIAC’s programmers had to. Instead, we can type what loosely resembles algebra into a text editor and get computers to follow our instructions.

As new issues present themselves to developers, existing languages change themselves to accommodate new improvements. Sometimes, entirely new programming languages are born. Who knows? Yours just might be the next one. This tutorial will teach you the basics of compiler theory as we build a minimal scripting language that compiles to Javascript.

Our tool of choice is called ANTLR, “Another Tool for Language Recognition.” ANTLR was created by Professor Terrence Parr in 1989 as an advanced parser generation tool, and is now in its fourth version.

ANTLR takes grammar files (.g4) as input and generates code that parses input text according to that grammar. The tool supports code generation in Java, C#, Javascript and Python officially.

The compiler we are going to write is actually a transpiler, which takes code written in one language and outputs code in another language. An actual compiler would take input code and generate either machine code or code in a lower-level language.

This transpiler will be written in Javascript and output valid Javascript code based on input text in our language.

But before we can even design our language, we need to understand a bit about compiler theory.

Assume you have a block of code like this:

set name to “world”set one to 1

say "Hello, ${name}!"

And you want to generate:

var name = “world”;var one = 1;

alert("Hello, " + name + "!");

There are a few ways to take the input code and generate Javascript. One way would be pass the input through a series of regular expressions. However, for more complex languages, it becomes very complex and unmanageable to do compilation solely via regular expressions.

The preferred way is a three-step process that starts with something called scanning, or lexical analysis. A scanner (otherwise known as a lexer) enumerates each character in the input text and groups the characters into unit called lexemes. The concept of a lexeme is present in natural languages, as well as computer languages. Lexemes are very abstract, and are nothing more than a grouping of characters. In short, lexemes are strings. Lexers extract specific strings from input text.

The scanner compares each of the strings to predefined patterns (this is where it is suitable to use a regular expression or something similar), and categorizes each lexeme as a token. Tokens are indeed lexemes, just with semantic definition attached to them. For example, assume you have the keyword “if” stored as a lexeme. A lexer’s responsibility is to recognize this “if” lexeme as an “if keyword” token.

An important consideration to make that will save you a lot of time and head-scratching is that lexers should not contain any real logic, besides matching strings against predefined patterns.

That logic is the job of a parser. Syntax analysis, or parsing, is the step of compilation that recognizes specific sequences of tokens and organizes them into a structural representation of the input text. The code unit that performs this is called the parser. Commonly, parsers build trees, where there is one root context, which has multiple child contexts representing the recognized token sequences from the input code. If you have looked into compilers before, you might recognize this concept as an abstract syntax tree, abbreviated as AST. The AST is the most important part of the compilation process, as it is quite literally the compiler’s roadmap to the code it is processing.

Taking the code example above, an AST might look like this:

<root context><variable assignment statement>- target name is 'name'- value: <expression (string)> "world"

<variable assignment statement>  
    - target name is 'one'  
    - value: <expression (int}> 1

<function invocation statement>  
    - target function is 'say'  
    - arguments:  
        \* <expression (string)> "Hello, ${name}!"

The final step is known as semantic analysis, and is completely open-ended. You are free to do whatever you want with the generated AST, but in most cases, we walk the tree and output a representation of each node.

In short, we go from input code -> lexer -> token stream -> parser -> AST -> semantic analysis -> output.

ANTLR is important within this tutorial because it automatically does the scanning and parsing for us. As you can imagine, this dramatically shrinks the time needed to write a compiler (or any language-enabled program), as we only need to implement semantic analysis.

In Part 2 of this tutorial, we will learn how to turn an ANTLR grammar into a lexer and parser to be used in our code. Stay tuned!

Creating a Scripting Language with ANTLR — Part 1