Notes on Writing a Compiler

This document has some sketchy notes about what goes on inside a compiler; in particular, how the compiler I wrote works. I mainly wrote this to try and synthesize for myself what I learned about compiler design.

Note on terminology: I'll use the word "compiler" as a catch-all for compilers and interpreters.

Design Guidelines

I've tried to sacrifice performance wherever possible to improve comprehensibility. My main goal is to understand compiler organization at a medium to high level. I blew off any low-level details which obscure the design or code in the name of efficiency. Note that this tendency runs counter to the philosophy of your average compiler writer, for whom efficiency is a major goal. And often rightly so; if you're building a tool that others are going to use to build programs.But for me and my personal little compiler -- not that important.

When designing and writing the code, I've tried to follow good O-O practices to enhance modularity, maintainability, and readability.

Structure of a Compiler

Compilers are generally broken in two parts: the front end and the back end. The front end is responsible for parsing the input file and converting it to some internal representation. The back end is responsible for taking the internal representation and either writing it to disk (for a compiler) or executing it (for an interpreter).

This decomposition is obviously useful. To make a compiler work on different platforms, you just have to rewrite the back end. Alternatively, you can reuse a back end and create compilers for different languages. For the compiler I built, the source language isn't really adequate and needs to be changed. When I get around to doing this, it will just require modification of the parser; the back end stays the same.

The compiler I wrote has three major components:

A parser, to read in the input and then convert it to
Intermediate code, which is an object representation of the input file, and
The executor, which takes the intermediate code and uses it to produce screen output.

Parsing

Parsing has both a formal and an informal definition. Informally, parsing is the act of reading input and somehow making sense of it. We'll get to the formal definition later.

So -- the goal is to get a routine that reads in a source file and outputs intermediate code (to be defined).

Parsing is generally decomposed into two pieces: lexical analysis and, well, parsing (the formal definition).

Lexical Analysis

Also known as scanning, or tokenizing. An input file is presented to a computer program as a big arracy of bytes. For example, if a file consisted of the line

The quick brown fox jumped over the lazy dog.

then this shows up in code as an array with first element 'T', then 'h', 'e', ' ', and so on. Not what we want -- we'd like to have a stream of words: 'The', 'quick', 'brown', etc. So the first step in reading this file has got to be reorganizing the input from a stream of characters to a stream of words. Although instead of calling them "words", we'll use the more technical term of "token" ("lexeme" is another word meaning the same thing).

Another example -- take this line of code

x = 4.4 + 7;

A token is usually defined as having a type and an attribute. The type is generally an integer code, and the attribute a string or numeric value. For this example, there are six tokens. The first token would have type IDENTIFIER (variable) and attribute "x". The second token has type EQUALS and no attribute. The third, type NUMBER and attribute either the value 4.4 or the string "4.4".

Note also that although the source string includes spaces, those aren't present in the token output. That's another role of the lexical analyzer -- to remove whitespace, comments, and other extraneous material (if desired).

Lexical analysis is the process of breaking up the input into a sequence of tokens. There are three ways you can create lexical analysis code.

You can just hack it out -- for the above example, it's pretty easy. Read a character; if it's a letter then keep reading until you hit a space. That completes the token; store it in memory and start the next token. That's pretty simple, but things get much harder, for example when you want to read in floating-point numbers or input where some words are keywords and have to be handled specially.

Alternatively, you can specify your input language will be turned into tokens. Regular expressions are your friend here. For example, an English word can be characterized using the regex ( [a-z][A-z] )+ (a string of one or more letters).

Once you have this specification, there are public algorithms for creating the lexical analysis code accordingly. Better than that, there are tools that will automatically create lexical analyzer code for you, if you spec your language in a certain format. This is by far the easiest way to go. See the section on parser generators below.

Grammars

The lexical analyzer turns the source input into tokens, but it doesn't care about what the tokens mean. The standard methodology is to interpret the token stream in terms of a grammar. This section is going to be really sketchy, since it's too deep to cover in such a short space. I'll try to get the flavor across.

A grammar is a description of a language. Theoretical grammars have been heavily studied by linguists and computer scientists, but the amount of theory you actually need to know to build a compiler isn't bad.

The type of grammar computer languages usually use is called context-free, also called BNF (for Backus-Naur Form) grammars. I won't try to define them directly, but will illustrate the concept with an example.

The main construct for a context-free grammar is something called a production. Let's look at an example that describes expressions built by adding numbers:

expression -> expression + number | number

You can read the symbol "->" (an arrow) as "can be", and the symbol "|" as "or". In other words, this production is saying that an expression can be an expression followed by a + token, followed by a number token, or just a single number token. For example, this production defines the token string "1+2" as an expression; likewise "1+2+3", if you apply the production twice.

A more complex grammar, to define algebraic expressions built up using +,-,*, and /, and allowing parenthesizing, would be:


	expression -> expression - term | expression + term | term


	term -> term * factor | term / factor | factor


	factor -> number | (expression)

You can read these productions top-down or bottom-up. For example, if someone hands you the line

6-5

and asks you if it's an expression or not, a bottom up approach would be to start with the last production. 6 is a number, so it's a factor. 5 is likewise a factor. Now a term can be a factor according to the second production, so we have term - term. Now an expression can be a term, according to the first production, so we have expression - term, which is an expression according to the first production.

A top-down approach would be to look at the top production, which tells us all things that an expression can be. We know the line can't be expression + term since there's no '+'; so let's try expression - term. That will work if 6 is a term; using the second production, you can see that 6 can only be a term if it's a factor, which it is according to the last production.

Hopefully, that doesn't seem too involved. And that's pretty much all there is to it. My turtle geometry language is specified as a context-free grammar with about 30 productions; click here for a stylized representation (using "::=" instead of the "->" symbol).

Parsing

Now we can give the more specialized, formal definition for parsing, referred to above. If you have a context-free grammar, and a statement in the language of that grammar, then to parse that statement is to determine which productions are used to derive the statement.

In other words, when we produced the expressions that determine the statement 6 - 5 to be an expression, that was parsing.

Right now, parsing looks like an empty activity, since all it's about is figuring out productions to make a statement. In a compiler, you don't just parse a statement, but you also take action based on which productions were used. That's called translation.

Translation

One of the conceptual problems I had understanding compiler design is that the presentations I read described the process from front end to back end. That makes it hard to comprehend translation, when you don't know what you're translating into.

With my compiler, the language we'll be translating into is that of the intermediate code, which is described a little later. But the bottom line is that for each production in your grammar, you have rules for how to build intermediate code. So when the parser tells you that such and such productions were used for a statement, you invoke those rules to build intermediate code.

Parser Generators

There are well-known algorithms for parsing context-free grammars, but they're pretty hairy and you don't want to be implementing them on your own.

There are programs that let you specify a context-free grammar in a special format, and they'll take that format and produce parsing code for you. These programs are called parser generators, or compiler compilers. Combine this with automated lexical analyzer generation, and you have a powerful toolkit for creating compiler front ends.

The standard set of tools for generating C/C++ code are called lex (lexical analysis) and yacc (for Yet Another Compiler Compiler, a parser generator). The gnu freeware versions are called flex and bison.

The program I used is called JavaCC, which produces Java code that both does your lexical analysis and parsing. It's freely available from Metamata.

Another free program, which I haven't used, is called ANTLR, and it can produce either C++ or Java code. It is supposed to be a more powerful tool than JavaCC (meaning it can handle a wider variety of grammars). It's available at the ANTLR home.

I found JavaCC to be very useful and relatively easy to use. It comes with a wide variety of examples (including grammar files for Java, C, C++, HTML, and JavaCC). However, the documentation was a little scanty and I personally used lex and yacc to cut my teeth on the idea of using parser generator tools, then moved to JavaCC when I was familiar with them.

Intermediate Code

Intermediate code is some sort of internal (to the compiler) representation of the input file. Although, it's not completely obvious why you need intermediate code. It is possible to just read in a line of source code, execute it (or write its contents out to a binary file), read the next, and so on. Although possible, it's not actually that good an idea.

For several reasons: performance, modularity, and hassle.

For my program, I need to actually execute the input program. If you actually tie execution to each line of the user's source, then, for example, every time you go through a loop, you need to re-parse each statement in the loop, which is expensive. Furthermore, compiling or executing on a line-by-line basis only precludes any chance of optimizing the code.

Of course, I've already said that I'm not that interested in performance. I like intermediate code because it adds a layer of indirection between the front and back ends. With icode (I'll use that abbreviation for 'intermediate code' from now on), you decouple the two ends.

Better than that, icode gives you the flexibility of designing a language that is easy to compile or execute. The choice of the input language is usually not in your hands; it's generally written in a spec somewhere that you have to follow. With icode, you can create a design for technical merit (read: easy to program for; maintainable), and then just make the translation piece of the front end adapt.

I'll discuss my icode design after I talk a little about symbol tables.

Symbol Tables

I think of a computer language as being composed of fixed elements and user-defined "stuff". Fixed elements are keywords and punctuation symbols. Those are hard-coded, as it were, in the language; they always mean the same thing and you always specify them in the same format. The rest is up to the user -- variables, function names, and literals. A literal is a piece of data that's embedded in the code. For example, if you have a statement

x = 11;

then the number 11 is a literal in your code. Ditto for strings.

One term for all this user-defined stuff is symbol. Symbols differ from the fixed elements because they can mean different things in different programs. For example, you might have one program where you use the string "sum" as a variable, and in another program, it's a function.

The symbol table is sort of a master dictionary for symbols in a user program. Each user symbol (variable, function name, literal) has a symbol table entry which tells what sort of symbol it is and necessary information about that particular symbol. For example, for a variable, the entry can hold the type of the variable and information about its value. The implementation of a symbol table is usually some sort of data structure that supports fast lookups; often a hash table. But that's not important from a conceptual standpoint.

Symbol tables can get fairly complicated, especially if the user language allows user-defined data types (which I'm not dealing with).

I'll describe my symbol table setup. There are three types of symbol table entries: one for variables, one for subroutines, and one for literals. There is a global symbol table, which just holds an entry for each subroutine. Every subroutine has its own symbol table. This is useful because you can have different variables in different subroutines with the same name. It's easy to keep track of your variables if you just put their information in a symbol table local to their subroutine only.

The data stored by the different symbols is pretty small. A function symbol stores an address in the icode for the function definition and has its own symbol table object. A literal symbol stores the actual value of the literal. A variable symbol stores not the value of the variable, but something called an offset, which I'll describe when I get to frames.

For a simple language, you can get by with storing the value of a variable inside its symbol table entry. The only trouble with this is that it implies that you only have a single instance of this variable -- if a function is called recursively, then the value of a local variable is the same in all the invocations of that function (acting somewhat like a C static variable). To make recursion work, you have to make the value of the variable depend both on the symbol and on the particular invocation of its owning function.

My Icode

There are several popular options for intermediate code. The Dragon book discusses "three-address code", which is a sort of assembly language for an abstract machine. Pascal compilers often use a "P machine" which also has an assembly-like flavor.

All that's a bit too low-level for me. I used the code in a book by Ronald Mak as an example. I modified his approach by removing low-level indexing details and eliminating features I wasn't using, but the overally philosophy is quite similar.

Basically, my icode looks a lot like the token stream coming out of the lexical analyzer. The two biggest changes are that where there are identifiers (variables, functions) and literals in the user source code, the icode has symbols referring to those identifiers. Furthermore, all icode instructions are stored in an array and thus every icode instruction has a "line number", which is just the index into the array. Control structures use these line numbers for their branching instructions.

That's a rough overview; let's be more specific. I have an ICode class, which consists of the global symbol table and the code segment, which is an array of ICodeStatement objects.

An ICodeStatement represents a single statement or command of intermediate code. Each ICodeStatement has a type, basically referring to a token type, and optionally a symbol or a jump location. For example, the source code statement

x = y + 7;

would look like this in ICode:

The only really tricky part is control statements (IF, WHILE, FOR), which need to have branch instructions. For example, with an IF statement, the executor is going to evaluate the boolean expression referred to by the IF, and then it will jump execution to different points in the icode depending on whether the expression is true or not.

Of course, the description of ICode is always a little empty until you see how it's going to be executed. That's coming up in a little while.

I'm not overly impressed with this design for the icode. It requires the executor to be basically a parser, which seems like too much of a pain. It works, though.

Using the Parser Generator

I just want to take a second to explain how you would use a parser generator tool to build icode. The details vary of course, both on the icode and the exact tool you're using.

Let's keep working with the statement

x = y + 7;

In my compiler, this is going to get parsed into the production

assignment -> identifier = expression ; This is a production in my JavaCC grammar file, and I've annotated it to produce icode when this production is used in a parsing.

When the parser sees this assignment, it will take the following actions:

Look up the symbol table entry for the variable "x"; if there is none, create an entry in the owning function's symbol table.
Create an icode statement with that symbol, and append it onto the code segment.
Create an icode statement of type EQUALS and append it onto the code segment.
Look up the symbol table entry for the variable "y"; if there is none, create an entry in the owning function's symbol table.
Create an icode statement with that symbol, and append it onto the code segment.
Create an icode statement of type PLUS and append it onto the code segment.
... and so on.

I found the only really tricky part with using the parser generator to be with control statements. For example, say you're parsing the production

if_statement -> IF ( expression ) { statement_list }

Here, to tell the executor where to go when the expression is false, you need to put in the icode a jump instruction to the first statement after the IF block. However, you don't know what that statement is until you've generated all the icode for the IF. So the technique is to put in a blank icode statement for the jump, and then after you've processed the rest of the IF and know where you have to jump to, you go back and fix up the jump. This process is called backpatching.

Execution

At this point, I've described mechanisms to read in the input file, and translate it into icode. The last thing that my compiler needs to do is to execute that code (other compilers will normally write it out as an executable file).

This is where stuff really gets confusing. I was pretty messed up until I got all the different code levels straight in my head. You have:

Source code -- some user's text file that you're reading in as input. Written in the language that you're writing the compiler for.
icode -- a representation of the source code, usually as a set of objects/data structures in the implementation language. This code does not have a persistent existence; there's no such thing as an icode file. It only exists at run time.
icode implementation -- the implementation of the icode in your implementation language; exists statically.
executor code -- code in your implementation language to execute the run time icode objects.

It's not that bad once you think about it and get it all straight. I wasn't thinking clearly when I started out, and managed to get my head all screwed up. This was maybe the biggest conceptual hurdle I had to overcome before the whole compiler design made sense.

The executor has the following data: a run-time stack for expression evaluation, a frame stack to handle function evaluation, and a turtle object to handle turtle commands. We'll discuss each in turn. The executor also has lots and lots of logic to handle all the icode statements.

Let's look at the executor's data structures before going in to how it works.

The results stack

Basically, this is just a place to put temporary results. For example, if you're executing the expression

(3 + y) * 8

then what happens is the executor uses the grammar of the icode to break this into productions. As each sub-production is evaluated, the result is pushed on the stack, and then popped from the stack when needed later. In this example, the executor will, in turn, push 3 on the stack, push the value of y on the stack, pop both, add them, and push the result, push 8, pop 8 and the previous result, multiply them, and push the result. Whatever icode statement uses this expression will pop the result.

Frames

What I call a frame is also called a stack frame, or an activation record. Every time a subroutine gets invoked, a frame will be created. The frame stores the value of all the parameters used to call the subroutine, all the local variables in the subroutine, the return value (if any) of the subroutine, and the code location from which the subroutine was called. In a more extensive language, the frame will have more information; perhaps a reference back to the calling stack frame or a reference to global data.

So it's the frame that holds the actual values of source code variables. There's a fairly standard, elegant, implementation that's used to map variable symbols to frame values. The frame stores values of variables in an array. A variable symbol stores a constant index (called an offset) that is used by all frames to access the value. So, to look up the value of "x", you take the symbol for x, look up its offset, and ask the frame to return the value for that offset.

The executor keeps a stack of frames. Each time you invoke a subroutine, a new frame for that subroutine is created and pushed onto the stack. When the executor has finished processing the subroutine, it pops the frame off the stack, examines it to see where to resume execution, and pushes the return value (if there is one) onto the run-time stack.

The Turtle

The easiest part of the whole business. The turtle is just a Java class with one method for each turtle command (forward, left, etc). It keeps track of the points it has visited, and knows how to draw itself.

The turtle class is completely decoupled from the rest of the code; you could code up turtle geometry programs inside Java by creating a turtle object and just calling its methods.

Back to the Executor

So ... how does the executor work? Well, it starts off by loading up the main() routine, and then starts processing ICodeStatement objects. It parses them using the icode's grammar and, well, does what it has to do.

For example, take an example we've seen before, where the source code was

x = y + 7;

and the icode statements were:

The executor recognizes this as being an assignment statment, by the production

assignment -> identifier = expression ;

The executor calls its Assignment() subroutine, which

processes the first statement by grabbing the variable symbol for x
processes the second statement by doing nothing
processes the statements of the expression as described above, eventually pushing the value of y+7 on the evaluation stack.
looks up the address of the variable x in the active frame
sets the value of x in the frame equal to the top value on the evaluation stack, and pops that stack.

It's actually not that complicated, once you've managed to get all the levels of indirection straight.

Error Handling

I haven't talked much about error handling, which is critical for a successful compiler. I'm not talking about handling errors in the implementation. My take is that there's not a whole lot you can do there besides failing the compile process; since if your logic is busted, you absolutely don't want to present the user with an output that's incorrect.

What's vital is that you handle user errors, i.e. errors in the user's source code. Imagine how hard it would be to write a program if the only feedback you get for a compile error is that there is an error. You've written several hundred lines of code and the compiler tells you there's an error ... somewhere. Not too helpful.

At a minimum, the compiler needs to give the line number of the source code for all the errors it finds. My compiler will stop on the first error it finds and report a description of the error and the line number. This isn't sufficient for an industrial strength compiler; you want it to detect as many errors as possible each time you attempt a compile. Which is not an easy task if you think about it. I don't bother; I figure that as a user, I have to fix all the bugs anyways so whether I get them one at a time or all at once don't matter too much (so long as compile times are short).

Ok, so how are errors detected? Well, the parser generated by JavaCC will detect syntax errors. That is, if the user's code cannot be parsed by your grammar, JavaCC will throw an exception (which I catch to print out the error message and halt execution). Since my turtle geometry language isn't typed, most of the errors that can be caught at compile time will indeed be caught by JavaCC. There are a few errors that I'll catch manually, like calling a subroutine that hasn't been defined.

Besides syntax errors, the other main class of errors is run-time errors, involving variables not being intialized (which actually, is a language "feature" I should fix) or operators being applied to inappropriate inputs. When such happens, I print an error message and halt execution. The only subtle point is that I'm detecting errors in icode, yet I need to tell the user the line number in the source code. So what I did was to store the source line number in every ICodeStatement object. This is horribly inefficient -- but not so much that it matters. There are more efficient ways; see Mak's book.