Saturday 14 November 2015

Compiler 3 Part 8 - Finale

This will end up being the longest, and final, post in the series. So let's tuck in!

Testing and Everything Else

The scanner and most of the infrastructure stayed exactly as it was.

The parser and AST did see some changes. The parser, in particular, saw several bug fixes and simplification. Some of the error checking that had been done during parsing has been passed on to the semantic analyzer. Similarly, anything related to type checking could be removed.

Also, the AST-related scoping code could be simplified and confined to the parser since it is discarded during semantic analysis.

Detecting “true” and “false” was added for boolean the type.

With those changes out of the way, I can say that if there is one area that Calc has truly lacked, it was in testing.

A compiler project should, in my opinion at least, be concerned with correctness. Correctness and consistency. The IR was one step in shoring up Calc in this area and testing is the other. Calc has always incorporated at least a little testing but it has never been as comprehensive as it should be.

Go provides great testing facilities right out of the box. The cover tool has also proven helpful in sussing out corner cases that need testing.

I have, I think, made large strides have been made in making Calc a much more stable compiler.

That isn’t to say that it’s perfect. I’m sure within 5 minutes of playing with it someone will discover some bugs. Hopefully, those than find them will take the time to submit an issue so I can create tests and fix them.

This is the value of having an open source project.

The Future...

Calc has continued to evolve along with my own skills and I hope that I’ve helped someone, in some way, to further their own growth. Whether it be teaching someone how to, or how NOT to code, the end result remains the same. Someone, somewhere, has benefitted from my efforts.

Now that I’ve had a chance to regroup from the herculean effort of writing and implementing the Calc2 spec and teaching series, I feel like I can continue to produce more content.

I do hope to continue talking about compiler construction. In the future, I hope that by using a smaller format, like this one, that I can produce more posts more quickly. I can certainly say that this third series has been much more enjoyable to write!

I have also been working on another side project to create a virtual machine with a complete software stack. It has an assembler, linker and virtual machine running it’s own bytecode. If there is interest, I’ll write some blog posts about it. You can find the project here.

I have concluded both previous series talking about future plans and I shall do no less today. In series two I had a list of things I wanted to implement and change. Let’s go through the list:

  • growable stack - well, this is null and void. Calc no longer manages it’s own stack.
  • simple garbage collector - I no longer plan on implementing a garbage collector myself. See below.
  • loops - not implemented
  • generated code optimizations - done! 
  • library/packages - not implemented
  • assembly - null and void. I don’t plan on generating assembly.
  • objects/structures - not implemented
Only one of the seven features I wanted to implement was actually completed. Of the remaining six features, three of them no longer make sense. The groundwork for other three has been laid down. In particular, loops and structures only await language design since implementation should be almost trivial.

Incorporating the remaining three features, I now present you with an updated list:

  • #line directives for source-debug mapping - I think this is a very reasonable and easily obtainable goal. It will make debugging Calc with the GNU debugger easier.
  • logical && and || operators - are easily added with minimal work
  • importing libraries - thanks to the x/debug/elf package, importing libraries into code may not be unreasonable to achieve in the near future. The scope of adding imports is likely a series in of itself.
  • garbage collection - Calc 2.x does not need garbage collection. It does not have pointers and it does no memory allocation. However, when it does, I will probably use the Beohm garbage collector rather than attempting to spin my own. Additional info can be found on Wikipedia. This feature will be far in the future so don't expect it any time soon.
  • structs, loops and arrays - with the new code generator, implementing structs and arrays on the back end ought to be trivial. Work on this has actually already begun and you can view some of the changes here. Sorry, the specification is not published publicly right now.
  • stack tracing - there isn’t much nothing holding me back from implementing stack traces on crashes now that the new code generator is in place. Time and effort are all that remains.
And that, as they say, is that! Thank you for taking the time to read through this series! At some point I’d like to put everything into a PDF but that would be a large task since large parts of both series would need to be re-written entirely. In fact, I might even want to start again from scratch.


Until next time!

Friday 13 November 2015

Compiler 3 Part 7 - Code Generation

Here we are at the final step.

As mentioned in the introduction, I cut out a massive amount of code by offloading much of the work on the IR.

So here is what has changed:

I introduced a small function to map Calc types to C types. This could probably exist in ir.Types, too, but I chose to keep it coupled with the code generator since that’s the only code that uses it.

Any object with an ID has its original named stripped and is replaced by a generic variable name. These names start with an underscore and the lower-case letter “v” followed by the ID of the object.

Each binary operation is assigned to a new variable (a reason for why C99 is required). Don’t worry about this being wasteful. Even if you chose to output a chain of infix arithmetic (1 + 2 + … + N) the underlying assembly instructions usually take no more than two operands anyway. This is why using the above method works so well since it more closely matches the machine code.

Note: I've spent a lot of time comparing the assembly generated from C to see how things work. This is a fun (am I sick?) and interesting exercise to try yourself.
Even if you didn't follow along with the last series, I encourage you to view the previous binary code generation function. 58 lines of confusing mess now pared down to a simple 5 line function. Perhaps more importantly, the generated code is easier to read and follow even though it’s not really intended for visual parsing.

Another important change is using C-style function calling and function declarations. This was made possible by the new IR and type system. With every object being assigned a valid type, we can easily create proper C function prototypes and definitions.

By utilizing the SSA-like code generation and the new IR it also becomes trivial to use C-style calling convention. Types have already been checked, the number of arguments verified, and sub-expressions have been assigned ID’s. Therefore, only raw values and object IDs are passed into the function.

All in all, fairly simple.

Tuesday 10 November 2015

Compiler 3 Part 6 - Tagging

This step is crucial to Calc’s code generator but may not exist at all in other compilers. Regardless, it makes code generation for Calc dead simple. Before I can get into the process, you do need a little background information.

In the introduction I mentioned something about 3AC and SSA. Make sure you check out the articles but the cut and dry is thus:

In SSA, each calculation is given a unique ID. These ID’s replace variable names and other objects.

So, what does this mean to us?

Consider the following two example infix arithmetic operations:


  1. a + b + c + d 
  2. a * (b + c) / d 

These two operations could be translated into something like the following:

Example 1: a + b + c + d
r1 := a + b
r2: := r1 + c
r3 := r2 + d


Example 1: a * (b + c) / d
r1 := b + c
r2 := a * r1
r3 := r2 / d


As you can see, the result of each binary operation is assigned to a new, unique variable. While verbose, it is much more akin to assembly and easy for C compilers to optimize.

It also ensures that calculations are done in the correct order and removes the necessity of pushing and popping operands to the stack.

Armed with knowledge, we can now get on with what I call tagging.

Tagging is the process by which we attach these unique identifiers to each operation. Variables, parameters, binary operations and unary operations all need to be tagged.

Oddly, perhaps, even if statements need to get tagged and you may wonder why that is. Well, if statements in Calc are not statements. They’re expressions. Like if expressions in functional languages, if expressions in Calc return a value.

As part of escape analysis and type checking, the value of any branch in an if expression must be checked and tagged since it’s result may be used elsewhere in the code.

As you can see, Folding constants can potentially save us time by reducing the number of operations needing an ID.

We can traverse the tree in any manner we chose provided that every ID is unique.

One more stop left to go. Code generation!

Monday 9 November 2015

Compiler 3 Part 5 - Constant Folding

The only optimization included in series 2 was constant folding. Unlike in the previous series, we’re now much better equipped to handle this optimization.

I feel that I should point out that most C compilers do this step, too. I was somewhat reluctant to keep it in at first. However, my reasoning for keeping it is two-fold: one, I think learning a bit about optimizations on an IR is a worthy lesson; and two, it can help make the next step a little quicker.

Once again, the IR comes to the rescue! In the first step of transforming the AST into the IR we created constants. These were objects representing actual values. This crucial step makes it much easier on us now.

Unary and Binary operations are ideal candidates for folding. We have already done the work of verifying types and correctness so we can ignore all that now. When we check if a value is a certain type we can be confident that information is correct.

Looking at the code you can see that it’s pretty simple. Traverse the tree depth first.  If both operands of a binary object, or the single operand of a unary object, are values of the correct type we can fold them together. We then return the result as a new constant value and replace the previous object with it.

Moving back up the tree we repeat the process until we exhaust all the foldable values.

You can see the value in converting basic literals into constants when building the initial IR tree. It makes folding constants together much, much simpler.

Onward and upward!

Sunday 8 November 2015

Compiler 3 Part 4 - Type Checking

Whether you chose to produce assembly or an intermediate language, the job of the IR is to take us ever closer to generating actual output. We need to do the crucial step of type checking. Since Calc’s type system is strong and static it is necessary to ensure type safety before moving on.

Each struct in the IR tree has an underlying object. This object provides a Pos() function that maps the remaining constructs back into the original source code. This will be useful to us when we are doing type checking so we can provide well formed error messages.

We must traverse the tree and verify that expected types match. Not all objects currently have a type, like variable declarations using type inference. This means that we don’t actually know the type of the variable when parsing the source language since the type isn't explicitly declared but inferred from the object being assigned to the variable. 

Similarly, binary expressions do not have a defined type when parsing. Your language may allow adding (concatenating) strings, in which case an add binary operation on strings would have the type string. In the case of Calc, logical comparison operations make a binary expression type bool. Don't forget that most languages have different integer types so your binary expressions will need to take on and verify these types.

This requires we traverse the tree depth first to identify the types of certain structures prior to verifying their correctness.

In addition to verifying type correctness, the type-checker also does some additional tests…
  • checks that function parameters match the number of arguments in function calls;
  • verifies that an identifier matches the right kind (variable vs. function names);
  • checks for undeclared variable and function names.

While not true of the current version of Calc, many languages have varying bit sizes of integer types. Somewhere during this process you would will need to verify if a particular value fits within the bounds of a particular type. If the source has a value greater than 255 being stored into an unsigned 8 bit type, this is a problem. When and how you chose to handle this error is up to you.

Type checking may also involve a level of escape analysis. You need to verify that any value being returned anywhere in the tree conforms to the parent type. In imperative languages, this means looking for statements like “return”. For functional languages, you need to search for terminal points like the "then" and "else" clauses in an "if" expression.

Once we are sure that everything is type safe and correct, we can move on.

Saturday 7 November 2015

Compiler 3 Part 3 - Intermediate Representation

After presenting Calc2 I came to discover that many parts of the compiler were quite fragile. The type checking was particularly bad, occasionally reporting type errors (false positives) when run over the exact same source code multiple times. 

That’s a serious problem.

I had assumed that using an intermediate representation was strictly for doing optimizations. While they certainly help with, and may be crucial for, optimization they can be useful for much, much more.

The IR used by Calc is nothing special. In fact, it looks very much like the AST and the objects it uses share many of the same names. So what gives?

If you look more closely, there are some crucial differences. For one, each object is represented a little differently. Second, there is much more in common between each object. In fact, there is a struct embedded into every one: object.

This struct gives me access to almost everything needed to perform various tasks, including: type checking, error reporting and code generation.

While the AST and IR may both be trees, and represent much of the same data, they represent two very distinct things. The AST, as the name implies, represents the actual syntax of the language. Using it, you can somewhat faithfully recreate the original source code. The IR, on the other hand, sits much closer to code generation and thereby represents the code we wish to output.

I look at the IR like this: it acts as a bridge between (intermediate) the syntax and code (representation) generation.

The first step is transforming the AST into IR. This is done with a simple tree walking algorithm, converting each node into the new form. Typically, the compiler starts with MakePackage. From there, you can follow the calls along to see how the tree is built.

There are a few things worth noting.

First, notice that the parameters of functions get separated into objects stored in a function's scope and only a string slice of parameter names is stored with the declaration. This will become important in the next few steps.

Second, we can also begin a simple optimization of converting literals, like number and boolean strings, into actual values called constants. This saves us the step of doing these conversions later on and provides an opportunity to do constant folding in a much more efficient manner.

Once we’re done converting the AST into the IR tree, the real work can begin. Stay tuned!

Friday 6 November 2015

Compiler 3 Part 2 - C Runtime and Code Generation

Before diving into the real meat and potatoes of the changes that led to the runtime being (temporarily) removed, I thought I should discuss the motivations.
The runtime irked me greatly because I felt that I did not leverage the power of C nor the C compilers. There were many, many improvements needing to be made and as I began working on them I came to a realization.

The runtime, at least in its current form, is not very useful.

Unlike in a language like Go where the runtime serves a very real purpose (managing goroutines, the garbage collector, etc) the Calc C runtime merely tried to mimic what it’s like to program in assembly.

Of course, there are reasons to re-add a runtime further down the road. Runtime checks, like bounds checking, are one such reason. Stack tracing is another wonderful motivation to implement a runtime. Built-in functions are also common and useful.

I still maintain this was a worthy exercise, long-term it just wasn't viable. The more I looked at the runtime the more I knew I could do better.

The largest hurdle was removing the stack. How was I to handle the call stack? As it turns out, it’s not too hard to use C-style function calls but it did require some changes that the series 2 code base just wasn't equipped to handle elegantly.

I also needed to push and pop data on and off the stack. So how could I rid myself of it?

I have been researching on how to use C as an intermediate representation for many months. Generating C function calls and declarations were proved difficult with the Calc2 compiler and that was one of the motivations for using an assembly-like approach.

Ultimately, it became more and more clear that I was missing a step that would make my life a lot easier.

That secret, if you will, was intermediate representation.

After much research I ran into some potential solutions: three address code and static single assignment. I have been aware of these languages and others like them (Gimple and RTL) for a while.

Using them seemed complicated and I didn't think they were worth the effort of learning about. I wasn't writing an optimizing compiler so what advantages did they provide that I would need.

While not using any of these representations directly I do take advantage of what they provide. More on that next post.

Wednesday 4 November 2015

Compiler 3 Part 1 - Introduction

I return to you once again with continued improvements and changes to the Calc programming language compiler. In the intervening months (almost a year!) since I completed the last series I have been busy with many other projects. However, I keep coming back to the compiler and in the last few months have been busy tinkering away.


I have taken a different route in this next iteration of the series and you could consider it a partial rewrite of the previous one. There are several knobbly sections in series two that I wasn't particularly happy with and in this third series I hope to address most, if not all, of them.


I also intend to make this third, and any potential future series, much shorter and narrower in scope. Please feel free to ask me questions as I will be glossing over some of the finer points.


So what has changed?


Well, the language itself changed almost not at all. The only additional language feature is the addition of the boolean type. So, nothing too exciting there.


What did change was the entire back-end.


In series two, the code generator operated directly on the AST. It proved to be cumbersome and error prone. The compiler now uses an intermediate representation for the code generator that reduced the LOC from 544 to 255. Less than half the LOC while producing better, more easily optimized code.


That does not even include the removal of the runtime library!


Yes, the entire runtime has been removed which included many hundreds of lines of code being deleted. The Calc language went on a diet!


The net gain is a much better compiler and I want to share those efforts with you all.

Stay tuned!

Series 1 - Novice Compiler Design
Series 2 - Apprentice Compiler Design