Monday, 15 December 2014

Compiler 2 Part 8: Parser

After an annoyingly long time, I am finally getting the next installment of this series out.

The parser hasn't so much as changed as grown. If the abstract syntax tree is any indication, there is a lot of work that needs to be done in the parser!

Before we get to that though, I thought I’d take a quick moment to talk a little bit about the parser in general. I pointed out a few questions in Part 1 of the Calc 1 series I didn't understand when I first started to learn about compilers. I’ll try and answer a couple of them now.

Parsing


In the previous series, I dove right in to creating the parser but I didn't really explain much about parser technology.

To be blunt, I don’t think I'm the right person to teach you. I'm not an authority on parsing nor parser technology.

What I can do is point you in what I hope is the right direction and explain a little about this particular parser.

Top-Down, Predictive, Recursive Descent


The Calc parser is a top-down parser. It tries to figure out the overall structure prior to breaking down the smaller parts. In parsing a single file it resolves the file object, then broadly looks at global expressions like function declarations and so on until it gets down the smallest parts, usually a variable or basic literal.

It is also a recursive descent parser, which is a sub-type of a parser family called LL parsers. As the name suggests, it recursively descends from the top of the grammar tree down to the bottom. The parseGenExpr method is called recursively to determine the different types of expressions in the Calc language.

More specifically, it is a predictive recursive descent parser. I believe you would call it an infinite look-ahead, LL(*), parser, By virtue of how function parameters are parsed, the parser must continually look ahead an infinite number of tokens to discover the type of each parameter.

You don’t need to understand any of this to write a parser but I've always prescribed to the school of knowing WHY something works rather than just HOW it does it.

Let’s move on to look at the specific changes that happened in Calc 2. I'm going to go quite quickly and only focus on changes which may be non-obvious.

Parsing Files


Rather than take the source code of the file and parse it, ParseFile now takes a pointer to a token.FileSet. It now must go through the work of opening the file, reading the source, parsing it and adding the file to the fileset.

Parsing Directories


ParseDir fulfils the multiple source file requirement. It opens the supplied path, reads all the files in the directory and attempts to parse each file in turn.

In this incarnation, the calls to ParseFile happen in a linear fashion but is a perfect spot to use some of Go’s concurrency primitives to speed up parsing on a multi-core system. I chose not to implement it in Calc 2.

Parser Struct


A couple new fields have been added to the parser object. The listok field is used to determine if an ExpressionList is allowed when parsing expressions. More on that a bit later. The other to fields hold the current and top Scopes.

The curScope field is the most important to us. It tracks the current, active scope as the parser transitions between different expressions.

Initialization and Errors


Both addError and init were updated to accommodate the changes made to error facilities and new parser fields, respectively.

Scope


openScope creates a new scope using the current scope as the parent for the new scope.

Conversely, closeScope reverts back to the previous scope.

Assignment


Unfortunately, nothing too exciting is going on in parseAssignExpr, either. By virtue of look-ahead, we know to expect an ASSIGN token. The expect method is used to get the position to ensure nothing funny happened and to advance the scanner in one step.

Our grammar then tells us we expect an identifier so that’s what we attempt to parse. From there, we can receive any kind of expression from a basic literal to a binary expression or even a function call.

Finally, we expect to find a closing parenthesis and finish off the expression.

Basic Literals


A minor change was made to ensure the parser is always advanced after parsing the literal.

Calling Functions


Again, another small method. The first thing we expect to find is the name of the function being called so we parse an identifier.

This should be followed by a list of arguments for the call and capped off with a right parenthesis.

Note that types are not checked here nor are the number of arguments to the parameters in the function being called. That’s because the function may not be declared yet.

Calc 2 does not state that a function must be declared in lexical order as it would in C. Calc, like Go, let’s you declare a function anywhere in the source code and can therefore be called prior to it’s declaration.

Function Declarations


The fun really begins here.

At first, parseDeclExpr is pretty straight forward up until we hit the call to openScope. This scope is the scope of the function itself. The parameters of the function are unique to the function call and can’t be accessed outside of it. Same is true of any variables declared within the function body.

Once open, we parse the parameter list, adding any parameters found into the newly opened scope. The code for this will be looked at a little later.

Next, we require the return type.

Last, the body of the function is parsed via a call to a method called tryExprOrList, finally closing the whole thing of with another right parenthesis.

tryExprOrList will be discussed later but the purpose of this function is to determine if the body of the function is a simple expression or a list of expressions.

The object for the declaration is created as normal but you can see that an ast.Object is also created, taking the declaration as an argument.

After the function’s scope is closed, the object pointing to the function declaration is inserted into the current scope.

Expressions


The listok field of the parser comes into play here. If we’re allowed to parse an expression list we attempt to do so. If not, we continue on as normal.

The other new expressions that we need to parse have been added and their respective methods are called.

Expression List


parseExprList merely tries to parse a list of expressions and appends it to the list.

GenExpr


Very little changed here. Parsing an identifier as a basic expression has been added and you’ll note that listok has been assigned as false. That’s to ensure we don’t keep trying to parse a list.

File


The only thing that really changed here is that the ast.File now only has a Scope rather than an Expr field. A file must have at least one declaration in it. In the compiler stage, we’ll ensure that it’s a declaration called “main”.

Parameter Lists


If you recall in parseDeclExpr I said I’d come back to parsing the parameters. Well, here we are!

Each parameter needs a type. However, a type is the last thing in the list to be listed. So, we need to keep parsing identifiers until either a comma, signifying a new set of parameters with a different type will be declared, or a closing paren is found. After either has been found, we need to retroactively assign the final identifier, presumably the type, to the other identifiers.

Each parameter is then inserted into the function’s scope. While the function has a type it does not have a value. That will be taken care of when the function is called.

Unary Expressions


A unary expression is pretty quick and dirty. We take the operator, which we know to be a negation sign, and the value to be negated and wrap it up into a neat little package.

Variable Expressions


Parsing a var expression is actually a bit tricky. That’s because some of the elements are optional and some depend on what elements we've received.

If the token is an IDENT we parse an identifier. Easy. However, if the token is an LPAREN then we have an expression but a var expression may only take an AssignExpr as it’s first element. So, we forcibly parse the expression and if another type of expression has been used an error is generated.

If no value was parsed via an assignment then we must find an identifier with a type name. The type is optional if there is an assignment expression because the type can be inferred.

Last, we expect to find a closing paren.

When we insert the identifier into the current scope, we need to make sure it hasn't already been declared and generate an error if it has.

Expressions or Expression Lists?


This little method is really just a utility function. It’s pretty self-explanatory.

The reason I separated it out is to make it clear that we can have either an expression or a list. I found it helpful.

Done


And that’s about it!

Now, on to assembly! Wait...what?