Compilers

Pyrrhuloxia

Source: SearchNet Media from Tucson, Arizona, USA [CC BY 2.0](https://creativecommons.org/licenses/by/2.0)

Due on Friday, May 3rd at 11:59 PM. This is a compiler lab. If you have a partner, the two of you will complete this lab as a team. If you do not have a partner, you are effectively in a team by yourself.

If you are working with a partner, please be familiar with the Partner Etiquette guidelines. You and your partner share a single repository and, barring unusual circumstances, will receive the same grade. If you experience any difficulties in your partnership, please alert your instructor as soon as possible.

If you are working alone, please don’t hesitate to seek help if you get stuck. Since there are no ninjas for this course and you don’t have a partner, your instructor is the only interactive resource available to you and is happy to help you through any problems you might encounter.

In either case, be familiar with the academic integrity policy! You are permitted to discuss high-level details of your compiler project with other students, but you should never see or share anything that will be turned in for a grade. If you need help, please post on the Piazza forum or, if necessary, contact the instructor directly. Make sure to post privately if you are sharing code. If you have any doubts about what is okay and what is not, it’s much safer to ask than to risk violating the Academic Integrity Policy.

Overview

In each of the previous labs, the parser – the tool which translates text files to ASTs – has been provided for you. In this lab, you will develop a compiler called Pyrrhuloxia. The Pyrrhuloxia compiler is functionally identical to a Cardinal compiler: it has the same behavior and builds the same programs. The only difference is that Pyrrhuloxia will use a parser you have implemented rather than the one provided with the starter code.

Transitioning to Pyrrhuloxia

As usual, you will be switching to a branch named after the compiler and then merging your previous work on your compiler project. However, the compiler you merge into this branch is your choice. Every compiler following Cardinal supports the behavior of Cardinal; that is, every program that the Cardinal compiler can compile can also be compiled in any of our later compilers.

So you may pick any of those branches – cardinal, hoopoe, or anything in between – and merge it into pyrrhuloxia. You’ll likely want to pick a branch which you expect you can test efficiently. You should pick the latest such branch; only pick cardinal if you don’t believe that testing a later lab will work well. Here’s an example of a transition from Hoopoe to Pyrrhuloxia:

git fetch
git checkout pyrrhuloxia
git merge hoopoe

This merge will introduce some new files in src/language/:

LLLexer.ml: a file in which you will write a lexer
LLParser.ml: a file in which you will write a parser
LLParserTool.ml: a file which ties the lexer and parser together to create a general parsing tool

By default, these files do not have any effect on your compiler; they are not referenced anywhere. After the merge, you should be able to make your compiler and run your unit tests. These tests are running using the old parser.

Using the New Parser Code

To develop and test your own parser, you’ll need to modify the compiler to use it. The relevant code is in src/builder.ml in the build function definition. Find the lines

let ast =
  try
    ParserTool.parse filename contents
  with
  | ParserTool.ParseFailure msg -> raise (BuildFailure msg)
in

and change them to

let ast =
  try
    LLParserTool.parse filename contents
  with
  | LLParserTool.ParseFailure msg -> raise (BuildFailure msg)
in

replacing the occurrences of ParserTool with LLParserTool.

If you merged cardinal into your pyrrhuloxia branch (as opposed to a later lab), you’ll also need to change LLParser.ml so that the parse function at the bottom returns an expr rather than a program. You may also need to delete the original parser code (the lexer.mll, parser.mly, and parserTool.ml files) from your source tree.

Once you have finished, your compiler uses the new parser modules… which don’t work yet. As a result, most of your tests will all fail and every file you attempt to compile raises a parse error. This is good! You’re ready to begin developing your Pyrrhuloxia parser.

The Starter Code

The starter code for this lab provides you with a working parser for the simple grammar described below:

<expression> ::=
  | <integer>
  | (<expression>)
  | <expression> + <expression>

Note, however, that this grammar is left-recursive: the production <expression> + <expression> has, as its leftmost term, the same rule as the rule it is defining. The starter code defines a top-down parser, which cannot handle left-recursive grammars, so it instead parses this equivalent grammar:

<expression> ::=
  | <additiveExpression>

<additiveExpression> ::=
  | <primaryExpression> + <primaryExpression> ...
  | <primaryExpression>

<primaryExpression> ::=
  | <integer>
  | (<expression>)

The latter grammar is produced by removing left recursion from the former. This is always possible with context-free grammars, which are the sort of grammars we have been using in this class. Note that the ellipses (...) in the above grammar indicate a sequence of similar operations; 4 + 5 + (6) is a valid <additiveExpression>.

Understanding the Starter Code

At this point, you should read over the code in LLLexer.ml, LLParser.ml, and LLParserTool.ml. (You will, after all, have to modify this code pretty heavily to complete the lab.) Let’s begin by considering LLLexer.ml.

`LLLexer.ml`

The lex function in LLLexer.ml is the entry point for the lexer: it takes a string and transforms it into a token list. If lexing fails, a LexerError is raised instead. (The LexerError exception and the token type are both defined at the top of the file.) Other than these definitions, the file contains:

A helper routine discard_whitespace which removes whitespace from the start of the input.
Mini-tokenizer routines named e.g. tokenize_int. These routines each know how to recognize exactly one kind of token from the input; they’re no smarter than that. If one of these routines attempts to tokenize an input without that type of token at the front, it throws a LexerError.
A general routine tokenize_first_of which takes a list of mini-tokenizers and combines them into larger tokenizer. The result is a tokenizer which tries each of the smaller routines in order, using the results of the first one which is successful. This routine will catch the LexerErrors from the smaller routines and only throw a LexerError of its own if none of the smaller routines succeed.
A function tokenize which recognizes the first token of a given input, returning that token and the rest of the input. The tokenize function throws a LexerError if the input has no recognizable token on it. tokenize is implemented by combining tokenize_first_of with the mini-tokenizers elsewhere in the file.

Note that this lexer, just like most of the rest of our code, still does not use mutable variables! The input is represented as a char list and we can use List library functions to operate on it. Implementation tips are provided later in this write-up.

`LLParser.ml`

The parse function at the end of LLParser.ml is the entry point for our parser. Overall, the parser works much like the lexer does: it defines a series of mini-parsers (such as parse_int) which know how to recognize exactly one kind of AST and then combines them together using a function named parse_first_of. Each function takes a token list and either returns an expr (if it can recognize the expression in the input tokens) or throws a ParserError (if it doesn’t know how to handle the input). The parse_first_of function will try each of a series of smaller parsers in turn, keeping the result of the first one which succeeds.

One key difference in the parser is that all of these parsing functions are defined mutually recursively. Examining the grammar above, we see that the <expression> grammar is defined in terms of <additiveExpression>, which is defined in terms of <primaryExpression>, which is defined in terms of <expression>. Since the grammar is recursive in this way, our top-down parser will be as well. We use OCaml’s let rec ... and ... syntax to handle this. For instance, consider the following:

let rec f n =
  if n = 0 then [] else n :: g (n-1)

and g n =
  if n = 0 then [] else f (n - 1)

;;

If we call f 8 using the definitions above, we will get the list [8;6;4;2]. The f function calls the g function with odd numbers while g calls f with even numbers; f records the numbers it sees in its output while g does not. The body of f refers to g which, without the and keyword, would be unacceptable (since g appears after f). The and keyword allows both f and g to be defined simultaneously while the rec keyword allows them to refer to each other (and themselves).

The code in parse_additive_expr deserves some special attention. This code parses a single <primaryExpression> and then attempts to process as many sequences of + <primaryExpression> as it can. It then uses that sequence to build up a tree of binary additions (via fold_left). This sort of repetition is likely the most complex work you’ll do in this assignment, so make sure to understand this pattern before making significant changes to the code.

The Pyrrhuloxia Parser

In this assignment, we are writing a top-down parser to be compatible with Cardinal. Let’s take a moment to recall the Cardinal grammar. In comparison to more recent labs, Cardinal has no functions, declarations, tuples, or calls. The full Cardinal grammar appears below:

<expression> ::=
  | true
  | false
  | <integer>
  | after(<expression>)
  | before(<expression>)
  | print(<expression>)
  | isbool(<expression>)
  | isint(<expression>)
  | <expression> + <expression>
  | <expression> - <expression>
  | <expression> * <expression>
  | <expression> < <expression>
  | <expression> > <expression>
  | <expression> = <expression>
  | <expression> && <expression>
  | <expression> || <expression>
  | <identifier>
  | let <identifier> = <expression> in <expression>
  | ifnz <expression> then <expression> else <expression>
  | if <expression> then <expression> else <expression>
  | (<expression>)

As mentioned above, we will not be able to work with a left recursive grammar using a top-down approach. Instead, we will write a parser for the following (generally equivalent) grammar:

<expression> ::=
  | let <identifier> = <expression> in <expression>
  | if <expression> then <expression> else <expression>
  | ifnz <expression> then <expression> else <expression>
  | <orExpression>

<orExpression> ::=
  | <andExpression> || <andExpression> ...
  | <andExpression>

<andExpression> ::=
  | <comparisonExpression> && <comparisonExpression> ...
  | <comparisonExpression>

<comparisonExpression> ::=
  | <additiveExpression> <comparisonOperator> <additiveExpression> ...
  | <additiveExpression>

<comparisonOperator> ::=
  | <
  | >
  | =

<additiveExpression> ::=
  | <multiplicativeExpression> <additiveOperator> <multiplicativeExpression> ...
  | <multiplicativeExpression>

<additiveOperator> ::=
  | +
  | -

<multiplicativeExpression> ::=
  | <primaryExpression> * <primaryExpression> ...
  | <primaryExpression>

<primaryExpression> ::=
  | true
  | false
  | <integer>
  | - <integer>
  | <identifier>
  | after(<expression>)
  | before(<expression>)
  | print(<expression>)
  | isint(<expression>)
  | isbool(<expression>)
  | (<expression>)

The above grammar accepts the same programs but has been adjusted to avoid left recursion. Again: the ellipses (...) in the above grammar indicate a sequence of similar operations.

Pyrrhuloxia Tokens

In Pyrrhuloxia, we will adopt the same tokenization standards as Cardinal.

<integer> is a sequence of digit characters, optionally preceeded by a minus sign.
<identifier> starts with either an alphanumeric character or an underscore. Each remaining character of an <identifier> is either alphanumeric, a digit, or an underscore.

Implementing Your Parser

The provided starter code is yours to modify as you see fit; the layout and organization of your parser is up to you. You can use whatever approach or design you like as long as your submission is your own code. You are not permitted to use a parser generator to complete this assignment (although you are permitted to write your own parser combinators if you like).

Much of your parser code is likely to be quite similar to the starter code. It might take a little work for you to develop a comfortable pattern while writing your parser, but it’s also important to use your time effectively. To avoid time-costly trial and error or reinvention of the metaphorical wheel, here are a few tips:

Since we are tokenizing a char list and parsing a token list, please be mindful that the Batteries Included library includes many useful routines in the List module.
The Batteries Included library includes a fairly complete String module which you can use to manipulate the input text. You aren’t likely to need these functions often, but they’ll be helpful to have if the need arises. In particular, the following expressions might be helpful:
- String.to_list some_string: converts some_string into a char list
- String.of_list some_char_list: converts some_char_list into a string
- String.starts_with s x: determines if string s starts with string x
- String.get s i: gets the character of string s at index i
- String.lchop s: returns the string s without the first character
- String.left s n: returns the n leftmost characters of s as a string
- String.tail s n: returns all except the n leftmost characters of s as a string
For working with individual characters, there is also a Char module. Here are some helpful functions it contains:
- Char.is_whitespace c: determines if a character is whitespace
- Char.is_digit c: determines if a character is a digit
- Char.is_letter c: determines if a character is a letter
Note that this module uses the Latin-1 encoding of characters (which includes the ASCII characters). The UChar module handles Unicode characters, but we don’t need to support Unicode in our language for now.
Make sure to write your own helper functions if the standard libraries don’t contain what you need. Parser code is pretty repetitive, so you’ll get a chance to amortize on your investment! Common helper functions include:
- A function to tokenize an arbitrary keyword from input (which requires that the next character not be an identifier character);
- A function to tokenize an arbitrary symbol from input (which can accept any symbol so long as symbols are parsed in the right order);
- A function which, given a <foo> parser and a <bar> parser, will parse any sequence of <foo> <bar> <foo> ... and return that sequence in a list (for binary operators).
You don’t need to write all (or any) of these routines, but helper functions will save you time in the long run!
Consider writing your parser a little bit at a time. The starter code recognizes parentheticals, addition, and integers. You can, for instance, add the booleans true and false to both the lexer and the parser and then test that they work. You could follow up by adding && expressions or something similar. Don’t try to do everything at once; it’ll be much harder to find errors that way.
Remember that identifiers are allowed to contain letters, digits, and underscores. That is, sub-let is not a single token (because - is not an identifier character) but sub_let is (because it’s a single sequence of letters and underscores). Note that let can be part of a variable name; it’s only when let and similar keywords appear separated from other identifier characters that they are keywords.
Don’t try to handle negative numbers when lexing integers. Our grammar allows the user to write the minus sign as a separate token. This is somewhat peculiar as it means let x = - 4 in x is legal syntax. Allowing this, however, fixes subtle problems which arise in traditional functional language syntax (such as in OCaml or Falcon) and this keeps your Pyrrhuloxia parser consistent with Cardinal.
Whitespace is already handled by the starter code’s lexer (as part of take_tokens in the lex function). This means that each mini-lexer you write may assume that the input does not start with whitespace. Make sure you understand how this is working! Parsers operate on lists of tokens (and we have no whitespace tokens), so your parser code won’t have to worry about this.

Testing Your Parser

Because Pyrrhuloxia is syntax compatible with Cardinal and Cardinal is a sublanguage of every more recent lab, you can use your Cardinal-compatible unit test code to test your new parser. To write parser tests, however, you don’t need to run the programs through the entire compiler pipeline via the helper functions in testUtils.ml (such as test_success). Instead, we can just test the parser portion of the lab.

You can write a function in testUtils.ml which expects the parser to produce a particular result for a particular file. To do so, you’ll want to make use of a few OCaml routines:

The FileUtils module in your starter code has a function read_file which you can use to get the contents of a particular text file as a string. For instance, read_file "test_code/4.bird" will produce the string "4\n". You can use this to obtain the text of a file to give to your LLParserTool module.
Recall from our earliest labs that you can write unit tests using the "test_name" >:: fun _ -> ... syntax. Once you’ve created a test, it must be added to the list given to OUnit to be executed. You can write your own test utility (e.g. test_parse) which takes a filename and an AST and makes sure that the contents of that file parse to the provided AST.
To get good test results, you can use the ~printer and ~cmp optional arguments of the assert_equal function. For instance, you might use assert_equal ~printer:show_program ~cmp:equal_program expected_program my_program to ensure that two program ASTs match. The equal_program function will be used to decide if they are equal and show_prorgam will be used to display them if they are not.
Once you have written a test_parse function, you can invoke it with things like test_parse "test_code/4.run" (EInt 4) to be sure that your parser is working. This does mean that you have to write the AST out by hand, but there’s no avoiding that; after all, your parser is exactly the tool that makes it easier for the programmer to write an AST!

Summary

To complete this lab, you will

Write a Cardinal-compatible lexer for Pyrrhuloxia in LLLexer.ml
Write a Cardinal-compatible LL parser for Pyrrhuloxia in LLParser.ml

Once you’ve done this, you’ve written a compiler from beginning to end! Now that you’re more familiar with OCaml, you might be interested to glance over builder.ml and the other parts of the compiler you haven’t changed much. You’ll see that they contain code to open files and run subprograms (like nasm). These commands are basically identical to those from the Hatchling lab; there’s nothing magical going on there.

Congratulations on developing a complete compiler!

Submitting

To submit your lab, just commit and push your work. The most recent pushed commit as of the due date will be graded. For more information on when assignments are due, see the Policies page.

After each lab, you will complete a brief questionnaire found here. In most cases, the questionnaire should take less than a minute. This questionnaire is required and will be used as part of your participation grade.

If You Have Trouble…

…then please contact your instructor! Piazza is the preferred method, but you can reach out via e-mail as well. Good luck!