Pyrrhuloxia
Due on Friday, May 3rd at 11:59 PM. This is a compiler lab. If you have a partner, the two of you will complete this lab as a team. If you do not have a partner, you are effectively in a team by yourself.
If you are working with a partner, please be familiar with the Partner Etiquette guidelines. You and your partner share a single repository and, barring unusual circumstances, will receive the same grade. If you experience any difficulties in your partnership, please alert your instructor as soon as possible.
If you are working alone, please don’t hesitate to seek help if you get stuck. Since there are no ninjas for this course and you don’t have a partner, your instructor is the only interactive resource available to you and is happy to help you through any problems you might encounter.
In either case, be familiar with the academic integrity policy! You are permitted to discuss high-level details of your compiler project with other students, but you should never see or share anything that will be turned in for a grade. If you need help, please post on the Piazza forum or, if necessary, contact the instructor directly. Make sure to post privately if you are sharing code. If you have any doubts about what is okay and what is not, it’s much safer to ask than to risk violating the Academic Integrity Policy.
Overview
In each of the previous labs, the parser – the tool which translates text files to ASTs – has been provided for you. In this lab, you will develop a compiler called Pyrrhuloxia. The Pyrrhuloxia compiler is functionally identical to a Cardinal compiler: it has the same behavior and builds the same programs. The only difference is that Pyrrhuloxia will use a parser you have implemented rather than the one provided with the starter code.
Transitioning to Pyrrhuloxia
As usual, you will be switching to a branch named after the compiler and then merging your previous work on your compiler project. However, the compiler you merge into this branch is your choice. Every compiler following Cardinal supports the behavior of Cardinal; that is, every program that the Cardinal compiler can compile can also be compiled in any of our later compilers.
So you may pick any of those branches – cardinal
, hoopoe
, or anything in between – and merge it into pyrrhuloxia
. You’ll likely want to pick a branch which you expect you can test efficiently. You should pick the latest such branch; only pick cardinal
if you don’t believe that testing a later lab will work well. Here’s an example of a transition from Hoopoe to Pyrrhuloxia:
git fetch
git checkout pyrrhuloxia
git merge hoopoe
This merge will introduce some new files in src/language/
:
LLLexer.ml
: a file in which you will write a lexerLLParser.ml
: a file in which you will write a parserLLParserTool.ml
: a file which ties the lexer and parser together to create a general parsing tool
By default, these files do not have any effect on your compiler; they are not referenced anywhere. After the merge, you should be able to make
your compiler and run your unit tests. These tests are running using the old parser.
Using the New Parser Code
To develop and test your own parser, you’ll need to modify the compiler to use it. The relevant code is in src/builder.ml
in the build
function definition. Find the lines
let ast =
try
ParserTool.parse filename contents
with
| ParserTool.ParseFailure msg -> raise (BuildFailure msg)
in
and change them to
let ast =
try
LLParserTool.parse filename contents
with
| LLParserTool.ParseFailure msg -> raise (BuildFailure msg)
in
replacing the occurrences of ParserTool
with LLParserTool
.
If you merged cardinal
into your pyrrhuloxia
branch (as opposed to a later lab), you’ll also need to change LLParser.ml
so that the parse
function at the bottom returns an expr
rather than a program
. You may also need to delete the original parser code (the lexer.mll
, parser.mly
, and parserTool.ml
files) from your source tree.
Once you have finished, your compiler uses the new parser modules… which don’t work yet. As a result, most of your tests will all fail and every file you attempt to compile raises a parse error. This is good! You’re ready to begin developing your Pyrrhuloxia parser.
The Starter Code
The starter code for this lab provides you with a working parser for the simple grammar described below:
<expression> ::=
| <integer>
| (<expression>)
| <expression> + <expression>
Note, however, that this grammar is left-recursive: the production <expression> + <expression>
has, as its leftmost term, the same rule as the rule it is defining. The starter code defines a top-down parser, which cannot handle left-recursive grammars, so it instead parses this equivalent grammar:
<expression> ::=
| <additiveExpression>
<additiveExpression> ::=
| <primaryExpression> + <primaryExpression> ...
| <primaryExpression>
<primaryExpression> ::=
| <integer>
| (<expression>)
The latter grammar is produced by removing left recursion from the former. This is always possible with context-free grammars, which are the sort of grammars we have been using in this class. Note that the ellipses (...
) in the above grammar indicate a sequence of similar operations; 4 + 5 + (6)
is a valid <additiveExpression>
.
Understanding the Starter Code
At this point, you should read over the code in LLLexer.ml
, LLParser.ml
, and LLParserTool.ml
. (You will, after all, have to modify this code pretty heavily to complete the lab.) Let’s begin by considering LLLexer.ml
.
LLLexer.ml
The lex
function in LLLexer.ml
is the entry point for the lexer: it takes a string
and transforms it into a token list
. If lexing fails, a LexerError
is raised instead. (The LexerError
exception and the token
type are both defined at the top of the file.) Other than these definitions, the file contains:
- A helper routine
discard_whitespace
which removes whitespace from the start of the input. - Mini-tokenizer routines named e.g.
tokenize_int
. These routines each know how to recognize exactly one kind of token from the input; they’re no smarter than that. If one of these routines attempts to tokenize an input without that type of token at the front, it throws aLexerError
. - A general routine
tokenize_first_of
which takes a list of mini-tokenizers and combines them into larger tokenizer. The result is a tokenizer which tries each of the smaller routines in order, using the results of the first one which is successful. This routine will catch theLexerError
s from the smaller routines and only throw aLexerError
of its own if none of the smaller routines succeed. - A function
tokenize
which recognizes the first token of a given input, returning that token and the rest of the input. Thetokenize
function throws aLexerError
if the input has no recognizable token on it.tokenize
is implemented by combiningtokenize_first_of
with the mini-tokenizers elsewhere in the file.
Note that this lexer, just like most of the rest of our code, still does not use mutable variables! The input is represented as a char list
and we can use List
library functions to operate on it. Implementation tips are provided later in this write-up.
LLParser.ml
The parse
function at the end of LLParser.ml
is the entry point for our parser. Overall, the parser works much like the lexer does: it defines a series of mini-parsers (such as parse_int
) which know how to recognize exactly one kind of AST and then combines them together using a function named parse_first_of
. Each function takes a token list
and either returns an expr
(if it can recognize the expression in the input tokens) or throws a ParserError
(if it doesn’t know how to handle the input). The parse_first_of
function will try each of a series of smaller parsers in turn, keeping the result of the first one which succeeds.
One key difference in the parser is that all of these parsing functions are defined mutually recursively. Examining the grammar above, we see that the <expression>
grammar is defined in terms of <additiveExpression>
, which is defined in terms of <primaryExpression>
, which is defined in terms of <expression>
. Since the grammar is recursive in this way, our top-down parser will be as well. We use OCaml’s let rec ... and ...
syntax to handle this. For instance, consider the following:
let rec f n =
if n = 0 then [] else n :: g (n-1)
and g n =
if n = 0 then [] else f (n - 1)
;;
If we call f 8
using the definitions above, we will get the list [8;6;4;2]
. The f
function calls the g
function with odd numbers while g
calls f
with even numbers; f
records the numbers it sees in its output while g
does not. The body of f
refers to g
which, without the and
keyword, would be unacceptable (since g
appears after f
). The and
keyword allows both f
and g
to be defined simultaneously while the rec
keyword allows them to refer to each other (and themselves).
The code in parse_additive_expr
deserves some special attention. This code parses a single <primaryExpression>
and then attempts to process as many sequences of + <primaryExpression>
as it can. It then uses that sequence to build up a tree of binary additions (via fold_left
). This sort of repetition is likely the most complex work you’ll do in this assignment, so make sure to understand this pattern before making significant changes to the code.
The Pyrrhuloxia Parser
In this assignment, we are writing a top-down parser to be compatible with Cardinal. Let’s take a moment to recall the Cardinal grammar. In comparison to more recent labs, Cardinal has no functions, declarations, tuples, or calls. The full Cardinal grammar appears below:
<expression> ::=
| true
| false
| <integer>
| after(<expression>)
| before(<expression>)
| print(<expression>)
| isbool(<expression>)
| isint(<expression>)
| <expression> + <expression>
| <expression> - <expression>
| <expression> * <expression>
| <expression> < <expression>
| <expression> > <expression>
| <expression> = <expression>
| <expression> && <expression>
| <expression> || <expression>
| <identifier>
| let <identifier> = <expression> in <expression>
| ifnz <expression> then <expression> else <expression>
| if <expression> then <expression> else <expression>
| (<expression>)
As mentioned above, we will not be able to work with a left recursive grammar using a top-down approach. Instead, we will write a parser for the following (generally equivalent) grammar:
<expression> ::=
| let <identifier> = <expression> in <expression>
| if <expression> then <expression> else <expression>
| ifnz <expression> then <expression> else <expression>
| <orExpression>
<orExpression> ::=
| <andExpression> || <andExpression> ...
| <andExpression>
<andExpression> ::=
| <comparisonExpression> && <comparisonExpression> ...
| <comparisonExpression>
<comparisonExpression> ::=
| <additiveExpression> <comparisonOperator> <additiveExpression> ...
| <additiveExpression>
<comparisonOperator> ::=
| <
| >
| =
<additiveExpression> ::=
| <multiplicativeExpression> <additiveOperator> <multiplicativeExpression> ...
| <multiplicativeExpression>
<additiveOperator> ::=
| +
| -
<multiplicativeExpression> ::=
| <primaryExpression> * <primaryExpression> ...
| <primaryExpression>
<primaryExpression> ::=
| true
| false
| <integer>
| - <integer>
| <identifier>
| after(<expression>)
| before(<expression>)
| print(<expression>)
| isint(<expression>)
| isbool(<expression>)
| (<expression>)
The above grammar accepts the same programs but has been adjusted to avoid left recursion. Again: the ellipses (...
) in the above grammar indicate a sequence of similar operations.
Pyrrhuloxia Tokens
In Pyrrhuloxia, we will adopt the same tokenization standards as Cardinal.
<integer>
is a sequence of digit characters, optionally preceeded by a minus sign.<identifier>
starts with either an alphanumeric character or an underscore. Each remaining character of an<identifier>
is either alphanumeric, a digit, or an underscore.
Implementing Your Parser
The provided starter code is yours to modify as you see fit; the layout and organization of your parser is up to you. You can use whatever approach or design you like as long as your submission is your own code. You are not permitted to use a parser generator to complete this assignment (although you are permitted to write your own parser combinators if you like).
Much of your parser code is likely to be quite similar to the starter code. It might take a little work for you to develop a comfortable pattern while writing your parser, but it’s also important to use your time effectively. To avoid time-costly trial and error or reinvention of the metaphorical wheel, here are a few tips:
-
Since we are tokenizing a
char list
and parsing atoken list
, please be mindful that the Batteries Included library includes many useful routines in the List module. - The Batteries Included library includes a fairly complete String module which you can use to manipulate the input text. You aren’t likely to need these functions often, but they’ll be helpful to have if the need arises. In particular, the following expressions might be helpful:
String.to_list some_string
: convertssome_string
into achar list
String.of_list some_char_list
: convertssome_char_list
into astring
String.starts_with s x
: determines if strings
starts with stringx
String.get s i
: gets the character of strings
at indexi
String.lchop s
: returns the strings
without the first characterString.left s n
: returns then
leftmost characters ofs
as a stringString.tail s n
: returns all except then
leftmost characters ofs
as a string
- For working with individual characters, there is also a Char module. Here are some helpful functions it contains:
Char.is_whitespace c
: determines if a character is whitespaceChar.is_digit c
: determines if a character is a digitChar.is_letter c
: determines if a character is a letter
Note that this module uses the Latin-1 encoding of characters (which includes the ASCII characters). The
UChar
module handles Unicode characters, but we don’t need to support Unicode in our language for now. -
Make sure to write your own helper functions if the standard libraries don’t contain what you need. Parser code is pretty repetitive, so you’ll get a chance to amortize on your investment! Common helper functions include:
- A function to tokenize an arbitrary keyword from input (which requires that the next character not be an identifier character);
- A function to tokenize an arbitrary symbol from input (which can accept any symbol so long as symbols are parsed in the right order);
- A function which, given a
<foo>
parser and a<bar>
parser, will parse any sequence of<foo> <bar> <foo> ...
and return that sequence in a list (for binary operators).
You don’t need to write all (or any) of these routines, but helper functions will save you time in the long run!
-
Consider writing your parser a little bit at a time. The starter code recognizes parentheticals, addition, and integers. You can, for instance, add the booleans
true
andfalse
to both the lexer and the parser and then test that they work. You could follow up by adding&&
expressions or something similar. Don’t try to do everything at once; it’ll be much harder to find errors that way. -
Remember that identifiers are allowed to contain letters, digits, and underscores. That is,
sub-let
is not a single token (because-
is not an identifier character) butsub_let
is (because it’s a single sequence of letters and underscores). Note thatlet
can be part of a variable name; it’s only whenlet
and similar keywords appear separated from other identifier characters that they are keywords. -
Don’t try to handle negative numbers when lexing integers. Our grammar allows the user to write the minus sign as a separate token. This is somewhat peculiar as it means
let x = - 4 in x
is legal syntax. Allowing this, however, fixes subtle problems which arise in traditional functional language syntax (such as in OCaml or Falcon) and this keeps your Pyrrhuloxia parser consistent with Cardinal. - Whitespace is already handled by the starter code’s lexer (as part of
take_tokens
in thelex
function). This means that each mini-lexer you write may assume that the input does not start with whitespace. Make sure you understand how this is working! Parsers operate on lists of tokens (and we have no whitespace tokens), so your parser code won’t have to worry about this.
Testing Your Parser
Because Pyrrhuloxia is syntax compatible with Cardinal and Cardinal is a sublanguage of every more recent lab, you can use your Cardinal-compatible unit test code to test your new parser. To write parser tests, however, you don’t need to run the programs through the entire compiler pipeline via the helper functions in testUtils.ml
(such as test_success
). Instead, we can just test the parser portion of the lab.
You can write a function in testUtils.ml
which expects the parser to produce a particular result for a particular file. To do so, you’ll want to make use of a few OCaml routines:
-
The
FileUtils
module in your starter code has a functionread_file
which you can use to get the contents of a particular text file as a string. For instance,read_file "test_code/4.bird"
will produce the string"4\n"
. You can use this to obtain the text of a file to give to yourLLParserTool
module. -
Recall from our earliest labs that you can write unit tests using the
"test_name" >:: fun _ -> ...
syntax. Once you’ve created a test, it must be added to the list given to OUnit to be executed. You can write your own test utility (e.g.test_parse
) which takes a filename and an AST and makes sure that the contents of that file parse to the provided AST. -
To get good test results, you can use the
~printer
and~cmp
optional arguments of theassert_equal
function. For instance, you might useassert_equal ~printer:show_program ~cmp:equal_program expected_program my_program
to ensure that twoprogram
ASTs match. Theequal_program
function will be used to decide if they are equal andshow_prorgam
will be used to display them if they are not. -
Once you have written a
test_parse
function, you can invoke it with things liketest_parse "test_code/4.run" (EInt 4)
to be sure that your parser is working. This does mean that you have to write the AST out by hand, but there’s no avoiding that; after all, your parser is exactly the tool that makes it easier for the programmer to write an AST!
Summary
To complete this lab, you will
- Write a Cardinal-compatible lexer for Pyrrhuloxia in
LLLexer.ml
- Write a Cardinal-compatible LL parser for Pyrrhuloxia in
LLParser.ml
Once you’ve done this, you’ve written a compiler from beginning to end! Now that you’re more familiar with OCaml, you might be interested to glance over builder.ml
and the other parts of the compiler you haven’t changed much. You’ll see that they contain code to open files and run subprograms (like nasm
). These commands are basically identical to those from the Hatchling lab; there’s nothing magical going on there.
Congratulations on developing a complete compiler!
Submitting
To submit your lab, just commit and push your work. The most recent pushed commit as of the due date will be graded. For more information on when assignments are due, see the Policies page.
After each lab, you will complete a brief questionnaire found here. In most cases, the questionnaire should take less than a minute. This questionnaire is required and will be used as part of your participation grade.
If You Have Trouble…
…then please contact your instructor! Piazza is the preferred method, but you can reach out via e-mail as well. Good luck!