Hoop Snake
Due on Friday, April 28th at 11:59 PM. This is a compiler lab. If you have a partner, the two of you will complete this lab as a team. If you do not have a partner, you are effectively in a team by yourself.
If you are working with a partner, please be familiar with the Partner Etiquette guidelines. You and your partner share a single repository and, barring unusual circumstances, will receive the same grade. If you experience any difficulties in your partnership, please alert your instructor as soon as possible.
If you are working alone, please don’t hesitate to seek help if you get stuck. Since there are no ninjas for this course and you don’t have a partner, your instructor is the only interactive resource available to you and is happy to help you through any problems you might encounter.
In either case, be familiar with the academic integrity policy! You are permitted to discuss high-level details of your compiler project with other students, but you should never see or share anything that will be turned in for a grade. If you need help, please post on the Piazza forum or, if necessary, contact the instructor directly. Make sure to post privately if you are sharing code. If you have any doubts about what is okay and what is not, it’s much safer to ask than to risk violating the Academic Integrity Policy.
Overview
This lab contains two parts:
-
In the first part, we will extend the Garbage Snake compiler to include tail call optimization. We will call the resulting languge “Hoop Snake”. Once you have completed this task, you will be able to compile arbitrary long-running computations using recursion for loops!
-
In the second part, we will write an LL parser for a modified form of the Cobra language. The parser is the only component of the compiler we have not yet explored. Writing a Cobra parser will give you enough familiarity with parsing that, given time, you could produce a Hoop Snake parser on your own.
This lab write-up will proceed in the order given above. Strictly speaking, however, you can perform these parts in either order so long as you make the appropriate change to language.mk
to switch back and forth. Neither part of the lab relies upon the other.
Part I: Tail Call Optimization
When you git pull
, you’ll discover the usual updates. The most relevant for this part of the lab is the src/languages/hoopSnake
directory. As usual, change the LANGUAGE
variable in language.mk
from garbageSnake
to hoopSnake
. Then, run make clean
and make
. Your code will not compile, but it won’t take much effort to get it back up and running.
The Hoop Snake Language
The Hoop Snake language is syntactically identical to Garbage Snake; there are no new forms of expressions or new language features available to the programmer. In fact, the only difference is a change in the A-normalized abstract syntax! The CAppl
constructor has been changed from
| CAppl of expr * expr
to
| CAppl of expr * expr * bool
The additional boolean value indicates whether the CAppl
in question is in a tail position: that is, whether the function call is the very last thing that this expression will do.
The first thing you should do in implementing Hoop Snake is to modify your A-normalization code to set this boolean value to false
always. Then, modify compiler.ml
to match on (but ignore) this value. Finally, make
your unit tests and ensure that everything is still working.
Marking Tail Positions
Next, you need to mark tail positions in your AST. You can either do this during A-normalization or you can write a separate function of type a_program -> a_program
which marks the tail positions of every expression in the AST (and then call that function at the right point within your compiler). The particular approach you use does not matter as long as the tail position variable is being set correctly.
Tail Calls
Consider the following program:
def f x = x end
def g y = f y end
g 4
In Garbage Snake, which has no tail calls, the following sequence of events would occur:
- The main expression calls
g
, leaving the return pointer for the main expression on the stack. - The
g
function callsf
, leaving the return pointer forg
on the stack. - The
f
function copies its parameter intoeax
and returns to the point ing
where it was called. - The
g
function finishes and returns to the point in the main expression where it was called. - The main expression yields the value returned by
g
, which was4
.
Tail call optimization begins with the observation that there was no point in returning to the body of g
. We knew we would simply be returning its value, so the stack frame we’ve created for g
is just taking up space. With tail call optimization, we want to bring about the following sequence of events instead:
- The main expression calls
g
, leaving the return pointer for the main expression on the stack. - The
g
function tail callsf
. It modifies the stack by removing its own stack frame and replacing its own arguments with new arguments forf
. It then jumps tof
rather than callingf
. - The
f
function, unaware that this has happened, copies the parameter intoeax
. When it issues theret
instruction, the tail call byg
has changed its return pointer to return directly to the point in the main expression wereg
was called. - The main expression yields the value returned by
f
, which was4
.
This kind of optimization is especially important in functional languages, which make frequent and thorough use of recursion. With tail call optimization, recursive tail calls consume no additonal stack memory and so work much like while
loops in imperative languages. Without tail call optimization, long-running recursive loops will overflow the stack.
Implementing Tail Call Optimization
In tail call optimization, there are three main participants:
- The original caller. In the example above, this is the program’s main expression.
- The tail caller. In the example above, this is
g
. - The tail callee. In the example above, this is
f
.
Not every call can be optimized. For us to apply our tail call optimization strategy, we must know the following facts:
- The call is in a tail position. If this is not true, then we must return to the calling function afterwards, so we can’t just throw our stack frame away. Note that this information is available at compile time.
- The tail callee is a function with equal or fewer parameters as the tail caller. If we tail call, we return directly to the original caller, who will remove arguments from the stack equal to the number of parameters of the tail caller (since that is who the original caller had called). If the tail callee has more parameters than the tail caller, we won’t have enough room for the tail callee’s arguments and we will not perform the tail call. We will not know this information until runtime.
Since the tail call position is known at compile time, it is the easiest part to handle: for a CAppl
which is not in a tail position, we simply generate the same assembly as we would in Garbage Snake.
If the CAppl
is in a tail position, we need to generate code which might perform a tail call. Remember that a CAppl
refers to two values: a closure and an argument. We’ll have our compiler generate assembly code which uses the following algorithm:
- If the closure is not yet full, create a new closure which includes this argument. This is the same as in Foxsnake and Garbage Snake.
- If the closure is full, then
- If the number of parameters in the closure is greater than the number of parameters of this function, we perform a normal call (just as we would if the
CAppl
were not in a tail position). - Otherwise, we perform a tail call.
- Instead of pushing the arguments onto the stack, we replace our own arguments with them. (If we have more parameters than the target function, we must replace our other arguments with zeros.)
- We tear down our stack (in a fashion similar to how we would at the end of a function declaration), returning it to (almost) the same condition as it was in when the original caller called us.
- We
jump
to the function rather thancall
ing it.
- If the number of parameters in the closure is greater than the number of parameters of this function, we perform a normal call (just as we would if the
It may be helpful to get this working for tail callees with an equal number of parameters first and then extend your implementation to tail callees with fewer parameters.
Testing Tail Call Optimization
Testing a compiler optimization is somewhat different than testing the compilation of a language feature. Our tests so far have treated the program as a black box and just checked its output. But an optimization should, by its very nature, not change what task a program accomplishes: it should just make the program smarter about how to accomplish that task.
In this case, however, we do know of an observable change in behavior. In Garbage Snake, long-running recursive functions will run out of stack memory and cause a segmentation fault. We can test the Hoop Snake optimization by writing such programs and making sure they run correctly. Here are some examples of the sorts of tests you may want to run:
-
A very simplistic form of test is to ensure that your tail call optimization prevents a stack overflow. Code like the following may help with that:
def f x = if x = 1 then 1 else f (x-1) end f 10000000
Without tail call optimization, this program will try (and fail) to create ten million stack frames. With tail call optimization, it should run correctly.
-
Because tail call optimization manipulates the stack in a variety of ways, we have to make sure that parameters are copied correctly! We can do this by passing parameters into a long-running recursive function to make sure they can be used once we’ve reached the base case:
def f a x b = if x = 1 then (a,b) else f (a+1) (x-1) (b && b) end f 4 10000000 true
If tail call optimization damages the stack somehow, it’s pretty unlikely that this function will give the expected result without encountering a type error or segmentation fault!
-
We should also test that tail call optimization does the right thing when presented with a function that has too many parameters.
def f x y z = x + y + z end def g n = f n n n end g 4
The above code doesn’t test the tail call optimization directly. If the decision about parameters is being made incorrectly, though, then the above code will likely either corrupt a return pointer or use a return pointer in a calculation, which will be visible in the output of the program.
-
It’s a bit trickier to test whether we’re getting the benefits of tail call optimization for functions with fewer parameters. We can do this by creating two mutually recursive functions and having the function with more arguments consume a lot of stack memory, like so:
def f1 x = if x = 1 then 1 else f2 (x-1) false end def f2 x b = let mem1 = 1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1 in let mem2 = 1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1 in f1 x end f1 100000
Here,
f2
generates numerous variables in A-normalization, which causes hundreds of bytes of stack memory to be allocated for each call. In the call tof1
, however, thef2
stack frame should be replaced, which frees those hundreds of bytes. If your code is behaving properly, then this program will exit normally. If not, then you will likely get a segmentation fault when the greedy calls tof2
cause your program to use up all of its stack memory. -
Make sure to run plenty of code (like that from your unit tests) which includes non-tail calls. For instance, consider
def plus1 x = x + 1 end def plus2 x = plus1 (plus1 x) end plus2 4
Within
plus2
, there are two calls toplus1
. If the innerplus1
call is treated as a tail call (even though it’s not in a tail position), thenplus1
will return directly to the main expression and so the second call toplus1
will never occur. If the above code returns5
, it’s likely you’re tail calling when you shouldn’t be.
Part II: Parsing
The second part of this assignment focuses on writing an LL parser for Cobra. The original grammar for the Cobra language was
<expression> ::=
| true
| false
| <integer>
| <identifier>
| let <identifier> = <expression> in <expression>
| if <expression> then <expression> else <expression>
| inc(<expression>)
| dec(<expression>)
| print(<expression>)
| isbool(<expression>)
| isint(<expression>)
| <expression> + <expression>
| <expression> - <expression>
| <expression> * <expression>
| <expression> < <expression>
| <expression> > <expression>
| <expression> = <expression>
| <expression> && <expression>
| <expression> || <expression>
| (<expression>)
This grammar is left recursive, however, and so we can’t write an LL parser for it directly. Instead, we will write a parser for the following (equivalent) grammar:
<expr> ::=
| let <IDENTIFIER> = <expr> in <expr>
| if <expr> then <expr> else <expr>
| <orExpr>
<orExpr> ::=
| <andExpr>
| <andExpr> || <andExpr> ...
<andExpr> ::=
| <comparisonExpr>
| <comparisonExpr> && <comparisonExpr> ...
<comparisonExpr> ::=
| <additiveExpr>
| <additiveExpr> <comparisonOperator> <additiveExpr> ...
<comparisonOperator> ::=
| <
| >
| =
<additiveExpr> ::=
| <multiplicativeExpr>
| <multiplicativeExpr> <additiveOperator> <multiplicativeExpr> ...
<additiveOperator> ::=
| +
| -
<multiplicativeExpr> ::=
| <primaryExpr>
| <primaryExpr> * <primaryExpr> ...
<primaryExpr> ::=
| true
| false
| <INTEGER>
| - <INTEGER>
| <IDENTIFIER>
| inc(<expr>)
| dec(<expr>)
| print(<expr>)
| isint(<expr>)
| isbool(<expr>)
| (<expr>)
The above grammar accepts the same programs but has been adjusted to avoid left recursion.
Implementing an LL Parser
To begin writing your LL parser, you should create a new directory in the src/language
directory. This new directory will be a peer to the other parsers we have used throughout the course, such as adder
and hoopSnake
. You may name it whatever you like.
For your parser to be compatible with the rest of the code in the compiler, you should do the following:
- Copy
src/language/hoopSnake/expressions.ml
into your new language directory to make sure you still have the same AST definitions. - Create a new file in your language directory called
parserTool.ml
. This file is meant to be the entry point to the parser from the compiler. It must contain two things:- A function named
parse
of typestring -> string -> program
. The first string is the name of the file being parsed (for error messages, if you like) while the second is the contents of the program. You should parse the second argument and produce aprogram
AST from it. Since we’re only parsing Cobra, the list of declarations will always be empty; the hard part is parsing the mainexpr
of the program. For now, you can put afailwith "TODO"
here. - The declaration
exception ParseFailure of string
. If anything goes wrong while parsing, yourparse
function should raise aParseFailure
. Thestring
is the error message to print to the command line.
- A function named
After this, the layout and organization of the parser is up to you. You can use whatever approach or design you like as long as your submission is your own code. You are not permitted to use a parser generator to complete this assignment (although you are permitted to write your own parser combinators if you like).
Testing Your Parser
Because Cobra is a strict subset of Hoop Snake – that is, every Cobra program is a valid Hoop Snake program – you can use your Cobra-compatible unit tests to test your parser. This is as simple as commenting out those unit tests that use files from other languages. You can also write direct unit tests, passing your parse
function a string to get the actual value and creating an AST by hand to get the expected value. Remember that you can use show_program
or show_expr
to turn ASTs into strings.
Parser Development Tips
There are a few miscellaneous tips which may help in writing your parser.
- The Batteries Included library includes a fairly complete String module which you can use to manipulate the input text. In particular, the following expressions might be helpful:
String.to_list some_string
: convertssome_string
into achar list
String.of_list some_char_list
: convertssome_char_list
into astring
String.starts_with s x
: determines if strings
starts with stringx
String.get s i
: gets the character of strings
at indexi
String.lchop s
: returns the strings
without the first characterString.left s n
: returns then
leftmost characters ofs
as a stringString.tail s n
: returns all except then
leftmost characters ofs
as a string
- For working with individual characters, there is also a Char module. Here are some helpful functions it contains:
Char.is_whitespace c
: determines if a character is whitespaceChar.is_digit c
: determines if a character is a digitChar.is_letter c
: determines if a character is a letter
Note that this module uses the Latin-1 encoding of characters (which includes the ASCII characters). The
UChar
module handles Unicode characters, but we don’t need to support Unicode in our language for now. - Consider writing your parser a little bit at a time. You can start by implementing just
true
andfalse
. After they tokenize and parse correctly, you can add e.g. integer expressions. Working a little at a time is a lot easier than doing everything at once. - Remember to handle whitespace correctly! Your lexer will need to recognize whitespace characters – spaces (
' '
), tabs ('\t'
), and newlines ('\n'
) in particular – and ignore them. Once your parser has a list of tokens, it won’t have this problem. - Remember also that you must be out of tokens when you finish parsing the expression. You should consider handling this case in
parseTool.ml
rather than in your main expression parsing function, since you only want to check for the end of the token list at the very end of the parsing process.
Summary
To complete this assignment, you’ll need to
- Add tail call optimization to your compiler
- Write an LL parser for the given modified Cobra grammar
Once you’ve done this, you’ve written a compiler from beginning to end! Now that you’re more familiar with OCaml, you might be interested to glance over builder.ml
and the other parts of the compiler you haven’t changed much; you’ll see that they contain code to open files and run subprograms (like clang
), but that’s about it. Congratulations!
Submitting
To submit your lab, just commit and push your work. The most recent pushed commit as of the due date will be graded. For more information on when assignments are due, see the Policies page.
If You Have Trouble…
…then please contact your instructor! Piazza is the preferred method, but you can reach out via e-mail as well. Good luck!