Falcon
Due on Wednesday, April 3rd at 11:59 PM. This is a compiler lab. If you have a partner, the two of you will complete this lab as a team. If you do not have a partner, you are effectively in a team by yourself.
If you are working with a partner, please be familiar with the Partner Etiquette guidelines. You and your partner share a single repository and, barring unusual circumstances, will receive the same grade. If you experience any difficulties in your partnership, please alert your instructor as soon as possible.
If you are working alone, please don’t hesitate to seek help if you get stuck. Since there are no ninjas for this course and you don’t have a partner, your instructor is the only interactive resource available to you and is happy to help you through any problems you might encounter.
In either case, be familiar with the academic integrity policy! You are permitted to discuss high-level details of your compiler project with other students, but you should never see or share anything that will be turned in for a grade. If you need help, please post on the Piazza forum or, if necessary, contact the instructor directly. Make sure to post privately if you are sharing code. If you have any doubts about what is okay and what is not, it’s much safer to ask than to risk violating the Academic Integrity Policy.
Overview
In this lab, we will extend the Eagle compiler to a new language called Falcon. Falcon introduces first-class functions and partial application. Optionally, you may also add anonymous functions.
Transitioning to Falcon
As usual, you’ll run some Git commands to prepare for Falcon:
git fetch
git checkout falcon
git merge eagle
The typical changes appear in src/languages/
. You’ll also have an updated resources/printer.c
to accommodate function closures.
Once you’ve merged the eagle
branch into falcon
, your code will not compile! This is expected. Adding higher-order functions to a language is a big change, so it’s going to take some work to get back to a functioning compiler. We will explain this below.
The Falcon Language
The syntax of Falcon function declarations and function calls is incompatible with the Dove syntax and the Falcon abstract syntax reflects this. We first present the syntax and semantics of Falcon and then discuss how to update the Eagle compiler incrementally to accommodate these changes.
Concrete Syntax
Falcon’s concrete syntax includes all of the features of Eagle, but function declarations and calls are written differently. For clarity, we present the full Falcon syntax here:
<program> ::=
| <declaration_list> <expression>
| <expression>
<declaration_list> ::=
| <declaration> <declaration_list>
| <declaration>
<declaration> ::=
| def <identifier> <parameter_list> = <expression> end
<parameter_list> ::=
| <identifier> <parameter_list>
| <identifier>
<expression> ::=
| <integer>
| true
| false
| <identifier>
| let <identifier> = <expression> in <expression>
| if <expression> then <expression> else <expression>
| after(<expression>)
| before(<expression>)
| print(<expression>)
| isbool(<expression>)
| isint(<expression>)
| istuple(<expression>)
| <expression> + <expression>
| <expression> - <expression>
| <expression> * <expression>
| <expression> < <expression>
| <expression> > <expression>
| <expression> = <expression>
| <expression> && <expression>
| <expression> || <expression>
| <expression>[<expression>]
| <expression> <expression>
| (<expression>)
| (<expression>, <expression_list>)
<expression_list> ::=
| <expression> , <expression_list>
| <expression>
The notable changes from Eagle are:
- Function declarations no longer use parentheses or commas; they are written in a more OCaml-like style, with spaces separating the parameters. For instance,
def f x y = x - y end
is a valid function declaration. Note that, as in Dove, functions can only be defined at top level; Falcon still does not support nested function definitions. You’ll have the opportunity to add this behavior once you’ve completed the Falcon lab. - Function declarations require at least one parameter.
- Function calls are by juxtaposition as in OCaml:
f x
calls functionf
with argumentx
.
Abstract Syntax
As usual, you can see the definitions of expressions for Falcon in src/language/asts.ml
. Make sure to read over those definitions. In particular, note that ECall
is no longer present. Instead, we have EAppl
(for function “application”) which only carries two expressions. Multiple arguments are passed by chaining EAppl
constructors (e.g. EAppl(EAppl(EVar f, EVar x), EVar y)
for the code f x y
).
Semantics
Unlike our previous languages, function declarations in Falcon create new variables named after those functions. These variables are bound to closures which define the function and the arguments which have been passed to that function so far. For instance,
def f x y =
x + y
end
let g = f in
g 2 4
evaluates to 6
. The variable f
is bound to the closure of function f
with no arguments. These closures will be values in Falcon and, as such, can be assigned to other variables (such as g
). Arguments can then be applied to the closure to call the function.
Falcon supports partial application, so the code
def f x y =
x + y
end
let increment = f 1 in
(increment 3, increment 7)
evaluates to the tuple (4, 8)
; the inc
closure carries the argument 1
for parameter x
and this argument is used each time inc
is called in the tuple.
While we will represent closures in a fashion similar to tuples, closures are not tuples; they are a new kind of value. That is,
let identity = fun x -> x in
istuple(identity)
will evaluate to false
.
Errors
Since we have now introduced higher-order functions, some of the errors which were easy to recognize in Eagle have become extremely difficult to recognize in Falcon. You are no longer required to recognize the following compile-time errors:
Function f is not defined.
: Because functions will be defined as variables bound to closures, this error becomesUnbound variable f.
Function f is called with an incorrect number of arguments.
: Because functions are now data, it becomes impossible to determine if a function has been applied to all of its arguments or not. A type system would help to mitigate this by weakening the kinds of application we could perform. For Falcon, you are simply not required to report this kind of error.
You are also now responsible for an additional type of runtime error:
- If an operation expects a closure but receives something else, exit with error code
6
.
For instance, exit code 6
would result from the code (1 + 2) (3 + 4)
because the evaluation of the first expression, 3
, is not a function. This should be reported after evaluating both the function and the argument. For instance, the code 0 (1 + true)
would return exit code 1
(“integer expected”) and not exit code 6
(“closure expected”) because, as in previous labs, we will evaluate both subexpressions before checking their results. This runtime error should be reported in addition to the runtime errors from previous labs.
Implementing the Falcon Compiler
To begin your work on Falcon, you should get your code to a point where it compiles, even if this means losing some functionality. The following is recommended:
- Remove the arity check from well-formedness. The old code for
ECall
was more complex than the new code forEAppl
should be. - Comment out the compilation of
ECall
. You won’t be using that code in your Falcon compiler, but it’s a good idea to keep it around for reference for now. - Add a compilation case for
EAppl
which just runsfailwith "TODO: EAppl"
.
Once you have done this, all of your unit tests which don’t use function calls should be able to pass. You will eventually need to update the syntax of your test files to match the new syntax of Falcon.
Step 1: Closures
Once you’ve gotten your code to compile, you’re ready to add closures to your compiler. A closure must be able to store:
- The current number of arguments the closure contains.
- The number of arguments the function needs to be called.
- The memory address of the assembly code representing the function.
- The arguments currently contained by the closure.
We will lay closures out in memory in that order. As with tuples, we will always store closures in the heap (since they require several bytes of memory). To distinguish closures from tuples, we will set the first bit of the first word of the closure to 1
. Consider the following example:
(32 bits) | (32 bits) | (32 bits) | (32 bits) | (32 bits) |
---|---|---|---|---|
0x80000002 |
0x00000004 |
0x080484f8 |
0x7FFFFFFF |
0x0000001e |
This sequence of words represents a closure in the Falcon heap. They are:
- A word indicating that this is a closure which carries two arguments. We know that this is a closure because the highest bit is set; we know that there are two arguments because, after setting the highest bit to zero, the value is
0x00000002
. - A word indicating the number of arguments the function requires. In this case, the function requires four arguments before it can be called.
- A word indicating the memory address of the function to call. This is a raw memory address to machine instructions; it could be any value.
- A word for the first of the two stored arguments. In this case, that argument is
false
(in Falcon’s memory representation). - A word for the second of the two stored arguments. In this case, that argument is the integer
15
(in Falcon’s memory representation).
Creating closures for declared functions
The first step toward implementing closures in Falcon is to create a set of global variables containing the zero-argument closures for the functions declared by the program. We are creating global variables here for two reasons:
- It would be inefficient to create a new 0-argument closure every time we refer to a function (e.g.
f
). - By using global variables to represent 0-argument closures, we don’t have to write complicated code to handle variable compilation and
EAppl
compilation will be much more consistent.
In the Eagle lab, you created a single global label heap_cursor
to track heap memory. You should now create an additional label for the closure of each declared function in the program. For instance, for the program
def f x =
x + 1
end
def g x y =
x * y
end
f (g 2 3)
the data section of your assembly might look something like this:
section .data
align 4
heap_cursor:
dd 0
align 4
closure_of_f:
dd 0x80000000, 1, _f
align 4
closure_of_g:
dd 0x80000000, 2, _g
Note that these initial closures always carry zero arguments. The second word (e.g. 1
or 2
) indicates the number of parameters the function declares while the third word (e.g. _f
or _g
) is the label given to the user-declared function. The label, named by itself, refers to the memory address of the labeled code.
Checkpoint
Once you’ve completed the above steps, you should be able to write a program which declares a function but does not use it. Your compiler should work correctly on this program and produce a data section similar to the above.
Step 2: Binding Variables to Closures
Once you’ve declared the existence of these closures, your next step should be to have some means to refer to them. When a Falcon program refers to a function by its name (e.g. f
), you should be able to retrieve that closure. Our compiler’s environment currently only has the ability to refer to stack offsets, so we will have to make it a bit more sophisticated.
Flexible environments
First, change your environment so that it contains a mapping from strings to arguments (instead of a mapping from strings to integers). You will need to update environment.ml
and compiler.ml
to do this. When you’ve finished, the code should behave in exactly the same way as it did before; you can use your unit tests to confirm this. The objective here is to ensure that the environment continues to have the same functionality but its data structure is now more flexible.
Our next step will be to add bindings for function closures. You’ll likely want to add a function to environment.ml
to add an arbitrary string
-to-argument
binding to your environment to make this easier, but you don’t need to do it yet.
Bindings for closures
Next, we must ensure that each function declaration results in a single variable binding. So far, we’ve written our compiler to start with an empty environment in the main expression and an environment containing parameters for each function body expression. We will add to both of these environments a binding for each declared function. For instance, in the program
def f x =
x + 1
end
f 0
the main expression should be compiled with an environment which binds f
to its closure. For instance, the assembly code for evaluating f
in the above program might be
mov eax, closure_of_f + 1
Note the use of + 1
. The label closure_of_f
should be aligned to a 4 byte boundary, so the last two bits of its address will be 00
. If we add 1
to this, we get a value which conforms to the Falcon binary representation for a memory pointer!
To handle the above, you probably want to add a new assembly argument
to assemblyLanguage.ml
. The name ArgLabelLocationOffset
is recommended as it is consistent with the previous names we have chosen.
In addition to updating compiler.ml
with these new bindings, you will need to update wellformedness.ml
so that it knows that these variable names are legal. Make sure to update both files before proceeding.
Checkpoint
Although you cannot yet call these functions, it should now be possible for you to produce the closure as a value. Write a program which declares a function f
; then, allow the main expression for that program to be just f
. Using the new version of printer.c
you received as part of the starter code for Falcon, you should get a result that looks something like this:
<closure@080484f8>[0/1](?)
The above is the printer’s way of displaying a function closure whose function is at memory address 0x080484f8
. That function expects 1 argument but the closure has 0 arguments so far. The ?
represents the value which has not yet been provided to the closure.
Step 3: Growing Closures
Once we have established closures in the compiler’s environment, we should work toward growing them: creating new closures which carry additional arguments. The compilation of EAppl
should generate assembly code that follows approximately this algorithm:
- Evaluate both the left and right subexpressions.
- Check to make sure the first argument is a closure. Fail if it is not.
- Compare the number of arguments in the closure \(a\) to the number of parameters for the function \(p\).
- If \(a + 1 < p\), then we’re not yet ready to call the function.
- Create a copy of the closure using new heap memory.
- Copy the new argument (the result the second subexpression) to the end of the copy you just created.
- In that copy, increase the number of arguments by one to account for the new argument.
- Make sure to update
heap_cursor
to reflect the fact that this new memory has been allocated by your copy. - Produce in
eax
a Falcon pointer to the new closure.
- If \(a + 1 = p\), then this argument is the last argument to the function and we are ready to call it.
- Allocate enough memory on the stack for all of the arguments. Do this by modifying
esp
directly since we’re pushing a dynamically-determined number of elements. - Copy the arguments from the closure into the stack.
- Copy the last argument (the result of the second subexpression) onto the stack.
- Call the function. This is most easily accomplished by moving the function address into
eax
and then using the instructionACall("eax")
. Alternatively, you could make a new OCamlinstruction
constructor for calling registers. - Adjust
esp
to remove the arguments from the stack.
- Allocate enough memory on the stack for all of the arguments. Do this by modifying
The above algorithm handles the case of growing closures as well as the case of calling functions (which is step 4 below). For now, focus on the first part. For the second part – the case in which you would call the function – you can simply generate an appropriate label and no instructions.
Registers
The closure-growing algorithm can get a bit tricky. Feel free to use all four general purpose registers – eax
, ebx
, ecx
, and edx
– to make your generated code simpler. Note that, if you use ebx
, you must observe the C++ calling conventions regarding callee-saved registers.
Copying memory with rep movsd
Writing a loop to copy memory in x86 assembly could be a bit tricky and rather error-prone. Thankfully, the processor comes with an instruction which makes it easy to copy several words of memory from one location to another.
To use this instruction:
- Set the
esi
register to the source memory address containing the data - Set the
edi
register to the destination memory address to write the data - Set the
ecx
register to the count of the number of 4-byte words to copy - Issue the instruction
rep movsd
The rep movsd
instruction will proceed by copying one word from esi
to edi
, moving esi
and edi
one word forward in memory, decrementing ecx
, and continuing until ecx
is 0
. For instance, consider the following code:
mov esi, ebp
add esi, 8
mov edi, [heap_cursor]
mov ecx, 3
rep movsd
At the start of rep movsd
:
- The
esi
register contains the addressebp + 8
: the first parameter of the current function. - The
edi
register contains the address of the heap cursor. - The
ecx
register contains the value3
.
After rep movsd
is complete:
- 12 bytes of memory (3 words) starting at
ebp + 8
will be copied into the 12 bytes of memory starting atheap_cursor
- The
esi
register will contain the addressebp + 20
: 12 bytes from where it started. - The
edi
register will contain the addressheap_cursor + 12
: 12 bytes from where it started. - The
ecx
register will contain the value0
.
To use this instruction, you can simply add an appropriate ARepMovsd
constructor to your instruction
type in assemblyLanguage.ml
. You are not required to use rep movsd
, but it will likely make your code much cleaner and easier to debug. It’s also considerably more efficient for the CPU.
Checkpoint
Once you’ve finished the code which grows closures, you should be able to see the closures growing for functions with several parameters. Try writing a program which declares a function f
of six parameters. Then, call f
with an argument and see if the result is an appropriate closure. Call that closure with another argument to see if it gets copied in correctly as well.
Step 4: Calling Functions
Once you’ve managed to make your closures grow appropriately, move on to implementing the function-calling part of the algorithm above. Probably the trickiest part of this is to copy the arguments from the closure onto the stack in the right order, although this is easier than it seems.
Remember: arguments are pushed onto the stack in reverse order only because stack memory grows downwards; the first argument to a function is lowest in memory. The closures also keep their arguments in ascending memory order, so it is enough simply to copy those arguments from the closure onto the stack and then copy the last argument after them.
Checkpoint
At this point, your unit tests from Eagle should all pass! You’ll need to update the syntax of function declarations and function calls if you haven’t already done so. You’ll also need to change any tests that relied on zero-parameter functions. In those cases, just add a parameter that you don’t use and then call the function with a dummy value, like 0
.
What’s more: you should now be able to pass functions as arguments to other functions! Try writing some programs which make use of your new compiler features. For instance, you could write an ntimes
function that takes parameters f
, n
, and x
and calls f
on x
a total of n
times.
Miscellaneous details
Make sure to update the code for istuple
as well as the code that validates that an argument is a tuple for e.g. tuple indexing. The Eagle version of this code can simply check the tag bits of the argument, but Falcon uses the same tag bits for tuple pointers and closure pointers. You’ll need to check both the tag bits and the first bit of the heap value to determine if an argument is a tuple or not.
Completion!
At this point, we have added the two key features of Falcon: first-class functions and partial application. Write some unit tests to be confident of your implementation and then congratulate yourself! The rest of the lab is optional; only the Falcon compiler is required for a grade. If you have the time and resources, however, you may find the optional part of this assignment interesting and/or informative.
Finch: Anonymous Functions (optional!)
The Falcon compiler includes two key features found in most functional languages: first-class functions and partial application. However, it’s missing a third feature: anonymous functions. You are not required to implement anonymous functions for this lab (and no course credit will be provided for doing so), but this language feature is extremely useful and a compiler which supports it is more satisfying. :) If you decide to implement this compiler, first commit and push your work on Falcon.
Let’s develop a small extension to Falcon called Finch. You can begin by running the following Git commands. Note that these commands are a bit different from your usual lab starters. In that case, your instructor has already created the branch for you. In this case, you are creating a new Git branch yourself.
git branch finch
git checkout finch
git push -u origin finch
If, while developing Finch, you find a bug in your Falcon work, there is a section below to help you make your change to the Falcon compiler.
Syntax and Semantics
Anonymous functions can be supported simply by adding a new form of expression to the Finch grammar:
<expression> ::= ...
| fun <identifier> -> <expression>
This syntax has the same meaning as in OCaml. For instance, the code
let f = (fun x -> fun y -> x + y) in
let increment = f 1 in
(increment 3, increment 7)
evaluates to (4, 8)
. Again, note that the expression form of let
here is different from the declaration form of let
above. We won’t allow expressions such as let f x y = x + y in ...
. We could – this expression is just shorthand for the anonymous function syntax in the above example – but it’s not the interesting part of implementing anonymous functions.
Parsing and the AST
To allow anonymous functions to appear in our programs, we must modify the parser and the AST. First, open src/language/asts.ml
and add a constructor to the expr
data type:
| ELambda of string * expr
This “lambda” constructor will represent our anonymous functions: the string contains the name of the variable while the expression is the function’s body. (Anonymous functions are traditionally denoted in programming language theory using the Greek letter lambda: hence the name.)
Once you have added the ELambda
constructor, you will need to allow the parser to recognize anonymous function syntax and construct new ASTs for them. First, open src/language/lexer.mll
and add the following cases somewhere in the middle of your lexer:
| "fun" { FUN }
| "->" { ARROW }
These cases must appear above the definition of the identifier case; don’t place them at the bottom of the file. Next, open src/language/parser.mly
; we will add two lines. The first appears toward the top of the file. Between the last %token
declaration but before the first precedence (e.g. %left
or %right
) declaration, there is a blank line. Replace it with
%token FUN ARROW
%nonassoc fake_FUNCTION_BODY
The first line tells the parser that FUN
and ARROW
are tokens we will use; these match the tokens that the lexer is now producing. The second line tells the parser that we are creating a new non-associative parsing precedence which is less important that every other precedence (because it’s first in the list). Then, we can add the rule for parsing anonymous functions by adding the following to the top of the expr
cases:
| FUN IDENTIFIER ARROW expr { ELambda($2,$4) } %prec fake_FUNCTION_BODY
Checkpoint
Once you’ve made these changes, you should be able to make clean
, make tests
, and run ./tests.byte
. Everything that worked in Falcon should work in Finch, but you’ll get warnings about the unhandled ELambda
case.
Implementing Anonymous Functions via Closure Conversion
Rather than developing some clever code generation technique to handle ELambda
expressions, we’ll use the mathematician’s approach: reduce anonymous functions to a previously-solved problem. We know how to handle ASTs without ELambda
expressions. What if we could just get rid of them?
We will add closure conversion as a preprocessing step to the Finch compiler. That is, the steps of the compilation pipeline are:
- Parse the source code into an AST
- Perform well-formedness checking on the AST
- Closure-convert the AST to eliminate
ELambda
expressions - Compile the resulting
ELambda
-less AST as per Falcon
Our compile_expression
function will operate with the invariant that it is never given an AST with an ELambda
expression in it. You can add an ELambda
case to its primary match
expression that just runs failwith "ELambda in compilation"
or something similar. If closure conversion works correctly, we should never be able to reach that case.
Examples of Closure Conversion
We will convert anonymous functions like fun x -> x
into named functions (of the form DFunction
) by searching the AST for them, picking a fresh name to use, creating a declaration with that name, and replacing the anonymous function with a reference to that new name as a variable. For instance, we might transform
def twice f x =
f (f x)
end
twice (fun a -> a * 2) 5
into
def twice f x =
f (f x)
end
def $lambda$1 a =
a * 2
end
twice $lambda$1 5
Note here that, while $
is not a legal character for the Finch parser, you can put any string you want into the AST (provided that your output assembles). By using a character like $
which is illegal in Finch but legal in nasm
, we can be sure that the user never wrote a variable with that name.
Non-Local Variables
When we closure-convert anonymous functions, we must be careful to handle non-local variables correctly. A non-local variable is one which is used within the function but not defined by it. For instance, in the program
let x = 2 in
(fun y -> x + y) 4
the variable x
is non-local to the anonymous function. When we closure-convert this code, we will add the non-locals as parameters to the resulting declaration:
def $lambda$1 x y =
x + y
end
let x = 2 in
$lambda$1 x 4
To do this correctly, you will want to implement a function for calculating the free variables in an expression. The non-local variables of an anonymous function are the free variables of its body minus the parameter it declares.
The Closure Conversion Algorithm
You are recommended to develop your closure conversion algorithm in a separate module (e.g. closureConversion.ml
). The implementation of closure conversion will likely include the following subroutines:
-
A recursive function
free_vars
to calculate which variables are “free” in an expression. A free variable is one which is used but not bound. Even if an expression has no unbound variables, its subexpressions may. For instance, the expressionlet x = 4 in x + x
has no unbound variables but the subexpressionx + x
does. This function should probably take anexpr
and return a set of variable names. -
A recursive function
closure_convert_expression
which takes anexpr
and returns anexpr * declaration list
. This function will find eachELambda
in the provided expression, create a new declaration to replace it, and then replace theELambda
with a reference to that new declaration. You may usefresh_name
to generate names for these new functions. -
A function
closure_convert_declaration
of typedeclaration -> declaration * declaration list
. This function is not complex; it simply applies closure conversion to the body of a declared function. -
A function
closure_convert_program
of typeprogram -> program
. This function is also not complex. It callsclosure_convert_declaration
on each function declaration andclosure_convert_expression
on the main body of the program. Rather than returning the new declarations produced by closure conversion, it incorporates them into the program definition.
Once you have implemented this algorithm and called it from your compiler.ml
file, your Finch compiler should be complete!
Git and Branches
It’s possible that, during Finch development, you’ll realize that you’ve made a mistake in Falcon. This kind of problem is not uncommon in industry projects, so there’s a fairly straightforward way to handle this using Git.
First: do not fix the bug yet. If you’ve found a bug in your Falcon implementation, it should be fixed on the falcon
branch. We need to set your changes to the finch
branch aside and return to the point in time that the falcon
compiler was last changed. You may either commit your current changes to finch
or run git stash
to store them for later.
Once git status
tells you that your working directory is clean, you can run git checkout falcon
to return to the latest commit of the Falcon branch. Run make clean
to be sure that none of the leftover OCaml binaries are left in your workspace. Then, work on the bug fix the bug as you would’ve done if you hadn’t yet started work on Finch.
Once you’ve finished updating Falcon, commit and push your work. Then, run git checkout finch
followed by git merge falcon
. This will bring the fix that you’ve made into the finch
branch so you don’t have to fix it twice. If you ran git stash
earlier, run git stash pop
now. Note that this may result in a “merge conflict”: a case in which the tool cannot figure out how to automatically merge the bug fix with your Finch code. In that case, you will have to help Git by editing the conflicted files so that everything compiles again and then create a new commit to finish the merge. In either case, you should now have
- A
falcon
branch with none of the Finch work but with a fix for the bug - A
finch
branch with all of your previous work as well as the bug fix - Only one version of the bug fix (instead of one ad-hoc version for each branch)
- The ability to repeat this process if you find another Falcon bug!
Summary
To complete this assignment, you’ll need to
- Modify your compiler to include closures as a possible heap value
- Add empty closures for each declared function
- Implement closure growth and function calls using closures
- Update tuple-handling code to handle closures correctly
Optionally, you may also
- Complete the closure conversion routine provided in the starter code
- Use closure conversion to support anonymous functions
Submitting
To submit your lab, just commit and push your work. The most recent pushed commit on the appropriate branch as of the due date will be graded. For more information on when assignments are due, see the Policies page.
After each lab, you will complete a brief questionnaire found here. In most cases, the questionnaire should take less than a minute. This questionnaire is required and will be used as part of your participation grade.
If You Have Trouble…
…then please contact your instructor! Piazza is the preferred method, but you can reach out via e-mail as well. Good luck!