Cobra
Due on Monday, February 20th at 11:59 PM. This is a compiler lab. If you have a partner, the two of you will complete this lab as a team. If you do not have a partner, you are effectively in a team by yourself.
If you are working with a partner, please be familiar with the Partner Etiquette guidelines. You and your partner share a single repository and, barring unusual circumstances, will receive the same grade. If you experience any difficulties in your partnership, please alert your instructor as soon as possible.
If you are working alone, please don’t hesitate to seek help if you get stuck. Since there are no ninjas for this course and you don’t have a partner, your instructor is the only interactive resource available to you and is happy to help you through any problems you might encounter.
In either case, be familiar with the academic integrity policy! You are permitted to discuss high-level details of your compiler project with other students, but you should never see or share anything that will be turned in for a grade. If you need help, please post on the Piazza forum or, if necessary, contact the instructor directly. Make sure to post privately if you are sharing code. If you have any doubts about what is okay and what is not, it’s much safer to ask than to risk violating the Academic Integrity Policy.
Overview
This lab will extend the Boa compiler to new language called Cobra, which includes boolean values and operations as well as a printing primitive. We will also modify the compiler to use a memory layout which permits Cobra programs to check the types of values at runtime and ensures that types are not used incorrectly. Throughout the semester, these compiler extensions will use the same Git repository, just as your previous assignment did.
This page will:
- Discuss the syntax and semantics of Cobra
- Describe the obligations your code will have in calling and being called by C functions
- Give a reference for some relevant assembly instructions
- Suggest a strategy for approaching the assignment
Transitioning to Cobra
As with the previous lab, you will find some new files in your repository; make sure to git pull
. The new contents are:
src/languages/cobra/
, a directory containing the lexer, parser, and AST of the Cobra language. The surface-level and A-normalized ASTs of Cobra include the new language features.resources/printer.h
andresources/printer.c
, two new files that will be used by Cobra programs (in addition todriver.c
) in order to print Cobra values.resources/error.h
andresources/error.h
, which we will use for error handling in Cobra programs.
To create a compiler for Cobra, begin by opening language.mk
and changing the LANGUAGE
variable from boa
to cobra
. Then, run make clean
and make
. If everything goes well, your code will compile (with warnings).
Cobra is an extension of the Boa language, so everything that worked in Boa should still work now. You can verify this by running your unit tests now. Even if the compilation of your unit tests produces warnings, you should have no trouble getting your unit tests to pass (as long as they did before).
The Cobra Language
We begin by introducing the concrete syntax, the abstract syntax, and the semantics of the Cobra language. The primary syntactic differences are the presence of booleans and several new operators, including print
: an operator which will display any value.
Concrete Syntax
The concrete syntax of Cobra is:
<expression> ::=
| true
| false
| <integer>
| <identifier>
| let <identifier> = <expression> in <expression>
| if <expression> then <expression> else <expression>
| inc(<expression>)
| dec(<expression>)
| print(<expression>)
| isbool(<expression>)
| isint(<expression>)
| <expression> + <expression>
| <expression> - <expression>
| <expression> * <expression>
| <expression> < <expression>
| <expression> > <expression>
| <expression> = <expression>
| <expression> && <expression>
| <expression> || <expression>
| (<expression>)
Syntactic precedence still follows OCaml. In order from highest to lowest precedence, the operators are:
*
+
-
<
=
>
&&
||
if
-then
-else
andlet
.
Abstract Syntax
As with Boa, the abstract syntax comes in two parts: the surface syntax and the A-normal form. Examine src/language/cobra/expressions.ml
to see how the abstract grammar has changed.
Semantics
Cobra has two distinct types of values: integers and booleans. These values are distinct in that there is no situation in which any integer value (e.g. 4
) should be interchangeable with any boolean value (e.g. true
) and vice versa; that is, 4 < true
always results in a runtime error rather than producing a value. Enforcing this will require changes throughout the compiler.
Memory Representation
The Boa compiler only had one type – integers – and stored all of its values in 32-bit signed form. Conditionals in Boa simply treated 0
as false
and everything else as 1
. We will instead choose a memory layout that distinguishes these types of values:
true
will be represented with the 32-bit value0xFFFFFFFF
false
will be represented with the 32-bit value0x7FFFFFFF
- Integers will be represented using a 32-bit signed value twice as large as the integer value; that is,
5
would be represented as0x0000000A
(which is the 32-bit signed representation for10
). This is the same as we discussed in class.
Note that, in this representation, the rightmost bit of every integer is 0
and the rightmost bit of both booleans is 1
. We will discuss below a strategy for adjusting the compiler to use this representation.
Printing
A new unary operator (similar to inc
and dec
) called print
appears in Cobra’s syntax. This operator prints out a human-readable text representation of its argument and then returns that same value. The argument is in terms of our representation above, so we interpret that representation in the new resource file printer.c
. We use another C file here (similar to our driver.c
) so we can print messages without being concerned about the system-specific details of interacting with the user.
Note that this printing occurs independently from the printing of the final value once the program is finished. For instance, the code
let x = 4 in
let y = print(x) in
print(y + 6)
will print
4
10
10
The 4
is printed by the expression print(x)
; this expression also evaluates to 4
, so y
is bound to 4. This means that print(y + 6)
will print 10
and also evaluate to 10
. Since the overall expression evaluates to 10
, this is printed when the program terminates.
Errors
The semantics of Cobra also include error conditions, when we will terminate the program at runtime without producing a value. When an error arises, we will print a message and terminate the program with a non-zero exit code. (The implementation of this is discussed below.) The following conditions will produce an error:
- The arithmetic unary operators
add1
andsub1
should require that their arguments are integers. If this is not the case, we terminate with exit code1
. - The arithmetic binary operators
+
,-
,*
,<
, and>
should require that both of their arguments are integers. If either argument is not, we terminate with exit code1
. - The logical binary operators
&&
and||
should require that both of their arguments are booleans. If either argument is not, we terminate with exit code2
. - The condition in an
if
-then
-else
must be a boolean. If it is not, we terminate with exit code2
. - No arithmetic operation may trigger overflow. If this occurs, we terminate with exit code
3
.
In a fashion similar to the above, we will handle errors using the error.c
resource file. This file defines a C function which will print a message and immediately terminate the process with the provided error code.
As an example, the code
let x = 4 in
if x - 4 then 5 else 0
will terminate with exit code 2
because x - 4
evaluates to an integer and not a boolean. Implementing this functionality will temporarily break your unit tests. Don’t worry; they’ll still be useful and we’ll fix them before we’re finished!
Interacting with the World
The features of Cobra allow the programmer to interact with the rest of the system in a more meaningful way. To do this, you’ll need to know the C function calling conventions so your code can call (and be called by) C correctly. You’ll also need to use some new assembly instructions.
C Calling Conventions
Code that either calls or is called by a C function must conform to a set of conventions in order to be linked properly. Each call has a caller (the code making the function call) and a callee (the code of the function being called). The convention is this:
- Before calling a function, the caller must
- Push the function call arguments onto the stack.
- Use the
call
instruction to jump to the function’s code. This pushes the return instruction address onto the stack.
- At the very beginning of a call, the callee must
- Push
ebp
onto the stack to save it. - Copy
esp
intoebp
to form the new base of the stack. - Move
esp
to make room for all of the local variables in the function. (See below.)
- Push
- At the very end of the call, the callee must
- Leave the return value in
eax
. - Copy
ebp
intoesp
to restore the old value ofesp
. - Pop into
ebp
the old value of the stack’s base pointer. - Execute the
ret
instruction to pop and jump to the return address that was pushed when the call began.
- Leave the return value in
- After the call returns, the caller must
- Pop the function call arguments off of the stack.
There are also requirements for the caller and callee to protect the values of certain registers. For instance, if the caller cares about the value in eax
, it is the duty of the caller to save that value (since the callee is expected to overwrite it). We are not using any registers that require special handling right now, so we don’t have to worry about this.
Allocating Enough Stack Space
So far, we’ve been intentionally sloppy about how we access stack memory: we’ve just been using memory beyond the location stored in esp
. In the C calling convention, however, we are expected to move esp
to make room for our local variables. In particular, all of our local variables should live between ebp
and esp
.
In observing this calling convention, you can just change all of your esp
offsets to use ebp
instead. Then, to make sure that functions that you call do not use this memory, you must adjust esp
to account for all of your local variables. One way to address this is to write a special function to count the number of local variables used at one time; then, you can call this function from within your compile_program
function. Here’s a stub to get you started:
let rec count_vars_in_c_expression (ce : c_expr) : int =
failwith "TODO"
and count_vars_in_a_expression (ae : a_expr) : int =
failwith "TODO"
;;
New Assembly Constructors
Here are some constructors you’ll probably want to include in your assemblyLanguage.ml
file.
-
XPush of arg
andXPop of arg
The
push
andpop
assembly instructions will copy values to and from stack memory and adjustesp
accordingly. For instance,push eax
pushes the value ofeax
onto the stack and then movesesp
by four bytes. -
XShl of arg * arg
andXShr of arg * arg
The
shl
andshr
instructions shift an argument left or right by some number of bits. For instance,shl eax, 2
shifts theeax
register to the left by two bits. After this operation, the two lowest bits ofeax
will be zeros. -
XSal of arg * arg
andXSar of arg * arg
The
sal
andsar
instructions work likeshl
andshr
but are the “arithmetic shift” operations. This means that they preserve the sign bit, sosar eax, 1
will turn a-8
into-4
. Usingshr
instead would also shift the highest bit, resulting in a very large positive number. (Recall two’s complement!) -
XAnd of arg * arg
,XOr of arg * arg
, andXxor of arg * arg
The
and
,or
, andxor
operations take two arguments (similar to e.g.add
) and perform bitwise masking operations. -
XJo of string
andXJno of string
These jump instructions will jump to the provided label if overflow occurs (
jo
) or if it does not (jno
). -
XCall of string
The
call
instruction is used to invoke a subroutine. It pushes the address of the next instruction onto the stack and then jumps to the provided label (as incall printValue
. Theret
instruction does the opposite of this, popping the instruction address and jumping to it. -
XSized of string * argument
This constructor should be added to the
argument
type (and not theinstruction
type). In some cases, the assembler may complain that the size of an operation is ambiguous. It may, for instance, complain if you writemov [ebp-8], 0
because it’s not clear how big the
0
is intended to be; this may be a one-byte zero, a two-byte zero, or a four-byte zero. The intend ofXSized
is to allow you to specify this. You might, for instance, writemov [ebp-8], DWORD 0
to indicate that you want the
0
to be four bytes large. With theXSized
constructor, the corresponding abstract syntax would beXMov(XMemory(XAddressByRegisterOffset(EBP, -8)), XSized("DWORD", XConstant(0)))
If you wish, you could make a
type size = BYTE | WORD | DWORD
declaration and then use this type instead ofstring
in yourXSized
constructor (for type safety).
Implementing the Cobra Compiler
Here’s a strategy you can use to transition from Boa to Cobra incrementally, which will hopefully make the process easier.
-
As stated above, change
language.mk
to that it refers tocobra
rather thanboa
. Once you do this, you shouldmake clean
andmake tests
. You’ll get several compiler warnings, but your tests should run just fine (as long as they did when you finished Boa!). -
Now, change
driver.c
to use theprintValue
function instead ofprintf
. TheprintValue
function is in theprinter.c
file, so you’ll need to do two things:- Add
#include "printer.h"
todriver.c
, as is usual for C programs - Open
builder.ml
and add bothprinter.c
anderror.c
to the listruntime_code
. This list contains the names of all of the C source files that we use in our compiled programs.
Once you’ve finished making these changes, run your tests again. They will fail because your programs are using a signed 32-bit representation of integers but the
printValue
function expects to print a type-tagged integer. You’ll know everything is working if the printed values are half of what they should be. - Add
-
Change your compilation functions to generate values using the type-tagged representation discussed above: integers are bit-shifted to the left by one and have their lowest bit set to zero. Don’t worry about error checking or
if
-then
-else
yet. Once you’ve fixed this, all of your tests should pass. -
Write code for the AST nodes that are new to Cobra (in both the A-normalization and compilation functions) except for the code handling
print
. You should add and test features one at a time: add in booleans first and write a test to see if e.g. the programtrue
prints “true
”. Then, add in code for<
and then test to see if it works (with something very simple, like the program4 < 5
). Don’t updateif
-then
-else
yet; just get the new Cobra features (except forprint
) working correctly. -
Update
if
-then
-else
to require booleans instead of integers. When you do this, you’ll have to update any old unit tests that useif
-then
-else
as well. When you’re finished, all of your unit tests should pass. Again, don’t worry about error checking yet. -
Add the
ebp
register toassemblyLanguage.ml
. Change your compiler so that memory locations are offset fromebp
instead ofesp
and write code to make yoursnake_main
routine observe the C calling conventions. Currently,compile_program
usescompile_a_expression
to compile your code and then just adds aret
instruction to the end. You’ll need to changecompile_program
so that it adds the appropriate callee behavior instead of just oneret
instruction. This is when you’ll need to write acount_vars_in_a_expression
function or something similar. -
Write code for
print
expressions. This requires your code to observe the caller behavior of the C calling conventions. You’ll need to add the lineextern printValue
to the preamble of assembly generated incompile_to_assembly_code
so that you can use theprintValue
label even though it’s defined elsewhere. -
At this point, all of the language features are in place. Now add error checking to your compiler: each operation that may need to fail must perform the appropriate checks and call
stopWithError
as necessary. As above, you’ll need to addextern stopWithError
to your assembly preamble to be able to call this external function. Make sure to write tests to confirm the behavior of runtime errors as well. Note that not all of the error messages and return codes have been written intoerror.c
; you’ll need to add a couple.
That should cover all of the behaviors of Cobra. Make sure you test each of the steps before moving on to the next one. Success in writing software often depends upon breaking the task into small, approachable pieces and adding bit by bit to your application.
Testing the Cobra Compiler
The testing function test_runtime_failure
in testUtils.ml
makes it easy to create an OUnit test that compiles and runs a given source file and expects it to fail with a particular exit code. You can use this function to test intentionally incorrect code to make sure your compiled Cobra programs handle these errors correctly.
Summary
To complete this assignment, you’ll need to
- Change your compiler’s memory model to keep track of the types of values
- Update your compiler to use C calling conventions
- Extend your compiler to the new features of the Cobra language
- Add error-checking code to your compiled programs to gracefully handle runtime type errors
Submitting
To submit your lab, just commit and push your work. The most recent pushed commit as of the due date will be graded. For more information on when assignments are due, see the Policies page.
If You Have Trouble…
…then please contact your instructor! Piazza is the preferred method, but you can reach out via e-mail as well. Good luck!