Compilers

Cardinal

Source: GeoffClarke [CC BY-SA 3.0](https://creativecommons.org/licenses/by-sa/3.0), via Wikimedia Commons

Due on Wednesday, February 20th at 11:59 PM. This is a compiler lab. If you have a partner, the two of you will complete this lab as a team. If you do not have a partner, you are effectively in a team by yourself.

If you are working with a partner, please be familiar with the Partner Etiquette guidelines. You and your partner share a single repository and, barring unusual circumstances, will receive the same grade. If you experience any difficulties in your partnership, please alert your instructor as soon as possible.

If you are working alone, please don’t hesitate to seek help if you get stuck. Since there are no ninjas for this course and you don’t have a partner, your instructor is the only interactive resource available to you and is happy to help you through any problems you might encounter.

In either case, be familiar with the academic integrity policy! You are permitted to discuss high-level details of your compiler project with other students, but you should never see or share anything that will be turned in for a grade. If you need help, please post on the Piazza forum or, if necessary, contact the instructor directly. Make sure to post privately if you are sharing code. If you have any doubts about what is okay and what is not, it’s much safer to ask than to risk violating the Academic Integrity Policy.

Overview

This lab will extend the Bluebird compiler to new language called Cardinal. Cardinal includes boolean values and operations as well as a printing primitive. Unlike Bluebird, Cardinal has “real” booleans: false and 0 will be different values and Cardinal programs will check their types at runtime. Just as with your previous assignment, this assignment uses the same compiler repository; we will continue this throughout the rest of the course.

This page will:

Discuss the syntax and semantics of Cardinal
Describe the obligations your code will have in calling and being called by C functions
Give a reference for some relevant assembly instructions
Suggest a strategy for approaching the assignment

Transitioning to Cardinal

As with the previous lab, the starter files for Cardinal have been stored in a branch in your Git repository. To start work on Cardinal, you must first move to the Cardinal branch with

git fetch
git checkout cardinal

This will move you to a version of the compiler which contains Cardinal starter files and nothing else. Next, run

git merge bluebird

to merge your previous work – the Bluebird compiler – into this starter code.

Once you’ve done this, you’ll find some changes to your repository.

src/language has changed again; the ASTs and parser here have been updated to include new Cardinal language features.
resources/printer.h and resources/printer.c, two new files that will be used by Cardinal programs (in addition to driver.c) in order to print Cardinal values.
resources/error.h and resources/error.h, which we will use for error handling in Cardinal programs.

Make sure to run make clean and make to verify that your merge was successful. Upon doing this, you’ll probably get a lot of OCaml compiler warning messages – these are important and we’ll deal with them later – but your compiler should still be able to compile and run Bluebird programs. Cardinal is an extension of the Bluebird language, so everything that worked in Bluebird should still work now. You can verify this by running your unit tests.

The Cardinal Language

We begin by introducing the concrete syntax, the abstract syntax, and the semantics of the Cardinal language. The primary syntactic differences are the presence of booleans and several new operators, including print: an operator which will display any value.

Concrete Syntax

The concrete syntax of Cardinal is:

<expression> ::=
  | true
  | false
  | <integer>
  | after(<expression>)
  | before(<expression>)
  | print(<expression>)
  | isbool(<expression>)
  | isint(<expression>)
  | <expression> + <expression>
  | <expression> - <expression>
  | <expression> * <expression>
  | <expression> < <expression>
  | <expression> > <expression>
  | <expression> = <expression>
  | <expression> && <expression>
  | <expression> || <expression>
  | <identifier>
  | let <identifier> = <expression> in <expression>
  | ifnz <expression> then <expression> else <expression>
  | if <expression> then <expression> else <expression>
  | (<expression>)

Syntactic precedence still follows OCaml. In order from highest to lowest precedence, the operators are:

*
+ -
< = >
&&
||
if-then-else, ifnz-then-else, and let.

Abstract Syntax

As with previous languages, the file src/language/asts.ml contains an abstract syntax for the concrete syntax given above. No new ASTs were added, but new constructors have been introduced for each of the above syntactic forms.

Semantics

Cardinal has two distinct types of values: integers and booleans. These values are distinct in that there is no situation in which any integer value (e.g. 4) should be interchangeable with any boolean value (e.g. true) and vice versa; that is, 4 < true always results in a runtime error rather than producing a value. Enforcing this will require changes throughout the compiler.

The binary operators for Cardinal are monomorphic: each of them only works on one pair of types. The < operator, for instance, only works on pairs of integers, while the && operator only works on pairs of booleans. This is of particular relevance with the = operator, which only works on pairs of integers. The expression true = true does not evaluate to true; it produces a runtime error.

Note that Cardinal has both ifnz and if. The ifnz semantics are the same as in Bluebird: the “condition” is expected to be a number, a zero value executes the else branch, and any non-zero value executes the then branch. In Cardinal, the if expression works more like if-then-else in OCaml: the condition must be a boolean and not an integer. Using an integer with an if expression (or a boolean with an ifnz expression) will produce a runtime error – that is, crash the program – which we discuss below.

Memory Representation

In Bluebird, all values were integers stored in 32-bit signed form. Cardinal has two types – integers and booleans – and, to exhibit the behavior described above, our Cardinal programs must be able to distinguish between integers and booleans at runtime. We will do this by choosing a memory layout that makes it easy to distinguish between and operate on those values:

true will be represented with the 32-bit value 0xFFFFFFFF
false will be represented with the 32-bit value 0x7FFFFFFF
Integers will be represented using a 32-bit signed value twice as large as the integer value; that is, 5 would be represented as 0x0000000A (which is the 32-bit signed representation for 10). This is the same as we discussed in class.

Note that, in this representation, the rightmost bit of every integer is 0 and the rightmost bit of both booleans is 1. We will discuss below a strategy for adjusting the compiler to use this representation.

Printing

A new unary operator (similar to after and before) called print appears in Cardinal’s syntax. This operator prints out a human-readable text representation of its argument and then returns that same value. The argument is in terms of our representation above, so we interpret that representation in the new resource file printer.c. We use another C file here (similar to our driver.c) so we can print messages without being concerned about the system-specific details of interacting with the user.

Note that this printing occurs independently from the printing of the final value once the program is finished. For instance, the code

let x = 4 in
let y = print(x) in
print(y > 6)

will print

4
false
false

The 4 is printed by the expression print(x); this expression also evaluates to 4, so y is bound to 4. This means that print(y > 6) will print false and also evaluate to false. Since the overall expression evaluates to false, this is printed when the program terminates.

Errors

The semantics of Cardinal also include error conditions, when we will terminate the program at runtime without producing a value. When an error arises, we will print a message and terminate the program with a non-zero exit code. (The implementation of this is discussed below.) The following conditions will produce an error:

The arithmetic unary operators after and before should require that their arguments are integers. If this is not the case, we terminate with exit code 1.
The arithmetic binary operators +, -, *, <, and > should require that both of their arguments are integers. If either argument is not, we terminate with exit code 1.
The condition in an ifnz-then-else must be an integer. If it is not, we terminate with exit code 1.
The logical binary operators && and || should require that both of their arguments are booleans. If either argument is not, we terminate with exit code 2.
The condition in an if-then-else must be a boolean. If it is not, we terminate with exit code 2.
No arithmetic operation may trigger overflow. If this occurs, we terminate with exit code 3.

In general, we assign meanings to the exit codes of Cardinal programs:

0 indicates that the program exited normally.
1 indicates that an integer was expected but some other type was found.
2 indicates that a boolean was expected but some other type was found.
3 indicates that an arithmetic error such as integer overflow has occurred.

When error checking, you are encouraged to evaluate both sides of the expression before checking whether or not they have the right type. First: it’s likely that this form of checking will be easier than checking each argument as you calculate it. Second: most sophisticated languages behave in this manner to support more complex language features.

In a fashion similar to the above, we will handle errors using the error.c resource file. This file defines a C function which will print a message and immediately terminate the process with the provided error code.

As an example, the code

let x = 4 in
if x - 4 then 5 else 0

will terminate with exit code 2 because x - 4 evaluates to an integer and not a boolean.

Interacting with the World

The features of Cardinal allow the programmer to interact with the rest of the system in a more meaningful way. To do this, you’ll need to know the C function calling conventions so your code can call (and be called by) C correctly. You’ll also need to use some new assembly instructions.

C Calling Conventions

Code that either calls or is called by a C function must conform to a set of conventions in order to be linked properly. Each call has a caller (the code making the function call) and a callee (the code of the function being called). The convention is this:

Before calling a function, the caller must
1. Push the function call arguments onto the stack.
2. Use the call instruction to jump to the function’s code. This pushes the return instruction address onto the stack.
At the very beginning of a call, the callee must
1. Push ebp onto the stack to save it.
2. Copy esp into ebp to form the new base of the stack.
3. Move esp to make room for all of the local variables in the function. (See below.)
At the very end of the call, the callee must
1. Leave the return value in eax.
2. Copy ebp into esp to restore the old value of esp.
3. Pop into ebp the old value of the stack’s base pointer.
4. Execute the ret instruction to pop and jump to the return address that was pushed when the call began.
After the call returns, the caller must
1. Pop the function call arguments off of the stack.

There are also requirements for the caller and callee to protect the values of certain registers. For instance, if the caller cares about the value in eax, it is the duty of the caller to save that value (since the callee is expected to overwrite it). We are not using any registers that require special handling right now, so we don’t have to worry about this.

(Note: the above treatment of C calling conventions ignores one important property – so-called “callee- and caller-saved registers” – which we will ignore for now.)

Allocating Enough Stack Space

So far, we’ve been intentionally sloppy about how we access stack memory: we’ve just been using memory beyond the location stored in esp. In the C calling convention, however, we are expected to move esp to make room for our local variables. In particular, all of our local variables should live between ebp and esp.

In observing this calling convention, you can just change all of your esp offsets to use ebp instead. Then, to make sure that functions that you call do not use this memory, you must adjust esp to account for all of your local variables (including temporary variables). Probably the simplest way to handle this is to write a special function to count the number of variables in use at one time; then, you can call this function from within compile_program. Your function might look like this:

 ```ocaml
 let rec required_stack_memory_for_expression (e : expr) : int =
     match e with
     | EInt _ -> 0
     | EUnaryOp(_,e') -> required_stack_memory_for_expression e'
     | EBinaryOp(_,e1,e2) ->
       List.max [required_stack_memory_for_expression e1;
                 4 + required_stack_memory_for_expression e2;
                 8
                ]
     ...
 ;;
 ```

Note that the EBinaryOp case is a bit tricky and the correct answer depends on how you solved that problem in your compile_expression function. The above example is accurate for the simplest solution discussed in class:

Compile the first expression
Allocate a temporary variable and copy the result of the first expression into it
Compile the second expression
Allocate another temporary variable and copy the result of the second expression into it
Copy the first result back into a register and then operate on it using the second result

When using this algorithm to calculate the amount of space used by a binary operation, we take the largest of three numbers:

The amount of memory used when calculating the first expression
The amount of memory used when calculating the second expression plus the 4 bytes we are using to store the result of the first expression while the second expression is being calculated
The 8 bytes we used to store the results of both expressions while we are operating on them

New Assembly Constructors

Here are some constructors you’ll probably want to include in your assemblyLanguage.ml file.

APush of arg and APop of arg

The push and pop assembly instructions will copy values to and from stack memory and adjust esp accordingly. For instance, push eax pushes the value of eax onto the stack and then moves esp by four bytes.
AShl of arg * arg and AShr of arg * arg

The shl and shr instructions shift an argument left or right by some number of bits. For instance, shl eax, 2 shifts the eax register to the left by two bits. After this operation, the two lowest bits of eax will be zeros.
ASal of arg * arg and ASar of arg * arg

The sal and sar instructions work like shl and shr but are the “arithmetic shift” operations. This means that they preserve the sign bit, so sar eax, 1 will turn a -8 into -4. Using shr instead would also shift the highest bit, resulting in a very large positive number. (Recall two’s complement!)
AAnd of arg * arg, AOr of arg * arg, and Axor of arg * arg

The and, or, and xor operations take two arguments (similar to e.g. add) and perform bitwise masking operations.
AJo of string and AJno of string

These jump instructions will jump to the provided label if overflow occurs (jo) or if it does not (jno).
ACall of string

The call instruction is used to invoke a subroutine. It pushes the address of the next instruction onto the stack and then jumps to the provided label (as in call printValue. The ret instruction does the opposite of this, popping the instruction address and jumping to it.
ArgSized of string * argument

This constructor should be added to the argument type (and not the instruction type). In some cases, the assembler may complain that the size of an operation is ambiguous. It may, for instance, complain if you write
```
mov [ebp-8], 0
```
because it’s not clear how big the 0 is intended to be; this may be a one-byte zero, a two-byte zero, or a four-byte zero. The intention of ArgSized is to allow you to specify this. You might, for instance, write
```
mov [ebp-8], DWORD 0
```
to indicate that you want the 0 to be four bytes large. With the ArgSized constructor, the corresponding abstract syntax would be
```
AMov(ArgMemory(AddressByRegisterOffset(EBP, -8)), ArgSized("DWORD", ArgConstant(0)))
```
If you wish, you could make a type size = BYTE | WORD | DWORD declaration and then use this type instead of string in your ArgSized constructor (for type safety).

Implementing the Cardinal Compiler

Here’s a strategy you can use to transition from Bluebird to Cardinal incrementally, which will hopefully make the process easier.

As stated above, perform your git checkout and git merge operations. Then, immediately make clean, make tests, and run ./tests.byte. You’ll get several compiler warnings, but your tests should run just fine (as long as they did when you finished Bluebird!).
Now, change driver.c to use the printValue function instead of printf. The printValue function is in the printer.c file, so you’ll need to do two things:
- Add #include "printer.h" to driver.c, as is usual for C programs
- Open builder.ml and add both printer.c and error.c to the list runtime_code. This list contains the names of all of the C source files that we use in our compiled programs.
Once you’ve finished making these changes, run your tests again. They will fail because your programs are using a signed 32-bit representation of integers but the printValue function expects to print a type-tagged integer. You’ll know everything is working if most of the printed values are half of what they’re expected to be (due to how printValue interprets the int it is given).
Change your compilation functions to generate values using the type-tagged representation discussed above: integers are bit-shifted to the left by one and have their lowest bit set to zero. Don’t worry about error checking, printing, or if-then-else yet. Once you’ve fixed this, all of your tests should pass.
Write code for the AST nodes that are new to Cardinal except for the code handling print. You should add and test features one at a time: add in booleans first and write a test to see if e.g. the program true prints “true”. Then, add in code for < and then test to see if it works (with something very simple, like the program 4 < 5).
Add the ebp register to assemblyLanguage.ml. Change your compiler so that memory locations are offset from ebp instead of esp and write code to make your bird_main routine observe the C calling conventions. Currently, compile_program uses compile_expression to compile your code and then just adds a ret instruction to the end. You’ll need to change compile_program so that it adds the appropriate callee behavior instead of just one ret instruction. This is when you’ll need to calculate how much stack memory the expression requires.
Write code for print expressions. This requires your code to observe the caller behavior of the C calling conventions. You’ll need to add the line extern printValue to the preamble of assembly generated in compile_to_assembly_code so that you can use the printValue label even though it’s defined elsewhere.
At this point, all of the language features are in place. Now add error checking to your compiler: each operation that may need to fail must perform the appropriate checks and call stopWithError as necessary. As above, you’ll need to add extern stopWithError to your assembly preamble to be able to call this external function. Feel free to use the ebx register to simplify the process of error checking.

Make sure to write tests to confirm the behavior of runtime errors as well. Note that not all of the error messages and return codes have been written into error.c; you’ll need to add a couple.

That should cover all of the behaviors of Cardinal. Make sure you test each of the steps before moving on to the next one. Success in writing software often depends upon breaking the task into small, approachable pieces and adding bit by bit to your application.

Testing the Cardinal Compiler

Failing On Purpose

The testing function test_runtime_failure in testUtils.ml makes it easy to create an OUnit test that compiles and runs a given source file and expects it to fail with a particular exit code. You can use this function to test intentionally incorrect code to make sure your compiled Cardinal programs handle these errors correctly.

Debugging with `gdb`

You can use gdb to step-debug your compiled programs! You can start by running gdb from the command line just as you would for a C or C++ program:

$ gdb output/program.run

When you are greeted with the gdb prompt, you might want to enter the following commands:

set disassembly-flavor intel: This tells gdb to display assembly language in the Intel syntax we’ve been using in class rather than the AT&T syntax (which is the default).
layout asm: Displays the disassembly of your code.
layout regs: Displays the processor’s registers as you debug the program.
break bird_main: Stops execution upon entering bird_main. After you’ve issued this command, you can type run to run your program up to that point.

To step through your program, use si or stepi (not s or step); this will step one instruction at a time. You can likewise use ni or nexti to step past function calls and finish will allow you to leave a function call you are currently in.

Note that you can create a file in your home directory – ~/.gdbinit – that contains one gdb command per line. These commands will be run when gdb starts, which prevents you from having to reconfigure things every time you run a program. For instance, the following is a .gdbinit file used by your instructor:

set disassembly-flavor intel
layout regs

If you are using gdb in other classes, be aware that you can comment out the lines in .gdbinit with a prefixed #.

Summary

To complete this assignment, you’ll need to

Change your compiler’s memory model to keep track of the types of values
Update your compiler to use C calling conventions
Extend your compiler to the new features of the Cardinal language
Add error-checking code to your compiled programs to gracefully handle runtime type errors

Submitting

To submit your lab, just commit and push your work. The most recent pushed commit on the appropriate branch as of the due date will be graded. For more information on when assignments are due, see the Policies page.

After each lab, you will complete a brief questionnaire found here. In most cases, the questionnaire should take less than a minute. This questionnaire is required and will be used as part of your participation grade.

If You Have Trouble…

…then please contact your instructor! Piazza is the preferred method, but you can reach out via e-mail as well. Good luck!