Compilers

Gull

gull
Gull

Due on Thursday, April 21st at 11:59 PM. This is a compiler lab. If you have a partner, the two of you will complete this lab as a team. If you do not have a partner, you are effectively in a team by yourself.

If you are working with a partner, please be familiar with the Partner Etiquette guidelines. You and your partner share a single repository and, barring unusual circumstances, will receive the same grade. If you experience any difficulties in your partnership, please alert your instructor as soon as possible.

If you are working alone, please don’t hesitate to seek help if you get stuck. Since there are no ninjas for this course and you don’t have a partner, your instructor is the only interactive resource available to you and is happy to help you through any problems you might encounter.

In either case, be familiar with the academic integrity policy! You are permitted to discuss high-level details of your compiler project with other students, but you should never see or share anything that will be turned in for a grade. If you need help, please post on the course forum or, if necessary, contact the instructor directly. Make sure to post privately if you are sharing code. If you have any doubts about what is okay and what is not, it’s much safer to ask than to risk violating the Academic Integrity Policy.

Overview

In this lab, we will extend the Falcon compiler to a new language called Gull. Gull introduces mutation (the ability to change some variables or memory locations). Gull also introduces a garbage collector, a tool which automatically manages the allocation of memory in user programs. The Python runtime, for instance, has a garbage collector while the C runtime does not.

Transitioning to Gull

As usual, you’ll run some Git commands to prepare for Gull:

git fetch
git checkout gull
git merge falcon

Note that you may instead git merge finch if you prefer.

You’ll have some small changes in src/languages/ to accommodate mutation. You’ll also have two new resource files – resources/gc.h and resources/gc.c – which will contain the code for our garbage collector.

After merging, you should be able to make clean, make tests, and run tests.byte. You’ll get some warnings about the new AST constructors as we’ve had in previous labs, but all of your Falcon tests should compile.

The Gull Language

Gull is syntactically identical to Falcon except for the addition of an expression which can change the value a tuple element. The garbage collector will run automatically as our program runs and does not affect the syntax of the language.

Concrete Syntax

As the addition to Gull is very small, we show only the new syntactic form here:

<expression> ::=
  | <expression>[<expression>] := <expression>
  ...

All other syntax is the same as in Falcon.

Abstract Syntax

As usual, you can see the definitions of expressions for Gull in src/language/asts.ml. The only addition is the ESet constructor to accommodate the new expression above.

Semantics

The new expression is the first form of mutation in our language! The := syntax allows us to change the value in a particular position of a tuple; this update will be visible to anyone with a reference to the tuple. The update always returns the value which was assigned to the tuple (much like print). For instance,

let t = (1,2,3) in
let x = (t[0] := 5) in
t[0] + t[1] + x

evaluates to 12. First, a tuple (1,2,3) is allocated on the heap and a pointer to it is assigned to the variable t. Then, t[0] := 5 changes the first element of the tuple to a 5, so all references to it now point to heap memory containing (5,2,3). The assignment returns the 5, so x is bound to that value. Finally, t[0] + t[1] + x evaluates to 5 + 2 + 5: the t[0] value has changed to a 5, the t[1] value is still the 2 from before, and x was bound to the 5 produced by the update.

Note that this new update syntax allows us to create cycles in our memory graph, which leads to interesting scenarios for garbage collection! For instance, we can write

let t = (1,0) in
let x = (t[1] := t) in
t

to produce a tuple whose second element points to the tuple itself. Make sure to try cases like this in your garbage collector once basic functionality appears to be working!

Runtime Errors

Runtime errors for the tuple update syntax are the same as errors for the tuple access syntax:

Implementing Mutation

If you like, you can set the rest of this lab aside for now and implement the tuple mutation feature described above. This is not the main part of the lab; we’re only adding this feature to test how your future garbage collector reacts to cycles in data. Fortunately, you can use the code you’ve written for tuple access as a guide to implement this feature quickly. You’ll want to do so before you start on your garbage collector; whether you’d prefer to implement this feature first or read the rest of the lab first is up to you.

Supporting Garbage Collection

The bulk of the lab assignment is the implementation of a Mark-Compact garbage collection algorithm. To save us a great deal of tedium and frustration, we will write this garbage collector in C rather than x86-64 assembly. The new files resources/gc.h and resources/gc.c will be included in any program your compiler builds and you will make use of this code to automatically reclaim any memory that the program is no longer using.

This section of the lab describes the changes we will make to our compiler to support garbage collection. Later in this lab, we will describe the mark-compact algorithm we will use. We will also discuss an implementation plan for approaching the problem.

Cleaning Up the Stack

Our garbage collector will work by looking through the stack and heap for pointers to heap objects and then reclaiming memory used by heap objects for which we don’t have pointers. In order to do this correctly and safely, we need to be sure that the stack contains only valid heap pointers. If old pointers are left on the stack, they may prevent the reclamation of memory or even crash the program (if the garbage collector tries to use an old pointer value). For instance, consider this code:

let a =
  let x = 4 in
  let t = (1,2) in
  x
in
some_function a

In the above code, we may have the following stack memory assignments: a maps to [rbp-8], x maps to [rbp-8] (because a and x never exist at the same time), and t maps to [rbp-16]. This means that the pointer to the tuple (1,2) will be stored in [rbp-16] and then left there even after the variable t goes out of scope. The garbage collector will not be able to tell that t is no longer bound, so the pointer in [rbp-16] will prevent the tuple (1,2) from being collected.

A more extreme problem can arise due to the left-over pointers that previous function calls leave in stack memory: future function calls may allocate that stack memory for their local variables and then cause a GC call before that memory is used, causing the garbage collector to use a left-over (and possibly invalid) pointer during collection; this could lead to a segmentation fault!

We can solve this problem just by making sure there’s never any junk on the stack. To do this, we just need to zero out some memory.

The above steps allow us to maintain an invariant that any memory in the stack which is not actively in use will contain a zero. This way, our garbage collector will only consider valid pointers. There are a number of ways you might choose to zero out memory. One of them is an instruction, rep stosq, which works much like rep movsq but copies rcx 64-bit words of memory into the address in rdi from the rax register (rather than from the address in rsi). If you set rax to zero, you can use rep stosq to zero out arbitrary memory without writing a loop yourself.

Heap Layout

To avoid complexity, we will use an algorithm similar to the LISP2 variation of mark-compact. This requires us to have enough memory to store some garbage collection data about each object on the heap. To make sure we reserve enough memory, we will store this information in the heap itself!

We will modify tuples and closures so that the second word of memory is reserved for the garbage collector. Everything other than the first word moves to the right to make room. For instance, the tuple (2,5) will be represented as follows:

tag & size (64 bits) GC word (64 bits) 2 (64 bits) 5 (64 bits)
0x0000000000000002 0x000000000000000 0x0000000000000004 0x000000000000000a

As another example, the closure for a four parameter function which currently carries two arguments 6 and 8 might look like this:

tag & arg count (64 bits) GC word (64 bits) param count (64 bits) code address (64 bits) 6 (64 bits) 8 (64 bits)
0x8000000000000002 0x0000000000000000 0x0000000000000004 0x00000000080484f8 0x000000000000000c 0x0000000000000010

The GC word will always be 0x0000000000000000 except when the garbage collector is running.

Global Variables

The addition of tuples in Eagle prompted the addition of a global variable heap_cursor. This variable contains a pointer to the next free byte of heap memory. If we are to manage memory properly, we will need additional information. These variables will be defined within your assembly code, but they are declared (as extern) within resources/gc.c so our garbage collection code can make use of them. These variables are:

In order for the linker to connect your gc.c file’s extern declaration to the labels that you have created, your assembly file will need to declare those labels as global (much like we did with main function in the Hatchling lab and as we have done in every lab since with bird_main). Likewise, for your assembly program to call the gc function in gc.c, you’ll need to declare gc as an extern label in your assembly file.

Calling the Garbage Collector

So far, we have simply incremented heap_cursor whenever we needed more memory. If the program allocates enough memory to move the heap_cursor beyond the end of the heap, it simply crashes. In this lab, our program will instead call the garbage collector.

There are two places in the compiler which allocate more memory: the creation of a tuple and the creation of new closures. We will modify the assembly generated in those cases to

If the heap is full and gc is called, the gc function may change the value of heap_cursor and free up enough memory for the program to use. If not enough memory can be freed (because it’s actually all in use), the gc function will terminate the program with exit code 7, which we will take to mean “out of memory”. (We could instead choose to try to allocate a larger heap at this point. For this lab, we choose not to try to be that clever.)

Make sure to read the documentation in gc.h to understand how the gc function is expected to behave!

Mark-Compact Garbage Collection

We will now discuss the variation of mark-compact algorithm we will use for our compiler. This algorithm consists of five phases:

  1. Mark, which identifies all heap objects which are still reachable.
  2. Forward, which determines new memory locations for all reachable heap objects (but does not yet move them).
  3. Update, which changes all pointers on the heap to refer to the new memory locations (even though the objects haven’t moved yet).
  4. Compact, which moves all reachable heap objects into their new locations (so that the updated pointers are now correct).
  5. Unmark, which zeroes the GC words of reachable heap objects (to be ready for the next garbage collection pass).

There are more phases in this version of mark-compact than are strictly necessary, but this separation makes it easier to approach each phase one at a time.

Phase 1: Mark

The first phase of this algorithm requires us to identify which heap objects are reachable; any unreachable heap objects can be discarded. A heap object is reachable if

We can use a depth-first graph traversal algorithm similar to the depth-first search from CS35. Do not create your own stack data structure; instead, you can rely on the program call stack to perform your search recursively. Such a recursive approach has potential downsides in a language like C, but we’ll ignore them for this lab.

To find all of the pointers to our heap, we can just scan the stack from top to bottom. Between start_of_stack and end_of_stack, our stack has the following kinds of information on it:

On the other hand, our heap pointers have the following properties:

Non-pointer values will not end in the bits 01, so we can easily avoid e.g. treating a boolean like a heap pointer. Pointers into our stack and pointers to code will not be within the bounds of our heap, so we can use start_of_heap and heap_cursor to ignore those values as well.

Using these observations, we will perform a depth-first traversal starting with each heap pointer on the stack. We will mark the GC word of each object we reach during this traversal with a non-zero value (e.g. 0x0000000000000001). Since all GC words are 0x0000000000000000 at the start of garbage collection, any heap object which still has a zero GC word after the mark phase is unreachable and can be destroyed.

Phase 2: Forward

After we have identified reachable heap objects, we will use the GC words to pick new locations for them. To do this, we will use the following algorithm:

By the time we are finished forwarding, each reachable heap object’s GC word will contain a new heap address (in machine representation – it won’t have the 01 bits on the end). Our goal is to move all of the heap objects to that location, but we have to make sure that every pointer is updated as well!

Phase 3: Update

Before we move the heap objects, we want to update any pointers to them. If we moved the objects first, we’d have no memory-efficient way to determine how to transform old pointers to new pointers. Before we move the heap objects, though, each heap pointer points almost exactly to the new memory address!

For instance, if we have a heap pointer value 0x00000000017822a9, this refers to memory address 0x00000000017822a8 (after we strip the 01 from the end). Because every heap object’s GC word is eight bytes from the beginning of the object, the new pointer to replace 0x00000000017822a8 is at memory location 0x00000000017822b0. We can read that value, add 01, and use this result as the new pointer.

Since there may be heap pointers in the stack or within heap objects themselves, we must perform this update on all heap pointers we find in either of these places. Note that this is not a depth-first search! Instead, we can simply iterate over all stack values between start_of_stack and end_of_stack (non-recursively) and then iterate over all heap objects between start_of_heap and heap_cursor (as we did in the Forward Phase above).

Phase 4: Compact

Once we have updated all of the pointers, the task remains to move each heap object to its new location. Once we have done this, all of the new pointers we set in the Update Phase will point to the same data as they did before the garbage collection began. This phase is comparatively simple (if a little delicate): we must iterate over each object from start_of_heap to heap_cursor (as we did in previous phases), copying those objects into the memory location described by their GC words.

Phase 5: Unmark

The garbage collection algorithm relies upon unreachable objects having the value 0x0000000000000000 in their GC words. To make sure this is true in the future, we must now set the GC word of each heap object back to zero. We can use the same recursive depth-first traversal we used in the Mark Phase to do this; we could also simply iterate over the stack and the heap as we did in step 3 (only unmarking those objects which are still in allocated heap memory). Either approach is fine.

Tidying Up

At the end of this process, our gc function will need to perform a few tasks:

Implementing the Gull Compiler

This implementation task is challenging, but not for its complexity: the above garbage collection algorithm is described in five steps, each of which is a depth-first search or an iteration over the stack and heap. The challenging part of this lab lies in how fragile the algorithm’s invariants are and how many details there are to observe. When writing your garbage collector, try to move as slowly as possible! It takes a lot longer to find bugs in code like this, so it’s to your benefit to take extra time during implementation to try to create as few bugs as possible in your first draft.

Here’s a strategy you can use to guide your implementation:

  1. Implement the tuple update language feature (t[i] := v) if you haven’t done so already. Write some tests for it.

  2. Modify your compiler to zero out unused stack memory. This will not involve significant changes to compiler.ml. After you are finished, your unit tests should all pass.

  3. Modify your compiler to use the new heap layout that includes a GC word as the second word of each heap object. This will mostly involve editing the assembly code you generated for Eagle and Falcon to allocate space for the GC word and use updated memory addresses. Make sure to update print.c as well; the exercise of doing so will give you practice with the new heap layout. After you’re finished, all of your unit tests should run correctly. Make sure your unit tests are complete enough to find mistakes you might’ve made here: use asymmetric operations (like subtraction), use more than one parameter in a few functions, etc.

  4. Add the global variable declarations for start_of_heap, end_of_heap, start_of_stack, and end_of_stack to your generated assembly code. Add gc.c to the list of compiled files in builder.ml and make sure all of your tests still pass.

  5. Set the global variables properly. This will require you to add a new parameter to the bird_main function to represent the end of the heap as well as modify your bird_main assembly code to copy the argument into the end_of_heap variable. You should set start_of_heap, end_of_heap, heap_cursor, and start_of_stack at the beginning of bird_main. The end_of_stack variable will be set just before we call the garbage collector.

    You might also want to take this opportunity to set your heap to something small (like 32 words) so it’ll be easy to look at the whole heap and test the garbage collector as you write it.

    Once you’ve made these changes, make sure your tests still pass.

  6. Modify your code for ETuple to call the gc function whenever there isn’t enough heap memory to create a new tuple. (Remember that you’ll need to declare gc as an extern label for it to be accessible.) For now, leave gc as it is: whenever the garbage collector is called, your program will just stop with an “out of memory” error. Make sure to set end_of_stack before you push the argument for gc onto the stack. See below in this lab write-up for hints as to how to test your garbage collector. Your previous unit tests should all pass.

  7. Likewise modify your code for EAppl, calling the gc function whenever a new closure must be constructed but there isn’t enough memory to do so. Again, make sure to set end_of_stack and see below for some hints on testing. Your unit tests should still pass.

  8. Implement the mark-and-compact garbage collector. You should implement and then manually test each phase of the collector individually. It’s pretty hard to write automated tests for the garbage collector, so we’ll have to stick to manual tests until you have a working first draft, but you can use gdb and other debugging tools (see below) to determine if e.g. your Mark Phase is working before you write the other phases.

    You’ll want to write several helper functions for your garbage collector. It’s helpful, for instance, to have a function which determines the size of a heap object given a pointer to it. You could also write functions to perform each phase of garbage collection and even break the recursive phases down into more helpers. Don’t try to write your garbage collector as one giant C function!

  9. Write unit tests to verify that your garbage collector is working as intended. Cases you’ll want to test include:

    • Allocating and deallocating a lot of memory to make sure the garbage collector runs and frees up memory correctly.
    • Allocating and deallocating even more memory to make sure the garbage collector runs more than once in the execution of a given program (to verify that GC invariants are correctly re-established by the e.g. Unmark Phase).
    • Allocating a lot of memory but not deallocating it to make sure the garbage collector terminates the program with an “out of memory” error when appropriate.
    • Doing all of the above with both tuples and closures to make sure your code works for both cases.
    • Trigger garbage collection when at least one cycle exists in the heap.
    • Increasing your heap size in driver.c back up to at least 1048576 words to make sure your garbage collector works well on larger heaps.

    Again, see below for some hints about testing.

Once you’ve completed the above steps, you’ll have a garbage collector for your programming language’s runtime!

Testing the Garbage Collector with Gull Code

To test our garbage collector, we will need to write Gull code that uses up memory in specific patterns. In particular, we want to be able to either allocate and hold a lot of memory or be able to allocate and release a lot of memory; we also want to be able to do so with tuples, closures, or both. Here are some suggestions for how to do so.

To create both types of memory, you could write a program which mutually recurses between creating tuples and closures.

Debugging the Garbage Collector

The initial gc.c file includes a macro called debugf which will help you debug your mark-compact garbage collector. The debugf macro works just like the C printf function: it accepts a format string and a series of values and prints them. For instance, you can write

debugf(
  "Marking heap object at memory location %p: writing %d to address %p",
  heap_ptr, mark_value, gc_word_ptr);

The value of debugf over printf is that the gc.c file contains two C preprocessor macros (one of which is commented out) that control whether debugf does anything or not. You can have the second macro uncommented:

// This macro disables all debugf statements.  (They become no-ops.)
// #define debugf(fmt, ...) ;

// This macro enables all debugf statements.  (They become printf statements.)
#define debugf(fmt, ...) printf(fmt, ##__VA_ARGS__); fflush(stdout)

This will cause debugf to behave just like printf. This is useful because you can have print debugging as your garbage collector runs, but it will break all of your unit tests (since they don’t expect to get the debugf output). When you want to run your unit tests, use the first macro (as in the gc.c you are given in the starter code) like so:

// This macro disables all debugf statements.  (They become no-ops.)
#define debugf(fmt, ...) ;

// This macro enables all debugf statements.  (They become printf statements.)
// #define debugf(fmt, ...) printf(fmt, ##__VA_ARGS__); fflush(stdout)

This will cause every debugf call to do nothing. You don’t need to comment out those debugf calls or do anything to them; changing the macro as above is enough to silence all of this output. This way, you can switch to silent debugf to run your unit tests and then switch back to noisy debugf to debug your garbage collector.

Examining the Heap

gdb provides a few mechanisms for examining heap memory. While debugging gc or one of the helper functions you will write, you can use gdb commands like print/x start_of_heap to print the start_of_heap variable in hexadecimal. Sometimes, however, you want to see the entire heap. The provided gc.c file contains a function dump_heap which will show the entire contents of the heap (as described by start_of_heap and end_of_heap). This function uses debugf and so can be silenced using the macros above.

You may wish to consider dumping the heap before and after each phase of the garbage collector or even e.g. after you move each heap object. This combined with step debugging should provide you ample tools to understand what your garbage collection code is doing and why. As always, please ask if you have any questions or need help figuring out your bugs!

Some Notes About C

We are writing our garbage collector in C to avoid the error-prone and tedious nature of writing large amounts of assembly code. But while our assembly code has been able to treat every pointer as a simple number, C has a type system which assigns different meaning to pointer variables than to numeric values and which is insistent about keeping those types separate. This is a boon when it helps us avoid simple mistakes in our code (like assigning an integer to an integer pointer variable by accident) but a problem when we’re doing complex things with our own garbage collector (like assigning an integer to an integer pointer variable on purpose).

Here are some notes about C which may be helpful as you develop your garbage collector:

Summary

To complete this assignment, you’ll need to

The second part is the vast majority of the work.

Submitting

It is not enough just to push your code. Due dates are flexible, so it is necessary for you to take action to make it clear which commit you would like to have graded. We will use Git tags for this. A tag is a way of giving a human-readable name to a particular commit. Unlike a branch, however, tags are expected not to change once they are created.

When you are finished working on this compiler assignment, commit and push your work. Then, once you are sure that there are no additional changes you need to make, run the following commands:

$ git tag gull-submission
$ git push --tags

This will create a tag named gull-submission and push that tag to the Swarthmore GitHub Enterprise server. Your work on that tag will be graded.

In addition to pushing and tagging your work, you will need to fill out a brief questionnaire found here. In most cases, the questionnaire should take less than a minute. This questionnaire is required and will be used as part of your participation grade.

If You Have Trouble…

…then please contact your instructor! The course forum is the preferred method, but you can reach out via e-mail as well. Good luck!