Hatchling
This is a lab activity. It will not be evaluated for a grade, but work on this activity is required for participation. The purpose of this activity is to familiarize you with the process of building a compiler. We will do this by building an extremely simplistic (and not very useful) compilation tool. Let’s begin by cloning a Git repository:
git clone git@github.swarthmore.edu:cs75-s19/hatchling-<username>.git
If you inspect the contents of that repository, you’ll discover… nothing. There’s no code in it; just a .gitignore
and a .merlin
. This is intentional: we’re going to build a compiler from scratch. It is nice, though, to have a repository to help you keep track of your history.
The Big Picture
In this class, we have the task before us of building source code into a binary executable. The picture below illustrates the strategy that we will use.
The program that the user writes is the “program source”, which we translate into a data structure of some kind. We may perform a variety of transformations on that data structure before translating it into a series of assembly instructions, which we can then build into an object file (.o
) similar to those produced by clang
or gcc
.
We will rely upon existing tools to assemble our code into the .o
file as well as to translate the .o
into an executable. We will also rely upon existing tools to compile a “driver” for the program: the operating-system-specific part of the executable that gets us from the start of our process to the code that we generated. These steps are heavily system-specific and not as critical to this course.
This Activity
In this activity, the first two steps of the above diagram are much simplified:
- The “language” that we will compile contains only integers. Each source file will contain a single integer value and the compiled program will simply print that integer before terminating. “Parsing” is therefore just reading an integer from a file.
- We will not transform this integer in any way during the compilation process; the step named “transformations” above is a no-op.
With these assumptions, we can construct a compilation toolchain in the space of a single lab!
The Driver
In Computer Science, the term driver has taken on several different meanings. In this course, we use a somewhat less common form of the term: a driver is the topmost part of your program that gets everything set up so your code can run. Our driver will be a standard C program with a main
function; this allows us to rely on an existing compiler to do the system-specific operations necessary to get our process running. For this, the following driver.c
should be sufficient:
#include <stdio.h>
int hatchling_main() asm("hatchling_main");
int main(int argc, char** argv) {
int result = hatchling_main();
printf("%d\n", result);
return 0;
}
The hatchling_main
declaration tells the C compiler that it will eventually be linked with an assembly-level function named hatchling_main
. The asm
keyword ensures that the compiler will use the name hatchling_main
for this function at the assembler level.
We can compile this driver into a binary object file:
$ clang -g -c -o driver.o driver.c
Here, the -c
flag tells clang
to build a .o
file instead of an executable. We can think of the .o
file as a fragment of a program; it hasn’t yet been turned into a real program, but some amount of source code has been turned into machine code for us. The -g
flag instructs the compiler to keep debugging information, which is usually a handy thing to have around when you’re writing software.
If we try to compile the driver by itself, we get an error:
$ clang -g driver.o
driver.o: In function `main':
driver.c:6: undefined reference to `hatchling_main'
clang: error: linker command failed with exit code 1 (use -v to see invocation)
This is because hatchling_main
has not yet been defined. We need to develop a process for taking the “source code” (a file containing an integer) and translating it into machine code for hatchling_main
. We’ll do that next.
Assembly Language
Next, we are going to
- Write an assembly program to define
hatchling_main
- Link
hatchling_main
withdriver.c
to create an executable
Note that we are writing the assembly program ourselves; this is just to get a trial run together. Later, we will write our compiler in OCaml to generate this assembly code.
For this course, we will generate 32-bit x86 assembly using the so-called “Intel syntax”; you can use this guide by David Evans at the University of Virginia as an x86 assembly language reference. Here is a simple assembly program that defines a hatchling_main
function to return the value 37:
section .text
global hatchling_main
hatchling_main:
mov eax, 37
ret
Let’s review this code one line at a time:
section .text
: The following content (up until the nextsection
) is code.global hatchling_main
: The labelhatchling_main
should be visible to the rest of the world. This way, the linker can see that symbol in the resulting.o
file.hatchling_main:
: This defines the labelhatchling_main
. When other code jumps to the symbolhatchling_main
, that code is jumping here.mov eax, 37
: Put37
intoeax
, the register conventionally used for return values in C. Ourdriver.c
call tohatchling_main
will look ineax
after we return to find the result; this way, it finds our37
.ret
: Return to the caller by performing standard stack operations. The C compiler sets up the stack in a way thatret
causes us to return to the point right afterhatchling_main
was called.
We can put this file in example.s
to serve as an example assembly program for now; we’ll talk about generating it later. Let’s build this example.s
into a program object and link it into the wrapper to create an executable.
Building Our Program
There are a variety of assemblers – tools that translate assembly into machine-readable formats – that we could use. For this course, we will use nasm
, which is already installed on the department machines. We will use nasm
to turn our assembly code (.s
) into binary object files (.o
) so that we can link our assembly with our wrapper. We can use the following command to build our example above:
$ nasm -f elf32 -o example.o example.s
The -o
flag, like with gcc
or clang
, tells nasm
what output file to create. The -f
flag tells nasm
to build our file in a particular format; here, elf32
stands for the 32-bit Executable and Linkable Format (ELF) which is commonly used by Linux systems. (On Mac OS X, you can instead use -f macho
to build a binary object file that Mac OS X can use.) The particular format we pick is based on our operating system and we won’t go into too much detail about it here.
Once we have built example.o
, all that remains is to link it with the driver.o
we created from our wrapper. If we try to do that in a way that resembles compilation from CS35, we’ll get an error:
$ clang -g -o example example.o driver.o
/usr/bin/ld: i386 architecture of input file `example.o' is incompatible with i386:x86-64 output
clang: error: linker command failed with exit code 1 (use -v to see invocation)
This message is occurring because example.o
is in 32-bit executable format (from the elf32
format above) but driver.o
is in 64-bit executable format (because this is the default on a 64-bit machine with clang
or gcc
). We can confirm this by using the program file
to figure out the format of the driver.o
we are using:
$ file driver.o
driver.o: ELF 64-bit LSB relocatable, x86-64, version 1 (SYSV), not stripped
We need to build driver.o
in 32-bit format to make it compatible with example.o
:
$ clang -g -m32 -c -o driver.o driver.c
The -m32
flag here tells the compiler to build a 32-bit .o
file. Next, we can link our two 32-bit .o
files together:
$ clang -g -m32 -o example.run driver.o example.o
Finally, we can run our program!
$ ./example.run
37
We can confirm our understand of the files we’ve just created by examining them with the file
program:
$ file driver.o
driver.o: ELF 32-bit LSB relocatable, Intel 80386, version 1 (SYSV), not stripped
$ file example.o
example.o: ELF 32-bit LSB relocatable, Intel 80386, version 1 (SYSV), not stripped
$ file example.run
example.run: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), dynamically linked, interpreter /lib/ld-linux.so.2, for GNU/Linux 2.6.32, BuildID[sha1]=6eb54bf88ff34626c050f02f9240b5fc1b119357, not stripped
We’ve just used our wrapper and our example assembly code to build a binary executable! The only thing missing is that we had to write our assembly code by hand, but we want a compiler to write the code for us. We’ll do that next.
Hatchling: Our First Compiler
Rather than writing example.s
by hand, we want a compiler to generate it for us. For this activity, we’ll have the compiler read the file on standard input and write the file to standard output; that way, we don’t need to learn about OCaml’s I/O API yet. We can then instruct the command line to feed the source code into the compiler and capture what it prints.
We can start by writing this skeleton of compiler.ml
:
let compile (program : int) : string =
failwith "TODO"
;;
(* The following is essentially a way of writing a "main" in OCaml. *)
let () =
(* Get a line from standard input and assume that it is our "program". *)
let program_str = read_line () in
(* Convert the input to an integer. *)
let program = int_of_string program_str in
(* "Compile" the program into assembly. *)
let assembly = compile program in
(* Print the resulting assembly code. *)
print_endline assembly
;;
A more sophisticated compiler might read from one or more files instead of using standard input to get its program. Likewise, it might write to one or more files instead of printing the resulting assembly. It might even call the assembler directly instead of writing a file to speed up the process. The above code has the advantage of being much simpler.
In the above, the compile
function has been left unimplemented. Use what you know about OCaml strings to replace the failwith "TODO"
with an expression that generates the right assembly code for your program.
Once you’ve implemented your compile
function, we can test it. Create a file 19.hatchling
containing just the number 19
:
$ echo "19" > "19.hatchling"
Then, we can run our compiler. We use the <
operator in bash
to replace the standard keyboard input with the contents of a file (as if you’d copied the file and pasted it into your terminal):
$ ocaml compiler.ml < 19.hatchling
section .text
global hatchling_main
hatchling_main:
mov eax, 19
ret
We can capture that output to a file by using the >
operator and then assemble it using nasm
. Then, we can link the resulting program with our wrapper as we did before:
$ ocaml compiler.ml < 19.hatchling > 19.s
$ nasm -f elf32 -o 19.o 19.s
$ clang -g -m32 -c -o driver.o driver.c
$ clang -g -m32 -o 19.run driver.o 19.o
$ ./19.run
19
We can automate the above compilation process using a Makefile
. Note that Makefile
s use tabs and not spaces; if your editor is configured correctly, it should deal with this for you.
# The following line tells "make" that it should keep our intermediate files.
.SECONDARY:
# To make a .run file, we need the .o file. We link it with the wrapper.
# The symbol $@ is "the thing we're making", e.g. the .run file
# The symbol $< is "the first thing we need", e.g. the .o file
%.run: %.o driver.o
clang -g -m32 -o $@ driver.o $<
# To make a .o file, we need the .s file. We build it using the assembler.
%.o: %.s
nasm -f elf32 -o $@ $<
# To make a .s file, we need the .hatchling file. We generate it with the
# compiler.
%.s: %.hatchling
ocaml compiler.ml < $< > $@
# An exception to the .o rule above: we build driver.o from driver.c.
driver.o: driver.c
clang -g -m32 -c -o driver.o driver.c
# It's always good to know how to clean up.
.PHONY: clean
clean:
rm -f *.run
rm -f *.o
rm -f *.s
If your Makefile
is correct, you should be able to use it to run all of the necessary commands.
$ make clean
rm -f *.run
rm -f *.o
rm -f *.s
$ make 19.run
ocaml compiler.ml < 19.hatchling > 19.s
nasm -f elf32 -o 19.o 19.s
clang -g -m32 -c -o driver.o driver.c
clang -g -m32 -o 19.run driver.o 19.o
$ ./19.run
19
That’s a Compiler!
That’s it! We’ve just built a tool that will take an extraordinarily simple “program” (a document containing an integer) and transform it into something the computer can run. This is a compiler.
There are a few caveats, of course:
- This compiler handles an extremely simple language. We will discuss how to generate assembly from an incrementally larger set of features throughout the course.
- This compiler does not handle operating-system-specific details like the internals of the object files; we’ve left that to our assembler and linker. You’re welcome to read more about the Executable and Linkable Format, but we won’t talk a lot about it in this course. If you’re curious to see what is being generated, try using
objdump
on the.o
files (e.g.objdump -t 19.o
orobjdump -t driver.o
). - This compiler is using standard input and standard output to handle files. For our future compiler work in this course, we’ll read and write files. Since I/O is fairly uninteresting, however, most of this code will be provided for you so you can focus on the compilation part. Doing more complex things here is simply a matter of engineering.
In the meantime, you can try making changes to your compiler to see how it behaves. Can you change the input language in a useful way? Can you find a case in which this compiler does not work correctly? Remember to git commit
before trying to make changes!
Acknowledgements
Thanks to Abdulaziz Ghuloum for the structure of this activity and to Joe Politz for much of the specific content.