CS75 Lab 2: A Parser for C--

Due Dates
Checkpoint 1 (10%): Due: Tuesday March 3 before 11:30am
(you may not use a late day on the checkpoint)

Checkpoint 2 (0%): get a good amount of this done before spring break

Complete Project (90%): Due Thursday March 26 before 2am: (late Wednesday night)

Index
Project Part 2 FAQ
I'll post answers to questions I get about this assignment here:
Problem Introduction
Design and implement a Recursive Descent parser for C-- as follows:
  1. [Checkpoint 1: 10% of your grade] Convert the C-- grammar from project 1 to an LL(1) grammar. If you want to use latex to write up your solution, you can grab a copy of the grammar.tex file and a Makefile for building .dvi, .ps, .pdf versions here:
    /home/newhall/public/cs75/grammar_latex/

  2. [Checkpoint 2: 0% of your grade] Implement a recursive descent parser that recognizes the LL(1) grammar. Your parser will not explicitly build a parse tree but output grammar symbols corresponding to parse tree nodes as it parses.

  3. [Complete Program] Add abstract syntax tree (AST) building to your parser as it parses, and add error detection and recovery. Your final version of the parser will create an AST of the program as it parses. The AST will then be used by the code generator to generate MIPS code.

Checkpoint 2 is not a real checkpoint, but a suggestion for how to attack implementing full parser functionality (recursive descent parser, AST building, and some error recovery). I suggest that you break it up as I have it listed above and first build the recursive descent parser without creating an AST, and then add AST building functionality second. However, you certainly can also approach this problem by implementing both the recursive descent parser and AST building functionality at the same time.

I strongly encourage you to get all or most of the recursive descent parser coded before spring break, and then finish it up and add the AST building part after break.

Getting Started

Grab the starting point code and add it to your svn repository

One of you or your partner, should do the following:
  1. From you project subdirectory, copy over the setup2 file from my directory containing the project 2 starting point code:
    $ cd cs75/CS75_repositoryname/project
    $ cp /home/newhall/public/cs75/proj2_startingpt/setup2 .
    
  2. run setup2:
    $ ./setup2
    
    The setup2 script will copy over project 2 starting point files into the ast, parser, and includes subdirectories, call svn add to add these files to your repository, and then call svn to commit the new files.
  3. At this point your partner should be able to do an "svn update" to grab the project2 starting point files from your repository.

Starting Point Code Details

ast/: contains the abstract syntax tree data structure implementation that 
      your parser will use to build an AST of the parsed program
      
      you can type 'make depend' and 'make' to build a two test programs 

      * main.c: a test program that prints resulting AST to stdout
            contains examples of how to use the ast interface 

         to run:
        ./test_prog

      * nltk_main.c: a test program that prints resulting AST to an output file in nltk format.  

        to run: 
           ./test_nltk outputfile

        then:
            python -i outputfile
            >>>  ast.draw()

includes/ast.h:  the header file for the abstract syntax tree data structure
                 read its comments for more information about how to use it

includes/parser.h: the starting point for the parser header file
                   you will need to add more to this file

parser/parser.c:  the starting point for your parser implementation
                  this is where most of the code you write will go 

           run:  ./parser t.c-- 
            or:  ./parser t.c--  nltk_outfile

parser/main.c: contains code to initialize the AST, that calls a top-level
               parser routine, and that prints out the resulting AST and 
               prints out the nltk version of the AST to an output file if
               an output file is given as a command line argument.

               You will need to modify the print functions here for your 
               parser, but you may not need to modify much else here.

               type 'make depend' once, and then 'make' to build a parser 
               executable that does almost nothing.

               note: the nltk output file will not be in the correct format until
                     your AST has more than a single node

Incremental Implementation and Testing

To make your life much easier, you should follow good modular design practices, and use incremental implementation and testing. I suggest that you come up with individual, short, C-- test programs that test one specific part of your grammar, and use these in the first phase of testing your parser. You should also test your parser on larger C-- programs that contain multiple parts of the language, but this type of testing should be the second phase not the first.
Project Details:

Checkpoint 1: Converting Grammar to LL(1)

As you convert the C-- CFG to LL(1), be careful that you are not changing the language that is accepted by the grammar. There is one part of the C-- CFG that cannot be converted to LL(1). The problem occurs in some forms of expressions involving the assignment operator. The difficulty is in determining if the expression on the left-hand-side is a valid lvalue at parse time. For example, these are all valid C-- expressions:
  
    a[i];             // a[i] is not an lvalue here
    a[i] = x;         // a[i] is an lvalue here 
    x = 8 + (y = 6);  // y is an lvalue here: y = 6 
                      // and (y = 6) is an expression with a value of 6
but these are invalid C-- expressions:
  
    a[i] + b[i] = x;    // a[i] + b[i] is not a valid lvalue
    x = 8 + y = 6;      // 8 + y is not a valid lvalue 
                        // ( + has higher precedence than = ) 
Correctly handing these cases requires more than one token look-ahead or one or more token look-behind, neither which is allowed in a predictive parser. For now, the solution is to go ahead and translate the CFG to an almost equivalent LL(1) grammar. The resulting LL(1) grammar will be equivalent to the CFG for C-- with the exception that it will incorrectly accept strings like a[i] + b[j] = expr;. Later, when you implement the AST building part of this project, you can take care of ensuring that an expression appearing on the left-hand-side of an assignment statement is a valid lvalue, and if not generate an error message.

The precedence of C-- operators from lowest to highest is (your LL(1) grammar must handle precedence):

  1. assignment (=)
  2. or (||)
  3. and (&&)
  4. equal, not equal (==, !=)
  5. less than, less than or equal, greater than, greater than or equal (<, <=, >, >=)
  6. addition, subtraction (+, -)
  7. multiplication, division (*, /)
  8. not, unary minus (!, -)
  9. parentheses ()

Converting the CFG for C-- to an LL(1) grammar may be the most difficult part of the assignment. I strongly suggest first testing out your LL(1) grammar by hand to verify that its rules can correctly parse some very small C-- programs. Trying some very small examples with with zero, one, and two global variable declarations and one, and two very small functions. For example:

int x;
char y;

int main(int x, char y) {
  x = y;
}
In addition, I encourage you to compute FIRST and FOLLOW sets for all or part of your LL(1) grammar to prove that it is LL(1).

Checkpoint 2: Recursive Descent Parser

Once you have verified that your LL(1) grammar is in fact LL(1), and you have convinced yourself that it still parses C-- (you didn't modify the productions in such a way as to change the language it recognizes), then implement a recursive descent parser from your LL(1) grammar using the method we discussed in class.

You should make sure to test that your parser can correctly parse all valid C-- language constructs and combinations of C-- language constructs before you start adding error recovery code.

See the first part of the sample output below for an idea of what this part of your parser's output should look like.

Complete Program: AST and Error Recovery

Adding AST Building

Although your parser does not need to build a parse tree as it parses, it does need to build an AST as it parses. With the starting-point code is an implementation of an abstract syntax tree data structure that you can use to build your AST.

The AST should resemble a re-organization of a parse tree that would be produced by the original CFG parse of the program. For example, for the string x = 3 + 4; The CFG parse tree and its corresponding AST would look like:

parse tree:                             AST:
               stmt                              =
              /    \                           /   \
             expr   ;                         x     +
            / |  \                                  / \
          id  =   expr                             3   4 
         /       /  | \                             
        x      expr + expr                        
               |      | 
               3      4
Parts of the parse tree that are not necessary for generating code are removed from the AST (i.e. stmt, expr, and ; nodes), and operator nodes are internal nodes of the AST (vs. leaf nodes in the parse tree). The structure of the resulting AST represents the order of evaluation. Note: when you build an AST, you will be building it from the parse of the LL(1) grammar, which has a very different parse tree for this string, but the resulting AST should be the same as the one above.

The AST looks like a re-structured, cleaned-up parse tree, where many of the parse tree nodes that are necessary for parsing have been removed. In the above example, the AST has nodes corresponding to only terminals in the grammar. However, it is not the case that the AST will contain nodes only corresponding to terminals. For some parts of the grammar, the AST needs to contain nodes that do not correspond to terminals. For example, for the following C-- program fragment:

   int x
the code generator needs to do something very different if this is part of a global variable declaration, or a local or temporary variable declaration (at the beginning of a block), or a parameter. As a result, the AST needs to contain enough information about the context in which this appears, so that the code generator knows how to interpret it. For example, you may want to structure the AST like the following for these three different cases:
     global_decls                 block                     func_def
        /   ...                    /   ...              ... /     ...
     vardec                    vardeclist                paramlist
     /  \                       /     ...                  /    ... 
   int   x                   vardec                     vardec
                             /   \                      /   \
                           int    x                   int    x
As you think about what types of AST nodes you will need to represent the parsed program's structure, look at some of the original C-- grammar rules. For example, the production Program --> VarDeclList FunDeclList might tell me that in the AST, after the root node representing Program, I might want two child nodes, one for the list of global variable declarations, and one for the list of functions. The production FunDeclList --> FunDecl FunDeclList, tells me that the AST node for the list of functions will have n child nodes, one for each function declaration in the list. The production FunDecl ------> Type id ( ParamDecList ) Block , tells me that a function declaration node would have 4 child nodes, one for type, one for id, one for the parameter declaration list, one for block, and, so on.

You will need to add AST nodes corresponding to some of the terminals in your grammar, but not all (the AST's structure represents the correct parsing of the program that is directed by some grammar symbols, and thus these symbols no longer need to be represented in the AST. For example, in the statement x = 9 + 10; x, =, 9, +, and 10 are necessary to represent in the AST since the code generator will need to know these to generate correct code. The ; is not necessary to represent in the AST because the AST's structure will indicate where the assignment statement starts and ends. And, you will need to add AST nodes that represent the structure of the parsed program (as in the examples above).

Note: because your parser is not building the symbol table, the AST nodes need to contain enough information about names in the program so that the symbol table can be constructed during the code generation phase. For example, AST nodes corresponding to IDs, need to store the ID's lexeme and the source line number.

See the second part of the sample output below for an idea of what this part of your parser's output should look like.

Starting Point AST Code

Start by looking at the ast.h file in the includes subdirectory, and reading its comments about how to use the ast library functions. Next, take a look at the sample program, ast/main.c, to see an example of how to use some of the ast functions, and try building and running it. Note that the AST is printed out sideways, with all the children of a given node printed above the node (or if you tilt your head to the left, all of a node's children are printed below it and to its right).

Implementing AST building as part of the recursive descent parse

You can certainly approach this part of the assignment very formally, by first annotating your grammar with translation rules for constructing an AST, and then by implementing these rules in your parser to use syntax directed translation to build the AST as you parse. However, I think it is easier to do this less formally, but keep in mind that you are doing syntax directed translation, and use what you know about inherited and synthesized attributes to determine when you need to pass part of the AST to a parser function and when you need to return part of an AST from a parser function.

Since you are implementing a top-down parser, it makes sense to build the AST top-down as well. In most cases this will mean passing a pointer to the current parent ast_node to parsing functions that may create and add child nodes and may pass the child nodes to other parser functions. In some cases, you may need to pass more than one ast_node pointer, in others, the current parsing function may need to call other parsing functions that will build and return parts of the AST before the current parsing function knows where to add a new node. For example, you may be at a point in the parsing where you need to create a child node of a part of the AST tree that has not yet been created. In this case, you need to wait until this part has been created (and returned by parsing functions that this function calls) before this function can add a new node as a child.

The most complicated AST building part has to do with expressions. The problem is that the LL(1) grammar results in parse trees with a very different structure than their associated AST. As you think about how to handle this part, think about when in the parsing process you know that a node should be a parent or a child (this will help you decide if you should pass a new node to a called function as a new parent node, or if you need to call a function that will return an AST to which you will add a new node either as a parent or child).

Consider some of the following cases as you implement this part:

1 + 2 * 3;                // + is root of AST, 1 is lft child, * is rt child
1 * 2 + 3;                // + is root of AST, * is lft child, 3 is rt child 
1 + 2 + 3 + 4 + 5 + 6;    // the last + (from + 6) is root of AST
                          //  + (from + 5) is lft child, 6 is rt child
The best way to figure out what parts of the AST your parsing functions need to take as parameters or return is to draw the LL(1) parse tree for an example string in the grammar, and then think about how to "annotate" parts of the parse tree (i.e. points in the derivation) with AST building functionality. Think about answering questions such as: at what point do you know you need to create a new AST node? do you add it as a parent? child? to which node? when do you know where to add it? ...

Adding Error Detection and Recovery

Once your parser works, add some simple error recovery to your parser. You should implement simple panic and phrase-level recovery schemes (for a single missing or incorrect token); you do not have to implement complex error recovery, instead find simple cases of recovering from a single missing token during parsing. In addition, you should add error recovery to your lexer in cases where a missing or incorrect character(s) in the input can be easily handled (e.g. a missing '&' in the AND operator). When your compiler detects an error it will print out an error message, then either recover from the error and continue parsing or exit if it cannot recover. For example, after detecting the following error, the parser will print out an error message and continue parsing as if the missing RPAREN was there:
ID  LPAREN  ID      SEMI 
               ^^^^
                missing RPAREN in function call 
The error message might look something like:
Error: missing ) at line 34

You are welcome to try more complicated error recovery, but first try implementing some relatively easy cases. Remember that printing out the line number of the error message is important for the user to figure out where the error is.

Sample input and output
Your recursive descent parser will not actually construct a parse tree. Instead, it should display EVERY non-terminal in your LL(1) grammar as it is being processed and EVERY terminal when it is MATCHED.

After parsing is complete, your parser main program should print out the resulting AST to stdout, and print the AST to an output file in nltk format. note: do not use '(' or ')' as part of AST node names, as these are interpreted by nltk as child delimiters.

For example, when given the following program:

int y;
char ch;
int main() {
  int x;
  int y;
  x = 5 - 8 / 4;
  write x;
}

Parser output

the parser should output should look something like the following:
$ ./parser testprog.c-- outfile
# 
# Note: I've removed parts of my output that correspond to non-terminals in my LL(1)
# that are not from the original CFG.  I'm doing this because I want you to 
# convert the CFG to LL(1) without trying to see if your LL(1) is the same as 
# mine based on my sample output.
# 
# In your solution, you should print out EVERY non-terminal in your LL(1)
# grammar as it is parsed by your parser.
# 
program			      
MATCH: INT
MATCH: ID.y
MATCH: SEMI
MATCH: CHAR
MATCH: ID.ch
MATCH: SEMI
MATCH: INT	      
MATCH: ID.main	 	      
MATCH: LPAREN		      
ParamDeclList        # when your parser matches a token you should
MATCH: RPAREN        # print out:  "MATCH: TOKEN.attribute" 
MATCH: LBRACE        # in my grammar I actually have terminals for
Block                # each keyword (ex. IF, WHILE, ...), however
VarDecList           # I'm printing them out here as KEYWORD.attr
MATCH: INT           # you are welcome to do either
MATCH: ID.x
MATCH: SEMI
MATCH: INT 
MATCH: ID.y
MATCH: SEMI
VarDecList
StmtList
Stmt
Expr				
MATCH: ID.x
MATCH: ASSIGN
Expr
MATCH: NUM.5
MATCH: MINUS
MATCH: NUM.8
MATCH: DIV
MATCH: NUM.4
MATCH: SEMI
StmtList
Stmt
MATCH: KEYWORD.write
Expr
MATCH: ID.x
MATCH: SEMI
MATCH: RBRACE
FuncDecList
MATCH: DONE
**********************************************
AST:
                                                id:x
                                               /
                                        write
                                       /
                                                                integer:4
                                                               /
                                                                integer:8
                                                               /
                                                        / DIV
                                                       /
                                                        integer:5
                                                       /
                                                -
                                               /
                                                id:x
                                               /
                                        =
                                       /
                                stmtlist
                               /
                                                id:y
                                               /
                                                int
                                               /
                                        vardec
                                       /
                                                id:x
                                               /
                                                int
                                               /
                                        vardec
                                       /
                                vardeclist
                               /
                        BLOCK
                       /
                        PARAMLIST
                       /
                        id:main
                       /
                        int
                       /
                FUNC_DEF
               /
        FUNCLIST
       /
                id:ch
               /
                char
               /
        vardec
       /
                id:y
               /
                int
               /
        vardec
       /
PROGRAM

# your AST does not need to be identical to this one, particularly for the 
# nodes that do not correspond to terminals in the grammar, but for the 
# expressions it should be pretty much the same as this
If you run the parser with two command line arguments (one for input C-- file and the other for the AST output in nltk form), you can then run python to veiw the AST as follows:
$ python -i outfile
>>> ast.draw()

What to hand in

Checkpoint

In class, submit a hard copy of your LL(1) grammar. If there is a part of your grammar you want me to specifically check, please indicate which part(s) and include the steps you took in converting the CFG to LL(1).

You do not need to include with your checkpoint a copy of the steps you took to convert G to LL(1). However, if you do, it will be more likely that I will be able to find any errors in your conversion for you (and actually, you will be less likely to make them). I suggest that you compute FIRST and FOLLOW sets for parts of the grammar (like arithmetic expressions, and the rules with lists to see if your grammar is LL(1)).

Final Project

  1. A tar file to be submitted using cs75handin containing:
    1. a tar file of your project directory (i.e. all the source, header, and Makefile files I need to build and run your parser). You should do a 'make clean' before taring up your project directory, and make sure that all debug output is turned off (i.e. your parser's output should contain only what is shown in the sample output section).
    2. A README file with:
      1. Your name and your partner's name
      2. The number of late days you have used on this assignment
      3. The total number of late days you have used so far
    3. A sample of your parser's output on a good (and short) test C-- program that you devised. Annotate the output listing the parts (statement(s) of the C-- program) are being parsed at different points in the output. See my Unix documentation about script and dos2unix for recording terminal I/O to a file.
    4. A copy of the test C-- program for which you submitted sample output.

  2. A Project Report. You may submit this part on-line as part of your tar file or as a hardcopy in class. If you submit this on-line, then give it a name like proj2_report.appropriate_extention so that I can easily find it.
    Your project report should contain:
    1. A listing of your LL(1) grammar
    2. A description of your error recovery
    3. A description of any parts of the language that you were unable to parse