Checkpoint 2 (0%): get a good amount of this done before spring break
Complete Project (90%): Due Thursday March 26 before 2am: (late Wednesday night)
/home/newhall/public/cs75/grammar_latex/
Checkpoint 2 is not a real checkpoint, but a suggestion for how to attack implementing full parser functionality (recursive descent parser, AST building, and some error recovery). I suggest that you break it up as I have it listed above and first build the recursive descent parser without creating an AST, and then add AST building functionality second. However, you certainly can also approach this problem by implementing both the recursive descent parser and AST building functionality at the same time.
I strongly encourage you to get all or most of the recursive descent parser coded before spring break, and then finish it up and add the AST building part after break.
$ cd cs75/CS75_repositoryname/project $ cp /home/newhall/public/cs75/proj2_startingpt/setup2 .
$ ./setup2The setup2 script will copy over project 2 starting point files into the ast, parser, and includes subdirectories, call svn add to add these files to your repository, and then call svn to commit the new files.
ast/: contains the abstract syntax tree data structure implementation that your parser will use to build an AST of the parsed program you can type 'make depend' and 'make' to build a two test programs * main.c: a test program that prints resulting AST to stdout contains examples of how to use the ast interface to run: ./test_prog * nltk_main.c: a test program that prints resulting AST to an output file in nltk format. to run: ./test_nltk outputfile then: python -i outputfile >>> ast.draw() includes/ast.h: the header file for the abstract syntax tree data structure read its comments for more information about how to use it includes/parser.h: the starting point for the parser header file you will need to add more to this file parser/parser.c: the starting point for your parser implementation this is where most of the code you write will go run: ./parser t.c-- or: ./parser t.c-- nltk_outfile parser/main.c: contains code to initialize the AST, that calls a top-level parser routine, and that prints out the resulting AST and prints out the nltk version of the AST to an output file if an output file is given as a command line argument. You will need to modify the print functions here for your parser, but you may not need to modify much else here. type 'make depend' once, and then 'make' to build a parser executable that does almost nothing. note: the nltk output file will not be in the correct format until your AST has more than a single node
a[i]; // a[i] is not an lvalue here a[i] = x; // a[i] is an lvalue here x = 8 + (y = 6); // y is an lvalue here: y = 6 // and (y = 6) is an expression with a value of 6but these are invalid C-- expressions:
a[i] + b[i] = x; // a[i] + b[i] is not a valid lvalue x = 8 + y = 6; // 8 + y is not a valid lvalue // ( + has higher precedence than = )Correctly handing these cases requires more than one token look-ahead or one or more token look-behind, neither which is allowed in a predictive parser. For now, the solution is to go ahead and translate the CFG to an almost equivalent LL(1) grammar. The resulting LL(1) grammar will be equivalent to the CFG for C-- with the exception that it will incorrectly accept strings like
a[i] + b[j] = expr;
.
Later, when you implement the AST building part of this project, you
can take care of ensuring that an expression appearing on the left-hand-side
of an assignment statement is a valid lvalue, and if not generate an error
message.
The precedence of C-- operators from lowest to highest is (your LL(1) grammar must handle precedence):
Converting the CFG for C-- to an LL(1) grammar may be the most difficult part of the assignment. I strongly suggest first testing out your LL(1) grammar by hand to verify that its rules can correctly parse some very small C-- programs. Trying some very small examples with with zero, one, and two global variable declarations and one, and two very small functions. For example:
int x; char y; int main(int x, char y) { x = y; }In addition, I encourage you to compute FIRST and FOLLOW sets for all or part of your LL(1) grammar to prove that it is LL(1).
You should make sure to test that your parser can correctly parse all valid C-- language constructs and combinations of C-- language constructs before you start adding error recovery code.
See the first part of the sample output below for an idea of what this part of your parser's output should look like.
The AST should resemble a re-organization of a parse tree that would be produced by the original CFG parse of the program. For example, for the string x = 3 + 4; The CFG parse tree and its corresponding AST would look like:
parse tree: AST: stmt = / \ / \ expr ; x + / | \ / \ id = expr 3 4 / / | \ x expr + expr | | 3 4Parts of the parse tree that are not necessary for generating code are removed from the AST (i.e. stmt, expr, and ; nodes), and operator nodes are internal nodes of the AST (vs. leaf nodes in the parse tree). The structure of the resulting AST represents the order of evaluation. Note: when you build an AST, you will be building it from the parse of the LL(1) grammar, which has a very different parse tree for this string, but the resulting AST should be the same as the one above.
The AST looks like a re-structured, cleaned-up parse tree, where many of the parse tree nodes that are necessary for parsing have been removed. In the above example, the AST has nodes corresponding to only terminals in the grammar. However, it is not the case that the AST will contain nodes only corresponding to terminals. For some parts of the grammar, the AST needs to contain nodes that do not correspond to terminals. For example, for the following C-- program fragment:
int xthe code generator needs to do something very different if this is part of a global variable declaration, or a local or temporary variable declaration (at the beginning of a block), or a parameter. As a result, the AST needs to contain enough information about the context in which this appears, so that the code generator knows how to interpret it. For example, you may want to structure the AST like the following for these three different cases:
global_decls block func_def / ... / ... ... / ... vardec vardeclist paramlist / \ / ... / ... int x vardec vardec / \ / \ int x int xAs you think about what types of AST nodes you will need to represent the parsed program's structure, look at some of the original C-- grammar rules. For example, the production
Program --> VarDeclList FunDeclList
might tell me that in the AST, after the root node representing Program,
I might want two child nodes, one for the list of global variable declarations,
and one for the list of functions. The production
FunDeclList --> FunDecl FunDeclList
, tells me that the
AST node for the list of functions will have n child nodes, one for
each function declaration in the list. The production
FunDecl ------> Type id ( ParamDecList ) Block
, tells
me that a function declaration node would have 4 child nodes, one
for type, one for id, one for the parameter declaration list, one for block,
and, so on.
You will need to add AST nodes corresponding to some of the terminals in your grammar, but not all (the AST's structure represents the correct parsing of the program that is directed by some grammar symbols, and thus these symbols no longer need to be represented in the AST. For example, in the statement x = 9 + 10; x, =, 9, +, and 10 are necessary to represent in the AST since the code generator will need to know these to generate correct code. The ; is not necessary to represent in the AST because the AST's structure will indicate where the assignment statement starts and ends. And, you will need to add AST nodes that represent the structure of the parsed program (as in the examples above).
Note: because your parser is not building the symbol table, the AST nodes need to contain enough information about names in the program so that the symbol table can be constructed during the code generation phase. For example, AST nodes corresponding to IDs, need to store the ID's lexeme and the source line number.
See the second part of the sample output below for an idea of what this part of your parser's output should look like.
Since you are implementing a top-down parser, it makes sense to build the AST top-down as well. In most cases this will mean passing a pointer to the current parent ast_node to parsing functions that may create and add child nodes and may pass the child nodes to other parser functions. In some cases, you may need to pass more than one ast_node pointer, in others, the current parsing function may need to call other parsing functions that will build and return parts of the AST before the current parsing function knows where to add a new node. For example, you may be at a point in the parsing where you need to create a child node of a part of the AST tree that has not yet been created. In this case, you need to wait until this part has been created (and returned by parsing functions that this function calls) before this function can add a new node as a child.
The most complicated AST building part has to do with expressions. The problem is that the LL(1) grammar results in parse trees with a very different structure than their associated AST. As you think about how to handle this part, think about when in the parsing process you know that a node should be a parent or a child (this will help you decide if you should pass a new node to a called function as a new parent node, or if you need to call a function that will return an AST to which you will add a new node either as a parent or child).
Consider some of the following cases as you implement this part:
1 + 2 * 3; // + is root of AST, 1 is lft child, * is rt child 1 * 2 + 3; // + is root of AST, * is lft child, 3 is rt child 1 + 2 + 3 + 4 + 5 + 6; // the last + (from + 6) is root of AST // + (from + 5) is lft child, 6 is rt childThe best way to figure out what parts of the AST your parsing functions need to take as parameters or return is to draw the LL(1) parse tree for an example string in the grammar, and then think about how to "annotate" parts of the parse tree (i.e. points in the derivation) with AST building functionality. Think about answering questions such as: at what point do you know you need to create a new AST node? do you add it as a parent? child? to which node? when do you know where to add it? ...
ID LPAREN ID SEMI ^^^^ missing RPAREN in function callThe error message might look something like:
Error: missing ) at line 34
You are welcome to try more complicated error recovery, but first try implementing some relatively easy cases. Remember that printing out the line number of the error message is important for the user to figure out where the error is.
After parsing is complete, your parser main program should print out the resulting AST to stdout, and print the AST to an output file in nltk format. note: do not use '(' or ')' as part of AST node names, as these are interpreted by nltk as child delimiters.
For example, when given the following program:
int y; char ch; int main() { int x; int y; x = 5 - 8 / 4; write x; }
$ ./parser testprog.c-- outfile # # Note: I've removed parts of my output that correspond to non-terminals in my LL(1) # that are not from the original CFG. I'm doing this because I want you to # convert the CFG to LL(1) without trying to see if your LL(1) is the same as # mine based on my sample output. # # In your solution, you should print out EVERY non-terminal in your LL(1) # grammar as it is parsed by your parser. # program MATCH: INT MATCH: ID.y MATCH: SEMI MATCH: CHAR MATCH: ID.ch MATCH: SEMI MATCH: INT MATCH: ID.main MATCH: LPAREN ParamDeclList # when your parser matches a token you should MATCH: RPAREN # print out: "MATCH: TOKEN.attribute" MATCH: LBRACE # in my grammar I actually have terminals for Block # each keyword (ex. IF, WHILE, ...), however VarDecList # I'm printing them out here as KEYWORD.attr MATCH: INT # you are welcome to do either MATCH: ID.x MATCH: SEMI MATCH: INT MATCH: ID.y MATCH: SEMI VarDecList StmtList Stmt Expr MATCH: ID.x MATCH: ASSIGN Expr MATCH: NUM.5 MATCH: MINUS MATCH: NUM.8 MATCH: DIV MATCH: NUM.4 MATCH: SEMI StmtList Stmt MATCH: KEYWORD.write Expr MATCH: ID.x MATCH: SEMI MATCH: RBRACE FuncDecList MATCH: DONE ********************************************** AST: id:x / write / integer:4 / integer:8 / / DIV / integer:5 / - / id:x / = / stmtlist / id:y / int / vardec / id:x / int / vardec / vardeclist / BLOCK / PARAMLIST / id:main / int / FUNC_DEF / FUNCLIST / id:ch / char / vardec / id:y / int / vardec / PROGRAM # your AST does not need to be identical to this one, particularly for the # nodes that do not correspond to terminals in the grammar, but for the # expressions it should be pretty much the same as thisIf you run the parser with two command line arguments (one for input C-- file and the other for the AST output in nltk form), you can then run python to veiw the AST as follows:
$ python -i outfile >>> ast.draw()
You do not need to include with your checkpoint a copy of the steps you took to convert G to LL(1). However, if you do, it will be more likely that I will be able to find any errors in your conversion for you (and actually, you will be less likely to make them). I suggest that you compute FIRST and FOLLOW sets for parts of the grammar (like arithmetic expressions, and the rules with lists to see if your grammar is LL(1)).
proj2_report.appropriate_extention
so that I can easily find it.