Parsing Tips

(I don't know where this should be linked from. Currently only from my user page --Yossi 17:40, 27 October 2005 (EDT))

In the process of writing the parser for omath I've acquired some experience in writing grammar files. Although I'm by no means an expert, I hope that at the very least by writing these notes down, I will be able to make the parser better, and write other parsers more easily in the future. Perhaps someone else will also find it helpful.

I used JavaCC to create the parser. I'm not going to explain how JavaCC works, information can be found online Tutorial FAQ. I'll be assuming that you understand the basics of writing a grammar file, including the Tokenizer and the Productions.

General Tips:


 * Make every production's job simple. If you need a complicated production it is usually worthwhile to break it into several simple parts and then assemble them together with one (or more) "parent" production.
 * Every production should find something or fail. Do not have a production that can succeed without actually consuming a token.

I had a problem implementing this idea when I wanted a "train" of productions to terminate somehow, so I created a production that will always fail by looking for the token "[]" (the empty set). This is useful when writing large grammar files via a program. My failing production needs the token TOKEN: {  } and I write it so void False:{} {     } which always fails, since nothing matches it...One problem with this is that you can get unreachable code and that is a compile error for java. It doesn't know that  will never be matched....I'm not sure how to get around this.


 * In regards to the previous point, if you want a production to be optional, you can always call it with a (.)? This way, although the production itself must find something or fail, the calling production allows it to be optional. The grammar is much clearer this way.
 * Use parentheses. This is very important especially because the code blocks are always attached to the last production. By using parenthsis you can attach code to a choice. Also, the operator | (OR) is only attached to the producations directly before and after it, so parentheses are needed to control that.
 * Productions can also take input. This can make parsing postfix operators much easier, by passing the production the terms that have been parsed so far and letting it modify them if it finds its token.
 * Initialization stuff should go in the first code block of the production. This includes the variable declaration.
 * To allow for areas with no parsing (or different parsing), like comments or character strings, you will need a change the LEXICAL_STATE. Here is an example (from ) that changes the lexical state to accomodate for comments:

// When a "/*" is seen in the DEFAULT state, skip it and switch to the IN_COMMENT state MORE : {"/*": IN_COMMENT } // When any other character is seen in the IN_COMMENT state, skip it. < IN_COMMENT > MORE : { < ~[]> } //the "~[]" means "any character". // When a */ is seen in the IN_COMMENT state, skip it and switch back to the DEFAULT state < IN_COMMENT > SKIP : {"*/": DEFAULT }
 * Code blocks in optional and multiple regexps are a little confusing at first:

+ { code }

Will match one or more occurances of token  and then perform code once. On the other hand

(  {code} ) + Will match one or more occurances of  but perform code for every occurance. Similarly, notice the differences between the next two lines (  )? {code} (    {code} )? In the first code gets executed whether or not  is found. In the second it is excuted only if  is found.
 * When the javacc compiler says "LOOKAHEAD in non choice location, will be ignored, beware! more than the LOOKAHEAD is ignored! javacc ignores everything in the production!

JJTree Stuff

 * The last code block "under a node" is performed after the closeNode function call. Therefore, if you want to do something before that you must add a {} (empty code block) as the last code block.