Yacc2TT

With the TextTransformer project Yacc2TT.ttp Yacc grammars can be taken to a form which is suitable for LL parser generators. A tti file is produced which can be imported directly into the TextTransformer.

Introduction

There are a lot of grammars written for the parser generator "Yacc" and it would be nice if one could use them for the TextTransformer. However, a conversion of these grammars isn't trivial because TextTransformer uses the LL parsing algorithm and Yacc uses the LR parsing algorithm. These algorithms require a different logical construction of the grammars.
While both algorithms are reading the input from left to right, what the first 'L' in "LL" or "LR" stands for, the match with a grammar rule is tested in a different direction. The matching rule is already found in the LL parser after the recognition of the next token while the LR algorithm mostly needs several tokens, which are then evaluated backwards.

For a given token the LL parser only can find the matching rule, if the rule system (the grammar) is formulated in a manner that there are no alternatives. All alternatives must start with different, nonoverlapping token sets.

The steps which are required in accordance with this principle at conversion of the Yacc grammars, shall now be demonstrated. It is explained at the same time how these transformations are carried out in the TextTransformer.

The following Yacc rule is used as example:

direct_declarator
	: IDENTIFIER
	| '(' declarator ')'
	| direct_declarator '[' constant_expression ']'
	| direct_declarator '[' ']'
	| direct_declarator '(' parameter_type_list ')'
	| direct_declarator '(' identifier_list ')'
	| direct_declarator '(' ')'
	;

This definition is from a grammar C of Jeff Lee for the programming language C.

http://www.lysator.liu.se/c/ANSI-C-grammar-l.html

Representation of the rules by tree structures

The grammar of Yacc2TT is derived from "Appendix B: Yacc input syntax" of the Yacc manual. Besides the grammar there is the element in Yacc2TT:

mstrdnode m_nRules

This is a map in which a rule tree is stored to every rule name. The construction of the trees is carried out in the rule production and in the productions called by it. At first the tree for the above example looks like:

  T : IDENTIFIER	
	T : '(' 
	  NT : declarator 
	     T ')'
	NT : direct_declarator 
	   T : '[' 
	     NT : constant_expression 
	        T : ']'
	NT : direct_declarator 
	   T : '[' 
	     T : ']'
	NT : direct_declarator 
	   T : '(' 
	     NT : parameter_type_list 
	        T : ')'
	NT : direct_declarator 
	   T : '(' 
	     NT : identifier_list 
	        T : ')'
	NT : direct_declarator 
	   T : '(' 
	     T : ')'

"T" and "NT" are the labels for terminals and non-terminals. The tree was produced so that alternatives are expressed by sibbling nodes while child nodes represent the following-relation. These trees are tranformed step by step. Finally the LL-grammar can be produced from them.

LeftFactoring

The first step to the avoidance of alternatives with the same terminal beginners is to left factor common beginnings. This results in following tree:

  T : IDENTIFIER	
	T : '(' 
	  NT : declarator 
	     T ')'
	NT : direct_declarator 
	   T : '[' 
	     NT : constant_expression 
	        T : ']'
	      T : ']'
	   T : '(' 
	     NT : parameter_type_list 
	        T : ')'
	     NT : identifier_list 
	        T : ')'
	     T : ')'

Grouping

For the preparations for the important next step, the elimination of left recursions, alternatives which follow on a common predecessor are summarized to groups at first. Marking nodes are inserted in the tree with the labels "ALT_BEGIN" and "ALT_END". Later this bracketing also makes it easy to use the operators of the EBNF syntax of the TextTransformer, e.g. the repeat operator.

  T : IDENTIFIER	
	T : '(' 
	  NT : declarator 
	     T ')'
	NT : direct_declarator 
	   ALT_BEGIN
	   T : '[' 
	     ALT_BEGIN
	     NT : constant_expression 
	        T : ']'
	      T : ']'
	     ALT_END 
	   T : '(' 
	     ALT_BEGIN
	     NT : parameter_type_list 
	        T : ')'
	     NT : identifier_list 
	        T : ')'
	     T : ')'
	     ALT_END   
	   ALT_END

RemoveLeftRecursion

Left recursive rules like:

a = a C | B

are forbidden for LL-parsers. They would end up in infinite loops. However, the rule is logically equivalent with:

a = B C*

where '*' is the repeat operator. In Yacc2TT this remodelling is executed automatically and gives:

  ALT_BEGIN
  T : IDENTIFIER	
	T : '(' 
	 NT : declarator 
	    T ')'
	ALT_END    
	  OREP_BEGIN 
	    T : '[' 
	      ALT_BEGIN
	      NT : constant_expression 
	         T : ']'
	       T : ']'
	      ALT_END
	    T : '(' 
	      ALT_BEGIN
	      NT : parameter_type_list 
	         T : ')'
	      NT : identifier_list 
	         T : ')'
	      T : ')'
	      ALT_END
	  OREP_END

Removing empty alternatives

At rules which contain empty alternatives Yacc2TT still carries out another remodelling. E.g.

a | b |

became to:

( a | b )?

TextTransformer import file

Finally an output file is produced by the Yacc2TT project with the remodeled Yacc rules. It should be stored with the extension "tti" and then can be imported directly into the TextTransformer. The "direct_declarator" rule appears now as:

direct_declarator()
(>
  (
      IDENTIFIER
      "("
      declarator
      ")"
  )
  (
      "["
      (
          constant_expression
          "]"
        | "]"
      )
    | "("
      (
          parameter_type_list
          ")"
        | identifier_list
          ")"
        | ")"
      )
  )
<)

End

There still are possibilities of improving the rules. Particularly one problem isn't treated by Yacc2TT: although the individual rules have a form suitable for LL parser generators now, there still may be conflicts between the rules. The grammar tests of the TextTransformer show where there are such conflicts. They can relatively easily be removed by hand within the IDE.
With the C grammar is demonstrated how it works.

Last Update: 11.05.08

to the top