The programming language C

With the Texttransformer project C.ttp C source texts can be parsed. Type definitions are registered too and the corresponding names are recognized in the following code.

The source texts may not include any preprocessor directives. If this is the case, at first they can be removed or replaced with the Cpp preprocessor. This pre-processing can be carried out automatically if the preprocessor is put into the project options of C.ttp.

How the project was produced by conversion of an existing Yacc program shall be demonstrated now. The different stages of the conversion are contained in the zip file as backups:

C_parser.zip

The generation of the C parser from a Yacc program

Starting point for the C grammar was the Yacc program:

http://www.lysator.liu.se/c/ANSI-C-grammar-l.html

1. mechanical steps of the conversion

With the Yacc2TT project an import file was produced from the Yacc grammar for the TextTransformer: Backup001.

Already before the import of this file, the names of literal tokens were replaced in this file by the literals themselfes: Backup002.

Only four tokens remain in the lex file, which have to be be defined after the import as regular expressions:

CONSTANT
IDENTIFIER
STRING_LITERAL
TYPE_NAME

For IDENTIFIER and STRING_LITERAL the predefined definition of IS and STRING of TextTransformer can be used. TYPE_NAME is defined as placeholder token. CONSTANT stands for a number of expressions which describe the different numerical constants. CONSTANT is replaced in the TextTransformer project by the production "constant", which consists of alternative tokens:

constant ::=
  FLOAT
| INT_CONSTANT_HEX
| INT_CONSTANT_OCT
| INT_CONSTANT_DEC
| INT_CONSTANT_CHAR

Instead of the comment function of the lex file a corresponding comment production is put as an inclusion in the project options of the TextTransformer: Backup003.

The project can be compiled now. However, a deficit of the Yacc converter is obvious now too: the individual rules are in a correct form for TextTransformer but not the relationships of the rules to each other. So there are some error messages about conflicts between different production alternatives now.

2. "mechanical" elimination of conflicts

A mechanical method for the elimination of the conflicts consists in replacing the conflict alternatives by IF-ELSE alternatives. E.g.

external_declaration ::=
  function_definition
| declaration

can be replaced by

external_declaration ::=
IF( function_definition())
  function_definition
ELSE
  declaration
END

This procedure analogously is applied to:

statement
assignment_expression
parameter_declaration
unary_expression
cast_expression

Now the C parser seems to work, however, isn't efficient certainly.

3. intelligent elimination of conflicts

There is a conflict as well within "unary_expression" as and within "cast_expression" between "unary_expression" and an alternative, which starts with an '('. That "unary_expression" can start with '(' is in the end due to the alternative:

"(" expression ")"

in "primary_expression". This alternative is therefore removed here and inserted in "postfix_expression". This production becomes to:

    primary_expression postfix_expression_tail*
  | "(" expression ")" postfix_expression_tail*

with the helping production:

postfix_expression_tail ::=
  "[" expression "]"
| "(" argument_expression_list? ")"
| "." IDENTIFIER
| "->" IDENTIFIER
| "++"
| "--"

According to the same method the alternative is put outside once more: Backup005.

  | "(" expression ")" postfix_expression_tail*

The conflict in "assignment_expression" between "conditional_expression" and the alternative starting with "unary_expression" is based on the fact that "conditional_expression" starts with "unary_expression". This isn't obvious but hidden by the long sequence:

conditional_expression
logical_or_expression
logical_and_expression
inclusive_or_expression
exclusive_or_expression
and_expression
equality_expression
relational_expression
shift_expression
additive_expression
multiplicative_expression
cast_expression
unary_expression

All members of this chain up to "multiplicative_expression" are constructed in the same manner: they start with the subordinate non-terminal on which an optional rest follows. "cast_expression" can therefore pulled out there.

At first "multiplicative_expression" can be replaced everywhere by

  cast_expression 
  multiplicative_expression_tail*
with

multiplicative_expression_tail ::=
(
    "*"
  | "/"
  | "%"      
)
cast_expression

In the same way is now to deal with "logical_or_expression" etc.: Backup005.

Yacc doesn't know any operators like '?' and '+'. Using these operators can simplify some rules: Backup006.

Now we can have look at the warning messages:

initializer_list: LL(1) Warning: "," is the start and successor of a nullable structure

In "initializer" there is the alternative:

"{" initializer_list ","? "}"

However, the comma behind the "initializer_list" is never reached, because within this production at every comma a new loop is started behind an initializer:

initializer_list ::=
initializer ( "," initializer )*

A solution for this is the TextTransformer "BREAK"-symbol:

initializer_list ::=
initializer 
( 
  "," 
  (
      initializer 
    | BREAK
  )  
)*

The loop is left now, when the closing parenthsis is following. The comma in the alternative above has to be be removed now:

"{" initializer_list "}"

The same applies to "parameter_type_list" and "parameter_list".

"external_declaration" is made LL(1) conformal and "parameter_declaration" is remodeled such, that there is neededd a smaller look-ahead: Backup007.

Recognition of type definitions

Names can be defined as identifiers of user-defined types in C. E.g.:

typedef const char* cp;

Now such a name can be used like a predefined type. This behavior can be copied in the TextTransformer with "dynamic" tokens. Therefore the token TYPE_NAME was already defined as a placeholder-token:

TYPE_NAME ::= {DYNAMIC}

The "typedef" token can be found in the grammar as one of the alternatives of "storage_class_specifier". It is misplaced there. Type definitions would be permitted within a "parameter_type_list" and a repeat of the "typedef" keyword would be allowed like:

typedef typedef const char* cp;

In addition, some semantic helping code has to be be written for the use of the dynamic token and it would be better to keep this code isolated. While the formal equivalence with the original Yacc grammar has been maintained at all previous remodellings, this therefore becomes broken this time and the new production "type_definition" is introduced:

type_definition
{{
str sTypename;
}}
"typedef" 
declaration_specifiers?
type_declarator[sTypename]
{{ AddToken(sTypename, "TYPE", ScopeStr()); }} 
";"

The also new production "type_declarator" is syntactically the same as the "declaration" production but it contains the semantic code which provides the name for the type definition.

Code is also inserted in "struct_or_union_specifier" and in "enum_specifier" to register the names of the defined structures or enumerations.

A last problem still remains: types, which are defined locally in a "compound_statement" are valid only within this scope. Therefore an appropriate scope is produced within "compound_statement" and is left at the end.

compound_statement ::=
"{"  {{PushScope(ScopeStr() + ".local" + itos(m_iLocalScope++)); }}
( 
    declaration_list statement_list?
  | statement_list  
)?  {{PopScope(); }}
"}"

As well is an "exteral" scope is created in "translation_unit".

to the top