NonFree

Top  Previous  Next

Example projects > Plain text parsers > NonFree

 

NonFree.ttp is the real project which I developed before the first publication of the IMP filter to adapt it to my special needs. Only my name is changed in the project. All the previous projects are usable with the free version of the TextTransformer. This doesn't apply to NonFree.ttp. It shall be nevertheless presented briefly here as suggestion and "quarry".

 

The central rule is the text production. It consists essentially of a loop with alternative regular expressions for words, punctuation marks and the other printable characters.

 

(

    WORD

  | PUNCTUATION

  | SPECIAL

  | NBSP     {{ SetIsSpam(" : NBSP"); }} // protected blank

)*

 

WORD ::= [[:alnum:]Æ-Ïæ-ï€ÀÁÂÄÒÓÔÖÙÚÛÜßàáâäòóôöùúûü]+

PUNKTUATION ::= [[:punct:]]+ 

SPECIAL ::= [^[:alnum:][:punct:][:space:]Æ-Ïæ-ï€ÀÁÂÄÒÓÔÖÙÚÛÜßàáâäòóôöùúûü]+

NBSP ::= \xA0

 

With this loop all texts can be parsed completely because the expressions contain all ANSI characters. (Blanks are skipped automatically in accordance with the project options.)

 

This loop is extended by additional alternatives which are candidates for beginnings of is spam phrases. E.g.:

 

(

    WORD

  | PUNCTUATION

  | SPECIAL

  | NBSP     {{ SetIsSpam("NBSP"); }} // protected blank

  | pricelist_if

)*

 

Pricelist_if :

 

IF( pricelist() )

  pricelist         {{ SetIsSpam("price list");  }}

ELSE

  price  

END  

 

Pricelist_if was chosen here because it uses an interesting feature of the TextTransformer, though it isn't available in the free version. With the lines:

 

  IF( pricelist() )

    pricelist 

 

the parser is instructed to look ahead in the text. The text is only parsed as a price list and the spam attribute is set, if a whole list of articles and prices actually follows. Otherwise the above loop is continued at the current position. The look-ahead has the advantage that not all possible alternatives of any recognized token have to be taken into consideration to prevent an early abort of the parser.