Parsing texts > A typical structure of a mail text

A typical structure of a mail text

Top Previous Next

With the IMP filter text is parsed. This means that texts are analyzed according to their structure and their components. At a simple example the principle shall be demonstrated briefly. The header of the e-mail and possible sub-structures with binary data shall be ignored in this introducing page.. Here the pure text shall be analyzed as it is shown on an e-mail program.

This text can be analyzed in different ways. In the simplest case as a simple word list. (A list is a structure too.) A more complex structure, however, is suggested at the following typical mail.

---------

Dear Heinz,

blah blah blah

Cordially yours

Fatty

-----------

A text follows on the salutation and on this a greeting follows. And the salutation in turn is structured in itself and has a named component among others. If this name could be found, then it can be used well as a criterion for a not spam mail. If the addressee of the above mail is actually Heinz, then it in all probability the mail isn't spam, but it would be spam, if e.g. the salutation would be "Dear Fatty". This would be an advertisement for a slimming product, presumably.

In this case a word filter doesn't suffice for the classification as spam not. Heinz could have a friend whose nickname is Fatty. For the distinction it is therefore necessary to know in which positions the name appears: Fatty in the salutation wouldn't be spam, but not Fatty as signatory.

With the TextTransformer you can analyze such a mail text. At first the structure of the mail above is described a little more abstractly:

mail ::= salutation text greeting

Salutation, text and greeting are text components which are structured themselves in a way of their own. A definition for the salutation could be:

salutation ::= "Dear" "Heinz"

I.e. the words "Dear" and "Heinz" succeed one another in a salutation.

However, this doesn't characterize a salutation sufficiently. "Hello Heinz" would be a correct salutation too. Therefore the above rule (technical term: production) could be generalized to:

salutation ::= ("Dear" | "Hello") "Heinz"

I.e. the word "Heinz" follows in a salutation on one of the words "Dear" or "Hello." This certainly still doesn't suffice, however, it illustrates the principle. It works very similarly as the regular expressions which might be known to many users of the Spamihilator. In a TextTransformer project rules can describe the construction of a text quite similarly like regular expressions describe the construction of words. E.g. the repetition operators well-known from the regular expressions can be used for the text rule:

text ::= (WORD+ PUNCTUATION)*

The tokens WORD and PUNCTUATION can be described as "genuine" regular expressions.