Home |
Text2HTML |
Wikipedia |
Yacc2TT |
Delphi parser |
Java parser |
C preprocessor |
C parser |
HTML4 |
Utilities |
MIME parser |
Spamfilter |
Additional Examples |
Free components |
The project MIME.ttp shall provide a parser as complete as possible for e-mails. E.g. this parser can be used as a base for expansions of one's own for the recognition of spam which can be used by means of the IMP filter plugin directly in the anti-spam software spamihilator. It has turned out that the detailed MIME parser even can defend spam without such an expansion, since spam mails often aren't really in conformity with the MIME standard. Software which processes the mails for the reader also don't observe this standard in detail but rather try to represent all received e-mails if possible as optimally as possible. Just this tolerance makes the work easier for spammers.
MIME is today's standard for the construction of e-mails. In contrast to the older RFC822 standard which is designed only for the transmission of texts, MIME also permits the sending of pictures, videos and other binary data. E-mails according to RFC822 simply consists of some header lines and the real message text. In MIME multipart messages however, the header lines can be followed by several textual sub-structures.
The MIME standard, as a result of a development of several years, isn't specified completely in a single document. Different parts of the specification are rather distributed on several documents, grown historically, which partly mutually correct each other. This might be one of the reasons why there isn't a MIME grammar for any other parser generator yet. As far as known to the author also public available handwritten parsers only exist with simplified treatment of the header data. Another reason for the lack of a complete MIME grammar is presumable that only the TextTransformer is able technically to cope with the complex claims of a complete MIME parser.
Treatment of comments:
Comments are already defined in RFC822 as texts in parenthesis - '(' and ')'. According to the old standard they can occur after all tokens of the headers, e.g. between the label of a field and the following colon too. In accordance with RFC2822 this is no longer allowed, but nevertheless an e-mail parser has to be able to read the old form. Comments are dealt with in MIME.ttp with the "inclusion-" feature of the TextTransformer program. A production which is set in the project options as an inclusion is checked after every token.
Line folding:
For better readability, fields can be "folded" into several lines. To distinguish the continuation of such a folded line from the beginning of a new field, there have to be some white spaces or tabs at the beginning of these continuations. Thus, the single line:
To: "Joe & J. Harvey" <ddd @Org>, JJV @ BBN
can be represented as:
To: "Joe & J. Harvey" <ddd @ Org>,
JJV@BBN
These foldings are recognized by the token "FWS" (folding white space) in MIMI.ttp:
FWS ::= (\r\n[ \t]+)+
According to the TextTransformer longest match rule a match of this longer expression is preferred to a recognition of a simple line break. This is a simple kind of the look-ahead on token level.
Boundaries:
Beginning and end of the parts of a multipart message are marked by special boundaries. The special expressions of these limitations are defined respectively within the header data of the complete mail or their parts. Dynamic tokens are created in MIME ttp for these definitions:
BOUNDARY_BEGIN ::= {DYNAMIC}
BOUNDARY_END ::= {DYNAMIC}
If a boundary expression occurs in the text of the mail at a later position it is recognized by the dynamic token then.
Conflict resolutions:
As far as there are formal specifications of the MIME grammar in the RFC documents, these are formulated largely without consideration for the practical usability. E.g. one sees from the following part of RFC2822 that both alternatives of "address" start with "display-name" or "phrase":
address = mailbox / group
mailbox = name-addr / addr-spec
name-addr = [display-name] angle-addr
angle-addr = [CFWS] "<" addr-spec ">" [CFWS] / obs-angle-addr
group = display-name ":" [mailbox-list / CFWS] ";"
[CFWS]
display-name = phrase
mailbox-list = (mailbox *("," mailbox)) / obs-mbox-list
address-list = (address *("," address)) / obs-addr-list
To avoid unnecessary look-aheads, the MIME grammar for the TextTransformer was made LL(1)-conforming as far as possible. So the "address" production above becomes to:
phrase ( angle_addr //mailbox | "@" FWS? domain //mailbox | ":" FWS? mailbox_list? ";" FWS? // group ) | angle_addr //mailbox
(Note: "local_part" at the beginning of "addr-spec" - not represented here - is understood as special case of a generalized "phrase".)
There is a tolerant MIME parser at:
Attention: While the comment production is put in the project options as an inclusion in MIME.ttp, it is set into the local options of the according productions in Simple_MIME.ttp.
BOUNDARY_END was moved.
----
Standard for ARPA Internet Text Messages
This standard specifies a syntax for text messages that are sent between computer users, within the framework of "electronic mail" messages. This standard supersedes the one specified in RFC 822, updating it to reflect current practice and incorporating incremental changes that were specified in other RFCs.
specifies the various headers used to describe the structure of MIME messages.
defines the general structure of the MIME media typing system and defines an initial set of media types.
describes extensions to RFC 822 to allow non-US-ASCII text data in Internet mail header fields.
specifies various IANA registration procedures for MIME-related facilities.
describes MIME conformance criteria as well as providing some illustrative examples of MIME message formats, acknowledgements, and the bibliography.
to the top |