AllLinksAreSpam |
Top Previous Next |
Example projects > HTML and text parsers > AllLinksAreSpam
AllLinksAreSpam is based on the HTMLText-project. Syntactically they are identical. But a simple semantic action was inserted which classifies the mail as spam as soon as a link is found in it. This makes sense in the respect that almost every spam-mail is a vehicle for links. If one likes to allow links only in e-mails whose addresses are in the friend list, one has an effective spam-filter with AllLinksAreSpam. In addition, this project demonstrates the advantage of the HTML option over the text option: pure texts often doesn't contain all links.
The action is executed in the Link production:
NORMAL_LINK {{m_iResult = -1; }} | "http://www.mydomain.com" | "mailto:"? ( EMAIL {{m_iResult = -1; }} | "myname@mydomain.com" )
NORMAL_LINK is a regular expression which describes the pattern of most links.
NORMAL_LINK ::= (http://|ftp://)?[^\r\n\t <>"@]+(\.[^\r\n\t <>"@]+)+
EMAIL is a regular expression which describes the pattern of most e-mail-addresses:
EMAIL ::= [\w\.-]+ \// local part @ \ ([\w-]+\.)+ \ // sub domains [a-zA-Z]{2,4} // top level domain
The addresses of one's own are a special case. Sometimes they are copied by spammers into the mail. But it also can be that these addresses indicate that the mail is an answer from a sender whom you haven't included in your friend list yet. It is a good idea to develop a regular expression which matches the own address only when it is quoted exactly in the way how you write them into your mails. E.g.: the regular expression:
MY_EMAIL ::= -+\r?\nmailto:myname@mydomain.com
would match the following notation:
---------------- mailto:myname@mydomain.com
The Link production then could be completed to:
NORMAL_LINK {{if(m_iResult != 1) m_iResult = -1; }} | "mailto:"? ( EMAIL {{if(m_iResult != 1) m_iResult = -1; }} | MY_EMAIL {{m_iResult = 1; }} )
|