Example projects > HTML and text parsers

AllLinksAreSpam

Top Previous Next

Example projects > HTML and text parsers > AllLinksAreSpam

AllLinksAreSpam is based on the HTMLText-project. Syntactically they are identical. But a simple semantic action was inserted which classifies the mail as spam as soon as a link is found in it. This makes sense in the respect that almost every spam-mail is a vehicle for links. If one likes to allow links only in e-mails whose addresses are in the friend list, one has an effective spam-filter with AllLinksAreSpam. In addition, this project demonstrates the advantage of the HTML option over the text option: pure texts often doesn't contain all links.

The action is executed in the Link production:

NORMAL_LINK {{m_iResult = -1; }}

| "http://www.mydomain.com"

| "mailto:"?

(

EMAIL {{m_iResult = -1; }}

| "myname@mydomain.com"

)

NORMAL_LINK is a regular expression which describes the pattern of most links.

NORMAL_LINK ::=

(http://|ftp://)?[^\r\n\t <>"@]+(\.[^\r\n\t <>"@]+)+

EMAIL is a regular expression which describes the pattern of most e-mail-addresses:

EMAIL ::=

[\w\.-]+ \// local part

@ \

([\w-]+\.)+ \ // sub domains

[a-zA-Z]{2,4} // top level domain

The addresses of one's own are a special case. Sometimes they are copied by spammers into the mail. But it also can be that these addresses indicate that the mail is an answer from a sender whom you haven't included in your friend list yet.

It is a good idea to develop a regular expression which matches the own address only when it is quoted exactly in the way how you write them into your mails. E.g.: the regular expression:

MY_EMAIL ::= -+\r?\nmailto:myname@mydomain.com

would match the following notation:

----------------

mailto:myname@mydomain.com

The Link production then could be completed to:

NORMAL_LINK {{if(m_iResult != 1) m_iResult = -1; }}

| "http://www.mydomain.com"

| "mailto:"?

(

EMAIL {{if(m_iResult != 1) m_iResult = -1; }}

| MY_EMAIL {{m_iResult = 1; }}

)