Top  Previous  Next

Example projects > HTML and text parsers > HTMLText


The HTMLText-parser copes both with HTML code and with plain text or a mixture of them. With one exception, it doesn't presuppose that HTML code is well-formed, either..At such an assumption the parser would frequently fail. But, if the token "<!DOCTYPE" is found, it is assumed that this is the beginning of a well-formed HTML code section.

As long as well-formed HTML cannot be assumed, unfortunately, it isn't possible with the free version of the TextTransformer to distinguish  whether  '<' or '>' are the beginning or end of a HTML-tag or the less or greater sign. So it is not known, whether the parser is inside or outside of a tag. (With the standard version of TextTransformer a look-ahead could be used, to make the decision.)


To make a spam filter of this project, it must be enlarged by test functions of one's own. For the call of these functions the TextToCheck production is made:





| Link 


The important text components are here together: words, quotations, special characters, e-mail addresses and links.