Text2Html

The TextTransformer project "Text2Html" is used for the automatic transformation of plain text files in HTML files. There is no need for special formatting statements, so that you don't recognize,that the text is a copy for a HTML document. Only the blanks in the texts must be put carefully: text sections have to be separated by blank lines and tables must indented correctly and - new - by spaces hidden instructions can be coded.
The project is an example application for the TextTransformer and doesn't claim to treat all possible texts correctly. It can be, that incomplete, indented and nested structures aren't analyzed correctly. However, the project can be used for common texts. So all pages of this web site were created with it. Everybody can carry out individual customizations and extensions with the TextTransformer program. To make this easier, the project is explained in detail in the following.

At first a intelligible representation of the analysis of the text documents follows
A more special part in which the technical details are explained follows then.

The project can be downloaded here: Text2Html.ttp
The text copy for this HTML page is: Text2Html.txt

Plain text

Presumably plain texts are still the most spreading way to store texts, mostly in the ASCII or ANSI font. Shown on the screen, they look like texts, which were typed with a typewriter. This is because these files contain nothing further than the letters in binary form, which one can type on a typewriter. The plain text files contain no instructions for the use of different fonts and sizes or or for the drawing of tables etc.. Indeed, meanwhile, almost every computer user has complex word processing software which store the text data in their respective own formats. Just the simplicity and the independence of the software used are the great advantage of plain text files, however. Texts also are much more easier and thus faster to be written, if one doesn't have to pay attention to the respective formatting.

HTML text

HTML is the file format by which web pages are stored. HTML pages are looking more beautiful than plain texts since headings are represented with bold characters, lists are indented and tables are put into frames.
Also HTML files are text files. They contain, however, instructions for the formatting of the text besides the pure text data.

Transformation of plain text files in HTML text

The transformation of a text into a HTML document is relatively simple if the text was prepared with a word processing software: only the formatting instructions of the original text must be translated adequately. By the transformation of plain text files in HTML text one faces, however, the problem that the original text contains no explicit details on formatting.
There are three solutions for this problem in principle:

1. One adds special syntax elements to the original text which instruct a compiler, how the HTML files shall look like.

This procedure makes sense in so far, as the complex possibilities which HTML offers could be reduced to a simplified syntax which suffices for the individual purposes. The Wikipedia is an example of this. However, one then would have to learn this new syntax, one would be bound to it and one would disfigure the original document with that. This isn't aim of this project. (An exeption are the hidden instructions. See below)

2. One prepares a HTML document as a simple copy of the original text. i.e. without formattings.

After all, this is already a first approach. Text files can actually be shown on the browser directly often without any manipulations having to be carried out. Who would like to make it better and has seen his original code of a HTML side before could think merely the original text must be included into the (tag) couples:

<HTML> </HTML> and <BODY> </BODY>:

<HTML> <BODY> original text </BODY> </HTML>

If you then store the new text with the extender ".html" instead of ".txt" and load it into a browser, you will remark, though, so that all line breaks have been lost and that only the first of a number of blanks is always shown. In addition, it's not guaranteed that special characters are represented correctly and the characters by which the HTML tags are defined are mixing up the advertorial completely.
This possibility nevertheless forms the basic scaffolding for the HTML converter under item 3.

3. One develops a scheme to derive a formatting of the new text from the construction and contents of the original text.

This is the method that shall be used in the Text2Html project. Only in the case that no formatting is derivable the text shall be represented as originally as possible, like outlined under item 2.

How shall the formatting be derivable from the text, however? The answer is already indicated under item 3: from construction and contents.

Text construction

You can recognizes the construction of a text best if you look at it from a distance in which you cannot read it any more. The structure results just from that, what is not text: from its gaps. Gaps result from line breaks, blank lines, blank characters and tabulators.

At first the Text2Html project takes the same perspective. A text shall be changed into a HTML document so, that the described structure not only remains unchanged, but is strengthened. A chapter heading shall be represented a little more greatly as the other text and in boldface printing and the regular pattern of a listing or a table shall get accented by additional markings or lines.

Text content

A second criterion for the use of certain HTML elements arises from a closer analysis of the text content. E.g. the underlining of links on a HTML page and the possibility of reaching another page by selecting a link are carried out in HTML by putting: "<A HREF=", in front of the Internet address. The processor of a browser which shall represent the HTML page, otherwise is not able to recognize an Internet address as such. The TextTransformer is "more intelligent" and can recognize such addresses and prepare it for the browser.

Special characters

It was already mentioned above that it's not guaranteed, that special characters like 'ü' or 'é' are represented correctly. This depends on the respective browser and on the header of the HTML document. To assert, that these characters are shown correctly, they have to be replaced by names. E.g. the name for the character 'ü' is ""&uuml" and the name is for the character 'é' is "&eacute". The german word for gap: "Lücke" then looks in the HTML text like: "L&uumlcke".

Parts of text

The complete text is considered as a number of parts of text, which are separated by at least a blank line from each other. Blank lines may contain blanks and tabulators, however no other characters.

Parts of text are headings, text sections, lists and tables. The parts of text only are recognized, if at least an empty line is in front of them.

Headings

Headings are usually short, single-line texts. From other single-line texts they have to be distinguished by special features. A very prominent method, which was used in this document, is to underline headings. Another method would be to use only capital letters in headings. Isolated, single, most short lines of text are frequently also interpreted as headings, however, when they don't end with a punctuation mark. Such a character marks the end of a sentence mostly. A maximum title length of 75 characters is arbitrarily accepted in Text2HTML. If an isolated line is shorter and doesn't end with a punctuation mark, it is interpreted as a heading.
The following kinds of headings can be seen in the browser both, as HTML appearances and in their original text form.

This is not a heading.

First heading

A heading composed of normal text, not followed by a dot or a colon, looks in the plain text like:
First heading

SECOND HEADING

A capitalized heading looks in plain text like:
SECOND HEADING

Third Heading

A heading underlined with a single line looks in plain text like:

Third Heading
------------

Fourth Heading

A heading underlined with a double line looks in plain text like:

Fourth Heading
==============

Fifth Heading

A heading underlined with a line made of stars looks in plain text like:

Fourth Heading
**************

Sixth Heading 1

Sixth Heading 2

Sixth Heading 3

A heading enclosed in two lines looks in plain text like:

---------------
Sixth Heading 1
---------------

===============
Sixth Heading 2
===============

***************
Sixth Heading 3
***************

Lists

Lists are text sections succeeding one another, which are either numbered or which respectively start with a hyphen. Numberings can begin are with"1", "1.", "A" or "a"

A list entry can extend over several lines. This is difficult to realize in the plain text document since a certain number of blanks must be inserted at the beginnings of the lines there.
list entry
list entry

A list entry can extend over several lines und It can contain the characters '1', "1.", 'A' and 'a', if a new line of text doesn't start with them. Nested lists aren't treated correctly in the current version of Text2Html.
list entry
list entry

list entry
list entry
list entry

Tables

The recognition and translation from tables is the most difficult part of the project since there are very miscellaneous tables: they can have a table heading or not, perhaps single cells of a table can remain empty, the separation of the columns from each other can be by blanks, tabulators. There are other possibilities too. In principle, for every possible table type, recognition and translation routines should be written of their own.

Every conceivable table cannot be recognized by the Text2Html project. Here are a couple of examples of tables which are recognized and translated:

1. "Perfect" table with header, not indented

First	Second	Third
1	2	3
4	5	6

2. "Perfect" table without header, not indented

First	Second	Third
1	2	3
4	5	6

3. "Perfect", indented table

First Second Third

1 2 3

4 5 6

4. Indented table with empty fields and a space instead of a tabulator at one place

Type Range of values Default value (*)

bool true/false (resp. 1/0) false (resp. 0)

char 0-255 '\0'

int -32768 - +32767 0

unsigned int 0 - 65565 0

double -1.7E+308 - +1.7E+308 (15 digits) 0.0

str ""

node node::npos

vector empty

map empty

cursor --

function table empty

Type	Range of values	Default value (*)
bool	true/false (resp. 1/0)	false (resp. 0)
char	0-255	'\0'
int	-32768 - +32767	0
unsigned int	0 - 65565	0
double	-1.7E+308 - +1.7E+308 (15 digits)	0.0
str		""
node		node::npos
vector		empty
map		empty
cursor		--
function table		empty

As a tabulator not only a "real" tabulator characters '\t' are considered but also the character '|' or a sequence of blanks. Unfortunately, one cannot assume that a cell only can contain coherent characters and that the cells are separated by tabulators from each other correctly. It practically frequently happens that for place reasons two cells are separated by a simple blank. By the Text2HTML project tables not only are recognized, when constructed perfectly, but it is also tried to cope with deviations of the ideal structure. Certain minimu m requirements must nevertheless be meat.

1. Every table line must contain at least two text pieces which are separated by a tabulator. The second line also can optionally be a line which separates the table head of the rest of the table.

2. The columns must start at exactly the same positions in every line. However, it not every cell must have a content. Merely the first column must show a value in every line. Single cells also can be separated by a simple blank, however, no "real" tabulator '\t' may occur within a cell. It is important that there must be at least a table line in which all fields are separated by a clear tabulator.

Hidden instructions

At the beginning, it was said that the texts shall not be provided with special formatting instructions before their automatic conversion to HTML. But there is an exception for this in the Text2HTML project: with blanks instructions can be hidden in the text.

Blanks at the line ending or in a line without other characters are invisible in the text. So you can write them into the text without changing its appearance. Till now only two uses are made of this possibility in the Text2HTML project, perhaps there will be more such hidden instructions in future. (The advice: remove all superfluous blanks already now from texts which shall be processed with later versions of Text2HTML, too.)

An empty line with ecactly one white spaces instructs the program not to interpret the following line as a heading or the beginning of a list or a table.

E.g. it happens now and then that a line of text satisfies the criterion for a heading, but isn't a heading. So the following line:

<HTML> <BODY> Original-Text </BODY> </HTML>

interprets as a simple heading, if a white spaces is put in front of it:

<HTML> <BODY> Original-Text </BODY> </HTML>

An empty line with ecactly two white spaces has a similar effect as the previous instruction. In addition all white spaces in the following lines are printed, the special character aren't translated and the boldface attribute is ignored. (For HTML experts: the following text is included into the tags <pre> and </pre>.

Compare the results for the same text:

A text with two whitespaces in front looks exactly as is looks in the original text too:

E n g l i s h :  Hello  world 
G e r m a n :    Hallo  Welt

With one whitespace in front the text interprets as a normal paragraph and bodlface is recognized:

English: Hello world
German: Hallo Welt

Without whitespaces in front the text interprets as a table:

E n g l i s h :	Hello	world
G e r m a n :	Hallo	Welt

REMARKS TO THE IMPLEMENTATION

Remark: you can understand the following notes only, if you already have some experience with the use of TextTransformer. The "Text2Html" projekt presents advanced techniques of the TextTransformer, which are needed only, if the input is hardly structured.

The TextTransformer supports the development of efficient LL (1) parsers. The freely arranged text which shall be analyzed here don't suffices the conditions of a LL(1) analysis. The extended abilities of the TextTransformer,

the use of productions to a look-ahead and
the possibility of a call of sub-parsers

enable the TextTransformer to cope with the task nevertheless. However, these means should be used as thriftily as possible since the execution speed of the translation is reduced through them.

Projekt options

Since for the analysis of the text the recognition of blank lines and indents is necessary, blanks may not be ignored. The standard setting of the project options therefore is modified, not to ignore any character.
Remark: This change in a number of productions (see below) is undone again by local options.

Start produktion

At the beginning and the end of the start production "Text2Html" the HTML frame is written with the functions "HtmlBegin" and "HtmlEnd". The core of the production divides the source text in the alternating sections from blank lines and text.

An empty line is defined as:

EMPTY_LINE ::= (\r?\n){2,}   \ // linebreaks
               (          \
               [ \t]{3,}  \ // three or more whitespaces
               (\r?\n|\z) \
               )*

that means, EMPTY_LINE consists at leasts in two line breaks.

Single line breaks, can occur within a text section (PartOfText) too. The end of a text section is obviously marked by "EMPTY_LINE" since this token matches a longer text section than a single line break. It's recognition is therefore preferred according to the preference rules of the TextTransformer. But text sections can be separated as well by an odd number as by an even number of line breaks. So a single line break also must be able to follow on the double line break, before a new text section starts.

EOL ::= \r?\n

PartOfText

After the empty lines the program tests by a look-ahead, which element of HTML (Title, List, Table, Text) follows next.
The following table has a column for each HTML element. The names of the productions are in the first line, which are used for the corresponding element to look-ahead and the second line lists the productions by which the text is further analyzed, if the look-ahead was successful.

-- Titel Liste Tabelle Text

look-ahead isTitle isList isTable --

production Title List Table Paragraph

--	Titel	Liste	Tabelle	Text
look-ahead	isTitle	isList	isTable	--
production	Title	List	Table	Paragraph

It would be possible to use the same production for the look-ahead which is then responsible for the processing of the text. The look-ahead productions defined here abbreviate the look-ahead on a necessary minimum.

Since two text components can be separated from a single blank line, PartOfText may not consume any following line break. Productions which are used for the look-ahead may test the existence of a following blank line, however. Lok-ahead productions don't consume any text.

Recognition of tables

a) summary of the recognition algorithm

At first by a relative general description of tables is tested whether a table follows in the text. The production "isTable" checks whether there are at least two lines succeeding one another in the text which show a table like structure. If this is the case, the complete table section is read and broken into rows. These rows are broken down into text and not text parts (the gaps) in turn. So a tree structure is obtained, in which the rows of the table are the branches. If all rows/branches have a similar composition from text and gaps, the preparation of the HTML table is no problem. It is checked with the method "IsPerfectTable" whether this is the case. If it isn't a "perfect" table, then the tree must be analyzed more exactly to try to determine the column number and the column positions.

b) Tabulators

The columns of a table are separated by tabulators from each other. As a tabulator here not only a tabulator character is regarded but also the character '|' or a sequence of blanks. The token TABULATOR exactly is defined as follows:

TABULATOR ::= \t+|\|+|[ ]{2,}

Unfortunately, one cannot assume that table cells only can contain coherent characters and that the cells are separated by tabulators from each other correctly. It practically frequently happens that for place reasons two cells are separated by a simple white space. Since simple white spaces however, are permitted also within a table cell, the meaning of the white space is thus dependent on the respective context.

Remember the minimum requirements of tables mentioned above again.

NSpace / NSpaceNPm

In the productions for the HTML elements, at first the text is taken to pieces only roughly: into the gaps and the "filled" places. The recognition of the text parts, which doesn't contain white spaces is made by the production "NSpace" (not space). "NSpaceNPm" (not space, not punctuation mark) is a variant of "NSpace", which is used to parse titles, where no punctuation marks are allowed.
The sense of the use of "NSpace" shall be explained at the example of the table parser.

Table parser

The structure specification for tables above doesn't permit to design a parser which directly recognizes the individual cells as such. Rather the table rows are regarded as alternating sections of tabulators or blanks and text, which doesn't consist of tabulators or blanks.
If the table is "perfect" (see above), a row is chosen with a maximum number of "real" tabulators. This row is then taken as a scale for the other rows for which it is assumed that their cells start at the line positions which immediately folow in the sample row behind the tabulators.

PrintNSpace

Before the output of the not empty text parts these are taken to pieces by the sub-parser "PrintNSpace" once more. The text is output in "PrintNSpace", however, different text parts are treated differently. "PrintNSpace" consists a number of alternatives, e.g. "Link" for the recognition of links and BLOCK for the recognition of spaced characters. "Link" creates HTML links and in the action of the BLOCK token the text is output justified. In the other alternatives the text is either output unchanged or the translation of the special characters, mentioned above, is carried out (in the actions of the tokens "HTML_SPECIAL_CHAR" and "HTML_KEY_CHAR").

Spaced characters

Spaced characters. i.e. the insertion of one blank between all the letters of a word, is translated as bold. Special characters, like the apostrophe or a comma, which can appear within the spaced characters are a problem. They aren't separated by a white space from the previous letter. In addition, a sentence can end with spaced characters and so a point, an exclamation mark or a question mark therefore can immediately follow on the last bold letter. In English, moreover, there is the word "a", that always is surrounded by white spaces.

Translation of the special characters

For the translation of the special characters it would be possible to define a token of his own for each of them which is connected with an action for the output of its name, similar to the Atari example in the installation package of the TextTransformer. In the Text2Html project another way is gone: all special characters are combined into a set and this set is used for for the definition of a token:

HTML_SPECIAL_CHAR ::= [-\xa0-¿\"#$&'*+/<=>\\\^_|×÷]

Besides this, there still is a token for the characters, which have a key meaning in HTML:

HTML_KEY_CHAR ::= [<>&"]

A table is defined on the interpreter page as "mstrstr" which connects every character with its accompanying name. E.g.:

m_mSpecialChar ["ü"] = "ü";

If one of the characters from the set is recognized, it is translated by:

out << m_mSpecialChar[State.str()];

Remark: The table contains almost 100 translations but of course, it was produced from an available table by means of a short TextTransformer program, just like the character list. This took only 5 minutes.

Style and key words

When the HTML frame is written, the functions "PrintStyle" and "PrintKeyWords" are called. "PrintStyle" determines the used font sizes and types. "PrintKeyWords" produces an (invisible) list of key words for search engines. Every user should either adapt these functions his own needs or comment them out in the following way:

//PrintStyle();
//PrintKeyWords();

End

I wish a lot of fun and success with the Text2Html transformer. If somebody should improve it, enlarge it or modify it in any other way, I would be happy, if he could provide for other visitors of this home page too.

Last update: 22.12.07

to the top