Free TextTransformer Projects : Wikipedia

Free TextTransformer Projects

Home

Text2HTML

Minimal Website Impressum

Wikipedia

Wikipedia http://www.wikipedia.org is an online dictionary that consists in thousands of HTML sides. The remarkable thing about it is that every user can edit and enlarge the sides without HTML knowledge. To this it needs only few very simple formatting rules.

Possibilities of application

With the project, presented here, web pages can be produced with the same rules, independently form the Wikipedia, might be to write articles for the Wikipedia offline or to build a home page of one's own.

Restrictions

Although the basic syntax of Wikipedia is very simple, the large size of this dictionary cannot be handled without providing many special instructionstoo. So there are mutual links of hundreds of languages and categories etc. These possibilities are outside the project introduced here. Although it is the aim to be able to parse Wikipedia scripts as completely as possible, the processing of the parsed text, however, doesn't go furthermore what is required for a private home page.

It is another problem that the Wikipedia syntax is described in detail with many examples but is nevertheless not precisely defined, a disadvantage which has arisen from the fact, that the Wikipedia crew hasn't used any parser generator. Sometimes you can find out, what is working and what is not working, only by trial and error. The TextTransformer project therefore remains a "work in progres". In the current stage the project can process an (no longer existing? ) extensive manual side, where all syntax elements are collected.

IMPLEMENTATION

1. There are primarily two types of Wikipedia elements:

Elements, bginning with a repeat of a character:

Examples:

'' italic
[[ link
{{ variable
== section

2. Elements, beginning at the start of a line.

: indentation
* list entry
|- table row

To extract the proper text content from these elements, it has to be recognized by tokens, which exclude the elements. So the token TEXT is defined as

TEXT ::=
[h'"\[\]{}|!=<&*#;-] \// Characters, by which other tokens can begin
|[^\r\nh'"\[\]{}|!=<&*#;-]+ // all other characters can be arbitrarily repeated

It always recognizes only one character, if by this character also a Wikipedia element can start. Elements will be recognized, as longer matches are preferred to shorter. All other characters can be recognized by the TEXT token in an arbitrary repetition. Through this the text is analyzed considerably faster as if one would extract all text signs one by one.

Line breaks aren't recognized by TEXT. They don't have any influence on the layout of the text in the Wikipedia either, unless as an empty line for the separation two paragraphes.

The tokens of the elements which must be at the beginning of a line are defined by putting

(\r?\n)?\r?\n

in front of them. They are preferred to line breaks or empty lines, as their matches are longer.

Start-rules

with the start rule WikipediaWikipedia scripts are transformed to simple HTML pages
with the start rule Homepagethe HTML code becomes tied into the HTML-template "Homepage.tmpl" so that it appears in a frame for a complete web site.

Links

You can download the project here: Wikipedia.ttp
The text for this page is: Wikipedia.txt
An example, which can be converted: How_to_edit_a_page.txt.
Remark: If you save the target text with the extender ".html" and the doubleclick the file, it will be shown in your browser.
The converted document: How_to_edit_a_page.html

The project will be improved.

Last Update: 14.06.06

to the top