<br>); however, if text on a webpage is separated by horizontal blank-line images (auto-wrapped without using any
<br>), a long webpage can be translated containing several thousand words.
Google Translate, like other automatic translation tools, has its limitations. While it can help the reader to understand the general content of a foreign language text, it does not deliver accurate translations and does not produce publication-standard content, for example it often translates words out of context and is deliberately not applying any grammatical rules, since its algorithms are based on statistical analysis rather than traditional rule-based analysis.
Google translate is based on an approach called statistical machine translation, and more specifically, on research by Franz-Josef Och who won the DARPA contest for speed machine translation in 2003. Och is now the head of Google's machine translation department.
According to Och, a solid base for developing a usable statistical machine translation system for a new pair of languages from scratch, would consist in having a bilingual text corpus (or parallel collection) of more than a million words and two monolingual corpora of each more than a billion words. Statistical models from this data are then used to translate between those languages.
To acquire this huge amount of linguistic data, Google used United Nations documents. The same document is normally available in all six official UN languages, thus Google now has a 7-language corpus of 20 billion words' worth of human translations.
The availability of Arabic and Chinese as official UN languages is probably one of the reasons why Google Translate initially focused on the development of translation between English and those languages, and not, for example, Japanese and German, which are not official languages at the UN.
Google representatives have been very active at domestic conferences in Japan in the field asking researchers to provide them with bilingual corpora.
The Slovenian-English original beta version contained several pranks. For instance, Google Translate translated "Janša je lep" ("Janša is beautiful") into "Sanader is beautiful", "Slovenska obala" ("Slovenian coast") into "Croatian coast", "Ljubljana" into "rape", "Kranj" into "Miami" and "Koper" into "Chicago".