Google, Yahoo! BabelFish use math principles to translate documents online
Monday, February 21, 2011; 10:22 AM
Early one morning in 2007, Libby Casey was trying to do her laundry in a guesthouse in Reykjavik, Iceland. When she couldn't figure out how to use the washing machine, she opened up the instruction manual.
The guide was written in German, which Casey cannot read, so she typed bits of it into an Internet translation tool. "It occurs nobody endlschleudern, however, intercatapults" is one result she got. Stumped, she pressed some buttons and eventually managed to wash her clothes, in an elongated wash cycle that kept her pinned down for three hours.
Libby's quandary will come as no surprise to anyone who has tried to use a computer to translate things. For decades, machine translation was mostly useful if you were trying to be funny. But in the last few years, as anyone using Google Translate, Babel Fish or many other translation Web sites can tell you, things have changed dramatically. And all because of an effort begun in the 1980s to remove humans from the equation.
As the late Frederick Jelinek, who pioneered work on speech recognition at IBM in the 1970s, is widely quoted as saying: "Every time I fire a linguist, my translation improves." (He later denied putting it so harshly.)
Up to that point, researchers working on machine translation used linguistic models. By getting a computer to understand how a sentence worked grammatically in one language, the thought was, it would be possible to create a sentence meaning the same thing in another language. But the differing rules in different languages made it difficult.
Jelinek and his group at IBM argued that by using statistics and probability theory, instead of language rules, a computer could do a better job of converting one language into another. Translation, they basically argued, was as much a mathematical problem as a linguistic one.
The computer wouldn't understand the meaning of what it was translating, but by creating a huge database of words and sentences in different languages, the computer could be programmed to find the most common sentence constructions and alignment of words, and how these were likely to correspond between languages. (Warren Weaver, a mathematician at the Rockefeller Foundation, had first raised the idea of a statistical model for translation in a 1947 letter in which he wrote: "When I look at an article in Russian, I say: 'This is really written in English, but it has been coded in some strange symbols.' ")
The IBM effort began with proceedings from the Canadian parliament, which were published in English and French. "A couple guys drove to Canada and left with two suitcases full of tapes that contained the proceedings," says Daniel Marcu, co-founder of Language Weaver, the first start-up to use the new statistical techniques in 2002.
Jelinek's group began by using a computer to automatically align sentences in the French and English versions of the parliamentary documents. It did this by pairing sentences from the same point in the proceedings that were of roughly equal lengths. If an opening sentence in English was 20 words long but the French opening was two sentences of about 10 words, the computer would pair the English sentence with the two French ones. The IBM researchers then used statistical methods and deductions to identify sentence structures and groups of words that were most common in the paired sentences.
As researchers got hold of more documents and translations of them in different languages, the database of common words and groups of words grew, providing increasing accuracy and nuance. This is the essence of the system today.
Although the IBM group's initiative began more than 20 years ago, it has taken time for computer scientists at IBM and elsewhere to refine those techniques, for computers to become powerful enough to manage the complexity of the many linguistic probabilities (such as multiword phrases and idioms) and for databases to grow large enough - billions of words in various languages - to provide translations nuanced enough to be usable. This is easier when dealing with closely related languages, such as French and Spanish, and with languages that have lots of translated documents with which to build a database. European languages do well in computer translations in part because the workings of the European Union must be published in the 23 "official and working languages" of the EU; these documents can then be used as raw data for researchers.
A major step in computer translation occurred in 2007 - around the time that Libby Casey was struggling with those Reykjavik washer instructions - when Google introduced the first free, statistically based translation software. (Other Web-based translation programs were still using the older linguistic rule-based systems.)