Google, Yahoo! BabelFish use math principles to translate documents online

By Konstantin Kakaes
Special to The Washington Post
Monday, February 21, 2011; 10:22 AM

Early one morning in 2007, Libby Casey was trying to do her laundry in a guesthouse in Reykjavik, Iceland. When she couldn't figure out how to use the washing machine, she opened up the instruction manual.

The guide was written in German, which Casey cannot read, so she typed bits of it into an Internet translation tool. "It occurs nobody endlschleudern, however, intercatapults" is one result she got. Stumped, she pressed some buttons and eventually managed to wash her clothes, in an elongated wash cycle that kept her pinned down for three hours.

Libby's quandary will come as no surprise to anyone who has tried to use a computer to translate things. For decades, machine translation was mostly useful if you were trying to be funny. But in the last few years, as anyone using Google Translate, Babel Fish or many other translation Web sites can tell you, things have changed dramatically. And all because of an effort begun in the 1980s to remove humans from the equation.

As the late Frederick Jelinek, who pioneered work on speech recognition at IBM in the 1970s, is widely quoted as saying: "Every time I fire a linguist, my translation improves." (He later denied putting it so harshly.)

Up to that point, researchers working on machine translation used linguistic models. By getting a computer to understand how a sentence worked grammatically in one language, the thought was, it would be possible to create a sentence meaning the same thing in another language. But the differing rules in different languages made it difficult.

Jelinek and his group at IBM argued that by using statistics and probability theory, instead of language rules, a computer could do a better job of converting one language into another. Translation, they basically argued, was as much a mathematical problem as a linguistic one.

The computer wouldn't understand the meaning of what it was translating, but by creating a huge database of words and sentences in different languages, the computer could be programmed to find the most common sentence constructions and alignment of words, and how these were likely to correspond between languages. (Warren Weaver, a mathematician at the Rockefeller Foundation, had first raised the idea of a statistical model for translation in a 1947 letter in which he wrote: "When I look at an article in Russian, I say: 'This is really written in English, but it has been coded in some strange symbols.' ")

The IBM effort began with proceedings from the Canadian parliament, which were published in English and French. "A couple guys drove to Canada and left with two suitcases full of tapes that contained the proceedings," says Daniel Marcu, co-founder of Language Weaver, the first start-up to use the new statistical techniques in 2002.

Jelinek's group began by using a computer to automatically align sentences in the French and English versions of the parliamentary documents. It did this by pairing sentences from the same point in the proceedings that were of roughly equal lengths. If an opening sentence in English was 20 words long but the French opening was two sentences of about 10 words, the computer would pair the English sentence with the two French ones. The IBM researchers then used statistical methods and deductions to identify sentence structures and groups of words that were most common in the paired sentences.

As researchers got hold of more documents and translations of them in different languages, the database of common words and groups of words grew, providing increasing accuracy and nuance. This is the essence of the system today.

Although the IBM group's initiative began more than 20 years ago, it has taken time for computer scientists at IBM and elsewhere to refine those techniques, for computers to become powerful enough to manage the complexity of the many linguistic probabilities (such as multiword phrases and idioms) and for databases to grow large enough - billions of words in various languages - to provide translations nuanced enough to be usable. This is easier when dealing with closely related languages, such as French and Spanish, and with languages that have lots of translated documents with which to build a database. European languages do well in computer translations in part because the workings of the European Union must be published in the 23 "official and working languages" of the EU; these documents can then be used as raw data for researchers.

A major step in computer translation occurred in 2007 - around the time that Libby Casey was struggling with those Reykjavik washer instructions - when Google introduced the first free, statistically based translation software. (Other Web-based translation programs were still using the older linguistic rule-based systems.)

"Suddenly we see enormous progress in this technology because of Google's push," says Dimitris Sabatakakis, chief executive of Systran, one of the oldest computer translation companies. (Systran powered Google Translate until 2007 and is still the engine behind the widely known Yahoo! Babel Fish computer translation service, which now uses a hybrid system combining both statistical and linguistic models for translation.)

All this means that someone such as Michael Cavendish, a lawyer based Jacksonville, Fla., can do human-rights work related to China. "Machine translation has been a godsend for someone like me who has trouble conversing in foreign languages, because I never got a chance to study them in depth," he said recently.

When Cavendish writes documents, e-mails or Twitter posts to communicate with dissidents and others in Chinese, he finds that a computer translation is pretty good - provided he keeps his English simple. So he doesn't go on about "ex post facto laws," he said, but simply says: "China arrested this man today for something that was legal yesterday."

After shunning linguistic system for many years, the statistical translation mainstream is now again embracing grammar and other language-specific rules to capture some nuances and improve accuracy.

Experts say that improvements in translation systems are only going to continue as the databases they use grow larger and as computer scientists are better able to incorporate linguistic information. Soon, researchers say, there will be more and better "speech to speech" software, which will allow simultaneous translation in meetings, for instance. The Pentagon is particularly interested in giving deployed soldiers the ability to communicate with locals: One project is focusing on translations between English and Pashto, which is spoken in Afghanistan and Pakistan.

Even as the field rapidly evolves, though, the kind of odd translations that Libby Casey encountered doing her laundry in Reykjavik are unlikely to vanish entirely - as Sandra Alboum recently found out. Alboum, who runs a translation company in Arlington, was perusing a manual for a half-million-dollar steel-manipulation machine that a client of hers had translated, using a computer, from German into English. "Do not step under floating burdens," it said.

She had to check the manual herself to figure out what was meant: "Do not stand under suspended loads."

Kakaes is a writer living in Washington.

View all comments that have been posted about this article.

© 2011 The Washington Post Company