19 November 2007

Translation out of English

If you look at MT papers published in the *ACL conferences and siblings, I imagine you'll find a cornucopia of results for translating into English (usually, from Chinese or Arabic, though sometimes from German or Spanish or French, if the corpus used is from EU or UN or de-news). The parenthetical options are usually just due to the availability of corpora for those languages. The former two are due to our friend DARPA's interest in translating from Chinese and Arabic into English. There are certainly groups out there who work on translation with other target languages; the ones that come most readily to mind are Microsoft (which wants to translate its product docs from English into a handful of "commercially viable" languages), and a group that works on translation into Hungarian, which seems to be quite a difficult proposition!

Maybe I'm stretching here, but I feel like we have a pretty good handle on translation into English at this point. Our beloved n-gram language models work beautifully on a languages with such a fixed word order and (to a first approximation) no morphology. Between the phrase-based models that have dominated for a few years, and their hierarchical cousins (both with and without WSJ as input), I think we're doing a pretty good job on this task.

I think the state of affairs in translation out of English is much worse off. In particular, I think the state of affairs for translation from a morphologically-poor language to a morphologically-rich language. (Yes, I will concede that, in comparison to English, Chinese is morphologically-poorer, but I think the difference is not particularly substantial.)

Why do I think this is an interesting problem? For one, I think it challenges a handful of preconceived notions about translation. For instance, my impression is that while language modeling is pretty darn good in English, it's pretty darn bad in languages with complex morphology. There was a JHU workshop in 2002 on speech recognition for Arabic, a large component of which was on language modeling. From their final report, "All these morphology-based language models yielded slight but consistent reductions in word error rate when combined with standard word-based language models." I don't want to belittle that work---it was fantastic. But one would hope for more than slight reduction, given how badly word based ngram models work in Arabic.

Second, one of my biggest pet-peeves about MT (well, at least, why I think MT is easier than most people usually think of it) is that interpretation doesn't seem to be a key component. That is, the goal of MT is just to map from one language to another. It is still up to the human reading the (translated) document to do interpretation. One place where this (partially) falls down is when there is less information in the source language than you need to produce grammatical sentences in the target language. This is not a morphological issue per se (for instance, in the degree to which context plays a role for interpretation in Japanese is significantly higher than in English---directly translated sentences would often not be interpretable in English), but really an issue of information-poor to information-rich translation. It just so happens that a lot of this information is often marked in morphology, which languages like English lack.

That said, there is at least one good reason why one might not work on translation out of English. For me, at least, I don't speak another language well enough to really figure out what's going on in translations (i.e., I would be bad at error analysis). The language other than English that I speak best is Japanese. But in Japanese I could probably only catch gross translation errors, nothing particularly subtle. Moreover, Japanese is not what I would call a morphologically rich language. I would imagine Japanese to English might actually be harder than the other way, due to the huge amount of dropping (pro-drop and then some) that goes on in Japanese.

If I spoke Arabic, I think English to Arabic translation would be quite interesting. Not only do we have huge amounts of data (just flip all our "Arabic to English" data :P), but Arabic has complex, but well-studied morphology (even in the NLP literature). As cited above, there's been some progress in language modeling for Arabic, but I think it's far from solved. Finally, one big advantage of going out of English is that, if we wanted, we have a ton of very good tools we could throw at the source language: parsers, POS taggers, NE recognition, coreference systems, etc. Such things might be important in generating, eg., gender and number morphemes. But alas, my Arabic is not quite up to par.

(p.s., I recognize that there's no reason English even has to be one of the languages; it's just that most of our parallel data includes English and it's a very widely spoken language and so it seems at least not unnatural to include it. Moreover, from the perspective of "information poor", it's pretty close to the top!)

14 comments:

Kevin said...

I think another reason for working on other translation pairs is commercial. Just look at the percentage increase of Chinese, Portuguese, and Arabic speakers on the Internet:

http://www.internetworldstats.com/stats7.htm

There's gotta be interest in translation from English to X, as well as translation pairs not involving English. Now, the latter is yet another intriguing MT research problem, i.e. should we do bridge translation (X1->English, English->X2), direct translation (X1->X2), or a combination?

Dave said...

In fact, another (much smaller) DARPA program, the TransTac speech-to-speech translation project, IS looking at translation from English to (Iraqi) Arabic. The motivation is to allow full 2-way communication between the parties.

Yes, the morphological complexity of Arabic does cause a problem, but mainly for the metrics, rather than the translation itself, it would seem. BLEU and TER all show much worse performance for E2A than A2E, but subjective Likert-scale evaluation shows the two translation directions as performing about the same in many evals.

hal said...

kevin --

that usage list is pretty cool... would have been great if they could have worked the % increase into the graph itself. but it's pretty amazing.

as for the direct versus bridge, i guess it would depend primarily on if you have actual direct parallel data. if you don't, you're pretty much hosed and have to do bridge. it's possible that even with some parallel data, it might be better to do a combination (i wouldn't find this surprising at all). i guess one question is whether you can do anything more interesting than just X1 -> N-best-E and then N-best-E -> N^2-best-X2 and then rerank.

dave --

i had forgotten about this project, but my sense is that speech to speech in the case of transtac is a very very limited domain and that classification-based and small rule-based systems actually do quite well. are the results you're quoting for E2A vs A2E on this domain, or is it for the more general text-to-text in (eg) news? if the former, then my guess is there isn't actually much "generation" going on, which may explain away some of the good performance.

the point about evaluation metrics is very true.

Dave said...

Hal,

A lot of people think that the TransTac domain is very narrow, but that's not true, at least not when compared with other speech systems like dialog systems. Quite a wide range of topics are covered within its purview, and the Arabic vocab is about 75K. All of the surviving systems in the program in fact use statistical machine translation as their primary mechanism.

Now granted, it is not as broad as news (GALE), nor as syntactically demanding in terms of complex clausal structure to get right, etc. I don't know how E2A translation would do for that domain. Probably it would not be totally awful, though.

Great blog by the way.

Oskar Kohonen said...

There has been some research in our lab on adopting n-grams for Finnish (and related languages) which is morphologically rich. The approach has been to segment the words automatically using unsupervised morpheme-like segments. This improves n-gram performance significantly for morphology rich languages. The segmentation was also applied to statistical MT, for which Finnish is a difficult source and target. There however the scores did not improve that much. The first paper below seems to have applied the methods to Arabic as well.

Some papers in case you're interested:
http://www.cis.hut.fi/vsiivola/papers/creutz07naacl.pdf
http://www.cis.hut.fi/svirpioj/papers/virpioja07mtsummit.pdf

Philipp said...

You make a lot of valid points. The commercial interest English-X may be even bigger than X-English, but the funding situation in the US is different. Fortunately things a better here in the old world.

I have found in the translation of European languages that morphology is one of the main reasons why translations into a language is worse that translation out of it: generating morphology is much harder than translating it. I don't think this just an artefact of the BLEU score.

Now it's time to plug my Europark (2005) paper and the recent work on factored models...

Anonymous said...

Translating demo between English and morphologically very rich Czech, see https://blackbird.ms.mff.cuni.cz/cgi-bin/bojar/mt_cgi.pl

mark said...

I would disagree with you on the quality of automatic translation into English...still not reliable enough. I do agree that it's pretty impressive what n-gram models have yielded, but we might be close to the limits on what we can do with it.

. said...

酒店經紀PRETTY GIRL 台北酒店經紀人 ,禮服店 酒店兼差PRETTY GIRL酒店公關 酒店小姐 彩色爆米花酒店兼職,酒店工作 彩色爆米花酒店經紀, 酒店上班,酒店工作 PRETTY GIRL酒店喝酒酒店上班 彩色爆米花台北酒店酒店小姐 PRETTY GIRL酒店上班酒店打工PRETTY GIRL酒店打工酒店經紀 彩色爆米花

酒店上班請找艾葳 said...

艾葳酒店經紀公司提供專業的酒店經紀, 酒店上班小姐,八大行業,酒店兼職,傳播妹,或者想要打工兼差打工,兼差,八大行業,酒店兼職,想去酒店上班, 日式酒店,制服酒店,ktv酒店,禮服店,整天穿得水水漂漂的,還是想去制服店日領上班小姐,水水們如果想要擁有打工工作、晚上兼差工作兼差打工假日兼職兼職工作酒店兼差兼差打工兼差日領工作晚上兼差工作酒店工作酒店上班酒店打工兼職兼差兼差工作酒店上班等,想了解酒店相關工作特種行業內容,想兼職工作日領假日兼職兼差打工、或晚班兼職想擁有鋼琴酒吧又有保障的工作嗎???又可以現領請找專業又有保障的艾葳酒店經紀公司!

艾葳酒店經紀是合法的公司工作環境高雅時尚,無業績壓力,無脫秀無喝酒壓力,高層次會員制客源,工作輕鬆,可日領現領
一般的酒店經紀只會在水水們第一次上班和領薪水時出現而已,對水水們的上班安全一點保障都沒有!艾葳酒店經紀公司的水水們上班時全程媽咪作陪,不需擔心!只提供最優質的酒店上班,酒店上班,酒店打工環境、上班條件給水水們。心動嗎!? 趕快來填寫你的酒店上班履歷表

水水們妳有缺現領、有兼職缺錢便服店的煩腦嗎?想到日本留學缺錢嗎?妳是傳播妹??想要擁有高時薪又輕鬆的賺錢,酒店和,假日打工,假日兼職賺錢的機會嗎??想實現夢想卻又缺錢沒錢嗎!??
艾葳酒店台北酒店經紀招兵買馬!!徵專業的酒店打工,想要去酒店的水水,想要短期日領,酒店日領,禮服酒店,制服店,酒店經紀,ktv酒店,便服店,酒店工作,禮服店,酒店小姐,酒店經紀人,
等相關服務 幫您快速的實現您的夢想~!!

seldamuratim said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it to a few friends of mine that I know would enjoy reading..
sesli sohbetsesli chatkamerali sohbetseslisohbetsesli sohbet sitelerisesli chat siteleriseslichatsesli sohpetseslisohbet.comsesli chatsesli sohbetkamerali sohbetsesli chatsesli sohbetkamerali sohbet
seslisohbetsesli sohbetkamerali sohbetsesli chatsesli sohbetkamerali sohbet

seldamuratim said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it to a few friends of mine that I know would enjoy reading..

sesli sohbet
seslisohbet
sesli chat
seslichat
sesli sohbet sitesi
sesli chat sitesi
sesli sohpet
kamerali sohbet
kamerali chat
webcam sohbet

DiSCo said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it

to a few friends of mine that I know would enjoy reading..
seslisohbet
seslichat
sesli sohbet
sesli chat
sesli
sesli site
görünlütü sohbet
görüntülü chat
kameralı sohbet
kameralı chat
sesli sohbet siteleri
sesli chat siteleri
görüntülü sohbet siteleri
görüntülü chat siteleri
kameralı sohbet siteleri
canlı sohbet
sesli muhabbet
görüntülü muhabbet
kameralı muhabbet
seslidunya
seslisehir
sesli sex

Sesli Chat said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it

to a few friends of mine that I know would enjoy reading..
seslisohbet
seslichat
sesli sohbet
sesli chat
sesli
sesli site
görünlütü sohbet
görüntülü chat
kameralı sohbet
kameralı chat
sesli sohbet siteleri
sesli chat siteleri
sesli muhabbet siteleri
görüntülü sohbet siteleri
görüntülü chat siteleri
görüntülü muhabbet siteleri
kameralı sohbet siteleri
kameralı chat siteleri
kameralı muhabbet siteleri
canlı sohbet
sesli muhabbet
görüntülü muhabbet
kameralı muhabbet
birsesver
birses
seslidunya
seslisehir
sesli sex