19 July 2007

What's the Use of a Crummy Translation?

I'm currently visiting Microsoft Research Asia (in Beijing) for two weeks (thanks for having me, guys!). I speak basically no Chinese. I took one half of a semester about 6 years ago. I know much more Japanese; enough so that I can read signs that indicate direction, dates and times, but that's about it... the remainder is too divergent for me to make out at all (perhaps a native Japanese speaker would feel differently, but certainly not a gaijin like me).

My experience here has reminded me of a paper that Ken Church and Ed Hovy wrote almost 15 years ago now, Good Applications for Crummy Machine Translation. I'm not sure how many people have read it recently, but it essentially makes the following point: MT should enter the users world in small steps, only insofar as it is actually going to work. To say that MT quality has improved significantly in 15 years is probably already an understatement, but it is still certainly far from something that can even compare to translation quality of a human, even in the original training domain.

That said, I think that maybe we are a bit too modest as a community. MT output is actually relatively readable these days, especially for relatively short input sentences. The fact that "real world companies" such as Google and LanguageWeaver seem to anticipate making a profit off of MT shows that at least a few crazies out there believe that it is likely to work well enough to be useful.

At this point, rather than gleefully shouting the glories of MT, I would like to point out the difference between the title of this post and the title of the Church/Hovy paper. I want to know what to do with a crummy translation. They want to know what to do with crummy machine translation. This brings me back to the beginning of this post: my brief experience in Beijing. (Discourse parsers: I challenge you to get that dependency link!)
  • This voucher can not be encashed and can be used for one sitting only.
  • The management reserves the right of explanation.
  • Office snack is forbidden to take away.
  • Fizzwater bottles please recycle.
The first two are at my hotel, which is quite upscale; the second two are here on the fridge at Microsoft. There are so many more examples, in subway stations, on the bus, on tourism brochures, in trains, at the airport, I could go on collecting these forever. The interesting this is that although two of these use words that aren't even in my vocabulary (encashed and fizzwater), one is grammatical but semantically nonsensical (what are they explaining?) and one is missing an indirect object (but if it had one, it would be semantically meaningless), I still know what they all mean. Yes, they're sometimes amusing and worth a short chuckle, but overall the important points are gotten across: no cash value; you can be kicked out; don't steal snacks; recycle bottles.

The question I have to ask myself is: are these human translations really better than something a machine could produce? My guess is that machine translation outputs would be less entertaining, but I have a hard time imagine that they would be less comprehensible. I guess I want to know: if we're holding ourselves to the standard of a level of human translation, what level is this? Clearly it's not the average translation level that large tourism companies in China hold themselves to. Can we already beat these translations? If so, why don't we relish in this fact?

11 comments:

  1. There is some very promising work being done on SMT for asian languages, I wish I could say more. But needless to say there are leaps and bounds being made in the field. There are different ways of thinking about SMT, and google's blue scores are just not quite there, there are ways to get much better translations using the same techniques.Obviously I don't know the internals of Googles SMT tech, but the differences run to a very fundamental level. I regret having to be so cryptic, i respect my agreements :) It does have something to do with this though: http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html

    ReplyDelete
  2. Let's see how Google Translate performs in these cases:

    此優惠券不可兌換現金,而且只可使用一次。
    This Coupon non-convertible cash, but can only be used once.

    管理層保留最終解釋權。
    Management retain the final power of interpretation.

    (When the concept of "final" is removed, as in the original sign:
    管理層保留解釋的權利。
    Management explained to retain the rights.)

    辦公室零食不可拿走。
    Office snack foods can not take away.

    請回收汽水瓶。
    Please recall the bottle.

    (In the word order of the original sign:
    汽水瓶請回收。
    Please bottle recycling.)

    Hmm. It works okay.

    This reminds me of the electronic English-Chinese dictionary that I used to have. It worked great for looking up words. It would be cool if it could do simple MT as well. Now some electronic dictionaries can already do simple MT, e.g. this one. I'm not sure about the technology underlying this or how well it performs, but I wonder what current statistical MT technologies can do in a hand-held device like this.

    ReplyDelete
  3. John: SMT has long been overlooked as a viable MT method simply because of the prohibitively high cost of processing. I'm guessing that would be a limitation on embedded devices too. Hopefully not for long though.

    disclaimer: I don't work for these guys, but i'm very interested and excited in what they are doing.
    I got a little more info about the guys specialising in Asian SMT:
    1. they currently are working on over 100 high quality Asian language pairs. with many more to come.
    2. have support for domain specific translations (thus resulting in a much higher quality translation)
    3. will have a Thai beta translation done soon. Thai is an extremely hard language to translate because of it's lack of spaces, periods and attention :)

    4.they are looking for computational linguists from Asian countries to help with pre and post processing of their translations. So if you have experience in the area get in contact with me and I'll give you a contact. Or sign up for their mailing list at asiaonline.net.

    ReplyDelete
  4. Machine verses people. Not sure if I get that. Whilst I agree with the concept of MT I don't feel nor do I subscribe to its use commercially. Surely what we should all be looking at is contextual memory and the relationship of language to market and market language to brand or business or sector. Systems are already emerging that use artificial intelligence and are developing faster than MT. Ultimately the quality of our language as a communications medium is going to be the demonstrable proof that we can communicate with our fellow man or woman in his country and for his business.

    ReplyDelete
  5. Excellent post Hal. My company is the group that Dave Novakvic was referring to developing Asian SMT systems and what you present in this entry is the exactly what we subscribe to.

    I previously worked as VP and Research Director for Gartner in Asia Pacific and one of the messages I often tried to get across to people looking for perfection was that it often was not necessary. Good enough for the current task is what is necessary. Many businesses are using Wiki's, blogs and other tools internally that are far from perfect but useable and good enough for the purpose they are being used. Sure, they could be better and offer more, but they are good enough.

    MT is the same and the examples you pointed out are as good as any. MT has improved greatly and new techniques and new resources are making MT better every day.

    Some of the languages we are working on are difficult to deal with programatically, such as Thai. Thai does not have spaces, punctuation, periods or anything useful to determine words, paragraphs, sentences etc. I have a sentence that runs 27 pages - try feeding that through a SMT system :) Research has been limited to date because of basic low level tech being hard to master to get to the same point where most other languages start at.

    We will have our first Thai system up for demo in 2-3 weeks from now. It will be far from perfect, but it will be better than any of the other limited Thai translation systems today based on rules and it will be "good enough." Sure, there will be improvements over time and we are already working on some, but there is also a point of diminishing returns.

    We are training on multiple domains with corpus sizes exceeding 10 million sentence pairs. It will be an interesting language to monitor to see how it stacks up in SMT and what the point of diminishing returns is for domains.

    As I am sure you are aware, gathering corpus is not fun, time consuming and expensive - that point of being "good enough" is going to be key for us. At this stage, if key messages are clearly presented, even if the grammar is a little off, then that to me is "good enough".

    Future enhancements such as syntax trees and applying morphological data into some of the processes will likely give us greater quality than going too far on corpus.

    BTW, my personal favorite from China is "Passage of deformed man". The Chinese read "wheelchair ramp" - that one was not quite "good enough"

    Regards

    Dion

    ReplyDelete
  6. It's always nice to have a low baseline. A particularly useful baseline for speech systems is call center attendant performance. The attendants aren't dumb, they're just stressed from too little time to make a decision (typically under 20 seconds), too little training about the business logic (typically hundreds of destinations, with half a day training and a 3-ring binder), and too little experience (typically under six months).

    It really becomes a cost issue. How much are you willing to pay for a good translation? For the Chinese tour bus companies, not much. How much are companies willing to pay for telephone support? Again, not much.

    Just don't confuse baselines with toplines. People can do call routing and hotel sign translation at near 100% accuracy.

    The bigger issue is that all of NLP is crummy. 90% entity extraction precision means developing systems to deal with errors in 1/10 low level decisions. (Not to mention developing whole new systems to deal with the lack of recall.) 97% tagger accuracy still means one word in a sentence is likely wrong; not coincidentally that word's likely to be the most discriminative one in the sentence (in the TF/IDF sense), such as a noun or adjective, rather than a functional word like "the", which is (almost) always tagged correctly, padding the accuracy stats.

    ReplyDelete
  7. I love how SMT systems omit a negative every now and then, translating the phrase as the exact opposite of its original meaning. Idiomatic expressions and colloquialisms are often a great source of entertainment as well.

    ReplyDelete
  8. Bob, do you think one should weight accuracy calculation by e.g. the TF/IDF score of the word, to get a more relevant accuracy measure?

    ReplyDelete
  9. I'm not a big fan of complexly weighted utility metrics. If you can motivate one with a task, fair enough.

    Weighting by IDF would be similar to macro-average results for classifiers (metric is average over types, not tokens).

    What you need will depend on task. For "needle in a haystack" kinds of text mining, you need good recall on items not in the training set. For "what do people think of the iPhone", it's much easier to get an answer because the signal is hugely redundant. I'd want high recall for the former and high precision for the latter, most likely.

    ReplyDelete
  10. Some rumblings...

    Are those texts translations? What is a translation for you? The work of translation is more often than not underestimated, thought of as a trivial task and it is frequently miserably payed. People start to react about what a translation is when something has to be done with the product apart from laughing.
    Few people would say something is a car if it never moves. People are too used to just taking a look at translations and giving up and trying to understand by themselves through context or else...until they have to deal with longer messages that cannot be guessed from the look of a machine.
    MT has been improving a lot. Still, many endeavors in this area would progress faster if people would be humble enough as to ask what AI people have so often failed to ask:
    what is our general theory of this? (in this case, general theory of translation)
    The Turin test for AI was a bad premise. A parrot can talk and often fool and yet few people would say it is very intelligent. The Turin test and the reluctance to think about what intelligence really is has lead to lots of nice gadgets but too little advances for the efforts in AI.
    In the same way as many AI people in general have often failed to sit still for a moment and think of a theory of mind (remember On Intelligence, by Jeff Hawkins?), many people in NLP have failed to ask firstly what their theory of the language is (and a theory of the language needs to be more than a chosen formalism).

    What do we expect from MT? To surpass the work of people who are doing their best at a work they are not capable of but for which most companies do not want to pay enough? Do we want the system to do some kind of understanding? (very high level, and then define understanding)
    Or do we take some average path and
    decide to go for a system that can render the intended meaning of most of the sentences? Say 70, 80%?
    How robust? In what text fields?

    Andrés


    Crossminder

    ReplyDelete
  11. 酒店經紀PRETTY GIRL 台北酒店經紀人 ,禮服店 酒店兼差PRETTY GIRL酒店公關 酒店小姐 彩色爆米花酒店兼職,酒店工作 彩色爆米花酒店經紀, 酒店上班,酒店工作 PRETTY GIRL酒店喝酒酒店上班 彩色爆米花台北酒店酒店小姐 PRETTY GIRL酒店上班酒店打工PRETTY GIRL酒店打工酒店經紀 彩色爆米花

    ReplyDelete