I took part in a wonderful Dagstuhl workshop this past February on translating morphologically rich languages. (Yeah, I also don't really know why I was invited :P.) But many thanks to Alex, Kevin, Philipp, Helmut and Hans for inviting me. I had a realization during this workshop that I thought I'd share. It's obvious in retrospect, and perhaps in front-spect for many of you. Much of this came up in the discussion with Bonnie Webber, Marion Weller, Martin Volk, Marine Carpuat, Jörg Tiedemann and Maja Popovic, and Maja deserves much credit for her awesome error analysis tool that helped shed some light on German.
One thing you commonly think of when translating into a morphologically rich language is that there's stuff you're going to have to hallucinate. Really this isn't an issue of morphology per se, but just that this is one place where it's obvious. For instance, even going from English to French you'll have to hallucinate gender on your determiners (un versus une and le versus la) that's unmarked in English. Or when going from Japanese (which roughly combines present and future tenses into a single tense) to English, you'll have to hallucinate "will" at appropriate places.
An abstraction that I think was pretty widespread among the initial discussions in the workshop was that if you're going from language X to Y, there are basically two options:
- Phenomenon foo is explicit in Y but implicit in X, and therefore you'll have to hallucinate it (i.e., tense is explicit in English but not in Mandarin)
- Phenomenon bar is explicit in Y and also explicit in X, and so you can just copy it.
Okay, so you want examples.
An easy example is gender. I've been well assured that, for instance, French and Russian both have explicit gender. But just because some noun (eg moon/lune) is feminine in French doesn't mean it's also feminine in Russian. (In fact I think it's neuter.)
You might argue gender is a stupid thing to pick because it's essentially an artificial encoding of who-knows-what.
How about tense. That clearly has a semantic interpretation (did something happen in the past, the present or the future) and so if languages X and Y both express some particular tense, they must be consistent in how they do it.
Wrong. Now my memory is getting a bit shaky, but my recollection is that, for instance, in newswire text, it's very common in German to refer to things that have happened in the past in present tense. To English speakers this is a strange convention (we tend to refer to such things in past tense), but it doesn't have to be so. And of course English has it's own idiosyncrasies: see the plight of the native German speaker who cannot understand English tense usage in (New Zealand) news articles.
Part of this is probably because tense, even in English, is a pretty slippery concept. We (native English speakers) have no problem using present (or progressive) tense to refer to things that happened in the present (John runs) the past (so yesterday I'm running to the store and a hamburger falls on my head!) or the future (my flight leaves at 8:00 tonight).
Another easy example is definiteness (thanks to Kevin Knight for this inspiration). Again, our high school English teachers tell us that "the" (+Definite) has to refer to something that's already been introduced into context. I just went to cnn.com, clicked on the very first link, and the first sentence is "The boss resigned under pressure and other Veterans Affairs managers are likely on the way out." Ok you could argue that "The boss" is already in the context of the US news media (this is an article about Shinseki) but it's nonetheless very common to see (English) entities introduced using "the" and the precise rules that govern this may or may not be consistent across other languages.
The long and short of this is: I like the fact that translation into morphologically rich languages makes us pay attention to linguistic divergence. But that doesn't mean that divergences aren't there even when languages express the same set of linguistically-named phenomena. Usage can vary dramatically, be it for conventions, socio-linguistic reasons, or other things that are hard to pin down. It's just that by focusing all our energy on a very particular convention (newswire, parliament), we can pretty easily learn these mappings because there's no variability. Add some variability and we're hosed, even for languages with the same set of (overt) markings.
11 comments:
Hi Hal,
your tense example is nice :) maybe it just means that beyond morphological divergence, what is needed here is an account of what discourse linguists call "aspect", which in most languages needs a proper treatment of the modalities brought by the connectors?
ex:
Si j'étais grand, je serais heureux
IF I was tall, I would be happy
->
If I were tall, i'll be happy
In French if+ imperfect (past ) => conditionnal
In English, if + subjunctive form => futur
I remenber, back in my litterature degree (long time ago), we were asked to proof-edit a raw translation from Japonese to French and all these japonese present tenses needed to be adapted so that French's "tense concordense -- concordance des temps" was respected. Tough, super tough.
So I'm not holding my breath of seing this solved any time soon...
Djamé
I was tangentially involved with a paper presented at LREC this week that proposed a taxonomy for semantic/communicative functions of definiteness and did some annotation in a few languages. Even though I've spent a long time thinking about exactly these sorts of divergences, one of the things this really brought home was just how idiosyncratic grammaticalizations (the way semantic/communicative functions get mapped onto syntax) can be, and how quickly (diachronically speaking) morphemes can move around that space, picking up and discarding semantic functions.
I don't know if detailed semantic taxonomies like this are the right way to proceed (can't we just learn this stuff?), but doing the project to appreciate that there are are lots of way to slice the semantics of definiteness pie (or, in your German past tense example, the semantics of tense/aspect/evidentiality/modality pie) was worth it.
It goes deeper. When Michelle Dionisio and I were working on part-of-speech tagging for Tagalog, Judith Tonhauser was kind enough to keep asking us why we thought that 'verb' was a good name for the argument-taking predicate words of that language, and, if we did want to use that name, what claim we were making about them. You can't just assume some "universal tag set" and expect to get away without some pushback. Judith made us think about our not yet fully thought out reasons for wanting part-of-speech tags in the first place.
One fortunate feature, if Tagalog turns out to be untranslatable, is that no sensitive computational linguist will ever have to read the translations of the truly awful early twentieth century romance models we were working with.
@Chris Brew, I may be misremembering this, but several years ago I saw an internal chart of MT system quality across a bunch of languages on a standard test set (from a company whose identity will remain undisclosed), and Tagalog was very near the top. So unless I've forgotten (a distinct possibility), there's either a lot of good data out there or maybe MT doesn't need to worry about verbs and nouns that much?
I recall reading about an experiment with transcribing spoken French in a noisy environment. (I can't find the reference.) The hypothesis was that the use of gender marking terms (le, la, un, une) would reduce the error rate in transcription, by distinguishing words that sound very similar. The experimental results supported the hypothesis. Perhaps many of the apparently arbitrary conventions of language are a form of error-correcting code. Gender and tense may be completely arbitrary conventions that reduce error in communication by adding redundancy. You wrote, "You might argue gender is a stupid thing to pick because it's essentially an artificial encoding of who-knows-what." I believe it is a natural (but arbitrary) encoding of redundant information for error correction.
@Djame: yeah, aspect is even weirder to me than tense :). Thanks for the example!
@ChrisBrew: totally agree, and I love the Tagalog example! Syntax != Semantics. (i.e., just because we call something the same thing doesn't mean it _is_ the same thing.)
@Peter: Wow, that's a really cool example, though in retrospect it's not too surprising. I recall Yoshua Bengio making a comment along the lines of: if there were some alien race who spoke a language that was optimally compressed and had no redundancy, then it would just look like a random string of zeros and ones and there would be nothing we could learn from it. I think that's part of what's going on here, too, with number. I wonder if anyone's looked at, for instance, if in languages with arbitrary genders, there's anti-correlation of gender between phonologically similar words... (This would be a counterpoint to the reference you can't find.)
"anti-correlation of gender between phonologically similar words" -- A quick search found many lists of French homophones. Quite a few of the noun homophones can be distinguished by gender, but I haven't calculated whether it is statistically significant.
Genders are random. In Russian a full moon is feminine, but a half moon is masculine. I can't get over the fact that girl is neutral in German.
@Itman
"-chen" und "-lein" machen alles klein.
Mädchen is the diminuitve of "Magd" which is feminine. http://en.wiktionary.org/wiki/M%C3%A4dchen.
It just so happens that the diminuitives "-chen" and "-lein" make things small and in the process normalize their (grammatical) gender (cf. Frau -> Fräulein).
Prepositions/postpositions/case systems are another great example of this phenomenon, in which languages carve up space (and time, causality, etc.) differently in their grammar. Witness how English clusters together different spatial scenes as in or on—these clusters are language-specific (Bowerman & Choi, 2003).
In terms of automatic disambiguation of the semantics of grammatical categories, there was a 2010 paper that tried to tease apart the semantics of English tense/aspect: Reichart & Rappoport, Tense sense disambiguation: a new syntactic polysemy task. Semantic classifiers have also been built for English prepositions, modality, and definiteness.
Small correction: moon (луна) is in fact feminine in Russian as well. Though general point holds.
In English, the hardest parts for a foreign learner are the articles and prepositions. There are rules with dozens of cases, but still exceptions to them are common. One should develop an intuition, there is no logic there...
Post a Comment