There has been a trend for quite some time now toward developing algorithms and techniques to be applicable to a wide range of languages. Examples include parsing (witness the recent CoNLL challenge), machine translation, named entity recognition, etc. I know that in at least one or two of my own papers, I have claimed (without any experimental substantiation, of course :P) that there is no reason why the exact same system could not be run on languages other than English, provided a sufficient amount of labeled training data (and a native speaker who can deal with the annoying tokenization/normalization issues in the non-English language).
I get the feeling that a large part of the surge is blowback against older NLP systems, for which hundreds of/or thousands of human hours were put into writing language-specific grammars and rules and lexicons. The replacement idea is to spend thouse hundreds of/or thousands of hours annotating data, and then repeatedly reusing this data to solve different problems (or to try to come up with better solutions to an existing problem, despite the associated fears in doing so).
I think that, overall, this is a good trend. The problem that I see is that it is potentially limiting. In order to develop a system that could plausibly be applied to (nearly) any language, one has to resort to features that are universal across all languages. This is fine, but for the most part the only universal features we know of that are reasonably computable are things like "language is made up of words and words are sort of semanticy units on their own" (of course, this misses a lot of compounds in German and is hard to do in Chinese without spaces) and "words sometimes have prefixes and suffixes and these are syntactically useful" (oops, Arabic has infixes) and "capitalization is often a good indicator of something proper-noun-like" (except for German where many common nouns are capitalized or Japanese where there isn't case marking). These are sometimes compounded "adjacent words carry semantic meaning." But all in all, these features are relatively weak from the perspective of "language understanding."
This distinction seems analogous to the "domain independent" versus "domain specific" one that we've also seen. If you are willing to limit yourself to a specific domain (eg., counter-terrorism), you can probably do a pretty good job doing reasonably deep understanding. On the other hand, if you want to work at the other end---applicable across all domains---there's little you can do because you're better off going for shallow with complete coverage rather than deep but sparse. Where I think that the domain specific people have it right is that they actually do take advantage of being in a specific domain. I know that when I work on a problem that's language specific (eg., summarization or coreference), I've only seldom taken advantage of the fact that the language is English. Sure, for summarization I've occasionally made use of an English parser and for coreference I've made use of mined data that's specific to English, but overall, I treat it as pretty much "any old language." This would probably be fine if I then ran my system on Arabic and Chinese and Portuguese and showed that it worked. But I don't. This seems to tell me that I'm missing something: that I have not been clear about my goal. Maybe I should take a hint from the domain specific people and decide which side of the language independent camp I want to be on.
(The one counterargument that I will use to save face is that applying to other languages is often a lot of relatively needless work...you often have a pretty good idea of what's going to happen and I'd be surprised if people have strongly believed they've built something that's reasonably language independent and it turns out not to be.)
08 September 2006
Multilingual = Not Lingual at All?
Posted by
hal
at
9/08/2006 05:32:00 PM
Subscribe to:
Post Comments (Atom)
12 comments:
and some also said,
statistical natural language processing is not language processing at all, only statistics :P
What Hal is calling "multilingual" I would call "portable". And there's not just language portability but topic (e.g. genomics vs. sports) and genre (e.g. newswire vs. e-mail) portability.
A truly multilingual app should be like a multilingual person -- able to handle multiple languages in one instance. We built a bilingual entity extractor as part of one of the TIDES surprise language evaluations that could detect entities in English, Hindi, or documents that contained a mixture of both. It didn't do language ID or segmentation, but rather just built one big model. The features tend to be local and estimated conditionally, so it barely hurt performance at all. Hindi and English were easy in some sense because the character sets differ but it's easy to do a consistent tokenization.
Interesting...do Hindi characters and English characters occupy a completely different section of Unicode? Would the same work for, eg., German and English? Or Modern Standard Arabic and Iraqi?
Indeed, I think the portability issue is an important one. But I think there's a lot more consistency within a single language than between languages. Building models to exploit this should be vastly easier (in the sense that one can exploit more linguistically relevant features) than moving across languages.
Devanagari characters don't overlap with ASCII in Unicode. They are a mess in and of themselves.
For dealing with something like German and English, it depends on the task. Obviously Google handles this blend for tasks like spell checking. Search also doesn't seem to present much of a problem. For entity detection or part-of-speech tagging or phrase chunking, if you lean heavily on capitalization features, you'll definitely smear the probabilities out.
A generic solution is to build a mixture model. For instance, that's how most speech recognizers deal with variant pronunciations -- each triphone is a mixture of Gaussians over the feature space (usually intensity at quantized frequency intervals and their first and second derivatives to handle pitch movement). Normal mixture models ar e the typical example of the application of EM. Or in a supervised setting, you might know the mixture source (e.g. this article is sports news in English, this one's German business).
The main problem with such models is that they're surprised every 1/100th of a second that the person's still speaking with a Texas accent. In statistical terms, they're seriously underestimating dependencies. Much like a naive Bayes classifier, local language model or local featured tagger/chunker. That, in turn, makes posterior conditional probabilities (i.e. confidence) very difficult to estimate; just working through the math leads to estimates that are far too attenuated.
best site
http://www.furnitures.org.in
http://www.pokkers.org
the most popular of a class of games called vying games, in which players with
fully or partially concealed cards make wagers into a central pot, which is awarded
to the player or players with the best combination of cards or to the player who makes
an uncalled bet. Poker can also refer to video poker, a single-player game seen in
casinos much like a slot machine, or to other games that use poker hand rankings.
http://www.flowers-shop.org
In modern times, people have sought ways to cultivate, buy, wear, or just be around
flowers and blooming plants, partly because of their agreeable smell. Around the world,
people use flowers for a wide range of events and functions that, cumulatively, encompass
one's lifetime
http://www.women1.org/
a woman, or the feminine in men and women, seeks to share deep awareness of the world
in a sacralized communion. the presence of soft candle light, wild flowers, and the
rituals of dressing for the occasion are simply metaphors acknowledged and “lived out”
in honor of the moment. in honor of life. in honor of shared awareness of the infinite
in a moment.
酒店經紀PRETTY GIRL 台北酒店經紀人 ,禮服店 酒店兼差PRETTY GIRL酒店公關 酒店小姐 彩色爆米花酒店兼職,酒店工作 彩色爆米花酒店經紀, 酒店上班,酒店工作 PRETTY GIRL酒店喝酒酒店上班 彩色爆米花台北酒店酒店小姐 PRETTY GIRL酒店上班酒店打工PRETTY GIRL酒店打工酒店經紀 彩色爆米花
艾葳酒店經紀公司提供專業的酒店經紀, 酒店上班小姐,八大行業,酒店兼職,傳播妹,或者想要打工兼差、打工,兼差,八大行業,酒店兼職,想去酒店上班, 日式酒店,制服酒店,ktv酒店,禮服店,整天穿得水水漂漂的,還是想去制服店當日領上班小姐,水水們如果想要擁有打工工作、晚上兼差工作、兼差打工、假日兼職、兼職工作、酒店兼差、兼差、打工兼差、日領工作、晚上兼差工作、酒店工作、酒店上班、酒店打工、兼職、兼差、兼差工作、酒店上班等,想了解酒店相關工作和特種行業內容,想兼職工作日領、假日兼職、兼差打工、或晚班兼職想擁有鋼琴酒吧又有保障的工作嗎???又可以現領請找專業又有保障的艾葳酒店經紀公司!
艾葳酒店經紀是合法的公司工作環境高雅時尚,無業績壓力,無脫秀無喝酒壓力,高層次會員制客源,工作輕鬆,可日領、現領。
一般的酒店經紀只會在水水們第一次上班和領薪水時出現而已,對水水們的上班安全一點保障都沒有!艾葳酒店經紀公司的水水們上班時全程媽咪作陪,不需擔心!只提供最優質的酒店上班,酒店上班,酒店打工環境、上班條件給水水們。心動嗎!? 趕快來填寫你的酒店上班履歷表
水水們妳有缺現領、有兼職、缺錢便服店的煩腦嗎?想到日本留學缺錢嗎?妳是傳播妹??想要擁有高時薪又輕鬆的賺錢,酒店和,假日打工,假日兼職賺錢的機會嗎??想實現夢想卻又缺錢沒錢嗎!??
艾葳酒店台北酒店經紀招兵買馬!!徵專業的酒店打工,想要去酒店的水水,想要短期日領,酒店日領,禮服酒店,制服店,酒店經紀,ktv酒店,便服店,酒店工作,禮服店,酒店小姐,酒店經紀人,
等相關服務 幫您快速的實現您的夢想~!!
Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it to a few friends of mine that I know would enjoy reading..
sesli sohbetsesli chatkamerali sohbetseslisohbetsesli sohbet sitelerisesli chat siteleriseslichatsesli sohpetseslisohbet.comsesli chatsesli sohbetkamerali sohbetsesli chatsesli sohbetkamerali sohbet
seslisohbetsesli sohbetkamerali sohbetsesli chatsesli sohbetkamerali sohbet
Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it
to a few friends of mine that I know would enjoy reading..
seslisohbet
seslichat
sesli sohbet
sesli chat
sesli
sesli site
görünlütü sohbet
görüntülü chat
kameralı sohbet
kameralı chat
sesli sohbet siteleri
sesli chat siteleri
görüntülü sohbet siteleri
görüntülü chat siteleri
kameralı sohbet siteleri
canlı sohbet
sesli muhabbet
görüntülü muhabbet
kameralı muhabbet
seslidunya
seslisehir
sesli sex
Post a Comment