20 September 2007

Mark-up Always the Wrong Tree?

Almost a year ago I responded to a very interesting article in CL. The substance of the article is that we have to be careful when we annotate data lest we draw incorrect conclusions. In this post I'm going to take a more extreme position. It's not necessarily one I agree with 100%, but I think it's worth more than just a brief consideration.

Proposition: mark-up is always a bad idea.

That is: we should never be marking up data in ways that it's not "naturally" marked up. For instance, part-of-speech tagged data does not exist naturally. Parallel French-English data does. The crux of the argument is that if something is not a task that anyone performs naturally, then it's not a task worth computationalizing.

Here's why I think this is a reasonable position to take. In some sense, we're striving for machines that can do things that humans do. We have little to no external evidence that when humans (for instance) perform translation, that they also perform part-of-speech tagging along the way. Moreover, as the CL article mentioned above nicely points out, it's very easy to confuse ourselves by using incorrect representations, or being lazy about annotating. We may be happy to speculate the humans build up some sort of syntactic representation of sentences inside their heads (and, yes, there is some psychological evidence for something that might correlate with this). But the fact is, simply, that all we can observe are the inputs and outputs of some processes (eg., translation) and that we should base all of our models on these observables.

Despite the fact that agreeing with this proposition makes much of my own work uninteresting (at least from the perspective of doing things with language), I find very few holes in the argument.

I think the first hole is just a "we're not there yet" issue. That is: in the ideal world, sure, I agree, but I don't think we yet have the technology to accomplish this.

The second hole, which is somewhat related, is that even if we had the technology, working on small problems based on perhaps-ill-conceived data will give us insight into important issues. For instance, many summarization people believe that coreference issues are a big problem. Sure, I can imagine an end-to-end summarization system that essentially treats coreference as a "latent variable" and never actually looks and hand-annotated coref data. On the other hand, I have no idea what this latent variable should look like, how it should be influenced, etc. The very process of working on these small problems (like "solving" coref on small annotated data sets) give us an opportunity to better understand what goes in to these problems.

The hole with the second hole :) is the following. If this is the only compelling reason to look at these sub-problems, then we should essentially stop working on them once we have a reasonable grasp. Not to be too hard on POS tagging, but I think we've pretty much established that we can do this task and we know more or less the major ins and outs. So we should stop doing it. (Similar arguments can be made for other tasks; eg., NE tagging in English.)

The final hole is that I believe that there exist tasks that humans don't do simply because they're too big. And these are tasks that computers can do. If we can force some humans to do these tasks, maybe it would be worthwhile. But, to be honest, I can't think of any such thing off the top of my head. Maybe I'm just too closed-minded.

19 comments:

Fernando Pereira said...

I agree with your main point, and in fact I blogged on it a while ago ("Earning My Turns" entry of Feb 7, 2007). However, I disagree with your last paragraph. Search is the obvious example. People can't do it at Web scale at all, and not even that well at library scale. Yet, we can see search quality variations in the search engines we use.

Benoit said...

You are completely right. However finding the right model that can be learned in an unsupervised way is a monumental task. IMO any system that pretend to be able to process language has to integrate the parsing and the POS tagging and the morphological analysis and the anaphora resolution and the discourse analysis and the semantics, the logic, the epistemology and everything else. ALL THESE THINGS AT THE SAME TIME in the simplest, most predictive all encompassing model you can find. These things all depend on each other. You can't perform any of these tasks with human level performance without having processing information about the other tasks. A complete model needs to balance a bunch of hidden variables together that are learned all at the same time so that they converge towards a maximum likelihood. This model can then be queried by integrating out whatever information you don't need.

Would you be ready to risk 10 years of your life attempting to create such a model? AYour research would not produce much publishable before material before your reached a solution and you might never reach one but it would give you a chance at revolutionising your field. You might end up finding something useful: A theory of everything of language. A great undertaking, for sure, but how would your career be affected? If you are like most academics I have seen, your job depends on the amount of publications you make. In no way does anybody take into account the fact that your research although not productive in term of publication, works towards a harder much more respectful goal than all the others. They'd rather you do small useless incremental results that inevitably lead to specific domain performance maxima and that will never generalise to anything useful. They rather you play the publication game than that you work towards actually advancing science a path that might or might not give you any results in years, but that would be actually useful if it works.

It's funny because language is structured according to Sipf's law and thus any monkey with a computer can easily make systems that work only with the 10% most common word phenomenons an achieve 60% or more coverage and evaluated performance. The best systems that achieve the 70%-80% mark are actually only dealing with 20%-30% of the unique language features. No one tries to deal with the long tail. It's the big fraud of CL (and many other sciences) if you ask me. And people test their system against null hypotheses! HA! As if being better by a margin of nothing is actually an achievement.

But everybody go on with their life and pretend what they are doing is useful. It's like some kind of weird incestuous social phenomenon where everybody meets up at conferences and encourages each other in their incompetence then go back home an close the vicious circle by only ever hire people who play the same publication game.

hal said...

apparently, even in blog posts i've been scopped (and on my birthday, no less!). i sort-of agree about search.

but likening web search to library search... when i first went to the library, i used index cards. i sucked at it. a librarian helped me and i was more effective. a few years passed and those index cards were replaced by "virtual" index cards on a green-screen terminal. but the same phenomenon existed. the librarians were good searchers without the index cards or machines, but when using them as a *tool* they were much better.

fast forward to today... my mom can't find some information she needs online, so she tells me about it and if i'm not busy, i try to find it using google. often times i'm successful.

i liken myself to the librarian. the technology has changed (from index cards to inverted indices) but it's still a task that humans perform. and i don't think you can make a solid argument that "inverted indices" are too much technology and that librarians --- even before index cards --- had none: they at least had dewey decimal, which is akin to web hierarchies like DMOZ and Yahoo!

moreover, at least in domains smaller than the web, i think humans are still much better -- especially in scholarly search... i'd much prefer advice from a friend than to use google scholar or citeseer or rexa or what have you.

----------------------------------

benoit: i think that they key issue is "hole 2" -- that what we learn in the process of these small incremental steps may (we hope) lead to a better understanding of how to do the monolithic thing.

i wonder if this actually implies the opposite of the zaenen article: that we really shouldn't worry too much about annotation! maybe if we just do something useful enough that we can learn a little bit, that's enough! since we're going to throw away the actual systems and do it all internally later, maybe it doesn't really matter.

Bob Carpenter said...

Why not be pragmatic about data annotation? If I can annotate data that'll help with a task, I'll do it.

Arguing about what's "natural" is a game for philosophers and theologians.

Since we're doing philosophy, let's have a thought experiment. Is rating products on a 1-5 scale on Amazon or Parker's rating wines on a 50-100 scale "natural"? Is the breakdown of concepts and entities into Wikipedia pages or products into Pricegrabber pages a "natural" markup? Is the New York Times's breakdown of the news into sections or IMDB's of movies into genres "natural"? Is the abstracting of a technical paper or the teaser before the local news "natural"?

We do quantization, individuation, categorization, and summarization in these cases for practical reasons.

Bob Carpenter said...

As to Benoit's second point, I would like to offer a counterexample to the claim that no one's working on the long tail problem. Addressing what Breck Baldwin likes to call "recall denial" is the main focus of our NLM-funded research. We're focusing on high-recall approaches through confidence-based extraction and search-like interfaces (vs. first best).

I wrote up my thoughts on the topic in blog entries titled Is 90% Entity Detection Good Enough? and High Precision and High Recall Entity Extraction.

Anonymous said...

I believe humans care little about syntax when they translate. They care about patterns and semantics. The reason why MT systems suck (and will continue to suck) is because the correct translation between unrelated languages (say, Russian and English) is often a paraphrase where only semantics are preserved. Sure, some words may be translated one to one (most notably adjectives and nouns), but there's a huge gap between MT and human translation currently - one that humans effortlessly bridge. I also believe that language is only an I/O subsystem of the brain and the brain doesn't really use it for the actual cognition most of the time, except when thinking about highly abstract, unintuitive contexts where formal logic is involved. Most of the time we think in patterns, (slowly) filling in logic where patterns don't really work. From my experience, this applies to translation and speaking a foreign language at an advanced level. People just learn mappings between patterns (i.e. map them to the same invariant representation) and do so until these mappings feel natural. Before this happens they have to strain their brain to actually think about the words and sentences they're writing or saying. Afterwards, they don't give it a second thought, since there's no explicit logical step involved. They recognize pattern in one language, get its semantics and infer the same pattern in another language according to the probabilistic mapping in their brain. When speaking they do this inference directly from the semantics of what they want to say.

Anyway, I'm not really an expert. I'm just interested in the topic and I'm bilingual.

Brian Roark said...

Nobody knows for sure, but I'm going to go out on a limb and suggest that translators (the human kind) actually interpret (whatever that means) the string they are translating at some point in the process of translation. It may be possible that some form of syntactic generalization facilitates this mysterious interpretation -- dumb things like X was likely the killer because X came before the verb. (I just finished a Miss Marple mystery, so 'delicious death' is on my mind.) I will take a strong stance and say that the symbols VBD almost certainly play no part in the actual syntactic generalizations being made in such a process. Still, if I have a linguist-annotated corpus containing such a POS-tag, it might facilitate some crude approximation to the kind of syntactic generalization that would assist me (as a human) in translating 'Delmira Montague coldcocked the lavendar lady' even if half of the words are OOV. (Not enough coldcocking in Christie.) I'm with Bob on this one -- I don't really care whether my annotations match the 'real' generalizations that are playing a role, as long as they provide a certain amount of useful generalization for whatever application is of interest. Call me old school if you must, but I suspect there are some linguistic generalizations that can be exploited, somewhere somehow.

BTW, what about word transcriptions for speech? Is that also an out-of-bounds annotation? Who's to say what phones were uttered or whether what we call words actually correspond to something in the speaker's mind?

Anonymous said...

>> X was likely the killer because X came before the verb

Heh, that'd break down pretty quickly in languages with free word order. I.e. "она его убила", "она убила его", "его убила она" and "убила его она" all mean "she killed him" in Russian. All four permutations are possible in speech, two of the four are likely in written prose. Yet Russians have no problem matching this to the corresponding semantic pattern(s). As a matter of fact, even if you permute words in English phrase, most of the time you can still make out the meaning. To go even further, you can also permute letters in words, except the first and last one, and you'll still understand what's being said, with some effort.

Mark Johnson said...

I guess I don't see a big difference between corpus annotation and other theoretical work we do. Done badly, it can lead us down the wrong path, but so can e.g., a misleading statistical model. I do think our community fails to recognize that corpora (even "natural" ones) incorporate theoretical assumptions, and if these assumptions turn out to be wrong (or, more likely, incomplete) then we may not learn much by trying to "model" the corpus.

Corpus annotation, developing new statistical models, or anything else we do is useful to the extent it helps us achieve our goals (which in our field could be either technological or scientific).

Chris said...

I'm not entirely convinced that humans do not do some POS tagging "along the way" while translating naturally. This is a tricky area of psycholinguistics, and in the very least, the jury is still out (and unfortunately, the philosophers and theologians of Bob's comment are of little help here; I'll take the psycholinguists).

Idan said...

IMHO I think that our task is to help humans get information from natural languages. No one ever said that we need to follow humans' internal/implicit language processing in order to achieve that. Our input is natural languages, our output is natural languages (sometimes), but that's it. If we think (prove/show/justify) that POS tagging or parsing or any other internal representation helps us to better accomplish the task then we should use these tools, until a better approach will be discovered.

Anonymous said...

網頁設計,情趣用品,情趣用品,情趣用品,情趣用品
色情遊戲,寄情築園小遊戲,情色文學,一葉情貼圖片區,情惑用品性易購,情人視訊網,辣妹視訊,情色交友,成人論壇,情色論壇,愛情公寓,情色,舊情人,情色貼圖,色情聊天室,色情小說,做愛,做愛影片,性愛

免費視訊聊天室,aio交友愛情館,愛情公寓,一葉情貼圖片區,情色貼圖,情色文學,色情聊天室,情色小說,情色電影,情色論壇,成人論壇,辣妹視訊,視訊聊天室,情色視訊,免費視訊,免費視訊聊天,視訊交友網,視訊聊天室,視訊美女,視訊交友,視訊交友90739,UT聊天室,聊天室,豆豆聊天室,尋夢園聊天室,聊天室尋夢園,080聊天室,080苗栗人聊天室,女同志聊天室,上班族聊天室,小高聊天室 

AV,AV女優
視訊,影音視訊聊天室,視訊交友
視訊,影音視訊聊天室,視訊聊天室,視訊交友,視訊聊天,視訊美女

. said...

酒店經紀PRETTY GIRL 台北酒店經紀人 ,禮服店 酒店兼差PRETTY GIRL酒店公關 酒店小姐 彩色爆米花酒店兼職,酒店工作 彩色爆米花酒店經紀, 酒店上班,酒店工作 PRETTY GIRL酒店喝酒酒店上班 彩色爆米花台北酒店酒店小姐 PRETTY GIRL酒店上班酒店打工PRETTY GIRL酒店打工酒店經紀 彩色爆米花

酒店上班請找艾葳 said...

艾葳酒店經紀公司提供專業的酒店經紀, 酒店上班小姐,八大行業,酒店兼職,傳播妹,或者想要打工兼差打工,兼差,八大行業,酒店兼職,想去酒店上班, 日式酒店,制服酒店,ktv酒店,禮服店,整天穿得水水漂漂的,還是想去制服店日領上班小姐,水水們如果想要擁有打工工作、晚上兼差工作兼差打工假日兼職兼職工作酒店兼差兼差打工兼差日領工作晚上兼差工作酒店工作酒店上班酒店打工兼職兼差兼差工作酒店上班等,想了解酒店相關工作特種行業內容,想兼職工作日領假日兼職兼差打工、或晚班兼職想擁有鋼琴酒吧又有保障的工作嗎???又可以現領請找專業又有保障的艾葳酒店經紀公司!

艾葳酒店經紀是合法的公司工作環境高雅時尚,無業績壓力,無脫秀無喝酒壓力,高層次會員制客源,工作輕鬆,可日領現領
一般的酒店經紀只會在水水們第一次上班和領薪水時出現而已,對水水們的上班安全一點保障都沒有!艾葳酒店經紀公司的水水們上班時全程媽咪作陪,不需擔心!只提供最優質的酒店上班,酒店上班,酒店打工環境、上班條件給水水們。心動嗎!? 趕快來填寫你的酒店上班履歷表

水水們妳有缺現領、有兼職缺錢便服店的煩腦嗎?想到日本留學缺錢嗎?妳是傳播妹??想要擁有高時薪又輕鬆的賺錢,酒店和,假日打工,假日兼職賺錢的機會嗎??想實現夢想卻又缺錢沒錢嗎!??
艾葳酒店台北酒店經紀招兵買馬!!徵專業的酒店打工,想要去酒店的水水,想要短期日領,酒店日領,禮服酒店,制服店,酒店經紀,ktv酒店,便服店,酒店工作,禮服店,酒店小姐,酒店經紀人,
等相關服務 幫您快速的實現您的夢想~!!

qishaya said...

one day i went shopping outside,and in an ed hardy store,I found some kinds of ed hardy i love most they are Your website is really good Thank you for the information ed hardy ed hardy ed hardy clothing ed hardy clothing ed hardy shoes ed hardy shoes don ed hardy don ed hardy ed hardy clothes ed hardy clothes ed hardy bags ed hardy bags ed hardy swimwear ed hardy swimwear ed hardy jeans ed hardy jeans ed hardy mens ed hardy mens Thank you for the information

seldamuratim said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it to a few friends of mine that I know would enjoy reading..
sesli sohbetsesli chatkamerali sohbetseslisohbetsesli sohbet sitelerisesli chat siteleriseslichatsesli sohpetseslisohbet.comsesli chatsesli sohbetkamerali sohbetsesli chatsesli sohbetkamerali sohbet
seslisohbetsesli sohbetkamerali sohbetsesli chatsesli sohbetkamerali sohbet

seldamuratim said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it to a few friends of mine that I know would enjoy reading..

sesli sohbet
seslisohbet
sesli chat
seslichat
sesli sohbet sitesi
sesli chat sitesi
sesli sohpet
kamerali sohbet
kamerali chat
webcam sohbet

DiSCo said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it

to a few friends of mine that I know would enjoy reading..
seslisohbet
seslichat
sesli sohbet
sesli chat
sesli
sesli site
görünlütü sohbet
görüntülü chat
kameralı sohbet
kameralı chat
sesli sohbet siteleri
sesli chat siteleri
görüntülü sohbet siteleri
görüntülü chat siteleri
kameralı sohbet siteleri
canlı sohbet
sesli muhabbet
görüntülü muhabbet
kameralı muhabbet
seslidunya
seslisehir
sesli sex

Sesli Chat said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it

to a few friends of mine that I know would enjoy reading..
seslisohbet
seslichat
sesli sohbet
sesli chat
sesli
sesli site
görünlütü sohbet
görüntülü chat
kameralı sohbet
kameralı chat
sesli sohbet siteleri
sesli chat siteleri
sesli muhabbet siteleri
görüntülü sohbet siteleri
görüntülü chat siteleri
görüntülü muhabbet siteleri
kameralı sohbet siteleri
kameralı chat siteleri
kameralı muhabbet siteleri
canlı sohbet
sesli muhabbet
görüntülü muhabbet
kameralı muhabbet
birsesver
birses
seslidunya
seslisehir
sesli sex