20 September 2007

Mark-up Always the Wrong Tree?

Almost a year ago I responded to a very interesting article in CL. The substance of the article is that we have to be careful when we annotate data lest we draw incorrect conclusions. In this post I'm going to take a more extreme position. It's not necessarily one I agree with 100%, but I think it's worth more than just a brief consideration.

Proposition: mark-up is always a bad idea.

That is: we should never be marking up data in ways that it's not "naturally" marked up. For instance, part-of-speech tagged data does not exist naturally. Parallel French-English data does. The crux of the argument is that if something is not a task that anyone performs naturally, then it's not a task worth computationalizing.

Here's why I think this is a reasonable position to take. In some sense, we're striving for machines that can do things that humans do. We have little to no external evidence that when humans (for instance) perform translation, that they also perform part-of-speech tagging along the way. Moreover, as the CL article mentioned above nicely points out, it's very easy to confuse ourselves by using incorrect representations, or being lazy about annotating. We may be happy to speculate the humans build up some sort of syntactic representation of sentences inside their heads (and, yes, there is some psychological evidence for something that might correlate with this). But the fact is, simply, that all we can observe are the inputs and outputs of some processes (eg., translation) and that we should base all of our models on these observables.

Despite the fact that agreeing with this proposition makes much of my own work uninteresting (at least from the perspective of doing things with language), I find very few holes in the argument.

I think the first hole is just a "we're not there yet" issue. That is: in the ideal world, sure, I agree, but I don't think we yet have the technology to accomplish this.

The second hole, which is somewhat related, is that even if we had the technology, working on small problems based on perhaps-ill-conceived data will give us insight into important issues. For instance, many summarization people believe that coreference issues are a big problem. Sure, I can imagine an end-to-end summarization system that essentially treats coreference as a "latent variable" and never actually looks and hand-annotated coref data. On the other hand, I have no idea what this latent variable should look like, how it should be influenced, etc. The very process of working on these small problems (like "solving" coref on small annotated data sets) give us an opportunity to better understand what goes in to these problems.

The hole with the second hole :) is the following. If this is the only compelling reason to look at these sub-problems, then we should essentially stop working on them once we have a reasonable grasp. Not to be too hard on POS tagging, but I think we've pretty much established that we can do this task and we know more or less the major ins and outs. So we should stop doing it. (Similar arguments can be made for other tasks; eg., NE tagging in English.)

The final hole is that I believe that there exist tasks that humans don't do simply because they're too big. And these are tasks that computers can do. If we can force some humans to do these tasks, maybe it would be worthwhile. But, to be honest, I can't think of any such thing off the top of my head. Maybe I'm just too closed-minded.

12 comments:

Fernando Pereira said...

I agree with your main point, and in fact I blogged on it a while ago ("Earning My Turns" entry of Feb 7, 2007). However, I disagree with your last paragraph. Search is the obvious example. People can't do it at Web scale at all, and not even that well at library scale. Yet, we can see search quality variations in the search engines we use.

Benoit Essiambre said...

You are completely right. However finding the right model that can be learned in an unsupervised way is a monumental task. IMO any system that pretend to be able to process language has to integrate the parsing and the POS tagging and the morphological analysis and the anaphora resolution and the discourse analysis and the semantics, the logic, the epistemology and everything else. ALL THESE THINGS AT THE SAME TIME in the simplest, most predictive all encompassing model you can find. These things all depend on each other. You can't perform any of these tasks with human level performance without having processing information about the other tasks. A complete model needs to balance a bunch of hidden variables together that are learned all at the same time so that they converge towards a maximum likelihood. This model can then be queried by integrating out whatever information you don't need.

Would you be ready to risk 10 years of your life attempting to create such a model? AYour research would not produce much publishable before material before your reached a solution and you might never reach one but it would give you a chance at revolutionising your field. You might end up finding something useful: A theory of everything of language. A great undertaking, for sure, but how would your career be affected? If you are like most academics I have seen, your job depends on the amount of publications you make. In no way does anybody take into account the fact that your research although not productive in term of publication, works towards a harder much more respectful goal than all the others. They'd rather you do small useless incremental results that inevitably lead to specific domain performance maxima and that will never generalise to anything useful. They rather you play the publication game than that you work towards actually advancing science a path that might or might not give you any results in years, but that would be actually useful if it works.

It's funny because language is structured according to Sipf's law and thus any monkey with a computer can easily make systems that work only with the 10% most common word phenomenons an achieve 60% or more coverage and evaluated performance. The best systems that achieve the 70%-80% mark are actually only dealing with 20%-30% of the unique language features. No one tries to deal with the long tail. It's the big fraud of CL (and many other sciences) if you ask me. And people test their system against null hypotheses! HA! As if being better by a margin of nothing is actually an achievement.

But everybody go on with their life and pretend what they are doing is useful. It's like some kind of weird incestuous social phenomenon where everybody meets up at conferences and encourages each other in their incompetence then go back home an close the vicious circle by only ever hire people who play the same publication game.

hal said...

apparently, even in blog posts i've been scopped (and on my birthday, no less!). i sort-of agree about search.

but likening web search to library search... when i first went to the library, i used index cards. i sucked at it. a librarian helped me and i was more effective. a few years passed and those index cards were replaced by "virtual" index cards on a green-screen terminal. but the same phenomenon existed. the librarians were good searchers without the index cards or machines, but when using them as a *tool* they were much better.

fast forward to today... my mom can't find some information she needs online, so she tells me about it and if i'm not busy, i try to find it using google. often times i'm successful.

i liken myself to the librarian. the technology has changed (from index cards to inverted indices) but it's still a task that humans perform. and i don't think you can make a solid argument that "inverted indices" are too much technology and that librarians --- even before index cards --- had none: they at least had dewey decimal, which is akin to web hierarchies like DMOZ and Yahoo!

moreover, at least in domains smaller than the web, i think humans are still much better -- especially in scholarly search... i'd much prefer advice from a friend than to use google scholar or citeseer or rexa or what have you.

----------------------------------

benoit: i think that they key issue is "hole 2" -- that what we learn in the process of these small incremental steps may (we hope) lead to a better understanding of how to do the monolithic thing.

i wonder if this actually implies the opposite of the zaenen article: that we really shouldn't worry too much about annotation! maybe if we just do something useful enough that we can learn a little bit, that's enough! since we're going to throw away the actual systems and do it all internally later, maybe it doesn't really matter.

Anonymous said...

Why not be pragmatic about data annotation? If I can annotate data that'll help with a task, I'll do it.

Arguing about what's "natural" is a game for philosophers and theologians.

Since we're doing philosophy, let's have a thought experiment. Is rating products on a 1-5 scale on Amazon or Parker's rating wines on a 50-100 scale "natural"? Is the breakdown of concepts and entities into Wikipedia pages or products into Pricegrabber pages a "natural" markup? Is the New York Times's breakdown of the news into sections or IMDB's of movies into genres "natural"? Is the abstracting of a technical paper or the teaser before the local news "natural"?

We do quantization, individuation, categorization, and summarization in these cases for practical reasons.

Anonymous said...

As to Benoit's second point, I would like to offer a counterexample to the claim that no one's working on the long tail problem. Addressing what Breck Baldwin likes to call "recall denial" is the main focus of our NLM-funded research. We're focusing on high-recall approaches through confidence-based extraction and search-like interfaces (vs. first best).

I wrote up my thoughts on the topic in blog entries titled Is 90% Entity Detection Good Enough? and High Precision and High Recall Entity Extraction.

Anonymous said...

I believe humans care little about syntax when they translate. They care about patterns and semantics. The reason why MT systems suck (and will continue to suck) is because the correct translation between unrelated languages (say, Russian and English) is often a paraphrase where only semantics are preserved. Sure, some words may be translated one to one (most notably adjectives and nouns), but there's a huge gap between MT and human translation currently - one that humans effortlessly bridge. I also believe that language is only an I/O subsystem of the brain and the brain doesn't really use it for the actual cognition most of the time, except when thinking about highly abstract, unintuitive contexts where formal logic is involved. Most of the time we think in patterns, (slowly) filling in logic where patterns don't really work. From my experience, this applies to translation and speaking a foreign language at an advanced level. People just learn mappings between patterns (i.e. map them to the same invariant representation) and do so until these mappings feel natural. Before this happens they have to strain their brain to actually think about the words and sentences they're writing or saying. Afterwards, they don't give it a second thought, since there's no explicit logical step involved. They recognize pattern in one language, get its semantics and infer the same pattern in another language according to the probabilistic mapping in their brain. When speaking they do this inference directly from the semantics of what they want to say.

Anyway, I'm not really an expert. I'm just interested in the topic and I'm bilingual.

Unknown said...

Nobody knows for sure, but I'm going to go out on a limb and suggest that translators (the human kind) actually interpret (whatever that means) the string they are translating at some point in the process of translation. It may be possible that some form of syntactic generalization facilitates this mysterious interpretation -- dumb things like X was likely the killer because X came before the verb. (I just finished a Miss Marple mystery, so 'delicious death' is on my mind.) I will take a strong stance and say that the symbols VBD almost certainly play no part in the actual syntactic generalizations being made in such a process. Still, if I have a linguist-annotated corpus containing such a POS-tag, it might facilitate some crude approximation to the kind of syntactic generalization that would assist me (as a human) in translating 'Delmira Montague coldcocked the lavendar lady' even if half of the words are OOV. (Not enough coldcocking in Christie.) I'm with Bob on this one -- I don't really care whether my annotations match the 'real' generalizations that are playing a role, as long as they provide a certain amount of useful generalization for whatever application is of interest. Call me old school if you must, but I suspect there are some linguistic generalizations that can be exploited, somewhere somehow.

BTW, what about word transcriptions for speech? Is that also an out-of-bounds annotation? Who's to say what phones were uttered or whether what we call words actually correspond to something in the speaker's mind?

Anonymous said...

>> X was likely the killer because X came before the verb

Heh, that'd break down pretty quickly in languages with free word order. I.e. "она его убила", "она убила его", "его убила она" and "убила его она" all mean "she killed him" in Russian. All four permutations are possible in speech, two of the four are likely in written prose. Yet Russians have no problem matching this to the corresponding semantic pattern(s). As a matter of fact, even if you permute words in English phrase, most of the time you can still make out the meaning. To go even further, you can also permute letters in words, except the first and last one, and you'll still understand what's being said, with some effort.

Anonymous said...

I guess I don't see a big difference between corpus annotation and other theoretical work we do. Done badly, it can lead us down the wrong path, but so can e.g., a misleading statistical model. I do think our community fails to recognize that corpora (even "natural" ones) incorporate theoretical assumptions, and if these assumptions turn out to be wrong (or, more likely, incomplete) then we may not learn much by trying to "model" the corpus.

Corpus annotation, developing new statistical models, or anything else we do is useful to the extent it helps us achieve our goals (which in our field could be either technological or scientific).

Chris said...

I'm not entirely convinced that humans do not do some POS tagging "along the way" while translating naturally. This is a tricky area of psycholinguistics, and in the very least, the jury is still out (and unfortunately, the philosophers and theologians of Bob's comment are of little help here; I'll take the psycholinguists).

Idan said...

IMHO I think that our task is to help humans get information from natural languages. No one ever said that we need to follow humans' internal/implicit language processing in order to achieve that. Our input is natural languages, our output is natural languages (sometimes), but that's it. If we think (prove/show/justify) that POS tagging or parsing or any other internal representation helps us to better accomplish the task then we should use these tools, until a better approach will be discovered.

Anonymous said...

酒店經紀PRETTY GIRL 台北酒店經紀人 ,禮服店 酒店兼差PRETTY GIRL酒店公關 酒店小姐 彩色爆米花酒店兼職,酒店工作 彩色爆米花酒店經紀, 酒店上班,酒店工作 PRETTY GIRL酒店喝酒酒店上班 彩色爆米花台北酒店酒店小姐 PRETTY GIRL酒店上班酒店打工PRETTY GIRL酒店打工酒店經紀 彩色爆米花