08 May 2006

Making Humans Agree

The question of human agreement came up recently. As background, the common state of affairs in NLP when someone has decided to work on a new problem is to ask two (or more) humans to solve the problem and then find some way to measure the inter-annotator agreement. The idea is that if humans can't agree on a task, then it is not worth persuing a computational solution. Typically this is an iterative process. First some guidelines are written, then some annotation is done, then more guidelines are written, etc. The hope is that this process converges after a small number of iterations. LDC has made a science out of this. At the very least, inter-annotator agreement serves as a useful upper bound on system performance (in theory a system could do better; in practice, it does not).

There are two stances one can take on the annotator agreement problem, which are diametrically opposed. I will attempt to argue for both.

The first stance is: If humans do not agree, the problem is not worth working on. (I originally wrote "solving" instead of "working on," but I think the weakened form is more appropriate.) The claim here is that if a problem is so ill-specified that two humans perform it so differently, then when a system does it, we will have no way of knowing if it is doing a good job. If it mimics one of the humans, that's probably okay. But if it does something else, we have no idea if human N+1 might also do this, so we don't know if its good or bad. Moreover, it will be very difficult to learn, since the target concept is annotator-specific. We want to learn general rules, not rules specific to annotators.

The second stance is: If humans do agree, the problem is not worth working on. This is the more controversial statement: I have never heard anyone else ever make it. The argument here is that any problem that humans agree on is either uninteresting or has gone through so many iterations of definitions and redefinitions as to render it uninteresting. In this case, the annotation guidelines are essentially an algorithm that the annotator is executing. Making a machine execute a similar algorithm is not interesting because what it is mimicing is not a natural process, but something created by the designers. Even without multiple iterations of guidelines, high human agreement implies that the task being solved is too artificial: personalization is simply too important of an issue when it comes to language problems.

My beliefs are somewhere in the middle. If humans cannot agree at all, perhaps the problem is too hard. But if they agree too much, or if the guidelines are overly strict, then the problem may have been sapped of its reality. One should also keep in mind that personalization issues can render agreement low; we see this all the time in summarization. I do not think this necessarily means that task should not be worked on: it just means if worked on, it should be a personalized task. Measuring agreement in such cases is something I know nothing about.


Anonymous said...

HI, Hal and Others,

How about the following middle-way: you take many humans and give them loose guidelines. Then: (1) you can do statistics on people's answers and see whether there is something they do agree upon reliably, and develop software to get that part right; (2) you don't over-train people, so you get at their relatively intuitive judgements.
It's costly (if you have to pay people), and the tools for analysing partial agreement need development, but I think this strategy might be getting at the middle ground you mentioned.

Anonymous said...

酒店經紀PRETTY GIRL 台北酒店經紀人 ,禮服店 酒店兼差PRETTY GIRL酒店公關 酒店小姐 彩色爆米花酒店兼職,酒店工作 彩色爆米花酒店經紀, 酒店上班,酒店工作 PRETTY GIRL酒店喝酒酒店上班 彩色爆米花台北酒店酒店小姐 PRETTY GIRL酒店上班酒店打工PRETTY GIRL酒店打工酒店經紀 彩色爆米花