08 May 2006

Making Humans Agree

The question of human agreement came up recently. As background, the common state of affairs in NLP when someone has decided to work on a new problem is to ask two (or more) humans to solve the problem and then find some way to measure the inter-annotator agreement. The idea is that if humans can't agree on a task, then it is not worth persuing a computational solution. Typically this is an iterative process. First some guidelines are written, then some annotation is done, then more guidelines are written, etc. The hope is that this process converges after a small number of iterations. LDC has made a science out of this. At the very least, inter-annotator agreement serves as a useful upper bound on system performance (in theory a system could do better; in practice, it does not).

There are two stances one can take on the annotator agreement problem, which are diametrically opposed. I will attempt to argue for both.

The first stance is: If humans do not agree, the problem is not worth working on. (I originally wrote "solving" instead of "working on," but I think the weakened form is more appropriate.) The claim here is that if a problem is so ill-specified that two humans perform it so differently, then when a system does it, we will have no way of knowing if it is doing a good job. If it mimics one of the humans, that's probably okay. But if it does something else, we have no idea if human N+1 might also do this, so we don't know if its good or bad. Moreover, it will be very difficult to learn, since the target concept is annotator-specific. We want to learn general rules, not rules specific to annotators.

The second stance is: If humans do agree, the problem is not worth working on. This is the more controversial statement: I have never heard anyone else ever make it. The argument here is that any problem that humans agree on is either uninteresting or has gone through so many iterations of definitions and redefinitions as to render it uninteresting. In this case, the annotation guidelines are essentially an algorithm that the annotator is executing. Making a machine execute a similar algorithm is not interesting because what it is mimicing is not a natural process, but something created by the designers. Even without multiple iterations of guidelines, high human agreement implies that the task being solved is too artificial: personalization is simply too important of an issue when it comes to language problems.

My beliefs are somewhere in the middle. If humans cannot agree at all, perhaps the problem is too hard. But if they agree too much, or if the guidelines are overly strict, then the problem may have been sapped of its reality. One should also keep in mind that personalization issues can render agreement low; we see this all the time in summarization. I do not think this necessarily means that task should not be worked on: it just means if worked on, it should be a personalized task. Measuring agreement in such cases is something I know nothing about.

6 comments:

Ross Gayler said...

I agree with your conclusion: "If humans cannot agree at all, perhaps the problem is too hard. But if they agree too much, or if the guidelines are overly strict, then the problem may have been sapped of its reality." The fact that I agree with it probably means that this comment is uninteresting.

I think the issue you have identified arises because NLP (generally) treats utterances as not being about anything.

I am interested in AI rather than NLP and I have accepted Rich Sutton's arguments for experience-oriented AI. He holds that everything revolves around the stream of actions and observations between an agent and its environment. What an agent knows about its environment is expressed as a set of predicted observations contingent on the actions it might take.

In this framework it is perfectly reasonable to consider an agent that has unique knowledge of its environment. The quality of that knowledge is assessed via the accuracy of its predictions. Even though different agents may have different knowledge they may have a shared understanding that is mediated by the environment.

The example you give of no inter-annotator agreement seems to me to be the analog of the case where multiple agents use unique knowledge to interact in a shared world. The annotations don't mean anything in the world even though they may effectively mediate the agents' interactions with the world. If the annotations were directly about the world you would presumably have some inter-annotator agreement (although at the cost of losing information about the uniqueness of the agent).

Beata Beigman Klebanov said...

HI, Hal and Others,

How about the following middle-way: you take many humans and give them loose guidelines. Then: (1) you can do statistics on people's answers and see whether there is something they do agree upon reliably, and develop software to get that part right; (2) you don't over-train people, so you get at their relatively intuitive judgements.
It's costly (if you have to pay people), and the tools for analysing partial agreement need development, but I think this strategy might be getting at the middle ground you mentioned.

. said...

酒店經紀PRETTY GIRL 台北酒店經紀人 ,禮服店 酒店兼差PRETTY GIRL酒店公關 酒店小姐 彩色爆米花酒店兼職,酒店工作 彩色爆米花酒店經紀, 酒店上班,酒店工作 PRETTY GIRL酒店喝酒酒店上班 彩色爆米花台北酒店酒店小姐 PRETTY GIRL酒店上班酒店打工PRETTY GIRL酒店打工酒店經紀 彩色爆米花

seldamuratim said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it to a few friends of mine that I know would enjoy reading..
sesli sohbetsesli chatkamerali sohbetseslisohbetsesli sohbet sitelerisesli chat siteleriseslichatsesli sohpetseslisohbet.comsesli chatsesli sohbetkamerali sohbetsesli chatsesli sohbetkamerali sohbet
seslisohbetsesli sohbetkamerali sohbetsesli chatsesli sohbetkamerali sohbet

DiSCo said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it

to a few friends of mine that I know would enjoy reading..
seslisohbet
seslichat
sesli sohbet
sesli chat
sesli
sesli site
görünlütü sohbet
görüntülü chat
kameralı sohbet
kameralı chat
sesli sohbet siteleri
sesli chat siteleri
görüntülü sohbet siteleri
görüntülü chat siteleri
kameralı sohbet siteleri
canlı sohbet
sesli muhabbet
görüntülü muhabbet
kameralı muhabbet
seslidunya
seslisehir
sesli sex

Sesli Chat said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it

to a few friends of mine that I know would enjoy reading..
seslisohbet
seslichat
sesli sohbet
sesli chat
sesli
sesli site
görünlütü sohbet
görüntülü chat
kameralı sohbet
kameralı chat
sesli sohbet siteleri
sesli chat siteleri
sesli muhabbet siteleri
görüntülü sohbet siteleri
görüntülü chat siteleri
görüntülü muhabbet siteleri
kameralı sohbet siteleri
kameralı chat siteleri
kameralı muhabbet siteleri
canlı sohbet
sesli muhabbet
görüntülü muhabbet
kameralı muhabbet
birsesver
birses
seslidunya
seslisehir
sesli sex