I'm in the process of annotating some data (along with some students---thanks guys!). While this data isn't really in the context of an NLP application, the annotation process made me think of the following issue. I can annotate a lot more data if I'm less careful. Okay, so this is obvious. But it's something I hadn't really specifically thought of before.
So here's the issue. I have some fixed amount of time in which to annotate data (or some fixed amount of dollars). In this time, I can annotate N data points with a noise-rate of eta_N. Presumably eta_N approaches one half (for a binary task) as N increases. In other words, as N increases, (1-2 eta_N) approaches zero. A standard result in PAC learning states that a lower bound on the number of examples required to achieve 1-epsilon accuracy with probability 1-delta with a noise rate of eta_N when the VC-dimension is h is (h+log(1/delta))/(epsilon (1-2 eta)^2)).
This gives us some insight into the problem. This says that it is worth labeling more data (with higher noise) only if 1/(1-2 eta_N)^2 increases more slowly than N. So if we can label twice as much data and have the noise of this annotation increase by less than a factor of 0.15, then we're doing well. (Well in the sense that we can keep the bound the same an shrink either \epsilon or delta.)
So how does this hold up in practice? Well, it's hard to tell exactly for real problems because running such experiments would be quite time-consuming. So here's a simulation. We have a binary classification problem with 100 features. The weight vector is random; the first 50 dimensions are Nor(0,0.2); the next 35 are Nor(0,5); the final 15 are Nor(m,1) where m is the weight of the current feature id minus 35 (feature correlation). We vary the number of training examples and the error rate. We always generate equal number of positive and negative points. We train a logistic regression model with hyperparameters tuned on 1024 (noisy) dev points and evaluate on 1024 non-noisy test points. We do this ten times for each setting and average the results. Here's a picture of accuracy as a function of data set size and noise rate:
And here's the table of results:
N\eta 0 0.01 0.02 0.05 0.1 0.2
16 0.341 0.340 0.351 0.366 0.380 0.420
32 0.283 0.276 0.295 0.307 0.327 0.363
64 0.215 0.221 0.227 0.247 0.266 0.324
128 0.141 0.148 0.164 0.194 0.223 0.272
256 0.084 0.099 0.100 0.136 0.165 0.214
512 0.038 0.061 0.065 0.087 0.113 0.164
1024 0.023 0.034 0.044 0.059 0.079 0.123
The general trend here seems to be that if you don't have much data (N<=256), then it's almost always better to get more data at a much higher error rate (0.1 or 0.2 versus 0.0). Once you have a reasonable amount of data, then it starts paying to be more noise-free. Eg., 256 examples with 0.05 noise is just about as good as 1024 examples with 0.2 noise. This roughly concurs with the theorem (at least in terms of the trends).
I think the take-home message that's perhaps worth keeping in mind is the following. If we only have a little time/money for annotation, we should probably annotate more data at a higher noise rate. Once we start getting more money, we should simultaneously be more careful and add more data, but not let one dominate the other.