Let's say we're learning a predictor f and it doesn't get 100% test accuracy. We can ask ourselves: why not. Typically, the reason I ask myself this question is because I want to get closer to 100% test accuracy and want to know how I should go about it. Should I add more labeled data (what should it look like) or should I add more features (what should they look like) or am I just hosed?
In order to try to get a handle on this, we can ask ourselves: where can error come from:
- Noise in the training data (which we have fit and is now screwing us up at test time)
- Noise in the test data (which is causing our otherwise correct predictions to look wrong)
- Insufficient representation (we just don't have the right/enough features)
- Insufficient examples (our training data wasn't sampled densely enough in some region)
I'm going to use (binary) text categorization as a working example because it's simple yet exhibits NLP-like properties (ever growing feature sets, sparse feature representations, etc.). So let's say we train a model (perhaps linear) on a bag of words feature set and apply it on test data. Say we get 90% accuracy on dev data. Now what can we do?
Let's take a single example that we misclassified. Perhaps take the "worst" (i.e., most misclassified in a margin sense), though I don't know that it matters. Which of the four reasons above tells us why this example was misclassified?
By looking at the example (yes, by hand!) we can probably tell if it's a problem due to issue (2: Noise in the test data). We might like to try to automate this ascertainment, but honestly I'm at a loss for how to do this. Let's say we decide that the problem isn't due to test set noise. Now what?
Let's consider the following approach. We are going to retrain our classifier several times (okay, this is expensive, but we can worry about this later). What we're going to do is add this misclassified dev point into the training data with a varying cost. The currently trained model we will call f(0), since it is based on giving this dev point a cost of zero. We can then train f(c) for a large range of costs c and let C be some reasonable upper bound (i.e., we guarantee that C is big enough that f(C) correctly classifies this point -- for any reasonable classifier, there should be such a finite C). Go ahead and replace "correctly classifies" with "classifies with a sufficiently large margin" if you'd prefer; I doubt it matters.
Now, we're going to look at two of these fs. We'll look at f(0) and f(c'), where c' is the minimal value of c such that this dev example becomes correctly classified. Let's say we now run these two fs on the training data. We know that f(0) will misclassify our "selected" test point and that f(c') will not. The big question is what do the fs do on the other points.
- Suppose that f(c') doesn't make any (many?) more mistakes than f(0). That is, they're basically the same, just now classifying our selected point correctly. This suggests that the problem is (3) or (4) above (features or examples).
- Suppose that f(c') makes many more mistakes than f(0). Now, we see that in order to get this selected test point correct, we have to pay by getting other things wrong (that we didn't before). This suggests that the problem is (1) or (3) above (noise or features). Importantly, it's not (4: examples).
Now, let's go back and do the same process of all of the dev points that were misclassified. What can happen?
- Almost all of the f(c')s make no more errors on other training points. Unless all of these erroneous dev points are markedly different from the training data (in which case we really have a domain adaptation problem), then this is almost certainly a feature issue (3).
- Almost all of the f(c')s make many more errors on the other training points, and the set of training points on which they make these errors is usually the same. Then this is almost certainly noisy training data (or noisy test data).
- Almost all of the f(c')s make many more errors on the other training points, but the set of training points on which they err is substantially different each time. Then this is almost certainly a feature issue (3).
- Mixed results: some f(c')s make more errors on training, some don't. This is harder to say, but my gut tells me that this is probably a (4: example) issue.
(If you're using a margin based classifier, you might not have to exactly retrain each time. Koby Crammer's passive aggressive algorithm will essentially give you a closed form solution for "closest (in L2) weight vector to the current weight vector such that a given point is correctly classified," which could save some computational effort.)
Note that I haven't actually tried this. I'd definitely be interested to, but wanted to think about it a bit before I shot off to implement something.