01 December 2008

Where did my (linear) model go wrong?

Back to research-related posts for a while :).

Let's say we're learning a predictor f and it doesn't get 100% test accuracy. We can ask ourselves: why not. Typically, the reason I ask myself this question is because I want to get closer to 100% test accuracy and want to know how I should go about it. Should I add more labeled data (what should it look like) or should I add more features (what should they look like) or am I just hosed?

In order to try to get a handle on this, we can ask ourselves: where can error come from:
  1. Noise in the training data (which we have fit and is now screwing us up at test time)
  2. Noise in the test data (which is causing our otherwise correct predictions to look wrong)
  3. Insufficient representation (we just don't have the right/enough features)
  4. Insufficient examples (our training data wasn't sampled densely enough in some region)
These are not mutually exclusive. For instance, maybe be reducing/changing the feature set (3), it would turn out that we are sampled densely enough in a region of interest (4).

I'm going to use (binary) text categorization as a working example because it's simple yet exhibits NLP-like properties (ever growing feature sets, sparse feature representations, etc.). So let's say we train a model (perhaps linear) on a bag of words feature set and apply it on test data. Say we get 90% accuracy on dev data. Now what can we do?

Let's take a single example that we misclassified. Perhaps take the "worst" (i.e., most misclassified in a margin sense), though I don't know that it matters. Which of the four reasons above tells us why this example was misclassified?

By looking at the example (yes, by hand!) we can probably tell if it's a problem due to issue (2: Noise in the test data). We might like to try to automate this ascertainment, but honestly I'm at a loss for how to do this. Let's say we decide that the problem isn't due to test set noise. Now what?

Let's consider the following approach. We are going to retrain our classifier several times (okay, this is expensive, but we can worry about this later). What we're going to do is add this misclassified dev point into the training data with a varying cost. The currently trained model we will call f(0), since it is based on giving this dev point a cost of zero. We can then train f(c) for a large range of costs c and let C be some reasonable upper bound (i.e., we guarantee that C is big enough that f(C) correctly classifies this point -- for any reasonable classifier, there should be such a finite C). Go ahead and replace "correctly classifies" with "classifies with a sufficiently large margin" if you'd prefer; I doubt it matters.

Now, we're going to look at two of these fs. We'll look at f(0) and f(c'), where c' is the minimal value of c such that this dev example becomes correctly classified. Let's say we now run these two fs on the training data. We know that f(0) will misclassify our "selected" test point and that f(c') will not. The big question is what do the fs do on the other points.
  1. Suppose that f(c') doesn't make any (many?) more mistakes than f(0). That is, they're basically the same, just now classifying our selected point correctly. This suggests that the problem is (3) or (4) above (features or examples).
  2. Suppose that f(c') makes many more mistakes than f(0). Now, we see that in order to get this selected test point correct, we have to pay by getting other things wrong (that we didn't before). This suggests that the problem is (1) or (3) above (noise or features). Importantly, it's not (4: examples).
At this point, we have separated causes (3 or 4) from (1 or 4), but haven't separated them completely.

Now, let's go back and do the same process of all of the dev points that were misclassified. What can happen?
  1. Almost all of the f(c')s make no more errors on other training points. Unless all of these erroneous dev points are markedly different from the training data (in which case we really have a domain adaptation problem), then this is almost certainly a feature issue (3).
  2. Almost all of the f(c')s make many more errors on the other training points, and the set of training points on which they make these errors is usually the same. Then this is almost certainly noisy training data (or noisy test data).
  3. Almost all of the f(c')s make many more errors on the other training points, but the set of training points on which they err is substantially different each time. Then this is almost certainly a feature issue (3).
  4. Mixed results: some f(c')s make more errors on training, some don't. This is harder to say, but my gut tells me that this is probably a (4: example) issue.
There's a lot of hedging going on here: "almost", "many", "different", etc. Maybe you could formalize these things, but I'd rather get the intuition right first.

(If you're using a margin based classifier, you might not have to exactly retrain each time. Koby Crammer's passive aggressive algorithm will essentially give you a closed form solution for "closest (in L2) weight vector to the current weight vector such that a given point is correctly classified," which could save some computational effort.)

Note that I haven't actually tried this. I'd definitely be interested to, but wanted to think about it a bit before I shot off to implement something.

6 comments:

  1. Thanks for a very interesting post, Hal! I think you might be on to something but I have some questions about the intuition you have. You wrote:

    "1. Suppose that f(c') doesn't make any (many?) more mistakes than f(0)... This suggests that the problem is (3) or (4) above (features or examples)."

    I created some visual examples to test this. Consider an example here (link to JPG file). Here, circles and crosses are training examples of two classes. Our linear classifier (vertical line at x=0) misclassifies the dev point (triangle).

    Now, as we increase c', the dev point gets correctly classified by moving the new classifier to a vertical line at
    x=0.7. This corresponds to the first scenario you mentioned, since the training points remain correctly classified.

    Suppose we had
    more examples
    : in this case I follow your claim here that having more examples will fix the problem and make sure I get the classifier at x=0.7 in the first place. [Note, however, that these extra samples are not generated i.i.d over the whole feature space. I'm not sure whether it makes more sense to
    consider a more standard probabilistic model of data generation.]

    What I don't follow is why can't this be a noisy training label problem?
    Suppose some of the crosses had been mislabeled (they should've been circles) If the noisy labels were fixed, you get a correct classifier (at x=1.25 in this case). Again, note that I am assuming some biased way of generating noisy labels (in particular, my noisy labels are at the boundary). If I had assumed that labels are flipped randomly with some probability (as assume in many statistical learning theory papers), then my counter example doesn't hold.

    The point I'm trying to make is that it seems that you need to have some probabilistic model of data generation and noise generation in order to be sure that the claims are true. If one can arbitrarily generate data or create noisy labels, it seems that one can find counter-examples to your claim.

    ReplyDelete
  2. By the way, a random thought: I wonder whether you can gain any information by looking at the distribution of c' for all dev points. Can we draw any relationship to boosting? i.e. does high c' have any relation to "harder to classify" points that get higher weights in boosting?

    ReplyDelete
  3. Interesting. Thanks.

    Here's my take on this issue (focusing on NLP problems):

    If we consider the typical NLP problems in which linear models are being applied (broadly, these would be: local-context based tagging (WSD/POS/B-I-O or whatever), parser action prediction, re-reanking of rich structures (trees, sequences), and the various text classification tasks), then clearly the problem is, by and large (3) (do any of you really believe that our current feature representations for these tasks are suitable ones?), and when we try to augment the feature space we face intractability, or (4), or both.

    So when facing an NLP problem, I don't need[*] a fancy algorithm to tell me where the problem lies -- it's in the feature representation.

    The interesting question, in my view, is then "can we identify these test cases for which our feature are not sufficient?".
    Note this is not the same as asking "can we identify in advance where our classifier will fail". Rather, it's about sacrificing recall in favor of precision, by allowing our linear classifier to say "I don't know" on some of the cases.

    * algorithms for identifying cases of (1) or (2), e.g. annotation errors/inconsistencies, are, ofcourse, of interest. But for a different reason.

    ReplyDelete
  4. With so many diet pills saturating the market today, trying to find the best and most effective diet pill can be very tedious. Chances are, we may end up choosing the wrong diet pill. Phentremine is a natural apetite suppressant and if you are the one who cannot control your cravings for food, then it is just the right medicine for you! http://www.phentermine-effects.com

    ReplyDelete
  5. 酒店經紀PRETTY GIRL 台北酒店經紀人 ,禮服店 酒店兼差PRETTY GIRL酒店公關 酒店小姐 彩色爆米花酒店兼職,酒店工作 彩色爆米花酒店經紀, 酒店上班,酒店工作 PRETTY GIRL酒店喝酒酒店上班 彩色爆米花台北酒店酒店小姐 PRETTY GIRL酒店上班酒店打工PRETTY GIRL酒店打工酒店經紀 彩色爆米花

    ReplyDelete