01 December 2008

Where did my (linear) model go wrong?

Back to research-related posts for a while :).

Let's say we're learning a predictor f and it doesn't get 100% test accuracy. We can ask ourselves: why not. Typically, the reason I ask myself this question is because I want to get closer to 100% test accuracy and want to know how I should go about it. Should I add more labeled data (what should it look like) or should I add more features (what should they look like) or am I just hosed?

In order to try to get a handle on this, we can ask ourselves: where can error come from:
  1. Noise in the training data (which we have fit and is now screwing us up at test time)
  2. Noise in the test data (which is causing our otherwise correct predictions to look wrong)
  3. Insufficient representation (we just don't have the right/enough features)
  4. Insufficient examples (our training data wasn't sampled densely enough in some region)
These are not mutually exclusive. For instance, maybe be reducing/changing the feature set (3), it would turn out that we are sampled densely enough in a region of interest (4).

I'm going to use (binary) text categorization as a working example because it's simple yet exhibits NLP-like properties (ever growing feature sets, sparse feature representations, etc.). So let's say we train a model (perhaps linear) on a bag of words feature set and apply it on test data. Say we get 90% accuracy on dev data. Now what can we do?

Let's take a single example that we misclassified. Perhaps take the "worst" (i.e., most misclassified in a margin sense), though I don't know that it matters. Which of the four reasons above tells us why this example was misclassified?

By looking at the example (yes, by hand!) we can probably tell if it's a problem due to issue (2: Noise in the test data). We might like to try to automate this ascertainment, but honestly I'm at a loss for how to do this. Let's say we decide that the problem isn't due to test set noise. Now what?

Let's consider the following approach. We are going to retrain our classifier several times (okay, this is expensive, but we can worry about this later). What we're going to do is add this misclassified dev point into the training data with a varying cost. The currently trained model we will call f(0), since it is based on giving this dev point a cost of zero. We can then train f(c) for a large range of costs c and let C be some reasonable upper bound (i.e., we guarantee that C is big enough that f(C) correctly classifies this point -- for any reasonable classifier, there should be such a finite C). Go ahead and replace "correctly classifies" with "classifies with a sufficiently large margin" if you'd prefer; I doubt it matters.

Now, we're going to look at two of these fs. We'll look at f(0) and f(c'), where c' is the minimal value of c such that this dev example becomes correctly classified. Let's say we now run these two fs on the training data. We know that f(0) will misclassify our "selected" test point and that f(c') will not. The big question is what do the fs do on the other points.
  1. Suppose that f(c') doesn't make any (many?) more mistakes than f(0). That is, they're basically the same, just now classifying our selected point correctly. This suggests that the problem is (3) or (4) above (features or examples).
  2. Suppose that f(c') makes many more mistakes than f(0). Now, we see that in order to get this selected test point correct, we have to pay by getting other things wrong (that we didn't before). This suggests that the problem is (1) or (3) above (noise or features). Importantly, it's not (4: examples).
At this point, we have separated causes (3 or 4) from (1 or 4), but haven't separated them completely.

Now, let's go back and do the same process of all of the dev points that were misclassified. What can happen?
  1. Almost all of the f(c')s make no more errors on other training points. Unless all of these erroneous dev points are markedly different from the training data (in which case we really have a domain adaptation problem), then this is almost certainly a feature issue (3).
  2. Almost all of the f(c')s make many more errors on the other training points, and the set of training points on which they make these errors is usually the same. Then this is almost certainly noisy training data (or noisy test data).
  3. Almost all of the f(c')s make many more errors on the other training points, but the set of training points on which they err is substantially different each time. Then this is almost certainly a feature issue (3).
  4. Mixed results: some f(c')s make more errors on training, some don't. This is harder to say, but my gut tells me that this is probably a (4: example) issue.
There's a lot of hedging going on here: "almost", "many", "different", etc. Maybe you could formalize these things, but I'd rather get the intuition right first.

(If you're using a margin based classifier, you might not have to exactly retrain each time. Koby Crammer's passive aggressive algorithm will essentially give you a closed form solution for "closest (in L2) weight vector to the current weight vector such that a given point is correctly classified," which could save some computational effort.)

Note that I haven't actually tried this. I'd definitely be interested to, but wanted to think about it a bit before I shot off to implement something.


  1. Thanks for a very interesting post, Hal! I think you might be on to something but I have some questions about the intuition you have. You wrote:

    "1. Suppose that f(c') doesn't make any (many?) more mistakes than f(0)... This suggests that the problem is (3) or (4) above (features or examples)."

    I created some visual examples to test this. Consider an example here (link to JPG file). Here, circles and crosses are training examples of two classes. Our linear classifier (vertical line at x=0) misclassifies the dev point (triangle).

    Now, as we increase c', the dev point gets correctly classified by moving the new classifier to a vertical line at
    x=0.7. This corresponds to the first scenario you mentioned, since the training points remain correctly classified.

    Suppose we had
    more examples
    : in this case I follow your claim here that having more examples will fix the problem and make sure I get the classifier at x=0.7 in the first place. [Note, however, that these extra samples are not generated i.i.d over the whole feature space. I'm not sure whether it makes more sense to
    consider a more standard probabilistic model of data generation.]

    What I don't follow is why can't this be a noisy training label problem?
    Suppose some of the crosses had been mislabeled (they should've been circles) If the noisy labels were fixed, you get a correct classifier (at x=1.25 in this case). Again, note that I am assuming some biased way of generating noisy labels (in particular, my noisy labels are at the boundary). If I had assumed that labels are flipped randomly with some probability (as assume in many statistical learning theory papers), then my counter example doesn't hold.

    The point I'm trying to make is that it seems that you need to have some probabilistic model of data generation and noise generation in order to be sure that the claims are true. If one can arbitrarily generate data or create noisy labels, it seems that one can find counter-examples to your claim.

  2. By the way, a random thought: I wonder whether you can gain any information by looking at the distribution of c' for all dev points. Can we draw any relationship to boosting? i.e. does high c' have any relation to "harder to classify" points that get higher weights in boosting?

  3. Hello,

    A humble request...

    Do you, by any chance, happen to know who Secret Dubai (the blogger: secretdubai.blogspot.com) is?


  4. Interesting. Thanks.

    Here's my take on this issue (focusing on NLP problems):

    If we consider the typical NLP problems in which linear models are being applied (broadly, these would be: local-context based tagging (WSD/POS/B-I-O or whatever), parser action prediction, re-reanking of rich structures (trees, sequences), and the various text classification tasks), then clearly the problem is, by and large (3) (do any of you really believe that our current feature representations for these tasks are suitable ones?), and when we try to augment the feature space we face intractability, or (4), or both.

    So when facing an NLP problem, I don't need[*] a fancy algorithm to tell me where the problem lies -- it's in the feature representation.

    The interesting question, in my view, is then "can we identify these test cases for which our feature are not sufficient?".
    Note this is not the same as asking "can we identify in advance where our classifier will fail". Rather, it's about sacrificing recall in favor of precision, by allowing our linear classifier to say "I don't know" on some of the cases.

    * algorithms for identifying cases of (1) or (2), e.g. annotation errors/inconsistencies, are, ofcourse, of interest. But for a different reason.

  5. Other than the FDA approved weight loss pills, herbs and medicines prepared from herbal products also adequately assist you in triggering off weight loss. As you scan the herbal weight loss related pages of the website http://www.weight-loss-truths.com , you would come to known that herbal weight loss is possible through herbal products such as Hoodia, Dandelion, Pyruvate, Aloe Vera et al. However, before opting for herbal weight loss products, it is necessary for you to consult with a herbal practitioner.

  6. There are various online sources to provide you informative details on FDA approved weight loss pills as well as information on pills and tablets available in the pharmaceutical market that are capable of successfully relieving you from the grip of erectile dysfunction and depression. But the online pill destination http://www.pill-care.com is somewhat unique as it makes a whole array of information available to you, including details on the highly popular pills and tablets, testimonials as well as news on the latest tidbits of the pharmaceutical market. Log in to the website and get hold of relevant pill-care details.

  7. I always heard something from my neighbor that he sometimes goes to the internet bar to play the game which will use him some habbo credits,he usually can win a lot of habbo gold,then he let his friends all have some habbo coins,his friends thank him very much for introducing them the cheap habbo credits,they usually buy habbo gold together.

  8. With so many diet pills saturating the market today, trying to find the best and most effective diet pill can be very tedious. Chances are, we may end up choosing the wrong diet pill. Phentremine is a natural apetite suppressant and if you are the one who cannot control your cravings for food, then it is just the right medicine for you! http://www.phentermine-effects.com

  9. 酒店經紀PRETTY GIRL 台北酒店經紀人 ,禮服店 酒店兼差PRETTY GIRL酒店公關 酒店小姐 彩色爆米花酒店兼職,酒店工作 彩色爆米花酒店經紀, 酒店上班,酒店工作 PRETTY GIRL酒店喝酒酒店上班 彩色爆米花台北酒店酒店小姐 PRETTY GIRL酒店上班酒店打工PRETTY GIRL酒店打工酒店經紀 彩色爆米花

  10. 艾葳酒店經紀公司提供專業的酒店經紀, 酒店上班小姐,八大行業,酒店兼職,傳播妹,或者想要打工兼差打工,兼差,八大行業,酒店兼職,想去酒店上班, 日式酒店,制服酒店,ktv酒店,禮服店,整天穿得水水漂漂的,還是想去制服店日領上班小姐,水水們如果想要擁有打工工作、晚上兼差工作兼差打工假日兼職兼職工作酒店兼差兼差打工兼差日領工作晚上兼差工作酒店工作酒店上班酒店打工兼職兼差兼差工作酒店上班等,想了解酒店相關工作特種行業內容,想兼職工作日領假日兼職兼差打工、或晚班兼職想擁有鋼琴酒吧又有保障的工作嗎???又可以現領請找專業又有保障的艾葳酒店經紀公司!

    一般的酒店經紀只會在水水們第一次上班和領薪水時出現而已,對水水們的上班安全一點保障都沒有!艾葳酒店經紀公司的水水們上班時全程媽咪作陪,不需擔心!只提供最優質的酒店上班,酒店上班,酒店打工環境、上班條件給水水們。心動嗎!? 趕快來填寫你的酒店上班履歷表

    等相關服務 幫您快速的實現您的夢想~!!

  11. Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it to a few friends of mine that I know would enjoy reading..

    sesli sohbet
    sesli chat
    sesli sohbet sitesi
    sesli chat sitesi
    sesli sohpet
    kamerali sohbet
    kamerali chat
    webcam sohbet