01 December 2008

Where did my (linear) model go wrong?

Back to research-related posts for a while :).

Let's say we're learning a predictor f and it doesn't get 100% test accuracy. We can ask ourselves: why not. Typically, the reason I ask myself this question is because I want to get closer to 100% test accuracy and want to know how I should go about it. Should I add more labeled data (what should it look like) or should I add more features (what should they look like) or am I just hosed?

In order to try to get a handle on this, we can ask ourselves: where can error come from:

  1. Noise in the training data (which we have fit and is now screwing us up at test time)
  2. Noise in the test data (which is causing our otherwise correct predictions to look wrong)
  3. Insufficient representation (we just don't have the right/enough features)
  4. Insufficient examples (our training data wasn't sampled densely enough in some region)
These are not mutually exclusive. For instance, maybe be reducing/changing the feature set (3), it would turn out that we are sampled densely enough in a region of interest (4).

I'm going to use (binary) text categorization as a working example because it's simple yet exhibits NLP-like properties (ever growing feature sets, sparse feature representations, etc.). So let's say we train a model (perhaps linear) on a bag of words feature set and apply it on test data. Say we get 90% accuracy on dev data. Now what can we do?

Let's take a single example that we misclassified. Perhaps take the "worst" (i.e., most misclassified in a margin sense), though I don't know that it matters. Which of the four reasons above tells us why this example was misclassified?

By looking at the example (yes, by hand!) we can probably tell if it's a problem due to issue (2: Noise in the test data). We might like to try to automate this ascertainment, but honestly I'm at a loss for how to do this. Let's say we decide that the problem isn't due to test set noise. Now what?

Let's consider the following approach. We are going to retrain our classifier several times (okay, this is expensive, but we can worry about this later). What we're going to do is add this misclassified dev point into the training data with a varying cost. The currently trained model we will call f(0), since it is based on giving this dev point a cost of zero. We can then train f(c) for a large range of costs c and let C be some reasonable upper bound (i.e., we guarantee that C is big enough that f(C) correctly classifies this point -- for any reasonable classifier, there should be such a finite C). Go ahead and replace "correctly classifies" with "classifies with a sufficiently large margin" if you'd prefer; I doubt it matters.

Now, we're going to look at two of these fs. We'll look at f(0) and f(c'), where c' is the minimal value of c such that this dev example becomes correctly classified. Let's say we now run these two fs on the training data. We know that f(0) will misclassify our "selected" test point and that f(c') will not. The big question is what do the fs do on the other points.
  1. Suppose that f(c') doesn't make any (many?) more mistakes than f(0). That is, they're basically the same, just now classifying our selected point correctly. This suggests that the problem is (3) or (4) above (features or examples).
  2. Suppose that f(c') makes many more mistakes than f(0). Now, we see that in order to get this selected test point correct, we have to pay by getting other things wrong (that we didn't before). This suggests that the problem is (1) or (3) above (noise or features). Importantly, it's not (4: examples).
At this point, we have separated causes (3 or 4) from (1 or 4), but haven't separated them completely.

Now, let's go back and do the same process of all of the dev points that were misclassified. What can happen?
  1. Almost all of the f(c')s make no more errors on other training points. Unless all of these erroneous dev points are markedly different from the training data (in which case we really have a domain adaptation problem), then this is almost certainly a feature issue (3).
  2. Almost all of the f(c')s make many more errors on the other training points, and the set of training points on which they make these errors is usually the same. Then this is almost certainly noisy training data (or noisy test data).
  3. Almost all of the f(c')s make many more errors on the other training points, but the set of training points on which they err is substantially different each time. Then this is almost certainly a feature issue (3).
  4. Mixed results: some f(c')s make more errors on training, some don't. This is harder to say, but my gut tells me that this is probably a (4: example) issue.
There's a lot of hedging going on here: "almost", "many", "different", etc. Maybe you could formalize these things, but I'd rather get the intuition right first.

(If you're using a margin based classifier, you might not have to exactly retrain each time. Koby Crammer's passive aggressive algorithm will essentially give you a closed form solution for "closest (in L2) weight vector to the current weight vector such that a given point is correctly classified," which could save some computational effort.)

Note that I haven't actually tried this. I'd definitely be interested to, but wanted to think about it a bit before I shot off to implement something.

18 comments:

Kevin Duh said...

Thanks for a very interesting post, Hal! I think you might be on to something but I have some questions about the intuition you have. You wrote:

"1. Suppose that f(c') doesn't make any (many?) more mistakes than f(0)... This suggests that the problem is (3) or (4) above (features or examples)."

I created some visual examples to test this. Consider an example here (link to JPG file). Here, circles and crosses are training examples of two classes. Our linear classifier (vertical line at x=0) misclassifies the dev point (triangle).

Now, as we increase c', the dev point gets correctly classified by moving the new classifier to a vertical line at
x=0.7. This corresponds to the first scenario you mentioned, since the training points remain correctly classified.

Suppose we had
more examples
: in this case I follow your claim here that having more examples will fix the problem and make sure I get the classifier at x=0.7 in the first place. [Note, however, that these extra samples are not generated i.i.d over the whole feature space. I'm not sure whether it makes more sense to
consider a more standard probabilistic model of data generation.]

What I don't follow is why can't this be a noisy training label problem?
Suppose some of the crosses had been mislabeled (they should've been circles) If the noisy labels were fixed, you get a correct classifier (at x=1.25 in this case). Again, note that I am assuming some biased way of generating noisy labels (in particular, my noisy labels are at the boundary). If I had assumed that labels are flipped randomly with some probability (as assume in many statistical learning theory papers), then my counter example doesn't hold.

The point I'm trying to make is that it seems that you need to have some probabilistic model of data generation and noise generation in order to be sure that the claims are true. If one can arbitrarily generate data or create noisy labels, it seems that one can find counter-examples to your claim.

Kevin Duh said...

By the way, a random thought: I wonder whether you can gain any information by looking at the distribution of c' for all dev points. Can we draw any relationship to boosting? i.e. does high c' have any relation to "harder to classify" points that get higher weights in boosting?

whoissecretdubai said...

Hello,

A humble request...

Do you, by any chance, happen to know who Secret Dubai (the blogger: secretdubai.blogspot.com) is?

http://whoissecretdubai.blogspot.com/

Yoav said...

Interesting. Thanks.

Here's my take on this issue (focusing on NLP problems):

If we consider the typical NLP problems in which linear models are being applied (broadly, these would be: local-context based tagging (WSD/POS/B-I-O or whatever), parser action prediction, re-reanking of rich structures (trees, sequences), and the various text classification tasks), then clearly the problem is, by and large (3) (do any of you really believe that our current feature representations for these tasks are suitable ones?), and when we try to augment the feature space we face intractability, or (4), or both.

So when facing an NLP problem, I don't need[*] a fancy algorithm to tell me where the problem lies -- it's in the feature representation.

The interesting question, in my view, is then "can we identify these test cases for which our feature are not sufficient?".
Note this is not the same as asking "can we identify in advance where our classifier will fail". Rather, it's about sacrificing recall in favor of precision, by allowing our linear classifier to say "I don't know" on some of the cases.

* algorithms for identifying cases of (1) or (2), e.g. annotation errors/inconsistencies, are, ofcourse, of interest. But for a different reason.

Anonymous said...

Other than the FDA approved weight loss pills, herbs and medicines prepared from herbal products also adequately assist you in triggering off weight loss. As you scan the herbal weight loss related pages of the website http://www.weight-loss-truths.com , you would come to known that herbal weight loss is possible through herbal products such as Hoodia, Dandelion, Pyruvate, Aloe Vera et al. However, before opting for herbal weight loss products, it is necessary for you to consult with a herbal practitioner.

Anonymous said...

There are various online sources to provide you informative details on FDA approved weight loss pills as well as information on pills and tablets available in the pharmaceutical market that are capable of successfully relieving you from the grip of erectile dysfunction and depression. But the online pill destination http://www.pill-care.com is somewhat unique as it makes a whole array of information available to you, including details on the highly popular pills and tablets, testimonials as well as news on the latest tidbits of the pharmaceutical market. Log in to the website and get hold of relevant pill-care details.

Anonymous said...

I always heard something from my neighbor that he sometimes goes to the internet bar to play the game which will use him some habbo credits,he usually can win a lot of habbo gold,then he let his friends all have some habbo coins,his friends thank him very much for introducing them the cheap habbo credits,they usually buy habbo gold together.

Anonymous said...

With so many diet pills saturating the market today, trying to find the best and most effective diet pill can be very tedious. Chances are, we may end up choosing the wrong diet pill. Phentremine is a natural apetite suppressant and if you are the one who cannot control your cravings for food, then it is just the right medicine for you! http://www.phentermine-effects.com

Anonymous said...

酒店經紀PRETTY GIRL 台北酒店經紀人 ,禮服店 酒店兼差PRETTY GIRL酒店公關 酒店小姐 彩色爆米花酒店兼職,酒店工作 彩色爆米花酒店經紀, 酒店上班,酒店工作 PRETTY GIRL酒店喝酒酒店上班 彩色爆米花台北酒店酒店小姐 PRETTY GIRL酒店上班酒店打工PRETTY GIRL酒店打工酒店經紀 彩色爆米花

Anonymous said...

Классный кино на кинозоуне.
электронная почта без регистрации

Unknown said...

электронная почта без регистрации

Anonymous said...

Фильмы онлайин

Unknown said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it to a few friends of mine that I know would enjoy reading..
sesli sohbetsesli chatkamerali sohbetseslisohbetsesli sohbet sitelerisesli chat siteleriseslichatsesli sohpetseslisohbet.comsesli chatsesli sohbetkamerali sohbetsesli chatsesli sohbetkamerali sohbet
seslisohbetsesli sohbetkamerali sohbetsesli chatsesli sohbetkamerali sohbet

Unknown said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it to a few friends of mine that I know would enjoy reading..

sesli sohbet
seslisohbet
sesli chat
seslichat
sesli sohbet sitesi
sesli chat sitesi
sesli sohpet
kamerali sohbet
kamerali chat
webcam sohbet

combattery84 said...

HP dv9700 battery
HP F4809A Battery
HP nc8000 battery
HP nc8230 battery
HP pavilion zd8000 battery
HP f2024b battery
HP f4812a battery
HP Pavilion ZV5000 battery
HP Pavilion DV1000 battery
HP Pavilion ZD7000 Battery
HP Pavilion DV2000 battery
HP Pavilion DV4000 Battery
HP Pavilion dv6000 Battery
HP Pavilion DV9000 Battery
HP F4098A battery
HP pavilion zx6000 battery
HP omnibook xe4400 battery
HP omnibook xe4500 battery
HP omnibook xe3 battery
Notebook NX9110 battery
IBM 02K6821 battery
IBM 02K7054 battery
IBM 08K8195 battery
IBM 08K8218 battery
IBM 92P1089 battery
IBM Thinkpad 390 Series battery
IBM Thinkpad 390X battery
IBM ThinkPad Z61m Battery
IBM 02K7018 Battery
IBM thinkpad t41p battery
IBM THINKPAD T42 Battery

combattery84 said...

IBM ThinkPad R60 Battery
IBM ThinkPad T60 Battery
IBM ThinkPad T41 Battery
IBM ThinkPad T43 Battery
IBM ThinkPad X40 Battery
Thinkpad x24 battery
ThinkPad G41 battery
IBM thinkpad r52 battery
Thinkpad x22 battery
IBM thinkpad t42 battery
IBM thinkpad r51 battery
Thinkpad r50 battery
IBM thinkpad r32 battery
Thinkpad x41 battery
SONY VGP-BPS2 Battery
SONY VGP-BPS2C Battery
SONY VGP-BPS5 battery
SONY VGP-BPL2C battery
SONY VGP-BPS2A battery
SONY VGP-BPS2B battery
SONY PCGA-BP1N battery
SONY PCGA-BP2E battery
SONY PCGA-BP2NX battery
SONY PCGA-BP2S battery
SONY PCGA-BP2SA battery
SONY PCGA-BP2T battery
SONY PCGA-BP2V battery
SONY PCGA-BP4V battery
SONY PCGA-BP71 battery
SONY PCGA-BP71A battery
SONY VGP-BPL1 battery
SONY VGP-BPL2 battery

Anonymous said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it

to a few friends of mine that I know would enjoy reading..
seslisohbet
seslichat
sesli sohbet
sesli chat
sesli
sesli site
görünlütü sohbet
görüntülü chat
kameralı sohbet
kameralı chat
sesli sohbet siteleri
sesli chat siteleri
görüntülü sohbet siteleri
görüntülü chat siteleri
kameralı sohbet siteleri
canlı sohbet
sesli muhabbet
görüntülü muhabbet
kameralı muhabbet
seslidunya
seslisehir
sesli sex

Unknown said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it

to a few friends of mine that I know would enjoy reading..
seslisohbet
seslichat
sesli sohbet
sesli chat
sesli
sesli site
görünlütü sohbet
görüntülü chat
kameralı sohbet
kameralı chat
sesli sohbet siteleri
sesli chat siteleri
sesli muhabbet siteleri
görüntülü sohbet siteleri
görüntülü chat siteleri
görüntülü muhabbet siteleri
kameralı sohbet siteleri
kameralı chat siteleri
kameralı muhabbet siteleri
canlı sohbet
sesli muhabbet
görüntülü muhabbet
kameralı muhabbet
birsesver
birses
seslidunya
seslisehir
sesli sex