20 April 2010

((A => B) and (not B)) => (not A)

I remember back in middle school or high school -- added uncertainty so as not to date myself too much -- I first learned of the existence of file compression software. We used pkzip, mostly. I remember one of the first things I did was: compress myfile.exe to myfile.exe.zip. Then compress that to myfile.exe.zip.zip. Then myfile.exe.zip.zip.zip. I cannot tell you how disappointed I was when, at one stage, the file size went up after "compression."

We read papers all the time that demonstrate something roughly of the form "if A then B." The happens most obviously the closer you get to theory (when such statements can be made precise), but happens also in non-theoretical work. The point of this post is: if you believe A => B, then you have to ask yourself: which do I believe more? A, or not B?

The simplest case is the compression case. Let's say a weak compressor is one that always reduces a (non-empty) file's size by one bit. A strong compressor is one that cuts the file down to one bit. I can easily prove to you that if you give me a weak compressor, I can turn it into a strong compressor by running it N-1 times on files of size N. Trivial, right? But what do you conclude from this? You're certainly not happy, I don't think For what I've proved really is that weak compressors don't exist, not that strong compressors do. That is, you believe so strongly that a strong compressor is impossible, that you must conclude from (weak) => (strong) that (weak) cannot possibly exist.

Let's take the most obvious example from machine learning: boosting. The basic result of boosting is that I can turn a "weak classifier" into a "strong classifier." A strong classifier is one that (almost always) does a really good job at classification. A weak classifier is one that (always always) does a not completely crappy job at classification. That is, a strong classifier will get you 99.9% accuracy (most of the time) while I weak classifier will only get you 50.1% accuracy (most of the time). Boosting works by repeatedly applying the weak classifier to modified data sets in order ot get a strong classifier out.

In all the boosting papers, the results are presented as positive. That is: look, obviously we want strong classifiers. But weak classifiers look much more attainable. And voila, by boosting magic, we can turn the weak into the strong. This is an A => B setting: (existence of weak classifiers) => (existence of strong classifiers).

Sounds great, right? But let me ask you: do you believe more strongly that (weak classifiers exist) or more strongly that (strong classifiers do not exist)? For me, it's the latter. To some degree, no free lunch theorems apply here. This yields a totally different read of boosting results, namely: weak classifiers don't even exist!!!

More on the language side, one of the more classic experimentally results we have is that if you give me a really good semantic representation, I can do an awesome job doing natural language generation from those semantics. In order to do translation, for instance, I just have to generate the semantics of a French sentence and then I can generate a near-perfect English translation. (French semantics) => (Perfect translations). But do I believe mose strongly that we can get perfect French semantics or that we can not get perfect translation? Right now, almost certainly the latter.

My point is really one about critical reading. When you read a paper, things will be presented in one light. But that's often not the only light in which they can be interpreted. Apply your own interpretation.

12 comments:

Anonymous said...

My apology in advance for my comments because I'm pretty sure I didn't get your point:) But in the boosting case, the statement A=>B itself is false. Isn't there a limit in getting a strong classifier from weak classifiers?

Igor said...

I have a hard time following the definitions. Does "most of the time" mean "on average"? Otherwise, I don't understand what it means to have accuracy of "50.1% (most of the time)".

Also, a weak classifier is defined as "one that (always always) does a not completely crappy job at classification".

Is that supposed to be "almost always" instead of "always always"?

Thanks.

bj said...

You need to modify this to say that: many many different weak classifiers that when combined to produce a strong classifier all collectively and simultaneously do not exist. This doesn't mean that you can't find one (or a small number) of weak classifiers. I could create an unlimited number of weak classifiers if they were all identical, for example.

Suresh Venkatasubramanian said...

this sounds like the defn of an NP-hardness reduction :). I don't really believe I can solve SAT, so I MUST believe that this other problem is really hard.

hal said...

Well, I have to say that explaining the details of boosting wasn't really the point of this post (which is why I tried to be as hand-wavy as possible about it). But yes, it should be "almost always" (not "always always"). And of course it's not weak classifiers that I'm looking for, but a learning algorithm that's guaranteed to produce weak classifiers for (almost) any data set. The notion of almost is that it only has to do it (say) 90% of the time, and the other 10% of the time it can segfault for all I care. (Though note that you can boost this term, too.)

@suresh: I almost said something about complexity theory, which seems to kind of only have results of the form "if something almost certainly not true is true, then something else almost certainly not true is also true..." I think these are called something about pigs flying theorems or something...

Bob Carpenter said...

Your boosting and compression examples remind me of the (linear) speedup theorem. You can always improve the constant term of an algorithm's performance. (It doesn't work in practice, of course, because unfolding eventually hurts more than it helps due to program size and thus non-locality of instructions.)

Grae BG said...

I like your thinking. But I wonder...

The existence of weak classifiers implies the existence of strong classifiers only if we have assumed that the repeated usage of weak classification leads to strong classification. To be more mathematically precise, you have assumed that you can, through the repetition of weak classification, get arbitrarily close to strong classification. Under this assumption, you are correct, but I don't think that is a safe assumption to make.

A proof by contradiction in mathematics leads to showing an assumption cannot possibly be true; instead of concluding that weak classifiers don't exist - since they clearly do - we must simply question the assumption your implication is grounded upon.

Paul said...

I don't think you even need > 50% accuracy on the weak classifier, just better performance than a coin flip (not necessarily the same if the sizes of the categories are unequal). If I write a classifier that correctly identifies 10 examples it was trained on and flips a coin for everything else, with enough iterations we'll approach a case where every example has been used as a sample for some classifier. With enough additional iterations, the coin flips average out to 0, and you can detect faint deviations from the mean and figure out what the classifiers trained on that sample are reporting.

Absurdly large numbers of classifiers required. But it would work, wouldn't it? So when you don't like (not A) or (not B) you might just be interested in a more limited case.

Paul said...

Slightly more precisely, that algorithm would work on a finite set of classifiable values. If you're trying to classify the integers or reals, you're in no free lunch zone: even if you get 100% precision in some test set, generalizing that to performance on an infinite set will never come out to a measurable value about chance. So (strong classifiers) in the bounded set, unbounded resources, (no weak classifier) in the unbounded set, and the other cases are dependent on how you bound the set and resources.

Andy said...

Weak compression works for many but not all files. If it worked for all cases, then strong compression would be possible. Since weak compression does not work for all cases, the conditional is uninformative (denial of the antecedent).

Okay.

Now the other direction. Strong compression clearly doesn't exist. And yes, from this we can conclude by modus tollens that weak compression, as you defined it, does not exist.

However that's not a very useful definition of weak compression.

Weak compression works for many files.

That weak compression works for many files does not imply that strong compression works for all files.

So I think I have missed the point!

Anonymous said...

Andy: stop being logical. Relativism is the new cool. By demonstrating reality you only prove your unKooollNesssss dude.

combattery84 said...

ACER Travelmate 4002lmi battery
Acer travelmate 800 battery
Acer aspire 3613wlmi battery
Travelmate 2414wlmi battery
Acer batcl50l battery
Acer Travelmate 2300 battery
ACER aspire 3610 battery
ACER travelmate 4600 battery
Dell Latitude D800 battery
Dell Inspiron 600m battery
Dell Inspiron 8100 Battery
Dell Y9943 battery
Dell Inspiron 1521 battery
Dell Inspiron 510m battery
Dell Latitude D500 battery
Dell Latitude D520 battery
Dell GD761 battery
Dell NF343 battery
Dell D5318 battery
Dell G5260 battery
Dell Inspiron 9200 battery
Dell Latitude C500 battery
Dell HD438 Battery
Dell GK479 battery
Dell PC764 battery
Dell KD476 Battery
Dell Inspiron 1150 battery