This is something that's bothered me for quite a while, and I don't know of a good answer. I used to think it was something that theory people didn't worry about, but then this exact issue was brought up by a reviewer of a theory-heavy paper that we have at NIPS this year (with Avishek Saha and Abhishek Kumar). There are (at least?) two issues with comparing bounds, the first is the obvious "these are both upper bounds, what does it mean to compare them?" The second is the slightly less obvious "but your empirical losses may be totally different" issue. It's actually the second one that I want to talk about, but I have much less of a good internal feel about it.
Let's say that I'm considering two learning approaches. Say it's SVMs versus logistic regression. Both regularized. Or something. Doesn't really matter. At the end of the day, I'll have a bound that looks roughly like:
expected test error <= empirical training error + f( complexity / N)
Here, f is often "sqrt", but could really be any function. And N is the number of data points.
Between two algorithms, both "f" and "complexity" can vary. For instance, one might have a linear dependence on the dimensionality of the data (i.e., complexity looks like O(D), where D is dimensionality) and the other might have a superlinear dependence (eg., O(D log D)). Or one might have a square root. Who knows. Sometimes there's an inf or sup hiding in there, too, for instance in a lot of the margin bounds.
At the end of the day, we of course want to say "my algorithm is better than your algorithm." (What else is there in life?) The standard way to say this is that "my f(complexity / N) looks better than your f'(complexity' / N)."
Here's where two issues crop up. The first is that our bound is just an upper bound. For instance, Alice could come up to me and say "I'm thinking of a number between 1 and 10" and Bob could say "I'm thinking of a number between 1 and 100." Even though the bound is lower for Alice, it doesn't mean that Alice is actually thinking of a smaller number -- maybe Alice is thinking of 9 and Bob of 5. In this way, the bounds can be misleading.
My general approach with this issue is to squint, as I do for experimental results. I don't actually care about constant factors: I just care about things like "what does the dependence on D look like." Since D is usually huge for problems I care about, a linear or sublinear dependence on D looks really good to me. Beyond that I don't really care. I especially don't care if the proof techniques are quite similar. For instance, if they both use Rademacher complexities, then I'm more willing to compare them than if one uses Rademacher complexities and the other uses covering numbers. They somehow feel more comparable: I'm less likely to believe that the differences are due to the method of analysis.
(You can also get around this issue with some techniques, like Rademacher complexities, which give you both upper and lower bounds, but I don't think anyone really does that...)
The other issue I don't have as good a feeling for. The issue is that we're entirely ignoring the "empirical training error" question. In fact, this is often measured differently between different algorithms! For instance, for SVMs, the formal statement is more like "expected 0/1 loss on test <= empirical hinge loss on training + ..." Whereas for logistic regression, you might be comparing expected 0/1 loss with empirical log loss.
Now I really don't know what to do.
We ran into this issue because we were trying to compare some bounds between EasyAdapt and a simple model trained just on source data. The problem is that the source training error might be totally incomparable to the (source + target) training error.
But the issue is for sure more general. For instance, what if your training error is measured in squared error? Now this can be huge when hinge loss is still rather small. In fact, your squared error could be quadratically large in your hinge loss. Actually it could be arbitrarily larger, since hinge goes to zero for any sufficiently correct classification, but squared error does not. (Neither does log loss.)
This worries me greatly, much more than the issue of comparing upper bounds.
Does this bother everyone, or is it just me? Is there a good way to think about this that gets your out of this conundrum?
This problem bothers me too. I tend not to therefore conclude that logloss or squared loss is a little worse than hingeloss, although it is tempting. I am thinking that this might have some relationship between the bias-variance trade-off issue. The residue of logloss might prevents you from high variance which means the bound term of logloss might be smaller than of hingeloss. But I don't know how.
ReplyDeleteIn addition, I think if the loss is smaller than 1, squared loss would yield a smaller value than hingeloss.
I think comparing classifiers based on upper bounds is not advisable in general. Furthermore, the claim of one classifier being better than another based on the bounds often has no merit. The bounds are over all possible distributions and for a particular problem many distributions are a priori impossible.
ReplyDeleteThe usual complexity measures themselves do not ideally describe the actual complexity of a classifier as the following example illustrates.
Assume I have a small training data set (x_i, y_i) in R^d drawn from a reasonably smooth continuous distribution p(x, y). Assume that I built a linear classifier under some loss (say an SVM). I then take all the training examples that are misclassified by the linear classifier and memorize them along with their labels.
For a test vector x, if x is within epsilon of a memorized training example I give it the label of the training example. Otherwise I use the linear classifier to obtain my prediction. I can make epsilon very very small and since the training examples will be in general position with probability one, the classification scheme is unambiguous.
My classification scheme will have zero error on all training sets and therefore will have high complexity according to the usual complexity measures like VC, Rademacher etc. Yet it seems to me that my classifier has low complexity, because except for a very small region of the space the classification is essentially linear.
If I ignore the contribution of the memorized points (which only play a role for a set of vanishingly small probability), I have a linear classifier.
Now assume that someone else builds a quadratic classifier on the same training data (which possibly has a non-zero empirical error). Is it then true that I should expect lower generalization error for the quadratic classifier, just because my particular way of measuring complexity assigns lower complexity value to the quadratic classifier than my linear + memorization classifier?
There is a structural arbitrariness in bounds which is undesirable and unavoidable. This comes in two forms as you point out.
ReplyDeleteThe first form is best illustrated by the Occam's razor bound (page 9), where it's clear that any prior can be plugged in. Furthermore, every choice of prior implies a different learning algorithm of the form "minimize the upper bound". The essential impact of this form of arbitrariness is that almost any kind of regularization or complexity control can be justified with a related bound.
The other form of arbitrariness is relativized guarantees. Instead of saying "the test error is less than X", we have "the test error is less than the training error + Y". Here, "training error" can be changed to "margin error", "hinge loss", or something else---the possibilities are infinite. There are some relationships between these possibilities---margin error bounds training error for example. But generally speaking, we have a large number of incomparable bounds as you point out.
The basic conclusion here is that bounds are a weak tool for comparing learning algorithms. It's only under the same "if" clause (including implicitly the relativized basis) that bounds are comparable. In practice, computational and representational constraints tend to be a more effective basis for choosing a learning algorithm for a specific problem.
Further work on methods for problem-independent learning algorithm choice and design is definitely worth thinking about. Broadly new approaches are required however---we can't just prove a new bound of the same form. Some small amount of progress is made with learning reductions, where we can at least reduce the relativized basis to the same value (0-1 loss).