This is something that's bothered me for quite a while, and I don't know of a good answer. I used to think it was something that theory people didn't worry about, but then this exact issue was brought up by a reviewer of a theory-heavy paper that we have at NIPS this year (with Avishek Saha and Abhishek Kumar). There are (at least?) two issues with comparing bounds, the first is the obvious "these are both upper bounds, what does it mean to compare them?" The second is the slightly less obvious "but your empirical losses may be totally different" issue. It's actually the second one that I want to talk about, but I have much less of a good internal feel about it.
Let's say that I'm considering two learning approaches. Say it's SVMs versus logistic regression. Both regularized. Or something. Doesn't really matter. At the end of the day, I'll have a bound that looks roughly like:
expected test error <= empirical training error + f( complexity / N)
Here, f is often "sqrt", but could really be any function. And N is the number of data points.
Between two algorithms, both "f" and "complexity" can vary. For instance, one might have a linear dependence on the dimensionality of the data (i.e., complexity looks like O(D), where D is dimensionality) and the other might have a superlinear dependence (eg., O(D log D)). Or one might have a square root. Who knows. Sometimes there's an inf or sup hiding in there, too, for instance in a lot of the margin bounds.
At the end of the day, we of course want to say "my algorithm is better than your algorithm." (What else is there in life?) The standard way to say this is that "my f(complexity / N) looks better than your f'(complexity' / N)."
Here's where two issues crop up. The first is that our bound is just an upper bound. For instance, Alice could come up to me and say "I'm thinking of a number between 1 and 10" and Bob could say "I'm thinking of a number between 1 and 100." Even though the bound is lower for Alice, it doesn't mean that Alice is actually thinking of a smaller number -- maybe Alice is thinking of 9 and Bob of 5. In this way, the bounds can be misleading.
My general approach with this issue is to squint, as I do for experimental results. I don't actually care about constant factors: I just care about things like "what does the dependence on D look like." Since D is usually huge for problems I care about, a linear or sublinear dependence on D looks really good to me. Beyond that I don't really care. I especially don't care if the proof techniques are quite similar. For instance, if they both use Rademacher complexities, then I'm more willing to compare them than if one uses Rademacher complexities and the other uses covering numbers. They somehow feel more comparable: I'm less likely to believe that the differences are due to the method of analysis.
(You can also get around this issue with some techniques, like Rademacher complexities, which give you both upper and lower bounds, but I don't think anyone really does that...)
The other issue I don't have as good a feeling for. The issue is that we're entirely ignoring the "empirical training error" question. In fact, this is often measured differently between different algorithms! For instance, for SVMs, the formal statement is more like "expected 0/1 loss on test <= empirical hinge loss on training + ..." Whereas for logistic regression, you might be comparing expected 0/1 loss with empirical log loss.
Now I really don't know what to do.
We ran into this issue because we were trying to compare some bounds between EasyAdapt and a simple model trained just on source data. The problem is that the source training error might be totally incomparable to the (source + target) training error.
But the issue is for sure more general. For instance, what if your training error is measured in squared error? Now this can be huge when hinge loss is still rather small. In fact, your squared error could be quadratically large in your hinge loss. Actually it could be arbitrarily larger, since hinge goes to zero for any sufficiently correct classification, but squared error does not. (Neither does log loss.)
This worries me greatly, much more than the issue of comparing upper bounds.
Does this bother everyone, or is it just me? Is there a good way to think about this that gets your out of this conundrum?