21 October 2010

Comparing Bounds

This is something that's bothered me for quite a while, and I don't know of a good answer.  I used to think it was something that theory people didn't worry about, but then this exact issue was brought up by a reviewer of a theory-heavy paper that we have at NIPS this year (with Avishek Saha and Abhishek Kumar). There are (at least?) two issues with comparing bounds, the first is the obvious "these are both upper bounds, what does it mean to compare them?"  The second is the slightly less obvious "but your empirical losses may be totally different" issue.  It's actually the second one that I want to talk about, but I have much less of a good internal feel about it.

Let's say that I'm considering two learning approaches.  Say it's SVMs versus logistic regression.  Both regularized.  Or something.  Doesn't really matter.  At the end of the day, I'll have a bound that looks roughly like:

        expected test error <= empirical training error + f( complexity / N)

Here, f is often "sqrt", but could really be any function.  And N is the number of data points.

Between two algorithms, both "f" and "complexity" can vary.  For instance, one might have a linear dependence on the dimensionality of the data (i.e., complexity looks like O(D), where D is dimensionality) and the other might have a superlinear dependence (eg., O(D log D)).  Or one might have a square root.  Who knows.  Sometimes there's an inf or sup hiding in there, too, for instance in a lot of the margin bounds.

At the end of the day, we of course want to say "my algorithm is better than your algorithm."  (What else is there in life?)  The standard way to say this is that "my f(complexity / N) looks better than your f'(complexity' / N)."

Here's where two issues crop up.  The first is that our bound is just an upper bound.  For instance, Alice could come up to me and say "I'm thinking of a number between 1 and 10" and Bob could say "I'm thinking of a number between 1 and 100."  Even though the bound is lower for Alice, it doesn't mean that Alice is actually thinking of a smaller number -- maybe Alice is thinking of 9 and Bob of 5.  In this way, the bounds can be misleading.

My general approach with this issue is to squint, as I do for experimental results.  I don't actually care about constant factors: I just care about things like "what does the dependence on D look like."  Since D is usually huge for problems I care about, a linear or sublinear dependence on D looks really good to me.  Beyond that I don't really care.  I especially don't care if the proof techniques are quite similar.  For instance, if they both use Rademacher complexities, then I'm more willing to compare them than if one uses Rademacher complexities and the other uses covering numbers.  They somehow feel more comparable: I'm less likely to believe that the differences are due to the method of analysis.

(You can also get around this issue with some techniques, like Rademacher complexities, which give you both upper and lower bounds, but I don't think anyone really does that...)

The other issue I don't have as good a feeling for.  The issue is that we're entirely ignoring the "empirical training error" question.  In fact, this is often measured differently between different algorithms!  For instance, for SVMs, the formal statement is more like "expected 0/1 loss on test <= empirical hinge loss on training + ..."  Whereas for logistic regression, you might be comparing expected 0/1 loss with empirical log loss.

Now I really don't know what to do.

We ran into this issue because we were trying to compare some bounds between EasyAdapt and a simple model trained just on source data.  The problem is that the source training error might be totally incomparable to the (source + target) training error.

But the issue is for sure more general.  For instance, what if your training error is measured in squared error?  Now this can be huge when hinge loss is still rather small.  In fact, your squared error could be quadratically large in your hinge loss.  Actually it could be arbitrarily larger, since hinge goes to zero for any sufficiently correct classification, but squared error does not.  (Neither does log loss.)

This worries me greatly, much more than the issue of comparing upper bounds.

Does this bother everyone, or is it just me?  Is there a good way to think about this that gets your out of this conundrum?

05 October 2010

My Giant Reviewing Error

I try to be a good reviewer, but like everything, reviewing is a learning process.  About five years ago, I was reviewing a journal paper and made an error.  I don't want to give up anonymity in this post, so I'm going to be vague in places that don't matter.

I was reviewing a paper, which I thought was overall pretty strong.  I thought there was an interesting connection to some paper from Alice Smith (not the author's real name) in the past few years and mentioned this in my review.  Not a connection that made the current paper irrelevant, but something the authors should probably talk about.  In the revision response, the authors said that they had looked to try to find Smith's paper, but could figure out which one I was talking about, and asked for a pointer.  I spend the next five hours looking for the reference and couldn't find it myself.  It turns out that actually I was thinking of a paper by Bob Jones, so I provided that citation.  But the Jones paper wasn't even as relevant as it seemed at the time I wrote the review, so I apologized and told the authors they didn't really need to cover it that closely.

Now, you might be thinking to yourself: aha, now I know that Hal was the reviewer of my paper!  I remember that happening to me!

But, sadly, this is not true.  I get reviews like this all the time, and I feel it's one of the most irresponsible things reviewers can do.  In fact, I don't think a single reviewing cycle has passed where I don't get a review like this.  The problem with such reviews is that it enables a reviewer to make whatever claim they want, without any expectation that they have to back it up.  And the claims are usually wrong.  They're not necessarily being mean (I wasn't trying to be mean), but sometimes they are.

Here are some of the most ridiculous cases I've seen.  I mention these just to show how often this problem occurs.  These are all on papers of mine.

  • One reviewer wrote "This idea is so obvious this must have been done before."  This is probably the most humorous example I've seen, but the reviewer was clearly serious.  And no, this was not in a review for one of the the "frustratingly easy" papers.
  • In a NSF grant review for an educational proposal, we were informed by 4 of 7 reviewers (who each wrote about a paragraph) that our ideas had been done in SIGCSE several times.  Before submitting, we had skimmed/read the past 8 years of SIGCSE and could find nothing.  (Maybe it's true and we just were looking in the wrong place, but that still isn't helpful.)  It turned out to strongly seem that this was basically their way of saying "you are not one of us."
  • In a paper on technique X for task A, we were told hands down that it's well known that technique Y works better, with no citations.  The paper was rejected, we went and implemented Y, and found that it worked worse on task A.  We later found one paper saying that Y works better than X on task B, for B fairly different from A.
  • In another paper, we were told that what we were doing had been done before and in this case a citation was provided.  The citation was to one of our own papers, and it was quite different by any reasonable metric.  At least a citation was provided, but it was clear that the reviewer hadn't bothered reading it.
  • We were told that we missed an enormous amount of related work that could be found by a simple web search.  I've written such things in reviews, often saying something like "search for 'non-parametric Bayesian'" or something like that.  But here, no keywords were provided.  It's entirely possible (especially when someone moves into a new domain) that you can miss a large body of related work because you don't know how to find in: that's fine -- just tell me how to find it if you don't want to actually provide citations.
There are other examples I could cite from my own experience, but I think you get the idea.

I'm posting this not to gripe (though it's always fun to gripe about reviewing), but to try to draw attention to this problem.  It's really just an issue of laziness.  If I had bothered trying to look up a reference for Alice Smith's paper, I would have immediately realized I was wrong.  But I was lazy.  Luckily this didn't really adversely affect the outcome of the acceptance of this paper (journals are useful in that way -- authors can push back -- and yes, I know you can do this in author responses too, but you really need two rounds to make it work in this case).

I've really really tried ever since my experience above to not ever do this again.  And I would encourage future reviewers to try to avoid the temptation to do this: you may find your memory isn't as good as you think.  I would also encourage area chairs and co-reviewers to push their colleagues to actually provide citations for otherwise unsubstantiated claims.