01 April 2010

Classification weirdness, regression simplicity

In the context of some work on multitask learning, we came to realize that classification is kind of weird. Or at least linear classification. It's not that it's weird in a way that we didn't already know: it's just sort of a law of unexpected consequences.

If we're doing linear (binary) classification, we all know that changing the magnitude of the weight vector doesn't change the predictions. A standard exercise in a machine learning class might be to show that if your data is linearly separable, then for some models (for instance, unregularized models), the best solution is usually an infinite norm weight vector that's pointing in the right direction.

This is definitely not true of (linear) regression. Taking a good (or even perfect) linear regressor and blowing up the weights by some constant will kill your performance. By adding a regularizer, what you're basically doing is just saying how big you want that norm to be.

Of course, by regression I simply mean minimizing something like squared error and by classification I mean something like 0/1 loss or hinge loss or logistic loss or whatever.

I think this is stuff that we all know.

Where this can bite you in unexpected ways is the following. In lots of problems, like domain adaptation and multitask learning, you end up making assumptions roughly of the form "my weight vector for domain A should look like my weight vector for domain B" where "look like" is really the place where you get to be creative and define things how you feel best.

This is all well and good in the regression setting. A magnitude 5 weight in regression means a magnitude 5 weight in regression. But not so in classification. Since you can arbitrarily scale your weight vectors and still get the same decision boundaries, a magnitude 5 weight kind of means nothing. Or at least it means something that has to do more with the difficulty of the problem and how you chose to set your regularization parameter, rather than something to do with the task itself.

Perhaps we should be looking for definitions of "look like" that are insensitive to things like magnitude. Sure you can always normalize all your weight vectors to unit norm before you co-regularize them, but that loses information as well.

Perhaps this is a partial explanation of some negative transfer. One thing that you see, when looking at the literature in DA and MTL, is that all the tasks are typically of about the same difficulty. My expectation is that if you have two tasks that are highly related, but one is way harder than the other, is going to lead to negative transfer. Why? Because the easy task will get low norm weights, and the hard task will get high norm weights. The high norm weights will pull the low norm weights toward them too much, leading to worse performance on the "easy" task. In a sense, we actually want the opposite to happen: if you have a really hard task, it shouldn't screw up everyone else that's easy! (Yes, I know that being Bayesian might help here since you'd get a lot of uncertainty around those high norm weight vectors!)

5 comments:

  1. I'm a little confused...a classifier can be rescaled and perform the same (measured on 0-1 loss), but when learning, the training criterion may care about the scale (0-1 loss doesn't care, log loss does). So therefore it seems that a magnitude weight 5 does kind of mean something, if you're using something like log loss. What have I missed?

    ReplyDelete
  2. I think I am missing something too. In regression by scaling the weights you change the slope. In classification scaling the weights changes the slope too, but the zero level-set stays the same. How is that if you make it insensitive to such changes you loose information? How does the magnitude of the weight vector relate to "hardness"?

    ReplyDelete
  3. How about just looking at the amount of information provided by a certain feature - as considered by the model? If the model is probabilistic, all is well.

    ReplyDelete
  4. @Chris: It definitely does depend on the loss function; in practice, since we always use convex upper bounds on 0/1 loss, I agree with you that it means something, even if it's not quite clear what it means :).

    @Sam: Hrm... I guess I maybe said more than I'm willing to really defend, but certainly a hard margin SVM, the higher the norm, the smaller the margin, and hence the "harder." That's the intuition I was building on. I still vaguely stand behind it since we often use norm of w as a surrogate for complexity, which is a surrogate for difficulty.

    ReplyDelete