In the context of some work on multitask learning, we came to realize that classification is kind of weird. Or at least linear classification. It's not that it's weird in a way that we didn't already know: it's just sort of a law of unexpected consequences.
If we're doing linear (binary) classification, we all know that changing the magnitude of the weight vector doesn't change the predictions. A standard exercise in a machine learning class might be to show that if your data is linearly separable, then for some models (for instance, unregularized models), the best solution is usually an infinite norm weight vector that's pointing in the right direction.
This is definitely not true of (linear) regression. Taking a good (or even perfect) linear regressor and blowing up the weights by some constant will kill your performance. By adding a regularizer, what you're basically doing is just saying how big you want that norm to be.
Of course, by regression I simply mean minimizing something like squared error and by classification I mean something like 0/1 loss or hinge loss or logistic loss or whatever.
I think this is stuff that we all know.
Where this can bite you in unexpected ways is the following. In lots of problems, like domain adaptation and multitask learning, you end up making assumptions roughly of the form "my weight vector for domain A should look like my weight vector for domain B" where "look like" is really the place where you get to be creative and define things how you feel best.
This is all well and good in the regression setting. A magnitude 5 weight in regression means a magnitude 5 weight in regression. But not so in classification. Since you can arbitrarily scale your weight vectors and still get the same decision boundaries, a magnitude 5 weight kind of means nothing. Or at least it means something that has to do more with the difficulty of the problem and how you chose to set your regularization parameter, rather than something to do with the task itself.
Perhaps we should be looking for definitions of "look like" that are insensitive to things like magnitude. Sure you can always normalize all your weight vectors to unit norm before you co-regularize them, but that loses information as well.
Perhaps this is a partial explanation of some negative transfer. One thing that you see, when looking at the literature in DA and MTL, is that all the tasks are typically of about the same difficulty. My expectation is that if you have two tasks that are highly related, but one is way harder than the other, is going to lead to negative transfer. Why? Because the easy task will get low norm weights, and the hard task will get high norm weights. The high norm weights will pull the low norm weights toward them too much, leading to worse performance on the "easy" task. In a sense, we actually want the opposite to happen: if you have a really hard task, it shouldn't screw up everyone else that's easy! (Yes, I know that being Bayesian might help here since you'd get a lot of uncertainty around those high norm weight vectors!)