21 October 2009

Convex or not, my schizophrenic personality

Machine learning as a field has been very convex-happy for the past decade or so. So much so that when I saw a tutorial on submodular optimization in ML (one of the best tutorials I've seen), they said something along the lines of "submodularity will be for this decade what convexity was for the last decade." (Submodularity is cool and I'll post about it more in the future, but it's kind of a discrete analog of convexity. There's a NIPS workshop on the topic coming up.) This gives a sense of how important convexity has been.

There's also a bit of an undercurrent of "convexity isn't so great" from other sides of the ML community (roughly from the neural nets folks); see, for instance, Yann LeCun's talk Who's Afraid of Non-convex Loss Functions, a great and entertaining talk.

There's a part of me that loves convexity. Not having to do random restarts, being assured of global convergence, etc., all sounds very nice. I use logistic regression/maxent for almost all of my classification needs, have never run a neural network, and have only occasionally used svms (though of course they are convex, too). When I teach ML (as I'm doing now), I make a bit deal about convexity: it makes life easy in many ways.

That said, almost none of my recent papers reflect this. In fact, in the structure compilation paper, we flat out say that non-linearity in the model (which leads to a non-convex loss function) is the major reason why CRFs outperform independent classifiers in structured prediction tasks! Moreover, whenever I start doing Bayesian stuff, usually solved with some form of MCMC, I've completely punted on everything convex. In a "voting with my feet" world, I could care less about convexity! For the most part, if you're using EM or sampling or whatever, you don't care much about it either. Somehow we (fairly easily!) tolerate whatever negative effects there are of non-convex optimization.

I think one reason why such things don't both us, as NLPers, as much as they bother the average machine learning person is that we are willing to invest some energy in intelligent initialization. This already puts us in a good part of the optimization landscape, and doing local hillclimbing from there is not such a big deal. A classic example is the "Klein and Manning" smart initializer for unsupervised parsing, where a small amount of human knowledge goes a long way above a random initializer.

Another style of initialization is the IBM alignment model style. IBM model 4 is, of course, highly non-convex and ridiculously difficult to optimize (the E step is intractable). So they do a smart initialization, using the output of model 3. Model 3, too, is highly non-convex (but not quite so much so), so they initialize with model 2. And so on, down to model 1, which is actually convex and fairly easy to optimize. This sequencing of simple models to complex models also happens in some statistical analysis, where you first fit first order effects and then later fit higher order effects. The danger, of course, is that you got to a bad hill to climb, but this overall generally appears to be a bigger win than starting somewhere in the middle of a random swamp. (Of course, later, Bob Moore had this cute argument that even though model 1 is convex, we don't actually ever optimize it to the global optimum, so doing clever initialization for model 1 is also a good idea!)

These two ideas: clever initialization, and sequential initialization, seem like powerful ideas that I would like to see make their way into more complex models. For instance, in the original LDA paper, Dave Blei used an initialization where they pick some random documents as seeds for topics. As far as I know, no one really does this anymore (does anyone know why: does it really not matter?), but as we keep building more and more complex models, and lose hope that our off the shelf optimizer (or sampler) is going to do anything reasonable, we're probably going to need to get back to this habit, perhaps trying to formalize it in the meantime.

11 comments:

Anonymous said...

Maybe I can answer the question at the end about LDA. Because of the way variational inference for LDA is formulated as an optimization problem, you need to initialize with something which breaks the symmetry and how you do this matters which is why AFAIK Blei's lda-c distribution by default still uses the seeding method you mentioned.

A lot of people have lately moved towards using collapsed Gibbs sampling. Initialization seems to matter less here, and you needn't break any symmetries since the non-determinism will take care of that for you.

Anonymous said...

love your blog! Very cool! Please come visit my site Business Trade Guide of Columbus Ohio OH when you got time.

Anonymous said...

love your blog! Very cool! Please come visit my site Austin Business Directory when you got time.

Anonymous said...

Wow! Thank you! I always wanted to write in my site something like that. Can I take part of your post to my blog? Please come visit my site Jacksonville Business Directory Forum Blog Classifieds when you got time.

Mark Dredze said...

Convexity still matters, but we care less about it when doing things that are never going to be convex. For supervised learning, we can get nice convex optimization for basically anything we want, and so we are happy. Why deal with non-convexity?

For the more complex graphical models, could we ever hope to achieve convexity? No. If we could define a convex model, then we would. But since we couldn't learn these models without using non-convex objectives, we are willing to settle for non-convexity.

The interesting tension comes when you have reasonable convex approximations for non-convex objectives. There are a number of papers that need to make this decision: solve the non-convex objective, or solve exactly a convex approximation.

As we continue to look at these complex graphical models, the understanding of issues with non-convexity will become increasingly important.

Research Papers said...

Many institutions limit access to their online information. Making this information available will be an asset to all.

Admin said...

thnxxx for sharing the information

public speaking

Anonymous said...

You do have a point here :) I admire the stuff you post and the quality information you offer in your blog! Keep up the good work dude. Please come visit my site Fresno Business Search Engine when you got time.

Anonymous said...

You do have a point here :) I admire the stuff you post and the quality information you offer in your blog! Keep up the good work dude. Please come visit my site Fresno Web Link when you got time.

rr8004 said...

I found your blog on google and read a few of your other posts. I just added you to my Google News Reader. Keep up the good work. Look forward to reading more from you in the future. Please come visit my site Kentuchy KY Phone Directory when you

rr8004 said...

I found your blog on google and read a few of your other posts. I just added you to my Google News Reader. Keep up the good work. Look forward to reading more from you in the future. Please come visit my site Louisville Phone Book when you