Machine learning as a field has been very convex-happy for the past decade or so. So much so that when I saw a tutorial on submodular optimization in ML (one of the best tutorials I've seen), they said something along the lines of "submodularity will be for this decade what convexity was for the last decade." (Submodularity is cool and I'll post about it more in the future, but it's kind of a discrete analog of convexity. There's a NIPS workshop on the topic coming up.) This gives a sense of how important convexity has been.

There's also a bit of an undercurrent of "convexity isn't so great" from other sides of the ML community (roughly from the neural nets folks); see, for instance, Yann LeCun's talk Who's Afraid of Non-convex Loss Functions, a great and entertaining talk.

There's a part of me that loves convexity. Not having to do random restarts, being assured of global convergence, etc., all sounds very nice. I use logistic regression/maxent for almost all of my classification needs, have never run a neural network, and have only occasionally used svms (though of course they are convex, too). When I teach ML (as I'm doing now), I make a bit deal about convexity: it makes life easy in many ways.

That said, almost none of my recent papers reflect this. In fact, in the structure compilation paper, we flat out say that non-linearity in the model (which leads to a non-convex loss function) is the major reason why CRFs outperform independent classifiers in structured prediction tasks! Moreover, whenever I start doing Bayesian stuff, usually solved with some form of MCMC, I've completely punted on everything convex. In a "voting with my feet" world, I could care less about convexity! For the most part, if you're using EM or sampling or whatever, you don't care much about it either. Somehow we (fairly easily!) tolerate whatever negative effects there are of non-convex optimization.

I think one reason why such things don't both us, as NLPers, as much as they bother the average machine learning person is that we are willing to invest some energy in intelligent initialization. This already puts us in a good part of the optimization landscape, and doing local hillclimbing from there is not such a big deal. A classic example is the "Klein and Manning" smart initializer for unsupervised parsing, where a small amount of human knowledge goes a long way above a random initializer.

Another style of initialization is the IBM alignment model style. IBM model 4 is, of course, highly non-convex and ridiculously difficult to optimize (the E step is intractable). So they do a smart initialization, using the output of model 3. Model 3, too, is highly non-convex (but not quite so much so), so they initialize with model 2. And so on, down to model 1, which is actually convex and fairly easy to optimize. This sequencing of simple models to complex models also happens in some statistical analysis, where you first fit first order effects and then later fit higher order effects. The danger, of course, is that you got to a bad hill to climb, but this overall generally appears to be a bigger win than starting somewhere in the middle of a random swamp. (Of course, later, Bob Moore had this cute argument that even though model 1 is convex, we don't actually ever optimize it to the global optimum, so doing clever initialization for model 1 is also a good idea!)

These two ideas: clever initialization, and sequential initialization, seem like powerful ideas that I would like to see make their way into more complex models. For instance, in the original LDA paper, Dave Blei used an initialization where they pick some random documents as seeds for topics. As far as I know, no one really does this anymore (does anyone know why: does it really not matter?), but as we keep building more and more complex models, and lose hope that our off the shelf optimizer (or sampler) is going to do anything reasonable, we're probably going to need to get back to this habit, perhaps trying to formalize it in the meantime.

Half-Exponential No More

14 hours ago