14 May 2016

A bad optimizer is not a good thing

A very popular style of research in NLP and ML is the math abstraction. You cast your learning problem as some sort of objective function that you want to optimize. Or, if you're feeling Bayesian, you write down a joint likelihood that you'll either sample from or, yes, turn into an objective function that you want to optimize. The optimizer is then typically considered a black box, aside from its hyperparameters which you often must tune.

This is a very attractive style of research and one that I've personally gotten a lot of leverage out of. (Though it's worth emphasizing for reviewer #2 out there that this is not the only want to solve problems, and that not having an explicit objective function that an optimizer is cranking on does not mean your problem formulation is bad.) The main advantage here is separation of concerns. I can worry about making a good objective function, and my optimization friends can worry about optimizing it. Of course it's never that simple and we must consider what can possibly be optimized when making our objective functions (see, for instance, the obsession with convexity 10 years ago and the anti-obsession with convexity today).

At the end of the day, like any abstraction, it's not perfect, and we need to break it occasionally.

There is, however, something that's struck me as peculiar going back to the hayday of topic models in 10 or so years ago, but which applies equally well now. This is a strange world in which we actually hope that our optimizer doesn't work (or s/optimizer/sampler/). To me, this seems like a bad thing to hope for. I think we need to acknowledge that for hard problems, or optimizers and samplers probably won't work, and need to take that into account. But I don't think we should actively hope that they don't work: that just seems broken.

There are two places where I see this hope arise.

1. Occasionally you see objective functions that have trivial solutions that you're hoping won't be found. I've seen this several times in the context of multimodal learning. Suppose that you want to learn some sort of joint embedding of English words and Japanese words. I can write down some embedding functions f(e) and g(j) using whatever neural net magic I want. The outputs are vectors. Now, I want to have them supervise each other, and so I make an objective function that looks like ||A f(e) - B g(j)||^2 for some matrices A and B. The joint objective is to learn A, B, and all the parameters inside f and g. This seems reasonable, until you realize that simply setting A=B=0 will provide not only a local, but a global minimum of this objective function. Of course since these things are non-convex, we often initialize randomly. And in most cases, this global minimum actually isn't found (otherwise such approaches would never work). But this is just a broken way to formulate a problem because I'm relying on the fact that my optimizer doesn't actually work.

2. Much more commonly, the hope of a broken optimizer comes from the heuristic of using "initialization" to build a better model. The general scheme here is that I have some prior knowledge that I want to inject into the system, and I do so by initializing my model's parameters informed by that knowledge. I then optimize. But again, here, we're hoping that the optimizer fails! If the optimizer were perfect, it should be insensitive to the initialization (modulo picking one of multiple global optima, but I don't think that's what's really going on here). You see the same thing in Bayesian land, wherein you basically hope that your sampler doesn't mix properly. Again, this is a case where we're relying on the fact that we don't think our optimizers will actually work.

I think it's fair to say that #1 is pretty broken; I think there are a number of possible responses to #2 that are worth discussing briefly.

Response 1: Early stopping. We rarely run optimizers (or samplers) to completion and instead stop them early based on some held out measure of generalization. For many optimizers, it's possible to show that early stopping it equivalent to regularization (I first saw this as a homework exercise in the old Bishop neural nets book, but it's been reinvented many times since then... and probably before then). You can alternatively pull out some online learning theory and argue that online updates are implicitly regularizing (toward zero). We therefore can conclude that initialization plus early stopping is essentially a form of regularization toward the initialization point. This is reasonable, but it would be nice to actually lay out the theory more solidly in a way that's not specific to a particular model, regularizer, optimizer triple. (Maybe this has been done but I haven't seen it.)

Response 2: Initialization isn't to help the model, it's to help the optimizer. This is a pretty solid response. I think that when we do initialization, this is probably how we should think about it. We think the model is perfect (or at least good) but that the optimizer is going to have a hard time, so we start it somewhere reasonable. I think this perspective is hard to reconcile with the "initialization as prior knowledge view" (it's possible, but requires some mental acrobatics), but it's a reasonable view.

So what should we be doing? Well, if someone wants to prove something along the lines of #1 that would be nice :). Or alternatively just point me to an existing result!

More generally, though, I think the model vs optimization abstraction is quite useful. If we want to incorporate prior knowledge into the model, then some sort of regularization (or prior) seems like a very reasonable approach. (Yes, there are others.) If we want to try to help out our optimizer, then initialization may be a reasonable approach, but it really depends on the optimizer. One can of course do both: the "regularize toward X" approach together with initialization at (or near) X. (Though in my experience often regularizing toward X means that initializing at X isn't necessary because it already introduces a good bias and symmetry breaking.)









5 comments:

Chris said...

A third response might be curriculum learning (but maybe curriculum learning is a refinement of your initialization response). Certainly, the idea that success in learning (not just speed) should have some sort of dependence on the order that training is presented is not unreasonable. Of course, iterative "bad optimization" is an unsatisfying way to instantiate this (how should we analyze it?), but it's currently the only game in town, and it's not unreasonable that real learners we're trying to imitate are, themselves, bad optimizers (I know I am). I guess it wouldn't surprise me that even if we had a perfect optimizer, we still might find that bad optimizers + curricula (or something like that) might be a good idea practically, and perhaps for interesting reasons.

deadbeef said...

Probably a more sound alternative to "clever initialization" for doing some sort of transfer learning are multi-channel architectures, where you can provide several different embedding representations of your input sentences (e.g. random, word2vec, GloVe) and eventually leave some of them fixed. Does this sound ok?

Jason Eisner said...

Yup. What I always tell people: If you suspect that the answer is close to θ₀, then you should modify your objective function to regularize toward θ₀, typically by adding a multiple of ||θ-θ₀||². It's then certainly reasonable to initialize the optimizer at the regularizer's favorite point, θ = θ₀. But you shouldn't skip the regularization term and use only initialization to inject this prior knowledge, because that relies on a broken optimizer. A sufficiently good optimizer won't be sensitive to initialization!

Anonymous said...

Here is an old example of optimization problem for which one hopes that the optimizer will not find the degenerate global maximum: a mixture of Gaussians....

Nikos said...

With respect to the initialization example, I don't think you can have a "black box" analysis: optimization methods are so different e.g. gradient descent which regularizes towards the initial point vs. bundle method (without line search) which will jump around all over the place until it proves that the optimum is not hiding in some corner. But for GD it's pretty straightforward to show the relationship between initialization and early stopping vs regularization towards the initial point (for some lambda that decays with the number of steps).

In the embedding example, you have two ways to fix the problem: (a) rely on a bad optimizer (b) fix your objective. In your example you can add constraints to your objective inspired by approaches like PCA and CCA: by adding the requirement that the embedded english and japanese have the identity as second moment the trivial solution disappears.

The variational autoencoder does something similar (iirc the objective includes a kl divergence from standard multivariate Gaussian penalty (for the embedding).

The topic model example also has an underspecified objective. One way to fix is to order the topics: the first topic is the most popular one. This is exactly the same trick people use with eigenvectors (where popularity is the eigenvalue). The "topics as tensor eigenvectors" view is proving helpful sometimes...