Comments on natural language processing blog: A bad optimizer is not a good thing

With respect to the initialization example, I don&...

2016-05-18T00:38:02.604-06:00

With respect to the initialization example, I don't think you can have a "black box" analysis: optimization methods are so different e.g. gradient descent which regularizes towards the initial point vs. bundle method (without line search) which will jump around all over the place until it proves that the optimum is not hiding in some corner. But for GD it's pretty straightforward to show the relationship between initialization and early stopping vs regularization towards the initial point (for some lambda that decays with the number of steps).

In the embedding example, you have two ways to fix the problem: (a) rely on a bad optimizer (b) fix your objective. In your example you can add constraints to your objective inspired by approaches like PCA and CCA: by adding the requirement that the embedded english and japanese have the identity as second moment the trivial solution disappears.

The variational autoencoder does something similar (iirc the objective includes a kl divergence from standard multivariate Gaussian penalty (for the embedding).

The topic model example also has an underspecified objective. One way to fix is to order the topics: the first topic is the most popular one. This is exactly the same trick people use with eigenvectors (where popularity is the eigenvalue). The "topics as tensor eigenvectors" view is proving helpful sometimes...

Here is an old example of optimization problem for...

2016-05-17T15:11:00.652-06:00

Here is an old example of optimization problem for which one hopes that the optimizer will not find the degenerate global maximum: a mixture of Gaussians....

Yup. What I always tell people: If you suspect th...

2016-05-15T10:56:39.148-06:00

Yup. What I always tell people: If you suspect that the answer is close to θ₀, then you should modify your objective function to regularize toward θ₀, typically by adding a multiple of ||θ-θ₀||². It's then certainly reasonable to initialize the optimizer at the regularizer's favorite point, θ = θ₀. But you shouldn't skip the regularization term and use only initialization to inject this prior knowledge, because that relies on a broken optimizer. A sufficiently good optimizer won't be sensitive to initialization!

Probably a more sound alternative to "clever ...

2016-05-14T11:41:38.489-06:00

Probably a more sound alternative to "clever initialization" for doing some sort of transfer learning are multi-channel architectures, where you can provide several different embedding representations of your input sentences (e.g. random, word2vec, GloVe) and eventually leave some of them fixed. Does this sound ok?

A third response might be curriculum learning (but...

2016-05-14T09:32:46.728-06:00

A third response might be curriculum learning (but maybe curriculum learning is a refinement of your initialization response). Certainly, the idea that success in learning (not just speed) should have some sort of dependence on the order that training is presented is not unreasonable. Of course, iterative "bad optimization" is an unsatisfying way to instantiate this (how should we analyze it?), but it's currently the only game in town, and it's not unreasonable that real learners we're trying to imitate are, themselves, bad optimizers (I know I am). I guess it wouldn't surprise me that even if we had a perfect optimizer, we still might find that bad optimizers + curricula (or something like that) might be a good idea practically, and perhaps for interesting reasons.