Comments on natural language processing blog: Hyperparameter search, Bayesian optimization and related topics

Why not just fit the hyperparameters and parameter...

2014-10-13T15:16:55.796-06:00

Why not just fit the hyperparameters and parameters together in a hierarchical model? Basically, meta-analysis and what machine learning people call "adaptation" can both be cast as hierarchical modeling techniques. An example is Hal's "embarassingly simple" approach to adaptation, which Finkel and Manning later formulated as a standard hierarchical model.

The usual reason not to just fit a hierarchical model is twofold. The first problem is computation. MCMC, which will let you do the exact comptuation you want, namely take a posterior mean point estimate which minimizes expected error, will be too slow for large data sets. Optimization simply fails in hierarchical setting because the density grows without bound as hierarchical variance shrinks to zero (or as David McKay aptly put it in his book, "EM goes boom").

What you can do is approximate with point estimates, also known as empirical Bayes (in a huge misnomer --- it's no more empirical than full Bayes in a hierarchical model). The standard approach is maximum marginal likelihood (MML), as in the lme4 package in R. What you do is marginalize out the low level parameters, optimize the hyperparameters, then fix the hyperparameters to estimate the low-level parameters in a second pass.

We're working on doing this in full generality in Stan using Laplace approximations for the marginalization, but we're not quite all the way there yet. We're also working on black-box EP and VB approaches, which may provide better point estimates and better approximations to uncertainty than a simple Laplace estimate based on curvature at the mode.

There's a whole book by Brad Efron, the inventor of the bootstrap, on these kinds of techniques with applications to large-scale genomics problems: Large-Scale Inference
Empirical Bayes Methods for Estimation, Testing, and Prediction.

Thanks for discussing our work, Hal! You are insi...

2014-10-11T08:10:48.577-06:00

Thanks for discussing our work, Hal! You are insightful as always. I thought I might just point out a few things in response to your comments. (Please forgive my gratuitous plugging of my own papers.)

The idea of learning across multiple data sets is something we find to be very useful, and Kevin Swersky, Jasper Snoek, and I had a paper on it at NIPS last year. Andreas Krause also had a 2011 paper that discusses this setting.

The parallelism thing is also something we take seriously and have some good results on in Section 3.3 of our 2012 NIPS paper. Again, Andreas Krause and colleagues have also contributed to this with their 2012 ICML paper.

Regarding the idea of leveraging problem structure to speed things up, you might like our Freeze-Thaw Bayesian Optimization paper on arxiv. The idea is to guess what the final result will be of the inner-loop optimization and build that into the exploration/exploitation strategy without actually finishing jobs about which you are pessimistic.

For what it's worth, you might also worry about complicated constraints in your optimization that you can't identify a priori, e.g., "for some complicated hyperparameter regimes, my code just returns NaNs." For this, you might consider Bayesian Optimization with Unknown Constraints by my student Michael Gelbart, or this paper by John Cunningham and colleagues.

In a final gratuitous plug, the most current versions of Spearmint can be found here, and because Spearmint requires some overhead to set up and run, we've also been building a system for "Bayesian optimization as a web service" that you can sign up for at whetlab.com.

Ryan Adams

On using the idea of meta-learning, there has been...

2014-10-10T16:21:24.640-06:00

On using the idea of meta-learning, there has been some recent work. There was actually a paper from last year's NIPS where they used *multitask* GPs:
Multi-Task Bayesian Optimization: http://machinelearning.wustl.edu/mlpapers/paper_files/NIPS2013_5086.pdf)

There was also another paper at this year's AISTATS:
Efficient Transfer Learning Method for Automatic Hyperparameter Tuning: http://www.cs.cmu.edu/~dyogatam/papers/yogatama+mann.aistats2014.pdf

.. and another paper that uses a somewhat simpler idea of using the results from previous runs (on other related data sets) for better initialization on a new data set: Using Meta-Learning to Initialize Bayesian Optimization of Hyperparameters: http://ceur-ws.org/Vol-1201/MetaSel2014-complete.pdf#page=8