10 October 2014

Hyperparameter search, Bayesian optimization and related topics

In terms of (importance divided-by glamour), hyperparameter (HP) search is probably pretty close to the top. We all hate finding hyperparameters. Default settings are usually good, but you're always left wondering: could I have done better? I like averaged perceptron for this reason (I believe Yoav Goldberg has also expressed this sentiment): no pesky hyperparameters.
But I want to take a much broader perspective on hyperparameters. We typically think of HPs as { regularization constant, learning rate, architecture } (where "architecture" can mean something like neural network structure, choice of kernel, etc. But I think it's a lot broader and can include things like feature engineering, or at least representation modifications. For instance, vw now supports a number of helpful NLP feature templates: suffix and prefix features (via --affix), spelling features (via --spelling), ngram features (--ngrams), quadratic and cubic features, etc. Picking the right incarnation of these thing is essentially a hyperparameter search process, very akin (IMO) to things like representation learning.

Once you're willing to accept all these things as HPs (and I think you should), something like "grid search", which works for tuning C and eta in your SVM, just doesn't seem to cut it anymore.

Enter the world of automatic HP tuning. Lots of this work, not surprisingly, comes out of the neural networks community, because HP search is a big deal there. Most of my information here comes via Hugo Larochelle, who has done a lot of great work. Some places to start looking:
  • Spearmint toolkit by Snoek, Larochelle, Swersky, Zemel and Adams (and a JMLR paper)
  • SMAC by Hutter, Hoos, Leyton-Brown, Murphy and Ramage (and a paper
Most of this work falls under the framework of "Bayesian Optimization." The idea comes from the space of derivative-free optimization, where a common strategy is to fit a response surface. Basically you have a bunch of hyperparameters to tune. For any setting of hyperparameters, you can observe some response. In ML land this is usually something like held-out accuracy. Now, fit a regression function that can map from hyperparameters to response. Do something that looks like active learning to explore this space, with a bias toward finding places with high response (high accuracy). In these examples, the function being fit is a Gaussian Process, which is super useful because it can provide realistic estimates of variance, which are useful in the active learning/experimental design.

I first learned about this stuff, not in the context of HP optimization, from Ilya Ryzhov, a faculty member in our business school who works on these topics -- I learned that in his world, exploration/exploitation is also called "learning versus earning" which I think it awesome :).

I would love if hyperparameter optimization were a black box, and Spearmint and SMAC are both great steps in this direction. There are a couple of things that I haven't seen (though like I said, I'm not hugely well read in this space) that I would love to see.
  • Learning across multiple datasets, akin to meta-learning. I imagine that if you give me a problem to do HP optimization on, and also give the same problem to an undergraduate, I would be much better. Presumably I've learned from past experience how to do this well.
  • More importantly, taking advantage of some of the structure of learning algorithms (rather than oblivious black box optimization) seems like it could be a big win. For instance, most learning algorithms have some notion of early stopping (# of passes over the data, tolerance for convergence, etc.). You can also of course run on subsets of the data (which is equivalent in many cases). If you assume heldout accuracy doesn't get worse over time (e.g., because you do early stopping) then you can think of this as a type of right-censoring. I.e., you run an experiment part of the way through and you know "ok if I kept running it might get better, but I know it's at least 85% accurate now." The SMAC folks had a nice paper on Bayesian optimization with censored data, but my (incomplete) understanding is that it doesn't quite capture this (very common, IMO) case. I should be willing to start and stop processes at various points and try to figure out where to invest more computation to get better estimates. I can presumably even estimate: if I ran another pass, how much better do I think it would get?
  • I also think the focus on "finding the best hyperparameters" is somewhat the wrong problem. We want to find the best parameters, period. Hyperparameters are a nuisance on their way to that end. For instance, related to the above bullet, I could imagine running a few passes with one setting of hyperparameters, and then doing some other work, and then going back and restarting that previous run with a different setting of hyperparameters (assuming the model being learned is such that "warm starting" makes sense, which is almost always the case--except maybe in some neural network settings).
     
  • Parallelization is a big deal. One of the reasons something akin to grid search is so attractive is that it's trivial to submit 20*20*20 jobs to my cluster and just wait for them to finish. Anything that's less friendly than doing this is not worth it. Again, the SMAC folks have worked on this. But I don't think the problem is solved.
Beyond these technical issues there's always the obnoxious issue of trust. Somehow I need to believe that I'm not better than these algorithms at tuning hyperparameters. I should be happy to just run them, preferably saying "okay, here are 120 cores, you have four hours -- go to town." And I should believe that it's better than I could do with equivalent time/resources by clever grid search. Or perhaps I should be able to encode my strategies in some way so that it can prove to me that it's better than me.

Overall, I'd love to see more work on this problem, especially work that doesn't focus on neural networks, but still takes advantage of the properties of machine learning algorithms that are not shared by all black-box derivative-free optimization tasks. In the mean time, from what I hear, SMAC and Spearmint are actually quite good. Would love to hear if any NLP people have played around with them!

03 October 2014

Machine learning is the new algorithms

When I was an undergrad, probably my favorite CS class I took was algorithms. I liked it (a) because my background was math so it was the closest match to what I knew and (b) because even though it was "theory," a lot of the stuff we learned was really relevant. Over time, it seemed like the area had distilled worthwhile algorithms from interesting-in-theory-but-you'll-never-actually use algorithms.

In fact, I think this is a large part of why most undergraduate CS degrees today require a course in algorithms. You have these very nice, clearly defined statements, and very elegant solutions to those statements that in most cases (at the UG level) are known to be optimal.

Fast forward N years.

My claim today---and I'm speaking really as an NLP person, which is how I self-identify---is that machine learning is the new core. Everything that algorithms was to computer science 15 years ago, machine learning is today. That's not to say it won't move in another 10 years, but that's how I see it.

Why?

For the most part, algorithms (especially as taught at th UG level) is the study of one thing: Given a perfect input, how do I most efficiently compute the optimal output.

The problem is the "perfect input" part.

All of my experience in the past N years has told me that you never have a perfect input, and that it's far far far more important to be able to synthesize information from a large number of sources and reason about it than it is to find the exact-right-solution to some problem that exists only to Plato.

Even within machine learning you see this effect. Lots of numerical analysis people have worked on good algorithms for getting that last little bit of precision out of optimization algorithms. Does it matter? Nope! Model specification, parameter tuning, features, and data matter infinitely more than that last little bit of precision. (In some fields, for instance, scientific computing, that last little bit of precision may matter. I don't know enough to know one way or the other.)

Let's play a thought game. Say you're an UG CS major. You graduate and get a job in CS (not grad school). Which are you more likely to use: (1) a weighted cost flow algorithm or (2) a perceptron/decision tree?

Clearly I think the answer is (2). And I loved flow algorithms when I was an undergrad and have actually spent since 2006 trying to figure out how I can use them for a problem I want to solve. No dice.

I would actually go further. Suppose you have a problem whose inputs are ill-specified (as they always are when dealing with data), and whose structure actually does look like a flow problem. There are two CS students trying to solve this problem. Akiko knows about machine learning but not flows; Bob knows about flows but not machine learning. Bob tries to massage his data by hand into the input to an optimal flow algorithm, and then solves it exactly. Akiko uses machine learning to get good edge weights and hacks together some greedy algorithm for flows, not even knowing it's called a flow. Who's solution works better? I would put almost any amount of money on Akiko.

Full disclosure: those who know about my research in structured prediction will recognize this as a recurring theme in my own research agenda: fancy algorithms always lose to better models.

There's another big difference between N years ago and today: almost every algorithm you could possibly care about (or learn about as an UG) is implemented in a library for any reasonable programming language. That's not to say that it's unimportant to know how things work in order to use them, but I would argue it's much less important in a field like algorithms whose knowledge is comparatively stable, versus a field like machine learning where things are still changing and there is no "one right answer" to the "machine learning problem." In a field that's still a bit of an art rather than a science, understanding how things work under the hood feels a lot more important. Quicksort, heaps, minimum spanning trees, ... these are all here to stay.
Okay, so now I've convinced myself that we should yank algorithms out as an UG requirement and replace it with machine learning.

But wait, I can hear my colleagues yelling, taking algorithms isn't about learning algorithms: it's about learning how to think! But that's also what I think is great about machine learning: the distance between theory and algorithms is actually usually quite small (I try to get this across at various points in CiML, to varying degrees of success). If the only point of an algorithms class (I've heard exactly this argument made about automata theory, for instance) is to teach students how to think, I think we could do much better.

Okay, so I've thrown down the gauntlet. Someone should come smack me with theirs :P!

Edit after some comments:

I think I probably wrote badly and as a result my main point got lost. I'll try to restate it here briefly and then I'll edit the main post.

Main point: I feel like for 15 years, algorithms has been at the heart of most of what computer science does. I feel like that coveted position has now changed to machine learning or, more generically, statistical reasoning. I feel this way because figuring out how to map a real world problem into something an "algorithm" can consume, especially when that needs statistical modeling of various inputs, is (IMO) a more important and harder problem than details about flow algorithms. 



EDIT #2 AFTER DISCUSSION ON SURESH'S BLOG:

let me give a concrete example that may actually be a real world example, but i don't know (though see this paper). that of path finding for taxis or cars. the world is a graph and given directed edge costs we can run dijkstra or whatever to find LEAST-TIME (shortest) paths. this is basically google maps/etc.

of course, we never know the true time to travel some segment. we might know it now, but by the time the driver gets to some road (5 or 10 minutes from now) the conditions may have changed. and of course we have historical data on traffic from which we can predict what the condition of the road will be like in 10 minutes.

so here, "foo" is a function that takes the time of data, historical traffic data, weather and whathaveyou, and maps it to edge costs.

"bar" is dijkstra's algorithm or whatever shortest path algorithm you like.

my claim is that if you really want to solve this problem, it's much more important to understand how to create foo than how to create bar. in particular, if i gave you a greedy or near greedy approach to bar, combined with a really good foo, i bet this would be significantly better than an optimal bar and a crappy foo.