31 August 2010

Online Learning Algorithms that Work Harder

It seems to be a general goal in practical online learning algorithm development to have the updates be very very simply.  Perceptron is probably the simplest, and involves just a few adds.  Winnow takes a few multiplies.  MIRA takes a bit more, but still nothing hugely complicated.  Same with stochastic gradient descent algorithms for, eg., hinge loss.

I think this maybe used to make sense.  I'm not sure that it makes sense any more.  In particular, I would be happier with online algorithms that do more work per data point, but require only one pass over the data.  There are really only two examples I know of: the StreamSVM work that my student Piyush did with me and Suresh, and the confidence-weighted work by Mark Dredze, Koby Crammer and Fernando Pereira (note that they maybe weren't trying to make a one-pass algorithm, but it does seem to work well in that setting).

Why do I feel this way?

Well, if you look even at standard classification tasks, you'll find that if you have a highly optimized, dual threaded implementation of stochastic gradient descent, then your bottleneck becomes I/O, not learning.  This is what John Langford observed in his Vowpal Wabbit implementation.  He has to do multiple passes.  He deals with the I/O bottleneck by creating an I/O friendly, proprietary version of the input file during the first past, and then careening through it on subsequent passes.

In this case, basically what John is seeing is that I/O is too slow.  Or, phrased differently, learning is too fast :).  I never thought I'd say that, but I think it's true.  Especially when you consider that just having two threads is a pretty low requirement these days, it would be nice to put 8 or 16 threads to good use.

But I think the problem is actually quite a bit more severe.  You can tell this by realizing that the idealized world in which binary classifier algorithms usually get developed is, well, idealized.  In particular, someone has already gone through the effort of computing all your features for you.  Even running something simple like a tokenizer, stemmer and stop word remover over documents takes a non-negligible amount of time (to convince yourself: run it over Gigaword and see how long it takes!), easily much longer than a silly perceptron update.

So in the real world, you're probably going to be computing your features and learning on the fly.  (Or at least that's what I always do.)  In which case, if you have a few threads computing features and one thread learning, your learning thread is always going to be stalling, waiting for features.

One way to partially circumvent this is to do a variant of what John does: create a big scratch file as you go and write everything to this file on the first pass, so you can just read from it on subsequent passes.  In fact, I believe this is what Ryan McDonald does in MSTParser (he can correct me in the comments if I'm wrong :P).  I've never tried this myself because I am lazy.  Plus, it adds unnecessary complexity to your code, requires you to chew up disk, and of course adds its own delays since you now have to be writing to disk (which gives you tons of seeks to go back to where you were reading from initially).

A similar problem crops up in structured problems.  Since you usually have to run inference to get a gradient, you end up spending way more time on your inference than your gradients.  (This is similar to the problems you run into when trying to parallelize the structured perceptron.)

Anyway, at the end of the day, I would probably be happier with an online algorithm that spent a little more energy per-example and required fewer passes; I hope someone will invent one for me!

11 comments:

Ryan McDonald said...

That is exactly what MSTParser does. I got this trick from Clark & Curran

timv said...

I've definitely felt the IO bottleneck before.. the best solution is to get a really beefy 64-bit multi-core machine with tons of memory to run all your training on. This way you can leave everything in memory.. no more worries!

Have you considered posing this as a question on MetaOptimize (http://metaoptimize.com/qa)?

John Langford said...

I've seen VW work pretty well in a single pass on some datasets---for example on Leon's SGD example. In other cases, I've seen up to 10 passes be useful.

VW's cache format isn't proprietary---it's all open source code. Look in cache.cc to see both the reader and the writer. The easiest way to take advantage of it is to overwrite the update in gd.cc and leave the rest of the code intact. If you run into difficulties, email me.

But, the real difficulty with vanilla gradient descent seems not to be the number of passes required to learn, but rather (a) normalizing the features so it can learn and (b) finding the right learning rates so that it learns quickly.

Nikos and I have been tinkering with the update rule this summer, and the current github version is a substantial improvement over the older gradient descent w.r.t. eliminating most of the need for (b).

I've seen two nonIO-bound cases where the speed of learning really matters. The first is when you use the '-q' flag to specify a large number of implicit features via an outer product of existing features. The second is when you have a learning reduction that creates multiple examples per input example. In both cases, via judicious use of the hashing trick, you can make the learning core the bottleneck instead of the disk IO.

The disadvantages you mention for the cache aren't relevant in my experience. Via cached IO (also in VW in io.cc), you eliminate seek issues. Because the cache format is typically more concise, the disk space it uses is amortized by the original dataset's disk space. The real problem from a user's perspective seems to be the version control---I've seen too many cases of wasted work due to outdated cache files. In the case where you read the original from a file rather than stdin or a tcp port, a little timestamp checking might be helpful.

Mark Dredze said...

Hal- 100% agree.

While MSTParser does do that, it still spends a ton of time doing I/O on subsequent passes (I've done some performance analysis.)

The issue isn't that these algorithms are trying to be simple, it's that there is a limited amount you can do with one example at a time. I've always seen their simplicity as a side effect of not having much to work with. The motivation for CW was that if we had a bit more information (variance), we could do more on each update. Perhaps the way forward is to store more (a small constant amount) with the goal of doing more on each update. I've thought about this before; you could store the most "interesting" examples and then constantly be looking at them when you update a new example. Sort of like active learning where you get the labels anyway. I don't think I've ever had time to try this out, but I've been curious as to how it would do.

Joseph Turian said...

I think a useful approach is to keep around instances for which the model was incorrect in a "focus" list. At each update, their is a certain probability your update is performed over an instance in the focus list, instead of a new instance in the stream. There might be some theory that describes if an instance is likely to be noise and hence evicted from the focus list.

Collins + Roark, in their perceptron parser, do something similar: They found that inference was more expensive than updates. IIRC, if the perceptron update didn't make the gold standard parse have a better score than the previous argmax, they would keep that sentence in a list of current wrong answers. Then, they would periodically repeatedly iterate over the wrong answers until the perceptron updates fixed the scores.

Their technique was ad-hoc, and it would be useful to have some principle behind the correct way to maintain a focus list.

@Ryan or @Hal: Could you talk more about MSTParser working on disk? Does it save previous parses, or what?

@timv: Although I endorse MetaOptimize, I don't think we should assume that we can always work in memory.

Joseph Turian said...

I think a useful approach is to keep around instances for which the
model was incorrect in a "focus" list. At each update, their is a
certain probability your update is performed over an instance in the
focus list, instead of a new instance in the stream. There might be
some theory that describes if an instance is likely to be noise and
hence evicted from the focus list.

Collins + Roark, in their perceptron parser, do something similar:
They found that inference was more expensive than updates. IIRC, if
the perceptron update didn't make the gold standard parse have a
better score than the previous argmax, they would keep that sentence
in a list of current wrong answers. Then, they would periodically
repeatedly iterate over the wrong answers until the perceptron updates
fixed the scores.

Their technique was ad-hoc, and it would be useful to have some
principle behind the correct way to maintain a focus list.

@Ryan or @Hal: Could you talk more about MSTParser working on disk?
Does it save previous parses, or what?

@timv: Although I endorse MetaOptimize, I don't think we should assume
that we can always work in memory.

Ted Dunning ... apparently Bayesian said...

Another way to balance I/O against learning is to admit that we are going to make more than one pass through the data with different learning parameters. Once you say that, then you naturally move to one parsing/feature extracting thread and many learning threads.

In the Mahout SGD stuff I have been doing lately, I set up a pool of on-line cross-validating learners. Then, I set an epoch every few thousand training examples at which point I review the recent performance of each learner and replicate the successful ones and reap the losers using recorded step meta-mutation.

This has two big advantages:

1) I don't have to set an annealing schedule for the learning rate since the evolutionary algorithm handles that very nicely.

2) with a pool of 20 candidate learners each doing 5-way cross validation, I find it much easier to saturate 16 cores with learning. :-0

Btw... I find that a simplified form of confidence weighted learning makes many of my problems converge to useful levels in less than one pass through my data.

hal said...

@timv: I think the amount of data we have grows faster than the amount of affordable memory.

@John: see @Mark :).

@Ted: I agree, and had even thought of, the idea of just having different threads for different learning rates. I love your idea of taking this even further... is this written up anywhere?

@Mark, @Joseph: This is a really interesting idea. Perhaps someone should try it :). Our experience with the streamSVM stuff was that using larger lookaheads was practically beneficial, but we had tons of negative theoretical results saying that it can't possibly help against a smart adversary, so long as your lookahead is bounded (say polylog in the number of examples). This is true even for restricted adversaries, such as in the permuted stream setting. Maybe a more online learning analysis (like regret?) might be able to get something here, though.

Ted Dunning ... apparently Bayesian said...

@Hal,

The basic evolutionary algorithm is written up here: http://arxiv.org/abs/0803.3838

The adaptive online learner is embodied in the Apache Mahout class AdaptiveLogistic regression. It uses the class EvolutionaryProcess to implement the adaptation and the CrossFoldLearner to do on-line estimation of AUC (for binary target) or logLikelihood (for non-binary target).

There isn't a real writeup of the approach yet. Where would you suggest an implementation writeup go?

I would also like to invite anyone who would like to contribute to this work to pop in to the mahout development mailing list. The community is very welcoming for new contributions.

hal said...

@Ted: Maybe JMLR's open source software?

Joseph Turian said...

@Ted: This is an interesting approach. In deep learning, there are a ton of hyperparameters, and choosing them correctly is tricky.

There was a Snowbird abstract this year about doing exactly what you describe. Essentially, using genetic algorithms to choose your hyperparameters, and evolving good hyperparameters. This actually leads to better validation performance than choosing the hyperparameters at the beginning of training, since they can change during training. I can't find this abstract, even though I've looked for it several times since first seeing it.

Side-note: One of the reason Google's prediction API is so smart is now they have a lot of data for choosing their hyper-hyperparameters (i.e. the distributions from which they sample the hyperparameter mutations).