It seems to be a general goal in practical online learning algorithm development to have the updates be very very simply. Perceptron is probably the simplest, and involves just a few adds. Winnow takes a few multiplies. MIRA takes a bit more, but still nothing hugely complicated. Same with stochastic gradient descent algorithms for, eg., hinge loss.
I think this maybe used to make sense. I'm not sure that it makes sense any more. In particular, I would be happier with online algorithms that do more work per data point, but require only one pass over the data. There are really only two examples I know of: the StreamSVM work that my student Piyush did with me and Suresh, and the confidence-weighted work by Mark Dredze, Koby Crammer and Fernando Pereira (note that they maybe weren't trying to make a one-pass algorithm, but it does seem to work well in that setting).
Why do I feel this way?
Well, if you look even at standard classification tasks, you'll find that if you have a highly optimized, dual threaded implementation of stochastic gradient descent, then your bottleneck becomes I/O, not learning. This is what John Langford observed in his Vowpal Wabbit implementation. He has to do multiple passes. He deals with the I/O bottleneck by creating an I/O friendly, proprietary version of the input file during the first past, and then careening through it on subsequent passes.
In this case, basically what John is seeing is that I/O is too slow. Or, phrased differently, learning is too fast :). I never thought I'd say that, but I think it's true. Especially when you consider that just having two threads is a pretty low requirement these days, it would be nice to put 8 or 16 threads to good use.
But I think the problem is actually quite a bit more severe. You can tell this by realizing that the idealized world in which binary classifier algorithms usually get developed is, well, idealized. In particular, someone has already gone through the effort of computing all your features for you. Even running something simple like a tokenizer, stemmer and stop word remover over documents takes a non-negligible amount of time (to convince yourself: run it over Gigaword and see how long it takes!), easily much longer than a silly perceptron update.
So in the real world, you're probably going to be computing your features and learning on the fly. (Or at least that's what I always do.) In which case, if you have a few threads computing features and one thread learning, your learning thread is always going to be stalling, waiting for features.
One way to partially circumvent this is to do a variant of what John does: create a big scratch file as you go and write everything to this file on the first pass, so you can just read from it on subsequent passes. In fact, I believe this is what Ryan McDonald does in MSTParser (he can correct me in the comments if I'm wrong :P). I've never tried this myself because I am lazy. Plus, it adds unnecessary complexity to your code, requires you to chew up disk, and of course adds its own delays since you now have to be writing to disk (which gives you tons of seeks to go back to where you were reading from initially).
A similar problem crops up in structured problems. Since you usually have to run inference to get a gradient, you end up spending way more time on your inference than your gradients. (This is similar to the problems you run into when trying to parallelize the structured perceptron.)
Anyway, at the end of the day, I would probably be happier with an online algorithm that spent a little more energy per-example and required fewer passes; I hope someone will invent one for me!