natural language processing blog: ICML 2010 Retrospective

Just got back from Israel for ICML, which was a great experience: I'd wanted to go there for a while and this was a perfect opportunity. I'm very glad I spent some time afterwards out of Haifa, though.

Overall, I saw a lot of really good stuff. The usual caveats apply (I didn't see everything it's a biased sample, blah blah blah). Here are some things that stood out:

Structured Output Learning with Indirect Supervision (M.-W. Chang, V. Srikumar, D. Goldwasser, D. Roth). This was probably one of my favorite papers of the conference, even though I had learned some about the work when I visited UIUC a few months ago. Let's say you're trying to do word alignment, and you have a few labeled examples of alignments. But then you also have a bunch of parallel data. What can you do? You can turn the parallel data into a classification problem: are these two sentences translations of each other. You can pair random sentences to get negative examples. A very clever observation is basically that the weight vector for this binary classifier should point in the same direction as the weight vector for the (latent variable) structured problem! (Basically the binary classifier should say "yes" only when there exists an alignment that renders these good translations.) Tom Dietterich asked a question during Q/A: these binary classification problems seem very hard: is that bad? Ming-Wei reassured him that it wasn't. In thinking about it after the fact, I wonder if it is actually really importantant that they're hard: namely, if they were easy, then you could potentially answer the question without bothering to make up a reasonable alignment. I suspect this might be the case.

A Language-based Approach to Measuring Scholarly Impact (S. Gerrish, D. Blei). The idea here is that without using citation structure, you can model influence in large document collections. The basic idea is that when someone has a new idea, they often introduce new terminology to a field that wasn't there before. The important bit is that they don't change all of science, or even all of ACL: they only change what gets talked about in their particular sub-area (aka topic :P). It was asked during Q/A what would happen if you did use citations, and my guess based on my own small forays in this area is that the two sources would really reinforce eachother. That is, you might regularly cite the original EM even if your paper has almost nothing to do with it. (The example from the talk was then Penn Treebank paper: one that has a bajillion citations, but hasn't lexically affected how people talk about research.)

Hilbert Space Embeddings of Hidden Markov Models (L. Song, B. Boots, S. Saddiqi, G. Gordon, A. Smola). This received one of the best paper awards. While I definitely liked this paper, actually what I liked more what that it taught me something from COLT last year that I hadn't known (thanks to Percy Liang for giving me more details on this). That paper was A spectral algorithm for learning hidden Markov models (D. Hsu, S. Kakade, T. Zhang) and basically shows that you can use spectral decomposition techniques to "solve" the HMM problem. You create the matrix of observation pairs (A_ij = how many times did I see observation j follow observation i) and then do some processing and then a spectral decomposition and, voila, you get parameters to an HMM! In the case that the data was actually generated by an HMM, you get good performance and good guarantees. Unfortunately, if the data was not generated by an HMM, then the theory doesn't work and the practice does worse than EM. That's a big downer, since nothing is ever generated by the model we use, but it's a cool direction. At any rate, the current paper basically asks what happens if your observations are drawn from an RKHS, and then does an analysis. (Meta-comment: as was pointed out in the Q/A session, and then later to me privately, this has fairly strong connections to some stuff that's been done in Gaussian Process land recently.)

Forgetting Counts: Constant Memory Inference for a Dependent Hierarchical Pitman-Yor Process (N. Bartlett, D. Pfau, F. Wood). This paper shows that if you're building a hierarchical Pitman-Yor language model (think Kneser-Ney smoothing if that makes you feel more comfortable) in an online manner, then you should feel free to throw out entire restaurants as you go through the process. (A restaurant is just the set of counts for a given context.) You do this to maintain a maximum number of restaurants at any given time (it's a fixed memory algorithm). You can do this intelligently (via a heuristic) or just stupidly: pick them at random. Turns out it doesn't matter. The explanation is roughly that if it were important, and you threw it out, you'd see it again and it would get re-added. The chance that something that occurs a lot keeps getting picked to be thrown out is low. There's some connection to using approximate counting for language modeling, but the Bartlett et al. paper is being even stupider than we were being!

Learning efficiently with approximate inference via dual losses (O. Meshi, D. Sontag, T. Jaakkola, A. Globerson). Usually when you train structured models, you alternate between running inference (a maximization to find the most likely output for a given training instance) and running some optimization (a minimization to move your weight vector around to achieve lower loss). The observation here is that by taking the dual of the inference problem, you turn the maximization into a minimization. You now have a dual minimization, which you can solve simultaneously, meaning that when your weights are still crappy, you aren't wasting time finding perfect outputs. Moreover, you can "warm start" your inference for the next round. It's a very nice idea. I have to confess I was a bit disappointed by the experimental results, though: the gains weren't quite what I was hoping. However, most of the graphs they were using weren't very large, so maybe as yo move toward harder problems, the speed-ups will be more obvious.

Deep learning via Hessian-free optimization (J. Martens). Note that I neither saw this presentation nor read the paper (skimmed it!), but I talked with James about this over lunch one day. The "obvious" take away message is that you should read up on your optimization literature, and start using second order methods instead of your silly gradient methods (and don't store that giant Hessian: use efficient matrix-vector products). But the less obvious take away message is that some of the prevailing attitudes about optimizing deep belief networks may be wrong. For those who don't know, the usual deal is to train the networks layer by layer in an auto-encoder fashion, and then at the end apply back-propogation. The party line that I've already heard is that the layer-wise training is very important to getting the network near a "good" local optimum (whatever that means). But if James' story holds out, this seems to not be true: he doesn't do any clever initialization and still find good local optima!

A theoretical analysis of feature pooling in vision algorithms (Y.-L. Boureau, J. Ponce, Y. LeCun). Yes, that's right: a vision paper. Why should you read this paper? Here's the question they're asking: after you do some blah blah blah feature extraction stuff (specifically: Sift features), you get something that looks like a multiset of features (hrm.... sounds familiar). These are often turned into a histogram (basically taking averages) and sometimes just used as a bag: did I see this feature or not. (Sound familiar yet?) The analysis is: why should one of these be better and, in particular, why (in practice) do vision people see multiple regimes. Y-Lan et al. provide a simple, obviously broken, model (that assumes feature independence... okay, this has to sound familiar now) to look at the discriminability of these features (roughly the ration of between-class variances and overall variances) to see how these regimes work out. And they look basically how they do in practice (modulo one "advanced" model, which doesn't quite work out how they had hoped).

Some other papers that I liked, but don't want to write too much about:

Learning Programs: A Hierarchical Bayesian Approach (P. Liang, M. Jordan, D. Klein). Structured models over programs are very hard; this paper gives one approach to modeling them.
Budgeted Nonparametric Learning from Data Streams (R. Gomes, A. Krause). Shows that a clustering problem and a Gaussian process problem are submodular, goes from there.
Internal Rewards Mitigate Agent Boundedness (J. Sorg, S. Singh, R. Lewis). Exactly what the title says.
The Translation-invariant Wishart-Dirichlet Process for Clustering Distance Data (J. Vogt, S. Prabhakaran, T. Fuchs, V. Roth). Been wanting to do something like this for a while, but they did it better than I would have!
Sparse Gaussian Process Regression via L_1 Penalization (F. Yan, Y. Qi). Very interesting way to get sparsity in a GP basically by changing your approximating distribution.

Some papers that other people said they liked were: