I got back from NIPS on Sunday; sadly I got sick on the last day. Despite this, it was by far my most productive conference ever. Admittedly, I made a point to try to make it productive (being somewhat geographically isolated means that it's in my best interest), but even so, I had a fantastic time. There are several ideas that I plan on blogging on in the near future, but for now, I'll focus on the actual conference program. The standard disclaimer applies: I didn't see everything and just because something's not listed here doesn't mean I didn't enjoy it. I'll try to limit my commentary to papers that are relevant to the NLP audience, but I may stray. I'll conclude with a bit of my sense of what was in and out this year.
- Kristina Toutanova and Mark Johnson had a topic-model style paper for doing semisupervised POS tagging. There are two new things here. First, they actually model context (yay!) which we know is important in language. Second, they introduce a notion of a confusion class (what possible POS tags can a word get) and actually model this. This latter is something that makes sense in retrospect, but is not obvious to think of (IMO). They actually get good results (which is non-trivial for topic models in this context).
- Alex Bouchard-Cote and three more of the Berkeley gang had a paper on language change. If you saw their paper at ACL, they're attacking a pretty similar problem, by looking at phonological divergences between a handful of languages. I'd love to be able to reconcile their approach with what I've been working on---I use a huge database of typological knowledge; they use word forms...I'd love to use both, but I really like working with thousands of languages and it's hard to find corpora for all of these :).
- John Blitzer had a good paper (with other Penn folks) on learning bounds for domain adaptation. If you care at all about domain adaptation, you should read this paper.
- Alex Kulesza and Fernando Pereira had a great paper on what happens if you try to do structured learning when the underlying prediction (inference) algorithm is approximate. They show that things can go very wrong in widely different ways. This is a bit of a negative results paper, but it gives us (a) a strong sense that proceeding with a generic learning algorithm on top of an approximate inference algorithm is not okay and (b) a few ideas of what can actually go wrong.
- Yee Whye Teh, Kenichi Kurihara and Max Welling continue their quest toward collapsed variational models for all problems in the world by solving the HDP. (For those who don't know, you can think of this as LDA with a potentially infinite number of topics, but of course the real case is more interesting.)
- Ali Rahimi and Benjamin Recht presented a very cool analysis of kernel methods when your kernel is of the form K(x,z) = f(x-z). This is the case for, eg., the RBF kernel. This is something that I've wanted to try for a while, but to be honest my poor recollection of my analysis training left me a bit ill equipped. Essentially they apply a Fourier analysis to such kernels and then represent the kernel product K(x,z) as a dot product in a new input space, where kernelization is no longer required. The upshot is that you can now effectively do kernel learning in a linear model in primal space, which is very nice when the number of examples is large.
- David Heckerman gave a really nice invited talk on graphical models for HIV vaccine design. What he was particularly interested in was analyzing whether standard methods of determining correlation between events (eg., "do I have HIV given that I have some genomic signal?") work well when the data points are not independent, but rather belong to, eg., some genetic population. This is effectively what Lyle Campbell and I were doing in our ACL paper last year on typological features. Curing HIV might be a bit more of an interesting application, though :). Essentially what David and his colleagues do is to build a hierarchy on top of the data points and then let this explain some of the interactions and then figure out what isn't explained by this hierarchy. In the linguistic sense, this is effectively separating historical from areal divergences. I'll have to read up more on what they do, precisely.
My only serious quibble with the program this year is that I saw a handful of papers (and at least one that got an oral and one a spotlight) that really fell down in terms of evaluation. That is, these papers all proposed a new method for solving task X and then proceeded to solve it. The experiments, however, contained only comparisons to algorithms that did not have access to all the information that X had. I'll give an example of something that would fall in this category but I have not seen appear anywhere: I build a structured prediction algorithm and compare only against algorithms that make independent classification decisions. I'm not a huge quibbler over comparing against the absolute state of the art, but it really really irks me when there are no comparisons to algorithms that use the same information, especially when standard (read: they are in any reasonable machine learning textbook) algorithms exist.