12 December 2007

NIPS retrospective

I got back from NIPS on Sunday; sadly I got sick on the last day. Despite this, it was by far my most productive conference ever. Admittedly, I made a point to try to make it productive (being somewhat geographically isolated means that it's in my best interest), but even so, I had a fantastic time. There are several ideas that I plan on blogging on in the near future, but for now, I'll focus on the actual conference program. The standard disclaimer applies: I didn't see everything and just because something's not listed here doesn't mean I didn't enjoy it. I'll try to limit my commentary to papers that are relevant to the NLP audience, but I may stray. I'll conclude with a bit of my sense of what was in and out this year.

  1. Kristina Toutanova and Mark Johnson had a topic-model style paper for doing semisupervised POS tagging. There are two new things here. First, they actually model context (yay!) which we know is important in language. Second, they introduce a notion of a confusion class (what possible POS tags can a word get) and actually model this. This latter is something that makes sense in retrospect, but is not obvious to think of (IMO). They actually get good results (which is non-trivial for topic models in this context).
  2. Alex Bouchard-Cote and three more of the Berkeley gang had a paper on language change. If you saw their paper at ACL, they're attacking a pretty similar problem, by looking at phonological divergences between a handful of languages. I'd love to be able to reconcile their approach with what I've been working on---I use a huge database of typological knowledge; they use word forms...I'd love to use both, but I really like working with thousands of languages and it's hard to find corpora for all of these :).
  3. John Blitzer had a good paper (with other Penn folks) on learning bounds for domain adaptation. If you care at all about domain adaptation, you should read this paper.
  4. Alex Kulesza and Fernando Pereira had a great paper on what happens if you try to do structured learning when the underlying prediction (inference) algorithm is approximate. They show that things can go very wrong in widely different ways. This is a bit of a negative results paper, but it gives us (a) a strong sense that proceeding with a generic learning algorithm on top of an approximate inference algorithm is not okay and (b) a few ideas of what can actually go wrong.
  5. Yee Whye Teh, Kenichi Kurihara and Max Welling continue their quest toward collapsed variational models for all problems in the world by solving the HDP. (For those who don't know, you can think of this as LDA with a potentially infinite number of topics, but of course the real case is more interesting.)
  6. Ali Rahimi and Benjamin Recht presented a very cool analysis of kernel methods when your kernel is of the form K(x,z) = f(x-z). This is the case for, eg., the RBF kernel. This is something that I've wanted to try for a while, but to be honest my poor recollection of my analysis training left me a bit ill equipped. Essentially they apply a Fourier analysis to such kernels and then represent the kernel product K(x,z) as a dot product in a new input space, where kernelization is no longer required. The upshot is that you can now effectively do kernel learning in a linear model in primal space, which is very nice when the number of examples is large.
  7. David Heckerman gave a really nice invited talk on graphical models for HIV vaccine design. What he was particularly interested in was analyzing whether standard methods of determining correlation between events (eg., "do I have HIV given that I have some genomic signal?") work well when the data points are not independent, but rather belong to, eg., some genetic population. This is effectively what Lyle Campbell and I were doing in our ACL paper last year on typological features. Curing HIV might be a bit more of an interesting application, though :). Essentially what David and his colleagues do is to build a hierarchy on top of the data points and then let this explain some of the interactions and then figure out what isn't explained by this hierarchy. In the linguistic sense, this is effectively separating historical from areal divergences. I'll have to read up more on what they do, precisely.
There were several themes that stuck out at NIPS this year, though they aren't represented in the above list. Probably the biggest one is recommender systems. Just search the titles for "matrix" and you'll come up with a handful of such systems. I have to believe that this is spurred, at least in part, by the NetFlix challenge. Another theme was deep belief networks, which even had a rogue workshop dedicated to them. I have a feeling we'll be seeing more and more of these things. A final theme was randomized algorithms, particularly as applied to large scale learning. I think this is a great direction, especially at NIPS, where, historically, things have been less focused on the large scale.

My only serious quibble with the program this year is that I saw a handful of papers (and at least one that got an oral and one a spotlight) that really fell down in terms of evaluation. That is, these papers all proposed a new method for solving task X and then proceeded to solve it. The experiments, however, contained only comparisons to algorithms that did not have access to all the information that X had. I'll give an example of something that would fall in this category but I have not seen appear anywhere: I build a structured prediction algorithm and compare only against algorithms that make independent classification decisions. I'm not a huge quibbler over comparing against the absolute state of the art, but it really really irks me when there are no comparisons to algorithms that use the same information, especially when standard (read: they are in any reasonable machine learning textbook) algorithms exist.

10 comments:

m2 said...

Another theme this year lies at the intersection of approximate inference in graphical models and linear/convex programming. There were at least two orals here and I probably missed others as well:

P. Mudigonda, V. Kolmogorov, P. Torr
An Analysis of Convex Relaxations for MAP Estimation

D. Sontag, T. Jaakkola
New Outer Bounds on the Marginal Polytope

It is unclear to me if these would in the end lead to practical algorithms, but the developments are definitely theoretically very nice.

There were also a number of other interesting convex programming or approximate inference papers:

L. Song, A. Smola, K. Borgwardt, A. Gretton
Colored Maximum Variance Unfolding

E. Sudderth, M. Wainwright, A. Willsky
Loop Series and Bethe Variational Bounds in Attractive Graphical Models

Chris Brew said...

The confusion class idea has been in POS-tagging since Julian Kupiec and Doug Cutting built their
tagger (cf. http://citeseer.ist.psu.edu/cutting92practical.html).
There it goes by the name of "ambiguity class". I am sure there are multiple effective ways of using this idea, including some new ones.

mgr said...

I am a bit naive and am confused and confounded by the confusion class. I thought there was no dearth of POS tagging methods. Why the additional confusion?

Yoav said...

The "confusion/ambiguity class" of the word to be tagged is (at least) implicitly modeled in any recent POS tagger.

The ambiguity classes of the context words is successfully used by the SVMTool tagger (http://www.lsi.upc.edu/~nlp/SVMTool/).

Recent works on Hebrew/Arabic morphological disambiguation make use of context ambiguity classes as well.

Chris said...

"it was by far my most productive conference ever."

Just wondering, did this change your 'Costs and Benefits' analysis of conferences?

hal said...

Thanks to all for the POS pointers... I'll have to think a bit more about the ambiguity class thing... I have a vague sense that there's a difference, but I have to think more... I noticed Mark J didn't comment yet; he'd probably know more :P.

Chris: well, I consider Canada to be domestic ;), at least in the sense that it's cheap to attend. So I don't think this is a counter-example ;).

Yoav said...
This comment has been removed by the author.
Yoav said...

re ambiguity classes: the innovation presented in NIPS paper is in the reduced lexicon models, in which the unknown ambiguity class is inferred based on ortographic features.

Chris Brew said...

Re Yoav's comment. The Xerox tagger
does have a class guesser that assigns
an ambiguity class to unknown words on
the basis of
orthographic features. This is a major
contributor to its performance.

Anonymous said...

酒店經紀PRETTY GIRL 台北酒店經紀人 ,禮服店 酒店兼差PRETTY GIRL酒店公關 酒店小姐 彩色爆米花酒店兼職,酒店工作 彩色爆米花酒店經紀, 酒店上班,酒店工作 PRETTY GIRL酒店喝酒酒店上班 彩色爆米花台北酒店酒店小姐 PRETTY GIRL酒店上班酒店打工PRETTY GIRL酒店打工酒店經紀 彩色爆米花