07 June 2010

NAACL 2010 Retrospective

I just returned from NAACL 2010, which was simultaneously located in my home town of Los Angeles and located nowhere near my home town of Los Angeles. (That's me trying to deride downtown LA as being nothing like real LA.)

Overall I was pleased with the program. I saw a few talks that changed (a bit) how I think about some problems. There were only one or two talks I saw that made me wonder how "that paper" got in, which I think is an acceptable level. Of course I spend a great deal of time not at talks, but no longer feel bad about doing so.

On tutorials day, I saw Hoifung Poon's tutorial on Markov Logic Networks. I think Hoifung did a great job of targetting the tutorial at just the right audience, which probably wasn't exactly me (though I still quite enjoyed it myself). I won't try to describe MLNs, but my very brief summary is "language for compactly expressing complex factor graphs (or CRFs, if you prefer)." That's not exactly right, but I think it's pretty close. You can check back in a few months and see if there are going to be any upcoming "X, Y and Daume, 2011" papers using MLNs :P. At any rate, I think it's a topic worth knowing about, especially if you really just want to get a system up and running quickly. (I'm also interested in trying Andrew McCallum's Factorie system, which, to some degree, trades easy of use for added functionality. But honestly, I don't really have time to just try things these days: students have to do that for me.)

One of my favorite papers of the conference was one that I hadn't even planned to go see! It is Dependency Tree-based Sentiment Classification using CRFs with Hidden Variables by Tetsuji Nakagawa, Kentaro Inui and Sadao Kurohashi. (I saw it basically because by the end of the conference, I was too lazy to switch rooms after the prvious talk.) There are two things I really like about this paper. The first is that the type of sentiment they're going after is really broad. Example sentences included things that I'd love to look up, but apparently were only in the slides... but definitely more than "I love this movie." The example in the paper is "Tylenol prevents cancer," which is a nice positive case.

The basic idea is that some words give you sentiment. For instance, by itself, "cancer" is probably negative. But then some words flip polarity. Like "prevents." Or negation. Or other things. They set up a model based on sentence level annotations with latent variables for the "polarity" words and for the "flipping" words. The flipping words are allowed to flip any sentiment below them in the dependency tree. Cool idea! Of course, I have to nit-pick the paper a bit. It probably would be better to allow arguments/adjuncts to flip polarity, too. Otherwise, negation (which is usually a leaf) will never flip anything. And adjectives/adverbs can't flip either (eg., going from "happy" to "barely happy"). But overall I liked the paper.

A second thing I learned is that XOR problems do exist in real life, which I had previously questioned. The answer came (pretty much unintentionally) from the paper The viability of web-derived polarity lexicons by Leonid Velikovich, Sasha Blair-Goldensohn, Kerry Hannan and Ryan McDonald. I won't talk much about this paper other than to say that if you have 4 billion web pages, you can get some pretty good sentimenty words, if you're careful to not blindly apply graph propagation. But at the end, they throw a meta classifier on the polarity classification task, whose features include things like (1) how many positive terms are in the text, (2) how many negative terms are in the text, (3) how many negations are in the text. Voila! XOR! (Because negation XORs terms.)

I truly enjoyed Owen Rambow's poster on The Simple Truth about Dependency and Phrase Structure Representations: An Opinion Piece. If you're ever taken a class in mathematical logic, it is very easy for me to summarize this paper: parse trees (dependency or phrase structure) are your languge, but unless you have a theory of that language (in the model-theoretic sense) then whatever you do is meaningless. In more lay terms: you can always push symbols around, but unless you tie a semantics to those symbols, you're really not doing anything. Take home message: pay attention to the meaning of your symbols!

In the category of "things everyone should know about", there was Painless unsupervised learning with features by Taylor Berg-Kirkpatrick, Alexandre Bouchard Côté, John DeNero and Dan Klein. The idea is that you can replace your multinomails in an HMM (or other graphical model) with little maxent models. Do EM in this for unsuperviesd learning and you can throw in a bunch of extra features. I would have liked to have seen a comparison against naive Bayes with the extra features, but my prior belief is sufficiently strong that I'm willing to believe that it's helpful. The only sucky thing about this training regime is that training maxent models with (tens of) thousands of classes is pretty painful. Perhaps a reduction like tournaments or SECOC would help bring it down to a log factor.

I didn't see the presentation for From baby steps to leapfrog: How "Less is More" in unsupervised dependency parsing by Valetin Spitkovsky, Hiyan Alshawi and Dan Jurafsky, but I read it. The idea is that you can do better unsupervised dependency parsing by giving your learner progressively harder examples. I really really really tried to get something like this to work for unsearn, but nothing helped and most things hurn. (I only tried adding progressively longer sentences: other ideas, based on conversations with other folks, include looking at vocabulary size, part of speech (eg., human babies learn words in a particular order), etc.) I'm thrilled it actually works.

Again, I didn't see Discriminative Learning over Constrained Latent Representations by Ming-Wei Chang, Dan Goldwasser, Dan Roth and Vivek Srikumar, but I learned about the work when I visited UIUC recently (thanks again for the invitation, Dan R.!). This paper does exactly what you would guess from the title: learns good discriminative models when you have complex latent structures that you know something about a priori.

I usually ask people at the end of conferences what papers they liked. Here are some papers that were spoken highly of by my fellow NAACLers. (This list is almost unadulterated: one person actually nominated one of the papers I thought shouldn't have gotten in, so I've left it off the list. Other than that, I think I've included everything that was specifically mentioned to me.)
  1. Optimal Parsing Strategies for Linear Context-Free Rewriting Systems by Daniel Gildea.

  2. Products of Random Latent Variable Grammars by Slav Petrov.

  3. Joint Parsing and Alignment with Weakly Synchronized Grammars by David Burkett, John Blitzer and Dan Klein.

  4. For the sake of simplicity: Unsupervised extraction of lexical simplifications from Wikipedia by Mark Yatskar, Bo Pang, Cristian Danescu-Niculescu-Mizil and Lillian Lee.

  5. Type-Based MCMC by Percy Liang, Michael I. Jordan and Dan Klein.
I think I probably have two high level "complaints" about the program this year. First, I feel like we're seeing more and more "I downloaded blah blah blah data and trained a model using entirely standard features to predict something and it kind of worked" papers. I apologize if I've just described your paper, but these papers really rub me the wrong way. I feel like I just don't learn anything from them: we already know that machine learning works surprisingly well and I don't really need more evidence of that. Now, if my sentence described your paper, but your paper additionally had a really interesting analysis that helps us understand something about language, then you rock! Second, I saw a lot of presentations were speakers were somewhat embarassingly unaware of very prominent very relevant prior work. (In none of these cases was the prior work my own: it was work that's much more famous.) Sometimes the papers were cited (and it was more of a "why didn't you compare against that" issue) but very frequently they were not. Obviously not everyone knows about all papers, but I recognized this even for papers that aren't even close to my area.

Okay, I just ranted, so let's end on a positive note. I'm leaving the conference knowing more than when I went, and I had fun at the same time. Often we complain about the obvious type I errors and not-so-obvious type II errors, but overall I found the program strong. Many thanks to the entire program committee for putting together an on-average very good set of papers, and many thanks to all of you for writing these papers!

10 comments:

  1. Thanks for the retrospective Hal. I had a lot of fun at NAACL this year. It was really well organized. Kudos to the local organizers. Unfortunately there were way too many parallel sessions and not enough posters, IMHO.

    I too enjoyed the dependency tree sentiment paper by Nakagawa et al. I suspect that it should be possible to extend their model to handle the case when modifiers are the negators.

    For the Berg-Kirkpatrick et al. paper I still don't understand why the direct gradient is SO much better empirically. They say that the likelihood is about the same in practice. Normally I would just say they got lucky, but the fact that they observe it on so many different tasks suggests that something is going on here. I think I may have already brought this up with Hal in L.A., but if anyone has any thoughts about this I would love to hear them.

    Another paper I liked was "Bayesian Inference for Finite-State Transducers" by Chiang et al. I would like to understand better why their averaging trick for model selection works amazingly well.

    ReplyDelete
  2. Ryan: For the Berg-Kirkpatrick paper, I want to know that, too. It seems like magic. Maybe they should talk to their next-door-neighbor, Percy Liang, who is the expert and understanding the errors that unsupervised models make :). More seriously, though, I kind of suspect we might never get to know, given that the likelihoods are the same. That is, I've thought about it a bunch and I can't come up with any concrete hypothesis, much less one that's testable. If the likelihoods were better as well, I could formulate some stuff to test.

    ReplyDelete
  3. I fully agree that more and more people are either not aware of related work/prior art or know about it but choose to ignore the topic, and it also bugs me.

    Last year, I decided to give this more weight when reviewing submissions as a result. Every author should recognize and acknowledge the most closely related paper they know of, and the first work or early seminal work in the area, as well as the most recent and best results they know. Furthermore, authors should demonstrate they are aware of what their own contribution is, i.e. how they are different.

    ReplyDelete
  4. Excellent summary. I had the same comments about the CRF sentiment paper. I found myself sitting in the talk unplanned and I was very pleased. I had actually come up with a very similar model with a student earlier in the semester and it was/is on our TODO list. I'm glad to see that it works!

    ReplyDelete
  5. Great NAACL summary, Hal! We've got a hypothesis about why the direct gradient approach outperforms EM. It has to do with coarse features and how they're used during the initial stages of learning.

    I wrote up a summary of the hypothesis and some new results that back it up, posted here. Come check it out, and let me know what you think.

    ReplyDelete
  6. There ain't no reliability in XOR.

    ReplyDelete
  7. Thanks for this summary about NAACL. I haven´t known anything about thats conferences.
    Now I'll look over the papers. This NLP area is very interesting, specially the topic of sense in a document. It´s fascinating.

    ReplyDelete