NAACL is over, and while the majority of the conference took place in hallways for me this time around (lots of old and new friends made their way to NYC), I did get a chance to see a few papers that I enjoyed. I'm short-listing some of them here (including workshop papers), though I encourage others to list those that they liked, since one person can have at most a 33% recall.
- Effective Self-Training for Parsing (McClosky, Charniak + Johnson). Train a parser A, run A on 2m sentences, use A's output to train a new parser B and B does better on test data. This is a somewhat surprising story, and one that has plenty of counterexamples in the literature. These guys got it to work. The key was to use the Charniak + Johnson reranking parser as parser A, rather than just the Charniak parser. This begs the question: why does it work in the latter case. Unfortunately, the paper doesn't answer this question. Though talking to Eugene, they are interested in ideas why.
One obvious idea is that the epsilon difference in performance for reranking and not reranking happens to occur at a phase transition. I find this unlikely. My guess (which I've suggested and they seem interested in exploring) is the following. The unreranked parser produces systematic errors, and so self-training just reinforces these errors leading to worse performance. In contrast, the reranked parser has a high variance of errors (this is intuitively reasonable if you think about how reranking works) and therefore these come across as "noise" when you retrain on the large corpus. It's easy to verify this claim: simply remove enough data from the reranked parser so that its performance is comparable to the vanilla parser and see if you still get self-training improvements. If so, I may be right. One can then do a more exhaustive error analysis. Other people have suggested other reasons which I think the Brown folk will also look into. I'm very curious.
- Prototype-driven Learning for Sequence Models (Haghighi + Klein). This received the best student paper award (congrats Aria!). The idea in this paper is that we don't want to have to label entire sequences to do normal supervised learning. Instead, what we do is, for each possible label, we list a small number of prototypical words from that label. They then use a constrained MRF, together with distributional similarity features to attempt to learn a sequence labeling model. I think this idea is very interesting, and I am in general very interested in ways of using "not quite right" data to learn to solve problems.
My only concern with this approach is that it is quite strict to require that all occurances of prototypical words get their associated tag. This is perhaps okay for something like POS tagging, but in general problems, I feel that words are too ambiguous for this to work. (And, for completeness, since I asked this question at the talk, I would have liked to have seen at least one baseline model that made use of the prototypes, even in a naive way.)
- Quasi-Synchronous Grammars: Alignment by Soft Projection of Syntactic Dependencies (D Smith + Eisner, SMT Workshop). This paper makes the argument that the raw synchronicity assumptions made by SCFGs for either projection or translation is too strong (others, like Hwa, have argued the same). They analyze the sorts of transformations required to explain real parallel data (does a parent/child end up as a parent/child after translation, or is sisterhood maintained). They propose a translation model that can account for much more varied transformations that standard SCFGs. It's a workshop paper, and there's no decoder yet, but talking to David afterwards, they are actively working on this and it may actually be safe to hold one's breath.
- Computational Challenges in Parsing by Classification (Turian + Melamed, CHPJISLP Workshop). This is a history-based parsing-as-classification model, where one essentially parses in a bottom-up, sequential fashion. A lot of this paper is dedicated to talking about how to scale up to being able to train decision trees as classifiers (laziness, parallelization and sampling provide a 100 fold speedup), but something I found extremely interesting (though not shocking once you hear it) is that right-to-left search outperformed left-to-right. The obvious explanation is that English is largely right-branching, and by searching right-to-left, most decisions remain strongly local. They achive quite impressive results, and there's some obvious way in which this algorithm could be Searnified.