I'm going to try something new and daring this time. I will talk about papers I liked, but I will mention some things I think could be improved. I hope everyone finds this interesting and productive. As usual, I didn't see everything, didn't necessarily understand everything, and my errors are my fault. With that warning, here's my EMNLP 2014 list, sorted in anthology order.
- Identifying Argumentative Discourse Structures in Persuasive Essays (Christian Stab, Iryna Gurevych)
Full-on discourse parsing of rhetorical structure (e.g., RST) is really hard. In previous work, these authors created a corpus of essays annotated for (a) major claim, (b) claim, (c) premise and (d) none. (Prevalances of 5%, 23%, 55% and 17%, respectively.) This paper is about predicting that structure; in particular, both labeling spans (sentences?) as to their rhetorical class (of those four possibilities) and connecting them with binary support/non-support labels. Overall, they do quite well at both tasks: high to mid seventies in accuracy/F, respectively. They blame some of their errors on lack of coreference resolution. One question I had was: if you think of this annotation as a boiled down version of full rhetorical structure, what are you missing? Only 17% of sentences aren't one of the four classes, but if these fall high up in a rhetorical tree, I could imagine they would cut off a lot more of the structure. That said, I really liked that the authors found a tractable subproblem of full discourse parsing, and looking at it on its own.
- Aligning context-based statistical models of language with brain activity during reading (Leila Wehbe, Ashish Vaswani, Kevin Knight, Tom Mitchell)
When you read something, your brain lights up in certain ways. This paper is about finding predictive correlations between what you're reading (in particular, how a language model will score the incoming word) and how your brain will light up (via MEG imaging--this is the one with good time resolution but bad spacial resolution). The basic result is that the hidden layer in a recursive neural network language model does a pretty good job at the following prediction task: given an MEG reading, which of two words (the true one and a confounder) is the person reading. It does pretty well when readers are reading from Harry Potter: a big novelty here is that they use natural text rather than carefully laboratory controlled text. This is a cool step in the line of connecting NLP with neurolinguistics. The "hypothesis" in this paper is couched in the language of the integration theory of language processing (i.e., when you hear a word, your brain has to spend energy to integrate it into your prior analysis) rather than the (in my relatively uneducated opinion) preferred model of prediction theory (i.e., your brain has already predicted the word and the work it does is correcting for errors in its predictions). I think I can reinterpret their results to be consisted with the predictive theory, but I'd like to not have to do that work myself :P. I think the hardest part about this work is the fact that it uses natural, rather than laboratory, text. For all previous experiments I know, it's been very important to control for as much as you can about the context (history) and word to be predicted. In fact, most work only varies the history and never varies the word. This is done because any two words vary in so many ways (frequency, spelling, phonology, length, etc.) that you do not want to have to correct for those. For instance, you can ask if a bilingual speaker represents "dog" and "chien" in the same way, but this is a silly question because of course they don't: at the very least, they sound different and are spelled differently. Anyway, back to this paper, in the experiment they control for word length on the pairwise task, but I think controlling for word probability (under the surprise/predictive hypothesis) would be stronger. Though really as neat as it is to use natural text, I think it hurts more than it helps here.
- Learning Abstract Concept Embeddings from Multi-Modal Data: Since You Probably Can't See What I Mean (Felix Hill, Anna Korhonen)
This is the ultimate "duh" moment paper for me: basically we're all excited about embedding words and images into the same space, but lots of words aren't concrete and don't have visual counterparts (e.g., "flower" versus "encouragement"). They argue that concrete nouns are the rare category (72% of nouns and verbs are at least as abstract as "war"). The model they have for this is relatively simple: ignore abstract words. (Ok it's a bit more than that, but that's the general idea.) When you do this correctly, you get a marked improvement in performance. One thing I kept wondering during the talk, especially since Felix kept using the word "concrete" to refer to something that's not actually a mix of water, aggregate and cement, was the whole issue of metaphor and really word sense. Unfortunately neither of these words seem to appear in the paper, so it's hard to say what's going on. I think it could be really cool to combine these ideas with those of understanding metaphor or at least sense, though of course those are really really hard problems!
- NaturalLI: Natural Logic Inference for Common Sense Reasoning (Gabor Angeli, Christopher D. Manning)
I haven't followed the natural logic stuff for a while, so my assessment of this paper is pretty uninformed, but this was perhaps my favorite paper at the conference: I'll be presenting it in reading group on Monday, and might post more notes after that. Basically the idea is that you have a knowledge base that contains sentences like "The cat ate a mouse" and you have a query like "No carnivores eat animals". Instead of trying to map these sentences to logical form, we're going to treat the text itself as a logical form and do sentence rewriting to try to prove one from the other. An important insight is that quantification changes the direction of entailment that you can move, for instance if you say "All cats ..." then you can move down WordNet to "All housecats ..." but not up to "All mammals ..."; similarly if you say "No cats ..." then you can only move up the hierarchy. They set up a model like this, cast it as a big search problem, and get remarkably good scores (89% precision, 94% recall) on the FraCaS textual entailment task, which is competitive with the competition (though on a subset of the data, so not really comparable). Like I said, I really liked this paper; I could easily imagine trying to combine these ideas with stuff in the paraphrase database, especially once they have entailment directions annotated/predicted there.
- A Fast and Accurate Dependency Parser using Neural Networks (Danqi Chen, Christopher Manning)
The very brief summary of this paper is: take MaltParser and replace the SVM with a neural network and you do well. Slightly more detailed is that when MaltParser makes decisions, it looks at a few points on the stack and in the buffer. For each of these points, take the word and POS, map them to embeddings. You now have a half dozen embeddings (yes, we're embedding POS tags too). Also embed the dependency relation. Throw this through a network and make a prediction. The result 92% accuracy (basically tied with MST Parser, better than MaltParser by 2%) and about twice as fast as MaltParser. They also compare to their own implementation of MaltParser. Perhaps I missed it in the talk, but I think they only talked about their implementation there (they are 20 times faster) and not about real MaltParser. In order to get this speed, they do a caching trick that remembers vector-matrix products for common words. This is very cool and it's neat to see the embeddings of the POS and dependency relations: reminds me of Matsuzaki-style annotated labels. The one weakness here, I think, is that it's kind of unfair to compare a version of their own parser that with significant engineering tricks (the caching made it 8-10 times faster, so without it, it's 5 times slower than MaltParser) to an un-tricked-out MaltParser: in particular, if you really care about speed, you're going to do feature hashing in MaltParser and never compute all the strings that make it so slow. If this were done in MaltParser, then the neural network version, even if everything were cached, would be doing strictly more computation. Anyway, all that says is that I'm more impressed with the accuracy results than the speed results: and they appear to be doing quite well here, at the very least without sacrificing speed. (And as Yoav Goldberg reminds me, parsing on GPU with this architecture could be a big win!)
- Unsupervised Sentence Enhancement for Automatic Summarization (Jackie Chi Kit Cheung, Gerald Penn)
We all know how to do sentence fusion: take two sentences and stitch them together. The idea in this paper is to also enhance those sentences with little fragments taken from elsewhere. For instance, maybe you have a great sentence fusion but would like to add a PP taken from a related document: how can we do this? The model is basically one of extracting dependency triples, gluing them together and linearizing them. The biggest problem they face here is event coreference: given two sentences that contain the same (up to lemma) verb, is it the same event or not? They address event coref by looking at the participants of the predicate, though acknowledge more complex approaches might help further. Without the event coreference their approach hurts (over simple fusion) but with event coreference it helps. I would really have liked to have seen human judgments for both quality and grammaticality here: these things are notoriously hard to measure automatically and I'm not sure that we can really ascertain much without them. (Though yes, as the authors point out, human scores are notoriously hard to interpret too.)
- What Can We Get From 1000 Tokens? A Case Study of Multilingual POS Tagging For Resource-Poor Languages (Long Duong, Trevor Cohn, Karin Verspoor, Steven Bird, Paul Cook)
This paper is about a combination of projection methods (via parallel data) and small labeling (in the target language) to correct errors or annotation mismatches of the projection. The primary comparison is against Oscar Tackstrom's work (which I have an affinity for, having been his examiner for his Ph.D. defense) and the claim in this paper is that they do better with less. They can do even better using dictionaries of the type that Oscar has. One thing that's surprising is that when they replace the maxent correction model with a CRF things get worse. This seems wacky, but given the authors, I'm pretty confident that this isn't because they messed something up, which might be my default reaction :P. One obvious question is whether you can do projection and correction jointly, and I suspect someone will try this at some point. It could be cool to do it jointly not just over task but over languages in a multitask learning framework.
- Greed is Good if Randomized: New Inference for Dependency Parsing (Yuan Zhang, Tao Lei, Regina Barzilay, Tommi Jaakkola)
Local search is back! Let's do dependency parsing by first generating a random tree, then doing bottom-up reassignment of parents to improve it. You can prove that if you do this bottom up, then in one pass you can transform any tree to any other and all intermediate steps will also be trees. This space has lots of local optima (one cool thing is that in this paper they can count the number of local optima in n-cubed time using a variant of Chu-Liu-Edmonds, at least for first order models). To get around this they do a few hundred random restarts. One thing I really liked about this paper is that they went above and beyond what I might expect to see in a paper like this, really nailing the question of local optima, etc. They also use much more complex reranking features, which are available because at any step they have a full tree. The results are really good. I'm a bit surprised they need so many random restarts, which makes me worry about generalizing these results to more complex problems, which I think is the most important thing cuz I'm not sure we need more algorithms for dependency parsing. One cool thing is they get a natural anytime algorithm: you can stop the restarts at any point and you have some tree. This could be coupled in interesting ways with scheduling to handle cases where you need to parse one million sentences in some fixed amount of time: how much time do you spend on each? I was a bit surprised that they don't specifically train a model to make search decisions: they just train the model using their decoder and standard update strategies. It seems like something like Wick et al.'s SampleRank is just screaming to be used here.
Okay, so that's my list for EMNLP with mini-reviews. As I said before, these "reviews" are based 90% on the talk and 10% on skimming the papers later looking for keywords and examples, which means that there are almost certainly tons of errors. If someone notices an error and points it out, I'll just directly edit the post to fix it, with numerous apologies to the authors.
Acknowledging a small amount of bias, I also really like both of the student papers I was involved in at EMNLP. The first is about question answering and the result is that we now beat humans at lots of QuizBowl questions. The second is about trying to learn policies for simultaneous interpretation: basically when should we wait for more input. I put links below if these sound interesting to you.
Please comment with your own favorites!
Re: "It seems like something like Wick et al.'s SampleRank is just screaming to be used here."
ReplyDeleteRegina's group did exactly this in an earlier work published at ACL-2014.
http://people.csail.mit.edu/regina/my_papers/inf14.pdf
Re: "One cool thing is they get a natural anytime algorithm"
If the inference problem is posed as a output space search process, it will naturally have an anytime behavior. See the following paper for another example.
http://jair.org/papers/paper4212.html
Thank you for your comments. Actually we was thinking about doing projection and correction jointly too.
ReplyDeleteGreat summary, Hal!
ReplyDeleteI am also confused by the advantage of unstructured models over their structured counterparts in the cross-lingual part-of-speech tagging setting.
We have generally been able to get better results with structured models, in particular in the partially observed setting, where it seems the structure is able to help guide inference.
This year there were two (both great) papers reporting the opposite result. In addition to Duong et al. there was "Cross-Lingual Part-of-Speech Tagging through Ambiguous Learning" by Wisniewski et al, who use a history based model with a logistic loss, where instead of marginalizing out the latent structure, they update towards the max label at each position. This strategy gave better results than the corresponding latent-variable CRF.
So there seems to be something going on here. Although, in the second case, I wonder if it's simply that MAP-style updates work better for this problem. Ryan and I observed similar results with latent-variable CRFs for sentiment analysis, where updating towards the max-sequence rather than marginalizing, sometimes gave better results.