15 November 2014

The myth of a strong baseline

I can probably count on my fingers the number of papers I've submitted for which a reviewer hasn't complained about a baseline in some way. I don't mean to imply that all of those complaints are invalid: many of them were 100% right on in ways that either I was lazy about or ways that I hadn't seen a priori.

In fact, I remember back in 2005 I visited MIT and gave a talk on what eventually became the BayeSum paper (incidentally, probably one of my favorite papers I've written, though according to friends not exactly the best written... drat). I was comparing in the talk against some baselines, but Regina Barzilay very rightfully asked me: do any of these baselines have access to the same information that my proposed approach does? At the time I gave this talk the answer was no. In the actual paper the answer is yes. I think Regina was really on to something here, and this one question asked in my talk has had a profound impact on how I think about evaluation since then. For that, I take a small time-out and say: Thank you, Regina.

Like all such influential comments, my interpretation of Regina's question has changed over time, and this post is essentially about my current thinking on this issue, and how it relates to this "does not compare against a strong enough baseline" reviewer critique that is basically a way to kill any paper.

If we're going to ask the question about whether an evaluation strategy is "good" or "bad" we have to ask ourselves why are we doing this evaluation thing in the first place. My answer always goes back to my prima facie question when I read/review papers: what did I learn from this paper? IMO, the goal of an evaluation should be to help isolate what I learned.

Let's go back to the BayeSum paper. There are two things I could have been trying to demonstrate in this paper. (A) I could have been trying to show that some new external source of knowledge is useful; (B) I could have been trying to show that some new technique is useful. In the case of BayeSum, the answer was more like (B), which is why Regina's comment was spot on. I was trying to claim the approach was good, but I hadn't disentangled the new approach from the data, and so you could have easily believed that the improvement over the baseline was due to the added source of data rather than the new technique.

In many cases it's not that cut and dry because a (B)-style new technique might enable the use of (A)-style new data in a way that wasn't possible before. In that case, the thing that an author might be trying to convince me of is that the combination is good. That's fine, but I still think it's worth disentangling these two sources of information as much as possible. This is often tough because in the current NLP atmosphere in which we're obsessed with shiny new techniques, it's not appealing to show that the new data gets you 90% of the gain and the new technique is only 10% on top of that. But this is another issue.

So evaluations are to help us learn something. Let's return now to the question of you didn't compare against a strong enough baseline. Aside from parroting what's been said many times in the past, what is the point of such a complaint, beyond Regina's challenge, which I hope I've made clear I agree with. The issue, as best I can understand it, is that it is based on the following logic:

  • Assumption: if your approach improves things against a strong baseline, then it will also improve against a weaker baseline, perhaps by more.
I'll note in passing that this is basically an assumption of submodularity of ideas.

And, like the title of this blog post suggest, I would like to put forth the idea that this assumption is often ridiculous, or at least that there's plentiful evidence against it.

I'm going to pick on machine translation now just to give a concrete example, but I don't think this phenomenon is limited to MT in any way. The basic story is I start with some MT system like Moses or cdec or whatever. I add some features to it and performance goes up. The claim is that if my baseline MT system wasn't already sufficiently strong, then any improvement I see from my proposed technique could just be solving a problem that's already been solved if I had just tuned Moses better.

There's truth to this claim, but there's also untruth. A famous recent example is the Devlin et al. neural network MT paper. Let me be clear: I think this paper is great and I 100% believe the results that they presented. I'm not attacking this paper in any way; I'm choosing it simply as a representative example. One of the results they show is some insane 8 bleu point gain over a very strong baseline. And I completely believe that this was a very strong baseline. And that the 8 bleu point improvement was real. And that everything is great.

Okay, so any paper that leads to an 8 bleu point gain over a very strong baseline is going to get reimplemented by lots of people, and this has happened. Has anyone else gotten an 8 bleu point gain? Not that I've heard. I've heard numbers along the lines of 1 to 2 bleu points, but it's very plausible I haven't heard the whole story.

So what's going on here?

The answer is simply that the assumption I wrote above is false. We've assumed that since they got 8 points on a strong baseline, I'll get at least as much on my baseline (which is likely weaker than theirs).

One problem is that "strong" isn't a total order. Different systems might get similar bleu scores, but this doesn't mean that they get them in the same way. Something like the neural networks stuff clearly solved a major problem in the BBN strong baseline system, but this major problem clearly wasn't as major of a problem in some other strong baseline systems.

Does this make the results in the Devlin paper any less impressive or important? No, of course not. I learned a lot from that paper. But one thing I didn't learn is "if you apply this approach to any system that's weaker than our strong baseline, you will get 8 bleu points." That's just not a claim that their results substantiate, and the only reason people seem to believe that this should be true is because of the faulty assumption above.

So does this mean that comparing to strong baselines is unimportant and everyone should go back to comparing their MT system against a word-for-word model 1 baseline?

Of course not. There are lots of ways to be better than such a baseline, and so "beating" it does not teach me anything. I always tell students not to get too pleased when they get state of the art performance on some standard task: someone else will beat them next year. If the only thing that I learn from their papers is that they win on task X, then next year there's nothing to learn from that paper. The paper has to teach me something else to have any sort of lasting effect: what is the generalizable knowledge.

The point is that an evaluation is not an end in itself. An evaluation is there to teach you something, or to substantiate something that I want to teach you. If I want to show you that X is important, then I should show you an experiment that isolates X to the best of my ability and demonstrates an improvement, preferably also with an error analysis that shows that what I claim my widget is doing is actually what it's doing.

01 November 2014

EMNLP 2014 paper list (with mini-reviews)

I'm going to try something new and daring this time. I will talk about papers I liked, but I will mention some things I think could be improved. I hope everyone finds this interesting and productive. As usual, I didn't see everything, didn't necessarily understand everything, and my errors are my fault. With that warning, here's my EMNLP 2014 list, sorted in anthology order.

  • Identifying Argumentative Discourse Structures in Persuasive Essays (Christian Stab, Iryna Gurevych)
    Full-on discourse parsing of rhetorical structure (e.g., RST) is really hard. In previous work, these authors created a corpus of essays annotated for (a) major claim, (b) claim, (c) premise and (d) none. (Prevalances of 5%, 23%, 55% and 17%, respectively.) This paper is about predicting that structure; in particular, both labeling spans (sentences?) as to their rhetorical class (of those four possibilities) and connecting them with binary support/non-support labels. Overall, they do quite well at both tasks: high to mid seventies in accuracy/F, respectively. They blame some of their errors on lack of coreference resolution. One question I had was: if you think of this annotation as a boiled down version of full rhetorical structure, what are you missing? Only 17% of sentences aren't one of the four classes, but if these fall high up in a rhetorical tree, I could imagine they would cut off a lot more of the structure. That said, I really liked that the authors found a tractable subproblem of full discourse parsing, and looking at it on its own.

  • Aligning context-based statistical models of language with brain activity during reading (Leila Wehbe, Ashish Vaswani, Kevin Knight, Tom Mitchell)
    When you read something, your brain lights up in certain ways. This paper is about finding predictive correlations between what you're reading (in particular, how a language model will score the incoming word) and how your brain will light up (via MEG imaging--this is the one with good time resolution but bad spacial resolution). The basic result is that the hidden layer in a recursive neural network language model does a pretty good job at the following prediction task: given an MEG reading, which of two words (the true one and a confounder) is the person reading. It does pretty well when readers are reading from Harry Potter: a big novelty here is that they use natural text rather than carefully laboratory controlled text. This is a cool step in the line of connecting NLP with neurolinguistics. The "hypothesis" in this paper is couched in the language of the integration theory of language processing (i.e., when you hear a word, your brain has to spend energy to integrate it into your prior analysis) rather than the (in my relatively uneducated opinion) preferred model of prediction theory (i.e., your brain has already predicted the word and the work it does is correcting for errors in its predictions). I think I can reinterpret their results to be consisted with the predictive theory, but I'd like to not have to do that work myself :P. I think the hardest part about this work is the fact that it uses natural, rather than laboratory, text. For all previous experiments I know, it's been very important to control for as much as you can about the context (history) and word to be predicted. In fact, most work only varies the history and never varies the word. This is done because any two words vary in so many ways (frequency, spelling, phonology, length, etc.) that you do not want to have to correct for those. For instance, you can ask if a bilingual speaker represents "dog" and "chien" in the same way, but this is a silly question because of course they don't: at the very least, they sound different and are spelled differently. Anyway, back to this paper, in the experiment they control for word length on the pairwise task, but I think controlling for word probability (under the surprise/predictive hypothesis) would be stronger. Though really as neat as it is to use natural text, I think it hurts more than it helps here.

  • Learning Abstract Concept Embeddings from Multi-Modal Data: Since You Probably Can't See What I Mean (Felix Hill, Anna Korhonen)
    This is the ultimate "duh" moment paper for me: basically we're all excited about embedding words and images into the same space, but lots of words aren't concrete and don't have visual counterparts (e.g., "flower" versus "encouragement"). They argue that concrete nouns are the rare category (72% of nouns and verbs are at least as abstract as "war"). The model they have for this is relatively simple: ignore abstract words. (Ok it's a bit more than that, but that's the general idea.) When you do this correctly, you get a marked improvement in performance. One thing I kept wondering during the talk, especially since Felix kept using the word "concrete" to refer to something that's not actually a mix of water, aggregate and cement, was the whole issue of metaphor and really word sense. Unfortunately neither of these words seem to appear in the paper, so it's hard to say what's going on. I think it could be really cool to combine these ideas with those of understanding metaphor or at least sense, though of course those are really really hard problems!

  • NaturalLI: Natural Logic Inference for Common Sense Reasoning (Gabor Angeli, Christopher D. Manning)
    I haven't followed the natural logic stuff for a while, so my assessment of this paper is pretty uninformed, but this was perhaps my favorite paper at the conference: I'll be presenting it in reading group on Monday, and might post more notes after that. Basically the idea is that you have a knowledge base that contains sentences like "The cat ate a mouse" and you have a query like "No carnivores eat animals". Instead of trying to map these sentences to logical form, we're going to treat the text itself as a logical form and do sentence rewriting to try to prove one from the other. An important insight is that quantification changes the direction of entailment that you can move, for instance if you say "All cats ..." then you can move down WordNet to "All housecats ..." but not up to "All mammals ..."; similarly if you say "No cats ..." then you can only move up the hierarchy. They set up a model like this, cast it as a big search problem, and get remarkably good scores (89% precision, 94% recall) on the FraCaS textual entailment task, which is competitive with the competition (though on a subset of the data, so not really comparable). Like I said, I really liked this paper; I could easily imagine trying to combine these ideas with stuff in the paraphrase database, especially once they have entailment directions annotated/predicted there.

  • A Fast and Accurate Dependency Parser using Neural Networks (Danqi Chen, Christopher Manning)
    The very brief summary of this paper is: take MaltParser and replace the SVM with a neural network and you do well. Slightly more detailed is that when MaltParser makes decisions, it looks at a few points on the stack and in the buffer. For each of these points, take the word and POS, map them to embeddings. You now have a half dozen embeddings (yes, we're embedding POS tags too). Also embed the dependency relation. Throw this through a network and make a prediction. The result 92% accuracy (basically tied with MST Parser, better than MaltParser by 2%) and about twice as fast as MaltParser. They also compare to their own implementation of MaltParser. Perhaps I missed it in the talk, but I think they only talked about their implementation there (they are 20 times faster) and not about real MaltParser. In order to get this speed, they do a caching trick that remembers vector-matrix products for common words. This is very cool and it's neat to see the embeddings of the POS and dependency relations: reminds me of Matsuzaki-style annotated labels. The one weakness here, I think, is that it's kind of unfair to compare a version of their own parser that with significant engineering tricks (the caching made it 8-10 times faster, so without it, it's 5 times slower than MaltParser) to an un-tricked-out MaltParser: in particular, if you really care about speed, you're going to do feature hashing in MaltParser and never compute all the strings that make it so slow. If this were done in MaltParser, then the neural network version, even if everything were cached, would be doing strictly more computation. Anyway, all that says is that I'm more impressed with the accuracy results than the speed results: and they appear to be doing quite well here, at the very least without sacrificing speed. (And as Yoav Goldberg reminds me, parsing on GPU with this architecture could be a big win!)

  • Unsupervised Sentence Enhancement for Automatic Summarization (Jackie Chi Kit Cheung, Gerald Penn)
    We all know how to do sentence fusion: take two sentences and stitch them together. The idea in this paper is to also enhance those sentences with little fragments taken from elsewhere. For instance, maybe you have a great sentence fusion but would like to add a PP taken from a related document: how can we do this? The model is basically one of extracting dependency triples, gluing them together and linearizing them. The biggest problem they face here is event coreference: given two sentences that contain the same (up to lemma) verb, is it the same event or not? They address event coref by looking at the participants of the predicate, though acknowledge more complex approaches might help further. Without the event coreference their approach hurts (over simple fusion) but with event coreference it helps. I would really have liked to have seen human judgments for both quality and grammaticality here: these things are notoriously hard to measure automatically and I'm not sure that we can really ascertain much without them. (Though yes, as the authors point out, human scores are notoriously hard to interpret too.)

  • What Can We Get From 1000 Tokens? A Case Study of Multilingual POS Tagging For Resource-Poor Languages (Long Duong, Trevor Cohn, Karin Verspoor, Steven Bird, Paul Cook)
    This paper is about a combination of projection methods (via parallel data) and small labeling (in the target language) to correct errors or annotation mismatches of the projection. The primary comparison is against Oscar Tackstrom's work (which I have an affinity for, having been his examiner for his Ph.D. defense) and the claim in this paper is that they do better with less. They can do even better using dictionaries of the type that Oscar has. One thing that's surprising is that when they replace the maxent correction model with a CRF things get worse. This seems wacky, but given the authors, I'm pretty confident that this isn't because they messed something up, which might be my default reaction :P. One obvious question is whether you can do projection and correction jointly, and I suspect someone will try this at some point. It could be cool to do it jointly not just over task but over languages in a multitask learning framework.

  • Greed is Good if Randomized: New Inference for Dependency Parsing (Yuan Zhang, Tao Lei, Regina Barzilay, Tommi Jaakkola)
    Local search is back! Let's do dependency parsing by first generating a random tree, then doing bottom-up reassignment of parents to improve it. You can prove that if you do this bottom up, then in one pass you can transform any tree to any other and all intermediate steps will also be trees. This space has lots of local optima (one cool thing is that in this paper they can count the number of local optima in n-cubed time using a variant of Chu-Liu-Edmonds, at least for first order models). To get around this they do a few hundred random restarts. One thing I really liked about this paper is that they went above and beyond what I might expect to see in a paper like this, really nailing the question of local optima, etc. They also use much more complex reranking features, which are available because at any step they have a full tree. The results are really good. I'm a bit surprised they need so many random restarts, which makes me worry about generalizing these results to more complex problems, which I think is the most important thing cuz I'm not sure we need more algorithms for dependency parsing. One cool thing is they get a natural anytime algorithm: you can stop the restarts at any point and you have some tree. This could be coupled in interesting ways with scheduling to handle cases where you need to parse one million sentences in some fixed amount of time: how much time do you spend on each? I was a bit surprised that they don't specifically train a model to make search decisions: they just train the model using their decoder and standard update strategies. It seems like something like Wick et al.'s SampleRank is just screaming to be used here.

Okay, so that's my list for EMNLP with mini-reviews. As I said before, these "reviews" are based 90% on the talk and 10% on skimming the papers later looking for keywords and examples, which means that there are almost certainly tons of errors. If someone notices an error and points it out, I'll just directly edit the post to fix it, with numerous apologies to the authors. Acknowledging a small amount of bias, I also really like both of the student papers I was involved in at EMNLP. The first is about question answering and the result is that we now beat humans at lots of QuizBowl questions. The second is about trying to learn policies for simultaneous interpretation: basically when should we wait for more input. I put links below if these sound interesting to you.
Please comment with your own favorites!