15 November 2014

The myth of a strong baseline

I can probably count on my fingers the number of papers I've submitted for which a reviewer hasn't complained about a baseline in some way. I don't mean to imply that all of those complaints are invalid: many of them were 100% right on in ways that either I was lazy about or ways that I hadn't seen a priori.

In fact, I remember back in 2005 I visited MIT and gave a talk on what eventually became the BayeSum paper (incidentally, probably one of my favorite papers I've written, though according to friends not exactly the best written... drat). I was comparing in the talk against some baselines, but Regina Barzilay very rightfully asked me: do any of these baselines have access to the same information that my proposed approach does? At the time I gave this talk the answer was no. In the actual paper the answer is yes. I think Regina was really on to something here, and this one question asked in my talk has had a profound impact on how I think about evaluation since then. For that, I take a small time-out and say: Thank you, Regina.

Like all such influential comments, my interpretation of Regina's question has changed over time, and this post is essentially about my current thinking on this issue, and how it relates to this "does not compare against a strong enough baseline" reviewer critique that is basically a way to kill any paper.

If we're going to ask the question about whether an evaluation strategy is "good" or "bad" we have to ask ourselves why are we doing this evaluation thing in the first place. My answer always goes back to my prima facie question when I read/review papers: what did I learn from this paper? IMO, the goal of an evaluation should be to help isolate what I learned.

Let's go back to the BayeSum paper. There are two things I could have been trying to demonstrate in this paper. (A) I could have been trying to show that some new external source of knowledge is useful; (B) I could have been trying to show that some new technique is useful. In the case of BayeSum, the answer was more like (B), which is why Regina's comment was spot on. I was trying to claim the approach was good, but I hadn't disentangled the new approach from the data, and so you could have easily believed that the improvement over the baseline was due to the added source of data rather than the new technique.

In many cases it's not that cut and dry because a (B)-style new technique might enable the use of (A)-style new data in a way that wasn't possible before. In that case, the thing that an author might be trying to convince me of is that the combination is good. That's fine, but I still think it's worth disentangling these two sources of information as much as possible. This is often tough because in the current NLP atmosphere in which we're obsessed with shiny new techniques, it's not appealing to show that the new data gets you 90% of the gain and the new technique is only 10% on top of that. But this is another issue.

So evaluations are to help us learn something. Let's return now to the question of you didn't compare against a strong enough baseline. Aside from parroting what's been said many times in the past, what is the point of such a complaint, beyond Regina's challenge, which I hope I've made clear I agree with. The issue, as best I can understand it, is that it is based on the following logic:

  • Assumption: if your approach improves things against a strong baseline, then it will also improve against a weaker baseline, perhaps by more.
I'll note in passing that this is basically an assumption of submodularity of ideas.

And, like the title of this blog post suggest, I would like to put forth the idea that this assumption is often ridiculous, or at least that there's plentiful evidence against it.

I'm going to pick on machine translation now just to give a concrete example, but I don't think this phenomenon is limited to MT in any way. The basic story is I start with some MT system like Moses or cdec or whatever. I add some features to it and performance goes up. The claim is that if my baseline MT system wasn't already sufficiently strong, then any improvement I see from my proposed technique could just be solving a problem that's already been solved if I had just tuned Moses better.

There's truth to this claim, but there's also untruth. A famous recent example is the Devlin et al. neural network MT paper. Let me be clear: I think this paper is great and I 100% believe the results that they presented. I'm not attacking this paper in any way; I'm choosing it simply as a representative example. One of the results they show is some insane 8 bleu point gain over a very strong baseline. And I completely believe that this was a very strong baseline. And that the 8 bleu point improvement was real. And that everything is great.

Okay, so any paper that leads to an 8 bleu point gain over a very strong baseline is going to get reimplemented by lots of people, and this has happened. Has anyone else gotten an 8 bleu point gain? Not that I've heard. I've heard numbers along the lines of 1 to 2 bleu points, but it's very plausible I haven't heard the whole story.

So what's going on here?

The answer is simply that the assumption I wrote above is false. We've assumed that since they got 8 points on a strong baseline, I'll get at least as much on my baseline (which is likely weaker than theirs).

One problem is that "strong" isn't a total order. Different systems might get similar bleu scores, but this doesn't mean that they get them in the same way. Something like the neural networks stuff clearly solved a major problem in the BBN strong baseline system, but this major problem clearly wasn't as major of a problem in some other strong baseline systems.

Does this make the results in the Devlin paper any less impressive or important? No, of course not. I learned a lot from that paper. But one thing I didn't learn is "if you apply this approach to any system that's weaker than our strong baseline, you will get 8 bleu points." That's just not a claim that their results substantiate, and the only reason people seem to believe that this should be true is because of the faulty assumption above.

So does this mean that comparing to strong baselines is unimportant and everyone should go back to comparing their MT system against a word-for-word model 1 baseline?

Of course not. There are lots of ways to be better than such a baseline, and so "beating" it does not teach me anything. I always tell students not to get too pleased when they get state of the art performance on some standard task: someone else will beat them next year. If the only thing that I learn from their papers is that they win on task X, then next year there's nothing to learn from that paper. The paper has to teach me something else to have any sort of lasting effect: what is the generalizable knowledge.

The point is that an evaluation is not an end in itself. An evaluation is there to teach you something, or to substantiate something that I want to teach you. If I want to show you that X is important, then I should show you an experiment that isolates X to the best of my ability and demonstrates an improvement, preferably also with an error analysis that shows that what I claim my widget is doing is actually what it's doing.

01 November 2014

EMNLP 2014 paper list (with mini-reviews)

I'm going to try something new and daring this time. I will talk about papers I liked, but I will mention some things I think could be improved. I hope everyone finds this interesting and productive. As usual, I didn't see everything, didn't necessarily understand everything, and my errors are my fault. With that warning, here's my EMNLP 2014 list, sorted in anthology order.

  • Identifying Argumentative Discourse Structures in Persuasive Essays (Christian Stab, Iryna Gurevych)
    Full-on discourse parsing of rhetorical structure (e.g., RST) is really hard. In previous work, these authors created a corpus of essays annotated for (a) major claim, (b) claim, (c) premise and (d) none. (Prevalances of 5%, 23%, 55% and 17%, respectively.) This paper is about predicting that structure; in particular, both labeling spans (sentences?) as to their rhetorical class (of those four possibilities) and connecting them with binary support/non-support labels. Overall, they do quite well at both tasks: high to mid seventies in accuracy/F, respectively. They blame some of their errors on lack of coreference resolution. One question I had was: if you think of this annotation as a boiled down version of full rhetorical structure, what are you missing? Only 17% of sentences aren't one of the four classes, but if these fall high up in a rhetorical tree, I could imagine they would cut off a lot more of the structure. That said, I really liked that the authors found a tractable subproblem of full discourse parsing, and looking at it on its own.

  • Aligning context-based statistical models of language with brain activity during reading (Leila Wehbe, Ashish Vaswani, Kevin Knight, Tom Mitchell)
    When you read something, your brain lights up in certain ways. This paper is about finding predictive correlations between what you're reading (in particular, how a language model will score the incoming word) and how your brain will light up (via MEG imaging--this is the one with good time resolution but bad spacial resolution). The basic result is that the hidden layer in a recursive neural network language model does a pretty good job at the following prediction task: given an MEG reading, which of two words (the true one and a confounder) is the person reading. It does pretty well when readers are reading from Harry Potter: a big novelty here is that they use natural text rather than carefully laboratory controlled text. This is a cool step in the line of connecting NLP with neurolinguistics. The "hypothesis" in this paper is couched in the language of the integration theory of language processing (i.e., when you hear a word, your brain has to spend energy to integrate it into your prior analysis) rather than the (in my relatively uneducated opinion) preferred model of prediction theory (i.e., your brain has already predicted the word and the work it does is correcting for errors in its predictions). I think I can reinterpret their results to be consisted with the predictive theory, but I'd like to not have to do that work myself :P. I think the hardest part about this work is the fact that it uses natural, rather than laboratory, text. For all previous experiments I know, it's been very important to control for as much as you can about the context (history) and word to be predicted. In fact, most work only varies the history and never varies the word. This is done because any two words vary in so many ways (frequency, spelling, phonology, length, etc.) that you do not want to have to correct for those. For instance, you can ask if a bilingual speaker represents "dog" and "chien" in the same way, but this is a silly question because of course they don't: at the very least, they sound different and are spelled differently. Anyway, back to this paper, in the experiment they control for word length on the pairwise task, but I think controlling for word probability (under the surprise/predictive hypothesis) would be stronger. Though really as neat as it is to use natural text, I think it hurts more than it helps here.

  • Learning Abstract Concept Embeddings from Multi-Modal Data: Since You Probably Can't See What I Mean (Felix Hill, Anna Korhonen)
    This is the ultimate "duh" moment paper for me: basically we're all excited about embedding words and images into the same space, but lots of words aren't concrete and don't have visual counterparts (e.g., "flower" versus "encouragement"). They argue that concrete nouns are the rare category (72% of nouns and verbs are at least as abstract as "war"). The model they have for this is relatively simple: ignore abstract words. (Ok it's a bit more than that, but that's the general idea.) When you do this correctly, you get a marked improvement in performance. One thing I kept wondering during the talk, especially since Felix kept using the word "concrete" to refer to something that's not actually a mix of water, aggregate and cement, was the whole issue of metaphor and really word sense. Unfortunately neither of these words seem to appear in the paper, so it's hard to say what's going on. I think it could be really cool to combine these ideas with those of understanding metaphor or at least sense, though of course those are really really hard problems!

  • NaturalLI: Natural Logic Inference for Common Sense Reasoning (Gabor Angeli, Christopher D. Manning)
    I haven't followed the natural logic stuff for a while, so my assessment of this paper is pretty uninformed, but this was perhaps my favorite paper at the conference: I'll be presenting it in reading group on Monday, and might post more notes after that. Basically the idea is that you have a knowledge base that contains sentences like "The cat ate a mouse" and you have a query like "No carnivores eat animals". Instead of trying to map these sentences to logical form, we're going to treat the text itself as a logical form and do sentence rewriting to try to prove one from the other. An important insight is that quantification changes the direction of entailment that you can move, for instance if you say "All cats ..." then you can move down WordNet to "All housecats ..." but not up to "All mammals ..."; similarly if you say "No cats ..." then you can only move up the hierarchy. They set up a model like this, cast it as a big search problem, and get remarkably good scores (89% precision, 94% recall) on the FraCaS textual entailment task, which is competitive with the competition (though on a subset of the data, so not really comparable). Like I said, I really liked this paper; I could easily imagine trying to combine these ideas with stuff in the paraphrase database, especially once they have entailment directions annotated/predicted there.

  • A Fast and Accurate Dependency Parser using Neural Networks (Danqi Chen, Christopher Manning)
    The very brief summary of this paper is: take MaltParser and replace the SVM with a neural network and you do well. Slightly more detailed is that when MaltParser makes decisions, it looks at a few points on the stack and in the buffer. For each of these points, take the word and POS, map them to embeddings. You now have a half dozen embeddings (yes, we're embedding POS tags too). Also embed the dependency relation. Throw this through a network and make a prediction. The result 92% accuracy (basically tied with MST Parser, better than MaltParser by 2%) and about twice as fast as MaltParser. They also compare to their own implementation of MaltParser. Perhaps I missed it in the talk, but I think they only talked about their implementation there (they are 20 times faster) and not about real MaltParser. In order to get this speed, they do a caching trick that remembers vector-matrix products for common words. This is very cool and it's neat to see the embeddings of the POS and dependency relations: reminds me of Matsuzaki-style annotated labels. The one weakness here, I think, is that it's kind of unfair to compare a version of their own parser that with significant engineering tricks (the caching made it 8-10 times faster, so without it, it's 5 times slower than MaltParser) to an un-tricked-out MaltParser: in particular, if you really care about speed, you're going to do feature hashing in MaltParser and never compute all the strings that make it so slow. If this were done in MaltParser, then the neural network version, even if everything were cached, would be doing strictly more computation. Anyway, all that says is that I'm more impressed with the accuracy results than the speed results: and they appear to be doing quite well here, at the very least without sacrificing speed. (And as Yoav Goldberg reminds me, parsing on GPU with this architecture could be a big win!)

  • Unsupervised Sentence Enhancement for Automatic Summarization (Jackie Chi Kit Cheung, Gerald Penn)
    We all know how to do sentence fusion: take two sentences and stitch them together. The idea in this paper is to also enhance those sentences with little fragments taken from elsewhere. For instance, maybe you have a great sentence fusion but would like to add a PP taken from a related document: how can we do this? The model is basically one of extracting dependency triples, gluing them together and linearizing them. The biggest problem they face here is event coreference: given two sentences that contain the same (up to lemma) verb, is it the same event or not? They address event coref by looking at the participants of the predicate, though acknowledge more complex approaches might help further. Without the event coreference their approach hurts (over simple fusion) but with event coreference it helps. I would really have liked to have seen human judgments for both quality and grammaticality here: these things are notoriously hard to measure automatically and I'm not sure that we can really ascertain much without them. (Though yes, as the authors point out, human scores are notoriously hard to interpret too.)

  • What Can We Get From 1000 Tokens? A Case Study of Multilingual POS Tagging For Resource-Poor Languages (Long Duong, Trevor Cohn, Karin Verspoor, Steven Bird, Paul Cook)
    This paper is about a combination of projection methods (via parallel data) and small labeling (in the target language) to correct errors or annotation mismatches of the projection. The primary comparison is against Oscar Tackstrom's work (which I have an affinity for, having been his examiner for his Ph.D. defense) and the claim in this paper is that they do better with less. They can do even better using dictionaries of the type that Oscar has. One thing that's surprising is that when they replace the maxent correction model with a CRF things get worse. This seems wacky, but given the authors, I'm pretty confident that this isn't because they messed something up, which might be my default reaction :P. One obvious question is whether you can do projection and correction jointly, and I suspect someone will try this at some point. It could be cool to do it jointly not just over task but over languages in a multitask learning framework.

  • Greed is Good if Randomized: New Inference for Dependency Parsing (Yuan Zhang, Tao Lei, Regina Barzilay, Tommi Jaakkola)
    Local search is back! Let's do dependency parsing by first generating a random tree, then doing bottom-up reassignment of parents to improve it. You can prove that if you do this bottom up, then in one pass you can transform any tree to any other and all intermediate steps will also be trees. This space has lots of local optima (one cool thing is that in this paper they can count the number of local optima in n-cubed time using a variant of Chu-Liu-Edmonds, at least for first order models). To get around this they do a few hundred random restarts. One thing I really liked about this paper is that they went above and beyond what I might expect to see in a paper like this, really nailing the question of local optima, etc. They also use much more complex reranking features, which are available because at any step they have a full tree. The results are really good. I'm a bit surprised they need so many random restarts, which makes me worry about generalizing these results to more complex problems, which I think is the most important thing cuz I'm not sure we need more algorithms for dependency parsing. One cool thing is they get a natural anytime algorithm: you can stop the restarts at any point and you have some tree. This could be coupled in interesting ways with scheduling to handle cases where you need to parse one million sentences in some fixed amount of time: how much time do you spend on each? I was a bit surprised that they don't specifically train a model to make search decisions: they just train the model using their decoder and standard update strategies. It seems like something like Wick et al.'s SampleRank is just screaming to be used here.

Okay, so that's my list for EMNLP with mini-reviews. As I said before, these "reviews" are based 90% on the talk and 10% on skimming the papers later looking for keywords and examples, which means that there are almost certainly tons of errors. If someone notices an error and points it out, I'll just directly edit the post to fix it, with numerous apologies to the authors. Acknowledging a small amount of bias, I also really like both of the student papers I was involved in at EMNLP. The first is about question answering and the result is that we now beat humans at lots of QuizBowl questions. The second is about trying to learn policies for simultaneous interpretation: basically when should we wait for more input. I put links below if these sound interesting to you.
Please comment with your own favorites!

10 October 2014

Hyperparameter search, Bayesian optimization and related topics

In terms of (importance divided-by glamour), hyperparameter (HP) search is probably pretty close to the top. We all hate finding hyperparameters. Default settings are usually good, but you're always left wondering: could I have done better? I like averaged perceptron for this reason (I believe Yoav Goldberg has also expressed this sentiment): no pesky hyperparameters.
But I want to take a much broader perspective on hyperparameters. We typically think of HPs as { regularization constant, learning rate, architecture } (where "architecture" can mean something like neural network structure, choice of kernel, etc. But I think it's a lot broader and can include things like feature engineering, or at least representation modifications. For instance, vw now supports a number of helpful NLP feature templates: suffix and prefix features (via --affix), spelling features (via --spelling), ngram features (--ngrams), quadratic and cubic features, etc. Picking the right incarnation of these thing is essentially a hyperparameter search process, very akin (IMO) to things like representation learning.

Once you're willing to accept all these things as HPs (and I think you should), something like "grid search", which works for tuning C and eta in your SVM, just doesn't seem to cut it anymore.

Enter the world of automatic HP tuning. Lots of this work, not surprisingly, comes out of the neural networks community, because HP search is a big deal there. Most of my information here comes via Hugo Larochelle, who has done a lot of great work. Some places to start looking:

  • Spearmint toolkit by Snoek, Larochelle, Swersky, Zemel and Adams (and a JMLR paper)
  • SMAC by Hutter, Hoos, Leyton-Brown, Murphy and Ramage (and a paper
Most of this work falls under the framework of "Bayesian Optimization." The idea comes from the space of derivative-free optimization, where a common strategy is to fit a response surface. Basically you have a bunch of hyperparameters to tune. For any setting of hyperparameters, you can observe some response. In ML land this is usually something like held-out accuracy. Now, fit a regression function that can map from hyperparameters to response. Do something that looks like active learning to explore this space, with a bias toward finding places with high response (high accuracy). In these examples, the function being fit is a Gaussian Process, which is super useful because it can provide realistic estimates of variance, which are useful in the active learning/experimental design.

I first learned about this stuff, not in the context of HP optimization, from Ilya Ryzhov, a faculty member in our business school who works on these topics -- I learned that in his world, exploration/exploitation is also called "learning versus earning" which I think it awesome :).

I would love if hyperparameter optimization were a black box, and Spearmint and SMAC are both great steps in this direction. There are a couple of things that I haven't seen (though like I said, I'm not hugely well read in this space) that I would love to see.
  • Learning across multiple datasets, akin to meta-learning. I imagine that if you give me a problem to do HP optimization on, and also give the same problem to an undergraduate, I would be much better. Presumably I've learned from past experience how to do this well.
  • More importantly, taking advantage of some of the structure of learning algorithms (rather than oblivious black box optimization) seems like it could be a big win. For instance, most learning algorithms have some notion of early stopping (# of passes over the data, tolerance for convergence, etc.). You can also of course run on subsets of the data (which is equivalent in many cases). If you assume heldout accuracy doesn't get worse over time (e.g., because you do early stopping) then you can think of this as a type of right-censoring. I.e., you run an experiment part of the way through and you know "ok if I kept running it might get better, but I know it's at least 85% accurate now." The SMAC folks had a nice paper on Bayesian optimization with censored data, but my (incomplete) understanding is that it doesn't quite capture this (very common, IMO) case. I should be willing to start and stop processes at various points and try to figure out where to invest more computation to get better estimates. I can presumably even estimate: if I ran another pass, how much better do I think it would get?
  • I also think the focus on "finding the best hyperparameters" is somewhat the wrong problem. We want to find the best parameters, period. Hyperparameters are a nuisance on their way to that end. For instance, related to the above bullet, I could imagine running a few passes with one setting of hyperparameters, and then doing some other work, and then going back and restarting that previous run with a different setting of hyperparameters (assuming the model being learned is such that "warm starting" makes sense, which is almost always the case--except maybe in some neural network settings).
     
  • Parallelization is a big deal. One of the reasons something akin to grid search is so attractive is that it's trivial to submit 20*20*20 jobs to my cluster and just wait for them to finish. Anything that's less friendly than doing this is not worth it. Again, the SMAC folks have worked on this. But I don't think the problem is solved.
Beyond these technical issues there's always the obnoxious issue of trust. Somehow I need to believe that I'm not better than these algorithms at tuning hyperparameters. I should be happy to just run them, preferably saying "okay, here are 120 cores, you have four hours -- go to town." And I should believe that it's better than I could do with equivalent time/resources by clever grid search. Or perhaps I should be able to encode my strategies in some way so that it can prove to me that it's better than me.

Overall, I'd love to see more work on this problem, especially work that doesn't focus on neural networks, but still takes advantage of the properties of machine learning algorithms that are not shared by all black-box derivative-free optimization tasks. In the mean time, from what I hear, SMAC and Spearmint are actually quite good. Would love to hear if any NLP people have played around with them!

03 October 2014

Machine learning is the new algorithms

When I was an undergrad, probably my favorite CS class I took was algorithms. I liked it (a) because my background was math so it was the closest match to what I knew and (b) because even though it was "theory," a lot of the stuff we learned was really relevant. Over time, it seemed like the area had distilled worthwhile algorithms from interesting-in-theory-but-you'll-never-actually use algorithms.

In fact, I think this is a large part of why most undergraduate CS degrees today require a course in algorithms. You have these very nice, clearly defined statements, and very elegant solutions to those statements that in most cases (at the UG level) are known to be optimal.

Fast forward N years.

My claim today---and I'm speaking really as an NLP person, which is how I self-identify---is that machine learning is the new core. Everything that algorithms was to computer science 15 years ago, machine learning is today. That's not to say it won't move in another 10 years, but that's how I see it.

Why?

For the most part, algorithms (especially as taught at th UG level) is the study of one thing: Given a perfect input, how do I most efficiently compute the optimal output.

The problem is the "perfect input" part.

All of my experience in the past N years has told me that you never have a perfect input, and that it's far far far more important to be able to synthesize information from a large number of sources and reason about it than it is to find the exact-right-solution to some problem that exists only to Plato.

Even within machine learning you see this effect. Lots of numerical analysis people have worked on good algorithms for getting that last little bit of precision out of optimization algorithms. Does it matter? Nope! Model specification, parameter tuning, features, and data matter infinitely more than that last little bit of precision. (In some fields, for instance, scientific computing, that last little bit of precision may matter. I don't know enough to know one way or the other.)

Let's play a thought game. Say you're an UG CS major. You graduate and get a job in CS (not grad school). Which are you more likely to use: (1) a weighted cost flow algorithm or (2) a perceptron/decision tree?

Clearly I think the answer is (2). And I loved flow algorithms when I was an undergrad and have actually spent since 2006 trying to figure out how I can use them for a problem I want to solve. No dice.

I would actually go further. Suppose you have a problem whose inputs are ill-specified (as they always are when dealing with data), and whose structure actually does look like a flow problem. There are two CS students trying to solve this problem. Akiko knows about machine learning but not flows; Bob knows about flows but not machine learning. Bob tries to massage his data by hand into the input to an optimal flow algorithm, and then solves it exactly. Akiko uses machine learning to get good edge weights and hacks together some greedy algorithm for flows, not even knowing it's called a flow. Who's solution works better? I would put almost any amount of money on Akiko.

Full disclosure: those who know about my research in structured prediction will recognize this as a recurring theme in my own research agenda: fancy algorithms always lose to better models.

There's another big difference between N years ago and today: almost every algorithm you could possibly care about (or learn about as an UG) is implemented in a library for any reasonable programming language. That's not to say that it's unimportant to know how things work in order to use them, but I would argue it's much less important in a field like algorithms whose knowledge is comparatively stable, versus a field like machine learning where things are still changing and there is no "one right answer" to the "machine learning problem." In a field that's still a bit of an art rather than a science, understanding how things work under the hood feels a lot more important. Quicksort, heaps, minimum spanning trees, ... these are all here to stay.
Okay, so now I've convinced myself that we should yank algorithms out as an UG requirement and replace it with machine learning.

But wait, I can hear my colleagues yelling, taking algorithms isn't about learning algorithms: it's about learning how to think! But that's also what I think is great about machine learning: the distance between theory and algorithms is actually usually quite small (I try to get this across at various points in CiML, to varying degrees of success). If the only point of an algorithms class (I've heard exactly this argument made about automata theory, for instance) is to teach students how to think, I think we could do much better.

Okay, so I've thrown down the gauntlet. Someone should come smack me with theirs :P!

Edit after some comments:

I think I probably wrote badly and as a result my main point got lost. I'll try to restate it here briefly and then I'll edit the main post.

Main point: I feel like for 15 years, algorithms has been at the heart of most of what computer science does. I feel like that coveted position has now changed to machine learning or, more generically, statistical reasoning. I feel this way because figuring out how to map a real world problem into something an "algorithm" can consume, especially when that needs statistical modeling of various inputs, is (IMO) a more important and harder problem than details about flow algorithms. 



EDIT #2 AFTER DISCUSSION ON SURESH'S BLOG:

let me give a concrete example that may actually be a real world example, but i don't know (though see this paper). that of path finding for taxis or cars. the world is a graph and given directed edge costs we can run dijkstra or whatever to find LEAST-TIME (shortest) paths. this is basically google maps/etc.

of course, we never know the true time to travel some segment. we might know it now, but by the time the driver gets to some road (5 or 10 minutes from now) the conditions may have changed. and of course we have historical data on traffic from which we can predict what the condition of the road will be like in 10 minutes.

so here, "foo" is a function that takes the time of data, historical traffic data, weather and whathaveyou, and maps it to edge costs.

"bar" is dijkstra's algorithm or whatever shortest path algorithm you like.

my claim is that if you really want to solve this problem, it's much more important to understand how to create foo than how to create bar. in particular, if i gave you a greedy or near greedy approach to bar, combined with a really good foo, i bet this would be significantly better than an optimal bar and a crappy foo.

27 September 2014

AMR: Not semantics, but close (? maybe ???)

Okay, necessary warning. I'm not a semanticist. I'm not even a linguist. Last time I took semantics was twelve years ago (sigh.)

Like a lot of people, I've been excited about AMR (the "Abstract Meaning Representation") recently. It's hard not to get excited. Semantics is all the rage. And there are those crazy people out there who think you can cram meaning of a sentence into a !#$* vector [1], so the part of me that likes Language likes anything that has interesting structure and calls itself "Meaning." I effluviated about AMR in the context of the (awesome) SemEval panel.

There is an LREC paper this year whose title is where I stole the title of this post from: Not an Interlingua, But Close: A Comparison of English AMRs to Chinese and Czech by Xue, Bojar, Hajič, Palmer, Urešová and Zhang. It's a great introduction to AMR and you should read it (at least skim).

What I guess I'm interested in discussing is not the question of whether AMR is a good interlingua but whether it's a semantic representation. Note that it doesn't claim this: it's not called ASR. But as semantics is the study of the relationship between signifiers and denotation, [Edit: it's a reasonable place to look; see Emily Bender's comment.] it's probably the closest we have.


We've spent some time looking at the data (dummy), to try to understand what is actually there. What surprised me was how un-semantics-y AMR tends to be. The conclusion I reached is that it might be much closer to a sort of D-structure (admittedly with some word sense disambiguation) than a semantics (sorry, I grew up during GB days and haven't caught on to the whole minimalism thing). And actually it's kind of dubious even as a D-structure....

Why do I say that? A handful of reasons; I'll give an example of some of them. All of these examples are from the 1274 training AMRs from The Little Prince.

Gapped agents in matrix clauses

Example: "... insisted the little prince , who wanted to help him ."

(i / insist-01
  :ARG0 (p / prince
          :mod (l / little)
          :ARG0-of (w / want-01
                     :ARG1 (h / help-01
                             :ARG1 (h2 / he))))
  :ARG1 (...))

Why does this surprise me? I would strongly have expected "p" (the Little Prince) to be the ARG0 of help. But in this representation, help doesn't have any ARG0.

You could make some argument that you could write a tool to transform such matrix clauses and re-insert the ARG0 of the matrix verb as the ARG0 of the subordinate verb, but this doesn't work in general. For instance, "I always want to rest" in the dataset also doesn't have "I" as an argument of "rest." Unfortunately, the rest-er in the propbank/verbnet definition of rest is (correctly) its ARG1/theme. So this deterministic mapping doesn't work -- if you did this the interpretation would be that "I always want to rest" means "I always want to cause-something-unknown to rest" which is clearly different.

Perhaps this is an annotation error? I found the same "error" in "And I knew that I could not bear the thought of never hearing that laughter any more" (the ARG0 of "hear" is missing) but there are a few cases where the subordinate clause does get an agent; namely:
  • She did not wish to go out into the world all rumpled...
    ("go" correctly gets "she" as the ARG0)
  • ...that she wished to appear...
    ("appear" gets "she" as the ARG0)

So I'm not sure what's going on here. At least it's inconsistent...

Noun-noun compounds

In a semantic representation, I would expect the interpretation of noun noun compounds to be disambiguated.

For instance, the string "only one ring of petals" is annotated as:

        (r / ring :quant 1
                  :mod (o / only)
                  :consist-of (p / petal)))

But the string "glass globe" is simply:

        (g / globe
                  :mod (g2 / glass)))

In other words, the "ring of petals" is a ring that consists of petals. But the "glass globe" is just a "globe" modified by "glass." We have no idea what sort of modification this is, though one could argue that it's the consist-of relation that was used for ring of petals.

As made poinant by Ewan Dunbar, disambiguating Noun-Noun compounds is important when translating into many other languages:



Possession

There are many types of possession, many of which can be turned into predicates:
  • Erin's student => the student Erin advises
  • Julio's paper => the paper Julio wrote
  • Smita's football team => the team Smita plays on; the team Smita owns; the team Smita roots for
  • Kenji's speech => the manner in which Kenji talks; the speech Kenji gave (these last are actually hard to predicatize given my inventory of English verbs)
(Thanks to Ani Nenkova for most of these examples.)

In AMR, it appears the rule is that apostrophe-s ('s) turns into :poss. You (appear to) get (student :poss Erin) and (paper :poss Julio) and (team :poss Smita) and (speech :poss Kenji).

This is a bit strange because the Norman genitive alternation ("of") of possession in English (as opposed to the Saxon genitive "'s") does not turn into :poss. For instance, "other side of the planet" becomes:

                     (s2 / side
                              :mod (o / other)
                              :part-of (p2 / planet))

Here, the "of" has been disambiguated into :part-of; in contrast, with "air of authority", we get:

         (a / air
            :domain (a2 / authority))

where the "of" has turned into ":domain". In fact, I cannot find any cases where there is a :poss and no "'s" (or his/her/etc...).

Now, you could argue that "planet's side" and "authority's air" sound at best poetic and at worst wrong. (Though I find "planet's other side" pretty acceptable.) But this is basically a property of English. But these are totally fine in Japanese with /no as the possessive marker (according first to my prior and then confirmed by my Japanese informant Alvin -- thanks Alvin :P). I'm guessing they're okay in Chinese too (with /de as the possessive marker), but I'm not positive.

Moreover, possession is more complicated that than in lots of languages that distinguish between alienable and inalienable possession. A classic example being "my aunt" versus "my car." My mother is inalienable because try as I might, I cannot sell, buy, exchange, etc., my aunt. My car is alienable because I can do these things. In lots of languages, (about half, according to WALS), there is more than one class of possession, and the two class (in)alienable distinction is the most common.

As an example (taken from that WALS link, originally due to Langdon (1970)): In Mesa Grande Diegueño (Yuman; California), inalienable nouns like mother (ətalʸ) take a simple prefix ʔ- ('my'), while house (ewa) takes the compound prefix ʔə-nʸ- as in:

    a. ʔ-ətalʸ
       1sg-mother
       ‘my mother’ 

    b. ʔə-nʸ-ewaː
       1sg-alienable-house
       ‘my house’ 

So you could argue that in this case it's purely a property of the possessed noun, and so even in Mesa Grande Digueño, you could say :poss and then disambiguate by the semantics of the possessee.


Wrap Up

I could continue to go on, but perhaps I've made my point. Which is not that AMR sucks or is uninteresting or anything like that. It's just that even if we can parse English into AMR, there's a long way to go before we can start doing semantic reasoning from it. And maybe along the way you learned something. I know I did.

I think what really drove it home for me that AMR is not so much of a semantic representation is the ease with which I could imagine writing a rule-based generator for AMR to English. Yes, the sentences would come out stodgy and kind of wrong, but given an AMR and doing an in-order traversal of the tree, I'm pretty sure I could generate some fairly reasonable sentences (famous last words, right?). I believe this is true even if you took the AMR, re-ified it into a Hobbs-esque flat form first. The first step would be un-reification, which would basically amount to choosing a root, and then going from there. Like all NLP papers written in the 80s, perhaps the devil is in the details, in which case I'll be happy to be proved wrong.

One interesting final point for me is that as established in the paper I stole this title from, AMR is actually a pretty reasonable interlingua. But it's a pretty crappy semantic representation (IMO). This sort of breaks the typical MT Vauquois triangle triangle. That alone is kind of interesting :).




[1] Due to Ray Mooney, saw on Twitter, but now can't find the picture of the slide any more.

31 July 2014

Reading group notes: point/counter-point on "predict models"

In our local summer reading group, I led the discussion of two papers that appeared in Baltimore last month:

I love handouts, so I made a handout for this one too. I paste below the handout. All good ideas are those of the respective authors; all errors and bad ideas are probably due to bad transcription on my  part.


Don't count, predict! A systematic comparison of context-counting
vs. context-predicting semantic vectors

Marco Baroni & Georgiana Dinu & Germ\'an Kruszewski

Motivation: these silly deep learning people keep writing papers but
don't compare to traditional distributional semantics models. So we
will.

Conclusion: okay, those people are actually right.

== Background ==

Distributional semantics = you know a word by the company it keeps

"Count models":
 * For each word type, collect context vectors
 * Context vectors look at n words on the left and right
     (with position info? together or separately?)
     varied in 2..5
 * Each type is represented by the bag of contexts in which it appears
 * Contexts are scored by PMI or LLR
     (gets rid of useful but frequent contexts)
 * We might reduce dimensionality to k in { 200, ..., 500 }
    - using either SVD or NNMF

"Predict models" (aka deep learning):
 * Assume a mapping from word type -> k-dim vector
 * Learn a model to predict any word token given the vectors
     of the n words to its left and right
     varied in {2,5}
 * Words are thrown out proportional to their frequency
    - makes things faster
    - reduces importance of frequent words like IDF
 * Vary k in { 200, ..., 500 }

CW (Collobert & Weston) models:
 * freely available online
 * 100 dimensional vectors trained for 2 months on wikipedia
 * predict a word with 5 words to the left and right
 * used extensively in other literature

== Tasks ==
 * Synonym detection from TOEFL
   - given "levied" choose from { imposed, believed, requested, correlated }
   - compute cosine of word representations
 * Concept categorization
   - cluster words like { helicopters, motorcyles} into one class
     and { dogs, elephants } into another
   - used off-the-shelf clustering algorithms
 * Selectional preferences
    - given a verb/noun pair say whether the verb selects for that noun
    - eg, "eat apples" versus "eat gravity"
    - for each verb, take 20 most strongly associated nouns, average
        their representations, and measure cosine similarity to that
 * Analogy
    - eg, brother:sister :: grandson:(?granddaughter)
    - find nearest neighbor of (brother - sister + grandson)

== Summary of results ==
                                     pred    tie   count    (win: >=5)
                                     wins           wins
 * tune parameters PER TASK:          10      4      0
 * tune parameters OVERALL:           10      3      1
 * worst parameters                   14      0      0
 * best from relatedness:             11      3      0

 - best count   model: window=2, PMI, no compression, 300k dimensions
 - best predict model: window=5, no hier softmax, neg sampling, 400 dim







Linguistic Regularities in Sparse and Explicit Word Representations
Omer Levy & Yoav Goldberg

Motivation: neural representations capture analogical reasoning well;
why does this happen?

Conclusion: we can just as well using tranditional distributed word
representations if we measure similarity in a "multiplicative" way

== Background ==

* neural language models produce representations that can answer analogy questions:
  - gender...   man:woman :: king:queen
  - speakers... france:french :: mexico:spanish
  - number...   apple:apples :: car:cars
* question: how much of this is a property of *embeddings* (ie dense, low-dimensional)
  - alternative is distributional similarity == bag of contexts (ala Baroni paper)

== Experiment 1 ==

* Mikolov (word2vec) computes similarity for solving a:b :: a*:b* by finding:
    arg max_{b*}  similarity(b*, a* - a + b)       (called 3CosAdd)
   aka similarity(queen, king - man + woman)
* they use cosine similarity: cos(u,v) = dot(u,v) / [ ||u||  ||v|| ]
* expanding out, you get:
    arg max_{b*}  cos(b*,b) + cos(b*,a*) - cos(b*,a)
  aka
    cos(queen,woman) + cos(queen,king) - cos(queen,man)

Results: on MSR & Google datasets, embeggings >> explicit (predict >> count)
         on SemEval, basically tied (closed vocabulary?)

* an alternative is:
    arg max_{b*}  similarity(b*-b, a*-a)           (called PairDirection)
  aka similarity(queen-woman, king-man)

Results: Much much worse on MSR, Google (open vocab), and better (though tied) on
SemEval. Perhaps because scale matters for open vocabulary? could have been tested
explicitly... (drat!)


== Experiment 2 ==

Looking at the expansion of 3CosAdd, it looks like a "noisy or" sort of operation:
 - b* should be close to b, should be close to a*, should be far from a...

What about using something more like noisy and:

            cos(b*,b)  cos(b*,a*)
  arg max  -----------------------
      b*       cos(b*,a) + eps

aka

            cos(queen,king)  cos(queen,woman)
  arg max  -----------------------------------
      b*          cos(queen,man) + eps

Results:


                      MSR       Google
  3CosAdd  Predict    54%       63%
           Count      29%       45%
  3CosMul  Predict    59%       67%
           Count      57%       68%

One against the other:

  Both correct    -- 54% of cases
  Both wrong      -- 24%
  Predict correct -- 11.1%
  Count correct   -- 11.6%

27 July 2014

Hello, World!

Okay, usually Hello World is the first program you learn to write in a new programming language. For fun, I've been collecting how to say hello world in different human languages, something remarkably difficult to search for (because of the overloading of the word "language"). I have 28. I'd like to make it to 280 :). If you have one (or more) to contribute, email me, post a comment, or tweet to me @haldaume3. And of course if you think any of these is wrong, please let me know that too.

     1 bar Servus Woid!
     2 ca  Hola Món!
     3 de  Hallo Welt!
     4 en  Hello World!
     5 eo  Saluton, Mondo!
     6 es  ¡Hola Mundo!
     7 eu  Kaixo, mundua!
     8 fi  Hei maailma!
     9 hu  Helló, világ!
    10 ia  Hallo, mundo!
    11 id  Halo dunia!
    12 ja  こんにちは世界
    13 lv  Sveika, pasaule!
    14 min Helo dunia!
    15 mk  Здраво свету!
    16 ms  Helo dunia!
    17 nn  Hallo verda!
    18 no  Hallo, verden!
    19 pt  Olá Mundo!
    20 sh  Zdravo svete!
    21 sl  Pozdravljen svet!
    22 sq  Njatjeta Botë!
    23 sr  Здраво свете!
    24 sv  Hej Världen!
    25 th  เฮลโลเวิลด์
    26 tr  Merhaba dünya!
    27 vi  Xin chào thế giới!
    28 zh  世界,你好!

05 July 2014

My ACL 2014 picks...

Usual caveats: didn't see all papers, blah blah blah. Also look for #acl14nlp on twitter -- lots of papers were mentioned there too!

  • A Tabular Method for Dynamic Oracles in Transition-Based Parsing; Yoav Goldberg, Francesco Sartorio, Giorgio Satta.
    Jaokim Nivre, Ryan McDonald and I tried searnifying MaltParser back in 2007 and never got it to work. Perhaps this is because we didn't have dynamic oracles and we thought that a silly approximate oracle would be good enough. Guess not. Yoav, Francesco and Giorgio have a nice technique for efficiently computing the best possible-to-achieve dependency parse given some prefix, possibly incorrect, parse.
  • Joint Incremental Disfluency Detection and Dependency Parsing; Matthew Honnibal, Mark Johnson
    The basic idea is to do shift-reduce dependency parsing, but allow for "rewinds" in the case of (predicted) disfluencies. I like that they didn't just go with the most obvious model and actually thought about how might be a good way to solve this problem. Basic idea is if you get "Please book a flight to Boston uh to Denver..." is that you parse "to Boston" like usual but then when you get to the "uh", you remove old arcs. You do it this way because detecting the disfluent segment ("to Boston") is much easier when you hit "uh" than when you hit "to Boston."
  • Don't count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors; Marco Baroni; Georgiana Dinu; Germán Kruszewski
    This paper is summarized best by its own statement, which should win it the award for most honest paper ever: "...we set out to conduct this study because we were annoyed by the triumphalist overtones often surrounding [neural network embeddings], despite the almost complete lack of a proper comparison.... Our secret wish was to discover that it is all hype... Instead, we found that the [embeddings] are so good that, while the triumphalist overtones still sound excessive, there are very good reasons to switch to the new architecture."
  • Learning to Automatically Solve Algebra Word Problems ; Nate Kushman; Luke Zettlemoyer; Regina Barzilay; Yoav Artzi
    An algebra word problem is something like "I have twice as many dimes as nickles and have $2.53. How many nickles do I have." Of course usually they actually have an answer. They have a nice, fairly linguistically unstructured (i.e., no CCG) for mapping word problems to algebraic formulae and then solving those formulae. Code/data available.
  • Grounded Compositional Semantics for Finding and Describing Images with Sentences; Richard Socher, Quoc V. Le, Christopher D. Manning, and Andrew Y. Ng
    This is the follow-on work from Richard's NIPS workshop paper on text <-> images from this past NIPS. They fixed the main bug in that paper (the use of l2 error, which gives a trivial and uninteresting global optimal solution) and get nice results. If you're in the langvis space, worth a read, even if you don't like neural networks :).
  • From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions; Peter Young, Alice Lai, Micah Hodosh, Julia Hockenmaier
    I really like the "visual denotations" idea here. Basically you say something like "the set of worlds in which this sentence is true is the set of images in which this sentence is true (i.e., roughly the sentence is entailed by the image)." You can then measure similarity between sentences based on denotations.
  • Kneser-Ney Smoothing on Expected Counts; Hui Zhang; David Chiang
    I didn't actually see this talk or read the paper, but lots of people told me in hallways that this is a very nice result. Basically we like KN smoothing, but it only works for integral counts, which means it's hard to incorporate into something like EM, which produces fractional counts. This paper solves this problem.
  • Linguistic Structured Sparsity in Text Categorization; Dani Yogatama; Noah A. Smith
    Also didn't see this one, but skimmed the paper. The reason I really like this paper is because they took a well known technique in ML land (structured sparsity) and applied it to NLP, but in an interesting way. I.e., it wasn't just "apply X to Y" but rather find a very linguistically clever/interesting way of mapping X to a problem that we care about. Very cool work.
Overall I really liked the conference, thanks to everyone who helped put it together. I can't help but notice that about half of my picks above were actually TACL papers. I suspect this will be more and more true over time.

Please add comments with your favorite papers that I missed!

30 June 2014

Divergences passed through Bayes' rule

In a previous post's comments, we talked about Bayes rule and things like that. This got me wondering about the following question:

If we know p(A) and p(B|A), we can reconstruct p(A|B) perfectly by Bayes' rule. What if we only have estimates of p(A) and p(B|A)? How does the quality of the reconstruction of p(A|B) vary as a function of the quality of the estimates of the marginal and conditional?
I feel like there have to be results along these lines, but I was unable to find them. My next attempt was to prove something, which failed miserably after a few hours.  So, as a good empiricist and lazy(/bad) theorist, I designed a simple experiment.

Let A and B be binary variables. Let's generate a random joint distribution p(A,B), which has four cells for the four possible combinations of values of A and B. From this, we can directly compute the true marginal p(A) and the true conditionals p(B|A) and p(A|B).

Now, let's pick some "estimate" q(A) and q(B|A). You can think of these as a "noisy" version of p(A) and p(B|A). Given q(A) and q(B|A), we can compute an estimate a reconstructed joint distribution q(A,B) = q(A)q(B|A), as well as a reconstructed conditional distribution q(A|B) = q(A)q(B|A) / Z(q), where Z(q) is computed according to q. We can then compare q(A,B) to the true p(A,B) and q(A|B) to the true p(A|B) and measure how far they are.

At this point we have to decide what our measurement (divergence) function is. I tried three: variational distance (max absolute difference), l1 distance (sum absolute difference) and KL divergence. To be absolutely pedantic, I will define the versions of these that I used. First, the KL variants:
KL( p(A) || q(A) ) = sum_a p(a) log [ p(a) / q(a) ]
KL( p(A,B) || q(A,B) ) = sum_{a,b} p(a,b) log [ p(a,b) / q(a,b) ]
KL( p(A|B) || q(A|B) ) = sum_b p(b) KL( p(A|B=b) || q(A|B=b) )
Note that the direction is q from p (chosen because p is the "true" distribution) and that this also has the advantage that the conditional KL is based on p(B), which (in this case) is known exactly and is "correct."

By analogy, for l1 distance we have:
l1(p(A), q(A)) = sum_a |p(a) - q(a)|
l1(p(A,B), q(A,B)) = sum_{a,b} |p(a) - q(a)|
l1(p(A|B),q(A|B)) = sum_b p(b) l1(p(A|B=b), q(A|B=b))
Note that this last one might be slightly non-standard, but is parallel to the KL definition.

Similarly, for variational distance:
var(p(A), q(A)) = max_a |p(a) - q(a)|
var(p(A,B), q(A,B)) = max_{a,b} |p(a) - q(a)|
var(p(A|B),q(A|B)) = sum_b p(b) var(p(A|B=b), q(A|B=b))
Okay, so now for the experiment. First I generate a random (uniform) true joint distribution p(A,B). I then run through 1,000,000 possible q(A,B), where each of the three sufficient statistics are chosen from [0.01, 0.02, ... 0.99]. I then conditionalize and marginalize these in all the relevant ways and compute KL. Finally, I generate plots like the following very representative example for KL:
On the left column, we're inspecting the recovered joint distribution and in the right column the recovered conditional distribution. The top row shows: for different divergences of q(A) from p(A), and for different divergences of q(B|A) from p(B|A), how far is (left) the recovered joint q(A,B) from the true joint q(A,B), or how far is the (right) recovered conditional q(A|B) from the true conditional p(A|B). The middle row is the projection of this into two dimensions, focusing on the divergence in the marginal, and the bottom row is the projection onto the divergence in the conditional. The title shows what the true distribution is in the form [p(a,b) p(a,~b) ; p(~a,b) p(~a,~b)]. I chose this example because the joint has a correlation between a and b.

This example is fairly benign: as the approximations become worse, so do both of the recovered distributions, in a fairly linear way until a plateau. From the bottom row, you can see that it's more important to get the conditional right than the marginal (you can have a marginal that's quite far--eg., a KL of 1.5--and still get an almost perfect recovery of the conditional or joint, but this is not true for large differences in the conditional B|A.

One strange thing is that you often (for different true joints) see results that look like:
There's a very strange effect here, in which a larger kl on B|A can actually be better at the recovery of the conditional, while worse at the recovery of the joint.

 One can ask if this is an artifact of KL. So let's switch to L1 and variational for the first set of plots:

and variational:
So, in both L1 land and variational land, you can do better on the conditional by being worse on the (other) conditional.

For the example that gave rise to the weird KL results, we have the following for L1:
which shows almost an identical effect. For variational:
the effect is still the same.

Okay, so it's totally entirely possible (perhaps probable?) that there's a bug in my code. If you'd like to look, check out mykl.m and myklrun.m (yes, it's matlab). Let me know in the comments if there are bugs. If you'd like to look at more examples, check out all ten examples.

02 June 2014

Role models

During grad school, my advisor suggested I identify a recent grad who has been, to me, successful. I could then use him or her as a guide. I picked someone (he now knows who he is), and the exercise was useful: there are lots of ways to be successful in research land, and this helped me focus.

RST-relation=Topic-Shift.

I'm fairly serious about yoga. I've had a lot of instructors over the years and noticed a high correlation between InstructorILike and InstructorWhoIsMale. Initially I believed this was because male instructors pushed more, and that worked for me. Over time I realized that was not the full story.

I spent two weeks going to classes by instructors I hadn't had before to try to understand what variable(s) made the difference. I've believe now that a large part of the reason I like male instructors is precisely because they're male. A female instructor would do some crazy pose and my brain would immediately say "I could never do that." A male instructor would do the same pose and my brain would say "If he can do it, so can I." (I'd then try and fail several times, but never with a defeatist attitude.)

Topic-UnShift.

I've heard for a long time that having role models you can identify with is important. As someone who has in almost all of my life fit into the overwhelming majority (white male in tech/academia), it's been rare that I've had the opportunity to really feel this effect for myself. I try to believe things even if they haven't happened to me, but it's always better when you can empathize rather than sympathize and it's easier to empathize when you've actually been there.

The first time I remember feeling the effect of a role model "who looks like me" was  at the 1996 Olympics and Poul-Erik Høyer Larsen (Denmark) was the first European to ever win the badminton semi-finals; he then won the gold medal against Dong Jiong (China). (This sport is dominated by Indonesia, China and Malaysia.) Growing up in a particular part of Los Angeles and playing badminton as a kid, I was very much an outlier. Even though I'd never heard for Poul-Erik before (everyone knew who Jiong was), his win gave me something I could aspire to.

A few years ago I began broadcasting my support of the LGBT community, e.g., an HRC link on my web page and painting my laptop. Since then I've gotten emails from several people (mostly students) effectively asking why there aren't more/any LGBT role models in our community. You can interpret "community" meaning anything in the NLP/ML to CS to Science/Tech range. My answer: I don't know. It's hard to even know how large this community is because, unlike things like race and (binary) gender, it's not always outwardly inferrable (with noise). These issues effect tech is nuanced ways; see for instance an interview with the founder of Lesbians Who Tech or Queer in STEM for more.

This is all to say that having role models is important, and yes, it does matter who they are, where they came from, and what they look like. It mattered to the high school aged version of me, the grad school version of me, and the associate prof version of me. I'm not saying anything new here, but for our field to be healthy, we need a large number of successful people who can be role models for all sorts of students (and beyond). Token visibility is not enough because a single example of some particular label won't match with everyone who self-identifies with that label. The person I chose was, yes, a while male. There were plenty to choose from. But I chose him, and others would not have sufficed.

30 May 2014

Past tense is not past tense

I took part in a wonderful Dagstuhl workshop this past February on translating morphologically rich languages. (Yeah, I also don't really know why I was invited :P.) But many thanks to Alex, Kevin, Philipp, Helmut and Hans for inviting me. I had a realization during this workshop that I thought I'd share. It's obvious in retrospect, and perhaps in front-spect for many of you. Much of this came up in the discussion with Bonnie Webber, Marion Weller, Martin Volk, Marine Carpuat, Jörg Tiedemann and Maja Popovic, and Maja deserves much credit for her awesome error analysis tool that helped shed some light on German.

One thing you commonly think of when translating into a morphologically rich language is that there's stuff you're going to have to hallucinate. Really this isn't an issue of morphology per se, but just that this is one place where it's obvious. For instance, even going from English to French you'll have to hallucinate gender on your determiners (un versus une and le versus la) that's unmarked in English. Or when going from Japanese (which roughly combines present and future tenses into a single tense) to English, you'll have to hallucinate "will" at appropriate places.

An abstraction that I think was pretty widespread among the initial discussions in the workshop was that if you're going from language X to Y, there are basically two options:

  1. Phenomenon foo is explicit in Y but implicit in X, and therefore you'll have to hallucinate it (i.e., tense is explicit in English but not in Mandarin)
  2. Phenomenon bar is explicit in Y and also explicit in X, and so you can just copy it.
The problem is that (2) is just false, even for things that you think it might not be false for. That is to say: Just because two languages explicitly code for something that we give a consistent (linguistic) name to, doesn't mean that they code for it consistently.

Okay, so you want examples.

An easy example is gender. I've been well assured that, for instance, French and Russian both have explicit gender. But just because some noun (eg moon/lune) is feminine in French doesn't mean it's also feminine in Russian. (In fact I think it's neuter.)

You might argue gender is a stupid thing to pick because it's essentially an artificial encoding of who-knows-what.

How about tense. That clearly has a semantic interpretation (did something happen in the past, the present or the future) and so if languages X and Y both express some particular tense, they must be consistent in how they do it.

Wrong. Now my memory is getting a bit shaky, but my recollection is that, for instance, in newswire text, it's very common in German to refer to things that have happened in the past in present tense. To English speakers this is a strange convention (we tend to refer to such things in past tense), but it doesn't have to be so. And of course English has it's own idiosyncrasies: see the plight of the native German speaker who cannot understand English tense usage in (New Zealand) news articles.

Part of this is probably because tense, even in English, is a pretty slippery concept. We (native English speakers) have no problem using present (or progressive) tense to refer to things that happened in the present (John runs) the past (so yesterday I'm running to the store and a hamburger falls on my head!) or the future (my flight leaves at 8:00 tonight).

Another easy example is definiteness (thanks to Kevin Knight for this inspiration). Again, our high school English teachers tell us that "the" (+Definite) has to refer to something that's already been introduced into context. I just went to cnn.com, clicked on the very first link, and the first sentence is "The boss resigned under pressure and other Veterans Affairs managers are likely on the way out." Ok you could argue that "The boss" is already in the context of the US news media (this is an article about Shinseki) but it's nonetheless very common to see (English) entities introduced using "the" and the precise rules that govern this may or may not be consistent across other languages.

The long and short of this is: I like the fact that translation into morphologically rich languages makes us pay attention to linguistic divergence. But that doesn't mean that divergences aren't there even when languages express the same set of linguistically-named phenomena. Usage can vary dramatically, be it for conventions, socio-linguistic reasons, or other things that are hard to pin down. It's just that by focusing all our energy on a very particular convention (newswire, parliament), we can pretty easily learn these mappings because there's no variability. Add some variability and we're hosed, even for languages with the same set of (overt) markings.

16 May 2014

Perplexity versus error rate for language modeling

It's fair to say that perplexity is the de facto standard for evaluating language models. Perplexity comes under the usual attacks (what does it mean? does it correlate with something we care about? etc.), but here I want to attack it for a more pernicious reason: it locks us in to probabilistic models.

Background

Language modeling---or more specifically, history-based language modeling (as opposed to full sentence models)---is the task of predicting the next word in a text given the previous words. For instance, given the history "Mary likes her coffee with milk and", a good language model might predict "sugar" and a bad language model might predict "socks." (This is related to the notion of cloze probability.)

It's quite clear that there is no "right answer" to any of these prediction problem. As an extreme example, given the history " The", there are any number of possible words that could go next. There's just no way to know what the "right" answer is, whether you're a machine or a person.

This is probably the strongest justification for a perplexity-like measure. Since there's no "right" answer, we'll let our learned model propose a probability distribution over all possible next words. We say that this model is good if it assigns high probability to "sugar" and low probability to "socks."

Perplexity just measures the cross entropy between the empirical distribution (the distribution of things that actually appear) and the predicted distribution (what your model likes) and then divides by the number of words and exponentiates after throwing out unseen words.

The Issue

The issue here is that in order to compute perplexity, your model must produce a probability distribution. Historically we've liked probability distributions because they can be combined with other probability distributions according to the rules of probability (eg., Bayes rule or chain rule). Of course we threw that out a long time ago when we realized that combining things, for instance, in log-linear models, worked a lot better in practice, if you had a bit of data to tune the weights of the log-linear models.

So the issue in my mind is that there's plenty of good technology out there for making predictions that does not produce probability distributions. I think it's really unfortunate that non-probabilistic approaches don't get to play the language modeling game because they produce the "wrong" sort of output (according to the evaluation, but not according to the real world). I'm not saying there aren't good reasons to like probabilistic models, but just that alternatives are good. And right now those alternatives cannot compete. (For instance, Roark, Saraclar and Collins [2007] don't use perplexity at all and just go for word error rate of a speech recognizer around their perceptron-based language model.)

When I Ran Into This

I was curious about building a language model using vw, in the context of another project, and also to stress-test multiclass classification algorithms that scale well with respect to the number of classes.

As soon as I ran it, I discovered the issue: it produced results in the form of error rates. As I recall (it was a while ago), the error rate was somewhere in the 60s or 70s. I had absolutely no idea whether this was good or not. It seemed reasonable.

To get a sense of how standard language models fare, I decided to train a language model using srilm, and evaluate it according to error rate. To make my life easier, I just ran it on the WSJ portion of the Penn treebank. I used the first 48k sentences as train and the last 1208 sentences as test. I trained a 5gram Kneser-Ney smoothed language model and evaluated both perplexity and error rate (the latter required a bit of effort -- if anyone wants the scripts, let me know and I'll post them---but basically, I just take the LM's prediction to be the highest probability word given the context).

The language model I built had a perplexity (ppl1 in srilm) of 236.4, which seemed semi-reasonable, though of course pretty crappy. There was an OOV rate of 2.5% (ignored in the perplexity calculation).

The overall error rate for this model was 75.2%. This means that it was only guessing a quarter of words correct. (Note that this includes the 2.5% errors mandated by OOVs.)

I also tried another version, where all the model had to do was put the words in the right order. In other words, it knows ahead of time the set of words in the sentence and just has to pick between those 20, rather than between the full vocabulary (43k types). [This is maybe semi-reasonable for MT.] The error rate under this setting was 66.8%. Honestly I expected it would be a lot better.

Note that if you always guess ","---the most frequent type in this data---your error rate is 95.3%.

So why was it only moderately helpful (< 10% improvement) to tell the language model what the set of possible words was? Basically because the model was always guessing really high probability unigrams. Below are the top ten predicted words when the model made an error with their frequencies. They're basically all stop words. (This is in the unrestricted setting.)

     1    14722 ,
     2     1393 .
     3     1298 the
     4      512 and
     5      485 in
     6      439 of
     7      270 to
     8      163 ''
     9      157 a
    10      108 is
    11       54 's
    12       52 have
    13       49 said
    14       41 ``
    15       38


    16       38 for
    17       38 are
    18       34 %
    19       33 $
    20       31 be

The same list for the restricted setting is virtually identical, basically because most of these words are available in the average sentence:

     1    10194 ,
     2     5357 .
     3     1274 the
     4      274 of
     5      251 in
     6      232 to
     7      230 and
     8      193 a
     9       51


    10       36 for
    11       28 's
    12       25 ''
    13       24 said
    14       24 is
    15       21 that
    16       21 ``
    17       16 it
    18       15 from
    19       14 be
    20       13 by

Oh well.

Ok, so I don't have a "good" counter proposal to perplexity. Error rate certainly has many issues of its own. You could use IR-like metrics, like recall @ 10, mean average precision, etc., which are all questionable in their own ways.

I would just, in general, like if we could evaluate language models without having to be handcuffed by probabilities.

26 April 2014

An easy way to write less hurtful reviews: don't say "you"

I'll be honest: I've had my feelings hurt by scathing reviews more than a few times. In grad school I remember even crying over a review that I thought was particular pernicious. My skin has thickened a bit over time, though often in the not-so-helpful manner of dismissing reviews that I don't like as "they didn't get it," which defeats one of the two primary purposes of reviews in the first place (providing feedback; the other: making accept/reject decisions).

The thing that's hard to reconcile is that I really like most of the people in our community, and everyone I meet at least seems really friendly.

When doing mock reviews with grad students, I'll often tell them to keep in mind that there's a good chance that the author is, or later will be, a friend of theirs. It's possible to provide feedback to a friend in such a way that you don't hurt their feelings.

I've recently started doing something else (in addition to the above suggestion). I don't use the words "you" or "the authors" or even "I." The review of a scientific contribution is not about me and it's not about the authors. It's about the method, the experiments and the contribution. I see little reason why you need to mention anything related to the people involved. (One exception: "I" is often useful in hedging, like the previous sentence, which would be more forceful if I just said "There is little reason...") Perhaps we could even integrate this into START...

This is, of course, similar to the pop-psych advice of talking to loved ones about "actions" rather than "the person." For instance: "I hate you for spilling coffee and not cleaning it up" versus "I hate having coffee spilt on the floor." Or something. I'm sure others can come up with better examples.

My current approach is to write my review with this in mind, and then go back and search for all occurrences of my outlawed nouns, and rewrite these sentences. Often in the process of doing this, I become aware that in many of the cases what I've said really does sound like an attack, and with the very small edit this effect is removed or at least greatly reduced.

I realize I've now just given a pretty good signal for people reading reviews to see if they were written by me or not. Here's a solution: everyone should adopt this policy and then my reviews will no longer be so obvious.
But overall, I really think we should be nice to each other. Perhaps fewer people will depart from the field if they're not constantly battered down by harsh reviews, and then we'll all be better off.

14 April 2014

Waaaah! EMNLP six months late :)

Okay, so I've had this file called emnlp.txt sitting in my home directory since Oct 24 (last modification), and since I want to delete it, I figured I'd post it here first. I know this is super belated, but oh well, if anyone actually reads this blog any more, you're the first to know how I felt 6 months ago. I wonder if I would make the same calls today... :)

If you remember anything about EMNLP anymore and have your own opinions, please feel free to comment. It will also let me know if anyone reads here anymore :).

Happy Spring from DC!




16 September 2013

Active learning for positive examples

I have a colleague who wants to look through large amounts of (text) data for examples of a pretty rare phenomenon (maybe 1% positive class, at most). We have about 20 labeled positive examples and 20 labeled negative examples. The natural thing to do at this point is some sort of active learning.

But here's the thing. We have no need for a classifier. And we don't even care about being good at finding negative examples. All we care about is finding as many positive examples from a fixed corpus as possible.

That is to say: this is really a find-a-needle-in-a-haystack problem. The best discussion I've found of this is Section 4 of Inactive Learning, by Attenberg and Provost. I came across a couple other papers that basically do some sort of balancing to deal with the imbalanced data problem in active learning, but the Attenberg and Provost results suggest that even this is tricky to get right.

But I think this only partially addresses the problem. Here's why.

Let's say that my colleague (call her Alice) is willing to spend one hour of her life looking at examples. Alice estimates that she can look at about 500 examples in that time period (small caveat: some examples are larger than other and therefore slower, but let's ignore this for now). Alice's goal is to find as many positive examples in that one hour as possible. Here are some possible strategies Alice could deploy

  1. Look at 500 randomly selected examples, getting 5 (more) positive examples in expectation (likely somewhere between 2 and 12, with 95% confidence). This gives her a total of 25 positive examples
  2. Train a classifier on her 40 labeled examples, have it rank the entire set (minus those 40). Then look at the top 500 examples. If the learned model has a recall of r% in the top 500, then she should expect to get 5*r more examples, giving her a total of 20+5r examples. (Caveat: this is a bit dangerous, because with very few training examples, results can often be anti-correlated with the desired output and you get better performance by flipping the predicted label. There was a paper on this maybe 5-10 years ago but I can't find it... Maybe somewhat knows what I'm referring to.)
  3. Train a classifier, spend 30 minutes looking at 250 random examples, getting 2.5r more positive examples, then train another classifier, then spend 30 minutes looking at it's top ranked 250 examples. If the second classifier has a recall of r'% in the top 250, then she should expect to get another 2.5*r', and she'll have a total of 20+2.5r+2.5r' = 20 + 2.5(r+r') so long as r>r' this should be better.
  4. Taking this to the extreme, Alice could annotate one example at a time (subject to constraints on either the learning being very fast or Alice not actually being a human). As long as the recall-at-one of the classifiers learned is non-decreasing, this should (I think) be the optimal strategy.
So it seems that in a setting like this, what the classifier should optimize for is simply recall-at-one. In other words, it doesn't care about anything except that it's top ranked output is correct.

Okay, so why is this difficult? Basically because we don't have that much data. If the corpus that represents the haystack is huge, then we want to ensure that the top-one example from that haystack is a positive example. So in principle even if all we're trying to do is select between two different hypotheses (say our hypothesis class at each time step has cardinality two) then in principle we should use as much data as possible to evaluate these two hypotheses. In particular, looking at the recall-at-one on the subset of 40 labeled examples that we have is probably not representative, essentially because this loss function doesn't decompose nicely.

So what could we do instead? Well, we could pool the labeled and unlabeled examples and predict on those instead. Certainly if one of the positive labeled examples outranked everything else, that would be a good thing. But if truly 1% of the unlabeled dataset is positive, then this is not a necessary condition. In particular, the top-ranked positive labeled example could be as far as 1% down the list and we could still (in principle) be doing a good job.

Ok I'll admit: I really don't know how to do this. Maybe it's enough in most cases to optimize for zero-one loss (or a weighted version thereof) and just cross your fingers that the recall-at-one is good. But it does feel like there should be a more direct way to go about this problem.