## 12 July 2016

### Some picks from NAACL 2016

Usual caveats: didn't see all talks, didn't read all papers, there's lot of good stuff at NAACL that isn't listed here! That said, here are some papers I particularly liked at NAACL, with some comments. Please add comments with papers you liked!

#### Automatic Summarization of Student Course Feedback by Luo, Liu, Liu and Litman.

Anyone who has taught has suffered the following dilemma. You ask students for feedback throughout the course, and you have to provide free text because if you could anticipate their problems, you'd have addressed them already. But now you have hundreds of responses that you can't quickly read through. This paper provides a dataset of such responses, together with summaries that you actually have time to read through. The approach is a reasonably standard ILP formulation, with some additional machinery to deal with the fact that the data is very sparse. The idea is to essentially induce a "bigram similarity" as part of the summarization problem. I like this paper because the problem is great (I think NLP should really push in the direction of helping learners learn and teachers teach!), the dataset is nice (if a bit small) and the approach makes sense. And they actually do a human evaluation, even for a short paper! Hooray!

#### A Latent Variable Recurrent Neural Network for Discourse Relation Language Models by Ji, Haffari and Eisenstein.

This paper combines the nice rich hypothesis classes you get with the "anonymous" latent variables in neural networks with the modeling power that one gets from the ability to marginalize "structured" latent variables from classic graphical models land. This is applied to document language modeling, in which the latent variables are discourse relations (in the PDTB sense). The model works well both for language modeling and for predicting implicit discourse relations and dialogue acts. (And, as one should expect from a paper with Eisenstein as an author, there are statistical significance tests!)

#### Structured Prediction with Output Embeddings for Semantic Image Annotation by Quattoni, Ramisa, Madhyastha, Simo-Serra and Moreno-Noguer.

If you want to label images with complex structured outputs, like play(dog,grass), you quickly run in to sparsity problems on the output. The proposal here is to decompose the outputs into embeddings (kind of like Vivek's work and Jags' work) and learning a bilinear model of inputs/outputs in that space. In general, I think there's a lot to be done in interesting modeling of structured output spaces, and this paper gives a new set of techniques in this space.

#### Deconstructing Complex Search Tasks: A Bayesian Nonparametric Approach for Extracting Sub-tasks by Mehrotra, Bhattacharya and Yilmaz.

The problem considered here is the following: if I want to go to ACL, I need to register, book a flight, book a hotel, find some attractions to go to while skipping sessions, look up all the good vegan restaurants in Berlin (hah!), etc. My overall task is going to ACL, but there are a number of highly related but different subtasks. The challenge is to infer these subtasks from search logs, so that you can provide better search support. The model is a Chinese Restaurant Process with word embeddings used to measuring similarities. And look, another short paper with a human evaluation!

#### A Corpus and Cloze Evaluation for Deeper Understanding of Commonsense Stories by Mostafazadeh, Chambers, He, Parikh, Batra, Vanderwende, Kohli and Allen.

This paper introduces the task of story cloze: given a story prefix, predict which of two endings is the "correct" (natural) one. Eg: "Jim got his first credit card in college. He didn't have a job so he bought everything on his card. After he graduated he amounted a \$10,000 debt. Jim realized that he was foolish to spend so much money." Then you have to decide on the right ending: "Jim decided to device a plan for repayment." (correct) versus "Jim decided to open another credit card." (incorrect). The data set was quite well constructed (lots of checks), and a large collection of baseline models are run for comparison. The data is available. I like this paper for the task and the very very careful data set collection. Since this was establishing baselines, I would really liked to have seen error bars on the results so we can have more reasonable future comparisons. I'm really tempted to try to annotate some of these with plot units, but that was really really painful the first time around; but I feel like that's a good explanatory theory for a lot of the examples shown in the paper.

#### Learning Global Features for Coreference Resolution by Wiseman, Rush and Shieber.

The basic idea here is to take a "local" coreference model, that makes decisions about assigning mentions to entities, and augment it with a "global" model that models the entire cluster of mentions. In past work, cluster-level features have been hard to define (e.g., what are the right features to extract over sets of arbitrary size?). What's the solution? Essentially, embed the clusters. I like this paper because I wanted to try to do something like this and Wiseman did it better than I would have. I think there's also an interesting interpretation of this work in terms of things like neural Turing machines/memory networks, in which we now have technology for learning to update memory (where "memory" here is the "clusters"). My only gripe is that, like most papers in the new wave of coreference, the older ML approaches have been somewhat forgotten (except for Vincent Ng who is unforgettable); I'm thinking in particular about the work on ACE, like that of Luo, Ittycheriah, Jing, Kambhatla and Roukos on Bell-tree coreference, which I don't think gets enough notice these days for introducing a lot of the ideas that make up modern coref.

#### Visual Storytelling by Huang, Ferraro, Mostafazadeh, Misra, Agrawal, Devlin, Girshick, He, Kohli, Batra, Zitnick, Parikh, Vanderwende, Galley and Mitchell.

The main observation in this paper is that how you caption images in a sequence is different from how you caption them in isolation. For instance, if you have a temporal sequence of images from, say, a wedding, the captions for them should tell a story. This paper primarily introduces a dataset and baseline approaches for addressing the problem in this dataset. I like this paper, even if you remove the image stuff, because it emphasizes the fact that stories have a different structure than just sequences of sentences. The image stuff gives a grounding that's interesting beyond that. One problem pointed out in the paper is that automatic metrics aren't great here, which is problematic, but not particularly surprising.

## 05 July 2016

### Rating the quality of reviews, after the fact

Groan groan groan reviewers are horrible people. Not you and me. Those other reviewers over there!

tldr: In general we actually don't think our reviews are that bad, though of course it's easy to remember the bad ones. Author perception of review quality is colored by, but not determined by, the overall accept/reject decision and/or the overall score that review gave to the paper.

NIPS did an experiment a bunch of years ago (can't dig it up any more, now it's an urban legend) where they asked reviewers at, I think, the time of author feedback, to rate the reviews. The anecdotal story was that there was almost perfect correlation between "this is a good review" and "this review gave my paper a high score." Of course this is not super surprising, even if you get rid of "emotions," because presumably I like my paper and so any review that doesn't like it is flawed.

For NAACL 2013, we did a similar experiment, but we asked authors for their responses several months after the fact (actually, even after the conference had taken place), at which point hopefully emotions had cooled a bit and they could look back at their reviews with a sort of fond recollection. We presented the contact author of each paper with the original review text for each of their reviews, but did not show them the original scores. We asked them on a standard Likert scale how helpful this review was, and how informative this review was.

Because this was long after the fact, response rate was of course not 100%, and it was also biased toward authors of papers that were accepted. We got responses from 128 authors on a total of 138 papers (some papers had same contact authors), covering a total of about 397 reviews (roughly one per paper, but some short papers only had two, and some papers had four).

All the plots below are restricted to this set of 138 papers, not to the full set of about 500.

First, let's get a sense of the data. Here are the overall results for this entire set of 138 papers:

(Note that the numbers add up to 397, not 138, because this is counting per-review not per-paper.) The first row shows the accept/reject ratio. Since NAACL 2013 had an acceptance rate between 25% and 30%, obviously survey results are biased toward accepted papers, but we still have a healthy response rate from rejected papers.

Overall, the vast majority (~80%) of reviews were considered both informative and helpful (score 4 or 5) according to the authors. So yes, we need to do something about the 20% of reviews that got a 1, 2 or 3 on the Likert scale, but we're actually not doing that horribly. (Modulo sample selection bias.) The papers themselves were considered overwhelmingly appropriate and clear. The overall score distribution matches (roughly) the overall score distribution for the entire conference.

Let's look at what happens if we look only at accepted or rejected paper:

Comparing these, we definitely see a bit of the NIPS effect. For accepted papers, the reviews were considered overwhelmingly informative and helpful (scores of 4 or 5 in 85% or more cases). However, for rejected papers, the reviews were still considered largely informative and helpful (~73% of cases were 4s and 5s). Not surprisingly, accepted papers fare quite well on the individual score metrics, in particular overall score (duh!).

We can alternatively condition the analysis on the overall paper score rather than the final accept/reject decision. Here's how that looks:

That's not substantially different.

So what makes the difference between good (informativeness 4 or 5) and bad (informativeness 1 or 2) reviews?

On average, the "good" reviews were about 15% longer than the bad reviews (on average 320 characters versus 280 characters).

Somewhat surprisingly, a linear classifier on bag of words data and distinguish with 90% accuracy "good" from "bad" reviews, but the features it gives high weight to are basically features that look like positive versus negative reviews, rather essentially exploiting of the correlation between informativeness and acceptance, rather than informativeness on its own.

## 30 June 2016

Last semester from grad ML, I totally revamped stuff and started using the awesome book Understanding Machine Learning by Shai Ben-David and Shai Shalev-Shwartz (I'll try to review this in a later post). Naturally, one of the things we talked about was subgradients and subgradient descent.

I imagine that for many of you, if I asked you to define a subgradient of a convex function f at x informally (let's stick in one dimension for now), you would say something like "it's any line that that makes contact with f at x and everywhere else lies below f." This is the definition given in UML, and the definition we always see in pictures, like the one in ciml:

Okay, so this is all great, and one of the fun things you can do is derive algorithms like (stochastic) subgradient descent, which involve picking any subgradient and then using that as if it were a gradient in a standard gradient descent procedure. Yeah, you run into some speed limits for optimization rates, but it basically works.

So then on the midterm I asked a question that was intended to be a freebie: give me the subderivatives of the ramp loss function. Ramp loss is like hinge loss (shown above), but where the negative part gets clamped at 1. Formally, it's ramp(x) = min(1, hinge(x)), where hinge(x) = max(0, 1-x).

Turns out this wasn't a freebie. The problem, obviously in retrospect, is that ramp is not convex, and therefore doesn't have subderivatives! Or at least it doesn't have subderivatives for any x less than zero, and it's (only) subderivative for x≥0 is zero.

Several students point this out, and several students just went through the motions and computed what-we-might-normally-call subderivatives anyway. And by "what-we-might-normally-call" I of course mean what I had originally intended, and what I would have done myself if I had been a student in this course, and also what pretty much any auto-diff toolkit would do.

And yet, it's clearly wrong according to the definition of subgradients that we all know and use everyday in our very non-convex neural networks, when we do things like relu or hardtanh units.

It turns out (not surprisingly) that subdifferentials of non-convex functions has been studied (extensively) since the 1970s. Murdukhovich and Shao give a brief history in their paper on Banach spaces. Unfortunately, I don't actually understand most of that paper (or its citations).

I did manage to find one set of slides that I could understand, though, by Adil Bagirov for a talk on Subgradient methods in nonsmooth nonconvexoptimization! Basically the idea that Adil proposes is that we can use a gradient-ified version of a quasisecant and most/some subgradient-like methods still go through and make sense with this generalized notion.

Why am I posting this? Because it caused my brain to reconfigure itself when I was forced to think about this by the very smart students in my class! Am I going to teach quasisecants in the future? Probably not. But I am going to explicitly point out that the standard definition of subgradient doesn't work for nonconvex functions (or, more specifically, it works, but you get an empty set in a lot of cases) and that there are generalizations but that I don't think we've really figured this all out (as a community).

If anyone has other pointers that I can use in the future, I'd love to see them!

## 24 June 2016

### Language bias and black sheep

Tolga Bolukbasi and colleagues recently posted an article about bias in what is learned with word2vec, on the standard Google News crawl (h/t Jack Clark). Essentially what they found is that word embeddings reflect stereotypes regarding gender (for instance, "nurse" is closer to "she" than "he" and "hero" is the reverse) and race ("black male" is closest to "assaulted" and "white male" to "entitled"). This is not hugely surprising, and it's nice to see it confirmed. The authors additionally present a method for removing those stereotypes with no cost (as measured with analogy tasks) to accuracy of the embeddings. This also shows up on twitter embeddings related to hate speech.

There have been a handful of reactions to this work, some questioning the core motivation, essentially variants of "if there are biases in the data, they're there for a reason, and removing them is removing important information." The authors give a nice example in the paper (web search; two identical web pages about CS; one mentions "John" and the other "Mary"; query for "computer science" ranks the "John" one higher because of embeddings; appeal to a not-universally-held-belief that this is bad).

I'd like to take a step back and argue the the problem is much deeper than this. The problem is that even though we all know that strong Sapir-Whorf is false, we seem to want it to be true for computational stuff.

At a narrow level, the issue here is the question of what does a word "mean." I don't think anyone would argue that "nurse" means "female" or that "computer scientist" means "male." And yet, these word embeddings, which claim to be capturing meaning, are clearly capturing this non-meaning-effect. So then the argument becomes one of "well ok, nurse doesn't mean female, but it is correlated in the real world."

Which leads us to the "black sheep problem." We like to think that language is a reflection of underlying truth, and so if a word embedding (or whatever) is extracted from language, then it reflects some underlying truth about the world. The problem is that even in the simplest cases, this is super false.

The "black sheep problem" is that if you were to try to guess what color most sheep were by looking and language data, it would be very difficult for you to conclude that they weren't almost all black. In English, "black sheep" outnumbers "white sheep" about 25:1 (many "black sheep"s are movie references); in French it's 3:1; in German it's 12:1. Some languages get it right; in Korean it's 1:1.5 in favor of white sheep. This happens with other pairs, too; for example "white cloud" versus "red cloud." In English, red cloud wins 1.1:1 (there's a famous Sioux named "Red Cloud"); in Korean, white cloud wins 1.2:1, but four-leaf clover wins 2:1 over three-leaf clover. [Thanks to Karl Stratos and Kota Yamaguchi for helping with the multilingual examples.]

This is all to say that co-occurance frequencies of words definitely do not reflect co-occurance frequencies of things in the real world. And the fact that the correlation can both both ways means that just trying to model a "default" as something that doesn't appear won't work. (Also, computer vision doesn't really help: there are many many pictures of black sheep out there because of photographer bias.)

We observed a related phenomena when working on plot units. We were trying to extract "patient polarity verbs" (this idea has now been expanded and renamed "implicit sentiment": a much better name). The idea is that we want to know what polarity verbs inflict on their arguments. If I "feed" you, is this good or bad for you? For me? If I "punch" you, likewise. We focused on patients because action verbs are almost always good for the agent.

In order to accomplish this, we started with a seed list of "do-good-ers" and "wrong-do-ers." For instance, "the devil" was a wrong do-er, and so we can extract things that the devil does, and assume that these are (on average) bad for their patients. The problem was the "do-good-ers" don't do good, or at least they don't do good in the news. One of our do-good-ers was "firefighter". Firefighters are awesome. Even stereotyped, this is arguably a very positive social good, heroic profession. But in the news, what do firefighters do? Bad things. Is this because most firefighters do bad things in the world? Of course not. It's because news is especially poignant when stereotypically good people do bad things.

This comes up in translation too, especially when looking at looking at domain adaptation effects. For instance, our usual example for French to English translation is that in Hansards, "enceinte" transates as "room" but in EMEA (medical domain), it translates as "pregnant." What does this have to do with things like gender bias? In Canadian Hansards, "merde" translates mostly as "shit" and sometimes as "crap." In movie subtitles, it's very frequently "fuck." (I suspect translation direction is a confounder here.) This is essentially a form of intensification (or detensification, depending on direction). It is not hard to imagine similar intensifications happening between racial descriptions and racial slurs, or between gender descriptions and sexist slurs, depending on where the data came from.

## 14 May 2016

### A bad optimizer is not a good thing

A very popular style of research in NLP and ML is the math abstraction. You cast your learning problem as some sort of objective function that you want to optimize. Or, if you're feeling Bayesian, you write down a joint likelihood that you'll either sample from or, yes, turn into an objective function that you want to optimize. The optimizer is then typically considered a black box, aside from its hyperparameters which you often must tune.

This is a very attractive style of research and one that I've personally gotten a lot of leverage out of. (Though it's worth emphasizing for reviewer #2 out there that this is not the only want to solve problems, and that not having an explicit objective function that an optimizer is cranking on does not mean your problem formulation is bad.) The main advantage here is separation of concerns. I can worry about making a good objective function, and my optimization friends can worry about optimizing it. Of course it's never that simple and we must consider what can possibly be optimized when making our objective functions (see, for instance, the obsession with convexity 10 years ago and the anti-obsession with convexity today).

At the end of the day, like any abstraction, it's not perfect, and we need to break it occasionally.

There is, however, something that's struck me as peculiar going back to the hayday of topic models in 10 or so years ago, but which applies equally well now. This is a strange world in which we actually hope that our optimizer doesn't work (or s/optimizer/sampler/). To me, this seems like a bad thing to hope for. I think we need to acknowledge that for hard problems, or optimizers and samplers probably won't work, and need to take that into account. But I don't think we should actively hope that they don't work: that just seems broken.

There are two places where I see this hope arise.

1. Occasionally you see objective functions that have trivial solutions that you're hoping won't be found. I've seen this several times in the context of multimodal learning. Suppose that you want to learn some sort of joint embedding of English words and Japanese words. I can write down some embedding functions f(e) and g(j) using whatever neural net magic I want. The outputs are vectors. Now, I want to have them supervise each other, and so I make an objective function that looks like ||A f(e) - B g(j)||^2 for some matrices A and B. The joint objective is to learn A, B, and all the parameters inside f and g. This seems reasonable, until you realize that simply setting A=B=0 will provide not only a local, but a global minimum of this objective function. Of course since these things are non-convex, we often initialize randomly. And in most cases, this global minimum actually isn't found (otherwise such approaches would never work). But this is just a broken way to formulate a problem because I'm relying on the fact that my optimizer doesn't actually work.

2. Much more commonly, the hope of a broken optimizer comes from the heuristic of using "initialization" to build a better model. The general scheme here is that I have some prior knowledge that I want to inject into the system, and I do so by initializing my model's parameters informed by that knowledge. I then optimize. But again, here, we're hoping that the optimizer fails! If the optimizer were perfect, it should be insensitive to the initialization (modulo picking one of multiple global optima, but I don't think that's what's really going on here). You see the same thing in Bayesian land, wherein you basically hope that your sampler doesn't mix properly. Again, this is a case where we're relying on the fact that we don't think our optimizers will actually work.

I think it's fair to say that #1 is pretty broken; I think there are a number of possible responses to #2 that are worth discussing briefly.

Response 1: Early stopping. We rarely run optimizers (or samplers) to completion and instead stop them early based on some held out measure of generalization. For many optimizers, it's possible to show that early stopping it equivalent to regularization (I first saw this as a homework exercise in the old Bishop neural nets book, but it's been reinvented many times since then... and probably before then). You can alternatively pull out some online learning theory and argue that online updates are implicitly regularizing (toward zero). We therefore can conclude that initialization plus early stopping is essentially a form of regularization toward the initialization point. This is reasonable, but it would be nice to actually lay out the theory more solidly in a way that's not specific to a particular model, regularizer, optimizer triple. (Maybe this has been done but I haven't seen it.)

Response 2: Initialization isn't to help the model, it's to help the optimizer. This is a pretty solid response. I think that when we do initialization, this is probably how we should think about it. We think the model is perfect (or at least good) but that the optimizer is going to have a hard time, so we start it somewhere reasonable. I think this perspective is hard to reconcile with the "initialization as prior knowledge view" (it's possible, but requires some mental acrobatics), but it's a reasonable view.

So what should we be doing? Well, if someone wants to prove something along the lines of #1 that would be nice :). Or alternatively just point me to an existing result!

More generally, though, I think the model vs optimization abstraction is quite useful. If we want to incorporate prior knowledge into the model, then some sort of regularization (or prior) seems like a very reasonable approach. (Yes, there are others.) If we want to try to help out our optimizer, then initialization may be a reasonable approach, but it really depends on the optimizer. One can of course do both: the "regularize toward X" approach together with initialization at (or near) X. (Though in my experience often regularizing toward X means that initializing at X isn't necessary because it already introduces a good bias and symmetry breaking.)

## 26 March 2016

### A dagger by any other name: scheduled sampling

Scheduled Sampling was at NIPS last year; the reviews are also online. (Note: I did not review this paper.) This is actually the third time I've tried to make my way through this paper, and to force myself to not give up again, I'm writing my impressions here. Given that this paper is about two things I know a fair amount about (imitation learning and neural networks), I kept getting frustrated at how hard it was for me to actually understand what was going on and how it related to things we've known for a long time. So this post is trying to ease entry for anyone else in that position.

What is the problem this paper is trying to solve?

Neural networks are now often used to generate text. As such, they are often trained to maximize a product of probabilities: the probability of the 5th word in the output given all previous words. Unfortunately, the distribution on which they are trained is typically the gold standard distribution of previous words. This means that they never learn to recover from their own mistakes: at test time, if they make an error (which they will), they could end up in some distribution they've never seen at training time and behave arbitrarily badly. Matti Kaariainen proved over ten years ago that you can construct worst case examples on which, even with vanishingly small probability of error, the sequence predictor makes T/2 mistakes on a length T binary sequence prediction problem. We've also known for about as long how to fix this (for models that are more general than neural networks), but more on that later.

How do they solve this problem?

On the ith minibatch, for each output word, they flip an (independent) coin. If it comes up heads, they use the true ith word when training (the training data one); if it comes up tails, they use the predicted ith word. If the coin probability is 1, then this amounts to the standard (inconsistent) training paradigm. If the coin probability is 0, then this amounts to always using the predicted words.

How does this relate to existing stuff?

To put this in terminology that I understand better, they're playing with the rollin distribution, adjusting it between "Ref" (when the probability is one) to "Learn" (when the probability is zero) and, as is natural, "Mix" when the probability is somewhere in the middle.

I think (but please correct me if I'm wrong!) that Searn was the first algorithm that considered mixture policies for roll-in. (Previous algorithms, like incremental perceptron, Lasso, etc., used either Ref or Learn.) Of course no one should be actually using Searn anymore (algorithms like DAgger, AggreVaTe and LOLS, complete dominate it as far as I can tell).

For what it's worth, I don't think the "related work" section is the paper is really particularly accurate. Searn was never a reinforcement learning algorithm (where do you get the oracle in RL?), and the paper completely dismisses DAgger, referring to it only as "a related idea" and highlights the fact that the "scheduled sampling" approach is online. (Side note, the sense of "online" used in the scheduled sampling paper is the "stochastic" sense.) Of course, DAgger can be trained online, and in fact that's how we do it in the vw implementation... and it works incredibly well! You can also do a straight-up online analysis.

The only real difference to DAgger is the exact form of the schedule. DAgger uses a schedule of something like P(use truth) = 0.99i, where i is the round/minibatch/whatever. Scheduled sampling considers two other rates, one linear and one inverse sigmoid. They're shown below, where the red curve is the DAgger schedule (Figure 2 from the scheduled sampling paper):

What are the results like?

There are basically two novelties in this paper: (1) applying DAgger to training neural networks, and (2) trying different schedules.

I would expect results that analyse the importance of these two things, but I only see a partial analysis of (1). In particular, the results show that training neural networks with DAgger is better than training them with supervised learning. This is nice to know, and is a nice replication of the results from Ross, Gordon and Bagnell. It would be also nice to see the other side: that going neural here is beneficial, but I suspect that's just taken for granted.

On (2), nothing is said! For two of the three experiments (image captioning and constituency parsing), the paper says that they use inverse sigmoid decay. For the third, the paper uses the linear decay schedule. Presumably the others performed worse, but I would really have appreciated knowing whether there's any benefit to doing something other than the DAgger schedule. And if there's benefit, how big?

In particular, the DAgger analysis works for any schedule that has the sum of probabilities over rounds go to zero has the number of rounds goes to infinity. This does not happen for the linear schedule (the sum goes to some lower bound, epsilon). I'm pretty sure it happens for the sigmoid schedule but didn't bother to verify. It would have been nice if the paper had verified this. At any rate, if I'm going to switch from something that I understand theoretically (like DAgger with exponential decay) to something that's heuristic, I would like to know how much empirical benefit I'm getting in exchange for losing my understanding of what's going on.

What's missing

One major difference to DAgger is that this paper makes a weird assumption that the correct next thing to do is independent of what was done in the past. For examples, suppose the gold standard output is "The little bird flew into the mailbox." Suppose that for the first word, the schedule uses the true output. We now have an output of "The". For the second word, maybe the schedule uses the predicted output. Maybe this is "bird". For any reasonable evaluation criteria, probably the best completion of "The bird" with respect to this gold standard is "The bird flew into the mailbox." But the approach proposed here would insist on producing "The bird bird flew into the mailbox."

This is the question of computing the oracle completion, or the "minimum cost action" at any given point. Yoav Goldberg refers to this as a "dynamic oracle." We know how to do this for many loss functions, and I'm surprised that (a) this isn't an issue in the experiments here, and (b) that it's not mentioned.

Reviewing the Reviews

The reviewers seem to mostly have liked the paper, which makes sense, it is interesting. I'm really really surprised that none of the reviewers took the authors to task on the connections to DAgger, and instead concentrated on Searn. I suspect that the reviewers were neural networks experts and not imitation learning experts, and given that the paper talked only about Searn (which at this point is an easy strawman), the reviews tended to focus on that.
The biggest surprise to me in the reviews was that comment that the paper had "comprehensive empirical testing," also echoed by reviewer 3. (Reviewer 2 was basically a no-op.) This paper had impressive experiments on big hard problems, but the experiments were anything but comprehensive from the perspective of understanding what is going on.

Some last thoughts

Let me say that although what's written above is critical, I of course don't dislike this work. It's something that needed to be done, and it's great someone did it. I wish that it were just done better and that the paper made some attempt at situating itself in the literature.

For comparison, take a gander at the recent arxiv paper on training an LSTM using imitation learning/learning-to-search technology by Ballesteros, Goldberg, Dyer and Smith. Not only does this paper get really impressive empirical results, they are also remarkably straightforward about connections to previous work.

## 27 December 2015

### NIPS 2015 Retrospective

There have already been lots of really nice overviews of NIPS 2015. I don't have much to add, largely because I missed almost every talk and a large fraction of the poster sessions.

Before beginning, I'm really happy and somewhat humbled to have been able to participate in the quizbowl demo, which won the best demo award: congrats to Mohit Iyyer, He He and Jordan Boyd-Graber for doing all the hard work!

Here are some things I did see, and that I liked, or that other people pointed me to as something I would probably like but I haven't actually seen.

• Logarithmic Time Online Multiclass prediction by Anna Choromanska and John Langford. This paper has been on arxiv for a while, and has been built into vw for a while also, so I've known about this stuff for probably a year now, but I still think it's great. The task is to build, in an online way, a binary tree for doing multiclass to binary reductions, in a way that gives you nice regret guarantees. The tree changes over time. One interesting property is that a single class can me represented in multiple leaves which gives added representational capacity to common/difficult classes, something most approaches lack.
• Probabilistic Line Searches for Stochastic Optimization by Maren Mahsereci and Philipp Hennig. I didn't see this paper, but I heard from several people that they liked it. The idea is that conventional optimization (eg BFGS) gets a lot of mileage out of line search, something we don't conventionally do in stochastic optimization (like sgd), but perhaps we should. It's on my reading list.
• Grammar as a Foreign Language by Orio Vinyals, Lukasz Kaiser, Terry Koo, Slav Petrov, Ilya Sutskever and Geoffrey Hinton. This paper thinks of parsing as the output of a sequence-to-sequence model, basically outputting a tree as a sequence of symbols. This doesn't work if you just train it on labeled data. Instead, you have to train a state of the art parser, run it on a huge pile of unlabeled data, and then train the seq2seq model to predict that. I like this because this is what Percy, Dan and I tried to get to work in the structured compilation paper, but could never make it work (probably because we lacked a sufficiently rich hypothesis class). It's basically more evidence, IMO, that seq2seq models are really good memorizers: I would be interested to see an analysis of how much generalization is actually learned (I suspect quite little).
• The Human Kernel by Andrew Wilson, Christoph Dann, Chris Lucas and Eric Xing. I like this paper in the same way I like Jerry Zhu's "machine teaching" work. It's about the interplay between human learning and machine learning. If you haven't read this paper, it's a fun read.
• Calibrated Structured Prediction by Volodymyr Kuleshov and Percy Liang. The observation in this paper is the confidence estimation is important for structured prediction problems, but faces its own challenges. They approach it from a calibration perspective and adjust the notions of calibration to be appropriate for structured problems and marginal inference.
On the not-so-great side, as Kate Crawford pointed out on twitter, the diversity level for one of the NIPS symposia was quite poor, and this goes for pretty much all of the events at NIPS (other symposia, workshops, program committee, organizing committee, etc.). This is not limited to NIPS: apparently, WiML is now slightly larger than AIStats and there were approximately as many men at WiML this year as there were women at AIStats. Nando recently discussed similar issues in his recent Reddit AMA. There's at least some awareness here, but awareness is not enough.

Overall, I enjoyed NIPS, especially catching up with old friends, doing a talking machines interview (should be posted early 2016), and seeing what's going on in a field whose size has increased dramatically since I started going eleven years ago!

## 30 October 2015

### How sausage got made once

When I was working on what turned into an old short paper (Markov Random Topic Fields) I decided it might be pedagogically interesting to keep a journal of what I was doing. This journal ended when I ran out of steam and I never got back to it. My whole original idea was, after the paper got published, post everything: the journal, the code, the paper, the paper reviews, etc. It's now been 6 years and that's not going to happen, but in case anyone finds it interesting, here in the report.

Anyway, here is the report. I'm posting this so that perhaps new students can see that things don't ever work the first time, faculty still have trouble getting their code to work, etc.

The progress of a research idea

============= DAY 1 =============

* Idea

Want to do a form of topic modeling, but where there is meta
information.  There are ways to do this, eg., Supervised LDA or
Dirichlet-Multinomial Regression.  These both operate on a *feature*
level.  For some tasks, it is more natural to operate over a graph.
Along these lines, there's Pachinko Allocation, but this posits a
graph over the vocabulary, not over the documents.  (Plus, it is a
DAG, which doesn't make sense for our application.)

Question: how can we augment a standard topic model (eg., LDA), with
an underlying graph, where we assume topics vary smoothly over the
graph?

* Technology

What technology exists for statistical modeling over graphs?  Sounds
like a Markov Random Field.  So let's marry topic models (LDA) with
MRFs, to give a "Topical Markov Random Field" (TMRF).

We think of LDA a generating documents by first choosing a topic
mixture \theta, and then choosing topics z=k for each word w, where w
is drawn from a multinomial \beta_k.

Where can a graph fit in this?  The first idea is to put an MRF over
\theta.

* MRF over \theta

If we have an MRF over theta, then two issues arise.  First, we almost
certainly can't collapse out theta as we might like.  Okay, we'll live
with that.

Second, from an MRF perspective, what do the potential functions look
like?

The simplest idea is to use pairwise potentials of the form e^{-dist},
where dist is the distance between two thetas that touch on the
graph.  What Distance metric should we use?  We could use
Bhattacharyya, Hellinger, Euclidean, LogitEuclidean, etc.  Let's start
with Hellinger.

What about a variance?  We could have lengths in the graph that are
either latent or known.  Let's say they're latent and our potentials
have the form e^{-dist/l}, where l is the length (so that if you're
far away, distance doesn't matter.

** Getting data together

We have about 1000 docs and three graphs over those docs.  We get them
in a reasonable format and then subsample about 400 of the docs.  We
do this both for speed and to make sure we don't overfit the model on
the data too much.

============= DAY 2 =============

** Implementation

We implement this with an option to use or not use graphs (so we can
tell if they're helping).  We collapse out \beta, but not \theta in
both cases, and we compute log posteriors.

We run first on some simple test data (testW) from HBC and find that
it seems to be doing something kind of reasonable.  We then run on
some real data and it puts everything in one cluster after about 20-30
Gibbs iterations.

Debugging: First, we turn off all graph stuff (sampling lengths) and
things are still broken.  Then we initialize optimally and things are
still broken.  Then we turn off resampling \theta and things are still
broken.  The problem is that I'm using the collapsed \beta incorrectly
when sampling z.  I fix it and things work as expected (i.e., not
everything goes in the same cluster).

** Evaluating

So now the code seems to be working, so we want to evaluate.  We run a
model with an without a graph (where the graph is something we expect
will help).  The posterior probabilities coming out of the two
different models are all over the place.

So we do the standard trick of holding out 20% of the data as "test"
data and then evaluating log likelihood on the test.  Here, we do the
"bad" thing and just use 20% of the words in each document (so that we
already have \theta for all the documents).  Not great, but easy to
implement.  This time, no bugs.

At this point, it's a pain to recompile for every configuration change
and we'd like to be able to run a bunch of configs simultaneously.  So
we add a simple command line interface.

In order to evaluate, we plot either posteriors or held-out
likelihoods (usually the latter) as a function of iteration using
xgraph (interacts nicely with the shell and I'm used to it).

Things now seem mixed.  There's very little difference when you're not
using a graph between sampling \theta from the true posterior versus
using an MH proposal (this is good for us, since we have to use MH).
There is also little difference between the baseline LDA model and the
MRF models.

We turn off sampling of the lengths and just fix them at one.  For the
three graphs we're trying, only one seems to be doing any better than
the baseline LDA model.

** Optimization

Now that we're running experiments, we find that things are taking way
too long to run.  So we do some optimization.  First, we cache the sum
of all \beta_k posteriors.  This helps.  Second, we flip \beta from
\beta_{k,w} to \beta_{w,k}, which we've heard helps.  It doesn't.  We
put it back the other way.

All the time is being spent in resample_z, so we waste a half day
trying to figure out if there's a way to only resample a subset of the
zs.  For instance, track how often they change and only resample those
that change a lot (probabilistically).  This hurts.  Resampling those
with high entropy hurts.  I think the issue is that there are three
types of zs.  (1) those that change a lot because they have high
uncertainty but are rare enough that they don't really matter, (2)
those that change a lot and do matter, (3) those that just don't
change very much.  Probably could do something intelligent, but have

In order to really evaluate speed, we add some code that prints out
timing.

We do one more optimization that's maybe not very common.  resample_z
loops over docs, then words, then topics.  For each word, the loop
over topics is to compute p(z=k).  But if the next word you loop over
is the same word (they are both "the"), then you don't need to
recompute all the p(z=k)s -- you can cache them.  We do this, and then
sort the words.  This gives about a 20% speedup with no loss in
performance (posterior or held-out likelihood).

** Evaluating again

Since we had some success fixing the lengths at 1, we try fixing them
at 20.  Now that one graph is doing noticably better than the baseline
and the other two slightly better.  We try 5 and 10 and 50 and nothing
much seems to happen.  20 seems like a bit of a sweet spot.

============= DAY 3 =============

** A more rigorous evaluation

Running with lengths fixed at 20 seems to work, but there's always
variance due to randomness (both in the sampling and in the 80/20
split) that we'd like to account for.

So we run 8 copies of each of the four models (8 just because we
happen to have an 8 core machine, so we can run them simultaneously).

Now, we require more complex graphing technology than just xgraph.
We'll probably eventually want to graph things in matlab, but for now
all we really care about it how things do over time.  So we write a
small perl script to extract scores every 50 iterations (we've
switched from 200 to 300 just to be safe) and show means and stddevs
for each of the models.

While we're waiting for this to run, we think about...

* MRF over z?

Our initial model which may or may not be doing so well (we're waiting
on some experiments) assumes an MRF over \th.  Maybe this is not the
best place to put it.  Can we put it over z instead?

Why would we want to do this?  There are some technological reasons:
(1) this makes the MRF discrete and we know much better how to deal
with discrete MRFs.  (2) we can get rid of the MH step (though this
doesn't seem to be screwing us up much).  (3) we no longer have the
arbitrary choice of which distance function to use.  There are also
some technological reasons *not* to do it: it seems like it would be
computationally much more burdensome.

But, let's think if this makes sense in the context of our
application.  We have a bunch of research papers and our graphs are
authorship, citations, time or venue.  These really do feel like
graphs over *documents* not *words*.

We could turn them in to graphs over words by, say, connecting all
identical terms across documents, encouraging them to share the same
topic.  This could probably be done efficiently by storing an inverted
index.  On the other hand, would this really capture much?  My gut
tells me that for a given word "foo", it's probably pretty rare that
"foo" is assigned to different topics in different documents.  (Or at
least just as rare as it being assigned to different topics in the
same document.)  Note that we could evaluate this: run simple LDA, and
compute the fraction of times a word is assigned it's non-majority
topic across all the data, versus just across one documents.  I
suspect the numbers would be pretty similar.

The extreme alternative would be to link all words, but this is just
going to be computationally infeasible.  Moreover, this begins to look
just like tying the \thetas.

So for now, we put this idea on the back burner...

* MRF over \theta, continued...

We're still waiting for these experiments to run (they're about half
of the way there).  In thinking about the graph over z, though, it did
occur to me that maybe you have to use far more topics than I've been
using to really reap any benefits here.  We begin running with 8, and
then bumped it up to 20.  But maybe we really need to run with a lot
more.

So, I log in to two other machines and run just one copy with 50, 100,
150 and 200 topics, just for vanilla LDA.  The larger ones will be
slow, but we'll just have to go do something else for a while...

============= DAY 4 =============

* MRF over \theta, re-continued...

Both experiments finish and we see that: (1) with 20 topics and
lengths fixed at 10, there is no difference between raw LDA and TMRF.
(2) More topics is better.  Even 200 topics doesn't seem to be
overfitting.  Going 20, 50, 100, 150, 200, we get hidden log
likelihoods of -1.43, -1.40, -1.38, -1.36, -1.36 (all e+06).  The
significance (from the first experiments) seems to be around .002
(e+06), so these changes (even the last, which differs by 0.005) seem
to be real.  Since we weren't able to overfit, we also run with 300
and 400, and wait some more...

...and they finish and still aren't overfitting.  We get -1.35 and
-1.35 respectively (differing by about 0.004, again significant!).

This is getting ridiculous -- is there a bug somewhere?  Everything
we've seen in LDA-like papers shows that you overfit when you have a
ton of topics.  Maybe this is because our documents are really long?

** Model debugging

One thing that could be going wrong is that when we hide 20% of the
words, and evaluate on that 20%, we're skewing the evaluation to favor
long documents.  But long documents are probably precisely those that
need/want lots of topics.  Our mean document length is 2300, but the
std is over 2500.  The shortest document has 349 words, the longest
has 37120.  So, instead of hiding 20%, we try hiding a fixed number,
which is 20% of the mean, or 460.

============= DAY 5 =============

At this point, we've done a large number of evals, both with 20% hid,
and 460 words/doc hid (actually, the min of this and doclen/2), and
50-1600 (at *2) topics.  We do actually see a tiny bit of overfitting
at 800 or 1600 documents.

Then we do something we should have done a long time ago: go back and
skim through some LDA papers.  We look at the BNJ 2003 JMLR paper.  We
see that one of the selling points of LDA over LSI is that
it *doesn't* overfit!  Aaargh!  No wonder we haven't been able to get
substantial overfitting.

However, we also notice something else: aside from dropping 50 stop
words (we've been dropping 100), on one data set they don't prune rare
words at all, and on the other they prune df=1 words only.  We've been
pruning df<=5 or <=10 words (can't remember which).  Perhaps what's
going on is that the reason the graphs aren't helping is just because
there vocabulary (around 3k) isn't big enough for them to make a
difference!

We recreate the text, pruning only the df=1 words.  This leads to a
vocab of about 10k (which means inference will be ~3 times slower).
We run at 50, 100, 200 and 400 and we actualy see a tiny bit of
overfitting (maybe) on 400.  We accidentally only ran 100 iterations,
so it's a bit hard to tell, but at the very least there's
no *improvement* for going from 200 topics to 400 topics.  Strangely
running on the new text is actually about 5-20% *faster*, despite the
larger vocabulary!

** Re-running with graphs

At this point, we're ready to try running with graphs again.  Despite
the fact that it's slow, we settle on 200 topics (this took about 3.5
hours without graphs, so we will be waiting a while).  We also want to
run for more iterations, just to see what's going to happen: we do 200
again.

And again there's not much difference.  One of the graphs seems to be
just barely one std above everyone else, but that's nothing to write

============= DAY 6 =============

* Abstracts only?

At this point, things are not looking so spectacular.  Perhaps the
problem is that the documents themselves are so big that there's
really not much uncertainty.  This is reflected, to some degree, by
the lack of variance in the predictive perplexities.

So we rebuild the data on abstracts only.  This makes running
significantly faster (yay).  We run 5 copies each of 10, 20, 40, 80,
160 and 320 topics.  40 is a clear winner.  80 and above overfit

Now, we turn on graphs and get the following results (5 runs):

40-top-nog.default      -69239.4 (86.2629700392932)
40-top-nog.auth         -68920.0 (111.863756418243)
40-top-nog.cite         -68976.4 (174.920839238783)
40-top-nog.year         -69174.2 (133.769204228776)

If we compare default (not graphs) to auth (best), we see that we get
a 2-3 std separation.  This is looking promising, FINALLY!  Also, if
we plot the results, it looks like auth and cite really dominate.
Year is fairly useless.

It suggests that, maybe, we just need more data to see more of a
difference.

* Getting More Data

There are two ways we could get more data.  First, we could crawl
more.  Second, we could switch over to, say, PubMed or ACM.  This
would work since we only need abstracts, not full texts.  These sites

============= DAY 7 =============

Okay, ACM is a pain.  And it's really not that complete, so we switch
over to CiteSeer (don't know why we didn't think of this before!).  We
seed with docs from acl, cl, emnlp, icml, jmlr, nips and uai.  We
notice that CiteSeer is apparently using some crappy PDF extractor (it
misses ligatures!  ugh!) but it's not worth (yet!) actually
these seeds, we run 10 iterations of reference crawling, eventually
ending up with just over 44k documents.  We extract a subset
comprising 9277 abstracts, and six graphs: author, booktitle,
citation, url, year and time (where you connect to year-1 and year+1).
The 9k out of 44k are those that (a) have >=100 characters
"reasonable" in the abstract and (b) have connections to at least 5
other docs in *all* the graphs.  (We're no longer favoring citations.)
We then begin the runs again....



## 05 October 2015

### A small observation for prepubs on arxiv

The question of how "traditional conference publication" should react to arxiv prepublications is raised quite frequently. I'm not particularly shy about the fact that I'm not a fan, but that's not what this post is about. This post is about data.

In any discussion about the "arxiv question," proponents of the idea typically cite the idea that by posting papers early on arxiv, they are able to get feedback from the community about their work. (See for example here, which at least tries to be balanced even if the phrasing is totally biased, for instance in the poll at the end :P.)

At any rate, the question I was curious about is: is this actually borne out in practice?

I did the following experiment. Arxiv nicely lets us see revision histories. So we can see, for instance, whether a paper that was placed on arxiv before notifications for the corresponding conference have gone out, is updated more than a paper that was placed on arxiv after notifications.

For NIPS papers that were first posted to arxiv after camera ready, 75% were never updated and 19% were updated once (on average, they were updated 0.36 times +- 0.713 std).

For papers that were first posted to arxiv before notification, all were updated at least once. The real question is: how many times were they updated between posting to arxiv and acceptance to the conference. The answer is that 82% were never updated during that period. Of course, all were updated at some point later (after the camera ready deadline), and 55% were updated only once, and another 18% were updated twice.

[Note: I only count updated that come at least two week after the first posting to arxiv because before is more likely to be typo fixing, rather than real feedback from the community.]

The sample size is small enough that I can actually look at all of the ones that were updated between posting to arxiv and notification. One of these seems was first posted in mid-Feb, updated twice is late March, and then again in Nov (acceptance) and Dec (camera ready). Another is very similar. Two were most likely posted the previous year when it was submitted to AIStats (the dates match up) and then updated when submitted to NIPS. Those were the only four, and two of them seem like a legit possible case of update due to community feedback.

As far as the question of "rapid impact on the field" this is harder to answer. I took a random sample of ten papers from each of the groups (prepub versus non-prepub) and got citation counts from google scholar. The median citation count was 10 for both sets. The average was slightly higher for the prepub set (15 versus 11, but with giant standard deviations of 12 and 16). Considering the prepub set has been out at least 6 months longer (this is NIPS 2013 and 2014 so this is a sizeable percentage), this is a pretty small difference. And it's a difference that might be attributable to other factors like "famous people are perhaps more likely to prepub" [actually it's not clear the data play this out; in a totally unscientific study of "does Hal think this person is famous" and a sample of 20 for each, it's even split 10/10 in both sets].

Anyway, I'm posting this because I've heard this argument many times and I've always felt it's a bit dubious. I've never seen data to back it up. This data suggests it's not true. If someone really believes this argument, it would be nice to see it backed up with data!

[Notes: I took only papers that were marked on arxiv as having appeared in NIPS, and which were first posted to arxiv in 2013 or 2014; this is 175 papers. I then hand-checked them all to exclude things like workshops or just submissions, and labeled them as to whether they appeared in 2013 or 2014.  That left a sample of papers. The rest of the data was extracted automatically from the arxiv abstract. The total number that was posted before notification (the prepub cases) is 22 (or 27%) and the total number that were posted after notification is 59 (or 73%). So the sample is indeed small. Not much I can do about that.]

## 11 September 2015

### How long'll it take to say that?

tl;dr: Given a string of text in some language you might want to know how long it would take to speak it. Here are some perl/python "one-liners" to estimate that. Currently supports English (US), German, French, Italian, Spanish and Japanese. The estimates are all in seconds.

Brave are those who read the source code. I promise it's not intentionally obfuscated -- there's just a lot of unicoding going on, and then copious use of the right-handed saturn operator in perl.

Why would you want code for this rather than a dictionary? Dictionaries are limited to their vocabulary, which is also sensitive to things like tokenization. Having a completely parametric solution seemed much more generalizable.

Why would I possibly want this?

We've been doing a bunch of work recently on simultaneous machine interpretation (aka "real time machine translation"). None of us is a speech person, either in the "recognition" or "synthesis" sense. This unfortunately means that to date, all of our models treat each word as "equally long" when training and, perhaps more importantly, evaluating models. For instance, if we want to measure the décalage (aka time-lag, aka ear-voice-span) between when a Japanese word is "heard" and the corresponding English translation is "spoken", we've been assuming all words take precisely 1 second to speak. This is obviously ridiculous.

One quick and dirty alternative is to use a text-to-speech (aka speech synthesis) system to synthesize a sentence and use its length as an estimate of how long it would take a person to speak it. This is a totally plausible approach since decent open source synthesis software exists (we use such software to crease these one-liners), but it's slow and bloated and can't be easily distributed. This would be more accurate, but I was after a quick and filthy solution.

How do these scripts work?

There are two functions: one for estimating the amount of time it would take to speak a single word (sayWord in the code), and another for estimating the amount of time it would take to speak a sentence (sayit in the code). I'll first describe how sayWord works; sayit is pretty straightforward.

sayWord works by extracting a bunch of features from the word to be spoken (each of these is a particular regular expression) and evaluating a linear function of the counts of the matches of those regular expressions. The process by which coefficients were generated is explained later. The features are things like: number of characters, number of vowels, number of consonant groups, number of non-letters, number of vowel-consonant switches, number of digits of various lengths, and whether the word starts or ends with a vowel; and then, for each "reasonable" unicode character for European languages, the count of those characters. Not all of these features appears for each language because I used l1 regularization to prune down the feature set.

[Note: Japanese is different because it uses a different character set. The feature structure is basically the same, but replace "consonant" with "kana" and "vowel" with "kanji" and "reasonable character" with "each of the 100 most frequent Japanese characters" in Japanese Wikipedia.]

sayit attempts to pronounce a sentence by pronouncing each word individually (via sayWord) and then rescaling the resulting estimate because we ... don't ... pause ... between ... words. This rescaling is linear, and again estimated from data (explained below).

How are the coefficients estimated for words?

Basically I take a vocabulary of the 50k-100k most frequent words for each language, use a speech synthesis program to say them (for the European languages, I used MaryTTS; for Japanese I used Open JTalk).

I then extract all the features mentioned above, create a regression problem regressing on the number of seconds it takes to speak (after removing quiet time around the word) and throwing in some l1 regularization with vw to make sure that it didn't use too many features. I optimized quantile loss (aka absolute value loss) rather than squared loss.

One thing I did not do was weight the words by their frequency. I could have done this and the resulting regression weights change a bit but not too much. At the end it spends a lot of energy making sure it estimates the speaking time of "a" and "the" and "an" correctly. Because short words are typically high frequency (and vice versa) this meant that it tended to underestimate all other words. And because my relatively frequencies were from Wikipedia, they might not match yours. Keeping a uniform distribution felt like a better solution.

I also tried not regularizing and also using things like character ngram features to get a better fit. In the end I didn't include this either. You can about halve the error rate by doing these things, but I felt like having simpler, smaller models made more sense here. After all, the thing we're regressing on (speaking time from a TTS system) is sort of an artificial benchmark anyway, so getting a bit lower error is not obviously that meaningful.

Overall the mean absolute errors in prediction are:

• German: 51ms
• English: 55ms
• French: 59ms
• Italian: 41ms
• Japanese: 33ms
If you were to just predict the median on each, you would get mean absolute errors in the 550ms (Japanese) to 980ms (English) range, so this is quite an improvement.

How are the coefficients estimated for sentences?

For sentences, I fit a model of the form:
• time = const + a * [summed time for words] + b * [# of words]
The three parameters (const, a and b) were estimated by having the TTS systems speak entire sentences (about 40k from each language, taken randomly from Wikipedia) and then using the total time from the TTS (unioned across all languages) and the corresponding features. These are simply hardcoded and shared across languages. You could of course do a bit better by doing this on a language-by-language basis, but I didn't do this again for simplicity.

Conclusion

Chances are no one except me wants this. But if you do, please feel free to use it. I'd appreciate some sort of credit though :). And if you really want the dictionaries, you can find them here: de, en-US, fr, it, ja.

Thanks to Graham Neubig for providing the tokenized Japanese Wikipedia text and pointing me to Open JTalk!

## 08 September 2015

### Overfitting

Many students, in response to their assigned reading for today's undergrad ML class, asked me for a formal definition of overfitting. The fact that I don't really have one irked me, especially this this basically is the problem in machine learning. This blog post is an extended version of the answer I gave in class.

Intuitively, overfitting is when your learned model is doing much better on training data than it will do on test data from the same distribution. This happens because your model learns to fit random idiosyncrasies in the training data rather than model the true underlying trends.

I then pointed to a recent high-profile paper by Cynthia Dwork and colleagues on Holdout Reuse (appeared behind a paywall in Science and something similar seems to be in NIPS too). This paper presents some really cool ideas that use machinery from differential privacy to allow one to effectively reuse the same heldout ("development") data multiple times, without worrying about overfitting the holdout. But that's not what I wanted to highlight; I wanted to highlight the formal definition Dwork et al. give for overfitting. They (roughly) say that a model has overfit the training data when its test error is larger than its training error (by at least some small amount).

I feel like this definition captures some of the essence of overfitting, but it also misses what I would call my "everyday experience" with using machine learning in practical, real world settings. And my everyday experience is that you training error is always lower than your test error. To confirm to myself that I wasn't misremembering my experiences, I ran a few experiments on some simple classification tasks. Below is a plot of train/dev/test error on Reuters RCV1:

Here, the x-axis is regularization strength (this is all with libsvm, C=2^-xvalue) and the y-axis is error rate. There are error bars on dev/test but they're so small you can't really see them.

The thing I want to highlight in this figure is that the best performance, measured by dev or test performance, occurs when the regularization is a bit below 0. But the point at which the gap between dev performance and training performance starts to inflate is much earlier, maybe around -4. The title of the figure tells us that the test error achieved when the training error drops below dev error is 9.48%; but when dev bottomw out the test error is 6.08%. That's leaving 33% relative error on the table.

RCV1 isn't unique. Below are the same plots for MNIST on several binary splits:

For all three splits, some easier, some harder, the relative improvement is always around 20-30% by not using the definition provided by Dwork et al.

Having somewhat convinced myself that I'm not crazy, the immediate question is: what is a better definition of overfitting that captures my everyday experience? I completely agree that the definition quoted above is a necessary condition, but somehow it does not seem sufficient.

Even though this was my experience, I felt a bit flummoxed because I had a hard time pointing out exactly what I felt was missing in the definition above.

Here's what I ended up settling on. Overfitting typically happens, as stated above, because there are random idiosyncrasies in the training data that don't generalize. This is a finite sample effect. If I give you a dataset with 1000 training examples, and I add 100 completely random features, at least one of these features is going to correlate somewhat with the training labels. Overfitting occurs when the model chooses to use one of these completely random features, which helps drive the training error down, but hurts at test time.

But there's a "throwing the baby out with the bathwater" effect going on here. Although there may be completely random features that accidentally correlate slightly with the true label, there are also weak but nevertheless useful features that correlate a small amount with the true label. On a finite sample it's going to be virtually impossible to distinguish between these two.

By adopting the "training error < test error" rubric for overfitting, we're forcing ourselves to throw out both types of features. This is a trade-off, and empirically at least, it appears that this is a losing trade-off. It seems better to keep around a few of these weak-but-useful features in exchange for the risk of also keeping around a few truly-useless-and-misleading features. If this weren't the case, I would have a hard time explaining my "everyday experience."

So do I have a better suggestion?

Not really.

One really nice property of the Dwork et al. definition is that it's testable for a single model. That is, I hand you a learned function h and you can run h on some training data and some dev data and you can tell me if it's overfit. I don't have to draw curves like those above. I don't even have to know anything about how the model was built, and I don't have to sweep hyperparameters (which may or may not be sweepable).

I really like all of those properties. And I believe that if we came up with an alternative definition of overfitting that obeys these properties, we can reuse much of the awesome work in the Dwork et al. paper.

But I also feel like the definition above is overly conservative and it's risky to use.

## 09 June 2015

### Some NAACL 2013 statistics on author response, review quality, etc.

NAACL 2015 has just passed, NAACL 2013 is long in the past.

One bonus of being a program chair is that you get to have fun with data. In this post I'd like to review two pieces of data, one related to author response and one related to review quality assessment.

tldr: Generally, I think author response is useless, except insofar as it can be cathartic to authors and thereby provide some small psychological benefit. And in general people don't seem that dissatisfied with their papers' reviews, and this is largely independent of the outcome of the paper (conditioned on selection bias of those who responded).

On Author Response

NAACL and other conferences have for some time allowed authors to respond to their reviewers, ostensibly to correct errors, but more often just to argue their position. Most people I've talked to who are in favor of author response say that it made the difference for one paper between it being rejected (pre-response) and accepted (post-response). Of course this is unknowable because decisions aren't made pre-response, so what these authors are reporting is their guess as to accept/reject before response versus actual accept after response. My own internal guesses of accept/reject are often off the mark, so you can imagine I don't find this argument particularly strong. To give a sense as to how hard this prediction is, here is a plot showing whether your paper got accepted or not as a function of the mean overall score.

Here, the x-axis is the cumulative distribution of papers (there were more rejects than accepts, so the length of these is normalized to percentage) and the y-axis is the mean overall score for the paper.

Focusing just on the solid lines (long papers), you can see that there were one or two papers with average scores of 3.6 that got rejected, and several papers with average scores of 3.0 that got accepted. If instead you look at "probability of accept" given mean overall score, basically what you see is papers with a score of 2.8 or lower are almost certainly rejected; with 3.9 or higher are almost certainly accepted, and around 3.2 it's an absolute toss-up.

Now, you might ask: who chose to respond? Perhaps people with clear outcomes elected not to respond at all. Here are the numbers for that:

This is only for long papers (it looks the same for short). Each dot is a paper and red are rejects, blue are accepts; open circles means no response. Definitely people with scores less than 3 responded less frequently than those with scores above 3, but even the top scoring papers all included some response. There were some "hopeful" papers with scores less than 2.0 that responded, even though it was certainly not useful.

But wait, you might say: perhaps those papers with average score or 1.33 gave a response and their final scores went up to 4.5? Ok first of all, these are final scores plotted above, not pre-response scores. But we certainly can look at how scores changed between the initial reviews and the final reviews. (Note this conflates two things: author response and reviewer discussion. We'll get back to that later.)

Here, we again have one dot per paper (these are randomly perturbed with small variance so they're all visible). Along the x-axis is the original score of this paper; along the y-axis is how much this score changed between the initial review and the final review. Overall the average absolute change across all papers is 0.1. The vast majority (87%) didn't change at all.

And note: scores were just about as likely to go down as up. (Several scores went down with no response, which I find troublesome, except in cases where reviewers specifically said "I'm giving benefit of the doubt, but need a clarification here.")

You could also say: well, maybe the scores didn't change much, but the review text did. Again, it's pretty rare. Out of 430 reviews (long papers only), 3 reviews decreased in length (~50 words), 46 increased by at most 100 words, and 49 increased by over 100 words (wow, awesome!). But for 80% the review didn't change at all.

Now, getting back to the question of "was this due to AR or to reviewer discussion", this is of course impossible to de-conflate, but we can look at a similar plot as a function of discussion rather than AR:

Here, a few results stand out. First, there are slightly more score changes as a function of discussion than AR. Most of those worrying dots below y=0 that didn't have a response seem to have gone down due to discussions. Moreover, a few scores that went up (on the low-scoring, almost-certainly-rejected papers) did so with no discussion, suggesting that AR did change those scores. But around the x=3.2 position (the true borderlines), almost all of these with score changes had discussion, though plenty had discussion and no score change. As opposed to the x=3.2 position in AR in which they all have responses, and still the majority of scores don't change.

An obvious response at this point is that we still don't know whether the final decision really changed as a function of response. This is a causality question we cannot perform the counterfactual for, so we're not going to get a solid answer. However, we can look at the following:

On the x-axis we have the final score of the paper, and on the y-axis we have (a smoothed version of) the probability that this paper is accepted.

The two nearly identical curves (red and blue) correspond to blue=all long papers and red=all long papers for which the authors submitted any response. In this setting, there really is almost no difference.

The slightly different curve (the black one) corresponds only to papers in which the reviewers engaged in a discussion. Around the critical point (score in the 3-3.5 range), discussion uniformly decreased odds of acceptance, which is of course not surprising for anyone who has ever participated in discussions.

You should take all this with a grain of salt, because all three of these curves are within one standard deviation of the blue curve (dotted lines).

Finally, there's of course the chance that response and discussion are highly correlated: that is, response should perhaps spur discussion. This is probably false (by the above plot), but just to drive the point home, here is the data:

The x-axis is the amount of discussion and the y-axis is the length of the response. See all those dots along x=0? Those are the ones for which the authors responded (in one case with a 2000 word essay!) and for which there was absolutely no discussion. And most of them are rejected papers.

On Review Usefulness

Some of you may recall that we did a post-conference survey of how much you liked the reviews you got. We truly did this at a later point in time because we didn't get our acts together earlier, but now I'll claim we did it at a later point in time so that authors were no longer emotionally distraught by their reviews.

We presented each corresponding author with their old reviews, but hid the scores from them (I doubt any/many went back to look). They were then asked, for each review in turn, how informative it was and how helpful it was. They could choose 0=not, 1=sort of, or 2=very.

The first hypothesis one might have (certainly I had it!) is that people "liked" reviews that were nice to them and "disliked" reviews that were not. (I've heard this called the craigslist effect: if the transaction is successful, everyone's happy; else, everyone's unhappy.) This actually appears not really to be true. Here are the results:
rejects      1.24  +-0.57     1.25 +-0.72
accepts      1.38  +-0.48     1.27 +-0.70
(+- is one standard deviation)

Basically authors of accepted papers didn't particularly find their reviews much more informative (+0.14) or helpful (+0.02) than those of rejected papers.

[Note: the sample size is 110 rejects and 86 accepts, so there was a higher response rate on accepts since the acceptance rate is about 25%.]

You might suspect, however, that how much an author likes a review is correlated with that reviewer's score rather than the overall decision. We can look at that too (recall authors didn't see the scores when completing this survey):
0            3.04   +- 0.89        3.23  +- 0.87
1            3.22   +- 0.91        3.15  +- 0.93
2            3.21   +- 0.97        3.22  +- 0.96
Here, the first column (rating) is how the author scored the review. The second column is the average overall score that a reviewer gave an author who gave them a score of zero. For example, for reviews that were rated informative=2, the average score for those reviews was 3.21. Again, basically there's no effect.

Just to feel good about ourselves, authors do, in general, seem not-too-unhappy with their reviews.

Here's a histogram of informativeness for rejected papers:
0: ##############                                 (15%)
1: ############################################   (44%)
2: ########################################       (41%)
Here's the same histogram for accepted papers:
0: ########                                       ( 8%)
1: ############################################## (46%)
2: ############################################## (46%)
The only real difference is that rejected papers did have twice as large a frequency of "informative=0", but the rest was pretty close.

(And in case you're wondering, it's basically the same for "helpfulness." As pointed out by a friend, the inability to separate different aspects of the same object is typical when people are asked for intuitive judgements that don't involve reasoning. 70% of responses gave the same score for helpfulness and informativeness and another 29% gave scores within 1. Only 0.8% said inf=2 help=0 and 0.6% said inf=0 and help=2.)

Overall

I think of complaining about review(s|ers) as a staple hallway discussion, sort of an attractor state if you run out of other things to talk about. And we all remember the worst reviews we have had and often forget about the good ones that really do help our work get better. As Lillian Lee said at NAACL this year, it's usually our own fault when reviewers don't understand our papers. And keep in mind that all these numbers have huge randomness associated with them, for instance related to the NIPS experiment.

Despite this, it actually seems that we aren't too dissatisfied with our reviews. Only 11% of reviews were considered uninformative and, while I think we should strive to get this number down, I don't think it's that horrible. Especially when I consider the fact that 19% of papers have an average score of 2.0 or less, which basically means they were submitted not standing a chance (for a wide variety of reasons).

In general I don't think author response has an effect on the outcome that makes it worth the time and energy it takes. It might make authors feel better, but this is temporary when their paper almost certainly enjoys the same fate after response as before. I do think discussion is useful, and I'd rather see us use more time for discussion but cutting out response. Perhaps by reducing the number of papers any given area chair has to deal with. We should realize, though, that discussion almost always serves to lessen the likelihood of acceptance, probably because it's easier to argue against a paper than for, and reviewers don't really have any incentive to stand up for a paper they like, leading to a veto-effect.

## 15 November 2014

### The myth of a strong baseline

I can probably count on my fingers the number of papers I've submitted for which a reviewer hasn't complained about a baseline in some way. I don't mean to imply that all of those complaints are invalid: many of them were 100% right on in ways that either I was lazy about or ways that I hadn't seen a priori.

In fact, I remember back in 2005 I visited MIT and gave a talk on what eventually became the BayeSum paper (incidentally, probably one of my favorite papers I've written, though according to friends not exactly the best written... drat). I was comparing in the talk against some baselines, but Regina Barzilay very rightfully asked me: do any of these baselines have access to the same information that my proposed approach does? At the time I gave this talk the answer was no. In the actual paper the answer is yes. I think Regina was really on to something here, and this one question asked in my talk has had a profound impact on how I think about evaluation since then. For that, I take a small time-out and say: Thank you, Regina.

Like all such influential comments, my interpretation of Regina's question has changed over time, and this post is essentially about my current thinking on this issue, and how it relates to this "does not compare against a strong enough baseline" reviewer critique that is basically a way to kill any paper.

If we're going to ask the question about whether an evaluation strategy is "good" or "bad" we have to ask ourselves why are we doing this evaluation thing in the first place. My answer always goes back to my prima facie question when I read/review papers: what did I learn from this paper? IMO, the goal of an evaluation should be to help isolate what I learned.

Let's go back to the BayeSum paper. There are two things I could have been trying to demonstrate in this paper. (A) I could have been trying to show that some new external source of knowledge is useful; (B) I could have been trying to show that some new technique is useful. In the case of BayeSum, the answer was more like (B), which is why Regina's comment was spot on. I was trying to claim the approach was good, but I hadn't disentangled the new approach from the data, and so you could have easily believed that the improvement over the baseline was due to the added source of data rather than the new technique.

In many cases it's not that cut and dry because a (B)-style new technique might enable the use of (A)-style new data in a way that wasn't possible before. In that case, the thing that an author might be trying to convince me of is that the combination is good. That's fine, but I still think it's worth disentangling these two sources of information as much as possible. This is often tough because in the current NLP atmosphere in which we're obsessed with shiny new techniques, it's not appealing to show that the new data gets you 90% of the gain and the new technique is only 10% on top of that. But this is another issue.

So evaluations are to help us learn something. Let's return now to the question of you didn't compare against a strong enough baseline. Aside from parroting what's been said many times in the past, what is the point of such a complaint, beyond Regina's challenge, which I hope I've made clear I agree with. The issue, as best I can understand it, is that it is based on the following logic:

• Assumption: if your approach improves things against a strong baseline, then it will also improve against a weaker baseline, perhaps by more.
I'll note in passing that this is basically an assumption of submodularity of ideas.

And, like the title of this blog post suggest, I would like to put forth the idea that this assumption is often ridiculous, or at least that there's plentiful evidence against it.

I'm going to pick on machine translation now just to give a concrete example, but I don't think this phenomenon is limited to MT in any way. The basic story is I start with some MT system like Moses or cdec or whatever. I add some features to it and performance goes up. The claim is that if my baseline MT system wasn't already sufficiently strong, then any improvement I see from my proposed technique could just be solving a problem that's already been solved if I had just tuned Moses better.

There's truth to this claim, but there's also untruth. A famous recent example is the Devlin et al. neural network MT paper. Let me be clear: I think this paper is great and I 100% believe the results that they presented. I'm not attacking this paper in any way; I'm choosing it simply as a representative example. One of the results they show is some insane 8 bleu point gain over a very strong baseline. And I completely believe that this was a very strong baseline. And that the 8 bleu point improvement was real. And that everything is great.

Okay, so any paper that leads to an 8 bleu point gain over a very strong baseline is going to get reimplemented by lots of people, and this has happened. Has anyone else gotten an 8 bleu point gain? Not that I've heard. I've heard numbers along the lines of 1 to 2 bleu points, but it's very plausible I haven't heard the whole story.

So what's going on here?

The answer is simply that the assumption I wrote above is false. We've assumed that since they got 8 points on a strong baseline, I'll get at least as much on my baseline (which is likely weaker than theirs).

One problem is that "strong" isn't a total order. Different systems might get similar bleu scores, but this doesn't mean that they get them in the same way. Something like the neural networks stuff clearly solved a major problem in the BBN strong baseline system, but this major problem clearly wasn't as major of a problem in some other strong baseline systems.

Does this make the results in the Devlin paper any less impressive or important? No, of course not. I learned a lot from that paper. But one thing I didn't learn is "if you apply this approach to any system that's weaker than our strong baseline, you will get 8 bleu points." That's just not a claim that their results substantiate, and the only reason people seem to believe that this should be true is because of the faulty assumption above.

So does this mean that comparing to strong baselines is unimportant and everyone should go back to comparing their MT system against a word-for-word model 1 baseline?

Of course not. There are lots of ways to be better than such a baseline, and so "beating" it does not teach me anything. I always tell students not to get too pleased when they get state of the art performance on some standard task: someone else will beat them next year. If the only thing that I learn from their papers is that they win on task X, then next year there's nothing to learn from that paper. The paper has to teach me something else to have any sort of lasting effect: what is the generalizable knowledge.

The point is that an evaluation is not an end in itself. An evaluation is there to teach you something, or to substantiate something that I want to teach you. If I want to show you that X is important, then I should show you an experiment that isolates X to the best of my ability and demonstrates an improvement, preferably also with an error analysis that shows that what I claim my widget is doing is actually what it's doing.