## 24 August 2016

### Debugging machine learning

I've been thinking, mostly in the context of teaching, about how to specifically teach debugging of machine learning. Personally I find it very helpful to break things down in terms of the usual error terms: Bayes error (how much error is there in the best possible classifier), approximation error (how much do you pay for restricting to some hypothesis class), estimation error (how much do you pay because you only have finite samples), optimization error (how much do you pay because you didn't find a global optimum to your optimization problem). I've generally found that trying to isolate errors to one of these pieces, and then debugging that piece in particular (eg., pick a better optimizer versus pick a better hypothesis class) has been useful.

For instance, my general debugging strategy involves steps like the following:

1. First, ensure that your optimizer isn't the problem. You can do this by adding "cheating" features -- a feature that correlates perfectly with the label. Make sure you can successfully overfit the training data. If not, this is probably either an optimizer problem or a too-small-sample problem.
2. Remove all the features except the cheating feature and make sure you can overfit then. Assuming that works, add feature back in incrementally (usually at an exponential rate). If at some point, things stop working, then probably you have too many features or too little data.
3. Remove the cheating features and make your hypothesis class much bigger; e.g., by adding lots of quadratic features. Make sure you can overfit. If you can't overfit, maybe you need a better hypothesis class.
4. Cut the amount of training data in half. We usually see test accuracy asymptote as the training data size increases, so if cutting the training data in half has a huge effect, you're not yet asymptoted and you might do better to get some more data.
The problem is that this normal breakdown of error terms comes from theory land, and, well, sometimes theory misses out on some stuff because of a particular abstraction that has been taken. Typically this abstraction has to do with the fact that the overall goal has already been broken down into an iid/PAC style learning problem, and so you end up unable to see some types of error because the abstraction hides them.

In an effort to try to understand this better, I tried to make a flow chart of sorts that encompasses all the various types of error I could think of that can sneak into a machine learning system. This is shown below:
I've tried to give some reasonable names to the steps (the left part of the box) and then give a grounded example in the context of ad placement (because it's easy to think about). I'll walk through the steps (1-11) and try to say something about what sort of error can arise at that step.
1. In the first step, we take our real world goal of increasing revenue for our company and decide to solve it by improving our ad displays. This immediately upper bounds how much increased revenue we can hope for because, well, maybe ads are the wrong thing to target. Maybe I would do better by building a better product. This is sort of a "business" decision, but it's perhaps the most important question you can ask: am I even going after the right things?
2. Once you have a real world mechanism (better ad placement) you need to turn it into a learning problem (or not). In this case, we've decided that the way we're going to do this is by trying to predict clickthrough, and then use those predictions to place better ads. Is clickthrough a good thing to use to predict increased revenue? This itself is an active research area. But once you decide that you're going to predict clickthrough, you suffer some loss because of a mismatch between that prediction task and the goal of better ad placement.
3. Now you have to collect some data. You might do this by logging interactions with a currently deployed system. This introduces all sorts of biases because the data you're collecting is not from the final system you want to deploy (the one you're building now), and you will pay for this in terms of distribution drift.
4. You cannot possibly log everything that the current system is doing, so you have to only log a subset of things. Perhaps you log queries, ads, and clicks. This now hides any information that you didn't log, for instance time of day or day of week might be relevant, user information might be relevant, etc. Again, this upper bounds your best possible revenue.
5. You then usually pick a data representation, for instance quadratic terms between a bag of words on the query side and a bag of words on the ad side, paired with a +/- on whether the user clicked or not. We're now getting into the position where we can start using theory words, but this is basically limited the best possible Bayes error. If you included more information, or represented it better, you might be able to get a lower Bayes error.
6. You also have to choose a hypothesis class. I might choose decision trees. This is where my approximation error comes from.
7. We have to pick some training data. The real world is basically never i.i.d., so any data we select is going to have some bias. It might not be identically distributed with the test data (because things change month to month, for instance). It might not be independent (because things don't change much second to second). You will pay for this.
8. You now train your model on this data, probably tuning hyperparameters too. This is your usual estimation error.
9. We now pick some test data on which to measure performance. Of course, this test data is only going to be representative of how well your system will do in the future if this data is so representative. In practice, it won't be, typically at least because of concept drift over time.
10. After we make predictions on this test data, we have to choose some method for evaluating success. We might use accuracy, f measure, area under the ROC curve, etc. The degree to which these measures correlate with what we really care about (ad revenue) is going to affect how well we're able to capture the overall task. If the measure anti-correlates, for instance, we'll head downhill rather than uphill.
(Minor note: although I put these in a specific order, that's not a prescriptive order, and many can be swapped. Also, of course there are lots of cycles and dependencies here as one continues to improve systems.)

Some of these things are active research areas. Things like sample selection bias/domain adaptation/covariate shift have to do with mismatch of train/test data. For instance, if I can overfit train but generalization is horrible, I'll often randomly shuffle train/test into a new split and see if generalization is better. If it is, there's probably an adaptation problem.

When people develop new evaluation metrics (like Bleu for machine translation), they try to look at things like #10 (correlation with some goal, perhaps not exactly the end goal). And standard theory and debugging (per above) covers some of this too.

I'm very curious if y'all have topics/tricks that you like that aren't mentioned here.

## 16 August 2016

### Feature (or architecture) ablation

I wrote my first (and only) coreference paper back in 2005. At the time, my goals were to (a) do well on coref, (b) integrate background knowledge (like "Bush" is "president") using simple techniques, and (c) try to figure out how important different (types of) features were for making coreference decisions.

For the last, there is a reasonably extensive feature-type ablation experiment using backward selection (which I trust far more than forward selection). After writing the paper, I had many internal dialogues about why experiments like that are interesting. I have had, over the years, a couple of answers:

1. The obvious answer is "it tells us something interesting about language." It would be nice if this were true, but I'm not totally sure it is, and it's definitely not true if one doesn't put a bunch more effort into it than I put into that 2005 paper. What can we say? Yeah, spelling is important. Knowledge is important. Syntax is hard to actualize. I don't know that we didn't already know these things before.
2. Engineering. Suppose someone wanted to build a similar system. They want to put their effort where it's most valuable, and so feature ablation experiments tell you where you're likely to get the most bang for the buck. In a sense, you can see these as a type of negative result. Which features actually aren't that important. In the 2005 paper, you could remove syntactic, semantic, and class-based features with zero performance degradation; and also get rid of pattern-based features with minor performance degradation. This saves a lot of effort because some of these are actually quite a pain to implement and/or are slow and/or require lots of external resources.
Today, I mostly lean toward the engineering answer, or at least that's what I want to use as a jumping off point here.

Now that we're partially allergic to feature engineering and prefer to replace it with architecture engineering, I think the charge is stronger, not weaker, to do ablation experiments. Does that thing really need to be a biLSTM? Would an RNN suffice? What about just averaged bag of word embeddings? Do you need two layers of attention there or would one suffice? Do you need attention at all? Does that layer really need to be that wide?

These are all easy questions to ablate and answer.

There's never going to be a crisp answer like "yes, if I cut my hidden state from 493 units to 492 units performance goes down the drain." Many things will be gradual, but not all.

Why do I think this is important? Precisely for reason #2 above, but about a bajillion times more so. Training these really complicated models with wide hidden units, bidirectional stuff, etc., is really slow. Really really slow. If you tell me I can be within 1% accuracy but can train 100 times faster, I'm going to do it. Sure, for a final test run I might crank everything up again (and then report that!) but for development, it's super useful to have a system you can train and evaluate efficiently.

Does this tell us anything interesting about language? Almost certainly not (or at least not without a huge amount of extra work). But it does make everyone's life better.

## 14 August 2016

### Some papers I liked at ACL 2016

A conference just ended, so it's that time of year! Here are some papers I liked with the usual caveats about recall.

Before I go to the list, let me say that I really really enjoyed ACL this year. I was completely on the fence about going, and basically decided to go only because of giving a talk at Repl4NLP, and wanted to attend the business meeting for the discussion of diversity in the ACL community, led by Joakim Nivre with an amazing report that he, Lyn Walker, Yejin Choi and Min-Yen Kan put together. (Likely I'll post more, separately, about both of these; for the latter, I tried to transcribe much of Joakim's presentation.)

All in all, I'm supremely glad I decided to go: it was probably my favorite conference in recent memory. This was not just because there were lots of great papers (there were!) but also because somehow it felt more like a large community conference than others I've attended recently. I'm not sure what made it like this, but I noticed it felt a lot less clique-y than NAACL, a lot more broad and interesting than ICML/NIPS (though that's probably because of my personal taste in research) and in general a lot friendlier. I don't know what the organizers did that managed this great combination, but it was great!

Okay, so on to the papers, sorted by ACL id.

P16-1009: Rico Sennrich; Barry Haddow; Alexandra Birch
Improving Neural Machine Translation Models with Monolingual Data

I like this paper because it has a nice solution to a problem I spent a year thinking about on-and-off and never came up with. The problem is: suppose that you're training a discriminative MT system (they're doing neural; that's essentially irrelevant). You usually have far more monolingual data than parallel data, which typically gets thrown away in neural systems because we have no idea how to incorporate it (other than as a feature, but that's blech). What they do here is, assuming you have translation systems in both directions, back translate your monolingual target-side data, and then use that faux-parallel-data to train your MT system on. Obvious question is: how much of the improvement in performance is due to language modeling versus due to some weird kind of reverse-self-training, but regardless the answer, this is a really cool (if somewhat computationally expensive) answer to a question that's been around for at least five years. Oh and it also works really well.

P16-1018: E.Dario Gutierrez; Ekaterina Shutova; Tyler Marghetis; Benjamin Bergen
Literal and Metaphorical Senses in Compositional Distributional Semantic Models
I didn't see this paper presented, but it was suggested to me at Monday's poster session. Suppose we're trying to learn representations of adjective/noun pairs, by modeling nouns as vectors and adjectives as matrices, evaluating on unseen pairs only. (Personally I don't love this style, but that's incidental to the main ideas in this paper.) This paper adjusts the adjective matrices depending on whether they're being used literally ("sweet candy") or metaphorically ("sweet dreams"). But then you can go further and posit that there's another matrix that can transform literal metaphors into metaphorical metaphors automatically, essentially implementing the Lakoff-style notion that there is great consistency in how metaphors are created.

P16-1030 [dataset]: Hannah Rashkin; Sameer Singh; Yejin Choi
Connotation Frames: A Data-Driven Investigation
This paper should win some sort of award for thoroughness. The idea is that in many frames ("The walrus pummelled the sea squirt") there is implied connotation/polarity/etc. on not only the agent (walrus) and theme (sea squirt) of the frame but also tells us something about the relationship between the writer/speaker and the agent/theme (the writer might be closer to the sea squirt in this example, versus s/pummelled/fought/). The connotation frame for pummelled collects all this information. This paper also describes an approach to prediction of these complex frames using nice structured models. Totally want to try this stuff on our old plotunits data, where we had a hard time getting even a much simpler type of representation (patient polarity verbs) to work!

P16-1152: Artem Sokolov; Julia Kreutzer; Christopher Lo; Stefan Riezler
Learning Structured Predictors from Bandit Feedback for Interactive NLP
This was perhaps my favorite paper of the conference because it's trying to do something new and hard and takes a nice approach. At a high level, suppose you're Facebook and you're trying to improve your translation system so you ask users to give 1 star to 5 star ratings. How can you use this to do better translation? This is basically the (structured) contextual bandit feedback learning problem. This paper approaches this from a dueling bandits perspective where I show you two translations and ask which is better. (Some of the authors had an earlier MT-Summit paper on the non-dueling problem which I imagine many people didn't see, but you should read it anyway.) The technical approach is basically probabilitic latent-variable models, optimized with gradient descent, with promising results. (I also like this because I've been thinking about similar structured bandit problems recently too :P.)

P16-1231: Daniel Andor; Chris Alberti; David Weiss; Aliaksei Severyn; Alessandro Presta; Kuzman Ganchev; Slav Petrov; Michael Collins
Globally Normalized Transition-Based Neural Networks
[EDIT 14 Aug 2:40p: I misunderstood from the talk and therefore the following is basically inaccurate. I'm leaving this description and paper here on the list because Yoav's comment will make no sense otherwise, but please understand that it's wrong and, I hate to say this, it does make the paper less exciting to me. The part that's wrong is struck-out below.] There's a theme in the past two years of basically repeating all the structured prediction stuff we did ten years ago on our new neural network technology. This paper is about using Collins & Roark-style incremental perceptron for transition-based dependency parsing on top of neural networks. The idea is that label-bias is perhaps still a problem for neural network dependency parsers, like their linear grandparents. Why do I like this? Because I think a lot of neural nets people would argue that this shouldn't be necessary: the network can do arbitrarily far lookahead into the future and therefore should be able to learn to avoid the label-bias problem. This paper shows that current techniques don't achieve that: there's a consistent win to be had by doing global normalization.

P16-2013: Marina Fomicheva; Lucia Specia
Reference Bias in Monolingual Machine Translation Evaluation
This paper shows pretty definitively that human evaluations against a reference translation are super biased toward the particular reference used (probably because evaluators are lazy and are basically doing ngram matching anyway -- a story I heard from MSR friends a while back). The paper also shows that this gets worse over time, presumably as evaluators get tireder.
P16-2096: Dirk Hovy; Shannon L. Spruit
The Social Impact of Natural Language Processing
This is a nice paper summarizing four issues that come up in ethics that also come up in NLP. I mostly liked this paper because it gave names to things I've thought about off and on, but didn't have a name for. In particular, they consider exclusion (hey my ASR system doesn't work on people with an accent, I guess they don't get a voice), overgeneralization (to what degree are our models effectively stereotyping more than they should), over- and under-exposure (hey lets all work on parsing because that's what everyone else is working on, which then makes parsing seem more important...just to pick a random example :P), and dual-use (I made something for good but XYZ organization used it for evil!). This is a position/discussion-starting paper, and I thought quite engaging.

Feel free to post papers you like in the comments!

## 05 August 2016

### Fast & easy baseline text categorization with vw

About a month ago, the paper Bag of Tricks for Efficient Text Categorization was posted to arxiv. I found it thanks to Yoav Goldberg's rather incisive tweet:

Yoav is basically referring to the fact that the paper is all about (a) hashing features and (b) bigrams and (c) a projection that doesn't totally make sense to me, which (a) vw does by default (b) requires "--ngrams 2" and (c) I don't totally understand I don't think is necessary. (See this tutorial for more on how to do NLP in VW.)

At the time, I said if they gave me the data, I'd run vw on it and report results. They were nice enough to share the data but I never got around to running it. The code for their technique ("fastText") was just released, which goaded me into finally doing something.

So my goal here was to try to tell, without tuning any parameters, how competitive a baseline vw is to the results from fastText with minimal effort.

Here are the results:

 fastText vw Dataset ng time acc time acc ag news 1 91.5 2s 91.9 ag news 2 3s 92.5 5s 92.3 amazon full 1 55.8 47s 53.6 amazon full 2 33s 60.2 69s 56.6 amazon polarity 1 91.2 46s 91.3 amazon polarity 2 52s 94.6 68s 94.2 dbpedia 1 98.1 8s 98.4 dbpedia 2 8s 98.6 17s 98.7 sogou news 1 93.9 25s 93.6 sogou news 2 36s 96.8 30s 96.9 yahoo answers 1 72.0 30s 70.6 yahoo answers 2 27s 72.3 48s 71.0 yelp full 1 60.4 16s 56.9 yelp full 2 18s 63.9 37s 60.0 yelp polarity 1 93.8 10s 93.6 yelp polarity 2 15s 95.7 20s 95.5

(Average accuracy for fastText is 83.2; for vw is 82.2.)

In terms of accuracy, the two are roughly on par. vw occasionally wins; when it does, it's usually by 0.1% to 0.5%. fastText wins a bit more often, and on one dataset it wins significantly (yelp full: winning by 3%-4%) and on one a bit less (yahoo answers, up by about 1.3%). But the numbers are pretty much in line, and could almost certainly be brought up for vw with a wee bit of hyperparameter tuning (namely the learning rate, which is tuned in fastText).

In terms of training time, fastText is maybe 30% faster on average, though these are such small datasets (eg 500k examples) that a difference of 52s versus 68s is not too significant. I also noticed that for most of the datasets, simply writing the model to disk for vw took a nontrivial amount of time. But wait, there's more. That 30% faster for fastText was run on 20 cores in parallel whereas the vw run did not use parallelized learning (vw runs two threads, one for I/O and one for learning).
That said, a major caveat on comparing the training times. They're run on different machines. I don't know what type of machine the fastText results were achieved on, but it was a parallel 20-core run. The vw experiments were run on a single core, one pass over the data, on a 3.1Ghz Core i5-2400. Yes, I could have hogwild-ed vw and gotten it faster but it really didn't seem worth it for datasets this small. And yes, I could've rerun fastText on my machine, but... what can I say? I'm lazy.

What did I do to get these vw numbers? Here's the entire training script:
% cat run.sh
#!/bin/bash
d=$1 for ngram in 1 2 ; do cat$d/train.csv | ./csv2vw.pl | \
time vowpal_wabbit/vowpalwabbit/vw --oaa cat $d/classes.txt | wc -l \ -b25 --ngram$ngram -f $d/model.$ngram
cat $d/test.csv | ./csv2vw.pl | \ time vowpal_wabbit/vowpalwabbit/vw -t -i$d/model.$ngram done  Basically the only flags to vw are (1) telling it to do multiclass classification with one-against-all, (2) telling it to use 25 bits (not tuned), and telling it to either use unigrams or bigrams. [Comparison note: this means vw is using 33m hash bins; fastText used 10m for unigram models and 100m for bigram models.] The only(*) data munging that occurs is in csv2vw.pl, which is a lightweight script for converting the data, lowercasing, and doing very minor tokenization: % cat csv2vw.pl #!/usr/bin/perl -w use strict; while (<>) { chomp; if (/^"*([0-9]+)"*,"(.+)"*$/) {
print $1 . ' | ';$_ = lc($2); s/","/ /g; s/""/"/g; s/([^a-z0-9 -\\]+)/$1 /g;
s/:/C/g;
s/\|/P/g;
print $_ . "\n"; } else { die "malformed line '$_'";
}
}

There are two exceptions where I did slightly more data munging. The datasets released for dbpedia and Soguo were not properly shuffled, which makes online learning hard. I preprocessed the training data by randomly shuffling it. This took 2.4s for dbpedia and 12s for Soguo.

[[[EDIT 2:20p 5 Aug 2016: Out of curiosity, I upped the number of bits that vw uses for the experiments to 27 (so that it's on par with the 100m used by fastText). This makes it take about 5 seconds longer to run (writing the model to disk is slower). Performance stays the same on: ag news, amazon polarity, dbpedia, sogou, and yelp polarity; and it goes up from from 53.6/56.6 to 55.0/58.8 on amazon full, from 70.6/71.0 to 71.1/71.6 on yahoo answers, from 56.9/60.0 to 58.5/61.6 on yelp full. This puts the vw average with more bits at 82.6, which is 0.6% behind the fastText average.]]]

Long story short... am I switching from vw to fastText? Probably not any time soon.

## 29 July 2016

### A quick comment on structured input vs structured output learning

When I think of structured input models, I typically think of things like kernels over discrete input spaces. For instance, the famous all-substrings kernel for which K(d1,d2) effectively counts the number of common substrings in two documents, without spending exponential time enumerating them all. Of course there are many more ways of thinking about structured inputs: tree-to-string machine translation has a tree structured input. RNNs (on the input side) are essentially structured input models for sequence structures.

When I think of structured output models, I typically think of things like CRFs, structured SVMs/M3Ns, multilabel predictors (those are borderline), various transition-based methods (eg., shift/reduce parsers), etc. Here, my internal model for the structure is essentially at prediction time: find a high scoring structure from this complicated discrete output space.

Perhaps this has been obvious to everyone-but-me for a decade, but I only recently came to the recognition that these are essentially the same, at least if you restrict the sort of models you're willing to consider. (In particular, if you ignore things like imitation learning/learning to search for a moment.)

In a pure structured input setting, you have some simple label space Y (let's assume it's the real numbers) and some complex input space X. Typically you want to learn a function f : X ➝ Y, which has low loss. In particular you want to minimize the expectation of loss(y, f(x)) over random draws of x,y. And the "interesting" thing is that x isn't just a vector, so you have to be clever.

In the pure structure output setting, in, for instance, the structured SVM/CRF setup, you have some input space X (which may or may not be structured) and some complex output space Y. As before, you want to learn a function f : X ➝ Y, which has low loss. However, in the most common setups, the way you accomplish this is that instead of directly learning f, you instead learn a scoring function s that scores x,y pairs based on how "good" that y is for the corresponding x. For a fixed scoring function s, you derive f according to the argmax rule: fs(x) := argmaxy s(x,y). In this way, you have effectively separated the learning problem (get a good s) from the structured problem (solve the argmax). [Whether this is good or not is up to debate; I'm personally on the "nay" side.] You then want to minimize something like the expectation of loss(y, argmaxy' s(x,y')) over random draws x,y.

The observation is that these two problems are essentially the same thing. That is, if you know how to do the structured input problem, then the structured output problem is essentially the same thing, as far as the learning problem goes. That is, if you can put structure in f(x) for structured input, you can just as well put structure in s(x,y) for structured output. Or, by example, if you can predict the fluency of an English sentence x as a structured input problem, you can predict the translation quality of a French/English sentence pair x,y in a structured output problem. This doesn't solve the argmax problem -- you have to do that separately -- but the underlying learning problem is essentially identical.

You see similar ideas being reborn these days with papers like David Belanger's ICML paper this year on energy networks. With this framework of think-of-structured-input-and-structured-output-as-the-same, basically what they're doing is building a structured score function that uses both the input and output simultaneously, and throwing these through a deep network. (Ok it's a bit more than that, but that's the cartoon.)

At any rate, maybe obvious to everyone but me, but I thought I'd write it down anyway :).

## 26 July 2016

### Decoding (neural?) representations

I remember back in grad school days some subset of the field was thinking about the following question. I train an unsupervised HMM on some language data to get something-like-part-of-speech tags out. And naturally the question arises: these tags that come out... what are they actually encoding?

At the time, there were essentially three ways of approaching this question that I knew about:

1. Do a head-to-head comparison, in which you build an offline matching between induced tags and "true" tags, and then evaluate the accuracy of that matching. This was the standard evaluation strategy for unsupervised POS tagging, but is really just trying to get at the question of: how correlated are the induced tags with what we hope comes out.
2. Take a system that expects true POS tags and give it induced POS tags instead (at both training and test time). See how much it suffers (if at all). Joshua Goodman told me a few times (though I can't find his paper on this) that word clusters were just as good as POS tags if your task was NER.
3. Do something like #2, but also give the system both POS tags and induced tags, and see if the POS tags give you anything above and beyond the induced tags.
Now, ten years later since we're in "everything old is new again" phase, we're going through the same exercises, but with word embeddings instead of induced tags. This makes things slightly more complicated because it means that we have to have mechanisms that deal with continuous representations rather than discrete representations, but you basically see the same ideas floating around.

In fact, of the above approaches, the only one that requires any modification is #1 because there's not an obvious way to do the matching. The alternative is to let a classifier do the matching, rather than an offline process. In particular, you take your embeddings, and then try to train a classifier that predicts POS tags from the embeddings directly. (Note: I claim this generalizes #1 because if you did this with discrete tags, the classifier would simply learn to do the matching that we used to compute "by hand" offline.) If your classifier can do a good job, then you're happy.

This approach naturally has flaws (all do), but I think it's worth thinking about this seriously. To do so, we have to take a step back and ask ourselves: what are we trying to do? Typically, it seems we want to make an argument that a system that was not (obviously) designed to encode some phenomenon (like POS tags) and was not trained (specifically) to predict that phenomenon has nonetheless managed to infer that structure. (We then typically go on to say something like "who needs no POS tags" even though we just demonstrated our belief that they're meaningful by evaluating them... but okay.)

As a first observation, there is an entire field of study dedicated to answering questions like this: (psycho)linguists. Admittedly they only answer questions like this in humans and not in machines, but if you've ever posed to yourself the question "do humans encode/represented phrase structures in their brains" and don't know the answer (or if you've never thought about this question!) then you should go talk to some linguists. More classical linguists would answer these questions with tests like, for instance, constituency tests or scoping tests. I like Colin Phillips' encyclopedia article on syntax for a gentle introduction (and is what I start with for syntax in intro NLP).

So, as a starting point for "has my system learned X" we might ask our linguist friends how they determine if a human has learned X. Some techniques are difficult to replicate in machine (e.g., eye movement experiments, though of course models that have something akin to alignment---or "attention" if you must---could be thought of as having something like eye movements, though I would be hesitant to take this analogy too far). But many are not, for instance behavioral experiments, analyzing errors, and, I hesitate somewhat to say it, grammaticality judgements.

My second comment has to do with the notion of "can these encodings be used to predict POS tags." Suppose the answer is "yes." What does that mean? Suppose the answer is "no."

In order to interpret the answer to these questions, we have to get a bit more formal. We're going to train a classifier to do something like "predict POS given embedding." Okay, so what hypothesis space does that classifier have access to? Perhaps you say it gets a linear hypothesis space, in which case I ask: if it fails, why is that useful? It just means that POS cannot be decoded linearly from this encoding. Perhaps you make the hypothesis space outrageously complicated, in which case I ask: if it succeeds, what does that tell us?

The reason I ask these questions is because I think it's useful to think about two extreme cases.
1. We know that we can embed 200k words in about 300 dimensions with nearly orthogonal vectors. This means that for all intents and purposes, if we wanted, we could consider ourselves to be working with a one-hot word representation. We know that, to some degree, POS tags are predictable from words, especially if we allow for complex hypothesis spaces. But this is uninteresting because by any reasonable account, this representation has not encoded anything interesting: it's just the output classifier that's doing something interesting. That is to say: if your test can do well on the raw words as input, then it's dubious as a test.
2. We also know that some things are just unpredictable. Suppose I had a representation that perfectly encoded everything I could possibly want. But then in the "last layer" it got run through some encryption protocol. All of the information is still there, so the representation in some sense "contains" the POS tags, but no classifier is going to be able to extract it. That is to say, just because the encoded isn't on the "surface" doesn't mean it's not there. Now, one could reasonably argue something like "well if the information is there in an impossible-to-decode format then it might as well not be there" but this slope gets slippery very quickly.
Currently, I much prefer to think about essentially the equivalent of "behavioral" experiments. For instance, if you're machine translating and want to know if your system can handle scoping, then give it a minimal pair to translate that only differs in the scoping of some negation. Or if you're interesting in knowing whether it knows about POS tags, perhaps look at errors in its output and see if they fall along POS categories.

EDIT 26 Jul 2016, 8:24p Eastern: It's unclear to a few people so clarification. I'm mostly not talking about type-level word embeddings above, but embeddings in context. At a type-level, you could imagine evaluating (1) on out of vocabular terms, which would be totally reasonable. I'm think more something like: the state of your biLSTM in a neural MT system. The issue is that if, for instance, this biLSTM can repredict the input (as in an autoencoder), then it could be that the POS tagger is doing all the work. See this conversation thread with Yoav Goldberg for some discussion.

## 12 July 2016

### Some picks from NAACL 2016

Usual caveats: didn't see all talks, didn't read all papers, there's lot of good stuff at NAACL that isn't listed here! That said, here are some papers I particularly liked at NAACL, with some comments. Please add comments with papers you liked!

#### Automatic Summarization of Student Course Feedback by Luo, Liu, Liu and Litman.

Anyone who has taught has suffered the following dilemma. You ask students for feedback throughout the course, and you have to provide free text because if you could anticipate their problems, you'd have addressed them already. But now you have hundreds of responses that you can't quickly read through. This paper provides a dataset of such responses, together with summaries that you actually have time to read through. The approach is a reasonably standard ILP formulation, with some additional machinery to deal with the fact that the data is very sparse. The idea is to essentially induce a "bigram similarity" as part of the summarization problem. I like this paper because the problem is great (I think NLP should really push in the direction of helping learners learn and teachers teach!), the dataset is nice (if a bit small) and the approach makes sense. And they actually do a human evaluation, even for a short paper! Hooray!

#### A Latent Variable Recurrent Neural Network for Discourse Relation Language Models by Ji, Haffari and Eisenstein.

This paper combines the nice rich hypothesis classes you get with the "anonymous" latent variables in neural networks with the modeling power that one gets from the ability to marginalize "structured" latent variables from classic graphical models land. This is applied to document language modeling, in which the latent variables are discourse relations (in the PDTB sense). The model works well both for language modeling and for predicting implicit discourse relations and dialogue acts. (And, as one should expect from a paper with Eisenstein as an author, there are statistical significance tests!)

#### Structured Prediction with Output Embeddings for Semantic Image Annotation by Quattoni, Ramisa, Madhyastha, Simo-Serra and Moreno-Noguer.

If you want to label images with complex structured outputs, like play(dog,grass), you quickly run in to sparsity problems on the output. The proposal here is to decompose the outputs into embeddings (kind of like Vivek's work and Jags' work) and learning a bilinear model of inputs/outputs in that space. In general, I think there's a lot to be done in interesting modeling of structured output spaces, and this paper gives a new set of techniques in this space.

#### Deconstructing Complex Search Tasks: A Bayesian Nonparametric Approach for Extracting Sub-tasks by Mehrotra, Bhattacharya and Yilmaz.

The problem considered here is the following: if I want to go to ACL, I need to register, book a flight, book a hotel, find some attractions to go to while skipping sessions, look up all the good vegan restaurants in Berlin (hah!), etc. My overall task is going to ACL, but there are a number of highly related but different subtasks. The challenge is to infer these subtasks from search logs, so that you can provide better search support. The model is a Chinese Restaurant Process with word embeddings used to measuring similarities. And look, another short paper with a human evaluation!

#### A Corpus and Cloze Evaluation for Deeper Understanding of Commonsense Stories by Mostafazadeh, Chambers, He, Parikh, Batra, Vanderwende, Kohli and Allen.

This paper introduces the task of story cloze: given a story prefix, predict which of two endings is the "correct" (natural) one. Eg: "Jim got his first credit card in college. He didn't have a job so he bought everything on his card. After he graduated he amounted a \$10,000 debt. Jim realized that he was foolish to spend so much money." Then you have to decide on the right ending: "Jim decided to device a plan for repayment." (correct) versus "Jim decided to open another credit card." (incorrect). The data set was quite well constructed (lots of checks), and a large collection of baseline models are run for comparison. The data is available. I like this paper for the task and the very very careful data set collection. Since this was establishing baselines, I would really liked to have seen error bars on the results so we can have more reasonable future comparisons. I'm really tempted to try to annotate some of these with plot units, but that was really really painful the first time around; but I feel like that's a good explanatory theory for a lot of the examples shown in the paper.

#### Learning Global Features for Coreference Resolution by Wiseman, Rush and Shieber.

The basic idea here is to take a "local" coreference model, that makes decisions about assigning mentions to entities, and augment it with a "global" model that models the entire cluster of mentions. In past work, cluster-level features have been hard to define (e.g., what are the right features to extract over sets of arbitrary size?). What's the solution? Essentially, embed the clusters. I like this paper because I wanted to try to do something like this and Wiseman did it better than I would have. I think there's also an interesting interpretation of this work in terms of things like neural Turing machines/memory networks, in which we now have technology for learning to update memory (where "memory" here is the "clusters"). My only gripe is that, like most papers in the new wave of coreference, the older ML approaches have been somewhat forgotten (except for Vincent Ng who is unforgettable); I'm thinking in particular about the work on ACE, like that of Luo, Ittycheriah, Jing, Kambhatla and Roukos on Bell-tree coreference, which I don't think gets enough notice these days for introducing a lot of the ideas that make up modern coref.

#### Visual Storytelling by Huang, Ferraro, Mostafazadeh, Misra, Agrawal, Devlin, Girshick, He, Kohli, Batra, Zitnick, Parikh, Vanderwende, Galley and Mitchell.

The main observation in this paper is that how you caption images in a sequence is different from how you caption them in isolation. For instance, if you have a temporal sequence of images from, say, a wedding, the captions for them should tell a story. This paper primarily introduces a dataset and baseline approaches for addressing the problem in this dataset. I like this paper, even if you remove the image stuff, because it emphasizes the fact that stories have a different structure than just sequences of sentences. The image stuff gives a grounding that's interesting beyond that. One problem pointed out in the paper is that automatic metrics aren't great here, which is problematic, but not particularly surprising.

## 05 July 2016

### Rating the quality of reviews, after the fact

Groan groan groan reviewers are horrible people. Not you and me. Those other reviewers over there!

tldr: In general we actually don't think our reviews are that bad, though of course it's easy to remember the bad ones. Author perception of review quality is colored by, but not determined by, the overall accept/reject decision and/or the overall score that review gave to the paper.

NIPS did an experiment a bunch of years ago (can't dig it up any more, now it's an urban legend) where they asked reviewers at, I think, the time of author feedback, to rate the reviews. The anecdotal story was that there was almost perfect correlation between "this is a good review" and "this review gave my paper a high score." Of course this is not super surprising, even if you get rid of "emotions," because presumably I like my paper and so any review that doesn't like it is flawed.

For NAACL 2013, we did a similar experiment, but we asked authors for their responses several months after the fact (actually, even after the conference had taken place), at which point hopefully emotions had cooled a bit and they could look back at their reviews with a sort of fond recollection. We presented the contact author of each paper with the original review text for each of their reviews, but did not show them the original scores. We asked them on a standard Likert scale how helpful this review was, and how informative this review was.

Because this was long after the fact, response rate was of course not 100%, and it was also biased toward authors of papers that were accepted. We got responses from 128 authors on a total of 138 papers (some papers had same contact authors), covering a total of about 397 reviews (roughly one per paper, but some short papers only had two, and some papers had four).

All the plots below are restricted to this set of 138 papers, not to the full set of about 500.

First, let's get a sense of the data. Here are the overall results for this entire set of 138 papers:

(Note that the numbers add up to 397, not 138, because this is counting per-review not per-paper.) The first row shows the accept/reject ratio. Since NAACL 2013 had an acceptance rate between 25% and 30%, obviously survey results are biased toward accepted papers, but we still have a healthy response rate from rejected papers.

Overall, the vast majority (~80%) of reviews were considered both informative and helpful (score 4 or 5) according to the authors. So yes, we need to do something about the 20% of reviews that got a 1, 2 or 3 on the Likert scale, but we're actually not doing that horribly. (Modulo sample selection bias.) The papers themselves were considered overwhelmingly appropriate and clear. The overall score distribution matches (roughly) the overall score distribution for the entire conference.

Let's look at what happens if we look only at accepted or rejected paper:

Comparing these, we definitely see a bit of the NIPS effect. For accepted papers, the reviews were considered overwhelmingly informative and helpful (scores of 4 or 5 in 85% or more cases). However, for rejected papers, the reviews were still considered largely informative and helpful (~73% of cases were 4s and 5s). Not surprisingly, accepted papers fare quite well on the individual score metrics, in particular overall score (duh!).

We can alternatively condition the analysis on the overall paper score rather than the final accept/reject decision. Here's how that looks:

That's not substantially different.

So what makes the difference between good (informativeness 4 or 5) and bad (informativeness 1 or 2) reviews?

On average, the "good" reviews were about 15% longer than the bad reviews (on average 320 characters versus 280 characters).

Somewhat surprisingly, a linear classifier on bag of words data and distinguish with 90% accuracy "good" from "bad" reviews, but the features it gives high weight to are basically features that look like positive versus negative reviews, rather essentially exploiting of the correlation between informativeness and acceptance, rather than informativeness on its own.

## 30 June 2016

Last semester from grad ML, I totally revamped stuff and started using the awesome book Understanding Machine Learning by Shai Ben-David and Shai Shalev-Shwartz (I'll try to review this in a later post). Naturally, one of the things we talked about was subgradients and subgradient descent.

I imagine that for many of you, if I asked you to define a subgradient of a convex function f at x informally (let's stick in one dimension for now), you would say something like "it's any line that that makes contact with f at x and everywhere else lies below f." This is the definition given in UML, and the definition we always see in pictures, like the one in ciml:

Okay, so this is all great, and one of the fun things you can do is derive algorithms like (stochastic) subgradient descent, which involve picking any subgradient and then using that as if it were a gradient in a standard gradient descent procedure. Yeah, you run into some speed limits for optimization rates, but it basically works.

So then on the midterm I asked a question that was intended to be a freebie: give me the subderivatives of the ramp loss function. Ramp loss is like hinge loss (shown above), but where the negative part gets clamped at 1. Formally, it's ramp(x) = min(1, hinge(x)), where hinge(x) = max(0, 1-x).

Turns out this wasn't a freebie. The problem, obviously in retrospect, is that ramp is not convex, and therefore doesn't have subderivatives! Or at least it doesn't have subderivatives for any x less than zero, and it's (only) subderivative for x≥0 is zero.

Several students point this out, and several students just went through the motions and computed what-we-might-normally-call subderivatives anyway. And by "what-we-might-normally-call" I of course mean what I had originally intended, and what I would have done myself if I had been a student in this course, and also what pretty much any auto-diff toolkit would do.

And yet, it's clearly wrong according to the definition of subgradients that we all know and use everyday in our very non-convex neural networks, when we do things like relu or hardtanh units.

It turns out (not surprisingly) that subdifferentials of non-convex functions has been studied (extensively) since the 1970s. Murdukhovich and Shao give a brief history in their paper on Banach spaces. Unfortunately, I don't actually understand most of that paper (or its citations).

I did manage to find one set of slides that I could understand, though, by Adil Bagirov for a talk on Subgradient methods in nonsmooth nonconvexoptimization! Basically the idea that Adil proposes is that we can use a gradient-ified version of a quasisecant and most/some subgradient-like methods still go through and make sense with this generalized notion.

Why am I posting this? Because it caused my brain to reconfigure itself when I was forced to think about this by the very smart students in my class! Am I going to teach quasisecants in the future? Probably not. But I am going to explicitly point out that the standard definition of subgradient doesn't work for nonconvex functions (or, more specifically, it works, but you get an empty set in a lot of cases) and that there are generalizations but that I don't think we've really figured this all out (as a community).

If anyone has other pointers that I can use in the future, I'd love to see them!

## 24 June 2016

### Language bias and black sheep

Tolga Bolukbasi and colleagues recently posted an article about bias in what is learned with word2vec, on the standard Google News crawl (h/t Jack Clark). Essentially what they found is that word embeddings reflect stereotypes regarding gender (for instance, "nurse" is closer to "she" than "he" and "hero" is the reverse) and race ("black male" is closest to "assaulted" and "white male" to "entitled"). This is not hugely surprising, and it's nice to see it confirmed. The authors additionally present a method for removing those stereotypes with no cost (as measured with analogy tasks) to accuracy of the embeddings. This also shows up on twitter embeddings related to hate speech.

There have been a handful of reactions to this work, some questioning the core motivation, essentially variants of "if there are biases in the data, they're there for a reason, and removing them is removing important information." The authors give a nice example in the paper (web search; two identical web pages about CS; one mentions "John" and the other "Mary"; query for "computer science" ranks the "John" one higher because of embeddings; appeal to a not-universally-held-belief that this is bad).

I'd like to take a step back and argue the the problem is much deeper than this. The problem is that even though we all know that strong Sapir-Whorf is false, we seem to want it to be true for computational stuff.

At a narrow level, the issue here is the question of what does a word "mean." I don't think anyone would argue that "nurse" means "female" or that "computer scientist" means "male." And yet, these word embeddings, which claim to be capturing meaning, are clearly capturing this non-meaning-effect. So then the argument becomes one of "well ok, nurse doesn't mean female, but it is correlated in the real world."

Which leads us to the "black sheep problem." We like to think that language is a reflection of underlying truth, and so if a word embedding (or whatever) is extracted from language, then it reflects some underlying truth about the world. The problem is that even in the simplest cases, this is super false.

The "black sheep problem" is that if you were to try to guess what color most sheep were by looking and language data, it would be very difficult for you to conclude that they weren't almost all black. In English, "black sheep" outnumbers "white sheep" about 25:1 (many "black sheep"s are movie references); in French it's 3:1; in German it's 12:1. Some languages get it right; in Korean it's 1:1.5 in favor of white sheep. This happens with other pairs, too; for example "white cloud" versus "red cloud." In English, red cloud wins 1.1:1 (there's a famous Sioux named "Red Cloud"); in Korean, white cloud wins 1.2:1, but four-leaf clover wins 2:1 over three-leaf clover. [Thanks to Karl Stratos and Kota Yamaguchi for helping with the multilingual examples.]

This is all to say that co-occurance frequencies of words definitely do not reflect co-occurance frequencies of things in the real world. And the fact that the correlation can both both ways means that just trying to model a "default" as something that doesn't appear won't work. (Also, computer vision doesn't really help: there are many many pictures of black sheep out there because of photographer bias.)

We observed a related phenomena when working on plot units. We were trying to extract "patient polarity verbs" (this idea has now been expanded and renamed "implicit sentiment": a much better name). The idea is that we want to know what polarity verbs inflict on their arguments. If I "feed" you, is this good or bad for you? For me? If I "punch" you, likewise. We focused on patients because action verbs are almost always good for the agent.

In order to accomplish this, we started with a seed list of "do-good-ers" and "wrong-do-ers." For instance, "the devil" was a wrong do-er, and so we can extract things that the devil does, and assume that these are (on average) bad for their patients. The problem was the "do-good-ers" don't do good, or at least they don't do good in the news. One of our do-good-ers was "firefighter". Firefighters are awesome. Even stereotyped, this is arguably a very positive social good, heroic profession. But in the news, what do firefighters do? Bad things. Is this because most firefighters do bad things in the world? Of course not. It's because news is especially poignant when stereotypically good people do bad things.

This comes up in translation too, especially when looking at looking at domain adaptation effects. For instance, our usual example for French to English translation is that in Hansards, "enceinte" transates as "room" but in EMEA (medical domain), it translates as "pregnant." What does this have to do with things like gender bias? In Canadian Hansards, "merde" translates mostly as "shit" and sometimes as "crap." In movie subtitles, it's very frequently "fuck." (I suspect translation direction is a confounder here.) This is essentially a form of intensification (or detensification, depending on direction). It is not hard to imagine similar intensifications happening between racial descriptions and racial slurs, or between gender descriptions and sexist slurs, depending on where the data came from.

## 14 May 2016

### A bad optimizer is not a good thing

A very popular style of research in NLP and ML is the math abstraction. You cast your learning problem as some sort of objective function that you want to optimize. Or, if you're feeling Bayesian, you write down a joint likelihood that you'll either sample from or, yes, turn into an objective function that you want to optimize. The optimizer is then typically considered a black box, aside from its hyperparameters which you often must tune.

This is a very attractive style of research and one that I've personally gotten a lot of leverage out of. (Though it's worth emphasizing for reviewer #2 out there that this is not the only want to solve problems, and that not having an explicit objective function that an optimizer is cranking on does not mean your problem formulation is bad.) The main advantage here is separation of concerns. I can worry about making a good objective function, and my optimization friends can worry about optimizing it. Of course it's never that simple and we must consider what can possibly be optimized when making our objective functions (see, for instance, the obsession with convexity 10 years ago and the anti-obsession with convexity today).

At the end of the day, like any abstraction, it's not perfect, and we need to break it occasionally.

There is, however, something that's struck me as peculiar going back to the hayday of topic models in 10 or so years ago, but which applies equally well now. This is a strange world in which we actually hope that our optimizer doesn't work (or s/optimizer/sampler/). To me, this seems like a bad thing to hope for. I think we need to acknowledge that for hard problems, or optimizers and samplers probably won't work, and need to take that into account. But I don't think we should actively hope that they don't work: that just seems broken.

There are two places where I see this hope arise.

1. Occasionally you see objective functions that have trivial solutions that you're hoping won't be found. I've seen this several times in the context of multimodal learning. Suppose that you want to learn some sort of joint embedding of English words and Japanese words. I can write down some embedding functions f(e) and g(j) using whatever neural net magic I want. The outputs are vectors. Now, I want to have them supervise each other, and so I make an objective function that looks like ||A f(e) - B g(j)||^2 for some matrices A and B. The joint objective is to learn A, B, and all the parameters inside f and g. This seems reasonable, until you realize that simply setting A=B=0 will provide not only a local, but a global minimum of this objective function. Of course since these things are non-convex, we often initialize randomly. And in most cases, this global minimum actually isn't found (otherwise such approaches would never work). But this is just a broken way to formulate a problem because I'm relying on the fact that my optimizer doesn't actually work.

2. Much more commonly, the hope of a broken optimizer comes from the heuristic of using "initialization" to build a better model. The general scheme here is that I have some prior knowledge that I want to inject into the system, and I do so by initializing my model's parameters informed by that knowledge. I then optimize. But again, here, we're hoping that the optimizer fails! If the optimizer were perfect, it should be insensitive to the initialization (modulo picking one of multiple global optima, but I don't think that's what's really going on here). You see the same thing in Bayesian land, wherein you basically hope that your sampler doesn't mix properly. Again, this is a case where we're relying on the fact that we don't think our optimizers will actually work.

I think it's fair to say that #1 is pretty broken; I think there are a number of possible responses to #2 that are worth discussing briefly.

Response 1: Early stopping. We rarely run optimizers (or samplers) to completion and instead stop them early based on some held out measure of generalization. For many optimizers, it's possible to show that early stopping it equivalent to regularization (I first saw this as a homework exercise in the old Bishop neural nets book, but it's been reinvented many times since then... and probably before then). You can alternatively pull out some online learning theory and argue that online updates are implicitly regularizing (toward zero). We therefore can conclude that initialization plus early stopping is essentially a form of regularization toward the initialization point. This is reasonable, but it would be nice to actually lay out the theory more solidly in a way that's not specific to a particular model, regularizer, optimizer triple. (Maybe this has been done but I haven't seen it.)

Response 2: Initialization isn't to help the model, it's to help the optimizer. This is a pretty solid response. I think that when we do initialization, this is probably how we should think about it. We think the model is perfect (or at least good) but that the optimizer is going to have a hard time, so we start it somewhere reasonable. I think this perspective is hard to reconcile with the "initialization as prior knowledge view" (it's possible, but requires some mental acrobatics), but it's a reasonable view.

So what should we be doing? Well, if someone wants to prove something along the lines of #1 that would be nice :). Or alternatively just point me to an existing result!

More generally, though, I think the model vs optimization abstraction is quite useful. If we want to incorporate prior knowledge into the model, then some sort of regularization (or prior) seems like a very reasonable approach. (Yes, there are others.) If we want to try to help out our optimizer, then initialization may be a reasonable approach, but it really depends on the optimizer. One can of course do both: the "regularize toward X" approach together with initialization at (or near) X. (Though in my experience often regularizing toward X means that initializing at X isn't necessary because it already introduces a good bias and symmetry breaking.)

## 26 March 2016

### A dagger by any other name: scheduled sampling

Scheduled Sampling was at NIPS last year; the reviews are also online. (Note: I did not review this paper.) This is actually the third time I've tried to make my way through this paper, and to force myself to not give up again, I'm writing my impressions here. Given that this paper is about two things I know a fair amount about (imitation learning and neural networks), I kept getting frustrated at how hard it was for me to actually understand what was going on and how it related to things we've known for a long time. So this post is trying to ease entry for anyone else in that position.

What is the problem this paper is trying to solve?

Neural networks are now often used to generate text. As such, they are often trained to maximize a product of probabilities: the probability of the 5th word in the output given all previous words. Unfortunately, the distribution on which they are trained is typically the gold standard distribution of previous words. This means that they never learn to recover from their own mistakes: at test time, if they make an error (which they will), they could end up in some distribution they've never seen at training time and behave arbitrarily badly. Matti Kaariainen proved over ten years ago that you can construct worst case examples on which, even with vanishingly small probability of error, the sequence predictor makes T/2 mistakes on a length T binary sequence prediction problem. We've also known for about as long how to fix this (for models that are more general than neural networks), but more on that later.

How do they solve this problem?

On the ith minibatch, for each output word, they flip an (independent) coin. If it comes up heads, they use the true ith word when training (the training data one); if it comes up tails, they use the predicted ith word. If the coin probability is 1, then this amounts to the standard (inconsistent) training paradigm. If the coin probability is 0, then this amounts to always using the predicted words.

How does this relate to existing stuff?

To put this in terminology that I understand better, they're playing with the rollin distribution, adjusting it between "Ref" (when the probability is one) to "Learn" (when the probability is zero) and, as is natural, "Mix" when the probability is somewhere in the middle.

I think (but please correct me if I'm wrong!) that Searn was the first algorithm that considered mixture policies for roll-in. (Previous algorithms, like incremental perceptron, Lasso, etc., used either Ref or Learn.) Of course no one should be actually using Searn anymore (algorithms like DAgger, AggreVaTe and LOLS, complete dominate it as far as I can tell).

For what it's worth, I don't think the "related work" section is the paper is really particularly accurate. Searn was never a reinforcement learning algorithm (where do you get the oracle in RL?), and the paper completely dismisses DAgger, referring to it only as "a related idea" and highlights the fact that the "scheduled sampling" approach is online. (Side note, the sense of "online" used in the scheduled sampling paper is the "stochastic" sense.) Of course, DAgger can be trained online, and in fact that's how we do it in the vw implementation... and it works incredibly well! You can also do a straight-up online analysis.

The only real difference to DAgger is the exact form of the schedule. DAgger uses a schedule of something like P(use truth) = 0.99i, where i is the round/minibatch/whatever. Scheduled sampling considers two other rates, one linear and one inverse sigmoid. They're shown below, where the red curve is the DAgger schedule (Figure 2 from the scheduled sampling paper):

What are the results like?

There are basically two novelties in this paper: (1) applying DAgger to training neural networks, and (2) trying different schedules.

I would expect results that analyse the importance of these two things, but I only see a partial analysis of (1). In particular, the results show that training neural networks with DAgger is better than training them with supervised learning. This is nice to know, and is a nice replication of the results from Ross, Gordon and Bagnell. It would be also nice to see the other side: that going neural here is beneficial, but I suspect that's just taken for granted.

On (2), nothing is said! For two of the three experiments (image captioning and constituency parsing), the paper says that they use inverse sigmoid decay. For the third, the paper uses the linear decay schedule. Presumably the others performed worse, but I would really have appreciated knowing whether there's any benefit to doing something other than the DAgger schedule. And if there's benefit, how big?

In particular, the DAgger analysis works for any schedule that has the sum of probabilities over rounds go to zero has the number of rounds goes to infinity. This does not happen for the linear schedule (the sum goes to some lower bound, epsilon). I'm pretty sure it happens for the sigmoid schedule but didn't bother to verify. It would have been nice if the paper had verified this. At any rate, if I'm going to switch from something that I understand theoretically (like DAgger with exponential decay) to something that's heuristic, I would like to know how much empirical benefit I'm getting in exchange for losing my understanding of what's going on.

What's missing

One major difference to DAgger is that this paper makes a weird assumption that the correct next thing to do is independent of what was done in the past. For examples, suppose the gold standard output is "The little bird flew into the mailbox." Suppose that for the first word, the schedule uses the true output. We now have an output of "The". For the second word, maybe the schedule uses the predicted output. Maybe this is "bird". For any reasonable evaluation criteria, probably the best completion of "The bird" with respect to this gold standard is "The bird flew into the mailbox." But the approach proposed here would insist on producing "The bird bird flew into the mailbox."

This is the question of computing the oracle completion, or the "minimum cost action" at any given point. Yoav Goldberg refers to this as a "dynamic oracle." We know how to do this for many loss functions, and I'm surprised that (a) this isn't an issue in the experiments here, and (b) that it's not mentioned.

Reviewing the Reviews

The reviewers seem to mostly have liked the paper, which makes sense, it is interesting. I'm really really surprised that none of the reviewers took the authors to task on the connections to DAgger, and instead concentrated on Searn. I suspect that the reviewers were neural networks experts and not imitation learning experts, and given that the paper talked only about Searn (which at this point is an easy strawman), the reviews tended to focus on that.
The biggest surprise to me in the reviews was that comment that the paper had "comprehensive empirical testing," also echoed by reviewer 3. (Reviewer 2 was basically a no-op.) This paper had impressive experiments on big hard problems, but the experiments were anything but comprehensive from the perspective of understanding what is going on.

Some last thoughts

Let me say that although what's written above is critical, I of course don't dislike this work. It's something that needed to be done, and it's great someone did it. I wish that it were just done better and that the paper made some attempt at situating itself in the literature.

For comparison, take a gander at the recent arxiv paper on training an LSTM using imitation learning/learning-to-search technology by Ballesteros, Goldberg, Dyer and Smith. Not only does this paper get really impressive empirical results, they are also remarkably straightforward about connections to previous work.

## 27 December 2015

### NIPS 2015 Retrospective

There have already been lots of really nice overviews of NIPS 2015. I don't have much to add, largely because I missed almost every talk and a large fraction of the poster sessions.

Before beginning, I'm really happy and somewhat humbled to have been able to participate in the quizbowl demo, which won the best demo award: congrats to Mohit Iyyer, He He and Jordan Boyd-Graber for doing all the hard work!

Here are some things I did see, and that I liked, or that other people pointed me to as something I would probably like but I haven't actually seen.

• Logarithmic Time Online Multiclass prediction by Anna Choromanska and John Langford. This paper has been on arxiv for a while, and has been built into vw for a while also, so I've known about this stuff for probably a year now, but I still think it's great. The task is to build, in an online way, a binary tree for doing multiclass to binary reductions, in a way that gives you nice regret guarantees. The tree changes over time. One interesting property is that a single class can me represented in multiple leaves which gives added representational capacity to common/difficult classes, something most approaches lack.
• Probabilistic Line Searches for Stochastic Optimization by Maren Mahsereci and Philipp Hennig. I didn't see this paper, but I heard from several people that they liked it. The idea is that conventional optimization (eg BFGS) gets a lot of mileage out of line search, something we don't conventionally do in stochastic optimization (like sgd), but perhaps we should. It's on my reading list.
• Grammar as a Foreign Language by Orio Vinyals, Lukasz Kaiser, Terry Koo, Slav Petrov, Ilya Sutskever and Geoffrey Hinton. This paper thinks of parsing as the output of a sequence-to-sequence model, basically outputting a tree as a sequence of symbols. This doesn't work if you just train it on labeled data. Instead, you have to train a state of the art parser, run it on a huge pile of unlabeled data, and then train the seq2seq model to predict that. I like this because this is what Percy, Dan and I tried to get to work in the structured compilation paper, but could never make it work (probably because we lacked a sufficiently rich hypothesis class). It's basically more evidence, IMO, that seq2seq models are really good memorizers: I would be interested to see an analysis of how much generalization is actually learned (I suspect quite little).
• The Human Kernel by Andrew Wilson, Christoph Dann, Chris Lucas and Eric Xing. I like this paper in the same way I like Jerry Zhu's "machine teaching" work. It's about the interplay between human learning and machine learning. If you haven't read this paper, it's a fun read.
• Calibrated Structured Prediction by Volodymyr Kuleshov and Percy Liang. The observation in this paper is the confidence estimation is important for structured prediction problems, but faces its own challenges. They approach it from a calibration perspective and adjust the notions of calibration to be appropriate for structured problems and marginal inference.
On the not-so-great side, as Kate Crawford pointed out on twitter, the diversity level for one of the NIPS symposia was quite poor, and this goes for pretty much all of the events at NIPS (other symposia, workshops, program committee, organizing committee, etc.). This is not limited to NIPS: apparently, WiML is now slightly larger than AIStats and there were approximately as many men at WiML this year as there were women at AIStats. Nando recently discussed similar issues in his recent Reddit AMA. There's at least some awareness here, but awareness is not enough.

Overall, I enjoyed NIPS, especially catching up with old friends, doing a talking machines interview (should be posted early 2016), and seeing what's going on in a field whose size has increased dramatically since I started going eleven years ago!

## 30 October 2015

### How sausage got made once

When I was working on what turned into an old short paper (Markov Random Topic Fields) I decided it might be pedagogically interesting to keep a journal of what I was doing. This journal ended when I ran out of steam and I never got back to it. My whole original idea was, after the paper got published, post everything: the journal, the code, the paper, the paper reviews, etc. It's now been 6 years and that's not going to happen, but in case anyone finds it interesting, here in the report.

Anyway, here is the report. I'm posting this so that perhaps new students can see that things don't ever work the first time, faculty still have trouble getting their code to work, etc.

The progress of a research idea

============= DAY 1 =============

* Idea

Want to do a form of topic modeling, but where there is meta
information.  There are ways to do this, eg., Supervised LDA or
Dirichlet-Multinomial Regression.  These both operate on a *feature*
level.  For some tasks, it is more natural to operate over a graph.
Along these lines, there's Pachinko Allocation, but this posits a
graph over the vocabulary, not over the documents.  (Plus, it is a
DAG, which doesn't make sense for our application.)

Question: how can we augment a standard topic model (eg., LDA), with
an underlying graph, where we assume topics vary smoothly over the
graph?

* Technology

What technology exists for statistical modeling over graphs?  Sounds
like a Markov Random Field.  So let's marry topic models (LDA) with
MRFs, to give a "Topical Markov Random Field" (TMRF).

We think of LDA a generating documents by first choosing a topic
mixture \theta, and then choosing topics z=k for each word w, where w
is drawn from a multinomial \beta_k.

Where can a graph fit in this?  The first idea is to put an MRF over
\theta.

* MRF over \theta

If we have an MRF over theta, then two issues arise.  First, we almost
certainly can't collapse out theta as we might like.  Okay, we'll live
with that.

Second, from an MRF perspective, what do the potential functions look
like?

The simplest idea is to use pairwise potentials of the form e^{-dist},
where dist is the distance between two thetas that touch on the
graph.  What Distance metric should we use?  We could use
Bhattacharyya, Hellinger, Euclidean, LogitEuclidean, etc.  Let's start
with Hellinger.

What about a variance?  We could have lengths in the graph that are
either latent or known.  Let's say they're latent and our potentials
have the form e^{-dist/l}, where l is the length (so that if you're
far away, distance doesn't matter.

** Getting data together

We have about 1000 docs and three graphs over those docs.  We get them
in a reasonable format and then subsample about 400 of the docs.  We
do this both for speed and to make sure we don't overfit the model on
the data too much.

============= DAY 2 =============

** Implementation

We implement this with an option to use or not use graphs (so we can
tell if they're helping).  We collapse out \beta, but not \theta in
both cases, and we compute log posteriors.

We run first on some simple test data (testW) from HBC and find that
it seems to be doing something kind of reasonable.  We then run on
some real data and it puts everything in one cluster after about 20-30
Gibbs iterations.

Debugging: First, we turn off all graph stuff (sampling lengths) and
things are still broken.  Then we initialize optimally and things are
still broken.  Then we turn off resampling \theta and things are still
broken.  The problem is that I'm using the collapsed \beta incorrectly
when sampling z.  I fix it and things work as expected (i.e., not
everything goes in the same cluster).

** Evaluating

So now the code seems to be working, so we want to evaluate.  We run a
model with an without a graph (where the graph is something we expect
will help).  The posterior probabilities coming out of the two
different models are all over the place.

So we do the standard trick of holding out 20% of the data as "test"
data and then evaluating log likelihood on the test.  Here, we do the
"bad" thing and just use 20% of the words in each document (so that we
already have \theta for all the documents).  Not great, but easy to
implement.  This time, no bugs.

At this point, it's a pain to recompile for every configuration change
and we'd like to be able to run a bunch of configs simultaneously.  So
we add a simple command line interface.

In order to evaluate, we plot either posteriors or held-out
likelihoods (usually the latter) as a function of iteration using
xgraph (interacts nicely with the shell and I'm used to it).

Things now seem mixed.  There's very little difference when you're not
using a graph between sampling \theta from the true posterior versus
using an MH proposal (this is good for us, since we have to use MH).
There is also little difference between the baseline LDA model and the
MRF models.

We turn off sampling of the lengths and just fix them at one.  For the
three graphs we're trying, only one seems to be doing any better than
the baseline LDA model.

** Optimization

Now that we're running experiments, we find that things are taking way
too long to run.  So we do some optimization.  First, we cache the sum
of all \beta_k posteriors.  This helps.  Second, we flip \beta from
\beta_{k,w} to \beta_{w,k}, which we've heard helps.  It doesn't.  We
put it back the other way.

All the time is being spent in resample_z, so we waste a half day
trying to figure out if there's a way to only resample a subset of the
zs.  For instance, track how often they change and only resample those
that change a lot (probabilistically).  This hurts.  Resampling those
with high entropy hurts.  I think the issue is that there are three
types of zs.  (1) those that change a lot because they have high
uncertainty but are rare enough that they don't really matter, (2)
those that change a lot and do matter, (3) those that just don't
change very much.  Probably could do something intelligent, but have

In order to really evaluate speed, we add some code that prints out
timing.

We do one more optimization that's maybe not very common.  resample_z
loops over docs, then words, then topics.  For each word, the loop
over topics is to compute p(z=k).  But if the next word you loop over
is the same word (they are both "the"), then you don't need to
recompute all the p(z=k)s -- you can cache them.  We do this, and then
sort the words.  This gives about a 20% speedup with no loss in
performance (posterior or held-out likelihood).

** Evaluating again

Since we had some success fixing the lengths at 1, we try fixing them
at 20.  Now that one graph is doing noticably better than the baseline
and the other two slightly better.  We try 5 and 10 and 50 and nothing
much seems to happen.  20 seems like a bit of a sweet spot.

============= DAY 3 =============

** A more rigorous evaluation

Running with lengths fixed at 20 seems to work, but there's always
variance due to randomness (both in the sampling and in the 80/20
split) that we'd like to account for.

So we run 8 copies of each of the four models (8 just because we
happen to have an 8 core machine, so we can run them simultaneously).

Now, we require more complex graphing technology than just xgraph.
We'll probably eventually want to graph things in matlab, but for now
all we really care about it how things do over time.  So we write a
small perl script to extract scores every 50 iterations (we've
switched from 200 to 300 just to be safe) and show means and stddevs
for each of the models.

While we're waiting for this to run, we think about...

* MRF over z?

Our initial model which may or may not be doing so well (we're waiting
on some experiments) assumes an MRF over \th.  Maybe this is not the
best place to put it.  Can we put it over z instead?

Why would we want to do this?  There are some technological reasons:
(1) this makes the MRF discrete and we know much better how to deal
with discrete MRFs.  (2) we can get rid of the MH step (though this
doesn't seem to be screwing us up much).  (3) we no longer have the
arbitrary choice of which distance function to use.  There are also
some technological reasons *not* to do it: it seems like it would be
computationally much more burdensome.

But, let's think if this makes sense in the context of our
application.  We have a bunch of research papers and our graphs are
authorship, citations, time or venue.  These really do feel like
graphs over *documents* not *words*.

We could turn them in to graphs over words by, say, connecting all
identical terms across documents, encouraging them to share the same
topic.  This could probably be done efficiently by storing an inverted
index.  On the other hand, would this really capture much?  My gut
tells me that for a given word "foo", it's probably pretty rare that
"foo" is assigned to different topics in different documents.  (Or at
least just as rare as it being assigned to different topics in the
same document.)  Note that we could evaluate this: run simple LDA, and
compute the fraction of times a word is assigned it's non-majority
topic across all the data, versus just across one documents.  I
suspect the numbers would be pretty similar.

The extreme alternative would be to link all words, but this is just
going to be computationally infeasible.  Moreover, this begins to look
just like tying the \thetas.

So for now, we put this idea on the back burner...

* MRF over \theta, continued...

We're still waiting for these experiments to run (they're about half
of the way there).  In thinking about the graph over z, though, it did
occur to me that maybe you have to use far more topics than I've been
using to really reap any benefits here.  We begin running with 8, and
then bumped it up to 20.  But maybe we really need to run with a lot
more.

So, I log in to two other machines and run just one copy with 50, 100,
150 and 200 topics, just for vanilla LDA.  The larger ones will be
slow, but we'll just have to go do something else for a while...

============= DAY 4 =============

* MRF over \theta, re-continued...

Both experiments finish and we see that: (1) with 20 topics and
lengths fixed at 10, there is no difference between raw LDA and TMRF.
(2) More topics is better.  Even 200 topics doesn't seem to be
overfitting.  Going 20, 50, 100, 150, 200, we get hidden log
likelihoods of -1.43, -1.40, -1.38, -1.36, -1.36 (all e+06).  The
significance (from the first experiments) seems to be around .002
(e+06), so these changes (even the last, which differs by 0.005) seem
to be real.  Since we weren't able to overfit, we also run with 300
and 400, and wait some more...

...and they finish and still aren't overfitting.  We get -1.35 and
-1.35 respectively (differing by about 0.004, again significant!).

This is getting ridiculous -- is there a bug somewhere?  Everything
we've seen in LDA-like papers shows that you overfit when you have a
ton of topics.  Maybe this is because our documents are really long?

** Model debugging

One thing that could be going wrong is that when we hide 20% of the
words, and evaluate on that 20%, we're skewing the evaluation to favor
long documents.  But long documents are probably precisely those that
need/want lots of topics.  Our mean document length is 2300, but the
std is over 2500.  The shortest document has 349 words, the longest
has 37120.  So, instead of hiding 20%, we try hiding a fixed number,
which is 20% of the mean, or 460.

============= DAY 5 =============

At this point, we've done a large number of evals, both with 20% hid,
and 460 words/doc hid (actually, the min of this and doclen/2), and
50-1600 (at *2) topics.  We do actually see a tiny bit of overfitting
at 800 or 1600 documents.

Then we do something we should have done a long time ago: go back and
skim through some LDA papers.  We look at the BNJ 2003 JMLR paper.  We
see that one of the selling points of LDA over LSI is that
it *doesn't* overfit!  Aaargh!  No wonder we haven't been able to get
substantial overfitting.

However, we also notice something else: aside from dropping 50 stop
words (we've been dropping 100), on one data set they don't prune rare
words at all, and on the other they prune df=1 words only.  We've been
pruning df<=5 or <=10 words (can't remember which).  Perhaps what's
going on is that the reason the graphs aren't helping is just because
there vocabulary (around 3k) isn't big enough for them to make a
difference!

We recreate the text, pruning only the df=1 words.  This leads to a
vocab of about 10k (which means inference will be ~3 times slower).
We run at 50, 100, 200 and 400 and we actualy see a tiny bit of
overfitting (maybe) on 400.  We accidentally only ran 100 iterations,
so it's a bit hard to tell, but at the very least there's
no *improvement* for going from 200 topics to 400 topics.  Strangely
running on the new text is actually about 5-20% *faster*, despite the
larger vocabulary!

** Re-running with graphs

At this point, we're ready to try running with graphs again.  Despite
the fact that it's slow, we settle on 200 topics (this took about 3.5
hours without graphs, so we will be waiting a while).  We also want to
run for more iterations, just to see what's going to happen: we do 200
again.

And again there's not much difference.  One of the graphs seems to be
just barely one std above everyone else, but that's nothing to write

============= DAY 6 =============

* Abstracts only?

At this point, things are not looking so spectacular.  Perhaps the
problem is that the documents themselves are so big that there's
really not much uncertainty.  This is reflected, to some degree, by
the lack of variance in the predictive perplexities.

So we rebuild the data on abstracts only.  This makes running
significantly faster (yay).  We run 5 copies each of 10, 20, 40, 80,
160 and 320 topics.  40 is a clear winner.  80 and above overfit

Now, we turn on graphs and get the following results (5 runs):

40-top-nog.default      -69239.4 (86.2629700392932)
40-top-nog.auth         -68920.0 (111.863756418243)
40-top-nog.cite         -68976.4 (174.920839238783)
40-top-nog.year         -69174.2 (133.769204228776)

If we compare default (not graphs) to auth (best), we see that we get
a 2-3 std separation.  This is looking promising, FINALLY!  Also, if
we plot the results, it looks like auth and cite really dominate.
Year is fairly useless.

It suggests that, maybe, we just need more data to see more of a
difference.

* Getting More Data

There are two ways we could get more data.  First, we could crawl
more.  Second, we could switch over to, say, PubMed or ACM.  This
would work since we only need abstracts, not full texts.  These sites

============= DAY 7 =============

Okay, ACM is a pain.  And it's really not that complete, so we switch
over to CiteSeer (don't know why we didn't think of this before!).  We
seed with docs from acl, cl, emnlp, icml, jmlr, nips and uai.  We
notice that CiteSeer is apparently using some crappy PDF extractor (it
misses ligatures!  ugh!) but it's not worth (yet!) actually
these seeds, we run 10 iterations of reference crawling, eventually
ending up with just over 44k documents.  We extract a subset
comprising 9277 abstracts, and six graphs: author, booktitle,
citation, url, year and time (where you connect to year-1 and year+1).
The 9k out of 44k are those that (a) have >=100 characters
"reasonable" in the abstract and (b) have connections to at least 5
other docs in *all* the graphs.  (We're no longer favoring citations.)
We then begin the runs again....