14 May 2016

A bad optimizer is not a good thing

A very popular style of research in NLP and ML is the math abstraction. You cast your learning problem as some sort of objective function that you want to optimize. Or, if you're feeling Bayesian, you write down a joint likelihood that you'll either sample from or, yes, turn into an objective function that you want to optimize. The optimizer is then typically considered a black box, aside from its hyperparameters which you often must tune.

This is a very attractive style of research and one that I've personally gotten a lot of leverage out of. (Though it's worth emphasizing for reviewer #2 out there that this is not the only want to solve problems, and that not having an explicit objective function that an optimizer is cranking on does not mean your problem formulation is bad.) The main advantage here is separation of concerns. I can worry about making a good objective function, and my optimization friends can worry about optimizing it. Of course it's never that simple and we must consider what can possibly be optimized when making our objective functions (see, for instance, the obsession with convexity 10 years ago and the anti-obsession with convexity today).

At the end of the day, like any abstraction, it's not perfect, and we need to break it occasionally.

There is, however, something that's struck me as peculiar going back to the hayday of topic models in 10 or so years ago, but which applies equally well now. This is a strange world in which we actually hope that our optimizer doesn't work (or s/optimizer/sampler/). To me, this seems like a bad thing to hope for. I think we need to acknowledge that for hard problems, or optimizers and samplers probably won't work, and need to take that into account. But I don't think we should actively hope that they don't work: that just seems broken.

There are two places where I see this hope arise.

1. Occasionally you see objective functions that have trivial solutions that you're hoping won't be found. I've seen this several times in the context of multimodal learning. Suppose that you want to learn some sort of joint embedding of English words and Japanese words. I can write down some embedding functions f(e) and g(j) using whatever neural net magic I want. The outputs are vectors. Now, I want to have them supervise each other, and so I make an objective function that looks like ||A f(e) - B g(j)||^2 for some matrices A and B. The joint objective is to learn A, B, and all the parameters inside f and g. This seems reasonable, until you realize that simply setting A=B=0 will provide not only a local, but a global minimum of this objective function. Of course since these things are non-convex, we often initialize randomly. And in most cases, this global minimum actually isn't found (otherwise such approaches would never work). But this is just a broken way to formulate a problem because I'm relying on the fact that my optimizer doesn't actually work.

2. Much more commonly, the hope of a broken optimizer comes from the heuristic of using "initialization" to build a better model. The general scheme here is that I have some prior knowledge that I want to inject into the system, and I do so by initializing my model's parameters informed by that knowledge. I then optimize. But again, here, we're hoping that the optimizer fails! If the optimizer were perfect, it should be insensitive to the initialization (modulo picking one of multiple global optima, but I don't think that's what's really going on here). You see the same thing in Bayesian land, wherein you basically hope that your sampler doesn't mix properly. Again, this is a case where we're relying on the fact that we don't think our optimizers will actually work.

I think it's fair to say that #1 is pretty broken; I think there are a number of possible responses to #2 that are worth discussing briefly.

Response 1: Early stopping. We rarely run optimizers (or samplers) to completion and instead stop them early based on some held out measure of generalization. For many optimizers, it's possible to show that early stopping it equivalent to regularization (I first saw this as a homework exercise in the old Bishop neural nets book, but it's been reinvented many times since then... and probably before then). You can alternatively pull out some online learning theory and argue that online updates are implicitly regularizing (toward zero). We therefore can conclude that initialization plus early stopping is essentially a form of regularization toward the initialization point. This is reasonable, but it would be nice to actually lay out the theory more solidly in a way that's not specific to a particular model, regularizer, optimizer triple. (Maybe this has been done but I haven't seen it.)

Response 2: Initialization isn't to help the model, it's to help the optimizer. This is a pretty solid response. I think that when we do initialization, this is probably how we should think about it. We think the model is perfect (or at least good) but that the optimizer is going to have a hard time, so we start it somewhere reasonable. I think this perspective is hard to reconcile with the "initialization as prior knowledge view" (it's possible, but requires some mental acrobatics), but it's a reasonable view.

So what should we be doing? Well, if someone wants to prove something along the lines of #1 that would be nice :). Or alternatively just point me to an existing result!

More generally, though, I think the model vs optimization abstraction is quite useful. If we want to incorporate prior knowledge into the model, then some sort of regularization (or prior) seems like a very reasonable approach. (Yes, there are others.) If we want to try to help out our optimizer, then initialization may be a reasonable approach, but it really depends on the optimizer. One can of course do both: the "regularize toward X" approach together with initialization at (or near) X. (Though in my experience often regularizing toward X means that initializing at X isn't necessary because it already introduces a good bias and symmetry breaking.)

26 March 2016

A dagger by any other name: scheduled sampling

Scheduled Sampling was at NIPS last year; the reviews are also online. (Note: I did not review this paper.) This is actually the third time I've tried to make my way through this paper, and to force myself to not give up again, I'm writing my impressions here. Given that this paper is about two things I know a fair amount about (imitation learning and neural networks), I kept getting frustrated at how hard it was for me to actually understand what was going on and how it related to things we've known for a long time. So this post is trying to ease entry for anyone else in that position.

What is the problem this paper is trying to solve?

Neural networks are now often used to generate text. As such, they are often trained to maximize a product of probabilities: the probability of the 5th word in the output given all previous words. Unfortunately, the distribution on which they are trained is typically the gold standard distribution of previous words. This means that they never learn to recover from their own mistakes: at test time, if they make an error (which they will), they could end up in some distribution they've never seen at training time and behave arbitrarily badly. Matti Kaariainen proved over ten years ago that you can construct worst case examples on which, even with vanishingly small probability of error, the sequence predictor makes T/2 mistakes on a length T binary sequence prediction problem. We've also known for about as long how to fix this (for models that are more general than neural networks), but more on that later.

How do they solve this problem?

On the ith minibatch, for each output word, they flip an (independent) coin. If it comes up heads, they use the true ith word when training (the training data one); if it comes up tails, they use the predicted ith word. If the coin probability is 1, then this amounts to the standard (inconsistent) training paradigm. If the coin probability is 0, then this amounts to always using the predicted words.

How does this relate to existing stuff?

To put this in terminology that I understand better, they're playing with the rollin distribution, adjusting it between "Ref" (when the probability is one) to "Learn" (when the probability is zero) and, as is natural, "Mix" when the probability is somewhere in the middle.

I think (but please correct me if I'm wrong!) that Searn was the first algorithm that considered mixture policies for roll-in. (Previous algorithms, like incremental perceptron, Lasso, etc., used either Ref or Learn.) Of course no one should be actually using Searn anymore (algorithms like DAgger, AggreVaTe and LOLS, complete dominate it as far as I can tell).

For what it's worth, I don't think the "related work" section is the paper is really particularly accurate. Searn was never a reinforcement learning algorithm (where do you get the oracle in RL?), and the paper completely dismisses DAgger, referring to it only as "a related idea" and highlights the fact that the "scheduled sampling" approach is online. (Side note, the sense of "online" used in the scheduled sampling paper is the "stochastic" sense.) Of course, DAgger can be trained online, and in fact that's how we do it in the vw implementation... and it works incredibly well! You can also do a straight-up online analysis.

The only real difference to DAgger is the exact form of the schedule. DAgger uses a schedule of something like P(use truth) = 0.99i, where i is the round/minibatch/whatever. Scheduled sampling considers two other rates, one linear and one inverse sigmoid. They're shown below, where the red curve is the DAgger schedule (Figure 2 from the scheduled sampling paper):

What are the results like?

There are basically two novelties in this paper: (1) applying DAgger to training neural networks, and (2) trying different schedules.

I would expect results that analyse the importance of these two things, but I only see a partial analysis of (1). In particular, the results show that training neural networks with DAgger is better than training them with supervised learning. This is nice to know, and is a nice replication of the results from Ross, Gordon and Bagnell. It would be also nice to see the other side: that going neural here is beneficial, but I suspect that's just taken for granted.

On (2), nothing is said! For two of the three experiments (image captioning and constituency parsing), the paper says that they use inverse sigmoid decay. For the third, the paper uses the linear decay schedule. Presumably the others performed worse, but I would really have appreciated knowing whether there's any benefit to doing something other than the DAgger schedule. And if there's benefit, how big?

In particular, the DAgger analysis works for any schedule that has the sum of probabilities over rounds go to zero has the number of rounds goes to infinity. This does not happen for the linear schedule (the sum goes to some lower bound, epsilon). I'm pretty sure it happens for the sigmoid schedule but didn't bother to verify. It would have been nice if the paper had verified this. At any rate, if I'm going to switch from something that I understand theoretically (like DAgger with exponential decay) to something that's heuristic, I would like to know how much empirical benefit I'm getting in exchange for losing my understanding of what's going on.

What's missing

One major difference to DAgger is that this paper makes a weird assumption that the correct next thing to do is independent of what was done in the past. For examples, suppose the gold standard output is "The little bird flew into the mailbox." Suppose that for the first word, the schedule uses the true output. We now have an output of "The". For the second word, maybe the schedule uses the predicted output. Maybe this is "bird". For any reasonable evaluation criteria, probably the best completion of "The bird" with respect to this gold standard is "The bird flew into the mailbox." But the approach proposed here would insist on producing "The bird bird flew into the mailbox."

This is the question of computing the oracle completion, or the "minimum cost action" at any given point. Yoav Goldberg refers to this as a "dynamic oracle." We know how to do this for many loss functions, and I'm surprised that (a) this isn't an issue in the experiments here, and (b) that it's not mentioned.

Reviewing the Reviews

The reviewers seem to mostly have liked the paper, which makes sense, it is interesting. I'm really really surprised that none of the reviewers took the authors to task on the connections to DAgger, and instead concentrated on Searn. I suspect that the reviewers were neural networks experts and not imitation learning experts, and given that the paper talked only about Searn (which at this point is an easy strawman), the reviews tended to focus on that.
The biggest surprise to me in the reviews was that comment that the paper had "comprehensive empirical testing," also echoed by reviewer 3. (Reviewer 2 was basically a no-op.) This paper had impressive experiments on big hard problems, but the experiments were anything but comprehensive from the perspective of understanding what is going on.

Some last thoughts

Let me say that although what's written above is critical, I of course don't dislike this work. It's something that needed to be done, and it's great someone did it. I wish that it were just done better and that the paper made some attempt at situating itself in the literature.

For comparison, take a gander at the recent arxiv paper on training an LSTM using imitation learning/learning-to-search technology by Ballesteros, Goldberg, Dyer and Smith. Not only does this paper get really impressive empirical results, they are also remarkably straightforward about connections to previous work.

27 December 2015

NIPS 2015 Retrospective

There have already been lots of really nice overviews of NIPS 2015. I don't have much to add, largely because I missed almost every talk and a large fraction of the poster sessions.

Before beginning, I'm really happy and somewhat humbled to have been able to participate in the quizbowl demo, which won the best demo award: congrats to Mohit Iyyer, He He and Jordan Boyd-Graber for doing all the hard work!

Here are some things I did see, and that I liked, or that other people pointed me to as something I would probably like but I haven't actually seen.

  • Logarithmic Time Online Multiclass prediction by Anna Choromanska and John Langford. This paper has been on arxiv for a while, and has been built into vw for a while also, so I've known about this stuff for probably a year now, but I still think it's great. The task is to build, in an online way, a binary tree for doing multiclass to binary reductions, in a way that gives you nice regret guarantees. The tree changes over time. One interesting property is that a single class can me represented in multiple leaves which gives added representational capacity to common/difficult classes, something most approaches lack.
  • Probabilistic Line Searches for Stochastic Optimization by Maren Mahsereci and Philipp Hennig. I didn't see this paper, but I heard from several people that they liked it. The idea is that conventional optimization (eg BFGS) gets a lot of mileage out of line search, something we don't conventionally do in stochastic optimization (like sgd), but perhaps we should. It's on my reading list.
  • Grammar as a Foreign Language by Orio Vinyals, Lukasz Kaiser, Terry Koo, Slav Petrov, Ilya Sutskever and Geoffrey Hinton. This paper thinks of parsing as the output of a sequence-to-sequence model, basically outputting a tree as a sequence of symbols. This doesn't work if you just train it on labeled data. Instead, you have to train a state of the art parser, run it on a huge pile of unlabeled data, and then train the seq2seq model to predict that. I like this because this is what Percy, Dan and I tried to get to work in the structured compilation paper, but could never make it work (probably because we lacked a sufficiently rich hypothesis class). It's basically more evidence, IMO, that seq2seq models are really good memorizers: I would be interested to see an analysis of how much generalization is actually learned (I suspect quite little).
  • The Human Kernel by Andrew Wilson, Christoph Dann, Chris Lucas and Eric Xing. I like this paper in the same way I like Jerry Zhu's "machine teaching" work. It's about the interplay between human learning and machine learning. If you haven't read this paper, it's a fun read.
  • Calibrated Structured Prediction by Volodymyr Kuleshov and Percy Liang. The observation in this paper is the confidence estimation is important for structured prediction problems, but faces its own challenges. They approach it from a calibration perspective and adjust the notions of calibration to be appropriate for structured problems and marginal inference.
On the not-so-great side, as Kate Crawford pointed out on twitter, the diversity level for one of the NIPS symposia was quite poor, and this goes for pretty much all of the events at NIPS (other symposia, workshops, program committee, organizing committee, etc.). This is not limited to NIPS: apparently, WiML is now slightly larger than AIStats and there were approximately as many men at WiML this year as there were women at AIStats. Nando recently discussed similar issues in his recent Reddit AMA. There's at least some awareness here, but awareness is not enough.

Overall, I enjoyed NIPS, especially catching up with old friends, doing a talking machines interview (should be posted early 2016), and seeing what's going on in a field whose size has increased dramatically since I started going eleven years ago!

30 October 2015

How sausage got made once

When I was working on what turned into an old short paper (Markov Random Topic Fields) I decided it might be pedagogically interesting to keep a journal of what I was doing. This journal ended when I ran out of steam and I never got back to it. My whole original idea was, after the paper got published, post everything: the journal, the code, the paper, the paper reviews, etc. It's now been 6 years and that's not going to happen, but in case anyone finds it interesting, here in the report.

Anyway, here is the report. I'm posting this so that perhaps new students can see that things don't ever work the first time, faculty still have trouble getting their code to work, etc.

The progress of a research idea

============= DAY 1 =============

* Idea

Want to do a form of topic modeling, but where there is meta
information.  There are ways to do this, eg., Supervised LDA or
Dirichlet-Multinomial Regression.  These both operate on a *feature*
level.  For some tasks, it is more natural to operate over a graph.
Along these lines, there's Pachinko Allocation, but this posits a
graph over the vocabulary, not over the documents.  (Plus, it is a
DAG, which doesn't make sense for our application.)

Question: how can we augment a standard topic model (eg., LDA), with
an underlying graph, where we assume topics vary smoothly over the

* Technology

What technology exists for statistical modeling over graphs?  Sounds
like a Markov Random Field.  So let's marry topic models (LDA) with
MRFs, to give a "Topical Markov Random Field" (TMRF).

We think of LDA a generating documents by first choosing a topic
mixture \theta, and then choosing topics z=k for each word w, where w
is drawn from a multinomial \beta_k.

Where can a graph fit in this?  The first idea is to put an MRF over

* MRF over \theta

If we have an MRF over theta, then two issues arise.  First, we almost
certainly can't collapse out theta as we might like.  Okay, we'll live
with that.

Second, from an MRF perspective, what do the potential functions look

The simplest idea is to use pairwise potentials of the form e^{-dist},
where dist is the distance between two thetas that touch on the
graph.  What Distance metric should we use?  We could use
Bhattacharyya, Hellinger, Euclidean, LogitEuclidean, etc.  Let's start
with Hellinger.

What about a variance?  We could have lengths in the graph that are
either latent or known.  Let's say they're latent and our potentials
have the form e^{-dist/l}, where l is the length (so that if you're
far away, distance doesn't matter.

** Getting data together

We have about 1000 docs and three graphs over those docs.  We get them
in a reasonable format and then subsample about 400 of the docs.  We
do this both for speed and to make sure we don't overfit the model on
the data too much.

============= DAY 2 =============

** Implementation

We implement this with an option to use or not use graphs (so we can
tell if they're helping).  We collapse out \beta, but not \theta in
both cases, and we compute log posteriors.

We run first on some simple test data (testW) from HBC and find that
it seems to be doing something kind of reasonable.  We then run on
some real data and it puts everything in one cluster after about 20-30
Gibbs iterations.

Debugging: First, we turn off all graph stuff (sampling lengths) and
things are still broken.  Then we initialize optimally and things are
still broken.  Then we turn off resampling \theta and things are still
broken.  The problem is that I'm using the collapsed \beta incorrectly
when sampling z.  I fix it and things work as expected (i.e., not
everything goes in the same cluster).

** Evaluating

So now the code seems to be working, so we want to evaluate.  We run a
model with an without a graph (where the graph is something we expect
will help).  The posterior probabilities coming out of the two
different models are all over the place.

So we do the standard trick of holding out 20% of the data as "test"
data and then evaluating log likelihood on the test.  Here, we do the
"bad" thing and just use 20% of the words in each document (so that we
already have \theta for all the documents).  Not great, but easy to
implement.  This time, no bugs.

At this point, it's a pain to recompile for every configuration change
and we'd like to be able to run a bunch of configs simultaneously.  So
we add a simple command line interface.

In order to evaluate, we plot either posteriors or held-out
likelihoods (usually the latter) as a function of iteration using
xgraph (interacts nicely with the shell and I'm used to it).

Things now seem mixed.  There's very little difference when you're not
using a graph between sampling \theta from the true posterior versus
using an MH proposal (this is good for us, since we have to use MH).
There is also little difference between the baseline LDA model and the
MRF models.

We turn off sampling of the lengths and just fix them at one.  For the
three graphs we're trying, only one seems to be doing any better than
the baseline LDA model.

** Optimization

Now that we're running experiments, we find that things are taking way
too long to run.  So we do some optimization.  First, we cache the sum
of all \beta_k posteriors.  This helps.  Second, we flip \beta from
\beta_{k,w} to \beta_{w,k}, which we've heard helps.  It doesn't.  We
put it back the other way.  

All the time is being spent in resample_z, so we waste a half day
trying to figure out if there's a way to only resample a subset of the
zs.  For instance, track how often they change and only resample those
that change a lot (probabilistically).  This hurts.  Resampling those
with high entropy hurts.  I think the issue is that there are three
types of zs.  (1) those that change a lot because they have high
uncertainty but are rare enough that they don't really matter, (2)
those that change a lot and do matter, (3) those that just don't
change very much.  Probably could do something intelligent, but have
wasted too much time already.

In order to really evaluate speed, we add some code that prints out

We do one more optimization that's maybe not very common.  resample_z
loops over docs, then words, then topics.  For each word, the loop
over topics is to compute p(z=k).  But if the next word you loop over
is the same word (they are both "the"), then you don't need to
recompute all the p(z=k)s -- you can cache them.  We do this, and then
sort the words.  This gives about a 20% speedup with no loss in
performance (posterior or held-out likelihood).

** Evaluating again

Since we had some success fixing the lengths at 1, we try fixing them
at 20.  Now that one graph is doing noticably better than the baseline
and the other two slightly better.  We try 5 and 10 and 50 and nothing
much seems to happen.  20 seems like a bit of a sweet spot.

============= DAY 3 =============

** A more rigorous evaluation

Running with lengths fixed at 20 seems to work, but there's always
variance due to randomness (both in the sampling and in the 80/20
split) that we'd like to account for.

So we run 8 copies of each of the four models (8 just because we
happen to have an 8 core machine, so we can run them simultaneously).

Now, we require more complex graphing technology than just xgraph.
We'll probably eventually want to graph things in matlab, but for now
all we really care about it how things do over time.  So we write a
small perl script to extract scores every 50 iterations (we've
switched from 200 to 300 just to be safe) and show means and stddevs
for each of the models.

While we're waiting for this to run, we think about...

* MRF over z?

Our initial model which may or may not be doing so well (we're waiting
on some experiments) assumes an MRF over \th.  Maybe this is not the
best place to put it.  Can we put it over z instead?

Why would we want to do this?  There are some technological reasons:
(1) this makes the MRF discrete and we know much better how to deal
with discrete MRFs.  (2) we can get rid of the MH step (though this
doesn't seem to be screwing us up much).  (3) we no longer have the
arbitrary choice of which distance function to use.  There are also
some technological reasons *not* to do it: it seems like it would be
computationally much more burdensome.

But, let's think if this makes sense in the context of our
application.  We have a bunch of research papers and our graphs are
authorship, citations, time or venue.  These really do feel like
graphs over *documents* not *words*.

We could turn them in to graphs over words by, say, connecting all
identical terms across documents, encouraging them to share the same
topic.  This could probably be done efficiently by storing an inverted
index.  On the other hand, would this really capture much?  My gut
tells me that for a given word "foo", it's probably pretty rare that
"foo" is assigned to different topics in different documents.  (Or at
least just as rare as it being assigned to different topics in the
same document.)  Note that we could evaluate this: run simple LDA, and
compute the fraction of times a word is assigned it's non-majority
topic across all the data, versus just across one documents.  I
suspect the numbers would be pretty similar.

The extreme alternative would be to link all words, but this is just
going to be computationally infeasible.  Moreover, this begins to look
just like tying the \thetas.

So for now, we put this idea on the back burner...

* MRF over \theta, continued...

We're still waiting for these experiments to run (they're about half
of the way there).  In thinking about the graph over z, though, it did
occur to me that maybe you have to use far more topics than I've been
using to really reap any benefits here.  We begin running with 8, and
then bumped it up to 20.  But maybe we really need to run with a lot

So, I log in to two other machines and run just one copy with 50, 100,
150 and 200 topics, just for vanilla LDA.  The larger ones will be
slow, but we'll just have to go do something else for a while...

============= DAY 4 =============

* MRF over \theta, re-continued...

Both experiments finish and we see that: (1) with 20 topics and
lengths fixed at 10, there is no difference between raw LDA and TMRF.
(2) More topics is better.  Even 200 topics doesn't seem to be
overfitting.  Going 20, 50, 100, 150, 200, we get hidden log
likelihoods of -1.43, -1.40, -1.38, -1.36, -1.36 (all e+06).  The
significance (from the first experiments) seems to be around .002
(e+06), so these changes (even the last, which differs by 0.005) seem
to be real.  Since we weren't able to overfit, we also run with 300
and 400, and wait some more...

...and they finish and still aren't overfitting.  We get -1.35 and
-1.35 respectively (differing by about 0.004, again significant!).

This is getting ridiculous -- is there a bug somewhere?  Everything
we've seen in LDA-like papers shows that you overfit when you have a
ton of topics.  Maybe this is because our documents are really long?

** Model debugging

One thing that could be going wrong is that when we hide 20% of the
words, and evaluate on that 20%, we're skewing the evaluation to favor
long documents.  But long documents are probably precisely those that
need/want lots of topics.  Our mean document length is 2300, but the
std is over 2500.  The shortest document has 349 words, the longest
has 37120.  So, instead of hiding 20%, we try hiding a fixed number,
which is 20% of the mean, or 460.

============= DAY 5 =============

** Read the papers, stupid!

At this point, we've done a large number of evals, both with 20% hid,
and 460 words/doc hid (actually, the min of this and doclen/2), and
50-1600 (at *2) topics.  We do actually see a tiny bit of overfitting
at 800 or 1600 documents.

Then we do something we should have done a long time ago: go back and
skim through some LDA papers.  We look at the BNJ 2003 JMLR paper.  We
see that one of the selling points of LDA over LSI is that
it *doesn't* overfit!  Aaargh!  No wonder we haven't been able to get
substantial overfitting.

However, we also notice something else: aside from dropping 50 stop
words (we've been dropping 100), on one data set they don't prune rare
words at all, and on the other they prune df=1 words only.  We've been
pruning df<=5 or <=10 words (can't remember which).  Perhaps what's
going on is that the reason the graphs aren't helping is just because
there vocabulary (around 3k) isn't big enough for them to make a

We recreate the text, pruning only the df=1 words.  This leads to a
vocab of about 10k (which means inference will be ~3 times slower).
We run at 50, 100, 200 and 400 and we actualy see a tiny bit of
overfitting (maybe) on 400.  We accidentally only ran 100 iterations,
so it's a bit hard to tell, but at the very least there's
no *improvement* for going from 200 topics to 400 topics.  Strangely
(and I'd have to think about this more before I understand it),
running on the new text is actually about 5-20% *faster*, despite the
larger vocabulary!

** Re-running with graphs

At this point, we're ready to try running with graphs again.  Despite
the fact that it's slow, we settle on 200 topics (this took about 3.5
hours without graphs, so we will be waiting a while).  We also want to
run for more iterations, just to see what's going to happen: we do 200

And again there's not much difference.  One of the graphs seems to be
just barely one std above everyone else, but that's nothing to write
home about.

============= DAY 6 =============

* Abstracts only?

At this point, things are not looking so spectacular.  Perhaps the
problem is that the documents themselves are so big that there's
really not much uncertainty.  This is reflected, to some degree, by
the lack of variance in the predictive perplexities.

So we rebuild the data on abstracts only.  This makes running
significantly faster (yay).  We run 5 copies each of 10, 20, 40, 80,
160 and 320 topics.  40 is a clear winner.  80 and above overfit
fairly badly.

Now, we turn on graphs and get the following results (5 runs):

40-top-nog.default      -69239.4 (86.2629700392932)
40-top-nog.auth         -68920.0 (111.863756418243)
40-top-nog.cite         -68976.4 (174.920839238783)
40-top-nog.year         -69174.2 (133.769204228776)

If we compare default (not graphs) to auth (best), we see that we get
a 2-3 std separation.  This is looking promising, FINALLY!  Also, if
we plot the results, it looks like auth and cite really dominate.
Year is fairly useless.

It suggests that, maybe, we just need more data to see more of a

* Getting More Data

There are two ways we could get more data.  First, we could crawl
more.  Second, we could switch over to, say, PubMed or ACM.  This
would work since we only need abstracts, not full texts.  These sites
have nice interfaces, so we start downloading from ACM.

============= DAY 7 =============

Okay, ACM is a pain.  And it's really not that complete, so we switch
over to CiteSeer (don't know why we didn't think of this before!).  We
seed with docs from acl, cl, emnlp, icml, jmlr, nips and uai.  We
notice that CiteSeer is apparently using some crappy PDF extractor (it
misses ligatures!  ugh!) but it's not worth (yet!) actually
downloading the pdfs to do the extraction ourselves ala Braque.  From
these seeds, we run 10 iterations of reference crawling, eventually
ending up with just over 44k documents.  We extract a subset
comprising 9277 abstracts, and six graphs: author, booktitle,
citation, url, year and time (where you connect to year-1 and year+1).
The 9k out of 44k are those that (a) have >=100 characters
"reasonable" in the abstract and (b) have connections to at least 5
other docs in *all* the graphs.  (We're no longer favoring citations.)
We then begin the runs again....

20 October 2015

Results of NAACL 2015 post-conference survey are up


05 October 2015

A small observation for prepubs on arxiv

The question of how "traditional conference publication" should react to arxiv prepublications is raised quite frequently. I'm not particularly shy about the fact that I'm not a fan, but that's not what this post is about. This post is about data.

In any discussion about the "arxiv question," proponents of the idea typically cite the idea that by posting papers early on arxiv, they are able to get feedback from the community about their work. (See for example here, which at least tries to be balanced even if the phrasing is totally biased, for instance in the poll at the end :P.)

At any rate, the question I was curious about is: is this actually borne out in practice?

I did the following experiment. Arxiv nicely lets us see revision histories. So we can see, for instance, whether a paper that was placed on arxiv before notifications for the corresponding conference have gone out, is updated more than a paper that was placed on arxiv after notifications.

For NIPS papers that were first posted to arxiv after camera ready, 75% were never updated and 19% were updated once (on average, they were updated 0.36 times +- 0.713 std).

For papers that were first posted to arxiv before notification, all were updated at least once. The real question is: how many times were they updated between posting to arxiv and acceptance to the conference. The answer is that 82% were never updated during that period. Of course, all were updated at some point later (after the camera ready deadline), and 55% were updated only once, and another 18% were updated twice.

[Note: I only count updated that come at least two week after the first posting to arxiv because before is more likely to be typo fixing, rather than real feedback from the community.]

The sample size is small enough that I can actually look at all of the ones that were updated between posting to arxiv and notification. One of these seems was first posted in mid-Feb, updated twice is late March, and then again in Nov (acceptance) and Dec (camera ready). Another is very similar. Two were most likely posted the previous year when it was submitted to AIStats (the dates match up) and then updated when submitted to NIPS. Those were the only four, and two of them seem like a legit possible case of update due to community feedback.

As far as the question of "rapid impact on the field" this is harder to answer. I took a random sample of ten papers from each of the groups (prepub versus non-prepub) and got citation counts from google scholar. The median citation count was 10 for both sets. The average was slightly higher for the prepub set (15 versus 11, but with giant standard deviations of 12 and 16). Considering the prepub set has been out at least 6 months longer (this is NIPS 2013 and 2014 so this is a sizeable percentage), this is a pretty small difference. And it's a difference that might be attributable to other factors like "famous people are perhaps more likely to prepub" [actually it's not clear the data play this out; in a totally unscientific study of "does Hal think this person is famous" and a sample of 20 for each, it's even split 10/10 in both sets].

Anyway, I'm posting this because I've heard this argument many times and I've always felt it's a bit dubious. I've never seen data to back it up. This data suggests it's not true. If someone really believes this argument, it would be nice to see it backed up with data!

[Notes: I took only papers that were marked on arxiv as having appeared in NIPS, and which were first posted to arxiv in 2013 or 2014; this is 175 papers. I then hand-checked them all to exclude things like workshops or just submissions, and labeled them as to whether they appeared in 2013 or 2014.  That left a sample of papers. The rest of the data was extracted automatically from the arxiv abstract. The total number that was posted before notification (the prepub cases) is 22 (or 27%) and the total number that were posted after notification is 59 (or 73%). So the sample is indeed small. Not much I can do about that.]

11 September 2015

How long'll it take to say that?

tl;dr: Given a string of text in some language you might want to know how long it would take to speak it. Here are some perl/python "one-liners" to estimate that. Currently supports English (US), German, French, Italian, Spanish and Japanese. The estimates are all in seconds.

Brave are those who read the source code. I promise it's not intentionally obfuscated -- there's just a lot of unicoding going on, and then copious use of the right-handed saturn operator in perl.

Why would you want code for this rather than a dictionary? Dictionaries are limited to their vocabulary, which is also sensitive to things like tokenization. Having a completely parametric solution seemed much more generalizable.

Why would I possibly want this?

We've been doing a bunch of work recently on simultaneous machine interpretation (aka "real time machine translation"). None of us is a speech person, either in the "recognition" or "synthesis" sense. This unfortunately means that to date, all of our models treat each word as "equally long" when training and, perhaps more importantly, evaluating models. For instance, if we want to measure the décalage (aka time-lag, aka ear-voice-span) between when a Japanese word is "heard" and the corresponding English translation is "spoken", we've been assuming all words take precisely 1 second to speak. This is obviously ridiculous.

One quick and dirty alternative is to use a text-to-speech (aka speech synthesis) system to synthesize a sentence and use its length as an estimate of how long it would take a person to speak it. This is a totally plausible approach since decent open source synthesis software exists (we use such software to crease these one-liners), but it's slow and bloated and can't be easily distributed. This would be more accurate, but I was after a quick and filthy solution.

How do these scripts work?

There are two functions: one for estimating the amount of time it would take to speak a single word (sayWord in the code), and another for estimating the amount of time it would take to speak a sentence (sayit in the code). I'll first describe how sayWord works; sayit is pretty straightforward.

sayWord works by extracting a bunch of features from the word to be spoken (each of these is a particular regular expression) and evaluating a linear function of the counts of the matches of those regular expressions. The process by which coefficients were generated is explained later. The features are things like: number of characters, number of vowels, number of consonant groups, number of non-letters, number of vowel-consonant switches, number of digits of various lengths, and whether the word starts or ends with a vowel; and then, for each "reasonable" unicode character for European languages, the count of those characters. Not all of these features appears for each language because I used l1 regularization to prune down the feature set.

[Note: Japanese is different because it uses a different character set. The feature structure is basically the same, but replace "consonant" with "kana" and "vowel" with "kanji" and "reasonable character" with "each of the 100 most frequent Japanese characters" in Japanese Wikipedia.]

sayit attempts to pronounce a sentence by pronouncing each word individually (via sayWord) and then rescaling the resulting estimate because we ... don't ... pause ... between ... words. This rescaling is linear, and again estimated from data (explained below).

How are the coefficients estimated for words?

Basically I take a vocabulary of the 50k-100k most frequent words for each language, use a speech synthesis program to say them (for the European languages, I used MaryTTS; for Japanese I used Open JTalk).

I then extract all the features mentioned above, create a regression problem regressing on the number of seconds it takes to speak (after removing quiet time around the word) and throwing in some l1 regularization with vw to make sure that it didn't use too many features. I optimized quantile loss (aka absolute value loss) rather than squared loss.

One thing I did not do was weight the words by their frequency. I could have done this and the resulting regression weights change a bit but not too much. At the end it spends a lot of energy making sure it estimates the speaking time of "a" and "the" and "an" correctly. Because short words are typically high frequency (and vice versa) this meant that it tended to underestimate all other words. And because my relatively frequencies were from Wikipedia, they might not match yours. Keeping a uniform distribution felt like a better solution.

I also tried not regularizing and also using things like character ngram features to get a better fit. In the end I didn't include this either. You can about halve the error rate by doing these things, but I felt like having simpler, smaller models made more sense here. After all, the thing we're regressing on (speaking time from a TTS system) is sort of an artificial benchmark anyway, so getting a bit lower error is not obviously that meaningful.

Overall the mean absolute errors in prediction are:

  • German: 51ms
  • English: 55ms
  • French: 59ms
  • Italian: 41ms
  • Japanese: 33ms
If you were to just predict the median on each, you would get mean absolute errors in the 550ms (Japanese) to 980ms (English) range, so this is quite an improvement.

How are the coefficients estimated for sentences?

For sentences, I fit a model of the form:
  • time = const + a * [summed time for words] + b * [# of words]
The three parameters (const, a and b) were estimated by having the TTS systems speak entire sentences (about 40k from each language, taken randomly from Wikipedia) and then using the total time from the TTS (unioned across all languages) and the corresponding features. These are simply hardcoded and shared across languages. You could of course do a bit better by doing this on a language-by-language basis, but I didn't do this again for simplicity.


Chances are no one except me wants this. But if you do, please feel free to use it. I'd appreciate some sort of credit though :). And if you really want the dictionaries, you can find them here: de, en-US, fr, it, ja.

Thanks to Graham Neubig for providing the tokenized Japanese Wikipedia text and pointing me to Open JTalk!

08 September 2015


Many students, in response to their assigned reading for today's undergrad ML class, asked me for a formal definition of overfitting. The fact that I don't really have one irked me, especially this this basically is the problem in machine learning. This blog post is an extended version of the answer I gave in class.

Intuitively, overfitting is when your learned model is doing much better on training data than it will do on test data from the same distribution. This happens because your model learns to fit random idiosyncrasies in the training data rather than model the true underlying trends.

I then pointed to a recent high-profile paper by Cynthia Dwork and colleagues on Holdout Reuse (appeared behind a paywall in Science and something similar seems to be in NIPS too). This paper presents some really cool ideas that use machinery from differential privacy to allow one to effectively reuse the same heldout ("development") data multiple times, without worrying about overfitting the holdout. But that's not what I wanted to highlight; I wanted to highlight the formal definition Dwork et al. give for overfitting. They (roughly) say that a model has overfit the training data when its test error is larger than its training error (by at least some small amount).

I feel like this definition captures some of the essence of overfitting, but it also misses what I would call my "everyday experience" with using machine learning in practical, real world settings. And my everyday experience is that you training error is always lower than your test error. To confirm to myself that I wasn't misremembering my experiences, I ran a few experiments on some simple classification tasks. Below is a plot of train/dev/test error on Reuters RCV1:

Here, the x-axis is regularization strength (this is all with libsvm, C=2^-xvalue) and the y-axis is error rate. There are error bars on dev/test but they're so small you can't really see them.

The thing I want to highlight in this figure is that the best performance, measured by dev or test performance, occurs when the regularization is a bit below 0. But the point at which the gap between dev performance and training performance starts to inflate is much earlier, maybe around -4. The title of the figure tells us that the test error achieved when the training error drops below dev error is 9.48%; but when dev bottomw out the test error is 6.08%. That's leaving 33% relative error on the table.

RCV1 isn't unique. Below are the same plots for MNIST on several binary splits:

For all three splits, some easier, some harder, the relative improvement is always around 20-30% by not using the definition provided by Dwork et al.

Having somewhat convinced myself that I'm not crazy, the immediate question is: what is a better definition of overfitting that captures my everyday experience? I completely agree that the definition quoted above is a necessary condition, but somehow it does not seem sufficient.

Even though this was my experience, I felt a bit flummoxed because I had a hard time pointing out exactly what I felt was missing in the definition above.

Here's what I ended up settling on. Overfitting typically happens, as stated above, because there are random idiosyncrasies in the training data that don't generalize. This is a finite sample effect. If I give you a dataset with 1000 training examples, and I add 100 completely random features, at least one of these features is going to correlate somewhat with the training labels. Overfitting occurs when the model chooses to use one of these completely random features, which helps drive the training error down, but hurts at test time.

But there's a "throwing the baby out with the bathwater" effect going on here. Although there may be completely random features that accidentally correlate slightly with the true label, there are also weak but nevertheless useful features that correlate a small amount with the true label. On a finite sample it's going to be virtually impossible to distinguish between these two.

By adopting the "training error < test error" rubric for overfitting, we're forcing ourselves to throw out both types of features. This is a trade-off, and empirically at least, it appears that this is a losing trade-off. It seems better to keep around a few of these weak-but-useful features in exchange for the risk of also keeping around a few truly-useless-and-misleading features. If this weren't the case, I would have a hard time explaining my "everyday experience."

So do I have a better suggestion?

Not really.

One really nice property of the Dwork et al. definition is that it's testable for a single model. That is, I hand you a learned function h and you can run h on some training data and some dev data and you can tell me if it's overfit. I don't have to draw curves like those above. I don't even have to know anything about how the model was built, and I don't have to sweep hyperparameters (which may or may not be sweepable).

I really like all of those properties. And I believe that if we came up with an alternative definition of overfitting that obeys these properties, we can reuse much of the awesome work in the Dwork et al. paper.

But I also feel like the definition above is overly conservative and it's risky to use.

09 June 2015

Some NAACL 2013 statistics on author response, review quality, etc.

NAACL 2015 has just passed, NAACL 2013 is long in the past.

One bonus of being a program chair is that you get to have fun with data. In this post I'd like to review two pieces of data, one related to author response and one related to review quality assessment.

tldr: Generally, I think author response is useless, except insofar as it can be cathartic to authors and thereby provide some small psychological benefit. And in general people don't seem that dissatisfied with their papers' reviews, and this is largely independent of the outcome of the paper (conditioned on selection bias of those who responded).

On Author Response

NAACL and other conferences have for some time allowed authors to respond to their reviewers, ostensibly to correct errors, but more often just to argue their position. Most people I've talked to who are in favor of author response say that it made the difference for one paper between it being rejected (pre-response) and accepted (post-response). Of course this is unknowable because decisions aren't made pre-response, so what these authors are reporting is their guess as to accept/reject before response versus actual accept after response. My own internal guesses of accept/reject are often off the mark, so you can imagine I don't find this argument particularly strong. To give a sense as to how hard this prediction is, here is a plot showing whether your paper got accepted or not as a function of the mean overall score.

Here, the x-axis is the cumulative distribution of papers (there were more rejects than accepts, so the length of these is normalized to percentage) and the y-axis is the mean overall score for the paper.

Focusing just on the solid lines (long papers), you can see that there were one or two papers with average scores of 3.6 that got rejected, and several papers with average scores of 3.0 that got accepted. If instead you look at "probability of accept" given mean overall score, basically what you see is papers with a score of 2.8 or lower are almost certainly rejected; with 3.9 or higher are almost certainly accepted, and around 3.2 it's an absolute toss-up.

Now, you might ask: who chose to respond? Perhaps people with clear outcomes elected not to respond at all. Here are the numbers for that:

This is only for long papers (it looks the same for short). Each dot is a paper and red are rejects, blue are accepts; open circles means no response. Definitely people with scores less than 3 responded less frequently than those with scores above 3, but even the top scoring papers all included some response. There were some "hopeful" papers with scores less than 2.0 that responded, even though it was certainly not useful.

But wait, you might say: perhaps those papers with average score or 1.33 gave a response and their final scores went up to 4.5? Ok first of all, these are final scores plotted above, not pre-response scores. But we certainly can look at how scores changed between the initial reviews and the final reviews. (Note this conflates two things: author response and reviewer discussion. We'll get back to that later.)

Here, we again have one dot per paper (these are randomly perturbed with small variance so they're all visible). Along the x-axis is the original score of this paper; along the y-axis is how much this score changed between the initial review and the final review. Overall the average absolute change across all papers is 0.1. The vast majority (87%) didn't change at all.

And note: scores were just about as likely to go down as up. (Several scores went down with no response, which I find troublesome, except in cases where reviewers specifically said "I'm giving benefit of the doubt, but need a clarification here.")

You could also say: well, maybe the scores didn't change much, but the review text did. Again, it's pretty rare. Out of 430 reviews (long papers only), 3 reviews decreased in length (~50 words), 46 increased by at most 100 words, and 49 increased by over 100 words (wow, awesome!). But for 80% the review didn't change at all.

Now, getting back to the question of "was this due to AR or to reviewer discussion", this is of course impossible to de-conflate, but we can look at a similar plot as a function of discussion rather than AR:

Here, a few results stand out. First, there are slightly more score changes as a function of discussion than AR. Most of those worrying dots below y=0 that didn't have a response seem to have gone down due to discussions. Moreover, a few scores that went up (on the low-scoring, almost-certainly-rejected papers) did so with no discussion, suggesting that AR did change those scores. But around the x=3.2 position (the true borderlines), almost all of these with score changes had discussion, though plenty had discussion and no score change. As opposed to the x=3.2 position in AR in which they all have responses, and still the majority of scores don't change.

An obvious response at this point is that we still don't know whether the final decision really changed as a function of response. This is a causality question we cannot perform the counterfactual for, so we're not going to get a solid answer. However, we can look at the following:

On the x-axis we have the final score of the paper, and on the y-axis we have (a smoothed version of) the probability that this paper is accepted.

The two nearly identical curves (red and blue) correspond to blue=all long papers and red=all long papers for which the authors submitted any response. In this setting, there really is almost no difference.

The slightly different curve (the black one) corresponds only to papers in which the reviewers engaged in a discussion. Around the critical point (score in the 3-3.5 range), discussion uniformly decreased odds of acceptance, which is of course not surprising for anyone who has ever participated in discussions.

You should take all this with a grain of salt, because all three of these curves are within one standard deviation of the blue curve (dotted lines).

Finally, there's of course the chance that response and discussion are highly correlated: that is, response should perhaps spur discussion. This is probably false (by the above plot), but just to drive the point home, here is the data:

The x-axis is the amount of discussion and the y-axis is the length of the response. See all those dots along x=0? Those are the ones for which the authors responded (in one case with a 2000 word essay!) and for which there was absolutely no discussion. And most of them are rejected papers.

On Review Usefulness

Some of you may recall that we did a post-conference survey of how much you liked the reviews you got. We truly did this at a later point in time because we didn't get our acts together earlier, but now I'll claim we did it at a later point in time so that authors were no longer emotionally distraught by their reviews.

We presented each corresponding author with their old reviews, but hid the scores from them (I doubt any/many went back to look). They were then asked, for each review in turn, how informative it was and how helpful it was. They could choose 0=not, 1=sort of, or 2=very.

The first hypothesis one might have (certainly I had it!) is that people "liked" reviews that were nice to them and "disliked" reviews that were not. (I've heard this called the craigslist effect: if the transaction is successful, everyone's happy; else, everyone's unhappy.) This actually appears not really to be true. Here are the results:
           informativeness    helpfulness
rejects      1.24  +-0.57     1.25 +-0.72
accepts      1.38  +-0.48     1.27 +-0.70
(+- is one standard deviation)

Basically authors of accepted papers didn't particularly find their reviews much more informative (+0.14) or helpful (+0.02) than those of rejected papers.

[Note: the sample size is 110 rejects and 86 accepts, so there was a higher response rate on accepts since the acceptance rate is about 25%.]

You might suspect, however, that how much an author likes a review is correlated with that reviewer's score rather than the overall decision. We can look at that too (recall authors didn't see the scores when completing this survey):
rating      avg(informativeness)   avg(helpfulness)
  0            3.04   +- 0.89        3.23  +- 0.87
  1            3.22   +- 0.91        3.15  +- 0.93
  2            3.21   +- 0.97        3.22  +- 0.96
Here, the first column (rating) is how the author scored the review. The second column is the average overall score that a reviewer gave an author who gave them a score of zero. For example, for reviews that were rated informative=2, the average score for those reviews was 3.21. Again, basically there's no effect.

Just to feel good about ourselves, authors do, in general, seem not-too-unhappy with their reviews.

Here's a histogram of informativeness for rejected papers:
0: ##############                                 (15%)
1: ############################################   (44%)
2: ########################################       (41%)
Here's the same histogram for accepted papers:
0: ########                                       ( 8%)
1: ############################################## (46%)
2: ############################################## (46%)
The only real difference is that rejected papers did have twice as large a frequency of "informative=0", but the rest was pretty close.

(And in case you're wondering, it's basically the same for "helpfulness." As pointed out by a friend, the inability to separate different aspects of the same object is typical when people are asked for intuitive judgements that don't involve reasoning. 70% of responses gave the same score for helpfulness and informativeness and another 29% gave scores within 1. Only 0.8% said inf=2 help=0 and 0.6% said inf=0 and help=2.)


I think of complaining about review(s|ers) as a staple hallway discussion, sort of an attractor state if you run out of other things to talk about. And we all remember the worst reviews we have had and often forget about the good ones that really do help our work get better. As Lillian Lee said at NAACL this year, it's usually our own fault when reviewers don't understand our papers. And keep in mind that all these numbers have huge randomness associated with them, for instance related to the NIPS experiment.

Despite this, it actually seems that we aren't too dissatisfied with our reviews. Only 11% of reviews were considered uninformative and, while I think we should strive to get this number down, I don't think it's that horrible. Especially when I consider the fact that 19% of papers have an average score of 2.0 or less, which basically means they were submitted not standing a chance (for a wide variety of reasons).

In general I don't think author response has an effect on the outcome that makes it worth the time and energy it takes. It might make authors feel better, but this is temporary when their paper almost certainly enjoys the same fate after response as before. I do think discussion is useful, and I'd rather see us use more time for discussion but cutting out response. Perhaps by reducing the number of papers any given area chair has to deal with. We should realize, though, that discussion almost always serves to lessen the likelihood of acceptance, probably because it's easier to argue against a paper than for, and reviewers don't really have any incentive to stand up for a paper they like, leading to a veto-effect.

15 November 2014

The myth of a strong baseline

I can probably count on my fingers the number of papers I've submitted for which a reviewer hasn't complained about a baseline in some way. I don't mean to imply that all of those complaints are invalid: many of them were 100% right on in ways that either I was lazy about or ways that I hadn't seen a priori.

In fact, I remember back in 2005 I visited MIT and gave a talk on what eventually became the BayeSum paper (incidentally, probably one of my favorite papers I've written, though according to friends not exactly the best written... drat). I was comparing in the talk against some baselines, but Regina Barzilay very rightfully asked me: do any of these baselines have access to the same information that my proposed approach does? At the time I gave this talk the answer was no. In the actual paper the answer is yes. I think Regina was really on to something here, and this one question asked in my talk has had a profound impact on how I think about evaluation since then. For that, I take a small time-out and say: Thank you, Regina.

Like all such influential comments, my interpretation of Regina's question has changed over time, and this post is essentially about my current thinking on this issue, and how it relates to this "does not compare against a strong enough baseline" reviewer critique that is basically a way to kill any paper.

If we're going to ask the question about whether an evaluation strategy is "good" or "bad" we have to ask ourselves why are we doing this evaluation thing in the first place. My answer always goes back to my prima facie question when I read/review papers: what did I learn from this paper? IMO, the goal of an evaluation should be to help isolate what I learned.

Let's go back to the BayeSum paper. There are two things I could have been trying to demonstrate in this paper. (A) I could have been trying to show that some new external source of knowledge is useful; (B) I could have been trying to show that some new technique is useful. In the case of BayeSum, the answer was more like (B), which is why Regina's comment was spot on. I was trying to claim the approach was good, but I hadn't disentangled the new approach from the data, and so you could have easily believed that the improvement over the baseline was due to the added source of data rather than the new technique.

In many cases it's not that cut and dry because a (B)-style new technique might enable the use of (A)-style new data in a way that wasn't possible before. In that case, the thing that an author might be trying to convince me of is that the combination is good. That's fine, but I still think it's worth disentangling these two sources of information as much as possible. This is often tough because in the current NLP atmosphere in which we're obsessed with shiny new techniques, it's not appealing to show that the new data gets you 90% of the gain and the new technique is only 10% on top of that. But this is another issue.

So evaluations are to help us learn something. Let's return now to the question of you didn't compare against a strong enough baseline. Aside from parroting what's been said many times in the past, what is the point of such a complaint, beyond Regina's challenge, which I hope I've made clear I agree with. The issue, as best I can understand it, is that it is based on the following logic:

  • Assumption: if your approach improves things against a strong baseline, then it will also improve against a weaker baseline, perhaps by more.
I'll note in passing that this is basically an assumption of submodularity of ideas.

And, like the title of this blog post suggest, I would like to put forth the idea that this assumption is often ridiculous, or at least that there's plentiful evidence against it.

I'm going to pick on machine translation now just to give a concrete example, but I don't think this phenomenon is limited to MT in any way. The basic story is I start with some MT system like Moses or cdec or whatever. I add some features to it and performance goes up. The claim is that if my baseline MT system wasn't already sufficiently strong, then any improvement I see from my proposed technique could just be solving a problem that's already been solved if I had just tuned Moses better.

There's truth to this claim, but there's also untruth. A famous recent example is the Devlin et al. neural network MT paper. Let me be clear: I think this paper is great and I 100% believe the results that they presented. I'm not attacking this paper in any way; I'm choosing it simply as a representative example. One of the results they show is some insane 8 bleu point gain over a very strong baseline. And I completely believe that this was a very strong baseline. And that the 8 bleu point improvement was real. And that everything is great.

Okay, so any paper that leads to an 8 bleu point gain over a very strong baseline is going to get reimplemented by lots of people, and this has happened. Has anyone else gotten an 8 bleu point gain? Not that I've heard. I've heard numbers along the lines of 1 to 2 bleu points, but it's very plausible I haven't heard the whole story.

So what's going on here?

The answer is simply that the assumption I wrote above is false. We've assumed that since they got 8 points on a strong baseline, I'll get at least as much on my baseline (which is likely weaker than theirs).

One problem is that "strong" isn't a total order. Different systems might get similar bleu scores, but this doesn't mean that they get them in the same way. Something like the neural networks stuff clearly solved a major problem in the BBN strong baseline system, but this major problem clearly wasn't as major of a problem in some other strong baseline systems.

Does this make the results in the Devlin paper any less impressive or important? No, of course not. I learned a lot from that paper. But one thing I didn't learn is "if you apply this approach to any system that's weaker than our strong baseline, you will get 8 bleu points." That's just not a claim that their results substantiate, and the only reason people seem to believe that this should be true is because of the faulty assumption above.

So does this mean that comparing to strong baselines is unimportant and everyone should go back to comparing their MT system against a word-for-word model 1 baseline?

Of course not. There are lots of ways to be better than such a baseline, and so "beating" it does not teach me anything. I always tell students not to get too pleased when they get state of the art performance on some standard task: someone else will beat them next year. If the only thing that I learn from their papers is that they win on task X, then next year there's nothing to learn from that paper. The paper has to teach me something else to have any sort of lasting effect: what is the generalizable knowledge.

The point is that an evaluation is not an end in itself. An evaluation is there to teach you something, or to substantiate something that I want to teach you. If I want to show you that X is important, then I should show you an experiment that isolates X to the best of my ability and demonstrates an improvement, preferably also with an error analysis that shows that what I claim my widget is doing is actually what it's doing.

01 November 2014

EMNLP 2014 paper list (with mini-reviews)

I'm going to try something new and daring this time. I will talk about papers I liked, but I will mention some things I think could be improved. I hope everyone finds this interesting and productive. As usual, I didn't see everything, didn't necessarily understand everything, and my errors are my fault. With that warning, here's my EMNLP 2014 list, sorted in anthology order.

  • Identifying Argumentative Discourse Structures in Persuasive Essays (Christian Stab, Iryna Gurevych)
    Full-on discourse parsing of rhetorical structure (e.g., RST) is really hard. In previous work, these authors created a corpus of essays annotated for (a) major claim, (b) claim, (c) premise and (d) none. (Prevalances of 5%, 23%, 55% and 17%, respectively.) This paper is about predicting that structure; in particular, both labeling spans (sentences?) as to their rhetorical class (of those four possibilities) and connecting them with binary support/non-support labels. Overall, they do quite well at both tasks: high to mid seventies in accuracy/F, respectively. They blame some of their errors on lack of coreference resolution. One question I had was: if you think of this annotation as a boiled down version of full rhetorical structure, what are you missing? Only 17% of sentences aren't one of the four classes, but if these fall high up in a rhetorical tree, I could imagine they would cut off a lot more of the structure. That said, I really liked that the authors found a tractable subproblem of full discourse parsing, and looking at it on its own.

  • Aligning context-based statistical models of language with brain activity during reading (Leila Wehbe, Ashish Vaswani, Kevin Knight, Tom Mitchell)
    When you read something, your brain lights up in certain ways. This paper is about finding predictive correlations between what you're reading (in particular, how a language model will score the incoming word) and how your brain will light up (via MEG imaging--this is the one with good time resolution but bad spacial resolution). The basic result is that the hidden layer in a recursive neural network language model does a pretty good job at the following prediction task: given an MEG reading, which of two words (the true one and a confounder) is the person reading. It does pretty well when readers are reading from Harry Potter: a big novelty here is that they use natural text rather than carefully laboratory controlled text. This is a cool step in the line of connecting NLP with neurolinguistics. The "hypothesis" in this paper is couched in the language of the integration theory of language processing (i.e., when you hear a word, your brain has to spend energy to integrate it into your prior analysis) rather than the (in my relatively uneducated opinion) preferred model of prediction theory (i.e., your brain has already predicted the word and the work it does is correcting for errors in its predictions). I think I can reinterpret their results to be consisted with the predictive theory, but I'd like to not have to do that work myself :P. I think the hardest part about this work is the fact that it uses natural, rather than laboratory, text. For all previous experiments I know, it's been very important to control for as much as you can about the context (history) and word to be predicted. In fact, most work only varies the history and never varies the word. This is done because any two words vary in so many ways (frequency, spelling, phonology, length, etc.) that you do not want to have to correct for those. For instance, you can ask if a bilingual speaker represents "dog" and "chien" in the same way, but this is a silly question because of course they don't: at the very least, they sound different and are spelled differently. Anyway, back to this paper, in the experiment they control for word length on the pairwise task, but I think controlling for word probability (under the surprise/predictive hypothesis) would be stronger. Though really as neat as it is to use natural text, I think it hurts more than it helps here.

  • Learning Abstract Concept Embeddings from Multi-Modal Data: Since You Probably Can't See What I Mean (Felix Hill, Anna Korhonen)
    This is the ultimate "duh" moment paper for me: basically we're all excited about embedding words and images into the same space, but lots of words aren't concrete and don't have visual counterparts (e.g., "flower" versus "encouragement"). They argue that concrete nouns are the rare category (72% of nouns and verbs are at least as abstract as "war"). The model they have for this is relatively simple: ignore abstract words. (Ok it's a bit more than that, but that's the general idea.) When you do this correctly, you get a marked improvement in performance. One thing I kept wondering during the talk, especially since Felix kept using the word "concrete" to refer to something that's not actually a mix of water, aggregate and cement, was the whole issue of metaphor and really word sense. Unfortunately neither of these words seem to appear in the paper, so it's hard to say what's going on. I think it could be really cool to combine these ideas with those of understanding metaphor or at least sense, though of course those are really really hard problems!

  • NaturalLI: Natural Logic Inference for Common Sense Reasoning (Gabor Angeli, Christopher D. Manning)
    I haven't followed the natural logic stuff for a while, so my assessment of this paper is pretty uninformed, but this was perhaps my favorite paper at the conference: I'll be presenting it in reading group on Monday, and might post more notes after that. Basically the idea is that you have a knowledge base that contains sentences like "The cat ate a mouse" and you have a query like "No carnivores eat animals". Instead of trying to map these sentences to logical form, we're going to treat the text itself as a logical form and do sentence rewriting to try to prove one from the other. An important insight is that quantification changes the direction of entailment that you can move, for instance if you say "All cats ..." then you can move down WordNet to "All housecats ..." but not up to "All mammals ..."; similarly if you say "No cats ..." then you can only move up the hierarchy. They set up a model like this, cast it as a big search problem, and get remarkably good scores (89% precision, 94% recall) on the FraCaS textual entailment task, which is competitive with the competition (though on a subset of the data, so not really comparable). Like I said, I really liked this paper; I could easily imagine trying to combine these ideas with stuff in the paraphrase database, especially once they have entailment directions annotated/predicted there.

  • A Fast and Accurate Dependency Parser using Neural Networks (Danqi Chen, Christopher Manning)
    The very brief summary of this paper is: take MaltParser and replace the SVM with a neural network and you do well. Slightly more detailed is that when MaltParser makes decisions, it looks at a few points on the stack and in the buffer. For each of these points, take the word and POS, map them to embeddings. You now have a half dozen embeddings (yes, we're embedding POS tags too). Also embed the dependency relation. Throw this through a network and make a prediction. The result 92% accuracy (basically tied with MST Parser, better than MaltParser by 2%) and about twice as fast as MaltParser. They also compare to their own implementation of MaltParser. Perhaps I missed it in the talk, but I think they only talked about their implementation there (they are 20 times faster) and not about real MaltParser. In order to get this speed, they do a caching trick that remembers vector-matrix products for common words. This is very cool and it's neat to see the embeddings of the POS and dependency relations: reminds me of Matsuzaki-style annotated labels. The one weakness here, I think, is that it's kind of unfair to compare a version of their own parser that with significant engineering tricks (the caching made it 8-10 times faster, so without it, it's 5 times slower than MaltParser) to an un-tricked-out MaltParser: in particular, if you really care about speed, you're going to do feature hashing in MaltParser and never compute all the strings that make it so slow. If this were done in MaltParser, then the neural network version, even if everything were cached, would be doing strictly more computation. Anyway, all that says is that I'm more impressed with the accuracy results than the speed results: and they appear to be doing quite well here, at the very least without sacrificing speed. (And as Yoav Goldberg reminds me, parsing on GPU with this architecture could be a big win!)

  • Unsupervised Sentence Enhancement for Automatic Summarization (Jackie Chi Kit Cheung, Gerald Penn)
    We all know how to do sentence fusion: take two sentences and stitch them together. The idea in this paper is to also enhance those sentences with little fragments taken from elsewhere. For instance, maybe you have a great sentence fusion but would like to add a PP taken from a related document: how can we do this? The model is basically one of extracting dependency triples, gluing them together and linearizing them. The biggest problem they face here is event coreference: given two sentences that contain the same (up to lemma) verb, is it the same event or not? They address event coref by looking at the participants of the predicate, though acknowledge more complex approaches might help further. Without the event coreference their approach hurts (over simple fusion) but with event coreference it helps. I would really have liked to have seen human judgments for both quality and grammaticality here: these things are notoriously hard to measure automatically and I'm not sure that we can really ascertain much without them. (Though yes, as the authors point out, human scores are notoriously hard to interpret too.)

  • What Can We Get From 1000 Tokens? A Case Study of Multilingual POS Tagging For Resource-Poor Languages (Long Duong, Trevor Cohn, Karin Verspoor, Steven Bird, Paul Cook)
    This paper is about a combination of projection methods (via parallel data) and small labeling (in the target language) to correct errors or annotation mismatches of the projection. The primary comparison is against Oscar Tackstrom's work (which I have an affinity for, having been his examiner for his Ph.D. defense) and the claim in this paper is that they do better with less. They can do even better using dictionaries of the type that Oscar has. One thing that's surprising is that when they replace the maxent correction model with a CRF things get worse. This seems wacky, but given the authors, I'm pretty confident that this isn't because they messed something up, which might be my default reaction :P. One obvious question is whether you can do projection and correction jointly, and I suspect someone will try this at some point. It could be cool to do it jointly not just over task but over languages in a multitask learning framework.

  • Greed is Good if Randomized: New Inference for Dependency Parsing (Yuan Zhang, Tao Lei, Regina Barzilay, Tommi Jaakkola)
    Local search is back! Let's do dependency parsing by first generating a random tree, then doing bottom-up reassignment of parents to improve it. You can prove that if you do this bottom up, then in one pass you can transform any tree to any other and all intermediate steps will also be trees. This space has lots of local optima (one cool thing is that in this paper they can count the number of local optima in n-cubed time using a variant of Chu-Liu-Edmonds, at least for first order models). To get around this they do a few hundred random restarts. One thing I really liked about this paper is that they went above and beyond what I might expect to see in a paper like this, really nailing the question of local optima, etc. They also use much more complex reranking features, which are available because at any step they have a full tree. The results are really good. I'm a bit surprised they need so many random restarts, which makes me worry about generalizing these results to more complex problems, which I think is the most important thing cuz I'm not sure we need more algorithms for dependency parsing. One cool thing is they get a natural anytime algorithm: you can stop the restarts at any point and you have some tree. This could be coupled in interesting ways with scheduling to handle cases where you need to parse one million sentences in some fixed amount of time: how much time do you spend on each? I was a bit surprised that they don't specifically train a model to make search decisions: they just train the model using their decoder and standard update strategies. It seems like something like Wick et al.'s SampleRank is just screaming to be used here.

Okay, so that's my list for EMNLP with mini-reviews. As I said before, these "reviews" are based 90% on the talk and 10% on skimming the papers later looking for keywords and examples, which means that there are almost certainly tons of errors. If someone notices an error and points it out, I'll just directly edit the post to fix it, with numerous apologies to the authors. Acknowledging a small amount of bias, I also really like both of the student papers I was involved in at EMNLP. The first is about question answering and the result is that we now beat humans at lots of QuizBowl questions. The second is about trying to learn policies for simultaneous interpretation: basically when should we wait for more input. I put links below if these sound interesting to you.
Please comment with your own favorites!

10 October 2014

Hyperparameter search, Bayesian optimization and related topics

In terms of (importance divided-by glamour), hyperparameter (HP) search is probably pretty close to the top. We all hate finding hyperparameters. Default settings are usually good, but you're always left wondering: could I have done better? I like averaged perceptron for this reason (I believe Yoav Goldberg has also expressed this sentiment): no pesky hyperparameters.
But I want to take a much broader perspective on hyperparameters. We typically think of HPs as { regularization constant, learning rate, architecture } (where "architecture" can mean something like neural network structure, choice of kernel, etc. But I think it's a lot broader and can include things like feature engineering, or at least representation modifications. For instance, vw now supports a number of helpful NLP feature templates: suffix and prefix features (via --affix), spelling features (via --spelling), ngram features (--ngrams), quadratic and cubic features, etc. Picking the right incarnation of these thing is essentially a hyperparameter search process, very akin (IMO) to things like representation learning.

Once you're willing to accept all these things as HPs (and I think you should), something like "grid search", which works for tuning C and eta in your SVM, just doesn't seem to cut it anymore.

Enter the world of automatic HP tuning. Lots of this work, not surprisingly, comes out of the neural networks community, because HP search is a big deal there. Most of my information here comes via Hugo Larochelle, who has done a lot of great work. Some places to start looking:

  • Spearmint toolkit by Snoek, Larochelle, Swersky, Zemel and Adams (and a JMLR paper)
  • SMAC by Hutter, Hoos, Leyton-Brown, Murphy and Ramage (and a paper
Most of this work falls under the framework of "Bayesian Optimization." The idea comes from the space of derivative-free optimization, where a common strategy is to fit a response surface. Basically you have a bunch of hyperparameters to tune. For any setting of hyperparameters, you can observe some response. In ML land this is usually something like held-out accuracy. Now, fit a regression function that can map from hyperparameters to response. Do something that looks like active learning to explore this space, with a bias toward finding places with high response (high accuracy). In these examples, the function being fit is a Gaussian Process, which is super useful because it can provide realistic estimates of variance, which are useful in the active learning/experimental design.

I first learned about this stuff, not in the context of HP optimization, from Ilya Ryzhov, a faculty member in our business school who works on these topics -- I learned that in his world, exploration/exploitation is also called "learning versus earning" which I think it awesome :).

I would love if hyperparameter optimization were a black box, and Spearmint and SMAC are both great steps in this direction. There are a couple of things that I haven't seen (though like I said, I'm not hugely well read in this space) that I would love to see.
  • Learning across multiple datasets, akin to meta-learning. I imagine that if you give me a problem to do HP optimization on, and also give the same problem to an undergraduate, I would be much better. Presumably I've learned from past experience how to do this well.
  • More importantly, taking advantage of some of the structure of learning algorithms (rather than oblivious black box optimization) seems like it could be a big win. For instance, most learning algorithms have some notion of early stopping (# of passes over the data, tolerance for convergence, etc.). You can also of course run on subsets of the data (which is equivalent in many cases). If you assume heldout accuracy doesn't get worse over time (e.g., because you do early stopping) then you can think of this as a type of right-censoring. I.e., you run an experiment part of the way through and you know "ok if I kept running it might get better, but I know it's at least 85% accurate now." The SMAC folks had a nice paper on Bayesian optimization with censored data, but my (incomplete) understanding is that it doesn't quite capture this (very common, IMO) case. I should be willing to start and stop processes at various points and try to figure out where to invest more computation to get better estimates. I can presumably even estimate: if I ran another pass, how much better do I think it would get?
  • I also think the focus on "finding the best hyperparameters" is somewhat the wrong problem. We want to find the best parameters, period. Hyperparameters are a nuisance on their way to that end. For instance, related to the above bullet, I could imagine running a few passes with one setting of hyperparameters, and then doing some other work, and then going back and restarting that previous run with a different setting of hyperparameters (assuming the model being learned is such that "warm starting" makes sense, which is almost always the case--except maybe in some neural network settings).
  • Parallelization is a big deal. One of the reasons something akin to grid search is so attractive is that it's trivial to submit 20*20*20 jobs to my cluster and just wait for them to finish. Anything that's less friendly than doing this is not worth it. Again, the SMAC folks have worked on this. But I don't think the problem is solved.
Beyond these technical issues there's always the obnoxious issue of trust. Somehow I need to believe that I'm not better than these algorithms at tuning hyperparameters. I should be happy to just run them, preferably saying "okay, here are 120 cores, you have four hours -- go to town." And I should believe that it's better than I could do with equivalent time/resources by clever grid search. Or perhaps I should be able to encode my strategies in some way so that it can prove to me that it's better than me.

Overall, I'd love to see more work on this problem, especially work that doesn't focus on neural networks, but still takes advantage of the properties of machine learning algorithms that are not shared by all black-box derivative-free optimization tasks. In the mean time, from what I hear, SMAC and Spearmint are actually quite good. Would love to hear if any NLP people have played around with them!