It's really strange to look back now over the past ten to fifteen years and see a very small pendulum that no one really cares about swing around. I've spent the last ten years trying to convince you that structured prediction is RL; now I'm going to tell you that was a lie :).
Short Personal History
Back in 2005, John Langford, Daniel Marcu and I had a workshop paper at NIPS on
relating structured prediction to reinforcement learning. Basically the message of that paper is: you can cast structured prediction as RL, and then use off-the-shelf RL techniques like
conservative policy iteration to solve it, and this works pretty well. This view arose, for me, main from Collins and Roark's
incremental structured perceptron model, but I'm sure it dates back much further than that (almost certainly there's work from neural nets land in the 80s; I'd certainly appreciate pointers!). This led
eventually to
Searn, and then in the early 2010s, I went around a bunch of places (ACL, AAAI, ICML, etc.),
espousing connections between structured prediction and (inverse) reinforcement learning (
slides).
This kinda sorta caught on in a very small subcommunity.
One thing I should have realized looking back ten years is that you should not try to sell a hammer to hammer manufacturers, but rather to people with weird nails. I spent too much time and energy trying to convince the "traditional structured prediction" crowd (the CRF and M3N and SVM
struct folks) that the "RL" view of structured prediction was awesome. That was a losing strategy, unfortunately. (Although I learned a few nights ago at dinner that this might be changing now!)
But now everything has changed. To some degree, Ross, Gordon and Bagnell's
DAgger is a successful, better successor to Searn, and for years I had gone around telling everyone DAgger is my favorite algorithm ever (it consistently outperforms Searn's in their---and my---experiments, and is really really easy to implement and has stronger-ish theory). And then DAgger (or more precisely
DAD) gets
renamed/rebranded as "
scheduled sampling" (though you should read Marc'Aurelio's comment, which is very on point), and now these ideas are everywhere, particularly in sequence-to-sequence neural transduction models.
Nowadays
In the past year or two, there's been a
flurry of work applying not just imitation learning algorithms like DAgger to neural models for structured prediction problems, but also just applying straight-up RL algorithms (like reinforce, policy gradient, or actor/critic) to them. The important point is that while people have tried to do things like neural CRFs, etc., the basic sequence-to-sequence style model naturally fits a search-based structured prediction (aka RL-ish) view.
But these tasks are
not the same and, in fact, structured prediction is
much simpler, and I think we need to develop algorithms that take that into account.
The biggest difference is that in (all or at least almost all) structured prediction problems, conditioned on the input
x, the world is
known,
deterministic and therefore reversible and/or fully-explorable (modulo limits of computation)
. This is generally not true in RL, and one of biggest challenges in RL is that once you take an action, you cannot un-take that action, and you cannot try out other alternatives.
That is to say: computation aside, in structured prediction, conditioned on
x, you can build out the
entire search tree and do whatever the heck you want with it. (Of course "computation aside" makes no sense in a SP setting because the whole difficulty of SP is computation.) In fact, one of the big advantages to things like CRFs is effectively that they
do build out the entire search tree, at least implicitly, which is possibly precisely because of the limited expressivity of features.
This observation is probably perfectly obvious to most NLP folks. In a sense, the
semantic parsing crowd has been doing something reinforcement-learning like for quite some time. You produce a semantic parse, run it against a database, check if you get the correct answer or not. If so, positive reward; if not, negative. But no one (as far as I know) just produces one parse: you produce a beam of a bunch of parses and try them all. This is definitely
not something you can do in standard RL, but it
is something you can do in structured prediction.
In my mind, this was (and continues to be) one of the major weakness of the Searn/DAgger approach to structured prediction that continues to be a problem in applying standard RL algorithms to structured prediction. In a real sense, I think this is something that
incremental perceptron, broken
LaSO and not-broken
LaSO-BST, and
seq2seqLaSO got right that the more RL-ish approaches got wrong. (This continues to be true in bandit structured prediction, which will get a separate post in the maybe-near future.) One approach that blends the two to some degree is
Vieira and Eisner's recent paper that uses dynamic programming and change propogation within LOLS (which is effectively a variant of
AggreVaTe, a follow-on to DAgger) to learn to prune (it's not obvious to me how to generalize this to other tasks yet).
Why the Gap?
I don't think this gap is an accident, and I think there are essentially two reasons it exists.
First, as suggested above, is the question of computation. If you're willing to say "I don't care about computation" then you might as well put yourself in CRF world where life is (relatively) easy, at least if you want to do analysis. If you do care about computation, then doing the "purely greedy" thing is very natural and then you can say "well I know I'm computationally efficient because I'm greedy, and now I can focus entirely on the statistical problem." Once you're willing to spend a bit more computation, you have to figure out how to "charge" yourself properly for that computation. That is, you enter a world where there's a trade-off between statistics and computation (though not in the usual sense) and it's not at all clear how to balance that. It's also hard to convince myself that it's better to spend ten times longer on a single structured example than to do ten different examples. This is a question I've been interested in since
my dissertation (p44) but have made basically zero progress on:
(Note that I don't currently agree with the entirety of this passage---in particular, the complexity argument is somewhat broken---but I think the basic idea is right.)
Second, I think there's a bit of a looking-under-the-lamppost effect that's not easy to ignore. Here, the lamppost is mainly a computational efficiency lamppost, and secondarily a convenience lamppost. Greedy search is really fast. Even compared to beam search with a beam size of K, greedy is often much more than K times faster because you don't have bookkeeping overhead. And it's
way easier to implement greedy solutions than non-greedy, especially in neural land if you want things to be efficient on a GPU. And often toolkits have greedy already implemented for you. This obviously isn't un-recoverable, but runs into the problem of: if I can do 50 sentences greedily on my GPU in the time it would take to do beam-10 search for one of the sentences, is the computation really going to come off in my favor?
As always, I'd appreciate pointers to work that I don't know about that addresses any of these challenges!