Short Personal History
Back in 2005, John Langford, Daniel Marcu and I had a workshop paper at NIPS on relating structured prediction to reinforcement learning. Basically the message of that paper is: you can cast structured prediction as RL, and then use off-the-shelf RL techniques like conservative policy iteration to solve it, and this works pretty well. This view arose, for me, main from Collins and Roark's incremental structured perceptron model, but I'm sure it dates back much further than that (almost certainly there's work from neural nets land in the 80s; I'd certainly appreciate pointers!). This led eventually to Searn, and then in the early 2010s, I went around a bunch of places (ACL, AAAI, ICML, etc.), espousing connections between structured prediction and (inverse) reinforcement learning (slides).
This kinda sorta caught on in a very small subcommunity.
One thing I should have realized looking back ten years is that you should not try to sell a hammer to hammer manufacturers, but rather to people with weird nails. I spent too much time and energy trying to convince the "traditional structured prediction" crowd (the CRF and M3N and SVMstruct folks) that the "RL" view of structured prediction was awesome. That was a losing strategy, unfortunately. (Although I learned a few nights ago at dinner that this might be changing now!)
But now everything has changed. To some degree, Ross, Gordon and Bagnell's DAgger is a successful, better successor to Searn, and for years I had gone around telling everyone DAgger is my favorite algorithm ever (it consistently outperforms Searn's in their---and my---experiments, and is really really easy to implement and has stronger-ish theory). And then DAgger (or more precisely DAD) gets renamed/rebranded as "scheduled sampling" (though you should read Marc'Aurelio's comment, which is very on point), and now these ideas are everywhere, particularly in sequence-to-sequence neural transduction models.
In the past year or two, there's been a flurry of work applying not just imitation learning algorithms like DAgger to neural models for structured prediction problems, but also just applying straight-up RL algorithms (like reinforce, policy gradient, or actor/critic) to them. The important point is that while people have tried to do things like neural CRFs, etc., the basic sequence-to-sequence style model naturally fits a search-based structured prediction (aka RL-ish) view.
But these tasks are not the same and, in fact, structured prediction is much simpler, and I think we need to develop algorithms that take that into account.
The biggest difference is that in (all or at least almost all) structured prediction problems, conditioned on the input x, the world is known, deterministic and therefore reversible and/or fully-explorable (modulo limits of computation). This is generally not true in RL, and one of biggest challenges in RL is that once you take an action, you cannot un-take that action, and you cannot try out other alternatives.
That is to say: computation aside, in structured prediction, conditioned on x, you can build out the entire search tree and do whatever the heck you want with it. (Of course "computation aside" makes no sense in a SP setting because the whole difficulty of SP is computation.) In fact, one of the big advantages to things like CRFs is effectively that they do build out the entire search tree, at least implicitly, which is possibly precisely because of the limited expressivity of features.
This observation is probably perfectly obvious to most NLP folks. In a sense, the semantic parsing crowd has been doing something reinforcement-learning like for quite some time. You produce a semantic parse, run it against a database, check if you get the correct answer or not. If so, positive reward; if not, negative. But no one (as far as I know) just produces one parse: you produce a beam of a bunch of parses and try them all. This is definitely not something you can do in standard RL, but it is something you can do in structured prediction.
In my mind, this was (and continues to be) one of the major weakness of the Searn/DAgger approach to structured prediction that continues to be a problem in applying standard RL algorithms to structured prediction. In a real sense, I think this is something that incremental perceptron, broken LaSO and not-broken LaSO-BST, and seq2seqLaSO got right that the more RL-ish approaches got wrong. (This continues to be true in bandit structured prediction, which will get a separate post in the maybe-near future.) One approach that blends the two to some degree is Vieira and Eisner's recent paper that uses dynamic programming and change propogation within LOLS (which is effectively a variant of AggreVaTe, a follow-on to DAgger) to learn to prune (it's not obvious to me how to generalize this to other tasks yet).
Why the Gap?
I don't think this gap is an accident, and I think there are essentially two reasons it exists.
First, as suggested above, is the question of computation. If you're willing to say "I don't care about computation" then you might as well put yourself in CRF world where life is (relatively) easy, at least if you want to do analysis. If you do care about computation, then doing the "purely greedy" thing is very natural and then you can say "well I know I'm computationally efficient because I'm greedy, and now I can focus entirely on the statistical problem." Once you're willing to spend a bit more computation, you have to figure out how to "charge" yourself properly for that computation. That is, you enter a world where there's a trade-off between statistics and computation (though not in the usual sense) and it's not at all clear how to balance that. It's also hard to convince myself that it's better to spend ten times longer on a single structured example than to do ten different examples. This is a question I've been interested in since my dissertation (p44) but have made basically zero progress on:
As always, I'd appreciate pointers to work that I don't know about that addresses any of these challenges!