Short Personal History
Back in 2005, John Langford, Daniel Marcu and I had a workshop paper at NIPS on relating structured prediction to reinforcement learning. Basically the message of that paper is: you can cast structured prediction as RL, and then use off-the-shelf RL techniques like conservative policy iteration to solve it, and this works pretty well. This view arose, for me, main from Collins and Roark's incremental structured perceptron model, but I'm sure it dates back much further than that (almost certainly there's work from neural nets land in the 80s; I'd certainly appreciate pointers!). This led eventually to Searn, and then in the early 2010s, I went around a bunch of places (ACL, AAAI, ICML, etc.), espousing connections between structured prediction and (inverse) reinforcement learning (slides).
This kinda sorta caught on in a very small subcommunity.
One thing I should have realized looking back ten years is that you should not try to sell a hammer to hammer manufacturers, but rather to people with weird nails. I spent too much time and energy trying to convince the "traditional structured prediction" crowd (the CRF and M3N and SVMstruct folks) that the "RL" view of structured prediction was awesome. That was a losing strategy, unfortunately. (Although I learned a few nights ago at dinner that this might be changing now!)
But now everything has changed. To some degree, Ross, Gordon and Bagnell's DAgger is a successful, better successor to Searn, and for years I had gone around telling everyone DAgger is my favorite algorithm ever (it consistently outperforms Searn's in their---and my---experiments, and is really really easy to implement and has stronger-ish theory). And then DAgger (or more precisely DAD) gets renamed/rebranded as "scheduled sampling" (though you should read Marc'Aurelio's comment, which is very on point), and now these ideas are everywhere, particularly in sequence-to-sequence neural transduction models.
Nowadays
In the past year or two, there's been a flurry of work applying not just imitation learning algorithms like DAgger to neural models for structured prediction problems, but also just applying straight-up RL algorithms (like reinforce, policy gradient, or actor/critic) to them. The important point is that while people have tried to do things like neural CRFs, etc., the basic sequence-to-sequence style model naturally fits a search-based structured prediction (aka RL-ish) view.
But these tasks are not the same and, in fact, structured prediction is much simpler, and I think we need to develop algorithms that take that into account.
The biggest difference is that in (all or at least almost all) structured prediction problems, conditioned on the input x, the world is known, deterministic and therefore reversible and/or fully-explorable (modulo limits of computation). This is generally not true in RL, and one of biggest challenges in RL is that once you take an action, you cannot un-take that action, and you cannot try out other alternatives.
That is to say: computation aside, in structured prediction, conditioned on x, you can build out the entire search tree and do whatever the heck you want with it. (Of course "computation aside" makes no sense in a SP setting because the whole difficulty of SP is computation.) In fact, one of the big advantages to things like CRFs is effectively that they do build out the entire search tree, at least implicitly, which is possibly precisely because of the limited expressivity of features.
This observation is probably perfectly obvious to most NLP folks. In a sense, the semantic parsing crowd has been doing something reinforcement-learning like for quite some time. You produce a semantic parse, run it against a database, check if you get the correct answer or not. If so, positive reward; if not, negative. But no one (as far as I know) just produces one parse: you produce a beam of a bunch of parses and try them all. This is definitely not something you can do in standard RL, but it is something you can do in structured prediction.
In my mind, this was (and continues to be) one of the major weakness of the Searn/DAgger approach to structured prediction that continues to be a problem in applying standard RL algorithms to structured prediction. In a real sense, I think this is something that incremental perceptron, broken LaSO and not-broken LaSO-BST, and seq2seqLaSO got right that the more RL-ish approaches got wrong. (This continues to be true in bandit structured prediction, which will get a separate post in the maybe-near future.) One approach that blends the two to some degree is Vieira and Eisner's recent paper that uses dynamic programming and change propogation within LOLS (which is effectively a variant of AggreVaTe, a follow-on to DAgger) to learn to prune (it's not obvious to me how to generalize this to other tasks yet).
Why the Gap?
I don't think this gap is an accident, and I think there are essentially two reasons it exists.
First, as suggested above, is the question of computation. If you're willing to say "I don't care about computation" then you might as well put yourself in CRF world where life is (relatively) easy, at least if you want to do analysis. If you do care about computation, then doing the "purely greedy" thing is very natural and then you can say "well I know I'm computationally efficient because I'm greedy, and now I can focus entirely on the statistical problem." Once you're willing to spend a bit more computation, you have to figure out how to "charge" yourself properly for that computation. That is, you enter a world where there's a trade-off between statistics and computation (though not in the usual sense) and it's not at all clear how to balance that. It's also hard to convince myself that it's better to spend ten times longer on a single structured example than to do ten different examples. This is a question I've been interested in since my dissertation (p44) but have made basically zero progress on:
(Note that I don't currently agree with the entirety of this passage---in particular, the complexity argument is somewhat broken---but I think the basic idea is right.)
As always, I'd appreciate pointers to work that I don't know about that addresses any of these challenges!
One of the advantages of RL is that it does NOT build out the whole search tree, so does not require detailed symbolic representations of multiple alternatives at the same time. Neither the brain's neural hardware nor the various fancy-dan deep learning networks comfortably accommodate the representation of detailed symbolic alternatives, as far as we know. So it is well worth pursuing RL and similar methods in which the representation of prior context and current uncertainty is less necessarily cut-and-dried than in (say) CRFs.
ReplyDeleteThis comment has been removed by the author.
ReplyDeleteI wonder if the argument that "RL does not build out the whole search tree" true in practice. Most people perform epochs over the dataset and therefore reset the world to a start state. The agent can then execute a trajectory, different from what it tried earlier, and keep doing it to eventually explore the entire search space. In fact papers such as Trust Region Policy Optimization (Figure 1 https://arxiv.org/pdf/1502.05477.pdf) perform multiple rollouts from the same state (to be fair they do mention that it works in a simulation). Of course this won't hold in a real world, you cannot remake a glass jar that a robot has accidentally broken while exploring. Similarly, you cannot just reset to a start state that easily.
ReplyDelete@Chris: thanks! I don't know enough about how the brain works to say something interesting, but this is cool to think about. In eye-tracking, people do look back, for instance, when they get garden pathed or whatever, which isn't necessarily maintaining multiple hypotheses, but maintaining some sort of uncertainty. (Like Jinho Choi's selective branching.)
ReplyDelete@Dipendra: Agreed, good point. CPI also assumes you can reset (which is one reason we chose it), which is totally a fine assumption in SP. Even if you do ten passes over the data, though, you're still only trying a very very small subset of possible trajectories. But this definitely goes to the question of: would I rather expand more now, or do this sentence again later in a few hours?
There is a lot of psychological evidence about what happens during sentence processing, including evidence that there is some kind of representation of non-determinism. For example, there is Marslen-Wilson's Cohort Model (https://en.wikipedia.org/wiki/Cohort_model). But models like this are largely silent on details of the representations used. Which computational models cannot be.
ReplyDeleteHi Hal, an interesting blog post!
ReplyDeleteYou might be interested in our recent paper which adapts the LOLS approach to training RNNs:
SEARNN: Training RNNs with Global-Local Losses
RĂ©mi Leblond, Jean-Baptiste Alayrac, Anton Osokin, Simon Lacoste-Julien
arXiv:1706.04499
It will be presented at the ICML 2017 Workshop on Deep Structured Prediction
(in case that you are around...).