natural language processing blog: Whence your reward function?

09 December 2016

Whence your reward function?

I ran a grad seminar in reinforcement learning this past semester, which was a lot of fun and also gave me an opportunity to catch up on some stuff I'd been meaning to learn but haven't had a chance and old stuff I'd largely forgotten about. It's hard to believe, but my first RL paper was eleven years ago at a NIPS workshop where Daniel Marcu, John Langford and I had a first paper on reducing structured prediction to reinforcement learning, essentially by running Conservative Policy Iteration. (This work eventually became Searn.) Most of my own work in the RL space has focused on imitation learning/learning from demonstrations, but me and my students have recently been pushing more into straight up reinforcement learning algorithms and applications and explanations (also see Tao Lei's awesome work in a similar explanations vein, and Tim Vieira's really nice TACL paper too).

Reinforcement learning has undergone a bit of a renaissance recently, largely due to the efficacy of its combination with good function approximation via deep neural networks. Even more arguably this advance has been due to the increased availability and interest in "interesting" simulated environments, mostly video games and typified by the Atari game collection. In a very similar way that ImageNet made neural networks really work for computer vision (by being large, and capitalizing on the existence of GPUs), I think it's fair to say that these simulated environments have provided the same large data setting for RL that can also be combined with GPU power to build impressive solutions to many games.

In a real sense, many parts of the RL community are going all-in on the notion that learning to play games is a path toward broader AI. The usual refrain that I hear arguing against that approach is based on the quantity of data. The argument is roughly: if you actually want to build a robot that acts in the real world, you're not going to be able to simulate 10 million frames (from the Deepmind paper, which is just under 8 days of real time experience).

I think this is an issue, but I actually don't think it's the most substantial issue. I think the most substantial issue is the fact that game playing is a simulated environment and the reward function is generally crafted to make humans find the games fun, which usually means frequent small rewards that point you in the right direction. This is exactly where RL works well, and something that I'm not sure is a reasonable assumption in the real world.

Delayed reward is one of the hardest issues in RL, because (a) it means you have to do a lot of exploration and (b) you have a significant credit assignment problem. For instance, if you imagine a variant of (pick your favorite video game) where you only get a +1/-1 reward at the end of the game that says whether you won or lost, it becomes much much harder to learn, even if you play 10 million frames or 10 billion frames.

That's all to say: games are really nice settings for RL because there's a very well defined reward function and you typically get that reward very frequently. Neither of these things is going to be true in the real world, regardless of how much data you have.

At the end of the day, playing video games, while impressive, is really not that different from doing classification on synthetic data. Somehow it's better because the people doing the research were not those who invented the synthetic data, but games---even recent games that you might play on your (insert whatever the current popular gaming system is) are still heavily designed---are built in such a way that they are fun for their human players, which typically means increasing difficulty/complexity and relatively regularly reward function.

As we move toward systems that we expect to work in the real world (even if that is not embodied---I don't necessarily mean the difficulty of physical robots), it's less and less clear where the reward function comes from.

One option is to design a reward function. For complex behavior, I don't think we have any idea how to do this. There is the joke example in the R+N AI textbook where you give a vacuum cleaner a reward function for number of pieces of gunk picked up; the vacuum learns to pick up gunk, then drop it, then pick it up again, ad infinitum. It's a silly example, but I don't think we have much of an understanding of how to design reward functions for truly complex behaviors without significant risk of "unintended consequences." (To point a finger toward myself, we invented a reward function for simultaneous interpretation called Latency-Bleu a while ago, and six months later we realized there's a very simple way to game this metric. I was then disappointed that the models never learned that exploit.)

This is one reason I've spent most of my RL effort on imitation learning (IL) like things, typically where you can simulate an oracle. I've rarely seen an NLP problem that's been solved with RL where I haven't thought that it would have been much better and easier to just do IL. Of course IL has it's own issues: it's not a panacea.

One thing I've been thinking about a lot recently is forms of implicit feedback. One cool paper in this area I learned about when I visited GATech a few weeks ago is Learning from Explanations using Sentiment and Advice in RL by Samantha Krening and colleagues. In this work they basically have a coach sitting on the side of an RL algorithm giving it advice, and used that to tailor things that I think of as more immediate reward. I generally think of this kind of like a baby. There's some built in reward signal (it can't be turtles all the way down), but what we think of as a reward signal (like a friend saying "I really don't like that you did that") only turn into this true reward through a learned model that tells me that that's negative feedback. I'd love to see more work in the area of trying to figure out how to transform sparse and imperfect "true" reward signals into something that we can actually learn to optimize.

3 comments:

Aravind Rajeswaran said...: Hi Hal, nice post. I wanted to make a few quick points:
1. I think one has to move away from a "general MDP" setting to produce better algorithms beyond a certain point. For instance, the type of methods that work best on simulated robotic tasks (GPS style approaches of Sergey Levine and Igor Mordatch) don't work on Atari games and vice versa. I think picking a problem that's reasonably appealing and solving it well in a general framework would help more than pushing general purpose algorithms, which at best can serve as benchmarks for comparison.

2. Part of the difficulty comes from a (lack of) shaped reward function, but it's predominantly the dynamics. Some tasks are naturally "forgiving" in that even flailing around makes progress (e.g. locomotion), but most interesting and useful tasks (e.g. manipulating objects) don't fall under this bucket.

3. Inverse RL is certainly appealing for certain tasks, but they are very limited in utility. They can at best work on tasks which humans can solve, and there is a way to map the sensor-actuator system of a human to that of a robot (e.g. a human could solve a task in a way which is impossible for a robot due to it's actuation capabilities). I think a better approach would be continuation tricks -- where a related easier problem is solved first, and it's solution is reused for a harder task. For example, for walking tasks, one can initially load springs on the foot first so that falling down is difficult and the agent can pick up on a relatively simple reward function. Similarly, for solving Go, one could start with a smaller Go board and expand it out.; 10 December, 2016 01:42
hal said...: @Aravind: Thanks for the comments! Quick replies:

1. I largely agree. I also think we've been bending over backward to fit high dim observations into an MDP framework when it's really not---states don't really exist. PSRs maybe better here.

2. Agreed!

3. Yeah that's right; we were trying to address this somewhat in LOLS, but giving up global optimality. The continuation/multitask I think has to be the right answer; but I also worry that designing all these subtasks right is hard for people.; 12 December, 2016 07:30
Unknown said...: Many good points in this post! One of the things that AI tools such as planning and RL do is introduce a new form of programming in which we give goals or rewards rather than instructions. This is supposed to be a step forward, because we specify "what" instead of "how". But I think it is quite difficult, especially because we lack tools for debugging goals and rewards. My student Sean McGregor has a tool called MDPViz that helps visualize the behavior of the RL agent, and this can be useful for debugging.

On a separate note, Kshitij Judah's Ph.D. with Alan Fern dealt with a setup similar to the GATech work. From time to time the teacher looks at what the RL agent has been doing and gives it advice such as (a) in this state, this set of actions is good and this set is bad, (b) you don't even want to be in this state, and so on.
[Tom Dietterich]; 14 December, 2016 16:53

natural language processing blog

09 December 2016

Whence your reward function?

3 comments:

About Me

Labels

My Blog List

Blog Archive