Learning Policies For Learning Policies — Meta Reinforcement Learning (RL²) in Tensorflow

Reinforcement Learning provides a framework for training agents to solve problems in the world. One of the limitations of these agents however is their inflexibility once trained. They are able to learn a policy to solve a specific problem (formalized as an MDP), but that learned policy is often useless in new problems, even relatively similar ones.

Imagine the simplest possible agent: one trained to solve a two-armed bandit task in which one arm always provides a positive reward, and the other arm always provides no reward. Using any RL algorithm such as Q-Learning or Policy Gradient, the agent can quickly learn to always choose the arm with the positive reward. At this point we might be tempted to say we’ve built an agent that can solve two-arm bandit problems. But have we really? What happens if we take our trained agent and give it a nearly identical bandit problem, except with the values of the arms switched? In this new setting, the agent will perform at worse than chance, since it will simply pick whatever it believed to be the correct arm before. In the traditional RL paradigm, our only recourse would be to train a new agent on this new bandit, and another new agent on another new task, etc, etc.

What if this retraining wasn’t necessary though? What if we could have the agent learn a policy for learning new policies? Such an agent could be trained to solve not just a single bandit problem, but all similar bandits it may encounter in the future as well. This approach to learning policies that learn policies is called Meta Reinforcement Learning (Meta-RL), and it is one of the more exciting and promising recent developments in the field.

In Meta-RL, an RNN-based agent is trained to learn to exploit the structure of the problem dynamically. In doing so, it can learn to solve new problems without the need for retraining, simply by adjusting its hidden state. The original work describing Meta-RL was published by Jane Wang and her colleagues at DeepMind last year in their paper: Learning to Reinforcement Learn. I highly recommend checking out that original article for insight into the development of the algorithm. As it turns out this idea was also independently developed by the group at OpenAI and Berkeley and described in their recent paper RL2: Fast Reinforcement Learning via Slow Reinforcement Learning. Since I first became familiar with these concepts through the DeepMind paper, that is the one I followed in the design of my own implementation.

In this article I want to first describe how to augment the A3C algorithm with the capacity to perform Meta-RL learning. Next I will show how it can be used to solve certain problems for which a meta-policy can be applied to solving families of MDPs without the need for retraining. The code for a Tensorflow implementation of the model as well as each of the experiments is available at this GitHub Repo: Meta-RL.

Making a Meta Agent

The key ingredient in a Meta-RL system is a Recurrent Neural Network (RNN). This recurrent nature of this architecture is what allows the agent to learn the meta-policy, since it can adjust it’s output over time given new input. If that last bit sounds familiar, it is exactly what the traditional back-propagation training process of neural networks makes possible. Here instead, we will be training the RNN to learn to adjust itself by altering the hidden state, rather than needing backprop, or some other external adjustment intervention for every new task. In order for this to happen, we need more than the standard State -> Action set-up for the network. In addition to observations x(t), we will feed the RNN the previous reward r(t-1) and the previous action a(t-1). With these three things, the network can associate previous state-action pairs with their rewards, and in doing so adjust its future actions accordingly. As a side note, we don’t need to provide the RNN with the previous observation x(t-1), since it has already seen it at the previous time-step! Adding r(t-1) and a(t-1) allow the agent to always have the full picture of it’s actions and their success in the task.

The general Meta-RL framework when adapting from A3C.

Fortunately, the A3C algorithm is already halfway to being Meta-RL ready. Since it comes with an RNN to begin with, we simply need to provide the additional r(t-1) and a(t-1) inputs to the network, and it will be able to learn to perform meta-policy learning!

Meta Experiments

Before describing the experiments I conducted, I want to talk about the kinds of scenarios in which Meta-RL works, and the scenarios that it doesn’t. Meta-RL isn’t magic, and we can’t train agents on Pong, and have them play Breakout without re-training (yet, at least). The key to success here is that we are interested in learning a policy for learning a policy in a family of similar MDPs or bandits. So in the case of bandits, it is possible to learn a meta-policy for learning a family of unseen two-armed bandit problems. The same can be applied to learning contextual-bandits, grid-worlds, 3D mazes, and more, so long as the underlying logic of the problem remains the same.

Dependent Bandit_s_ — The simplest way to test our agent is with the two-armed bandit problem described at the beginning of this article. In this case, there are no observations x. Instead the authors add a timestep input t which allows the network to maintain a notion of the time in the task. I used this additional t input for all of my experiments as well.

The nature of the bandit is as follows: a randomly chosen arm provides a reward of +1 with probability p, and otherwise provides a reward of 0. The other arm provides +1 reward with probability 1-p. In this way the two arms are inversely related to one another, and finding out something about one arm provides information about the other as well. For a human, an ideal strategy would be to try each arm a few times, and figure out which provides the +1 reward more often. This is essentially the Meta-Policy that we would like our agent to learn.

In the experiment each episode consisted of 100 trials. After 20,000 training episodes, the meta-agent is able to flexibly adjust its policy quickly in unseen test-set bandit environments. As you can see below (right), after only a few trials the optimal arm is discovered and exploited. This is in contrast to the agent without the additional meta-enabling inputs (left), which acts randomly throughout the episode and accumulates little reward.

Trained meta-A3C agent performance on random test-set bandits without (left) and with (right) additional r(t-1) and a(t-1) inputs. Top numbers indicate reward probability. Blue indicates agent action selection. Green bars indicate cumulative reward over episode.

Rainbow Bandit_s_ — The next logical step when trying out meta-policy learning is to add in state inputs x, to allow for actions conditioned on observations. For this situation, I created a bandit in which the observed state is a 2-pixel image consisting of two random colors. One of the colors is set as “correct” at the beginning of the episode, and the corresponding arm will always provide a reward of +1. The other color then always corresponds to a reward of 0. For the duration of an episode (100 trials), the two colors are displayed in randomized order side-by-side each trial. Correspondingly, the arm that provides the reward changes with the color. This time, an ideal policy involves not just trying both arms, but learning that the reward follows the color, and discovering which of the two colors provides the reward. After 30,000 training episodes the Meta-RL agent learns a correct policy for solving these problems.

Trained meta-A3C agent performance on random test-set rainbow bandits without (left) and with (right) additional r(t-1) and a(t-1) inputs. Top color indicated reward giving arm. Blue bar indicates agent action selection.

Interestingly, if we look at the performance curve during training for this task, we discover something unexpected (at least to me). Unlike most curves, in which there is a smooth continuous overall trend towards improved performance, here there is a single discrete jump. The agent goes from random performance to near perfect performance within the span of 5000 episodes or so.

There seems to be a discrete jump in performance as the agent’s behavior changes from random action to employing a successful strategy.

Intuitively, this makes sense with the task since you either are applying the strategy, in which case near perfect performance is attained, or you aren’t, in which case performance over time is at chance level. That in and of itself isn’t so remarkable, but what seems compelling to me is that the agent had been training for hours beforehand, slowly adjusting its weights, with no perceptible behavioral changes. Then, all of a sudden things ‘clicked’ into place and an entirely new behavioral pattern emerged. This is similar to what psychologists find in the cognitive abilities of children as they develop. Certain cognitive skills such as object permanence appear to come as discrete changes in development, yet they are underlied by continuous changes in the wiring of the brain. These kinds of emergent behaviors are studied as part of a dynamic systems theory approach to cognition, and it is exciting to see similar phenomena come about within the context of RL agents!

Rainbow Gridworld_s_— For the final experiment, I wanted to test a meta-agent in a full MDP environment, and what better one than the gridworld utilized in the earlier tutorials? In this environment, the agent is represented by a blue square in a small grid. The agent can move up, down, left, or right, to move through the square 5x5 environment. In the environment there are randomly placed squares of one color which provide +1 reward upon contact, and an equal number of randomly placed squares of another color which provide 0 reward. As was the case in the rainbow bandit, the color here is ambiguous, because by randomizing the color each episode, what would be an MDP becomes a family of MDPs. The optimal strategy is to check one of the squares, and discover if it provides the reward. If so, then squares of that color should be the goals for the given episode. If not, then the other color indicates the reward. Each episode consisted of 100 steps, and after 30,000 training episodes, the meta-agent learns an approximation to this optimal policy.

Meta-A3C agent performance on random test-set grid worlds without (left) and with (right) additional r(t-1) and a(t-1) inputs. Colored bar below grid indicates reward-giving goals, and is importantly not part of network input.

**Beyond—**In their DeepMind paper, the authors discuss utilizing Meta-RL for 3D maze learning (discussed further in the paper), as well as another 3D task designed to emulate cognitive experiments conducted with primates. The entire paper is a great read, and I recommend it for those interested in the finer details of the algorithm, as well as the exact nature of the set of experiments they conducted. I hope to be able to see this technique applied in a variety of tasks in the future. A long-term goal of artificial intelligence is to train agents that can flexibly adapt to their environments and learn “on the go,” so to speak, in a safe and efficient way. Meta-RL/RL² is an exciting step toward that.

I hope this walkthrough and experiments has given some intuition about the power of Meta-RL, and the places where it is applicable. If you’d like to utilize Meta-RL in your own projects, feel free to fork or contribute to my Github repository!

If you’d like to follow my writing on Deep Learning, AI, and Cognitive Science, follow me on Medium @Arthur Juliani, or on twitter @awjliani.

Please consider donating to help support future tutorials, articles, and implementations. Any contribution is greatly appreciated!