# Planning

Planning is a method of simulating a sequence of actions in an environment model before actually taking an action in the real environment.

Concepts covered:
1. Cross entropy method (CEM)
2. Monte Carlo tree search (MCTS)
3. Probabilistic ensembles with trajectory sampling (PETS)

References:
- Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models
- Exploring Model-based Planning with Policy Networks
- Learning Latent Dynamics for Planning from Pixels

## Cross Entropy Method

The Cross Entropy Method (CEM) is a gradient-free optimization method commonly used for planning in model-based reinforcement learning.

CEM Algorithm
1. Create a Gaussian distribution $N(\mu,\sigma)$ that describes the weights $\theta$ of the neural network.
2. Sample $N$ batch samples of $\theta$ from the Gaussian.
3. Evaluate all $N$ samples of $\theta$ using the value function, e.g. running trials.
4. Select the top % of the samples of $\theta$ and compute the new $\mu$ and $\sigma$ to parameterise the new Gaussian distribution.
5. Repeat steps 1-4 until convergence.

In [10]:
import numpy as np
import tensorflow_probability as tfp
tfd = tfp.distributions
import gym
import warnings
warnings.filterwarnings("ignore")

In [11]:
# RL Gym
env = gym.make('CartPole-v1')

# Initialisation
n = 10  # number of candidate policies
top_k = 0.40  # top % selected for next iteration
mean = np.zeros((5,2))  # shape = (n_parameters, n_actions)
stddev = np.ones((5,2))  # shape = (n_parameters, n_actions)

In [12]:
def get_batch_weights(mean, stddev, n):
    mvn = tfd.MultivariateNormalDiag(
        loc=mean,
        scale_diag=stddev)
    return mvn.sample(n).numpy()

def policy(obs, weights):
    return np.argmax(obs @ weights[:4,:] + weights[4])

def run_trial(weights, render=False):
    obs = env.reset()
    done = False
    reward = 0
    while not done:
        a = policy(obs, weights)
        obs, r, done, _ = env.step(a)
        reward += r
        if render:
            env.render()
    return reward

def get_new_mean_stddev(rewards, batch_weights):
    idx = np.argsort(rewards)[::-1][:int(n*top_k)]
    mean = np.mean(batch_weights[idx], axis=0)
    stddev = np.sqrt(np.var(batch_weights[idx], axis=0))
    return mean, stddev

In [13]:
for i in range(20):
    batch_weights = get_batch_weights(mean, stddev, n)
    rewards = [run_trial(weights) for weights in batch_weights]
    mean, stddev = get_new_mean_stddev(rewards, batch_weights)
    print(rewards)

[39.0, 10.0, 9.0, 15.0, 11.0, 17.0, 17.0, 8.0, 9.0, 46.0]
[13.0, 99.0, 8.0, 25.0, 10.0, 45.0, 21.0, 18.0, 35.0, 35.0]
[43.0, 66.0, 26.0, 43.0, 48.0, 51.0, 59.0, 50.0, 126.0, 40.0]
[37.0, 52.0, 38.0, 75.0, 72.0, 55.0, 156.0, 29.0, 83.0, 210.0]
[66.0, 46.0, 47.0, 110.0, 63.0, 45.0, 117.0, 25.0, 75.0, 67.0]
[141.0, 53.0, 126.0, 73.0, 118.0, 60.0, 82.0, 141.0, 164.0, 93.0]
[115.0, 117.0, 99.0, 126.0, 94.0, 198.0, 102.0, 208.0, 76.0, 136.0]
[146.0, 116.0, 206.0, 145.0, 103.0, 82.0, 132.0, 108.0, 96.0, 152.0]
[124.0, 135.0, 100.0, 123.0, 98.0, 182.0, 134.0, 166.0, 111.0, 121.0]
[146.0, 112.0, 94.0, 111.0, 144.0, 154.0, 100.0, 113.0, 127.0, 102.0]
[180.0, 105.0, 122.0, 107.0, 94.0, 165.0, 132.0, 97.0, 80.0, 188.0]
[170.0, 123.0, 135.0, 136.0, 99.0, 161.0, 123.0, 147.0, 135.0, 104.0]
[138.0, 98.0, 118.0, 103.0, 190.0, 96.0, 203.0, 106.0, 118.0, 105.0]
[122.0, 100.0, 108.0, 120.0, 191.0, 178.0, 145.0, 119.0, 94.0, 104.0]
[94.0, 134.0, 116.0, 143.0, 113.0, 163.0, 154.0, 207.0, 145.0, 147.0]
[215

In [14]:
mean, stddev

(array([[-0.69079407,  0.4850589 ],
        [ 0.10470737,  0.22073966],
        [-0.6316004 , -0.33848435],
        [-0.92844697,  1.94632246],
        [ 0.42116319,  0.45191136]]),
 array([[0.0044287 , 0.00021798],
        [0.0028252 , 0.00163548],
        [0.01741926, 0.00442157],
        [0.00040444, 0.00117998],
        [0.0003756 , 0.00224895]]))

In [15]:
best_weights = get_batch_weights(mean, stddev, 1)[0]

In [16]:
run_trial(best_weights, render=False)

97.0

## Monte Carlo Tree Search

Upcoming

## Probabilistic Ensembles with Trajectory Sampling

Probabilistic ensembles with trajectory sampling (PETS) is a model-based reinforcement learning algorithm that combines probabilistic netwroks to capture aleatoric uncertainty and ensembles to capture epistemic uncertainty.

PETS is used for model predictive control (MPC), that plans and optimizes for a sequence of actions.

Instead of random shooting, PETS uses CEM to sample actions from a distribution closer to previous action samples that yielded high reward.

