Practical Reinforcement Learning with TensorFlow

Practical RL with
TensorFlow
Illia Polosukhin, XIX.ai

Reinforcement Learning Problem

OpenAI Gym
- Library of environments
Control, Atari, Doom, etc.
- Same API
- Provides way to share and
compare results
https://gym.openai.com/

Markov Decision Process
MDP < S, A, P, R, 𝛾 >
- S: set of states
- A: set of actions
- T(s, a, s’): probability of transition
- Reward(s): reward function
- 𝛾: discounting factory
Trace: {<s0,a0,r0>, …, <sn,an,rn>}

Definitions
- Return: total discounted reward:
- Policy: Agent’s behavior
- Deterministic policy: π(s) = a
- Stochastic policy: π(a | s) = P[At = a | St = s]
- Value function: Expected return starting from state s:
- State-value function: Vπ(s) = Eπ[R | St = s]
- Action-value function: Qπ(s, a) = Eπ[R | St = s, At = a]

Deep Q Learning
- Model-free, off-policy technique to learn optimal Q(s, a):
- Qi+1(s, a) ← Qi(s, a) + 𝛼(R + 𝛾 maxa’ Qi(s’, a’) - Qi(s, a))
- Optimal policy then π(s) = argmaxa’ Q(s, a’)
- Requires exploration (ε-greedy) to explore various transitions from the states.
- Take random action with ε probability, start ε high and decay to low value as training
progresses.
- Deep Q Learning: approximate Q(s, a) with neural network: Q(s, a, 𝜃)
- Do stochastic gradient descent using loss

Run Optimization
Full example: https://github.com/ilblackdragon/tensorflow-rl/blob/master/examples/atari-rl.py

Monitored Session
- Handles pitfalls of distributed training.
- Saving and restoring checkpoints.
- Hooks is a general interface for injecting
computation into TensorFlow training
loop.

Original Results on Atari Games
Mnih et al., 2013

Beating Human Level Mnih at el., 2015

Policy Gradient
- Given policy π 𝜃(a | s) find such 𝜃 that maximizes expected return:
J(𝜃) = ∑sdπ(s)V(s)
- In Deep RL, we approximate π 𝜃(a | s) with neural network.
- Usually with softmax layer on top to estimate probabilities of each action.
- We can estimate J(𝜃) from samples of observed behavior: ∑k=0..Tp𝜃( 𝜏k | π)R( 𝜏k)
- Do stochastic gradient descent using update:
𝜃i+1 = 𝜃i + 𝛼 (1/T) ∑k=0..T ∇log p𝜃( 𝜏k | π)R( 𝜏k)

Async Advantage Actor-Critic (A3C)
- Asynchronous: using multiple instances of
environments and networks
- Actor-Critic: using both policy and
estimate of value function.
- Advantage: estimate how different was
outcome than expected.
Image by Arthur Juliani

A3C Results on Atari Games
Mnih at el., 2016

Practical use cases
- Robotics
- Finance
- Industrial optimization
- Predictive assistant

Illia Polosukhin
XIX.ai
@ilblackdragon, illia@xix.ai
Questions?
Full code will be available soon at
https://github.com/ilblackdragon/tensorflow-rl/

Practical Reinforcement Learning with TensorFlow

More Related Content

What's hot

Similar to Practical Reinforcement Learning with TensorFlow

Recently uploaded

Practical Reinforcement Learning with TensorFlow

Editor's Notes