How to Make Deep Reinforcement Learning less Data Hungry
Pieter Abbeel
Embodied Intelligence
UC Berkeley
Gradescope
OpenAI
AI for robo<c automa<on
AI research
AI for grading homework and exams
AI research / advisory role
Reinforcement Learning (RL)
Robot +
Environment
max
✓
E[
HX
t=0
R(st)|⇡✓]
[Image credit: Sutton and Barto 1998] Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
Reinforcement Learning (RL)
Robot +
Environment
Compared ton
supervised learning,
addi3onal challenges:
Credit assignmentn
Stabilityn
Explora3onn
[Image credit: Sutton and Barto 1998]
max
✓
E[
HX
t=0
R(st)|⇡✓]
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
Deep RL Success Stories
DQN Mnih et al, NIPS 2013 / Nature 2015
MCTS Guo et al, NIPS 2014; TRPO Schulman, Levine, Moritz, Jordan, Abbeel, ICML 2015; A3C Mnih et al,
ICML 2016; Dueling DQN Wang et al ICML 2016; Double DQN van Hasselt et al, AAAI 2016; Prioritized
Experience Replay Schaul et al, ICLR 2016; Bootstrapped DQN Osband et al, 2016; Q-Ensembles Chen et al,
2017; Rainbow Hessel et al, 2017; …
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
Deep RL Success Stories
AlphaGo Silver et al, Nature 2015
AlphaGoZero Silver et al, Nature 2017
Tian et al, 2016; Maddison et al, 2014; Clark et al, 2015
AlphaZero Silver et al, 2017
n Super-human agent on Dota2 1v1, enabled by
n Reinforcement learning
n Self-play
n Enough computation
Deep RL Success Stories
[OpenAI, 2017] Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
Deep RL Success Stories
^ TRPO Schulman et al, 2015 + GAE Schulman et al, 2016
See also: DDPG Lillicrap et al 2015; SVG Heess et al, 2015; Q-Prop Gu et al, 2016; Scaling up ES Salimans et
al, 2017; PPO Schulman et al, 2017; Parkour Heess et al, 2017;
Deep RL Success Stories
Guided Policy Search Levine*, Finn*, Darrell, Abbeel, JMLR 2016
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
[Geng*, Zhang*, Bruce*, Caluwaerts, Vespignani, Sunspiral, Abbeel, Levine, ICRA 2017] Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
Mastery? Yes
Deep RL (DQN) vs. Human
Score: 18.9 Score: 9.3
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
But how good is the learning?
Deep RL (DQN) vs. Human
Score: 18.9 Score: 9.3
#Experience
measured in real
time: 40 days
#Experience
measured in real
time: 2 hours
“Slow” “Fast”
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
Humans vs. DDQN
Black dots: human play
Blue curve: mean of human play
Blue dashed line: ‘expert’ human play
Red dashed lines:
DDQN a@er 10, 25, 200M frames
(~ 46, 115, 920 hours)
[Tsividis, Pouncy, Xu, Tenenbaum, Gershman, 2017] Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
Humans after 15 minutes tend to outperform DDQN after 115 hours
How to bridge this gap?
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
Outline
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
Hindsight Experience Replayn
Metan -learning – learn-to-learn faster
n On-Policy Algorithms:
n Can only perform update based on data collected under current policy
n Policy gradients, A3C, TRPO
n Atari: Most wall-clock efficient
n Off-Policy Algorithms:
n Can re-use all past data for updates (“experience replay”)
n Q learning, DQN, DDPG
n Atari: Most sample efficient
On-Policy vs. Off-Policy
X
k from replay bu↵er
(Q✓(sk, ak) (rk + max
a
Q✓OLD
(sk+1, a))2
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
But Signal?
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
Main idea: Get reward signal from any experience by simply assuming then
goal equals whatever happened
Representa=on learned: Universal Qn -func=on Q(s, g, a)
Replay buffer:n
Actual experience:n
Experience replay:n
Hindsight replay:n
Hindsight Experience Replay
[Andrychowicz et al, 2017] [Related work: Cabi et al, 2017]
(st, gt, at, rt, st+1)
X
k from replay bu↵er
(Q✓(sk, gk, ak) (rk + max
a
Q✓OLD
(sk+1, gk, a))2
X
k from replay bu↵er
(Q✓(sk, g0
k,ak) (r0
k + max
a
Q✓OLD
(sk+1, g0
k, a))2
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
Hindsight Experience Replay: Pushing, Sliding, Pick-and-Place
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope[Andrychowicz, Wolski, Ray, Schneider, Fong, Welinder, McGrew, Tobin, Abbeel, Zaremba, NIPS2017]
Quan%ta%ve Evalua%on
~3s/episode à 1.5 days for 50 epochs, 6 days for 200 epochs
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
Outline
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
n Hindsight Experience Replay
n Meta-learning – learn-to-learn faster
Starting Observations
n TRPO, DQN, A3C, DDPG, PPO, Rainbow, … are fully general RL
algorithms
n i.e., for any environment that can be mathematically defined, these
algorithms are equally applicable
n Environments encountered in real world
= tiny, tiny subset of all environments that could be defined
(e.g. they all satisfy our universe’s physics)
Can we develop “fast” RL algorithms that take advantage of this?
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
Reinforcement Learning
[Duan, Schulman, Chen, Bartlett, Sutskever, Abbeel, 2016] also: [Wang et al, 2016]
Agent
Environment
asr
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
Reinforcement Learning
[Duan, Schulman, Chen, Bartlett, Sutskever, Abbeel, 2016] also: [Wang et al, 2016]
RL Algorithm
Policy
Environment
aor
RL Agent
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
Reinforcement Learning
[Duan, Schulman, Chen, Bartlett, Sutskever, Abbeel, 2016] also: [Wang et al, 2016]
RL Algorithm
Policy
Environment A
asr
…
RL Agent
RL Algorithm
Policy
Environment B
asr
RL Agent
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
Reinforcement Learning
[Duan, Schulman, Chen, Bartlett, Sutskever, Abbeel, 2016] also: [Wang et al, 2016]
…
RL Algorithm
Policy
Environment A
aor
RL Algorithm
Policy
Environment B
aor
RL Agent RL Agent
Traditional RL research:
• Human experts develop
the RL algorithm
• After many years, still
no RL algorithms nearly
as good as humans…
Alternative:
• Could we learn a better
RL algorithm?
• Or even learn a better
entire agent?
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
Meta-Reinforcement Learning
[Duan, Schulman, Chen, Bartlett, Sutskever, Abbeel, 2016] also: [Wang et al, 2016]
"Fast” RL
Agent
Environment A
Meta RL
Algorithm
Environment B
…
Meta-training environments
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
Meta-Reinforcement Learning
[Duan, Schulman, Chen, Bartlett, Sutskever, Abbeel, 2016] also: [Wang et al, 2016]
"Fast” RL
Agent
Environment A
ar,o
Meta RL
Algorithm
Environment B
…
Environment F
Meta-training environments
Testing environments
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
Meta-Reinforcement Learning
[Duan, Schulman, Chen, Bartlett, Sutskever, Abbeel, 2016] also: [Wang et al, 2016]
"Fast” RL
Agent
Environment A
ar,o
Meta RL
Algorithm
Environment B
…
Environment G
Meta-training environments
Testing environments
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
Meta-Reinforcement Learning
[Duan, Schulman, Chen, Bartle:, Sutskever, Abbeel, 2016] also: [Wang et al, 2016]
"Fast” RL
Agent
Environment A
ar,o
Meta RL
Algorithm
Environment B
…
Environment H
Meta-training environments
TesNng environments
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
Formalizing Learning to Reinforcement Learn
[Duan, Schulman, Chen, Bartlett, Sutskever, Abbeel, 2016] also: [Wang et al, 2016] Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
max
✓
EM E⌧
(k)
M
" KX
k=1
R(⌧
(k)
M ) | RLagent✓
#
Formalizing Learning to Reinforcement Learn
[Duan, Schulman, Chen, Bartle:, Sutskever, Abbeel, 2016] also: [Wang et al, 2016]
M : sample MDP
⌧
(k)
M : k’th trajectory in MDP M
max
✓
EM E⌧
(k)
M
" KX
k=1
R(⌧
(k)
M ) | RLagent✓
#
Meta-train:
max
✓
X
M2Mtrain
E⌧
(k)
M
" KX
k=1
R(⌧
(k)
M ) | RLagent✓
#
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
n RLagent = RNN = generic computation architecture
n different weights in the RNN means different RL algorithm and prior
n different activations in the RNN means different current policy
n meta-train objective can be optimized with an existing (slow) RL algorithm
Representing RLagent✓
max
✓
X
M2Mtrain
E⌧
(k)
M
" KX
k=1
R(⌧
(k)
M ) | RLagent✓
#
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope[Duan, Schulman, Chen, BartleN, Sutskever, Abbeel, 2016] also: [Wang et al, 2016]
Evaluation: Multi-Armed Bandits
Mul5n -Armed Bandits se6ng
Each bandit has its own distribu5onn
over pay-outs
Each episode = choose 1 banditn
Good RL agent should exploren
bandits sufficiently, yet also exploit
the good/best ones
Provably (asympto5cally) op5maln
RL algorithms have been invented
by humans: Gi6ns index, UCB1,
Thompson sampling, …
[Duan, Schulman, Chen, Bartlett, Sutskever, Abbeel, 2016] Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
[Duan, Schulman, Chen, Bartlett, Sutskever, Abbeel, 2016]
Evaluation: Multi-Armed Bandits
We consider Bayesian evaluaEon seFng. Some of these prior works also have adversarial
guarantees, which we don’t consider here.
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
Evaluation: Visual Navigation
Agent’s view Maze
Agent input: current image
Agent action: straight / 2 degrees left / 2 degrees right
Map just shown for our purposes, but not available to agent
Related work: Mirowski, et al, 2016; Jaderberg et al, 2016; Mnih et al, 2016; Wang et al, 2016
Evaluation: Visual Navigation
After learning-to-learnBefore learning-to-learn
[Duan, Schulman, Chen, Bartlett, Sutskever, Abbeel, 2016]
Agent input: current image
Agent ac,on: straight / 2 degrees leE / 2 degrees right
Map just shown for our purposes, but not available to agent
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
n Simple Neural Attentive Meta-Learner (SNAIL)
[Mishra, Rohaninejad, Chen, Abbeel, 2017]
n Model-Agnostic Meta-Learning (MAML)
[Finn, Abbeel, Levine, 2017; Finn*, Yu*, Zhang, Abbeel, Levine 2017]
Other Meta RL Architectures
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
Hindsight Experience Replayn
Metan -learning – learn-to-learn faster
Summary
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
OpenAIn Gym: h-ps://gym.openai.com
OpenAIn baselines: h-ps://github.com/openai/baselines
rllabn : h-ps://github.com/rll/rllab
Want to Try Things Out?
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
n Deep RL Bootcamp [intense long weekend]
n 12 lectures by: Pieter Abbeel, Rocky Duan, Peter Chen, John Schulman,
Vlad Mnih, Chelsea Finn, Sergey Levine
n 4 hands-on labs
n https://sites.google.com/view/deep-rl-bootcamp/
n Deep RL Course [full semester]
n Originated by John Schulman, Sergey Levine, Chelsea Finn
n Latest offering (by Sergey Levine):
http://rll.berkeley.edu/deeprlcourse/
Want to learn more?
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
Fast Reinforcement Learningn
Long Horizon Reasoningn
Taskabilityn
Lifelong Learningn
Leverage Simula;onn
Maximize Signal Extractedn
Many Exciting Directions
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
n Safe Learning
n Value Alignment
n Planning + Learning
n …
Thank you
@pabbeel
pabbeel@cs.berkeley.edu
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope

Deep Reinforcement Learning by Pieter

  • 1.
    How to MakeDeep Reinforcement Learning less Data Hungry Pieter Abbeel Embodied Intelligence UC Berkeley Gradescope OpenAI AI for robo<c automa<on AI research AI for grading homework and exams AI research / advisory role
  • 2.
    Reinforcement Learning (RL) Robot+ Environment max ✓ E[ HX t=0 R(st)|⇡✓] [Image credit: Sutton and Barto 1998] Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
  • 3.
    Reinforcement Learning (RL) Robot+ Environment Compared ton supervised learning, addi3onal challenges: Credit assignmentn Stabilityn Explora3onn [Image credit: Sutton and Barto 1998] max ✓ E[ HX t=0 R(st)|⇡✓] Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
  • 4.
    Deep RL SuccessStories DQN Mnih et al, NIPS 2013 / Nature 2015 MCTS Guo et al, NIPS 2014; TRPO Schulman, Levine, Moritz, Jordan, Abbeel, ICML 2015; A3C Mnih et al, ICML 2016; Dueling DQN Wang et al ICML 2016; Double DQN van Hasselt et al, AAAI 2016; Prioritized Experience Replay Schaul et al, ICLR 2016; Bootstrapped DQN Osband et al, 2016; Q-Ensembles Chen et al, 2017; Rainbow Hessel et al, 2017; … Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
  • 5.
    Deep RL SuccessStories AlphaGo Silver et al, Nature 2015 AlphaGoZero Silver et al, Nature 2017 Tian et al, 2016; Maddison et al, 2014; Clark et al, 2015 AlphaZero Silver et al, 2017
  • 6.
    n Super-human agenton Dota2 1v1, enabled by n Reinforcement learning n Self-play n Enough computation Deep RL Success Stories [OpenAI, 2017] Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
  • 7.
    Deep RL SuccessStories ^ TRPO Schulman et al, 2015 + GAE Schulman et al, 2016 See also: DDPG Lillicrap et al 2015; SVG Heess et al, 2015; Q-Prop Gu et al, 2016; Scaling up ES Salimans et al, 2017; PPO Schulman et al, 2017; Parkour Heess et al, 2017;
  • 8.
    Deep RL SuccessStories Guided Policy Search Levine*, Finn*, Darrell, Abbeel, JMLR 2016 Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
  • 9.
    [Geng*, Zhang*, Bruce*,Caluwaerts, Vespignani, Sunspiral, Abbeel, Levine, ICRA 2017] Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
  • 10.
    Mastery? Yes Deep RL(DQN) vs. Human Score: 18.9 Score: 9.3 Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
  • 11.
    But how goodis the learning? Deep RL (DQN) vs. Human Score: 18.9 Score: 9.3 #Experience measured in real time: 40 days #Experience measured in real time: 2 hours “Slow” “Fast” Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
  • 12.
    Humans vs. DDQN Blackdots: human play Blue curve: mean of human play Blue dashed line: ‘expert’ human play Red dashed lines: DDQN a@er 10, 25, 200M frames (~ 46, 115, 920 hours) [Tsividis, Pouncy, Xu, Tenenbaum, Gershman, 2017] Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope Humans after 15 minutes tend to outperform DDQN after 115 hours
  • 13.
    How to bridgethis gap? Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
  • 14.
    Outline Pieter Abbeel --embody.ai / UC Berkeley / Gradescope Hindsight Experience Replayn Metan -learning – learn-to-learn faster
  • 15.
    n On-Policy Algorithms: nCan only perform update based on data collected under current policy n Policy gradients, A3C, TRPO n Atari: Most wall-clock efficient n Off-Policy Algorithms: n Can re-use all past data for updates (“experience replay”) n Q learning, DQN, DDPG n Atari: Most sample efficient On-Policy vs. Off-Policy X k from replay bu↵er (Q✓(sk, ak) (rk + max a Q✓OLD (sk+1, a))2 Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
  • 16.
    But Signal? Pieter Abbeel-- embody.ai / UC Berkeley / Gradescope
  • 17.
    Main idea: Getreward signal from any experience by simply assuming then goal equals whatever happened Representa=on learned: Universal Qn -func=on Q(s, g, a) Replay buffer:n Actual experience:n Experience replay:n Hindsight replay:n Hindsight Experience Replay [Andrychowicz et al, 2017] [Related work: Cabi et al, 2017] (st, gt, at, rt, st+1) X k from replay bu↵er (Q✓(sk, gk, ak) (rk + max a Q✓OLD (sk+1, gk, a))2 X k from replay bu↵er (Q✓(sk, g0 k,ak) (r0 k + max a Q✓OLD (sk+1, g0 k, a))2 Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
  • 18.
    Hindsight Experience Replay:Pushing, Sliding, Pick-and-Place Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope[Andrychowicz, Wolski, Ray, Schneider, Fong, Welinder, McGrew, Tobin, Abbeel, Zaremba, NIPS2017]
  • 19.
    Quan%ta%ve Evalua%on ~3s/episode à1.5 days for 50 epochs, 6 days for 200 epochs Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
  • 20.
    Outline Pieter Abbeel --embody.ai / UC Berkeley / Gradescope n Hindsight Experience Replay n Meta-learning – learn-to-learn faster
  • 21.
    Starting Observations n TRPO,DQN, A3C, DDPG, PPO, Rainbow, … are fully general RL algorithms n i.e., for any environment that can be mathematically defined, these algorithms are equally applicable n Environments encountered in real world = tiny, tiny subset of all environments that could be defined (e.g. they all satisfy our universe’s physics) Can we develop “fast” RL algorithms that take advantage of this? Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
  • 22.
    Reinforcement Learning [Duan, Schulman,Chen, Bartlett, Sutskever, Abbeel, 2016] also: [Wang et al, 2016] Agent Environment asr Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
  • 23.
    Reinforcement Learning [Duan, Schulman,Chen, Bartlett, Sutskever, Abbeel, 2016] also: [Wang et al, 2016] RL Algorithm Policy Environment aor RL Agent Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
  • 24.
    Reinforcement Learning [Duan, Schulman,Chen, Bartlett, Sutskever, Abbeel, 2016] also: [Wang et al, 2016] RL Algorithm Policy Environment A asr … RL Agent RL Algorithm Policy Environment B asr RL Agent Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
  • 25.
    Reinforcement Learning [Duan, Schulman,Chen, Bartlett, Sutskever, Abbeel, 2016] also: [Wang et al, 2016] … RL Algorithm Policy Environment A aor RL Algorithm Policy Environment B aor RL Agent RL Agent Traditional RL research: • Human experts develop the RL algorithm • After many years, still no RL algorithms nearly as good as humans… Alternative: • Could we learn a better RL algorithm? • Or even learn a better entire agent? Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
  • 26.
    Meta-Reinforcement Learning [Duan, Schulman,Chen, Bartlett, Sutskever, Abbeel, 2016] also: [Wang et al, 2016] "Fast” RL Agent Environment A Meta RL Algorithm Environment B … Meta-training environments Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
  • 27.
    Meta-Reinforcement Learning [Duan, Schulman,Chen, Bartlett, Sutskever, Abbeel, 2016] also: [Wang et al, 2016] "Fast” RL Agent Environment A ar,o Meta RL Algorithm Environment B … Environment F Meta-training environments Testing environments Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
  • 28.
    Meta-Reinforcement Learning [Duan, Schulman,Chen, Bartlett, Sutskever, Abbeel, 2016] also: [Wang et al, 2016] "Fast” RL Agent Environment A ar,o Meta RL Algorithm Environment B … Environment G Meta-training environments Testing environments Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
  • 29.
    Meta-Reinforcement Learning [Duan, Schulman,Chen, Bartle:, Sutskever, Abbeel, 2016] also: [Wang et al, 2016] "Fast” RL Agent Environment A ar,o Meta RL Algorithm Environment B … Environment H Meta-training environments TesNng environments Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
  • 30.
    Formalizing Learning toReinforcement Learn [Duan, Schulman, Chen, Bartlett, Sutskever, Abbeel, 2016] also: [Wang et al, 2016] Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope max ✓ EM E⌧ (k) M " KX k=1 R(⌧ (k) M ) | RLagent✓ #
  • 31.
    Formalizing Learning toReinforcement Learn [Duan, Schulman, Chen, Bartle:, Sutskever, Abbeel, 2016] also: [Wang et al, 2016] M : sample MDP ⌧ (k) M : k’th trajectory in MDP M max ✓ EM E⌧ (k) M " KX k=1 R(⌧ (k) M ) | RLagent✓ # Meta-train: max ✓ X M2Mtrain E⌧ (k) M " KX k=1 R(⌧ (k) M ) | RLagent✓ # Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
  • 32.
    n RLagent =RNN = generic computation architecture n different weights in the RNN means different RL algorithm and prior n different activations in the RNN means different current policy n meta-train objective can be optimized with an existing (slow) RL algorithm Representing RLagent✓ max ✓ X M2Mtrain E⌧ (k) M " KX k=1 R(⌧ (k) M ) | RLagent✓ # Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope[Duan, Schulman, Chen, BartleN, Sutskever, Abbeel, 2016] also: [Wang et al, 2016]
  • 33.
    Evaluation: Multi-Armed Bandits Mul5n-Armed Bandits se6ng Each bandit has its own distribu5onn over pay-outs Each episode = choose 1 banditn Good RL agent should exploren bandits sufficiently, yet also exploit the good/best ones Provably (asympto5cally) op5maln RL algorithms have been invented by humans: Gi6ns index, UCB1, Thompson sampling, … [Duan, Schulman, Chen, Bartlett, Sutskever, Abbeel, 2016] Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
  • 34.
    [Duan, Schulman, Chen,Bartlett, Sutskever, Abbeel, 2016] Evaluation: Multi-Armed Bandits We consider Bayesian evaluaEon seFng. Some of these prior works also have adversarial guarantees, which we don’t consider here. Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
  • 35.
    Evaluation: Visual Navigation Agent’sview Maze Agent input: current image Agent action: straight / 2 degrees left / 2 degrees right Map just shown for our purposes, but not available to agent Related work: Mirowski, et al, 2016; Jaderberg et al, 2016; Mnih et al, 2016; Wang et al, 2016
  • 36.
    Evaluation: Visual Navigation Afterlearning-to-learnBefore learning-to-learn [Duan, Schulman, Chen, Bartlett, Sutskever, Abbeel, 2016] Agent input: current image Agent ac,on: straight / 2 degrees leE / 2 degrees right Map just shown for our purposes, but not available to agent Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
  • 37.
    n Simple NeuralAttentive Meta-Learner (SNAIL) [Mishra, Rohaninejad, Chen, Abbeel, 2017] n Model-Agnostic Meta-Learning (MAML) [Finn, Abbeel, Levine, 2017; Finn*, Yu*, Zhang, Abbeel, Levine 2017] Other Meta RL Architectures Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
  • 38.
    Hindsight Experience Replayn Metan-learning – learn-to-learn faster Summary Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
  • 39.
    OpenAIn Gym: h-ps://gym.openai.com OpenAInbaselines: h-ps://github.com/openai/baselines rllabn : h-ps://github.com/rll/rllab Want to Try Things Out? Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
  • 40.
    n Deep RLBootcamp [intense long weekend] n 12 lectures by: Pieter Abbeel, Rocky Duan, Peter Chen, John Schulman, Vlad Mnih, Chelsea Finn, Sergey Levine n 4 hands-on labs n https://sites.google.com/view/deep-rl-bootcamp/ n Deep RL Course [full semester] n Originated by John Schulman, Sergey Levine, Chelsea Finn n Latest offering (by Sergey Levine): http://rll.berkeley.edu/deeprlcourse/ Want to learn more? Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
  • 41.
    Fast Reinforcement Learningn LongHorizon Reasoningn Taskabilityn Lifelong Learningn Leverage Simula;onn Maximize Signal Extractedn Many Exciting Directions Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope n Safe Learning n Value Alignment n Planning + Learning n …
  • 42.
    Thank you @pabbeel pabbeel@cs.berkeley.edu Pieter Abbeel-- embody.ai / UC Berkeley / Gradescope