SlideShare a Scribd company logo
How to Make Deep Reinforcement Learning less Data Hungry
Pieter Abbeel
Embodied Intelligence
UC Berkeley
Gradescope
OpenAI
AI for robo<c automa<on
AI research
AI for grading homework and exams
AI research / advisory role
Reinforcement Learning (RL)
Robot +
Environment
max
✓
E[
HX
t=0
R(st)|⇡✓]
[Image credit: Sutton and Barto 1998] Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
Reinforcement Learning (RL)
Robot +
Environment
Compared ton
supervised learning,
addi3onal challenges:
Credit assignmentn
Stabilityn
Explora3onn
[Image credit: Sutton and Barto 1998]
max
✓
E[
HX
t=0
R(st)|⇡✓]
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
Deep RL Success Stories
DQN Mnih et al, NIPS 2013 / Nature 2015
MCTS Guo et al, NIPS 2014; TRPO Schulman, Levine, Moritz, Jordan, Abbeel, ICML 2015; A3C Mnih et al,
ICML 2016; Dueling DQN Wang et al ICML 2016; Double DQN van Hasselt et al, AAAI 2016; Prioritized
Experience Replay Schaul et al, ICLR 2016; Bootstrapped DQN Osband et al, 2016; Q-Ensembles Chen et al,
2017; Rainbow Hessel et al, 2017; …
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
Deep RL Success Stories
AlphaGo Silver et al, Nature 2015
AlphaGoZero Silver et al, Nature 2017
Tian et al, 2016; Maddison et al, 2014; Clark et al, 2015
AlphaZero Silver et al, 2017
n Super-human agent on Dota2 1v1, enabled by
n Reinforcement learning
n Self-play
n Enough computation
Deep RL Success Stories
[OpenAI, 2017] Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
Deep RL Success Stories
^ TRPO Schulman et al, 2015 + GAE Schulman et al, 2016
See also: DDPG Lillicrap et al 2015; SVG Heess et al, 2015; Q-Prop Gu et al, 2016; Scaling up ES Salimans et
al, 2017; PPO Schulman et al, 2017; Parkour Heess et al, 2017;
Deep RL Success Stories
Guided Policy Search Levine*, Finn*, Darrell, Abbeel, JMLR 2016
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
[Geng*, Zhang*, Bruce*, Caluwaerts, Vespignani, Sunspiral, Abbeel, Levine, ICRA 2017] Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
Mastery? Yes
Deep RL (DQN) vs. Human
Score: 18.9 Score: 9.3
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
But how good is the learning?
Deep RL (DQN) vs. Human
Score: 18.9 Score: 9.3
#Experience
measured in real
time: 40 days
#Experience
measured in real
time: 2 hours
“Slow” “Fast”
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
Humans vs. DDQN
Black dots: human play
Blue curve: mean of human play
Blue dashed line: ‘expert’ human play
Red dashed lines:
DDQN a@er 10, 25, 200M frames
(~ 46, 115, 920 hours)
[Tsividis, Pouncy, Xu, Tenenbaum, Gershman, 2017] Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
Humans after 15 minutes tend to outperform DDQN after 115 hours
How to bridge this gap?
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
Outline
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
Hindsight Experience Replayn
Metan -learning – learn-to-learn faster
n On-Policy Algorithms:
n Can only perform update based on data collected under current policy
n Policy gradients, A3C, TRPO
n Atari: Most wall-clock efficient
n Off-Policy Algorithms:
n Can re-use all past data for updates (“experience replay”)
n Q learning, DQN, DDPG
n Atari: Most sample efficient
On-Policy vs. Off-Policy
X
k from replay bu↵er
(Q✓(sk, ak) (rk + max
a
Q✓OLD
(sk+1, a))2
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
But Signal?
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
Main idea: Get reward signal from any experience by simply assuming then
goal equals whatever happened
Representa=on learned: Universal Qn -func=on Q(s, g, a)
Replay buffer:n
Actual experience:n
Experience replay:n
Hindsight replay:n
Hindsight Experience Replay
[Andrychowicz et al, 2017] [Related work: Cabi et al, 2017]
(st, gt, at, rt, st+1)
X
k from replay bu↵er
(Q✓(sk, gk, ak) (rk + max
a
Q✓OLD
(sk+1, gk, a))2
X
k from replay bu↵er
(Q✓(sk, g0
k,ak) (r0
k + max
a
Q✓OLD
(sk+1, g0
k, a))2
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
Hindsight Experience Replay: Pushing, Sliding, Pick-and-Place
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope[Andrychowicz, Wolski, Ray, Schneider, Fong, Welinder, McGrew, Tobin, Abbeel, Zaremba, NIPS2017]
Quan%ta%ve Evalua%on
~3s/episode à 1.5 days for 50 epochs, 6 days for 200 epochs
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
Outline
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
n Hindsight Experience Replay
n Meta-learning – learn-to-learn faster
Starting Observations
n TRPO, DQN, A3C, DDPG, PPO, Rainbow, … are fully general RL
algorithms
n i.e., for any environment that can be mathematically defined, these
algorithms are equally applicable
n Environments encountered in real world
= tiny, tiny subset of all environments that could be defined
(e.g. they all satisfy our universe’s physics)
Can we develop “fast” RL algorithms that take advantage of this?
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
Reinforcement Learning
[Duan, Schulman, Chen, Bartlett, Sutskever, Abbeel, 2016] also: [Wang et al, 2016]
Agent
Environment
asr
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
Reinforcement Learning
[Duan, Schulman, Chen, Bartlett, Sutskever, Abbeel, 2016] also: [Wang et al, 2016]
RL Algorithm
Policy
Environment
aor
RL Agent
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
Reinforcement Learning
[Duan, Schulman, Chen, Bartlett, Sutskever, Abbeel, 2016] also: [Wang et al, 2016]
RL Algorithm
Policy
Environment A
asr
…
RL Agent
RL Algorithm
Policy
Environment B
asr
RL Agent
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
Reinforcement Learning
[Duan, Schulman, Chen, Bartlett, Sutskever, Abbeel, 2016] also: [Wang et al, 2016]
…
RL Algorithm
Policy
Environment A
aor
RL Algorithm
Policy
Environment B
aor
RL Agent RL Agent
Traditional RL research:
• Human experts develop
the RL algorithm
• After many years, still
no RL algorithms nearly
as good as humans…
Alternative:
• Could we learn a better
RL algorithm?
• Or even learn a better
entire agent?
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
Meta-Reinforcement Learning
[Duan, Schulman, Chen, Bartlett, Sutskever, Abbeel, 2016] also: [Wang et al, 2016]
"Fast” RL
Agent
Environment A
Meta RL
Algorithm
Environment B
…
Meta-training environments
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
Meta-Reinforcement Learning
[Duan, Schulman, Chen, Bartlett, Sutskever, Abbeel, 2016] also: [Wang et al, 2016]
"Fast” RL
Agent
Environment A
ar,o
Meta RL
Algorithm
Environment B
…
Environment F
Meta-training environments
Testing environments
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
Meta-Reinforcement Learning
[Duan, Schulman, Chen, Bartlett, Sutskever, Abbeel, 2016] also: [Wang et al, 2016]
"Fast” RL
Agent
Environment A
ar,o
Meta RL
Algorithm
Environment B
…
Environment G
Meta-training environments
Testing environments
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
Meta-Reinforcement Learning
[Duan, Schulman, Chen, Bartle:, Sutskever, Abbeel, 2016] also: [Wang et al, 2016]
"Fast” RL
Agent
Environment A
ar,o
Meta RL
Algorithm
Environment B
…
Environment H
Meta-training environments
TesNng environments
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
Formalizing Learning to Reinforcement Learn
[Duan, Schulman, Chen, Bartlett, Sutskever, Abbeel, 2016] also: [Wang et al, 2016] Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
max
✓
EM E⌧
(k)
M
" KX
k=1
R(⌧
(k)
M ) | RLagent✓
#
Formalizing Learning to Reinforcement Learn
[Duan, Schulman, Chen, Bartle:, Sutskever, Abbeel, 2016] also: [Wang et al, 2016]
M : sample MDP
⌧
(k)
M : k’th trajectory in MDP M
max
✓
EM E⌧
(k)
M
" KX
k=1
R(⌧
(k)
M ) | RLagent✓
#
Meta-train:
max
✓
X
M2Mtrain
E⌧
(k)
M
" KX
k=1
R(⌧
(k)
M ) | RLagent✓
#
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
n RLagent = RNN = generic computation architecture
n different weights in the RNN means different RL algorithm and prior
n different activations in the RNN means different current policy
n meta-train objective can be optimized with an existing (slow) RL algorithm
Representing RLagent✓
max
✓
X
M2Mtrain
E⌧
(k)
M
" KX
k=1
R(⌧
(k)
M ) | RLagent✓
#
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope[Duan, Schulman, Chen, BartleN, Sutskever, Abbeel, 2016] also: [Wang et al, 2016]
Evaluation: Multi-Armed Bandits
Mul5n -Armed Bandits se6ng
Each bandit has its own distribu5onn
over pay-outs
Each episode = choose 1 banditn
Good RL agent should exploren
bandits sufficiently, yet also exploit
the good/best ones
Provably (asympto5cally) op5maln
RL algorithms have been invented
by humans: Gi6ns index, UCB1,
Thompson sampling, …
[Duan, Schulman, Chen, Bartlett, Sutskever, Abbeel, 2016] Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
[Duan, Schulman, Chen, Bartlett, Sutskever, Abbeel, 2016]
Evaluation: Multi-Armed Bandits
We consider Bayesian evaluaEon seFng. Some of these prior works also have adversarial
guarantees, which we don’t consider here.
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
Evaluation: Visual Navigation
Agent’s view Maze
Agent input: current image
Agent action: straight / 2 degrees left / 2 degrees right
Map just shown for our purposes, but not available to agent
Related work: Mirowski, et al, 2016; Jaderberg et al, 2016; Mnih et al, 2016; Wang et al, 2016
Evaluation: Visual Navigation
After learning-to-learnBefore learning-to-learn
[Duan, Schulman, Chen, Bartlett, Sutskever, Abbeel, 2016]
Agent input: current image
Agent ac,on: straight / 2 degrees leE / 2 degrees right
Map just shown for our purposes, but not available to agent
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
n Simple Neural Attentive Meta-Learner (SNAIL)
[Mishra, Rohaninejad, Chen, Abbeel, 2017]
n Model-Agnostic Meta-Learning (MAML)
[Finn, Abbeel, Levine, 2017; Finn*, Yu*, Zhang, Abbeel, Levine 2017]
Other Meta RL Architectures
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
Hindsight Experience Replayn
Metan -learning – learn-to-learn faster
Summary
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
OpenAIn Gym: h-ps://gym.openai.com
OpenAIn baselines: h-ps://github.com/openai/baselines
rllabn : h-ps://github.com/rll/rllab
Want to Try Things Out?
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
n Deep RL Bootcamp [intense long weekend]
n 12 lectures by: Pieter Abbeel, Rocky Duan, Peter Chen, John Schulman,
Vlad Mnih, Chelsea Finn, Sergey Levine
n 4 hands-on labs
n https://sites.google.com/view/deep-rl-bootcamp/
n Deep RL Course [full semester]
n Originated by John Schulman, Sergey Levine, Chelsea Finn
n Latest offering (by Sergey Levine):
http://rll.berkeley.edu/deeprlcourse/
Want to learn more?
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
Fast Reinforcement Learningn
Long Horizon Reasoningn
Taskabilityn
Lifelong Learningn
Leverage Simula;onn
Maximize Signal Extractedn
Many Exciting Directions
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
n Safe Learning
n Value Alignment
n Planning + Learning
n …
Thank you
@pabbeel
pabbeel@cs.berkeley.edu
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope

More Related Content

What's hot

Introduction to Applied Machine Learning
Introduction to Applied Machine LearningIntroduction to Applied Machine Learning
Introduction to Applied Machine Learning
SheilaJimenezMorejon
 
Convolutional neural network
Convolutional neural networkConvolutional neural network
Convolutional neural network
MojammilHusain
 
Weak Slot and Filler Structure (by Mintoo Jakhmola LPU)
Weak Slot and Filler Structure (by Mintoo Jakhmola LPU)Weak Slot and Filler Structure (by Mintoo Jakhmola LPU)
Weak Slot and Filler Structure (by Mintoo Jakhmola LPU)Mintoo Jakhmola
 
AlexNet
AlexNetAlexNet
AlexNet
Bertil Hatt
 
Faster R-CNN - PR012
Faster R-CNN - PR012Faster R-CNN - PR012
Faster R-CNN - PR012
Jinwon Lee
 
[DSC 2016] 系列活動:李宏毅 / 一天搞懂深度學習
[DSC 2016] 系列活動:李宏毅 / 一天搞懂深度學習[DSC 2016] 系列活動:李宏毅 / 一天搞懂深度學習
[DSC 2016] 系列活動:李宏毅 / 一天搞懂深度學習
台灣資料科學年會
 
Echo state networks and locomotion patterns
Echo state networks and locomotion patternsEcho state networks and locomotion patterns
Echo state networks and locomotion patterns
Vito Strano
 
All about A.I SuperMario (Reinforcement Learning)
All about A.I SuperMario (Reinforcement Learning)All about A.I SuperMario (Reinforcement Learning)
All about A.I SuperMario (Reinforcement Learning)
wonseok jung
 
weak slot and filler structure
weak slot and filler structureweak slot and filler structure
weak slot and filler structure
Amey Kerkar
 
Reinforcement learning, Q-Learning
Reinforcement learning, Q-LearningReinforcement learning, Q-Learning
Reinforcement learning, Q-Learning
Kuppusamy P
 
Power Law Distributions for Twitter Data
Power Law Distributions for Twitter DataPower Law Distributions for Twitter Data
Power Law Distributions for Twitter DataConor Feeney
 
GPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersGPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask Learners
Young Seok Kim
 
強化學習 Reinforcement Learning
強化學習 Reinforcement Learning強化學習 Reinforcement Learning
強化學習 Reinforcement Learning
Yen-lung Tsai
 
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
Preferred Networks
 
How is a Vision Transformer (ViT) model built and implemented?
How is a Vision Transformer (ViT) model built and implemented?How is a Vision Transformer (ViT) model built and implemented?
How is a Vision Transformer (ViT) model built and implemented?
Benjaminlapid1
 
Deep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its ApplicationsDeep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its Applications
Bill Liu
 
Training language models to follow instructions with human feedback.pdf
Training language models to follow instructions
with human feedback.pdfTraining language models to follow instructions
with human feedback.pdf
Training language models to follow instructions with human feedback.pdf
Po-Chuan Chen
 
Deep Learning - Convolutional Neural Networks
Deep Learning - Convolutional Neural NetworksDeep Learning - Convolutional Neural Networks
Deep Learning - Convolutional Neural Networks
Christian Perone
 
CNNs: from the Basics to Recent Advances
CNNs: from the Basics to Recent AdvancesCNNs: from the Basics to Recent Advances
CNNs: from the Basics to Recent Advances
Dmytro Mishkin
 
CNN Machine learning DeepLearning
CNN Machine learning DeepLearningCNN Machine learning DeepLearning
CNN Machine learning DeepLearning
Abhishek Sharma
 

What's hot (20)

Introduction to Applied Machine Learning
Introduction to Applied Machine LearningIntroduction to Applied Machine Learning
Introduction to Applied Machine Learning
 
Convolutional neural network
Convolutional neural networkConvolutional neural network
Convolutional neural network
 
Weak Slot and Filler Structure (by Mintoo Jakhmola LPU)
Weak Slot and Filler Structure (by Mintoo Jakhmola LPU)Weak Slot and Filler Structure (by Mintoo Jakhmola LPU)
Weak Slot and Filler Structure (by Mintoo Jakhmola LPU)
 
AlexNet
AlexNetAlexNet
AlexNet
 
Faster R-CNN - PR012
Faster R-CNN - PR012Faster R-CNN - PR012
Faster R-CNN - PR012
 
[DSC 2016] 系列活動:李宏毅 / 一天搞懂深度學習
[DSC 2016] 系列活動:李宏毅 / 一天搞懂深度學習[DSC 2016] 系列活動:李宏毅 / 一天搞懂深度學習
[DSC 2016] 系列活動:李宏毅 / 一天搞懂深度學習
 
Echo state networks and locomotion patterns
Echo state networks and locomotion patternsEcho state networks and locomotion patterns
Echo state networks and locomotion patterns
 
All about A.I SuperMario (Reinforcement Learning)
All about A.I SuperMario (Reinforcement Learning)All about A.I SuperMario (Reinforcement Learning)
All about A.I SuperMario (Reinforcement Learning)
 
weak slot and filler structure
weak slot and filler structureweak slot and filler structure
weak slot and filler structure
 
Reinforcement learning, Q-Learning
Reinforcement learning, Q-LearningReinforcement learning, Q-Learning
Reinforcement learning, Q-Learning
 
Power Law Distributions for Twitter Data
Power Law Distributions for Twitter DataPower Law Distributions for Twitter Data
Power Law Distributions for Twitter Data
 
GPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersGPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask Learners
 
強化學習 Reinforcement Learning
強化學習 Reinforcement Learning強化學習 Reinforcement Learning
強化學習 Reinforcement Learning
 
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
 
How is a Vision Transformer (ViT) model built and implemented?
How is a Vision Transformer (ViT) model built and implemented?How is a Vision Transformer (ViT) model built and implemented?
How is a Vision Transformer (ViT) model built and implemented?
 
Deep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its ApplicationsDeep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its Applications
 
Training language models to follow instructions with human feedback.pdf
Training language models to follow instructions
with human feedback.pdfTraining language models to follow instructions
with human feedback.pdf
Training language models to follow instructions with human feedback.pdf
 
Deep Learning - Convolutional Neural Networks
Deep Learning - Convolutional Neural NetworksDeep Learning - Convolutional Neural Networks
Deep Learning - Convolutional Neural Networks
 
CNNs: from the Basics to Recent Advances
CNNs: from the Basics to Recent AdvancesCNNs: from the Basics to Recent Advances
CNNs: from the Basics to Recent Advances
 
CNN Machine learning DeepLearning
CNN Machine learning DeepLearningCNN Machine learning DeepLearning
CNN Machine learning DeepLearning
 

Similar to Deep Reinforcement Learning by Pieter

“From Inference to Action: AI Beyond Pattern Recognition,” a Keynote Presenta...
“From Inference to Action: AI Beyond Pattern Recognition,” a Keynote Presenta...“From Inference to Action: AI Beyond Pattern Recognition,” a Keynote Presenta...
“From Inference to Action: AI Beyond Pattern Recognition,” a Keynote Presenta...
Edge AI and Vision Alliance
 
Lecture 12: Research Directions (Full Stack Deep Learning - Spring 2021)
Lecture 12: Research Directions (Full Stack Deep Learning - Spring 2021)Lecture 12: Research Directions (Full Stack Deep Learning - Spring 2021)
Lecture 12: Research Directions (Full Stack Deep Learning - Spring 2021)
Sergey Karayev
 
Paper reading best of both world
Paper reading best of both worldPaper reading best of both world
Paper reading best of both world
Shinagawa Seitaro
 
iccv2009 tutorial: boosting and random forest - part III
iccv2009 tutorial: boosting and random forest - part IIIiccv2009 tutorial: boosting and random forest - part III
iccv2009 tutorial: boosting and random forest - part IIIzukun
 
Going Smart and Deep on Materials at ALCF
Going Smart and Deep on Materials at ALCFGoing Smart and Deep on Materials at ALCF
Going Smart and Deep on Materials at ALCF
Ian Foster
 
Maximally Invariant Data Perturbation as Explanation
Maximally Invariant Data Perturbation as ExplanationMaximally Invariant Data Perturbation as Explanation
Maximally Invariant Data Perturbation as Explanation
Satoshi Hara
 
Crystallization classification semisupervised
Crystallization classification semisupervisedCrystallization classification semisupervised
Crystallization classification semisupervised
Madhav Sigdel
 
[SIGIR17] Learning to Rank Using Localized Geometric Mean Metrics
[SIGIR17] Learning to Rank Using Localized Geometric Mean Metrics[SIGIR17] Learning to Rank Using Localized Geometric Mean Metrics
[SIGIR17] Learning to Rank Using Localized Geometric Mean Metrics
Yuxin Su
 
Ensemble Learning Featuring the Netflix Prize Competition and ...
Ensemble Learning Featuring the Netflix Prize Competition and ...Ensemble Learning Featuring the Netflix Prize Competition and ...
Ensemble Learning Featuring the Netflix Prize Competition and ...butest
 
Explainable AI
Explainable AIExplainable AI
Explainable AI
Arithmer Inc.
 
Using Simulation to Investigate Requirements Prioritization Strategies
Using Simulation to Investigate Requirements Prioritization StrategiesUsing Simulation to Investigate Requirements Prioritization Strategies
Using Simulation to Investigate Requirements Prioritization Strategies
CS, NcState
 
Star Master Cocomo07
Star Master Cocomo07Star Master Cocomo07
Star Master Cocomo07
CS, NcState
 
Software tools for calculating materials properties in high-throughput (pymat...
Software tools for calculating materials properties in high-throughput (pymat...Software tools for calculating materials properties in high-throughput (pymat...
Software tools for calculating materials properties in high-throughput (pymat...
Anubhav Jain
 
Clusterix at VDS 2016
Clusterix at VDS 2016Clusterix at VDS 2016
Clusterix at VDS 2016
Eamonn Maguire
 
assia2015sakai
assia2015sakaiassia2015sakai
assia2015sakai
Tetsuya Sakai
 
Cross-validation estimate of the number of clusters in a network
Cross-validation estimate of the number of clusters in a networkCross-validation estimate of the number of clusters in a network
Cross-validation estimate of the number of clusters in a network
Tatsuro Kawamoto
 
Unsupervised Learning and Image Classification in High Performance Computing ...
Unsupervised Learning and Image Classification in High Performance Computing ...Unsupervised Learning and Image Classification in High Performance Computing ...
Unsupervised Learning and Image Classification in High Performance Computing ...
HPCC Systems
 
Step zhedong
Step zhedongStep zhedong
Step zhedong
哲东 郑
 
DeepLearning
DeepLearningDeepLearning
DeepLearning
ShahzadAsgharArain
 

Similar to Deep Reinforcement Learning by Pieter (20)

“From Inference to Action: AI Beyond Pattern Recognition,” a Keynote Presenta...
“From Inference to Action: AI Beyond Pattern Recognition,” a Keynote Presenta...“From Inference to Action: AI Beyond Pattern Recognition,” a Keynote Presenta...
“From Inference to Action: AI Beyond Pattern Recognition,” a Keynote Presenta...
 
Lecture 12: Research Directions (Full Stack Deep Learning - Spring 2021)
Lecture 12: Research Directions (Full Stack Deep Learning - Spring 2021)Lecture 12: Research Directions (Full Stack Deep Learning - Spring 2021)
Lecture 12: Research Directions (Full Stack Deep Learning - Spring 2021)
 
Paper reading best of both world
Paper reading best of both worldPaper reading best of both world
Paper reading best of both world
 
iccv2009 tutorial: boosting and random forest - part III
iccv2009 tutorial: boosting and random forest - part IIIiccv2009 tutorial: boosting and random forest - part III
iccv2009 tutorial: boosting and random forest - part III
 
Going Smart and Deep on Materials at ALCF
Going Smart and Deep on Materials at ALCFGoing Smart and Deep on Materials at ALCF
Going Smart and Deep on Materials at ALCF
 
Maximally Invariant Data Perturbation as Explanation
Maximally Invariant Data Perturbation as ExplanationMaximally Invariant Data Perturbation as Explanation
Maximally Invariant Data Perturbation as Explanation
 
Crystallization classification semisupervised
Crystallization classification semisupervisedCrystallization classification semisupervised
Crystallization classification semisupervised
 
2014 toronto-torbug
2014 toronto-torbug2014 toronto-torbug
2014 toronto-torbug
 
[SIGIR17] Learning to Rank Using Localized Geometric Mean Metrics
[SIGIR17] Learning to Rank Using Localized Geometric Mean Metrics[SIGIR17] Learning to Rank Using Localized Geometric Mean Metrics
[SIGIR17] Learning to Rank Using Localized Geometric Mean Metrics
 
Ensemble Learning Featuring the Netflix Prize Competition and ...
Ensemble Learning Featuring the Netflix Prize Competition and ...Ensemble Learning Featuring the Netflix Prize Competition and ...
Ensemble Learning Featuring the Netflix Prize Competition and ...
 
Explainable AI
Explainable AIExplainable AI
Explainable AI
 
Using Simulation to Investigate Requirements Prioritization Strategies
Using Simulation to Investigate Requirements Prioritization StrategiesUsing Simulation to Investigate Requirements Prioritization Strategies
Using Simulation to Investigate Requirements Prioritization Strategies
 
Star Master Cocomo07
Star Master Cocomo07Star Master Cocomo07
Star Master Cocomo07
 
Software tools for calculating materials properties in high-throughput (pymat...
Software tools for calculating materials properties in high-throughput (pymat...Software tools for calculating materials properties in high-throughput (pymat...
Software tools for calculating materials properties in high-throughput (pymat...
 
Clusterix at VDS 2016
Clusterix at VDS 2016Clusterix at VDS 2016
Clusterix at VDS 2016
 
assia2015sakai
assia2015sakaiassia2015sakai
assia2015sakai
 
Cross-validation estimate of the number of clusters in a network
Cross-validation estimate of the number of clusters in a networkCross-validation estimate of the number of clusters in a network
Cross-validation estimate of the number of clusters in a network
 
Unsupervised Learning and Image Classification in High Performance Computing ...
Unsupervised Learning and Image Classification in High Performance Computing ...Unsupervised Learning and Image Classification in High Performance Computing ...
Unsupervised Learning and Image Classification in High Performance Computing ...
 
Step zhedong
Step zhedongStep zhedong
Step zhedong
 
DeepLearning
DeepLearningDeepLearning
DeepLearning
 

More from Bill Liu

Walk Through a Real World ML Production Project
Walk Through a Real World ML Production ProjectWalk Through a Real World ML Production Project
Walk Through a Real World ML Production Project
Bill Liu
 
Redefining MLOps with Model Deployment, Management and Observability in Produ...
Redefining MLOps with Model Deployment, Management and Observability in Produ...Redefining MLOps with Model Deployment, Management and Observability in Produ...
Redefining MLOps with Model Deployment, Management and Observability in Produ...
Bill Liu
 
Productizing Machine Learning at the Edge
Productizing Machine Learning at the EdgeProductizing Machine Learning at the Edge
Productizing Machine Learning at the Edge
Bill Liu
 
Transformers in Vision: From Zero to Hero
Transformers in Vision: From Zero to HeroTransformers in Vision: From Zero to Hero
Transformers in Vision: From Zero to Hero
Bill Liu
 
Deep AutoViML For Tensorflow Models and MLOps Workflows
Deep AutoViML For Tensorflow Models and MLOps WorkflowsDeep AutoViML For Tensorflow Models and MLOps Workflows
Deep AutoViML For Tensorflow Models and MLOps Workflows
Bill Liu
 
Metaflow: The ML Infrastructure at Netflix
Metaflow: The ML Infrastructure at NetflixMetaflow: The ML Infrastructure at Netflix
Metaflow: The ML Infrastructure at Netflix
Bill Liu
 
Practical Crowdsourcing for ML at Scale
Practical Crowdsourcing for ML at ScalePractical Crowdsourcing for ML at Scale
Practical Crowdsourcing for ML at Scale
Bill Liu
 
Building large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBuilding large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudi
Bill Liu
 
Big Data and AI in Fighting Against COVID-19
Big Data and AI in Fighting Against COVID-19Big Data and AI in Fighting Against COVID-19
Big Data and AI in Fighting Against COVID-19
Bill Liu
 
Highly-scalable Reinforcement Learning RLlib for Real-world Applications
Highly-scalable Reinforcement Learning RLlib for Real-world ApplicationsHighly-scalable Reinforcement Learning RLlib for Real-world Applications
Highly-scalable Reinforcement Learning RLlib for Real-world Applications
Bill Liu
 
Build computer vision models to perform object detection and classification w...
Build computer vision models to perform object detection and classification w...Build computer vision models to perform object detection and classification w...
Build computer vision models to perform object detection and classification w...
Bill Liu
 
Causal Inference in Data Science and Machine Learning
Causal Inference in Data Science and Machine LearningCausal Inference in Data Science and Machine Learning
Causal Inference in Data Science and Machine Learning
Bill Liu
 
Weekly #106: Deep Learning on Mobile
Weekly #106: Deep Learning on MobileWeekly #106: Deep Learning on Mobile
Weekly #106: Deep Learning on Mobile
Bill Liu
 
Weekly #105: AutoViz and Auto_ViML Visualization and Machine Learning
Weekly #105: AutoViz and Auto_ViML Visualization and Machine LearningWeekly #105: AutoViz and Auto_ViML Visualization and Machine Learning
Weekly #105: AutoViz and Auto_ViML Visualization and Machine Learning
Bill Liu
 
AISF19 - On Blending Machine Learning with Microeconomics
AISF19 - On Blending Machine Learning with MicroeconomicsAISF19 - On Blending Machine Learning with Microeconomics
AISF19 - On Blending Machine Learning with Microeconomics
Bill Liu
 
AISF19 - Travel in the AI-First World
AISF19 - Travel in the AI-First WorldAISF19 - Travel in the AI-First World
AISF19 - Travel in the AI-First World
Bill Liu
 
AISF19 - Unleash Computer Vision at the Edge
AISF19 - Unleash Computer Vision at the EdgeAISF19 - Unleash Computer Vision at the Edge
AISF19 - Unleash Computer Vision at the Edge
Bill Liu
 
AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...
AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...
AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...
Bill Liu
 
Toronto meetup 20190917
Toronto meetup 20190917Toronto meetup 20190917
Toronto meetup 20190917
Bill Liu
 
Feature Engineering for NLP
Feature Engineering for NLPFeature Engineering for NLP
Feature Engineering for NLP
Bill Liu
 

More from Bill Liu (20)

Walk Through a Real World ML Production Project
Walk Through a Real World ML Production ProjectWalk Through a Real World ML Production Project
Walk Through a Real World ML Production Project
 
Redefining MLOps with Model Deployment, Management and Observability in Produ...
Redefining MLOps with Model Deployment, Management and Observability in Produ...Redefining MLOps with Model Deployment, Management and Observability in Produ...
Redefining MLOps with Model Deployment, Management and Observability in Produ...
 
Productizing Machine Learning at the Edge
Productizing Machine Learning at the EdgeProductizing Machine Learning at the Edge
Productizing Machine Learning at the Edge
 
Transformers in Vision: From Zero to Hero
Transformers in Vision: From Zero to HeroTransformers in Vision: From Zero to Hero
Transformers in Vision: From Zero to Hero
 
Deep AutoViML For Tensorflow Models and MLOps Workflows
Deep AutoViML For Tensorflow Models and MLOps WorkflowsDeep AutoViML For Tensorflow Models and MLOps Workflows
Deep AutoViML For Tensorflow Models and MLOps Workflows
 
Metaflow: The ML Infrastructure at Netflix
Metaflow: The ML Infrastructure at NetflixMetaflow: The ML Infrastructure at Netflix
Metaflow: The ML Infrastructure at Netflix
 
Practical Crowdsourcing for ML at Scale
Practical Crowdsourcing for ML at ScalePractical Crowdsourcing for ML at Scale
Practical Crowdsourcing for ML at Scale
 
Building large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBuilding large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudi
 
Big Data and AI in Fighting Against COVID-19
Big Data and AI in Fighting Against COVID-19Big Data and AI in Fighting Against COVID-19
Big Data and AI in Fighting Against COVID-19
 
Highly-scalable Reinforcement Learning RLlib for Real-world Applications
Highly-scalable Reinforcement Learning RLlib for Real-world ApplicationsHighly-scalable Reinforcement Learning RLlib for Real-world Applications
Highly-scalable Reinforcement Learning RLlib for Real-world Applications
 
Build computer vision models to perform object detection and classification w...
Build computer vision models to perform object detection and classification w...Build computer vision models to perform object detection and classification w...
Build computer vision models to perform object detection and classification w...
 
Causal Inference in Data Science and Machine Learning
Causal Inference in Data Science and Machine LearningCausal Inference in Data Science and Machine Learning
Causal Inference in Data Science and Machine Learning
 
Weekly #106: Deep Learning on Mobile
Weekly #106: Deep Learning on MobileWeekly #106: Deep Learning on Mobile
Weekly #106: Deep Learning on Mobile
 
Weekly #105: AutoViz and Auto_ViML Visualization and Machine Learning
Weekly #105: AutoViz and Auto_ViML Visualization and Machine LearningWeekly #105: AutoViz and Auto_ViML Visualization and Machine Learning
Weekly #105: AutoViz and Auto_ViML Visualization and Machine Learning
 
AISF19 - On Blending Machine Learning with Microeconomics
AISF19 - On Blending Machine Learning with MicroeconomicsAISF19 - On Blending Machine Learning with Microeconomics
AISF19 - On Blending Machine Learning with Microeconomics
 
AISF19 - Travel in the AI-First World
AISF19 - Travel in the AI-First WorldAISF19 - Travel in the AI-First World
AISF19 - Travel in the AI-First World
 
AISF19 - Unleash Computer Vision at the Edge
AISF19 - Unleash Computer Vision at the EdgeAISF19 - Unleash Computer Vision at the Edge
AISF19 - Unleash Computer Vision at the Edge
 
AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...
AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...
AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...
 
Toronto meetup 20190917
Toronto meetup 20190917Toronto meetup 20190917
Toronto meetup 20190917
 
Feature Engineering for NLP
Feature Engineering for NLPFeature Engineering for NLP
Feature Engineering for NLP
 

Recently uploaded

PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Nexer Digital
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
nkrafacyberclub
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
DianaGray10
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Neo4j
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
Neo4j
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
Pierluigi Pugliese
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
RinaMondal9
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 

Recently uploaded (20)

PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 

Deep Reinforcement Learning by Pieter

  • 1. How to Make Deep Reinforcement Learning less Data Hungry Pieter Abbeel Embodied Intelligence UC Berkeley Gradescope OpenAI AI for robo<c automa<on AI research AI for grading homework and exams AI research / advisory role
  • 2. Reinforcement Learning (RL) Robot + Environment max ✓ E[ HX t=0 R(st)|⇡✓] [Image credit: Sutton and Barto 1998] Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
  • 3. Reinforcement Learning (RL) Robot + Environment Compared ton supervised learning, addi3onal challenges: Credit assignmentn Stabilityn Explora3onn [Image credit: Sutton and Barto 1998] max ✓ E[ HX t=0 R(st)|⇡✓] Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
  • 4. Deep RL Success Stories DQN Mnih et al, NIPS 2013 / Nature 2015 MCTS Guo et al, NIPS 2014; TRPO Schulman, Levine, Moritz, Jordan, Abbeel, ICML 2015; A3C Mnih et al, ICML 2016; Dueling DQN Wang et al ICML 2016; Double DQN van Hasselt et al, AAAI 2016; Prioritized Experience Replay Schaul et al, ICLR 2016; Bootstrapped DQN Osband et al, 2016; Q-Ensembles Chen et al, 2017; Rainbow Hessel et al, 2017; … Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
  • 5. Deep RL Success Stories AlphaGo Silver et al, Nature 2015 AlphaGoZero Silver et al, Nature 2017 Tian et al, 2016; Maddison et al, 2014; Clark et al, 2015 AlphaZero Silver et al, 2017
  • 6. n Super-human agent on Dota2 1v1, enabled by n Reinforcement learning n Self-play n Enough computation Deep RL Success Stories [OpenAI, 2017] Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
  • 7. Deep RL Success Stories ^ TRPO Schulman et al, 2015 + GAE Schulman et al, 2016 See also: DDPG Lillicrap et al 2015; SVG Heess et al, 2015; Q-Prop Gu et al, 2016; Scaling up ES Salimans et al, 2017; PPO Schulman et al, 2017; Parkour Heess et al, 2017;
  • 8. Deep RL Success Stories Guided Policy Search Levine*, Finn*, Darrell, Abbeel, JMLR 2016 Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
  • 9. [Geng*, Zhang*, Bruce*, Caluwaerts, Vespignani, Sunspiral, Abbeel, Levine, ICRA 2017] Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
  • 10. Mastery? Yes Deep RL (DQN) vs. Human Score: 18.9 Score: 9.3 Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
  • 11. But how good is the learning? Deep RL (DQN) vs. Human Score: 18.9 Score: 9.3 #Experience measured in real time: 40 days #Experience measured in real time: 2 hours “Slow” “Fast” Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
  • 12. Humans vs. DDQN Black dots: human play Blue curve: mean of human play Blue dashed line: ‘expert’ human play Red dashed lines: DDQN a@er 10, 25, 200M frames (~ 46, 115, 920 hours) [Tsividis, Pouncy, Xu, Tenenbaum, Gershman, 2017] Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope Humans after 15 minutes tend to outperform DDQN after 115 hours
  • 13. How to bridge this gap? Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
  • 14. Outline Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope Hindsight Experience Replayn Metan -learning – learn-to-learn faster
  • 15. n On-Policy Algorithms: n Can only perform update based on data collected under current policy n Policy gradients, A3C, TRPO n Atari: Most wall-clock efficient n Off-Policy Algorithms: n Can re-use all past data for updates (“experience replay”) n Q learning, DQN, DDPG n Atari: Most sample efficient On-Policy vs. Off-Policy X k from replay bu↵er (Q✓(sk, ak) (rk + max a Q✓OLD (sk+1, a))2 Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
  • 16. But Signal? Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
  • 17. Main idea: Get reward signal from any experience by simply assuming then goal equals whatever happened Representa=on learned: Universal Qn -func=on Q(s, g, a) Replay buffer:n Actual experience:n Experience replay:n Hindsight replay:n Hindsight Experience Replay [Andrychowicz et al, 2017] [Related work: Cabi et al, 2017] (st, gt, at, rt, st+1) X k from replay bu↵er (Q✓(sk, gk, ak) (rk + max a Q✓OLD (sk+1, gk, a))2 X k from replay bu↵er (Q✓(sk, g0 k,ak) (r0 k + max a Q✓OLD (sk+1, g0 k, a))2 Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
  • 18. Hindsight Experience Replay: Pushing, Sliding, Pick-and-Place Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope[Andrychowicz, Wolski, Ray, Schneider, Fong, Welinder, McGrew, Tobin, Abbeel, Zaremba, NIPS2017]
  • 19. Quan%ta%ve Evalua%on ~3s/episode à 1.5 days for 50 epochs, 6 days for 200 epochs Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
  • 20. Outline Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope n Hindsight Experience Replay n Meta-learning – learn-to-learn faster
  • 21. Starting Observations n TRPO, DQN, A3C, DDPG, PPO, Rainbow, … are fully general RL algorithms n i.e., for any environment that can be mathematically defined, these algorithms are equally applicable n Environments encountered in real world = tiny, tiny subset of all environments that could be defined (e.g. they all satisfy our universe’s physics) Can we develop “fast” RL algorithms that take advantage of this? Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
  • 22. Reinforcement Learning [Duan, Schulman, Chen, Bartlett, Sutskever, Abbeel, 2016] also: [Wang et al, 2016] Agent Environment asr Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
  • 23. Reinforcement Learning [Duan, Schulman, Chen, Bartlett, Sutskever, Abbeel, 2016] also: [Wang et al, 2016] RL Algorithm Policy Environment aor RL Agent Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
  • 24. Reinforcement Learning [Duan, Schulman, Chen, Bartlett, Sutskever, Abbeel, 2016] also: [Wang et al, 2016] RL Algorithm Policy Environment A asr … RL Agent RL Algorithm Policy Environment B asr RL Agent Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
  • 25. Reinforcement Learning [Duan, Schulman, Chen, Bartlett, Sutskever, Abbeel, 2016] also: [Wang et al, 2016] … RL Algorithm Policy Environment A aor RL Algorithm Policy Environment B aor RL Agent RL Agent Traditional RL research: • Human experts develop the RL algorithm • After many years, still no RL algorithms nearly as good as humans… Alternative: • Could we learn a better RL algorithm? • Or even learn a better entire agent? Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
  • 26. Meta-Reinforcement Learning [Duan, Schulman, Chen, Bartlett, Sutskever, Abbeel, 2016] also: [Wang et al, 2016] "Fast” RL Agent Environment A Meta RL Algorithm Environment B … Meta-training environments Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
  • 27. Meta-Reinforcement Learning [Duan, Schulman, Chen, Bartlett, Sutskever, Abbeel, 2016] also: [Wang et al, 2016] "Fast” RL Agent Environment A ar,o Meta RL Algorithm Environment B … Environment F Meta-training environments Testing environments Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
  • 28. Meta-Reinforcement Learning [Duan, Schulman, Chen, Bartlett, Sutskever, Abbeel, 2016] also: [Wang et al, 2016] "Fast” RL Agent Environment A ar,o Meta RL Algorithm Environment B … Environment G Meta-training environments Testing environments Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
  • 29. Meta-Reinforcement Learning [Duan, Schulman, Chen, Bartle:, Sutskever, Abbeel, 2016] also: [Wang et al, 2016] "Fast” RL Agent Environment A ar,o Meta RL Algorithm Environment B … Environment H Meta-training environments TesNng environments Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
  • 30. Formalizing Learning to Reinforcement Learn [Duan, Schulman, Chen, Bartlett, Sutskever, Abbeel, 2016] also: [Wang et al, 2016] Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope max ✓ EM E⌧ (k) M " KX k=1 R(⌧ (k) M ) | RLagent✓ #
  • 31. Formalizing Learning to Reinforcement Learn [Duan, Schulman, Chen, Bartle:, Sutskever, Abbeel, 2016] also: [Wang et al, 2016] M : sample MDP ⌧ (k) M : k’th trajectory in MDP M max ✓ EM E⌧ (k) M " KX k=1 R(⌧ (k) M ) | RLagent✓ # Meta-train: max ✓ X M2Mtrain E⌧ (k) M " KX k=1 R(⌧ (k) M ) | RLagent✓ # Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
  • 32. n RLagent = RNN = generic computation architecture n different weights in the RNN means different RL algorithm and prior n different activations in the RNN means different current policy n meta-train objective can be optimized with an existing (slow) RL algorithm Representing RLagent✓ max ✓ X M2Mtrain E⌧ (k) M " KX k=1 R(⌧ (k) M ) | RLagent✓ # Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope[Duan, Schulman, Chen, BartleN, Sutskever, Abbeel, 2016] also: [Wang et al, 2016]
  • 33. Evaluation: Multi-Armed Bandits Mul5n -Armed Bandits se6ng Each bandit has its own distribu5onn over pay-outs Each episode = choose 1 banditn Good RL agent should exploren bandits sufficiently, yet also exploit the good/best ones Provably (asympto5cally) op5maln RL algorithms have been invented by humans: Gi6ns index, UCB1, Thompson sampling, … [Duan, Schulman, Chen, Bartlett, Sutskever, Abbeel, 2016] Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
  • 34. [Duan, Schulman, Chen, Bartlett, Sutskever, Abbeel, 2016] Evaluation: Multi-Armed Bandits We consider Bayesian evaluaEon seFng. Some of these prior works also have adversarial guarantees, which we don’t consider here. Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
  • 35. Evaluation: Visual Navigation Agent’s view Maze Agent input: current image Agent action: straight / 2 degrees left / 2 degrees right Map just shown for our purposes, but not available to agent Related work: Mirowski, et al, 2016; Jaderberg et al, 2016; Mnih et al, 2016; Wang et al, 2016
  • 36. Evaluation: Visual Navigation After learning-to-learnBefore learning-to-learn [Duan, Schulman, Chen, Bartlett, Sutskever, Abbeel, 2016] Agent input: current image Agent ac,on: straight / 2 degrees leE / 2 degrees right Map just shown for our purposes, but not available to agent Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
  • 37. n Simple Neural Attentive Meta-Learner (SNAIL) [Mishra, Rohaninejad, Chen, Abbeel, 2017] n Model-Agnostic Meta-Learning (MAML) [Finn, Abbeel, Levine, 2017; Finn*, Yu*, Zhang, Abbeel, Levine 2017] Other Meta RL Architectures Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
  • 38. Hindsight Experience Replayn Metan -learning – learn-to-learn faster Summary Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
  • 39. OpenAIn Gym: h-ps://gym.openai.com OpenAIn baselines: h-ps://github.com/openai/baselines rllabn : h-ps://github.com/rll/rllab Want to Try Things Out? Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
  • 40. n Deep RL Bootcamp [intense long weekend] n 12 lectures by: Pieter Abbeel, Rocky Duan, Peter Chen, John Schulman, Vlad Mnih, Chelsea Finn, Sergey Levine n 4 hands-on labs n https://sites.google.com/view/deep-rl-bootcamp/ n Deep RL Course [full semester] n Originated by John Schulman, Sergey Levine, Chelsea Finn n Latest offering (by Sergey Levine): http://rll.berkeley.edu/deeprlcourse/ Want to learn more? Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
  • 41. Fast Reinforcement Learningn Long Horizon Reasoningn Taskabilityn Lifelong Learningn Leverage Simula;onn Maximize Signal Extractedn Many Exciting Directions Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope n Safe Learning n Value Alignment n Planning + Learning n …
  • 42. Thank you @pabbeel pabbeel@cs.berkeley.edu Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope