SlideShare a Scribd company logo
1 of 18
HAYA!
Continuous Deep Q-Learning
with Model-based Acceleration
2016 ICML
S. Gu, T. Lillicrap, I. Sutskever, S. Levine.
Presenter : Hyemin Ahn
HAYA!
Introduction
2016-12-02 CPSLAB (EECS) 2
 Another, and Another improved work of
Deep - Reinforcement Learning
 Tried incorporate the advantages of
Model-free Reinforcement Learning
&&
Model-based Reinforcement Learning
HAYA!
Results : Preview
2016-12-02 CPSLAB (EECS) 3
HAYA!
Reinforcement Learning : overview
2016-12-02 CPSLAB (EECS) 4
Agent
How can we formulize our behavior?
HAYA!
Reinforcement Learning : overview
2016-12-02 CPSLAB (EECS) 5
At each time 𝒕,
the agent receives an observation 𝒙 𝒕
from environment 𝑬
Wow
so scare
such gun
so many bullets
nice suit btw
HAYA!
Reinforcement Learning : overview
2016-12-02 CPSLAB (EECS) 6
The agent takes
an action 𝒖 𝒕 ∈ 𝒰,
and receives a scalar reward 𝒓 𝒕.
𝒖 𝒕
𝒙 𝒕
HAYA!
Reinforcement Learning : overview
2016-12-02 CPSLAB (EECS) 7
The agent chooses an action according to its current policy 𝛑(𝐮𝐭|𝐱 𝐭),
which maps states to probability distribution over actions.
𝒖 𝟏
𝒖 𝟐
𝛑(𝐮𝐭|𝒙 𝒕)
𝒖 𝟐𝒖 𝟏
HAYA!
Reinforcement Learning : overview
2016-12-02 CPSLAB (EECS) 8
𝒖 𝟏
𝒙 𝟏
𝛑
𝒖 𝟐
𝒙 𝟐
𝛑
𝒖 𝟑
𝒙 𝟑
𝛑
𝒑(𝒙 𝟐|𝒙 𝟏, 𝒖 𝟏) 𝒑(𝒙 𝟑|𝒙 𝟐, 𝒖 𝟐)
𝑹 𝒕 =
𝒊=𝒕
𝑻
𝜸(𝒊−𝒕)
𝒓(𝒙𝒊, 𝒖𝒊)
: cumulative sum of rewards
over sequences. (𝜸 ∈ [𝟎, 𝟏]:discounting factor)
𝑸 𝝅
𝒙 𝒕, 𝒖 𝒕 = 𝔼[𝑹 𝒕|𝒙 𝒕, 𝒖 𝒕]
: state-action value function.
Objective of RL
: find 𝛑 maximizing 𝔼(𝑹 𝟏) !
𝒓(𝒙 𝟏, 𝒖 𝟏) 𝒓(𝒙 𝟐, 𝒖 𝟐) 𝒓(𝒙 𝟑, 𝒖 𝟑)M
D
P
HAYA!
Reinforcement Learning : overview
2016-12-02 CPSLAB (EECS) 9
𝑸 𝝅 𝒕𝒓𝒊𝒏𝒊𝒕𝒚 𝒙 𝒕, 𝒖 𝒕 𝑸 𝝅 𝒏𝒆𝒐 𝒙 𝒕, 𝒖 𝒕<
HAYA!
Reinforcement Learning : overview
2016-12-02 CPSLAB (EECS) 10
• From environment E,
𝒙 ∈ 𝒳 : state
𝒖 ∈ 𝒰 : action
• π(𝒖 𝑡|𝒙 𝑡) : a policy defining agent’s behavior
: maps states to probability distribution over the actions
• With 𝒳, 𝒰, an initial state distribution p(𝒙1), the agent experiences a transition
to a new state sampled from the dynamics distribution p 𝒙t+1 𝒙t, 𝒖t
• Rt = i=t
T
γ(i−t)
r(𝒙i, 𝒖i) : the sum of future reward with a discounting factor γ ∈ [0,1]
• Objective of RL : learning a policy π maximizing,
HAYA!
Reinforcement Learning : Model Free?
2016-12-02 CPSLAB (EECS) 11
• When the system dynamics p 𝒙t+1 𝒙t, 𝒖t are not known.
• We define the Q-function 𝑄 𝜋
𝒙 𝑡, 𝒖 𝑡 , corresponding to a policy 𝜋
as the expected return from 𝒙 𝑡 after taking 𝒖 𝑡 and following 𝜋 thereafter.
• Q-learning learns a greedy deterministic policy
which corresponds to
• The learning objective is to minimize the Bellman error,
 𝛽 : arbitrary exploration policy, 𝜌 𝛽
: resulting state visitation frequency of the policy 𝛽,
 𝜃 𝑄
: parameter of the Q-function,
 Assume that there is a fixed target 𝑦𝑡, 𝑸(𝒙, 𝝁(𝒙))
HAYA!
Continuous Q-Learning with
Normalized Advantage Functions
2016-12-02 CPSLAB (EECS) 12
How Authors learned parameterized Q-function with Deep Learning,
when the domain of state-action is continuous?
Value function
Advantage function of
a given policy 𝝅
They suggest to use a neural network
that separately outputs a value function term, and an advantage term.
 State-dependent, positive-definite square matrix,
parameterized by 𝑷 𝒙 𝜃 𝑃
= 𝑳 𝒙 𝜃 𝑃
𝑳 𝒙 𝜃 𝑃 𝑇
.
 𝑳 𝒙 𝜃 𝑃
: Lower-triangular matrix whose entries
come from a linear output layer of a neural network.
The action that maximizes
the Q-function is always
given by 𝝁(𝒙|𝜃 𝜇
).
HAYA!
Continuous Q-Learning with
Normalized Advantage Functions
2016-12-02 CPSLAB (EECS) 13
Trick : assume that we have a target network.
𝑄′(𝒙, 𝒖|𝜃 𝑄′)
the SLOW-LEARNER
𝑄 𝒙, 𝒖 𝜃 𝑄
the EXPLORER
𝑹 EXPERIENCE
CONTAINER
HAYA!
Accelerating Learning with Imagination Rollouts
2016-12-02 CPSLAB (EECS) 14
 The sample complexity of model-free algorithms tends to be high when
using high-dimensional function approximators.
 To reduce the sample complexity and accelerate the learning phase,
how about using a good exploratory behavior
from the trajectory optimization?
HAYA!
Accelerating Learning with Imagination Rollouts
2016-12-02 CPSLAB (EECS) 15
 how about using a good exploratory behavior from the trajectory optimization?
𝑄′(𝒙, 𝒖|𝜃 𝑄′
)
𝑄 𝒙, 𝒖 𝜃 𝑄
𝑹
𝑩 𝒇𝑩 𝒐𝒍𝒅
𝜇 𝒙 𝜃 𝜇 𝜋 𝑡
𝑖𝐿𝑄𝐺
𝒖 𝑡 𝒙 𝑡
𝓜
𝑹 𝒇
𝒇
𝒇
HAYA!
Experiment : Results
2016-12-02 CPSLAB (EECS) 16
HAYA!
Experiment : Results
2016-12-02 CPSLAB (EECS) 17
HAYA!
2016-12-02 CPSLAB (EECS) 18

More Related Content

What's hot

Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)Dongmin Lee
 
Multi PPT - Agent Actor-Critic for Mixed Cooperative-Competitive Environments
Multi PPT - Agent Actor-Critic for Mixed Cooperative-Competitive EnvironmentsMulti PPT - Agent Actor-Critic for Mixed Cooperative-Competitive Environments
Multi PPT - Agent Actor-Critic for Mixed Cooperative-Competitive EnvironmentsJisang Yoon
 
Reinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialReinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialOmar Enayet
 
Tamara G. Kolda, Distinguished Member of Technical Staff, Sandia National Lab...
Tamara G. Kolda, Distinguished Member of Technical Staff, Sandia National Lab...Tamara G. Kolda, Distinguished Member of Technical Staff, Sandia National Lab...
Tamara G. Kolda, Distinguished Member of Technical Staff, Sandia National Lab...MLconf
 
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...Dongmin Lee
 
Introduction of "TrailBlazer" algorithm
Introduction of "TrailBlazer" algorithmIntroduction of "TrailBlazer" algorithm
Introduction of "TrailBlazer" algorithmKatsuki Ohto
 
Higher Order Fused Regularization for Supervised Learning with Grouped Parame...
Higher Order Fused Regularization for Supervised Learning with Grouped Parame...Higher Order Fused Regularization for Supervised Learning with Grouped Parame...
Higher Order Fused Regularization for Supervised Learning with Grouped Parame...Koh Takeuchi
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement LearningDongHyun Kwak
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learningDongHyun Kwak
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement LearningUsman Qayyum
 
Differential privacy without sensitivity [NIPS2016読み会資料]
Differential privacy without sensitivity [NIPS2016読み会資料]Differential privacy without sensitivity [NIPS2016読み会資料]
Differential privacy without sensitivity [NIPS2016読み会資料]Kentaro Minami
 
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)Hogeon Seo
 
Reinforcement learning
Reinforcement learning Reinforcement learning
Reinforcement learning Chandra Meena
 
Reinforcement learning 7313
Reinforcement learning 7313Reinforcement learning 7313
Reinforcement learning 7313Slideshare
 
Safe and Efficient Off-Policy Reinforcement Learning
Safe and Efficient Off-Policy Reinforcement LearningSafe and Efficient Off-Policy Reinforcement Learning
Safe and Efficient Off-Policy Reinforcement Learningmooopan
 
Reinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural NetsReinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural NetsPierre de Lacaze
 

What's hot (20)

Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
 
Multi PPT - Agent Actor-Critic for Mixed Cooperative-Competitive Environments
Multi PPT - Agent Actor-Critic for Mixed Cooperative-Competitive EnvironmentsMulti PPT - Agent Actor-Critic for Mixed Cooperative-Competitive Environments
Multi PPT - Agent Actor-Critic for Mixed Cooperative-Competitive Environments
 
Lec3 dqn
Lec3 dqnLec3 dqn
Lec3 dqn
 
Reinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialReinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners Tutorial
 
Tamara G. Kolda, Distinguished Member of Technical Staff, Sandia National Lab...
Tamara G. Kolda, Distinguished Member of Technical Staff, Sandia National Lab...Tamara G. Kolda, Distinguished Member of Technical Staff, Sandia National Lab...
Tamara G. Kolda, Distinguished Member of Technical Staff, Sandia National Lab...
 
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...
 
Introduction of "TrailBlazer" algorithm
Introduction of "TrailBlazer" algorithmIntroduction of "TrailBlazer" algorithm
Introduction of "TrailBlazer" algorithm
 
Higher Order Fused Regularization for Supervised Learning with Grouped Parame...
Higher Order Fused Regularization for Supervised Learning with Grouped Parame...Higher Order Fused Regularization for Supervised Learning with Grouped Parame...
Higher Order Fused Regularization for Supervised Learning with Grouped Parame...
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
 
Differential privacy without sensitivity [NIPS2016読み会資料]
Differential privacy without sensitivity [NIPS2016読み会資料]Differential privacy without sensitivity [NIPS2016読み会資料]
Differential privacy without sensitivity [NIPS2016読み会資料]
 
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)
 
Reinforcement learning
Reinforcement learning Reinforcement learning
Reinforcement learning
 
Reinforcement learning 7313
Reinforcement learning 7313Reinforcement learning 7313
Reinforcement learning 7313
 
Safe and Efficient Off-Policy Reinforcement Learning
Safe and Efficient Off-Policy Reinforcement LearningSafe and Efficient Off-Policy Reinforcement Learning
Safe and Efficient Off-Policy Reinforcement Learning
 
Deep robotics
Deep roboticsDeep robotics
Deep robotics
 
Generalized Reinforcement Learning
Generalized Reinforcement LearningGeneralized Reinforcement Learning
Generalized Reinforcement Learning
 
Reinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural NetsReinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural Nets
 

Viewers also liked

Introduction For seq2seq(sequence to sequence) and RNN
Introduction For seq2seq(sequence to sequence) and RNNIntroduction For seq2seq(sequence to sequence) and RNN
Introduction For seq2seq(sequence to sequence) and RNNHye-min Ahn
 
Loud and Trendy: Crowdsourcing Impressions of Social Ambiance in Popular Indo...
Loud and Trendy: Crowdsourcing Impressions of Social Ambiance in Popular Indo...Loud and Trendy: Crowdsourcing Impressions of Social Ambiance in Popular Indo...
Loud and Trendy: Crowdsourcing Impressions of Social Ambiance in Popular Indo...Darshan Santani
 
Cooperative Collision Avoidance via Proximal Message Passing
Cooperative Collision Avoidance via Proximal Message PassingCooperative Collision Avoidance via Proximal Message Passing
Cooperative Collision Avoidance via Proximal Message PassingLyft
 
Human brain how it work
Human brain how it workHuman brain how it work
Human brain how it workhudvin
 
Explaining and harnessing adversarial examples (2015)
Explaining and harnessing adversarial examples (2015)Explaining and harnessing adversarial examples (2015)
Explaining and harnessing adversarial examples (2015)정훈 서
 
Paper Reading : Enriching word vectors with subword information(2016)
Paper Reading : Enriching word vectors with subword information(2016)Paper Reading : Enriching word vectors with subword information(2016)
Paper Reading : Enriching word vectors with subword information(2016)정훈 서
 
Linear Discriminant Analysis and Its Generalization
Linear Discriminant Analysis and Its GeneralizationLinear Discriminant Analysis and Its Generalization
Linear Discriminant Analysis and Its Generalization일상 온
 
Paper Reading : Learning from simulated and unsupervised images through adver...
Paper Reading : Learning from simulated and unsupervised images through adver...Paper Reading : Learning from simulated and unsupervised images through adver...
Paper Reading : Learning from simulated and unsupervised images through adver...정훈 서
 
Encoding Robotic Sensor States for Q-Learning using the
Encoding Robotic Sensor States for Q-Learning using the Encoding Robotic Sensor States for Q-Learning using the
Encoding Robotic Sensor States for Q-Learning using the butest
 
『밑바닥부터 시작하는 딥러닝』 - 미리보기
『밑바닥부터 시작하는 딥러닝』 - 미리보기『밑바닥부터 시작하는 딥러닝』 - 미리보기
『밑바닥부터 시작하는 딥러닝』 - 미리보기복연 이
 
Face detection and recognition using OpenCV
Face detection and recognition using OpenCVFace detection and recognition using OpenCV
Face detection and recognition using OpenCVAndrew Babiy
 
A pixel to-pixel segmentation method of DILD without masks using CNN and perl...
A pixel to-pixel segmentation method of DILD without masks using CNN and perl...A pixel to-pixel segmentation method of DILD without masks using CNN and perl...
A pixel to-pixel segmentation method of DILD without masks using CNN and perl...남주 김
 
Machine Learning for Actuaries
Machine Learning for ActuariesMachine Learning for Actuaries
Machine Learning for ActuariesArthur Charpentier
 
The What, Why and How of (Web) Analytics Testing (Web, IoT, Big Data)
The What, Why and How of (Web) Analytics Testing (Web, IoT, Big Data)The What, Why and How of (Web) Analytics Testing (Web, IoT, Big Data)
The What, Why and How of (Web) Analytics Testing (Web, IoT, Big Data)Anand Bagmar
 
Deep Learning in Computer Vision
Deep Learning in Computer VisionDeep Learning in Computer Vision
Deep Learning in Computer VisionSungjoon Choi
 
Generative adversarial networks
Generative adversarial networksGenerative adversarial networks
Generative adversarial networks남주 김
 
Understanding deep learning requires rethinking generalization (2017) 2 2(2)
Understanding deep learning requires rethinking generalization (2017)    2 2(2)Understanding deep learning requires rethinking generalization (2017)    2 2(2)
Understanding deep learning requires rethinking generalization (2017) 2 2(2)정훈 서
 

Viewers also liked (20)

Introduction For seq2seq(sequence to sequence) and RNN
Introduction For seq2seq(sequence to sequence) and RNNIntroduction For seq2seq(sequence to sequence) and RNN
Introduction For seq2seq(sequence to sequence) and RNN
 
Loud and Trendy: Crowdsourcing Impressions of Social Ambiance in Popular Indo...
Loud and Trendy: Crowdsourcing Impressions of Social Ambiance in Popular Indo...Loud and Trendy: Crowdsourcing Impressions of Social Ambiance in Popular Indo...
Loud and Trendy: Crowdsourcing Impressions of Social Ambiance in Popular Indo...
 
Cooperative Collision Avoidance via Proximal Message Passing
Cooperative Collision Avoidance via Proximal Message PassingCooperative Collision Avoidance via Proximal Message Passing
Cooperative Collision Avoidance via Proximal Message Passing
 
Human brain how it work
Human brain how it workHuman brain how it work
Human brain how it work
 
Explaining and harnessing adversarial examples (2015)
Explaining and harnessing adversarial examples (2015)Explaining and harnessing adversarial examples (2015)
Explaining and harnessing adversarial examples (2015)
 
Paper Reading : Enriching word vectors with subword information(2016)
Paper Reading : Enriching word vectors with subword information(2016)Paper Reading : Enriching word vectors with subword information(2016)
Paper Reading : Enriching word vectors with subword information(2016)
 
Linear Discriminant Analysis and Its Generalization
Linear Discriminant Analysis and Its GeneralizationLinear Discriminant Analysis and Its Generalization
Linear Discriminant Analysis and Its Generalization
 
Paper Reading : Learning from simulated and unsupervised images through adver...
Paper Reading : Learning from simulated and unsupervised images through adver...Paper Reading : Learning from simulated and unsupervised images through adver...
Paper Reading : Learning from simulated and unsupervised images through adver...
 
Encoding Robotic Sensor States for Q-Learning using the
Encoding Robotic Sensor States for Q-Learning using the Encoding Robotic Sensor States for Q-Learning using the
Encoding Robotic Sensor States for Q-Learning using the
 
『밑바닥부터 시작하는 딥러닝』 - 미리보기
『밑바닥부터 시작하는 딥러닝』 - 미리보기『밑바닥부터 시작하는 딥러닝』 - 미리보기
『밑바닥부터 시작하는 딥러닝』 - 미리보기
 
Face detection and recognition using OpenCV
Face detection and recognition using OpenCVFace detection and recognition using OpenCV
Face detection and recognition using OpenCV
 
Handover Parameters Self-optimization by Q-Learning in 4G Networks
Handover Parameters Self-optimization by Q-Learning in 4G NetworksHandover Parameters Self-optimization by Q-Learning in 4G Networks
Handover Parameters Self-optimization by Q-Learning in 4G Networks
 
A pixel to-pixel segmentation method of DILD without masks using CNN and perl...
A pixel to-pixel segmentation method of DILD without masks using CNN and perl...A pixel to-pixel segmentation method of DILD without masks using CNN and perl...
A pixel to-pixel segmentation method of DILD without masks using CNN and perl...
 
Conflict mgt in nursing
Conflict mgt in nursingConflict mgt in nursing
Conflict mgt in nursing
 
Machine Learning for Actuaries
Machine Learning for ActuariesMachine Learning for Actuaries
Machine Learning for Actuaries
 
The What, Why and How of (Web) Analytics Testing (Web, IoT, Big Data)
The What, Why and How of (Web) Analytics Testing (Web, IoT, Big Data)The What, Why and How of (Web) Analytics Testing (Web, IoT, Big Data)
The What, Why and How of (Web) Analytics Testing (Web, IoT, Big Data)
 
Deep Learning in Computer Vision
Deep Learning in Computer VisionDeep Learning in Computer Vision
Deep Learning in Computer Vision
 
Generative adversarial networks
Generative adversarial networksGenerative adversarial networks
Generative adversarial networks
 
Understanding deep learning requires rethinking generalization (2017) 2 2(2)
Understanding deep learning requires rethinking generalization (2017)    2 2(2)Understanding deep learning requires rethinking generalization (2017)    2 2(2)
Understanding deep learning requires rethinking generalization (2017) 2 2(2)
 
Internet of Things
Internet of ThingsInternet of Things
Internet of Things
 

Similar to 1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration

Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningKhaled Saleh
 
increasing the action gap - new operators for reinforcement learning
increasing the action gap - new operators for reinforcement learningincreasing the action gap - new operators for reinforcement learning
increasing the action gap - new operators for reinforcement learningRyo Iwaki
 
safe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learningsafe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learningRyo Iwaki
 
Literature Review - Presentation on Relevant work for RL4AD capstone
Literature Review - Presentation on Relevant work for RL4AD capstoneLiterature Review - Presentation on Relevant work for RL4AD capstone
Literature Review - Presentation on Relevant work for RL4AD capstoneMayank Gupta
 
Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017
Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017
Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017MLconf
 
An Introduction to Reinforcement Learning - The Doors to AGI
An Introduction to Reinforcement Learning - The Doors to AGIAn Introduction to Reinforcement Learning - The Doors to AGI
An Introduction to Reinforcement Learning - The Doors to AGIAnirban Santara
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement LearningSalem-Kabbani
 
Survey of Modern Reinforcement Learning
Survey of Modern Reinforcement Learning Survey of Modern Reinforcement Learning
Survey of Modern Reinforcement Learning Julia Maddalena
 
Proximal Policy Optimization Algorithms, Schulman et al, 2017
Proximal Policy Optimization Algorithms, Schulman et al, 2017Proximal Policy Optimization Algorithms, Schulman et al, 2017
Proximal Policy Optimization Algorithms, Schulman et al, 2017Chris Ohk
 
Naive Reinforcement algorithm
Naive Reinforcement algorithmNaive Reinforcement algorithm
Naive Reinforcement algorithmSameerJolly2
 
rlpptgroup3-231018180804-0c05fb2f789piutt
rlpptgroup3-231018180804-0c05fb2f789piuttrlpptgroup3-231018180804-0c05fb2f789piutt
rlpptgroup3-231018180804-0c05fb2f789piutt201roopikha
 
Demystifying deep reinforement learning
Demystifying deep reinforement learningDemystifying deep reinforement learning
Demystifying deep reinforement learning재연 윤
 
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015Chris Ohk
 
DDPG algortihm for angry birds
DDPG algortihm for angry birdsDDPG algortihm for angry birds
DDPG algortihm for angry birdsWangyu Han
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learningElias Hasnat
 
Jsai final final final
Jsai final final finalJsai final final final
Jsai final final finaldinesh malla
 
Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018
Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018
Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018Universitat Politècnica de Catalunya
 
Actor critic algorithm
Actor critic algorithmActor critic algorithm
Actor critic algorithmJie-Han Chen
 

Similar to 1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration (20)

Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement Learning
 
increasing the action gap - new operators for reinforcement learning
increasing the action gap - new operators for reinforcement learningincreasing the action gap - new operators for reinforcement learning
increasing the action gap - new operators for reinforcement learning
 
safe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learningsafe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learning
 
Literature Review - Presentation on Relevant work for RL4AD capstone
Literature Review - Presentation on Relevant work for RL4AD capstoneLiterature Review - Presentation on Relevant work for RL4AD capstone
Literature Review - Presentation on Relevant work for RL4AD capstone
 
Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017
Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017
Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017
 
An Introduction to Reinforcement Learning - The Doors to AGI
An Introduction to Reinforcement Learning - The Doors to AGIAn Introduction to Reinforcement Learning - The Doors to AGI
An Introduction to Reinforcement Learning - The Doors to AGI
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Survey of Modern Reinforcement Learning
Survey of Modern Reinforcement Learning Survey of Modern Reinforcement Learning
Survey of Modern Reinforcement Learning
 
Proximal Policy Optimization Algorithms, Schulman et al, 2017
Proximal Policy Optimization Algorithms, Schulman et al, 2017Proximal Policy Optimization Algorithms, Schulman et al, 2017
Proximal Policy Optimization Algorithms, Schulman et al, 2017
 
ddpg seminar
ddpg seminarddpg seminar
ddpg seminar
 
Naive Reinforcement algorithm
Naive Reinforcement algorithmNaive Reinforcement algorithm
Naive Reinforcement algorithm
 
rlpptgroup3-231018180804-0c05fb2f789piutt
rlpptgroup3-231018180804-0c05fb2f789piuttrlpptgroup3-231018180804-0c05fb2f789piutt
rlpptgroup3-231018180804-0c05fb2f789piutt
 
Demystifying deep reinforement learning
Demystifying deep reinforement learningDemystifying deep reinforement learning
Demystifying deep reinforement learning
 
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
 
DDPG algortihm for angry birds
DDPG algortihm for angry birdsDDPG algortihm for angry birds
DDPG algortihm for angry birds
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Jsai final final final
Jsai final final finalJsai final final final
Jsai final final final
 
Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018
Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018
Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018
 
Introduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement LearningIntroduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement Learning
 
Actor critic algorithm
Actor critic algorithmActor critic algorithm
Actor critic algorithm
 

Recently uploaded

Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxIntroduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxk795866
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile servicerehmti665
 
Risk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfRisk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfROCENODodongVILLACER
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AIabhishek36461
 
Effects of rheological properties on mixing
Effects of rheological properties on mixingEffects of rheological properties on mixing
Effects of rheological properties on mixingviprabot1
 
Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.eptoze12
 
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEINFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEroselinkalist12
 
An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...Chandu841456
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx959SahilShah
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024Mark Billinghurst
 
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfCCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfAsst.prof M.Gokilavani
 
Introduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHIntroduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHC Sai Kiran
 
pipeline in computer architecture design
pipeline in computer architecture  designpipeline in computer architecture  design
pipeline in computer architecture designssuser87fa0c1
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionDr.Costas Sachpazis
 
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)dollysharma2066
 

Recently uploaded (20)

🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
 
Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxIntroduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptx
 
Design and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdfDesign and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdf
 
POWER SYSTEMS-1 Complete notes examples
POWER SYSTEMS-1 Complete notes  examplesPOWER SYSTEMS-1 Complete notes  examples
POWER SYSTEMS-1 Complete notes examples
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile service
 
Risk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfRisk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdf
 
young call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Serviceyoung call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Service
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AI
 
Effects of rheological properties on mixing
Effects of rheological properties on mixingEffects of rheological properties on mixing
Effects of rheological properties on mixing
 
Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.
 
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Serviceyoung call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
 
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEINFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
 
An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024
 
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfCCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
 
Introduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHIntroduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECH
 
pipeline in computer architecture design
pipeline in computer architecture  designpipeline in computer architecture  design
pipeline in computer architecture design
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
 
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
 

1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration

  • 1. HAYA! Continuous Deep Q-Learning with Model-based Acceleration 2016 ICML S. Gu, T. Lillicrap, I. Sutskever, S. Levine. Presenter : Hyemin Ahn
  • 2. HAYA! Introduction 2016-12-02 CPSLAB (EECS) 2  Another, and Another improved work of Deep - Reinforcement Learning  Tried incorporate the advantages of Model-free Reinforcement Learning && Model-based Reinforcement Learning
  • 4. HAYA! Reinforcement Learning : overview 2016-12-02 CPSLAB (EECS) 4 Agent How can we formulize our behavior?
  • 5. HAYA! Reinforcement Learning : overview 2016-12-02 CPSLAB (EECS) 5 At each time 𝒕, the agent receives an observation 𝒙 𝒕 from environment 𝑬 Wow so scare such gun so many bullets nice suit btw
  • 6. HAYA! Reinforcement Learning : overview 2016-12-02 CPSLAB (EECS) 6 The agent takes an action 𝒖 𝒕 ∈ 𝒰, and receives a scalar reward 𝒓 𝒕. 𝒖 𝒕 𝒙 𝒕
  • 7. HAYA! Reinforcement Learning : overview 2016-12-02 CPSLAB (EECS) 7 The agent chooses an action according to its current policy 𝛑(𝐮𝐭|𝐱 𝐭), which maps states to probability distribution over actions. 𝒖 𝟏 𝒖 𝟐 𝛑(𝐮𝐭|𝒙 𝒕) 𝒖 𝟐𝒖 𝟏
  • 8. HAYA! Reinforcement Learning : overview 2016-12-02 CPSLAB (EECS) 8 𝒖 𝟏 𝒙 𝟏 𝛑 𝒖 𝟐 𝒙 𝟐 𝛑 𝒖 𝟑 𝒙 𝟑 𝛑 𝒑(𝒙 𝟐|𝒙 𝟏, 𝒖 𝟏) 𝒑(𝒙 𝟑|𝒙 𝟐, 𝒖 𝟐) 𝑹 𝒕 = 𝒊=𝒕 𝑻 𝜸(𝒊−𝒕) 𝒓(𝒙𝒊, 𝒖𝒊) : cumulative sum of rewards over sequences. (𝜸 ∈ [𝟎, 𝟏]:discounting factor) 𝑸 𝝅 𝒙 𝒕, 𝒖 𝒕 = 𝔼[𝑹 𝒕|𝒙 𝒕, 𝒖 𝒕] : state-action value function. Objective of RL : find 𝛑 maximizing 𝔼(𝑹 𝟏) ! 𝒓(𝒙 𝟏, 𝒖 𝟏) 𝒓(𝒙 𝟐, 𝒖 𝟐) 𝒓(𝒙 𝟑, 𝒖 𝟑)M D P
  • 9. HAYA! Reinforcement Learning : overview 2016-12-02 CPSLAB (EECS) 9 𝑸 𝝅 𝒕𝒓𝒊𝒏𝒊𝒕𝒚 𝒙 𝒕, 𝒖 𝒕 𝑸 𝝅 𝒏𝒆𝒐 𝒙 𝒕, 𝒖 𝒕<
  • 10. HAYA! Reinforcement Learning : overview 2016-12-02 CPSLAB (EECS) 10 • From environment E, 𝒙 ∈ 𝒳 : state 𝒖 ∈ 𝒰 : action • π(𝒖 𝑡|𝒙 𝑡) : a policy defining agent’s behavior : maps states to probability distribution over the actions • With 𝒳, 𝒰, an initial state distribution p(𝒙1), the agent experiences a transition to a new state sampled from the dynamics distribution p 𝒙t+1 𝒙t, 𝒖t • Rt = i=t T γ(i−t) r(𝒙i, 𝒖i) : the sum of future reward with a discounting factor γ ∈ [0,1] • Objective of RL : learning a policy π maximizing,
  • 11. HAYA! Reinforcement Learning : Model Free? 2016-12-02 CPSLAB (EECS) 11 • When the system dynamics p 𝒙t+1 𝒙t, 𝒖t are not known. • We define the Q-function 𝑄 𝜋 𝒙 𝑡, 𝒖 𝑡 , corresponding to a policy 𝜋 as the expected return from 𝒙 𝑡 after taking 𝒖 𝑡 and following 𝜋 thereafter. • Q-learning learns a greedy deterministic policy which corresponds to • The learning objective is to minimize the Bellman error,  𝛽 : arbitrary exploration policy, 𝜌 𝛽 : resulting state visitation frequency of the policy 𝛽,  𝜃 𝑄 : parameter of the Q-function,  Assume that there is a fixed target 𝑦𝑡, 𝑸(𝒙, 𝝁(𝒙))
  • 12. HAYA! Continuous Q-Learning with Normalized Advantage Functions 2016-12-02 CPSLAB (EECS) 12 How Authors learned parameterized Q-function with Deep Learning, when the domain of state-action is continuous? Value function Advantage function of a given policy 𝝅 They suggest to use a neural network that separately outputs a value function term, and an advantage term.  State-dependent, positive-definite square matrix, parameterized by 𝑷 𝒙 𝜃 𝑃 = 𝑳 𝒙 𝜃 𝑃 𝑳 𝒙 𝜃 𝑃 𝑇 .  𝑳 𝒙 𝜃 𝑃 : Lower-triangular matrix whose entries come from a linear output layer of a neural network. The action that maximizes the Q-function is always given by 𝝁(𝒙|𝜃 𝜇 ).
  • 13. HAYA! Continuous Q-Learning with Normalized Advantage Functions 2016-12-02 CPSLAB (EECS) 13 Trick : assume that we have a target network. 𝑄′(𝒙, 𝒖|𝜃 𝑄′) the SLOW-LEARNER 𝑄 𝒙, 𝒖 𝜃 𝑄 the EXPLORER 𝑹 EXPERIENCE CONTAINER
  • 14. HAYA! Accelerating Learning with Imagination Rollouts 2016-12-02 CPSLAB (EECS) 14  The sample complexity of model-free algorithms tends to be high when using high-dimensional function approximators.  To reduce the sample complexity and accelerate the learning phase, how about using a good exploratory behavior from the trajectory optimization?
  • 15. HAYA! Accelerating Learning with Imagination Rollouts 2016-12-02 CPSLAB (EECS) 15  how about using a good exploratory behavior from the trajectory optimization? 𝑄′(𝒙, 𝒖|𝜃 𝑄′ ) 𝑄 𝒙, 𝒖 𝜃 𝑄 𝑹 𝑩 𝒇𝑩 𝒐𝒍𝒅 𝜇 𝒙 𝜃 𝜇 𝜋 𝑡 𝑖𝐿𝑄𝐺 𝒖 𝑡 𝒙 𝑡 𝓜 𝑹 𝒇 𝒇 𝒇

Editor's Notes

  1. Of these, the least novel are the value/advantage decomposition of Q(s,a) and the use of locally-adapted linear-Gaussian dynamics.
  2. But we don’t know the target…!
  3. But we don’t know the target…!
  4. But we don’t know the target…!