SlideShare a Scribd company logo
1 of 23
Continuous Control with
Deep Reinforcement Learning
2016 ICLR
Timothy P. Lillicrap, et al. (Google DeepMind)
Presenter : Hyemin Ahn
Introduction
2016-04-17 CPSLAB (EECS) 2
 Another work of
Deep Learning
+ Reinforcement Learning
from Google DEEPMIND !
 Extended their Deep Q Network,
which is dealing with discrete action space,
to continuous action space.
Results : Preview
2016-04-17 CPSLAB (EECS) 3
Reinforcement Learning : overview
2016-04-17 CPSLAB (EECS) 4
Agent
How can we formulize our behavior?
Reinforcement Learning : overview
2016-04-17 CPSLAB (EECS) 5
At each time 𝒕,
the agent receives an observation 𝒙 𝒕
from environment 𝑬
Wow
so scare
such gun
so many bullets
nice suit btw
Reinforcement Learning : overview
2016-04-17 CPSLAB (EECS) 6
The agent takes
an action 𝒂 𝒕 ∈ 𝒜 ⊆ ℝ 𝑵,
and receives a scalar reward 𝒓 𝒕.
𝒂 𝒕
𝒙 𝒕
Reinforcement Learning : overview
2016-04-17 CPSLAB (EECS) 7
For selecting the action, there is a policy 𝛑: 𝓢 → 𝓟 𝓐 ,
which maps states to probability distribution over actions.
𝒂 𝟏
𝒂 𝟐
𝛑(𝐬𝐭)
𝒂 𝟐𝒂 𝟏
Reinforcement Learning : overview
2016-04-17 CPSLAB (EECS) 8
𝒂 𝟏
𝒔 𝟏
𝛑
𝒂 𝟐
𝒔 𝟐
𝛑
𝒂 𝟑
𝒔 𝟑
𝛑
𝒑(𝒔 𝟐|𝒔 𝟏, 𝒂 𝟏) 𝒑(𝒔 𝟑|𝒔 𝟐, 𝒂 𝟐)
𝑹 𝒕 =
𝒊=𝒕
𝑻
𝜸𝒊−𝒕
𝒓(𝒔𝒊, 𝒂𝒊)
: cumulative sum of rewards
over sequences. (𝜸 ∈ [𝟎, 𝟏]:discounting factor)
𝑸 𝝅
𝒔 𝒕, 𝒂 𝒕 = 𝔼 𝝅[𝑹 𝒕|𝒔 𝒕, 𝒂 𝒕]
: state-action value function.
Objective of RL
: find 𝛑 maximizing 𝔼 𝝅(𝑹 𝟏) !
𝒓(𝒔 𝟏, 𝒂 𝟏) 𝒓(𝒔 𝟐, 𝒂 𝟐) 𝒓(𝒔 𝟑, 𝒂 𝟑)M
D
P
Reinforcement Learning : overview
2016-04-17 CPSLAB (EECS) 9
𝑸 𝝅 𝒕𝒓𝒊𝒏𝒊𝒕𝒚 𝒔 𝒕, 𝒂 𝒕 𝑸 𝝅 𝒏𝒆𝒐 𝒔 𝒕, 𝒂 𝒕<
Reinforcement Learning : overview
2016-04-17 CPSLAB (EECS) 10
• From environment E,
xt : observation
st ∈ 𝒮 : state.
• If E is fully observed, st = xt
• at ∈ 𝒜 : agent’s action
• π: 𝒮 → 𝒫 𝒜 : a policy defining agent’s behavior
: maps states to probability distribution over the actions.
• With 𝒮, 𝒜, an initial state distribution p(s1),
transition dynamics p st+1 st, at , and reward function r st, at ,
Agent’s behavior can be modeled as a Markov Decision Process (MDP).
• Rt = i=t
T
γi−t
r(si, ai) : the sum of discounted future reward
with a discounting factor γ ∈ [0,1].
• Objective of RL : learning a policy π maximizing 𝔼π(R1).
 For this, state-action-value function Qπ
st, at = 𝔼π[Rt|st, at] is used.
Q-learning is finding , the greedy policy
Reinforcement Learning : Q-Learning
2016-04-17 CPSLAB (EECS) 11
π: 𝒮 → 𝒫 𝒜 𝜇: 𝒮 → 𝒜
The Bellman equation refers to this recursive relationship
It gets harder to compute this due to stochastic policy π: 𝒮 → 𝒫 𝒜
Let us think about the deterministic policy instead of stochastic one
Can we do this in continuous action space?
Reinforcement Learning : continuous space?
2016-04-17 CPSLAB (EECS) 12
What DQN(Deep Q Network) was learning Q-function with NN.
With function approximator parameterized by 𝜃 𝑄
,
Model a network
Finding 𝜃 𝑄
minimizing the loss function
Problem 1: How can we know this real value?
Also model a policy with a network parameter 𝜃 𝜇
And find a parameter 𝜃 𝜇
which can do
Reinforcement Learning : continuous space?
2016-04-17 CPSLAB (EECS) 13
How can we find in a continuous action space?
Anyway, if assume that we know
Silver, David, et al. "Deterministic policy gradient algorithms." ICML. 2014.
The gradient of the policy’s performance can be defined as,
Problem 2: How can we successfully explore this action space?
Objective : Learn and in a continuous action space!
Reinforcement Learning : continuous space?
2016-04-17 CPSLAB (EECS) 14
How can we successfully explore this action space?
Problem of
How can we know this real value?
Problem of
Both are neural network
Authors suggest to use additional ‘target networks’
and
DDPG(Deep DPG) Algorithm
2016-04-17 CPSLAB (EECS) 15
DDPG(Deep DPG) Algorithm
2016-04-17 CPSLAB (EECS) 16
Our objective
DDPG(Deep DPG) Algorithm
2016-04-17 CPSLAB (EECS) 17
A finite sized cache of transitions (𝑠𝑡, 𝑎 𝑡, 𝑟𝑡, 𝑠𝑡+1 )
Our objective
DDPG(Deep DPG) Algorithm
2016-04-17 CPSLAB (EECS) 18
A finite sized cache of transitions (𝑠𝑡, 𝑎 𝑡, 𝑟𝑡, 𝑠𝑡+1 )
Our objective
DDPG(Deep DPG) Algorithm
2016-04-17 CPSLAB (EECS) 19
A finite sized cache of transitions (𝑠𝑡, 𝑎 𝑡, 𝑟𝑡, 𝑠𝑡+1 )
Our objective
Explored reward + sum of future rewards from target policy network
DDPG(Deep DPG) Algorithm
2016-04-17 CPSLAB (EECS) 20
A finite sized cache of transitions (𝑠𝑡, 𝑎 𝑡, 𝑟𝑡, 𝑠𝑡+1 )
Our objective
𝜏 ≪ 1 for avoiding divergence
Explored reward + sum of future rewards from target policy network
DDPG(Deep DPG) Algorithm
2016-04-17 CPSLAB (EECS) 21
Assume these are right, real target networks
Exploration
Experiment : Results
2016-04-17 CPSLAB (EECS) 22
2016-04-17 CPSLAB (EECS) 23

More Related Content

What's hot

Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...MLconf
 
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIDeep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIJack Clark
 
DQN (Deep Q-Network)
DQN (Deep Q-Network)DQN (Deep Q-Network)
DQN (Deep Q-Network)Dong Guo
 
Introduction of "TrailBlazer" algorithm
Introduction of "TrailBlazer" algorithmIntroduction of "TrailBlazer" algorithm
Introduction of "TrailBlazer" algorithmKatsuki Ohto
 
deep reinforcement learning with double q learning
deep reinforcement learning with double q learningdeep reinforcement learning with double q learning
deep reinforcement learning with double q learningSeungHyeok Baek
 
Exploration Strategies in Reinforcement Learning
Exploration Strategies in Reinforcement LearningExploration Strategies in Reinforcement Learning
Exploration Strategies in Reinforcement LearningDongmin Lee
 
Tamara G. Kolda, Distinguished Member of Technical Staff, Sandia National Lab...
Tamara G. Kolda, Distinguished Member of Technical Staff, Sandia National Lab...Tamara G. Kolda, Distinguished Member of Technical Staff, Sandia National Lab...
Tamara G. Kolda, Distinguished Member of Technical Staff, Sandia National Lab...MLconf
 
Deep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-LearningDeep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-LearningKai-Wen Zhao
 
Higher Order Fused Regularization for Supervised Learning with Grouped Parame...
Higher Order Fused Regularization for Supervised Learning with Grouped Parame...Higher Order Fused Regularization for Supervised Learning with Grouped Parame...
Higher Order Fused Regularization for Supervised Learning with Grouped Parame...Koh Takeuchi
 
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...Dongmin Lee
 
Optimization in deep learning
Optimization in deep learningOptimization in deep learning
Optimization in deep learningJeremy Nixon
 
Bayesian Model-Agnostic Meta-Learning
Bayesian Model-Agnostic Meta-LearningBayesian Model-Agnostic Meta-Learning
Bayesian Model-Agnostic Meta-LearningSangwoo Mo
 
Optimization for Deep Learning
Optimization for Deep LearningOptimization for Deep Learning
Optimization for Deep LearningSebastian Ruder
 
Meta-Learning with Implicit Gradients
Meta-Learning with Implicit GradientsMeta-Learning with Implicit Gradients
Meta-Learning with Implicit GradientsSangwoo Mo
 
safe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learningsafe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learningRyo Iwaki
 
is anyone_interest_in_auto-encoding_variational-bayes
is anyone_interest_in_auto-encoding_variational-bayesis anyone_interest_in_auto-encoding_variational-bayes
is anyone_interest_in_auto-encoding_variational-bayesNAVER Engineering
 
Introduction to Diffusion Models
Introduction to Diffusion ModelsIntroduction to Diffusion Models
Introduction to Diffusion ModelsSangwoo Mo
 
Sharpness-aware minimization (SAM)
Sharpness-aware minimization (SAM)Sharpness-aware minimization (SAM)
Sharpness-aware minimization (SAM)Sangwoo Mo
 

What's hot (20)

Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
 
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIDeep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
 
DQN (Deep Q-Network)
DQN (Deep Q-Network)DQN (Deep Q-Network)
DQN (Deep Q-Network)
 
Introduction of "TrailBlazer" algorithm
Introduction of "TrailBlazer" algorithmIntroduction of "TrailBlazer" algorithm
Introduction of "TrailBlazer" algorithm
 
deep reinforcement learning with double q learning
deep reinforcement learning with double q learningdeep reinforcement learning with double q learning
deep reinforcement learning with double q learning
 
Exploration Strategies in Reinforcement Learning
Exploration Strategies in Reinforcement LearningExploration Strategies in Reinforcement Learning
Exploration Strategies in Reinforcement Learning
 
Tamara G. Kolda, Distinguished Member of Technical Staff, Sandia National Lab...
Tamara G. Kolda, Distinguished Member of Technical Staff, Sandia National Lab...Tamara G. Kolda, Distinguished Member of Technical Staff, Sandia National Lab...
Tamara G. Kolda, Distinguished Member of Technical Staff, Sandia National Lab...
 
Deep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-LearningDeep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-Learning
 
Lec3 dqn
Lec3 dqnLec3 dqn
Lec3 dqn
 
Higher Order Fused Regularization for Supervised Learning with Grouped Parame...
Higher Order Fused Regularization for Supervised Learning with Grouped Parame...Higher Order Fused Regularization for Supervised Learning with Grouped Parame...
Higher Order Fused Regularization for Supervised Learning with Grouped Parame...
 
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...
 
Optimization in deep learning
Optimization in deep learningOptimization in deep learning
Optimization in deep learning
 
Bayesian Model-Agnostic Meta-Learning
Bayesian Model-Agnostic Meta-LearningBayesian Model-Agnostic Meta-Learning
Bayesian Model-Agnostic Meta-Learning
 
Optimization for Deep Learning
Optimization for Deep LearningOptimization for Deep Learning
Optimization for Deep Learning
 
Meta-Learning with Implicit Gradients
Meta-Learning with Implicit GradientsMeta-Learning with Implicit Gradients
Meta-Learning with Implicit Gradients
 
Deep Learning for Computer Vision: Optimization (UPC 2016)
Deep Learning for Computer Vision: Optimization (UPC 2016)Deep Learning for Computer Vision: Optimization (UPC 2016)
Deep Learning for Computer Vision: Optimization (UPC 2016)
 
safe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learningsafe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learning
 
is anyone_interest_in_auto-encoding_variational-bayes
is anyone_interest_in_auto-encoding_variational-bayesis anyone_interest_in_auto-encoding_variational-bayes
is anyone_interest_in_auto-encoding_variational-bayes
 
Introduction to Diffusion Models
Introduction to Diffusion ModelsIntroduction to Diffusion Models
Introduction to Diffusion Models
 
Sharpness-aware minimization (SAM)
Sharpness-aware minimization (SAM)Sharpness-aware minimization (SAM)
Sharpness-aware minimization (SAM)
 

Viewers also liked

Introduction For seq2seq(sequence to sequence) and RNN
Introduction For seq2seq(sequence to sequence) and RNNIntroduction For seq2seq(sequence to sequence) and RNN
Introduction For seq2seq(sequence to sequence) and RNNHye-min Ahn
 
Loud and Trendy: Crowdsourcing Impressions of Social Ambiance in Popular Indo...
Loud and Trendy: Crowdsourcing Impressions of Social Ambiance in Popular Indo...Loud and Trendy: Crowdsourcing Impressions of Social Ambiance in Popular Indo...
Loud and Trendy: Crowdsourcing Impressions of Social Ambiance in Popular Indo...Darshan Santani
 
Paper Reading : Enriching word vectors with subword information(2016)
Paper Reading : Enriching word vectors with subword information(2016)Paper Reading : Enriching word vectors with subword information(2016)
Paper Reading : Enriching word vectors with subword information(2016)정훈 서
 
Explaining and harnessing adversarial examples (2015)
Explaining and harnessing adversarial examples (2015)Explaining and harnessing adversarial examples (2015)
Explaining and harnessing adversarial examples (2015)정훈 서
 
Paper Reading : Learning from simulated and unsupervised images through adver...
Paper Reading : Learning from simulated and unsupervised images through adver...Paper Reading : Learning from simulated and unsupervised images through adver...
Paper Reading : Learning from simulated and unsupervised images through adver...정훈 서
 
A pixel to-pixel segmentation method of DILD without masks using CNN and perl...
A pixel to-pixel segmentation method of DILD without masks using CNN and perl...A pixel to-pixel segmentation method of DILD without masks using CNN and perl...
A pixel to-pixel segmentation method of DILD without masks using CNN and perl...남주 김
 
강화 학습 기초 Reinforcement Learning an introduction
강화 학습 기초 Reinforcement Learning an introduction강화 학습 기초 Reinforcement Learning an introduction
강화 학습 기초 Reinforcement Learning an introductionTaehoon Kim
 
Reinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialReinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialOmar Enayet
 
Generative adversarial networks
Generative adversarial networksGenerative adversarial networks
Generative adversarial networks남주 김
 
Understanding deep learning requires rethinking generalization (2017) 2 2(2)
Understanding deep learning requires rethinking generalization (2017)    2 2(2)Understanding deep learning requires rethinking generalization (2017)    2 2(2)
Understanding deep learning requires rethinking generalization (2017) 2 2(2)정훈 서
 
Understanding deep learning requires rethinking generalization (2017) 1/2
Understanding deep learning requires rethinking generalization (2017) 1/2Understanding deep learning requires rethinking generalization (2017) 1/2
Understanding deep learning requires rethinking generalization (2017) 1/2정훈 서
 

Viewers also liked (11)

Introduction For seq2seq(sequence to sequence) and RNN
Introduction For seq2seq(sequence to sequence) and RNNIntroduction For seq2seq(sequence to sequence) and RNN
Introduction For seq2seq(sequence to sequence) and RNN
 
Loud and Trendy: Crowdsourcing Impressions of Social Ambiance in Popular Indo...
Loud and Trendy: Crowdsourcing Impressions of Social Ambiance in Popular Indo...Loud and Trendy: Crowdsourcing Impressions of Social Ambiance in Popular Indo...
Loud and Trendy: Crowdsourcing Impressions of Social Ambiance in Popular Indo...
 
Paper Reading : Enriching word vectors with subword information(2016)
Paper Reading : Enriching word vectors with subword information(2016)Paper Reading : Enriching word vectors with subword information(2016)
Paper Reading : Enriching word vectors with subword information(2016)
 
Explaining and harnessing adversarial examples (2015)
Explaining and harnessing adversarial examples (2015)Explaining and harnessing adversarial examples (2015)
Explaining and harnessing adversarial examples (2015)
 
Paper Reading : Learning from simulated and unsupervised images through adver...
Paper Reading : Learning from simulated and unsupervised images through adver...Paper Reading : Learning from simulated and unsupervised images through adver...
Paper Reading : Learning from simulated and unsupervised images through adver...
 
A pixel to-pixel segmentation method of DILD without masks using CNN and perl...
A pixel to-pixel segmentation method of DILD without masks using CNN and perl...A pixel to-pixel segmentation method of DILD without masks using CNN and perl...
A pixel to-pixel segmentation method of DILD without masks using CNN and perl...
 
강화 학습 기초 Reinforcement Learning an introduction
강화 학습 기초 Reinforcement Learning an introduction강화 학습 기초 Reinforcement Learning an introduction
강화 학습 기초 Reinforcement Learning an introduction
 
Reinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialReinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners Tutorial
 
Generative adversarial networks
Generative adversarial networksGenerative adversarial networks
Generative adversarial networks
 
Understanding deep learning requires rethinking generalization (2017) 2 2(2)
Understanding deep learning requires rethinking generalization (2017)    2 2(2)Understanding deep learning requires rethinking generalization (2017)    2 2(2)
Understanding deep learning requires rethinking generalization (2017) 2 2(2)
 
Understanding deep learning requires rethinking generalization (2017) 1/2
Understanding deep learning requires rethinking generalization (2017) 1/2Understanding deep learning requires rethinking generalization (2017) 1/2
Understanding deep learning requires rethinking generalization (2017) 1/2
 

Similar to 0415_seminar_DeepDPG

Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningKhaled Saleh
 
Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018
Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018
Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018Universitat Politècnica de Catalunya
 
Reinforcement Learning on Mine Sweeper
Reinforcement Learning on Mine SweeperReinforcement Learning on Mine Sweeper
Reinforcement Learning on Mine SweeperDataScienceLab
 
increasing the action gap - new operators for reinforcement learning
increasing the action gap - new operators for reinforcement learningincreasing the action gap - new operators for reinforcement learning
increasing the action gap - new operators for reinforcement learningRyo Iwaki
 
Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...
Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...
Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...Universitat Politècnica de Catalunya
 
Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningNAVER Engineering
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement LearningDongHyun Kwak
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learningDing Li
 
Parallel Guided Local Search and Some Preliminary Experimental Results for Co...
Parallel Guided Local Search and Some Preliminary Experimental Results for Co...Parallel Guided Local Search and Some Preliminary Experimental Results for Co...
Parallel Guided Local Search and Some Preliminary Experimental Results for Co...csandit
 
Preference learning for guiding the tree searches in continuous POMDPs (CoRL ...
Preference learning for guiding the tree searches in continuous POMDPs (CoRL ...Preference learning for guiding the tree searches in continuous POMDPs (CoRL ...
Preference learning for guiding the tree searches in continuous POMDPs (CoRL ...Jisu Han
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement LearningSalem-Kabbani
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learningDongHyun Kwak
 
Deep Reinforcement Learning: MDP & DQN - Xavier Giro-i-Nieto - UPC Barcelona ...
Deep Reinforcement Learning: MDP & DQN - Xavier Giro-i-Nieto - UPC Barcelona ...Deep Reinforcement Learning: MDP & DQN - Xavier Giro-i-Nieto - UPC Barcelona ...
Deep Reinforcement Learning: MDP & DQN - Xavier Giro-i-Nieto - UPC Barcelona ...Universitat Politècnica de Catalunya
 
PPT - Discovering Reinforcement Learning Algorithms
PPT - Discovering Reinforcement Learning AlgorithmsPPT - Discovering Reinforcement Learning Algorithms
PPT - Discovering Reinforcement Learning AlgorithmsJisang Yoon
 
Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017
Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017
Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017MLconf
 
Unit 5 Introduction to Planning and ANN.pptx
Unit 5 Introduction to Planning and ANN.pptxUnit 5 Introduction to Planning and ANN.pptx
Unit 5 Introduction to Planning and ANN.pptxDrYogeshDeshmukh1
 

Similar to 0415_seminar_DeepDPG (20)

Learning To Run
Learning To RunLearning To Run
Learning To Run
 
Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement Learning
 
Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018
Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018
Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018
 
Reinforcement Learning on Mine Sweeper
Reinforcement Learning on Mine SweeperReinforcement Learning on Mine Sweeper
Reinforcement Learning on Mine Sweeper
 
increasing the action gap - new operators for reinforcement learning
increasing the action gap - new operators for reinforcement learningincreasing the action gap - new operators for reinforcement learning
increasing the action gap - new operators for reinforcement learning
 
Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...
Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...
Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...
 
Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement Learning
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Parallel Guided Local Search and Some Preliminary Experimental Results for Co...
Parallel Guided Local Search and Some Preliminary Experimental Results for Co...Parallel Guided Local Search and Some Preliminary Experimental Results for Co...
Parallel Guided Local Search and Some Preliminary Experimental Results for Co...
 
Reinforcement Learning - DQN
Reinforcement Learning - DQNReinforcement Learning - DQN
Reinforcement Learning - DQN
 
Introduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement LearningIntroduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement Learning
 
Preference learning for guiding the tree searches in continuous POMDPs (CoRL ...
Preference learning for guiding the tree searches in continuous POMDPs (CoRL ...Preference learning for guiding the tree searches in continuous POMDPs (CoRL ...
Preference learning for guiding the tree searches in continuous POMDPs (CoRL ...
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Deep Reinforcement Learning: MDP & DQN - Xavier Giro-i-Nieto - UPC Barcelona ...
Deep Reinforcement Learning: MDP & DQN - Xavier Giro-i-Nieto - UPC Barcelona ...Deep Reinforcement Learning: MDP & DQN - Xavier Giro-i-Nieto - UPC Barcelona ...
Deep Reinforcement Learning: MDP & DQN - Xavier Giro-i-Nieto - UPC Barcelona ...
 
PPT - Discovering Reinforcement Learning Algorithms
PPT - Discovering Reinforcement Learning AlgorithmsPPT - Discovering Reinforcement Learning Algorithms
PPT - Discovering Reinforcement Learning Algorithms
 
LFA-NPG-Paper.pdf
LFA-NPG-Paper.pdfLFA-NPG-Paper.pdf
LFA-NPG-Paper.pdf
 
Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017
Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017
Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017
 
Unit 5 Introduction to Planning and ANN.pptx
Unit 5 Introduction to Planning and ANN.pptxUnit 5 Introduction to Planning and ANN.pptx
Unit 5 Introduction to Planning and ANN.pptx
 

Recently uploaded

IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024Mark Billinghurst
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfAsst.prof M.Gokilavani
 
Artificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxArtificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxbritheesh05
 
Introduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptxIntroduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptxvipinkmenon1
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...VICTOR MAESTRE RAMIREZ
 
chaitra-1.pptx fake news detection using machine learning
chaitra-1.pptx  fake news detection using machine learningchaitra-1.pptx  fake news detection using machine learning
chaitra-1.pptx fake news detection using machine learningmisbanausheenparvam
 
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)dollysharma2066
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx959SahilShah
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )Tsuyoshi Horigome
 
power system scada applications and uses
power system scada applications and usespower system scada applications and uses
power system scada applications and usesDevarapalliHaritha
 
Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxIntroduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxk795866
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AIabhishek36461
 
Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.eptoze12
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxJoão Esperancinha
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130Suhani Kapoor
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girlsssuser7cb4ff
 

Recently uploaded (20)

IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024
 
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptxExploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
 
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCRCall Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
 
Artificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxArtificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptx
 
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
 
Introduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptxIntroduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptx
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...
 
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Serviceyoung call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
 
chaitra-1.pptx fake news detection using machine learning
chaitra-1.pptx  fake news detection using machine learningchaitra-1.pptx  fake news detection using machine learning
chaitra-1.pptx fake news detection using machine learning
 
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )
 
power system scada applications and uses
power system scada applications and usespower system scada applications and uses
power system scada applications and uses
 
Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxIntroduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptx
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AI
 
Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girls
 

0415_seminar_DeepDPG

  • 1. Continuous Control with Deep Reinforcement Learning 2016 ICLR Timothy P. Lillicrap, et al. (Google DeepMind) Presenter : Hyemin Ahn
  • 2. Introduction 2016-04-17 CPSLAB (EECS) 2  Another work of Deep Learning + Reinforcement Learning from Google DEEPMIND !  Extended their Deep Q Network, which is dealing with discrete action space, to continuous action space.
  • 4. Reinforcement Learning : overview 2016-04-17 CPSLAB (EECS) 4 Agent How can we formulize our behavior?
  • 5. Reinforcement Learning : overview 2016-04-17 CPSLAB (EECS) 5 At each time 𝒕, the agent receives an observation 𝒙 𝒕 from environment 𝑬 Wow so scare such gun so many bullets nice suit btw
  • 6. Reinforcement Learning : overview 2016-04-17 CPSLAB (EECS) 6 The agent takes an action 𝒂 𝒕 ∈ 𝒜 ⊆ ℝ 𝑵, and receives a scalar reward 𝒓 𝒕. 𝒂 𝒕 𝒙 𝒕
  • 7. Reinforcement Learning : overview 2016-04-17 CPSLAB (EECS) 7 For selecting the action, there is a policy 𝛑: 𝓢 → 𝓟 𝓐 , which maps states to probability distribution over actions. 𝒂 𝟏 𝒂 𝟐 𝛑(𝐬𝐭) 𝒂 𝟐𝒂 𝟏
  • 8. Reinforcement Learning : overview 2016-04-17 CPSLAB (EECS) 8 𝒂 𝟏 𝒔 𝟏 𝛑 𝒂 𝟐 𝒔 𝟐 𝛑 𝒂 𝟑 𝒔 𝟑 𝛑 𝒑(𝒔 𝟐|𝒔 𝟏, 𝒂 𝟏) 𝒑(𝒔 𝟑|𝒔 𝟐, 𝒂 𝟐) 𝑹 𝒕 = 𝒊=𝒕 𝑻 𝜸𝒊−𝒕 𝒓(𝒔𝒊, 𝒂𝒊) : cumulative sum of rewards over sequences. (𝜸 ∈ [𝟎, 𝟏]:discounting factor) 𝑸 𝝅 𝒔 𝒕, 𝒂 𝒕 = 𝔼 𝝅[𝑹 𝒕|𝒔 𝒕, 𝒂 𝒕] : state-action value function. Objective of RL : find 𝛑 maximizing 𝔼 𝝅(𝑹 𝟏) ! 𝒓(𝒔 𝟏, 𝒂 𝟏) 𝒓(𝒔 𝟐, 𝒂 𝟐) 𝒓(𝒔 𝟑, 𝒂 𝟑)M D P
  • 9. Reinforcement Learning : overview 2016-04-17 CPSLAB (EECS) 9 𝑸 𝝅 𝒕𝒓𝒊𝒏𝒊𝒕𝒚 𝒔 𝒕, 𝒂 𝒕 𝑸 𝝅 𝒏𝒆𝒐 𝒔 𝒕, 𝒂 𝒕<
  • 10. Reinforcement Learning : overview 2016-04-17 CPSLAB (EECS) 10 • From environment E, xt : observation st ∈ 𝒮 : state. • If E is fully observed, st = xt • at ∈ 𝒜 : agent’s action • π: 𝒮 → 𝒫 𝒜 : a policy defining agent’s behavior : maps states to probability distribution over the actions. • With 𝒮, 𝒜, an initial state distribution p(s1), transition dynamics p st+1 st, at , and reward function r st, at , Agent’s behavior can be modeled as a Markov Decision Process (MDP). • Rt = i=t T γi−t r(si, ai) : the sum of discounted future reward with a discounting factor γ ∈ [0,1]. • Objective of RL : learning a policy π maximizing 𝔼π(R1).  For this, state-action-value function Qπ st, at = 𝔼π[Rt|st, at] is used.
  • 11. Q-learning is finding , the greedy policy Reinforcement Learning : Q-Learning 2016-04-17 CPSLAB (EECS) 11 π: 𝒮 → 𝒫 𝒜 𝜇: 𝒮 → 𝒜 The Bellman equation refers to this recursive relationship It gets harder to compute this due to stochastic policy π: 𝒮 → 𝒫 𝒜 Let us think about the deterministic policy instead of stochastic one Can we do this in continuous action space?
  • 12. Reinforcement Learning : continuous space? 2016-04-17 CPSLAB (EECS) 12 What DQN(Deep Q Network) was learning Q-function with NN. With function approximator parameterized by 𝜃 𝑄 , Model a network Finding 𝜃 𝑄 minimizing the loss function Problem 1: How can we know this real value?
  • 13. Also model a policy with a network parameter 𝜃 𝜇 And find a parameter 𝜃 𝜇 which can do Reinforcement Learning : continuous space? 2016-04-17 CPSLAB (EECS) 13 How can we find in a continuous action space? Anyway, if assume that we know Silver, David, et al. "Deterministic policy gradient algorithms." ICML. 2014. The gradient of the policy’s performance can be defined as, Problem 2: How can we successfully explore this action space?
  • 14. Objective : Learn and in a continuous action space! Reinforcement Learning : continuous space? 2016-04-17 CPSLAB (EECS) 14 How can we successfully explore this action space? Problem of How can we know this real value? Problem of Both are neural network Authors suggest to use additional ‘target networks’ and
  • 16. DDPG(Deep DPG) Algorithm 2016-04-17 CPSLAB (EECS) 16 Our objective
  • 17. DDPG(Deep DPG) Algorithm 2016-04-17 CPSLAB (EECS) 17 A finite sized cache of transitions (𝑠𝑡, 𝑎 𝑡, 𝑟𝑡, 𝑠𝑡+1 ) Our objective
  • 18. DDPG(Deep DPG) Algorithm 2016-04-17 CPSLAB (EECS) 18 A finite sized cache of transitions (𝑠𝑡, 𝑎 𝑡, 𝑟𝑡, 𝑠𝑡+1 ) Our objective
  • 19. DDPG(Deep DPG) Algorithm 2016-04-17 CPSLAB (EECS) 19 A finite sized cache of transitions (𝑠𝑡, 𝑎 𝑡, 𝑟𝑡, 𝑠𝑡+1 ) Our objective Explored reward + sum of future rewards from target policy network
  • 20. DDPG(Deep DPG) Algorithm 2016-04-17 CPSLAB (EECS) 20 A finite sized cache of transitions (𝑠𝑡, 𝑎 𝑡, 𝑟𝑡, 𝑠𝑡+1 ) Our objective 𝜏 ≪ 1 for avoiding divergence Explored reward + sum of future rewards from target policy network
  • 21. DDPG(Deep DPG) Algorithm 2016-04-17 CPSLAB (EECS) 21 Assume these are right, real target networks Exploration