SlideShare a Scribd company logo
1 of 28
Download to read offline
Human-level Control
Through Deep
Reinforcement Learning
Volodymyr Mnih, Koray Kavukcuoglu, David
Silver, Andrei A. Rusu, etc.
Project Group: Qingyuan Feng, Jian Jin, Saad
Mahboob, Rui Wang
Acknowledgements
β€Ί This presentation is partially adopted from the presentations by:
β€Ί Dong-Kyoung Kye, available at: vi.snu.ac.kr/xe
β€Ί Jiang Guo, available at:
http://ir.hit.edu.cn/~jguo/docs/notes/dqn-atari.pdf
Outline
β€Ί Motivation & Reinforcement Learning
β€Ί Model Description
– Q-Learning
– Q-network
– Training Q-network
– Innovations of the model
β€Ί Project outline
Motivation
β€Ί Previously the game-playing agents are highly specific to the
game
β€Ί Goal: creating an AI agent capable of playing a wide range of
games, one step closer to Jarvis or R2D2
https://www.youtube.com/watch?v=cqXbjyWrdSo https://twitter.com/r2d2__starwars
Reinforcement Learning
β€Ί Want to teach the agent to play games
β€Ί Supervised learning: let expert players play for 100,000 times
β€Ί Reinforcement learning is the choice
Reinforcement Learning
β€Ί Categories of ML:
β€Ί RL mechanism
Reinforcement Learning
β€Ί Markov Decision Process:
Model Description
Model components
β€Ί States:
𝑆𝑑 = π‘₯1, π‘Ž1, π‘₯2, … , π‘Ž π‘‘βˆ’1, π‘₯ 𝑑
π‘Šβ„Žπ‘’π‘Ÿπ‘’ π‘₯ 𝑑 𝑖𝑠 𝑝𝑖π‘₯𝑒𝑙 π‘£π‘Žπ‘™π‘’π‘’π‘  π‘Žπ‘‘ π‘‘π‘–π‘šπ‘’ 𝑑
β€Ί Value function: discounted future reward
β€Ί Policy: πœ‹, mapping from state to action
β€Ί Goal: maximize value function (discounted future reward)
Q-Learning
β€Ί Q function: maximum discounted future reward
𝑄 𝑠, π‘Ž = π‘šπ‘Žπ‘₯ (𝑅𝑑|𝑠𝑑 = 𝑠, π‘Ž 𝑑 = π‘Ž)
β€Ί Q function represents the β€œquality” of a certain action in a given
state.
β€Ί Iterative calculation: Bellman equation
𝑄 𝑠, π‘Ž = π‘Ÿ + π›Ύπ‘šπ‘Žπ‘₯ π‘Žβ€² 𝑄(𝑠′, π‘Žβ€²)
β€Ί In practice, value iteration is impractical
– Specific to each sequence s and action a, can’t generalize
Q-network
β€Ί Use a function approximator to estimate the action-value
function
β€Ί Neural network with weight πœƒ as the approximator, called Q-
network
β€Ί Input/Output:
State Network
Q value of
Action 1
Q value of
Action 2
Q value of
Action 3
Deep Q-Network
Training Q-Network
β€Ί Loss function: mean squared error (MSE)
β€Ί Derivatives w.r.t. the weights:
β€Ί Using mini-batch SGD
Innovations: Experience Replay
β€Ί Break temporal correlations
β€Ί Better utilize experience
β€Ί Choose action π‘Ž 𝑑 according to πœ€-greedy policy
– Choose best action with probability 1 βˆ’ πœ€, randomly with prob. πœ€
β€Ί Store transition (𝑠𝑑, π‘Ž 𝑑, π‘Ÿπ‘‘, 𝑠𝑑+1) in replay memory D
β€Ί Sample mini-batch of transitions (𝑠, π‘Ž, π‘Ÿ, 𝑠′) from D
β€Ί Minimize MSE between Q-network and Q-learning targets
Innovations: separate target network
β€Ί A separate target network having the same structure
β€Ί Compute Q-learning targets using less frequently update
parameters πœƒπ‘–
βˆ’
instead of πœƒπ‘– of the training network
β€Ί Optimize between Q-network and Q-learning targets:
β€Ί Periodically update πœƒπ‘–
βˆ’
to values of πœƒπ‘–
Complete workflow
Results
Deep Continuous Control for
Self Driving Car Simulation
Self Driving Car
How to Handle Continuous Control?
How to Handle Continuous Control?
Name Range (units) Description
ob.angle [-Ο€,+Ο€]
Angle between the car direction and the direction of the
track axis
ob.track (0, 200)(meters)
Vector of 19 range finder sensors: each sensor returns the
distance between the track edge and the car within a range
of 200 meters
ob.trackPos (-∞,+∞)
Distance between the car and the track axis. The value is
normalized w.r.t. to the track width: it is 0 when the car is
on the axis, values greater than 1 or -1 means the car is
outside of the track.
ob.speedX (-∞,+∞)(km/h)
Speed of the car along the longitudinal axis of the car
(good velocity)
ob.speedY (-∞,+∞)(km/h) Speed of the car along the transverse axis of the car
ob.speedZ (-∞,+∞)(km/h) Speed of the car along the Z-axis of the car
ob.wheelSpinVel (0,+∞)(rad/s)
Vector of 4 sensors representing the rotation speed of
wheels
ob.rpm (0,+∞)(rpm) Number of rotation per minute of the car engine
How to Handle Continuous Control?
β€Ί The DQN is designed for the discrete output
β€Ί The Continuous Output is High dimensional
Deep Deterministic Policy Gradient
Critic Network
Actor Network
Correlated Metric
Dueling Network
Attention Example
How to Choose Attention
β€Ί The Attention Model would reduce the dimensionalities of
features obtained from images
β€Ί Convolutional Local Features may be enough for the decisions
β€Ί Supervised Signals are correlated with environment it exploits or
explores
Project Plan
β€Ί Utilize CNN to get temporal information
β€Ί Add Probabilistic Mixture Model Layer for Convolutional
Features
β€Ί Develop two network architectures to process the temporal and
convolutional features
DEEP
REFINFORCEMENT
LEARNING
β€Ί Thank you!

More Related Content

Similar to Human level control through deep rl

Scaling out logistic regression with Spark
Scaling out logistic regression with SparkScaling out logistic regression with Spark
Scaling out logistic regression with SparkBarak Gitsis
Β 
Demystifying deep reinforement learning
Demystifying deep reinforement learningDemystifying deep reinforement learning
Demystifying deep reinforement learningμž¬μ—° 윀
Β 
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...Universitat Politècnica de Catalunya
Β 
DeepLearningLecture.pptx
DeepLearningLecture.pptxDeepLearningLecture.pptx
DeepLearningLecture.pptxssuserf07225
Β 
Dimensionality Reduction | Machine Learning | CloudxLab
Dimensionality Reduction | Machine Learning | CloudxLabDimensionality Reduction | Machine Learning | CloudxLab
Dimensionality Reduction | Machine Learning | CloudxLabCloudxLab
Β 
Salt Identification Challenge
Salt Identification ChallengeSalt Identification Challenge
Salt Identification Challengekenluck2001
Β 
Questions On The Equation For Regression
Questions On The Equation For RegressionQuestions On The Equation For Regression
Questions On The Equation For RegressionTiffany Sandoval
Β 
Deep Learning in Computer Vision
Deep Learning in Computer VisionDeep Learning in Computer Vision
Deep Learning in Computer VisionSungjoon Choi
Β 
Deep Learning in Recommender Systems - RecSys Summer School 2017
Deep Learning in Recommender Systems - RecSys Summer School 2017Deep Learning in Recommender Systems - RecSys Summer School 2017
Deep Learning in Recommender Systems - RecSys Summer School 2017BalΓ‘zs Hidasi
Β 
181123 asynchronous method for deep reinforcement learning seunghyeok back
181123 asynchronous method for deep reinforcement learning seunghyeok back181123 asynchronous method for deep reinforcement learning seunghyeok back
181123 asynchronous method for deep reinforcement learning seunghyeok backSeungHyeok Baek
Β 
TensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and TricksTensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and TricksBen Ball
Β 
Udacity-Didi Challenge Finalists
Udacity-Didi Challenge FinalistsUdacity-Didi Challenge Finalists
Udacity-Didi Challenge FinalistsDavid Silver
Β 
Enhancing the performance of kmeans algorithm
Enhancing the performance of kmeans algorithmEnhancing the performance of kmeans algorithm
Enhancing the performance of kmeans algorithmHadi Fadlallah
Β 
1-pytorch-CNN-RNN.pdf
1-pytorch-CNN-RNN.pdf1-pytorch-CNN-RNN.pdf
1-pytorch-CNN-RNN.pdfAndrey63387
Β 
Lucio marcenaro tue summer_school
Lucio marcenaro tue summer_schoolLucio marcenaro tue summer_school
Lucio marcenaro tue summer_schoolJun Hu
Β 
Lesson_8_DeepLearning.pdf
Lesson_8_DeepLearning.pdfLesson_8_DeepLearning.pdf
Lesson_8_DeepLearning.pdfssuser7f0b19
Β 
"Energy-efficient Hardware for Embedded Vision and Deep Convolutional Neural ...
"Energy-efficient Hardware for Embedded Vision and Deep Convolutional Neural ..."Energy-efficient Hardware for Embedded Vision and Deep Convolutional Neural ...
"Energy-efficient Hardware for Embedded Vision and Deep Convolutional Neural ...Edge AI and Vision Alliance
Β 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement LearningUsman Qayyum
Β 
Neural Learning to Rank
Neural Learning to RankNeural Learning to Rank
Neural Learning to RankBhaskar Mitra
Β 

Similar to Human level control through deep rl (20)

Scaling out logistic regression with Spark
Scaling out logistic regression with SparkScaling out logistic regression with Spark
Scaling out logistic regression with Spark
Β 
Demystifying deep reinforement learning
Demystifying deep reinforement learningDemystifying deep reinforement learning
Demystifying deep reinforement learning
Β 
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Β 
DeepLearningLecture.pptx
DeepLearningLecture.pptxDeepLearningLecture.pptx
DeepLearningLecture.pptx
Β 
Dimensionality Reduction | Machine Learning | CloudxLab
Dimensionality Reduction | Machine Learning | CloudxLabDimensionality Reduction | Machine Learning | CloudxLab
Dimensionality Reduction | Machine Learning | CloudxLab
Β 
Salt Identification Challenge
Salt Identification ChallengeSalt Identification Challenge
Salt Identification Challenge
Β 
Questions On The Equation For Regression
Questions On The Equation For RegressionQuestions On The Equation For Regression
Questions On The Equation For Regression
Β 
Deep Learning in Computer Vision
Deep Learning in Computer VisionDeep Learning in Computer Vision
Deep Learning in Computer Vision
Β 
Deep Learning in Recommender Systems - RecSys Summer School 2017
Deep Learning in Recommender Systems - RecSys Summer School 2017Deep Learning in Recommender Systems - RecSys Summer School 2017
Deep Learning in Recommender Systems - RecSys Summer School 2017
Β 
181123 asynchronous method for deep reinforcement learning seunghyeok back
181123 asynchronous method for deep reinforcement learning seunghyeok back181123 asynchronous method for deep reinforcement learning seunghyeok back
181123 asynchronous method for deep reinforcement learning seunghyeok back
Β 
TensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and TricksTensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and Tricks
Β 
Udacity-Didi Challenge Finalists
Udacity-Didi Challenge FinalistsUdacity-Didi Challenge Finalists
Udacity-Didi Challenge Finalists
Β 
Pytorch meetup
Pytorch meetupPytorch meetup
Pytorch meetup
Β 
Enhancing the performance of kmeans algorithm
Enhancing the performance of kmeans algorithmEnhancing the performance of kmeans algorithm
Enhancing the performance of kmeans algorithm
Β 
1-pytorch-CNN-RNN.pdf
1-pytorch-CNN-RNN.pdf1-pytorch-CNN-RNN.pdf
1-pytorch-CNN-RNN.pdf
Β 
Lucio marcenaro tue summer_school
Lucio marcenaro tue summer_schoolLucio marcenaro tue summer_school
Lucio marcenaro tue summer_school
Β 
Lesson_8_DeepLearning.pdf
Lesson_8_DeepLearning.pdfLesson_8_DeepLearning.pdf
Lesson_8_DeepLearning.pdf
Β 
"Energy-efficient Hardware for Embedded Vision and Deep Convolutional Neural ...
"Energy-efficient Hardware for Embedded Vision and Deep Convolutional Neural ..."Energy-efficient Hardware for Embedded Vision and Deep Convolutional Neural ...
"Energy-efficient Hardware for Embedded Vision and Deep Convolutional Neural ...
Β 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
Β 
Neural Learning to Rank
Neural Learning to RankNeural Learning to Rank
Neural Learning to Rank
Β 

Recently uploaded

[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
Β 
Integration and Automation in Practice: CI/CD in MuleΒ Integration and Automat...
Integration and Automation in Practice: CI/CD in MuleΒ Integration and Automat...Integration and Automation in Practice: CI/CD in MuleΒ Integration and Automat...
Integration and Automation in Practice: CI/CD in MuleΒ Integration and Automat...Patryk Bandurski
Β 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
Β 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
Β 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
Β 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
Β 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
Β 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
Β 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
Β 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
Β 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
Β 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
Β 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
Β 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
Β 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
Β 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
Β 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
Β 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
Β 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
Β 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
Β 

Recently uploaded (20)

[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
Β 
Integration and Automation in Practice: CI/CD in MuleΒ Integration and Automat...
Integration and Automation in Practice: CI/CD in MuleΒ Integration and Automat...Integration and Automation in Practice: CI/CD in MuleΒ Integration and Automat...
Integration and Automation in Practice: CI/CD in MuleΒ Integration and Automat...
Β 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Β 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
Β 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Β 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
Β 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
Β 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
Β 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Β 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
Β 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
Β 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Β 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
Β 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
Β 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
Β 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
Β 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
Β 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Β 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
Β 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Β 

Human level control through deep rl

  • 1. Human-level Control Through Deep Reinforcement Learning Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, etc. Project Group: Qingyuan Feng, Jian Jin, Saad Mahboob, Rui Wang
  • 2. Acknowledgements β€Ί This presentation is partially adopted from the presentations by: β€Ί Dong-Kyoung Kye, available at: vi.snu.ac.kr/xe β€Ί Jiang Guo, available at: http://ir.hit.edu.cn/~jguo/docs/notes/dqn-atari.pdf
  • 3. Outline β€Ί Motivation & Reinforcement Learning β€Ί Model Description – Q-Learning – Q-network – Training Q-network – Innovations of the model β€Ί Project outline
  • 4. Motivation β€Ί Previously the game-playing agents are highly specific to the game β€Ί Goal: creating an AI agent capable of playing a wide range of games, one step closer to Jarvis or R2D2 https://www.youtube.com/watch?v=cqXbjyWrdSo https://twitter.com/r2d2__starwars
  • 5. Reinforcement Learning β€Ί Want to teach the agent to play games β€Ί Supervised learning: let expert players play for 100,000 times β€Ί Reinforcement learning is the choice
  • 6. Reinforcement Learning β€Ί Categories of ML: β€Ί RL mechanism
  • 9. Model components β€Ί States: 𝑆𝑑 = π‘₯1, π‘Ž1, π‘₯2, … , π‘Ž π‘‘βˆ’1, π‘₯ 𝑑 π‘Šβ„Žπ‘’π‘Ÿπ‘’ π‘₯ 𝑑 𝑖𝑠 𝑝𝑖π‘₯𝑒𝑙 π‘£π‘Žπ‘™π‘’π‘’π‘  π‘Žπ‘‘ π‘‘π‘–π‘šπ‘’ 𝑑 β€Ί Value function: discounted future reward β€Ί Policy: πœ‹, mapping from state to action β€Ί Goal: maximize value function (discounted future reward)
  • 10. Q-Learning β€Ί Q function: maximum discounted future reward 𝑄 𝑠, π‘Ž = π‘šπ‘Žπ‘₯ (𝑅𝑑|𝑠𝑑 = 𝑠, π‘Ž 𝑑 = π‘Ž) β€Ί Q function represents the β€œquality” of a certain action in a given state. β€Ί Iterative calculation: Bellman equation 𝑄 𝑠, π‘Ž = π‘Ÿ + π›Ύπ‘šπ‘Žπ‘₯ π‘Žβ€² 𝑄(𝑠′, π‘Žβ€²) β€Ί In practice, value iteration is impractical – Specific to each sequence s and action a, can’t generalize
  • 11. Q-network β€Ί Use a function approximator to estimate the action-value function β€Ί Neural network with weight πœƒ as the approximator, called Q- network β€Ί Input/Output: State Network Q value of Action 1 Q value of Action 2 Q value of Action 3
  • 13. Training Q-Network β€Ί Loss function: mean squared error (MSE) β€Ί Derivatives w.r.t. the weights: β€Ί Using mini-batch SGD
  • 14. Innovations: Experience Replay β€Ί Break temporal correlations β€Ί Better utilize experience β€Ί Choose action π‘Ž 𝑑 according to πœ€-greedy policy – Choose best action with probability 1 βˆ’ πœ€, randomly with prob. πœ€ β€Ί Store transition (𝑠𝑑, π‘Ž 𝑑, π‘Ÿπ‘‘, 𝑠𝑑+1) in replay memory D β€Ί Sample mini-batch of transitions (𝑠, π‘Ž, π‘Ÿ, 𝑠′) from D β€Ί Minimize MSE between Q-network and Q-learning targets
  • 15. Innovations: separate target network β€Ί A separate target network having the same structure β€Ί Compute Q-learning targets using less frequently update parameters πœƒπ‘– βˆ’ instead of πœƒπ‘– of the training network β€Ί Optimize between Q-network and Q-learning targets: β€Ί Periodically update πœƒπ‘– βˆ’ to values of πœƒπ‘–
  • 18. Deep Continuous Control for Self Driving Car Simulation
  • 20. How to Handle Continuous Control?
  • 21. How to Handle Continuous Control? Name Range (units) Description ob.angle [-Ο€,+Ο€] Angle between the car direction and the direction of the track axis ob.track (0, 200)(meters) Vector of 19 range finder sensors: each sensor returns the distance between the track edge and the car within a range of 200 meters ob.trackPos (-∞,+∞) Distance between the car and the track axis. The value is normalized w.r.t. to the track width: it is 0 when the car is on the axis, values greater than 1 or -1 means the car is outside of the track. ob.speedX (-∞,+∞)(km/h) Speed of the car along the longitudinal axis of the car (good velocity) ob.speedY (-∞,+∞)(km/h) Speed of the car along the transverse axis of the car ob.speedZ (-∞,+∞)(km/h) Speed of the car along the Z-axis of the car ob.wheelSpinVel (0,+∞)(rad/s) Vector of 4 sensors representing the rotation speed of wheels ob.rpm (0,+∞)(rpm) Number of rotation per minute of the car engine
  • 22. How to Handle Continuous Control? β€Ί The DQN is designed for the discrete output β€Ί The Continuous Output is High dimensional
  • 23. Deep Deterministic Policy Gradient Critic Network Actor Network Correlated Metric
  • 26. How to Choose Attention β€Ί The Attention Model would reduce the dimensionalities of features obtained from images β€Ί Convolutional Local Features may be enough for the decisions β€Ί Supervised Signals are correlated with environment it exploits or explores
  • 27. Project Plan β€Ί Utilize CNN to get temporal information β€Ί Add Probabilistic Mixture Model Layer for Convolutional Features β€Ί Develop two network architectures to process the temporal and convolutional features