SlideShare a Scribd company logo
1 of 23
Download to read offline
Deep Learning for Real-Time Atari
Game Play Using Offline Monte-Carlo
Tree Search Planning
Xiaoxiao Guo, Satinder Singh, Honglak Lee, Richard L. Lewis, Xiaoshi Wang. NIPS 2014.
Yu Kai Huang
Outline
● Main idea
● Monte-Carlo Tree Search
○ Selection
○ Expansion
○ Simulation
○ Backpropagation
● Experiment
○ Three methods
○ Visualization
Main idea
Main Idea
“We achieve this by introducing new methods for combining RL and DL that use
slow, off-line Monte Carlo tree search planning methods to generate training
data for a deep-learned classifier capable of state-of-the-art real-time play.”
Deep Q-learning Network
Image from https://arxiv.org/pdf/1312.5602.pdf
Sampling training data
● Experience Replay
● ϵ−greedy action selection
○ Exploration & Exploitation
Sampling training data
● Off-line Monte Carlo tree search planning method
○ UCT-agent
Monte-Carlo Tree Search
MCTS
● The true value of any action can be approximated by running several random
simulations.
● These values can be efficiently used to adjust the policy (strategy) towards a
best-first strategy.
Image from https://www.zhihu.com/question/39916945
MCTS
● Iteratively building partial search tree
● Iteration
○ Most urgent node
■ Tree policy
■ Exploration/exploitation
○ Simulation
■ Add child node
■ Default policy
○ Update weights
Image from http://www.yisongyue.com/courses/cs159/lectures/MCTS.pdf
MCTS - UCT
● Upper Confidence bounds for Trees
Image from https://www.researchgate.net/publication/220978338_Monte-Carlo_Tree_Search_A_New_Framework_for_Game_AI
MCTS - UCT
Selection
● Start at root node
● Based on Tree Policy select child: UCB
● Apply recursively - descend through tree
○ Stop when expandable node is reached
○ Expandable
■ Node that is non-terminal and has unexplored children
Image from http://www.yisongyue.com/courses/cs159/lectures/MCTS.pdf
MCTS - UCT
Selection
● Start at root node
● Based on Tree Policy select child: UCB
● Apply recursively - descend through tree
○ Stop when expandable node is reached
○ Expandable
■ Node that is non-terminal and has unexplored children
Exploitation Exploration
Image from http://www.yisongyue.com/courses/cs159/lectures/MCTS.pdf
MCTS - UCT
Expansion
● Add one or more child nodes to tree
○ Depends on what actions are available for the current position
○ Method in which this is done depends on Tree Policy
Image from http://www.yisongyue.com/courses/cs159/lectures/MCTS.pdf
MCTS - UCT
Simulation
● Runs simulation of path that was selected
● Default Policy determines how simulation is run
● The outcome determines value
Image from http://www.yisongyue.com/courses/cs159/lectures/MCTS.pdf
MCTS - UCT
Backpropagation
● Moves backward through saved path
● Value of Node
○ representative of benefit of going down that path from parent
● Values are updated dependent on board outcome
○ Based on how the simulated game ends, values are updated
Image from http://www.yisongyue.com/courses/cs159/lectures/MCTS.pdf
MCTS - UCT
Image from https://zhuanlan.zhihu.com/p/30458774
Experiment
Three Methods
● UCTtoRegression
○ The UCT training data is used to train the CNN via regression.
● UCTtoClassification
○ The UCT training data is used to train the CNN via classification.
● UCTtoClassification-Interleaved
○ The UCT training data is used to train the CNN via classification.
○ Then use the trained CNN to decide action choices in collecting further runs.
○ Then finetune the trained CNN.
CNN Architecture
Experimental Results
Visualization of the first-layer features
Reference
[1] Deep Learning for Real-Time Atari Game Play Using Offline Monte-Carlo Tree Search Planning,
https://papers.nips.cc/paper/5421-deep-learning-for-real-time-atari-game-play-using-offline-monte-carlo-tr
ee-search-planning
[2] Monte Carlo Tree Search and AlphaGo, Suraj Nair, Peter Kundzicz, Kevin An, Vansh Kumar,
http://www.yisongyue.com/courses/cs159/lectures/MCTS.pdf
[3] tobe: 如何学习蒙特卡罗树搜索(MCTS), https://zhuanlan.zhihu.com/p/30458774

More Related Content

Similar to Deep Learning for Real-Time Atari Game Play Using Offline Monte-Carlo Tree Search Planning

Memory-based Reinforcement Learning
Memory-based Reinforcement LearningMemory-based Reinforcement Learning
Memory-based Reinforcement Learning
Hung Le
 

Similar to Deep Learning for Real-Time Atari Game Play Using Offline Monte-Carlo Tree Search Planning (20)

Reinforcement learning in a nutshell
Reinforcement learning in a nutshellReinforcement learning in a nutshell
Reinforcement learning in a nutshell
 
Memory-based Reinforcement Learning
Memory-based Reinforcement LearningMemory-based Reinforcement Learning
Memory-based Reinforcement Learning
 
Introduction to reinforcement learning
Introduction to reinforcement learningIntroduction to reinforcement learning
Introduction to reinforcement learning
 
Building a deep learning ai.pptx
Building a deep learning ai.pptxBuilding a deep learning ai.pptx
Building a deep learning ai.pptx
 
An Introduction to Neural Architecture Search
An Introduction to Neural Architecture SearchAn Introduction to Neural Architecture Search
An Introduction to Neural Architecture Search
 
Using SigOpt to Tune Deep Learning Models with Nervana Cloud
Using SigOpt to Tune Deep Learning Models with Nervana CloudUsing SigOpt to Tune Deep Learning Models with Nervana Cloud
Using SigOpt to Tune Deep Learning Models with Nervana Cloud
 
Web Traffic Time Series Forecasting
Web Traffic  Time Series ForecastingWeb Traffic  Time Series Forecasting
Web Traffic Time Series Forecasting
 
Time series analysis : Refresher and Innovations
Time series analysis : Refresher and InnovationsTime series analysis : Refresher and Innovations
Time series analysis : Refresher and Innovations
 
Ai and ml study group lecture 1 and 2
Ai and ml study group   lecture 1 and 2Ai and ml study group   lecture 1 and 2
Ai and ml study group lecture 1 and 2
 
Machine Learning with Python
Machine Learning with Python Machine Learning with Python
Machine Learning with Python
 
C3 w1
C3 w1C3 w1
C3 w1
 
Reinforcement Learning 8: Planning and Learning with Tabular Methods
Reinforcement Learning 8: Planning and Learning with Tabular MethodsReinforcement Learning 8: Planning and Learning with Tabular Methods
Reinforcement Learning 8: Planning and Learning with Tabular Methods
 
Automatic Image Cropping - A journey from a Master Thesis to Production
Automatic Image Cropping - A journey from a Master Thesis to ProductionAutomatic Image Cropping - A journey from a Master Thesis to Production
Automatic Image Cropping - A journey from a Master Thesis to Production
 
Sequential Decision Making in Recommendations
Sequential Decision Making in RecommendationsSequential Decision Making in Recommendations
Sequential Decision Making in Recommendations
 
Thamme Gowda's Summer2016- NASA JPL Internship
Thamme Gowda's Summer2016- NASA JPL InternshipThamme Gowda's Summer2016- NASA JPL Internship
Thamme Gowda's Summer2016- NASA JPL Internship
 
Methodology (DLAI D6L2 2017 UPC Deep Learning for Artificial Intelligence)
Methodology (DLAI D6L2 2017 UPC Deep Learning for Artificial Intelligence)Methodology (DLAI D6L2 2017 UPC Deep Learning for Artificial Intelligence)
Methodology (DLAI D6L2 2017 UPC Deep Learning for Artificial Intelligence)
 
Copy of CRICKET MATCH WIN PREDICTOR USING LOGISTIC ...
Copy of CRICKET MATCH WIN PREDICTOR USING LOGISTIC                           ...Copy of CRICKET MATCH WIN PREDICTOR USING LOGISTIC                           ...
Copy of CRICKET MATCH WIN PREDICTOR USING LOGISTIC ...
 
A Multi-Armed Bandit Framework For Recommendations at Netflix
A Multi-Armed Bandit Framework For Recommendations at NetflixA Multi-Armed Bandit Framework For Recommendations at Netflix
A Multi-Armed Bandit Framework For Recommendations at Netflix
 
Poster
PosterPoster
Poster
 
Machine learning ( Part 3 )
Machine learning ( Part 3 )Machine learning ( Part 3 )
Machine learning ( Part 3 )
 

More from 郁凱 黃

Introduction to FreeBSD commands
Introduction to FreeBSD commandsIntroduction to FreeBSD commands
Introduction to FreeBSD commands
郁凱 黃
 
Introduction to FreeBSD commands(beta)
Introduction to FreeBSD commands(beta)Introduction to FreeBSD commands(beta)
Introduction to FreeBSD commands(beta)
郁凱 黃
 
電競大賽說明會ppt
電競大賽說明會ppt電競大賽說明會ppt
電競大賽說明會ppt
郁凱 黃
 

More from 郁凱 黃 (10)

Human-level control through deep reinforcement learning
Human-level control through deep reinforcement learningHuman-level control through deep reinforcement learning
Human-level control through deep reinforcement learning
 
Ring loss: Convex Feature Normalization for Face Recognition
Ring loss: Convex Feature Normalization for Face RecognitionRing loss: Convex Feature Normalization for Face Recognition
Ring loss: Convex Feature Normalization for Face Recognition
 
Practical Block-wise Neural Network Architecture Generation
Practical Block-wise Neural Network Architecture GenerationPractical Block-wise Neural Network Architecture Generation
Practical Block-wise Neural Network Architecture Generation
 
Playing Atari with Deep Reinforcement Learning
Playing Atari with Deep Reinforcement LearningPlaying Atari with Deep Reinforcement Learning
Playing Atari with Deep Reinforcement Learning
 
A Revisit of Feature Learning on CNN-based Face Recognition
A Revisit of Feature Learning on CNN-based Face RecognitionA Revisit of Feature Learning on CNN-based Face Recognition
A Revisit of Feature Learning on CNN-based Face Recognition
 
Rose x Girl x White sheet
Rose x Girl x White sheetRose x Girl x White sheet
Rose x Girl x White sheet
 
Akatsuki Hackathon 2015 Demo
Akatsuki Hackathon 2015 DemoAkatsuki Hackathon 2015 Demo
Akatsuki Hackathon 2015 Demo
 
Introduction to FreeBSD commands
Introduction to FreeBSD commandsIntroduction to FreeBSD commands
Introduction to FreeBSD commands
 
Introduction to FreeBSD commands(beta)
Introduction to FreeBSD commands(beta)Introduction to FreeBSD commands(beta)
Introduction to FreeBSD commands(beta)
 
電競大賽說明會ppt
電競大賽說明會ppt電競大賽說明會ppt
電競大賽說明會ppt
 

Recently uploaded

Laundry management system project report.pdf
Laundry management system project report.pdfLaundry management system project report.pdf
Laundry management system project report.pdf
Kamal Acharya
 
Teachers record management system project report..pdf
Teachers record management system project report..pdfTeachers record management system project report..pdf
Teachers record management system project report..pdf
Kamal Acharya
 
Complex plane, Modulus, Argument, Graphical representation of a complex numbe...
Complex plane, Modulus, Argument, Graphical representation of a complex numbe...Complex plane, Modulus, Argument, Graphical representation of a complex numbe...
Complex plane, Modulus, Argument, Graphical representation of a complex numbe...
MohammadAliNayeem
 

Recently uploaded (20)

ONLINE CAR SERVICING SYSTEM PROJECT REPORT.pdf
ONLINE CAR SERVICING SYSTEM PROJECT REPORT.pdfONLINE CAR SERVICING SYSTEM PROJECT REPORT.pdf
ONLINE CAR SERVICING SYSTEM PROJECT REPORT.pdf
 
Supermarket billing system project report..pdf
Supermarket billing system project report..pdfSupermarket billing system project report..pdf
Supermarket billing system project report..pdf
 
RM&IPR M5 notes.pdfResearch Methodolgy & Intellectual Property Rights Series 5
RM&IPR M5 notes.pdfResearch Methodolgy & Intellectual Property Rights Series 5RM&IPR M5 notes.pdfResearch Methodolgy & Intellectual Property Rights Series 5
RM&IPR M5 notes.pdfResearch Methodolgy & Intellectual Property Rights Series 5
 
Research Methodolgy & Intellectual Property Rights Series 2
Research Methodolgy & Intellectual Property Rights Series 2Research Methodolgy & Intellectual Property Rights Series 2
Research Methodolgy & Intellectual Property Rights Series 2
 
Arduino based vehicle speed tracker project
Arduino based vehicle speed tracker projectArduino based vehicle speed tracker project
Arduino based vehicle speed tracker project
 
Laundry management system project report.pdf
Laundry management system project report.pdfLaundry management system project report.pdf
Laundry management system project report.pdf
 
KIT-601 Lecture Notes-UNIT-5.pdf Frame Works and Visualization
KIT-601 Lecture Notes-UNIT-5.pdf Frame Works and VisualizationKIT-601 Lecture Notes-UNIT-5.pdf Frame Works and Visualization
KIT-601 Lecture Notes-UNIT-5.pdf Frame Works and Visualization
 
Teachers record management system project report..pdf
Teachers record management system project report..pdfTeachers record management system project report..pdf
Teachers record management system project report..pdf
 
Peek implant persentation - Copy (1).pdf
Peek implant persentation - Copy (1).pdfPeek implant persentation - Copy (1).pdf
Peek implant persentation - Copy (1).pdf
 
E-Commerce Shopping for developing a shopping ecommerce site
E-Commerce Shopping for developing a shopping ecommerce siteE-Commerce Shopping for developing a shopping ecommerce site
E-Commerce Shopping for developing a shopping ecommerce site
 
Electrostatic field in a coaxial transmission line
Electrostatic field in a coaxial transmission lineElectrostatic field in a coaxial transmission line
Electrostatic field in a coaxial transmission line
 
KIT-601 Lecture Notes-UNIT-3.pdf Mining Data Stream
KIT-601 Lecture Notes-UNIT-3.pdf Mining Data StreamKIT-601 Lecture Notes-UNIT-3.pdf Mining Data Stream
KIT-601 Lecture Notes-UNIT-3.pdf Mining Data Stream
 
Lect_Z_Transform_Main_digital_image_processing.pptx
Lect_Z_Transform_Main_digital_image_processing.pptxLect_Z_Transform_Main_digital_image_processing.pptx
Lect_Z_Transform_Main_digital_image_processing.pptx
 
Introduction to Machine Learning Unit-4 Notes for II-II Mechanical Engineering
Introduction to Machine Learning Unit-4 Notes for II-II Mechanical EngineeringIntroduction to Machine Learning Unit-4 Notes for II-II Mechanical Engineering
Introduction to Machine Learning Unit-4 Notes for II-II Mechanical Engineering
 
The battle for RAG, explore the pros and cons of using KnowledgeGraphs and Ve...
The battle for RAG, explore the pros and cons of using KnowledgeGraphs and Ve...The battle for RAG, explore the pros and cons of using KnowledgeGraphs and Ve...
The battle for RAG, explore the pros and cons of using KnowledgeGraphs and Ve...
 
Top 13 Famous Civil Engineering Scientist
Top 13 Famous Civil Engineering ScientistTop 13 Famous Civil Engineering Scientist
Top 13 Famous Civil Engineering Scientist
 
Research Methodolgy & Intellectual Property Rights Series 1
Research Methodolgy & Intellectual Property Rights Series 1Research Methodolgy & Intellectual Property Rights Series 1
Research Methodolgy & Intellectual Property Rights Series 1
 
Lect 2 - Design of slender column-2.pptx
Lect 2 - Design of slender column-2.pptxLect 2 - Design of slender column-2.pptx
Lect 2 - Design of slender column-2.pptx
 
internship exam ppt.pptx on embedded system and IOT
internship exam ppt.pptx on embedded system and IOTinternship exam ppt.pptx on embedded system and IOT
internship exam ppt.pptx on embedded system and IOT
 
Complex plane, Modulus, Argument, Graphical representation of a complex numbe...
Complex plane, Modulus, Argument, Graphical representation of a complex numbe...Complex plane, Modulus, Argument, Graphical representation of a complex numbe...
Complex plane, Modulus, Argument, Graphical representation of a complex numbe...
 

Deep Learning for Real-Time Atari Game Play Using Offline Monte-Carlo Tree Search Planning

  • 1. Deep Learning for Real-Time Atari Game Play Using Offline Monte-Carlo Tree Search Planning Xiaoxiao Guo, Satinder Singh, Honglak Lee, Richard L. Lewis, Xiaoshi Wang. NIPS 2014. Yu Kai Huang
  • 2. Outline ● Main idea ● Monte-Carlo Tree Search ○ Selection ○ Expansion ○ Simulation ○ Backpropagation ● Experiment ○ Three methods ○ Visualization
  • 4. Main Idea “We achieve this by introducing new methods for combining RL and DL that use slow, off-line Monte Carlo tree search planning methods to generate training data for a deep-learned classifier capable of state-of-the-art real-time play.”
  • 5. Deep Q-learning Network Image from https://arxiv.org/pdf/1312.5602.pdf
  • 6. Sampling training data ● Experience Replay ● ϵ−greedy action selection ○ Exploration & Exploitation
  • 7. Sampling training data ● Off-line Monte Carlo tree search planning method ○ UCT-agent
  • 9. MCTS ● The true value of any action can be approximated by running several random simulations. ● These values can be efficiently used to adjust the policy (strategy) towards a best-first strategy. Image from https://www.zhihu.com/question/39916945
  • 10. MCTS ● Iteratively building partial search tree ● Iteration ○ Most urgent node ■ Tree policy ■ Exploration/exploitation ○ Simulation ■ Add child node ■ Default policy ○ Update weights Image from http://www.yisongyue.com/courses/cs159/lectures/MCTS.pdf
  • 11. MCTS - UCT ● Upper Confidence bounds for Trees Image from https://www.researchgate.net/publication/220978338_Monte-Carlo_Tree_Search_A_New_Framework_for_Game_AI
  • 12. MCTS - UCT Selection ● Start at root node ● Based on Tree Policy select child: UCB ● Apply recursively - descend through tree ○ Stop when expandable node is reached ○ Expandable ■ Node that is non-terminal and has unexplored children Image from http://www.yisongyue.com/courses/cs159/lectures/MCTS.pdf
  • 13. MCTS - UCT Selection ● Start at root node ● Based on Tree Policy select child: UCB ● Apply recursively - descend through tree ○ Stop when expandable node is reached ○ Expandable ■ Node that is non-terminal and has unexplored children Exploitation Exploration Image from http://www.yisongyue.com/courses/cs159/lectures/MCTS.pdf
  • 14. MCTS - UCT Expansion ● Add one or more child nodes to tree ○ Depends on what actions are available for the current position ○ Method in which this is done depends on Tree Policy Image from http://www.yisongyue.com/courses/cs159/lectures/MCTS.pdf
  • 15. MCTS - UCT Simulation ● Runs simulation of path that was selected ● Default Policy determines how simulation is run ● The outcome determines value Image from http://www.yisongyue.com/courses/cs159/lectures/MCTS.pdf
  • 16. MCTS - UCT Backpropagation ● Moves backward through saved path ● Value of Node ○ representative of benefit of going down that path from parent ● Values are updated dependent on board outcome ○ Based on how the simulated game ends, values are updated Image from http://www.yisongyue.com/courses/cs159/lectures/MCTS.pdf
  • 17. MCTS - UCT Image from https://zhuanlan.zhihu.com/p/30458774
  • 19. Three Methods ● UCTtoRegression ○ The UCT training data is used to train the CNN via regression. ● UCTtoClassification ○ The UCT training data is used to train the CNN via classification. ● UCTtoClassification-Interleaved ○ The UCT training data is used to train the CNN via classification. ○ Then use the trained CNN to decide action choices in collecting further runs. ○ Then finetune the trained CNN.
  • 22. Visualization of the first-layer features
  • 23. Reference [1] Deep Learning for Real-Time Atari Game Play Using Offline Monte-Carlo Tree Search Planning, https://papers.nips.cc/paper/5421-deep-learning-for-real-time-atari-game-play-using-offline-monte-carlo-tr ee-search-planning [2] Monte Carlo Tree Search and AlphaGo, Suraj Nair, Peter Kundzicz, Kevin An, Vansh Kumar, http://www.yisongyue.com/courses/cs159/lectures/MCTS.pdf [3] tobe: 如何学习蒙特卡罗树搜索(MCTS), https://zhuanlan.zhihu.com/p/30458774