Deep Learning for Real-Time Atari
Game Play Using Offline Monte-Carlo
Tree Search Planning
Xiaoxiao Guo, Satinder Singh, Honglak Lee, Richard L. Lewis, Xiaoshi Wang. NIPS 2014.
Yu Kai Huang
Outline
● Main idea
● Monte-Carlo Tree Search
○ Selection
○ Expansion
○ Simulation
○ Backpropagation
● Experiment
○ Three methods
○ Visualization
Main idea
Main Idea
“We achieve this by introducing new methods for combining RL and DL that use
slow, off-line Monte Carlo tree search planning methods to generate training
data for a deep-learned classifier capable of state-of-the-art real-time play.”
Deep Q-learning Network
Image from https://arxiv.org/pdf/1312.5602.pdf
Sampling training data
● Experience Replay
● ϵ−greedy action selection
○ Exploration & Exploitation
Sampling training data
● Off-line Monte Carlo tree search planning method
○ UCT-agent
Monte-Carlo Tree Search
MCTS
● The true value of any action can be approximated by running several random
simulations.
● These values can be efficiently used to adjust the policy (strategy) towards a
best-first strategy.
Image from https://www.zhihu.com/question/39916945
MCTS
● Iteratively building partial search tree
● Iteration
○ Most urgent node
■ Tree policy
■ Exploration/exploitation
○ Simulation
■ Add child node
■ Default policy
○ Update weights
Image from http://www.yisongyue.com/courses/cs159/lectures/MCTS.pdf
MCTS - UCT
● Upper Confidence bounds for Trees
Image from https://www.researchgate.net/publication/220978338_Monte-Carlo_Tree_Search_A_New_Framework_for_Game_AI
MCTS - UCT
Selection
● Start at root node
● Based on Tree Policy select child: UCB
● Apply recursively - descend through tree
○ Stop when expandable node is reached
○ Expandable
■ Node that is non-terminal and has unexplored children
Image from http://www.yisongyue.com/courses/cs159/lectures/MCTS.pdf
MCTS - UCT
Selection
● Start at root node
● Based on Tree Policy select child: UCB
● Apply recursively - descend through tree
○ Stop when expandable node is reached
○ Expandable
■ Node that is non-terminal and has unexplored children
Exploitation Exploration
Image from http://www.yisongyue.com/courses/cs159/lectures/MCTS.pdf
MCTS - UCT
Expansion
● Add one or more child nodes to tree
○ Depends on what actions are available for the current position
○ Method in which this is done depends on Tree Policy
Image from http://www.yisongyue.com/courses/cs159/lectures/MCTS.pdf
MCTS - UCT
Simulation
● Runs simulation of path that was selected
● Default Policy determines how simulation is run
● The outcome determines value
Image from http://www.yisongyue.com/courses/cs159/lectures/MCTS.pdf
MCTS - UCT
Backpropagation
● Moves backward through saved path
● Value of Node
○ representative of benefit of going down that path from parent
● Values are updated dependent on board outcome
○ Based on how the simulated game ends, values are updated
Image from http://www.yisongyue.com/courses/cs159/lectures/MCTS.pdf
MCTS - UCT
Image from https://zhuanlan.zhihu.com/p/30458774
Experiment
Three Methods
● UCTtoRegression
○ The UCT training data is used to train the CNN via regression.
● UCTtoClassification
○ The UCT training data is used to train the CNN via classification.
● UCTtoClassification-Interleaved
○ The UCT training data is used to train the CNN via classification.
○ Then use the trained CNN to decide action choices in collecting further runs.
○ Then finetune the trained CNN.
CNN Architecture
Experimental Results
Visualization of the first-layer features
Reference
[1] Deep Learning for Real-Time Atari Game Play Using Offline Monte-Carlo Tree Search Planning,
https://papers.nips.cc/paper/5421-deep-learning-for-real-time-atari-game-play-using-offline-monte-carlo-tr
ee-search-planning
[2] Monte Carlo Tree Search and AlphaGo, Suraj Nair, Peter Kundzicz, Kevin An, Vansh Kumar,
http://www.yisongyue.com/courses/cs159/lectures/MCTS.pdf
[3] tobe: 如何学习蒙特卡罗树搜索(MCTS), https://zhuanlan.zhihu.com/p/30458774

Deep Learning for Real-Time Atari Game Play Using Offline Monte-Carlo Tree Search Planning

  • 1.
    Deep Learning forReal-Time Atari Game Play Using Offline Monte-Carlo Tree Search Planning Xiaoxiao Guo, Satinder Singh, Honglak Lee, Richard L. Lewis, Xiaoshi Wang. NIPS 2014. Yu Kai Huang
  • 2.
    Outline ● Main idea ●Monte-Carlo Tree Search ○ Selection ○ Expansion ○ Simulation ○ Backpropagation ● Experiment ○ Three methods ○ Visualization
  • 3.
  • 4.
    Main Idea “We achievethis by introducing new methods for combining RL and DL that use slow, off-line Monte Carlo tree search planning methods to generate training data for a deep-learned classifier capable of state-of-the-art real-time play.”
  • 5.
    Deep Q-learning Network Imagefrom https://arxiv.org/pdf/1312.5602.pdf
  • 6.
    Sampling training data ●Experience Replay ● ϵ−greedy action selection ○ Exploration & Exploitation
  • 7.
    Sampling training data ●Off-line Monte Carlo tree search planning method ○ UCT-agent
  • 8.
  • 9.
    MCTS ● The truevalue of any action can be approximated by running several random simulations. ● These values can be efficiently used to adjust the policy (strategy) towards a best-first strategy. Image from https://www.zhihu.com/question/39916945
  • 10.
    MCTS ● Iteratively buildingpartial search tree ● Iteration ○ Most urgent node ■ Tree policy ■ Exploration/exploitation ○ Simulation ■ Add child node ■ Default policy ○ Update weights Image from http://www.yisongyue.com/courses/cs159/lectures/MCTS.pdf
  • 11.
    MCTS - UCT ●Upper Confidence bounds for Trees Image from https://www.researchgate.net/publication/220978338_Monte-Carlo_Tree_Search_A_New_Framework_for_Game_AI
  • 12.
    MCTS - UCT Selection ●Start at root node ● Based on Tree Policy select child: UCB ● Apply recursively - descend through tree ○ Stop when expandable node is reached ○ Expandable ■ Node that is non-terminal and has unexplored children Image from http://www.yisongyue.com/courses/cs159/lectures/MCTS.pdf
  • 13.
    MCTS - UCT Selection ●Start at root node ● Based on Tree Policy select child: UCB ● Apply recursively - descend through tree ○ Stop when expandable node is reached ○ Expandable ■ Node that is non-terminal and has unexplored children Exploitation Exploration Image from http://www.yisongyue.com/courses/cs159/lectures/MCTS.pdf
  • 14.
    MCTS - UCT Expansion ●Add one or more child nodes to tree ○ Depends on what actions are available for the current position ○ Method in which this is done depends on Tree Policy Image from http://www.yisongyue.com/courses/cs159/lectures/MCTS.pdf
  • 15.
    MCTS - UCT Simulation ●Runs simulation of path that was selected ● Default Policy determines how simulation is run ● The outcome determines value Image from http://www.yisongyue.com/courses/cs159/lectures/MCTS.pdf
  • 16.
    MCTS - UCT Backpropagation ●Moves backward through saved path ● Value of Node ○ representative of benefit of going down that path from parent ● Values are updated dependent on board outcome ○ Based on how the simulated game ends, values are updated Image from http://www.yisongyue.com/courses/cs159/lectures/MCTS.pdf
  • 17.
    MCTS - UCT Imagefrom https://zhuanlan.zhihu.com/p/30458774
  • 18.
  • 19.
    Three Methods ● UCTtoRegression ○The UCT training data is used to train the CNN via regression. ● UCTtoClassification ○ The UCT training data is used to train the CNN via classification. ● UCTtoClassification-Interleaved ○ The UCT training data is used to train the CNN via classification. ○ Then use the trained CNN to decide action choices in collecting further runs. ○ Then finetune the trained CNN.
  • 20.
  • 21.
  • 22.
    Visualization of thefirst-layer features
  • 23.
    Reference [1] Deep Learningfor Real-Time Atari Game Play Using Offline Monte-Carlo Tree Search Planning, https://papers.nips.cc/paper/5421-deep-learning-for-real-time-atari-game-play-using-offline-monte-carlo-tr ee-search-planning [2] Monte Carlo Tree Search and AlphaGo, Suraj Nair, Peter Kundzicz, Kevin An, Vansh Kumar, http://www.yisongyue.com/courses/cs159/lectures/MCTS.pdf [3] tobe: 如何学习蒙特卡罗树搜索(MCTS), https://zhuanlan.zhihu.com/p/30458774