SlideShare a Scribd company logo
1 of 24
Download to read offline
Temporal-difference search in computer Go
David Silver · Richard S. Sutton · Martin M¨uller
March 3, 2014
1 / 24
Three Major Sections
Section 3: Shape features (almost exact duplicate of previous paper
from 2004)
Section 2: TD-Search (Simulation-based search methods for game
play)
Section 4: Dyna-2 algorithm (Introducing concept of long and
short-term memory in simulated search)
2 / 24
AI Game Playing
In AI the most successful game play strategies have followed these
steps:
1st: Positions are evaluated by a linear combination of many features.
Position is broken down into small local components
2nd: Weights are trained by TD learning and self play
Finally: A linear evaluation function is combined with a suitable search
algorithm to produce a high-performing game.
3 / 24
Sec 3.0: TD Learning with Local Shape Features
Approach: Shapes are an important part of Go. Master-level players
are thinking in terms of general shapes.
Objective is to win the games of Go. Rewards: r = 1 if black wins and
r= 0 if white wins.
There is a state value s for each intersection of the board. Three
possible values: empty, white, or black.
Local shape l is found by all combinations of shapes up to 3x3 region.
φ(s) = 1 if board matches local shape li
The value function V π
(s) is the expected total reward from state s
when following policy π (Probability of winning)
Value function is approximated by logistic-linear combination of shape
features. Model-free learning since learning the value function.
V (s) = σ(φ(s) · θLI
+ φ(s) · θLI
) Black’s winning probability from state
s.
Use two-ply update: TD(0) error is calculated after both player and
opponent have made a move V (st+2)
Using self play: Agent and opponent use the same policy π(s, a)
4 / 24
5 / 24
6 / 24
Sec 3.2.1: Training Procedure
Objective: Train the weights of the value function V(s) and update
using logistic TD(0).
Initialize all weights to zero and run a million games of self play to
find the weights of value function V(s)
Black and white select moves using an − greedy policy over same
value function.
Using self play: Agent and opponent use the same − greedy policy
π(s, a)
Terminate when both players pass
7 / 24
Results with different sets of local shape features
8 / 24
Results with different sets of local shape features cont’d
9 / 24
Alpha Beta Search
AB Search: During game play this is a technique used to search the
potential moves from state s to s (Note: we are now using the learnt
value function V(s))
For example, for a depth of 2, if it’s white’s move we consider all of
white’s moves and blacks’s responses to all those moves. We
maintain an alpha and a beta for the upper and lower bound (value)
of the move.
10 / 24
AB Search Results
11 / 24
Section 4.0: Temporal-different search
Idea: If we are in a current state s there is always a subgame G of
original game G. Apply TD learning to G using subgames of
self-play, that start from the current state st .
Simulation-based search: agent samples episodes of simulated
experience and updates its value function from simulated experience.
Begin in state s0. At each step µ of simulation, an action au is
selected according to a simulation policy, and a new state su+1 and
reward ru+1 is generated by the MDP model. Repeat until terminal
state is reached.
12 / 24
MCTS and TD-difference search cont’d
TD-search uses value function approximation on the current
sub-graph (our V(s) from before). We can update all similar states
since the value function approximates the value of any position s ∈ S.
MCTS must wait on many time-steps to until getting a final
outcome. Depends on all the agents decisions throughout the
simulation. TD search can bootstrap, as before, using steps between
subsequent states. Does not need to wait until the end to make
corrections to to TD-error (just as in TD-difference learning).
MCTS is currently the best known example of simulated search
13 / 24
Linear TD Search
Linear TD search is an implementation of TD search where the value
function is updated online. The agent simulates episodes of
experience from the current state by sampling its current policy
π(s, a) and from transition model Pπ
ss
and reward model Rπ
ss
(note:
P is transition probabilities and R is reward function)
Linear TD search is applied to the sub-game at the current state.
And instead of using a search tree, the agent is going to approximate
the value function by using a linear combination given by:
Qµ(s, a) = φ(s, a) · θµ
Q is the action value function Qπ(s, a) = E[Rt|st = s, at = a]
θ is the weights, µ is the current time step, and φ(s, a) is feature
vector representing states and actions
After each step the agent updates the parameters by TD learning,
using TD(λ)
14 / 24
TD Search Results
15 / 24
TD search in computer Go
In section 3.0 the value of each shape was learned offline using TD
learning by self play. This is considered myopic since each section is
evaluated without knowledge of the entire board.
The ideas is to use local shape features in TD search. TD search can
learn the value of each feature in the context current board context
or state µ as discussed previously. This allows the agent to focus on
what works well now.
Issues: By starting simulations from current position we break the
symmetries. So the weight sharing (feature reduction) based on this
breaking is lost.
16 / 24
Change to TD search
Remember with shapes in section 3.0 we were learning the Value
function V(s).
We modify the TD search algorithm to update the weights of our
value function.
Linear TD search is applied to the sub-game at the current state.
And instead of using a search tree, the agent is going to approximate
the value function by using a linear combination given by:
δθ = α θ(st )
||θ(st )||2 (V (st+2) − V (st))
17 / 24
Experiments with TD search
Ran tournaments between different versions of RLGO of at least 200
games
Used bayeselo program to calculate Elo rating
Recall that for TD search we are doing simulated search (slide 13)
and we have no simulation policy for each step µ. A simulation policy
maximizes the action from every state in the MDP.
Fuego 0.1 is a “vanilla” policy that they use as a default policy. They
do this to incorporate some prior knowledge into the simulation
policy. TD search assumes no prior knowledge. They switch to this
every T moves.
Switching policy ever 2-8 moves resulted in a 300 point Elo
improvement.
The results show the importance of what they call “temporality”.
Focusing on the agents resources at the current moment.
18 / 24
TD Results
19 / 24
TD Results
20 / 24
Dyna-2: Integrating short and long-term memories
Learning algorithms slowly exact knowledge from the complete history
of training data.
Search algorithms use and extend this knowledge, rapidly and online,
so as to evaluate local states more accurately.
Dyna-2: combines both TD learning and TD search
Sutton had a previous algorithm Dyna (1990) that applied TD
learning both to real experience and simulated experience.
They key idea with Dyna-2 is to maintain two separate memories: a
long-term memory that is learnt from real experience; and short-term
memory that is used during search and is update from simulated
experience.
21 / 24
Dyna-2 cont’d
Define a short Qµ(s, a) and long-term value function Q(s, a)
Q(s, a) = φ(s, a) · θ
Q(s, a) = φ(s, a) · θ + φ(s, a) · θ
Q is the action value function Qπ(s, a) = E[Rt|st = s, at = a] , θ is
the weights, and φ(s, a) is feature vector representing states and
actions.
The short term value function,Q(s, a), uses both memories to
approximate true value function.
Two phase search: AB search is performed after each TD search.
22 / 24
Dyna-2 results
23 / 24
Discussion
What is the significance of the 2nd author?
What are your thoughts on the overall performance of this algorithm?
Why didn’t they outperform modern MCTS methods?
Are there any other applications where this might be useful?
Did you think the paper did a good job explaining their approach?
Was it descriptive enough?
What feature of Go, as compared to chess, checkers, or backgammon,
makes it different in the reinforcement learning environment.
Is using only a 1x1 feature set of shapes equivalent to the notion of
”over-fitting”?
What is the advantage of two-ply update verse 1-ply update that they
referred to in section 3.2? What is the trade-off as we go up to 6 ply?
24 / 24

More Related Content

What's hot

Optimization for Deep Learning
Optimization for Deep LearningOptimization for Deep Learning
Optimization for Deep LearningSebastian Ruder
 
A note on word embedding
A note on word embeddingA note on word embedding
A note on word embeddingKhang Pham
 
Lecture 8 dynamic programming
Lecture 8 dynamic programmingLecture 8 dynamic programming
Lecture 8 dynamic programmingOye Tu
 
Support Vector Machine (Classification) - Step by Step
Support Vector Machine (Classification) - Step by StepSupport Vector Machine (Classification) - Step by Step
Support Vector Machine (Classification) - Step by StepManish nath choudhary
 
Dynamic programming class 16
Dynamic programming class 16Dynamic programming class 16
Dynamic programming class 16Kumar
 
Gradient descent optimizer
Gradient descent optimizerGradient descent optimizer
Gradient descent optimizerHojin Yang
 
Dynamic programming - fundamentals review
Dynamic programming - fundamentals reviewDynamic programming - fundamentals review
Dynamic programming - fundamentals reviewElifTech
 
Introduction to dynamic programming
Introduction to dynamic programmingIntroduction to dynamic programming
Introduction to dynamic programmingAmisha Narsingani
 
Skiena algorithm 2007 lecture16 introduction to dynamic programming
Skiena algorithm 2007 lecture16 introduction to dynamic programmingSkiena algorithm 2007 lecture16 introduction to dynamic programming
Skiena algorithm 2007 lecture16 introduction to dynamic programmingzukun
 
Dynamic programming
Dynamic programmingDynamic programming
Dynamic programmingGopi Saiteja
 
5.3 dynamic programming 03
5.3 dynamic programming 035.3 dynamic programming 03
5.3 dynamic programming 03Krish_ver2
 
CS8451 - Design and Analysis of Algorithms
CS8451 - Design and Analysis of AlgorithmsCS8451 - Design and Analysis of Algorithms
CS8451 - Design and Analysis of AlgorithmsKrishnan MuthuManickam
 
SURF 2012 Final Report(1)
SURF 2012 Final Report(1)SURF 2012 Final Report(1)
SURF 2012 Final Report(1)Eric Zhang
 
Optimal real-time landing using DNN
Optimal real-time landing using DNNOptimal real-time landing using DNN
Optimal real-time landing using DNN홍배 김
 
Design and Analysis of Algorithms
Design and Analysis of AlgorithmsDesign and Analysis of Algorithms
Design and Analysis of AlgorithmsSwapnil Agrawal
 

What's hot (20)

Optimization for Deep Learning
Optimization for Deep LearningOptimization for Deep Learning
Optimization for Deep Learning
 
A note on word embedding
A note on word embeddingA note on word embedding
A note on word embedding
 
Lecture 8 dynamic programming
Lecture 8 dynamic programmingLecture 8 dynamic programming
Lecture 8 dynamic programming
 
Support Vector Machine (Classification) - Step by Step
Support Vector Machine (Classification) - Step by StepSupport Vector Machine (Classification) - Step by Step
Support Vector Machine (Classification) - Step by Step
 
Dynamic programming class 16
Dynamic programming class 16Dynamic programming class 16
Dynamic programming class 16
 
Unit 7 dynamic programming
Unit 7   dynamic programmingUnit 7   dynamic programming
Unit 7 dynamic programming
 
Gradient descent optimizer
Gradient descent optimizerGradient descent optimizer
Gradient descent optimizer
 
Dynamic programming
Dynamic programmingDynamic programming
Dynamic programming
 
Dynamic programming - fundamentals review
Dynamic programming - fundamentals reviewDynamic programming - fundamentals review
Dynamic programming - fundamentals review
 
DS ppt
DS pptDS ppt
DS ppt
 
Introduction to dynamic programming
Introduction to dynamic programmingIntroduction to dynamic programming
Introduction to dynamic programming
 
Skiena algorithm 2007 lecture16 introduction to dynamic programming
Skiena algorithm 2007 lecture16 introduction to dynamic programmingSkiena algorithm 2007 lecture16 introduction to dynamic programming
Skiena algorithm 2007 lecture16 introduction to dynamic programming
 
Dynamic programming
Dynamic programmingDynamic programming
Dynamic programming
 
Dynamic programming
Dynamic programmingDynamic programming
Dynamic programming
 
5.3 dynamic programming 03
5.3 dynamic programming 035.3 dynamic programming 03
5.3 dynamic programming 03
 
CS8451 - Design and Analysis of Algorithms
CS8451 - Design and Analysis of AlgorithmsCS8451 - Design and Analysis of Algorithms
CS8451 - Design and Analysis of Algorithms
 
Dynamicpgmming
DynamicpgmmingDynamicpgmming
Dynamicpgmming
 
SURF 2012 Final Report(1)
SURF 2012 Final Report(1)SURF 2012 Final Report(1)
SURF 2012 Final Report(1)
 
Optimal real-time landing using DNN
Optimal real-time landing using DNNOptimal real-time landing using DNN
Optimal real-time landing using DNN
 
Design and Analysis of Algorithms
Design and Analysis of AlgorithmsDesign and Analysis of Algorithms
Design and Analysis of Algorithms
 

Similar to Goprez sg

Reinfrocement Learning
Reinfrocement LearningReinfrocement Learning
Reinfrocement LearningNatan Katz
 
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017MLconf
 
An efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningAn efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningPrabhu Kumar
 
Head First Reinforcement Learning
Head First Reinforcement LearningHead First Reinforcement Learning
Head First Reinforcement Learningazzeddine chenine
 
Hierarchical Reinforcement Learning
Hierarchical Reinforcement LearningHierarchical Reinforcement Learning
Hierarchical Reinforcement Learningahmad bassiouny
 
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIDeep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIJack Clark
 
TensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and TricksTensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and TricksBen Ball
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learningDing Li
 
DDPG algortihm for angry birds
DDPG algortihm for angry birdsDDPG algortihm for angry birds
DDPG algortihm for angry birdsWangyu Han
 
EE660_Report_YaxinLiu_8448347171
EE660_Report_YaxinLiu_8448347171EE660_Report_YaxinLiu_8448347171
EE660_Report_YaxinLiu_8448347171Yaxin Liu
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learningDongHyun Kwak
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement LearningDongHyun Kwak
 
Demystifying deep reinforement learning
Demystifying deep reinforement learningDemystifying deep reinforement learning
Demystifying deep reinforement learning재연 윤
 

Similar to Goprez sg (20)

Reinfrocement Learning
Reinfrocement LearningReinfrocement Learning
Reinfrocement Learning
 
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
 
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
 
An efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningAn efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game Learning
 
Head First Reinforcement Learning
Head First Reinforcement LearningHead First Reinforcement Learning
Head First Reinforcement Learning
 
RL intro
RL introRL intro
RL intro
 
Reinforcement Learning - DQN
Reinforcement Learning - DQNReinforcement Learning - DQN
Reinforcement Learning - DQN
 
Hierarchical Reinforcement Learning
Hierarchical Reinforcement LearningHierarchical Reinforcement Learning
Hierarchical Reinforcement Learning
 
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIDeep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
 
Deep RL.pdf
Deep RL.pdfDeep RL.pdf
Deep RL.pdf
 
TensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and TricksTensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and Tricks
 
Introduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement LearningIntroduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement Learning
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
DDPG algortihm for angry birds
DDPG algortihm for angry birdsDDPG algortihm for angry birds
DDPG algortihm for angry birds
 
Searching techniques
Searching techniquesSearching techniques
Searching techniques
 
EE660_Report_YaxinLiu_8448347171
EE660_Report_YaxinLiu_8448347171EE660_Report_YaxinLiu_8448347171
EE660_Report_YaxinLiu_8448347171
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Ga
GaGa
Ga
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Demystifying deep reinforement learning
Demystifying deep reinforement learningDemystifying deep reinforement learning
Demystifying deep reinforement learning
 

More from Sean Golliher

Time Series Forecasting using Neural Nets (GNNNs)
Time Series Forecasting using Neural Nets (GNNNs)Time Series Forecasting using Neural Nets (GNNNs)
Time Series Forecasting using Neural Nets (GNNNs)Sean Golliher
 
A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction:
A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction:A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction:
A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction:Sean Golliher
 
Property Matching and Query Expansion on Linked Data Using Kullback-Leibler D...
Property Matching and Query Expansion on Linked Data Using Kullback-Leibler D...Property Matching and Query Expansion on Linked Data Using Kullback-Leibler D...
Property Matching and Query Expansion on Linked Data Using Kullback-Leibler D...Sean Golliher
 
Lecture 9 - Machine Learning and Support Vector Machines (SVM)
Lecture 9 - Machine Learning and Support Vector Machines (SVM)Lecture 9 - Machine Learning and Support Vector Machines (SVM)
Lecture 9 - Machine Learning and Support Vector Machines (SVM)Sean Golliher
 
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494Sean Golliher
 
Lecture 7- Text Statistics and Document Parsing
Lecture 7- Text Statistics and Document ParsingLecture 7- Text Statistics and Document Parsing
Lecture 7- Text Statistics and Document ParsingSean Golliher
 
Information Retrieval, Encoding, Indexing, Big Table. Lecture 6 - Indexing
Information Retrieval, Encoding, Indexing, Big Table. Lecture 6  - IndexingInformation Retrieval, Encoding, Indexing, Big Table. Lecture 6  - Indexing
Information Retrieval, Encoding, Indexing, Big Table. Lecture 6 - IndexingSean Golliher
 
PageRank and The Google Matrix
PageRank and The Google MatrixPageRank and The Google Matrix
PageRank and The Google MatrixSean Golliher
 
CSCI 494 - Lect. 3. Anatomy of Search Engines/Building a Crawler
CSCI 494 - Lect. 3. Anatomy of Search Engines/Building a CrawlerCSCI 494 - Lect. 3. Anatomy of Search Engines/Building a Crawler
CSCI 494 - Lect. 3. Anatomy of Search Engines/Building a CrawlerSean Golliher
 

More from Sean Golliher (9)

Time Series Forecasting using Neural Nets (GNNNs)
Time Series Forecasting using Neural Nets (GNNNs)Time Series Forecasting using Neural Nets (GNNNs)
Time Series Forecasting using Neural Nets (GNNNs)
 
A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction:
A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction:A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction:
A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction:
 
Property Matching and Query Expansion on Linked Data Using Kullback-Leibler D...
Property Matching and Query Expansion on Linked Data Using Kullback-Leibler D...Property Matching and Query Expansion on Linked Data Using Kullback-Leibler D...
Property Matching and Query Expansion on Linked Data Using Kullback-Leibler D...
 
Lecture 9 - Machine Learning and Support Vector Machines (SVM)
Lecture 9 - Machine Learning and Support Vector Machines (SVM)Lecture 9 - Machine Learning and Support Vector Machines (SVM)
Lecture 9 - Machine Learning and Support Vector Machines (SVM)
 
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
 
Lecture 7- Text Statistics and Document Parsing
Lecture 7- Text Statistics and Document ParsingLecture 7- Text Statistics and Document Parsing
Lecture 7- Text Statistics and Document Parsing
 
Information Retrieval, Encoding, Indexing, Big Table. Lecture 6 - Indexing
Information Retrieval, Encoding, Indexing, Big Table. Lecture 6  - IndexingInformation Retrieval, Encoding, Indexing, Big Table. Lecture 6  - Indexing
Information Retrieval, Encoding, Indexing, Big Table. Lecture 6 - Indexing
 
PageRank and The Google Matrix
PageRank and The Google MatrixPageRank and The Google Matrix
PageRank and The Google Matrix
 
CSCI 494 - Lect. 3. Anatomy of Search Engines/Building a Crawler
CSCI 494 - Lect. 3. Anatomy of Search Engines/Building a CrawlerCSCI 494 - Lect. 3. Anatomy of Search Engines/Building a Crawler
CSCI 494 - Lect. 3. Anatomy of Search Engines/Building a Crawler
 

Goprez sg

  • 1. Temporal-difference search in computer Go David Silver · Richard S. Sutton · Martin M¨uller March 3, 2014 1 / 24
  • 2. Three Major Sections Section 3: Shape features (almost exact duplicate of previous paper from 2004) Section 2: TD-Search (Simulation-based search methods for game play) Section 4: Dyna-2 algorithm (Introducing concept of long and short-term memory in simulated search) 2 / 24
  • 3. AI Game Playing In AI the most successful game play strategies have followed these steps: 1st: Positions are evaluated by a linear combination of many features. Position is broken down into small local components 2nd: Weights are trained by TD learning and self play Finally: A linear evaluation function is combined with a suitable search algorithm to produce a high-performing game. 3 / 24
  • 4. Sec 3.0: TD Learning with Local Shape Features Approach: Shapes are an important part of Go. Master-level players are thinking in terms of general shapes. Objective is to win the games of Go. Rewards: r = 1 if black wins and r= 0 if white wins. There is a state value s for each intersection of the board. Three possible values: empty, white, or black. Local shape l is found by all combinations of shapes up to 3x3 region. φ(s) = 1 if board matches local shape li The value function V π (s) is the expected total reward from state s when following policy π (Probability of winning) Value function is approximated by logistic-linear combination of shape features. Model-free learning since learning the value function. V (s) = σ(φ(s) · θLI + φ(s) · θLI ) Black’s winning probability from state s. Use two-ply update: TD(0) error is calculated after both player and opponent have made a move V (st+2) Using self play: Agent and opponent use the same policy π(s, a) 4 / 24
  • 7. Sec 3.2.1: Training Procedure Objective: Train the weights of the value function V(s) and update using logistic TD(0). Initialize all weights to zero and run a million games of self play to find the weights of value function V(s) Black and white select moves using an − greedy policy over same value function. Using self play: Agent and opponent use the same − greedy policy π(s, a) Terminate when both players pass 7 / 24
  • 8. Results with different sets of local shape features 8 / 24
  • 9. Results with different sets of local shape features cont’d 9 / 24
  • 10. Alpha Beta Search AB Search: During game play this is a technique used to search the potential moves from state s to s (Note: we are now using the learnt value function V(s)) For example, for a depth of 2, if it’s white’s move we consider all of white’s moves and blacks’s responses to all those moves. We maintain an alpha and a beta for the upper and lower bound (value) of the move. 10 / 24
  • 12. Section 4.0: Temporal-different search Idea: If we are in a current state s there is always a subgame G of original game G. Apply TD learning to G using subgames of self-play, that start from the current state st . Simulation-based search: agent samples episodes of simulated experience and updates its value function from simulated experience. Begin in state s0. At each step µ of simulation, an action au is selected according to a simulation policy, and a new state su+1 and reward ru+1 is generated by the MDP model. Repeat until terminal state is reached. 12 / 24
  • 13. MCTS and TD-difference search cont’d TD-search uses value function approximation on the current sub-graph (our V(s) from before). We can update all similar states since the value function approximates the value of any position s ∈ S. MCTS must wait on many time-steps to until getting a final outcome. Depends on all the agents decisions throughout the simulation. TD search can bootstrap, as before, using steps between subsequent states. Does not need to wait until the end to make corrections to to TD-error (just as in TD-difference learning). MCTS is currently the best known example of simulated search 13 / 24
  • 14. Linear TD Search Linear TD search is an implementation of TD search where the value function is updated online. The agent simulates episodes of experience from the current state by sampling its current policy π(s, a) and from transition model Pπ ss and reward model Rπ ss (note: P is transition probabilities and R is reward function) Linear TD search is applied to the sub-game at the current state. And instead of using a search tree, the agent is going to approximate the value function by using a linear combination given by: Qµ(s, a) = φ(s, a) · θµ Q is the action value function Qπ(s, a) = E[Rt|st = s, at = a] θ is the weights, µ is the current time step, and φ(s, a) is feature vector representing states and actions After each step the agent updates the parameters by TD learning, using TD(λ) 14 / 24
  • 16. TD search in computer Go In section 3.0 the value of each shape was learned offline using TD learning by self play. This is considered myopic since each section is evaluated without knowledge of the entire board. The ideas is to use local shape features in TD search. TD search can learn the value of each feature in the context current board context or state µ as discussed previously. This allows the agent to focus on what works well now. Issues: By starting simulations from current position we break the symmetries. So the weight sharing (feature reduction) based on this breaking is lost. 16 / 24
  • 17. Change to TD search Remember with shapes in section 3.0 we were learning the Value function V(s). We modify the TD search algorithm to update the weights of our value function. Linear TD search is applied to the sub-game at the current state. And instead of using a search tree, the agent is going to approximate the value function by using a linear combination given by: δθ = α θ(st ) ||θ(st )||2 (V (st+2) − V (st)) 17 / 24
  • 18. Experiments with TD search Ran tournaments between different versions of RLGO of at least 200 games Used bayeselo program to calculate Elo rating Recall that for TD search we are doing simulated search (slide 13) and we have no simulation policy for each step µ. A simulation policy maximizes the action from every state in the MDP. Fuego 0.1 is a “vanilla” policy that they use as a default policy. They do this to incorporate some prior knowledge into the simulation policy. TD search assumes no prior knowledge. They switch to this every T moves. Switching policy ever 2-8 moves resulted in a 300 point Elo improvement. The results show the importance of what they call “temporality”. Focusing on the agents resources at the current moment. 18 / 24
  • 21. Dyna-2: Integrating short and long-term memories Learning algorithms slowly exact knowledge from the complete history of training data. Search algorithms use and extend this knowledge, rapidly and online, so as to evaluate local states more accurately. Dyna-2: combines both TD learning and TD search Sutton had a previous algorithm Dyna (1990) that applied TD learning both to real experience and simulated experience. They key idea with Dyna-2 is to maintain two separate memories: a long-term memory that is learnt from real experience; and short-term memory that is used during search and is update from simulated experience. 21 / 24
  • 22. Dyna-2 cont’d Define a short Qµ(s, a) and long-term value function Q(s, a) Q(s, a) = φ(s, a) · θ Q(s, a) = φ(s, a) · θ + φ(s, a) · θ Q is the action value function Qπ(s, a) = E[Rt|st = s, at = a] , θ is the weights, and φ(s, a) is feature vector representing states and actions. The short term value function,Q(s, a), uses both memories to approximate true value function. Two phase search: AB search is performed after each TD search. 22 / 24
  • 24. Discussion What is the significance of the 2nd author? What are your thoughts on the overall performance of this algorithm? Why didn’t they outperform modern MCTS methods? Are there any other applications where this might be useful? Did you think the paper did a good job explaining their approach? Was it descriptive enough? What feature of Go, as compared to chess, checkers, or backgammon, makes it different in the reinforcement learning environment. Is using only a 1x1 feature set of shapes equivalent to the notion of ”over-fitting”? What is the advantage of two-ply update verse 1-ply update that they referred to in section 3.2? What is the trade-off as we go up to 6 ply? 24 / 24