Goprez sg

Temporal-diﬀerence search in computer Go
David Silver · Richard S. Sutton · Martin M¨uller
March 3, 2014
1 / 24

Three Major Sections
Section 3: Shape features (almost exact duplicate of previous paper
from 2004)
Section 2: TD-Search (Simulation-based search methods for game
play)
Section 4: Dyna-2 algorithm (Introducing concept of long and
short-term memory in simulated search)
2 / 24

AI Game Playing
In AI the most successful game play strategies have followed these
steps:
1st: Positions are evaluated by a linear combination of many features.
Position is broken down into small local components
2nd: Weights are trained by TD learning and self play
Finally: A linear evaluation function is combined with a suitable search
algorithm to produce a high-performing game.
3 / 24

Sec 3.0: TD Learning with Local Shape Features
Approach: Shapes are an important part of Go. Master-level players
are thinking in terms of general shapes.
Objective is to win the games of Go. Rewards: r = 1 if black wins and
r= 0 if white wins.
There is a state value s for each intersection of the board. Three
possible values: empty, white, or black.
Local shape l is found by all combinations of shapes up to 3x3 region.
φ(s) = 1 if board matches local shape li
The value function V π
(s) is the expected total reward from state s
when following policy π (Probability of winning)
Value function is approximated by logistic-linear combination of shape
features. Model-free learning since learning the value function.
V (s) = σ(φ(s) · θLI
+ φ(s) · θLI
) Black’s winning probability from state
s.
Use two-ply update: TD(0) error is calculated after both player and
opponent have made a move V (st+2)
Using self play: Agent and opponent use the same policy π(s, a)
4 / 24

Sec 3.2.1: Training Procedure
Objective: Train the weights of the value function V(s) and update
using logistic TD(0).
Initialize all weights to zero and run a million games of self play to
ﬁnd the weights of value function V(s)
Black and white select moves using an − greedy policy over same
value function.
Using self play: Agent and opponent use the same − greedy policy
π(s, a)
Terminate when both players pass
7 / 24

Results with diﬀerent sets of local shape features
8 / 24

Results with diﬀerent sets of local shape features cont’d
9 / 24

Alpha Beta Search
AB Search: During game play this is a technique used to search the
potential moves from state s to s (Note: we are now using the learnt
value function V(s))
For example, for a depth of 2, if it’s white’s move we consider all of
white’s moves and blacks’s responses to all those moves. We
maintain an alpha and a beta for the upper and lower bound (value)
of the move.
10 / 24

Section 4.0: Temporal-diﬀerent search
Idea: If we are in a current state s there is always a subgame G of
original game G. Apply TD learning to G using subgames of
self-play, that start from the current state st .
Simulation-based search: agent samples episodes of simulated
experience and updates its value function from simulated experience.
Begin in state s0. At each step µ of simulation, an action au is
selected according to a simulation policy, and a new state su+1 and
reward ru+1 is generated by the MDP model. Repeat until terminal
state is reached.
12 / 24

MCTS and TD-difference search cont’d
TD-search uses value function approximation on the current
sub-graph (our V(s) from before). We can update all similar states
since the value function approximates the value of any position s ∈ S.
MCTS must wait on many time-steps to until getting a final
outcome. Depends on all the agents decisions throughout the
simulation. TD search can bootstrap, as before, using steps between
subsequent states. Does not need to wait until the end to make
corrections to to TD-error (just as in TD-difference learning).
MCTS is currently the best known example of simulated search
13 / 24

Linear TD Search
Linear TD search is an implementation of TD search where the value
function is updated online. The agent simulates episodes of
experience from the current state by sampling its current policy
π(s, a) and from transition model Pπ
ss
and reward model Rπ
ss
(note:
P is transition probabilities and R is reward function)
Linear TD search is applied to the sub-game at the current state.
And instead of using a search tree, the agent is going to approximate
the value function by using a linear combination given by:
Qµ(s, a) = φ(s, a) · θµ
Q is the action value function Qπ(s, a) = E[Rt|st = s, at = a]
θ is the weights, µ is the current time step, and φ(s, a) is feature
vector representing states and actions
After each step the agent updates the parameters by TD learning,
using TD(λ)
14 / 24

TD search in computer Go
In section 3.0 the value of each shape was learned oﬄine using TD
learning by self play. This is considered myopic since each section is
evaluated without knowledge of the entire board.
The ideas is to use local shape features in TD search. TD search can
learn the value of each feature in the context current board context
or state µ as discussed previously. This allows the agent to focus on
what works well now.
Issues: By starting simulations from current position we break the
symmetries. So the weight sharing (feature reduction) based on this
breaking is lost.
16 / 24

Change to TD search
Remember with shapes in section 3.0 we were learning the Value
function V(s).
We modify the TD search algorithm to update the weights of our
value function.
Linear TD search is applied to the sub-game at the current state.
And instead of using a search tree, the agent is going to approximate
the value function by using a linear combination given by:
δθ = α θ(st )
||θ(st )||2 (V (st+2) − V (st))
17 / 24

Experiments with TD search
Ran tournaments between diﬀerent versions of RLGO of at least 200
games
Used bayeselo program to calculate Elo rating
Recall that for TD search we are doing simulated search (slide 13)
and we have no simulation policy for each step µ. A simulation policy
maximizes the action from every state in the MDP.
Fuego 0.1 is a “vanilla” policy that they use as a default policy. They
do this to incorporate some prior knowledge into the simulation
policy. TD search assumes no prior knowledge. They switch to this
every T moves.
Switching policy ever 2-8 moves resulted in a 300 point Elo
improvement.
The results show the importance of what they call “temporality”.
Focusing on the agents resources at the current moment.
18 / 24

Dyna-2: Integrating short and long-term memories
Learning algorithms slowly exact knowledge from the complete history
of training data.
Search algorithms use and extend this knowledge, rapidly and online,
so as to evaluate local states more accurately.
Dyna-2: combines both TD learning and TD search
Sutton had a previous algorithm Dyna (1990) that applied TD
learning both to real experience and simulated experience.
They key idea with Dyna-2 is to maintain two separate memories: a
long-term memory that is learnt from real experience; and short-term
memory that is used during search and is update from simulated
experience.
21 / 24

Dyna-2 cont’d
Deﬁne a short Qµ(s, a) and long-term value function Q(s, a)
Q(s, a) = φ(s, a) · θ
Q(s, a) = φ(s, a) · θ + φ(s, a) · θ
Q is the action value function Qπ(s, a) = E[Rt|st = s, at = a] , θ is
the weights, and φ(s, a) is feature vector representing states and
actions.
The short term value function,Q(s, a), uses both memories to
approximate true value function.
Two phase search: AB search is performed after each TD search.
22 / 24

Discussion
What is the significance of the 2nd author?
What are your thoughts on the overall performance of this algorithm?
Why didn’t they outperform modern MCTS methods?
Are there any other applications where this might be useful?
Did you think the paper did a good job explaining their approach?
Was it descriptive enough?
What feature of Go, as compared to chess, checkers, or backgammon,
makes it different in the reinforcement learning environment.
Is using only a 1x1 feature set of shapes equivalent to the notion of
”over-fitting”?
What is the advantage of two-ply update verse 1-ply update that they
referred to in section 3.2? What is the trade-off as we go up to 6 ply?
24 / 24

Goprez sg

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Goprez sg

Similar to Goprez sg (20)

More from Sean Golliher

More from Sean Golliher (9)

Goprez sg