(Alpha) Zero to Elo (with demo)

OutlineOutline
1. The math behind Go
2. From Crazy Stone -> AlphaGO
3. AlphaGo vs AlphaZero
4. Policy Iteration
5. Policy Improvement (Math alert!)
6. Policy Evaluation
7. The deep side of AlphaZero
8. Code and demo

"For a true AI isn't measured by the size of its tree, but by
the precision of its moves." Filottete

Go is constructive
Humans describe more as intuititive game
possible states
possible games for each starting state
10
170
10
360

Adversarial
Fully observable
Deterministic

"The mystery of Go, the ancient game that computers still
can't win" - Wired 2014

AlphaGo Zero vs AlphaZeroAlphaGo Zero vs AlphaZero

Reinforcement LearningReinforcement Learning

The agent de nes the part of the world that wants to
explore
And it evaluates the goodness of its behaviors, based
on how much reward is getting

and:
π(a ∣ s) = P (a ∣ s) ∀s ∈ S
(s) = [ ∣ ]vπ Eπ ∑
t
γ
t
Rt St

def value(state):
"""
Black magic
"""
return v

def policy(state):
"""
White magic
"""
return reasonable_actions

Policy IterationPolicy Iteration

Policy ImprovementPolicy Improvement

1. Plan in the future
2. Try new actions

Monte-Carlo Tree SearchMonte-Carlo Tree Search

MCTS is an algorithm to perform sampling based
lookahead search.

With the backup operation we keep track of:
N(s,a) visit count
Q(s,a) mean action value

Q(s, a) + cP (s, a)
N(s,b)∑
b
√
1+N(s,a)

Policy EvaluationPolicy Evaluation
Self PlaySelf Play

1. Clone yourself and ght!
2. As the Yous battle, observe the ght
3. Use those experiences to improve further

How is it implemented in python?How is it implemented in python?
def play_against_yourself(game, player_mcts):
...
board = game.reset()
while not terminal:
act = player_mcts.pick_move(board)
board, r, terminal, opp_act = game.step(action)
training_samples.append((board, player_id, act))
training_samples.append((board, opp_id, opp_act))
return training_samples

To the code!To the code!
main: https://gist.github.com/manuel-
delverme/36f9fd220989903274c4badf83c0f880

The deeper side of RLThe deeper side of RL

In AlphaZero we want to classify cats nd the best
moves

The superstar of the newtorkThe superstar of the newtork

Deep Learning - where are theDeep Learning - where are the
layers? 1/523layers? 1/523

Deep Learning - where are theDeep Learning - where are the
layers? 2/523layers? 2/523

it's-going-to-take-a-while 3/523it's-going-to-take-a-while 3/523

it's-going-to-take-a-while 4/523it's-going-to-take-a-while 4/523

lol joking/523lol joking/523
fast forwarding...

network heads/523network heads/523

Loss function - what makes theLoss function - what makes the
model happy?model happy?
(z − v(s) − π log p + c||θ||)
2

To the code!To the code!
train: https://gist.github.com/manuel-
delverme/a1b6b93bd5b4d607920b045b039fcb98

ContactsContacts
manuel.delverme@gmail.com
simone.totaro@gmail.com

Thank you!Thank you!
github/mosc

(Alpha) Zero to Elo (with demo)

Recommended

Recommended

More Related Content

What's hot

What's hot (13)

Similar to (Alpha) Zero to Elo (with demo)

Similar to (Alpha) Zero to Elo (with demo) (20)

More from MeetupDataScienceRoma

More from MeetupDataScienceRoma (20)

Recently uploaded

Recently uploaded (20)

(Alpha) Zero to Elo (with demo)