Understanding AlphaGo

UNDERSTANDING ALPHA GO
How Deep Learning Made the Impossible Possible

ABOUT MYSELF
 Ms.c. In computer Science, HUJI
 Research interest: Deep Learning in Computer
Vision, NLP, Reinforcement learning.
 Also, DL Theory and other ML stuff.
 Works in a DL start-up (Imubit)
 Contact: mangate@gmail.com

CREDITS
 A lot of slides were taken from the following publicly
available slideshows:
 https://www.slideshare.net/ShaneSeungwhanMoon/how-
alphago-works
 https://www.slideshare.net/ckmarkohchang/alphago-in-depth
 https://www.slideshare.net/KarelHa1/alphago-mastering-the-
game-of-go-with-deep-neural-networks-and-tree-search
 Original AlphaGo article:
Silver, David, et al. "Mastering the game of Go with
deep neural networks and tree search.“Nature 529.7587
(2016): 484-489.
Available here:
http://web.iitd.ac.in/~sumeet/Silver16.pdf

DEEP LEARNING IS CHANGING OUR LIVES
 Search Engine (also for images and audio)
 Spam filters
 Recommender systems (Netflix, Youtube)
 Self-Driving Cars
 Cyber security (and regular one via computer
vision)
 Machine translation.
 Speech to text, audio recognition.
 Image recognition, smart shopping
 And more and more and more…

AI VERSUS HUMAN
 In 1997, a super computer called Deep Blue (IBM) won Garry
Kasparov.
 This was the first defeat of a reigning world chess champion
by a computer under tournament conditions.

AI VERSUS HUMAN
 In 2011 Watson, another super-computer by IBM, “crashed”
the 2 best player in Jepoerdy, a popular question-answering
tv-show.

GO
 An ancient Chinese Game
(2,500 years old!)
 Despite its relatively simple
rules, Go is very complex,
even more so than chess.
 Winning Go requires a
great deal of intuition and
therefore was considered
unachievable by computer for at least the next 30
years.

AI VESUS HUMAN
 In 2016 a AlphaGo, a computer program by
DeepMind (part of Google) played a 5-games Go
match aginst Lee Sedol.
 Lee Sedol:
 professional 9-Dan (highest ranking in Go) considered
among top 3 players in the world.
 2nd in international titles.
 Won 97 out of 100 games
against european Go
champion Fan Hui.

AI VERSUS HUMAN
 “I’m confident that I can win, at least this time” – Lee Sedol
 Alpha Go won 4-1
 “I kind of felt powerless… misjudged the capabilities of
AlphaGo” – Lee Sedol
 How is it possible? Deep Learning.

AI IN GAME PLAYING
 Almost every game can be “simulated” with a tree search.
 A move is done if it has to most chances to end in a victory.

AI IN GAMES
 More formally: an optimal value function V*(s)
determines the outcome of the game:
 From every board position (state=s)
 Under perfect play by all players.
 This is done by going over the tree containing
possible move sequences where:
 b is the games breadth (number of legal moves in each
position)
 d is the game depth (game length in moves)
 Tic-Tac-Toe:
 Chess:
d
b
4, 4b d 
35 80b d 

TREE SEARCH IN GO
 However in GO:
 This is more than the number of atoms in the entire universe!
 Go Is more complex than chess!
250, 150b d 
100
10 ( )Googol

KEY: REDUCE THE SEARCH SPACE
 Reducing b (possible actions space)

KEY: REDUCE THE SEARCH SPACE
 Reducing d – Position evaluation ahead of time
 Instead of simulating all the way to the end:
Both reductions are done with Deep Learning.

SOME CONCEPTS
 Supervised Learning (classification)
 On a given data, predict a class (or choose 1 option
out of some known number of options)

SOME CONCEPTS
 Supervised Learning (regression)
 On a given data, predict some real number

SOME CONCEPTS
 Reinforcement Learning
 Upon given state (observation) perform some
action which leads to the goal (i.e. winning a game)

SOME CONCEPTS
 CNN’s are able to learn abstract features of a given image

REDUCING ACTION CANDIDATES
 Done by learning to “imitate” expert moves
 Data: Online Go experts. 160K Games 300M moves.
 This is supervised classification (on given data predict the
expert action out of all possible ones)

REDUCING ACTION CANDIDATES
 This deep CNN achieved 55% test accuracy on predicting
expert moves.
 Imitators with no Deep Learning reached only 22% accuracy.
 Small improvement in accuracy lead to big improvement in
playing ability.

ROLLOUT NETWORK
 Train additional smaller network
(Ppi ) for imitating.
 This network achieves only 24.2%
accuracy.
 Works 1000 times faster (2us
compared to 3ms).
 This network is used for rollouts
(explained later).

IMPROVING THE NETWORK
 Improve the imitator network through self playing
(Reinforcement learning)
 An entire game is played and the parameters are
updates according to the results.

IMPROVING THE NETWORK
 Keep generating better models by self-play newer models
against old ones
 The final network also won 85% against the best GO software
(model without self play won only 11%)
 However, the model was eventually not used during the
games. It was used to generate the value function.

REDUCING SEARCH DEPTH - DATASET
 Self-play with the imitator model for some steps (0
to 450).
 Make some random move. This is the starting
position ‘s’.
 Self play until the end with the RL network (latest
model).
 If black won z=1 otherwise z=0.
 Save (s,z) to the dataset.
 Generated 30M (s,z) pairs from 30M games.

REDUCING SEARCH DEPTH –
VALUE FUNCTION
 Regression task, for a given position S give number between
0 and 1.
 Now, for each possible position we can have an evaluation of
how “good” it is for the black player.

PUTTING IT ALL TOGETHER - MCST
 During game time a method called Monte-Carlo
Search Tree (MCTS) is applied.
 This method have 4 steps:
 Selection
 Expansion
 Evaluation
 Backup (update)
 For each play in the game this process is repeated
about 10K times.

MCTS - SELECTION
 At each step we have a starting
position (the board at this point).
 An action is selected
using a combination of the imitator
network and some other value
(Q) which is set to 0 at the start.
 we divide by the
times a state/action pair was
visited to encourage diversity.
( , )
( )
1 ( , )
P s a
u p
N s a



MCTS - EXPANSION
 When building the tree,
position can be expended once
(create new leafs in the tree)
with the imitator network.
 This way we have the new u(P)
for the next searches.

MCTS - EVALUATION
 After simulating 3-4 steps
with the imitating network
we evaluate the board
position.
 This is done in two ways:
 The value network prediction.
 Using the smaller imitator
network to self-play to the end
(rollout), and save the result
(1 for black win 0 for white)
 Both evaluation are combined
to give this board position a
number between 0 and 1.

MCTS – BACKUP (UPDATE)
 After the simulation we
update the tree.
 Update Q (which was
0 in the beginning) with
the value computed with
the value network and the
rollouts.
 Update N(s,a): Increase
by one for each
state/action pair visited.

CHOOSING AN ACTION
 For each step during the game MCTS is done for
10K times.
 In the end the action which was visited the most
times from the root position (the current board) is
taken.
 Notes:
 Since this process is long they had to use the smaller
network for rollouts to keep it feasible (otherwise each
move would have taken the computer several days to
compute).
 The imitator network was better in choosing the first
actions compared to the RL network, probably due to
human taking more diverse actions.

ALPHA GO WEAKNESSES
 In the 4th game, Lee Sedol got the board to a
position which was not on Alpha Go search tree,
causing the program to choose worse actions and
losing the game eventually.
 Most assumptions made for Alpha-Go are not
relevant in real life RL problems. See:
https://medium.com/@karpathy/alphago-in-context-
c47718cb95a5

RETIREMENT
 In March 2017 alpha go won Ke Jie, the 1st ranked in the
world, 3-0.
 Google’s DeepMind unit announced that it would be the last
event match the AI plays.

SUMMARY
 To this day, AlphaGo is considered one of the greatest AI
achievements in recent history.
 This achievement was made by combining Deep
Learning with standard method (like MCST) to “simplify”
the very complex game of Go.
 4 Deep Neural Networks were used:
 3 almost identical Convolutional Neural Network:
 Imitating network for action space reduction.
 RL network created through self-play, for generating the dataset
for the value network.
 Value network for search depth reduction.
 1 small network for rollouts.
 Deep Learning keeps achieving new amazing goals
every day, and is one of the fastest growing fields in
both academy and industry.

Understanding AlphaGo

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Understanding AlphaGo

Similar to Understanding AlphaGo (20)

Recently uploaded

Recently uploaded (20)

Understanding AlphaGo