6. @rabbuhl#Devoxx #AISelfLearningGamePlaying
Game Playing: ML (50’s – 70’s)
When Who, What Year Book/Article Category
1950’s/60’s Arthur Samuels,
invents alpha-beta
pruning
1959 Some Studies in
Machine Learning
Using the Game
of Checkers
Algorithm (early
AI is search tree)
Frank Rosenblatt,
invents
Perceptrons
1962 Principles of
Neurodynamics
Connectionist
(neural net)
1970’s Marvin Minsky,
criticizes
Perceptrons
(XOR)
1969 Perceptrons Symbolist (Expert
Systems), AI
Winter except
Expert Systems
(until 80’s)
7. @rabbuhl#Devoxx #AISelfLearningGamePlaying
Game Playing: ML (80’s)
When Who, What Year Book/Article Category
1980’s Rumelhart,
McCelland, Hinton
multi-layer
perceptrons
trained with back-
propagation,
UCSD
1986 Parallel
Distributed
Processing
Connectionist
(neural net),
AI Revived
Christopher
Watkins develops
Q-learning
1989 Q-learning Reinforcement
Learning
Richard Sutton,
and Andrew Barto
1989 Reinforcement
Learning: An
Introduction
Reinforcement
Learning
8. @rabbuhl#Devoxx #AISelfLearningGamePlaying
Game Playing: ML (90’s)
When Who, What Year Book/Article Category
1990’s RA, Mentor 1991 to 1994 Neural Nets and
Tic-Tac-Toe
Neural network
N. Schraudolph,
HNC, UCSD PhD
Student
1993 Temporal
Difference
Learning and Go
Neural network
Gerald Tesauro,
develops TD-
Gammon
1995 Temporal
Difference
Learning and TD-
Gammon
Connectionist
(neural net) and
Reinforcement
Learning (TD-
lambda)
IBM Deep Blue
chess playing
computer
1997 Deep Blue
Overview, 1997
Brute-force
hardware, used
alpha-beta
minimax search
9. @rabbuhl#Devoxx #AISelfLearningGamePlaying
Game Playing: ML (00’s and beyond)
When Who, What Year Book/Article Category
2000’s RA / JMentor 2004 Q-Learning and Tic-
Tac-Toe
Reinforcement
Learning (Q-
Learning), QMiniMax
IBM Watson 2011 Jeopardy! Machine Learning,
Nat Lang Processing,
information retrieval
Facebook DeepFace,
Facial Recognition,
Humans 97.53%
correct, DeepFace
97.25% correct
2014 Facebook Creates
Software That
Matches Faces
Almost as Well as
You Do, 2014
Connectionist, Neural
Networks
2016 AlphaGo, beats
human player
2016 Mastering the game
of Go with deep
neural networks and
tree search
RL, Monte Carlo Tree
Search, Machine
Learning, etc.
11. @rabbuhl#Devoxx #AISelfLearningGamePlaying
Machine Learning Basics
Basics:
•Training is done by presenting two members sets of patterns
to the network:
• Ki = {Ai, Bi}, I = 0,…,p – 1
• Where
Ai = {Xi,0, …, Xi,n-1}
Bi = {Yi,0, …, Yi,m-1}
12. @rabbuhl#Devoxx #AISelfLearningGamePlaying
Machine Learning Basics
Basics:
• Training is done as follows:
1.Initialize the weights and thresholds
2.Present training set Ki to the network
3.Calculate the forward pass of the network
4.Calculate the desired output
5.Adapt the weights
6.Calculate the error for the training set
7.Repeat by going to step 2 (*)
Training stops when the error for all training sets is less than 0.01 (generalize).
13. @rabbuhl#Devoxx #AISelfLearningGamePlaying
Machine Learning Basics
Example:
• For the XOR problem the network is:
• 2 inputs, 8 hidden, 1 output
• The training set is defined as:
• 0.0 0.0 0.9
• 0.0 1.0 -0.9
• 1.0 0.0 -0.9
• 1.0 1.0 0.9
16. @rabbuhl#Devoxx #AISelfLearningGamePlaying
TD Gammon / Reinforcement Learning
• Back-gammon:
• Tesuaro switched to reinforcement learning and self-learning to teach TD-
gammon to play at a world-class level.
• Version 2.1 used a heuristic 2-ply search in real time
• A 3-ply search would improve its playing ability
17. @rabbuhl#Devoxx #AISelfLearningGamePlaying
Reinforcement Learning
RL differs from supervised learning where learning is done from examples
provided by a knowledgeable external supervisor.
RL attempts to learn from its own experience, four parts:
• Policy: defines the learning agents way of behaving at a give time,
• Reward function: defines the goal of the RL problem,
• Value function: defines what is good in the long run,
• Model: mimics the behavior of the environment
19. @rabbuhl#Devoxx #AISelfLearningGamePlaying
Tic-Tac-Toe / Q-Learning
Policy:
•Rule which tells the player which move to make for every
state of the game
Values:
•First, set up a table of numbers, one for each state of the
game
•Each number is the probability of winning from the state
20. @rabbuhl#Devoxx #AISelfLearningGamePlaying
Tic-Tac-Toe / Q-Learning
We play many games against our opponent:
•We examine states which result from each possible move
•We look up their current values in the table
Most of the time:
•We move greedily and select the move which has the highest
probability of winning
•However, sometimes we randomly select from other moves
21. @rabbuhl#Devoxx #AISelfLearningGamePlaying
Tic-Tac-Toe / Q-Learning
When we are playing:
• We adjust the states using the temporal difference:
• V(s1) = V(s1) + alpha [V(s2) – V(s1)]
• s1 is the state before the greedy move
• s2 is the state after the move
• Alpha is the step-size parameter which is the rate of learning
• Number of states for Tic-Tac-Toe: 3 ^ 9 = 19,683
• Number of states for Backgammon: 10 ^ 20 = 100,000,000,000,000,000,000
22. @rabbuhl#Devoxx #AISelfLearningGamePlaying
Tic-Tac-Toe / Demo
JMentor: https://github.com/richardabbuhl/jmentor
Download and build with Maven
• XOR problem: jmentorjbackpropoutjbackprop.bat -p xor.trn xor.xml, jbackprop.bat -p
xor.trn -a xor.xml
• MinMax problem: jmentorjbackpropoutjbackprop.bat -p mm.trn mm.xml, jbackprop.bat -
p mm.trn -a mm.xml
• Look at training set ttt.trn
• Training network to learn TTT: Run a bourne shell: qgosh.sh
• See the results: showbest.sh
• TTT GUI: jmentorjtictactoeguioutjtictactoegui.bat
• Play a game
23. @rabbuhl#Devoxx #AISelfLearningGamePlaying
AlphaGo
• Go is a game of profound complexity.
• There are
1,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,0
00,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,
000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,00
0,000,000,000,000,000,000,000 possible positions
• That’s more than the number of atoms in the universe, and more than a
googol times larger than chess.
AlphaGo: using machine learning to master the ancient game of go, Machine Learning 2016
24. @rabbuhl#Devoxx #AISelfLearningGamePlaying
AlphaGo
AlphaGo combines:
• Advanced Tree Search (monte-carlo tree search)
• Deep Neural Networks (neural networks and reinforcement learning)
Neural Networks:
• One neural network, the “policy network” select the next move to play,
• The other neural network, the “value network” predicts the winner of
the game
25. @rabbuhl#Devoxx #AISelfLearningGamePlaying
AlphaGo
Neural Networks (policy network):
•Trained using 30 million moves played by human experts
•It could then predict the human moves 57% percent of the time
•The policy networks discovered new strategies by playing lots of
games between the neural networks and improving them using
reinforcement learning
•The improved policy networks without a tree search can beat
state-of-the-art Go programs which use enormous tree searches
26. @rabbuhl#Devoxx #AISelfLearningGamePlaying
AlphaGo
Neural Networks (value network):
•The policy networks were used to train the value networks
and again improved using reinforcement learning
•The value networks can evaluate a Go position and estimate
an eventual winner
27. @rabbuhl#Devoxx #AISelfLearningGamePlaying
AlphaGo
Monte-Carlo Tree Search (MCTS)
• AlphaGo combines the policy and search value networks in an MCTS
algorithm that selects actions by lookahead search,
• Evaluating policy and value network requires several orders of
magnitude more computation that traditional search heuristics
Mastering the game of Go with Deep Neural Networks and Tree Search
https://gogameguru.com/i/2016/03/deepmind-mastering-go.pdf
AlphaGo: using machine learning to master the ancient game of go, Machine Learning 2016
28. @rabbuhl#Devoxx #AISelfLearningGamePlaying
AlphaGo
Monte-Carlo Tree Search (MCTS)
• The final version of AlphaGo used 40 search threads, 48 CPUs, and 8
GPUs.
• The distributed version of AlphaGo using multiple machines, 40 search
threads, 1202 CPUs, and 176 GPUs.
Mastering the game of Go with Deep Neural Networks and Tree Search
https://gogameguru.com/i/2016/03/deepmind-mastering-go.pdf
AlphaGo: using machine learning to master the ancient game of go, Machine Learning 2016
29. @rabbuhl#Devoxx #AISelfLearningGamePlaying
AlphaGo
For the match against Fan Hui, the researchers used:
• A larger network of computers that spanned about 170 GPU cards
• and 1,200 standard processors, or CPUs.
• This larger computer network both trained the system and played the
actual game, drawing on the results of the training.
In a Huge Breakthrough, Google’s AI Beats a Top Player at the Game of Go, Wired 01.26.16
30. @rabbuhl#Devoxx #AISelfLearningGamePlaying
AlphaGo Zero
A new version of AlphaGo has emerged:
• AlphaGo learned by using training data from hundreds of thousands of
games played by human experts,
• AlphaGo Zero uses not training data; instead it learns by playing millions
of games against itself and learning by each game to improve.
• No human data is needed any more.
AlphaGo Zero Shows Machines Can Become Superhuman Without Any Help, Intelligent Machines, Will Knight, October 18, 2017.
https://www.technologyreview.com/s/609141/alphago-zero-shows-machines-can-become-superhuman-without-any-help/
31. @rabbuhl#Devoxx #AISelfLearningGamePlaying
ML Links / Questions?
• Deep Learning 4j: https://deeplearning4j.org/
• Tensor Flow: https://www.tensorflow.org/
• Google Cloud Machine Learning: https://cloud.google.com/ml-engine/
• Azure Machine Learning: https://azure.microsoft.com/en-
us/services/machine-learning/
• Amazon Machine Learning: https://aws.amazon.com/machine-learning/
• Deep Mind: https://deepmind.com/blog/open-sourcing-deepmind-lab/