Devoxx 2017 - AI Self-learning Game Playing

@rabbuhl#Devoxx #AISelfLearningGamePlaying
AI Self-Learning Game Playing
Richard Abbuhl
ING

Overview
• Introduction
• Machine Learning? Why Game Playing?
• History Game Playing and Machine Learning
• Machine Learning Basics / Neural Networks
• TD Gammon / Reinforcement Learning
• Tic-Tac-Toe / Q-Learning / Demo
• AlphaGo / AlphaGo Zero
• Useful Links? Questions?

Introduction
•Thesis: Enhancements to Back-propagation and its Application
to Large-Scale Pattern Recognition Problems
•Product: Database Mining Marksman

Machine Learning?
Drive a car?
Play a game?
Tell a joke?
Tell a story?
Make a prediction?

Game Playing: Why?
•Data Warehouse (processed)
•Data Lake (raw)

Game Playing: ML (50’s – 70’s)
When Who, What Year Book/Article Category
1950’s/60’s Arthur Samuels,
invents alpha-beta
pruning
1959 Some Studies in
Machine Learning
Using the Game
of Checkers
Algorithm (early
AI is search tree)
Frank Rosenblatt,
invents
Perceptrons
1962 Principles of
Neurodynamics
Connectionist
(neural net)
1970’s Marvin Minsky,
criticizes
Perceptrons
(XOR)
1969 Perceptrons Symbolist (Expert
Systems), AI
Winter except
Expert Systems
(until 80’s)

Game Playing: ML (80’s)
1980’s Rumelhart,
McCelland, Hinton
multi-layer
perceptrons
trained with back-
propagation,
UCSD
1986 Parallel
Distributed
Processing
Connectionist
(neural net),
AI Revived
Christopher
Watkins develops
Q-learning
1989 Q-learning Reinforcement
Learning
Richard Sutton,
and Andrew Barto
1989 Reinforcement
Learning: An
Introduction
Reinforcement
Learning

Game Playing: ML (90’s)
1990’s RA, Mentor 1991 to 1994 Neural Nets and
Tic-Tac-Toe
Neural network
N. Schraudolph,
HNC, UCSD PhD
Student
1993 Temporal
Difference
Learning and Go
Neural network
Gerald Tesauro,
develops TD-
Gammon
1995 Temporal
Difference
Learning and TD-
Gammon
Connectionist
(neural net) and
Reinforcement
Learning (TD-
lambda)
IBM Deep Blue
chess playing
computer
1997 Deep Blue
Overview, 1997
Brute-force
hardware, used
alpha-beta
minimax search

Game Playing: ML (00’s and beyond)
2000’s RA / JMentor 2004 Q-Learning and Tic-
Tac-Toe
Reinforcement
Learning (Q-
Learning), QMiniMax
IBM Watson 2011 Jeopardy! Machine Learning,
Nat Lang Processing,
information retrieval
Facebook DeepFace,
Facial Recognition,
Humans 97.53%
correct, DeepFace
97.25% correct
2014 Facebook Creates
Software That
Matches Faces
Almost as Well as
You Do, 2014
Connectionist, Neural
Networks
2016 AlphaGo, beats
human player
2016 Mastering the game
of Go with deep
neural networks and
tree search
RL, Monte Carlo Tree
Search, Machine
Learning, etc.

Machine Learning Basics
•Basics:
•Machine learning can be implemented using a feed-forward
multi-layer neural network

Basics:
•Training is done by presenting two members sets of patterns
to the network:
• Ki = {Ai, Bi}, I = 0,…,p – 1
• Where
Ai = {Xi,0, …, Xi,n-1}
Bi = {Yi,0, …, Yi,m-1}

Basics:
• Training is done as follows:
1.Initialize the weights and thresholds
2.Present training set Ki to the network
3.Calculate the forward pass of the network
4.Calculate the desired output
5.Adapt the weights
6.Calculate the error for the training set
7.Repeat by going to step 2 (*)
Training stops when the error for all training sets is less than 0.01 (generalize).

Example:
• For the XOR problem the network is:
• 2 inputs, 8 hidden, 1 output
• The training set is defined as:
• 0.0 0.0 0.9
• 0.0 1.0 -0.9
• 1.0 0.0 -0.9
• 1.0 1.0 0.9

Machine Learning Basics: Kolmogorov’s Theorem
•Robert Hecht-Neilsen (1987): Kolmogorov's mapping
neural network existence theorem

TD Gammon
•Back-gammon (1995):
http://www.bkgm.com/articles/tesauro/tdl.html
•Tesauro used this approach to teach a multi-layer perceptrons
to play back-gammon

TD Gammon / Reinforcement Learning
• Back-gammon:
• Tesuaro switched to reinforcement learning and self-learning to teach TD-
gammon to play at a world-class level.
• Version 2.1 used a heuristic 2-ply search in real time
• A 3-ply search would improve its playing ability

Reinforcement Learning
RL differs from supervised learning where learning is done from examples
provided by a knowledgeable external supervisor.
RL attempts to learn from its own experience, four parts:
• Policy: defines the learning agents way of behaving at a give time,
• Reward function: defines the goal of the RL problem,
• Value function: defines what is good in the long run,
• Model: mimics the behavior of the environment

Tic-Tac-Toe / Q-Learning
• Tic-Tac-Toe:
• Board size is 3 x 3
• Training is done using Q-Learning.

Policy:
•Rule which tells the player which move to make for every
state of the game
Values:
•First, set up a table of numbers, one for each state of the
game
•Each number is the probability of winning from the state

We play many games against our opponent:
•We examine states which result from each possible move
•We look up their current values in the table
Most of the time:
•We move greedily and select the move which has the highest
probability of winning
•However, sometimes we randomly select from other moves

When we are playing:
• We adjust the states using the temporal difference:
• V(s1) = V(s1) + alpha [V(s2) – V(s1)]
• s1 is the state before the greedy move
• s2 is the state after the move
• Alpha is the step-size parameter which is the rate of learning
• Number of states for Tic-Tac-Toe: 3 ^ 9 = 19,683
• Number of states for Backgammon: 10 ^ 20 = 100,000,000,000,000,000,000

Tic-Tac-Toe / Demo
JMentor: https://github.com/richardabbuhl/jmentor
Download and build with Maven
• XOR problem: jmentorjbackpropoutjbackprop.bat -p xor.trn xor.xml, jbackprop.bat -p
xor.trn -a xor.xml
• MinMax problem: jmentorjbackpropoutjbackprop.bat -p mm.trn mm.xml, jbackprop.bat -
p mm.trn -a mm.xml
• Look at training set ttt.trn
• Training network to learn TTT: Run a bourne shell: qgosh.sh
• See the results: showbest.sh
• TTT GUI: jmentorjtictactoeguioutjtictactoegui.bat
• Play a game

AlphaGo
• Go is a game of profound complexity.
• There are
1,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,0
00,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,
000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,00
0,000,000,000,000,000,000,000 possible positions
• That’s more than the number of atoms in the universe, and more than a
googol times larger than chess.
AlphaGo: using machine learning to master the ancient game of go, Machine Learning 2016

AlphaGo
AlphaGo combines:
• Advanced Tree Search (monte-carlo tree search)
• Deep Neural Networks (neural networks and reinforcement learning)
Neural Networks:
• One neural network, the “policy network” select the next move to play,
• The other neural network, the “value network” predicts the winner of
the game

AlphaGo
Neural Networks (policy network):
•Trained using 30 million moves played by human experts
•It could then predict the human moves 57% percent of the time
•The policy networks discovered new strategies by playing lots of
games between the neural networks and improving them using
reinforcement learning
•The improved policy networks without a tree search can beat
state-of-the-art Go programs which use enormous tree searches

AlphaGo
Neural Networks (value network):
•The policy networks were used to train the value networks
and again improved using reinforcement learning
•The value networks can evaluate a Go position and estimate
an eventual winner

AlphaGo
Monte-Carlo Tree Search (MCTS)
• AlphaGo combines the policy and search value networks in an MCTS
algorithm that selects actions by lookahead search,
• Evaluating policy and value network requires several orders of
magnitude more computation that traditional search heuristics
Mastering the game of Go with Deep Neural Networks and Tree Search
https://gogameguru.com/i/2016/03/deepmind-mastering-go.pdf

AlphaGo
Monte-Carlo Tree Search (MCTS)
• The final version of AlphaGo used 40 search threads, 48 CPUs, and 8
GPUs.
• The distributed version of AlphaGo using multiple machines, 40 search
threads, 1202 CPUs, and 176 GPUs.
Mastering the game of Go with Deep Neural Networks and Tree Search
https://gogameguru.com/i/2016/03/deepmind-mastering-go.pdf

AlphaGo
For the match against Fan Hui, the researchers used:
• A larger network of computers that spanned about 170 GPU cards
• and 1,200 standard processors, or CPUs.
• This larger computer network both trained the system and played the
actual game, drawing on the results of the training.
In a Huge Breakthrough, Google’s AI Beats a Top Player at the Game of Go, Wired 01.26.16

AlphaGo Zero
A new version of AlphaGo has emerged:
• AlphaGo learned by using training data from hundreds of thousands of
games played by human experts,
• AlphaGo Zero uses not training data; instead it learns by playing millions
of games against itself and learning by each game to improve.
• No human data is needed any more.
AlphaGo Zero Shows Machines Can Become Superhuman Without Any Help, Intelligent Machines, Will Knight, October 18, 2017.
https://www.technologyreview.com/s/609141/alphago-zero-shows-machines-can-become-superhuman-without-any-help/

ML Links / Questions?
• Deep Learning 4j: https://deeplearning4j.org/
• Tensor Flow: https://www.tensorflow.org/
• Google Cloud Machine Learning: https://cloud.google.com/ml-engine/
• Azure Machine Learning: https://azure.microsoft.com/en-
us/services/machine-learning/
• Amazon Machine Learning: https://aws.amazon.com/machine-learning/
• Deep Mind: https://deepmind.com/blog/open-sourcing-deepmind-lab/

Thank You!!

Devoxx 2017 - AI Self-learning Game Playing

Recommended

Recommended

More Related Content

What's hot

What's hot (8)

Similar to Devoxx 2017 - AI Self-learning Game Playing

Similar to Devoxx 2017 - AI Self-learning Game Playing (20)

Recently uploaded

Recently uploaded (20)

Devoxx 2017 - AI Self-learning Game Playing