iSense Java Summit - Self-learning Game Playing

Self-learning Game Playing
Java Summit
Richard Abbuhl / Freelancer / ING
iSense – 16 May 2018

iSense Java Summit
1. Introduction – Richard
2. Preface
3. Self-playing ingredients
4. It’s All About The Data
5. Neural Networks
6. Reinforcement Learning
7. Monte-Carlo Tree Search
8. Questions?
2
Agenda

iSense Java Summit
• Java / Microservices back-end, Angular / Polymer front-end, Java, Javascript, (C, C++, Ada)
• Java / Machine Learning (Education/Interest), https://github.com/richardabbuhl/jmentor (back-
propagation and reinforcement learning)
• Love the hype.
• What’s AI done for you?
Introduction - Richard
3

iSense Java Summit
Who is on the top right?
Who is on the bottom right?
What do they have in common with machine
learning?
Preface
4

iSense Java SummitBasic Ingredients
5
Machine Learning Game-Playing:
- Rules of the game
- Data
- Machine Learning Algorithm
- Search algorithm

iSense Java SummitML: It’s All About The Data
6
Early business value
• Customer databases
Current business value
• Big data
• Data warehouses
• Data lakes
Note: ETL (Extract, Transform, and Load)

7
Data set of 30 million moves played by
human experts (available at the KGS Go
server)

8
What’s wrong with the data?
Errors in Data
Missing Data
Skewed Data
Incomplete Data
Alpha Go: predicted human moves 57%
percent of the time

iSense Java SummitIt’s All About The Data (Synthetic Data)
9
Machine learning algorithms needs lots of
data (AG: 30M+?, TTT: sample )
Big data is expensive?
Alternatives?

iSense Java SummitSynthetic Data Set
10
Approach one: generate a data set
Design is to create realistic data
Easy or not?

iSense Java SummitSynthetic Data Set / Self-Playing
11
Approach two:
Set Machine Learning weights to initial state
For N times:
Play a game:
Player one moves (either ML algorithm or random)
Adjust weights
Player two moves (either ML algorithm or random)
Adjust weights
until done
Random move = synthetic data (anneal over time)
AlphaGo Zero: no human data needed any more (synthetic / self-playing)

iSense Java Summit
Basics:
• Most Deep Learning is based on back-propagation (1986) which is used to train a neural
network to recognize patterns.
• Training is done by presenting two members sets of patterns to the network:
• Ki = {Ai, Bi}, I = 0,…,p – 1
• Where
• Ai = {Xi,0, …, Xi,n-1}
• Bi = {Yi,0, …, Yi,m-1}
Neural Networks
12

iSense Java Summit
Example:
• For the XOR problem the network is:
• 2 inputs, 8 hidden, 1 output
• A training set is defined as:
• 0.0 0.0 0.9
• 0.0 1.0 -0.9
• 1.0 0.0 -0.9
• 1.0 1.0 0.9
Neural Networks
13

iSense Java Summit
Example:
• For the TTT problem the network is:
• 27 inputs, 48 hidden, 1 output
• A training set is defined as:
• 0.0 0.0 0.9 0.0 0.9 0.0 0.9 0.0 0.0 0.9 0.0 0.0 0.9 0.0 0.0 0.9 0.0 0.0 0.9 0.0 0.0 0.9 0.0 0.0 0.9 0.0 0.0 0.540
• 0.0 0.9 0.0 0.0 0.0 0.9 0.9 0.0 0.0 0.9 0.0 0.0 0.9 0.0 0.0 0.9 0.0 0.0 0.9 0.0 0.0 0.9 0.0 0.0 0.9 0.0 0.0 0.460
• 0.0 0.0 0.9 0.9 0.0 0.0 0.0 0.9 0.0 0.9 0.0 0.0 0.9 0.0 0.0 0.9 0.0 0.0 0.9 0.0 0.0 0.9 0.0 0.0 0.9 0.0 0.0 0.500
• 0.9 0.0 0.0 0.0 0.0 0.9 0.0 0.9 0.0 0.9 0.0 0.0 0.9 0.0 0.0 0.9 0.0 0.0 0.9 0.0 0.0 0.9 0.0 0.0 0.9 0.0 0.0 0.460
• 0.0 0.9 0.0 0.9 0.0 0.0 0.0 0.0 0.9 0.9 0.0 0.0 0.9 0.0 0.0 0.9 0.0 0.0 0.9 0.0 0.0 0.9 0.0 0.0 0.9 0.0 0.0 0.500
Neural Networks
14

iSense Java SummitNeural Networks
15
Basics:
• Implemented using a feed-forward multi-layer neural network

iSense Java Summit
Basics:
• Training is done as follows:
1. Initialize the weights and thresholds
2. Present training set Ki to the network
3. Calculate the forward pass of the network
4. Calculate the desired output
5. Adapt the weights
6. Calculate the error for the training set
7. Repeat by going to step 2 (*)
Training stops when the error for all training sets is less than 0.01 (generalize).
Neural Networks
16

iSense Java SummitReinforcement Learning
17
Reinforcement Learning originally coined by Minsky (1961).
If an action take by a learning system is followed by a satisfactory state of affairs, then the
tendency of the system to produce that particular action is strengthened or reinforced.
Otherwise, the tendency of the system to production that action is weakened.
(Sutton et al., 1991, Barto 1992)

18
RL differs from supervised learning where learning is done from examples provided by a
knowledgeable external supervisor.
RL attempts to learn from its own experience, four parts:
• Policy: defines the learning agents way of behaving at a give time,
• Reward function: defines the goal of the RL problem,
• Value function: defines what is good in the long run,
• Model: mimics the behavior of the environment

19
Policy:
• Rule which tells the player which move to make for
every state of the game
Values:
• First, set up a table of numbers, one for each state
of the game
• Each number is the probability of winning from the
state

20
We play many games against our opponent:
• We examine states which result from each possible move
• We look up their current values in the table
Most of the time:
• We move greedily and select the move which has the highest probability of winning
• However, sometimes we randomly select from other moves

21
When we are playing:
• We adjust the states using the temporal difference:
• V(s1) = V(s1) + alpha [V(s2) – V(s1)]
• s1 is the state before the greedy move
• s2 is the state after the move
• Alpha is the step-size parameter which is the rate of learning
Number of states for Tic-Tac-Toe: 3 ^ 9 = 19,683
Number of states for Backgammon: 10 ^ 20 = 100,000,000,000,000,000,000
https://github.com/suragnair/alpha-zero-general

iSense Java SummitMonte-Carlo Tree Search
22
1. Selection
Starting at root node R, recursively select
optimal child nodes (explained below) until
a leaf node L is reached.
2. Expansion
If L is a not a terminal node (i.e. it does not
end the game) then create one or more
child nodes and select one C.
3. Simulation
Run a simulated playout from C until a
result is achieved.
4. Backpropagation
Update the current move sequence with
the simulation result.

iSense Java SummitMonte-Carlo Tree Search
23
Monte-Carlo Tree Search (MCTS)
• AlphaGo combines the policy and search value networks in an MCTS algorithm that
selects actions by lookahead search,
• Note: valuating policy and value network requires several orders of magnitude more
computation that traditional search heuristics

iSense Java Summit - Self-learning Game Playing

Recommended

Recommended

More Related Content

Similar to iSense Java Summit - Self-learning Game Playing

Similar to iSense Java Summit - Self-learning Game Playing (20)

Recently uploaded

Recently uploaded (20)

iSense Java Summit - Self-learning Game Playing