Even if AlphaGo’s victory over the go’s world champion was viewed dubiously as hype by a one-trick pony, AlphaZero’s ability to learn chess in 4 hours and beat the strongest computer using not-of-this-owrld techniques has silenced the strongest of critiques. DeepMind has proved a track record with a trajectory to conquer more complex aspects of human mind.
But really, how do they do it? While many aspects of their technology remains unpublished, they for the most part use common Machine Learning techniques that can be used to build intelligent agents. In this talk, we not only cover tools and techniques but also build an agent to play and compete with humans. See if you can beat the machine!
7. @aliostad
/// Deep Learning basics
> Bunch of techniques to overcome 80s problems:
- Overfitting: DropOut Layers
- Curse of Dimensionality: MOAR data!
- Better training and optimisation techniques
- GPUs and parallel computing to speed-up training
> Multi-layer neural network described back in 1950s
> Type of layers, # of units and activation function
10. @aliostad
/// 2013 - Atari
“We apply our method to seven Atari 2600
games from the Arcade Learning
Environment, with no adjustment of the
architecture or learning algorithm. We find
that it outperforms all previous approaches
on six of the games and surpasses a human
expert on three of them.”
> Deep-Mind
14. @aliostad
/// Late 2017 - Chess
> DeepMind
Grandmaster Daniel King on
AlphaZero’s game 10 against Stockfish
https://www.youtube.com/watch?v=Lfkam_oLLM8
15. @aliostad
/// 1992
Gerald Tesauro - IBM
> TD-Gammon
> Using Temporal Difference
Learning TD-Lambda
> Neural Networks
> Training using Self-Play
> Value Function
17. @aliostad
/// research grant
Rich Sutton Andrew Barto
> “Goal seeking components for Adaptive Intelligence” 1977
> Cybernetics Center for
Systems Neuroscience in
University of
Massachusetts Amherst
> “Synthesis of Nonlinear Control Surfaces by a
Layered Associative Network” 1981
22. @aliostad
/// Temporal Difference (TD)
if error is zero => reward=v(s)-γv(s’)
where γ is the discount factor
Predictive Reward Signal of Dopamine Neurons
- Wolfram Schultz 1998
24. @aliostad
/// Q-Learning
> A form of TD Learning
> Uses Q Function which returns probability
distribution for actions to be drawn from
R L U D F N
0.1 0.2 0.5 0.1 0.0 0.1
Explore vs Exploit
(Greediness)
25. @aliostad
/// Deep Q Network (DQN)
> Proposed by Atari paper (DeepMind) in 2013
> Uses a Deep Network to map state to
action and uses Q-Learning error to train
> Double Q-learning variant (DeepMind 2015)
> Duelling Networks variant (DeepMind 2015)
26. @aliostad
/// meet lunar-lander!
> State: (8,)
> Action: (4,)
> Rewards:
- leg touchdown: +10
- crash: -100
- rest: +100
- solve: 200
- main engine: -0.3
Part of OpenAI’s gym
27. @aliostad
/// keras-rl
> based on OpenAI’s agent/environment interface
> Supports DQN (and its variants), CEM,
SARSA and DDPG algorithms
> Upcoming ACER, A2C/A3C, PPO, etc algorithms
> Uses any keras models as long as input/output
shapes matches. “Bring Your Own Models”
31. @aliostad
/// hexagon
> Mainly A coding challenge (playhexagon.com)
> Danske Bank (Vidas)
> A round-based strategy
game for 2 or more
players to start with one
cell and gradually occupy
the board or have more
cells when time runs out.
33. @aliostad
/// hexagon - expansion
Transferring 70 resources
from seed cell to the
adjacent neutral cell
34. @aliostad
/// hexagon - increments
Maroon also transfers 70
resources from seed cell to
its adjacent neutral cell.
All occupied cells get +1
resource unless they have
100 or more resources.
+1
35. @aliostad
/// hexagon - attack
Transferring 40 resources
from the cell having 58 to
the adjacent enemy cell
having 16 results in own
having 18 and the attacked
cell 40-16=24.
36. @aliostad
/// hexagon - boost
Transferring 50 resources from
the cell having 100 to to friendly
cell having 4 results in own
having 50 and the boosted cell
4-50=54.
This helps the cell to protect
against neighbouring enemy
cells having 20 and 25
resources.
40. @aliostad
/// hexagon - what to do?
> Attack? From which
cell to which cell?
> Re-inforcements?
> How many resources?
41. @aliostad
/// hexagon - heuristics
self.attackPotential = self.resources *
math.sqrt(max(self.resources -
safeMin([n.resources for n in self.nonOwns]), 1)) /
math.log(sum([n.resources for n in self.enemies], 1)
+ 1, 5)
# how suitable is a cell for receiving boost
self.boostFactor = math.sqrt(sum((n.resources for n in self.enemies), 1)) *
safeMax([n.resources for n in self.enemies], 1) /
(self.resources + 1)
def getGivingBoostSuitability(self):
return (self.depth + 1) * math.sqrt(self.resources + 1) *
(1.7 if self.resources == 100 else 1)
43. @aliostad
/// hexagon - heuristics??
> Score functions are arbitrary: they do not necessarily
represent the underlying mechanics of the game
> No easy way to learn parameters and and
testing all combinations impossible
> When it does not work, it is hard to
know which parameter to tune.
Got to be a better way…
self.attackPotential = self.resources *
math.sqrt(max(self.resources -
safeMin([n.resources for n in self.nonOwns]), 1)) /
math.log(sum([n.resources for n in self.enemies], 1)
+ 1, 5)
45. @aliostad
/// hexagon - cell representation
> Own cells represented by positive integer (for
resources). Enemy cells by negative integer. Neutral
by zero
> Feature extraction: for every cell extract
- sum/max/min friendly cells
- sum/max/min enemy cells
46. /// hexagon - board representation
> Flattened: array of cells
> 2D representation so that we can use Convolutional
Neural Network. Hexagon => Grid
10 -1 0 0 25 -43 -12 3 0 -9
47. @aliostad
/// hexagon - model
> Pure RL models: DQN, PPO, etc
> AlphaZero: Monte Carlo Tree Search (MCTS) + RL
models
48. @aliostad
/// hexagon - decision tree
Hierarchy of Models and Game Rules
> Centaur:
Replacing parts of the heuristic-
based man-made agent with
machine-learning
- selecting attacker or boosting cell
- choosing attack/boost resources
50. @aliostad
/// hexagon - alphazero
> Cell representation: 1, -1 and 0 for friendly,
enemy and neutral cells.
> Board representation: Grid mapping of Hexagon
> Action representation: flattened board with 1 for
cells that can attack or boost.
> Deep Learning Model: choice of flat or Conv2D
> Resource quantization: actions include resource
proportions
51. @aliostad
/// hexagon - alphazero
> default: python hexagon_alphazero train —radius 4
> model: python hexagon_alphazero train -m [f|cm|cam]
> default: python hexagon_alphazero test -p fm -q a
> quantization: python hexagon_alphazero -p fmz -a a -z 4
> rounds: python hexagon_alphazero -p cmz -q a -x 200
Training
Testing