>>> Building
Intelligent Agents
using
Deep Reinforcement Learning
@aliostad
Ali Kheyrollahi, ASOS
@aliostad
/// Do you has teh codez?
> Slides will be published - check @aliostad
> github page:
https://github.com/aliostad/hexagon-rl
@aliostad
: @aliostad
email: the same @gmail.com
http://byterot.blogspot.com
Ali Kheyrollahi,
Solutions Architect at ASOS
@aliostad
@aliostad
/// Take-aways
> mini-history of Reinforcement Learning (RL)
> Basics
> Representations and models
> Putting it all together in Hexagon
@aliostad
/// Deep Learning
@aliostad
/// Deep Learning basics
> Bunch of techniques to overcome 80s problems:
- Overfitting: DropOut Layers
- Curse of Dimensionality: MOAR data!
- Better training and optimisation techniques
- GPUs and parallel computing to speed-up training
> Multi-layer neural network described back in 1950s
> Type of layers, # of units and activation function
@aliostad
/// supervised learning
@aliostad
/// unsupervised learning
Clustering
GAN
word2vec
king + woman - man = queen
@aliostad
/// 2013 - Atari
“We apply our method to seven Atari 2600
games from the Arcade Learning
Environment, with no adjustment of the
architecture or learning algorithm. We find
that it outperforms all previous approaches
on six of the games and surpasses a human
expert on three of them.”
> Deep-Mind
@aliostad
/// 2015 - Go
> DeepMindLive reactions to the move 37
@aliostad
/// 2016 - Doom
@aliostad
/// 2017 - Dota2
> OpenAI
@aliostad
/// Late 2017 - Chess
> DeepMind
Grandmaster Daniel King on
AlphaZero’s game 10 against Stockfish
https://www.youtube.com/watch?v=Lfkam_oLLM8
@aliostad
/// 1992
Gerald Tesauro - IBM
> TD-Gammon
> Using Temporal Difference
Learning TD-Lambda
> Neural Networks
> Training using Self-Play
> Value Function
@aliostad
/// harry klopf
Harry Klopf
Marvin Minsky
Alan Turing
@aliostad
/// research grant
Rich Sutton Andrew Barto
> “Goal seeking components for Adaptive Intelligence” 1977
> Cybernetics Center for
Systems Neuroscience in
University of
Massachusetts Amherst
> “Synthesis of Nonlinear Control Surfaces by a
Layered Associative Network” 1981
@aliostad
/// progress
1982
1998
@aliostad
/// reinforcement learning
> Neuroscience + Psychology + Control Theory
> Learning with a Critic
ENVIRONMENT
AGENT
Observation (state)
Action
Reward
@aliostad
/// markov decision process
Markov Decision Process - Wikipedia
@aliostad
/// Value Function v(s)
v(s1) = v(s0) - R
@aliostad
/// Temporal Difference (TD)
if error is zero => reward=v(s)-γv(s’)
where γ is the discount factor
Predictive Reward Signal of Dopamine Neurons
- Wolfram Schultz 1998
@aliostad
/// Monte-Carlo Tree Search (MCTS)
In MCTS,
γ is 1!
@aliostad
/// Q-Learning
> A form of TD Learning
> Uses Q Function which returns probability
distribution for actions to be drawn from
R L U D F N
0.1 0.2 0.5 0.1 0.0 0.1
Explore vs Exploit
(Greediness)
@aliostad
/// Deep Q Network (DQN)
> Proposed by Atari paper (DeepMind) in 2013
> Uses a Deep Network to map state to
action and uses Q-Learning error to train
> Double Q-learning variant (DeepMind 2015)
> Duelling Networks variant (DeepMind 2015)
@aliostad
/// meet lunar-lander!
> State: (8,)
> Action: (4,)
> Rewards:
- leg touchdown: +10
- crash: -100
- rest: +100
- solve: 200
- main engine: -0.3
Part of OpenAI’s gym
@aliostad
/// keras-rl
> based on OpenAI’s agent/environment interface
> Supports DQN (and its variants), CEM,
SARSA and DDPG algorithms
> Upcoming ACER, A2C/A3C, PPO, etc algorithms
> Uses any keras models as long as input/output
shapes matches. “Bring Your Own Models”
@aliostad
/// DQN in keras-rl - 1
@aliostad
/// DQN in keras-rl -2
INPUT
I N P U T
DENSE
DENSE
O U T P U T
[0.8, 0.9,…-0.3]
[0, 0, 1, 0]
DENSE
FLATTEN
@aliostad
/// lunar-lander with DQN
@aliostad
/// hexagon
> Mainly A coding challenge (playhexagon.com)
> Danske Bank (Vidas)
> A round-based strategy
game for 2 or more
players to start with one
cell and gradually occupy
the board or have more
cells when time runs out.
@aliostad
/// hexagon - start
@aliostad
/// hexagon - expansion
Transferring 70 resources
from seed cell to the
adjacent neutral cell
@aliostad
/// hexagon - increments
Maroon also transfers 70
resources from seed cell to
its adjacent neutral cell.
All occupied cells get +1
resource unless they have
100 or more resources.
+1
@aliostad
/// hexagon - attack
Transferring 40 resources
from the cell having 58 to
the adjacent enemy cell
having 16 results in own
having 18 and the attacked
cell 40-16=24.
@aliostad
/// hexagon - boost
Transferring 50 resources from
the cell having 100 to to friendly
cell having 4 results in own
having 50 and the boosted cell
4-50=54.
This helps the cell to protect
against neighbouring enemy
cells having 20 and 25
resources.
@aliostad
/// hexagon - neighbourhood
@aliostad
/// hexagon - gameplay
@aliostad
/// hexagon - strategies
attack all the things! defend…build a wall
flooding
@aliostad
/// hexagon - what to do?
> Attack? From which
cell to which cell?
> Re-inforcements?
> How many resources?
@aliostad
/// hexagon - heuristics
self.attackPotential = self.resources * 
math.sqrt(max(self.resources -
safeMin([n.resources for n in self.nonOwns]), 1)) / 
math.log(sum([n.resources for n in self.enemies], 1)
+ 1, 5)
# how suitable is a cell for receiving boost
self.boostFactor = math.sqrt(sum((n.resources for n in self.enemies), 1)) * 
safeMax([n.resources for n in self.enemies], 1) /
(self.resources + 1)
def getGivingBoostSuitability(self):
return (self.depth + 1) * math.sqrt(self.resources + 1) *
(1.7 if self.resources == 100 else 1)
@aliostad
/// hexagon - heuristics
@aliostad
/// hexagon - heuristics??
> Score functions are arbitrary: they do not necessarily
represent the underlying mechanics of the game
> No easy way to learn parameters and and
testing all combinations impossible
> When it does not work, it is hard to
know which parameter to tune.
Got to be a better way…
self.attackPotential = self.resources * 
math.sqrt(max(self.resources -
safeMin([n.resources for n in self.nonOwns]), 1)) / 
math.log(sum([n.resources for n in self.enemies], 1)
+ 1, 5)
@aliostad
/// hexagon -representation
@aliostad
/// hexagon - cell representation
> Own cells represented by positive integer (for
resources). Enemy cells by negative integer. Neutral
by zero
> Feature extraction: for every cell extract
- sum/max/min friendly cells
- sum/max/min enemy cells
/// hexagon - board representation
> Flattened: array of cells
> 2D representation so that we can use Convolutional
Neural Network. Hexagon => Grid
10 -1 0 0 25 -43 -12 3 0 -9
@aliostad
/// hexagon - model
> Pure RL models: DQN, PPO, etc
> AlphaZero: Monte Carlo Tree Search (MCTS) + RL
models
@aliostad
/// hexagon - decision tree
Hierarchy of Models and Game Rules
> Centaur:
Replacing parts of the heuristic-
based man-made agent with
machine-learning
- selecting attacker or boosting cell
- choosing attack/boost resources
@aliostad
/// hexagon - repo
> DQN
> AlphaZero (Monte Carlo Tree Search)
> DDPG
> PPO
@aliostad
/// hexagon - alphazero
> Cell representation: 1, -1 and 0 for friendly,
enemy and neutral cells.
> Board representation: Grid mapping of Hexagon
> Action representation: flattened board with 1 for
cells that can attack or boost.
> Deep Learning Model: choice of flat or Conv2D
> Resource quantization: actions include resource
proportions
@aliostad
/// hexagon - alphazero
> default: python hexagon_alphazero train —radius 4
> model: python hexagon_alphazero train -m [f|cm|cam]
> default: python hexagon_alphazero test -p fm -q a
> quantization: python hexagon_alphazero -p fmz -a a -z 4
> rounds: python hexagon_alphazero -p cmz -q a -x 200
Training
Testing
@aliostad
/// C O D E
&
D E M O
@aliostad
Automatic real-time road marking recognition
Hexagon Game: Winter Picture
Researchgate: Convolution Picture
Perceptron Video
AlphaGo vs Lee Sedol: Move 37
AI playing FPS
Hexagon Official Site
hexagon-rl github page

Autonomous agents with deep reinforcement learning - Oredev 2018

  • 1.
    >>> Building Intelligent Agents using DeepReinforcement Learning @aliostad Ali Kheyrollahi, ASOS
  • 2.
    @aliostad /// Do youhas teh codez? > Slides will be published - check @aliostad > github page: https://github.com/aliostad/hexagon-rl
  • 3.
    @aliostad : @aliostad email: thesame @gmail.com http://byterot.blogspot.com Ali Kheyrollahi, Solutions Architect at ASOS
  • 4.
  • 5.
    @aliostad /// Take-aways > mini-historyof Reinforcement Learning (RL) > Basics > Representations and models > Putting it all together in Hexagon
  • 6.
  • 7.
    @aliostad /// Deep Learningbasics > Bunch of techniques to overcome 80s problems: - Overfitting: DropOut Layers - Curse of Dimensionality: MOAR data! - Better training and optimisation techniques - GPUs and parallel computing to speed-up training > Multi-layer neural network described back in 1950s > Type of layers, # of units and activation function
  • 8.
  • 9.
  • 10.
    @aliostad /// 2013 -Atari “We apply our method to seven Atari 2600 games from the Arcade Learning Environment, with no adjustment of the architecture or learning algorithm. We find that it outperforms all previous approaches on six of the games and surpasses a human expert on three of them.” > Deep-Mind
  • 11.
    @aliostad /// 2015 -Go > DeepMindLive reactions to the move 37
  • 12.
  • 13.
    @aliostad /// 2017 -Dota2 > OpenAI
  • 14.
    @aliostad /// Late 2017- Chess > DeepMind Grandmaster Daniel King on AlphaZero’s game 10 against Stockfish https://www.youtube.com/watch?v=Lfkam_oLLM8
  • 15.
    @aliostad /// 1992 Gerald Tesauro- IBM > TD-Gammon > Using Temporal Difference Learning TD-Lambda > Neural Networks > Training using Self-Play > Value Function
  • 16.
    @aliostad /// harry klopf HarryKlopf Marvin Minsky Alan Turing
  • 17.
    @aliostad /// research grant RichSutton Andrew Barto > “Goal seeking components for Adaptive Intelligence” 1977 > Cybernetics Center for Systems Neuroscience in University of Massachusetts Amherst > “Synthesis of Nonlinear Control Surfaces by a Layered Associative Network” 1981
  • 18.
  • 19.
    @aliostad /// reinforcement learning >Neuroscience + Psychology + Control Theory > Learning with a Critic ENVIRONMENT AGENT Observation (state) Action Reward
  • 20.
    @aliostad /// markov decisionprocess Markov Decision Process - Wikipedia
  • 21.
    @aliostad /// Value Functionv(s) v(s1) = v(s0) - R
  • 22.
    @aliostad /// Temporal Difference(TD) if error is zero => reward=v(s)-γv(s’) where γ is the discount factor Predictive Reward Signal of Dopamine Neurons - Wolfram Schultz 1998
  • 23.
    @aliostad /// Monte-Carlo TreeSearch (MCTS) In MCTS, γ is 1!
  • 24.
    @aliostad /// Q-Learning > Aform of TD Learning > Uses Q Function which returns probability distribution for actions to be drawn from R L U D F N 0.1 0.2 0.5 0.1 0.0 0.1 Explore vs Exploit (Greediness)
  • 25.
    @aliostad /// Deep QNetwork (DQN) > Proposed by Atari paper (DeepMind) in 2013 > Uses a Deep Network to map state to action and uses Q-Learning error to train > Double Q-learning variant (DeepMind 2015) > Duelling Networks variant (DeepMind 2015)
  • 26.
    @aliostad /// meet lunar-lander! >State: (8,) > Action: (4,) > Rewards: - leg touchdown: +10 - crash: -100 - rest: +100 - solve: 200 - main engine: -0.3 Part of OpenAI’s gym
  • 27.
    @aliostad /// keras-rl > basedon OpenAI’s agent/environment interface > Supports DQN (and its variants), CEM, SARSA and DDPG algorithms > Upcoming ACER, A2C/A3C, PPO, etc algorithms > Uses any keras models as long as input/output shapes matches. “Bring Your Own Models”
  • 28.
  • 29.
    @aliostad /// DQN inkeras-rl -2 INPUT I N P U T DENSE DENSE O U T P U T [0.8, 0.9,…-0.3] [0, 0, 1, 0] DENSE FLATTEN
  • 30.
  • 31.
    @aliostad /// hexagon > MainlyA coding challenge (playhexagon.com) > Danske Bank (Vidas) > A round-based strategy game for 2 or more players to start with one cell and gradually occupy the board or have more cells when time runs out.
  • 32.
  • 33.
    @aliostad /// hexagon -expansion Transferring 70 resources from seed cell to the adjacent neutral cell
  • 34.
    @aliostad /// hexagon -increments Maroon also transfers 70 resources from seed cell to its adjacent neutral cell. All occupied cells get +1 resource unless they have 100 or more resources. +1
  • 35.
    @aliostad /// hexagon -attack Transferring 40 resources from the cell having 58 to the adjacent enemy cell having 16 results in own having 18 and the attacked cell 40-16=24.
  • 36.
    @aliostad /// hexagon -boost Transferring 50 resources from the cell having 100 to to friendly cell having 4 results in own having 50 and the boosted cell 4-50=54. This helps the cell to protect against neighbouring enemy cells having 20 and 25 resources.
  • 37.
  • 38.
  • 39.
    @aliostad /// hexagon -strategies attack all the things! defend…build a wall flooding
  • 40.
    @aliostad /// hexagon -what to do? > Attack? From which cell to which cell? > Re-inforcements? > How many resources?
  • 41.
    @aliostad /// hexagon -heuristics self.attackPotential = self.resources * math.sqrt(max(self.resources - safeMin([n.resources for n in self.nonOwns]), 1)) / math.log(sum([n.resources for n in self.enemies], 1) + 1, 5) # how suitable is a cell for receiving boost self.boostFactor = math.sqrt(sum((n.resources for n in self.enemies), 1)) * safeMax([n.resources for n in self.enemies], 1) / (self.resources + 1) def getGivingBoostSuitability(self): return (self.depth + 1) * math.sqrt(self.resources + 1) * (1.7 if self.resources == 100 else 1)
  • 42.
  • 43.
    @aliostad /// hexagon -heuristics?? > Score functions are arbitrary: they do not necessarily represent the underlying mechanics of the game > No easy way to learn parameters and and testing all combinations impossible > When it does not work, it is hard to know which parameter to tune. Got to be a better way… self.attackPotential = self.resources * math.sqrt(max(self.resources - safeMin([n.resources for n in self.nonOwns]), 1)) / math.log(sum([n.resources for n in self.enemies], 1) + 1, 5)
  • 44.
  • 45.
    @aliostad /// hexagon -cell representation > Own cells represented by positive integer (for resources). Enemy cells by negative integer. Neutral by zero > Feature extraction: for every cell extract - sum/max/min friendly cells - sum/max/min enemy cells
  • 46.
    /// hexagon -board representation > Flattened: array of cells > 2D representation so that we can use Convolutional Neural Network. Hexagon => Grid 10 -1 0 0 25 -43 -12 3 0 -9
  • 47.
    @aliostad /// hexagon -model > Pure RL models: DQN, PPO, etc > AlphaZero: Monte Carlo Tree Search (MCTS) + RL models
  • 48.
    @aliostad /// hexagon -decision tree Hierarchy of Models and Game Rules > Centaur: Replacing parts of the heuristic- based man-made agent with machine-learning - selecting attacker or boosting cell - choosing attack/boost resources
  • 49.
    @aliostad /// hexagon -repo > DQN > AlphaZero (Monte Carlo Tree Search) > DDPG > PPO
  • 50.
    @aliostad /// hexagon -alphazero > Cell representation: 1, -1 and 0 for friendly, enemy and neutral cells. > Board representation: Grid mapping of Hexagon > Action representation: flattened board with 1 for cells that can attack or boost. > Deep Learning Model: choice of flat or Conv2D > Resource quantization: actions include resource proportions
  • 51.
    @aliostad /// hexagon -alphazero > default: python hexagon_alphazero train —radius 4 > model: python hexagon_alphazero train -m [f|cm|cam] > default: python hexagon_alphazero test -p fm -q a > quantization: python hexagon_alphazero -p fmz -a a -z 4 > rounds: python hexagon_alphazero -p cmz -q a -x 200 Training Testing
  • 52.
    @aliostad /// C OD E & D E M O
  • 53.
    @aliostad Automatic real-time roadmarking recognition Hexagon Game: Winter Picture Researchgate: Convolution Picture Perceptron Video AlphaGo vs Lee Sedol: Move 37 AI playing FPS Hexagon Official Site hexagon-rl github page