Autonomous agents with deep reinforcement learning - Oredev 2018

>>> Building
Intelligent Agents
using
Deep Reinforcement Learning
@aliostad
Ali Kheyrollahi, ASOS

@aliostad
/// Do you has teh codez?
> Slides will be published - check @aliostad
> github page:
https://github.com/aliostad/hexagon-rl

@aliostad
: @aliostad
email: the same @gmail.com
http://byterot.blogspot.com
Ali Kheyrollahi,
Solutions Architect at ASOS

@aliostad
/// Take-aways
> mini-history of Reinforcement Learning (RL)
> Basics
> Representations and models
> Putting it all together in Hexagon

@aliostad
/// Deep Learning basics
> Bunch of techniques to overcome 80s problems:
- Overﬁtting: DropOut Layers
- Curse of Dimensionality: MOAR data!
- Better training and optimisation techniques
- GPUs and parallel computing to speed-up training
> Multi-layer neural network described back in 1950s
> Type of layers, # of units and activation function

@aliostad
/// supervised learning

@aliostad
/// unsupervised learning
Clustering
GAN
word2vec
king + woman - man = queen

@aliostad
/// 2013 - Atari
“We apply our method to seven Atari 2600
games from the Arcade Learning
Environment, with no adjustment of the
architecture or learning algorithm. We ﬁnd
that it outperforms all previous approaches
on six of the games and surpasses a human
expert on three of them.”
> Deep-Mind

@aliostad
/// 2015 - Go
> DeepMindLive reactions to the move 37

@aliostad
/// 2017 - Dota2
> OpenAI

@aliostad
/// Late 2017 - Chess
> DeepMind
Grandmaster Daniel King on
AlphaZero’s game 10 against Stockﬁsh
https://www.youtube.com/watch?v=Lfkam_oLLM8

@aliostad
/// 1992
Gerald Tesauro - IBM
> TD-Gammon
> Using Temporal Diﬀerence
Learning TD-Lambda
> Neural Networks
> Training using Self-Play
> Value Function

@aliostad
/// harry klopf
Harry Klopf
Marvin Minsky
Alan Turing

@aliostad
/// research grant
Rich Sutton Andrew Barto
> “Goal seeking components for Adaptive Intelligence” 1977
> Cybernetics Center for
Systems Neuroscience in
University of
Massachusetts Amherst
> “Synthesis of Nonlinear Control Surfaces by a
Layered Associative Network” 1981

@aliostad
/// progress
1982
1998

@aliostad
/// reinforcement learning
> Neuroscience + Psychology + Control Theory
> Learning with a Critic
ENVIRONMENT
AGENT
Observation (state)
Action
Reward

@aliostad
/// markov decision process
Markov Decision Process - Wikipedia

@aliostad
/// Value Function v(s)
v(s1) = v(s0) - R

@aliostad
/// Temporal Difference (TD)
if error is zero => reward=v(s)-γv(s’)
where γ is the discount factor
Predictive Reward Signal of Dopamine Neurons
- Wolfram Schultz 1998

@aliostad
/// Monte-Carlo Tree Search (MCTS)
In MCTS,
γ is 1!

@aliostad
/// Q-Learning
> A form of TD Learning
> Uses Q Function which returns probability
distribution for actions to be drawn from
R L U D F N
0.1 0.2 0.5 0.1 0.0 0.1
Explore vs Exploit
(Greediness)

@aliostad
/// Deep Q Network (DQN)
> Proposed by Atari paper (DeepMind) in 2013
> Uses a Deep Network to map state to
action and uses Q-Learning error to train
> Double Q-learning variant (DeepMind 2015)
> Duelling Networks variant (DeepMind 2015)

@aliostad
/// meet lunar-lander!
> State: (8,)
> Action: (4,)
> Rewards:
- leg touchdown: +10
- crash: -100
- rest: +100
- solve: 200
- main engine: -0.3
Part of OpenAI’s gym

@aliostad
/// keras-rl
> based on OpenAI’s agent/environment interface
> Supports DQN (and its variants), CEM,
SARSA and DDPG algorithms
> Upcoming ACER, A2C/A3C, PPO, etc algorithms
> Uses any keras models as long as input/output
shapes matches. “Bring Your Own Models”

@aliostad
/// DQN in keras-rl - 1

@aliostad
/// DQN in keras-rl -2
INPUT
I N P U T
DENSE
DENSE
O U T P U T
[0.8, 0.9,…-0.3]
[0, 0, 1, 0]
DENSE
FLATTEN

@aliostad
/// lunar-lander with DQN

@aliostad
/// hexagon
> Mainly A coding challenge (playhexagon.com)
> Danske Bank (Vidas)
> A round-based strategy
game for 2 or more
players to start with one
cell and gradually occupy
the board or have more
cells when time runs out.

@aliostad
/// hexagon - expansion
Transferring 70 resources
from seed cell to the
adjacent neutral cell

@aliostad
/// hexagon - increments
Maroon also transfers 70
resources from seed cell to
its adjacent neutral cell.
All occupied cells get +1
resource unless they have
100 or more resources.
+1

@aliostad
/// hexagon - attack
Transferring 40 resources
from the cell having 58 to
the adjacent enemy cell
having 16 results in own
having 18 and the attacked
cell 40-16=24.

@aliostad
/// hexagon - boost
Transferring 50 resources from
the cell having 100 to to friendly
cell having 4 results in own
having 50 and the boosted cell
4-50=54.
This helps the cell to protect
against neighbouring enemy
cells having 20 and 25
resources.

@aliostad
/// hexagon - neighbourhood

@aliostad
/// hexagon - gameplay

@aliostad
/// hexagon - strategies
attack all the things! defend…build a wall
ﬂooding

@aliostad
/// hexagon - what to do?
> Attack? From which
cell to which cell?
> Re-inforcements?
> How many resources?

@aliostad
/// hexagon - heuristics
self.attackPotential = self.resources *
math.sqrt(max(self.resources -
safeMin([n.resources for n in self.nonOwns]), 1)) /
math.log(sum([n.resources for n in self.enemies], 1)
+ 1, 5)
# how suitable is a cell for receiving boost
self.boostFactor = math.sqrt(sum((n.resources for n in self.enemies), 1)) *
safeMax([n.resources for n in self.enemies], 1) /
(self.resources + 1)
def getGivingBoostSuitability(self):
return (self.depth + 1) * math.sqrt(self.resources + 1) *
(1.7 if self.resources == 100 else 1)

@aliostad
/// hexagon - heuristics

@aliostad
/// hexagon - heuristics??
> Score functions are arbitrary: they do not necessarily
represent the underlying mechanics of the game
> No easy way to learn parameters and and
testing all combinations impossible
> When it does not work, it is hard to
know which parameter to tune.
Got to be a better way…
self.attackPotential = self.resources *
math.sqrt(max(self.resources -
safeMin([n.resources for n in self.nonOwns]), 1)) /
math.log(sum([n.resources for n in self.enemies], 1)
+ 1, 5)

@aliostad
/// hexagon -representation

@aliostad
/// hexagon - cell representation
> Own cells represented by positive integer (for
resources). Enemy cells by negative integer. Neutral
by zero
> Feature extraction: for every cell extract
- sum/max/min friendly cells
- sum/max/min enemy cells

/// hexagon - board representation
> Flattened: array of cells
> 2D representation so that we can use Convolutional
Neural Network. Hexagon => Grid
10 -1 0 0 25 -43 -12 3 0 -9

@aliostad
/// hexagon - model
> Pure RL models: DQN, PPO, etc
> AlphaZero: Monte Carlo Tree Search (MCTS) + RL
models

@aliostad
/// hexagon - decision tree
Hierarchy of Models and Game Rules
> Centaur:
Replacing parts of the heuristic-
based man-made agent with
machine-learning
- selecting attacker or boosting cell
- choosing attack/boost resources

@aliostad
/// hexagon - repo
> DQN
> AlphaZero (Monte Carlo Tree Search)
> DDPG
> PPO

@aliostad
/// hexagon - alphazero
> Cell representation: 1, -1 and 0 for friendly,
enemy and neutral cells.
> Board representation: Grid mapping of Hexagon
> Action representation: ﬂattened board with 1 for
cells that can attack or boost.
> Deep Learning Model: choice of ﬂat or Conv2D
> Resource quantization: actions include resource
proportions

@aliostad
/// hexagon - alphazero
> default: python hexagon_alphazero train —radius 4
> model: python hexagon_alphazero train -m [f|cm|cam]
> default: python hexagon_alphazero test -p fm -q a
> quantization: python hexagon_alphazero -p fmz -a a -z 4
> rounds: python hexagon_alphazero -p cmz -q a -x 200
Training
Testing

@aliostad
/// C O D E
&
D E M O

@aliostad
Automatic real-time road marking recognition
Hexagon Game: Winter Picture
Researchgate: Convolution Picture
Perceptron Video
AlphaGo vs Lee Sedol: Move 37
AI playing FPS
Hexagon Ofﬁcial Site
hexagon-rl github page

Autonomous agents with deep reinforcement learning - Oredev 2018

Recommended

Recommended

More Related Content

Similar to Autonomous agents with deep reinforcement learning - Oredev 2018

Similar to Autonomous agents with deep reinforcement learning - Oredev 2018 (20)

More from Ali Kheyrollahi

More from Ali Kheyrollahi (17)

Recently uploaded

Recently uploaded (20)

Autonomous agents with deep reinforcement learning - Oredev 2018