SlideShare a Scribd company logo
1 of 55
Download to read offline
A Brief Survey of Reinforcement Learning
Short walk-through on building learning agents
Giancarlo Frison
December 12, 2018
Table of contents i
1. Introduction
2. Genetic Programming
3. Multi-armed Bandit
4. Markov Decision Process
5. Neural Networks in RL
gfrison 2018 1
Table of contents ii
6. Actor-Critic
7. Deep Deterministic Policy Gradient
8. Asgard
gfrison 2018 2
Introduction
What is Reinforcement Learning
[9]
Reinforcement learning is an area of machine learning concerned with how
so tware agents ought to take actions in an environment so as to maximize some
notion of cumulative reward.
gfrison 2018 3
Challenges in RL
RL vision is about creating systems that are capable of learning how to adapt in
the real world, solely based on trial-and-error. Challenges:
Reward signal
The optimal policy must be inferred by interacting with the environment. The only
learning signal the agent receives is the reward.
Credit assignment problem
Agents must deal with long-range time dependencies: O ten the consequences of
an action only materialise a ter many transitions of the environment [8].
Non stationary environments
Especially in multi-agent environments, in the same circumstances of state st and
action at outcomes might be different.
gfrison 2018 4
Difference from other optimizations
There are major differences within:
• Supervised learning
gfrison 2018 5
Difference from other optimizations
There are major differences within:
• Supervised learning
• Mathematical optimization
gfrison 2018 5
Difference from other optimizations
There are major differences within:
• Supervised learning
• Mathematical optimization
• Genetic programming
gfrison 2018 5
Genetic Programming
Genetic Programming
GP is a technique whereby computer programs are encoded as a set of genes that
are then modified using an evolutionary algorithm.
• A number of chromosomes are randomly created.
• Each chromosome is evaluated through a fitness function.
• Best ones are selected, the others are disposed.
• Chromosomes could be breded among the selected for a new generation.
• Offsprings are randomly mutated.
• Repeat until the score threshold is reached [3].
gfrison 2018 6
Crossover
Figure 1: The “breeding” is called crossover. The chromosome (in this case an AST), is
merged between two individuals for searching a better function.
gfrison 2018 7
Strenghts on many local minima
GP may overtake gradient-based algorithms when the solution space has many
local minima
Figure 2: Rastrigin function
gfrison 2018 8
Strenghts on no-differentiable functions
f(x) is differentiable when it has always a finite derivative along the domain.
−2 −1 1 2
−1
1
2
Figure 3: f(x) = |x| has no derivative at x = 0.
GP is tollerant to latent no-differentiable functions.
gfrison 2018 9
Multi-armed Bandit
Multi-armed Bandit
The multi-armed bandit problem has been the subject of decades of intense
study in statistics, operations research, A/B testing, computer science, and
economics [12]
gfrison 2018 10
Q Value’s Action
Figure 4: Obtained measures a ter repeated pullings with 10 arms [8].
gfrison 2018 11
Q Value’s Action
Qn is the estimated value of its action a ter n selections.
Qn+1 =
R1 + R2 + ... + Rn
n
A more scalable formula, updates the average with incremental and small
constant:
Qn+1 = Qn +
1
n
(Rn − Qn)
General expression of the bandit algorithm, recurrent pattern in RL (Target could
be considered the reward R by now).
NewEstimate = OldEstimate + StepSize(Target − OldEstimate)
gfrison 2018 12
Gambler’s Dilemma
When pulled, an arm produces a random payout drawn independently of the past.
Because the distribution of payouts corresponding to each arm is not listed, the
player can learn it only by experimenting.
Exploitation
Earn more money by exploiting
arms that yielded high payouts
in the past.
Exploration
Exploring alternative arms may
return higher payouts in the
future.
There is a tradeoff between exploration and regret.
gfrison 2018 13
Markov Decision Process
Definition of MDP
Figure 5: 2015 DARPA Robotics Challange [6]
Despite as in bandits, MDP formalizes the decision making (Policy π) in sequential
steps, aggregated in Episodes.
gfrison 2018 14
Actions and States
Figure 6: Model representation of MDP [5]
MDP strives to find the best π to all possible states. In Markov processes, the
selected action depends only on current state.
gfrison 2018 15
Bellman equation
In a multi-step process the total return G is the discounted sum of rewards, from
the begin state to the terminal state:
Gπ = r0 + γr1 + γ2
r2 + ...γn
rn
bellman equation: The value of a state Vs is the averaged reward obtained in that
state plus the sum of all values following π therea ter.
The goal of RL is to find the optimal π with highest G
π = argmax
∑
s
r + γvπ(s)
gfrison 2018 16
Grid World
 -1 -1 -1
-1 -1 -1 -1
-1 -1 -1 -1
-1 -1 -1 
• A = up, down, right, left
• Terminal states are the
flagged boxes.
• Rt = −1 for all states.
The problem is to define the best πππ. Value function is computed by iterative
policy evaluation.
gfrison 2018 17
Iteration 1
Calculated V1 Policy π1
0 -1 -1 -1
-1 -1 -1 -1
-1 -1 -1 -1
-1 -1 -1 0
 ← ? ?
↑ ? ? ?
? ? ? ↓
? ? → 
gfrison 2018 18
Iteration 2
Calculated V2 Policy π2
0 -1.7 -2 -2
-1.7 -2 -2 -2
-2 -2 -2 -1.7
-2 -2 -1.7 0
 ← ← ?
↑ ? ↓
↑ ? ↓
? → → 
gfrison 2018 19
Iteration 3
Calculated V3 Policy π3
0 -2.4 -2.9 -3
-2.4 -2.9 -3 -2.9
-2.9 -3 -2.9 -2.4
-3 -2.9 -2.4 0
 ← ←
↑ ↓
↑ ↓
→ → 
gfrison 2018 20
Temporal difference
Model-free learning
The learning comes only through raw experiences (episodes) when there is no
prior knowledge of the environment.
Bootstrapping
Value estimations are based partly on other learned estimations.
Bootstrapping helps to increase the stability of estimations.
V(st) = V(st) + α(rt+1 + γV(st+1) − V(st))
gfrison 2018 21
Neural Networks in RL
Beyond the Gridworld
In the Gridworld every state value vt is stored in a table. The approach lacks
scalability and is inherently limited to fairly low-dimension problems.
Figure 7: The state vt might be a frame of a videogame [11]
gfrison 2018 22
Deep Reinforcement Learning
Deep learning enables RL to scale to decision-making problems that were
previously intractable, i.e., settings with high-dimensional state and action
spaces. [1]
• Universal function approximator Multilayer neural networks can approximate
any continuous function to any degree of accuracy [4].
gfrison 2018 23
Deep Reinforcement Learning
Deep learning enables RL to scale to decision-making problems that were
previously intractable, i.e., settings with high-dimensional state and action
spaces. [1]
• Universal function approximator Multilayer neural networks can approximate
any continuous function to any degree of accuracy [4].
• Representation learning Automatically discover the representations needed
for feature detection or classification from raw data, also known as feature
engeneering [2].
Neural networks can learn the approximated value of Vs or Q(s, a) for any given
state/action space.
gfrison 2018 23
Discrete space
When the action space is limited by a set of options, (e.g. turn right, le t) the
output layer could be a set of neurons as the number of possible actions.
A1 A2 A3 A4
0.21 0.27 0.30 0.21
Figure 8: Output layer with so tmax
The choosed action is the one with the biggest Qt (A3).
gfrison 2018 24
Exploration in discrete space
The balance between exploration and exploitation is one of the most
challanging task in RL. The most common is the ε-greedy:
π(s) =



random action, if ξ < ϵ.
argmaxQ(s, a), otherwise.
(1)
With ε-greedy, at each time step, the agent selects a random action with a fixed
probability, 0 ≤ ϵ ≤ 1, instead of selecting the learned optimal action.
gfrison 2018 25
Exploration in discrete space
ε-greedy must be tuned and it is fixed during the training. In contrast,
Value-Difference Based Exploration adapts the exploration-probability based on
fluctuations of confidence [10].
A1 A2 A3 A4
0.21 0.27 0.30 0.21
Figure 9: Low confidence, flat
likelihood
A1 A2 A3 A4
0.21 0.07 0.70 0.01
Figure 10: High confidence, sharp
likelihood
gfrison 2018 26
Continuous space
Applications of RL require continuous state spaces defined by means of
continuous variables e.g. position, velocity, torque.
Single Output
Like in DDPG algorithm. The NN yields only one output.
Probability distribution
The network yields a distribution. For convenience usually it is a normal
distribution defined by 2 outputs: μ and σ.
gfrison 2018 27
Deep Q-Network
NN comes into play by approximating state-action value pairs Qt(s, a).
Double network
Other the main network, a sibling network (target) is used in training for adjusting
values Q, the second network is periodically aligned the the main network.
Priority experience replay
Past state/action/reward tuples are kept also for future iterations. Tuples with
higher error in evaluation are retained longer.
 First disruptive advance in Deep RL in 2015 by DeepMind [7].
gfrison 2018 28
Actor-Critic
Policy-based algorithm
actor-critic methods lay in the family of policy-based algorithms. Differently by
value-based ones - like DQN - the main network (actor) learns directly the action
aπ(s) to return for a given state s.
 Therefore there is no argmax Qπ(s, A) among all Qπ(s) for a given state
gfrison 2018 29
Actor-Critic
actor-critic defines an architecture based on 2 parts:
Actor
Main network which infers the action for a given state
(st → a)
Critic
Secondary network used only for training, evaluates the
actor output returning the averaged value of state.
st → V(s)
gfrison 2018 30
Advantage function for incremental learning
The advantage technique is used to give a evaluation of the last action compared
to the average of other actions.
gfrison 2018 31
Advantage function for incremental learning
The advantage technique is used to give a evaluation of the last action compared
to the average of other actions.
It answers the following question:
Is last action more rewarding than others in such scenario?
gfrison 2018 31
Advantage function for incremental learning
The advantage technique is used to give a evaluation of the last action compared
to the average of other actions.
It answers the following question:
Is last action more rewarding than others in such scenario?
Yes
 Action’s likelihood is encouraged. Gradients are pushed toward this action.
No
 Action’s likelihood is discouraged. Gradients are pushed away from this action.
gfrison 2018 31
Training
s0
a0
Environment
s1 r1
t0
s1
a1
Environment
s2 r2
t1
T T
Figure 11: Actor-Critic interactions in an iterative traininggfrison 2018 32
Actor training
-
vt
rt
st
advt
at
1
2
• 1 Critic returns the averaged value
of the state.
• 2 Actor is trained considering st, at
and advt = rt − vt
gfrison 2018 33
Deep Deterministic Policy Gradient
DDPG
Especially effective in continuous space. Derived from actor-critic and dqn it
uses 4 networks.
Value/action networks
Likewise ac there is a distinction between value and action networks.
Duplication of ac networks
For stability those networks are duplicated.
Replay buffer
Likewise dqn an history of past experiences is reused.
gfrison 2018 34
Gradient propagation
Despite others tecniques, ddpg improves the actual π by applying the gradient
obtained during the training iteration, from the value network to the output
network.
Figure 12: Like with a pantograph, the gradient Vs is applied to the π network for a similar
effect, without knowing the value of π.
gfrison 2018 35
Asgard
Missing slides for Intellectual Property
gfrison 2018 35
Demo
gfrison 2018 35
References i
K. Arulkumaran, M. P. Deisenroth, M. Brundage, and A. A. Bharath.
A brief survey of deep reinforcement learning.
arXiv preprint arXiv:1708.05866, 2017.
Y. Bengio, A. Courville, and P. Vincent.
Representation learning: A review and new perspectives.
IEEE transactions on pattern analysis and machine intelligence,
35(8):1798–1828, 2013.
Giancarlo Frison.
First Steps on Evolutionary Systems.
2018.
link https://labs.cx.sap.com/2018/07/02/first-steps-on-evolutionary-systems/.
gfrison 2018 36
References ii
K. Hornik, M. Stinchcombe, and H. White.
Multilayer feedforward networks are universal approximators.
Neural networks, 2(5):359–366, 1989.
J. Zico Kolter.
Markov Decision Process.
http://www.cs.cmu.edu/afs/cs/academic/class/15780-
s16/www/slides/mdps.pdf.
John Williams.
2015 darpa robotics challange.
Public domain.
gfrison 2018 37
References iii
V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare,
A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al.
Human-level control through deep reinforcement learning.
Nature, 518(7540):529, 2015.
P. Montague.
Reinforcement Learning: An Introduction, by Sutton, R.S. and Barto, A.G.
Trends in Cognitive Sciences, 3(9):360, 1999.
Omanomanoman.
CC BY-SA 4.0, https://commons.wikimedia.org/.
gfrison 2018 38
References iv
M. Tokic and G. Palm.
Value-difference based exploration: adaptive control between
epsilon-greedy and so tmax.
In Annual Conference on Artificial Intelligence, pages 335–346. Springer, 2011.
Valve Corporation.
A screenshot from the video game Dota 2.
Yamaguchi .
CC-BY-SA-3.0 (http://creativecommons.org/licenses/by-sa/3.0/)].
gfrison 2018 39

More Related Content

What's hot

Mining at scale with latent factor models for matrix completion
Mining at scale with latent factor models for matrix completionMining at scale with latent factor models for matrix completion
Mining at scale with latent factor models for matrix completionFabio Petroni, PhD
 
GASGD: Stochastic Gradient Descent for Distributed Asynchronous Matrix Comple...
GASGD: Stochastic Gradient Descent for Distributed Asynchronous Matrix Comple...GASGD: Stochastic Gradient Descent for Distributed Asynchronous Matrix Comple...
GASGD: Stochastic Gradient Descent for Distributed Asynchronous Matrix Comple...Fabio Petroni, PhD
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learningDongHyun Kwak
 
HDRF: Stream-Based Partitioning for Power-Law Graphs
HDRF: Stream-Based Partitioning for Power-Law GraphsHDRF: Stream-Based Partitioning for Power-Law Graphs
HDRF: Stream-Based Partitioning for Power-Law GraphsFabio Petroni, PhD
 
Distributed Deep Q-Learning
Distributed Deep Q-LearningDistributed Deep Q-Learning
Distributed Deep Q-LearningLyft
 
TensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and TricksTensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and TricksBen Ball
 
Multi PPT - Agent Actor-Critic for Mixed Cooperative-Competitive Environments
Multi PPT - Agent Actor-Critic for Mixed Cooperative-Competitive EnvironmentsMulti PPT - Agent Actor-Critic for Mixed Cooperative-Competitive Environments
Multi PPT - Agent Actor-Critic for Mixed Cooperative-Competitive EnvironmentsJisang Yoon
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement LearningUsman Qayyum
 
Adversarial Reinforced Learning for Unsupervised Domain Adaptation
Adversarial Reinforced Learning for Unsupervised Domain AdaptationAdversarial Reinforced Learning for Unsupervised Domain Adaptation
Adversarial Reinforced Learning for Unsupervised Domain Adaptationtaeseon ryu
 
pptx - Psuedo Random Generator for Halfspaces
pptx - Psuedo Random Generator for Halfspacespptx - Psuedo Random Generator for Halfspaces
pptx - Psuedo Random Generator for Halfspacesbutest
 
# Can we trust ai. the dilemma of model adjustment
# Can we trust ai. the dilemma of model adjustment# Can we trust ai. the dilemma of model adjustment
# Can we trust ai. the dilemma of model adjustmentTerence Huang
 
Financial Trading as a Game: A Deep Reinforcement Learning Approach
Financial Trading as a Game: A Deep Reinforcement Learning ApproachFinancial Trading as a Game: A Deep Reinforcement Learning Approach
Financial Trading as a Game: A Deep Reinforcement Learning Approach謙益 黃
 
Graph Convolutional Neural Networks
Graph Convolutional Neural Networks Graph Convolutional Neural Networks
Graph Convolutional Neural Networks 신동 강
 
Incremental collaborative filtering via evolutionary co clustering
Incremental collaborative filtering via evolutionary co clusteringIncremental collaborative filtering via evolutionary co clustering
Incremental collaborative filtering via evolutionary co clusteringAllen Wu
 
Deep Learning in Finance
Deep Learning in FinanceDeep Learning in Finance
Deep Learning in FinanceAltoros
 
Focal loss for dense object detection
Focal loss for dense object detectionFocal loss for dense object detection
Focal loss for dense object detectionDaeHeeKim31
 
A scalable collaborative filtering framework based on co clustering
A scalable collaborative filtering framework based on co clusteringA scalable collaborative filtering framework based on co clustering
A scalable collaborative filtering framework based on co clusteringAllenWu
 

What's hot (19)

Mining at scale with latent factor models for matrix completion
Mining at scale with latent factor models for matrix completionMining at scale with latent factor models for matrix completion
Mining at scale with latent factor models for matrix completion
 
GASGD: Stochastic Gradient Descent for Distributed Asynchronous Matrix Comple...
GASGD: Stochastic Gradient Descent for Distributed Asynchronous Matrix Comple...GASGD: Stochastic Gradient Descent for Distributed Asynchronous Matrix Comple...
GASGD: Stochastic Gradient Descent for Distributed Asynchronous Matrix Comple...
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
HDRF: Stream-Based Partitioning for Power-Law Graphs
HDRF: Stream-Based Partitioning for Power-Law GraphsHDRF: Stream-Based Partitioning for Power-Law Graphs
HDRF: Stream-Based Partitioning for Power-Law Graphs
 
Playing Atari with Deep Reinforcement Learning
Playing Atari with Deep Reinforcement LearningPlaying Atari with Deep Reinforcement Learning
Playing Atari with Deep Reinforcement Learning
 
Distributed Deep Q-Learning
Distributed Deep Q-LearningDistributed Deep Q-Learning
Distributed Deep Q-Learning
 
TensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and TricksTensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and Tricks
 
Multi PPT - Agent Actor-Critic for Mixed Cooperative-Competitive Environments
Multi PPT - Agent Actor-Critic for Mixed Cooperative-Competitive EnvironmentsMulti PPT - Agent Actor-Critic for Mixed Cooperative-Competitive Environments
Multi PPT - Agent Actor-Critic for Mixed Cooperative-Competitive Environments
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
 
Deep Q-Learning
Deep Q-LearningDeep Q-Learning
Deep Q-Learning
 
Adversarial Reinforced Learning for Unsupervised Domain Adaptation
Adversarial Reinforced Learning for Unsupervised Domain AdaptationAdversarial Reinforced Learning for Unsupervised Domain Adaptation
Adversarial Reinforced Learning for Unsupervised Domain Adaptation
 
pptx - Psuedo Random Generator for Halfspaces
pptx - Psuedo Random Generator for Halfspacespptx - Psuedo Random Generator for Halfspaces
pptx - Psuedo Random Generator for Halfspaces
 
# Can we trust ai. the dilemma of model adjustment
# Can we trust ai. the dilemma of model adjustment# Can we trust ai. the dilemma of model adjustment
# Can we trust ai. the dilemma of model adjustment
 
Financial Trading as a Game: A Deep Reinforcement Learning Approach
Financial Trading as a Game: A Deep Reinforcement Learning ApproachFinancial Trading as a Game: A Deep Reinforcement Learning Approach
Financial Trading as a Game: A Deep Reinforcement Learning Approach
 
Graph Convolutional Neural Networks
Graph Convolutional Neural Networks Graph Convolutional Neural Networks
Graph Convolutional Neural Networks
 
Incremental collaborative filtering via evolutionary co clustering
Incremental collaborative filtering via evolutionary co clusteringIncremental collaborative filtering via evolutionary co clustering
Incremental collaborative filtering via evolutionary co clustering
 
Deep Learning in Finance
Deep Learning in FinanceDeep Learning in Finance
Deep Learning in Finance
 
Focal loss for dense object detection
Focal loss for dense object detectionFocal loss for dense object detection
Focal loss for dense object detection
 
A scalable collaborative filtering framework based on co clustering
A scalable collaborative filtering framework based on co clusteringA scalable collaborative filtering framework based on co clustering
A scalable collaborative filtering framework based on co clustering
 

Similar to A Brief Survey of Reinforcement Learning

Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017MLconf
 
Deep reinforcement learning from scratch
Deep reinforcement learning from scratchDeep reinforcement learning from scratch
Deep reinforcement learning from scratchJie-Han Chen
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learningDing Li
 
DDPG algortihm for angry birds
DDPG algortihm for angry birdsDDPG algortihm for angry birds
DDPG algortihm for angry birdsWangyu Han
 
Anomaly Detection through Reinforcement Learning
Anomaly Detection through Reinforcement LearningAnomaly Detection through Reinforcement Learning
Anomaly Detection through Reinforcement LearningHari Koduvely (PhD)
 
REINFORCEMENT LEARNING
REINFORCEMENT LEARNINGREINFORCEMENT LEARNING
REINFORCEMENT LEARNINGpradiprahul
 
Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018
Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018
Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018Universitat Politècnica de Catalunya
 
Reinforcement learning
Reinforcement learning Reinforcement learning
Reinforcement learning Chandra Meena
 
Machine learning Vs Deep learning Vs Reinforcement learning | Pydata Mumbai
Machine learning Vs Deep learning Vs Reinforcement learning | Pydata Mumbai Machine learning Vs Deep learning Vs Reinforcement learning | Pydata Mumbai
Machine learning Vs Deep learning Vs Reinforcement learning | Pydata Mumbai Pratik Bhavsar
 
Deep Convolutional GANs - meaning of latent space
Deep Convolutional GANs - meaning of latent spaceDeep Convolutional GANs - meaning of latent space
Deep Convolutional GANs - meaning of latent spaceHansol Kang
 
Graph Gurus Episode 32: Using Graph Algorithms for Advanced Analytics Part 5
Graph Gurus Episode 32: Using Graph Algorithms for Advanced Analytics Part 5Graph Gurus Episode 32: Using Graph Algorithms for Advanced Analytics Part 5
Graph Gurus Episode 32: Using Graph Algorithms for Advanced Analytics Part 5TigerGraph
 
Using Graph Algorithms for Advanced Analytics - Part 5 Classification
Using Graph Algorithms for Advanced Analytics - Part 5 ClassificationUsing Graph Algorithms for Advanced Analytics - Part 5 Classification
Using Graph Algorithms for Advanced Analytics - Part 5 ClassificationTigerGraph
 
Head First Reinforcement Learning
Head First Reinforcement LearningHead First Reinforcement Learning
Head First Reinforcement Learningazzeddine chenine
 
G-TAD: Sub-Graph Localization for Temporal Action Detection
G-TAD: Sub-Graph Localization for Temporal Action DetectionG-TAD: Sub-Graph Localization for Temporal Action Detection
G-TAD: Sub-Graph Localization for Temporal Action DetectionMengmeng Xu
 
Advance mathematics mid term presentation rev01
Advance mathematics mid term presentation rev01Advance mathematics mid term presentation rev01
Advance mathematics mid term presentation rev01Nirmal Joshi
 
An efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningAn efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningPrabhu Kumar
 
ゆるふわ強化学習入門
ゆるふわ強化学習入門ゆるふわ強化学習入門
ゆるふわ強化学習入門Ryo Iwaki
 

Similar to A Brief Survey of Reinforcement Learning (20)

Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
 
Deep reinforcement learning from scratch
Deep reinforcement learning from scratchDeep reinforcement learning from scratch
Deep reinforcement learning from scratch
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
DDPG algortihm for angry birds
DDPG algortihm for angry birdsDDPG algortihm for angry birds
DDPG algortihm for angry birds
 
Anomaly Detection through Reinforcement Learning
Anomaly Detection through Reinforcement LearningAnomaly Detection through Reinforcement Learning
Anomaly Detection through Reinforcement Learning
 
CS799_FinalReport
CS799_FinalReportCS799_FinalReport
CS799_FinalReport
 
REINFORCEMENT LEARNING
REINFORCEMENT LEARNINGREINFORCEMENT LEARNING
REINFORCEMENT LEARNING
 
Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018
Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018
Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018
 
Reinforcement learning
Reinforcement learning Reinforcement learning
Reinforcement learning
 
5. 8519 1-pb
5. 8519 1-pb5. 8519 1-pb
5. 8519 1-pb
 
Machine learning Vs Deep learning Vs Reinforcement learning | Pydata Mumbai
Machine learning Vs Deep learning Vs Reinforcement learning | Pydata Mumbai Machine learning Vs Deep learning Vs Reinforcement learning | Pydata Mumbai
Machine learning Vs Deep learning Vs Reinforcement learning | Pydata Mumbai
 
Deep Convolutional GANs - meaning of latent space
Deep Convolutional GANs - meaning of latent spaceDeep Convolutional GANs - meaning of latent space
Deep Convolutional GANs - meaning of latent space
 
Graph Gurus Episode 32: Using Graph Algorithms for Advanced Analytics Part 5
Graph Gurus Episode 32: Using Graph Algorithms for Advanced Analytics Part 5Graph Gurus Episode 32: Using Graph Algorithms for Advanced Analytics Part 5
Graph Gurus Episode 32: Using Graph Algorithms for Advanced Analytics Part 5
 
Using Graph Algorithms for Advanced Analytics - Part 5 Classification
Using Graph Algorithms for Advanced Analytics - Part 5 ClassificationUsing Graph Algorithms for Advanced Analytics - Part 5 Classification
Using Graph Algorithms for Advanced Analytics - Part 5 Classification
 
Head First Reinforcement Learning
Head First Reinforcement LearningHead First Reinforcement Learning
Head First Reinforcement Learning
 
Goprez sg
Goprez  sgGoprez  sg
Goprez sg
 
G-TAD: Sub-Graph Localization for Temporal Action Detection
G-TAD: Sub-Graph Localization for Temporal Action DetectionG-TAD: Sub-Graph Localization for Temporal Action Detection
G-TAD: Sub-Graph Localization for Temporal Action Detection
 
Advance mathematics mid term presentation rev01
Advance mathematics mid term presentation rev01Advance mathematics mid term presentation rev01
Advance mathematics mid term presentation rev01
 
An efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningAn efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game Learning
 
ゆるふわ強化学習入門
ゆるふわ強化学習入門ゆるふわ強化学習入門
ゆるふわ強化学習入門
 

Recently uploaded

Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 

Recently uploaded (20)

Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 

A Brief Survey of Reinforcement Learning

  • 1. A Brief Survey of Reinforcement Learning Short walk-through on building learning agents Giancarlo Frison December 12, 2018
  • 2. Table of contents i 1. Introduction 2. Genetic Programming 3. Multi-armed Bandit 4. Markov Decision Process 5. Neural Networks in RL gfrison 2018 1
  • 3. Table of contents ii 6. Actor-Critic 7. Deep Deterministic Policy Gradient 8. Asgard gfrison 2018 2
  • 5. What is Reinforcement Learning [9] Reinforcement learning is an area of machine learning concerned with how so tware agents ought to take actions in an environment so as to maximize some notion of cumulative reward. gfrison 2018 3
  • 6. Challenges in RL RL vision is about creating systems that are capable of learning how to adapt in the real world, solely based on trial-and-error. Challenges: Reward signal The optimal policy must be inferred by interacting with the environment. The only learning signal the agent receives is the reward. Credit assignment problem Agents must deal with long-range time dependencies: O ten the consequences of an action only materialise a ter many transitions of the environment [8]. Non stationary environments Especially in multi-agent environments, in the same circumstances of state st and action at outcomes might be different. gfrison 2018 4
  • 7. Difference from other optimizations There are major differences within: • Supervised learning gfrison 2018 5
  • 8. Difference from other optimizations There are major differences within: • Supervised learning • Mathematical optimization gfrison 2018 5
  • 9. Difference from other optimizations There are major differences within: • Supervised learning • Mathematical optimization • Genetic programming gfrison 2018 5
  • 11. Genetic Programming GP is a technique whereby computer programs are encoded as a set of genes that are then modified using an evolutionary algorithm. • A number of chromosomes are randomly created. • Each chromosome is evaluated through a fitness function. • Best ones are selected, the others are disposed. • Chromosomes could be breded among the selected for a new generation. • Offsprings are randomly mutated. • Repeat until the score threshold is reached [3]. gfrison 2018 6
  • 12. Crossover Figure 1: The “breeding” is called crossover. The chromosome (in this case an AST), is merged between two individuals for searching a better function. gfrison 2018 7
  • 13. Strenghts on many local minima GP may overtake gradient-based algorithms when the solution space has many local minima Figure 2: Rastrigin function gfrison 2018 8
  • 14. Strenghts on no-differentiable functions f(x) is differentiable when it has always a finite derivative along the domain. −2 −1 1 2 −1 1 2 Figure 3: f(x) = |x| has no derivative at x = 0. GP is tollerant to latent no-differentiable functions. gfrison 2018 9
  • 16. Multi-armed Bandit The multi-armed bandit problem has been the subject of decades of intense study in statistics, operations research, A/B testing, computer science, and economics [12] gfrison 2018 10
  • 17. Q Value’s Action Figure 4: Obtained measures a ter repeated pullings with 10 arms [8]. gfrison 2018 11
  • 18. Q Value’s Action Qn is the estimated value of its action a ter n selections. Qn+1 = R1 + R2 + ... + Rn n A more scalable formula, updates the average with incremental and small constant: Qn+1 = Qn + 1 n (Rn − Qn) General expression of the bandit algorithm, recurrent pattern in RL (Target could be considered the reward R by now). NewEstimate = OldEstimate + StepSize(Target − OldEstimate) gfrison 2018 12
  • 19. Gambler’s Dilemma When pulled, an arm produces a random payout drawn independently of the past. Because the distribution of payouts corresponding to each arm is not listed, the player can learn it only by experimenting. Exploitation Earn more money by exploiting arms that yielded high payouts in the past. Exploration Exploring alternative arms may return higher payouts in the future. There is a tradeoff between exploration and regret. gfrison 2018 13
  • 21. Definition of MDP Figure 5: 2015 DARPA Robotics Challange [6] Despite as in bandits, MDP formalizes the decision making (Policy π) in sequential steps, aggregated in Episodes. gfrison 2018 14
  • 22. Actions and States Figure 6: Model representation of MDP [5] MDP strives to find the best π to all possible states. In Markov processes, the selected action depends only on current state. gfrison 2018 15
  • 23. Bellman equation In a multi-step process the total return G is the discounted sum of rewards, from the begin state to the terminal state: Gπ = r0 + γr1 + γ2 r2 + ...γn rn bellman equation: The value of a state Vs is the averaged reward obtained in that state plus the sum of all values following π therea ter. The goal of RL is to find the optimal π with highest G π = argmax ∑ s r + γvπ(s) gfrison 2018 16
  • 24. Grid World  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1  • A = up, down, right, left • Terminal states are the flagged boxes. • Rt = −1 for all states. The problem is to define the best πππ. Value function is computed by iterative policy evaluation. gfrison 2018 17
  • 25. Iteration 1 Calculated V1 Policy π1 0 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 0  ← ? ? ↑ ? ? ? ? ? ? ↓ ? ? →  gfrison 2018 18
  • 26. Iteration 2 Calculated V2 Policy π2 0 -1.7 -2 -2 -1.7 -2 -2 -2 -2 -2 -2 -1.7 -2 -2 -1.7 0  ← ← ? ↑ ? ↓ ↑ ? ↓ ? → →  gfrison 2018 19
  • 27. Iteration 3 Calculated V3 Policy π3 0 -2.4 -2.9 -3 -2.4 -2.9 -3 -2.9 -2.9 -3 -2.9 -2.4 -3 -2.9 -2.4 0  ← ← ↑ ↓ ↑ ↓ → →  gfrison 2018 20
  • 28. Temporal difference Model-free learning The learning comes only through raw experiences (episodes) when there is no prior knowledge of the environment. Bootstrapping Value estimations are based partly on other learned estimations. Bootstrapping helps to increase the stability of estimations. V(st) = V(st) + α(rt+1 + γV(st+1) − V(st)) gfrison 2018 21
  • 30. Beyond the Gridworld In the Gridworld every state value vt is stored in a table. The approach lacks scalability and is inherently limited to fairly low-dimension problems. Figure 7: The state vt might be a frame of a videogame [11] gfrison 2018 22
  • 31. Deep Reinforcement Learning Deep learning enables RL to scale to decision-making problems that were previously intractable, i.e., settings with high-dimensional state and action spaces. [1] • Universal function approximator Multilayer neural networks can approximate any continuous function to any degree of accuracy [4]. gfrison 2018 23
  • 32. Deep Reinforcement Learning Deep learning enables RL to scale to decision-making problems that were previously intractable, i.e., settings with high-dimensional state and action spaces. [1] • Universal function approximator Multilayer neural networks can approximate any continuous function to any degree of accuracy [4]. • Representation learning Automatically discover the representations needed for feature detection or classification from raw data, also known as feature engeneering [2]. Neural networks can learn the approximated value of Vs or Q(s, a) for any given state/action space. gfrison 2018 23
  • 33. Discrete space When the action space is limited by a set of options, (e.g. turn right, le t) the output layer could be a set of neurons as the number of possible actions. A1 A2 A3 A4 0.21 0.27 0.30 0.21 Figure 8: Output layer with so tmax The choosed action is the one with the biggest Qt (A3). gfrison 2018 24
  • 34. Exploration in discrete space The balance between exploration and exploitation is one of the most challanging task in RL. The most common is the ε-greedy: π(s) =    random action, if ξ < ϵ. argmaxQ(s, a), otherwise. (1) With ε-greedy, at each time step, the agent selects a random action with a fixed probability, 0 ≤ ϵ ≤ 1, instead of selecting the learned optimal action. gfrison 2018 25
  • 35. Exploration in discrete space ε-greedy must be tuned and it is fixed during the training. In contrast, Value-Difference Based Exploration adapts the exploration-probability based on fluctuations of confidence [10]. A1 A2 A3 A4 0.21 0.27 0.30 0.21 Figure 9: Low confidence, flat likelihood A1 A2 A3 A4 0.21 0.07 0.70 0.01 Figure 10: High confidence, sharp likelihood gfrison 2018 26
  • 36. Continuous space Applications of RL require continuous state spaces defined by means of continuous variables e.g. position, velocity, torque. Single Output Like in DDPG algorithm. The NN yields only one output. Probability distribution The network yields a distribution. For convenience usually it is a normal distribution defined by 2 outputs: μ and σ. gfrison 2018 27
  • 37. Deep Q-Network NN comes into play by approximating state-action value pairs Qt(s, a). Double network Other the main network, a sibling network (target) is used in training for adjusting values Q, the second network is periodically aligned the the main network. Priority experience replay Past state/action/reward tuples are kept also for future iterations. Tuples with higher error in evaluation are retained longer.  First disruptive advance in Deep RL in 2015 by DeepMind [7]. gfrison 2018 28
  • 39. Policy-based algorithm actor-critic methods lay in the family of policy-based algorithms. Differently by value-based ones - like DQN - the main network (actor) learns directly the action aπ(s) to return for a given state s.  Therefore there is no argmax Qπ(s, A) among all Qπ(s) for a given state gfrison 2018 29
  • 40. Actor-Critic actor-critic defines an architecture based on 2 parts: Actor Main network which infers the action for a given state (st → a) Critic Secondary network used only for training, evaluates the actor output returning the averaged value of state. st → V(s) gfrison 2018 30
  • 41. Advantage function for incremental learning The advantage technique is used to give a evaluation of the last action compared to the average of other actions. gfrison 2018 31
  • 42. Advantage function for incremental learning The advantage technique is used to give a evaluation of the last action compared to the average of other actions. It answers the following question: Is last action more rewarding than others in such scenario? gfrison 2018 31
  • 43. Advantage function for incremental learning The advantage technique is used to give a evaluation of the last action compared to the average of other actions. It answers the following question: Is last action more rewarding than others in such scenario? Yes  Action’s likelihood is encouraged. Gradients are pushed toward this action. No  Action’s likelihood is discouraged. Gradients are pushed away from this action. gfrison 2018 31
  • 44. Training s0 a0 Environment s1 r1 t0 s1 a1 Environment s2 r2 t1 T T Figure 11: Actor-Critic interactions in an iterative traininggfrison 2018 32
  • 45. Actor training - vt rt st advt at 1 2 • 1 Critic returns the averaged value of the state. • 2 Actor is trained considering st, at and advt = rt − vt gfrison 2018 33
  • 47. DDPG Especially effective in continuous space. Derived from actor-critic and dqn it uses 4 networks. Value/action networks Likewise ac there is a distinction between value and action networks. Duplication of ac networks For stability those networks are duplicated. Replay buffer Likewise dqn an history of past experiences is reused. gfrison 2018 34
  • 48. Gradient propagation Despite others tecniques, ddpg improves the actual π by applying the gradient obtained during the training iteration, from the value network to the output network. Figure 12: Like with a pantograph, the gradient Vs is applied to the π network for a similar effect, without knowing the value of π. gfrison 2018 35
  • 50. Missing slides for Intellectual Property gfrison 2018 35
  • 52. References i K. Arulkumaran, M. P. Deisenroth, M. Brundage, and A. A. Bharath. A brief survey of deep reinforcement learning. arXiv preprint arXiv:1708.05866, 2017. Y. Bengio, A. Courville, and P. Vincent. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798–1828, 2013. Giancarlo Frison. First Steps on Evolutionary Systems. 2018. link https://labs.cx.sap.com/2018/07/02/first-steps-on-evolutionary-systems/. gfrison 2018 36
  • 53. References ii K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward networks are universal approximators. Neural networks, 2(5):359–366, 1989. J. Zico Kolter. Markov Decision Process. http://www.cs.cmu.edu/afs/cs/academic/class/15780- s16/www/slides/mdps.pdf. John Williams. 2015 darpa robotics challange. Public domain. gfrison 2018 37
  • 54. References iii V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015. P. Montague. Reinforcement Learning: An Introduction, by Sutton, R.S. and Barto, A.G. Trends in Cognitive Sciences, 3(9):360, 1999. Omanomanoman. CC BY-SA 4.0, https://commons.wikimedia.org/. gfrison 2018 38
  • 55. References iv M. Tokic and G. Palm. Value-difference based exploration: adaptive control between epsilon-greedy and so tmax. In Annual Conference on Artificial Intelligence, pages 335–346. Springer, 2011. Valve Corporation. A screenshot from the video game Dota 2. Yamaguchi . CC-BY-SA-3.0 (http://creativecommons.org/licenses/by-sa/3.0/)]. gfrison 2018 39