SlideShare a Scribd company logo
1 of 45
Download to read offline
Adversarially Guided Actor-Critic,
Y. Flet-Berliac et al, 2021
옥찬호
utilForever@gmail.com
Adversarially Guided Actor-Critic,
Y. Flet-Berliac et al, 2021
Prerequisites
• Markov Decision Process
• Policy Iteration
• Policy Gradient
• Actor-Critic
• Entropy
• KL Divergence
• PPO (Proximal Policy Optimization)
• GAN (Generative Adversarial Network)
Adversarially Guided Actor-Critic,
Y. Flet-Berliac et al, 2021
Introduction
• Generalization and exploration in RL still represent key
challenges that leave most current methods ineffective.
• First, a battery of recent studies (Farebrother et al., 2018; Zhang et al.,
2018a; Song et al., 2020; Cobbe et al., 2020) indicates that current RL
methods fail to generalize correctly even when agents have been trained in a
diverse set of environments.
• Second, exploration has been extensively studied in RL; however, most hard-
exploration problems use the same environment for training and evaluation.
Adversarially Guided Actor-Critic,
Y. Flet-Berliac et al, 2021
Introduction
• Hence, since a well-designed exploration strategy should
maximize the information received from a trajectory about an
environment, the exploration capabilities may not be
appropriately assessed if that information is memorized.
Adversarially Guided Actor-Critic,
Y. Flet-Berliac et al, 2021
Introduction
• In this work, we propose Adversarially Guided Actor-Critic
(AGAC), which reconsiders the actor-critic framework by
introducing a third protagonist: the adversary.
• Its role is to predict the actor’s actions correctly. Meanwhile, the actor must
not only find the optimal actions to maximize the sum of expected returns,
but also counteract the predictions of the adversary.
• This formulation is lightly inspired by adversarial methods, specifically
generative adversarial networks (GANs) (Goodfellow et al., 2014).
Adversarially Guided Actor-Critic,
Y. Flet-Berliac et al, 2021
Introduction
• Such a link between GANs and actor-critic methods has been
formalized by Pfau & Vinyals (2016); however, in the context
of a third protagonist, we draw a different analogy.
• The adversary can be interpreted as playing the role of a discriminator that
must predict the actions of the actor, and the actor can be considered as
playing the role of a generator that behaves to deceive the predictions of the
adversary.
• This approach has the advantage, as with GANs, that the optimization
procedure generates a diversity of meaningful data, corresponding to
sequences of actions in AGAC.
Adversarially Guided Actor-Critic,
Y. Flet-Berliac et al, 2021
Introduction
• The contributions of this work are as follow:
• (i) we propose a novel actor-critic formulation inspired from adversarial
learning (AGAC)
• (ii) we analyze empirically AGAC on key reinforcement learning aspects such
as diversity, exploration and stability
• (iii) we demonstrate significant gains in performance on several sparse-
reward hard-exploration tasks including procedurally-generated tasks
Adversarially Guided Actor-Critic,
Y. Flet-Berliac et al, 2021
Backgrounds and Notations
• Markov Decision Process (MDP)
𝑀 = 𝒮, 𝒜, 𝒫, 𝑅, 𝛾
• 𝒮 : The state space
• 𝒜 : The action space
• 𝒫 : The transition kernel
• 𝑅 : The bounded reward function
• 𝛾 ∈ 0, 1 : The discount factor
Adversarially Guided Actor-Critic,
Y. Flet-Berliac et al, 2021
Backgrounds and Notations
• Let 𝜋 denote a stochastic policy mapping states to distributions
over actions. We place ourselves in the infinite-horizon setting.
• i.e., we seek a policy that optimizes 𝐽 𝜋 = σ𝑡=0
∞
𝛾𝑡
𝑟 𝑠𝑡, 𝑎𝑡 .
• The value of a state is the quantity 𝑉𝜋 𝑠 = 𝔼𝜋 σ𝑡=0
∞
𝛾𝑡𝑟 𝑠𝑡, 𝑎𝑡 𝑠0 = 𝑠] and the
value of a state-action pair 𝑄𝜋
(𝑠, 𝑎) of performing action 𝑎 in state 𝑠 and then
following policy 𝜋 is defined as: 𝑄𝜋
𝑠, 𝑎 = 𝔼𝜋 σ𝑡=0
∞
𝛾𝑡
𝑟 𝑠𝑡, 𝑎𝑡 𝑠0 = 𝑠, 𝑎0 = 𝑎].
• The advantage function, which quantifies how an action 𝑎 is better than the
average action in state 𝑠, is 𝐴𝜋 𝑠, 𝑎 = 𝑄𝜋 𝑠, 𝑎 − 𝑉𝜋 𝑠 .
• Finally, the entropy ℋ𝜋
of a policy is calculated as:
ℋ𝜋 𝑠 = 𝔼𝜋 ∙ 𝑠 − log 𝜋 ∙ 𝑠 .
Adversarially Guided Actor-Critic,
Y. Flet-Berliac et al, 2021
Backgrounds and Notations
• An actor-critic algorithm is composed of two main components:
a policy and a value predictor.
• In deep RL, both the policy and the value function are obtained via
parametric estimators: we denote 𝜃 and 𝜙 their respective parameters.
• The policy is updated via policy gradient, while the value is usually updated
via temporal difference or Monte Carlo rollouts.
Adversarially Guided Actor-Critic,
Y. Flet-Berliac et al, 2021
Backgrounds and Notations
• For a sequence of transitions 𝑠𝑡, 𝑎𝑡, 𝑟𝑡, 𝑠𝑡+1 𝑡∈ 0,𝑁 , we use the following
policy gradient loss (including the commonly used entropic penalty):
ℒ𝑃𝐺 = −
1
𝑁
෍
𝑡′=𝑡
𝑡+𝑁
𝐴𝑡′ log 𝜋 𝑎𝑡′ 𝑠𝑡′, 𝜃 + 𝛼ℋ𝜋 𝑠𝑡′, 𝜃
where 𝛼 is the entropy coefficient and 𝐴𝑡 is the generalized advantage
estimator (Schulman et al., 2016) defined as:
𝐴𝑡 = ෍
𝑡′=𝑡
𝑡+𝑁
𝛾𝜆 𝑡′−𝑡(𝑟𝑡′ + 𝛾𝑉𝜙old
𝑠𝑡′+1 − 𝑉𝜙old
(𝑠𝑡′))
with 𝜆 a fixed hyperparameter and 𝑉𝜙old
the value function estimator at the
previous optimization iteration.
Actor
Critic Entropy
Adversarially Guided Actor-Critic,
Y. Flet-Berliac et al, 2021
Backgrounds and Notations
• To estimate the value function, we solve the non-linear regression problem
minimize𝜙 σ𝑡′=𝑡
𝑡+𝑁
𝑉𝜙 𝑠𝑡′ − ෠
𝑉𝑡′
2
where ෠
𝑉𝑡′ = 𝐴𝑡 + 𝑉𝜙𝑜𝑙𝑑
𝑠𝑡′ .
Adversarially Guided Actor-Critic,
Y. Flet-Berliac et al, 2021
Adversarially Guided AC
• To foster diversified behavior in its trajectories, AGAC
introduces a third protagonist to the actor-critic framework:
the adversary.
• The role of the adversary is to accurately predict the actor’s actions, by
minimizing the discrepancy between its action distribution 𝜋adv and the
distribution induced by the policy 𝜋.
• Meanwhile, in addition to finding the optimal actions to maximize the sum of
expected returns, the actor must also counteract the adversary’s predictions
by maximizing the discrepancy between 𝜋 and 𝜋adv.
Adversarially Guided Actor-Critic,
Y. Flet-Berliac et al, 2021
Adversarially Guided AC
Adversarially Guided Actor-Critic,
Y. Flet-Berliac et al, 2021
Adversarially Guided AC
• This discrepancy, used as a form of exploration bonus, is
defined as the difference of action log-probabilities, whose
expectation is the Kullback-Leibler divergence:
𝐷KL 𝜋 ∙ 𝑠 ฮ𝜋adv ∙ 𝑠 = 𝔼𝜋 ∙ 𝑠 log 𝜋 ∙ 𝑠 − log 𝜋adv ∙ 𝑠
• Formally, for each state-action pair 𝑠𝑡, 𝑎𝑡 in a trajectory, an
action-dependent bonus log 𝜋 (𝑎𝑡|𝑠𝑡) − log 𝜋adv 𝑎𝑡 𝑠𝑡 is added
to the advantage.
Adversarially Guided Actor-Critic,
Y. Flet-Berliac et al, 2021
Adversarially Guided AC
• The value target of the critic is modified to include the action-
independent equivalent, which is the KL-divergence
𝐷KL 𝜋 ∙ 𝑠 ฮ𝜋adv ∙ 𝑠 .
• In addition to the parameters 𝜃 (resp. 𝜃old the parameter of the
policy at the previous iteration) and 𝜙 defined above (resp.
𝜙old that of the critic), we denote 𝜓 (resp. 𝜓old) that of the
adversary.
Adversarially Guided Actor-Critic,
Y. Flet-Berliac et al, 2021
Adversarially Guided AC
• AGAC minimizes the following loss:
ℒAGAC = ℒPG + 𝛽𝑉ℒ𝑉 + 𝛽advℒadv
• In the new objective ℒPG =
1
𝑁
σ𝑡=0
𝑁
𝐴𝑡
AGAC
log 𝜋 𝑎𝑡 𝑠𝑡, 𝜃 + 𝛼ℋ𝜋 𝑠𝑡, 𝜃 ,
AGAC modifies 𝐴𝑡 as:
𝐴𝑡
AGAC
= 𝐴𝑡 + 𝑐 log 𝜋 𝑎𝑡 𝑠𝑡, 𝜃old − log 𝜋adv 𝑎𝑡 𝑠𝑡, 𝜓old
with 𝑐 is a varying hyperparameter that controls the dependence on the
action log-probability difference.
To encourage exploration without preventing asymptotic stability, 𝑐 is
linearly annealed during the course of training.
Adversarially Guided Actor-Critic,
Y. Flet-Berliac et al, 2021
Adversarially Guided AC
• ℒ𝑉 is the objective function of the critic defined as:
ℒ𝑉 =
1
𝑁
෍
𝑡=0
𝑁
𝑉𝜙 𝑠𝑡 − ෠
𝑉𝑡 + 𝑐𝐷KL 𝜋 ∙ 𝑠𝑡, 𝜃old ฮ𝜋adv ∙ 𝑠𝑡, 𝜓old
2
• Finally, ℒadv is the objective function of the adversary:
ℒadv =
1
𝑁
෍
𝑡=0
𝑁
𝐷KL 𝜋 ∙ 𝑠𝑡, 𝜃old ฮ𝜋adv ∙ 𝑠𝑡, 𝜓
• They are the three equations that our method modifies in the traditional
actor-critic framework. The terms 𝛽𝑉 and 𝛽adv are fixed hyperparameters.
Adversarially Guided Actor-Critic,
Y. Flet-Berliac et al, 2021
Adversarially Guided AC
• Under the proposed actor-critic formulation, the probability of
sampling an action is increased if the modified advantage is
positive, i.e.
• (i) the corresponding return is larger than the predicted value
• (ii) the action log-probability difference is large
• More precisely, our method favors transitions whose actions
were less accurately predicted than the average action, i.e.
log 𝜋 𝑎 𝑠 − log 𝜋adv 𝑎 𝑠 ≥ 𝐷KL 𝜋 ∙ 𝑠 ฮ𝜋adv ∙ 𝑠 .
Adversarially Guided Actor-Critic,
Y. Flet-Berliac et al, 2021
Adversarially Guided AC
• This is particularly visible for 𝜆 → 1, in which case the
generalized advantage is 𝐴𝑡 = 𝐺𝑡 − 𝑉𝜙old
(𝑠𝑡), resulting in the
appearance of both aforementioned mirrored terms in the
modified advantage:
𝐴𝑡
AGAC
= 𝐺𝑡 − ෠
𝑉
𝑡
𝜙old
+ 𝑐 log 𝜋 𝑎𝑡 𝑠𝑡 − log 𝜋adv 𝑎𝑡 𝑠𝑡 − ෡
𝐷KL
𝜙old
𝜋 ∙ 𝑠 ฮ𝜋adv ∙ 𝑠
with 𝐺𝑡 the observed return, ෠
𝑉
𝑡
𝜙old
the estimated return and
෡
𝐷KL
𝜙old
𝜋 ∙ 𝑠 ฮ𝜋adv ∙ 𝑠 the estimated KL-divergence.
Adversarially Guided Actor-Critic,
Y. Flet-Berliac et al, 2021
Adversarially Guided AC
• To avoid instability, in practice the adversary is a separate
estimator, updated with a smaller learning rate than the actor.
This way, it represents a delayed and more steady version of
the actor’s policy, which prevents the agent from having to
constantly adapt or focus solely on fooling the adversary.
Adversarially Guided Actor-Critic,
Y. Flet-Berliac et al, 2021
Building Motivation
• We provide an interpretation of AGAC by studying the
dynamics of attraction and repulsion between the actor and
the adversary.
• To simplify, we study the equivalent of AGAC in a policy
iteration (PI) scheme. PI being the dynamic programming
scheme underlying the standard actor-critic, we have reasons
to think that some of our findings translate to the original
AGAC algorithm.
Adversarially Guided Actor-Critic,
Y. Flet-Berliac et al, 2021
Building Motivation
• In PI, the quantity of interest is the action-value, which AGAC
would modify as:
𝑄𝜋𝑘
AGAC
= 𝑄𝜋𝑘
+ 𝑐 log 𝜋𝑘 − log 𝜋adv
with 𝜋𝑘 the policy at iteration 𝑘.
• Incorporating the entropic penalty, the new policy 𝜋𝑘+1 verifies:
𝜋𝑘+1 = argmax
𝜋
𝒥PI 𝜋 = argmax
𝜋
𝔼𝑠𝔼𝑎~𝜋 ∙ 𝑠 𝑄𝜋𝑘
AGAC
𝑠, 𝑎 − 𝛼 log 𝜋 𝑎 𝑠
Adversarially Guided Actor-Critic,
Y. Flet-Berliac et al, 2021
Building Motivation
• We can rewrite this objective:
𝒥PI 𝜋
= 𝔼𝑠𝔼𝑎~𝜋 ∙ 𝑠 𝑄𝜋𝑘
AGAC
𝑠, 𝑎 − 𝛼 log 𝜋 𝑎 𝑠
= 𝔼𝑠𝔼𝑎~𝜋 ∙ 𝑠 𝑄𝜋𝑘
𝑠, 𝑎 + 𝑐 log 𝜋𝑘 𝑎 𝑠 − log 𝜋adv 𝑎 𝑠 − 𝛼 log 𝜋 𝑎 𝑠
= 𝔼𝑠𝔼𝑎~𝜋 ∙ 𝑠 𝑄𝜋𝑘
𝑠, 𝑎 + 𝑐 log 𝜋𝑘 𝑎 𝑠 − log 𝜋 𝑎 𝑠 + log 𝜋 𝑎 𝑠 − log 𝜋adv 𝑎 𝑠 − 𝛼 log 𝜋 𝑎 𝑠
= 𝔼𝑠 𝔼𝑎~𝜋 ∙ 𝑠 𝑄𝜋𝑘
𝑠, 𝑎 − 𝑐𝐷KL 𝜋 ∙ 𝑠 ฮ𝜋𝑘 ∙ 𝑠 + 𝑐𝐷KL 𝜋 ∙ 𝑠 ฮ𝜋adv ∙ 𝑠 + 𝛼ℋ 𝜋 ∙ 𝑠
𝜋𝑘 is attractive 𝜋adv is repulsive enforces
stochastic
policies
Adversarially Guided Actor-Critic,
Y. Flet-Berliac et al, 2021
Building Motivation
• Thus, in the PI scheme, AGAC finds a policy that maximizes Q-
values, while at the same time remaining close to the current
policy and far from a mixture of the previous policies (i.e.,
𝜋𝑘−1, 𝜋𝑘−2, 𝜋𝑘−3, …).
• Note that we experimentally observe that our method
performs better with a smaller learning rate for the adversarial
network than that of the other networks, which could imply
that a stable repulsive term is beneficial.
Adversarially Guided Actor-Critic,
Y. Flet-Berliac et al, 2021
Building Motivation
• This optimization problem is strongly concave in 𝜋 (thanks to
the entropy term), and is state-wise a Legendre-Fenchel
transform. Its solution is given by:
𝜋𝑘+1 ∝
𝜋𝑘
𝜋adv
𝑐
𝛼
exp
𝑄𝜋𝑘
𝛼
• This result gives us some insight into the behavior of the objective
function. Notably, in our example, if 𝜋adv is fixed and 𝑐 = 𝛼, we
recover a KL-regularized PI scheme (Geist et al., 2019) with the
modified reward 𝑟 − 𝑐 log 𝜋adv.
Adversarially Guided Actor-Critic,
Y. Flet-Berliac et al, 2021
Proof
𝜋𝑘+1 = argmax
𝜋
𝒥PI 𝜋 ∝
𝜋𝑘
𝜋adv
𝑐
𝛼
exp
𝑄𝜋𝑘
𝛼
with the objective function:
𝒥PI 𝜋 = 𝔼𝑠𝔼𝑎~𝜋 ∙ 𝑠 𝑄𝜋𝑘
𝑠, 𝑎 + 𝑐 log 𝜋𝑘 𝑎 𝑠 − log 𝜋adv 𝑎 𝑠 − 𝛼 log 𝜋 𝑎 𝑠
Adversarially Guided Actor-Critic,
Y. Flet-Berliac et al, 2021
Proof
• We first consider a simpler optimization problem:
argmax
𝜋
𝜋, 𝑄𝜋𝑘
+ 𝛼ℋ 𝜋
whose solution is known (Vieillard et al., 2020a, Appendix A).
• The expression for the maximizer is the 𝛼-scaled softmax:
𝜋∗
=
exp
𝑄𝜋𝑘
𝛼
1, exp
𝑄𝜋𝑘
𝛼
Adversarially Guided Actor-Critic,
Y. Flet-Berliac et al, 2021
Proof
• We now turn towards the optimization problem of interest,
which we can rewrite as:
argmax
𝜋
𝜋, 𝑄𝜋𝑘
+ 𝑐 log 𝜋𝑘 − log 𝜋adv + 𝛼ℋ 𝜋
• By the simple change of variable ෨
𝑄𝜋𝑘
= 𝑄𝜋𝑘
+ 𝑐 log 𝜋𝑘 − log 𝜋adv ,
we can reuse the previous solution (replacing 𝑄𝜋𝑘
by ෨
𝑄𝜋𝑘
).
• With the simplification:
exp
𝑄𝜋𝑘
+ 𝑐 log 𝜋𝑘 − log 𝜋adv
𝛼
=
𝜋𝑘
𝜋adv
𝑐
𝛼
exp
𝑄𝜋𝑘
𝛼
Adversarially Guided Actor-Critic,
Y. Flet-Berliac et al, 2021
Implementation
• In all of the experiments, we use PPO (Schulman et al., 2017)
as the base algorithm and build on it to incorporate our method.
Hence,
ℒ𝑃𝐺 = −
1
𝑁
෍
𝑡′=𝑡
𝑡+𝑁
min
𝜋 𝑎𝑡′ 𝑠𝑡′, 𝜃
𝜋 𝑎𝑡′ 𝑠𝑡′, 𝜃old
𝐴𝑡′
AGAC
, clip
𝜋 𝑎𝑡′ 𝑠𝑡′, 𝜃
𝜋 𝑎𝑡′ 𝑠𝑡′, 𝜃old
, 1 − 𝜖, 1 + 𝜖 𝐴𝑡′
AGAC
with 𝐴𝑡′
AGAC
given in first equation, 𝑁 the temporal length
considered for one update of parameters and 𝜖 the clipping
parameter.
In PPO, 𝐿𝐶𝐿𝐼𝑃
𝜃 = ෡
𝔼𝑡 min(𝑟𝑡 𝜃 መ
𝐴𝑡, clip(𝑟𝑡 𝜃 , 1 − 𝜖, 1 + 𝜖) መ
𝐴𝑡)
Adversarially Guided Actor-Critic,
Y. Flet-Berliac et al, 2021
Implementation
• Similar to RIDE (Raileanu & Rocktäschel, 2019), we also
discount PPO by episodic state visitation counts.
• The actor, critic and adversary use the convolutional
architecture of the Nature paper of DQN (Mnih et al., 2015)
with different hidden sizes.
• The three neural networks are optimized using Adam (Kingma
& Ba, 2015). Our method does not use RNNs in its architecture;
instead, in all our experiments, we use frame stacking.
Adversarially Guided Actor-Critic,
Y. Flet-Berliac et al, 2021
Implementation
Adversarially Guided Actor-Critic,
Y. Flet-Berliac et al, 2021
Implementation
• At each training step, we perform a stochastic optimization
step to minimize ℒAGAC using stop-gradient:
• 𝜃 ← Adam 𝜃, ∇𝜃ℒPG, 𝜂1
• 𝜙 ← Adam 𝜙, ∇𝜙ℒV, 𝜂1
• 𝜓 ← Adam 𝜓, ∇𝜓ℒadv, 𝜂2
Adversarially Guided Actor-Critic,
Y. Flet-Berliac et al, 2021
Experiments
• In this section, we describe our experimental study in which we
investigate:
• (i) whether the adversarial bonus alone (e.g. without episodic state visitation
count) is sufficient to outperform other methods in VizDoom, a sparse-
reward task with high-dimensional observations
• (ii) whether AGAC succeeds in partially-observable and procedurally-
generated environments with high sparsity in the rewards, compared to
other methods
Adversarially Guided Actor-Critic,
Y. Flet-Berliac et al, 2021
Experiments
• In this section, we describe our experimental study in which we
investigate:
• (iii) how well AGAC is capable of exploring in environments without extrinsic
reward
• (iv) the training stability of our method. In all of the experiments, lines are
average performances and shaded areas represent one standard deviation
Adversarially Guided Actor-Critic,
Y. Flet-Berliac et al, 2021
Experiments
• Environments
Adversarially Guided Actor-Critic,
Y. Flet-Berliac et al, 2021
Experiments
• Baselines
• RIDE (Raileanu & Rocktäschel, 2019)
• Count as Count-Based Exploration (Bellemare et al., 2016b)
• RND (Burda et al., 2018)
• ICM (Pathak et al., 2017)
• AMIGo (Campero et al., 2021)
Adversarially Guided Actor-Critic,
Y. Flet-Berliac et al, 2021
Experiments
• Adversarially-based Exploration (No episodic count)
Adversarially Guided Actor-Critic,
Y. Flet-Berliac et al, 2021
Experiments
• Hard-Exploration Tasks with Partially-Observable Environments
Adversarially Guided Actor-Critic,
Y. Flet-Berliac et al, 2021
Experiments
• Training Stability
Adversarially Guided Actor-Critic,
Y. Flet-Berliac et al, 2021
Experiments
• Exploration in Reward-Free Environment
Adversarially Guided Actor-Critic,
Y. Flet-Berliac et al, 2021
Experiments
• Visualizing Coverage and Diversity
Adversarially Guided Actor-Critic,
Y. Flet-Berliac et al, 2021
Experiments
• Visualizing Coverage and Diversity
Disscussions
• This paper introduced AGAC, a modification to the traditional
actor-critic framework: an adversary network is added as a third
protagonist.
• The mechanics of AGAC have been discussed from a policy iteration
point of view, and we provided theoretical insight into the inner
workings of the proposed algorithm: the adversary forces the agent
to remain close to the current policy while moving away from the
previous ones. In a nutshell, the influence of the adversary makes
the actor conservatively diversified.
Adversarially Guided Actor-Critic,
Y. Flet-Berliac et al, 2021
Thank you!

More Related Content

What's hot

Presentation based on "Hierarchical Bayesian Models of Subtask Learning. Angl...
Presentation based on "Hierarchical Bayesian Models of Subtask Learning. Angl...Presentation based on "Hierarchical Bayesian Models of Subtask Learning. Angl...
Presentation based on "Hierarchical Bayesian Models of Subtask Learning. Angl...
Jeromy Anglim
 
Reinforcement learning 7313
Reinforcement learning 7313Reinforcement learning 7313
Reinforcement learning 7313
Slideshare
 

What's hot (20)

Actor critic algorithm
Actor critic algorithmActor critic algorithm
Actor critic algorithm
 
Exploration Strategies in Reinforcement Learning
Exploration Strategies in Reinforcement LearningExploration Strategies in Reinforcement Learning
Exploration Strategies in Reinforcement Learning
 
Reinforcement Learning Guide For Beginners
Reinforcement Learning Guide For BeginnersReinforcement Learning Guide For Beginners
Reinforcement Learning Guide For Beginners
 
A brief introduction to Searn Algorithm
A brief introduction to Searn AlgorithmA brief introduction to Searn Algorithm
A brief introduction to Searn Algorithm
 
Policy gradient
Policy gradientPolicy gradient
Policy gradient
 
Presentation based on "Hierarchical Bayesian Models of Subtask Learning. Angl...
Presentation based on "Hierarchical Bayesian Models of Subtask Learning. Angl...Presentation based on "Hierarchical Bayesian Models of Subtask Learning. Angl...
Presentation based on "Hierarchical Bayesian Models of Subtask Learning. Angl...
 
Machine learning Algorithms with a Sagemaker demo
Machine learning Algorithms with a Sagemaker demoMachine learning Algorithms with a Sagemaker demo
Machine learning Algorithms with a Sagemaker demo
 
Algorithms Design Patterns
Algorithms Design PatternsAlgorithms Design Patterns
Algorithms Design Patterns
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...
 
Temporal difference learning
Temporal difference learningTemporal difference learning
Temporal difference learning
 
Deep reinforcement learning from scratch
Deep reinforcement learning from scratchDeep reinforcement learning from scratch
Deep reinforcement learning from scratch
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)
 
Deep Q-Learning
Deep Q-LearningDeep Q-Learning
Deep Q-Learning
 
Reinforcement learning 7313
Reinforcement learning 7313Reinforcement learning 7313
Reinforcement learning 7313
 
Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement Learning
 
An introduction to reinforcement learning
An introduction to  reinforcement learningAn introduction to  reinforcement learning
An introduction to reinforcement learning
 
A sensitivity analysis of contribution-based cooperative co-evolutionary algo...
A sensitivity analysis of contribution-based cooperative co-evolutionary algo...A sensitivity analysis of contribution-based cooperative co-evolutionary algo...
A sensitivity analysis of contribution-based cooperative co-evolutionary algo...
 
AUTOMATIC TRANSFER RATE ADJUSTMENT FOR TRANSFER REINFORCEMENT LEARNING
AUTOMATIC TRANSFER RATE ADJUSTMENT FOR TRANSFER REINFORCEMENT LEARNINGAUTOMATIC TRANSFER RATE ADJUSTMENT FOR TRANSFER REINFORCEMENT LEARNING
AUTOMATIC TRANSFER RATE ADJUSTMENT FOR TRANSFER REINFORCEMENT LEARNING
 

Similar to Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021

Factor anaysis scale dimensionality
Factor anaysis scale dimensionalityFactor anaysis scale dimensionality
Factor anaysis scale dimensionality
Carlo Magno
 
Brown bag 2012_fall
Brown bag 2012_fallBrown bag 2012_fall
Brown bag 2012_fall
Xiaolei Zhou
 
Sakanashi, h.; kakazu, y. (1994): co evolving genetic algorithm with filtered...
Sakanashi, h.; kakazu, y. (1994): co evolving genetic algorithm with filtered...Sakanashi, h.; kakazu, y. (1994): co evolving genetic algorithm with filtered...
Sakanashi, h.; kakazu, y. (1994): co evolving genetic algorithm with filtered...
ArchiLab 7
 
Marketing Research Project
Marketing Research ProjectMarketing Research Project
Marketing Research Project
Pranav B. Gujjar
 
Local Coordination in Online Distributed Constraint Optimization Problems
Local Coordination in Online Distributed Constraint Optimization ProblemsLocal Coordination in Online Distributed Constraint Optimization Problems
Local Coordination in Online Distributed Constraint Optimization Problems
Antonio Maria Fiscarelli
 
Empirical Evaluation of Active Learning in Recommender Systems
Empirical Evaluation of Active Learning in Recommender SystemsEmpirical Evaluation of Active Learning in Recommender Systems
Empirical Evaluation of Active Learning in Recommender Systems
University of Bergen
 

Similar to Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021 (20)

On the (ab)use of Omega? - Caporin M., Costola M., Jannin G., Maillet B. Dece...
On the (ab)use of Omega? - Caporin M., Costola M., Jannin G., Maillet B. Dece...On the (ab)use of Omega? - Caporin M., Costola M., Jannin G., Maillet B. Dece...
On the (ab)use of Omega? - Caporin M., Costola M., Jannin G., Maillet B. Dece...
 
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration
 
Factor anaysis scale dimensionality
Factor anaysis scale dimensionalityFactor anaysis scale dimensionality
Factor anaysis scale dimensionality
 
Factor analysis in R by Aman Chauhan
Factor analysis in R by Aman ChauhanFactor analysis in R by Aman Chauhan
Factor analysis in R by Aman Chauhan
 
Regressioin mini case
Regressioin mini caseRegressioin mini case
Regressioin mini case
 
Modern Recommendation for Advanced Practitioners part2
Modern Recommendation for Advanced Practitioners part2Modern Recommendation for Advanced Practitioners part2
Modern Recommendation for Advanced Practitioners part2
 
Predicting an Applicant Status Using Principal Component, Discriminant and Lo...
Predicting an Applicant Status Using Principal Component, Discriminant and Lo...Predicting an Applicant Status Using Principal Component, Discriminant and Lo...
Predicting an Applicant Status Using Principal Component, Discriminant and Lo...
 
Brown bag 2012_fall
Brown bag 2012_fallBrown bag 2012_fall
Brown bag 2012_fall
 
Sakanashi, h.; kakazu, y. (1994): co evolving genetic algorithm with filtered...
Sakanashi, h.; kakazu, y. (1994): co evolving genetic algorithm with filtered...Sakanashi, h.; kakazu, y. (1994): co evolving genetic algorithm with filtered...
Sakanashi, h.; kakazu, y. (1994): co evolving genetic algorithm with filtered...
 
Logistic-regression.pptx
Logistic-regression.pptxLogistic-regression.pptx
Logistic-regression.pptx
 
Harnessing Deep Neural Networks with Logic Rules
Harnessing Deep Neural Networks with Logic RulesHarnessing Deep Neural Networks with Logic Rules
Harnessing Deep Neural Networks with Logic Rules
 
Reinforcement learning
Reinforcement  learningReinforcement  learning
Reinforcement learning
 
Rank Monotonicity in Centrality Measures (A report about Quality guarantees f...
Rank Monotonicity in Centrality Measures (A report about Quality guarantees f...Rank Monotonicity in Centrality Measures (A report about Quality guarantees f...
Rank Monotonicity in Centrality Measures (A report about Quality guarantees f...
 
Marketing Research Project
Marketing Research ProjectMarketing Research Project
Marketing Research Project
 
Measuring the behavioral component of financial fluctuaction. An analysis bas...
Measuring the behavioral component of financial fluctuaction. An analysis bas...Measuring the behavioral component of financial fluctuaction. An analysis bas...
Measuring the behavioral component of financial fluctuaction. An analysis bas...
 
Local Coordination in Online Distributed Constraint Optimization Problems
Local Coordination in Online Distributed Constraint Optimization ProblemsLocal Coordination in Online Distributed Constraint Optimization Problems
Local Coordination in Online Distributed Constraint Optimization Problems
 
Local coordination in online distributed constraint optimization problems - P...
Local coordination in online distributed constraint optimization problems - P...Local coordination in online distributed constraint optimization problems - P...
Local coordination in online distributed constraint optimization problems - P...
 
Empirical Evaluation of Active Learning in Recommender Systems
Empirical Evaluation of Active Learning in Recommender SystemsEmpirical Evaluation of Active Learning in Recommender Systems
Empirical Evaluation of Active Learning in Recommender Systems
 
EFA
EFAEFA
EFA
 
Bayesian Generalization Error and Real Log Canonical Threshold in Non-negativ...
Bayesian Generalization Error and Real Log Canonical Threshold in Non-negativ...Bayesian Generalization Error and Real Log Canonical Threshold in Non-negativ...
Bayesian Generalization Error and Real Log Canonical Threshold in Non-negativ...
 

More from Chris Ohk

More from Chris Ohk (20)

인프콘 2022 - Rust 크로스 플랫폼 프로그래밍
인프콘 2022 - Rust 크로스 플랫폼 프로그래밍인프콘 2022 - Rust 크로스 플랫폼 프로그래밍
인프콘 2022 - Rust 크로스 플랫폼 프로그래밍
 
고려대학교 컴퓨터학과 특강 - 대학생 때 알았더라면 좋았을 것들
고려대학교 컴퓨터학과 특강 - 대학생 때 알았더라면 좋았을 것들고려대학교 컴퓨터학과 특강 - 대학생 때 알았더라면 좋았을 것들
고려대학교 컴퓨터학과 특강 - 대학생 때 알았더라면 좋았을 것들
 
Momenti Seminar - 5 Years of RosettaStone
Momenti Seminar - 5 Years of RosettaStoneMomenti Seminar - 5 Years of RosettaStone
Momenti Seminar - 5 Years of RosettaStone
 
선린인터넷고등학교 2021 알고리즘 컨퍼런스 - Rust로 알고리즘 문제 풀어보기
선린인터넷고등학교 2021 알고리즘 컨퍼런스 - Rust로 알고리즘 문제 풀어보기선린인터넷고등학교 2021 알고리즘 컨퍼런스 - Rust로 알고리즘 문제 풀어보기
선린인터넷고등학교 2021 알고리즘 컨퍼런스 - Rust로 알고리즘 문제 풀어보기
 
Momenti Seminar - A Tour of Rust, Part 2
Momenti Seminar - A Tour of Rust, Part 2Momenti Seminar - A Tour of Rust, Part 2
Momenti Seminar - A Tour of Rust, Part 2
 
Momenti Seminar - A Tour of Rust, Part 1
Momenti Seminar - A Tour of Rust, Part 1Momenti Seminar - A Tour of Rust, Part 1
Momenti Seminar - A Tour of Rust, Part 1
 
Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020
Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020
Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020
 
GDG Gwangju DevFest 2019 - <하스스톤> 강화학습 환경 개발기
GDG Gwangju DevFest 2019 - <하스스톤> 강화학습 환경 개발기GDG Gwangju DevFest 2019 - <하스스톤> 강화학습 환경 개발기
GDG Gwangju DevFest 2019 - <하스스톤> 강화학습 환경 개발기
 
[RLKorea] <하스스톤> 강화학습 환경 개발기
[RLKorea] <하스스톤> 강화학습 환경 개발기[RLKorea] <하스스톤> 강화학습 환경 개발기
[RLKorea] <하스스톤> 강화학습 환경 개발기
 
[NDC 2019] 하스스톤 강화학습 환경 개발기
[NDC 2019] 하스스톤 강화학습 환경 개발기[NDC 2019] 하스스톤 강화학습 환경 개발기
[NDC 2019] 하스스톤 강화학습 환경 개발기
 
C++20 Key Features Summary
C++20 Key Features SummaryC++20 Key Features Summary
C++20 Key Features Summary
 
[델리만주] 대학원 캐슬 - 석사에서 게임 프로그래머까지
[델리만주] 대학원 캐슬 - 석사에서 게임 프로그래머까지[델리만주] 대학원 캐슬 - 석사에서 게임 프로그래머까지
[델리만주] 대학원 캐슬 - 석사에서 게임 프로그래머까지
 
디미고 특강 - 개발을 시작하려는 여러분에게
디미고 특강 - 개발을 시작하려는 여러분에게디미고 특강 - 개발을 시작하려는 여러분에게
디미고 특강 - 개발을 시작하려는 여러분에게
 
청강대 특강 - 프로젝트 제대로 해보기
청강대 특강 - 프로젝트 제대로 해보기청강대 특강 - 프로젝트 제대로 해보기
청강대 특강 - 프로젝트 제대로 해보기
 
[NDC 2018] 유체역학 엔진 개발기
[NDC 2018] 유체역학 엔진 개발기[NDC 2018] 유체역학 엔진 개발기
[NDC 2018] 유체역학 엔진 개발기
 
My Way, Your Way
My Way, Your WayMy Way, Your Way
My Way, Your Way
 
Re:Zero부터 시작하지 않는 오픈소스 개발
Re:Zero부터 시작하지 않는 오픈소스 개발Re:Zero부터 시작하지 않는 오픈소스 개발
Re:Zero부터 시작하지 않는 오픈소스 개발
 
[9XD] Introduction to Computer Graphics
[9XD] Introduction to Computer Graphics[9XD] Introduction to Computer Graphics
[9XD] Introduction to Computer Graphics
 
C++17 Key Features Summary - Ver 2
C++17 Key Features Summary - Ver 2C++17 Key Features Summary - Ver 2
C++17 Key Features Summary - Ver 2
 
[제1회 시나브로 그룹 오프라인 밋업] 개발자의 자존감
[제1회 시나브로 그룹 오프라인 밋업] 개발자의 자존감[제1회 시나브로 그룹 오프라인 밋업] 개발자의 자존감
[제1회 시나브로 그룹 오프라인 밋업] 개발자의 자존감
 

Recently uploaded

Structuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessStructuring Teams and Portfolios for Success
Structuring Teams and Portfolios for Success
UXDXConf
 

Recently uploaded (20)

Intro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджераIntro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджера
 
ECS 2024 Teams Premium - Pretty Secure
ECS 2024   Teams Premium - Pretty SecureECS 2024   Teams Premium - Pretty Secure
ECS 2024 Teams Premium - Pretty Secure
 
Optimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through ObservabilityOptimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through Observability
 
Using IESVE for Room Loads Analysis - UK & Ireland
Using IESVE for Room Loads Analysis - UK & IrelandUsing IESVE for Room Loads Analysis - UK & Ireland
Using IESVE for Room Loads Analysis - UK & Ireland
 
Portal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russePortal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russe
 
Powerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaPowerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara Laskowska
 
Syngulon - Selection technology May 2024.pdf
Syngulon - Selection technology May 2024.pdfSyngulon - Selection technology May 2024.pdf
Syngulon - Selection technology May 2024.pdf
 
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone KomSalesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
 
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfLinux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
 
Designing for Hardware Accessibility at Comcast
Designing for Hardware Accessibility at ComcastDesigning for Hardware Accessibility at Comcast
Designing for Hardware Accessibility at Comcast
 
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
 
Structuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessStructuring Teams and Portfolios for Success
Structuring Teams and Portfolios for Success
 
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
 
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
 
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
 
Google I/O Extended 2024 Warsaw
Google I/O Extended 2024 WarsawGoogle I/O Extended 2024 Warsaw
Google I/O Extended 2024 Warsaw
 
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
 
Microsoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - QuestionnaireMicrosoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - Questionnaire
 
How we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdfHow we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdf
 
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdfHow Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
 

Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021

  • 1. Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021 옥찬호 utilForever@gmail.com
  • 2. Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021 Prerequisites • Markov Decision Process • Policy Iteration • Policy Gradient • Actor-Critic • Entropy • KL Divergence • PPO (Proximal Policy Optimization) • GAN (Generative Adversarial Network)
  • 3. Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021 Introduction • Generalization and exploration in RL still represent key challenges that leave most current methods ineffective. • First, a battery of recent studies (Farebrother et al., 2018; Zhang et al., 2018a; Song et al., 2020; Cobbe et al., 2020) indicates that current RL methods fail to generalize correctly even when agents have been trained in a diverse set of environments. • Second, exploration has been extensively studied in RL; however, most hard- exploration problems use the same environment for training and evaluation.
  • 4. Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021 Introduction • Hence, since a well-designed exploration strategy should maximize the information received from a trajectory about an environment, the exploration capabilities may not be appropriately assessed if that information is memorized.
  • 5. Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021 Introduction • In this work, we propose Adversarially Guided Actor-Critic (AGAC), which reconsiders the actor-critic framework by introducing a third protagonist: the adversary. • Its role is to predict the actor’s actions correctly. Meanwhile, the actor must not only find the optimal actions to maximize the sum of expected returns, but also counteract the predictions of the adversary. • This formulation is lightly inspired by adversarial methods, specifically generative adversarial networks (GANs) (Goodfellow et al., 2014).
  • 6. Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021 Introduction • Such a link between GANs and actor-critic methods has been formalized by Pfau & Vinyals (2016); however, in the context of a third protagonist, we draw a different analogy. • The adversary can be interpreted as playing the role of a discriminator that must predict the actions of the actor, and the actor can be considered as playing the role of a generator that behaves to deceive the predictions of the adversary. • This approach has the advantage, as with GANs, that the optimization procedure generates a diversity of meaningful data, corresponding to sequences of actions in AGAC.
  • 7. Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021 Introduction • The contributions of this work are as follow: • (i) we propose a novel actor-critic formulation inspired from adversarial learning (AGAC) • (ii) we analyze empirically AGAC on key reinforcement learning aspects such as diversity, exploration and stability • (iii) we demonstrate significant gains in performance on several sparse- reward hard-exploration tasks including procedurally-generated tasks
  • 8. Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021 Backgrounds and Notations • Markov Decision Process (MDP) 𝑀 = 𝒮, 𝒜, 𝒫, 𝑅, 𝛾 • 𝒮 : The state space • 𝒜 : The action space • 𝒫 : The transition kernel • 𝑅 : The bounded reward function • 𝛾 ∈ 0, 1 : The discount factor
  • 9. Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021 Backgrounds and Notations • Let 𝜋 denote a stochastic policy mapping states to distributions over actions. We place ourselves in the infinite-horizon setting. • i.e., we seek a policy that optimizes 𝐽 𝜋 = σ𝑡=0 ∞ 𝛾𝑡 𝑟 𝑠𝑡, 𝑎𝑡 . • The value of a state is the quantity 𝑉𝜋 𝑠 = 𝔼𝜋 σ𝑡=0 ∞ 𝛾𝑡𝑟 𝑠𝑡, 𝑎𝑡 𝑠0 = 𝑠] and the value of a state-action pair 𝑄𝜋 (𝑠, 𝑎) of performing action 𝑎 in state 𝑠 and then following policy 𝜋 is defined as: 𝑄𝜋 𝑠, 𝑎 = 𝔼𝜋 σ𝑡=0 ∞ 𝛾𝑡 𝑟 𝑠𝑡, 𝑎𝑡 𝑠0 = 𝑠, 𝑎0 = 𝑎]. • The advantage function, which quantifies how an action 𝑎 is better than the average action in state 𝑠, is 𝐴𝜋 𝑠, 𝑎 = 𝑄𝜋 𝑠, 𝑎 − 𝑉𝜋 𝑠 . • Finally, the entropy ℋ𝜋 of a policy is calculated as: ℋ𝜋 𝑠 = 𝔼𝜋 ∙ 𝑠 − log 𝜋 ∙ 𝑠 .
  • 10. Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021 Backgrounds and Notations • An actor-critic algorithm is composed of two main components: a policy and a value predictor. • In deep RL, both the policy and the value function are obtained via parametric estimators: we denote 𝜃 and 𝜙 their respective parameters. • The policy is updated via policy gradient, while the value is usually updated via temporal difference or Monte Carlo rollouts.
  • 11. Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021 Backgrounds and Notations • For a sequence of transitions 𝑠𝑡, 𝑎𝑡, 𝑟𝑡, 𝑠𝑡+1 𝑡∈ 0,𝑁 , we use the following policy gradient loss (including the commonly used entropic penalty): ℒ𝑃𝐺 = − 1 𝑁 ෍ 𝑡′=𝑡 𝑡+𝑁 𝐴𝑡′ log 𝜋 𝑎𝑡′ 𝑠𝑡′, 𝜃 + 𝛼ℋ𝜋 𝑠𝑡′, 𝜃 where 𝛼 is the entropy coefficient and 𝐴𝑡 is the generalized advantage estimator (Schulman et al., 2016) defined as: 𝐴𝑡 = ෍ 𝑡′=𝑡 𝑡+𝑁 𝛾𝜆 𝑡′−𝑡(𝑟𝑡′ + 𝛾𝑉𝜙old 𝑠𝑡′+1 − 𝑉𝜙old (𝑠𝑡′)) with 𝜆 a fixed hyperparameter and 𝑉𝜙old the value function estimator at the previous optimization iteration. Actor Critic Entropy
  • 12. Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021 Backgrounds and Notations • To estimate the value function, we solve the non-linear regression problem minimize𝜙 σ𝑡′=𝑡 𝑡+𝑁 𝑉𝜙 𝑠𝑡′ − ෠ 𝑉𝑡′ 2 where ෠ 𝑉𝑡′ = 𝐴𝑡 + 𝑉𝜙𝑜𝑙𝑑 𝑠𝑡′ .
  • 13. Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021 Adversarially Guided AC • To foster diversified behavior in its trajectories, AGAC introduces a third protagonist to the actor-critic framework: the adversary. • The role of the adversary is to accurately predict the actor’s actions, by minimizing the discrepancy between its action distribution 𝜋adv and the distribution induced by the policy 𝜋. • Meanwhile, in addition to finding the optimal actions to maximize the sum of expected returns, the actor must also counteract the adversary’s predictions by maximizing the discrepancy between 𝜋 and 𝜋adv.
  • 14. Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021 Adversarially Guided AC
  • 15. Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021 Adversarially Guided AC • This discrepancy, used as a form of exploration bonus, is defined as the difference of action log-probabilities, whose expectation is the Kullback-Leibler divergence: 𝐷KL 𝜋 ∙ 𝑠 ฮ𝜋adv ∙ 𝑠 = 𝔼𝜋 ∙ 𝑠 log 𝜋 ∙ 𝑠 − log 𝜋adv ∙ 𝑠 • Formally, for each state-action pair 𝑠𝑡, 𝑎𝑡 in a trajectory, an action-dependent bonus log 𝜋 (𝑎𝑡|𝑠𝑡) − log 𝜋adv 𝑎𝑡 𝑠𝑡 is added to the advantage.
  • 16. Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021 Adversarially Guided AC • The value target of the critic is modified to include the action- independent equivalent, which is the KL-divergence 𝐷KL 𝜋 ∙ 𝑠 ฮ𝜋adv ∙ 𝑠 . • In addition to the parameters 𝜃 (resp. 𝜃old the parameter of the policy at the previous iteration) and 𝜙 defined above (resp. 𝜙old that of the critic), we denote 𝜓 (resp. 𝜓old) that of the adversary.
  • 17. Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021 Adversarially Guided AC • AGAC minimizes the following loss: ℒAGAC = ℒPG + 𝛽𝑉ℒ𝑉 + 𝛽advℒadv • In the new objective ℒPG = 1 𝑁 σ𝑡=0 𝑁 𝐴𝑡 AGAC log 𝜋 𝑎𝑡 𝑠𝑡, 𝜃 + 𝛼ℋ𝜋 𝑠𝑡, 𝜃 , AGAC modifies 𝐴𝑡 as: 𝐴𝑡 AGAC = 𝐴𝑡 + 𝑐 log 𝜋 𝑎𝑡 𝑠𝑡, 𝜃old − log 𝜋adv 𝑎𝑡 𝑠𝑡, 𝜓old with 𝑐 is a varying hyperparameter that controls the dependence on the action log-probability difference. To encourage exploration without preventing asymptotic stability, 𝑐 is linearly annealed during the course of training.
  • 18. Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021 Adversarially Guided AC • ℒ𝑉 is the objective function of the critic defined as: ℒ𝑉 = 1 𝑁 ෍ 𝑡=0 𝑁 𝑉𝜙 𝑠𝑡 − ෠ 𝑉𝑡 + 𝑐𝐷KL 𝜋 ∙ 𝑠𝑡, 𝜃old ฮ𝜋adv ∙ 𝑠𝑡, 𝜓old 2 • Finally, ℒadv is the objective function of the adversary: ℒadv = 1 𝑁 ෍ 𝑡=0 𝑁 𝐷KL 𝜋 ∙ 𝑠𝑡, 𝜃old ฮ𝜋adv ∙ 𝑠𝑡, 𝜓 • They are the three equations that our method modifies in the traditional actor-critic framework. The terms 𝛽𝑉 and 𝛽adv are fixed hyperparameters.
  • 19. Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021 Adversarially Guided AC • Under the proposed actor-critic formulation, the probability of sampling an action is increased if the modified advantage is positive, i.e. • (i) the corresponding return is larger than the predicted value • (ii) the action log-probability difference is large • More precisely, our method favors transitions whose actions were less accurately predicted than the average action, i.e. log 𝜋 𝑎 𝑠 − log 𝜋adv 𝑎 𝑠 ≥ 𝐷KL 𝜋 ∙ 𝑠 ฮ𝜋adv ∙ 𝑠 .
  • 20. Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021 Adversarially Guided AC • This is particularly visible for 𝜆 → 1, in which case the generalized advantage is 𝐴𝑡 = 𝐺𝑡 − 𝑉𝜙old (𝑠𝑡), resulting in the appearance of both aforementioned mirrored terms in the modified advantage: 𝐴𝑡 AGAC = 𝐺𝑡 − ෠ 𝑉 𝑡 𝜙old + 𝑐 log 𝜋 𝑎𝑡 𝑠𝑡 − log 𝜋adv 𝑎𝑡 𝑠𝑡 − ෡ 𝐷KL 𝜙old 𝜋 ∙ 𝑠 ฮ𝜋adv ∙ 𝑠 with 𝐺𝑡 the observed return, ෠ 𝑉 𝑡 𝜙old the estimated return and ෡ 𝐷KL 𝜙old 𝜋 ∙ 𝑠 ฮ𝜋adv ∙ 𝑠 the estimated KL-divergence.
  • 21. Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021 Adversarially Guided AC • To avoid instability, in practice the adversary is a separate estimator, updated with a smaller learning rate than the actor. This way, it represents a delayed and more steady version of the actor’s policy, which prevents the agent from having to constantly adapt or focus solely on fooling the adversary.
  • 22. Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021 Building Motivation • We provide an interpretation of AGAC by studying the dynamics of attraction and repulsion between the actor and the adversary. • To simplify, we study the equivalent of AGAC in a policy iteration (PI) scheme. PI being the dynamic programming scheme underlying the standard actor-critic, we have reasons to think that some of our findings translate to the original AGAC algorithm.
  • 23. Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021 Building Motivation • In PI, the quantity of interest is the action-value, which AGAC would modify as: 𝑄𝜋𝑘 AGAC = 𝑄𝜋𝑘 + 𝑐 log 𝜋𝑘 − log 𝜋adv with 𝜋𝑘 the policy at iteration 𝑘. • Incorporating the entropic penalty, the new policy 𝜋𝑘+1 verifies: 𝜋𝑘+1 = argmax 𝜋 𝒥PI 𝜋 = argmax 𝜋 𝔼𝑠𝔼𝑎~𝜋 ∙ 𝑠 𝑄𝜋𝑘 AGAC 𝑠, 𝑎 − 𝛼 log 𝜋 𝑎 𝑠
  • 24. Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021 Building Motivation • We can rewrite this objective: 𝒥PI 𝜋 = 𝔼𝑠𝔼𝑎~𝜋 ∙ 𝑠 𝑄𝜋𝑘 AGAC 𝑠, 𝑎 − 𝛼 log 𝜋 𝑎 𝑠 = 𝔼𝑠𝔼𝑎~𝜋 ∙ 𝑠 𝑄𝜋𝑘 𝑠, 𝑎 + 𝑐 log 𝜋𝑘 𝑎 𝑠 − log 𝜋adv 𝑎 𝑠 − 𝛼 log 𝜋 𝑎 𝑠 = 𝔼𝑠𝔼𝑎~𝜋 ∙ 𝑠 𝑄𝜋𝑘 𝑠, 𝑎 + 𝑐 log 𝜋𝑘 𝑎 𝑠 − log 𝜋 𝑎 𝑠 + log 𝜋 𝑎 𝑠 − log 𝜋adv 𝑎 𝑠 − 𝛼 log 𝜋 𝑎 𝑠 = 𝔼𝑠 𝔼𝑎~𝜋 ∙ 𝑠 𝑄𝜋𝑘 𝑠, 𝑎 − 𝑐𝐷KL 𝜋 ∙ 𝑠 ฮ𝜋𝑘 ∙ 𝑠 + 𝑐𝐷KL 𝜋 ∙ 𝑠 ฮ𝜋adv ∙ 𝑠 + 𝛼ℋ 𝜋 ∙ 𝑠 𝜋𝑘 is attractive 𝜋adv is repulsive enforces stochastic policies
  • 25. Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021 Building Motivation • Thus, in the PI scheme, AGAC finds a policy that maximizes Q- values, while at the same time remaining close to the current policy and far from a mixture of the previous policies (i.e., 𝜋𝑘−1, 𝜋𝑘−2, 𝜋𝑘−3, …). • Note that we experimentally observe that our method performs better with a smaller learning rate for the adversarial network than that of the other networks, which could imply that a stable repulsive term is beneficial.
  • 26. Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021 Building Motivation • This optimization problem is strongly concave in 𝜋 (thanks to the entropy term), and is state-wise a Legendre-Fenchel transform. Its solution is given by: 𝜋𝑘+1 ∝ 𝜋𝑘 𝜋adv 𝑐 𝛼 exp 𝑄𝜋𝑘 𝛼 • This result gives us some insight into the behavior of the objective function. Notably, in our example, if 𝜋adv is fixed and 𝑐 = 𝛼, we recover a KL-regularized PI scheme (Geist et al., 2019) with the modified reward 𝑟 − 𝑐 log 𝜋adv.
  • 27. Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021 Proof 𝜋𝑘+1 = argmax 𝜋 𝒥PI 𝜋 ∝ 𝜋𝑘 𝜋adv 𝑐 𝛼 exp 𝑄𝜋𝑘 𝛼 with the objective function: 𝒥PI 𝜋 = 𝔼𝑠𝔼𝑎~𝜋 ∙ 𝑠 𝑄𝜋𝑘 𝑠, 𝑎 + 𝑐 log 𝜋𝑘 𝑎 𝑠 − log 𝜋adv 𝑎 𝑠 − 𝛼 log 𝜋 𝑎 𝑠
  • 28. Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021 Proof • We first consider a simpler optimization problem: argmax 𝜋 𝜋, 𝑄𝜋𝑘 + 𝛼ℋ 𝜋 whose solution is known (Vieillard et al., 2020a, Appendix A). • The expression for the maximizer is the 𝛼-scaled softmax: 𝜋∗ = exp 𝑄𝜋𝑘 𝛼 1, exp 𝑄𝜋𝑘 𝛼
  • 29. Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021 Proof • We now turn towards the optimization problem of interest, which we can rewrite as: argmax 𝜋 𝜋, 𝑄𝜋𝑘 + 𝑐 log 𝜋𝑘 − log 𝜋adv + 𝛼ℋ 𝜋 • By the simple change of variable ෨ 𝑄𝜋𝑘 = 𝑄𝜋𝑘 + 𝑐 log 𝜋𝑘 − log 𝜋adv , we can reuse the previous solution (replacing 𝑄𝜋𝑘 by ෨ 𝑄𝜋𝑘 ). • With the simplification: exp 𝑄𝜋𝑘 + 𝑐 log 𝜋𝑘 − log 𝜋adv 𝛼 = 𝜋𝑘 𝜋adv 𝑐 𝛼 exp 𝑄𝜋𝑘 𝛼
  • 30. Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021 Implementation • In all of the experiments, we use PPO (Schulman et al., 2017) as the base algorithm and build on it to incorporate our method. Hence, ℒ𝑃𝐺 = − 1 𝑁 ෍ 𝑡′=𝑡 𝑡+𝑁 min 𝜋 𝑎𝑡′ 𝑠𝑡′, 𝜃 𝜋 𝑎𝑡′ 𝑠𝑡′, 𝜃old 𝐴𝑡′ AGAC , clip 𝜋 𝑎𝑡′ 𝑠𝑡′, 𝜃 𝜋 𝑎𝑡′ 𝑠𝑡′, 𝜃old , 1 − 𝜖, 1 + 𝜖 𝐴𝑡′ AGAC with 𝐴𝑡′ AGAC given in first equation, 𝑁 the temporal length considered for one update of parameters and 𝜖 the clipping parameter. In PPO, 𝐿𝐶𝐿𝐼𝑃 𝜃 = ෡ 𝔼𝑡 min(𝑟𝑡 𝜃 መ 𝐴𝑡, clip(𝑟𝑡 𝜃 , 1 − 𝜖, 1 + 𝜖) መ 𝐴𝑡)
  • 31. Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021 Implementation • Similar to RIDE (Raileanu & Rocktäschel, 2019), we also discount PPO by episodic state visitation counts. • The actor, critic and adversary use the convolutional architecture of the Nature paper of DQN (Mnih et al., 2015) with different hidden sizes. • The three neural networks are optimized using Adam (Kingma & Ba, 2015). Our method does not use RNNs in its architecture; instead, in all our experiments, we use frame stacking.
  • 32. Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021 Implementation
  • 33. Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021 Implementation • At each training step, we perform a stochastic optimization step to minimize ℒAGAC using stop-gradient: • 𝜃 ← Adam 𝜃, ∇𝜃ℒPG, 𝜂1 • 𝜙 ← Adam 𝜙, ∇𝜙ℒV, 𝜂1 • 𝜓 ← Adam 𝜓, ∇𝜓ℒadv, 𝜂2
  • 34. Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021 Experiments • In this section, we describe our experimental study in which we investigate: • (i) whether the adversarial bonus alone (e.g. without episodic state visitation count) is sufficient to outperform other methods in VizDoom, a sparse- reward task with high-dimensional observations • (ii) whether AGAC succeeds in partially-observable and procedurally- generated environments with high sparsity in the rewards, compared to other methods
  • 35. Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021 Experiments • In this section, we describe our experimental study in which we investigate: • (iii) how well AGAC is capable of exploring in environments without extrinsic reward • (iv) the training stability of our method. In all of the experiments, lines are average performances and shaded areas represent one standard deviation
  • 36. Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021 Experiments • Environments
  • 37. Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021 Experiments • Baselines • RIDE (Raileanu & Rocktäschel, 2019) • Count as Count-Based Exploration (Bellemare et al., 2016b) • RND (Burda et al., 2018) • ICM (Pathak et al., 2017) • AMIGo (Campero et al., 2021)
  • 38. Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021 Experiments • Adversarially-based Exploration (No episodic count)
  • 39. Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021 Experiments • Hard-Exploration Tasks with Partially-Observable Environments
  • 40. Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021 Experiments • Training Stability
  • 41. Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021 Experiments • Exploration in Reward-Free Environment
  • 42. Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021 Experiments • Visualizing Coverage and Diversity
  • 43. Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021 Experiments • Visualizing Coverage and Diversity
  • 44. Disscussions • This paper introduced AGAC, a modification to the traditional actor-critic framework: an adversary network is added as a third protagonist. • The mechanics of AGAC have been discussed from a policy iteration point of view, and we provided theoretical insight into the inner workings of the proposed algorithm: the adversary forces the agent to remain close to the current policy while moving away from the previous ones. In a nutshell, the influence of the adversary makes the actor conservatively diversified. Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021