SlideShare a Scribd company logo
Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
1 / 70
Multi-Reward Reinforcement Learning
Literature Survey
Younghyo Park1
1 Mechanical Engineering Department, Senior
Seoul National University
Check out the Notion version of this slide : https://bit.ly/3tnzD9F
Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
2 / 70
Table of Contents
1. Preliminaries
2. Why does one use multi-reward?
3. Notable Papers
4. Paper Review / Summary
o innately multi-reward
o multi-reward to better understand the environment
o multi-reward for better performance
Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
3 / 70
Table of Contents
1. Preliminaries
2. Why does one use multi-reward?
3. Notable Papers
4. Paper Review / Summary
o innately multi-reward
o multi-reward to better understand the environment
o multi-reward for better performance
Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
4 / 70
Preliminaries
 Multi-Objective Markov-Deicision Process (MOMDP)
o reward function is no longer a scalar, but a vector.
o value function for a stationary policy 𝜋 on state 𝑠, is also a vector.
o For single-objective MDP (SOMDP), ordering of value function is complete
when state is given.
o In contrast, in MOMDP, even when state is given, only a partial ordering is
possible. Determining the optimal policy requires further contemplation.
⇒ one possible solution is to prioritize the rewards / objectives (covered later)
Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
5 / 70
Table of Contents
1. Preliminaries
2. Why does one use multi-reward?
3. Notable Papers
4. Paper Review / Summary
o innately multi-reward
o multi-reward to better understand the environment
o multi-reward for better performance
Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
6 / 70
Why does one use multi-reward?
After some literature survey, I’ve noticed that reinforcement learning community
deploys multi-reward architecture for the following reasons, mainly threefold.
1. MDP problem is innately multi-reward
In this case, the MDP has to be designed as a multi-reward problem from
the beginning. (No possible choices otherwise)
o (possibly conflicting) multiple obejctive / goals
ex) public transportation system using RL → multiple goal of “commute
time” and “fuel-efficiency”
o reward given by multiple users(experts)
ex) RL environment where human users/experts give the reward → one
user may decide to reward the agent with values twice as large as those
of another → rewards are incomparable, cannot be naively converted to
a single reward problem.
Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
7 / 70
Why does one use multi-reward?
After some literature survey, I’ve noticed that reinforcement learning community
deploys multi-reward architecture for the following reasons, mainly threefold.
2. Implement multi-reward to better understand the environment
The MDP can possibly be formulated as a single-reward problem. However,
constructing as a multi-reward problem can give us more information about
the environment.
o generalizing the q-functions for multiple goals
ex) learning the RL system to generalize upon various tasks/goals →
learning the q-network which accepts ‘goal state’ as an input
o exploring the environment using multiple agents
ex) multiple agents, acting based on different rewards, can give us more
information about the environment
Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
8 / 70
Why does one use multi-reward?
After some literature survey, I’ve noticed that reinforcement learning community
deploys multi-reward architecture for the following reasons, mainly threefold.
3. Implement multi-reward for better performance
When the original single objective MDP problem is quite cumbersome to
handle, we can ease the problem by splitting the single reward as multiple
(easier) rewards.
o sparse rewards
ex) binary rewards are given at the end of each episode if the agent
succeeded the goal → if the goal is hard to achieve, the reward can be
extremely sparse, and RL training can be problematic.
o complex rewards
ex) If the single objective reward depends on too much state
components, learning q-function might be difficult → split the single
objective reward to simpler multi-objective reward
Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
9 / 70
Table of Contents
1. Preliminaries
2. Why does one use multi-reward?
3. Notable Papers
4. Paper Review / Summary
o innately multi-reward
o multi-reward to better understand the environment
o multi-reward for better performance
Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
10 / 70
Notable Papers
1. MDP problem is innately multi-reward
o A Survey of Multi-Objective Sequential Decision Making [JAIR 2013] [pdf]
o Multi-Objective Deep Reinforcement Learning [arXiv 2016] [pdf]
o Dynamic Weights in Multi-Objective Deep Reinforcement Learning [ICML 2019] [pdf]
o Balancing Multiple Sources of Reward in Reinforcement Learning [NIPS 2000] [pdf]
2. Implement multi-reward to better understand the environment
o Horde: A scalable real-time architecture for learning knowledge from unsupervised
sensorimotor interaction [AAMAS 2011] [pdf]
o Universal Value Function Approximators [ICML 2015] [pdf]
3. Implement multi-reward for better performance
o Hindsight Experience Replay [NIPS 2017] [pdf]
o Hybrid Reward Architecture for Reinforcement Learning [NIPS 2017] [pdf]
Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
11 / 70
Table of Contents
1. Preliminaries
2. Why does one use multi-reward?
3. Notable Papers
4. Paper Review / Summary
o innately multi-reward
o multi-reward to better understand the environment
o multi-reward for better performance
Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
12 / 70
A Survey of Multi-Objective Sequential Decision Making [JAIR 2013] [pdf]
 One might wonder,
 Such scalarization function, parameterized by vector
might convert the MOMDP problem to a SOMDP problem.
ex) one possible scalarization function is a linear function:
 Once we scalarize the reward, typical SOMDP solution can be applied.
“Why don’t we scalarize the vector-reward with some scalarization function?”
Paper Review / Summary – innately multi-reward (1)
Deiderik M. Roijers, Peter Vamplew, Shimon Whiteson
Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
13 / 70
 Unfortunately, such conversion is not always possible / desirable.
1. Unknown weights scenario
o Weight has to be specified before learning → not always possible!
o User might prefer different priorities (weights) over time.
o But still, once the weight is specified and fixed, we can easily train and
use the decision-making process.
A Survey of Multi-Objective Sequential Decision Making [JAIR 2013] [pdf]
ex) public transportation system using RL
– weights (priority) between two rewards, each corresponding to commute
time and pollution cost, might fluctuate based on the price of oil.
– weight cannot be specified before training!
– once the weight is fixed, we can use the SOMDP algorithm.
Paper Review / Summary – innately multi-reward (1)
Deiderik M. Roijers, Peter Vamplew, Shimon Whiteson
Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
14 / 70
 Unfortunately, such conversion is not always possible / desirable.
2. Decision support scenario * not really our main concern
o The concept of scalarization itself may not be applicable from the
beginning → objective-priority may not be accurately quantified.
o Users may also have “fuzzy” preferences that defy meaningful
quantification.
o MOMDP might require arbitrary human decision during its operation.
A Survey of Multi-Objective Sequential Decision Making [JAIR 2013] [pdf]
ex) public transportation system using RL
– if the transportation system could be made more efficient by obstrucing a
beautiful view, then a human deisgner may not be able to quantify the loss,
or the priority regarding the loss of beauty.
Paper Review / Summary – innately multi-reward (1)
Deiderik M. Roijers, Peter Vamplew, Shimon Whiteson
Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
15 / 70
 Unfortunately, such conversion is not always possible / desirable.
3. Known weights scenario * rare case
o if the scalarization function 𝑓 is nonlinear, the resulting SOMDP problem
may not have additive returns.
o optimal policy may be stochastic, thus difficult to solve.
A Survey of Multi-Objective Sequential Decision Making [JAIR 2013] [pdf]
Paper Review / Summary – innately multi-reward (1)
Deiderik M. Roijers, Peter Vamplew, Shimon Whiteson
Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
16 / 70
 Unfortunately, such conversion is not always possible / desirable.
1. Unknown weights scenario
2. Decision support scenario
3. Known weights scenario
 Thus, algorithmic solution that specifically targets the MOMDP case should be
developed. * Note that our main concern is on the “weights”.
 Useful MOMDP solution should be able to
1. give an optimal policy for any arbitrary weights,
2. properly handle the case when weight changes over time (dynamic weights),
3. with just an initial single training (no retraining required)
A Survey of Multi-Objective Sequential Decision Making [JAIR 2013] [pdf]
Paper Review / Summary – innately multi-reward (1)
Deiderik M. Roijers, Peter Vamplew, Shimon Whiteson
Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
17 / 70
 Important Definitions and Terminologies
A Survey of Multi-Objective Sequential Decision Making [JAIR 2013] [pdf]
1. Undominated policies
o this set includes (weight – policy) pair if the policy 𝜋 is optimal for some
weight
o this undominated policies contain redundnat policies (some of the
policies contained in this set is not the only optimal policy for weight )
o what we want is a compact-policy set that allows us to index a single
optimal policy for a given weight .
Paper Review / Summary – innately multi-reward (1)
Deiderik M. Roijers, Peter Vamplew, Shimon Whiteson
Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
18 / 70
 Important Definitions and Terminologies
A Survey of Multi-Objective Sequential Decision Making [JAIR 2013] [pdf]
2. Coverage Set
o coverage set is a subset of undominated policies
o this includes a single policy corresponding to every possible weight
o Author calls the process of obtaining the coverage set (from
undominated set) as a pruning process.
“even if we don’t know the weights a-priori, we are already eliminating the
redundant policies that we know that we aren’t going to use in the future.”
from) Multi-Objective Decision Making, Shimon Whiteson, Microsoft
https://www.youtube.com/watch?v=_zJ_cbg3TzY&feature=youtu.be
Paper Review / Summary – innately multi-reward (1)
Deiderik M. Roijers, Peter Vamplew, Shimon Whiteson
Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
19 / 70
 Important Definitions and Terminologies
A Survey of Multi-Objective Sequential Decision Making [JAIR 2013] [pdf]
2. Coverage Set
o For instance, assume that there are only two possible scalarizations
o Undominated set = {𝜋1, 𝜋2, 𝜋3}
o Coverage set = {𝜋1, 𝜋2} or {𝜋3, 𝜋2}
is not an optimal policy for
either of the scalarization (weight)
two optimal policies exist
(redundant)
Paper Review / Summary – innately multi-reward (1)
Deiderik M. Roijers, Peter Vamplew, Shimon Whiteson
Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
20 / 70
 Important Definitions and Terminologies
A Survey of Multi-Objective Sequential Decision Making [JAIR 2013] [pdf]
3. Convex Hull
o undominated policy for linear scalarization function
4. Convex Coverage Set
o coverage set for linear scalarization function
Paper Review / Summary – innately multi-reward (1)
Deiderik M. Roijers, Peter Vamplew, Shimon Whiteson
Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
21 / 70
 Important Visualizations
o Assume ℝ2
reward vector (dots in objective space)
o Scalarized reward using weight (lines in objective space)
A Survey of Multi-Objective Sequential Decision Making [JAIR 2013] [pdf]
each dot in objective space
corresponds to the line in
weight space
Paper Review / Summary – innately multi-reward (1)
Deiderik M. Roijers, Peter Vamplew, Shimon Whiteson
Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
22 / 70
 Important Visualizations
o Now, what’s the coverage set?
o For all weights, we should determine a single optimal policy (value function).
A Survey of Multi-Objective Sequential Decision Making [JAIR 2013] [pdf]
each dot in objective space
corresponds to the line in
weight space
Paper Review / Summary – innately multi-reward (1)
Deiderik M. Roijers, Peter Vamplew, Shimon Whiteson
Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
23 / 70
 Important Visualizations
o Now, what’s the coverage set?
o For all weights, we should determine a single optimal policy (value function)
⇒ upper surface of weight space
A Survey of Multi-Objective Sequential Decision Making [JAIR 2013] [pdf]
each dot in objective space
corresponds to the line in
weight space
Paper Review / Summary – innately multi-reward (1)
Deiderik M. Roijers, Peter Vamplew, Shimon Whiteson
Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
24 / 70
 Important Visualizations
o Now, what’s the coverage set?
o For all weights, we should determine a single optimal policy (value function)
⇒ upper surface of weight space
o Blue dot / line is included in the undominated policy set, but not in the
coverage set. (redundant policy)
A Survey of Multi-Objective Sequential Decision Making [JAIR 2013] [pdf]
each dot in objective space
corresponds to the line in
weight space
Paper Review / Summary – innately multi-reward (1)
Deiderik M. Roijers, Peter Vamplew, Shimon Whiteson
Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
25 / 70
 Important Visualizations
 Thus, our goal is to find the minimal coverage space: the upper surface (and its
corresponding optimal policy / value functions) in the weight space.
A Survey of Multi-Objective Sequential Decision Making [JAIR 2013] [pdf]
each dot in objective space
corresponds to the line in
weight space
Paper Review / Summary – innately multi-reward (1)
Deiderik M. Roijers, Peter Vamplew, Shimon Whiteson
Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
26 / 70
Paper Review / Summary – innately multi-reward (2)
Multi-Objective Deep Reinforcement Learning [arXiv 2016] [pdf]
 This paper aims to learn an approximate coverage set of policies, each
represented by a neural network.
 One might ask, “Do I have to check the optimal policy for all possible weights to
find out the coverage set?” Fortunately, the answer is “No, you don’t have to.”
⇒ we may solve the SOMDP problem only for some few points in the weight
space to fully determine the coverage set.
 Authors use the concept of “Linear Support”
Hossam Mossalam, Yannis M. Assael, Diederik M. Roijers, Shimon Whiteson
Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
27 / 70
Paper Review / Summary – innately multi-reward (2)
Multi-Objective Deep Reinforcement Learning [arXiv 2016] [pdf]
 Linear Support Algorithm (Cheng, 1988)
http://www.pomdp.org/tutorial/cheng.html
1. First, pick an extremum point in the
weight space.
Hossam Mossalam, Yannis M. Assael, Diederik M. Roijers, Shimon Whiteson
Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
28 / 70
Paper Review / Summary – innately multi-reward (2)
Multi-Objective Deep Reinforcement Learning [arXiv 2016] [pdf]
 Linear Support Algorithm (Cheng, 1988)
http://www.pomdp.org/tutorial/cheng.html
1. First, pick an extremum point in the
weight space.
2. For the chosen weight, solve the
SOMDP and get the (vector-form)
optimal value function. That gives us
a line in the weight space.
* you might want to store the
corresopnding optimal policy as well
Hossam Mossalam, Yannis M. Assael, Diederik M. Roijers, Shimon Whiteson
Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
29 / 70
Paper Review / Summary – innately multi-reward (2)
Multi-Objective Deep Reinforcement Learning [arXiv 2016] [pdf]
 Linear Support Algorithm (Cheng, 1988)
http://www.pomdp.org/tutorial/cheng.html
3. Now, set the weights to the other
extremum.
Hossam Mossalam, Yannis M. Assael, Diederik M. Roijers, Shimon Whiteson
Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
30 / 70
Paper Review / Summary – innately multi-reward (2)
Multi-Objective Deep Reinforcement Learning [arXiv 2016] [pdf]
 Linear Support Algorithm (Cheng, 1988)
http://www.pomdp.org/tutorial/cheng.html
3. Now, set the weights to the other
extremum.
4. Solving the SOMDP, we get another
(vector-form) optimal value function,
which again can be represented as a
line in this plot.
Hossam Mossalam, Yannis M. Assael, Diederik M. Roijers, Shimon Whiteson
Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
31 / 70
Paper Review / Summary – innately multi-reward (2)
Multi-Objective Deep Reinforcement Learning [arXiv 2016] [pdf]
 Linear Support Algorithm (Cheng, 1988)
http://www.pomdp.org/tutorial/cheng.html
3. Now, set the weights to the other
extremum.
4. Solving the SOMDP, we get another
(vector-form) optimal value function,
which again can be represented as a
line in this plot.
5. We now have an intersection.
We call this “corner point”.
Hossam Mossalam, Yannis M. Assael, Diederik M. Roijers, Shimon Whiteson
Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
32 / 70
Paper Review / Summary – innately multi-reward (2)
Multi-Objective Deep Reinforcement Learning [arXiv 2016] [pdf]
 Linear Support Algorithm (Cheng, 1988)
http://www.pomdp.org/tutorial/cheng.html
6. Find the corresponding (vector-form)
optimal value function at this corner
point weight.
Hossam Mossalam, Yannis M. Assael, Diederik M. Roijers, Shimon Whiteson
Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
33 / 70
Paper Review / Summary – innately multi-reward (2)
Multi-Objective Deep Reinforcement Learning [arXiv 2016] [pdf]
http://www.pomdp.org/tutorial/cheng.html
6. Find the corresponding (vector-form)
optimal value function at this corner
point weight.
7. We then can find new corner points.
 Linear Support Algorithm (Cheng, 1988)
Hossam Mossalam, Yannis M. Assael, Diederik M. Roijers, Shimon Whiteson
Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
34 / 70
Paper Review / Summary – innately multi-reward (2)
Multi-Objective Deep Reinforcement Learning [arXiv 2016] [pdf]
http://www.pomdp.org/tutorial/cheng.html
6. Find the corresponding (vector-form)
optimal value function at this corner
point weight.
7. We then can find new corner points.
8. Repeat the same process until you
end up drawing the same line you
have previously drawn.
(⇔ no new optimal value vector /
optimal policy can be obtained)
 Linear Support Algorithm (Cheng, 1988)
Hossam Mossalam, Yannis M. Assael, Diederik M. Roijers, Shimon Whiteson
Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
35 / 70
Paper Review / Summary – innately multi-reward (2)
Multi-Objective Deep Reinforcement Learning [arXiv 2016] [pdf]
 Optimal Linear Support Algorithm (OLS)
o Authors developed a slightly modified version of Linear Support (Cheng,
1988) to reduce the computational burden.
o For large weight space, original Linear Support algorithm might require
excessive iterations.
o However, optimal linear support algorithm (OLS) terminates the iteration
when the maximum possible improvement 𝛥 is below the pre-defined
threshold 𝜖.
maximum possible
improvement
* if this is smaller than our pre-
defined threshold 𝜖, we
terminate the iteration.
Hossam Mossalam, Yannis M. Assael, Diederik M. Roijers, Shimon Whiteson
Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
36 / 70
Paper Review / Summary – innately multi-reward (2)
Multi-Objective Deep Reinforcement Learning [arXiv 2016] [pdf]
 Optimal Linear Support Algorithm (OLS)
 OLS requires an SOMDP solver that can
1. give us a vectorized optimal value function
2. and corresponding optimal policy
3. when weights are given.
 Authors use a DQN network that outputs a matrix of
* while the standard DQN tries to maximize the Q-value itself,
this DQN tries to maximize the scalarized Q-value for explored corner weights.
Hossam Mossalam, Yannis M. Assael, Diederik M. Roijers, Shimon Whiteson
Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
37 / 70
Paper Review / Summary – innately multi-reward (2)
Multi-Objective Deep Reinforcement Learning [arXiv 2016] [pdf]
Hossam Mossalam, Yannis M. Assael, Diederik M. Roijers, Shimon Whiteson
Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
38 / 70
Paper Review / Summary – innately multi-reward (3)
Dynamic Weights in Multi-Objective Deep Reinforcement Learning [ICML 2019] [pdf]
Axel Abels, Diederik M. Roijers, Tom Lenaerts, Ann Nowe, Denis Steckelmacher
 This paper specifically considers the case where the weights change over time.
 Main Contributions
1. propose a Conditioned Network (CN) → augmented version of DQN that
outputs weight-dependent multi-objective Q-vectors.
2. propose Diverse Experience Replay (DER) → way to efficiently train the
conditioned network, exploring both the weight space and state-action space.
Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
39 / 70
Paper Review / Summary – innately multi-reward (3)
Dynamic Weights in Multi-Objective Deep Reinforcement Learning [ICML 2019] [pdf]
 Conditioned Network (CN)
o Network structure itself is quite intuitive: accepts weight as an input.
o Main problem comes from the “episode generation” phase.
o During the episode generation phase,
1. to fully explore the action space, we might just take the 𝜖-greedy policy.
2. to fully explore the weight space, ?
Axel Abels, Diederik M. Roijers, Tom Lenaerts, Ann Nowe, Denis Steckelmacher
Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
40 / 70
Paper Review / Summary – innately multi-reward (3)
Dynamic Weights in Multi-Objective Deep Reinforcement Learning [ICML 2019] [pdf]
 Diverse Experience Replay (DER)
o a diverse buffer from which relevant experiences can be sampled for weight
vectors whose policies have not been executed recently.
o a method that reduces the replay buffer bias, making it to obtain diverse
multi-objective optimal vectors.
Non-DER
DER
Axel Abels, Diederik M. Roijers, Tom Lenaerts, Ann Nowe, Denis Steckelmacher
Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
41 / 70
Paper Review / Summary – innately multi-reward (4)
Balancing Multiple Sources of Reward in Reinforcement Learning [NIPS 2000] [pdf]
 Slightly Different Approach: scalarization is not the way to go!
o Author claims, “creating a single reward value by combining the multiple
components can throw away vital information and can lead to incorrect
solutions.”
o Thus, author suggests an algorithm that uses / interprets multiple-reward as a
vector form itself, without any type of scalarization.
 Instead of explicitly setting the priority between multiple objectives, this paper tries
to find the ‘optimal balance’ between multiple rewards.
 Real-life intuition : “when and how do we make an optimal decision between
(possibly conflicting) social agendas?” → VOTE!
Chirstian R. Shelton
Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
42 / 70
Paper Review / Summary – innately multi-reward (4)
Balancing Multiple Sources of Reward in Reinforcement Learning [NIPS 2000] [pdf]
 Policy Votes
o Authors introduce the concept of voting.
o Instead of a one-hot vector voting (like we do), they can vote for multiple
options (actions) as long as it sums to one.
o Meanwhile, these multiple reward sources should vote not only for a single
state, but for multiple states. They might want to “distribute” their voting power
𝛼𝑠(𝑥) over multiple states.
Chirstian R. Shelton
Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
43 / 70
Paper Review / Summary – innately multi-reward (4)
Balancing Multiple Sources of Reward in Reinforcement Learning [NIPS 2000] [pdf]
 Policy Votes
o Now, the ballot counting: our final policy is determined by the votes from
multiple reward sources.
o Note that 𝛼𝑠 𝑥 and 𝑣𝑠 𝑥, 𝑎 are all trainable parameters.
⇒ Each reward source will tune their own parameters to maximize their
expected reward.
o However, for each reward source, it might be unwise to entirely reveal their
true policy preference in the vote. → Keep in mind, that the overall final policy
is also affected by votes from other reward sources.
Chirstian R. Shelton
Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
44 / 70
Paper Review / Summary – innately multi-reward (4)
Balancing Multiple Sources of Reward in Reinforcement Learning [NIPS 2000] [pdf]
 Nash Equilibrium Problem
o We can now formulate our problem into a Nash equilibrium situation in game
theory.
1. each reward source, one by one, finds the best voting to maximize their
reward.
2. simultaneously update the old solution with new best response
3. iteration ends when all reward sources stay at their previous vote, with
same amount of reward. (Nash equilibrium)
o Since this is quite an outdated paper (before the popularity of deep-RL), they
use an old-fashioned way : formulate an estimate function of reward for given
policy 𝜋.
Chirstian R. Shelton
KL-divergence between true personal
preference 𝑝𝑠(𝑥, 𝑎) and the official policy 𝜋(𝑥, 𝑎)
Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
45 / 70
Table of Contents
1. Preliminaries
2. Why does one use multi-reward?
3. Notable Papers
4. Paper Review / Summary
o innately multi-reward
o multi-reward to better understand the environment
o multi-reward for better performance
Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
46 / 70
Paper Review / Summary – better understand the env (1)
Horde: A scalable real-time architecture for learning …[AAMAS 2011] [pdf]
 This paper first gave an idea of implementing multiple-reward to better understand
the environment!
 Authors believe,
Richard S. Sutton, Joseph Modayil, Michael Delp, Thomas Degris, Patrick M. Pilarski, Adam White
“knowledge about the environment is represented as a large number of
approximate value functions learned in parallel, each with its own policy,
pseudo-reward function, pseudo-termination function, and pseudo-terminal-
reward function.”
Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
47 / 70
Paper Review / Summary – better understand the env (1)
Horde: A scalable real-time architecture for learning …[AAMAS 2011] [pdf]
 They define a general value function (GVF)
with four auxiliary functional inputs. (question functions)
 Now, the authors propose an idea of Horde Architecture
o consisting of an overall agent composed of many sub-agents (called demons)
o each demon is an independent RL agent responsible for learning one small
piece of knowledge about the environment
o Demons try to approximate the GVF 𝑞, corresponding to their own question
functions (pseudo-rewards)
Richard S. Sutton, Joseph Modayil, Michael Delp, Thomas Degris, Patrick M. Pilarski, Adam White
Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
48 / 70
Paper Review / Summary – better understand the env (2)
Universal Value Function Approximators [ICML 2015] [pdf]
Tom Schaul, Dan Horgan, Karol Gregor, David Silver
Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
49 / 70
Paper Review / Summary – better understand the env (2)
Universal Value Function Approximators [ICML 2015] [pdf]
 This paper tries to extend the idea of general value function (GVF).
 Recall: What’s wrong with typical value function 𝑉(𝑠)?
o represent the utility of any state in achieving a single goal.
o No information can be extracted from this value function when we want to
achieve a different / multiple goal.
 Recall: Sutton et al. (2011) tried to extend this value function to take extra
(pseudo) “goal” into account, for the purpose of learning more about the
surrounding environment.
o learn multiple value function approximators 𝑉
𝑔(𝑠) each corresponding to
different (pseudo-) goals 𝑔.
o each such value function represents a chunk of knowledge about the
environment → can be useful when we have to solve a different goal.
Tom Schaul, Dan Horgan, Karol Gregor, David Silver
Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
50 / 70
Paper Review / Summary – better understand the env (2)
Universal Value Function Approximators [ICML 2015] [pdf]
 Authors extend the idea of general value function approximator, to take both
states 𝑠 and goal 𝑔 as input, parameterized by 𝜃.
 Instead of learning multiple value functions for some selected goal states, we are
learning a single value (universal) function approximator (UVFA) that can
generalize over all possible goals.
 However, training an UVFA can be a difficult task!
→ if naively trained, the agent will only see a small subset of possible
combinations of state and goals (𝑠, 𝑔)
Tom Schaul, Dan Horgan, Karol Gregor, David Silver
Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
51 / 70
Paper Review / Summary – better understand the env (2)
Universal Value Function Approximators [ICML 2015] [pdf]
 Possible Architectures of UVFA
Tom Schaul, Dan Horgan, Karol Gregor, David Silver
simply
concatenate!
two-stream
architecture
Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
52 / 70
Paper Review / Summary – better understand the env (2)
Universal Value Function Approximators [ICML 2015] [pdf]
 Possible Architectures of UVFA
Tom Schaul, Dan Horgan, Karol Gregor, David Silver
simply
concatenate!
two-stream
architecture
turns out, this is way
better than the left one.
* experiment details omitted
Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
53 / 70
Paper Review / Summary – better understand the env (2)
Universal Value Function Approximators [ICML 2015] [pdf]
 Training the UVFA
Tom Schaul, Dan Horgan, Karol Gregor, David Silver
Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
54 / 70
Paper Review / Summary – better understand the env (2)
Universal Value Function Approximators [ICML 2015] [pdf]
 Results
Tom Schaul, Dan Horgan, Karol Gregor, David Silver
1. Trained the UVFA for
green-dotted goals (left)
2. Predicted the value function
for pink-dotted goals (right)
Value function
learned by Horde
(explicitly explored)
* 5 goals from the test set
Value function
predicted by UVFA
Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
55 / 70
Table of Contents
1. Preliminaries
2. Why does one use multi-reward?
3. Notable Papers
4. Paper Review / Summary
o innately multi-reward
o multi-reward to better understand the environment
o multi-reward for better performance
Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
56 / 70
Paper Review / Summary – better performance (1)
Hindsight Experience Replay [NIPS 2017] [pdf]
 Designing a reward function is important, but not easy.
o common challenge of RL is to carefully engineer the reward function → not
only reflecting the task at hand, but also carefully shape to guide the policy
optimization
o the necessity of cost engineering limits the applicability of RL in the real
world, because it requires both RL expertise and domain-specific knowledge.
o not applicable in situations where we do not know what admissible behavior
may look like.
 Therefore, we need to develop algorithms which can learn from unshaped reward
signals (e.g. a binary signal indicating successful task completion)
Marcin Andrychowicz, Filip Wolski, .. , Wojciech Zaremba
Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
57 / 70
Paper Review / Summary – better performance (1)
Hindsight Experience Replay [NIPS 2017] [pdf]
 Meanwhile, humans can learn from both succeeded and failed attempts.
o if you’re learning to play hockey, for instance, you can definitely learn from the
experience of failure (ball got out of the net, slightly to the right) → you can
adjust your kick slightly to the left!
 On the other hand, robots learn nothing from this failure (zero reward).
Marcin Andrychowicz, Filip Wolski, .. , Wojciech Zaremba
Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
58 / 70
Paper Review / Summary – better performance (1)
Hindsight Experience Replay [NIPS 2017] [pdf]
 It is however possible to draw another conclusion: “this failed sequence of actions
would be successful, and thus beneficial for the robot’s learning, if the net had
been placed further to the right!
 Authors propose “Hindsight Experience Replay”
Marcin Andrychowicz, Filip Wolski, .. , Wojciech Zaremba
Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
59 / 70
Paper Review / Summary – better performance (1)
Hindsight Experience Replay [NIPS 2017] [pdf]
 Hindsight Experience Replay (HER)
o after experiencing the episode 𝑠0, 𝑠1, ⋯ , 𝑠𝑇, we store in the replay buffer every
transition 𝑠𝑡 → 𝑠𝑡+1 not only with the original goal used for this episode, but
also with a subset of other goals.
o One possible choice of such goals can be… the state which is achieved at
the final step of each episode.
o Implement UVFA structure (concatenated version) to learn from these
multiple goals
Marcin Andrychowicz, Filip Wolski, .. , Wojciech Zaremba
Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
60 / 70
Paper Review / Summary – better performance (1)
Hindsight Experience Replay [NIPS 2017] [pdf]
 Hindsight Experience Replay (HER)
Marcin Andrychowicz, Filip Wolski, .. , Wojciech Zaremba
Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
61 / 70
Paper Review / Summary – better performance (1)
Hindsight Experience Replay [NIPS 2017] [pdf]
 Results
Marcin Andrychowicz, Filip Wolski, .. , Wojciech Zaremba
Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
62 / 70
Paper Review / Summary – better performance (2)
Hybrid Reward Architecture for Reinforcement Learning [NIPS 2017] [pdf]
 This paper proposes a Hybrid Reward Architecture (HRA)
 Horde vs UVFA vs HRA
o Horde : learns multiple general value functions (GVFs), each corresponding
to different reward functions and other question functions, using multiple sub-
agents (a.k.a. demons)
o UVFA : generalize the GVFs across different task and goals
o HRA : decomposes the reward function into 𝑛 different reward functions, with
the intent to solve a multiple-simple tasks rather than a single-complex task.
Harm van Seijen, Joshua Romoff, Tavian Barnes, .. , Jeffrey Tsang
Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
63 / 70
Paper Review / Summary – better performance (2)
Hybrid Reward Architecture for Reinforcement Learning [NIPS 2017] [pdf]
 Proposed Method
o Decompose the reward function 𝑅𝑒𝑛𝑣 into 𝑛 reward functions
and train separate RL agents on each of these reward functions.
o Because agent 𝑘 has its own reward function, it also has its own Q-value
function, 𝑄𝑘. (In fact, such 𝑘 different DQN networks can share multiple
lower-level layers!)
o Combined network that represents all Q-value functions
Harm van Seijen, Joshua Romoff, Tavian Barnes, .. , Jeffrey Tsang
Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
64 / 70
Paper Review / Summary – better performance (2)
Hybrid Reward Architecture for Reinforcement Learning [NIPS 2017] [pdf]
 Proposed Method
o Loss function associated with HRA
o To that end, we get an optimal weight 𝜃⋆
where
Harm van Seijen, Joshua Romoff, Tavian Barnes, .. , Jeffrey Tsang
Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
65 / 70
Paper Review / Summary – better performance (2)
Hybrid Reward Architecture for Reinforcement Learning [NIPS 2017] [pdf]
 HRA Architecture
 Possible Variants
o not only decomposing the existing reward, but also adding a pseudo-reward.
Harm van Seijen, Joshua Romoff, Tavian Barnes, .. , Jeffrey Tsang
Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
66 / 70
Paper Review / Summary – better performance (2)
Hybrid Reward Architecture for Reinforcement Learning [NIPS 2017] [pdf]
 Results
Harm van Seijen, Joshua Romoff, Tavian Barnes, .. , Jeffrey Tsang
original DQN
HRA
original HRA
(just decomposing)
Extended HRA
(Decomposing + adding
pseudo-rewards)
Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
67 / 70
For more details, check out the original papers.
1. MDP problem is innately multi-reward
o A Survey of Multi-Objective Sequential Decision Making [JAIR 2013] [pdf]
o Multi-Objective Deep Reinforcement Learning [arXiv 2016] [pdf]
o Dynamic Weights in Multi-Objective Deep Reinforcement Learning [ICML 2019] [pdf]
o Balancing Multiple Sources of Reward in Reinforcement Learning [NIPS 2000] [pdf]
2. Implement multi-reward to better understand the environment
o Horde: A scalable real-time architecture for learning knowledge from unsupervised
sensorimotor interaction [AAMAS 2011] [pdf]
o Universal Value Function Approximators [ICML 2015] [pdf]
3. Implement multi-reward for better performance
o Hindsight Experience Replay [NIPS 2017] [pdf]
o Hybrid Reward Architecture for Reinforcement Learning [NIPS 2017] [pdf]
Check out the Notion version of this slide : https://bit.ly/3tnzD9F

More Related Content

Similar to Multi reward literature_survey_younghyo_park

Investigating teachers' understanding of IMS Learning Design: Yes they can!
Investigating teachers' understanding of IMS Learning Design: Yes they can!Investigating teachers' understanding of IMS Learning Design: Yes they can!
Investigating teachers' understanding of IMS Learning Design: Yes they can!
Michael Derntl
 
WebOrganic Sharing 20130419
WebOrganic Sharing 20130419 WebOrganic Sharing 20130419
WebOrganic Sharing 20130419
Jeff Ng
 
Book Summary Report Example
Book Summary Report ExampleBook Summary Report Example
Book Summary Report ExampleEMBS2007
 
Review On In-Context Leaning.pptx
Review On In-Context Leaning.pptxReview On In-Context Leaning.pptx
Review On In-Context Leaning.pptx
wesleyshih4
 
Interactive Tradeoffs Between Competing Offline Metrics with Bayesian Optimiz...
Interactive Tradeoffs Between Competing Offline Metrics with Bayesian Optimiz...Interactive Tradeoffs Between Competing Offline Metrics with Bayesian Optimiz...
Interactive Tradeoffs Between Competing Offline Metrics with Bayesian Optimiz...
SigOpt
 
GDG Community Day 2023 - Interpretable ML in production
GDG Community Day 2023 - Interpretable ML in productionGDG Community Day 2023 - Interpretable ML in production
GDG Community Day 2023 - Interpretable ML in production
SARADINDU SENGUPTA
 
Self-directed learning using Moodle
Self-directed learning using MoodleSelf-directed learning using Moodle
Self-directed learning using Moodle
NetSpot Pty Ltd
 
REINFORCEMENT LEARNING (reinforced through trial and error).pptx
REINFORCEMENT LEARNING (reinforced through trial and error).pptxREINFORCEMENT LEARNING (reinforced through trial and error).pptx
REINFORCEMENT LEARNING (reinforced through trial and error).pptx
archayacb21
 
Ev3 teachers guide web
Ev3 teachers guide webEv3 teachers guide web
Ev3 teachers guide web
Arif Budiman
 
Ev3 teachers guia
Ev3 teachers guiaEv3 teachers guia
Ev3 teachers guia
WebMD
 
A hybrid constructive algorithm incorporating teaching-learning based optimiz...
A hybrid constructive algorithm incorporating teaching-learning based optimiz...A hybrid constructive algorithm incorporating teaching-learning based optimiz...
A hybrid constructive algorithm incorporating teaching-learning based optimiz...
IJECEIAES
 
EAUC 2014 presentation on energy visualisation
EAUC 2014 presentation on energy visualisationEAUC 2014 presentation on energy visualisation
EAUC 2014 presentation on energy visualisation
Karl Letten
 
How energy visualisation can work for you
How energy visualisation can work for youHow energy visualisation can work for you
How energy visualisation can work for you
Nottingham Trent University
 
Conditional interval variables: A powerful concept for modeling and solving c...
Conditional interval variables: A powerful concept for modeling and solving c...Conditional interval variables: A powerful concept for modeling and solving c...
Conditional interval variables: A powerful concept for modeling and solving c...
Philippe Laborie
 
The Cost of Technology: Total Cost of Ownership and Value of Investment - Ri...
The Cost of Technology:  Total Cost of Ownership and Value of Investment - Ri...The Cost of Technology:  Total Cost of Ownership and Value of Investment - Ri...
The Cost of Technology: Total Cost of Ownership and Value of Investment - Ri...SchoolDude Editors
 
MOOCS@Work Working Group Session 4
MOOCS@Work Working Group Session 4MOOCS@Work Working Group Session 4
MOOCS@Work Working Group Session 4
LearningCafe
 
Learning cafe call moo cs wgmeet4_ver0.1
Learning cafe call moo cs wgmeet4_ver0.1Learning cafe call moo cs wgmeet4_ver0.1
Learning cafe call moo cs wgmeet4_ver0.1LearningCafe
 
Al-Ahliyya Amman University جامعة عمان األهلية.docx
Al-Ahliyya Amman University   جامعة عمان األهلية.docxAl-Ahliyya Amman University   جامعة عمان األهلية.docx
Al-Ahliyya Amman University جامعة عمان األهلية.docx
galerussel59292
 
User interface adaptation based on user feedback and machine learning
User interface adaptation based on user feedback and machine learningUser interface adaptation based on user feedback and machine learning
User interface adaptation based on user feedback and machine learning
Nesrine Mezhoudi
 
Multiobjective Firefly Algorithm for Continuous Optimization
Multiobjective Firefly Algorithm for Continuous Optimization Multiobjective Firefly Algorithm for Continuous Optimization
Multiobjective Firefly Algorithm for Continuous Optimization
Xin-She Yang
 

Similar to Multi reward literature_survey_younghyo_park (20)

Investigating teachers' understanding of IMS Learning Design: Yes they can!
Investigating teachers' understanding of IMS Learning Design: Yes they can!Investigating teachers' understanding of IMS Learning Design: Yes they can!
Investigating teachers' understanding of IMS Learning Design: Yes they can!
 
WebOrganic Sharing 20130419
WebOrganic Sharing 20130419 WebOrganic Sharing 20130419
WebOrganic Sharing 20130419
 
Book Summary Report Example
Book Summary Report ExampleBook Summary Report Example
Book Summary Report Example
 
Review On In-Context Leaning.pptx
Review On In-Context Leaning.pptxReview On In-Context Leaning.pptx
Review On In-Context Leaning.pptx
 
Interactive Tradeoffs Between Competing Offline Metrics with Bayesian Optimiz...
Interactive Tradeoffs Between Competing Offline Metrics with Bayesian Optimiz...Interactive Tradeoffs Between Competing Offline Metrics with Bayesian Optimiz...
Interactive Tradeoffs Between Competing Offline Metrics with Bayesian Optimiz...
 
GDG Community Day 2023 - Interpretable ML in production
GDG Community Day 2023 - Interpretable ML in productionGDG Community Day 2023 - Interpretable ML in production
GDG Community Day 2023 - Interpretable ML in production
 
Self-directed learning using Moodle
Self-directed learning using MoodleSelf-directed learning using Moodle
Self-directed learning using Moodle
 
REINFORCEMENT LEARNING (reinforced through trial and error).pptx
REINFORCEMENT LEARNING (reinforced through trial and error).pptxREINFORCEMENT LEARNING (reinforced through trial and error).pptx
REINFORCEMENT LEARNING (reinforced through trial and error).pptx
 
Ev3 teachers guide web
Ev3 teachers guide webEv3 teachers guide web
Ev3 teachers guide web
 
Ev3 teachers guia
Ev3 teachers guiaEv3 teachers guia
Ev3 teachers guia
 
A hybrid constructive algorithm incorporating teaching-learning based optimiz...
A hybrid constructive algorithm incorporating teaching-learning based optimiz...A hybrid constructive algorithm incorporating teaching-learning based optimiz...
A hybrid constructive algorithm incorporating teaching-learning based optimiz...
 
EAUC 2014 presentation on energy visualisation
EAUC 2014 presentation on energy visualisationEAUC 2014 presentation on energy visualisation
EAUC 2014 presentation on energy visualisation
 
How energy visualisation can work for you
How energy visualisation can work for youHow energy visualisation can work for you
How energy visualisation can work for you
 
Conditional interval variables: A powerful concept for modeling and solving c...
Conditional interval variables: A powerful concept for modeling and solving c...Conditional interval variables: A powerful concept for modeling and solving c...
Conditional interval variables: A powerful concept for modeling and solving c...
 
The Cost of Technology: Total Cost of Ownership and Value of Investment - Ri...
The Cost of Technology:  Total Cost of Ownership and Value of Investment - Ri...The Cost of Technology:  Total Cost of Ownership and Value of Investment - Ri...
The Cost of Technology: Total Cost of Ownership and Value of Investment - Ri...
 
MOOCS@Work Working Group Session 4
MOOCS@Work Working Group Session 4MOOCS@Work Working Group Session 4
MOOCS@Work Working Group Session 4
 
Learning cafe call moo cs wgmeet4_ver0.1
Learning cafe call moo cs wgmeet4_ver0.1Learning cafe call moo cs wgmeet4_ver0.1
Learning cafe call moo cs wgmeet4_ver0.1
 
Al-Ahliyya Amman University جامعة عمان األهلية.docx
Al-Ahliyya Amman University   جامعة عمان األهلية.docxAl-Ahliyya Amman University   جامعة عمان األهلية.docx
Al-Ahliyya Amman University جامعة عمان األهلية.docx
 
User interface adaptation based on user feedback and machine learning
User interface adaptation based on user feedback and machine learningUser interface adaptation based on user feedback and machine learning
User interface adaptation based on user feedback and machine learning
 
Multiobjective Firefly Algorithm for Continuous Optimization
Multiobjective Firefly Algorithm for Continuous Optimization Multiobjective Firefly Algorithm for Continuous Optimization
Multiobjective Firefly Algorithm for Continuous Optimization
 

Recently uploaded

办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
apvysm8
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
Nanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdfNanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdf
eddie19851
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
balafet
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
slg6lamcq
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
dwreak4tg
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
vikram sood
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
v3tuleee
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
oz8q3jxlp
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
manishkhaire30
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
Roger Valdez
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
74nqk8xf
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
haila53
 
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
mzpolocfi
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
rwarrenll
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
roli9797
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
sameer shah
 

Recently uploaded (20)

办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
Nanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdfNanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdf
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
 
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
 

Multi reward literature_survey_younghyo_park

  • 1. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship …. 1 / 70 Multi-Reward Reinforcement Learning Literature Survey Younghyo Park1 1 Mechanical Engineering Department, Senior Seoul National University Check out the Notion version of this slide : https://bit.ly/3tnzD9F
  • 2. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship …. 2 / 70 Table of Contents 1. Preliminaries 2. Why does one use multi-reward? 3. Notable Papers 4. Paper Review / Summary o innately multi-reward o multi-reward to better understand the environment o multi-reward for better performance
  • 3. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship …. 3 / 70 Table of Contents 1. Preliminaries 2. Why does one use multi-reward? 3. Notable Papers 4. Paper Review / Summary o innately multi-reward o multi-reward to better understand the environment o multi-reward for better performance
  • 4. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship …. 4 / 70 Preliminaries  Multi-Objective Markov-Deicision Process (MOMDP) o reward function is no longer a scalar, but a vector. o value function for a stationary policy 𝜋 on state 𝑠, is also a vector. o For single-objective MDP (SOMDP), ordering of value function is complete when state is given. o In contrast, in MOMDP, even when state is given, only a partial ordering is possible. Determining the optimal policy requires further contemplation. ⇒ one possible solution is to prioritize the rewards / objectives (covered later)
  • 5. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship …. 5 / 70 Table of Contents 1. Preliminaries 2. Why does one use multi-reward? 3. Notable Papers 4. Paper Review / Summary o innately multi-reward o multi-reward to better understand the environment o multi-reward for better performance
  • 6. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship …. 6 / 70 Why does one use multi-reward? After some literature survey, I’ve noticed that reinforcement learning community deploys multi-reward architecture for the following reasons, mainly threefold. 1. MDP problem is innately multi-reward In this case, the MDP has to be designed as a multi-reward problem from the beginning. (No possible choices otherwise) o (possibly conflicting) multiple obejctive / goals ex) public transportation system using RL → multiple goal of “commute time” and “fuel-efficiency” o reward given by multiple users(experts) ex) RL environment where human users/experts give the reward → one user may decide to reward the agent with values twice as large as those of another → rewards are incomparable, cannot be naively converted to a single reward problem.
  • 7. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship …. 7 / 70 Why does one use multi-reward? After some literature survey, I’ve noticed that reinforcement learning community deploys multi-reward architecture for the following reasons, mainly threefold. 2. Implement multi-reward to better understand the environment The MDP can possibly be formulated as a single-reward problem. However, constructing as a multi-reward problem can give us more information about the environment. o generalizing the q-functions for multiple goals ex) learning the RL system to generalize upon various tasks/goals → learning the q-network which accepts ‘goal state’ as an input o exploring the environment using multiple agents ex) multiple agents, acting based on different rewards, can give us more information about the environment
  • 8. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship …. 8 / 70 Why does one use multi-reward? After some literature survey, I’ve noticed that reinforcement learning community deploys multi-reward architecture for the following reasons, mainly threefold. 3. Implement multi-reward for better performance When the original single objective MDP problem is quite cumbersome to handle, we can ease the problem by splitting the single reward as multiple (easier) rewards. o sparse rewards ex) binary rewards are given at the end of each episode if the agent succeeded the goal → if the goal is hard to achieve, the reward can be extremely sparse, and RL training can be problematic. o complex rewards ex) If the single objective reward depends on too much state components, learning q-function might be difficult → split the single objective reward to simpler multi-objective reward
  • 9. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship …. 9 / 70 Table of Contents 1. Preliminaries 2. Why does one use multi-reward? 3. Notable Papers 4. Paper Review / Summary o innately multi-reward o multi-reward to better understand the environment o multi-reward for better performance
  • 10. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship …. 10 / 70 Notable Papers 1. MDP problem is innately multi-reward o A Survey of Multi-Objective Sequential Decision Making [JAIR 2013] [pdf] o Multi-Objective Deep Reinforcement Learning [arXiv 2016] [pdf] o Dynamic Weights in Multi-Objective Deep Reinforcement Learning [ICML 2019] [pdf] o Balancing Multiple Sources of Reward in Reinforcement Learning [NIPS 2000] [pdf] 2. Implement multi-reward to better understand the environment o Horde: A scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction [AAMAS 2011] [pdf] o Universal Value Function Approximators [ICML 2015] [pdf] 3. Implement multi-reward for better performance o Hindsight Experience Replay [NIPS 2017] [pdf] o Hybrid Reward Architecture for Reinforcement Learning [NIPS 2017] [pdf]
  • 11. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship …. 11 / 70 Table of Contents 1. Preliminaries 2. Why does one use multi-reward? 3. Notable Papers 4. Paper Review / Summary o innately multi-reward o multi-reward to better understand the environment o multi-reward for better performance
  • 12. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship …. 12 / 70 A Survey of Multi-Objective Sequential Decision Making [JAIR 2013] [pdf]  One might wonder,  Such scalarization function, parameterized by vector might convert the MOMDP problem to a SOMDP problem. ex) one possible scalarization function is a linear function:  Once we scalarize the reward, typical SOMDP solution can be applied. “Why don’t we scalarize the vector-reward with some scalarization function?” Paper Review / Summary – innately multi-reward (1) Deiderik M. Roijers, Peter Vamplew, Shimon Whiteson
  • 13. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship …. 13 / 70  Unfortunately, such conversion is not always possible / desirable. 1. Unknown weights scenario o Weight has to be specified before learning → not always possible! o User might prefer different priorities (weights) over time. o But still, once the weight is specified and fixed, we can easily train and use the decision-making process. A Survey of Multi-Objective Sequential Decision Making [JAIR 2013] [pdf] ex) public transportation system using RL – weights (priority) between two rewards, each corresponding to commute time and pollution cost, might fluctuate based on the price of oil. – weight cannot be specified before training! – once the weight is fixed, we can use the SOMDP algorithm. Paper Review / Summary – innately multi-reward (1) Deiderik M. Roijers, Peter Vamplew, Shimon Whiteson
  • 14. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship …. 14 / 70  Unfortunately, such conversion is not always possible / desirable. 2. Decision support scenario * not really our main concern o The concept of scalarization itself may not be applicable from the beginning → objective-priority may not be accurately quantified. o Users may also have “fuzzy” preferences that defy meaningful quantification. o MOMDP might require arbitrary human decision during its operation. A Survey of Multi-Objective Sequential Decision Making [JAIR 2013] [pdf] ex) public transportation system using RL – if the transportation system could be made more efficient by obstrucing a beautiful view, then a human deisgner may not be able to quantify the loss, or the priority regarding the loss of beauty. Paper Review / Summary – innately multi-reward (1) Deiderik M. Roijers, Peter Vamplew, Shimon Whiteson
  • 15. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship …. 15 / 70  Unfortunately, such conversion is not always possible / desirable. 3. Known weights scenario * rare case o if the scalarization function 𝑓 is nonlinear, the resulting SOMDP problem may not have additive returns. o optimal policy may be stochastic, thus difficult to solve. A Survey of Multi-Objective Sequential Decision Making [JAIR 2013] [pdf] Paper Review / Summary – innately multi-reward (1) Deiderik M. Roijers, Peter Vamplew, Shimon Whiteson
  • 16. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship …. 16 / 70  Unfortunately, such conversion is not always possible / desirable. 1. Unknown weights scenario 2. Decision support scenario 3. Known weights scenario  Thus, algorithmic solution that specifically targets the MOMDP case should be developed. * Note that our main concern is on the “weights”.  Useful MOMDP solution should be able to 1. give an optimal policy for any arbitrary weights, 2. properly handle the case when weight changes over time (dynamic weights), 3. with just an initial single training (no retraining required) A Survey of Multi-Objective Sequential Decision Making [JAIR 2013] [pdf] Paper Review / Summary – innately multi-reward (1) Deiderik M. Roijers, Peter Vamplew, Shimon Whiteson
  • 17. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship …. 17 / 70  Important Definitions and Terminologies A Survey of Multi-Objective Sequential Decision Making [JAIR 2013] [pdf] 1. Undominated policies o this set includes (weight – policy) pair if the policy 𝜋 is optimal for some weight o this undominated policies contain redundnat policies (some of the policies contained in this set is not the only optimal policy for weight ) o what we want is a compact-policy set that allows us to index a single optimal policy for a given weight . Paper Review / Summary – innately multi-reward (1) Deiderik M. Roijers, Peter Vamplew, Shimon Whiteson
  • 18. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship …. 18 / 70  Important Definitions and Terminologies A Survey of Multi-Objective Sequential Decision Making [JAIR 2013] [pdf] 2. Coverage Set o coverage set is a subset of undominated policies o this includes a single policy corresponding to every possible weight o Author calls the process of obtaining the coverage set (from undominated set) as a pruning process. “even if we don’t know the weights a-priori, we are already eliminating the redundant policies that we know that we aren’t going to use in the future.” from) Multi-Objective Decision Making, Shimon Whiteson, Microsoft https://www.youtube.com/watch?v=_zJ_cbg3TzY&feature=youtu.be Paper Review / Summary – innately multi-reward (1) Deiderik M. Roijers, Peter Vamplew, Shimon Whiteson
  • 19. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship …. 19 / 70  Important Definitions and Terminologies A Survey of Multi-Objective Sequential Decision Making [JAIR 2013] [pdf] 2. Coverage Set o For instance, assume that there are only two possible scalarizations o Undominated set = {𝜋1, 𝜋2, 𝜋3} o Coverage set = {𝜋1, 𝜋2} or {𝜋3, 𝜋2} is not an optimal policy for either of the scalarization (weight) two optimal policies exist (redundant) Paper Review / Summary – innately multi-reward (1) Deiderik M. Roijers, Peter Vamplew, Shimon Whiteson
  • 20. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship …. 20 / 70  Important Definitions and Terminologies A Survey of Multi-Objective Sequential Decision Making [JAIR 2013] [pdf] 3. Convex Hull o undominated policy for linear scalarization function 4. Convex Coverage Set o coverage set for linear scalarization function Paper Review / Summary – innately multi-reward (1) Deiderik M. Roijers, Peter Vamplew, Shimon Whiteson
  • 21. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship …. 21 / 70  Important Visualizations o Assume ℝ2 reward vector (dots in objective space) o Scalarized reward using weight (lines in objective space) A Survey of Multi-Objective Sequential Decision Making [JAIR 2013] [pdf] each dot in objective space corresponds to the line in weight space Paper Review / Summary – innately multi-reward (1) Deiderik M. Roijers, Peter Vamplew, Shimon Whiteson
  • 22. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship …. 22 / 70  Important Visualizations o Now, what’s the coverage set? o For all weights, we should determine a single optimal policy (value function). A Survey of Multi-Objective Sequential Decision Making [JAIR 2013] [pdf] each dot in objective space corresponds to the line in weight space Paper Review / Summary – innately multi-reward (1) Deiderik M. Roijers, Peter Vamplew, Shimon Whiteson
  • 23. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship …. 23 / 70  Important Visualizations o Now, what’s the coverage set? o For all weights, we should determine a single optimal policy (value function) ⇒ upper surface of weight space A Survey of Multi-Objective Sequential Decision Making [JAIR 2013] [pdf] each dot in objective space corresponds to the line in weight space Paper Review / Summary – innately multi-reward (1) Deiderik M. Roijers, Peter Vamplew, Shimon Whiteson
  • 24. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship …. 24 / 70  Important Visualizations o Now, what’s the coverage set? o For all weights, we should determine a single optimal policy (value function) ⇒ upper surface of weight space o Blue dot / line is included in the undominated policy set, but not in the coverage set. (redundant policy) A Survey of Multi-Objective Sequential Decision Making [JAIR 2013] [pdf] each dot in objective space corresponds to the line in weight space Paper Review / Summary – innately multi-reward (1) Deiderik M. Roijers, Peter Vamplew, Shimon Whiteson
  • 25. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship …. 25 / 70  Important Visualizations  Thus, our goal is to find the minimal coverage space: the upper surface (and its corresponding optimal policy / value functions) in the weight space. A Survey of Multi-Objective Sequential Decision Making [JAIR 2013] [pdf] each dot in objective space corresponds to the line in weight space Paper Review / Summary – innately multi-reward (1) Deiderik M. Roijers, Peter Vamplew, Shimon Whiteson
  • 26. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship …. 26 / 70 Paper Review / Summary – innately multi-reward (2) Multi-Objective Deep Reinforcement Learning [arXiv 2016] [pdf]  This paper aims to learn an approximate coverage set of policies, each represented by a neural network.  One might ask, “Do I have to check the optimal policy for all possible weights to find out the coverage set?” Fortunately, the answer is “No, you don’t have to.” ⇒ we may solve the SOMDP problem only for some few points in the weight space to fully determine the coverage set.  Authors use the concept of “Linear Support” Hossam Mossalam, Yannis M. Assael, Diederik M. Roijers, Shimon Whiteson
  • 27. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship …. 27 / 70 Paper Review / Summary – innately multi-reward (2) Multi-Objective Deep Reinforcement Learning [arXiv 2016] [pdf]  Linear Support Algorithm (Cheng, 1988) http://www.pomdp.org/tutorial/cheng.html 1. First, pick an extremum point in the weight space. Hossam Mossalam, Yannis M. Assael, Diederik M. Roijers, Shimon Whiteson
  • 28. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship …. 28 / 70 Paper Review / Summary – innately multi-reward (2) Multi-Objective Deep Reinforcement Learning [arXiv 2016] [pdf]  Linear Support Algorithm (Cheng, 1988) http://www.pomdp.org/tutorial/cheng.html 1. First, pick an extremum point in the weight space. 2. For the chosen weight, solve the SOMDP and get the (vector-form) optimal value function. That gives us a line in the weight space. * you might want to store the corresopnding optimal policy as well Hossam Mossalam, Yannis M. Assael, Diederik M. Roijers, Shimon Whiteson
  • 29. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship …. 29 / 70 Paper Review / Summary – innately multi-reward (2) Multi-Objective Deep Reinforcement Learning [arXiv 2016] [pdf]  Linear Support Algorithm (Cheng, 1988) http://www.pomdp.org/tutorial/cheng.html 3. Now, set the weights to the other extremum. Hossam Mossalam, Yannis M. Assael, Diederik M. Roijers, Shimon Whiteson
  • 30. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship …. 30 / 70 Paper Review / Summary – innately multi-reward (2) Multi-Objective Deep Reinforcement Learning [arXiv 2016] [pdf]  Linear Support Algorithm (Cheng, 1988) http://www.pomdp.org/tutorial/cheng.html 3. Now, set the weights to the other extremum. 4. Solving the SOMDP, we get another (vector-form) optimal value function, which again can be represented as a line in this plot. Hossam Mossalam, Yannis M. Assael, Diederik M. Roijers, Shimon Whiteson
  • 31. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship …. 31 / 70 Paper Review / Summary – innately multi-reward (2) Multi-Objective Deep Reinforcement Learning [arXiv 2016] [pdf]  Linear Support Algorithm (Cheng, 1988) http://www.pomdp.org/tutorial/cheng.html 3. Now, set the weights to the other extremum. 4. Solving the SOMDP, we get another (vector-form) optimal value function, which again can be represented as a line in this plot. 5. We now have an intersection. We call this “corner point”. Hossam Mossalam, Yannis M. Assael, Diederik M. Roijers, Shimon Whiteson
  • 32. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship …. 32 / 70 Paper Review / Summary – innately multi-reward (2) Multi-Objective Deep Reinforcement Learning [arXiv 2016] [pdf]  Linear Support Algorithm (Cheng, 1988) http://www.pomdp.org/tutorial/cheng.html 6. Find the corresponding (vector-form) optimal value function at this corner point weight. Hossam Mossalam, Yannis M. Assael, Diederik M. Roijers, Shimon Whiteson
  • 33. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship …. 33 / 70 Paper Review / Summary – innately multi-reward (2) Multi-Objective Deep Reinforcement Learning [arXiv 2016] [pdf] http://www.pomdp.org/tutorial/cheng.html 6. Find the corresponding (vector-form) optimal value function at this corner point weight. 7. We then can find new corner points.  Linear Support Algorithm (Cheng, 1988) Hossam Mossalam, Yannis M. Assael, Diederik M. Roijers, Shimon Whiteson
  • 34. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship …. 34 / 70 Paper Review / Summary – innately multi-reward (2) Multi-Objective Deep Reinforcement Learning [arXiv 2016] [pdf] http://www.pomdp.org/tutorial/cheng.html 6. Find the corresponding (vector-form) optimal value function at this corner point weight. 7. We then can find new corner points. 8. Repeat the same process until you end up drawing the same line you have previously drawn. (⇔ no new optimal value vector / optimal policy can be obtained)  Linear Support Algorithm (Cheng, 1988) Hossam Mossalam, Yannis M. Assael, Diederik M. Roijers, Shimon Whiteson
  • 35. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship …. 35 / 70 Paper Review / Summary – innately multi-reward (2) Multi-Objective Deep Reinforcement Learning [arXiv 2016] [pdf]  Optimal Linear Support Algorithm (OLS) o Authors developed a slightly modified version of Linear Support (Cheng, 1988) to reduce the computational burden. o For large weight space, original Linear Support algorithm might require excessive iterations. o However, optimal linear support algorithm (OLS) terminates the iteration when the maximum possible improvement 𝛥 is below the pre-defined threshold 𝜖. maximum possible improvement * if this is smaller than our pre- defined threshold 𝜖, we terminate the iteration. Hossam Mossalam, Yannis M. Assael, Diederik M. Roijers, Shimon Whiteson
  • 36. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship …. 36 / 70 Paper Review / Summary – innately multi-reward (2) Multi-Objective Deep Reinforcement Learning [arXiv 2016] [pdf]  Optimal Linear Support Algorithm (OLS)  OLS requires an SOMDP solver that can 1. give us a vectorized optimal value function 2. and corresponding optimal policy 3. when weights are given.  Authors use a DQN network that outputs a matrix of * while the standard DQN tries to maximize the Q-value itself, this DQN tries to maximize the scalarized Q-value for explored corner weights. Hossam Mossalam, Yannis M. Assael, Diederik M. Roijers, Shimon Whiteson
  • 37. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship …. 37 / 70 Paper Review / Summary – innately multi-reward (2) Multi-Objective Deep Reinforcement Learning [arXiv 2016] [pdf] Hossam Mossalam, Yannis M. Assael, Diederik M. Roijers, Shimon Whiteson
  • 38. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship …. 38 / 70 Paper Review / Summary – innately multi-reward (3) Dynamic Weights in Multi-Objective Deep Reinforcement Learning [ICML 2019] [pdf] Axel Abels, Diederik M. Roijers, Tom Lenaerts, Ann Nowe, Denis Steckelmacher  This paper specifically considers the case where the weights change over time.  Main Contributions 1. propose a Conditioned Network (CN) → augmented version of DQN that outputs weight-dependent multi-objective Q-vectors. 2. propose Diverse Experience Replay (DER) → way to efficiently train the conditioned network, exploring both the weight space and state-action space.
  • 39. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship …. 39 / 70 Paper Review / Summary – innately multi-reward (3) Dynamic Weights in Multi-Objective Deep Reinforcement Learning [ICML 2019] [pdf]  Conditioned Network (CN) o Network structure itself is quite intuitive: accepts weight as an input. o Main problem comes from the “episode generation” phase. o During the episode generation phase, 1. to fully explore the action space, we might just take the 𝜖-greedy policy. 2. to fully explore the weight space, ? Axel Abels, Diederik M. Roijers, Tom Lenaerts, Ann Nowe, Denis Steckelmacher
  • 40. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship …. 40 / 70 Paper Review / Summary – innately multi-reward (3) Dynamic Weights in Multi-Objective Deep Reinforcement Learning [ICML 2019] [pdf]  Diverse Experience Replay (DER) o a diverse buffer from which relevant experiences can be sampled for weight vectors whose policies have not been executed recently. o a method that reduces the replay buffer bias, making it to obtain diverse multi-objective optimal vectors. Non-DER DER Axel Abels, Diederik M. Roijers, Tom Lenaerts, Ann Nowe, Denis Steckelmacher
  • 41. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship …. 41 / 70 Paper Review / Summary – innately multi-reward (4) Balancing Multiple Sources of Reward in Reinforcement Learning [NIPS 2000] [pdf]  Slightly Different Approach: scalarization is not the way to go! o Author claims, “creating a single reward value by combining the multiple components can throw away vital information and can lead to incorrect solutions.” o Thus, author suggests an algorithm that uses / interprets multiple-reward as a vector form itself, without any type of scalarization.  Instead of explicitly setting the priority between multiple objectives, this paper tries to find the ‘optimal balance’ between multiple rewards.  Real-life intuition : “when and how do we make an optimal decision between (possibly conflicting) social agendas?” → VOTE! Chirstian R. Shelton
  • 42. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship …. 42 / 70 Paper Review / Summary – innately multi-reward (4) Balancing Multiple Sources of Reward in Reinforcement Learning [NIPS 2000] [pdf]  Policy Votes o Authors introduce the concept of voting. o Instead of a one-hot vector voting (like we do), they can vote for multiple options (actions) as long as it sums to one. o Meanwhile, these multiple reward sources should vote not only for a single state, but for multiple states. They might want to “distribute” their voting power 𝛼𝑠(𝑥) over multiple states. Chirstian R. Shelton
  • 43. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship …. 43 / 70 Paper Review / Summary – innately multi-reward (4) Balancing Multiple Sources of Reward in Reinforcement Learning [NIPS 2000] [pdf]  Policy Votes o Now, the ballot counting: our final policy is determined by the votes from multiple reward sources. o Note that 𝛼𝑠 𝑥 and 𝑣𝑠 𝑥, 𝑎 are all trainable parameters. ⇒ Each reward source will tune their own parameters to maximize their expected reward. o However, for each reward source, it might be unwise to entirely reveal their true policy preference in the vote. → Keep in mind, that the overall final policy is also affected by votes from other reward sources. Chirstian R. Shelton
  • 44. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship …. 44 / 70 Paper Review / Summary – innately multi-reward (4) Balancing Multiple Sources of Reward in Reinforcement Learning [NIPS 2000] [pdf]  Nash Equilibrium Problem o We can now formulate our problem into a Nash equilibrium situation in game theory. 1. each reward source, one by one, finds the best voting to maximize their reward. 2. simultaneously update the old solution with new best response 3. iteration ends when all reward sources stay at their previous vote, with same amount of reward. (Nash equilibrium) o Since this is quite an outdated paper (before the popularity of deep-RL), they use an old-fashioned way : formulate an estimate function of reward for given policy 𝜋. Chirstian R. Shelton KL-divergence between true personal preference 𝑝𝑠(𝑥, 𝑎) and the official policy 𝜋(𝑥, 𝑎)
  • 45. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship …. 45 / 70 Table of Contents 1. Preliminaries 2. Why does one use multi-reward? 3. Notable Papers 4. Paper Review / Summary o innately multi-reward o multi-reward to better understand the environment o multi-reward for better performance
  • 46. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship …. 46 / 70 Paper Review / Summary – better understand the env (1) Horde: A scalable real-time architecture for learning …[AAMAS 2011] [pdf]  This paper first gave an idea of implementing multiple-reward to better understand the environment!  Authors believe, Richard S. Sutton, Joseph Modayil, Michael Delp, Thomas Degris, Patrick M. Pilarski, Adam White “knowledge about the environment is represented as a large number of approximate value functions learned in parallel, each with its own policy, pseudo-reward function, pseudo-termination function, and pseudo-terminal- reward function.”
  • 47. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship …. 47 / 70 Paper Review / Summary – better understand the env (1) Horde: A scalable real-time architecture for learning …[AAMAS 2011] [pdf]  They define a general value function (GVF) with four auxiliary functional inputs. (question functions)  Now, the authors propose an idea of Horde Architecture o consisting of an overall agent composed of many sub-agents (called demons) o each demon is an independent RL agent responsible for learning one small piece of knowledge about the environment o Demons try to approximate the GVF 𝑞, corresponding to their own question functions (pseudo-rewards) Richard S. Sutton, Joseph Modayil, Michael Delp, Thomas Degris, Patrick M. Pilarski, Adam White
  • 48. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship …. 48 / 70 Paper Review / Summary – better understand the env (2) Universal Value Function Approximators [ICML 2015] [pdf] Tom Schaul, Dan Horgan, Karol Gregor, David Silver
  • 49. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship …. 49 / 70 Paper Review / Summary – better understand the env (2) Universal Value Function Approximators [ICML 2015] [pdf]  This paper tries to extend the idea of general value function (GVF).  Recall: What’s wrong with typical value function 𝑉(𝑠)? o represent the utility of any state in achieving a single goal. o No information can be extracted from this value function when we want to achieve a different / multiple goal.  Recall: Sutton et al. (2011) tried to extend this value function to take extra (pseudo) “goal” into account, for the purpose of learning more about the surrounding environment. o learn multiple value function approximators 𝑉 𝑔(𝑠) each corresponding to different (pseudo-) goals 𝑔. o each such value function represents a chunk of knowledge about the environment → can be useful when we have to solve a different goal. Tom Schaul, Dan Horgan, Karol Gregor, David Silver
  • 50. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship …. 50 / 70 Paper Review / Summary – better understand the env (2) Universal Value Function Approximators [ICML 2015] [pdf]  Authors extend the idea of general value function approximator, to take both states 𝑠 and goal 𝑔 as input, parameterized by 𝜃.  Instead of learning multiple value functions for some selected goal states, we are learning a single value (universal) function approximator (UVFA) that can generalize over all possible goals.  However, training an UVFA can be a difficult task! → if naively trained, the agent will only see a small subset of possible combinations of state and goals (𝑠, 𝑔) Tom Schaul, Dan Horgan, Karol Gregor, David Silver
  • 51. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship …. 51 / 70 Paper Review / Summary – better understand the env (2) Universal Value Function Approximators [ICML 2015] [pdf]  Possible Architectures of UVFA Tom Schaul, Dan Horgan, Karol Gregor, David Silver simply concatenate! two-stream architecture
  • 52. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship …. 52 / 70 Paper Review / Summary – better understand the env (2) Universal Value Function Approximators [ICML 2015] [pdf]  Possible Architectures of UVFA Tom Schaul, Dan Horgan, Karol Gregor, David Silver simply concatenate! two-stream architecture turns out, this is way better than the left one. * experiment details omitted
  • 53. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship …. 53 / 70 Paper Review / Summary – better understand the env (2) Universal Value Function Approximators [ICML 2015] [pdf]  Training the UVFA Tom Schaul, Dan Horgan, Karol Gregor, David Silver
  • 54. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship …. 54 / 70 Paper Review / Summary – better understand the env (2) Universal Value Function Approximators [ICML 2015] [pdf]  Results Tom Schaul, Dan Horgan, Karol Gregor, David Silver 1. Trained the UVFA for green-dotted goals (left) 2. Predicted the value function for pink-dotted goals (right) Value function learned by Horde (explicitly explored) * 5 goals from the test set Value function predicted by UVFA
  • 55. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship …. 55 / 70 Table of Contents 1. Preliminaries 2. Why does one use multi-reward? 3. Notable Papers 4. Paper Review / Summary o innately multi-reward o multi-reward to better understand the environment o multi-reward for better performance
  • 56. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship …. 56 / 70 Paper Review / Summary – better performance (1) Hindsight Experience Replay [NIPS 2017] [pdf]  Designing a reward function is important, but not easy. o common challenge of RL is to carefully engineer the reward function → not only reflecting the task at hand, but also carefully shape to guide the policy optimization o the necessity of cost engineering limits the applicability of RL in the real world, because it requires both RL expertise and domain-specific knowledge. o not applicable in situations where we do not know what admissible behavior may look like.  Therefore, we need to develop algorithms which can learn from unshaped reward signals (e.g. a binary signal indicating successful task completion) Marcin Andrychowicz, Filip Wolski, .. , Wojciech Zaremba
  • 57. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship …. 57 / 70 Paper Review / Summary – better performance (1) Hindsight Experience Replay [NIPS 2017] [pdf]  Meanwhile, humans can learn from both succeeded and failed attempts. o if you’re learning to play hockey, for instance, you can definitely learn from the experience of failure (ball got out of the net, slightly to the right) → you can adjust your kick slightly to the left!  On the other hand, robots learn nothing from this failure (zero reward). Marcin Andrychowicz, Filip Wolski, .. , Wojciech Zaremba
  • 58. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship …. 58 / 70 Paper Review / Summary – better performance (1) Hindsight Experience Replay [NIPS 2017] [pdf]  It is however possible to draw another conclusion: “this failed sequence of actions would be successful, and thus beneficial for the robot’s learning, if the net had been placed further to the right!  Authors propose “Hindsight Experience Replay” Marcin Andrychowicz, Filip Wolski, .. , Wojciech Zaremba
  • 59. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship …. 59 / 70 Paper Review / Summary – better performance (1) Hindsight Experience Replay [NIPS 2017] [pdf]  Hindsight Experience Replay (HER) o after experiencing the episode 𝑠0, 𝑠1, ⋯ , 𝑠𝑇, we store in the replay buffer every transition 𝑠𝑡 → 𝑠𝑡+1 not only with the original goal used for this episode, but also with a subset of other goals. o One possible choice of such goals can be… the state which is achieved at the final step of each episode. o Implement UVFA structure (concatenated version) to learn from these multiple goals Marcin Andrychowicz, Filip Wolski, .. , Wojciech Zaremba
  • 60. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship …. 60 / 70 Paper Review / Summary – better performance (1) Hindsight Experience Replay [NIPS 2017] [pdf]  Hindsight Experience Replay (HER) Marcin Andrychowicz, Filip Wolski, .. , Wojciech Zaremba
  • 61. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship …. 61 / 70 Paper Review / Summary – better performance (1) Hindsight Experience Replay [NIPS 2017] [pdf]  Results Marcin Andrychowicz, Filip Wolski, .. , Wojciech Zaremba
  • 62. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship …. 62 / 70 Paper Review / Summary – better performance (2) Hybrid Reward Architecture for Reinforcement Learning [NIPS 2017] [pdf]  This paper proposes a Hybrid Reward Architecture (HRA)  Horde vs UVFA vs HRA o Horde : learns multiple general value functions (GVFs), each corresponding to different reward functions and other question functions, using multiple sub- agents (a.k.a. demons) o UVFA : generalize the GVFs across different task and goals o HRA : decomposes the reward function into 𝑛 different reward functions, with the intent to solve a multiple-simple tasks rather than a single-complex task. Harm van Seijen, Joshua Romoff, Tavian Barnes, .. , Jeffrey Tsang
  • 63. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship …. 63 / 70 Paper Review / Summary – better performance (2) Hybrid Reward Architecture for Reinforcement Learning [NIPS 2017] [pdf]  Proposed Method o Decompose the reward function 𝑅𝑒𝑛𝑣 into 𝑛 reward functions and train separate RL agents on each of these reward functions. o Because agent 𝑘 has its own reward function, it also has its own Q-value function, 𝑄𝑘. (In fact, such 𝑘 different DQN networks can share multiple lower-level layers!) o Combined network that represents all Q-value functions Harm van Seijen, Joshua Romoff, Tavian Barnes, .. , Jeffrey Tsang
  • 64. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship …. 64 / 70 Paper Review / Summary – better performance (2) Hybrid Reward Architecture for Reinforcement Learning [NIPS 2017] [pdf]  Proposed Method o Loss function associated with HRA o To that end, we get an optimal weight 𝜃⋆ where Harm van Seijen, Joshua Romoff, Tavian Barnes, .. , Jeffrey Tsang
  • 65. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship …. 65 / 70 Paper Review / Summary – better performance (2) Hybrid Reward Architecture for Reinforcement Learning [NIPS 2017] [pdf]  HRA Architecture  Possible Variants o not only decomposing the existing reward, but also adding a pseudo-reward. Harm van Seijen, Joshua Romoff, Tavian Barnes, .. , Jeffrey Tsang
  • 66. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship …. 66 / 70 Paper Review / Summary – better performance (2) Hybrid Reward Architecture for Reinforcement Learning [NIPS 2017] [pdf]  Results Harm van Seijen, Joshua Romoff, Tavian Barnes, .. , Jeffrey Tsang original DQN HRA original HRA (just decomposing) Extended HRA (Decomposing + adding pseudo-rewards)
  • 67. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship …. 67 / 70 For more details, check out the original papers. 1. MDP problem is innately multi-reward o A Survey of Multi-Objective Sequential Decision Making [JAIR 2013] [pdf] o Multi-Objective Deep Reinforcement Learning [arXiv 2016] [pdf] o Dynamic Weights in Multi-Objective Deep Reinforcement Learning [ICML 2019] [pdf] o Balancing Multiple Sources of Reward in Reinforcement Learning [NIPS 2000] [pdf] 2. Implement multi-reward to better understand the environment o Horde: A scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction [AAMAS 2011] [pdf] o Universal Value Function Approximators [ICML 2015] [pdf] 3. Implement multi-reward for better performance o Hindsight Experience Replay [NIPS 2017] [pdf] o Hybrid Reward Architecture for Reinforcement Learning [NIPS 2017] [pdf] Check out the Notion version of this slide : https://bit.ly/3tnzD9F