Multi reward literature_survey_younghyo_park

Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
1 / 70
Multi-Reward Reinforcement Learning
Literature Survey
Younghyo Park1
1 Mechanical Engineering Department, Senior
Seoul National University
Check out the Notion version of this slide : https://bit.ly/3tnzD9F

2 / 70
Table of Contents
1. Preliminaries
2. Why does one use multi-reward?
3. Notable Papers
4. Paper Review / Summary
o innately multi-reward
o multi-reward to better understand the environment
o multi-reward for better performance

3 / 70
Table of Contents
1. Preliminaries
3. Notable Papers

4 / 70
Preliminaries
 Multi-Objective Markov-Deicision Process (MOMDP)
o reward function is no longer a scalar, but a vector.
o value function for a stationary policy 𝜋 on state 𝑠, is also a vector.
o For single-objective MDP (SOMDP), ordering of value function is complete
when state is given.
o In contrast, in MOMDP, even when state is given, only a partial ordering is
possible. Determining the optimal policy requires further contemplation.
⇒ one possible solution is to prioritize the rewards / objectives (covered later)

5 / 70
Table of Contents
1. Preliminaries
3. Notable Papers

6 / 70
Why does one use multi-reward?
After some literature survey, I’ve noticed that reinforcement learning community
deploys multi-reward architecture for the following reasons, mainly threefold.
1. MDP problem is innately multi-reward
In this case, the MDP has to be designed as a multi-reward problem from
the beginning. (No possible choices otherwise)
o (possibly conflicting) multiple obejctive / goals
ex) public transportation system using RL → multiple goal of “commute
time” and “fuel-efficiency”
o reward given by multiple users(experts)
ex) RL environment where human users/experts give the reward → one
user may decide to reward the agent with values twice as large as those
of another → rewards are incomparable, cannot be naively converted to
a single reward problem.

7 / 70
2. Implement multi-reward to better understand the environment
The MDP can possibly be formulated as a single-reward problem. However,
constructing as a multi-reward problem can give us more information about
the environment.
o generalizing the q-functions for multiple goals
ex) learning the RL system to generalize upon various tasks/goals →
learning the q-network which accepts ‘goal state’ as an input
o exploring the environment using multiple agents
ex) multiple agents, acting based on different rewards, can give us more
information about the environment

8 / 70
3. Implement multi-reward for better performance
When the original single objective MDP problem is quite cumbersome to
handle, we can ease the problem by splitting the single reward as multiple
(easier) rewards.
o sparse rewards
ex) binary rewards are given at the end of each episode if the agent
succeeded the goal → if the goal is hard to achieve, the reward can be
extremely sparse, and RL training can be problematic.
o complex rewards
ex) If the single objective reward depends on too much state
components, learning q-function might be difficult → split the single
objective reward to simpler multi-objective reward

9 / 70
Table of Contents
1. Preliminaries
3. Notable Papers

10 / 70
Notable Papers
o A Survey of Multi-Objective Sequential Decision Making [JAIR 2013] [pdf]
o Multi-Objective Deep Reinforcement Learning [arXiv 2016] [pdf]
o Dynamic Weights in Multi-Objective Deep Reinforcement Learning [ICML 2019] [pdf]
o Balancing Multiple Sources of Reward in Reinforcement Learning [NIPS 2000] [pdf]
o Horde: A scalable real-time architecture for learning knowledge from unsupervised
sensorimotor interaction [AAMAS 2011] [pdf]
o Universal Value Function Approximators [ICML 2015] [pdf]
o Hindsight Experience Replay [NIPS 2017] [pdf]
o Hybrid Reward Architecture for Reinforcement Learning [NIPS 2017] [pdf]

11 / 70
Table of Contents
1. Preliminaries
3. Notable Papers

12 / 70
A Survey of Multi-Objective Sequential Decision Making [JAIR 2013] [pdf]
 One might wonder,
 Such scalarization function, parameterized by vector
might convert the MOMDP problem to a SOMDP problem.
ex) one possible scalarization function is a linear function:
 Once we scalarize the reward, typical SOMDP solution can be applied.
“Why don’t we scalarize the vector-reward with some scalarization function?”
Paper Review / Summary – innately multi-reward (1)
Deiderik M. Roijers, Peter Vamplew, Shimon Whiteson

13 / 70
 Unfortunately, such conversion is not always possible / desirable.
1. Unknown weights scenario
o Weight has to be specified before learning → not always possible!
o User might prefer different priorities (weights) over time.
o But still, once the weight is specified and fixed, we can easily train and
use the decision-making process.
ex) public transportation system using RL
– weights (priority) between two rewards, each corresponding to commute
time and pollution cost, might fluctuate based on the price of oil.
– weight cannot be specified before training!
– once the weight is fixed, we can use the SOMDP algorithm.

14 / 70
2. Decision support scenario * not really our main concern
o The concept of scalarization itself may not be applicable from the
beginning → objective-priority may not be accurately quantified.
o Users may also have “fuzzy” preferences that defy meaningful
quantification.
o MOMDP might require arbitrary human decision during its operation.
ex) public transportation system using RL
– if the transportation system could be made more efficient by obstrucing a
beautiful view, then a human deisgner may not be able to quantify the loss,
or the priority regarding the loss of beauty.

15 / 70
3. Known weights scenario * rare case
o if the scalarization function 𝑓 is nonlinear, the resulting SOMDP problem
may not have additive returns.
o optimal policy may be stochastic, thus difficult to solve.

16 / 70
1. Unknown weights scenario
2. Decision support scenario
3. Known weights scenario
 Thus, algorithmic solution that specifically targets the MOMDP case should be
developed. * Note that our main concern is on the “weights”.
 Useful MOMDP solution should be able to
1. give an optimal policy for any arbitrary weights,
2. properly handle the case when weight changes over time (dynamic weights),
3. with just an initial single training (no retraining required)

17 / 70
 Important Definitions and Terminologies
1. Undominated policies
o this set includes (weight – policy) pair if the policy 𝜋 is optimal for some
weight
o this undominated policies contain redundnat policies (some of the
policies contained in this set is not the only optimal policy for weight )
o what we want is a compact-policy set that allows us to index a single
optimal policy for a given weight .

18 / 70
2. Coverage Set
o coverage set is a subset of undominated policies
o this includes a single policy corresponding to every possible weight
o Author calls the process of obtaining the coverage set (from
undominated set) as a pruning process.
“even if we don’t know the weights a-priori, we are already eliminating the
redundant policies that we know that we aren’t going to use in the future.”
from) Multi-Objective Decision Making, Shimon Whiteson, Microsoft
https://www.youtube.com/watch?v=_zJ_cbg3TzY&feature=youtu.be

19 / 70
2. Coverage Set
o For instance, assume that there are only two possible scalarizations
o Undominated set = {𝜋1, 𝜋2, 𝜋3}
o Coverage set = {𝜋1, 𝜋2} or {𝜋3, 𝜋2}
is not an optimal policy for
either of the scalarization (weight)
two optimal policies exist
(redundant)

20 / 70
3. Convex Hull
o undominated policy for linear scalarization function
4. Convex Coverage Set
o coverage set for linear scalarization function

21 / 70
 Important Visualizations
o Assume ℝ2
reward vector (dots in objective space)
o Scalarized reward using weight (lines in objective space)
each dot in objective space
corresponds to the line in
weight space

22 / 70
o Now, what’s the coverage set?
o For all weights, we should determine a single optimal policy (value function).
weight space

23 / 70
o For all weights, we should determine a single optimal policy (value function)
⇒ upper surface of weight space
weight space

24 / 70
o For all weights, we should determine a single optimal policy (value function)
⇒ upper surface of weight space
o Blue dot / line is included in the undominated policy set, but not in the
coverage set. (redundant policy)
weight space

25 / 70
 Thus, our goal is to find the minimal coverage space: the upper surface (and its
corresponding optimal policy / value functions) in the weight space.
weight space

26 / 70
Multi-Objective Deep Reinforcement Learning [arXiv 2016] [pdf]
 This paper aims to learn an approximate coverage set of policies, each
represented by a neural network.
 One might ask, “Do I have to check the optimal policy for all possible weights to
find out the coverage set?” Fortunately, the answer is “No, you don’t have to.”
⇒ we may solve the SOMDP problem only for some few points in the weight
space to fully determine the coverage set.
 Authors use the concept of “Linear Support”
Hossam Mossalam, Yannis M. Assael, Diederik M. Roijers, Shimon Whiteson

27 / 70
 Linear Support Algorithm (Cheng, 1988)
http://www.pomdp.org/tutorial/cheng.html
1. First, pick an extremum point in the
weight space.

28 / 70
1. First, pick an extremum point in the
weight space.
2. For the chosen weight, solve the
SOMDP and get the (vector-form)
optimal value function. That gives us
a line in the weight space.
* you might want to store the
corresopnding optimal policy as well

29 / 70
3. Now, set the weights to the other
extremum.

30 / 70
extremum.
4. Solving the SOMDP, we get another
(vector-form) optimal value function,
which again can be represented as a
line in this plot.

31 / 70
extremum.
4. Solving the SOMDP, we get another
(vector-form) optimal value function,
which again can be represented as a
line in this plot.
5. We now have an intersection.
We call this “corner point”.

32 / 70
6. Find the corresponding (vector-form)
optimal value function at this corner
point weight.

33 / 70
point weight.
7. We then can find new corner points.

34 / 70
point weight.
7. We then can find new corner points.
8. Repeat the same process until you
end up drawing the same line you
have previously drawn.
(⇔ no new optimal value vector /
optimal policy can be obtained)

35 / 70
 Optimal Linear Support Algorithm (OLS)
o Authors developed a slightly modified version of Linear Support (Cheng,
1988) to reduce the computational burden.
o For large weight space, original Linear Support algorithm might require
excessive iterations.
o However, optimal linear support algorithm (OLS) terminates the iteration
when the maximum possible improvement 𝛥 is below the pre-defined
threshold 𝜖.
maximum possible
improvement
* if this is smaller than our pre-
defined threshold 𝜖, we
terminate the iteration.

36 / 70
 Optimal Linear Support Algorithm (OLS)
 OLS requires an SOMDP solver that can
1. give us a vectorized optimal value function
2. and corresponding optimal policy
3. when weights are given.
 Authors use a DQN network that outputs a matrix of
* while the standard DQN tries to maximize the Q-value itself,
this DQN tries to maximize the scalarized Q-value for explored corner weights.

37 / 70

38 / 70
Dynamic Weights in Multi-Objective Deep Reinforcement Learning [ICML 2019] [pdf]
Axel Abels, Diederik M. Roijers, Tom Lenaerts, Ann Nowe, Denis Steckelmacher
 This paper specifically considers the case where the weights change over time.
 Main Contributions
1. propose a Conditioned Network (CN) → augmented version of DQN that
outputs weight-dependent multi-objective Q-vectors.
2. propose Diverse Experience Replay (DER) → way to efficiently train the
conditioned network, exploring both the weight space and state-action space.

39 / 70
 Conditioned Network (CN)
o Network structure itself is quite intuitive: accepts weight as an input.
o Main problem comes from the “episode generation” phase.
o During the episode generation phase,
1. to fully explore the action space, we might just take the 𝜖-greedy policy.
2. to fully explore the weight space, ?

40 / 70
 Diverse Experience Replay (DER)
o a diverse buffer from which relevant experiences can be sampled for weight
vectors whose policies have not been executed recently.
o a method that reduces the replay buffer bias, making it to obtain diverse
multi-objective optimal vectors.
Non-DER
DER

41 / 70
Balancing Multiple Sources of Reward in Reinforcement Learning [NIPS 2000] [pdf]
 Slightly Different Approach: scalarization is not the way to go!
o Author claims, “creating a single reward value by combining the multiple
components can throw away vital information and can lead to incorrect
solutions.”
o Thus, author suggests an algorithm that uses / interprets multiple-reward as a
vector form itself, without any type of scalarization.
 Instead of explicitly setting the priority between multiple objectives, this paper tries
to find the ‘optimal balance’ between multiple rewards.
 Real-life intuition : “when and how do we make an optimal decision between
(possibly conflicting) social agendas?” → VOTE!
Chirstian R. Shelton

42 / 70
 Policy Votes
o Authors introduce the concept of voting.
o Instead of a one-hot vector voting (like we do), they can vote for multiple
options (actions) as long as it sums to one.
o Meanwhile, these multiple reward sources should vote not only for a single
state, but for multiple states. They might want to “distribute” their voting power
𝛼𝑠(𝑥) over multiple states.

43 / 70
 Policy Votes
o Now, the ballot counting: our final policy is determined by the votes from
multiple reward sources.
o Note that 𝛼𝑠 𝑥 and 𝑣𝑠 𝑥, 𝑎 are all trainable parameters.
⇒ Each reward source will tune their own parameters to maximize their
expected reward.
o However, for each reward source, it might be unwise to entirely reveal their
true policy preference in the vote. → Keep in mind, that the overall final policy
is also affected by votes from other reward sources.

44 / 70
 Nash Equilibrium Problem
o We can now formulate our problem into a Nash equilibrium situation in game
theory.
1. each reward source, one by one, finds the best voting to maximize their
reward.
2. simultaneously update the old solution with new best response
3. iteration ends when all reward sources stay at their previous vote, with
same amount of reward. (Nash equilibrium)
o Since this is quite an outdated paper (before the popularity of deep-RL), they
use an old-fashioned way : formulate an estimate function of reward for given
policy 𝜋.
KL-divergence between true personal
preference 𝑝𝑠(𝑥, 𝑎) and the official policy 𝜋(𝑥, 𝑎)

45 / 70
Table of Contents
1. Preliminaries
3. Notable Papers

46 / 70
Paper Review / Summary – better understand the env (1)
Horde: A scalable real-time architecture for learning …[AAMAS 2011] [pdf]
 This paper first gave an idea of implementing multiple-reward to better understand
the environment!
 Authors believe,
Richard S. Sutton, Joseph Modayil, Michael Delp, Thomas Degris, Patrick M. Pilarski, Adam White
“knowledge about the environment is represented as a large number of
approximate value functions learned in parallel, each with its own policy,
pseudo-reward function, pseudo-termination function, and pseudo-terminal-
reward function.”

47 / 70
Horde: A scalable real-time architecture for learning …[AAMAS 2011] [pdf]
 They define a general value function (GVF)
with four auxiliary functional inputs. (question functions)
 Now, the authors propose an idea of Horde Architecture
o consisting of an overall agent composed of many sub-agents (called demons)
o each demon is an independent RL agent responsible for learning one small
piece of knowledge about the environment
o Demons try to approximate the GVF 𝑞, corresponding to their own question
functions (pseudo-rewards)
Richard S. Sutton, Joseph Modayil, Michael Delp, Thomas Degris, Patrick M. Pilarski, Adam White

48 / 70
Universal Value Function Approximators [ICML 2015] [pdf]
Tom Schaul, Dan Horgan, Karol Gregor, David Silver

49 / 70
 This paper tries to extend the idea of general value function (GVF).
 Recall: What’s wrong with typical value function 𝑉(𝑠)?
o represent the utility of any state in achieving a single goal.
o No information can be extracted from this value function when we want to
achieve a different / multiple goal.
 Recall: Sutton et al. (2011) tried to extend this value function to take extra
(pseudo) “goal” into account, for the purpose of learning more about the
surrounding environment.
o learn multiple value function approximators 𝑉
𝑔(𝑠) each corresponding to
different (pseudo-) goals 𝑔.
o each such value function represents a chunk of knowledge about the
environment → can be useful when we have to solve a different goal.

50 / 70
 Authors extend the idea of general value function approximator, to take both
states 𝑠 and goal 𝑔 as input, parameterized by 𝜃.
 Instead of learning multiple value functions for some selected goal states, we are
learning a single value (universal) function approximator (UVFA) that can
generalize over all possible goals.
 However, training an UVFA can be a difficult task!
→ if naively trained, the agent will only see a small subset of possible
combinations of state and goals (𝑠, 𝑔)

51 / 70
 Possible Architectures of UVFA
simply
concatenate!
two-stream
architecture

52 / 70
 Possible Architectures of UVFA
simply
concatenate!
two-stream
architecture
turns out, this is way
better than the left one.
* experiment details omitted

53 / 70
 Training the UVFA

54 / 70
 Results
1. Trained the UVFA for
green-dotted goals (left)
2. Predicted the value function
for pink-dotted goals (right)
Value function
learned by Horde
(explicitly explored)
* 5 goals from the test set
Value function
predicted by UVFA

55 / 70
Table of Contents
1. Preliminaries
3. Notable Papers

56 / 70
Paper Review / Summary – better performance (1)
Hindsight Experience Replay [NIPS 2017] [pdf]
 Designing a reward function is important, but not easy.
o common challenge of RL is to carefully engineer the reward function → not
only reflecting the task at hand, but also carefully shape to guide the policy
optimization
o the necessity of cost engineering limits the applicability of RL in the real
world, because it requires both RL expertise and domain-specific knowledge.
o not applicable in situations where we do not know what admissible behavior
may look like.
 Therefore, we need to develop algorithms which can learn from unshaped reward
signals (e.g. a binary signal indicating successful task completion)
Marcin Andrychowicz, Filip Wolski, .. , Wojciech Zaremba

57 / 70
 Meanwhile, humans can learn from both succeeded and failed attempts.
o if you’re learning to play hockey, for instance, you can definitely learn from the
experience of failure (ball got out of the net, slightly to the right) → you can
adjust your kick slightly to the left!
 On the other hand, robots learn nothing from this failure (zero reward).

58 / 70
 It is however possible to draw another conclusion: “this failed sequence of actions
would be successful, and thus beneficial for the robot’s learning, if the net had
been placed further to the right!
 Authors propose “Hindsight Experience Replay”

59 / 70
 Hindsight Experience Replay (HER)
o after experiencing the episode 𝑠0, 𝑠1, ⋯ , 𝑠𝑇, we store in the replay buffer every
transition 𝑠𝑡 → 𝑠𝑡+1 not only with the original goal used for this episode, but
also with a subset of other goals.
o One possible choice of such goals can be… the state which is achieved at
the final step of each episode.
o Implement UVFA structure (concatenated version) to learn from these
multiple goals

60 / 70
 Hindsight Experience Replay (HER)

61 / 70
 Results

62 / 70
Hybrid Reward Architecture for Reinforcement Learning [NIPS 2017] [pdf]
 This paper proposes a Hybrid Reward Architecture (HRA)
 Horde vs UVFA vs HRA
o Horde : learns multiple general value functions (GVFs), each corresponding
to different reward functions and other question functions, using multiple sub-
agents (a.k.a. demons)
o UVFA : generalize the GVFs across different task and goals
o HRA : decomposes the reward function into 𝑛 different reward functions, with
the intent to solve a multiple-simple tasks rather than a single-complex task.
Harm van Seijen, Joshua Romoff, Tavian Barnes, .. , Jeffrey Tsang

63 / 70
 Proposed Method
o Decompose the reward function 𝑅𝑒𝑛𝑣 into 𝑛 reward functions
and train separate RL agents on each of these reward functions.
o Because agent 𝑘 has its own reward function, it also has its own Q-value
function, 𝑄𝑘. (In fact, such 𝑘 different DQN networks can share multiple
lower-level layers!)
o Combined network that represents all Q-value functions

64 / 70
 Proposed Method
o Loss function associated with HRA
o To that end, we get an optimal weight 𝜃⋆
where

65 / 70
 HRA Architecture
 Possible Variants
o not only decomposing the existing reward, but also adding a pseudo-reward.

66 / 70
 Results
original DQN
HRA
original HRA
(just decomposing)
Extended HRA
(Decomposing + adding
pseudo-rewards)

67 / 70
For more details, check out the original papers.
o A Survey of Multi-Objective Sequential Decision Making [JAIR 2013] [pdf]
o Multi-Objective Deep Reinforcement Learning [arXiv 2016] [pdf]
o Dynamic Weights in Multi-Objective Deep Reinforcement Learning [ICML 2019] [pdf]
o Balancing Multiple Sources of Reward in Reinforcement Learning [NIPS 2000] [pdf]
o Horde: A scalable real-time architecture for learning knowledge from unsupervised
sensorimotor interaction [AAMAS 2011] [pdf]
o Universal Value Function Approximators [ICML 2015] [pdf]
o Hindsight Experience Replay [NIPS 2017] [pdf]
o Hybrid Reward Architecture for Reinforcement Learning [NIPS 2017] [pdf]
Check out the Notion version of this slide : https://bit.ly/3tnzD9F

Multi reward literature_survey_younghyo_park

Recommended

Recommended

More Related Content

Similar to Multi reward literature_survey_younghyo_park

Similar to Multi reward literature_survey_younghyo_park (20)

Recently uploaded

Recently uploaded (20)

Multi reward literature_survey_younghyo_park