This document summarizes a literature survey on multi-reward reinforcement learning. It discusses why researchers use multi-reward approaches, including when the problem is inherently multi-objective, to better understand the environment, or improve performance. Notable papers that use multi-reward for these reasons are outlined. The document also reviews papers on multi-objective Markov decision processes, defining concepts like undominated policies, coverage sets, and visualizing the relationship between objective space and weight space to determine the optimal policy for any given weights.
Expoelearning 2010 Virtual Campus International Quality In E Learning Lambrop...Margarida ROMERO
Lambropoulos, N. & Romero, M. (2010) Quality in collaborative learning. EuroCAT ergonomic analysis. Sesión Virtual Campus International 2010. IX Congreso Internacional de e-learning ExpoElearning 2010. 24 y 25 de febrero. Madrid.
Quality In Computer Supported Collaborative eLearning by Lambropoulos RomeroNiki Lambropoulos PhD
Quality In Computer Supported Collaborative eLearning by Lambropoulos & Romero in Madrid at the Quality eLearning Workshop coordinated by Dr. Margarida Romero http://www.expoelearning.com/
Model-Based User Interface Optimization: Part IV: ADVANCED TOPICS - At SICSA ...Aalto University
Tutorial on Model-Based User Interface Optimization. Part IV: ADVANCED TOPICS.
Presented by Antti Oulasvirta (Aalto University) at SICSA Summer School on Computational Interaction in 2015 in Glasgow. Note: This one-day lecture is divided into multiple parts.
Expoelearning 2010 Virtual Campus International Quality In E Learning Lambrop...Margarida ROMERO
Lambropoulos, N. & Romero, M. (2010) Quality in collaborative learning. EuroCAT ergonomic analysis. Sesión Virtual Campus International 2010. IX Congreso Internacional de e-learning ExpoElearning 2010. 24 y 25 de febrero. Madrid.
Quality In Computer Supported Collaborative eLearning by Lambropoulos RomeroNiki Lambropoulos PhD
Quality In Computer Supported Collaborative eLearning by Lambropoulos & Romero in Madrid at the Quality eLearning Workshop coordinated by Dr. Margarida Romero http://www.expoelearning.com/
Model-Based User Interface Optimization: Part IV: ADVANCED TOPICS - At SICSA ...Aalto University
Tutorial on Model-Based User Interface Optimization. Part IV: ADVANCED TOPICS.
Presented by Antti Oulasvirta (Aalto University) at SICSA Summer School on Computational Interaction in 2015 in Glasgow. Note: This one-day lecture is divided into multiple parts.
On 20th April, 2013, PKL Dr. Jimmy Wong Chi-Ho (Tin Sum Valley) Primary School held an “Innovative Fiber-Optic Network and Wireless Network Technology Sharing Session”. Mr. Jeff Ng, Programme Manager of WebOrganic, was one of the speakers in such event. His sharing topic was “Essential Elements to Promote e-Learning” and explained the easily overlooked essential elements during the implementation of e-Learning programmes in schools.
Interactive Tradeoffs Between Competing Offline Metrics with Bayesian Optimiz...SigOpt
Many real world applications - machine learning models, simulators, etc. - have multiple competing metrics that define performance; these require practitioners to carefully consider potential tradeoffs. However, assessing and ranking this tradeoff is nontrivial, especially when the number of metrics is more than two. Often times, practitioners scalarize the metrics into a single objective, e.g., using a weighted sum.
In this talk, we pose this problem as a constrained multi-objective optimization problem. By setting and updating the constraints, we can efficiently explore only the region of the Pareto efficient frontier of the model/system of most interest. We motivate this problem with the application of an experimental design setting, where we are trying to fabricate high performance glass substrate for solar cell panels.
GDG Community Day 2023 - Interpretable ML in productionSARADINDU SENGUPTA
Validating an ML model with train-test accuracy metrics offers an initial understanding of viability but generating consistent inferencing with contextual business goals requires understanding how the deployed model works in different nature and how they will behave in case of soft data drift.
In this talk, I will try to go through different explainability methods and how to employ them and how the choice of type of models affects or affects the interpretability in production inferencing.
REINFORCEMENT LEARNING (reinforced through trial and error).pptxarchayacb21
Reinforcement learning (RL) is a machine learning (ML) technique that trains software to make decisions to achieve the most optimal results. It mimics the trial-and-error learning process that humans use to achieve their goals.
A hybrid constructive algorithm incorporating teaching-learning based optimiz...IJECEIAES
In neural networks, simultaneous determination of the optimum structure and weights is a challenge. This paper proposes a combination of teachinglearning based optimization (TLBO) algorithm and a constructive algorithm (CA) to cope with the challenge. In literature, TLBO is used to choose proper weights, while CA is adopted to construct different structures in order to select the proper one. In this study, the basic TLBO algorithm along with an improved version of this algorithm for network weights selection are utilized. Meanwhile, as a constructive algorithm, a novel modification to multiple operations, using statistical tests (MOST), is applied and tested to choose the proper structure. The proposed combinatorial algorithms are applied to ten classification problems and two-time-series prediction problems, as the benchmark. The results are evaluated based on training and testing error, network complexity and mean-square error. The experimental results illustrate that the proposed hybrid method of the modified MOST constructive algorithm and the improved TLBO (MCO-ITLBO) algorithm outperform the others; moreover, they have been proven by Wilcoxon statistical tests as well. The proposed method demonstrates less average error with less complexity in the network structure.
EAUC 2014 presentation on energy visualisationKarl Letten
A copy of presentation given at the Environmental Association of Universities & College (EAUC) 2014 conference. It was presented with Richard Bull from De Montfort University and Neil Jennings from NUS. The presentation captured some of our work on presenting energy data in a way that is simple, fun and engaging to building users
Our presentation from the EAUC 2014 conference on the impact of energy dashboards - using examples from DMU's JISC funded projects, EU Smartspaces and the NUS's student switch off campaign - we present lessons learnt from energy visualisation in the University sector.
Conditional interval variables: A powerful concept for modeling and solving c...Philippe Laborie
Scheduling is not only about deciding when to schedule a predefined set of activities. Most of real-world scheduling problems also involve selecting a subset of activities (oversubscribed problems) and a particular way to execute them (resource or mode allocation, alternative recipes, preemptive activity splitting, etc.). We present the notion of conditional interval variable in the context of Constraint Programming and show how this concept can be leveraged to model and solve complex scheduling problems involving both temporal and non-temporal decisions.
This slide deck was presented at the 21st International Symposium on Mathematical Programming (ISMP 2012).
Philippe Laborie
Plan for MOOCs at NBNCo
A questionnaire for MOOCs Learners.
Getting you
to
think about MOOCsAt Work - Impact Challenges Performance Support
The business end
plan for this initiative with white paper to be released at the UnConference
Research- Profs who have not done MOOCs
oppose it more.
Al-Ahliyya Amman University جامعة عمان األهلية.docxgalerussel59292
Al-Ahliyya Amman University
جامعة عمان األهلية
Project
Second Semester 2019/2020
Advance Digital System
Student Name : Student ID:
EP-04-01-F151-Eng, Rev. c
Ref.: 31/19 / 2014 - 2015
Date: 09/08/2015
1-1
VENDMACH is a vending machine that accepts nickels, dimes, and quarters, and dispenses gum, apple,
or yogurt. A gum pack costs 15¢, an apple is 20¢, and yogurt is 25¢.
The ma chine has the following 1 - bit input s:
NICKEL: a signal that becomes 1 when a nickel is deposited in the coin slot.
DIME: a signal that becomes 1 when a dime is deposited in the coin slot.
QUARTER: a signal that becomes 1 when a quarter is deposited in the coin slot.
COINRETURN: a signal that becomes 1 when the coin return button is pressed.
GUM: a signal that becomes 1 when the gum selection button is pressed.
APPLE: a signal that becomes 1 when the apple selection button is pressed.
YOGURT: a signal that becomes 1 when the yogurt selection button is pressed.
Al-Ahliyya Amman University
جامعة عمان األهلية
Project
Second Semester 2019/2020
Advance Digital System
Student Name : Student ID:
EP-04-01-F151-Eng, Rev. c
Ref.: 31/19 / 2014 - 2015
Date: 09/08/2015
2-1
In addition to these “ user ” inputs, the ma chine has two control inputs:
CLOCK: a timing signal that sequences the state transitions of the machine.
INIT: an initialization signal that resets the machine to a suitable starting state.
The ma chine has thre e outputs:
CREDIT: the amount of money deposited so far and available to make a purchase;
CREDIT, in cents, should be displayed on the LEFT and RIGHT LED digits.
DISPENSED ITEM: the item that was just purchased should be displayed on the
XS40 LED: g for gum, A for apple, and y for yogurt, as indicated in Figure.
Instructions:
Use proteus software to implement the design of system and test it.
BLOOM’S TAXONOMY 2
APA Style:
All parts are not related to each other. Please answer individually APA STYLE:
Part1 – ¾ page (less than 1 page) no cover or reference page needed
https://www.ted.com/talks/michelle_thompson_education_reimagined_through_constructivism
View the TED Talk with Michelle Thompson and respond to the following after viewing this talk.
What is the benefit or benefits of constructivism? How do you believe it can help enhance the educational experience? What was one of the key "takeaways" of Michelle's experience?
Part2: 1 and 1/4 pages: cover and reference page needed
Integrating Educational Technology into Teaching
Pick one of the learning Theorists (i.e. B.F. Skinner, Robert Gagne', John Dewey, Albert Bandura, Jean Piaget, Howard Gardn.
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
More Related Content
Similar to Multi reward literature_survey_younghyo_park
On 20th April, 2013, PKL Dr. Jimmy Wong Chi-Ho (Tin Sum Valley) Primary School held an “Innovative Fiber-Optic Network and Wireless Network Technology Sharing Session”. Mr. Jeff Ng, Programme Manager of WebOrganic, was one of the speakers in such event. His sharing topic was “Essential Elements to Promote e-Learning” and explained the easily overlooked essential elements during the implementation of e-Learning programmes in schools.
Interactive Tradeoffs Between Competing Offline Metrics with Bayesian Optimiz...SigOpt
Many real world applications - machine learning models, simulators, etc. - have multiple competing metrics that define performance; these require practitioners to carefully consider potential tradeoffs. However, assessing and ranking this tradeoff is nontrivial, especially when the number of metrics is more than two. Often times, practitioners scalarize the metrics into a single objective, e.g., using a weighted sum.
In this talk, we pose this problem as a constrained multi-objective optimization problem. By setting and updating the constraints, we can efficiently explore only the region of the Pareto efficient frontier of the model/system of most interest. We motivate this problem with the application of an experimental design setting, where we are trying to fabricate high performance glass substrate for solar cell panels.
GDG Community Day 2023 - Interpretable ML in productionSARADINDU SENGUPTA
Validating an ML model with train-test accuracy metrics offers an initial understanding of viability but generating consistent inferencing with contextual business goals requires understanding how the deployed model works in different nature and how they will behave in case of soft data drift.
In this talk, I will try to go through different explainability methods and how to employ them and how the choice of type of models affects or affects the interpretability in production inferencing.
REINFORCEMENT LEARNING (reinforced through trial and error).pptxarchayacb21
Reinforcement learning (RL) is a machine learning (ML) technique that trains software to make decisions to achieve the most optimal results. It mimics the trial-and-error learning process that humans use to achieve their goals.
A hybrid constructive algorithm incorporating teaching-learning based optimiz...IJECEIAES
In neural networks, simultaneous determination of the optimum structure and weights is a challenge. This paper proposes a combination of teachinglearning based optimization (TLBO) algorithm and a constructive algorithm (CA) to cope with the challenge. In literature, TLBO is used to choose proper weights, while CA is adopted to construct different structures in order to select the proper one. In this study, the basic TLBO algorithm along with an improved version of this algorithm for network weights selection are utilized. Meanwhile, as a constructive algorithm, a novel modification to multiple operations, using statistical tests (MOST), is applied and tested to choose the proper structure. The proposed combinatorial algorithms are applied to ten classification problems and two-time-series prediction problems, as the benchmark. The results are evaluated based on training and testing error, network complexity and mean-square error. The experimental results illustrate that the proposed hybrid method of the modified MOST constructive algorithm and the improved TLBO (MCO-ITLBO) algorithm outperform the others; moreover, they have been proven by Wilcoxon statistical tests as well. The proposed method demonstrates less average error with less complexity in the network structure.
EAUC 2014 presentation on energy visualisationKarl Letten
A copy of presentation given at the Environmental Association of Universities & College (EAUC) 2014 conference. It was presented with Richard Bull from De Montfort University and Neil Jennings from NUS. The presentation captured some of our work on presenting energy data in a way that is simple, fun and engaging to building users
Our presentation from the EAUC 2014 conference on the impact of energy dashboards - using examples from DMU's JISC funded projects, EU Smartspaces and the NUS's student switch off campaign - we present lessons learnt from energy visualisation in the University sector.
Conditional interval variables: A powerful concept for modeling and solving c...Philippe Laborie
Scheduling is not only about deciding when to schedule a predefined set of activities. Most of real-world scheduling problems also involve selecting a subset of activities (oversubscribed problems) and a particular way to execute them (resource or mode allocation, alternative recipes, preemptive activity splitting, etc.). We present the notion of conditional interval variable in the context of Constraint Programming and show how this concept can be leveraged to model and solve complex scheduling problems involving both temporal and non-temporal decisions.
This slide deck was presented at the 21st International Symposium on Mathematical Programming (ISMP 2012).
Philippe Laborie
Plan for MOOCs at NBNCo
A questionnaire for MOOCs Learners.
Getting you
to
think about MOOCsAt Work - Impact Challenges Performance Support
The business end
plan for this initiative with white paper to be released at the UnConference
Research- Profs who have not done MOOCs
oppose it more.
Al-Ahliyya Amman University جامعة عمان األهلية.docxgalerussel59292
Al-Ahliyya Amman University
جامعة عمان األهلية
Project
Second Semester 2019/2020
Advance Digital System
Student Name : Student ID:
EP-04-01-F151-Eng, Rev. c
Ref.: 31/19 / 2014 - 2015
Date: 09/08/2015
1-1
VENDMACH is a vending machine that accepts nickels, dimes, and quarters, and dispenses gum, apple,
or yogurt. A gum pack costs 15¢, an apple is 20¢, and yogurt is 25¢.
The ma chine has the following 1 - bit input s:
NICKEL: a signal that becomes 1 when a nickel is deposited in the coin slot.
DIME: a signal that becomes 1 when a dime is deposited in the coin slot.
QUARTER: a signal that becomes 1 when a quarter is deposited in the coin slot.
COINRETURN: a signal that becomes 1 when the coin return button is pressed.
GUM: a signal that becomes 1 when the gum selection button is pressed.
APPLE: a signal that becomes 1 when the apple selection button is pressed.
YOGURT: a signal that becomes 1 when the yogurt selection button is pressed.
Al-Ahliyya Amman University
جامعة عمان األهلية
Project
Second Semester 2019/2020
Advance Digital System
Student Name : Student ID:
EP-04-01-F151-Eng, Rev. c
Ref.: 31/19 / 2014 - 2015
Date: 09/08/2015
2-1
In addition to these “ user ” inputs, the ma chine has two control inputs:
CLOCK: a timing signal that sequences the state transitions of the machine.
INIT: an initialization signal that resets the machine to a suitable starting state.
The ma chine has thre e outputs:
CREDIT: the amount of money deposited so far and available to make a purchase;
CREDIT, in cents, should be displayed on the LEFT and RIGHT LED digits.
DISPENSED ITEM: the item that was just purchased should be displayed on the
XS40 LED: g for gum, A for apple, and y for yogurt, as indicated in Figure.
Instructions:
Use proteus software to implement the design of system and test it.
BLOOM’S TAXONOMY 2
APA Style:
All parts are not related to each other. Please answer individually APA STYLE:
Part1 – ¾ page (less than 1 page) no cover or reference page needed
https://www.ted.com/talks/michelle_thompson_education_reimagined_through_constructivism
View the TED Talk with Michelle Thompson and respond to the following after viewing this talk.
What is the benefit or benefits of constructivism? How do you believe it can help enhance the educational experience? What was one of the key "takeaways" of Michelle's experience?
Part2: 1 and 1/4 pages: cover and reference page needed
Integrating Educational Technology into Teaching
Pick one of the learning Theorists (i.e. B.F. Skinner, Robert Gagne', John Dewey, Albert Bandura, Jean Piaget, Howard Gardn.
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Global Situational Awareness of A.I. and where its headedvikram sood
You can see the future first in San Francisco.
Over the past year, the talk of the town has shifted from $10 billion compute clusters to $100 billion clusters to trillion-dollar clusters. Every six months another zero is added to the boardroom plans. Behind the scenes, there’s a fierce scramble to secure every power contract still available for the rest of the decade, every voltage transformer that can possibly be procured. American big business is gearing up to pour trillions of dollars into a long-unseen mobilization of American industrial might. By the end of the decade, American electricity production will have grown tens of percent; from the shale fields of Pennsylvania to the solar farms of Nevada, hundreds of millions of GPUs will hum.
The AGI race has begun. We are building machines that can think and reason. By 2025/26, these machines will outpace college graduates. By the end of the decade, they will be smarter than you or I; we will have superintelligence, in the true sense of the word. Along the way, national security forces not seen in half a century will be un-leashed, and before long, The Project will be on. If we’re lucky, we’ll be in an all-out race with the CCP; if we’re unlucky, an all-out war.
Everyone is now talking about AI, but few have the faintest glimmer of what is about to hit them. Nvidia analysts still think 2024 might be close to the peak. Mainstream pundits are stuck on the wilful blindness of “it’s just predicting the next word”. They see only hype and business-as-usual; at most they entertain another internet-scale technological change.
Before long, the world will wake up. But right now, there are perhaps a few hundred people, most of them in San Francisco and the AI labs, that have situational awareness. Through whatever peculiar forces of fate, I have found myself amongst them. A few years ago, these people were derided as crazy—but they trusted the trendlines, which allowed them to correctly predict the AI advances of the past few years. Whether these people are also right about the next few years remains to be seen. But these are very smart people—the smartest people I have ever met—and they are the ones building this technology. Perhaps they will be an odd footnote in history, or perhaps they will go down in history like Szilard and Oppenheimer and Teller. If they are seeing the future even close to correctly, we are in for a wild ride.
Let me tell you what we see.
Learn SQL from basic queries to Advance queriesmanishkhaire30
Dive into the world of data analysis with our comprehensive guide on mastering SQL! This presentation offers a practical approach to learning SQL, focusing on real-world applications and hands-on practice. Whether you're a beginner or looking to sharpen your skills, this guide provides the tools you need to extract, analyze, and interpret data effectively.
Key Highlights:
Foundations of SQL: Understand the basics of SQL, including data retrieval, filtering, and aggregation.
Advanced Queries: Learn to craft complex queries to uncover deep insights from your data.
Data Trends and Patterns: Discover how to identify and interpret trends and patterns in your datasets.
Practical Examples: Follow step-by-step examples to apply SQL techniques in real-world scenarios.
Actionable Insights: Gain the skills to derive actionable insights that drive informed decision-making.
Join us on this journey to enhance your data analysis capabilities and unlock the full potential of SQL. Perfect for data enthusiasts, analysts, and anyone eager to harness the power of data!
#DataAnalysis #SQL #LearningSQL #DataInsights #DataScience #Analytics
Analysis insight about a Flyball dog competition team's performanceroli9797
Insight of my analysis about a Flyball dog competition team's last year performance. Find more: https://github.com/rolandnagy-ds/flyball_race_analysis/tree/main
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
Multi reward literature_survey_younghyo_park
1. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
1 / 70
Multi-Reward Reinforcement Learning
Literature Survey
Younghyo Park1
1 Mechanical Engineering Department, Senior
Seoul National University
Check out the Notion version of this slide : https://bit.ly/3tnzD9F
2. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
2 / 70
Table of Contents
1. Preliminaries
2. Why does one use multi-reward?
3. Notable Papers
4. Paper Review / Summary
o innately multi-reward
o multi-reward to better understand the environment
o multi-reward for better performance
3. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
3 / 70
Table of Contents
1. Preliminaries
2. Why does one use multi-reward?
3. Notable Papers
4. Paper Review / Summary
o innately multi-reward
o multi-reward to better understand the environment
o multi-reward for better performance
4. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
4 / 70
Preliminaries
Multi-Objective Markov-Deicision Process (MOMDP)
o reward function is no longer a scalar, but a vector.
o value function for a stationary policy 𝜋 on state 𝑠, is also a vector.
o For single-objective MDP (SOMDP), ordering of value function is complete
when state is given.
o In contrast, in MOMDP, even when state is given, only a partial ordering is
possible. Determining the optimal policy requires further contemplation.
⇒ one possible solution is to prioritize the rewards / objectives (covered later)
5. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
5 / 70
Table of Contents
1. Preliminaries
2. Why does one use multi-reward?
3. Notable Papers
4. Paper Review / Summary
o innately multi-reward
o multi-reward to better understand the environment
o multi-reward for better performance
6. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
6 / 70
Why does one use multi-reward?
After some literature survey, I’ve noticed that reinforcement learning community
deploys multi-reward architecture for the following reasons, mainly threefold.
1. MDP problem is innately multi-reward
In this case, the MDP has to be designed as a multi-reward problem from
the beginning. (No possible choices otherwise)
o (possibly conflicting) multiple obejctive / goals
ex) public transportation system using RL → multiple goal of “commute
time” and “fuel-efficiency”
o reward given by multiple users(experts)
ex) RL environment where human users/experts give the reward → one
user may decide to reward the agent with values twice as large as those
of another → rewards are incomparable, cannot be naively converted to
a single reward problem.
7. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
7 / 70
Why does one use multi-reward?
After some literature survey, I’ve noticed that reinforcement learning community
deploys multi-reward architecture for the following reasons, mainly threefold.
2. Implement multi-reward to better understand the environment
The MDP can possibly be formulated as a single-reward problem. However,
constructing as a multi-reward problem can give us more information about
the environment.
o generalizing the q-functions for multiple goals
ex) learning the RL system to generalize upon various tasks/goals →
learning the q-network which accepts ‘goal state’ as an input
o exploring the environment using multiple agents
ex) multiple agents, acting based on different rewards, can give us more
information about the environment
8. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
8 / 70
Why does one use multi-reward?
After some literature survey, I’ve noticed that reinforcement learning community
deploys multi-reward architecture for the following reasons, mainly threefold.
3. Implement multi-reward for better performance
When the original single objective MDP problem is quite cumbersome to
handle, we can ease the problem by splitting the single reward as multiple
(easier) rewards.
o sparse rewards
ex) binary rewards are given at the end of each episode if the agent
succeeded the goal → if the goal is hard to achieve, the reward can be
extremely sparse, and RL training can be problematic.
o complex rewards
ex) If the single objective reward depends on too much state
components, learning q-function might be difficult → split the single
objective reward to simpler multi-objective reward
9. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
9 / 70
Table of Contents
1. Preliminaries
2. Why does one use multi-reward?
3. Notable Papers
4. Paper Review / Summary
o innately multi-reward
o multi-reward to better understand the environment
o multi-reward for better performance
10. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
10 / 70
Notable Papers
1. MDP problem is innately multi-reward
o A Survey of Multi-Objective Sequential Decision Making [JAIR 2013] [pdf]
o Multi-Objective Deep Reinforcement Learning [arXiv 2016] [pdf]
o Dynamic Weights in Multi-Objective Deep Reinforcement Learning [ICML 2019] [pdf]
o Balancing Multiple Sources of Reward in Reinforcement Learning [NIPS 2000] [pdf]
2. Implement multi-reward to better understand the environment
o Horde: A scalable real-time architecture for learning knowledge from unsupervised
sensorimotor interaction [AAMAS 2011] [pdf]
o Universal Value Function Approximators [ICML 2015] [pdf]
3. Implement multi-reward for better performance
o Hindsight Experience Replay [NIPS 2017] [pdf]
o Hybrid Reward Architecture for Reinforcement Learning [NIPS 2017] [pdf]
11. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
11 / 70
Table of Contents
1. Preliminaries
2. Why does one use multi-reward?
3. Notable Papers
4. Paper Review / Summary
o innately multi-reward
o multi-reward to better understand the environment
o multi-reward for better performance
12. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
12 / 70
A Survey of Multi-Objective Sequential Decision Making [JAIR 2013] [pdf]
One might wonder,
Such scalarization function, parameterized by vector
might convert the MOMDP problem to a SOMDP problem.
ex) one possible scalarization function is a linear function:
Once we scalarize the reward, typical SOMDP solution can be applied.
“Why don’t we scalarize the vector-reward with some scalarization function?”
Paper Review / Summary – innately multi-reward (1)
Deiderik M. Roijers, Peter Vamplew, Shimon Whiteson
13. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
13 / 70
Unfortunately, such conversion is not always possible / desirable.
1. Unknown weights scenario
o Weight has to be specified before learning → not always possible!
o User might prefer different priorities (weights) over time.
o But still, once the weight is specified and fixed, we can easily train and
use the decision-making process.
A Survey of Multi-Objective Sequential Decision Making [JAIR 2013] [pdf]
ex) public transportation system using RL
– weights (priority) between two rewards, each corresponding to commute
time and pollution cost, might fluctuate based on the price of oil.
– weight cannot be specified before training!
– once the weight is fixed, we can use the SOMDP algorithm.
Paper Review / Summary – innately multi-reward (1)
Deiderik M. Roijers, Peter Vamplew, Shimon Whiteson
14. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
14 / 70
Unfortunately, such conversion is not always possible / desirable.
2. Decision support scenario * not really our main concern
o The concept of scalarization itself may not be applicable from the
beginning → objective-priority may not be accurately quantified.
o Users may also have “fuzzy” preferences that defy meaningful
quantification.
o MOMDP might require arbitrary human decision during its operation.
A Survey of Multi-Objective Sequential Decision Making [JAIR 2013] [pdf]
ex) public transportation system using RL
– if the transportation system could be made more efficient by obstrucing a
beautiful view, then a human deisgner may not be able to quantify the loss,
or the priority regarding the loss of beauty.
Paper Review / Summary – innately multi-reward (1)
Deiderik M. Roijers, Peter Vamplew, Shimon Whiteson
15. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
15 / 70
Unfortunately, such conversion is not always possible / desirable.
3. Known weights scenario * rare case
o if the scalarization function 𝑓 is nonlinear, the resulting SOMDP problem
may not have additive returns.
o optimal policy may be stochastic, thus difficult to solve.
A Survey of Multi-Objective Sequential Decision Making [JAIR 2013] [pdf]
Paper Review / Summary – innately multi-reward (1)
Deiderik M. Roijers, Peter Vamplew, Shimon Whiteson
16. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
16 / 70
Unfortunately, such conversion is not always possible / desirable.
1. Unknown weights scenario
2. Decision support scenario
3. Known weights scenario
Thus, algorithmic solution that specifically targets the MOMDP case should be
developed. * Note that our main concern is on the “weights”.
Useful MOMDP solution should be able to
1. give an optimal policy for any arbitrary weights,
2. properly handle the case when weight changes over time (dynamic weights),
3. with just an initial single training (no retraining required)
A Survey of Multi-Objective Sequential Decision Making [JAIR 2013] [pdf]
Paper Review / Summary – innately multi-reward (1)
Deiderik M. Roijers, Peter Vamplew, Shimon Whiteson
17. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
17 / 70
Important Definitions and Terminologies
A Survey of Multi-Objective Sequential Decision Making [JAIR 2013] [pdf]
1. Undominated policies
o this set includes (weight – policy) pair if the policy 𝜋 is optimal for some
weight
o this undominated policies contain redundnat policies (some of the
policies contained in this set is not the only optimal policy for weight )
o what we want is a compact-policy set that allows us to index a single
optimal policy for a given weight .
Paper Review / Summary – innately multi-reward (1)
Deiderik M. Roijers, Peter Vamplew, Shimon Whiteson
18. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
18 / 70
Important Definitions and Terminologies
A Survey of Multi-Objective Sequential Decision Making [JAIR 2013] [pdf]
2. Coverage Set
o coverage set is a subset of undominated policies
o this includes a single policy corresponding to every possible weight
o Author calls the process of obtaining the coverage set (from
undominated set) as a pruning process.
“even if we don’t know the weights a-priori, we are already eliminating the
redundant policies that we know that we aren’t going to use in the future.”
from) Multi-Objective Decision Making, Shimon Whiteson, Microsoft
https://www.youtube.com/watch?v=_zJ_cbg3TzY&feature=youtu.be
Paper Review / Summary – innately multi-reward (1)
Deiderik M. Roijers, Peter Vamplew, Shimon Whiteson
19. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
19 / 70
Important Definitions and Terminologies
A Survey of Multi-Objective Sequential Decision Making [JAIR 2013] [pdf]
2. Coverage Set
o For instance, assume that there are only two possible scalarizations
o Undominated set = {𝜋1, 𝜋2, 𝜋3}
o Coverage set = {𝜋1, 𝜋2} or {𝜋3, 𝜋2}
is not an optimal policy for
either of the scalarization (weight)
two optimal policies exist
(redundant)
Paper Review / Summary – innately multi-reward (1)
Deiderik M. Roijers, Peter Vamplew, Shimon Whiteson
20. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
20 / 70
Important Definitions and Terminologies
A Survey of Multi-Objective Sequential Decision Making [JAIR 2013] [pdf]
3. Convex Hull
o undominated policy for linear scalarization function
4. Convex Coverage Set
o coverage set for linear scalarization function
Paper Review / Summary – innately multi-reward (1)
Deiderik M. Roijers, Peter Vamplew, Shimon Whiteson
21. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
21 / 70
Important Visualizations
o Assume ℝ2
reward vector (dots in objective space)
o Scalarized reward using weight (lines in objective space)
A Survey of Multi-Objective Sequential Decision Making [JAIR 2013] [pdf]
each dot in objective space
corresponds to the line in
weight space
Paper Review / Summary – innately multi-reward (1)
Deiderik M. Roijers, Peter Vamplew, Shimon Whiteson
22. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
22 / 70
Important Visualizations
o Now, what’s the coverage set?
o For all weights, we should determine a single optimal policy (value function).
A Survey of Multi-Objective Sequential Decision Making [JAIR 2013] [pdf]
each dot in objective space
corresponds to the line in
weight space
Paper Review / Summary – innately multi-reward (1)
Deiderik M. Roijers, Peter Vamplew, Shimon Whiteson
23. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
23 / 70
Important Visualizations
o Now, what’s the coverage set?
o For all weights, we should determine a single optimal policy (value function)
⇒ upper surface of weight space
A Survey of Multi-Objective Sequential Decision Making [JAIR 2013] [pdf]
each dot in objective space
corresponds to the line in
weight space
Paper Review / Summary – innately multi-reward (1)
Deiderik M. Roijers, Peter Vamplew, Shimon Whiteson
24. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
24 / 70
Important Visualizations
o Now, what’s the coverage set?
o For all weights, we should determine a single optimal policy (value function)
⇒ upper surface of weight space
o Blue dot / line is included in the undominated policy set, but not in the
coverage set. (redundant policy)
A Survey of Multi-Objective Sequential Decision Making [JAIR 2013] [pdf]
each dot in objective space
corresponds to the line in
weight space
Paper Review / Summary – innately multi-reward (1)
Deiderik M. Roijers, Peter Vamplew, Shimon Whiteson
25. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
25 / 70
Important Visualizations
Thus, our goal is to find the minimal coverage space: the upper surface (and its
corresponding optimal policy / value functions) in the weight space.
A Survey of Multi-Objective Sequential Decision Making [JAIR 2013] [pdf]
each dot in objective space
corresponds to the line in
weight space
Paper Review / Summary – innately multi-reward (1)
Deiderik M. Roijers, Peter Vamplew, Shimon Whiteson
26. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
26 / 70
Paper Review / Summary – innately multi-reward (2)
Multi-Objective Deep Reinforcement Learning [arXiv 2016] [pdf]
This paper aims to learn an approximate coverage set of policies, each
represented by a neural network.
One might ask, “Do I have to check the optimal policy for all possible weights to
find out the coverage set?” Fortunately, the answer is “No, you don’t have to.”
⇒ we may solve the SOMDP problem only for some few points in the weight
space to fully determine the coverage set.
Authors use the concept of “Linear Support”
Hossam Mossalam, Yannis M. Assael, Diederik M. Roijers, Shimon Whiteson
27. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
27 / 70
Paper Review / Summary – innately multi-reward (2)
Multi-Objective Deep Reinforcement Learning [arXiv 2016] [pdf]
Linear Support Algorithm (Cheng, 1988)
http://www.pomdp.org/tutorial/cheng.html
1. First, pick an extremum point in the
weight space.
Hossam Mossalam, Yannis M. Assael, Diederik M. Roijers, Shimon Whiteson
28. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
28 / 70
Paper Review / Summary – innately multi-reward (2)
Multi-Objective Deep Reinforcement Learning [arXiv 2016] [pdf]
Linear Support Algorithm (Cheng, 1988)
http://www.pomdp.org/tutorial/cheng.html
1. First, pick an extremum point in the
weight space.
2. For the chosen weight, solve the
SOMDP and get the (vector-form)
optimal value function. That gives us
a line in the weight space.
* you might want to store the
corresopnding optimal policy as well
Hossam Mossalam, Yannis M. Assael, Diederik M. Roijers, Shimon Whiteson
29. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
29 / 70
Paper Review / Summary – innately multi-reward (2)
Multi-Objective Deep Reinforcement Learning [arXiv 2016] [pdf]
Linear Support Algorithm (Cheng, 1988)
http://www.pomdp.org/tutorial/cheng.html
3. Now, set the weights to the other
extremum.
Hossam Mossalam, Yannis M. Assael, Diederik M. Roijers, Shimon Whiteson
30. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
30 / 70
Paper Review / Summary – innately multi-reward (2)
Multi-Objective Deep Reinforcement Learning [arXiv 2016] [pdf]
Linear Support Algorithm (Cheng, 1988)
http://www.pomdp.org/tutorial/cheng.html
3. Now, set the weights to the other
extremum.
4. Solving the SOMDP, we get another
(vector-form) optimal value function,
which again can be represented as a
line in this plot.
Hossam Mossalam, Yannis M. Assael, Diederik M. Roijers, Shimon Whiteson
31. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
31 / 70
Paper Review / Summary – innately multi-reward (2)
Multi-Objective Deep Reinforcement Learning [arXiv 2016] [pdf]
Linear Support Algorithm (Cheng, 1988)
http://www.pomdp.org/tutorial/cheng.html
3. Now, set the weights to the other
extremum.
4. Solving the SOMDP, we get another
(vector-form) optimal value function,
which again can be represented as a
line in this plot.
5. We now have an intersection.
We call this “corner point”.
Hossam Mossalam, Yannis M. Assael, Diederik M. Roijers, Shimon Whiteson
32. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
32 / 70
Paper Review / Summary – innately multi-reward (2)
Multi-Objective Deep Reinforcement Learning [arXiv 2016] [pdf]
Linear Support Algorithm (Cheng, 1988)
http://www.pomdp.org/tutorial/cheng.html
6. Find the corresponding (vector-form)
optimal value function at this corner
point weight.
Hossam Mossalam, Yannis M. Assael, Diederik M. Roijers, Shimon Whiteson
33. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
33 / 70
Paper Review / Summary – innately multi-reward (2)
Multi-Objective Deep Reinforcement Learning [arXiv 2016] [pdf]
http://www.pomdp.org/tutorial/cheng.html
6. Find the corresponding (vector-form)
optimal value function at this corner
point weight.
7. We then can find new corner points.
Linear Support Algorithm (Cheng, 1988)
Hossam Mossalam, Yannis M. Assael, Diederik M. Roijers, Shimon Whiteson
34. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
34 / 70
Paper Review / Summary – innately multi-reward (2)
Multi-Objective Deep Reinforcement Learning [arXiv 2016] [pdf]
http://www.pomdp.org/tutorial/cheng.html
6. Find the corresponding (vector-form)
optimal value function at this corner
point weight.
7. We then can find new corner points.
8. Repeat the same process until you
end up drawing the same line you
have previously drawn.
(⇔ no new optimal value vector /
optimal policy can be obtained)
Linear Support Algorithm (Cheng, 1988)
Hossam Mossalam, Yannis M. Assael, Diederik M. Roijers, Shimon Whiteson
35. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
35 / 70
Paper Review / Summary – innately multi-reward (2)
Multi-Objective Deep Reinforcement Learning [arXiv 2016] [pdf]
Optimal Linear Support Algorithm (OLS)
o Authors developed a slightly modified version of Linear Support (Cheng,
1988) to reduce the computational burden.
o For large weight space, original Linear Support algorithm might require
excessive iterations.
o However, optimal linear support algorithm (OLS) terminates the iteration
when the maximum possible improvement 𝛥 is below the pre-defined
threshold 𝜖.
maximum possible
improvement
* if this is smaller than our pre-
defined threshold 𝜖, we
terminate the iteration.
Hossam Mossalam, Yannis M. Assael, Diederik M. Roijers, Shimon Whiteson
36. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
36 / 70
Paper Review / Summary – innately multi-reward (2)
Multi-Objective Deep Reinforcement Learning [arXiv 2016] [pdf]
Optimal Linear Support Algorithm (OLS)
OLS requires an SOMDP solver that can
1. give us a vectorized optimal value function
2. and corresponding optimal policy
3. when weights are given.
Authors use a DQN network that outputs a matrix of
* while the standard DQN tries to maximize the Q-value itself,
this DQN tries to maximize the scalarized Q-value for explored corner weights.
Hossam Mossalam, Yannis M. Assael, Diederik M. Roijers, Shimon Whiteson
37. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
37 / 70
Paper Review / Summary – innately multi-reward (2)
Multi-Objective Deep Reinforcement Learning [arXiv 2016] [pdf]
Hossam Mossalam, Yannis M. Assael, Diederik M. Roijers, Shimon Whiteson
38. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
38 / 70
Paper Review / Summary – innately multi-reward (3)
Dynamic Weights in Multi-Objective Deep Reinforcement Learning [ICML 2019] [pdf]
Axel Abels, Diederik M. Roijers, Tom Lenaerts, Ann Nowe, Denis Steckelmacher
This paper specifically considers the case where the weights change over time.
Main Contributions
1. propose a Conditioned Network (CN) → augmented version of DQN that
outputs weight-dependent multi-objective Q-vectors.
2. propose Diverse Experience Replay (DER) → way to efficiently train the
conditioned network, exploring both the weight space and state-action space.
39. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
39 / 70
Paper Review / Summary – innately multi-reward (3)
Dynamic Weights in Multi-Objective Deep Reinforcement Learning [ICML 2019] [pdf]
Conditioned Network (CN)
o Network structure itself is quite intuitive: accepts weight as an input.
o Main problem comes from the “episode generation” phase.
o During the episode generation phase,
1. to fully explore the action space, we might just take the 𝜖-greedy policy.
2. to fully explore the weight space, ?
Axel Abels, Diederik M. Roijers, Tom Lenaerts, Ann Nowe, Denis Steckelmacher
40. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
40 / 70
Paper Review / Summary – innately multi-reward (3)
Dynamic Weights in Multi-Objective Deep Reinforcement Learning [ICML 2019] [pdf]
Diverse Experience Replay (DER)
o a diverse buffer from which relevant experiences can be sampled for weight
vectors whose policies have not been executed recently.
o a method that reduces the replay buffer bias, making it to obtain diverse
multi-objective optimal vectors.
Non-DER
DER
Axel Abels, Diederik M. Roijers, Tom Lenaerts, Ann Nowe, Denis Steckelmacher
41. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
41 / 70
Paper Review / Summary – innately multi-reward (4)
Balancing Multiple Sources of Reward in Reinforcement Learning [NIPS 2000] [pdf]
Slightly Different Approach: scalarization is not the way to go!
o Author claims, “creating a single reward value by combining the multiple
components can throw away vital information and can lead to incorrect
solutions.”
o Thus, author suggests an algorithm that uses / interprets multiple-reward as a
vector form itself, without any type of scalarization.
Instead of explicitly setting the priority between multiple objectives, this paper tries
to find the ‘optimal balance’ between multiple rewards.
Real-life intuition : “when and how do we make an optimal decision between
(possibly conflicting) social agendas?” → VOTE!
Chirstian R. Shelton
42. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
42 / 70
Paper Review / Summary – innately multi-reward (4)
Balancing Multiple Sources of Reward in Reinforcement Learning [NIPS 2000] [pdf]
Policy Votes
o Authors introduce the concept of voting.
o Instead of a one-hot vector voting (like we do), they can vote for multiple
options (actions) as long as it sums to one.
o Meanwhile, these multiple reward sources should vote not only for a single
state, but for multiple states. They might want to “distribute” their voting power
𝛼𝑠(𝑥) over multiple states.
Chirstian R. Shelton
43. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
43 / 70
Paper Review / Summary – innately multi-reward (4)
Balancing Multiple Sources of Reward in Reinforcement Learning [NIPS 2000] [pdf]
Policy Votes
o Now, the ballot counting: our final policy is determined by the votes from
multiple reward sources.
o Note that 𝛼𝑠 𝑥 and 𝑣𝑠 𝑥, 𝑎 are all trainable parameters.
⇒ Each reward source will tune their own parameters to maximize their
expected reward.
o However, for each reward source, it might be unwise to entirely reveal their
true policy preference in the vote. → Keep in mind, that the overall final policy
is also affected by votes from other reward sources.
Chirstian R. Shelton
44. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
44 / 70
Paper Review / Summary – innately multi-reward (4)
Balancing Multiple Sources of Reward in Reinforcement Learning [NIPS 2000] [pdf]
Nash Equilibrium Problem
o We can now formulate our problem into a Nash equilibrium situation in game
theory.
1. each reward source, one by one, finds the best voting to maximize their
reward.
2. simultaneously update the old solution with new best response
3. iteration ends when all reward sources stay at their previous vote, with
same amount of reward. (Nash equilibrium)
o Since this is quite an outdated paper (before the popularity of deep-RL), they
use an old-fashioned way : formulate an estimate function of reward for given
policy 𝜋.
Chirstian R. Shelton
KL-divergence between true personal
preference 𝑝𝑠(𝑥, 𝑎) and the official policy 𝜋(𝑥, 𝑎)
45. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
45 / 70
Table of Contents
1. Preliminaries
2. Why does one use multi-reward?
3. Notable Papers
4. Paper Review / Summary
o innately multi-reward
o multi-reward to better understand the environment
o multi-reward for better performance
46. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
46 / 70
Paper Review / Summary – better understand the env (1)
Horde: A scalable real-time architecture for learning …[AAMAS 2011] [pdf]
This paper first gave an idea of implementing multiple-reward to better understand
the environment!
Authors believe,
Richard S. Sutton, Joseph Modayil, Michael Delp, Thomas Degris, Patrick M. Pilarski, Adam White
“knowledge about the environment is represented as a large number of
approximate value functions learned in parallel, each with its own policy,
pseudo-reward function, pseudo-termination function, and pseudo-terminal-
reward function.”
47. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
47 / 70
Paper Review / Summary – better understand the env (1)
Horde: A scalable real-time architecture for learning …[AAMAS 2011] [pdf]
They define a general value function (GVF)
with four auxiliary functional inputs. (question functions)
Now, the authors propose an idea of Horde Architecture
o consisting of an overall agent composed of many sub-agents (called demons)
o each demon is an independent RL agent responsible for learning one small
piece of knowledge about the environment
o Demons try to approximate the GVF 𝑞, corresponding to their own question
functions (pseudo-rewards)
Richard S. Sutton, Joseph Modayil, Michael Delp, Thomas Degris, Patrick M. Pilarski, Adam White
48. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
48 / 70
Paper Review / Summary – better understand the env (2)
Universal Value Function Approximators [ICML 2015] [pdf]
Tom Schaul, Dan Horgan, Karol Gregor, David Silver
49. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
49 / 70
Paper Review / Summary – better understand the env (2)
Universal Value Function Approximators [ICML 2015] [pdf]
This paper tries to extend the idea of general value function (GVF).
Recall: What’s wrong with typical value function 𝑉(𝑠)?
o represent the utility of any state in achieving a single goal.
o No information can be extracted from this value function when we want to
achieve a different / multiple goal.
Recall: Sutton et al. (2011) tried to extend this value function to take extra
(pseudo) “goal” into account, for the purpose of learning more about the
surrounding environment.
o learn multiple value function approximators 𝑉
𝑔(𝑠) each corresponding to
different (pseudo-) goals 𝑔.
o each such value function represents a chunk of knowledge about the
environment → can be useful when we have to solve a different goal.
Tom Schaul, Dan Horgan, Karol Gregor, David Silver
50. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
50 / 70
Paper Review / Summary – better understand the env (2)
Universal Value Function Approximators [ICML 2015] [pdf]
Authors extend the idea of general value function approximator, to take both
states 𝑠 and goal 𝑔 as input, parameterized by 𝜃.
Instead of learning multiple value functions for some selected goal states, we are
learning a single value (universal) function approximator (UVFA) that can
generalize over all possible goals.
However, training an UVFA can be a difficult task!
→ if naively trained, the agent will only see a small subset of possible
combinations of state and goals (𝑠, 𝑔)
Tom Schaul, Dan Horgan, Karol Gregor, David Silver
51. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
51 / 70
Paper Review / Summary – better understand the env (2)
Universal Value Function Approximators [ICML 2015] [pdf]
Possible Architectures of UVFA
Tom Schaul, Dan Horgan, Karol Gregor, David Silver
simply
concatenate!
two-stream
architecture
52. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
52 / 70
Paper Review / Summary – better understand the env (2)
Universal Value Function Approximators [ICML 2015] [pdf]
Possible Architectures of UVFA
Tom Schaul, Dan Horgan, Karol Gregor, David Silver
simply
concatenate!
two-stream
architecture
turns out, this is way
better than the left one.
* experiment details omitted
53. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
53 / 70
Paper Review / Summary – better understand the env (2)
Universal Value Function Approximators [ICML 2015] [pdf]
Training the UVFA
Tom Schaul, Dan Horgan, Karol Gregor, David Silver
54. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
54 / 70
Paper Review / Summary – better understand the env (2)
Universal Value Function Approximators [ICML 2015] [pdf]
Results
Tom Schaul, Dan Horgan, Karol Gregor, David Silver
1. Trained the UVFA for
green-dotted goals (left)
2. Predicted the value function
for pink-dotted goals (right)
Value function
learned by Horde
(explicitly explored)
* 5 goals from the test set
Value function
predicted by UVFA
55. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
55 / 70
Table of Contents
1. Preliminaries
2. Why does one use multi-reward?
3. Notable Papers
4. Paper Review / Summary
o innately multi-reward
o multi-reward to better understand the environment
o multi-reward for better performance
56. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
56 / 70
Paper Review / Summary – better performance (1)
Hindsight Experience Replay [NIPS 2017] [pdf]
Designing a reward function is important, but not easy.
o common challenge of RL is to carefully engineer the reward function → not
only reflecting the task at hand, but also carefully shape to guide the policy
optimization
o the necessity of cost engineering limits the applicability of RL in the real
world, because it requires both RL expertise and domain-specific knowledge.
o not applicable in situations where we do not know what admissible behavior
may look like.
Therefore, we need to develop algorithms which can learn from unshaped reward
signals (e.g. a binary signal indicating successful task completion)
Marcin Andrychowicz, Filip Wolski, .. , Wojciech Zaremba
57. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
57 / 70
Paper Review / Summary – better performance (1)
Hindsight Experience Replay [NIPS 2017] [pdf]
Meanwhile, humans can learn from both succeeded and failed attempts.
o if you’re learning to play hockey, for instance, you can definitely learn from the
experience of failure (ball got out of the net, slightly to the right) → you can
adjust your kick slightly to the left!
On the other hand, robots learn nothing from this failure (zero reward).
Marcin Andrychowicz, Filip Wolski, .. , Wojciech Zaremba
58. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
58 / 70
Paper Review / Summary – better performance (1)
Hindsight Experience Replay [NIPS 2017] [pdf]
It is however possible to draw another conclusion: “this failed sequence of actions
would be successful, and thus beneficial for the robot’s learning, if the net had
been placed further to the right!
Authors propose “Hindsight Experience Replay”
Marcin Andrychowicz, Filip Wolski, .. , Wojciech Zaremba
59. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
59 / 70
Paper Review / Summary – better performance (1)
Hindsight Experience Replay [NIPS 2017] [pdf]
Hindsight Experience Replay (HER)
o after experiencing the episode 𝑠0, 𝑠1, ⋯ , 𝑠𝑇, we store in the replay buffer every
transition 𝑠𝑡 → 𝑠𝑡+1 not only with the original goal used for this episode, but
also with a subset of other goals.
o One possible choice of such goals can be… the state which is achieved at
the final step of each episode.
o Implement UVFA structure (concatenated version) to learn from these
multiple goals
Marcin Andrychowicz, Filip Wolski, .. , Wojciech Zaremba
60. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
60 / 70
Paper Review / Summary – better performance (1)
Hindsight Experience Replay [NIPS 2017] [pdf]
Hindsight Experience Replay (HER)
Marcin Andrychowicz, Filip Wolski, .. , Wojciech Zaremba
61. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
61 / 70
Paper Review / Summary – better performance (1)
Hindsight Experience Replay [NIPS 2017] [pdf]
Results
Marcin Andrychowicz, Filip Wolski, .. , Wojciech Zaremba
62. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
62 / 70
Paper Review / Summary – better performance (2)
Hybrid Reward Architecture for Reinforcement Learning [NIPS 2017] [pdf]
This paper proposes a Hybrid Reward Architecture (HRA)
Horde vs UVFA vs HRA
o Horde : learns multiple general value functions (GVFs), each corresponding
to different reward functions and other question functions, using multiple sub-
agents (a.k.a. demons)
o UVFA : generalize the GVFs across different task and goals
o HRA : decomposes the reward function into 𝑛 different reward functions, with
the intent to solve a multiple-simple tasks rather than a single-complex task.
Harm van Seijen, Joshua Romoff, Tavian Barnes, .. , Jeffrey Tsang
63. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
63 / 70
Paper Review / Summary – better performance (2)
Hybrid Reward Architecture for Reinforcement Learning [NIPS 2017] [pdf]
Proposed Method
o Decompose the reward function 𝑅𝑒𝑛𝑣 into 𝑛 reward functions
and train separate RL agents on each of these reward functions.
o Because agent 𝑘 has its own reward function, it also has its own Q-value
function, 𝑄𝑘. (In fact, such 𝑘 different DQN networks can share multiple
lower-level layers!)
o Combined network that represents all Q-value functions
Harm van Seijen, Joshua Romoff, Tavian Barnes, .. , Jeffrey Tsang
64. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
64 / 70
Paper Review / Summary – better performance (2)
Hybrid Reward Architecture for Reinforcement Learning [NIPS 2017] [pdf]
Proposed Method
o Loss function associated with HRA
o To that end, we get an optimal weight 𝜃⋆
where
Harm van Seijen, Joshua Romoff, Tavian Barnes, .. , Jeffrey Tsang
65. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
65 / 70
Paper Review / Summary – better performance (2)
Hybrid Reward Architecture for Reinforcement Learning [NIPS 2017] [pdf]
HRA Architecture
Possible Variants
o not only decomposing the existing reward, but also adding a pseudo-reward.
Harm van Seijen, Joshua Romoff, Tavian Barnes, .. , Jeffrey Tsang
66. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
66 / 70
Paper Review / Summary – better performance (2)
Hybrid Reward Architecture for Reinforcement Learning [NIPS 2017] [pdf]
Results
Harm van Seijen, Joshua Romoff, Tavian Barnes, .. , Jeffrey Tsang
original DQN
HRA
original HRA
(just decomposing)
Extended HRA
(Decomposing + adding
pseudo-rewards)
67. Younghyo Park Multi-Reward Reinforcement Learning 2021 INRoL Internship ….
67 / 70
For more details, check out the original papers.
1. MDP problem is innately multi-reward
o A Survey of Multi-Objective Sequential Decision Making [JAIR 2013] [pdf]
o Multi-Objective Deep Reinforcement Learning [arXiv 2016] [pdf]
o Dynamic Weights in Multi-Objective Deep Reinforcement Learning [ICML 2019] [pdf]
o Balancing Multiple Sources of Reward in Reinforcement Learning [NIPS 2000] [pdf]
2. Implement multi-reward to better understand the environment
o Horde: A scalable real-time architecture for learning knowledge from unsupervised
sensorimotor interaction [AAMAS 2011] [pdf]
o Universal Value Function Approximators [ICML 2015] [pdf]
3. Implement multi-reward for better performance
o Hindsight Experience Replay [NIPS 2017] [pdf]
o Hybrid Reward Architecture for Reinforcement Learning [NIPS 2017] [pdf]
Check out the Notion version of this slide : https://bit.ly/3tnzD9F