1
Multi-ObjectiveDeepReinforcementLearning with
Priority-based Socially Aware Mobile Robot
NavigationFrameworks
Hanoi, Nov-2023
Institute of Information Technology
Caugiay, Hanoi, Vietnam, Le Quy Don Technical University
Caugiay, Hanoi, Vietnam,
2
IntroductionandRelatedwork
Conclusionand Futurework
Outline
01
04
02
03
Methodology
Experiment
3
Introduction and
Relatedwork
01
4
Socially aware robot navigation problem
● Socialenvironment:
○ Dynamic, dense of moving obstacles
(humans or other objects)
○ Non communicating situation
● Sociallyawarerobotnavigation is how to control
the robot to reach the goal:
○ Without collide on obstacles
○ With time-efficently
○ With social compliant
5
DeepReinforcement Learning (DRL)approaches on Socially
aware robot navigation
● CADRL - Collision Avoidance in Pedestrian-Rich Environments With Deep Reinforcement Learning
● SARL - Crowd-aware robot navigation with attention-based deep reinforcement learning
6
Socially aware robot navigation isa multi-objective
decision-making problem
● The robot must not only reach its destination but also adhere to social rules.
○ Each of these social rules can be considered an objective within the
training process.
○ Their importance may vary depending on the context.
● Some recent research attempts have endeavored to extend robot navigation
into a multiobjective problem. However:
○ They focused on relatively simple navigation spaces. (grid world, no
pedestrians,…)
7
Main contributions of our work
● Introducing amulti-objective frameworkdesigned to enhanceexisting single-objective navigation models,
through 3 following contributions:
○ (1) The development of a multi-objective robot navigation framework.
○ (2) A reward prediction model.
○ (3) Conducting experiments that showcase the effectiveness of our framework within a crowded
simulation environment.
8
Methodology
02
9
Typical Multi-objective Reinforcement learning
● Similar to Single-objective RL framework, which rely on Markov Decision Process (MDP):
𝑆, 𝐴, 𝑃, 𝑅, 𝛾 (1)
● The only difference:
○ The environment issues a vector reward R: 𝑟 = 𝑟1, … , 𝑟𝑑 |𝑑 𝑜𝑏𝑗𝑒𝑐𝑡𝑖𝑣𝑒𝑠 instead of a scalar reward 𝑅
○ The utility function 𝑢: ℝ 𝑑
→ ℝ which maps multi-objective reward to a scalar value in alignment
with user-defined preferences. 𝑟𝑡 = 𝑢(𝑟𝑡
)
10
UserPreferences Modeling
● We proposed Object Order: o = [𝑜1, … , 𝑜𝑑] ∊ 𝑂, which defines the priority of each objective. The first
objective in o (𝑜1) has the highest priority and vice versa.
● The 𝑟𝑖
is preferable to 𝑟𝑗
following order o : 𝑟𝑖
≻𝑜 𝑟𝑗
if
∃ 𝑟𝑥
𝑖
> 𝑟𝑥
𝑗
|𝑜𝑥 ∈ 𝑜
∄ 𝑟𝑦
𝑖
< 𝑟𝑦
𝑗
|𝑜𝑦 ∈ 𝑜 & 𝑜𝑦 ℎ𝑎𝑠 ℎ𝑖𝑔ℎ𝑒𝑟 𝑝𝑟𝑖𝑜𝑟𝑖𝑡𝑦 𝑡ℎ𝑎𝑛 𝑜𝑥 (2)
11
ProposedRewardPredictorwith Objective Order
● Given ObjectiveOrdero, and utility function 𝑢, the preferences ≻o is define in terms of state-reward and
trajectories-reward:
○ For every vector state-reward 𝑟 = 𝑅(𝑠, 𝑎):
𝑟𝑖 ≻o 𝑟𝑗 ⇔ 𝑢(𝑟𝑖) > 𝑢(𝑟𝑗) (3)
○ For every trajectories-reward ξ = {𝑟0
, … , 𝑟𝑡
}:
ξ𝑖 ≻o ξ𝑗 ⇔ 𝑟𝑖
0
+ ⋯ + 𝑟𝑖
𝑡
≻o 𝑟𝑗
0
+ ⋯ + 𝑟𝑗
ℎ
(4)
⇔ 𝑢(𝑟𝑖
0
) + ⋯ + 𝑢(𝑟𝑖
𝑡
) > 𝑢(𝑟𝑗
0
) + ⋯ + 𝑢(𝑟𝑗
ℎ
)
● As utility function 𝑢 is unknown in most case, we propose a Reward Predictor ( denote as f ) to approximate u.
● The Reward Predictor predict scalar rewards from the state spaceS instead of reward-vector R.
12
ProposedRewardPredictorwith Objective Order(cont)
● Given ObjectiveOrdero, and RewardPredictor f, the preferences ≻o is define in terms of state-reward and
trajectories-reward:
○ For every vector state-reward 𝑟 = 𝑅(𝑠, 𝑎):
𝑟𝑖
𝑡
≻o𝑟𝑗
ℎ
⇔ 𝑓(𝑠𝑖
𝑡+1
) >𝑓(𝑠𝑗
ℎ+1
) (5)
○ For every trajectories-reward ξ = {𝑟0
, … , 𝑟𝑡
}:
ξ𝑖 ≻o ξ𝑗 ⇔ 𝑟𝑖
0
+ ⋯ + 𝑟𝑖
𝑡
≻o 𝑟𝑗
0
+ ⋯ + 𝑟𝑗
ℎ
(6)
⇔ 𝑓(𝑠𝑖
0+1
) + ⋯ + 𝑓(𝑠𝑖
𝑡+1
) > 𝑓(𝑠𝑗
0+1
) + ⋯ + 𝑓(𝑠𝑗
ℎ+1
)
● In fitting f, we deploy state lossand trajectoryloss to ensure these contrains.
● Both losses ultilize the Cross-Entropyloss.
13
ProposedRewardPredictor
● TheEmbeddingModule transforms the state of agents into high-dimensional vectors, facilitating the extraction
of dynamic features.
● TheAttention Module considers human interactions, is tasked with generating a context vector that is
associated with each individual’s observation.
● ThePredictionModule forecasts the subsequent scalar reward based on the observed state, in conjunction with
the provided context vector.
14
ProposedRewardPredictorwith Objective Order(cont)
15
Our proposedMulti-Objective Robot navigation
framework
● With Reward Predictor: 𝑅 = 𝐹 𝑆 |𝑂, which predicts scalar rewards from observed states satisfying
predefined Objective Order.
● We can convert a multi-objective RL framework to a single-objective one.
[𝑠𝑡
, 𝑟𝑡
, 𝑠𝑡+1
] → [𝑠𝑡
, 𝑟𝑡
= 𝐹(𝑠𝑡+1
) | o, 𝑠𝑡+1
]
16
ProposedFramework
17
Experiment
03
18
Experiment Setup
● Simulation environment (adopt from SARL):
○ Invisible robot
○ Holonomic
○ Number of humans: 5, 10, 15, 20
● Baseline: SARL [1]
● SARLwithin our framework: SARL_f
○ Predefined Objective Order:
○ Reward predictor: f
○ RL Framework: SARL
● Trainingepisodes: 20.000
[1] C. Chen, Y. Liu, S. Kreiss, and A. Alahi, “Crowd-robot interaction: Crowd-aware robot navigation with attention-based deep reinforcement learning,” in 2019
International Conference on Robotics and Automation (ICRA). IEEE, 2019, pp. 6015–6022.
19
Quantitative Evaluation on 100testing episodes
● SARL_f shows a significantimprovement in
minimizing the discomfort experienced by
humans
● SARL_f exhibits better generalizationthan
SARL when facing unforeseen situations
20
Rewardpredictor
● We evalute 4 states, Over the course of the training process, we evaluate predicted rewards of 4 permanent
randomly selected states, each representing one of 4 types (Success, Discomfort, Collision, and Other)
● Reward predictor has effectively assigned distinct rewards to each type of state.
21
Qualitative Evaluation
● SARL_f intentionally chose a longer path to ensure the safety of humans.
22
Qualitative Evaluation (cont)
● SARL_f exhibited a tendency to potentially halt its motion and wait for humans to move before resuming its path,
therefore, reducing human discomfort.
23
Qualitative Evaluation (cont)
● SARL_f successfully navigate to the goal in 20-human setup while SARL doesn’t.
24
Conclusion and
Futurework
04
25
Conclusion and Future work
● Conclusion:
○ Our framework leverages a reward prediction model to convert reward vectors into scalar rewards
that align with user preferences.
○ Eliminating the need for hand-crafted reward functions that rely on empirical experiences.
○ Fully compabile with existing RL frameworks
● Future work:
○ Exploring deeper into the impact of different objective prioritizations.
○ Enhancing the training process of our framework in terms of both training duration and sample
efficiency.
26
Conclusion and
Futurework
04

Multi-Objective Deep Reinforcement Learning with Priority-based Socially Aware Mobile Robot Navigation Frameworks

  • 1.
    1 Multi-ObjectiveDeepReinforcementLearning with Priority-based SociallyAware Mobile Robot NavigationFrameworks Hanoi, Nov-2023 Institute of Information Technology Caugiay, Hanoi, Vietnam, Le Quy Don Technical University Caugiay, Hanoi, Vietnam,
  • 2.
  • 3.
  • 4.
    4 Socially aware robotnavigation problem ● Socialenvironment: ○ Dynamic, dense of moving obstacles (humans or other objects) ○ Non communicating situation ● Sociallyawarerobotnavigation is how to control the robot to reach the goal: ○ Without collide on obstacles ○ With time-efficently ○ With social compliant
  • 5.
    5 DeepReinforcement Learning (DRL)approacheson Socially aware robot navigation ● CADRL - Collision Avoidance in Pedestrian-Rich Environments With Deep Reinforcement Learning ● SARL - Crowd-aware robot navigation with attention-based deep reinforcement learning
  • 6.
    6 Socially aware robotnavigation isa multi-objective decision-making problem ● The robot must not only reach its destination but also adhere to social rules. ○ Each of these social rules can be considered an objective within the training process. ○ Their importance may vary depending on the context. ● Some recent research attempts have endeavored to extend robot navigation into a multiobjective problem. However: ○ They focused on relatively simple navigation spaces. (grid world, no pedestrians,…)
  • 7.
    7 Main contributions ofour work ● Introducing amulti-objective frameworkdesigned to enhanceexisting single-objective navigation models, through 3 following contributions: ○ (1) The development of a multi-objective robot navigation framework. ○ (2) A reward prediction model. ○ (3) Conducting experiments that showcase the effectiveness of our framework within a crowded simulation environment.
  • 8.
  • 9.
    9 Typical Multi-objective Reinforcementlearning ● Similar to Single-objective RL framework, which rely on Markov Decision Process (MDP): 𝑆, 𝐴, 𝑃, 𝑅, 𝛾 (1) ● The only difference: ○ The environment issues a vector reward R: 𝑟 = 𝑟1, … , 𝑟𝑑 |𝑑 𝑜𝑏𝑗𝑒𝑐𝑡𝑖𝑣𝑒𝑠 instead of a scalar reward 𝑅 ○ The utility function 𝑢: ℝ 𝑑 → ℝ which maps multi-objective reward to a scalar value in alignment with user-defined preferences. 𝑟𝑡 = 𝑢(𝑟𝑡 )
  • 10.
    10 UserPreferences Modeling ● Weproposed Object Order: o = [𝑜1, … , 𝑜𝑑] ∊ 𝑂, which defines the priority of each objective. The first objective in o (𝑜1) has the highest priority and vice versa. ● The 𝑟𝑖 is preferable to 𝑟𝑗 following order o : 𝑟𝑖 ≻𝑜 𝑟𝑗 if ∃ 𝑟𝑥 𝑖 > 𝑟𝑥 𝑗 |𝑜𝑥 ∈ 𝑜 ∄ 𝑟𝑦 𝑖 < 𝑟𝑦 𝑗 |𝑜𝑦 ∈ 𝑜 & 𝑜𝑦 ℎ𝑎𝑠 ℎ𝑖𝑔ℎ𝑒𝑟 𝑝𝑟𝑖𝑜𝑟𝑖𝑡𝑦 𝑡ℎ𝑎𝑛 𝑜𝑥 (2)
  • 11.
    11 ProposedRewardPredictorwith Objective Order ●Given ObjectiveOrdero, and utility function 𝑢, the preferences ≻o is define in terms of state-reward and trajectories-reward: ○ For every vector state-reward 𝑟 = 𝑅(𝑠, 𝑎): 𝑟𝑖 ≻o 𝑟𝑗 ⇔ 𝑢(𝑟𝑖) > 𝑢(𝑟𝑗) (3) ○ For every trajectories-reward ξ = {𝑟0 , … , 𝑟𝑡 }: ξ𝑖 ≻o ξ𝑗 ⇔ 𝑟𝑖 0 + ⋯ + 𝑟𝑖 𝑡 ≻o 𝑟𝑗 0 + ⋯ + 𝑟𝑗 ℎ (4) ⇔ 𝑢(𝑟𝑖 0 ) + ⋯ + 𝑢(𝑟𝑖 𝑡 ) > 𝑢(𝑟𝑗 0 ) + ⋯ + 𝑢(𝑟𝑗 ℎ ) ● As utility function 𝑢 is unknown in most case, we propose a Reward Predictor ( denote as f ) to approximate u. ● The Reward Predictor predict scalar rewards from the state spaceS instead of reward-vector R.
  • 12.
    12 ProposedRewardPredictorwith Objective Order(cont) ●Given ObjectiveOrdero, and RewardPredictor f, the preferences ≻o is define in terms of state-reward and trajectories-reward: ○ For every vector state-reward 𝑟 = 𝑅(𝑠, 𝑎): 𝑟𝑖 𝑡 ≻o𝑟𝑗 ℎ ⇔ 𝑓(𝑠𝑖 𝑡+1 ) >𝑓(𝑠𝑗 ℎ+1 ) (5) ○ For every trajectories-reward ξ = {𝑟0 , … , 𝑟𝑡 }: ξ𝑖 ≻o ξ𝑗 ⇔ 𝑟𝑖 0 + ⋯ + 𝑟𝑖 𝑡 ≻o 𝑟𝑗 0 + ⋯ + 𝑟𝑗 ℎ (6) ⇔ 𝑓(𝑠𝑖 0+1 ) + ⋯ + 𝑓(𝑠𝑖 𝑡+1 ) > 𝑓(𝑠𝑗 0+1 ) + ⋯ + 𝑓(𝑠𝑗 ℎ+1 ) ● In fitting f, we deploy state lossand trajectoryloss to ensure these contrains. ● Both losses ultilize the Cross-Entropyloss.
  • 13.
    13 ProposedRewardPredictor ● TheEmbeddingModule transformsthe state of agents into high-dimensional vectors, facilitating the extraction of dynamic features. ● TheAttention Module considers human interactions, is tasked with generating a context vector that is associated with each individual’s observation. ● ThePredictionModule forecasts the subsequent scalar reward based on the observed state, in conjunction with the provided context vector.
  • 14.
  • 15.
    15 Our proposedMulti-Objective Robotnavigation framework ● With Reward Predictor: 𝑅 = 𝐹 𝑆 |𝑂, which predicts scalar rewards from observed states satisfying predefined Objective Order. ● We can convert a multi-objective RL framework to a single-objective one. [𝑠𝑡 , 𝑟𝑡 , 𝑠𝑡+1 ] → [𝑠𝑡 , 𝑟𝑡 = 𝐹(𝑠𝑡+1 ) | o, 𝑠𝑡+1 ]
  • 16.
  • 17.
  • 18.
    18 Experiment Setup ● Simulationenvironment (adopt from SARL): ○ Invisible robot ○ Holonomic ○ Number of humans: 5, 10, 15, 20 ● Baseline: SARL [1] ● SARLwithin our framework: SARL_f ○ Predefined Objective Order: ○ Reward predictor: f ○ RL Framework: SARL ● Trainingepisodes: 20.000 [1] C. Chen, Y. Liu, S. Kreiss, and A. Alahi, “Crowd-robot interaction: Crowd-aware robot navigation with attention-based deep reinforcement learning,” in 2019 International Conference on Robotics and Automation (ICRA). IEEE, 2019, pp. 6015–6022.
  • 19.
    19 Quantitative Evaluation on100testing episodes ● SARL_f shows a significantimprovement in minimizing the discomfort experienced by humans ● SARL_f exhibits better generalizationthan SARL when facing unforeseen situations
  • 20.
    20 Rewardpredictor ● We evalute4 states, Over the course of the training process, we evaluate predicted rewards of 4 permanent randomly selected states, each representing one of 4 types (Success, Discomfort, Collision, and Other) ● Reward predictor has effectively assigned distinct rewards to each type of state.
  • 21.
    21 Qualitative Evaluation ● SARL_fintentionally chose a longer path to ensure the safety of humans.
  • 22.
    22 Qualitative Evaluation (cont) ●SARL_f exhibited a tendency to potentially halt its motion and wait for humans to move before resuming its path, therefore, reducing human discomfort.
  • 23.
    23 Qualitative Evaluation (cont) ●SARL_f successfully navigate to the goal in 20-human setup while SARL doesn’t.
  • 24.
  • 25.
    25 Conclusion and Futurework ● Conclusion: ○ Our framework leverages a reward prediction model to convert reward vectors into scalar rewards that align with user preferences. ○ Eliminating the need for hand-crafted reward functions that rely on empirical experiences. ○ Fully compabile with existing RL frameworks ● Future work: ○ Exploring deeper into the impact of different objective prioritizations. ○ Enhancing the training process of our framework in terms of both training duration and sample efficiency.
  • 26.

Editor's Notes

  • #5 The captivating possibilities of socially aware robot navigation have increasingly captured the attention of the research community. Before we delve into the intricacies of our work, let me provide a brief overview of some key challenges in this domain. Firstly, let's consider the social environment: Dynamic and Dense Movements: Navigating through social environments involves coping with dynamic and densely packed scenarios, where obstacles, whether humans or other objects, are in constant motion. Non-Communicative Situations: Adding to the complexity, there are instances where direct communication may not be possible. Robots must navigate effectively in these non-communicative situations, relying on their own awareness and decision-making capabilities. Moving on to the second set of challenges, socially aware robot navigation involves addressing how we control robots to reach their goals: Collision-Free Navigation: A primary concern is guiding the robot to its destination without colliding with obstacles. This requires advanced sensors and algorithms to detect and respond to moving elements in real-time, ensuring a safe and obstacle-free journey. Time-Efficiency: Time is of the essence in today's fast-paced world. Achieving socially aware navigation involves not just safety but also optimizing the robot's path to the goal in the most time-efficient manner. This demands intelligent algorithms that balance speed with precision. Social Compliance: Lastly, it's crucial for robots to navigate with social compliance, aligning with societal norms. This includes respecting personal space, adhering to established routes, and adapting behavior to different social contexts. In essence, the challenges of socially aware robot navigation encompass a dynamic and multifaceted landscape, calling for innovative solutions. As we embark on our exploration of this captivating field, these challenges serve as guideposts, steering our efforts towards a future where robots seamlessly navigate and interact within our social spaces.
  • #6 In recent times, there has been a notable surge in harnessing the power of Deep Reinforcement Learning (DRL) to achieve promising results in various domains. Notably, two remarkable works that have made significant strides in this area are: CADRL - Collision Avoidance in Pedestrian-Rich Environments With Deep Reinforcement Learning: This project focuses on addressing the critical challenge of collision avoidance in environments saturated with pedestrians. By employing Deep Reinforcement Learning, CADRL seeks to develop intelligent algorithms that enable robots to navigate through densely populated spaces, prioritizing safety and collision-free interactions with pedestrians. SARL - Crowd-aware Robot Navigation with Attention-based Deep Reinforcement Learning: SARL takes a unique approach by incorporating attention-based mechanisms into Deep Reinforcement Learning for robot navigation. The attention mechanism allows the robot to dynamically focus on relevant aspects of the crowded environment, enhancing its awareness and decision-making capabilities. This work aims to create robots that navigate through crowds with a heightened level of situational awareness and adaptability.
  • #7 We argue that socially aware robot navigation is fundamentally a multi-objective decision-making problem. The rationale behind this assertion lies in the realization that the robot's mission extends beyond the mere act of reaching its destination; it must also navigate through the environment while adhering to social rules. In the training process, each of these social rules can be identified and treated as a distinct objective. These objectives are not only numerous but also diverse, ranging from considerations of personal space to adhering to established routes or exhibiting socially compliant behavior. By framing these social rules as separate objectives, we acknowledge the multifaceted nature of the navigation task. Furthermore, the importance of these individual objectives can vary depending on the specific context in which the robot operates. For instance, the significance of maintaining personal space may differ in a crowded urban setting compared to a more spacious environment. Understanding and assigning contextual importance to these objectives is crucial for effective socially aware robot navigation. Recent research endeavors have sought to expand the scope of robot navigation by transforming it into a multi-objective problem. However, a noteworthy observation is that these efforts have predominantly centered around relatively straightforward navigation spaces, characterized by grid layouts and an absence of pedestrians. The focus on such simplified environments implies that the current body of research is yet to comprehensively address the challenges associated with multi-objective navigation in more intricate, real-world scenarios involving dynamic elements like pedestrians.
  • #20 SARL_f does not achieve the same success rate in reaching the goal as SARL in setups with 5 and 10 humans, but outperforms SARL when the environment gets crowded (with 15 and 20 humans)