Multi-Objective Deep Reinforcement Learning with Priority-based Socially Aware Mobile Robot Navigation Frameworks

1
Multi-ObjectiveDeepReinforcementLearning with
Priority-based Socially Aware Mobile Robot
NavigationFrameworks
Hanoi, Nov-2023
Institute of Information Technology
Caugiay, Hanoi, Vietnam, Le Quy Don Technical University
Caugiay, Hanoi, Vietnam,

2
IntroductionandRelatedwork
Conclusionand Futurework
Outline
01
04
02
03
Methodology
Experiment

3
Introduction and
Relatedwork
01

4
Socially aware robot navigation problem
● Socialenvironment:
○ Dynamic, dense of moving obstacles
(humans or other objects)
○ Non communicating situation
● Sociallyawarerobotnavigation is how to control
the robot to reach the goal:
○ Without collide on obstacles
○ With time-efficently
○ With social compliant

5
DeepReinforcement Learning (DRL)approaches on Socially
aware robot navigation
● CADRL - Collision Avoidance in Pedestrian-Rich Environments With Deep Reinforcement Learning
● SARL - Crowd-aware robot navigation with attention-based deep reinforcement learning

6
Socially aware robot navigation isa multi-objective
decision-making problem
● The robot must not only reach its destination but also adhere to social rules.
○ Each of these social rules can be considered an objective within the
training process.
○ Their importance may vary depending on the context.
● Some recent research attempts have endeavored to extend robot navigation
into a multiobjective problem. However:
○ They focused on relatively simple navigation spaces. (grid world, no
pedestrians,…)

7
Main contributions of our work
● Introducing amulti-objective frameworkdesigned to enhanceexisting single-objective navigation models,
through 3 following contributions:
○ (1) The development of a multi-objective robot navigation framework.
○ (2) A reward prediction model.
○ (3) Conducting experiments that showcase the effectiveness of our framework within a crowded
simulation environment.

9
Typical Multi-objective Reinforcement learning
● Similar to Single-objective RL framework, which rely on Markov Decision Process (MDP):
𝑆, 𝐴, 𝑃, 𝑅, 𝛾 (1)
● The only difference:
○ The environment issues a vector reward R: 𝑟 = 𝑟1, … , 𝑟𝑑 |𝑑 𝑜𝑏𝑗𝑒𝑐𝑡𝑖𝑣𝑒𝑠 instead of a scalar reward 𝑅
○ The utility function 𝑢: ℝ 𝑑
→ ℝ which maps multi-objective reward to a scalar value in alignment
with user-defined preferences. 𝑟𝑡 = 𝑢(𝑟𝑡
)

10
UserPreferences Modeling
● We proposed Object Order: o = [𝑜1, … , 𝑜𝑑] ∊ 𝑂, which defines the priority of each objective. The first
objective in o (𝑜1) has the highest priority and vice versa.
● The 𝑟𝑖
is preferable to 𝑟𝑗
following order o : 𝑟𝑖
≻𝑜 𝑟𝑗
if
∃ 𝑟𝑥
𝑖
> 𝑟𝑥
𝑗
|𝑜𝑥 ∈ 𝑜
∄ 𝑟𝑦
𝑖
< 𝑟𝑦
𝑗
|𝑜𝑦 ∈ 𝑜 & 𝑜𝑦 ℎ𝑎𝑠 ℎ𝑖𝑔ℎ𝑒𝑟 𝑝𝑟𝑖𝑜𝑟𝑖𝑡𝑦 𝑡ℎ𝑎𝑛 𝑜𝑥 (2)

11
ProposedRewardPredictorwith Objective Order
● Given ObjectiveOrdero, and utility function 𝑢, the preferences ≻o is define in terms of state-reward and
trajectories-reward:
○ For every vector state-reward 𝑟 = 𝑅(𝑠, 𝑎):
𝑟𝑖 ≻o 𝑟𝑗 ⇔ 𝑢(𝑟𝑖) > 𝑢(𝑟𝑗) (3)
○ For every trajectories-reward ξ = {𝑟0
, … , 𝑟𝑡
}:
ξ𝑖 ≻o ξ𝑗 ⇔ 𝑟𝑖
0
+ ⋯ + 𝑟𝑖
𝑡
≻o 𝑟𝑗
0
+ ⋯ + 𝑟𝑗
ℎ
(4)
⇔ 𝑢(𝑟𝑖
0
) + ⋯ + 𝑢(𝑟𝑖
𝑡
) > 𝑢(𝑟𝑗
0
) + ⋯ + 𝑢(𝑟𝑗
ℎ
)
● As utility function 𝑢 is unknown in most case, we propose a Reward Predictor ( denote as f ) to approximate u.
● The Reward Predictor predict scalar rewards from the state spaceS instead of reward-vector R.

12
ProposedRewardPredictorwith Objective Order(cont)
● Given ObjectiveOrdero, and RewardPredictor f, the preferences ≻o is define in terms of state-reward and
trajectories-reward:
○ For every vector state-reward 𝑟 = 𝑅(𝑠, 𝑎):
𝑟𝑖
𝑡
≻o𝑟𝑗
ℎ
⇔ 𝑓(𝑠𝑖
𝑡+1
) >𝑓(𝑠𝑗
ℎ+1
) (5)
○ For every trajectories-reward ξ = {𝑟0
, … , 𝑟𝑡
}:
ξ𝑖 ≻o ξ𝑗 ⇔ 𝑟𝑖
0
+ ⋯ + 𝑟𝑖
𝑡
≻o 𝑟𝑗
0
+ ⋯ + 𝑟𝑗
ℎ
(6)
⇔ 𝑓(𝑠𝑖
0+1
) + ⋯ + 𝑓(𝑠𝑖
𝑡+1
) > 𝑓(𝑠𝑗
0+1
) + ⋯ + 𝑓(𝑠𝑗
ℎ+1
)
● In fitting f, we deploy state lossand trajectoryloss to ensure these contrains.
● Both losses ultilize the Cross-Entropyloss.

13
ProposedRewardPredictor
● TheEmbeddingModule transforms the state of agents into high-dimensional vectors, facilitating the extraction
of dynamic features.
● TheAttention Module considers human interactions, is tasked with generating a context vector that is
associated with each individual’s observation.
● ThePredictionModule forecasts the subsequent scalar reward based on the observed state, in conjunction with
the provided context vector.

14
ProposedRewardPredictorwith Objective Order(cont)

15
Our proposedMulti-Objective Robot navigation
framework
● With Reward Predictor: 𝑅 = 𝐹 𝑆 |𝑂, which predicts scalar rewards from observed states satisfying
predefined Objective Order.
● We can convert a multi-objective RL framework to a single-objective one.
[𝑠𝑡
, 𝑟𝑡
, 𝑠𝑡+1
] → [𝑠𝑡
, 𝑟𝑡
= 𝐹(𝑠𝑡+1
) | o, 𝑠𝑡+1
]

18
Experiment Setup
● Simulation environment (adopt from SARL):
○ Invisible robot
○ Holonomic
○ Number of humans: 5, 10, 15, 20
● Baseline: SARL [1]
● SARLwithin our framework: SARL_f
○ Predefined Objective Order:
○ Reward predictor: f
○ RL Framework: SARL
● Trainingepisodes: 20.000
[1] C. Chen, Y. Liu, S. Kreiss, and A. Alahi, “Crowd-robot interaction: Crowd-aware robot navigation with attention-based deep reinforcement learning,” in 2019
International Conference on Robotics and Automation (ICRA). IEEE, 2019, pp. 6015–6022.

19
Quantitative Evaluation on 100testing episodes
● SARL_f shows a significantimprovement in
minimizing the discomfort experienced by
humans
● SARL_f exhibits better generalizationthan
SARL when facing unforeseen situations

20
Rewardpredictor
● We evalute 4 states, Over the course of the training process, we evaluate predicted rewards of 4 permanent
randomly selected states, each representing one of 4 types (Success, Discomfort, Collision, and Other)
● Reward predictor has effectively assigned distinct rewards to each type of state.

21
Qualitative Evaluation
● SARL_f intentionally chose a longer path to ensure the safety of humans.

22
Qualitative Evaluation (cont)
● SARL_f exhibited a tendency to potentially halt its motion and wait for humans to move before resuming its path,
therefore, reducing human discomfort.

23
Qualitative Evaluation (cont)
● SARL_f successfully navigate to the goal in 20-human setup while SARL doesn’t.

24
Conclusion and
Futurework
04

25
Conclusion and Future work
● Conclusion:
○ Our framework leverages a reward prediction model to convert reward vectors into scalar rewards
that align with user preferences.
○ Eliminating the need for hand-crafted reward functions that rely on empirical experiences.
○ Fully compabile with existing RL frameworks
● Future work:
○ Exploring deeper into the impact of different objective prioritizations.
○ Enhancing the training process of our framework in terms of both training duration and sample
efficiency.

26
Conclusion and
Futurework
04

Multi-Objective Deep Reinforcement Learning with Priority-based Socially Aware Mobile Robot Navigation Frameworks

More Related Content

Similar to Multi-Objective Deep Reinforcement Learning with Priority-based Socially Aware Mobile Robot Navigation Frameworks

Recently uploaded

Multi-Objective Deep Reinforcement Learning with Priority-based Socially Aware Mobile Robot Navigation Frameworks

Editor's Notes