PRM-RL: Long-range Robotics Navigation Tasks by Combining Reinforcement Learning and Sampling-based Planning

PRM-RL: Long-range Robotics Navigation
Tasks by Combining Reinforcement
Learning and Sampling-based Planning
IEEE International Conference on Robotics and Automation (ICRA), 2018
Best Paper Award in Service Robotics
Aleksandra Faust et al.
Google Brain Robotics
Presented by Dongmin Lee
December 1, 2019

Outline
• Abstract
• Introduction
• Reinforcement Learning
• Methods
• Results
1

Abstract
2
PRM-RL (Probabilistic Roadmap-Reinforcement Learning):
• A hierarchical method for long-range navigation task
• Combines sampling-based path planning with RL
• Uses feature-based and deep neural net policies (DDPG) in continuous
state and action spaces
Experiments: simulation and robot on two navigation tasks (end-to-end)
• Indoor (drive) navigation in office environments - selected
• Aerial cargo delivery in urban environments

Abstract
3
PRM-RL (Probabilistic Roadmap-Reinforcement Learning):
• A hierarchical method for long-range navigation task
• Combines sampling-based path planning with RL
• Uses feature-based and deep neural net policies (DDPG) in continuous
state and action spaces
Experiments: simulation and robot on two navigation tasks (end-to-end)
• Indoor (drive) navigation in office environments - selected
• Aerial cargo delivery in urban environments

Introduction
4
PRM-RL YouTube video
• https://bit.ly/34zCTmd
Traditional Motion Planning (or Path Planning)
• CS287 Advanced Robotics (Fall 2019), Lecture 9: Motion Planning
• https://people.eecs.berkeley.edu/~pabbeel/cs287-fa19/slides/Lec10-
motion-planning.pdf
Probabilistic Roadmap (PRM) YouTube video
• https://bit.ly/34rRKz0
• https://bit.ly/35Nb61Q
Rapidly-exploring Random Tree* (RRT*) YouTube video
• https://bit.ly/2OXiocb
• https://bit.ly/2OQbUvM

5
RL provides a formalism for behaviors
• Problem of a goal-directed agent interacting with an uncertain environment
• Interaction à adaptation
feedback & decision
Reinforcement Learning

6
What are the challenges of RL?
• Huge # of samples: millions
• Fast, stable learning
• Hyperparameter tuning
• Exploration
• Sparse reward signals
• Safety / reliability
• Simulator

7
• Exploration
• Sparse reward signals due to long-range navigation
• Simulator

8
• Exploration
• Sparse reward signals due to long-range navigation
à Solve with hierarchical waypoints
• Simulator

9
So, What’s the advantage of PRM-RL than traditional methods?
• In PRM-RL, an RL agent is trained to execute a local point-to-point task
without knowledge of the topology, learning the task constraints.
• The PRM-RL builds a roadmap using the RL agent instead of the traditional
collision-free straight-line planner.
• Thus, the resulting long-range navigation planner combines the planning
efficiency of a PRM with the robustness of an RL agent.
Introduction

10
Experiment: environments used for the indoor navigation tasks
Introduction

11
Three stages:
1. RL agent training
2. PRM construction (roadmap creation)
3. PRM-RL querying (roadmap querying)
Methods

12
Methods
Definition
• 𝑆: robot’s state space
• 𝑠: start state in state space 𝑆
• 𝑔: goal state in state space 𝑆
• C-space: a space of all possible robot configurations
(e.g., state space 𝑆 is a superset of the C-space)
• C-free: a partition of C-space consisting of only collision-free paths
• 𝐿(𝑠): some task predicate (attribute) to satisfies the task constraints
• 𝑝(𝑠): a state space point’s estimate onto C-space that belong in C-free
The task is completed when the system is sufficiently close to the goal state:
∥ 𝑝 𝑠 − 𝑝 𝑔 ∥ ≤ 𝜖
Our goal is to find a transfer function:
𝑠, = 𝑓(𝑠, 𝑎)

Markov Decision Process (MDP):
• 𝑆: 𝑆 ⊂ ℝ34 is the state or observation space of the robot
• 𝑠: 𝑠 = (𝑔, 𝑜) ∈ ℝ77
(the goal 𝑔 in polar coordinates and LIDAR observations 𝑜)
• 𝐴: 𝐴 ⊂ ℝ39 is the space of all possible actions that the robot can perform
• 𝑎: 𝑎 = (𝑣;, 𝑣<) ∈ ℝ=
(two-dimensional vector of wheel speeds)
• 𝑃: 𝑆 × 𝐴 → ℝ is a probability distribution over state and actions. We assume a presence of
a simplified black-box simulator without knowing the full non-linear system dynamics
• 𝑅: 𝑆 → ℝ is a scalar reward. We reward the agent for staying away from obstacles.
Our goal is to find a policy 𝜋 ∶ 𝑆 → 𝐴:
𝜋 𝑠 = 𝑎
Given an observed state 𝑠, returns an action 𝑎 that agent should perform to maximize
long-term return:
𝜋∗(𝑠) = arg max
J∈K
𝔼 M
NOP
Q
𝛾N 𝑅 𝑠N
13
Methods

14
Methods
Markov Decision Process (MDP):
• 𝑆: 𝑆 ⊂ ℝ34 is the state or observation space of the robot
• 𝑠: 𝑠 = (𝑔, 𝑜) ∈ ℝ77
(the goal 𝑔 in polar coordinates and LIDAR observations 𝑜)
• 𝐴: 𝐴 ⊂ ℝ39 is the space of all possible actions that the robot can perform
• 𝑎: 𝑎 = (𝑣;, 𝑣<) ∈ ℝ=
(two-dimensional vector of wheel speeds)
• 𝑃: 𝑆 × 𝐴 → ℝ is a probability distribution over state and actions. We assume a presence of
a simplified black-box simulator without knowing the full non-linear system dynamics
• 𝑅: 𝑆 → ℝ is a scalar reward. We reward the agent for staying away from obstacles.
Our goal is to find a policy 𝜋 ∶ 𝑆 → 𝐴:
𝜋 𝑠 = 𝑎
Given an observed state 𝑠, returns an action 𝑎 that agent should perform to maximize
long-term return:
𝜋∗(𝑠) = arg max
J∈K
𝔼 M
NOP
Q
𝛾N 𝑅 𝑠N

15
Methods
Training with DDPG algorithm for the indoor navigation tasks

16
Methods
2. PRM construction (roadmap creation)
Algorithm 1: connect two nodes using PRM-RL

17
Methods
3. PRM querying (roadmap querying)
Generate long-range trajectories
• We query a roadmap that return a list of waypoints to a higher-level planner.
• The higher-level planner then invokes a RL agent to produce a trajectory to the next
waypoint.
• When the robot is within the waypoint’s goal range, the higher-level planner changes
the goal with the next waypoint in the list.

18
Results
Indoor navigation
1. Roadmap construction evaluation
2. Expected trajectory characteristics
3. Actual trajectory characteristics
4. Physical robot experiments
à Each roadmap is evaluated on 100 randomly generated queries from the C-free.

1. Roadmap construction evaluation
• The higher sampling density produces larger maps and more successful
queries.
• The number of nodes in the map does not depend on the local planner, but
the number of edges and collision checks do.
• Roadmaps built with the RL local planner are more densely connected with 15%
and 50% more edges.
• The RL agent can go around the corners and small obstacles.
19
Results

20
Results
2. Expected trajectory characteristics
• The RL agent does not require the robot to come to rest at the goal region,
therefore the robot experiences some inertia when the waypoint is
switched. This causes some of the failures.
• The PRM-RL paths contain more waypoints except Building 3.
• Expected trajectory length and duration are longer for the RL agent.

21
Results
3. Actual trajectory characteristics
• We look at the query characteristics for successful versus unsuccessful
queries.
• The RL agent produces higher success rate than the PRM-SL.
• The successful trajectories have fewer waypoints than the expected
waypoints, which means that the shorter queries are more likely to succeed.

4. Physical robot experiments
• To transfer of our approach on a real robot, we created a simple slalom-like
environment with four obstacles.
22
Results

23
Results
PRM-RL YouTube video
• https://bit.ly/34zCTmd

PRM-RL: Long-range Robotics Navigation Tasks by Combining Reinforcement Learning and Sampling-based Planning

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to PRM-RL: Long-range Robotics Navigation Tasks by Combining Reinforcement Learning and Sampling-based Planning

Similar to PRM-RL: Long-range Robotics Navigation Tasks by Combining Reinforcement Learning and Sampling-based Planning (20)

More from Dongmin Lee

More from Dongmin Lee (12)

Recently uploaded

Recently uploaded (20)

PRM-RL: Long-range Robotics Navigation Tasks by Combining Reinforcement Learning and Sampling-based Planning