Deep
Reinforcement
Learning for
Goal-directed
Visual
Navigation
Máté Kisantal
2
3
INTRODUCTION
4
Micro Aerial
Vehicles (MAVs)
● Attractive platform
– Low-cost
– Agile
● Requires an operator
● More autonomy needed
for safe, scalable and efficient commercial use
5
MAV
Autonomous
Flight● Waypoint-based navigation
● MAV ‘blindly’ follows waypoints
6
Navigation in
cluttered
environments
● Navigation: goal directed motion + obstacle avoidance
● Obstacle avoidance relies on information about the 3D
environment
● Goal-directed navigation is challenging
– MAVs need to perceive the environment
– React appropriately
7
Forest flight
8
Perceiving the
environment:
Sensors● Drones are constrained:
– Low power
– Light weight
– Low cost
● Cameras
– Suitable for MAVs
– Indirect 3D information
9
10
Difficulties of
Vision-based
Navigation● Need to understand the 3D structure of the scene
● Interpretation & integration of visual depth clues
– Perspective
– Occlusion
– Shadows
– etc.
11
Vision-based
navigation
● Decoupling perception and control● Decoupling perception and control
● First 3D reconstruction
● Then path planning in the reconstructed scene
12
Integrating
perception and
control● Mapping observations to control actions directly
● Bypassing exact state estimation
● Internal representation is not constrained
13
Efficient
Representation
14
DEFINING A
MODULE FOR
GOAL-DIRECTED
NAVIGATION
15
Vision-based
Neural Navigation
Module
● VNNM
● Provide goal-directed navigation capability for MAV
● Directly from visual inputs
● Easy integration to existing autopilot software
stacks
● Re-usability in different missions/contexts
16
VNNM in the
Control Hierarchy
● VNNM makes a trade-off between
– Flying directly towards the goal
– Avoiding obstacles based on
visual input
● ‘intelligent steering’ functionality
17
REINFORCEMENT
LEARNING
18
Basic RL setting
● Agent learns a policy by interacting with an
environment
● Selects actions, receives state and reward
● Goal: maximizing the cumulative reward
● No direct information about the correct action
– just evaluative feedback + exploration
– ‘trial and error learning’
19
Actor-Critic
learning● Actor:
– Learns the policy
– Selects actions
● Critic:
– Learns to predict future rewards
– Tells how good the actions are
– We use this information to get the gradient of the
policy
● ‘Asynchronous Advantage Actor-Critic (A3C) algorithm’
20
Every day RL
example - Cooking● Actions:
– Controlling heat
– Adding ingredients (e.g. salt)
● Observations:
– Smell, consistency, etc.
● Rewards:
– Taste, compliments from others
● Maximizing rewards - becoming a better cook
● Difficulties:
– Delayed rewards
– Only evaluative feedback, the correct action is unknown
– Requires exploration (e.g. trying different spices)
– Terminal states (e.g. food burnt)
21
Formulating the RL
problem - Rewards
● Has to reflect the competing objectives:
– Flying towards the goal
– Avoiding collisions
22
problem – Action
space● Considerations:
– Low dimensional, discrete action space
– Has to be simple to track
● Constraining velocity command:
– Horizontal motion only
– Constant speed
● Actions: heading acceleration commands
● Output: heading rate
23
Formulating the RL
problem –
Observations● 84 x 84 pixel image
● Goal direction
● + heading rate state
24
Need for an
efficient policy
representation
● State space extremely high dimensional
– Image itself is already over
21000 color intensity values
● Function approximation is needed
25
Deep
Reinforcement
Learning● RL + neural networks (NN) as function
approximators
– Deep learning
– NNs are suitable to use raw pixel inputs
26
RL for Robotics:
Challenges
● Training with RL on the real MAV is cumbersome
– Trial-and-error learning, dangerous exploration
– Excessive training times
● Using a virtual training environment
– Faster than real time, safe
– Controllable environment
● Introducing auxiliary learning tasks
– Additional loss signals, more efficient
representation learning
27
AUXILIARY
TRAINING
28
Difficulty of RL
with visual inputs
● High dimensional input
● Learning signal:
– Single scalar reward
– Possibly sparse and
delayed
● The agent needs to ‘learn to see’ based on the
reward function.
● Solution: incorporating other learning signals in the
training
29
Auxiliary training
● Parts of the network are
shared between the tasks
● Making use of the domain
information contained in the
learning signals of other tasks
● Tasks can benefit from the shared
representation if they rely on similar
information
30
Auxiliary Depth
Prediction Task● Depth prediction is closely related to obstacle
avoidance
● Simplifying the task:
–
–
Classification into 8 depth bins
Cropped, low-resolution depth image
31
Neural
Architec
ture
● VNNM module
● + auxiliary depth network
32
SIMULATED FOREST
ENVIRONMENT
33
Simulator fidelity
considerations
● Simulator may introduce domain discrepancy
● Physical fidelity
– VNNM operates on top of feedback control loops
– Simple motion model can be sufficient
● Visual fidelity
– More difficult to achieve
– Closing the ‘reality-gap’ with high quality graphics
– (Domain adaptation techniques)
34
Using a game
engine
● Making use of the efforts of the computer game
industry
● Capturing the complexity of the real environment
35
36
Simulator
properties
● 229 trees
● 100 x 100 m area
● Average collision-free
flight distance: 23.4 m
● Access to the simulator:
– UnrealCV open-source
plugin
37
TRAINING,
EVALUATION
AND RESULTS
38
Training – Learning
Curves
39
Training -
Progress
40
Training -
Progress
41
Successful Trial
https://www.youtube.com/watch?v=ud6PSePb3NE&index=4&list=UUR26roGXGSqacmJWSoP6yBg
42
Unsuccessful
Trial
https://www.youtube.com/watch?v=56XR6nu00Y0&index=3&list=UUR26roGXGSqacmJWSoP6yBg
43
Evaluation
● Randomly sampled goal and start locations
● Distance of 60 meters
● Baseline policy: straight flying
44
Quantitative
evaluationVNNM SF
Success 43% 16%
Avg. flight 31.7 m 21.3 m
45
tive
evalua
tion I.● Success:
– A
– B
● Failure
– C
– D
46
ive
Evaluati
on II.● Changing goal
directions
● Successful tracking
47
Depth Prediction
Evaluation
48
Depth Prediction
Evaluation
● Only close trees detected
● Directly in front of the agent
● Possible explanation:
– These are the most relevant
for obstacle avoidance!
– This capability is trained by
both the auxiliary and the
main task.
49
CONCLUSIONS
&
FURTHER WORK
50
Conclusions
● We introduced the VNNM that can extend
autonomous capabilities of MAVs
● We demonstrated its capabilities in a realistic
virtual environment
● Obstacle avoidance performance did not meet our
initial expectations
● Training is not stable enough
51
Possible future
space
applications
Deep
Reinforcement
Learning for
Goal-directed
Visual
Navigation
Máté Kisantal
53
Next step: Domain
Adaptation
● Addressing the reality gap
● Examples:
– Domain randomization
– Domain adversarial training
● Training auxiliary tasks on real dataset
54
Domain-Adversarial Training of Neural Networks
(Ganin et al. 2016)
55
Domain Randomization for Transferring Deep Neural
Networks from Simulation to the Real World
(Ganin et al. 2016)
56
Photo Credits
● Slide 4: 3D Robotics
● Slide 6: Wikipedia, Peter Bond, Sebastian Kasten, Andreas Praefcke,
CC BY-SA 3.0
● Slide 8: Wikipedia, CC BY-SA 3.0, C-M
● Slide 9: Wikipedia, public domain
● Slide 10: Jordan Savoff
● Slide 14: Parrot Inc.
● Slide 22: Nebulux, Flickr
● Slide 38: @koola_ue4, Twitter
● Slide 38: Maxime Lafleur, IntentUAV

Deep Reinforcement learning for Goal-directed Navigation

Editor's Notes

  • #19 No direct supervision, only evaluative feedback The feedback can be delayed The agent influences what future input it’s going to receive Particularly flexible learning paradigm for learning interaction