Deep parking

Deep parking:
an implementation of automatic parking
with deep reinforcement learning
Shintaro Shiba, Feb.2016-Dec.2016
Engineer Internship at Preferred Networks
Mentor: Abe-san, Fujita-san
1

About me
Shintaro Shiba
• Graduate student at the University of
Tokyo
– Major in neuroscience and animal behavior
• Part-time engineer (internship) at
Preferred Networks, Inc.
– Blog post URL:
https://research.preferred.jp/2017/03/deep-
parking/
2

Contents
• Original Idea
• Background: DQN and Double-DQN
• Task definition
– Environment: car simulator
– Agents
1. Coordinate
2. Bird‘s-eye view
3. Subjective view
• Discussion
• Summary
3

Achievement
Trajectory of the car agent Subjective view (Input for DQN)
0 deg
-120 deg
+120 deg
4

Original Idea: DQN for parking
https://research.preferred.jp/2016/01/ces2016/
https://research.preferred.jp/2015/06/distributed-deep-reinforcement-learning/
Succeeded in driving smoothly with DQN
Input: 32 virtual sensors, 3 previous actions + Current speed and steering
Output: 9 actions
Is it possible to learn for car agent to park itself,
with inputs of images from camera?
5

Reinforcement learning
Environment
Agent
action
state
reward
Learning algorithm
6

DQN: Deep-Q Network
Volodymyr Mnih et al. 2015
each episode >>
each action >>
update Q function >>
7

Double DQN
Preventing overestimation of Q values
Hado van Hasselt et al. 2015
8

Reinforcement learning in this project
Environment
Car simulator
Agent
Different sensor +
different neural network
action
state = sensor input
reward
9

Environment:
Car simulator
Forces of …
• Traction
• Air resistance
• Rolling resistance
• Centrifugal force
• Brake
• Cornering force
F = Ftraction + Faero + Frr + Fc + Fbrake + Fcf
10

Common specifications:
state, action, reward
Input (States)
– Features specific to each agent + car speed, car steering
Output (Actions)
– 9: accelerate, decelerate, steer right, steer left, throw (do
nothing), accelerate + steer right, accelerate + steer left,
decelerate + steer right, decelerate + steer left
Reward
– +1 when the car is in the goal
– -1 when the car is out of the field
– 0.01 - 0.01 * distance_to_goal otherwise (changed afterward)
Goal
– Car inside the goal region, no other conditions like car direction
Terminate
– Time up: 500 times of actions (changed to 450 afterward)
– Field out: Out of the field
11

Common specifications:
hyperparameters
Maximum episode: 50,000
Gamma: 0.97
Optimizer: RMSpropGraves
– lr=0.00015, alpha=0.95, momentum=0.95,
eps=0.01
– changed afterward: lr=0.00015, alpha=0.95,
momentum=0, eps=0.01
Batchsize: 50 or 64
Epsilon: 0.1 at last
– linearly decreased from 1.0 at first
12

Agents
1. Coordinate
2. Bird’s-eye view
3. Subjective view
– Three cameras
– Four cameras
13

Coordinate agent
Input features
– Relative coordinate value from the car to the
goal
(80, 300)
goal
car
14
input shape: (2, )
normalized

Coordinate agent
Neural Network
– only full-connected layers (3)
n of actions (9)
n of car
parameters (2)
coordinates (2)
64 64
15

Bird’s-eye view agent
Input features
– Bird’s-eye image of the whole field
input size: 80 x 80
normalized
17

Neural Network
80
80
128
192
n of actions
n of car
parameters (2)
64
400
18
Conv

Neural Network
80
80
128
192
n of actions
n of car
parameters (2)
64
400
19
Conv

Result: 18k episodes
20

Result: after 18k episodes ?
But we had already spent about 6 month for this agent so moved to the next…21

Subjective view agent
Input features
– N_of_camera images of subjective view from
the car
– Number of cameras…Three or Four
– FoV = 120 deg
camera
ex. Input images for four camera agent
front
+0
back
+180
right
+90
left
+270
22

Neural Network
Conv
80
80
200 x 3
400
256
n of actions
n of car
parameters (2)
64 23

Neural Network
Conv
80
80
200 x 3
400
256
n of actions
n of car
parameters (2)
64 24

Problem
– Calculation time (GeForce GTX TITAN X)
• At first… 3 [min/ep] x 50k [ep] = 100 days
• Reviewed by Abe-san… 1.6 [min/ep] x 50k [ep] = 55
days
– Because of copy and synchronization between GPU and
CPU
– Learning interrupted as soon as divergence of DNN output
– (Fortunately) agent “learned” goal by ~10k episodes in
some trials
– Memory usage
• In DQN, we need to store 1M previous input data
– 1M x (80 x 80 x 3 ch x 4 cameras)
• Save images to disk and access every time
25

Result: three cameras, 6k episodes
0 deg
-120 deg
+120 deg
Trajectory of the car agent Subjective view (Input for DQN)
26

Result: three cameras, 50k episodes
The policy “move anyways” ?
>> Reward setting
Seems not able to goal every time
Only “easy” goal to achieve
>> Variable task difficulty (curriculum
Frequent goals here
27

Four camera at 30k ep.
28

Modify reward
Previous
– +1 when the car is in the goal
– 0.01 - 0.01 * distance_to_goal otherwise
New
– +1 - speed when the car is in the goal
• in order to stop the car
– -0.005
29

Modify difficulty
Difficulty: Initial car direction & position
– Constraint
• Car always starts near the middle of the field
• Car always starts with face toward center:
– Curriculum
• Car direction:
– where n = currriculum
• Criteria:
– 0.6 of mean reward over 100 episodes
±
p
12
n
±
p
4
Goal
n = 1
n = 2
30

Subjective view agent:
modifications
N cameras Reward Difficulty Learning result
3 Default Default about 6k: o
50k: x
3 modified Default about 16k: o
3 modified Constraint ? (still learning)
3 modified Curriculum o
(though curriculum 1
yet)
4 Default Default x
4 modified Curriculum △ (not bad, but not
successful yet at 6k)
31

Subjective view agent:
modifications
Curriculum + Three cameras
@curriculum 1. Criteria needs to be modified
reward mean reward sum
1.0
0.0
500
0
n episode
0 10k 20k
n episode
0 10k 20k
32

Discussion
1. Initial settings included the situation
where car cannot reach the goal
– e.g. Start towards the edge of the field
– This made learning unstable
2. Why successful for coordinate agent?
– In spite there could be such situations?
33

Discussion
3. Comparison with three and four cameras
– Considering success rate and execution time,
three camera is better
– Why not successful in four cameras?
– Need several trials?
4. DQN often diverged
– every three times in personal feeling
• four cameras is slightly more oftern
– Importance of dataset for learning
• memory size, batch size
34

Discussion
5. Curriculum
– Ideally better to quantify “difficulty of the task”
• In this case, maybe it is roughly represented as
“bias of distribution” of the selected actions?
accelerate
decelerate
throw (do nothing)
steer right
steer left
accelerate + steer right
accelerate + steer left
decelerate + steer right
decelerate + steer left
same times for each actions >> go straight
biased distribution of selected actions >> go right/lef
35

Summary
• Car agent can park itself with subjective view
of cameras, though not always stable
learning
• Trade-off between reward design and
learning difficulty
– Simple reward: difficult to learn
• Try other algorithms like A3C
– Complex reward: difficult to set
• Other setting for distance_to_goal
36

Deep parking

More Related Content

What's hot

Viewers also liked

Similar to Deep parking

Recently uploaded

Deep parking

Editor's Notes