Deep parking:
an implementation of automatic parking
with deep reinforcement learning
Shintaro Shiba, Feb.2016-Dec.2016
Engineer Internship at Preferred Networks
Mentor: Abe-san, Fujita-san
1
About me
Shintaro Shiba
• Graduate student at the University of
Tokyo
– Major in neuroscience and animal behavior
• Part-time engineer (internship) at
Preferred Networks, Inc.
– Blog post URL:
https://research.preferred.jp/2017/03/deep-
parking/
2
Contents
• Original Idea
• Background: DQN and Double-DQN
• Task definition
– Environment: car simulator
– Agents
1. Coordinate
2. Bird‘s-eye view
3. Subjective view
• Discussion
• Summary
3
Achievement
Trajectory of the car agent Subjective view (Input for DQN)
0 deg
-120 deg
+120 deg
4
Original Idea: DQN for parking
https://research.preferred.jp/2016/01/ces2016/
https://research.preferred.jp/2015/06/distributed-deep-reinforcement-learning/
Succeeded in driving smoothly with DQN
Input: 32 virtual sensors, 3 previous actions + Current speed and steering
Output: 9 actions
Is it possible to learn for car agent to park itself,
with inputs of images from camera?
5
Reinforcement learning
Environment
Agent
action
state
reward
Learning algorithm
6
DQN: Deep-Q Network
Volodymyr Mnih et al. 2015
each episode >>
each action >>
update Q function >>
7
Double DQN
Preventing overestimation of Q values
Hado van Hasselt et al. 2015
8
Reinforcement learning in this project
Environment
Car simulator
Agent
Different sensor +
different neural network
action
state = sensor input
reward
9
Environment:
Car simulator
Forces of …
• Traction
• Air resistance
• Rolling resistance
• Centrifugal force
• Brake
• Cornering force
F = Ftraction + Faero + Frr + Fc + Fbrake + Fcf
10
Common specifications:
state, action, reward
Input (States)
– Features specific to each agent + car speed, car steering
Output (Actions)
– 9: accelerate, decelerate, steer right, steer left, throw (do
nothing), accelerate + steer right, accelerate + steer left,
decelerate + steer right, decelerate + steer left
Reward
– +1 when the car is in the goal
– -1 when the car is out of the field
– 0.01 - 0.01 * distance_to_goal otherwise (changed afterward)
Goal
– Car inside the goal region, no other conditions like car direction
Terminate
– Time up: 500 times of actions (changed to 450 afterward)
– Field out: Out of the field
11
Common specifications:
hyperparameters
Maximum episode: 50,000
Gamma: 0.97
Optimizer: RMSpropGraves
– lr=0.00015, alpha=0.95, momentum=0.95,
eps=0.01
– changed afterward: lr=0.00015, alpha=0.95,
momentum=0, eps=0.01
Batchsize: 50 or 64
Epsilon: 0.1 at last
– linearly decreased from 1.0 at first
12
Agents
1. Coordinate
2. Bird’s-eye view
3. Subjective view
– Three cameras
– Four cameras
13
Coordinate agent
Input features
– Relative coordinate value from the car to the
goal
(80, 300)
goal
car
14
input shape: (2, )
normalized
Coordinate agent
Neural Network
– only full-connected layers (3)
n of actions (9)
n of car
parameters (2)
coordinates (2)
64 64
15
Coordinate agent
Result
16
Bird’s-eye view agent
Input features
– Bird’s-eye image of the whole field
input size: 80 x 80
normalized
17
Bird’s-eye view agent
Neural Network
80
80
128
192
n of actions
n of car
parameters (2)
64
400
18
Conv
Bird’s-eye view agent
Neural Network
80
80
128
192
n of actions
n of car
parameters (2)
64
400
19
Conv
Bird’s-eye view agent
Result: 18k episodes
20
Bird’s-eye view agent
Result: after 18k episodes ?
But we had already spent about 6 month for this agent so moved to the next…21
Subjective view agent
Input features
– N_of_camera images of subjective view from
the car
– Number of cameras…Three or Four
– FoV = 120 deg
camera
ex. Input images for four camera agent
front
+0
back
+180
right
+90
left
+270
22
Subjective view agent
Neural Network
Conv
80
80
200 x 3
400
256
n of actions
n of car
parameters (2)
64 23
Subjective view agent
Neural Network
Conv
80
80
200 x 3
400
256
n of actions
n of car
parameters (2)
64 24
Subjective view agent
Problem
– Calculation time (GeForce GTX TITAN X)
• At first… 3 [min/ep] x 50k [ep] = 100 days
• Reviewed by Abe-san… 1.6 [min/ep] x 50k [ep] = 55
days
– Because of copy and synchronization between GPU and
CPU
– Learning interrupted as soon as divergence of DNN output
– (Fortunately) agent “learned” goal by ~10k episodes in
some trials
– Memory usage
• In DQN, we need to store 1M previous input data
– 1M x (80 x 80 x 3 ch x 4 cameras)
• Save images to disk and access every time
25
Subjective view agent
Result: three cameras, 6k episodes
0 deg
-120 deg
+120 deg
Trajectory of the car agent Subjective view (Input for DQN)
26
Subjective view agent
Result: three cameras, 50k episodes
The policy “move anyways” ?
>> Reward setting
Seems not able to goal every time
Only “easy” goal to achieve
>> Variable task difficulty (curriculum
Frequent goals here
27
Subjective view agent
Four camera at 30k ep.
28
Modify reward
Previous
– +1 when the car is in the goal
– -1 when the car is out of the field
– 0.01 - 0.01 * distance_to_goal otherwise
New
– +1 - speed when the car is in the goal
• in order to stop the car
– -1 when the car is out of the field
– -0.005
29
Modify difficulty
Difficulty: Initial car direction & position
– Constraint
• Car always starts near the middle of the field
• Car always starts with face toward center:
– Curriculum
• Car direction:
– where n = currriculum
• Criteria:
– 0.6 of mean reward over 100 episodes
±
p
12
n
±
p
4
Goal
n = 1
n = 2
30
Subjective view agent:
modifications
N cameras Reward Difficulty Learning result
3 Default Default about 6k: o
50k: x
3 modified Default about 16k: o
3 modified Constraint ? (still learning)
3 modified Curriculum o
(though curriculum 1
yet)
4 Default Default x
4 modified Curriculum △ (not bad, but not
successful yet at 6k)
31
Subjective view agent:
modifications
Curriculum + Three cameras
@curriculum 1. Criteria needs to be modified
reward mean reward sum
1.0
0.0
500
0
n episode
0 10k 20k
n episode
0 10k 20k
32
Discussion
1. Initial settings included the situation
where car cannot reach the goal
– e.g. Start towards the edge of the field
– This made learning unstable
2. Why successful for coordinate agent?
– In spite there could be such situations?
33
Discussion
3. Comparison with three and four cameras
– Considering success rate and execution time,
three camera is better
– Why not successful in four cameras?
– Need several trials?
4. DQN often diverged
– every three times in personal feeling
• four cameras is slightly more oftern
– Importance of dataset for learning
• memory size, batch size
34
Discussion
5. Curriculum
– Ideally better to quantify “difficulty of the task”
• In this case, maybe it is roughly represented as
“bias of distribution” of the selected actions?
accelerate
decelerate
throw (do nothing)
steer right
steer left
accelerate + steer right
accelerate + steer left
decelerate + steer right
decelerate + steer left
same times for each actions >> go straight
biased distribution of selected actions >> go right/lef
35
Summary
• Car agent can park itself with subjective view
of cameras, though not always stable
learning
• Trade-off between reward design and
learning difficulty
– Simple reward: difficult to learn
• Try other algorithms like A3C
– Complex reward: difficult to set
• Other setting for distance_to_goal
36

Deep parking

  • 1.
    Deep parking: an implementationof automatic parking with deep reinforcement learning Shintaro Shiba, Feb.2016-Dec.2016 Engineer Internship at Preferred Networks Mentor: Abe-san, Fujita-san 1
  • 2.
    About me Shintaro Shiba •Graduate student at the University of Tokyo – Major in neuroscience and animal behavior • Part-time engineer (internship) at Preferred Networks, Inc. – Blog post URL: https://research.preferred.jp/2017/03/deep- parking/ 2
  • 3.
    Contents • Original Idea •Background: DQN and Double-DQN • Task definition – Environment: car simulator – Agents 1. Coordinate 2. Bird‘s-eye view 3. Subjective view • Discussion • Summary 3
  • 4.
    Achievement Trajectory of thecar agent Subjective view (Input for DQN) 0 deg -120 deg +120 deg 4
  • 5.
    Original Idea: DQNfor parking https://research.preferred.jp/2016/01/ces2016/ https://research.preferred.jp/2015/06/distributed-deep-reinforcement-learning/ Succeeded in driving smoothly with DQN Input: 32 virtual sensors, 3 previous actions + Current speed and steering Output: 9 actions Is it possible to learn for car agent to park itself, with inputs of images from camera? 5
  • 6.
  • 7.
    DQN: Deep-Q Network VolodymyrMnih et al. 2015 each episode >> each action >> update Q function >> 7
  • 8.
    Double DQN Preventing overestimationof Q values Hado van Hasselt et al. 2015 8
  • 9.
    Reinforcement learning inthis project Environment Car simulator Agent Different sensor + different neural network action state = sensor input reward 9
  • 10.
    Environment: Car simulator Forces of… • Traction • Air resistance • Rolling resistance • Centrifugal force • Brake • Cornering force F = Ftraction + Faero + Frr + Fc + Fbrake + Fcf 10
  • 11.
    Common specifications: state, action,reward Input (States) – Features specific to each agent + car speed, car steering Output (Actions) – 9: accelerate, decelerate, steer right, steer left, throw (do nothing), accelerate + steer right, accelerate + steer left, decelerate + steer right, decelerate + steer left Reward – +1 when the car is in the goal – -1 when the car is out of the field – 0.01 - 0.01 * distance_to_goal otherwise (changed afterward) Goal – Car inside the goal region, no other conditions like car direction Terminate – Time up: 500 times of actions (changed to 450 afterward) – Field out: Out of the field 11
  • 12.
    Common specifications: hyperparameters Maximum episode:50,000 Gamma: 0.97 Optimizer: RMSpropGraves – lr=0.00015, alpha=0.95, momentum=0.95, eps=0.01 – changed afterward: lr=0.00015, alpha=0.95, momentum=0, eps=0.01 Batchsize: 50 or 64 Epsilon: 0.1 at last – linearly decreased from 1.0 at first 12
  • 13.
    Agents 1. Coordinate 2. Bird’s-eyeview 3. Subjective view – Three cameras – Four cameras 13
  • 14.
    Coordinate agent Input features –Relative coordinate value from the car to the goal (80, 300) goal car 14 input shape: (2, ) normalized
  • 15.
    Coordinate agent Neural Network –only full-connected layers (3) n of actions (9) n of car parameters (2) coordinates (2) 64 64 15
  • 16.
  • 17.
    Bird’s-eye view agent Inputfeatures – Bird’s-eye image of the whole field input size: 80 x 80 normalized 17
  • 18.
    Bird’s-eye view agent NeuralNetwork 80 80 128 192 n of actions n of car parameters (2) 64 400 18 Conv
  • 19.
    Bird’s-eye view agent NeuralNetwork 80 80 128 192 n of actions n of car parameters (2) 64 400 19 Conv
  • 20.
  • 21.
    Bird’s-eye view agent Result:after 18k episodes ? But we had already spent about 6 month for this agent so moved to the next…21
  • 22.
    Subjective view agent Inputfeatures – N_of_camera images of subjective view from the car – Number of cameras…Three or Four – FoV = 120 deg camera ex. Input images for four camera agent front +0 back +180 right +90 left +270 22
  • 23.
    Subjective view agent NeuralNetwork Conv 80 80 200 x 3 400 256 n of actions n of car parameters (2) 64 23
  • 24.
    Subjective view agent NeuralNetwork Conv 80 80 200 x 3 400 256 n of actions n of car parameters (2) 64 24
  • 25.
    Subjective view agent Problem –Calculation time (GeForce GTX TITAN X) • At first… 3 [min/ep] x 50k [ep] = 100 days • Reviewed by Abe-san… 1.6 [min/ep] x 50k [ep] = 55 days – Because of copy and synchronization between GPU and CPU – Learning interrupted as soon as divergence of DNN output – (Fortunately) agent “learned” goal by ~10k episodes in some trials – Memory usage • In DQN, we need to store 1M previous input data – 1M x (80 x 80 x 3 ch x 4 cameras) • Save images to disk and access every time 25
  • 26.
    Subjective view agent Result:three cameras, 6k episodes 0 deg -120 deg +120 deg Trajectory of the car agent Subjective view (Input for DQN) 26
  • 27.
    Subjective view agent Result:three cameras, 50k episodes The policy “move anyways” ? >> Reward setting Seems not able to goal every time Only “easy” goal to achieve >> Variable task difficulty (curriculum Frequent goals here 27
  • 28.
    Subjective view agent Fourcamera at 30k ep. 28
  • 29.
    Modify reward Previous – +1when the car is in the goal – -1 when the car is out of the field – 0.01 - 0.01 * distance_to_goal otherwise New – +1 - speed when the car is in the goal • in order to stop the car – -1 when the car is out of the field – -0.005 29
  • 30.
    Modify difficulty Difficulty: Initialcar direction & position – Constraint • Car always starts near the middle of the field • Car always starts with face toward center: – Curriculum • Car direction: – where n = currriculum • Criteria: – 0.6 of mean reward over 100 episodes ± p 12 n ± p 4 Goal n = 1 n = 2 30
  • 31.
    Subjective view agent: modifications Ncameras Reward Difficulty Learning result 3 Default Default about 6k: o 50k: x 3 modified Default about 16k: o 3 modified Constraint ? (still learning) 3 modified Curriculum o (though curriculum 1 yet) 4 Default Default x 4 modified Curriculum △ (not bad, but not successful yet at 6k) 31
  • 32.
    Subjective view agent: modifications Curriculum+ Three cameras @curriculum 1. Criteria needs to be modified reward mean reward sum 1.0 0.0 500 0 n episode 0 10k 20k n episode 0 10k 20k 32
  • 33.
    Discussion 1. Initial settingsincluded the situation where car cannot reach the goal – e.g. Start towards the edge of the field – This made learning unstable 2. Why successful for coordinate agent? – In spite there could be such situations? 33
  • 34.
    Discussion 3. Comparison withthree and four cameras – Considering success rate and execution time, three camera is better – Why not successful in four cameras? – Need several trials? 4. DQN often diverged – every three times in personal feeling • four cameras is slightly more oftern – Importance of dataset for learning • memory size, batch size 34
  • 35.
    Discussion 5. Curriculum – Ideallybetter to quantify “difficulty of the task” • In this case, maybe it is roughly represented as “bias of distribution” of the selected actions? accelerate decelerate throw (do nothing) steer right steer left accelerate + steer right accelerate + steer left decelerate + steer right decelerate + steer left same times for each actions >> go straight biased distribution of selected actions >> go right/lef 35
  • 36.
    Summary • Car agentcan park itself with subjective view of cameras, though not always stable learning • Trade-off between reward design and learning difficulty – Simple reward: difficult to learn • Try other algorithms like A3C – Complex reward: difficult to set • Other setting for distance_to_goal 36

Editor's Notes

  • #21 学習率をもっと小さくするのか A3C
  • #33 線よりも平均で書くか点で書く TRPO