Imitation Learning for Autonomous Driving in TORCS

Imitation Learning for
Autonomous Driving in TORCS
Final Report
Yasunori Kudo
Mitsuru Kusumoto, Yasuhiro Fujita
SP Team

Imitation Learning
Imitation Learning is an approach for the sequential prediction
problem, where expert demonstrations of good behavior are
used to learn a controller.
In standard reinforcement learning, agents need to explore the
environment many times to obtain a good policy. However, sample
efficiency is crucial in actual environments.
Expert demonstrations may be helpful for this issue.
Examples :
• Legged locomotion [Ratliff 2006]
• Outdoor navigation [Silver 2008]
• Car driving [Pomerleau 1989]
• Helicopter flight [Abbeel 2007]
Where we’ll go

DAgger : Dataset Aggregation
DAGGER3.6. DATASET AGGREGATION: ITERATIVE INTERACTIVE LEARNING
APPROACH 69
Execute current policy and Query Expert
New Data
Supervised Learning
All previous data
Aggregate
Dataset
Steering
from expert
New
Policy
Stéphane. Ross, Geoffrey J. Gordon, and J. Andrew. Bagnell. A reduction of imitation
learning and structured prediction to no-‑regret online learning. In AISTATS, 2011.

DAgger : Dataset Aggregation
DAGGER
70 CHAPTER 3. LEARNING BEHAVIOR FROM DEMONSTRATIONS
Initialize D ;.
Initialize ˆ⇡1 to any policy in ⇧.
for i = 1 to N do
Let ⇡i = i⇡⇤ + (1 i)ˆ⇡i.
Sample T-step trajectories using ⇡i.
Get dataset Di = {(s, ⇡⇤(s))} of visited states by ⇡i and actions given by expert.
Aggregate datasets: D D
S
Di.
Train classiﬁer ˆ⇡i+1 on D (or use online learner to get ˆ⇡i+1 given new data Di).
end for
Return best ˆ⇡i on validation.
Algorithm 3.6.1: DAGGER Algorithm.
algorithm described above is the special case i = I(i = 1) for I the indicator function,
which often performs best in practice (see Chapter 5). The general DAGGER algorithm
is detailed in Algorithm 3.6.1.
!
"∗
(%, !) !
DAgger: Dataset Aggregation
• Collect new trajectories with 1
• New Dataset D1’ = {(s, *(s))}
• Aggregate Datasets:
D1 = D0 U D1’
• Train 2 on D1
17
1
Steering from
expert
DAgger: Dataset Aggregation
2
• Collect new trajectories with 2
• New Dataset D2’ = {(s, *(s))}
• Aggregate Datasets:
D2 = D1 U D2’
• Train 3 on D2
18
Steering from
expert
Expert policy Predicted policy
Avoid to collect states affected
by only expert policy

Experiments
• Pendulum and Pong in OpenAI Gym
• We compared the performance of DAgger
with standard RL algorithm.
REINFORCE
m: Pendulum Swingup
benchmark task
control:
rque
From “Reinforcement Learning In Continuous Time and
Space”, Kenji Doya, 2000State : (θ, θ)
Reward : cosθ
・
State : 80×80 binary
Reward : win +1, lose -‑1

Experiments - REINFORCE
REward Increment = Nonnegative Factor × Offset Reinforcement
× Characteristic Eligibility
!!
*) = *. ∪ *)′
!! (1 *)
∇34 3 =
1
6
789)
:;,8 − =
>
8?)
∇-3 log ( &;,8|$;,8; 3
>
8?)
E
;?)
38F)
= 38
+ H∇34 3
*)
789)
:;,8 − =
>
8?)
∇-3 log ( &;,8|$;,8; 3
>
8?)
E
?)
38F)
= 38
+ H∇34 3
Ronald J. Williams. Simple statistical gradient-‑following algorithms for
connectionist reinforcement learning. Machine Learning, 8(3):229-‑256, 1992.
• Predict gradient
• Update model parameter
!
"
#
$
%
&
!
"
#
$
%
&
'
(
)

: Model parameter
: Number of episode
: Number of step
: Decay of reward
: Reward
: Baseline
: Policy
: Action
: State

Experiments ‒ Multi Agent
http://192.168.0.1/8080
http://192.168.0.2/8080
http://192.168.0.3/8080
experience
environment
experience
environment
experience
environment
gradient
gradient
gradient
model
parameter
model
parameter
model
parameter
update
Training speed is about 3 times faster than
single agent.
3 agents

Results - Pendulum
REINFORCE
DAgger
3 Layers Perceptron
3
200
2
input
(cosθ, sinθ, θ)
.
output
or
Less episodes until convergence !

Results - Pong
REINFORCE
DAgger
3 Layers Perceptron
6400
200
2
input
6400 (80x80) vector
output
Up or Down
Validation accuracy : 97.04%
= ー
S St+1 t
Less episodes until convergence !

Application to TORCS
7 training tracks
track4 track7 track18
3 test tracks
track8 track12 track16
…• Car driving simulator game
• Try to improve Yoshida-‑sanʼ’s projects
• Train policy only from vision sensor
Imitation Learning
(expert : hand-‑crafted AI)
Reinforcement Learning
+CNCGAKCFMELAF
GKGKK
=DKGK
IIGGLKGK
8IKGKL=
CGIEC=GEFNCKCGKGK
,AAPIL(G=L0 S-‑0.+-‑
/I
/I
/I
/I
/I(
xt-‑1
xt
3×64×64
• Steering wheel : (-‑1, 0, 1)
• Whether to brake : (0, 1)
• Steering wheel : -‑1 ~∼ 1
• Accel : 0 ~∼ 1
or
Discrete actions
Continuous actions
Transfer Learning

Results ‒ DAgger in TORCS
Discrete actions Continuous actions
• DAgger works well in different environments(no overfitting!).
• The agent cannot surpass the performance of the expert : Most places
where an agent fails are where the expert fails.
• The expert cannot reach the goal in all test tracks.
• An agent with continuous actions gradually become worse...
Expert can reach

Experiments ‒Transfer Learning
• Experiment 1 (single-‑play) -‑ RL for faster and safer driving
• Experiment 2 (self-‑play) -‑ RL for racing battle
Rewards
Out of the tracks ⇒ -1
Every 400(track 0) or 200(track 8) steps ⇒ mean speed
Environments
Track 0 and 16
Rewards
Out of the tracks ⇒ -1
Overtaken by the opponent ⇒ -1
Overtake the opponent ⇒ mean speed
Environment
Track 0 32
32
64
64
Input
Input
(≒0 ~∼ 2.2)
(≒0 ~∼ 2.2)

Results - Experiment 1 (single-play)
Track 0 (Goal : 400 steps) Track 16 (Goal : 1600 steps)
• Transfer learning works well in REINFORCE algorithm.
• Better driving than expert in terms of both speed and safety.
• An agent trained well seems to control speed
by steering action only (no braking).
Expert
Moving Average

Results - Experiment 2 (self-play)
Opponent
= ExpertAgent Opponent Agent OpponentAgent
vs. expert self-‑play１ self-‑play 2
Moving Average
RL not to be overtaken RL to overtake RL not to be overtaken

Conclusion and Future Works
• DAgger works well in various environments such as TORCS.
• DAgger is very effective as pre-‑training before RL.
• Imitation Learning as pre-‑training cause to get stuck in local minima ?
• Multi-‑task learning (predicting existence of another car to the left
or right at the same time) could help to train autonomous driving ?
Future Works
Conclusion

With baselines
Without baselines
With pre-‑training
Without pre-‑training
Appendix
Comparison of baselines Comparison of pre-‑training by DAgger

Imitation Learning for Autonomous Driving in TORCS

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Imitation Learning for Autonomous Driving in TORCS

Similar to Imitation Learning for Autonomous Driving in TORCS (20)

More from Preferred Networks

More from Preferred Networks (20)

Recently uploaded

Recently uploaded (20)

Imitation Learning for Autonomous Driving in TORCS