The present presentation encompasses a comprehensive theoretical overview of self-driving car technology. It begins by tracing the history of autonomous vehicles and then delves into a range of methodologies, such as Imitation Learning, the direct perception approach utilizing end-to-end deep learning, and strategies employing Reinforcement Learning. Additionally, it examines various topics including Vehicle Dynamics, Vehicle Control, Odometry, SLAM (Simultaneous Localization and Mapping), Road and Lane Detection, 3D Reconstruction, Motion Planning, Object Detection, Object Tracking, and Decision Making and Planning.
5. Contents
Goal: Develop an understanding of the capabilities and limitations of autonomous
driving solutions and gain a basic understanding of the entire system comprising
perception, planning and vehicle control. Training agents in simple environments.
I History of self-driving cars
I End-to-end learning for self-driving (imitation/reinforcement learning)
I Modular approaches to self-driving
I Perception (camera, lidar, radar)
I Localization (with visual and road maps)
I Navigation and path planning
I Vehicle models and control algorithms
8
6. Prerequisites
Linear Algebra:
I Vectors: x, y ∈ Rn
I Matrices: A, B ∈ Rm×n
I Operations: AT , A−1, Tr(A), det(A), A + B, AB, Ax, x>y
I Norms: kxk1, kxk2, kxk∞, kAkF
I SVD: A = UDV>
44
7. Prerequisites
Probability and Information Theory:
I Probability distributions: P(X = x)
I Marginal/conditional: p(x) =
R
p(x, y)dy , p(x, y) = p(x|y)p(y)
I Bayes rule: p(x|y) = p(y|x)p(x)/p(y)
I Conditional independence: x ⊥
⊥ y | z ⇔ p(x, y|z) = p(x|z)p(y|z)
I Expectation: Ex∼p [f(x)] =
R
x p(x)f(x)dx
I Variance: Var(f(x)) = E
(f(x) − E[f(x)])2
I Distributions: Bernoulli, Categorical, Gaussian, Laplace
I Entropy: H(x) , KL Divergence: DKL(pkq)
45
11. Road Fatalities in 2017
I USA: 32,700 Germany: 3,300 World: 1,300,000
I Main factors: speeding, intoxication, distraction, etc. 49
12. Benefits of Autonomous Driving
I Lower risk of accidents
I Provide mobility for elderly and people with disabilities
I In the US 45% of people with disabilities still work
I Decrease pollution for a more healthy environment
I New ways of public transportation
I Car pooling
I Car sharing
I Reduce number of cars (95% of the time a car is parked)
50
15. Self-driving is Hard
Human performance: 1 fatality per 100 mio miles
Error rate to improve on: 0.000001 %
Challenges:
I Snow, heavy rain, night
I Unstructured roads, parking lots
I Pedestrians, erratic behavior
I Reflections, dynamics
I Rare and unseen events
I Merging, negotiating, reasoning
I Ethics: what is good behavior?
I Legal questions
http://theoatmeal.com/blog/google_self_driving_car
53
17. The Trolley Problem (1905)
Thought experiment:
I You observe a train that will kill 5 people on the rail tracks if it continues
I You have the option to pull a lever to redirect the train to another track
I However, the train will kill one (other) person on that alternate track
I What is your decision? What is the correct/ethical decision?
55
18. The MIT Moral Machine
http://moralmachine.mit.edu/ 56
22. 1886: Benz Patent-Motorwagen Nummer 1
I Benz 954 cc single-cylinder four-stroke engine (500 watts)
I Weight: 100 kg (engine), 265 kg (total)
I Maximal speed: 16 km/h
I Consumption: 10 liter / 100 km (!)
I Construction based on the tricycle, many bicycle components
I 29.1.1886: patent filed
I 3.7.1886: first public test drive in Mannheim
I 2.11.1886: patent granted, but investors stayed skeptical
I First long distance trip (106 km) by Bertha Benz in 1888 with Motorwagen
Nummer 3 (without knowledge of her husband) fostered commercial interest
I First gas station: pharmacy in Wiesloch near Heidelberg
59
25. 1925: Phantom Auto – “American Wonder” (Houdina Radio Control)
In the summer of 1925, Houdina’s driverless car, called the American Wonder, traveled along Broadway in New York
City—trailed by an operator in another vehicle—and down Fifth Avenue through heavy traffic. It turned corners, sped up,
slowed down and honked its horn. Unfortunately, the demonstration ended when the American Wonder crashed into
another vehicle filled with photographers documenting the event. (Discovery Magazine)
https://www.discovermagazine.com/technology/the-driverless-car-era-began-more-than-90-years-ago 61
26. 1939: Futurama – New York World’s Fair
I Exhibit at the New York World’s Fair in 1939 sponsored by General Motors
I Designed by Norman Bel Geddes’ - his vision of the world 20 years later (1960)
I Radio-controlled electric cars, electromagnetic field via circuits in roadway
I #1 exhibition, very well received (great depression), prototypes by RCA GM
https://www.youtube.com/watch?v=sClZqfnWqmc 62
30. 1960: RCA Labs’ Wire Controlled Car Aeromobile
https://spectrum.ieee.org/selfdriving-cars-were-just-around-the-cornerin-1960 64
31. 1970: Citroen DS19
I Steered by sensing magnetic cables in the road, up to 130 km/h
https://www.youtube.com/watch?v=MwdjM2Yx3gU 65
32. 1986: Navlab 1
I Vision-based navigation
Jochem, Pomerleau, Kumar and Armstrong: PANS: A Portable Navigation Platform. IV, 1995. 66
33. Navlab Overview
I Project at Carnegie Mellon University, USA
I 1986: Navlab 1: 5 computer racks (Warp supercomputer)
I 1988: First semi-autonomous drive at 20 mph
I 1990: Navlab 2: 6 mph offroad, 70 mph highway driving
I 1995: Navlab 5: “No Hands Across America” (2850 miles, 98 % autonomy)
I PANS: Portable Advanced Navigation Support
I Compute: 50 Mhz Sparc workstation (only 90 watts)
I Main focus: lane keeping (lateral but no longitudinal control, i.e., no steering)
I Position estimation: Differential GPS + Fibre Optic Gyroscope (IMU)
I Low-level control: HC11 microcontroller
Jochem, Pomerleau, Kumar and Armstrong: PANS: A Portable Navigation Platform. IV, 1995. 67
34. 1988: ALVINN
ALVINN: An Autonomous Land Vehicle in a Neural Network
I Forward-looking, vision based driving
I Fully connected neural network maps
road images to vehicle turn radius
I Directions discretized (45 bins)
I Trained on simulated road images
I Tested on unlined paths, lined city
streets and interstate highways
I 90 consecutive miles at up to 70 mph
Pomerleau: ALVINN: An Autonomous Land Vehicle in a Neural Network. NIPS, 1988. 68
36. 1995: AURORA
AURORA: Automative Run-Off-Road Avoidance System
I Downward-looking (mounted at side)
I Adjustable template correlation
I Tracks solid or dashed lane marking
I shown to perform robustly even
when the markings are worn or their
appearance in the image is degraded
I Mainly tested as a lane departure
warning system (“time to crossing”)
Chen, Jochem and Pomerleau: AURORA: A Vision-Based Roadway Departure Warning System. IROS, 1995. 70
37. 1986: VaMoRs – Bundeswehr Universität Munich
I Developed by Ernst Dickmanns in context of EUREKA-Prometheus (€800 mio.)
(PROgraMme for a European Traffic of Highest Efficiency and Unprecedented Safety, 1987- 1995)
I Demonstration to Daimler-Benz Research 1986 in Stuttgart
I Longitudinal lateral guidance with lateral acceleration feedback
I Speed: 0 to 36 km/h 71
38. 1986: VaMoRs – Bundeswehr Universität Munich
I Developed by Ernst Dickmanns in context of EUREKA-Prometheus (€800 mio.)
(PROgraMme for a European Traffic of Highest Efficiency and Unprecedented Safety, 1987- 1995)
I Demonstration to Daimler-Benz Research 1986 in Stuttgart
I Longitudinal lateral guidance with lateral acceleration feedback
I Speed: 0 to 36 km/h 71
39. 1994: VAMP – Bundeswehr Universität Munich
I 2nd Generation Transputer (60 processors), bifocal saccade vision, no GPS
I 1678 km autonomous ride Munich to Odense, 95% autonomy (up to 158 km)
I Autonomous driving speed record: 180 km/h (lane keeping)
I Convoi driving, automatic lane change (triggered by human)
72
40. 1992: Summary Paper by Dickmanns
Dickmanns and Mysliwetz: Recursive 3-D Road and Relative Ego-State Recognition. PAMI, 1992. 73
41. 1995: Invention of Adaptive Cruise Control (ACC)
I 1992: Lidar-based distance control by Mitsubishi (throttle control downshift)
I 1997: Laser adaptive cruise control by Toyota (throttle control downshift)
I 1999: Distronic radar-assisted ACC by Mercedes-Benz (S-Class), level 1 autonomy
74
42. 2000: First Technological Revolution: GPS, IMUs Maps
I NAVSTAR GPS available with 1 meter accuracy, IMUs improve up to 5 cm
I Navigation systems and road maps available
I Accurate self-localization and ego-motion estimation algorithms
75
43. 2004: Darpa Grand Challenge 1 (Limited to US Participants)
I 1st competition in the Mojave Desert along a 240 km route, $1 mio prize money
I No traffic, dirt roads, driven by GPS (2935 points, up to 4 per curve).
I None of the robot vehicles finished the route. CMU traveled the farthest distance,
completing 11.78 km of the course before hitting a rock.
76
44. 2005: Darpa Grand Challenge 2 (Limited to US Participants)
I 2nd competition in the Mojave Desert along a 212 km route, $2 mio prize money
I Five teams finished (Stanford team 1st in 6:54 h, CMU team 2nd in 7:05 h)
77
45. 2006: Park Shuttle Rotterdam
I 1800 meters route from metro station Kralingse Zoom to business park Rivium
I One of the first truly driverless car, but dedicated lane, localization via magnets
78
46. 2006: Second Technological Revolution: Lidars High-res Sensors
I High-resolution Lidar
I Camera systems with increasing resolution
I Accurate 3D reconstruction, 3D detection 3D localization
79
47. 2007: Darpa Urban Challenge (International Participants)
I 3nd competition at George Air Force Base, 96 km route, urban driving, $2 mio
I Rules: obey traffic law, negotiate, avoid obstacles, merge into traffic
I 11 US teams received $1 mio funding for their research
I Winners: CMU 1st (4:10), Stanford’s Stanley 2nd (4:29). No non-US participant.
80
48. 2009: Google starts working on Self-Driving Car
I Led by Sebastian Thrun, former director of Stanford AI lab and Stanley team
I Others: Chris Urmson, Dmitri Dolgov, Mike Montemerlo, Anthony Levandowski
I Renamed “Waymo” in 2016 (Google spend $1 billion until 2015)
https://waymo.com/ 81
49. 2010: VisLab Intercontinental Autonomous Challenge (VIAC)
I July 20 to October 28: 16,000 kilometres trip from Parma, Italy to Shanghai, China
I The second vehicle automatically followed the route defined by the leader vehicle
by following it either visually or thanks to GPS waypoints sent by the lead vehicle
Broggi, Medici, Zani, Coati and Panciroli: Autonomous vehicles control in the VisLab Intercontinental Autonomous Challenge. Annu. Rev. Control, 2012. 82
50. 2010: VisLab Intercontinental Autonomous Challenge (VIAC)
I July 20 to October 28: 16,000 kilometres trip from Parma, Italy to Shanghai, China
I The second vehicle automatically followed the route defined by the leader vehicle
by following it either visually or thanks to GPS waypoints sent by the lead vehicle
Broggi, Medici, Zani, Coati and Panciroli: Autonomous vehicles control in the VisLab Intercontinental Autonomous Challenge. Annu. Rev. Control, 2012. 82
51. 2010: Pikes Peak Self-Driving Audi TTS
I Pikes Peak International Hill Climb (since 1916): 20 km, 1440 Hm, Summit: 4300 m
I Audi TTS completes track in 27 min (record in 2010: 10 min, now: 8 min)
https://www.youtube.com/watch?v=Arx8qWx9CFk 83
52. 2010: Stadtpilot (Technical University Braunschweig)
I Goal: geofenced innercity driving based on laser scanners, cameras and HD maps
I Challenges: traffic lights, roundabouts, etc. Similar efforts by FU Berlin and others
Saust, Wille, Lichte and Maurer: Autonomous Vehicle Guidance on Braunschweig’s inner ring road within the Stadtpilot Project. IV, 2011. 84
53. 2012: Third Technological Revolution: Deep Learning
I Representation learning boosts in accuracy across tasks and benchmarks
Güler et al.: DensePose: Dense Human Pose Estimation In The Wild. CVPR, 2018. 85
54. 2012: Third Technological Revolution: New Benchmarks
Geiger, Lenz and Urtasun: Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite. CVPR, 2012. 86
55. 2013: Mercedes Benz S500 Intelligent Drive
I Autonomous ride on historic Bertha Benz route by Daimler RD and KIT/FZI
I Novelty: close to production stereo cameras / radar (but requires HD maps)
Ziegler et al.: Making Bertha Drive - An Autonomous Journey on a Historic Route. IEEE Intell. Transp. Syst. Mag., 2014. 87
56. 2014: Mercedes S Class
Advanced ADAS (Level 2 Autonomy):
I Autonomous steering, lane keeping, acceleration/braking, collision avoidance,
driver fatigue monitoring in city traffic and highway speeds up to 200 km/h 88
57. 2014: Society of Automotive Engineers: SAE Levels of Autonomy
I Lateral control = steering, Longitudinal control = gas/brake
89
61. 2015: Tesla Model S Autopilot
Tesla Autopilot 2015 (Level 2 Autonomy):
I Lane keeping for limited-access highways (hands off time: 30-120 seconds)
I Doesn’t read traffic signals, traffic signs or detect pedestrians/cyclists
93
64. 2018: Tesla Model X Autopilot: Fatal Accident 2
The National Transportation Safety Board (NTSB) said that four seconds before the 23
March crash on a highway in Silicon Valley, which killed Walter Huang, 38, the car
stopped following the path of a vehicle in front of it. Three seconds before the impact,
it sped up from 62mph to 70.8mph, and the car did not brake or steer away, the NTSB
said. After the fatal crash in the city of Mountain View, Tesla noted that the driver had
received multiple warnings to put his hands on the wheel and said he did not intervene
during the five seconds before the car hit the divider. But the NTSB report revealed that
these alerts were made more than 15 minutes before the crash. In the 60 seconds
prior to the collision, the driver also had his hands on the wheel on three separate
occasions, though not in the final six seconds, according to the agency. As the car
headed toward the barrier, there was no precrash braking or evasive steering
movement, the report added. The Guardian (June, 2018)
96
65. 2018: Waymo (former Google) announced Public Service
I In 2018 driving without safety driver in a geofenced district of Phoenix
I By 2021 also in suburbs of Arizona, San Francisco and Mountain View
97
66. 2018: Nuro Last-mile Delivery
I Founded by two of the Google self-driving car engineers
98
67. Self-Driving Industry
I NVIDIA: Supplier of self-driving hardware and software
I Waabi: Startup by Raquel Urtasun (formly Uber)
I Aurora: Startup by Chris Urmson (formerly CMU, Google, Waymo)
I Argo AI: Startup by Bryan Salesky (now Ford/Volkswagen)
I Zoox: Startup by Jesse Levinson (now Amazon)
I Cruise: Startup by Kyle Vogt (now General Motors)
I NuTonomy: Startup by Doug Parker (now Delphi/Aptiv)
I Efforts in China: Baidu Apollo, AutoX, Pony.AI
I Comma.ai: Custom open-source dashcam to retrofit any vehicle
I Wayve: Startup focusing on end-to-end self-driving
99
69. Business Models
Autonomous or nothing (Google, Apple, Uber)
I Very risky, only few companies can do this
I Long term goals
Introduce technology little by little (all car companies)
I Car industry is very conservative
I ADAS as intermediate goal
I Sharp transition: how to maintain the driver engaged?
101
71. Summary
I Self-driving has a long history
I Highway lane-keeping of today was developed over 30 years ago
I Increased robustness ⇒ introduction of level 3 for highways in 2019
I Increased interest after DARPA challenge and new benchmarks (e.g., KITTI)
I Many claims about full self-driving (e.g., Elon Musk), but level 4/5 stays hard
I Waymo introduced first public service end of 2018 (with safety driver)
I Waymo/Tesla seem ahead of competition in full self-driving, but no winner yet
I But several setbacks (Uber, Tesla accidents)
I Most existing systems require laser scanners and HD maps (exception: Tesla)
I Driving as an engineering problem, quite different from human cognition
103
75. Autonomous Driving
Steer
Gas Brake
Sensory Input
Mapping Function
Dominating Paradigms:
I Modular Pipelines
I End-to-End Learning (Imitation Learning, Reinforcement Learning)
I Direct Perception
4
76. Autonomous Driving: Modular Pipeline
Steer
Gas Brake
Sensory Input
Modular Pipeline
Path
Planning
Vehicle
Control
Scene
Parsing
Low-level
Perception
Examples:
I [Montemerlo et al., JFR 2008]
I [Urmson et al., JFR 2008]
I Waymo, Uber, Tesla, Zoox, ...
5
77. Autonomous Driving: Modular Pipeline
Steer
Gas Brake
Sensory Input
Modular Pipeline
Path
Planning
Vehicle
Control
Scene
Parsing
Low-level
Perception
6
78. Autonomous Driving: Modular Pipeline
Steer
Gas Brake
Sensory Input
Modular Pipeline
Path
Planning
Vehicle
Control
Scene
Parsing
Low-level
Perception
6
79. Autonomous Driving: Modular Pipeline
Steer
Gas Brake
Sensory Input
Modular Pipeline
Path
Planning
Vehicle
Control
Scene
Parsing
Low-level
Perception
6
80. Autonomous Driving: Modular Pipeline
Steer
Gas Brake
Sensory Input
Modular Pipeline
Path
Planning
Vehicle
Control
Scene
Parsing
Low-level
Perception
6
81. Autonomous Driving: Modular Pipeline
Steer
Gas Brake
Sensory Input
Modular Pipeline
Path
Planning
Vehicle
Control
Scene
Parsing
Low-level
Perception
6
82. Autonomous Driving: Modular Pipeline
Steer
Gas Brake
Sensory Input
Modular Pipeline
Path
Planning
Vehicle
Control
Scene
Parsing
Low-level
Perception
Pros:
I Small components, easy
to develop in parallel
I Interpretability
Cons:
I Piece-wise training (not jointly)
I Localization and planning
heavily relies on HD maps
HD maps: Centimeter precision lanes, markings, traffic lights/signs, human annotated
7
83. Autonomous Driving: Modular Pipeline
I Piece-wise training difficult: not all objects are equally important!
Ohn-Bar and Trivedi: Are All Objects Equal? Deep Spatio-Temporal Importance Prediction in Driving Videos. PR, 2017. 8
84. Autonomous Driving: Modular Pipeline
I HD Maps are expensive to create (data collection annotation effort)
https://www.geospatialworld.net/article/hd-maps-autonomous-vehicles/ 9
85. Autonomous Driving: End-to-End Learning
Steer
Gas Brake
Sensory Input
Imitation Learning / Reinforcement Learning
Neural
Network
Examples:
I [Pomerleau, NIPS 1989]
I [Bojarski, Arxiv 2016]
I [Codevilla et al., ICRA 2018]
10
86. Autonomous Driving: End-to-End Learning
Steer
Gas Brake
Sensory Input
Imitation Learning / Reinforcement Learning
Neural
Network
Pros:
I End-to-end training
I Cheap annotations
Cons:
I Training / Generalization
I Interpretability
10
87. Autonomous Driving: Direct Perception
Steer
Gas Brake
Sensory Input
Direct Perception
Intermediate
Representations
Vehicle
Control
Neural
Network
Examples:
I [Chen et al., ICCV 2015]
I [Sauer et al., CoRL 2018]
I [Behl et al., IROS 2020]
11
88. Autonomous Driving: Direct Perception
Steer
Gas Brake
Sensory Input
Direct Perception
Intermediate
Representations
Vehicle
Control
Neural
Network
Pros:
I Compact Representation
I Interpretability
Cons:
I Control typically not learned jointly
I How to choose representations?
11
91. Linear Classification
Logistic Regression
ŷ = σ(w
x + w0) σ(x) =
1
1 + e−x
I Let x ∈ R2
I Decision boundary: wx + w0 = 0
I Decide for class 1 ⇔ wx −w0
I Decide for class 0 ⇔ wx −w0
I Which problems can we solve? 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0
x
0.0
0.2
0.4
0.6
0.8
1.0
(x)
Sigmoid
0.5
Decision
Boundary
Class 0 Class 1
14
96. Linear Classification
Linear classifier with
non-linear features ψ:
w
x1
x2
x1x2
| {z }
ψ(x)
−w0
x1 x2 ψ1(x) ψ2(x) ψ3(x) XOR
0 0 0 0 0 0
0 1 0 1 0 1
1 0 1 0 0 1
1 1 1 1 1 0
1
0
Class 0
Class 1
1
I Non-linear features allow linear
classifier to solve non-linear
classification problems!
19
97. Representation Matters
CHAPTER 1. INTRODUCTION
x
y
r
θ
Cartesian Coordinates Polar Coordinates
I But how to choose the transformation? Can be very hard in practice.
I Yet, this was the dominant approach until the 2000s (vision, speech, ..)
I In deep/representation learning we want to learn these transformations
20
99. Non-Linear Classification
XOR(x1, x2) = AND(OR(x1, x2),NAND(x1, x2))
The above expression can be rewritten
as a program of logistic regressors:
h1 = σ(w
OR x + wOR)
h2 = σ(w
NAND x + wNAND)
ŷ = σ(w
AND h + wAND)
Note that h(x) is a non-linear feature of x.
We call h(x) a hidden layer.
22
100. Multi-Layer Perceptrons
I MLPs are feedforward neural networks (no feedback connections)
I They compose several non-linear functions f(x) = ŷ(h3(h2(h1(x))))
where hi(·) are called hidden layers and ŷ(·) is the output layer
I The data specifies only the behavior of the output layer (thus the name “hidden”)
I Each layer i comprises multiple neurons j which are implemented as affine
transformations (ax + b) followed by non-linear activation functions (g):
hij = g(a
ijhi−1 + bij)
I Each neuron in each layer is fully connected to all neurons of the previous layer
I The overall length of the chain is the depth of the model ⇒ “Deep Learning”
23
101. MLP Network Architecture
Input Layer Hidden Layer 1 Hidden Layer 2 Hidden Layer 3 Output Layer
Network Depth = #Computation Layers = 4
Layer
Width
=
#Neurons
in
Layer
I Neurons are grouped into layers, each neuron fully connected to all prev. ones
I Hidden layer hi = g(Aihi−1 + bi) with activation function g(·) and weights Ai, bi
24
102. Deeper Models allow for more Complex Decisions
2 Hidden Neurons 5 Hidden Neurons 15 Hidden Neurons
https://cs.stanford.edu/people/karpathy/convnetjs/demo/classify2d.html
25
103. Output and Loss Functions
Input Layer Hidden Layer 1 Hidden Layer 2 Output Layer Loss Function Target
I The output layer is the last layer in a neural network which computes the output
I The loss function compares the result of the output layer to the target value(s)
I Choice of output layer and loss function depends on task (discrete, continuous, ..)
26
104. Output Layer
Input Layer Hidden Layer 1 Hidden Layer 2 Output Layer Loss Function Target
I For classification problems, we use a sigmoid or softmax non-linearity
I For regression problems, we can directly return the value after the last layer
27
105. Loss Function
Input Layer Hidden Layer 1 Hidden Layer 2 Output Layer Loss Function Target
I For classification problems, we use the (binary) cross-entropy loss
I For regression problems, we can use the `1 or `2 loss
28
106. Activation Functions
Input Layer Hidden Layer 1 Hidden Layer 2 Output Layer Loss Function Target
I Hidden layer hi = g(Aihi−1 + bi) with activation function g(·) and weights Ai, bi
I The activation function is frequently applied element-wise to its input
I Activation functions must be non-linear to learn non-linear mappings
29
109. Convolutional Neural Networks
I Multi-layer perceptrons don’t scale to high-dimensional inputs
I ConvNets represent data in 3 dimensions: width, height, depth (= feature maps)
I ConvNets interleave discrete convolutions, non-linearities and pooling
I Key ideas: sparse interactions, parameter sharing, equivariant representation
32
110. Fully Connected vs. Convolutional Layers
Filter Kernel
I Fully connected layer: #Weights = W × H × Cout × (W × H × Cin + 1)
I Convolutional layer: #Weights = Cout × (K × K × Cin + 1) (“weight sharing”)
I With Cin input and Cout output channels, layer size W × H and kernel size K × K
I Convolutions are followed by non-lineary activation functions (e.g., ReLU) 33
111. Fully Connected vs. Convolutional Layers
I Fully connected layer: #Weights = W × H × Cout × (W × H × Cin + 1)
I Convolutional layer: #Weights = Cout × (K × K × Cin + 1) (“weight sharing”)
I With Cin input and Cout output channels, layer size W × H and kernel size K × K
I Convolutions are followed by non-lineary activation functions (e.g., ReLU) 33
112. Fully Connected vs. Convolutional Layers
I Fully connected layer: #Weights = W × H × Cout × (W × H × Cin + 1)
I Convolutional layer: #Weights = Cout × (K × K × Cin + 1) (“weight sharing”)
I With Cin input and Cout output channels, layer size W × H and kernel size K × K
I Convolutions are followed by non-lineary activation functions (e.g., ReLU) 33
113. Fully Connected vs. Convolutional Layers
I Fully connected layer: #Weights = W × H × Cout × (W × H × Cin + 1)
I Convolutional layer: #Weights = Cout × (K × K × Cin + 1) (“weight sharing”)
I With Cin input and Cout output channels, layer size W × H and kernel size K × K
I Convolutions are followed by non-lineary activation functions (e.g., ReLU) 33
114. Fully Connected vs. Convolutional Layers
I Fully connected layer: #Weights = W × H × Cout × (W × H × Cin + 1)
I Convolutional layer: #Weights = Cout × (K × K × Cin + 1) (“weight sharing”)
I With Cin input and Cout output channels, layer size W × H and kernel size K × K
I Convolutions are followed by non-lineary activation functions (e.g., ReLU) 33
115. Fully Connected vs. Convolutional Layers
I Fully connected layer: #Weights = W × H × Cout × (W × H × Cin + 1)
I Convolutional layer: #Weights = Cout × (K × K × Cin + 1) (“weight sharing”)
I With Cin input and Cout output channels, layer size W × H and kernel size K × K
I Convolutions are followed by non-lineary activation functions (e.g., ReLU) 33
117. Downsampling
I Downsampling reduces the spatial resolution (e.g., for image level predictions)
I Downsampling increases the receptive field (which pixels influence a neuron)
35
118. Pooling
I Typically, stride s = 2 and kernel size 2 × 2 ⇒ reduces spatial dimensions by 2
I Pooling has no parameters (typical pooling operations: max, min, mean)
36
119. Pooling
I Typically, stride s = 2 and kernel size 2 × 2 ⇒ reduces spatial dimensions by 2
I Pooling has no parameters (typical pooling operations: max, min, mean)
36
120. Fully Connected Layers
I Often, convolutional networks comprise fully connected layers at the end
37
122. Optimization
Optimization Problem: (dataset X)
w∗
= argmin
w
L(X, w)
Gradient Descent:
w0
= winit
wt+1
= wt
− η ∇wL(X, wt
)
I Neural network loss L(X, w) is not convex, we have to use gradient descent
I There exist multiple local minima, but we will find only one through optimization
I Good news: it is known that many local minima in deep networks are good ones
39
123. Backpropagation
I Values are efficiently computed
forward, gradients backward
I Modularity: Each node must only
“know” how to compute gradients
wrt. its own arguments
I One fw/bw pass per data point:
∇wL(X, w) =
N
X
i=1
∇wL(yi, xi, w)
| {z }
Backpropagation
Compute Loss
Compute Derivatives
40
124. Gradient Descent
Algorithm:
1. Initialize weights w0 and pick learning rate η
2. For all data points i ∈ {1, . . . , N} do:
2.1 Forward propagate xi through network to calculate prediction ŷi
2.2 Backpropagate to obtain gradient ∇wLi(wt
) ≡ ∇wL(ŷi, yi, wt
)
3. Update gradients: wt+1 = wt − η 1
N
P
i ∇wLi(wt)
4. If validation error decreases, go to step 2, otherwise stop
Challenges:
I Typically, millions of parameters ⇒ dim(w) = 1 million or more
I Typically, millions of training points ⇒ N = 1 million or more
I Becomes extremely expensive to compute and does not fit into memory
41
125. Stochastic Gradient Descent
Solution:
I The total loss over the entire training set can be expressed as the expectation:
1
N
X
i
Li(wt
) = Ei∼U{1,N}
Li(wt
)
I This expectation can be approximated by a smaller subset B N of the data:
Ei∼U{1,N}
Li(wt
)
≈
1
B
X
b
Lb(wt
)
I Thus, the gradient can also be approximated by this subset (=minibatch):
1
N
X
i
∇wLi(wt
) ≈
1
B
X
b
∇wLb(wt
)
42
126. Stochastic Gradient Descent
Algorithm:
1. Initialize weights w0, pick learning rate η and minibatch size |Xbatch|
2. Draw random minibatch {(x1, y1), . . . , (xB, yB)} ⊆ X (with B N)
3. For all minibatch elements b ∈ {1, . . . , B} do:
3.1 Forward propagate xb through network to calculate prediction ŷb
3.2 Backpropagate to obtain batch element gradient ∇wLb(wt
) ≡ ∇wL(ŷb, yb, wt
)
4. Update gradients: wt+1 = wt − η 1
B
P
b ∇wLb(wt)
5. If validation error decreases, go to step 2, otherwise stop
43
127. First-order Methods
There exist many variants:
I SGD
I SGD with Momentum
I SGD with Nesterov Momentum
I RMSprop
I Adagrad
I Adadelta
I Adam
I AdaMax
I NAdam
I AMSGrad
Adam is often the method of choice due to its robustness.
44
128. Learning Rate Schedules
0 10 20 30 40 50
20
30
40
50
60
iter. (1e4)
error
(%)
plain-18
plain-34
0 10 20 30 40 50
20
30
40
50
60
iter. (1e4)
error
(%)
ResNet-18
ResNet-34
I A fixed learning rate is too slow in the beginning and too fast in the end
I Exponential decay: ηt = ηαt
I Step decay: η ← 0.5η (every K iterations)
He, Zhang, Ren andSun: Deep Residual Learning for Image Recognition. CVPR, 2016. 45
130. Capacity, Overfitting and Underfitting
0.0 0.2 0.4 0.6 0.8 1.0
x
1.5
1.0
0.5
0.0
0.5
1.0
1.5
y
M=1 Ground Truth
Noisy Observations
Polynomial Fit
Test Set
0.0 0.2 0.4 0.6 0.8 1.0
x
1.5
1.0
0.5
0.0
0.5
1.0
1.5
y
M=3 Ground Truth
Noisy Observations
Polynomial Fit
Test Set
0.0 0.2 0.4 0.6 0.8 1.0
x
1.5
1.0
0.5
0.0
0.5
1.0
1.5
y
M=9 Ground Truth
Noisy Observations
Polynomial Fit
Test Set
Capacity too low Capacity about right Capacity too high
I Underfitting: Model too simple, does not achieve low error on training set
I Overfitting: Training error small, but test error (= generalization error) large
I Regularization: Take model from third regime (right) to second regime (middle)
47
131. Early Stopping and Parameter Penalties
Unregularized Objective
L2 Regularizer
Early stopping:
I Dashed: Trajectory taken by SGD
I Trajectory stops at w̃ before
reaching minimum training error w∗
L2 Regularization:
I Regularize objective with L2 penalty
I Penalty forces minimum of
regularized loss w̃ closer to origin
48
132. Dropout
Idea:
I During training, set neurons to zero with probability µ (typically µ = 0.5)
I Each binary mask is one model, changes randomly with every training iteration
I Creates ensemble “on the fly” from a single network with shared parameters
Srivastava, Hinton, Krizhevsky, Sutskever and Salakhutdinov: Dropout: a simple way to prevent neural networks from overfitting. JMLR, 2014. 49
133. Data Augmentation
I Best way towards better generalization
is to train on more data
I However, data in practice often limited
I Goal of data augmentation: create
“fake” data from the existing data (on
the fly) and add it to the training set
I New data must preserve semantics
I Even simple operations like translation
or adding per-pixel noise often already
greatly improve generalization
I https://github.com/aleju/imgaug
50
137. Imitation Learning in a Nutshell
Hard coding policies is often difficult ⇒ Rather use a data-driven approach!
I Given: demonstrations or demonstrator
I Goal: train a policy to mimic decision
I Variants: behavior cloning (this lecture), inverse optimal control, ...
54
138. Formal Definition of Imitation Learning
I State: s ∈ S may be partially observed (e.g., game screen)
I Action: a ∈ A may be discrete or continuous (e.g., turn angle, speed)
I Policy: πθ : S → A we want to learn the policy parameters θ
I Optimal action: a∗ ∈ A provided by expert demonstrator
I Optimal policy: π∗ : S → A provided by expert demonstrator
I State dynamics: P(si+1|si, ai) simulator, typically not known to policy
Often deterministic: si+1 = T(si, ai) deterministic mapping
I Rollout: Given s0, sequentially execute ai = πθ(si) sample si+1 ∼ P(si+1|si, ai)
yields trajectory τ = (s0, a0, s1, a1, . . . )
I Loss function: L(a∗, a) loss of action a given optimal action a∗
55
139. Formal Definition of Imitation Learning
General Imitation Learning:
argmin
θ
Es∼P(s|πθ) [L (π∗
(s), πθ(s))]
I State distribution P(s|πθ) depends on rollout
determined by current policy πθ
Behavior Cloning:
argmin
θ
E(s∗,a∗)∼P∗ [L (a∗
, πθ(s∗
))]
| {z }
=
PN
i=1 L(a∗
i ,πθ(s∗
i ))
I State distribution P∗ provided by expert
I Reduces to supervised learning problem
56
140. Challenges of Behavior Cloning
I Behavior cloning makes IID assumption
I Next state is sampled from states observed during expert demonstration
I Thus, next state is sampled independently from action predicted by current policy
I What if πθ makes a mistake?
I Enters new states that haven’t been observed before
I New states not sampled from same (expert) distribution anymore
I Cannot recover, catastrophic failure in the worst case
I What can we do to overcome this train/test distribution mismatch?
57
141. DAgger
Data Aggregation (DAgger):
I Iteratively build a set of inputs that the final policy is likely to encounter based on
previous experience. Query expert for aggregate dataset
I But can easily overfit to main mode of demonstrations
I High training variance (random initialization, order of data)
Ross, Gordon and Bagnell: A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning. AISTATS, 2011. 58
142. DAgger with Critical States and Replay Buffer
Key Ideas:
1. Sample critical states from the collected on-policy data based on the
utility they provide to the learned policy in terms of driving behavior
2. Incorporate a replay buffer which progressively focuses on the high
uncertainty regions of the policy’s state distribution
Prakash, Behl, Ohn-bar, Chitta and Geiger: Exploring Data Aggregation in Policy Learning for Vision-based Urban Autonomous Driving. CVPR, 2020. 59
144. ALVINN: An Autonomous Land Vehicle in a Neural Network
I Fully connected 3 layer neural net
I 36k parameters
I Maps road images to turn radius
I Directions discretized (45 bins)
I Trained on simulated road images!
I Tested on unlined paths, lined city
streets and interstate highways
I 90 consecutive miles at up to 70 mph
Pomerleau: ALVINN: An Autonomous Land Vehicle in a Neural Network. NIPS, 1988. 61
145. ALVINN: An Autonomous Land Vehicle in a Neural Network
Pomerleau: ALVINN: An Autonomous Land Vehicle in a Neural Network. NIPS, 1988. 62
147. PilotNet: System Overview
I Data augmentation by 3 cameras and virtually shifted / rotated images assuming
the world is flat (homography), adjusting the steering angle appropriately
Bojarski et al.: End-to-End Learning for Self-Driving Cars. Arxiv, 2016. 64
148. PilotNet: Architecture
I Convolutional network (250k param)
I Input: YUV image representation
I 1 Normalization layer
I Not learned
I 5 Convolutional Layers
I 3 strided 5x5
I 2 non-strided 3x3
I 3 Fully connected Layers
I Output: turning radius
I Trained on 72h of driving
Bojarski et al.: End-to-End Learning for Self-Driving Cars. Arxiv, 2016. 65
150. VisualBackProp
I Central idea: find salient image regions that lead to high activations
I Forward pass, then iteratively scale-up activations
Bojarski et al.: VisualBackProp: Efficient Visualization of CNNs for Autonomous Driving. ICRA, 2018. 67
152. VisualBackProp
I Test if shift in salient objects affects predicted turn radius more strongly
Bojarski et al.: VisualBackProp: Efficient Visualization of CNNs for Autonomous Driving. ICRA, 2018. 69
155. Conditional Imitation Learning
Idea:
I Condition controller on navigation command c ∈ {left,right,straight}
I High-level navigation command can be provided by consumer GPS, i.e.,
telling the vehicle to turn left/right or go straight at the next intersection
I This removes the task ambiguity induced by the environment
I State st: current image Action at: steering angle acceleration
Codevilla, Müller, López, Koltun and Dosovitskiy: End-to-End Driving Via Conditional Imitation Learning. ICRA, 2018. 72
156. Comparison to Behavior Cloning
Behavior Cloning:
I Training Set:
D = {(a∗
i , s∗
i )N
i=1}
I Objective:
argmin
θ
N
X
i=1
L (a∗
i , πθ(s∗
i ))
I Assumption:
∃f(·) : ai = f(si)
Often violated in practice!
Conditional Imitation Learning:
I Training Set:
D = {(a∗
i , s∗
i , c∗
i )N
i=1}
I Objective:
argmin
θ
N
X
i=1
L (a∗
i , πθ(s∗
i , c∗
i ))
I Assumption:
∃f(·, ·) : ai = f(si, ci)
Better assumption!
Codevilla, Müller, López, Koltun and Dosovitskiy: End-to-End Driving Via Conditional Imitation Learning. ICRA, 2018. 73
157. Conditional Imitation Learning: Network Architecture
I This paper proposes two network architectures:
I (a) Extract features C(c) and concatenate with image features I(i)
I (b) Command c acts as switch between specialized submodules
I Measurements m capture additional information (here: speed of vehicle)
Codevilla, Müller, López, Koltun and Dosovitskiy: End-to-End Driving Via Conditional Imitation Learning. ICRA, 2018. 74
158. Conditional Imitation Learning: Noise Injection
I Temporally correlated noise injected into trajectories ⇒ drift (only 12 minutes)
I Record driver’s (=expert’s) corrective response ⇒ recover from drift
Codevilla, Müller, López, Koltun and Dosovitskiy: End-to-End Driving Via Conditional Imitation Learning. ICRA, 2018. 75
161. Neural Attention Fields
I An MLP iteratively compresses the high-dimensional input into a compact
representation ci (c 6= nav. command) based on a BEV query location as input
I The model predicts waypoints and auxiliary semantics which aids learning
Chitta, Prakash and Geiger: Neural Attention Fields for End-to-End Autonomous Driving. ICCV, 2021. 78
162. Summary
Advantages of Imitation Learning:
I Easy to implement
I Cheap annotations (just driving while recording images and actions)
I Entire model trained end-to-end
I Conditioning removes ambiguity at intersections
Challenges for Imitation Learning?
I Behavior cloning uses IID assumption which is violated in practice
I Direct mapping from images to control ⇒ No long term planning
I No memory (can’t remember speed signs, etc.)
I Mapping is difficult to interpret (“black box”), despite visualization techniques
79
163. Self-Driving Cars
Lecture 3 – Direct Perception
Robotics, Computer Vision, System Software
BE, MS, PhD (MMMTU, IISc, IIIT-Hyderabad)
Kumar Bipin
164. Agenda
3.1 Direct Perception
3.2 Conditional Affordance Learning
3.3 Visual Abstractions
3.4 Driving Policy Transfer
3.5 Online vs. Offline Evaluation
2
166. Approaches to Self-Driving
Steer
Gas Brake
Sensory Input
Modular Pipeline
Path
Planning
Vehicle
Control
Scene
Parsing
Low-level
Perception
+ Modular + Interpretable - Expert decisions - Piece-wise training
Steer
Gas Brake
Sensory Input
Imitation Learning / Reinforcement Learning
Neural
Network
+ End-to-end + Simple - Generalization - Interpretable - Data 4
167. Direct Perception
Steer
Gas Brake
Sensory Input
Direct Perception
Intermediate
Representations
Vehicle
Control
Neural
Network
Idea of Direct Perception:
I Hybrid model between imitation learning and modular pipelines
I Learn to predict interpretable low-dimensional intermediate representation
I Decouple perception from planning and control
I Allows to exploit classical controllers or learned controllers (or hybrids)
5
168. Direct Perception for Autonomous Driving
Affordances:
I Attributes of the environment which limit space of actions [Gibson, 1966]
I In this case: 13 affordances
Chen, Seff, Kornhauser and Xiao: Learning Affordance for Direct Perception in Autonomous Driving. ICCV, 2015. 6
169. Overview
I TORCS Simulator: Open source car racing game simulator
I Network: AlexNet (5 conv layers, 4 fully conn. layers), 13 output neurons
I Training: Affordance indicators trained with `2 loss
Chen, Seff, Kornhauser and Xiao: Learning Affordance for Direct Perception in Autonomous Driving. ICCV, 2015. 7
170. Affordance Indicators and State Machine
Chen, Seff, Kornhauser and Xiao: Learning Affordance for Direct Perception in Autonomous Driving. ICCV, 2015. 8
171. Controller
Steering controller:
s = θ1(α − dc/w)
I s: steering command θ1: parameter
I α: relative orientation dc: distance to centerline w: road width
Speed controller: (“optimal velocity car following model”)
v = vmax (1 − exp (−θ2 dp − θ3))
I v: target velocity vmax maximal velocity
I dp: distance to preceding car θ2,3: parameters
Chen, Seff, Kornhauser and Xiao: Learning Affordance for Direct Perception in Autonomous Driving. ICCV, 2015. 9
172. TORCS Simulator
I TORCS: Open source car racing game http://torcs.sourceforge.net/
Chen, Seff, Kornhauser and Xiao: Learning Affordance for Direct Perception in Autonomous Driving. ICCV, 2015. 10
173. Results
Chen, Seff, Kornhauser and Xiao: Learning Affordance for Direct Perception in Autonomous Driving. ICCV, 2015. 11
174. Network Visualization
I Left: Averaged top 100 images activating a neuron in first fully connected layer
I Right: Maximal response of 4th conv. layer (note: focus on cars and markings)
Chen, Seff, Kornhauser and Xiao: Learning Affordance for Direct Perception in Autonomous Driving. ICCV, 2015. 12
177. Conditional Affordance Learning
Affordances
𝑅𝑒𝑙𝑎𝑡𝑖𝑣𝑒 𝑎𝑛𝑔𝑙𝑒 = 0.01 𝑟𝑎𝑑
𝐶𝑒𝑛𝑡𝑒𝑟𝑙𝑖𝑛𝑒 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒= 0.15 𝑚
𝑅𝑒𝑑 𝑙𝑖𝑔ℎ𝑡 = 𝐹𝑎𝑙𝑠𝑒
…
Neural
Network
Video Input
Directional Input
Control
Comands
Controller
Brake = 0.0
Sauer, Savinov and Geiger: Conditional Affordance Learning for Driving in Urban Environments. CoRL, 2018. 15
178. CARLA Simulator
I Goal: drive from A to B as fast, safely and comfortably as possible
I Infractions:
I Driving on wrong lane
I Driving on sidewalk
I Running a red light
I Violating speed limit
I Colliding with vehicles
I Hitting other objects
Sauer, Savinov and Geiger: Conditional Affordance Learning for Driving in Urban Environments. CoRL, 2018. 16
179. Affordances
Affordances:
I Distance to centerline
I Relative angle to road
I Distance to lead vehicle
I Speed signs
I Traffic lights
I Hazard stop
Sauer, Savinov and Geiger: Conditional Affordance Learning for Driving in Urban Environments. CoRL, 2018. 17
180. Affordances
Affordances:
I Distance to centerline
I Relative angle to road
I Distance to lead vehicle
I Speed signs
I Traffic lights
I Hazard stop
30
km/h
𝐴1
𝐴2
𝑨𝟑
𝝍
𝒅 𝒙𝒍𝒐𝒄𝒂𝒍
𝒚𝒍𝒐𝒄𝒂𝒍
centerline
vehicle
𝒍 = 𝟏𝟓 𝒎
agent
hazard stop
(for pedestrian)
speed
sign
traffic
light
Sauer, Savinov and Geiger: Conditional Affordance Learning for Driving in Urban Environments. CoRL, 2018. 18
181. Overview
...
Task Blocks
Feature
map
Feature Extractor
High-level Planner
Agent in Environment
Position
unconditional
conditional
CARLA
...
...
Longitudinal Control
Lateral Control
Controller
Control
Commands
Affordances
Directional
Command
Image
Perception CAL Agent
Memory
N
N-1
1
2
...
N-2
3
...
Sauer, Savinov and Geiger: Conditional Affordance Learning for Driving in Urban Environments. CoRL, 2018. 19
182. Controller
Longitudinal Control
cruising
following
over_limit
red_light
hazard_stop
States
Affordances Throttle
Brake
if hazard stop == True
elif red light == True
elif speed limit - 15
elif veh_distance 35
else
I Finite-state machine
I PID controller for cruising
I Car following model
Lateral Control
I Stanley controller
I δ(t) = ψ(t) + arctan
kx(t)
u(t)
I Damping term
Sauer, Savinov and Geiger: Conditional Affordance Learning for Driving in Urban Environments. CoRL, 2018. 20
183. Parameter Learning
Perception Stack:
I Multi-task learning: single forward pass ⇒ fast learning and inference
I Dataset: random driving using controller operating on GT affordances
⇒ 240k images with GT affordances
I Loss functions:
I Discrete affordances: Class-weighted cross-entropy (CWCE)
I Continuous affordances: Mean average error (MAE)
I Optimized with ADAM (batch size 32)
Controller:
I Ziegler-Nichols
Sauer, Savinov and Geiger: Conditional Affordance Learning for Driving in Urban Environments. CoRL, 2018. 21
184. Data Collection
Data Collection:
I Navigation based on true
affordances random inputs
Data Augmentation:
I No image flipping
I Color, contrast, brightness
I Gaussian blur noise
I Provoke rear-end collisions
I Camera pose randomization
𝝓𝟏
𝝓𝟐
𝝓𝟑(= 𝟎)
𝒅 = 𝟓𝟎𝒄𝒎
𝒅 = 𝟓𝟎𝒄𝒎
Sauer, Savinov and Geiger: Conditional Affordance Learning for Driving in Urban Environments. CoRL, 2018. 22
185. Results
Training conditions New weather New town
New town and
new weather
Task MP CIL RL CAL MP CIL RL CAL MP CIL RL CAL MP CIL RL CAL
Straight 98 95 89 100 100 98 86 100 92 97 74 93 50 80 68 94
One turn 82 89 34 97 95 90 16 96 61 59 12 82 50 48 20 72
Navigation 80 86 14 92 94 84 2 90 24 40 3 70 47 44 6 68
Nav. dynamic 77 83 7 83 89 82 2 82 24 38 2 64 44 42 4 64
Baselines:
I MP = Modular Pipeline [Dosovitskiy et al., CoRL 2017]
I CIL = Conditional Imitation Learning [Codevilla et al., ICRA 2018]
I RL = Reinforcement Learning A3C [Mnih et al., ICML 2016]
Sauer, Savinov and Geiger: Conditional Affordance Learning for Driving in Urban Environments. CoRL, 2018. 23
197. Does Computer Vision Matter for Action?
Does Computer Vision Matter for Action?
I Analyze various intermediate representations:
segmentation, depth, normals, flow, albedo
I Intermediate representations improve results
I Consistent gains across simulations / tasks
I Depth and semantic provide largest gains
I Better generalization performance
Zhou, Krähenbühl and Koltun: Does computer vision matter for action? Science Robotics, 2019. 29
198. Visual Abstractions
What is a good visual abstraction?
I Invariant (hide irrelevant variations from policy)
I Universal (applicable to wide range of scenarios)
I Data efficient (in terms of memory/computation)
I Label efficient (require little manual effort)
Train
Test
Pixel Space Representation Space
Figure Credit:
Alexander Sax
Semantic segmentation:
I Encodes task-relevant knowledge (e.g. road is drivable) and priors (e.g., grouping)
I Can be processed with standard 2D convolutional policy networks
Disadvantage:
I Labelling time: ∼90 min for 1 Cityscapes image
Zhou, Krähenbühl and Koltun: Does computer vision matter for action? Science Robotics, 2019. 30
199. Label Efficient Visual Abstractions
Questions:
I What is the trade-off between annotation time and driving performance?
I Can selecting specific semantic classes ease policy learning?
I Are visual abstractions trained with few images competitive?
I Is fine-grained annotation important?
I Are visual abstractions able to reduce training variance?
Behl, Chitta, Prakash, Ohn-Bar and Geiger: Label Efficient Visual Abstractions for Autonomous Driving. IROS, 2020. 31
200. Label Efficient Visual Abstractions
Model:
I Visual abstraction network aψ : s 7→ r (state 7→ abstraction)
I Control policy πθ : r, c, v 7→ a (abstraction, command, velocity 7→ action)
I Composing both yields a = πθ(aψ(s)) (state 7→ action)
Behl, Chitta, Prakash, Ohn-Bar and Geiger: Label Efficient Visual Abstractions for Autonomous Driving. IROS, 2020. 32
201. Label Efficient Visual Abstractions
Datasets:
I nr images annotated with semantic labels R = {si, ri}ns
i=1
I na images annotated with expert actions A = {si, ai}na
i=1
I We assume nr na
Behl, Chitta, Prakash, Ohn-Bar and Geiger: Label Efficient Visual Abstractions for Autonomous Driving. IROS, 2020. 32
202. Label Efficient Visual Abstractions
Training:
I Train visual abstraction network aφ(·) using semantic dataset R
I Apply this network to obtain control dataset Cφ = {aψ(si), ai}na
i=1
I Train control policy πθ(·) using control dataset Cφ
Behl, Chitta, Prakash, Ohn-Bar and Geiger: Label Efficient Visual Abstractions for Autonomous Driving. IROS, 2020. 32
203. Control Policy
Model:
I CILRS [Codevilla et al., ICCV 2019]
Input:
I Visual abstraction r
I Navigational command c
I Vehicle velocity v
Output:
I Action/control â and velocity v̂
Loss:
I L = ||a − â||1 + λ ||v − v̂||1
Behl, Chitta, Prakash, Ohn-Bar and Geiger: Label Efficient Visual Abstractions for Autonomous Driving. IROS, 2020. 33
204. Visual Abstractions
Privileged Segmentation (14 classes):
I Ground-truth semantic labels for 14 classes
I Upper bound for analysis
Behl, Chitta, Prakash, Ohn-Bar and Geiger: Label Efficient Visual Abstractions for Autonomous Driving. IROS, 2020. 34
205. Visual Abstractions
Privileged Segmentation (6 classes):
I Ground-truth semantic labels for 2 stuff and 4 object classes
I Upper bound for analysis
Behl, Chitta, Prakash, Ohn-Bar and Geiger: Label Efficient Visual Abstractions for Autonomous Driving. IROS, 2020. 34
206. Visual Abstractions
Inferred Segmentation (14 classes):
I Segmentation model trained on 14 classes
I ResNet and Feature Pyramid Network (FPN) with segmentation head
Behl, Chitta, Prakash, Ohn-Bar and Geiger: Label Efficient Visual Abstractions for Autonomous Driving. IROS, 2020. 34
207. Visual Abstractions
Inferred Segmentation (6 classes):
I Segmentation model trained on 2 stuff and 4 object classes
I ResNet and Feature Pyramid Network (FPN) with segmentation head
Behl, Chitta, Prakash, Ohn-Bar and Geiger: Label Efficient Visual Abstractions for Autonomous Driving. IROS, 2020. 34
208. Visual Abstractions
Hybrid Detection and Segmentation (6 classes):
I Segmentation model trained on 2 stuff classes: road, lane marking
I Object detection trained on 4 object classes: vehicle, pedestrian, traffic light (r/g)
Behl, Chitta, Prakash, Ohn-Bar and Geiger: Label Efficient Visual Abstractions for Autonomous Driving. IROS, 2020. 34
209. Evaluation
Training Town Test Town
I CARLA 0.8.4 NoCrash benchmark
I Random start and end location
I Metric: Percentage of successfully completed episodes (success rate)
Behl, Chitta, Prakash, Ohn-Bar and Geiger: Label Efficient Visual Abstractions for Autonomous Driving. IROS, 2020. 35
210. Traffic Density
Empty Regular Dense
I Difficulty varies with number of dynamic agents in the scene
I Empty: 0 Agents Regular: 65 Agents Dense: 220 Agents
Behl, Chitta, Prakash, Ohn-Bar and Geiger: Label Efficient Visual Abstractions for Autonomous Driving. IROS, 2020. 36
211. Identifying Most Relevant Classes (Privileged)
I 14 classes: road, lane marking, vehicle, pedestrian, green light, red light, sidewalk,
building, fence, pole, vegetation, wall, traffic sign, other
I 7 classes: road, lane marking, vehicle, pedestrian, green light, red light, sidewalk,
building, fence, pole, vegetation, wall, traffic sign, other
I 6 classes: road, lane marking, vehicle, pedestrian, green light , red light, sidewalk
I 5 classes: road, lane marking, vehicle, pedestrian, green light, red light
Behl, Chitta, Prakash, Ohn-Bar and Geiger: Label Efficient Visual Abstractions for Autonomous Driving. IROS, 2020. 37
212. Identifying Most Relevant Classes (Privileged)
Number of Classes
0%
25%
50%
75%
100%
5 6 7 14
Empty
Number of Classes
0%
25%
50%
75%
100%
5 6 7 14
Regular
Number of Classes
0%
25%
50%
75%
100%
5 6 7 14
Dense
Number of Classes
0%
25%
50%
75%
100%
5 6 7 14
Timeout
Collision
Success
Overall
I Moving from 14 to 6 classes does not hurt driving performance (on contrary)
I Drastic performance drop when lane markings are removed
Behl, Chitta, Prakash, Ohn-Bar and Geiger: Label Efficient Visual Abstractions for Autonomous Driving. IROS, 2020. 38
213. Identifying Most Relevant Classes (Privileged)
Behl, Chitta, Prakash, Ohn-Bar and Geiger: Label Efficient Visual Abstractions for Autonomous Driving. IROS, 2020. 39
214. Identifying Most Relevant Classes (Inferred)
83
74
100
86
Number of Classes
Success
Rate
0
25
50
75
100
6 14
Empty
64
56
76 72
Number of Classes
Success
Rate
0
25
50
75
100
6 14
Regular
30
19
26 24
Number of Classes
Success
Rate
0
25
50
75
100
6 14
Dense
59
50
67 61
Number of Classes
Success
Rate
0
25
50
75
100
6 14
Standard
Privileged
Overall
I Small performance drop when using inferred segmentations
I 6-class representation consistently improves upon 14-class representation
I We use the 6-class representation for all following experiments
Behl, Chitta, Prakash, Ohn-Bar and Geiger: Label Efficient Visual Abstractions for Autonomous Driving. IROS, 2020. 40
218. Driving Policy Transfer
Problem:
I Driving policies learned in simulation often do not transfer well to the real world
Idea:
I Encapsulate driving policy such that it is not directly exposed to raw perceptual
input or low-level control (input: semantic segmentation, output: waypoints)
I Allows for transferring driving policy without retraining or finetuning
Müller, Dosovitskiy, Ghanem and Koltun: Driving Policy Transfer via Modularity and Abstraction. CoRL, 2018. 44
219. Waypoint Representation
Representation:
I Input: Semantic segmentation (per pixel “road” vs. “non-road”)
I Output: 2 waypoints (distance to vehicle, relative angle wrt. vehicle heading)
I One sufficient for steering, second one for braking before turns
Müller, Dosovitskiy, Ghanem and Koltun: Driving Policy Transfer via Modularity and Abstraction. CoRL, 2018. 45
220. Results
Success Rate over 25 Navigation Trials
I Driving Policy: Conditional Imitation Learning (branched)
I Control: PID controller for lateral and longitudinal control
I Results: Full method generalizes best (“+” = with data augmentation)
Müller, Dosovitskiy, Ghanem and Koltun: Driving Policy Transfer via Modularity and Abstraction. CoRL, 2018. 46
223. Online vs. Offline Evaluation
I Online evaluation (i.e., using a real vehicle) is expensive and can be dangerous
I Offline evaluation on a pre-recorded validation dataset is cheap and easy
I Question: How predictive is offline evaluation (a) for the online task (b)?
I Empirical study using CIL on CARLA trained with MSE loss on steering angle
Codevilla, Lopez, Koltun and Dosovitskiy: On Offline Evaluation of Vision-based Driving Models. ECCV, 2018. 49
224. Online Metrics
I Success Rate: Percentage of routes successfully completed
I Average Completion: Average fraction of distance to goal covered
I Km per Infraction: Average driven distance between 2 infractions
Remark: The current CARLA metrics Infraction Score and Driving Score are not
considered in this work from 2018, but would likely lead to similar conclusions.
Codevilla, Lopez, Koltun and Dosovitskiy: On Offline Evaluation of Vision-based Driving Models. ECCV, 2018. 50
225. Offline Metrics
I a/â: true/predicted steering angle |V |: #samples in validation set
I v : speed δ(·): Kronecker delta function θ: Heaviside step function
I Q ∈ {−1, 0, 1}: Quantization x −σ −σ ≤ x σ x ≥ σ
Codevilla, Lopez, Koltun and Dosovitskiy: On Offline Evaluation of Vision-based Driving Models. ECCV, 2018. 51
226. Results: Online vs. Online
I Generalization performance (town 2, new weather), radius = training iteration
I 45 different models varying dataset size, augmentation, architecture, etc.
I Success rate correlates well with average completion and km per infraction
Codevilla, Lopez, Koltun and Dosovitskiy: On Offline Evaluation of Vision-based Driving Models. ECCV, 2018. 52
227. Results: Online vs. Offline
I All metrics not well correlated, Mean Square Error (MSE) performs worst
I Absolute steering error improves, speed weighting is not important
Codevilla, Lopez, Koltun and Dosovitskiy: On Offline Evaluation of Vision-based Driving Models. ECCV, 2018. 53
228. Results: Online vs. Offline
I Cumulating the error over time does not improve the correlation
I Quantized classification and thresholded relative error perform best
Codevilla, Lopez, Koltun and Dosovitskiy: On Offline Evaluation of Vision-based Driving Models. ECCV, 2018. 54
229. Case Study
I Model 1: Trained with single camera `2 loss (=bad model)
I Model 2: Trained with three cameras `1 loss (=good model)
I Predictions of both models noisy, but Model 1 predicts occasionally very large
errors leading to crashes, however the average prediction error is similar
Codevilla, Lopez, Koltun and Dosovitskiy: On Offline Evaluation of Vision-based Driving Models. ECCV, 2018. 55
230. Case Study
I Model 1 crashes in every trial but model 2 can drive successfully
I Illustrates the difficulty of using offline metrics for predicting online behavior
Codevilla, Lopez, Koltun and Dosovitskiy: On Offline Evaluation of Vision-based Driving Models. ECCV, 2018. 56
231. Summary
I Direct perception predicts intermediate representations
I Low-dimensional affordances or classic computer vision representations (e.g.,
semantic segmentation, depth) can be used as intermediate representations
I Decouples perception from planning and control
I Hybrid model between imitation learning and modular pipelines
I Direct methods are more interpretable as the representation can be inspected
I Effective visual abstractions can be learned using limited supervision
I Planning can also be decoupled from control for better transfer
I Offline metrics are not necessarily indicative of online driving performance
57
235. Reinforcement Learning
So far:
I Supervised learning, lots of expert demonstrations required
I Use of auxiliary, short-term loss functions
I Imitation learning: per-frame loss on action
I Direct perception: per-frame loss on affordance indicators
Now:
I Learning of models based on the loss that we actually care about, e.g.:
I Minimize time to target location
I Minimize number of collisions
I Minimize risk
I Maximize comfort
I etc.
Sutton and Barto: Reinforcement Learning: An Introduction. MIT Press, 2017. 4
236. Types of Learning
Supervised Learning:
I Dataset: {(xi, yi)} (xi = data, yi = label) Goal: Learn mapping x 7→ y
I Examples: Classification, regression, imitation learning, affordance learning, etc.
Unsupervised Learning:
I Dataset: {(xi)} (xi = data) Goal: Discover structure underlying data
I Examples: Clustering, dimensionality reduction, feature learning, etc.
Reinforcement Learning:
I Agent interacting with environment which provides numeric reward signals
I Goal: Learn how to take actions in order to maximize reward
I Examples: Learning of manipulation or control tasks (everything that interacts)
Sutton and Barto: Reinforcement Learning: An Introduction. MIT Press, 2017. 5
237. Introduction to Reinforcement Learning
Agent
Environment
State st Action at
Reward rt
Next state st+1
I Agent oberserves environment state st at time t
I Agent sends action at at time t to the environment
I Environment returns the reward rt and its new state st+1 to the agent
Sutton and Barto: Reinforcement Learning: An Introduction. MIT Press, 2017. 6
238. Introduction to Reinforcement Learning
I Goal: Select actions to maximize total future reward
I Actions may have long term consequences
I Reward may be delayed, not instantaneous
I It may be better to sacrifice immediate reward to gain more long-term reward
I Examples:
I Financial investment (may take months to mature)
I Refuelling a helicopter (might prevent crash in several hours)
I Sacrificing a chess piece (might help winning chances in the future)
Sutton and Barto: Reinforcement Learning: An Introduction. MIT Press, 2017. 7
239. Example: Cart Pole Balancing
I Objective: Balance pole on moving cart
I State: Angle, angular vel., position, vel.
I Action: Horizontal force applied to cart
I Reward: 1 if pole is upright at time t
https://gym.openai.com/envs/#classic_control 8
240. Example: Robot Locomotion
http://blog.openai.com/roboschool/
I Objective: Make robot move forward
I State: Position and angle of joints
I Action: Torques applied on joints
I Reward: 1 if upright forward moving
https://gym.openai.com/envs/#mujoco 9
241. Example: Atari Games
http://blog.openai.com/gym-retro/
I Objective: Maximize game score
I State: Raw pixels of screen (210x160)
I Action: Left, right, up, down
I Reward: Score increase/decrease at t
https://gym.openai.com/envs/#atari 10
243. Example: Self-Driving
I Objective: Lane Following
I State: Image (96x96)
I Action: Acceleration, Steering
I Reward: - per frame, + per tile
https://gym.openai.com/envs/CarRacing-v0/ 12
245. Markov Decision Process
Markov Decision Process (MDP) models the environment and is defined by the tuple
(S, A, R, P, γ)
with
I S : set of possible states
I A: set of possible actions
I R(rt|st, at): distribution of current reward given (state,action) pair
I P(st+1|st, at): distribution over next state given (state,action) pair
I γ: discount factor (determines value of future rewards)
Almost all reinforcement learning problems can be formalized as MDPs
14
246. Markov Decision Process
Markov property: Current state completely characterizes state of the world
I A state st is Markov if and only if
P(st+1|st) = P(st+1|s1, ..., st)
I ”The future is independent of the past given the present”
I The state captures all relevant information from the history
I Once the state is known, the history may be thrown away
I The state is a sufficient statistics of the future
15
247. Markov Decision Process
Reinforcement learning loop:
I At time t = 0:
I Environment samples initial state s0 ∼ P(s0)
I Then, for t = 0 until done:
I Agent selects action at
I Environment samples reward rt ∼ R(rt|st, at)
I Environment samples next state st+1 ∼ P(st+1|st, at)
I Agent receives reward rt and next state st+1
Agent
Environment
at
st
rt
st+1
How do we select an action?
16
248. Policy
A policy π is a function from S to A that specifies what action to take in each state:
I A policy fully defines the behavior of an agent
I Deterministic policy: a = π(s)
I Stochastic policy: π(a|s) = P(at = a|st = s)
Remark:
I MDP policies depend only on the current state and not the entire history
I However, the current state may include past observations
17
249. Policy
How do we learn a policy?
Imitation Learning: Learn a policy from expert demonstrations
I Expert demonstrations are provided
I Supervised learning problem
Reinforcement Learning: Learn a policy through trial-and-error
I No expert demonstrations given
I Agent discovers itself which actions maximize the expected future reward
I The agent interacts with the environment and obtains reward
I The agent discovers good actions and improves its policy π
18
250. Exploration vs. Exploitation
How do we discover good actions?
Answer: We need to explore the state/action space. Thus RL combines two tasks:
I Exploration: Try a novel action a in state s , observe reward rt
I Discovers more information about the environment, but sacrifices total reward
I Game-playing example: Play a novel experimental move
I Exploitation: Use a previously discovered good action a
I Exploits known information to maximize reward, but sacrifice unexplored areas
I Game-playing example: Play the move you believe is best
Trade-off: It is important to explore and exploit simultaneously
19
251. Exploration vs. Exploitation
How to balance exploration and exploitation?
-greedy exploration algorithm:
I Try all possible actions with non-zero probability
I With probability choose an action at random (exploration)
I With probability 1 − choose the best action (exploitation)
I Greedy action is defined as best action which was discovered so far
I is large initially and gradually annealed (=reduced) over time
20
253. Value Functions
How good is a state?
The state-value function V π at state st is the expected cumulative discounted reward
(rt ∼ R(rt|st, at)) when following policy π from state st:
V π
(st) = E[rt + γrt+1 + γ2
rt+2 + . . . |st, π] = E
X
k≥0
γk
rt+k st, π
I The discount factor γ 1 is the value of future rewards at current time t
I Weights immediate reward higher than future reward
(e.g., γ = 1
2 ⇒ γk
= 1
1 , 1
2 , 1
4 , 1
8 , 1
16 , . . . )
I Determines agent’s far/short-sightedness
I Avoids infinite returns in cyclic Markov processes
22
254. Value Functions
How good is a state-action pair?
The action-value function Qπ at state st and action at is the expected cumulative
discounted reward when taking action at in state st and then following the policy π:
Qπ
(st, at) = E
X
k≥0
γk
rt+k st, at, π
I The discount factor γ ∈ [0, 1] is the value of future rewards at current time t
I Weights immediate reward higher than future reward
(e.g., γ = 1
2 ⇒ γk
= 1
1 , 1
2 , 1
4 , 1
8 , 1
16 , . . . )
I Determines agent’s far/short-sightedness
I Avoids infinite returns in cyclic Markov processes
23
255. Optimal Value Functions
The optimal state-value function V ∗(st) is the best V π(st) over all policies π:
V ∗
(st) = max
π
V π
(st) V π
(st) = E
X
k≥0
γk
rt+k st, π
The optimal action-value function Q∗(st, at) is the best Qπ(st, at) over all policies π:
Q∗
(st, at) = max
π
Qπ
(st, at) Qπ
(st, at) = E
X
k≥0
γk
rt+k st, at, π
I The optimal value functions specify the best possible performance in the MDP
I However, searching over all possible policies π is computationally intractable
24
256. Optimal Policy
If Q∗(st, at) would be known, what would be the optimal policy?
π∗
(st) = argmax
a0∈A
Q∗
(st, a0
)
I Unfortunately, searching over all possible policies π is intractable in most cases
I Thus, determining Q∗(st, at) is hard in general (for most interesting problems)
I Let’s have a look at a simple example where the optimal policy is easy to compute
25
257. A Simple Grid World Example
actions = {
1. right
2. left
3. up
4. down
}
states
?
?
reward: r = −1 for
each transition
Objective: Reach one of terminal states (marked with ’?’) in least number of actions
I Penalty (negative reward) given for every transition made
26
258. A Simple Grid World Example
?
?
Random Policy
?
?
Optimal Policy
I The arrows indicate equal probability of moving into each of the directions
27
260. Bellman Optimality Equation
I The Bellman Optimality Equation is
named after Richard Ernest Bellman who
introduced dynamic programming in 1953
I Almost any problem which can be solved
using optimal control theory can be solved
via the appropriate Bellman equation
Richard Ernest Bellman
Sutton and Barto: Reinforcement Learning: An Introduction. MIT Press, 2017. 29
261. Bellman Optimality Equation
The Bellman Optimality Equation (BOE) decomposes Q∗ as follows:
Q∗
(st, at) = E
rt + γrt+1 + γ2
rt+2 + . . . |st, at
BOE
= E
rt + γ max
a0∈A
Q∗
(st+1, a0
) st, at
This recursive formulation comprises two parts:
I Current reward: rt
I Discounted optimal action-value of successor: γ max
a0∈A
Q∗(st+1, a0)
We want to determine Q∗(st, at). How can we solve the BOE?
I The BOE is non-linear (because of max-operator) ⇒ no closed form solution
I Several iterative methods have been proposed, most popular: Q-Learning
Sutton and Barto: Reinforcement Learning: An Introduction. MIT Press, 2017. 30
262. Proof of the Bellman Optimality Equation
Proof of the Bellman Optimality Equation for the optimal action-value function Q∗:
Q∗
(st, at) = E
rt + γrt+1 + γ2
rt+2 + . . . |st, at
= E
X
k≥0
γk
rt+k|st, at
= E
rt + γ
X
k≥0
γk
rt+k+1|st, at
= E [rt + γV ∗
(st+1)|st, at]
= E
rt + γ max
a0
Q∗
(st+1, a0
)|st, at
Sutton and Barto: Reinforcement Learning: An Introduction. MIT Press, 2017. 31
263. Bellman Optimality Equation
Why is it useful to solve the BOE?
I A greedy policy which chooses the action that maximizes the optimal
action-value function Q∗ or the optimal state-value function V ∗ takes
into account the reward consequences of all possible future behavior
I Via Q∗ and V ∗ the optimal expected long-term return is turned into a quantity
that is locally and immediately available for each state / state-action pair
I For V ∗, a one-step-ahead search yields the optimal actions
I Q∗ effectively caches the results of all one-step-ahead searches
Sutton and Barto: Reinforcement Learning: An Introduction. MIT Press, 2017. 32
264. Q-Learning
Q-Learning: Iteratively solve for Q∗
Q∗
(st, at) = E
rt + γ max
a0∈A
Q∗
(st+1, a0
) st, at
by constructing an update sequence Q1, Q2, . . . using learning rate α:
Qi+1(st, at) ← (1 − α)Qi(st, at) + α(rt + γ max
a0∈A
Qi(st+1, a0
))
= Qi(st, at) + α (rt + γ max
a0∈A
Qi(st+1, a0
)
| {z }
target
− Qi(st, at)
| {z }
prediction
)
| {z }
temporal difference (TD) error
I Qi will converge to Q∗ as i → ∞ Note: policy π learned implicitly via Q table!
Watkins and Dayan: Technical Note Q-Learning. Machine Learning, 1992. 33
265. Q-Learning
Implementation:
I Initialize Q table and initial state s0 randomly
I Repeat:
I Observe state st, choose action at according to -greedy strategy
(Q-Learning is “off-policy” as the updated policy is different from the behavior policy)
I Observe reward rt and next state st+1
I Compute TD error: rt + γ max
a0∈A
Qi(st+1, a0
) − Qi(st, at)
I Update Q table
What’s the problem with using Q tables?
I Scalability: Tables don’t scale to high dimensional state/action spaces (e.g., GO)
I Solution: Use a function approximator (neural network) to represent Q(s, a)
Watkins and Dayan: Technical Note Q-Learning. Machine Learning, 1992. 34
267. Deep Q-Learning
Use a deep neural network with weights θ as function approximator to estimate Q:
Q(s, a; θ) ≈ Q∗
(s, a)
Q(s, a; θ)
θ
s a
Q(s, a1; θ), ...Q(s, am; θ)
θ
s
Mnih et al.: Human-level control through deep reinforcement learning. Nature, 2015. 36
268. Training the Q Network
Forward Pass:
Loss function is the mean-squared error in Q-values:
L(θ) = E
rt + γ max
a0
Q(st+1, a0
; θ)
| {z }
target
− Q(st, at; θ)
| {z }
prediction
2
Backward Pass:
Gradient update with respect to Q-function parameters θ:
∇θL(θ) = ∇θ E
rt + γ max
a0
Q(st+1, a0
; θ) − Q(st, at; θ)
2
#
Optimize objective end-to-end with stochastic gradient descent (SGD) using ∇θL(θ).
Mnih et al.: Human-level control through deep reinforcement learning. Nature, 2015. 37
269. Experience Replay
To speed-up training we like to train on mini-batches:
I Problem: Learning from consecutive samples is inefficient
I Reason: Strong correlations between consecutive samples
Experience replay stores agent’s experiences at each time-step
I Continually update a replay memory D with new experiences et = (st, at, rt, st+1)
I Train on samples (st, at, rt, st+1) ∼ U(D) drawn uniformly at random from D
I Breaks correlations between samples
I Improves data efficiency as each sample can be used multiple times
In practice, a circular replay memory of finite memory size is used.
Mnih et al.: Human-level control through deep reinforcement learning. Nature, 2015. 38
270. Fixed Q Targets
Problem: Non-stationary targets
I As the policy changes, so do our targets: rt + γ max
a0
Q(st+1, a0; θ)
I This may lead to oscillation or divergence
Solution: Use fixed Q targets to stabilize training
I A target network Q with weights θ− is used to generate the targets:
L(θ) = E(st,at,rt,st+1)∼U(D)
rt + γ max
a0
Q(st+1, a0
; θ−
) − Q(st, at; θ)
2
#
I Target network Q is only updated every C steps by cloning the Q-network
I Effect: Reduces oscillation of the policy by adding a delay
Mnih et al.: Human-level control through deep reinforcement learning. Nature, 2015. 39
271. Putting it together
Deep Q-Learning using experience replay and fixed Q targets:
I Take action at according to -greedy policy
I Store transition (st, at, rt, st+1) in replay memory D
I Sample random mini-batch of transitions (st, at, rt, st+1) from D
I Compute Q targets using old parameters θ−
I Optimize MSE between Q targets and Q network predictions
L(θ) = Est,at,rt,st+1∼D
rt + γ max
a0
Q(st+1, a0
; θ−
) − Q(st, at; θ)
2
#
using stochastic gradient descent.
Mnih et al.: Human-level control through deep reinforcement learning. Nature, 2015. 40
272. Case Study: Playing Atari Games
Agent
Environment
; ;
;
Objective: Complete the game with the highest score
Mnih et al.: Human-level control through deep reinforcement learning. Nature, 2015. 41
273. Case Study: Playing Atari Games
Q(s, a; θ): Neural network with weights θ
FC-Out (Q values)
FC-256
32 4x4 conv, stride 2
16 8x8 conv, stride 2
; ; ; ;
Input: 84 × 84 × 4 stack of last 4 frames
(after grayscale conversion, downsampling, cropping)
Output: Q values for all (4 to 18) Atari actions
(efficient: single forward pass computes Q for all actions)
Mnih et al.: Human-level control through deep reinforcement learning. Nature, 2015. 42
274. Case Study: Playing Atari Games
Mnih et al.: Human-level control through deep reinforcement learning. Nature, 2015. 43
275. Deep Q-Learning Shortcomings
Deep Q-Learning suffers from several shortcomings:
I Long training times
I Uniform sampling from replay buffer ⇒ all transitions equally important
I Simplistic exploration strategy
I Action space is limited to a discrete set of actions
(otherwise, expensive test-time optimization required)
Various improvements over the original algorithm have been explored.
44
276. Deep Deterministic Policy Gradients
DDPG addresses the problem of continuous action spaces.
Problem: Finding a continuous action requires optimization at every timestep.
Solution: Use two networks, an actor (deterministic policy) and a critic.
µ(s; θµ)
θµ
s
Actor
Q(s, a; θQ)
θQ
s a = µ(s; θµ)
Critic
Lillicrap et al.: Continuous Control with Deep Reinforcement Learning. ICLR, 2016. 45
277. Deep Deterministic Policy Gradients
I Actor network with weights θµ estimates agent’s deterministic policy µ(s; θµ)
I Update deterministic policy µ(·) in direction that most improves Q
I Apply chain rule to the expected return (this is the policy gradient):
∇θµ Est,at,rt,st+1∼D
Q(st, µ(st; θµ
); θQ
)
= E
∇at
Q(st, at; θQ
) ∇θµ µ(st; θµ
)
I Critic estimates value of current policy Q(s, a; θQ)
I Learned using the Bellman Optimality Equation as in Q Learning:
∇θQ Est,at,rt,st+1∼D
h
rt + γQ(st+1, µ(st+1; θµ−
); θQ−
) − Q(st, at; θQ
)
2
i
I Remark: No maximization over actions required as this step is now learned via µ(·)
Lillicrap et al.: Continuous Control with Deep Reinforcement Learning. ICLR, 2016. 46
278. Deep Deterministic Policy Gradients
Experience replay and target networks are again used to stabilize training:
I Replay memory D stores transition tuples (st, at, rt, st+1)
I Target networks are updated using “soft” target updates
I Weights are not directly copied but slowly adapted:
θQ−
← τθQ
+ (1 − τ)θQ−
θµ−
← τθµ
+ (1 − τ)θµ−
where 0 τ 1 controls the tradeoff between speed and stability of learning
Exploration is performed by adding noise ∇θµ to the policy µ(s):
µ(s; θµ
) + N
Lillicrap et al.: Continuous Control with Deep Reinforcement Learning. ICLR, 2016. 47
279. Prioritized Experience Replay
Prioritize experience to replay important transitions more frequently
I Priority δ is measured by magnitude of temporal difference (TD) error:
δ = rt + γ max
a0
Q(st+1, a0
; θQ−
) − Q(st, at; θQ
)
I TD error measures how “surprising” or unexpected the transition is
I Stochastic prioritization avoids overfitting due to lack of diversity
I Enables learning speed-up by a factor of 2 on Atari benchmarks
Schaul et al.: Prioritized Experience Replay. ICLR, 2016. 48
280. Learning to Drive in a Day
Real-world RL demo by Wayve:
I Deep Deterministic Policy Gradients
with Prioritized Experience Replay
I Input: Single monocular image
I Action: Steering and speed
I Reward: Distance traveled without
the safety driver taking control
(requires no maps / localization)
I 4 Conv layers, 2 FC layers
I Only 35 training episodes
Kendall, Hawke, Janz, Mazur, Reda, Allen, Lam, Bewley and Shah: Learning to Drive in a Day. ICRA, 2019. 49
281. Learning to Drive in a Day
Kendall, Hawke, Janz, Mazur, Reda, Allen, Lam, Bewley and Shah: Learning to Drive in a Day. ICRA, 2019. 50
283. Asynchronous Deep Reinforcement Learning
Execute multiple agents in separate environment instances:
I Each agent interacts with its own environment copy and collects experience
I Agents may use different exploration policies to maximize experience diversity
I Experience is not stored but directly used to update a shared global model
I Stabilizes training in similar way to experience replay by decorrelating samples
I Leads to reduction in training time roughly linear in the number of parallel agents
Mnih et al.: Asynchronous Methods for Deep Reinforcement Learning. ICML, 2016. 52
284. Bootstrapped DQN
Bootstrapping for efficient exploration:
I Approximate a distribution over Q values via K bootstrapped ”heads”
I At the start of each epoch, a single head Qk is selected uniformly at random
I After training, all heads can be combined into a single ensemble policy
Q1 QK
θQ1
... θQK
θshared
s
Osband et al.: Deep Exploration via Bootstrapped DQN. NIPS, 2016. 53
285. Double Q-Learning
Double Q-Learning
I Decouple Q function for selection and evaluation of actions
to avoid Q overestimation and stabilize training. Target:
DQN : rt + γ max
a0
Q(st+1, a0
; θ−
)
DoubleDQN : rt + γQ(st+1, argmax
a0
Q(st+1, a0
; θ); θ−
)
I Online network with weights θ is used to determine greedy policy
I Target network with weights θ−
is used to determine corresponding action value
I Improves performance on Atari benchmarks
van Hasselt et al.: Deep Reinforcement Learning with Double Q-learning. AAAI, 2016. 54
286. Deep Recurrent Q-Learning
Add recurrency to a deep Q-network to handle partial observability of states:
FC-Out (Q-values)
LSTM
32 4x4 conv, stride 2
16 8x8 conv, stride 2
; ; ; ;
Replace fully-connected layer with recurrent LSTM layer
Hausknecht and Stone: Deep Recurrent Q-Learning for Partially Observable MDPs. AAAI, 2015 55
288. Summary
I Reinforcement learning learns through interaction with the environment
I The environment is typically modeled as a Markov Decision Process
I The goal of RL is to maximize the expected future reward
I Reinforcement learning requires trading off exploration and exploitation
I Q-Learning iteratively solves for the optimal action-value function
I The policy is learned implicitly via the Q table
I Deep Q-Learning scales to continuous/high-dimensional state spaces
I Deep Deterministic Policy Gradients scales to continuous action spaces
I Experience replay and target networks are necessary to stabilize training
57
293. Kinematics vs. Kinetics
Kinematics:
I Greek origin: “motion”, “moving”
I Describes motion of points and bodies
I Considers position, velocity, acceleration, ..
I Examples: Celestial bodies, particle
systems, robotic arm, human skeleton
Kinetics:
I Describes causes of motion
I Effects of forces/moments
I Newton’s laws, e.g., F = ma
6
294. Holonomic Constraints
Holonomic constraints are constraints on the configuration:
I Assume a particle in three dimensions (x, y, z) ∈ R3
I We can constrain the particle to the x/y plane via:
z = 0
⇔ f(x, y, z) = 0 with f(x, y, z) = z
x/y plane
I Constraints of the form f(x, y, z) = 0 are called holonomic constraints
I They constrain the configuration space
I But the system can move freely in that space
I Controllable degrees of freedom equal total degrees of freedom (2)
7
295. Non-Holonomic Constraints
Non-Holonomic constraints are constraints on the velocity:
I Assume a vehicle that is parameterized by (x, y, ψ) ∈ R2 × [0, 2π]
I The 2D vehicle velocity is given by:
ẋ = v cos(ψ)
ẏ = v sin(ψ)
⇒ ẋ sin(ψ) − ẏ cos(ψ) = 0
I This non-holonomic constraint cannot be expressed in the form f(x, y, ψ) = 0
I The car cannot freely move in any direction (e.g., sideways)
I It constrains the velocity space, but not the configuration space
I Controllable degrees of freedom less than total degrees of freedom (2 vs. 3)
8
296. Holonomic vs. Non-Holonomic Systems
Holonomic Systems
I Constrain configuration space
I Can freely move in any direction
I Controllable degrees of freedom
equal to total degrees of freedom
I Constraints can be described by
f(x1, . . . , xN ) = 0
Example:
3D Particle
z = 0
x/y plane
Nonholonomic Systems
I Constrain velocity space
I Cannot freely move in any direction
I Controllable degrees of freedom less
than total degrees of freedom
I Constraints cannot be described by
f(x1, . . . , xN ) = 0
Example:
Car
ẋ sin(ψ) − ẏ cos(ψ) = 0
9
297. Holonomic vs. Non-Holonomic Systems
I A robot can be subject to both holonomic and non-holonomic constraints
I A car (rigid body in 3D) is kept on the ground by 3 holonomic constraints
I One additional non-holonomic constraint prevents sideways sliding
10
298. Coordinate Systems
Inertial Frame
Horizontal Frame
Horizontal Plane
Vehicle
Reference
Point
Vehicle Frame
I Inertial Frame: Fixed to earth with vertical Z-axis and X/Y horizontal plane
I Vehicle Frame: Attached to vehicle at fixed reference point; xv points towards
the front, yv to the side and zv to the top of the vehicle (ISO 8855)
I Horizontal Frame: Origin at vehicle reference point (like vehicle frame) but x-
and y-axes are projections of xv- and yv-axes onto the X/Y horizontal plane
11
299. Kinematics of a Point
The position rP (t) ∈ R3 of point P at time t ∈ R is given by 3 coordinates.
Velocity and acceleration are the first and second derivatives of the position rP (t).
rP (t) =
x(t)
y(t)
z(t)
vP (t) = ṙP (t) =
ẋ(t)
ẏ(t)
ż(t)
aP (t) = r̈P (t) =
ẍ(t)
ÿ(t)
z̈(t)
Trajectory of point P
12
300. Kinematics of a Rigid Body
A rigid body refers to a collection of infinitely many infinitesimally small mass points
which are rigidly connected, i.e., their relative position remains unchanged over time.
It’s motion can be compactly described by the motion of an (arbitrary) reference point
C of the body plus the relative motion of all other points P with respect to C.
I C: Reference point fixed to rigid body
I P: Arbitrary point on rigid body
I ω: Angular velocity of rigid body
I Position: rP = rC + rCP
I Velocity: vP = vC + ω × rCP
I Due to rigidity, points P can only rotate wrt. C
I Thus a rigid body has 6 DoF (3 pos., 3 rot.)
13
301. Instantaneous Center of Rotation
At each time instance t ∈ R, there exists a particular reference point O (called the
instantaneous center of rotation) for which vO(t) = 0. Each point P of the rigid body
performs a pure rotation about O:
vP = vO + ω × rOP = ω × rOP
Example 1: Turning Wheel
I Wheel is completely lifted off the ground
I Wheel does not move in x or y direction
I Ang. vel. vector ω points into x/y plane
I Velocity of point P: vP = ωR with radius R
14
302. Instantaneous Center of Rotation
At each time instance t ∈ R, there exists a particular reference point O (called the
instantaneous center of rotation) for which vO(t) = 0. Each point P of the rigid body
performs a pure rotation about O:
vP = vO + ω × rOP = ω × rOP
Example 2: Rolling Wheel
I Wheel is rolling on the ground without slip
I Ground is fixed in x/y plane
I Ang. vel. vector ω points into x/y plane
I Velocity of point P: vP = 2ωR with radius R
14
304. Rigid Body Motion
Rotation Center
I Different points on the rigid body move along different circular trajectories
16
305. Kinematic Bicycle Model
I The kinematic bicycle model approximates the 4 wheels with 2 imaginary wheels
17
306. Kinematic Bicycle Model
I The kinematic bicycle model approximates the 4 wheels with 2 imaginary wheels
17
307. Kinematic Bicycle Model
Rotation
Center
Wheelbase
Vehicle
Velocity
Front Wheel
Velocity
Back Wheel
Velocity
Slip
Angle
Front
Steering
Angle
Center of
Gravity
Heading
Angle
Back
Steering
Angle
Turning
Radius
Course Angle
Assumptions:
- Planar motion (no roll, no pitch)
- Low speed = No wheel slip
(wheel orientation = wheel velocity)
I The kinematic bicycle model approximates the 4 wheels with 2 imaginary wheels
17
308. Kinematic Bicycle Model
Model
Rotation Center
Motion Equations
Ẋ = v cos(ψ + β)
Ẏ = v sin(ψ + β)
ψ̇ =
v cos(β)
`f + `r
(tan(δf ) − tan(δr))
β = tan−1
`f tan(δr) + `r tan(δf )
`f + `r
(proof as exercise)
18
309. Kinematic Bicycle Model
Model
Rotation Center
Motion Equations
Ẋ = v cos(ψ + β)
Ẏ = v sin(ψ + β)
ψ̇ =
v cos(β)
`f + `r
tan(δ)
β = tan−1
`r tan(δ)
`f + `r
(only front steering)
tan δ =
lf + lr
R0
⇒
1
R0
=
tan δ
lf + lr
⇒ tan β =
lr
R0
=
lr tan δ
lf + lr
cos β =
R0
R
⇒
1
R
=
cos β
R0
⇒ ψ̇ = ω =
v
R
=
v cos(β)
R0
=
v cos(β)
lf + lr
tan(δ)
18
311. Kinematic Bicycle Model
Model
Rotation Center
Motion Equations
Xt+1 = Xt + v cos(ψ) ∆t
Yt+1 = Yt + v sin(ψ) ∆t
ψt+1 = ψt +
vδ
`f + `r
∆t
(time discretized model)
19
312. Ackermann Steering Geometry
Front Steering Angles
Turning Radius
Rotation Center
Wheelbase
Track
I In practice, the left and right wheel steering angles are not equal if no wheel slip
I Combination of admissible steering angles called Ackerman steering geometry
I If angles are small, the left/right steering wheel angles can be approximated:
δl ≈ tan
L
R + 0.5B
≈
L
R + 0.5B
δr ≈ tan
L
R − 0.5B
≈
L
R − 0.5B 20