•

0 likes•24 views

The present presentation encompasses a comprehensive theoretical overview of self-driving car technology. It begins by tracing the history of autonomous vehicles and then delves into a range of methodologies, such as Imitation Learning, the direct perception approach utilizing end-to-end deep learning, and strategies employing Reinforcement Learning. Additionally, it examines various topics including Vehicle Dynamics, Vehicle Control, Odometry, SLAM (Simultaneous Localization and Mapping), Road and Lane Detection, 3D Reconstruction, Motion Planning, Object Detection, Object Tracking, and Decision Making and Planning.

- 1. Self-Driving Cars Lecture 1 – Introduction Robotics, Computer Vision, System Software BE, MS, PhD (MMMTU, IISc, IIIT-Hyderabad) Kumar Bipin
- 2. Self-Driving - A Human Dream 2
- 3. Agenda 1.1 Organization 1.2 Introduction 1.3 History of Self-Driving 5
- 5. Contents Goal: Develop an understanding of the capabilities and limitations of autonomous driving solutions and gain a basic understanding of the entire system comprising perception, planning and vehicle control. Training agents in simple environments. I History of self-driving cars I End-to-end learning for self-driving (imitation/reinforcement learning) I Modular approaches to self-driving I Perception (camera, lidar, radar) I Localization (with visual and road maps) I Navigation and path planning I Vehicle models and control algorithms 8
- 6. Prerequisites Linear Algebra: I Vectors: x, y ∈ Rn I Matrices: A, B ∈ Rm×n I Operations: AT , A−1, Tr(A), det(A), A + B, AB, Ax, x>y I Norms: kxk1, kxk2, kxk∞, kAkF I SVD: A = UDV> 44
- 7. Prerequisites Probability and Information Theory: I Probability distributions: P(X = x) I Marginal/conditional: p(x) = R p(x, y)dy , p(x, y) = p(x|y)p(y) I Bayes rule: p(x|y) = p(y|x)p(x)/p(y) I Conditional independence: x ⊥ ⊥ y | z ⇔ p(x, y|z) = p(x|z)p(y|z) I Expectation: Ex∼p [f(x)] = R x p(x)f(x)dx I Variance: Var(f(x)) = E (f(x) − E[f(x)])2 I Distributions: Bernoulli, Categorical, Gaussian, Laplace I Entropy: H(x) , KL Divergence: DKL(pkq) 45
- 8. Thank You! Looking forward to our discussions
- 11. Road Fatalities in 2017 I USA: 32,700 Germany: 3,300 World: 1,300,000 I Main factors: speeding, intoxication, distraction, etc. 49
- 12. Beneﬁts of Autonomous Driving I Lower risk of accidents I Provide mobility for elderly and people with disabilities I In the US 45% of people with disabilities still work I Decrease pollution for a more healthy environment I New ways of public transportation I Car pooling I Car sharing I Reduce number of cars (95% of the time a car is parked) 50
- 14. Uber Fatal Accident (2018) 52
- 15. Self-driving is Hard Human performance: 1 fatality per 100 mio miles Error rate to improve on: 0.000001 % Challenges: I Snow, heavy rain, night I Unstructured roads, parking lots I Pedestrians, erratic behavior I Reﬂections, dynamics I Rare and unseen events I Merging, negotiating, reasoning I Ethics: what is good behavior? I Legal questions http://theoatmeal.com/blog/google_self_driving_car 53
- 17. The Trolley Problem (1905) Thought experiment: I You observe a train that will kill 5 people on the rail tracks if it continues I You have the option to pull a lever to redirect the train to another track I However, the train will kill one (other) person on that alternate track I What is your decision? What is the correct/ethical decision? 55
- 18. The MIT Moral Machine http://moralmachine.mit.edu/ 56
- 20. The Automobile
- 21. 1886: Benz Patent-Motorwagen Nummer 1 59
- 22. 1886: Benz Patent-Motorwagen Nummer 1 I Benz 954 cc single-cylinder four-stroke engine (500 watts) I Weight: 100 kg (engine), 265 kg (total) I Maximal speed: 16 km/h I Consumption: 10 liter / 100 km (!) I Construction based on the tricycle, many bicycle components I 29.1.1886: patent ﬁled I 3.7.1886: ﬁrst public test drive in Mannheim I 2.11.1886: patent granted, but investors stayed skeptical I First long distance trip (106 km) by Bertha Benz in 1888 with Motorwagen Nummer 3 (without knowledge of her husband) fostered commercial interest I First gas station: pharmacy in Wiesloch near Heidelberg 59
- 23. 1886: Benz Patent-Motorwagen Nummer 1 59
- 25. 1925: Phantom Auto – “American Wonder” (Houdina Radio Control) In the summer of 1925, Houdina’s driverless car, called the American Wonder, traveled along Broadway in New York City—trailed by an operator in another vehicle—and down Fifth Avenue through heavy trafﬁc. It turned corners, sped up, slowed down and honked its horn. Unfortunately, the demonstration ended when the American Wonder crashed into another vehicle ﬁlled with photographers documenting the event. (Discovery Magazine) https://www.discovermagazine.com/technology/the-driverless-car-era-began-more-than-90-years-ago 61
- 26. 1939: Futurama – New York World’s Fair I Exhibit at the New York World’s Fair in 1939 sponsored by General Motors I Designed by Norman Bel Geddes’ - his vision of the world 20 years later (1960) I Radio-controlled electric cars, electromagnetic ﬁeld via circuits in roadway I #1 exhibition, very well received (great depression), prototypes by RCA GM https://www.youtube.com/watch?v=sClZqfnWqmc 62
- 27. 1956: General Motors Firebird II https://www.youtube.com/watch?v=cPOmuvFostY 63
- 28. 1956: General Motors Firebird II https://www.youtube.com/watch?v=cPOmuvFostY 63
- 29. 1956: General Motors Firebird II https://www.youtube.com/watch?v=cPOmuvFostY 63
- 30. 1960: RCA Labs’ Wire Controlled Car Aeromobile https://spectrum.ieee.org/selfdriving-cars-were-just-around-the-cornerin-1960 64
- 31. 1970: Citroen DS19 I Steered by sensing magnetic cables in the road, up to 130 km/h https://www.youtube.com/watch?v=MwdjM2Yx3gU 65
- 32. 1986: Navlab 1 I Vision-based navigation Jochem, Pomerleau, Kumar and Armstrong: PANS: A Portable Navigation Platform. IV, 1995. 66
- 33. Navlab Overview I Project at Carnegie Mellon University, USA I 1986: Navlab 1: 5 computer racks (Warp supercomputer) I 1988: First semi-autonomous drive at 20 mph I 1990: Navlab 2: 6 mph offroad, 70 mph highway driving I 1995: Navlab 5: “No Hands Across America” (2850 miles, 98 % autonomy) I PANS: Portable Advanced Navigation Support I Compute: 50 Mhz Sparc workstation (only 90 watts) I Main focus: lane keeping (lateral but no longitudinal control, i.e., no steering) I Position estimation: Differential GPS + Fibre Optic Gyroscope (IMU) I Low-level control: HC11 microcontroller Jochem, Pomerleau, Kumar and Armstrong: PANS: A Portable Navigation Platform. IV, 1995. 67
- 34. 1988: ALVINN ALVINN: An Autonomous Land Vehicle in a Neural Network I Forward-looking, vision based driving I Fully connected neural network maps road images to vehicle turn radius I Directions discretized (45 bins) I Trained on simulated road images I Tested on unlined paths, lined city streets and interstate highways I 90 consecutive miles at up to 70 mph Pomerleau: ALVINN: An Autonomous Land Vehicle in a Neural Network. NIPS, 1988. 68
- 35. 1988: ALVINN Pomerleau: ALVINN: An Autonomous Land Vehicle in a Neural Network. NIPS, 1988. 69
- 36. 1995: AURORA AURORA: Automative Run-Off-Road Avoidance System I Downward-looking (mounted at side) I Adjustable template correlation I Tracks solid or dashed lane marking I shown to perform robustly even when the markings are worn or their appearance in the image is degraded I Mainly tested as a lane departure warning system (“time to crossing”) Chen, Jochem and Pomerleau: AURORA: A Vision-Based Roadway Departure Warning System. IROS, 1995. 70
- 37. 1986: VaMoRs – Bundeswehr Universität Munich I Developed by Ernst Dickmanns in context of EUREKA-Prometheus (€800 mio.) (PROgraMme for a European Trafﬁc of Highest Efﬁciency and Unprecedented Safety, 1987- 1995) I Demonstration to Daimler-Benz Research 1986 in Stuttgart I Longitudinal lateral guidance with lateral acceleration feedback I Speed: 0 to 36 km/h 71
- 38. 1986: VaMoRs – Bundeswehr Universität Munich I Developed by Ernst Dickmanns in context of EUREKA-Prometheus (€800 mio.) (PROgraMme for a European Trafﬁc of Highest Efﬁciency and Unprecedented Safety, 1987- 1995) I Demonstration to Daimler-Benz Research 1986 in Stuttgart I Longitudinal lateral guidance with lateral acceleration feedback I Speed: 0 to 36 km/h 71
- 39. 1994: VAMP – Bundeswehr Universität Munich I 2nd Generation Transputer (60 processors), bifocal saccade vision, no GPS I 1678 km autonomous ride Munich to Odense, 95% autonomy (up to 158 km) I Autonomous driving speed record: 180 km/h (lane keeping) I Convoi driving, automatic lane change (triggered by human) 72
- 40. 1992: Summary Paper by Dickmanns Dickmanns and Mysliwetz: Recursive 3-D Road and Relative Ego-State Recognition. PAMI, 1992. 73
- 41. 1995: Invention of Adaptive Cruise Control (ACC) I 1992: Lidar-based distance control by Mitsubishi (throttle control downshift) I 1997: Laser adaptive cruise control by Toyota (throttle control downshift) I 1999: Distronic radar-assisted ACC by Mercedes-Benz (S-Class), level 1 autonomy 74
- 42. 2000: First Technological Revolution: GPS, IMUs Maps I NAVSTAR GPS available with 1 meter accuracy, IMUs improve up to 5 cm I Navigation systems and road maps available I Accurate self-localization and ego-motion estimation algorithms 75
- 43. 2004: Darpa Grand Challenge 1 (Limited to US Participants) I 1st competition in the Mojave Desert along a 240 km route, $1 mio prize money I No trafﬁc, dirt roads, driven by GPS (2935 points, up to 4 per curve). I None of the robot vehicles ﬁnished the route. CMU traveled the farthest distance, completing 11.78 km of the course before hitting a rock. 76
- 44. 2005: Darpa Grand Challenge 2 (Limited to US Participants) I 2nd competition in the Mojave Desert along a 212 km route, $2 mio prize money I Five teams ﬁnished (Stanford team 1st in 6:54 h, CMU team 2nd in 7:05 h) 77
- 45. 2006: Park Shuttle Rotterdam I 1800 meters route from metro station Kralingse Zoom to business park Rivium I One of the ﬁrst truly driverless car, but dedicated lane, localization via magnets 78
- 46. 2006: Second Technological Revolution: Lidars High-res Sensors I High-resolution Lidar I Camera systems with increasing resolution I Accurate 3D reconstruction, 3D detection 3D localization 79
- 47. 2007: Darpa Urban Challenge (International Participants) I 3nd competition at George Air Force Base, 96 km route, urban driving, $2 mio I Rules: obey trafﬁc law, negotiate, avoid obstacles, merge into trafﬁc I 11 US teams received $1 mio funding for their research I Winners: CMU 1st (4:10), Stanford’s Stanley 2nd (4:29). No non-US participant. 80
- 48. 2009: Google starts working on Self-Driving Car I Led by Sebastian Thrun, former director of Stanford AI lab and Stanley team I Others: Chris Urmson, Dmitri Dolgov, Mike Montemerlo, Anthony Levandowski I Renamed “Waymo” in 2016 (Google spend $1 billion until 2015) https://waymo.com/ 81
- 49. 2010: VisLab Intercontinental Autonomous Challenge (VIAC) I July 20 to October 28: 16,000 kilometres trip from Parma, Italy to Shanghai, China I The second vehicle automatically followed the route deﬁned by the leader vehicle by following it either visually or thanks to GPS waypoints sent by the lead vehicle Broggi, Medici, Zani, Coati and Panciroli: Autonomous vehicles control in the VisLab Intercontinental Autonomous Challenge. Annu. Rev. Control, 2012. 82
- 50. 2010: VisLab Intercontinental Autonomous Challenge (VIAC) I July 20 to October 28: 16,000 kilometres trip from Parma, Italy to Shanghai, China I The second vehicle automatically followed the route deﬁned by the leader vehicle by following it either visually or thanks to GPS waypoints sent by the lead vehicle Broggi, Medici, Zani, Coati and Panciroli: Autonomous vehicles control in the VisLab Intercontinental Autonomous Challenge. Annu. Rev. Control, 2012. 82
- 51. 2010: Pikes Peak Self-Driving Audi TTS I Pikes Peak International Hill Climb (since 1916): 20 km, 1440 Hm, Summit: 4300 m I Audi TTS completes track in 27 min (record in 2010: 10 min, now: 8 min) https://www.youtube.com/watch?v=Arx8qWx9CFk 83
- 52. 2010: Stadtpilot (Technical University Braunschweig) I Goal: geofenced innercity driving based on laser scanners, cameras and HD maps I Challenges: trafﬁc lights, roundabouts, etc. Similar efforts by FU Berlin and others Saust, Wille, Lichte and Maurer: Autonomous Vehicle Guidance on Braunschweig’s inner ring road within the Stadtpilot Project. IV, 2011. 84
- 53. 2012: Third Technological Revolution: Deep Learning I Representation learning boosts in accuracy across tasks and benchmarks Güler et al.: DensePose: Dense Human Pose Estimation In The Wild. CVPR, 2018. 85
- 54. 2012: Third Technological Revolution: New Benchmarks Geiger, Lenz and Urtasun: Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite. CVPR, 2012. 86
- 55. 2013: Mercedes Benz S500 Intelligent Drive I Autonomous ride on historic Bertha Benz route by Daimler RD and KIT/FZI I Novelty: close to production stereo cameras / radar (but requires HD maps) Ziegler et al.: Making Bertha Drive - An Autonomous Journey on a Historic Route. IEEE Intell. Transp. Syst. Mag., 2014. 87
- 56. 2014: Mercedes S Class Advanced ADAS (Level 2 Autonomy): I Autonomous steering, lane keeping, acceleration/braking, collision avoidance, driver fatigue monitoring in city trafﬁc and highway speeds up to 200 km/h 88
- 57. 2014: Society of Automotive Engineers: SAE Levels of Autonomy I Lateral control = steering, Longitudinal control = gas/brake 89
- 58. Disengagement per 1000 miles (California Dept. of Motor Veh., 2017) 90
- 59. 2015: Uber starts Self-Driving Research I Uber hires 50 robotic researchers and academics from CMU, shut down in 2020 91
- 60. 2016: OTTO I Self-driving truck company, bought by Uber for $625 mio., later shut down 92
- 61. 2015: Tesla Model S Autopilot Tesla Autopilot 2015 (Level 2 Autonomy): I Lane keeping for limited-access highways (hands off time: 30-120 seconds) I Doesn’t read trafﬁc signals, trafﬁc signs or detect pedestrians/cyclists 93
- 62. 2016: Tesla Model S Autopilot: Fatal Accident 1 94
- 63. 2018: Tesla Model X Autopilot: Fatal Accident 2 95
- 64. 2018: Tesla Model X Autopilot: Fatal Accident 2 The National Transportation Safety Board (NTSB) said that four seconds before the 23 March crash on a highway in Silicon Valley, which killed Walter Huang, 38, the car stopped following the path of a vehicle in front of it. Three seconds before the impact, it sped up from 62mph to 70.8mph, and the car did not brake or steer away, the NTSB said. After the fatal crash in the city of Mountain View, Tesla noted that the driver had received multiple warnings to put his hands on the wheel and said he did not intervene during the ﬁve seconds before the car hit the divider. But the NTSB report revealed that these alerts were made more than 15 minutes before the crash. In the 60 seconds prior to the collision, the driver also had his hands on the wheel on three separate occasions, though not in the ﬁnal six seconds, according to the agency. As the car headed toward the barrier, there was no precrash braking or evasive steering movement, the report added. The Guardian (June, 2018) 96
- 65. 2018: Waymo (former Google) announced Public Service I In 2018 driving without safety driver in a geofenced district of Phoenix I By 2021 also in suburbs of Arizona, San Francisco and Mountain View 97
- 66. 2018: Nuro Last-mile Delivery I Founded by two of the Google self-driving car engineers 98
- 67. Self-Driving Industry I NVIDIA: Supplier of self-driving hardware and software I Waabi: Startup by Raquel Urtasun (formly Uber) I Aurora: Startup by Chris Urmson (formerly CMU, Google, Waymo) I Argo AI: Startup by Bryan Salesky (now Ford/Volkswagen) I Zoox: Startup by Jesse Levinson (now Amazon) I Cruise: Startup by Kyle Vogt (now General Motors) I NuTonomy: Startup by Doug Parker (now Delphi/Aptiv) I Efforts in China: Baidu Apollo, AutoX, Pony.AI I Comma.ai: Custom open-source dashcam to retroﬁt any vehicle I Wayve: Startup focusing on end-to-end self-driving 99
- 69. Business Models Autonomous or nothing (Google, Apple, Uber) I Very risky, only few companies can do this I Long term goals Introduce technology little by little (all car companies) I Car industry is very conservative I ADAS as intermediate goal I Sharp transition: how to maintain the driver engaged? 101
- 70. Wild Predictions about the Future of Self-Driving 102
- 71. Summary I Self-driving has a long history I Highway lane-keeping of today was developed over 30 years ago I Increased robustness ⇒ introduction of level 3 for highways in 2019 I Increased interest after DARPA challenge and new benchmarks (e.g., KITTI) I Many claims about full self-driving (e.g., Elon Musk), but level 4/5 stays hard I Waymo introduced ﬁrst public service end of 2018 (with safety driver) I Waymo/Tesla seem ahead of competition in full self-driving, but no winner yet I But several setbacks (Uber, Tesla accidents) I Most existing systems require laser scanners and HD maps (exception: Tesla) I Driving as an engineering problem, quite different from human cognition 103
- 72. Self-Driving Cars Lecture 2 – Imitation Learning Robotics, Computer Vision, System Software BE, MS, PhD (MMMTU, IISc, IIIT-Hyderabad) Kumar Bipin
- 73. Agenda 2.1 Approaches to Self-Driving 2.2 Deep Learning Recap 2.3 Imitation Learning 2.4 Conditional Imitation Learning 2
- 75. Autonomous Driving Steer Gas Brake Sensory Input Mapping Function Dominating Paradigms: I Modular Pipelines I End-to-End Learning (Imitation Learning, Reinforcement Learning) I Direct Perception 4
- 76. Autonomous Driving: Modular Pipeline Steer Gas Brake Sensory Input Modular Pipeline Path Planning Vehicle Control Scene Parsing Low-level Perception Examples: I [Montemerlo et al., JFR 2008] I [Urmson et al., JFR 2008] I Waymo, Uber, Tesla, Zoox, ... 5
- 77. Autonomous Driving: Modular Pipeline Steer Gas Brake Sensory Input Modular Pipeline Path Planning Vehicle Control Scene Parsing Low-level Perception 6
- 78. Autonomous Driving: Modular Pipeline Steer Gas Brake Sensory Input Modular Pipeline Path Planning Vehicle Control Scene Parsing Low-level Perception 6
- 79. Autonomous Driving: Modular Pipeline Steer Gas Brake Sensory Input Modular Pipeline Path Planning Vehicle Control Scene Parsing Low-level Perception 6
- 80. Autonomous Driving: Modular Pipeline Steer Gas Brake Sensory Input Modular Pipeline Path Planning Vehicle Control Scene Parsing Low-level Perception 6
- 81. Autonomous Driving: Modular Pipeline Steer Gas Brake Sensory Input Modular Pipeline Path Planning Vehicle Control Scene Parsing Low-level Perception 6
- 82. Autonomous Driving: Modular Pipeline Steer Gas Brake Sensory Input Modular Pipeline Path Planning Vehicle Control Scene Parsing Low-level Perception Pros: I Small components, easy to develop in parallel I Interpretability Cons: I Piece-wise training (not jointly) I Localization and planning heavily relies on HD maps HD maps: Centimeter precision lanes, markings, trafﬁc lights/signs, human annotated 7
- 83. Autonomous Driving: Modular Pipeline I Piece-wise training difﬁcult: not all objects are equally important! Ohn-Bar and Trivedi: Are All Objects Equal? Deep Spatio-Temporal Importance Prediction in Driving Videos. PR, 2017. 8
- 84. Autonomous Driving: Modular Pipeline I HD Maps are expensive to create (data collection annotation effort) https://www.geospatialworld.net/article/hd-maps-autonomous-vehicles/ 9
- 85. Autonomous Driving: End-to-End Learning Steer Gas Brake Sensory Input Imitation Learning / Reinforcement Learning Neural Network Examples: I [Pomerleau, NIPS 1989] I [Bojarski, Arxiv 2016] I [Codevilla et al., ICRA 2018] 10
- 86. Autonomous Driving: End-to-End Learning Steer Gas Brake Sensory Input Imitation Learning / Reinforcement Learning Neural Network Pros: I End-to-end training I Cheap annotations Cons: I Training / Generalization I Interpretability 10
- 87. Autonomous Driving: Direct Perception Steer Gas Brake Sensory Input Direct Perception Intermediate Representations Vehicle Control Neural Network Examples: I [Chen et al., ICCV 2015] I [Sauer et al., CoRL 2018] I [Behl et al., IROS 2020] 11
- 88. Autonomous Driving: Direct Perception Steer Gas Brake Sensory Input Direct Perception Intermediate Representations Vehicle Control Neural Network Pros: I Compact Representation I Interpretability Cons: I Control typically not learned jointly I How to choose representations? 11
- 90. Supervised Learning Input Output Model I Learning: Estimate parameters w from training data {(xi, yi)}N i=1 I Inference: Make novel predictions: y = fw(x) 13
- 91. Linear Classiﬁcation Logistic Regression ŷ = σ(w x + w0) σ(x) = 1 1 + e−x I Let x ∈ R2 I Decision boundary: wx + w0 = 0 I Decide for class 1 ⇔ wx −w0 I Decide for class 0 ⇔ wx −w0 I Which problems can we solve? 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 x 0.0 0.2 0.4 0.6 0.8 1.0 (x) Sigmoid 0.5 Decision Boundary Class 0 Class 1 14
- 92. Linear Classiﬁcation Linear Classiﬁer: Class 1 ⇔ w x −w0 x1 x2 OR(x1,x2) 0 0 0 0 1 1 1 0 1 1 1 1 1 1 | {z } w x1 x2 ! | {z } x 0.5 |{z} −w0 1 1 0 Class 0 Class 1 15
- 93. Linear Classiﬁcation Linear Classiﬁer: Class 1 ⇔ w x −w0 x1 x2 AND(x1,x2) 0 0 0 0 1 0 1 0 0 1 1 1 1 1 | {z } w x1 x2 ! | {z } x 1.5 |{z} −w0 1 1 0 Class 0 Class 1 16
- 94. Linear Classiﬁcation Linear Classiﬁer: Class 1 ⇔ w x −w0 x1 x2 NAND(x1,x2) 0 0 1 0 1 1 1 0 1 1 1 0 − 1 − 1 | {z } w x1 x2 ! | {z } x −1.5 |{z} −w0 1 1 0 Class 0 Class 1 17
- 95. Linear Classiﬁcation Linear Classiﬁer: Class 1 ⇔ w x −w0 x1 x2 XOR(x1,x2) 0 0 0 0 1 1 1 0 1 1 1 0 ? ? | {z } w x1 x2 ! | {z } x ? |{z} −w0 1 1 0 Class 0 Class 1 ? 18
- 96. Linear Classiﬁcation Linear classiﬁer with non-linear features ψ: w x1 x2 x1x2 | {z } ψ(x) −w0 x1 x2 ψ1(x) ψ2(x) ψ3(x) XOR 0 0 0 0 0 0 0 1 0 1 0 1 1 0 1 0 0 1 1 1 1 1 1 0 1 0 Class 0 Class 1 1 I Non-linear features allow linear classiﬁer to solve non-linear classiﬁcation problems! 19
- 97. Representation Matters CHAPTER 1. INTRODUCTION x y r θ Cartesian Coordinates Polar Coordinates I But how to choose the transformation? Can be very hard in practice. I Yet, this was the dominant approach until the 2000s (vision, speech, ..) I In deep/representation learning we want to learn these transformations 20
- 98. Non-Linear Classiﬁcation Linear Classiﬁer: Class 1 ⇔ w x −w0 x1 x2 XOR(x1,x2) 0 0 0 0 1 1 1 0 1 1 1 0 XOR(x1, x2) = AND(OR(x1, x2),NAND(x1, x2)) 1 1 0 Class 0 Class 1 N A N D O R X O R 21
- 99. Non-Linear Classiﬁcation XOR(x1, x2) = AND(OR(x1, x2),NAND(x1, x2)) The above expression can be rewritten as a program of logistic regressors: h1 = σ(w OR x + wOR) h2 = σ(w NAND x + wNAND) ŷ = σ(w AND h + wAND) Note that h(x) is a non-linear feature of x. We call h(x) a hidden layer. 22
- 100. Multi-Layer Perceptrons I MLPs are feedforward neural networks (no feedback connections) I They compose several non-linear functions f(x) = ŷ(h3(h2(h1(x)))) where hi(·) are called hidden layers and ŷ(·) is the output layer I The data speciﬁes only the behavior of the output layer (thus the name “hidden”) I Each layer i comprises multiple neurons j which are implemented as afﬁne transformations (ax + b) followed by non-linear activation functions (g): hij = g(a ijhi−1 + bij) I Each neuron in each layer is fully connected to all neurons of the previous layer I The overall length of the chain is the depth of the model ⇒ “Deep Learning” 23
- 101. MLP Network Architecture Input Layer Hidden Layer 1 Hidden Layer 2 Hidden Layer 3 Output Layer Network Depth = #Computation Layers = 4 Layer Width = #Neurons in Layer I Neurons are grouped into layers, each neuron fully connected to all prev. ones I Hidden layer hi = g(Aihi−1 + bi) with activation function g(·) and weights Ai, bi 24
- 102. Deeper Models allow for more Complex Decisions 2 Hidden Neurons 5 Hidden Neurons 15 Hidden Neurons https://cs.stanford.edu/people/karpathy/convnetjs/demo/classify2d.html 25
- 103. Output and Loss Functions Input Layer Hidden Layer 1 Hidden Layer 2 Output Layer Loss Function Target I The output layer is the last layer in a neural network which computes the output I The loss function compares the result of the output layer to the target value(s) I Choice of output layer and loss function depends on task (discrete, continuous, ..) 26
- 104. Output Layer Input Layer Hidden Layer 1 Hidden Layer 2 Output Layer Loss Function Target I For classiﬁcation problems, we use a sigmoid or softmax non-linearity I For regression problems, we can directly return the value after the last layer 27
- 105. Loss Function Input Layer Hidden Layer 1 Hidden Layer 2 Output Layer Loss Function Target I For classiﬁcation problems, we use the (binary) cross-entropy loss I For regression problems, we can use the `1 or `2 loss 28
- 106. Activation Functions Input Layer Hidden Layer 1 Hidden Layer 2 Output Layer Loss Function Target I Hidden layer hi = g(Aihi−1 + bi) with activation function g(·) and weights Ai, bi I The activation function is frequently applied element-wise to its input I Activation functions must be non-linear to learn non-linear mappings 29
- 109. Convolutional Neural Networks I Multi-layer perceptrons don’t scale to high-dimensional inputs I ConvNets represent data in 3 dimensions: width, height, depth (= feature maps) I ConvNets interleave discrete convolutions, non-linearities and pooling I Key ideas: sparse interactions, parameter sharing, equivariant representation 32
- 110. Fully Connected vs. Convolutional Layers Filter Kernel I Fully connected layer: #Weights = W × H × Cout × (W × H × Cin + 1) I Convolutional layer: #Weights = Cout × (K × K × Cin + 1) (“weight sharing”) I With Cin input and Cout output channels, layer size W × H and kernel size K × K I Convolutions are followed by non-lineary activation functions (e.g., ReLU) 33
- 111. Fully Connected vs. Convolutional Layers I Fully connected layer: #Weights = W × H × Cout × (W × H × Cin + 1) I Convolutional layer: #Weights = Cout × (K × K × Cin + 1) (“weight sharing”) I With Cin input and Cout output channels, layer size W × H and kernel size K × K I Convolutions are followed by non-lineary activation functions (e.g., ReLU) 33
- 112. Fully Connected vs. Convolutional Layers I Fully connected layer: #Weights = W × H × Cout × (W × H × Cin + 1) I Convolutional layer: #Weights = Cout × (K × K × Cin + 1) (“weight sharing”) I With Cin input and Cout output channels, layer size W × H and kernel size K × K I Convolutions are followed by non-lineary activation functions (e.g., ReLU) 33
- 113. Fully Connected vs. Convolutional Layers I Fully connected layer: #Weights = W × H × Cout × (W × H × Cin + 1) I Convolutional layer: #Weights = Cout × (K × K × Cin + 1) (“weight sharing”) I With Cin input and Cout output channels, layer size W × H and kernel size K × K I Convolutions are followed by non-lineary activation functions (e.g., ReLU) 33
- 114. Fully Connected vs. Convolutional Layers I Fully connected layer: #Weights = W × H × Cout × (W × H × Cin + 1) I Convolutional layer: #Weights = Cout × (K × K × Cin + 1) (“weight sharing”) I With Cin input and Cout output channels, layer size W × H and kernel size K × K I Convolutions are followed by non-lineary activation functions (e.g., ReLU) 33
- 115. Fully Connected vs. Convolutional Layers I Fully connected layer: #Weights = W × H × Cout × (W × H × Cin + 1) I Convolutional layer: #Weights = Cout × (K × K × Cin + 1) (“weight sharing”) I With Cin input and Cout output channels, layer size W × H and kernel size K × K I Convolutions are followed by non-lineary activation functions (e.g., ReLU) 33
- 116. Padding Idea of Padding: I Add boundary of appropriate size with zeros (blue) around input tensor 34
- 117. Downsampling I Downsampling reduces the spatial resolution (e.g., for image level predictions) I Downsampling increases the receptive ﬁeld (which pixels inﬂuence a neuron) 35
- 118. Pooling I Typically, stride s = 2 and kernel size 2 × 2 ⇒ reduces spatial dimensions by 2 I Pooling has no parameters (typical pooling operations: max, min, mean) 36
- 119. Pooling I Typically, stride s = 2 and kernel size 2 × 2 ⇒ reduces spatial dimensions by 2 I Pooling has no parameters (typical pooling operations: max, min, mean) 36
- 120. Fully Connected Layers I Often, convolutional networks comprise fully connected layers at the end 37
- 121. Optimization
- 122. Optimization Optimization Problem: (dataset X) w∗ = argmin w L(X, w) Gradient Descent: w0 = winit wt+1 = wt − η ∇wL(X, wt ) I Neural network loss L(X, w) is not convex, we have to use gradient descent I There exist multiple local minima, but we will ﬁnd only one through optimization I Good news: it is known that many local minima in deep networks are good ones 39
- 123. Backpropagation I Values are efﬁciently computed forward, gradients backward I Modularity: Each node must only “know” how to compute gradients wrt. its own arguments I One fw/bw pass per data point: ∇wL(X, w) = N X i=1 ∇wL(yi, xi, w) | {z } Backpropagation Compute Loss Compute Derivatives 40
- 124. Gradient Descent Algorithm: 1. Initialize weights w0 and pick learning rate η 2. For all data points i ∈ {1, . . . , N} do: 2.1 Forward propagate xi through network to calculate prediction ŷi 2.2 Backpropagate to obtain gradient ∇wLi(wt ) ≡ ∇wL(ŷi, yi, wt ) 3. Update gradients: wt+1 = wt − η 1 N P i ∇wLi(wt) 4. If validation error decreases, go to step 2, otherwise stop Challenges: I Typically, millions of parameters ⇒ dim(w) = 1 million or more I Typically, millions of training points ⇒ N = 1 million or more I Becomes extremely expensive to compute and does not ﬁt into memory 41
- 125. Stochastic Gradient Descent Solution: I The total loss over the entire training set can be expressed as the expectation: 1 N X i Li(wt ) = Ei∼U{1,N} Li(wt ) I This expectation can be approximated by a smaller subset B N of the data: Ei∼U{1,N} Li(wt ) ≈ 1 B X b Lb(wt ) I Thus, the gradient can also be approximated by this subset (=minibatch): 1 N X i ∇wLi(wt ) ≈ 1 B X b ∇wLb(wt ) 42
- 126. Stochastic Gradient Descent Algorithm: 1. Initialize weights w0, pick learning rate η and minibatch size |Xbatch| 2. Draw random minibatch {(x1, y1), . . . , (xB, yB)} ⊆ X (with B N) 3. For all minibatch elements b ∈ {1, . . . , B} do: 3.1 Forward propagate xb through network to calculate prediction ŷb 3.2 Backpropagate to obtain batch element gradient ∇wLb(wt ) ≡ ∇wL(ŷb, yb, wt ) 4. Update gradients: wt+1 = wt − η 1 B P b ∇wLb(wt) 5. If validation error decreases, go to step 2, otherwise stop 43
- 127. First-order Methods There exist many variants: I SGD I SGD with Momentum I SGD with Nesterov Momentum I RMSprop I Adagrad I Adadelta I Adam I AdaMax I NAdam I AMSGrad Adam is often the method of choice due to its robustness. 44
- 128. Learning Rate Schedules 0 10 20 30 40 50 20 30 40 50 60 iter. (1e4) error (%) plain-18 plain-34 0 10 20 30 40 50 20 30 40 50 60 iter. (1e4) error (%) ResNet-18 ResNet-34 I A ﬁxed learning rate is too slow in the beginning and too fast in the end I Exponential decay: ηt = ηαt I Step decay: η ← 0.5η (every K iterations) He, Zhang, Ren andSun: Deep Residual Learning for Image Recognition. CVPR, 2016. 45
- 129. Regularization
- 130. Capacity, Overﬁtting and Underﬁtting 0.0 0.2 0.4 0.6 0.8 1.0 x 1.5 1.0 0.5 0.0 0.5 1.0 1.5 y M=1 Ground Truth Noisy Observations Polynomial Fit Test Set 0.0 0.2 0.4 0.6 0.8 1.0 x 1.5 1.0 0.5 0.0 0.5 1.0 1.5 y M=3 Ground Truth Noisy Observations Polynomial Fit Test Set 0.0 0.2 0.4 0.6 0.8 1.0 x 1.5 1.0 0.5 0.0 0.5 1.0 1.5 y M=9 Ground Truth Noisy Observations Polynomial Fit Test Set Capacity too low Capacity about right Capacity too high I Underﬁtting: Model too simple, does not achieve low error on training set I Overﬁtting: Training error small, but test error (= generalization error) large I Regularization: Take model from third regime (right) to second regime (middle) 47
- 131. Early Stopping and Parameter Penalties Unregularized Objective L2 Regularizer Early stopping: I Dashed: Trajectory taken by SGD I Trajectory stops at w̃ before reaching minimum training error w∗ L2 Regularization: I Regularize objective with L2 penalty I Penalty forces minimum of regularized loss w̃ closer to origin 48
- 132. Dropout Idea: I During training, set neurons to zero with probability µ (typically µ = 0.5) I Each binary mask is one model, changes randomly with every training iteration I Creates ensemble “on the ﬂy” from a single network with shared parameters Srivastava, Hinton, Krizhevsky, Sutskever and Salakhutdinov: Dropout: a simple way to prevent neural networks from overﬁtting. JMLR, 2014. 49
- 133. Data Augmentation I Best way towards better generalization is to train on more data I However, data in practice often limited I Goal of data augmentation: create “fake” data from the existing data (on the ﬂy) and add it to the training set I New data must preserve semantics I Even simple operations like translation or adding per-pixel noise often already greatly improve generalization I https://github.com/aleju/imgaug 50
- 135. Imitation Learning: Manipulation Towards Imitation Learning of Dynamic Manipulation Tasks: A Framework to Learn from Failures 52
- 136. Imitation Learning: Car Racing Trainer (Human Driver) Trainee (Neural Network) 53
- 137. Imitation Learning in a Nutshell Hard coding policies is often difﬁcult ⇒ Rather use a data-driven approach! I Given: demonstrations or demonstrator I Goal: train a policy to mimic decision I Variants: behavior cloning (this lecture), inverse optimal control, ... 54
- 138. Formal Deﬁnition of Imitation Learning I State: s ∈ S may be partially observed (e.g., game screen) I Action: a ∈ A may be discrete or continuous (e.g., turn angle, speed) I Policy: πθ : S → A we want to learn the policy parameters θ I Optimal action: a∗ ∈ A provided by expert demonstrator I Optimal policy: π∗ : S → A provided by expert demonstrator I State dynamics: P(si+1|si, ai) simulator, typically not known to policy Often deterministic: si+1 = T(si, ai) deterministic mapping I Rollout: Given s0, sequentially execute ai = πθ(si) sample si+1 ∼ P(si+1|si, ai) yields trajectory τ = (s0, a0, s1, a1, . . . ) I Loss function: L(a∗, a) loss of action a given optimal action a∗ 55
- 139. Formal Deﬁnition of Imitation Learning General Imitation Learning: argmin θ Es∼P(s|πθ) [L (π∗ (s), πθ(s))] I State distribution P(s|πθ) depends on rollout determined by current policy πθ Behavior Cloning: argmin θ E(s∗,a∗)∼P∗ [L (a∗ , πθ(s∗ ))] | {z } = PN i=1 L(a∗ i ,πθ(s∗ i )) I State distribution P∗ provided by expert I Reduces to supervised learning problem 56
- 140. Challenges of Behavior Cloning I Behavior cloning makes IID assumption I Next state is sampled from states observed during expert demonstration I Thus, next state is sampled independently from action predicted by current policy I What if πθ makes a mistake? I Enters new states that haven’t been observed before I New states not sampled from same (expert) distribution anymore I Cannot recover, catastrophic failure in the worst case I What can we do to overcome this train/test distribution mismatch? 57
- 141. DAgger Data Aggregation (DAgger): I Iteratively build a set of inputs that the ﬁnal policy is likely to encounter based on previous experience. Query expert for aggregate dataset I But can easily overﬁt to main mode of demonstrations I High training variance (random initialization, order of data) Ross, Gordon and Bagnell: A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning. AISTATS, 2011. 58
- 142. DAgger with Critical States and Replay Buffer Key Ideas: 1. Sample critical states from the collected on-policy data based on the utility they provide to the learned policy in terms of driving behavior 2. Incorporate a replay buffer which progressively focuses on the high uncertainty regions of the policy’s state distribution Prakash, Behl, Ohn-bar, Chitta and Geiger: Exploring Data Aggregation in Policy Learning for Vision-based Urban Autonomous Driving. CVPR, 2020. 59
- 143. ALVINN: An Autonomous Land Vehicle in a Neural Network
- 144. ALVINN: An Autonomous Land Vehicle in a Neural Network I Fully connected 3 layer neural net I 36k parameters I Maps road images to turn radius I Directions discretized (45 bins) I Trained on simulated road images! I Tested on unlined paths, lined city streets and interstate highways I 90 consecutive miles at up to 70 mph Pomerleau: ALVINN: An Autonomous Land Vehicle in a Neural Network. NIPS, 1988. 61
- 145. ALVINN: An Autonomous Land Vehicle in a Neural Network Pomerleau: ALVINN: An Autonomous Land Vehicle in a Neural Network. NIPS, 1988. 62
- 146. PilotNet: End-to-End Learning for Self-Driving Cars
- 147. PilotNet: System Overview I Data augmentation by 3 cameras and virtually shifted / rotated images assuming the world is ﬂat (homography), adjusting the steering angle appropriately Bojarski et al.: End-to-End Learning for Self-Driving Cars. Arxiv, 2016. 64
- 148. PilotNet: Architecture I Convolutional network (250k param) I Input: YUV image representation I 1 Normalization layer I Not learned I 5 Convolutional Layers I 3 strided 5x5 I 2 non-strided 3x3 I 3 Fully connected Layers I Output: turning radius I Trained on 72h of driving Bojarski et al.: End-to-End Learning for Self-Driving Cars. Arxiv, 2016. 65
- 149. PilotNet: Video Bojarski et al.: End-to-End Learning for Self-Driving Cars. Arxiv, 2016. 66
- 150. VisualBackProp I Central idea: ﬁnd salient image regions that lead to high activations I Forward pass, then iteratively scale-up activations Bojarski et al.: VisualBackProp: Efﬁcient Visualization of CNNs for Autonomous Driving. ICRA, 2018. 67
- 151. VisualBackProp Bojarski et al.: VisualBackProp: Efﬁcient Visualization of CNNs for Autonomous Driving. ICRA, 2018. 68
- 152. VisualBackProp I Test if shift in salient objects affects predicted turn radius more strongly Bojarski et al.: VisualBackProp: Efﬁcient Visualization of CNNs for Autonomous Driving. ICRA, 2018. 69
- 154. Conditional Imitation Learning Codevilla, Müller, López, Koltun and Dosovitskiy: End-to-End Driving Via Conditional Imitation Learning. ICRA, 2018. 71
- 155. Conditional Imitation Learning Idea: I Condition controller on navigation command c ∈ {left,right,straight} I High-level navigation command can be provided by consumer GPS, i.e., telling the vehicle to turn left/right or go straight at the next intersection I This removes the task ambiguity induced by the environment I State st: current image Action at: steering angle acceleration Codevilla, Müller, López, Koltun and Dosovitskiy: End-to-End Driving Via Conditional Imitation Learning. ICRA, 2018. 72
- 156. Comparison to Behavior Cloning Behavior Cloning: I Training Set: D = {(a∗ i , s∗ i )N i=1} I Objective: argmin θ N X i=1 L (a∗ i , πθ(s∗ i )) I Assumption: ∃f(·) : ai = f(si) Often violated in practice! Conditional Imitation Learning: I Training Set: D = {(a∗ i , s∗ i , c∗ i )N i=1} I Objective: argmin θ N X i=1 L (a∗ i , πθ(s∗ i , c∗ i )) I Assumption: ∃f(·, ·) : ai = f(si, ci) Better assumption! Codevilla, Müller, López, Koltun and Dosovitskiy: End-to-End Driving Via Conditional Imitation Learning. ICRA, 2018. 73
- 157. Conditional Imitation Learning: Network Architecture I This paper proposes two network architectures: I (a) Extract features C(c) and concatenate with image features I(i) I (b) Command c acts as switch between specialized submodules I Measurements m capture additional information (here: speed of vehicle) Codevilla, Müller, López, Koltun and Dosovitskiy: End-to-End Driving Via Conditional Imitation Learning. ICRA, 2018. 74
- 158. Conditional Imitation Learning: Noise Injection I Temporally correlated noise injected into trajectories ⇒ drift (only 12 minutes) I Record driver’s (=expert’s) corrective response ⇒ recover from drift Codevilla, Müller, López, Koltun and Dosovitskiy: End-to-End Driving Via Conditional Imitation Learning. ICRA, 2018. 75
- 159. CARLA Simulator http://www.carla.org Codevilla, Müller, López, Koltun and Dosovitskiy: End-to-End Driving Via Conditional Imitation Learning. ICRA, 2018. 76
- 160. Conditional Imitation Learning Codevilla, Santana, Lopez and Gaidon: Exploring the Limitations of Behavior Cloning for Autonomous Driving. ICCV, 2019. 77
- 161. Neural Attention Fields I An MLP iteratively compresses the high-dimensional input into a compact representation ci (c 6= nav. command) based on a BEV query location as input I The model predicts waypoints and auxiliary semantics which aids learning Chitta, Prakash and Geiger: Neural Attention Fields for End-to-End Autonomous Driving. ICCV, 2021. 78
- 162. Summary Advantages of Imitation Learning: I Easy to implement I Cheap annotations (just driving while recording images and actions) I Entire model trained end-to-end I Conditioning removes ambiguity at intersections Challenges for Imitation Learning? I Behavior cloning uses IID assumption which is violated in practice I Direct mapping from images to control ⇒ No long term planning I No memory (can’t remember speed signs, etc.) I Mapping is difﬁcult to interpret (“black box”), despite visualization techniques 79
- 163. Self-Driving Cars Lecture 3 – Direct Perception Robotics, Computer Vision, System Software BE, MS, PhD (MMMTU, IISc, IIIT-Hyderabad) Kumar Bipin
- 164. Agenda 3.1 Direct Perception 3.2 Conditional Affordance Learning 3.3 Visual Abstractions 3.4 Driving Policy Transfer 3.5 Online vs. Ofﬂine Evaluation 2
- 166. Approaches to Self-Driving Steer Gas Brake Sensory Input Modular Pipeline Path Planning Vehicle Control Scene Parsing Low-level Perception + Modular + Interpretable - Expert decisions - Piece-wise training Steer Gas Brake Sensory Input Imitation Learning / Reinforcement Learning Neural Network + End-to-end + Simple - Generalization - Interpretable - Data 4
- 167. Direct Perception Steer Gas Brake Sensory Input Direct Perception Intermediate Representations Vehicle Control Neural Network Idea of Direct Perception: I Hybrid model between imitation learning and modular pipelines I Learn to predict interpretable low-dimensional intermediate representation I Decouple perception from planning and control I Allows to exploit classical controllers or learned controllers (or hybrids) 5
- 168. Direct Perception for Autonomous Driving Affordances: I Attributes of the environment which limit space of actions [Gibson, 1966] I In this case: 13 affordances Chen, Seff, Kornhauser and Xiao: Learning Affordance for Direct Perception in Autonomous Driving. ICCV, 2015. 6
- 169. Overview I TORCS Simulator: Open source car racing game simulator I Network: AlexNet (5 conv layers, 4 fully conn. layers), 13 output neurons I Training: Affordance indicators trained with `2 loss Chen, Seff, Kornhauser and Xiao: Learning Affordance for Direct Perception in Autonomous Driving. ICCV, 2015. 7
- 170. Affordance Indicators and State Machine Chen, Seff, Kornhauser and Xiao: Learning Affordance for Direct Perception in Autonomous Driving. ICCV, 2015. 8
- 171. Controller Steering controller: s = θ1(α − dc/w) I s: steering command θ1: parameter I α: relative orientation dc: distance to centerline w: road width Speed controller: (“optimal velocity car following model”) v = vmax (1 − exp (−θ2 dp − θ3)) I v: target velocity vmax maximal velocity I dp: distance to preceding car θ2,3: parameters Chen, Seff, Kornhauser and Xiao: Learning Affordance for Direct Perception in Autonomous Driving. ICCV, 2015. 9
- 172. TORCS Simulator I TORCS: Open source car racing game http://torcs.sourceforge.net/ Chen, Seff, Kornhauser and Xiao: Learning Affordance for Direct Perception in Autonomous Driving. ICCV, 2015. 10
- 173. Results Chen, Seff, Kornhauser and Xiao: Learning Affordance for Direct Perception in Autonomous Driving. ICCV, 2015. 11
- 174. Network Visualization I Left: Averaged top 100 images activating a neuron in ﬁrst fully connected layer I Right: Maximal response of 4th conv. layer (note: focus on cars and markings) Chen, Seff, Kornhauser and Xiao: Learning Affordance for Direct Perception in Autonomous Driving. ICCV, 2015. 12
- 176. How can we transfer this idea to cities?
- 177. Conditional Affordance Learning Affordances 𝑅𝑒𝑙𝑎𝑡𝑖𝑣𝑒 𝑎𝑛𝑔𝑙𝑒 = 0.01 𝑟𝑎𝑑 𝐶𝑒𝑛𝑡𝑒𝑟𝑙𝑖𝑛𝑒 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒= 0.15 𝑚 𝑅𝑒𝑑 𝑙𝑖𝑔ℎ𝑡 = 𝐹𝑎𝑙𝑠𝑒 … Neural Network Video Input Directional Input Control Comands Controller Brake = 0.0 Sauer, Savinov and Geiger: Conditional Affordance Learning for Driving in Urban Environments. CoRL, 2018. 15
- 178. CARLA Simulator I Goal: drive from A to B as fast, safely and comfortably as possible I Infractions: I Driving on wrong lane I Driving on sidewalk I Running a red light I Violating speed limit I Colliding with vehicles I Hitting other objects Sauer, Savinov and Geiger: Conditional Affordance Learning for Driving in Urban Environments. CoRL, 2018. 16
- 179. Affordances Affordances: I Distance to centerline I Relative angle to road I Distance to lead vehicle I Speed signs I Trafﬁc lights I Hazard stop Sauer, Savinov and Geiger: Conditional Affordance Learning for Driving in Urban Environments. CoRL, 2018. 17
- 180. Affordances Affordances: I Distance to centerline I Relative angle to road I Distance to lead vehicle I Speed signs I Trafﬁc lights I Hazard stop 30 km/h 𝐴1 𝐴2 𝑨𝟑 𝝍 𝒅 𝒙𝒍𝒐𝒄𝒂𝒍 𝒚𝒍𝒐𝒄𝒂𝒍 centerline vehicle 𝒍 = 𝟏𝟓 𝒎 agent hazard stop (for pedestrian) speed sign traffic light Sauer, Savinov and Geiger: Conditional Affordance Learning for Driving in Urban Environments. CoRL, 2018. 18
- 181. Overview ... Task Blocks Feature map Feature Extractor High-level Planner Agent in Environment Position unconditional conditional CARLA ... ... Longitudinal Control Lateral Control Controller Control Commands Affordances Directional Command Image Perception CAL Agent Memory N N-1 1 2 ... N-2 3 ... Sauer, Savinov and Geiger: Conditional Affordance Learning for Driving in Urban Environments. CoRL, 2018. 19
- 182. Controller Longitudinal Control cruising following over_limit red_light hazard_stop States Affordances Throttle Brake if hazard stop == True elif red light == True elif speed limit - 15 elif veh_distance 35 else I Finite-state machine I PID controller for cruising I Car following model Lateral Control I Stanley controller I δ(t) = ψ(t) + arctan kx(t) u(t) I Damping term Sauer, Savinov and Geiger: Conditional Affordance Learning for Driving in Urban Environments. CoRL, 2018. 20
- 183. Parameter Learning Perception Stack: I Multi-task learning: single forward pass ⇒ fast learning and inference I Dataset: random driving using controller operating on GT affordances ⇒ 240k images with GT affordances I Loss functions: I Discrete affordances: Class-weighted cross-entropy (CWCE) I Continuous affordances: Mean average error (MAE) I Optimized with ADAM (batch size 32) Controller: I Ziegler-Nichols Sauer, Savinov and Geiger: Conditional Affordance Learning for Driving in Urban Environments. CoRL, 2018. 21
- 184. Data Collection Data Collection: I Navigation based on true affordances random inputs Data Augmentation: I No image ﬂipping I Color, contrast, brightness I Gaussian blur noise I Provoke rear-end collisions I Camera pose randomization 𝝓𝟏 𝝓𝟐 𝝓𝟑(= 𝟎) 𝒅 = 𝟓𝟎𝒄𝒎 𝒅 = 𝟓𝟎𝒄𝒎 Sauer, Savinov and Geiger: Conditional Affordance Learning for Driving in Urban Environments. CoRL, 2018. 22
- 185. Results Training conditions New weather New town New town and new weather Task MP CIL RL CAL MP CIL RL CAL MP CIL RL CAL MP CIL RL CAL Straight 98 95 89 100 100 98 86 100 92 97 74 93 50 80 68 94 One turn 82 89 34 97 95 90 16 96 61 59 12 82 50 48 20 72 Navigation 80 86 14 92 94 84 2 90 24 40 3 70 47 44 6 68 Nav. dynamic 77 83 7 83 89 82 2 82 24 38 2 64 44 42 4 64 Baselines: I MP = Modular Pipeline [Dosovitskiy et al., CoRL 2017] I CIL = Conditional Imitation Learning [Codevilla et al., ICRA 2018] I RL = Reinforcement Learning A3C [Mnih et al., ICML 2016] Sauer, Savinov and Geiger: Conditional Affordance Learning for Driving in Urban Environments. CoRL, 2018. 23
- 186. Results Conditional Navigation Sauer, Savinov and Geiger: Conditional Affordance Learning for Driving in Urban Environments. CoRL, 2018. 24
- 187. Results Speed Signs Sauer, Savinov and Geiger: Conditional Affordance Learning for Driving in Urban Environments. CoRL, 2018. 24
- 188. Results Car Following Sauer, Savinov and Geiger: Conditional Affordance Learning for Driving in Urban Environments. CoRL, 2018. 24
- 189. Results Hazard Stop Sauer, Savinov and Geiger: Conditional Affordance Learning for Driving in Urban Environments. CoRL, 2018. 24
- 190. Attention Attention to Hazard Stop Sauer, Savinov and Geiger: Conditional Affordance Learning for Driving in Urban Environments. CoRL, 2018. 25
- 191. Attention Attention to Red Light Sauer, Savinov and Geiger: Conditional Affordance Learning for Driving in Urban Environments. CoRL, 2018. 25
- 192. Path Planning Optimal Path (green) vs. Traveled Path (red) Sauer, Savinov and Geiger: Conditional Affordance Learning for Driving in Urban Environments. CoRL, 2018. 26
- 193. Failure Cases Hazard Stop: False Positive Sauer, Savinov and Geiger: Conditional Affordance Learning for Driving in Urban Environments. CoRL, 2018. 27
- 194. Failure Cases Hazard Stop: False Negative Sauer, Savinov and Geiger: Conditional Affordance Learning for Driving in Urban Environments. CoRL, 2018. 27
- 195. Failure Cases Red Light: False Positive Sauer, Savinov and Geiger: Conditional Affordance Learning for Driving in Urban Environments. CoRL, 2018. 27
- 197. Does Computer Vision Matter for Action? Does Computer Vision Matter for Action? I Analyze various intermediate representations: segmentation, depth, normals, ﬂow, albedo I Intermediate representations improve results I Consistent gains across simulations / tasks I Depth and semantic provide largest gains I Better generalization performance Zhou, Krähenbühl and Koltun: Does computer vision matter for action? Science Robotics, 2019. 29
- 198. Visual Abstractions What is a good visual abstraction? I Invariant (hide irrelevant variations from policy) I Universal (applicable to wide range of scenarios) I Data efﬁcient (in terms of memory/computation) I Label efﬁcient (require little manual effort) Train Test Pixel Space Representation Space Figure Credit: Alexander Sax Semantic segmentation: I Encodes task-relevant knowledge (e.g. road is drivable) and priors (e.g., grouping) I Can be processed with standard 2D convolutional policy networks Disadvantage: I Labelling time: ∼90 min for 1 Cityscapes image Zhou, Krähenbühl and Koltun: Does computer vision matter for action? Science Robotics, 2019. 30
- 199. Label Efﬁcient Visual Abstractions Questions: I What is the trade-off between annotation time and driving performance? I Can selecting speciﬁc semantic classes ease policy learning? I Are visual abstractions trained with few images competitive? I Is ﬁne-grained annotation important? I Are visual abstractions able to reduce training variance? Behl, Chitta, Prakash, Ohn-Bar and Geiger: Label Efﬁcient Visual Abstractions for Autonomous Driving. IROS, 2020. 31
- 200. Label Efﬁcient Visual Abstractions Model: I Visual abstraction network aψ : s 7→ r (state 7→ abstraction) I Control policy πθ : r, c, v 7→ a (abstraction, command, velocity 7→ action) I Composing both yields a = πθ(aψ(s)) (state 7→ action) Behl, Chitta, Prakash, Ohn-Bar and Geiger: Label Efﬁcient Visual Abstractions for Autonomous Driving. IROS, 2020. 32
- 201. Label Efﬁcient Visual Abstractions Datasets: I nr images annotated with semantic labels R = {si, ri}ns i=1 I na images annotated with expert actions A = {si, ai}na i=1 I We assume nr na Behl, Chitta, Prakash, Ohn-Bar and Geiger: Label Efﬁcient Visual Abstractions for Autonomous Driving. IROS, 2020. 32
- 202. Label Efﬁcient Visual Abstractions Training: I Train visual abstraction network aφ(·) using semantic dataset R I Apply this network to obtain control dataset Cφ = {aψ(si), ai}na i=1 I Train control policy πθ(·) using control dataset Cφ Behl, Chitta, Prakash, Ohn-Bar and Geiger: Label Efﬁcient Visual Abstractions for Autonomous Driving. IROS, 2020. 32
- 203. Control Policy Model: I CILRS [Codevilla et al., ICCV 2019] Input: I Visual abstraction r I Navigational command c I Vehicle velocity v Output: I Action/control â and velocity v̂ Loss: I L = ||a − â||1 + λ ||v − v̂||1 Behl, Chitta, Prakash, Ohn-Bar and Geiger: Label Efﬁcient Visual Abstractions for Autonomous Driving. IROS, 2020. 33
- 204. Visual Abstractions Privileged Segmentation (14 classes): I Ground-truth semantic labels for 14 classes I Upper bound for analysis Behl, Chitta, Prakash, Ohn-Bar and Geiger: Label Efﬁcient Visual Abstractions for Autonomous Driving. IROS, 2020. 34
- 205. Visual Abstractions Privileged Segmentation (6 classes): I Ground-truth semantic labels for 2 stuff and 4 object classes I Upper bound for analysis Behl, Chitta, Prakash, Ohn-Bar and Geiger: Label Efﬁcient Visual Abstractions for Autonomous Driving. IROS, 2020. 34
- 206. Visual Abstractions Inferred Segmentation (14 classes): I Segmentation model trained on 14 classes I ResNet and Feature Pyramid Network (FPN) with segmentation head Behl, Chitta, Prakash, Ohn-Bar and Geiger: Label Efﬁcient Visual Abstractions for Autonomous Driving. IROS, 2020. 34
- 207. Visual Abstractions Inferred Segmentation (6 classes): I Segmentation model trained on 2 stuff and 4 object classes I ResNet and Feature Pyramid Network (FPN) with segmentation head Behl, Chitta, Prakash, Ohn-Bar and Geiger: Label Efﬁcient Visual Abstractions for Autonomous Driving. IROS, 2020. 34
- 208. Visual Abstractions Hybrid Detection and Segmentation (6 classes): I Segmentation model trained on 2 stuff classes: road, lane marking I Object detection trained on 4 object classes: vehicle, pedestrian, trafﬁc light (r/g) Behl, Chitta, Prakash, Ohn-Bar and Geiger: Label Efﬁcient Visual Abstractions for Autonomous Driving. IROS, 2020. 34
- 209. Evaluation Training Town Test Town I CARLA 0.8.4 NoCrash benchmark I Random start and end location I Metric: Percentage of successfully completed episodes (success rate) Behl, Chitta, Prakash, Ohn-Bar and Geiger: Label Efﬁcient Visual Abstractions for Autonomous Driving. IROS, 2020. 35
- 210. Trafﬁc Density Empty Regular Dense I Difﬁculty varies with number of dynamic agents in the scene I Empty: 0 Agents Regular: 65 Agents Dense: 220 Agents Behl, Chitta, Prakash, Ohn-Bar and Geiger: Label Efﬁcient Visual Abstractions for Autonomous Driving. IROS, 2020. 36
- 211. Identifying Most Relevant Classes (Privileged) I 14 classes: road, lane marking, vehicle, pedestrian, green light, red light, sidewalk, building, fence, pole, vegetation, wall, trafﬁc sign, other I 7 classes: road, lane marking, vehicle, pedestrian, green light, red light, sidewalk, building, fence, pole, vegetation, wall, trafﬁc sign, other I 6 classes: road, lane marking, vehicle, pedestrian, green light , red light, sidewalk I 5 classes: road, lane marking, vehicle, pedestrian, green light, red light Behl, Chitta, Prakash, Ohn-Bar and Geiger: Label Efﬁcient Visual Abstractions for Autonomous Driving. IROS, 2020. 37
- 212. Identifying Most Relevant Classes (Privileged) Number of Classes 0% 25% 50% 75% 100% 5 6 7 14 Empty Number of Classes 0% 25% 50% 75% 100% 5 6 7 14 Regular Number of Classes 0% 25% 50% 75% 100% 5 6 7 14 Dense Number of Classes 0% 25% 50% 75% 100% 5 6 7 14 Timeout Collision Success Overall I Moving from 14 to 6 classes does not hurt driving performance (on contrary) I Drastic performance drop when lane markings are removed Behl, Chitta, Prakash, Ohn-Bar and Geiger: Label Efﬁcient Visual Abstractions for Autonomous Driving. IROS, 2020. 38
- 213. Identifying Most Relevant Classes (Privileged) Behl, Chitta, Prakash, Ohn-Bar and Geiger: Label Efﬁcient Visual Abstractions for Autonomous Driving. IROS, 2020. 39
- 214. Identifying Most Relevant Classes (Inferred) 83 74 100 86 Number of Classes Success Rate 0 25 50 75 100 6 14 Empty 64 56 76 72 Number of Classes Success Rate 0 25 50 75 100 6 14 Regular 30 19 26 24 Number of Classes Success Rate 0 25 50 75 100 6 14 Dense 59 50 67 61 Number of Classes Success Rate 0 25 50 75 100 6 14 Standard Privileged Overall I Small performance drop when using inferred segmentations I 6-class representation consistently improves upon 14-class representation I We use the 6-class representation for all following experiments Behl, Chitta, Prakash, Ohn-Bar and Geiger: Label Efﬁcient Visual Abstractions for Autonomous Driving. IROS, 2020. 40
- 215. Hybrid Representation 82 70 25 58 89 67 22 59 Success Rate 0 25 50 75 100 Empty Regular Dense Overall Hybrid Standard I Performance of hybrid representation matches standard segmentation I Annotation time (segmentation): ∼ 300 seconds per image and per class I Annotation time (hybrid): ∼ 20 seconds per image and per class Behl, Chitta, Prakash, Ohn-Bar and Geiger: Label Efﬁcient Visual Abstractions for Autonomous Driving. IROS, 2020. 41
- 216. Summary Behl, Chitta, Prakash, Ohn-Bar and Geiger: Label Efﬁcient Visual Abstractions for Autonomous Driving. IROS, 2020. 42
- 218. Driving Policy Transfer Problem: I Driving policies learned in simulation often do not transfer well to the real world Idea: I Encapsulate driving policy such that it is not directly exposed to raw perceptual input or low-level control (input: semantic segmentation, output: waypoints) I Allows for transferring driving policy without retraining or ﬁnetuning Müller, Dosovitskiy, Ghanem and Koltun: Driving Policy Transfer via Modularity and Abstraction. CoRL, 2018. 44
- 219. Waypoint Representation Representation: I Input: Semantic segmentation (per pixel “road” vs. “non-road”) I Output: 2 waypoints (distance to vehicle, relative angle wrt. vehicle heading) I One sufﬁcient for steering, second one for braking before turns Müller, Dosovitskiy, Ghanem and Koltun: Driving Policy Transfer via Modularity and Abstraction. CoRL, 2018. 45
- 220. Results Success Rate over 25 Navigation Trials I Driving Policy: Conditional Imitation Learning (branched) I Control: PID controller for lateral and longitudinal control I Results: Full method generalizes best (“+” = with data augmentation) Müller, Dosovitskiy, Ghanem and Koltun: Driving Policy Transfer via Modularity and Abstraction. CoRL, 2018. 46
- 221. Results Müller, Dosovitskiy, Ghanem and Koltun: Driving Policy Transfer via Modularity and Abstraction. CoRL, 2018. 47
- 222. 3.5 Online vs. Ofﬂine Evaluation
- 223. Online vs. Ofﬂine Evaluation I Online evaluation (i.e., using a real vehicle) is expensive and can be dangerous I Ofﬂine evaluation on a pre-recorded validation dataset is cheap and easy I Question: How predictive is ofﬂine evaluation (a) for the online task (b)? I Empirical study using CIL on CARLA trained with MSE loss on steering angle Codevilla, Lopez, Koltun and Dosovitskiy: On Ofﬂine Evaluation of Vision-based Driving Models. ECCV, 2018. 49
- 224. Online Metrics I Success Rate: Percentage of routes successfully completed I Average Completion: Average fraction of distance to goal covered I Km per Infraction: Average driven distance between 2 infractions Remark: The current CARLA metrics Infraction Score and Driving Score are not considered in this work from 2018, but would likely lead to similar conclusions. Codevilla, Lopez, Koltun and Dosovitskiy: On Ofﬂine Evaluation of Vision-based Driving Models. ECCV, 2018. 50
- 225. Ofﬂine Metrics I a/â: true/predicted steering angle |V |: #samples in validation set I v : speed δ(·): Kronecker delta function θ: Heaviside step function I Q ∈ {−1, 0, 1}: Quantization x −σ −σ ≤ x σ x ≥ σ Codevilla, Lopez, Koltun and Dosovitskiy: On Ofﬂine Evaluation of Vision-based Driving Models. ECCV, 2018. 51
- 226. Results: Online vs. Online I Generalization performance (town 2, new weather), radius = training iteration I 45 different models varying dataset size, augmentation, architecture, etc. I Success rate correlates well with average completion and km per infraction Codevilla, Lopez, Koltun and Dosovitskiy: On Ofﬂine Evaluation of Vision-based Driving Models. ECCV, 2018. 52
- 227. Results: Online vs. Ofﬂine I All metrics not well correlated, Mean Square Error (MSE) performs worst I Absolute steering error improves, speed weighting is not important Codevilla, Lopez, Koltun and Dosovitskiy: On Ofﬂine Evaluation of Vision-based Driving Models. ECCV, 2018. 53
- 228. Results: Online vs. Ofﬂine I Cumulating the error over time does not improve the correlation I Quantized classiﬁcation and thresholded relative error perform best Codevilla, Lopez, Koltun and Dosovitskiy: On Ofﬂine Evaluation of Vision-based Driving Models. ECCV, 2018. 54
- 229. Case Study I Model 1: Trained with single camera `2 loss (=bad model) I Model 2: Trained with three cameras `1 loss (=good model) I Predictions of both models noisy, but Model 1 predicts occasionally very large errors leading to crashes, however the average prediction error is similar Codevilla, Lopez, Koltun and Dosovitskiy: On Ofﬂine Evaluation of Vision-based Driving Models. ECCV, 2018. 55
- 230. Case Study I Model 1 crashes in every trial but model 2 can drive successfully I Illustrates the difﬁculty of using ofﬂine metrics for predicting online behavior Codevilla, Lopez, Koltun and Dosovitskiy: On Ofﬂine Evaluation of Vision-based Driving Models. ECCV, 2018. 56
- 231. Summary I Direct perception predicts intermediate representations I Low-dimensional affordances or classic computer vision representations (e.g., semantic segmentation, depth) can be used as intermediate representations I Decouples perception from planning and control I Hybrid model between imitation learning and modular pipelines I Direct methods are more interpretable as the representation can be inspected I Effective visual abstractions can be learned using limited supervision I Planning can also be decoupled from control for better transfer I Ofﬂine metrics are not necessarily indicative of online driving performance 57
- 232. Self-Driving Cars Lecture 4 – Reinforcement Learning Robotics, Computer Vision, System Software BE, MS, PhD (MMMTU, IISc, IIIT-Hyderabad) Kumar Bipin
- 233. Agenda 4.1 Markov Decision Processes 4.2 Bellman Optimality and Q-Learning 4.3 Deep Q-Learning 2
- 235. Reinforcement Learning So far: I Supervised learning, lots of expert demonstrations required I Use of auxiliary, short-term loss functions I Imitation learning: per-frame loss on action I Direct perception: per-frame loss on affordance indicators Now: I Learning of models based on the loss that we actually care about, e.g.: I Minimize time to target location I Minimize number of collisions I Minimize risk I Maximize comfort I etc. Sutton and Barto: Reinforcement Learning: An Introduction. MIT Press, 2017. 4
- 236. Types of Learning Supervised Learning: I Dataset: {(xi, yi)} (xi = data, yi = label) Goal: Learn mapping x 7→ y I Examples: Classiﬁcation, regression, imitation learning, affordance learning, etc. Unsupervised Learning: I Dataset: {(xi)} (xi = data) Goal: Discover structure underlying data I Examples: Clustering, dimensionality reduction, feature learning, etc. Reinforcement Learning: I Agent interacting with environment which provides numeric reward signals I Goal: Learn how to take actions in order to maximize reward I Examples: Learning of manipulation or control tasks (everything that interacts) Sutton and Barto: Reinforcement Learning: An Introduction. MIT Press, 2017. 5
- 237. Introduction to Reinforcement Learning Agent Environment State st Action at Reward rt Next state st+1 I Agent oberserves environment state st at time t I Agent sends action at at time t to the environment I Environment returns the reward rt and its new state st+1 to the agent Sutton and Barto: Reinforcement Learning: An Introduction. MIT Press, 2017. 6
- 238. Introduction to Reinforcement Learning I Goal: Select actions to maximize total future reward I Actions may have long term consequences I Reward may be delayed, not instantaneous I It may be better to sacriﬁce immediate reward to gain more long-term reward I Examples: I Financial investment (may take months to mature) I Refuelling a helicopter (might prevent crash in several hours) I Sacriﬁcing a chess piece (might help winning chances in the future) Sutton and Barto: Reinforcement Learning: An Introduction. MIT Press, 2017. 7
- 239. Example: Cart Pole Balancing I Objective: Balance pole on moving cart I State: Angle, angular vel., position, vel. I Action: Horizontal force applied to cart I Reward: 1 if pole is upright at time t https://gym.openai.com/envs/#classic_control 8
- 240. Example: Robot Locomotion http://blog.openai.com/roboschool/ I Objective: Make robot move forward I State: Position and angle of joints I Action: Torques applied on joints I Reward: 1 if upright forward moving https://gym.openai.com/envs/#mujoco 9
- 241. Example: Atari Games http://blog.openai.com/gym-retro/ I Objective: Maximize game score I State: Raw pixels of screen (210x160) I Action: Left, right, up, down I Reward: Score increase/decrease at t https://gym.openai.com/envs/#atari 10
- 242. Example: Go www.deepmind.com/research/alphago/ I Objective: Winning the game I State: Position of all pieces I Action: Location of next piece I Reward: 1 if game won, 0 otherwise www.deepmind.com/research/alphago/ 11
- 243. Example: Self-Driving I Objective: Lane Following I State: Image (96x96) I Action: Acceleration, Steering I Reward: - per frame, + per tile https://gym.openai.com/envs/CarRacing-v0/ 12
- 244. Reinforcement Learning: Overview Agent Environment Action at State st Reward rt Next state st+1 I How can we mathematically formalize the RL problem? 13
- 245. Markov Decision Process Markov Decision Process (MDP) models the environment and is deﬁned by the tuple (S, A, R, P, γ) with I S : set of possible states I A: set of possible actions I R(rt|st, at): distribution of current reward given (state,action) pair I P(st+1|st, at): distribution over next state given (state,action) pair I γ: discount factor (determines value of future rewards) Almost all reinforcement learning problems can be formalized as MDPs 14
- 246. Markov Decision Process Markov property: Current state completely characterizes state of the world I A state st is Markov if and only if P(st+1|st) = P(st+1|s1, ..., st) I ”The future is independent of the past given the present” I The state captures all relevant information from the history I Once the state is known, the history may be thrown away I The state is a sufﬁcient statistics of the future 15
- 247. Markov Decision Process Reinforcement learning loop: I At time t = 0: I Environment samples initial state s0 ∼ P(s0) I Then, for t = 0 until done: I Agent selects action at I Environment samples reward rt ∼ R(rt|st, at) I Environment samples next state st+1 ∼ P(st+1|st, at) I Agent receives reward rt and next state st+1 Agent Environment at st rt st+1 How do we select an action? 16
- 248. Policy A policy π is a function from S to A that speciﬁes what action to take in each state: I A policy fully deﬁnes the behavior of an agent I Deterministic policy: a = π(s) I Stochastic policy: π(a|s) = P(at = a|st = s) Remark: I MDP policies depend only on the current state and not the entire history I However, the current state may include past observations 17
- 249. Policy How do we learn a policy? Imitation Learning: Learn a policy from expert demonstrations I Expert demonstrations are provided I Supervised learning problem Reinforcement Learning: Learn a policy through trial-and-error I No expert demonstrations given I Agent discovers itself which actions maximize the expected future reward I The agent interacts with the environment and obtains reward I The agent discovers good actions and improves its policy π 18
- 250. Exploration vs. Exploitation How do we discover good actions? Answer: We need to explore the state/action space. Thus RL combines two tasks: I Exploration: Try a novel action a in state s , observe reward rt I Discovers more information about the environment, but sacriﬁces total reward I Game-playing example: Play a novel experimental move I Exploitation: Use a previously discovered good action a I Exploits known information to maximize reward, but sacriﬁce unexplored areas I Game-playing example: Play the move you believe is best Trade-off: It is important to explore and exploit simultaneously 19
- 251. Exploration vs. Exploitation How to balance exploration and exploitation? -greedy exploration algorithm: I Try all possible actions with non-zero probability I With probability choose an action at random (exploration) I With probability 1 − choose the best action (exploitation) I Greedy action is deﬁned as best action which was discovered so far I is large initially and gradually annealed (=reduced) over time 20
- 252. 4.2 Bellman Optimality and Q-Learning
- 253. Value Functions How good is a state? The state-value function V π at state st is the expected cumulative discounted reward (rt ∼ R(rt|st, at)) when following policy π from state st: V π (st) = E[rt + γrt+1 + γ2 rt+2 + . . . |st, π] = E X k≥0 γk rt+k st, π I The discount factor γ 1 is the value of future rewards at current time t I Weights immediate reward higher than future reward (e.g., γ = 1 2 ⇒ γk = 1 1 , 1 2 , 1 4 , 1 8 , 1 16 , . . . ) I Determines agent’s far/short-sightedness I Avoids inﬁnite returns in cyclic Markov processes 22
- 254. Value Functions How good is a state-action pair? The action-value function Qπ at state st and action at is the expected cumulative discounted reward when taking action at in state st and then following the policy π: Qπ (st, at) = E X k≥0 γk rt+k st, at, π I The discount factor γ ∈ [0, 1] is the value of future rewards at current time t I Weights immediate reward higher than future reward (e.g., γ = 1 2 ⇒ γk = 1 1 , 1 2 , 1 4 , 1 8 , 1 16 , . . . ) I Determines agent’s far/short-sightedness I Avoids inﬁnite returns in cyclic Markov processes 23
- 255. Optimal Value Functions The optimal state-value function V ∗(st) is the best V π(st) over all policies π: V ∗ (st) = max π V π (st) V π (st) = E X k≥0 γk rt+k st, π The optimal action-value function Q∗(st, at) is the best Qπ(st, at) over all policies π: Q∗ (st, at) = max π Qπ (st, at) Qπ (st, at) = E X k≥0 γk rt+k st, at, π I The optimal value functions specify the best possible performance in the MDP I However, searching over all possible policies π is computationally intractable 24
- 256. Optimal Policy If Q∗(st, at) would be known, what would be the optimal policy? π∗ (st) = argmax a0∈A Q∗ (st, a0 ) I Unfortunately, searching over all possible policies π is intractable in most cases I Thus, determining Q∗(st, at) is hard in general (for most interesting problems) I Let’s have a look at a simple example where the optimal policy is easy to compute 25
- 257. A Simple Grid World Example actions = { 1. right 2. left 3. up 4. down } states ? ? reward: r = −1 for each transition Objective: Reach one of terminal states (marked with ’?’) in least number of actions I Penalty (negative reward) given for every transition made 26
- 258. A Simple Grid World Example ? ? Random Policy ? ? Optimal Policy I The arrows indicate equal probability of moving into each of the directions 27
- 259. Solving for the Optimal Policy
- 260. Bellman Optimality Equation I The Bellman Optimality Equation is named after Richard Ernest Bellman who introduced dynamic programming in 1953 I Almost any problem which can be solved using optimal control theory can be solved via the appropriate Bellman equation Richard Ernest Bellman Sutton and Barto: Reinforcement Learning: An Introduction. MIT Press, 2017. 29
- 261. Bellman Optimality Equation The Bellman Optimality Equation (BOE) decomposes Q∗ as follows: Q∗ (st, at) = E rt + γrt+1 + γ2 rt+2 + . . . |st, at BOE = E rt + γ max a0∈A Q∗ (st+1, a0 ) st, at This recursive formulation comprises two parts: I Current reward: rt I Discounted optimal action-value of successor: γ max a0∈A Q∗(st+1, a0) We want to determine Q∗(st, at). How can we solve the BOE? I The BOE is non-linear (because of max-operator) ⇒ no closed form solution I Several iterative methods have been proposed, most popular: Q-Learning Sutton and Barto: Reinforcement Learning: An Introduction. MIT Press, 2017. 30
- 262. Proof of the Bellman Optimality Equation Proof of the Bellman Optimality Equation for the optimal action-value function Q∗: Q∗ (st, at) = E rt + γrt+1 + γ2 rt+2 + . . . |st, at = E X k≥0 γk rt+k|st, at = E rt + γ X k≥0 γk rt+k+1|st, at = E [rt + γV ∗ (st+1)|st, at] = E rt + γ max a0 Q∗ (st+1, a0 )|st, at Sutton and Barto: Reinforcement Learning: An Introduction. MIT Press, 2017. 31
- 263. Bellman Optimality Equation Why is it useful to solve the BOE? I A greedy policy which chooses the action that maximizes the optimal action-value function Q∗ or the optimal state-value function V ∗ takes into account the reward consequences of all possible future behavior I Via Q∗ and V ∗ the optimal expected long-term return is turned into a quantity that is locally and immediately available for each state / state-action pair I For V ∗, a one-step-ahead search yields the optimal actions I Q∗ effectively caches the results of all one-step-ahead searches Sutton and Barto: Reinforcement Learning: An Introduction. MIT Press, 2017. 32
- 264. Q-Learning Q-Learning: Iteratively solve for Q∗ Q∗ (st, at) = E rt + γ max a0∈A Q∗ (st+1, a0 ) st, at by constructing an update sequence Q1, Q2, . . . using learning rate α: Qi+1(st, at) ← (1 − α)Qi(st, at) + α(rt + γ max a0∈A Qi(st+1, a0 )) = Qi(st, at) + α (rt + γ max a0∈A Qi(st+1, a0 ) | {z } target − Qi(st, at) | {z } prediction ) | {z } temporal difference (TD) error I Qi will converge to Q∗ as i → ∞ Note: policy π learned implicitly via Q table! Watkins and Dayan: Technical Note Q-Learning. Machine Learning, 1992. 33
- 265. Q-Learning Implementation: I Initialize Q table and initial state s0 randomly I Repeat: I Observe state st, choose action at according to -greedy strategy (Q-Learning is “off-policy” as the updated policy is different from the behavior policy) I Observe reward rt and next state st+1 I Compute TD error: rt + γ max a0∈A Qi(st+1, a0 ) − Qi(st, at) I Update Q table What’s the problem with using Q tables? I Scalability: Tables don’t scale to high dimensional state/action spaces (e.g., GO) I Solution: Use a function approximator (neural network) to represent Q(s, a) Watkins and Dayan: Technical Note Q-Learning. Machine Learning, 1992. 34
- 266. 4.3 Deep Q-Learning
- 267. Deep Q-Learning Use a deep neural network with weights θ as function approximator to estimate Q: Q(s, a; θ) ≈ Q∗ (s, a) Q(s, a; θ) θ s a Q(s, a1; θ), ...Q(s, am; θ) θ s Mnih et al.: Human-level control through deep reinforcement learning. Nature, 2015. 36
- 268. Training the Q Network Forward Pass: Loss function is the mean-squared error in Q-values: L(θ) = E rt + γ max a0 Q(st+1, a0 ; θ) | {z } target − Q(st, at; θ) | {z } prediction 2 Backward Pass: Gradient update with respect to Q-function parameters θ: ∇θL(θ) = ∇θ E rt + γ max a0 Q(st+1, a0 ; θ) − Q(st, at; θ) 2 # Optimize objective end-to-end with stochastic gradient descent (SGD) using ∇θL(θ). Mnih et al.: Human-level control through deep reinforcement learning. Nature, 2015. 37
- 269. Experience Replay To speed-up training we like to train on mini-batches: I Problem: Learning from consecutive samples is inefﬁcient I Reason: Strong correlations between consecutive samples Experience replay stores agent’s experiences at each time-step I Continually update a replay memory D with new experiences et = (st, at, rt, st+1) I Train on samples (st, at, rt, st+1) ∼ U(D) drawn uniformly at random from D I Breaks correlations between samples I Improves data efﬁciency as each sample can be used multiple times In practice, a circular replay memory of ﬁnite memory size is used. Mnih et al.: Human-level control through deep reinforcement learning. Nature, 2015. 38
- 270. Fixed Q Targets Problem: Non-stationary targets I As the policy changes, so do our targets: rt + γ max a0 Q(st+1, a0; θ) I This may lead to oscillation or divergence Solution: Use ﬁxed Q targets to stabilize training I A target network Q with weights θ− is used to generate the targets: L(θ) = E(st,at,rt,st+1)∼U(D) rt + γ max a0 Q(st+1, a0 ; θ− ) − Q(st, at; θ) 2 # I Target network Q is only updated every C steps by cloning the Q-network I Effect: Reduces oscillation of the policy by adding a delay Mnih et al.: Human-level control through deep reinforcement learning. Nature, 2015. 39
- 271. Putting it together Deep Q-Learning using experience replay and ﬁxed Q targets: I Take action at according to -greedy policy I Store transition (st, at, rt, st+1) in replay memory D I Sample random mini-batch of transitions (st, at, rt, st+1) from D I Compute Q targets using old parameters θ− I Optimize MSE between Q targets and Q network predictions L(θ) = Est,at,rt,st+1∼D rt + γ max a0 Q(st+1, a0 ; θ− ) − Q(st, at; θ) 2 # using stochastic gradient descent. Mnih et al.: Human-level control through deep reinforcement learning. Nature, 2015. 40
- 272. Case Study: Playing Atari Games Agent Environment ; ; ; Objective: Complete the game with the highest score Mnih et al.: Human-level control through deep reinforcement learning. Nature, 2015. 41
- 273. Case Study: Playing Atari Games Q(s, a; θ): Neural network with weights θ FC-Out (Q values) FC-256 32 4x4 conv, stride 2 16 8x8 conv, stride 2 ; ; ; ; Input: 84 × 84 × 4 stack of last 4 frames (after grayscale conversion, downsampling, cropping) Output: Q values for all (4 to 18) Atari actions (efﬁcient: single forward pass computes Q for all actions) Mnih et al.: Human-level control through deep reinforcement learning. Nature, 2015. 42
- 274. Case Study: Playing Atari Games Mnih et al.: Human-level control through deep reinforcement learning. Nature, 2015. 43
- 275. Deep Q-Learning Shortcomings Deep Q-Learning suffers from several shortcomings: I Long training times I Uniform sampling from replay buffer ⇒ all transitions equally important I Simplistic exploration strategy I Action space is limited to a discrete set of actions (otherwise, expensive test-time optimization required) Various improvements over the original algorithm have been explored. 44
- 276. Deep Deterministic Policy Gradients DDPG addresses the problem of continuous action spaces. Problem: Finding a continuous action requires optimization at every timestep. Solution: Use two networks, an actor (deterministic policy) and a critic. µ(s; θµ) θµ s Actor Q(s, a; θQ) θQ s a = µ(s; θµ) Critic Lillicrap et al.: Continuous Control with Deep Reinforcement Learning. ICLR, 2016. 45
- 277. Deep Deterministic Policy Gradients I Actor network with weights θµ estimates agent’s deterministic policy µ(s; θµ) I Update deterministic policy µ(·) in direction that most improves Q I Apply chain rule to the expected return (this is the policy gradient): ∇θµ Est,at,rt,st+1∼D Q(st, µ(st; θµ ); θQ ) = E ∇at Q(st, at; θQ ) ∇θµ µ(st; θµ ) I Critic estimates value of current policy Q(s, a; θQ) I Learned using the Bellman Optimality Equation as in Q Learning: ∇θQ Est,at,rt,st+1∼D h rt + γQ(st+1, µ(st+1; θµ− ); θQ− ) − Q(st, at; θQ ) 2 i I Remark: No maximization over actions required as this step is now learned via µ(·) Lillicrap et al.: Continuous Control with Deep Reinforcement Learning. ICLR, 2016. 46
- 278. Deep Deterministic Policy Gradients Experience replay and target networks are again used to stabilize training: I Replay memory D stores transition tuples (st, at, rt, st+1) I Target networks are updated using “soft” target updates I Weights are not directly copied but slowly adapted: θQ− ← τθQ + (1 − τ)θQ− θµ− ← τθµ + (1 − τ)θµ− where 0 τ 1 controls the tradeoff between speed and stability of learning Exploration is performed by adding noise ∇θµ to the policy µ(s): µ(s; θµ ) + N Lillicrap et al.: Continuous Control with Deep Reinforcement Learning. ICLR, 2016. 47
- 279. Prioritized Experience Replay Prioritize experience to replay important transitions more frequently I Priority δ is measured by magnitude of temporal difference (TD) error: δ = rt + γ max a0 Q(st+1, a0 ; θQ− ) − Q(st, at; θQ ) I TD error measures how “surprising” or unexpected the transition is I Stochastic prioritization avoids overﬁtting due to lack of diversity I Enables learning speed-up by a factor of 2 on Atari benchmarks Schaul et al.: Prioritized Experience Replay. ICLR, 2016. 48
- 280. Learning to Drive in a Day Real-world RL demo by Wayve: I Deep Deterministic Policy Gradients with Prioritized Experience Replay I Input: Single monocular image I Action: Steering and speed I Reward: Distance traveled without the safety driver taking control (requires no maps / localization) I 4 Conv layers, 2 FC layers I Only 35 training episodes Kendall, Hawke, Janz, Mazur, Reda, Allen, Lam, Bewley and Shah: Learning to Drive in a Day. ICRA, 2019. 49
- 281. Learning to Drive in a Day Kendall, Hawke, Janz, Mazur, Reda, Allen, Lam, Bewley and Shah: Learning to Drive in a Day. ICRA, 2019. 50
- 282. Other ﬂavors of Deep RL
- 283. Asynchronous Deep Reinforcement Learning Execute multiple agents in separate environment instances: I Each agent interacts with its own environment copy and collects experience I Agents may use different exploration policies to maximize experience diversity I Experience is not stored but directly used to update a shared global model I Stabilizes training in similar way to experience replay by decorrelating samples I Leads to reduction in training time roughly linear in the number of parallel agents Mnih et al.: Asynchronous Methods for Deep Reinforcement Learning. ICML, 2016. 52
- 284. Bootstrapped DQN Bootstrapping for efﬁcient exploration: I Approximate a distribution over Q values via K bootstrapped ”heads” I At the start of each epoch, a single head Qk is selected uniformly at random I After training, all heads can be combined into a single ensemble policy Q1 QK θQ1 ... θQK θshared s Osband et al.: Deep Exploration via Bootstrapped DQN. NIPS, 2016. 53
- 285. Double Q-Learning Double Q-Learning I Decouple Q function for selection and evaluation of actions to avoid Q overestimation and stabilize training. Target: DQN : rt + γ max a0 Q(st+1, a0 ; θ− ) DoubleDQN : rt + γQ(st+1, argmax a0 Q(st+1, a0 ; θ); θ− ) I Online network with weights θ is used to determine greedy policy I Target network with weights θ− is used to determine corresponding action value I Improves performance on Atari benchmarks van Hasselt et al.: Deep Reinforcement Learning with Double Q-learning. AAAI, 2016. 54
- 286. Deep Recurrent Q-Learning Add recurrency to a deep Q-network to handle partial observability of states: FC-Out (Q-values) LSTM 32 4x4 conv, stride 2 16 8x8 conv, stride 2 ; ; ; ; Replace fully-connected layer with recurrent LSTM layer Hausknecht and Stone: Deep Recurrent Q-Learning for Partially Observable MDPs. AAAI, 2015 55
- 288. Summary I Reinforcement learning learns through interaction with the environment I The environment is typically modeled as a Markov Decision Process I The goal of RL is to maximize the expected future reward I Reinforcement learning requires trading off exploration and exploitation I Q-Learning iteratively solves for the optimal action-value function I The policy is learned implicitly via the Q table I Deep Q-Learning scales to continuous/high-dimensional state spaces I Deep Deterministic Policy Gradients scales to continuous action spaces I Experience replay and target networks are necessary to stabilize training 57
- 289. Self-Driving Cars Lecture 5 – Vehicle Dynamics Robotics, Computer Vision, System Software BE, MS, PhD (MMMTU, IISc, IIIT-Hyderabad) Kumar Bipin
- 290. Agenda 5.1 Introduction 5.2 Kinematic Bicycle Model 5.3 Tire Models 5.4 Dynamic Bicycle Model 2
- 291. 5.1 Introduction
- 292. Electronic Stability Program Knowledge of vehicle dynamics enables accurate vehicle control 5
- 293. Kinematics vs. Kinetics Kinematics: I Greek origin: “motion”, “moving” I Describes motion of points and bodies I Considers position, velocity, acceleration, .. I Examples: Celestial bodies, particle systems, robotic arm, human skeleton Kinetics: I Describes causes of motion I Effects of forces/moments I Newton’s laws, e.g., F = ma 6
- 294. Holonomic Constraints Holonomic constraints are constraints on the conﬁguration: I Assume a particle in three dimensions (x, y, z) ∈ R3 I We can constrain the particle to the x/y plane via: z = 0 ⇔ f(x, y, z) = 0 with f(x, y, z) = z x/y plane I Constraints of the form f(x, y, z) = 0 are called holonomic constraints I They constrain the conﬁguration space I But the system can move freely in that space I Controllable degrees of freedom equal total degrees of freedom (2) 7
- 295. Non-Holonomic Constraints Non-Holonomic constraints are constraints on the velocity: I Assume a vehicle that is parameterized by (x, y, ψ) ∈ R2 × [0, 2π] I The 2D vehicle velocity is given by: ẋ = v cos(ψ) ẏ = v sin(ψ) ⇒ ẋ sin(ψ) − ẏ cos(ψ) = 0 I This non-holonomic constraint cannot be expressed in the form f(x, y, ψ) = 0 I The car cannot freely move in any direction (e.g., sideways) I It constrains the velocity space, but not the conﬁguration space I Controllable degrees of freedom less than total degrees of freedom (2 vs. 3) 8
- 296. Holonomic vs. Non-Holonomic Systems Holonomic Systems I Constrain conﬁguration space I Can freely move in any direction I Controllable degrees of freedom equal to total degrees of freedom I Constraints can be described by f(x1, . . . , xN ) = 0 Example: 3D Particle z = 0 x/y plane Nonholonomic Systems I Constrain velocity space I Cannot freely move in any direction I Controllable degrees of freedom less than total degrees of freedom I Constraints cannot be described by f(x1, . . . , xN ) = 0 Example: Car ẋ sin(ψ) − ẏ cos(ψ) = 0 9
- 297. Holonomic vs. Non-Holonomic Systems I A robot can be subject to both holonomic and non-holonomic constraints I A car (rigid body in 3D) is kept on the ground by 3 holonomic constraints I One additional non-holonomic constraint prevents sideways sliding 10
- 298. Coordinate Systems Inertial Frame Horizontal Frame Horizontal Plane Vehicle Reference Point Vehicle Frame I Inertial Frame: Fixed to earth with vertical Z-axis and X/Y horizontal plane I Vehicle Frame: Attached to vehicle at ﬁxed reference point; xv points towards the front, yv to the side and zv to the top of the vehicle (ISO 8855) I Horizontal Frame: Origin at vehicle reference point (like vehicle frame) but x- and y-axes are projections of xv- and yv-axes onto the X/Y horizontal plane 11
- 299. Kinematics of a Point The position rP (t) ∈ R3 of point P at time t ∈ R is given by 3 coordinates. Velocity and acceleration are the ﬁrst and second derivatives of the position rP (t). rP (t) = x(t) y(t) z(t) vP (t) = ṙP (t) = ẋ(t) ẏ(t) ż(t) aP (t) = r̈P (t) = ẍ(t) ÿ(t) z̈(t) Trajectory of point P 12
- 300. Kinematics of a Rigid Body A rigid body refers to a collection of inﬁnitely many inﬁnitesimally small mass points which are rigidly connected, i.e., their relative position remains unchanged over time. It’s motion can be compactly described by the motion of an (arbitrary) reference point C of the body plus the relative motion of all other points P with respect to C. I C: Reference point ﬁxed to rigid body I P: Arbitrary point on rigid body I ω: Angular velocity of rigid body I Position: rP = rC + rCP I Velocity: vP = vC + ω × rCP I Due to rigidity, points P can only rotate wrt. C I Thus a rigid body has 6 DoF (3 pos., 3 rot.) 13
- 301. Instantaneous Center of Rotation At each time instance t ∈ R, there exists a particular reference point O (called the instantaneous center of rotation) for which vO(t) = 0. Each point P of the rigid body performs a pure rotation about O: vP = vO + ω × rOP = ω × rOP Example 1: Turning Wheel I Wheel is completely lifted off the ground I Wheel does not move in x or y direction I Ang. vel. vector ω points into x/y plane I Velocity of point P: vP = ωR with radius R 14
- 302. Instantaneous Center of Rotation At each time instance t ∈ R, there exists a particular reference point O (called the instantaneous center of rotation) for which vO(t) = 0. Each point P of the rigid body performs a pure rotation about O: vP = vO + ω × rOP = ω × rOP Example 2: Rolling Wheel I Wheel is rolling on the ground without slip I Ground is ﬁxed in x/y plane I Ang. vel. vector ω points into x/y plane I Velocity of point P: vP = 2ωR with radius R 14
- 304. Rigid Body Motion Rotation Center I Different points on the rigid body move along different circular trajectories 16
- 305. Kinematic Bicycle Model I The kinematic bicycle model approximates the 4 wheels with 2 imaginary wheels 17
- 306. Kinematic Bicycle Model I The kinematic bicycle model approximates the 4 wheels with 2 imaginary wheels 17
- 307. Kinematic Bicycle Model Rotation Center Wheelbase Vehicle Velocity Front Wheel Velocity Back Wheel Velocity Slip Angle Front Steering Angle Center of Gravity Heading Angle Back Steering Angle Turning Radius Course Angle Assumptions: - Planar motion (no roll, no pitch) - Low speed = No wheel slip (wheel orientation = wheel velocity) I The kinematic bicycle model approximates the 4 wheels with 2 imaginary wheels 17
- 308. Kinematic Bicycle Model Model Rotation Center Motion Equations Ẋ = v cos(ψ + β) Ẏ = v sin(ψ + β) ψ̇ = v cos(β) `f + `r (tan(δf ) − tan(δr)) β = tan−1 `f tan(δr) + `r tan(δf ) `f + `r (proof as exercise) 18
- 309. Kinematic Bicycle Model Model Rotation Center Motion Equations Ẋ = v cos(ψ + β) Ẏ = v sin(ψ + β) ψ̇ = v cos(β) `f + `r tan(δ) β = tan−1 `r tan(δ) `f + `r (only front steering) tan δ = lf + lr R0 ⇒ 1 R0 = tan δ lf + lr ⇒ tan β = lr R0 = lr tan δ lf + lr cos β = R0 R ⇒ 1 R = cos β R0 ⇒ ψ̇ = ω = v R = v cos(β) R0 = v cos(β) lf + lr tan(δ) 18
- 310. Kinematic Bicycle Model Model Rotation Center Motion Equations Ẋ = v cos(ψ) Ẏ = v sin(ψ) ψ̇ = vδ `f + `r (assuming β and δ are very small) 19
- 311. Kinematic Bicycle Model Model Rotation Center Motion Equations Xt+1 = Xt + v cos(ψ) ∆t Yt+1 = Yt + v sin(ψ) ∆t ψt+1 = ψt + vδ `f + `r ∆t (time discretized model) 19
- 312. Ackermann Steering Geometry Front Steering Angles Turning Radius Rotation Center Wheelbase Track I In practice, the left and right wheel steering angles are not equal if no wheel slip I Combination of admissible steering angles called Ackerman steering geometry I If angles are small, the left/right steering wheel angles can be approximated: δl ≈ tan L R + 0.5B ≈ L R + 0.5B δr ≈ tan L R − 0.5B ≈ L R − 0.5B 20
- 313. Ackermann Steering Geometry Trapezoidal Geometry Left Turn Right Turn I In practice, this setup can be realized using a trapezoidal tie rod arrangement 21
- 314. 5.3 Tire Models
- 315. Kinematics is not enough .. Which assumption of our model is violated in this case? 23