Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
CHALLENGES IN AUTONOMOUS
VEHICLE TESTING AND VALIDATION
Philip Koopman, Ph.D.
Michael Wagner
But what about fleet deployment?
• Need V&V beyond just road tests
– High ASIL assurance requires a whole lot of testing &...
6/2016: Tesla “AutoPilot” Crash; 130M Operational Miles
Koopman & Wagner 3
https://www.nytimes.com/interactive/2016/07/01/...
Need to test for at least ~3x crash rate to validate safety
• Hypothetical fleet deployment: New York Medallion Taxi Fleet...
1985 1990 1995 2000 2005 2010
DARPA
Grand
Challenge
DARPA
LAGR
ARPA
Demo II
DARPA
SC-ALV
NASA
Lunar
Rover
NASA
Dante II
Au...
Robot Testing Is Difficult
6
Test coverage over 
high‐dimensional inputs
Sensitivity to calibration
Nonlinear behaviors
Ad...
How Do You Test a
Randomized Algorithm?
• Randomized path planner
– Randomly generate paths
– Pick best path based on
fitn...
Getting Obvious Cases Covered Is Challenging
8
Extreme contrast No lane infrastructure Poor visibility
Unusual obstacles C...
But, Then There Is The Weird Stuff…
(Weirder than any one person can imagine)
9
http://piximus.net/fun/funny-and-odd-thing...
Legibility: can humans understand how ML works?
• Machine Learning “learns” from training data
– Result is a weighted comb...
Machine Learning requirements
are the training data
• V model traces reqts to V&V
• Where are the requirements in a
machin...
** toyota $1.4B class action suit
APD (Autonomous Platform Demonstrator)
How did we make this scenario safe?
TARGET GVW: 8,500 kg  
TARGET SPEED: 80 km/hr  ...
The Autonomous
Platform Demonstrator
(APD) was the first UGV
to use a Safety Monitor
as part of its safety
case.
As a resu...
APD is the first unmanned vehicle to use the Safety Monitor.
(Unclassified: Distribution A. Approved for Public Release.
T...
What Happens When Primary Autonomy Has a Fault?
• Can’t trust a sick system to act properly
– With safety monitor approach...
Run-time Monitoring of a set of partial semi-formal invariants
• Three types of invariants:
– Safe invariants: I’m safe wh...
Suggested Philosophy for Testing Autonomous Vehicles:
• Some testing should look for proper functionality
– But, some test...
Use Robustness Testing (SW Fault Injection) to Stress Test
• Apply combinations of valid & invalid parameters to interface...
• Use testing dictionary
based on data types
• Random combinations of
pre-selected dictionary values
• Both valid and exce...
ASTAA:
Automated Stress Testing of Autonomy Systems
21
INPUT
SPACE
RESPONSE
SPACE
VALID
INPUTS
INVALID
INPUTS
SPECIFIED
BE...
DISTRIBUTION A – NREC case number STAA‐2013‐10‐02
Example: Ground Vehicle CAN intercaption:
• CAN Interceptor
Isolates actuators from ECU by
splitting the CAN bus
Modifies ...
But, Does ASTAA Find Problems?
24
Mature (6 years old) “RECBot” vehicle tested with initial tool set
• No access to source...
Improper handling of floating-point numbers:
• Inf, NaN, limited precision
Array indexing and allocation:
• Images, point ...
Fully Autonomous vehicles need ISO 26262 plus more
• Doing enough testing is challenging. Even worse…
– Machine learning s...
• Formed in 2013 to develop
technologies that make complex
software more robust
• Deep experience in dependability,
roboti...
Upcoming SlideShare
Loading in …5
×

Challenges in Autonomous Vehicle Testing and Validation

8,792 views

Published on

SCAV 2017 Pittsburgh PA
Phil Koopman & Mike Wagner

Fully Autonomous vehicles need ISO 26262 plus more
Doing enough testing is challenging. Even worse…
Machine learning systems are inherently brittle and lack “legibility”
Challenges mapping to traditional safety V model
Training data is the de facto requirement+design
Non-determinism makes it difficult to do testing
Potential solution elements:
Run-time safety monitors worry about safety, not “correctness”
Accelerated stress testing via fault injection finds defects otherwise missed in vehicle-level testing
Testing philosophy should include black swan events
Off-nominal training & testing scenarios
Response to internal faults even if we’re not quite sure how they could happen

Published in: Engineering
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • "All requirements allocated to Monitor..." *THIS* is how autonomous safety will be done. Thanks for publishing openly!
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Challenges in Autonomous Vehicle Testing and Validation

  1. 1. CHALLENGES IN AUTONOMOUS VEHICLE TESTING AND VALIDATION Philip Koopman, Ph.D. Michael Wagner
  2. 2. But what about fleet deployment? • Need V&V beyond just road tests – High ASIL assurance requires a whole lot of testing & some optimism – Nondeterministic algorithms yield non-repeatable tests – Machine-learning based autonomy is brittle and lacks “legibility” • What breaks when mapping full autonomy to safety V model? – Autonomy requirements/high level design are implicit in training data • Potential strategies for safer autonomous vehicle designs – Run-time safety monitors using traditional high-ASIL software • Safety invariants to put envelopes around adaptation & frequent updates – Accelerated stress testing via fault injection Overview: Fully Autonomous Vehicles Are Cool! Koopman & Wagner https://en.wikipedia.org/wiki/Autonomous_car 2
  3. 3. 6/2016: Tesla “AutoPilot” Crash; 130M Operational Miles Koopman & Wagner 3 https://www.nytimes.com/interactive/2016/07/01/business/inside-tesla-accident.html?_r=0
  4. 4. Need to test for at least ~3x crash rate to validate safety • Hypothetical fleet deployment: New York Medallion Taxi Fleet – 13,437 vehicles, average 70,000 miles/yr = 941M miles/year • 7 critical crashes in 2015  134M miles/critical crash (death or serious injury) • Assume testing representative; faults are random independent • Illustrative: How much testing to ensure critical crash rate is at least as good as human drivers?  (At least 3x crash rate) – These are optimistic test lengths… • Assumes random independent arrivals • Is simulated driving accurate enough? – More likely takes 10x the crash rate Validating High-ASIL Systems via Testing Is Challenging 4 [2014 NYC Taxi Fact Book] Testing Miles Confidence if NO critical crash seen 122.8M 60% 308.5M 90% 401.4M 95% 617.1M 99% [Fatal and Critical Injury data / Local Law 31 of 2014] Using chi-square test from: http://reliabilityanalyticstoolkit.appspot.com/mtbf_test_calculator Koopman & Wagner
  5. 5. 1985 1990 1995 2000 2005 2010 DARPA Grand Challenge DARPA LAGR ARPA Demo II DARPA SC-ALV NASA Lunar Rover NASA Dante II Auto Excavator Auto Harvesting Auto Forklift Mars Rovers Urban Challenge DARPA PerceptOR DARPA UPI Auto Haulage Auto Spraying Laser Paint RemovalArmy FCS NREC: Faculty, staff, students Off-campus Robotics Institute facility, SCS
  6. 6. Robot Testing Is Difficult 6 Test coverage over  high‐dimensional inputs Sensitivity to calibration Nonlinear behaviors Adaptive  systems Validation of machine learning results Non‐linear motion planning
  7. 7. How Do You Test a Randomized Algorithm? • Randomized path planner – Randomly generate paths – Pick best path based on fitness or goodness score • Implications for testing: – If you can carefully control random number generator, maybe you can reproduce behavior in unit test – At system level, generally sensitive to initial conditions • Can be essentially impossible to get test reproducibility in real systems • In practice, significant effort to force or “trick” robot into displaying behavior Testing Non-Deterministic Algorithms 7 https://youtu.be/E-IUAL-D9SY
  8. 8. Getting Obvious Cases Covered Is Challenging 8 Extreme contrast No lane infrastructure Poor visibility Unusual obstacles Construction Water (note that it appears flat!) [Wagner 2014]
  9. 9. But, Then There Is The Weird Stuff… (Weirder than any one person can imagine) 9 http://piximus.net/fun/funny-and-odd-things-spotted-on-the-road http://blogs.reuters.com/photographers-blog/2012/11/26/house-in-the-middle-of-the-road/ http://edtech2.boisestate.edu/robertsona/506/images/buffalo.jpg
  10. 10. Legibility: can humans understand how ML works? • Machine Learning “learns” from training data – Result is a weighted combination of “features” • Commonly the weighting is inscrutable, or at least not intuitive – There is an unknown (significant?) chance results are brittle • E.g., accidental correlations in training data, sensitivity to noise Machine Learning Might Be Brittle & Inscrutable 10 Bus Not a Bus Magnified Difference AlexNet: Szegedy, Christian, et al. "Intriguing properties of neural networks." arXiv preprint arXiv:1312.6199 (2013). “Dumbbell”“Baseball” (Mordvintsev et al., 2015)(Nguyen et al., 2015)
  11. 11. Machine Learning requirements are the training data • V model traces reqts to V&V • Where are the requirements in a machine learning based system? – ML system is just a framework – The training data forms de facto requirements • How do you know the training data is “complete”? – Training data is safety critical – What if a moderately rare case isn’t trained? • It might not behave as you expect • People’s perception of “almost the same” does not necessarily predict ML responses! Where Are the Requirements for Machine Learning? 11 REQUIREMENTS SPECIFICATION SYSTEM SPECIFICATION SUBSYSTEM/ COMPONENT SPECIFICATION PROGRAM SPECIFICATION MODULE SPECIFICATION SOURCE CODE UNIT TEST PROGRAM TEST SUBSYSTEM/ COMPONENT TEST SYSTEM INTEGRATION & TEST ACCEPTANCE TEST VERIFICATION & TRACEABILITY VALIDATION & TRACEABILITY VERIFICATION & TRACEABILITY VERIFICATION & TRACEABILITY VERIFICATION & TRACEABILITY Review Review Review Review Review Review Review Review Review Review Review Cluster Analysis ? ?
  12. 12. ** toyota $1.4B class action suit
  13. 13. APD (Autonomous Platform Demonstrator) How did we make this scenario safe? TARGET GVW: 8,500 kg   TARGET SPEED: 80 km/hr   Approved for Public Release. TACOM Case #20247 Date: 07 OCT 2009 13
  14. 14. The Autonomous Platform Demonstrator (APD) was the first UGV to use a Safety Monitor as part of its safety case. As a result, the U.S. Army approved APD for demonstrations involving soldier participation. U.S. Army cites high quality of APD safety case and turns to NREC to improve the safety of unmanned vehicles. Approved for Public Release – Distribution is Unlimited (NREC case #: STAA-2012-10-17)
  15. 15. APD is the first unmanned vehicle to use the Safety Monitor. (Unclassified: Distribution A. Approved for Public Release. TACOM Case # 19281 Date: 20 OCT 2009) Run-Time Safety Monitors 15 Approach: Enforce Safety with Monitor/Actuator Pair • “Actuator” is the ML-based software – Usually works – But, might sometimes be unsafe – Actuator failures are drivability problems • All safety requirements are allocated to Monitor – Monitor performs safety shutdown if unsafe outputs/state detected – Monitor is non-ML software that enforces a safety “envelope” • In practice, we’ve had significant success with this approach – E.g., over-speed shutdown on APD – Important point: need to be clever in defining what “safe” means to create monitors – Helps define testing pass/fail criteria too
  16. 16. What Happens When Primary Autonomy Has a Fault? • Can’t trust a sick system to act properly – With safety monitor approach, the monitor/actuator pair shuts down – But, you need to get car to safe state • Bad news: need automated recovery – If driver drops out of loop, can’t just say “it’s your problem!” • Good news: short duration recovery mission makes things easier – Cars only need a few seconds to get to side of road or stop in lane – Think of this as a “safing mission” like diverting an aircraft • Easier reliability because only a few seconds for something else to fail • Easier requirements because it is a simple “stop vehicle” mission • In general, can get much simpler, inexpensive safing autonomy Safing Missions To Reduce Redundancy Requirements 16
  17. 17. Run-time Monitoring of a set of partial semi-formal invariants • Three types of invariants: – Safe invariants: I’m safe when X • Can under-approximate safe region – Unsafe invariants: I’m unsafe when Y – Assumption invariants: My safety proof assumptions have been violated when Z • Partial invariants are useful! – When testing to discover requirements – When testing to check quality – At runtime when you hit unknown unknowns – When runtime behavior is potentially unpredictable (e.g., adaptive systems, frequently updated systems) – To enforce assumptions: operating modes, sensor validity, etc. Generalized Safety Monitors 17
  18. 18. Suggested Philosophy for Testing Autonomous Vehicles: • Some testing should look for proper functionality – But, some testing should attempt to falsify a correctness hypothesis • Much of vehicle autonomy is based on Machine Learning – ML is inductive learning… which is vulnerable to “black swan” failures – We’ve found safety monitors & robustness testing to be useful The Black Swan Meets Autonomous Vehicles 18 Thousands of miles of “white swans”… Make sure to fault inject some “black swans”
  19. 19. Use Robustness Testing (SW Fault Injection) to Stress Test • Apply combinations of valid & invalid parameters to interfaces • Subroutine calls (e.g., null pointer passed to subroutine) • Data flows (e.g., NaN passed as floating point input) • Subsystem interfaces (e.g., CAN messages corrupted on the fly) • System-level digital inputs (e.g., corrupted Lidar data sets) • In our experience, robustness testing finds interesting bugs – You can think of it as a targeted, specialized form of fuzzing • Results: – Finds functional defects in autonomous systems • Basic design faults, not just exception handling • Commonly finds defects missed in extensive field testing – Is capable of finding architectural defects • e.g., finds missing but necessary redundancy Accelerating the Search for Unknown Unknowns 19
  20. 20. • Use testing dictionary based on data types • Random combinations of pre-selected dictionary values • Both valid and exceptional values Basic Idea of Scalable Robustness Testing 20 • Caused task crashes and kernel panics on commercial desktop OS • But what about on robots? • Use Robustness testing for stress + run-time monitoring for pass/fail detector
  21. 21. ASTAA: Automated Stress Testing of Autonomy Systems 21 INPUT SPACE RESPONSE SPACE VALID INPUTS INVALID INPUTS SPECIFIED BEHAVIOR SHOULD WORK UNDEFINED SHOULD RETURN ERROR MODULE UNDER TEST ROBUST OPERATION REPRODUCIBLE FAILURE UNREPRODUCIBLE FAILURE Ballista Stress-Testing Tool Robustness testing of defined interfaces • Most test cases are exceptional • Test cases based on best-practice software testing methodology • Detects software hanging or crashing Earlier work looked at stress-testing COTS operating systems Uncovered system-killer crash vulnerabilities in top-of-the-line commercial operating systems NREC Safety Monitor Monitors safety invariants at run-time • Designed as run-time safety shutdown box for UAS applications Independently senses system state to determine whether invariants are violated Firewalls safety-criticality into a small, manageable subset of a complex UAS; prototype deployed on Autonomous Platform Demonstrator (APD), a 9-ton UGV capable of reaching 80 km/hr + Approved for Public Release – Distribution is Unlimited (NREC case #: STAA-2012-10-17)
  22. 22. DISTRIBUTION A – NREC case number STAA‐2013‐10‐02
  23. 23. Example: Ground Vehicle CAN intercaption: • CAN Interceptor Isolates actuators from ECU by splitting the CAN bus Modifies J1939 status messages from by-wire controllers before forwarding to ECU Reads messages for invariant evaluation • ASTAA Test Runner Instructs CAN interceptor about how to modify incoming CAN messages Monitors invariants Next generation: Edge Case Research SWITCHBOARD ASTAA Stress Testing Fault Injection 23 DISTRIBUTION A – NREC case number STAA‐2013‐10‐02
  24. 24. But, Does ASTAA Find Problems? 24 Mature (6 years old) “RECBot” vehicle tested with initial tool set • No access to source code or design details; just interface specification • ASTAA elicited a speed-limit violation 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 0 10 20 30 40 50 60 70 80 90 Vehicle Speed [m/s] Time [sec] RECbot Speed Limit Tests cmd = 1 m/s No speed limit violation cmd = 3 m/s Speed limit enforced cmd = Inf Speed limit enforced cmd = NaN Speed limit violated End of test Speed-limit violation occurred when exceptional input sent as speed command Distribution Statement A - Approved for public release; distribution is unlimited. NAVAIR Public Affairs Office tracking number 2013-74, NREC internal case number STAA-2012-10-23
  25. 25. Improper handling of floating-point numbers: • Inf, NaN, limited precision Array indexing and allocation: • Images, point clouds, etc… • Segmentation faults due to arrays that are too small • Many forms of buffer overflow, especially dealing with complex data types • Large arrays and memory exhaustion Time: • Time flowing backwards, jumps • Not rejecting stale data Problems handling dynamic state: • For example, lists of perceived objects or command trajectories • Race conditions permit improper insertion or removal of items • Vulnerabilities in garbage collection allow memory to be exhausted or execution to be slowed down Example Autonomous Vehicle Defects Found via Robustness Testing 25DISTRIBUTION A – NREC case number STAA-2013-10-02
  26. 26. Fully Autonomous vehicles need ISO 26262 plus more • Doing enough testing is challenging. Even worse… – Machine learning systems are inherently brittle and lack “legibility” • Challenges mapping to traditional safety V model – Training data is the de facto requirement+design – Non-determinism makes it difficult to do testing • Potential solution elements: – Run-time safety monitors worry about safety, not “correctness” – Accelerated stress testing via fault injection finds defects otherwise missed in vehicle-level testing – Testing philosophy should include black swan events • Off-nominal training & testing scenarios • Response to internal faults even if we’re not quite sure how they could happen Conclusions 26
  27. 27. • Formed in 2013 to develop technologies that make complex software more robust • Deep experience in dependability, robotics, and testing • Methodology, tools, training Edge Case Research Clients in many markets: • Consumer electronics • Automotive • Defense • Robotics • Industrial controls • Finance

×