Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Highly Autonomous Vehicle Validation


Published on

It's more than just road testing!
- Why a billion miles of testing might not be enough to ensure self-driving car safety.
- Why it's important to distinguish testing for requirements validation vs. testing for implementation validation.
- Why machine learning is the hard part of mapping autonomy validation to ISO 26262

Published in: Technology

Highly Autonomous Vehicle Validation

  1. 1. Highly Autonomous Vehicle Validation: It’s more than just road testing! © 2017 Edge Case Research LLC Prof. Philip Koopman
  2. 2. 2© 2017 Edge Case Research LLC  Self-driving cars are so cool!  But also kind of scary  Is a billion miles of testing enough?  Perhaps not, even if you could afford it  Will simulation really solve this?  What exactly are you validating?  Requirements vs. implementation validation  Can you map autonomy onto ISO 26262?  Why machine learning is the hard part How Do You Validate Autonomous Vehicles?
  3. 3. 3© 2017 Edge Case Research LLC 1985 1990 1995 2000 2005 2010 DARPA Grand Challenge DARPA LAGR ARPA Demo II DARPA SC-ALV NASA Lunar Rover NASA Dante II Auto Excavator Auto Harvesting Auto Forklift Mars Rovers Urban Challenge DARPA PerceptOR DARPA UPI Auto Haulage Auto Spraying Laser Paint RemovalArmy FCS NREC: 30+ Years Of Cool Robots Robot & AV Safety AHS Safety
  4. 4. 4© 2017 Edge Case Research LLC  Best case testing scenario:  For example, 134M miles/mishap (NYC Taxi Data)  Assumptions: – Random independent mishap arrivals in testing – 95% confidence of >=134M miles/mishap  401.4M miles of testing if no mishaps – More likely 1B+ miles if “just as good as” humans  Significant practical issues:  Unlikely that software fails randomly  Reset testing meter if software changes  Reset testing meter if environment changes – Hard to know what changes are “small” vs. “big” A Billion Miles of On-Road Testing? # Mishaps in Testing Total Testing Miles for 95% Confidence 0 401M 1 636M 2 844M 3 1.039B [Fatal and Critical Injury data/ NYC Local Law 31 of 2014]
  5. 5. 5© 2017 Edge Case Research LLC  Validate implementation: does vehicle behave correctly?  Traffic scenarios (e.g., aggressive human driver behavior)  Sensor limitations (e.g., contrast, glare, clutter)  Adverse weather (e.g., snow, fog, rain)  Anticipated road conditions (e.g., flooding, obstacles)  Scalability is an issue – big scenario cross-product  How long to drive around to collect scenario elements?  Depends on distribution of how often they appear How About Designing Tests Instead? Extreme contrast Poor visibility Road obstacles Construction Water (appears flat!)
  6. 6. 6© 2017 Edge Case Research LLC  Vehicle safety limited by novel black swan arrivals  Assume 100K miles/novel hazard seen in road testing: – Case 1: 100 novel hazards @ 10M miles/hazard for each type – Case 2: 100,000 novel hazards @ 10B miles/hazard for each type » Note: US fleet averages about 8B+ miles/day Distribution of Scenario Elements Matters Random Independent Arrival Rate (exponential) Power Law Arrival Rate (80/20 rule) Cross-Over at 4000 hrs Random Independent Arrival Rate (exponential) Power Law Arrival Rate (80/20 rule) Many , Infrequent Scenarios Total Area is the same! Different Cross-Over at 4000 hrs You might not see some everyday hazards even after a billion miles of on-road testing odd-things-spotted-on-the-road
  7. 7. 7© 2017 Edge Case Research LLC  Machine Learning learns by example  There is no design to trace to testing  Training data are de facto requirements  We don’t know if design is correct  ML behavior has an inscrutable “design” – This is one facet of the “legibility” problem – How do you trace unknown design to tests?  “Black swans” depend on what it has learned – (We’ll come back to this shortly)  Possible Approach: Trace tests back to a safety argument  Left side of V just represents safety functions/reqts. – E.g.: “doesn’t hit people” rather than analyzing ML weights  Use non-ML software to enforce those safety requirements How Do We Map To the V Model? ?
  8. 8. 8© 2017 Edge Case Research LLC  Strategy: non-ML safety checker  Enforces safety; can be ISO 26262  Safety Envelope:  Specify unsafe regions for safety  Specify safe regions for functionality – Deal with complex boundary via: » Under-approximate safe region » Over-approximate unsafe region  Transition triggers system safety response  Partition the requirements:  Operation: functional requirements  Failsafe: safety requirements (safety functions) Safety Envelopes to Mitigate ML Risks UNSAFE!
  9. 9. 9© 2017 Edge Case Research LLC  How did we make this safe?  Fail-safe safety gate via high-ASIL checker  Untrusted Machine Learning is “doer”  Checker is a run-time monitor  Enforces the safety envelope  Metric Temporal Logic safety invariants  Works best for control functions  Fail Operational Doer/Checker for HAV: 2CASA dual channel architecture  Primary pair for normal autonomy  Secondary pair for safing mission Practical Application: Runtime Monitoring Doer/Checker Pair Low SIL High SIL Simple Safety Envelope Checker ML TARGET GVW: 8,500 kg TARGET SPEED: 80 km/hr Approved for Public Release. TACOM Case #20247 Date: 07 OCT 2009
  10. 10. 10© 2017 Edge Case Research LLC  Legibility problem:  Did the system do the right thing for the right reason?  How can you tell what the system will do next?  “Black swans” are in context of machine’s learning – If you don’t know what it learned, what is “novel”?  Machine learning is brittle and inscrutable  Proving results of inductive learning is tough  Surprises in what was learned  brittleness  Problem: will system tolerate noise?  “Noise” is likely to affect ML systems differently than humans  Problem: did it work for the correct reason?  If you don’t know why system acted, was it a “test pass” or did it get lucky?  Statistically valid testing sets will increase number of tests, but still leave doubt What About ML Legibility? Bus Not a Bus Magnified Difference (Szegedy, et al. 2013) Learned “Dumbbell” (Mordvintsev et al., 2015)
  11. 11. 11© 2017 Edge Case Research LLC  Control software robustness testing  1990s: operating systems (Ballista)  2010: HAVs and robots (ASTAA)  Switchboard for SIL & HIL tests  Machine learning robustness testing  Sensor Robustness (RIOT project) Robustness Stress Testing (“Noise”) Synthetic Environment Robustness Testing Synthetic Equipment Faults Gaussian blur DISTRIBUTION A – NREC case number STAA-2013-10-02 DISTRIBUTION A – NREC RIOT
  12. 12. 12© 2017 Edge Case Research LLC  Passing a test once might be by luck  Difficult to infer causation given only correlation  Make the ML system explain why  Pre-define scenario element “bins” (OEDR/ODD related) – Bins are scenario elements that are present – Trace bins to test scenario design  ML says which bin(s) it thinks are in play – Did it understand which scenario it was in? Or get lucky?  Must pass the test for the right reason  ML forced to learn bins for scenario elements – Separate ML goes from scenario elements to behavior  Residual risks: – ML lies about what it sees (but we can catch it in act) – ML black box pair learns covert communication channel Design Machine Learning for Validation
  13. 13. 13© 2017 Edge Case Research LLC  Initial simulation set-up  Use higher realism levels to validate simulation accuracy  Simulation used for AV validation:  Push down to low fidelity simulations for brute force coverage  Identify residual risks at each level – Relevant simplifications – Simulation assumptions  Higher fidelity tests assumptions  Why are you simulating/testing?  On-road: to discover requirement gaps  Mid-level: to mitigate risks of believing simulations  Reduced order models/low level: to get coverage Simulation As Risk Reduction
  14. 14. 14© 2017 Edge Case Research LLC  Traditional safety has its place  Traditional functionality should be ISO 26262  Safety envelopes for ML control functions  ML perception simulation and testing  Use the simulation “hammer” effectively  Simulation/test for SOTIF safety risk reduction  Robustness testing helps understand maturity  Strategies for key risk areas:  Machine learning: require ML to explain its actions (e.g., OEDR bins, ODD scenarios)  Operational concepts: detect ODD violations during test & deployment  Requirements: identify gaps in safety requirements; continuous scenario improvement  Safety methodology: test assumptions made in a safety argument  Societal & technical collaboration: industry consensus on understanding and continuously reducing residual risks via monitoring; how safe is safe enough? Key Principles of Pragmatic HAV Validation [ECR]
  15. 15. 15© 2017 Edge Case Research LLC  Multi-prong approach required for validation  Need rigorous engineering beyond vehicle testing/sims  Account for heavy-tail distribution of scenario elements – How does system recognize it’s in a novel situation? – Will system have safe enough response to novelty?  Unique Machine Learning validation challenges  How do we create “requirements” for an ML system?  How do we ensure that testing traces to the ML training data?  How do we ensure adequate requirements and testing coverage for the real world?  Promising approaches:  Safety monitor: let ML optimize behavior while guarding against the unexpected  Robustness testing: inject faults into system building blocks to uncover faults  In progress: comprehensive safety validation approach Conclusions [General Motors]