Srivastava KDD 2010 Presentation


Published on

This is a presentation I gave at KDD 2010.

  • Be the first to comment

  • Be the first to like this

Srivastava KDD 2010 Presentation

  1. 1. Discovering Precursors to Aviation Safety Incidents:from Massive Data to Actionable Information Ashok N. Srivastava Principal Investigator, IVHM ProjectGroup Lead, Intelligent Data Understanding
  2. 2. A Day in the Life of the National Airspace
  3. 3. Recent Safety Advances
  4. 4. Proactively Managing Risk IDENTIFY A Strategy forIMPLEMENT Safety EVALUATE Improvement FORMULATEReduce the accident rate by identifying and responding to precursor events before accidents occur.
  5. 5. Data Mining in Support of Global Operations 105 Data sharing Just culture / safety culture10-6 10-5 10-4 10-3 10-2 10-1 100 101 102 103 104 105 106
  6. 6. Data Arises from Molecular to Global ScalesMolecules Materials Sensors Software Engines Aircraft Vehicle Technology Operations 10-6 10-5 10-4 10-3 10-2 10-1 100 101 102 103 104 105 106
  7. 7. The Anatomy of an Aviation Safety Incident Safe Compromised States Anomolous Safe States or Accidents States Incident Transition Transition Transition Transition Transition Transition k-1 k k+1 n-2 n-1 n ...... State k -1 State k State k+1 State n -1 State n State n+1 Time Scenario = ( Context Behavior Outcome) From Irving Statler, Aviation Safety Monitoring and Modeling Project
  8. 8. Where are Precursors Found?SUBJECTIVE -- data continuum ---OBJECTIVE Accident Data Forensics What Happened? Discrete and Continuous Data Operational Data Discovering Causal Factors Why did it Happen? Operational Surveys Text Data Anecdotal Reports Predictions What will Happen Next? Projections From Irving Statler, Aviation Safety Monitoring and Modeling Project
  9. 9. Aviation Safety Report Excerpt JUST PRIOR TO TOUCHDOWN, LAX TWR TOLD US TO GO AROUND BECAUSE OF THE ACFT IN FRONT OF US. BOTH THE COPLT AND I, HOWEVER, UNDERSTOOD TWR TO SAY, CLRED TO LAND, ACFT ON THE RWY. SINCE THE ACFT IN FRONT OF US WAS CLR OF THE RWY AND WE BOTH MISUNDERSTOOD TWRS RADIO CALL AND CONSIDERED IT AN ADVISORY, WE LANDED...• Can answers why an incident happened• Over 100K reports• Multiple authors (multi lingual writers)• Inconsistent use of abbreviations and punctuation
  10. 10. Mining Heterogeneous Data is the Key• Primary Source: aircraft • Primary Source: Humans• Can answer what happened • Can answer why anin an aircraft during an Discrete Aviation Safety IncidentAviation Safety Incident happened Sample Text Report JUST PRIOR TO TOUCHDOWN, Continuous K(xi,xj) Textual LAX TWR TOLD US TO GO AROUND BECAUSE OF THE ACFT IN FRONT OF US.• Primary Source: ...Radar data• Can answer whathappened in the National Networks Massive Data:Airspace during Aviation Archives growing at 100K flights per month.Safety Incident
  11. 11. Mercer Kernel Based Approaches for Discrete, Continuous, Textual, and other SourcesMultiple Kernel Anomaly Detection (KDD 2010) 5 4 Unique 1    3 Linear Q    i j     K i , j  3 Decision  2 min 2 i, j    1 Boundary 0 0 2 4 6  Subject to: i  1 1 i   0,1, 1 0  i  , i l Talk at 5 pm today, Independence A
  12. 12. Gaussian Process Regression for Discrete and Continuous Sources (typically for small problems, see below)• Training data • Model building • data matrix of observations – n x d • Train hyperparameters on a sample of X • y vector of target data – n x 1 • Compute covariance matrix K (n x n)• Test data • Prediction • X* matrix of new observations – n* x d • Compute cross covariance matrix K* (n* x n) • Compute mean prediction on y* using• Covariance function • • Compute variance of prediction using• Goal • Predict y* corresponding to X*Algorithm Analysis• Storage Complexity: Storing covariancematrix O(n2)• Time Complexity: Computing matrixinversion O(n3) V. Tresp. Mixtures of Gaussian processes. In Todd K. Leen, Thomas G. Dietterich, and Volker Tresp, editors, Advances in Neural Information Processing Systems 13, pages 654–660. MIT Press, 2001.
  13. 13. GP–V: Scaling to 3 million points 250 GP GP-V 200 Run time (seconds) Analysis on simulated data 150 100 50 0 2 3 4 5 6 7 10 10 10 10 10 10 Data points (log scale)• Input data dimension =5• Number of sample points = 3 million. New method under review at ICDM 2010 fordistributed implementation for massive data sets (K. Das and A. N. Srivastava).• Run time = time to build the model + time to evaluate 500 test points• Hyper parameters trained on 100 sample points L. Foster, A. Waagen, N.A. N. Srivastava, “Stable andRinsky, C. Gaussian M. J. Way, P. Gazis, and Aijaz, M. Hurley, A. Luis, J. Efficient Satyavolu,• Accuracy does not degrade with approximation Process Calculations,” Journal of Machine Learning Research, 10(Apr):857- -882, 2009.
  14. 14. How do we get the Word Out? DASHlinkdisseminate. collaborate. innovate. is a collaborativewebsite designed to promote:• Sustainability• Reproducibility• Dissemination• Community buildingUsers can create profiles• Share papers, upload anddownload opensource algorithms• Find NASA data sets.Coming Soon… DASHlink 2.0.
  15. 15. Real World Impact: Flight Operations at Southwest Airlines• NASA has open-sourced many of its key data mining algorithms for analysis of data from flight data recorders through DASHlink, our Web 2.0 portal for the world.• Southwest Airlines obtained copies of sequenceMiner and Orca, two advanced anomaly detection techniques.• Early results indicate that operationally significant events have been discovered by these • Southwest Airlines algorithms that would not be plans to incorporate triggered by their existing algorithms into daily methods. operations.
  16. 16. Mining Human Performance and Flight DataPhysiological Measurements Sample Text Report Luton, UK JUST PRIOR TO To EasyJet To NASA Data Mining Team TOUCHDOWN, • Near real-time • Daily data LAX TWR decision support • 300 GB flights per month • Understanding TOLD US TO • Physiology, text, cockpit, GO AROUND fatigue with an engines and flight parameters, objective flight path, network nformation. BECAUSE OF approach THE ACFT IN FRONT OF US. ... NASA Data Mining Lab (Mountain View, CA) Key question: How does fatigue effect aviation safety?
  17. 17. Data Mining in Scientific Domains • Aeronautical Systems – 100K flights per month • Earth and Space Science – Earth Observing System generates ~21 TB of data per week. – Ames simulations generating 1-5 TB per day • Exploration Systems – Space Shuttle and International Space station downlinks about 1.5GB per day. • Kepler Planetary Discovery Data Systems – 100K solar systems every 30 minutesResearchers world-wide can work with us in numerous domains through our new collaborative portal
  18. 18. Introducing NASA Earth Exchange (NEX) Collaborative Earth ScienceNEX provides a platform for researchers to share information and data derived from NASA’s Earthobserving satellites. Its goal is to encourage exploration and collaboration to create new ways to understand and improve life here on Earth. Public Release: August 2010
  19. 19. NEX website and DASHlink 2.0 Capabilities• Collaboration with NASA on Projects (and initiating the process for further collaboration with NEXresearchers on NASA supercomputing facilities).• Share Resources: Papers, algorithms, data sets etc.• Find and contact other researchers in your field via member profiles.• Learn about other cutting edge work at NASA and beyond.• Cross reference work on DASHlink and NEX.
  20. 20. NASA Earth Exchange Components
  21. 21. The Data Mining Team Group Members Funding SourcesAshok N. Srivastava • NASA Aeronautics Research MissionKanishka BhaduriKamalika Das Directorate- IVHM ProjectSantanu Das • NASA Engineering and Safety CenterElizabeth Foughty • Exploration Systems Mission DirectorateDave IversonRodney Martin Exploration Technology DevelopmentBryan Matthews Program, ISHM ProjectDawn McIntosh • Science Mission DirectorateNikunj OzaMark SchwabacherJohn StutzDavid Wolpert+ 7 summer students Team Members are NASA Employees, Contractors, and Students. •21