Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

DutchMLSchool. ML: A Technical Perspective


Published on

DutchMLSchool. Machine Learning: A Technical Perspective
TITLE AS IN SCHEDULE - Main Conference: Introduction to Machine Learning.
DutchMLSchool: 1st edition of the Machine Learning Summer School in The Netherlands.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

DutchMLSchool. ML: A Technical Perspective

  1. 1. 1st edition | July 8-11, 2019 1
  2. 2. BigML, Inc #DutchMLSchool Machine Learning at 39: Les Même Choses et Plusieurs Changes Thomas G. Dietterich Chief Scientist, BigML 2
  3. 3. BigML, Inc #DutchMLSchool 1980: First Machine Learning Workshop 3 • Carnegie-Mellon University • Organizers: Jaime Carbonell, Tom Mitchell, Ryszard Michalski • Attendees: ~30 • Topics: • Exact learning • Compression • Supervised learning with noise-free labels
  4. 4. BigML, Inc #DutchMLSchool • Generalization • Feature Engineering • Explanation and Uncertainty • Uncertainty Quantification • Run-Time Monitoring • Application-Specific Metrics Outline 4
  5. 5. BigML, Inc #DutchMLSchool Challenge #1: Generalization 5
  6. 6. BigML, Inc #DutchMLSchool • Ross Quinlan introduced ID3 • Decision tree learning algorithm • Goal: Compress chess endgame tables into simple decision rules • Ken Thompson had reverse-enumerated the winning positions for certain chess endgames  Large table of (board position, outcome) pairs • ID3 was applied to compress these into a more understandable representation • Notes: • No generalization, Noise Free • Interpretability was important Decision Tree Method: ID3 Win in 10 Breda, 2006 6
  7. 7. BigML, Inc #DutchMLSchool • Generalization for iid data • Assume training and runtime data are drawn from the same distribution • Strong theoretical guarantees • Generalization across domains • Causal Transportability • Domain-Adversarial Training Today: Generalization is Key 7
  8. 8. BigML, Inc #DutchMLSchool • Predicting Lung Cancer • T: Lung Cancer • C: Chest Pain • A: Patient is taking aspirin • K: Patient is a smoker (not observed) • S: The distribution of A may change between training and deployment (change of hospital) • Goal: Create a predictive model that does not depend on S • Guaranteed to generalize to new hospital (assuming this causal model is correct) Causal Transportability (Pearl & Bareinboim, 2011) (Subbaswamy et al., 2018) 8
  9. 9. BigML, Inc #DutchMLSchool • Generate all models that can make 𝑇 independent of 𝑆 • Evaluate each model on validation data • Keep the best model • Guaranteed to transport across hospitals provided that the causal diagram is correct Graph Surgery Technique Encourages thinking ahead about possible changes at deployment time 9
  10. 10. BigML, Inc #DutchMLSchool • Given: • Training data points from two or more domains: 𝐷 , 𝐷 𝐷 points are labeled pairs 𝑥 , 𝑦 𝐷 points are unlabeled 𝑥 • Training: • For 𝐷 points: Predict the correct label • For all points: Predict the domain 1 vs. 2 • Find weights that give accurate predictions for 𝐷 and chance predictions for the domain Domain Adversarial Training (Ganin, et al., 2016) Ganin, et al., JMLR 2016 10
  11. 11. BigML, Inc #DutchMLSchool Experiments 11
  12. 12. BigML, Inc #DutchMLSchool Challenge #2: Feature Engineering 12
  13. 13. BigML, Inc #DutchMLSchool • In 1980, Quinlan carefully designed interpretable features with predictive power. This is still important today in most applications • Claim: Features should include meta-data definitions • “Numbers should never travel alone across the internet” –Mark Fox • BigML flatline language • SQL statements/procedures • Trifacta rules Feature Engineering Example: Student_Teacher_Ratio(school, time) |{s | registered(s, school, time)}| / ∑ 𝐹𝑇𝐸 𝑡, school, time 13
  14. 14. BigML, Inc #DutchMLSchool • Allows data consumers to detect when the meaning of the feature has changed even when the feature name has not changed • important for detecting data errors and debugging classifier failures Importance of Feature Meta-Data 14
  15. 15. BigML, Inc #DutchMLSchool • No: Deep learning applications still require careful data preparation • image normalization, contrast enhancement, etc. • Yes: Deep learning can learn powerful intermediate representations • <2012: Manually-designed SIFT and HoG features for images combined with support vector machines or random forests • >2012: Deep learning produces much better results Does Deep Learning Automate Feature Engineering? Yes and No 0 5 10 15 20 25 30 2010 2011 2012 2013 2014 Top5ClassificationError(%) Before After ImageNet 1000 Classes 15
  16. 16. BigML, Inc #DutchMLSchool Challenge #3: Explanation and Interpretability 16
  17. 17. BigML, Inc #DutchMLSchool • 1980: Quinlan wanted interpretability because he expected people to memorize the learned decision tree • In practice, we needed to check whether the learning algorithm got the right answer • Today: Our highest-performing models (random forests, boosted trees, deep neural networks) are not interpretable • Interpretability and explanation are “hot topics” in ML research Interpretability and Explanation 17
  18. 18. BigML, Inc #DutchMLSchool • Claim: Explanations should help the user perform some task • BigML has worked hard on visualization tools to provide interpretability • At Oregon State, we are developing explanation tools for reinforcement learning Explanation and Interpretability ML System User Task Predictive Model ML Engineer Find errors and holes in data Recommendation System End User Decide whether to follow the recommendation Predictive Model RL Model ML Engineer Acceptance Testing: Decide whether delivered system is sufficiently accurate 18
  19. 19. BigML, Inc #DutchMLSchool Challenge #4: Uncertainty Quantification 19
  20. 20. BigML, Inc #DutchMLSchool • 1980: This issue was totally ignored • Today: Giving calibrated uncertainty estimates is important • Calibrated Probabilities: • When the classifier says “X belongs to class C with probability 0.94”, then it is correct 94% of the time • This is measured using a separate labeled “calibration set” • Can use “out of bag” training data in random forests Uncertainty Quantification 20
  21. 21. BigML, Inc #DutchMLSchool • Some classifiers are always well-calibrated • Decision Trees • Random Forests • Others must be post-processed to achieve good calibration • Boosted trees • Support Vector Machines • Deep Neural Networks Calibration 21
  22. 22. BigML, Inc #DutchMLSchool • Sort the predicted probabilities into bins 0.0-0.1, 0.1-0.2, etc. • For each bin, measure the average accuracy on the calibration data • Plot the accuracy for each bin • should lie on the diagonal if well- calibrated • Example shows that Naïve Bayes is generally very optimistic Measuring Calibration via a Reliability Diagram Reliability Diagram (Naïve Bayes; ADULT) Zadrozny & Elkan, 2002 22
  23. 23. BigML, Inc #DutchMLSchool • Fit an invertible curve to the reliability diagram • Often a sigmoid curve works well • Use this to convert the predicted values (on X axis) to calibrated values (Y axis) • Similar techniques can calibrate Naïve Bayes, Deep Nets, Boosted Trees, etc. Fitting a recalibration function 23
  24. 24. BigML, Inc #DutchMLSchool • Calibration compares predicted probability and expected accuracy globally – across the entire calibration data set • This may be misleading • A classifier could achieve 95% accuracy and perfect calibration by classifying 95% of the data set perfectly and the remaining 5% completely incorrectly • This 5% could be a specific customer segment • Within that segment, the classifier is actually very poorly calibrated because it outputs a confidence of 0.95 but is correct 0% of the time • Lesson: Calibration should be done separately for each customer segment or local group • Decision trees calibrate separately for each leaf of the tree, so they usually don’t exhibit this problem • It is always important to look at model accuracy by customer segments and other customer features (gender, race, region, age, etc.) • Example: Face recognition is less accurate on dark skin and on women, etc. Local vs. Global Calibration 24
  25. 25. BigML, Inc #DutchMLSchool Challenge #5: Run-Time Monitoring 25
  26. 26. BigML, Inc #DutchMLSchool • Predictive models are only guaranteed to be accurate if run-time queries are drawn from the same distribution as the training data • Open Category Problem: Run-time data may involve new classes • New types of objects in computer vision • New classes of items (books, restaurants) in recommender systems • New diseases in medical systems • New types of fraud in supervised fraud detection Why Monitor? 26
  27. 27. BigML, Inc #DutchMLSchool • Outlier Detection • Detect whether a new query 𝑥 is an outlier compared to the training data 𝑥 , … , 𝑥 • Change Detection • Detect whether the data distribution has changed • Compare the 𝐿 most recent points 𝑥 , … , 𝑥 to the 𝐿 points before them, 𝑥 , … , 𝑥 . Do they come from different distributions? How to Monitor? 27
  28. 28. BigML, Inc #DutchMLSchool 28 • Most AD papers only evaluate on a few datasets • Often proprietary or very easy (e.g., KDD 1999) • ML community needs a large and growing collection of public anomaly benchmarks Anomaly Detection Benchmarking Study [Emmott, Das, Dietterich, Fern, Wong, 2013; KDD ODD-2013] [Emmott, Das, Dietterich, Fern, Wong. 2016; arXiv 1503.01158v2]
  29. 29. BigML, Inc #DutchMLSchool 29 • Density-Based Approaches • RKDE: Robust Kernel Density Estimation (Kim & Scott, 2008) • EGMM: Ensemble Gaussian Mixture Model (our group) • Quantile-Based Methods • OCSVM: One-class SVM (Schoelkopf, et al., 1999) • SVDD: Support Vector Data Description (Tax & Duin, 2004) Algorithms • Neighbor-Based Methods • LOF: Local Outlier Factor (Breunig, et al., 2000) • ABOD: kNN Angle-Based Outlier Detector (Kriegel, et al., 2008) • Projection-Based Methods • IFOR: Isolation Forest (Liu, et al., 2008) • LODA: Lightweight Online Detector of Anomalies (Pevny, 2016)
  30. 30. BigML, Inc #DutchMLSchool 30 Algorithm Comparison 0 0.2 0.4 0.6 0.8 1 1.2 iforest egmm lof rkde abod loda ocsvm svdd ChangeinMetricwrtControl Dataset Algorithm logit(AUC) log(LIFT) Based on this study, BigML implemented Isolation Forest
  31. 31. BigML, Inc #DutchMLSchool • Only make a prediction if the query 𝑥 has a low anomaly score • Liu, et al. 2018 showed how to set 𝜏 to guarantee detecting new category queries with high probability Open Category Detection 𝑥 Anomaly Detector 𝐴 𝑥 𝜏? Classifier 𝑓 Training Examples 𝑥 , 𝑦 no 𝑦 𝑓 𝑥 yes reject [Liu, Garrepalli, Fern, Dietterich, ICML 2018] 31
  32. 32. BigML, Inc #DutchMLSchool • “Two Sample” test. 𝑆 ∼ 𝑃 , 𝑆 ∼ 𝑃 , is 𝑃 𝑃 ? • Method 1: Kernel two-sample test • Method 2: Train a classifier to distinguish between 𝑆 and 𝑆 . Can it do better than random guessing? • At each time 𝑡, slide 𝑆 and 𝑆 one step forward in time (requires online methods) • An area of active research Change Detection 𝑥𝑥 𝑥𝑥 𝑥𝑥 𝑥𝑥 𝑥𝑥 𝑆 𝑆 32
  33. 33. BigML, Inc #DutchMLSchool Challenge #6: Evaluation 33
  34. 34. BigML, Inc #DutchMLSchool • Standard metrics for evaluating classifiers, such as F1 and AUC, were developed for machine learning research • Most applications require separate metrics • Example: • Financial fraud • Suppose we have 5 analysts and each analyst can examine 10 cases per day • Metric: Expected value of the top 50 alarms (value@50). • Incorporates the estimated value of each candidate fraud alarm Application-Specific Metrics are Important 34
  35. 35. BigML, Inc #DutchMLSchool • Open Category Detection: • Detect 99% of all open category queries • Metric: Precision at 99% recall • Obstacle Detection for Self-Driving cars • Detect 99.999% of all dangerous obstacles • Metric: Precision at 99.999% recall • Cancer Screening: • Must trade off false alarms versus missed alarms • Metric: Cost to patient (may vary from one patient to another) • AUC is a fairly good metric for this case More Examples 35
  36. 36. BigML, Inc #DutchMLSchool Summary 36
  37. 37. BigML, Inc #DutchMLSchool • Generalization • Beyond iid: Causal transportability; Domain adaptation • Feature Engineering • Very important; Deep learning can discover useful intermediate features in some cases • Uncertainty Quantification • Probability Calibration • Run-time Monitoring • Anomaly Detection; Change Point Detection • Application-Specific Metrics Frontiers of Machine Learning and Applications 37
  38. 38. Co-organized by: Sponsor: Business Partners: 38