1st edition | July 8-11, 2019
1
BigML, Inc #DutchMLSchool
Machine Learning at 39:
Les Même Choses et Plusieurs Changes
Thomas G. Dietterich
Chief Scientist, BigML
2
BigML, Inc #DutchMLSchool
1980: First Machine Learning Workshop
3
• Carnegie-Mellon University
• Organizers: Jaime Carbonell, Tom Mitchell, Ryszard Michalski
• Attendees: ~30
• Topics:
• Exact learning
• Compression
• Supervised learning with noise-free labels
BigML, Inc #DutchMLSchool
• Generalization
• Feature Engineering
• Explanation and Uncertainty
• Uncertainty Quantification
• Run-Time Monitoring
• Application-Specific Metrics
Outline
4
BigML, Inc #DutchMLSchool
Challenge #1: Generalization
5
BigML, Inc #DutchMLSchool
• Ross Quinlan introduced ID3
• Decision tree learning algorithm
• Goal: Compress chess endgame tables into simple decision
rules
• Ken Thompson had reverse-enumerated the winning
positions for certain chess endgames  Large table of
(board position, outcome) pairs
• ID3 was applied to compress these into a more
understandable representation
• Notes:
• No generalization, Noise Free
• Interpretability was important
Decision Tree Method: ID3
Win in 10
Breda, 2006 6
BigML, Inc #DutchMLSchool
• Generalization for iid data
• Assume training and runtime data are drawn from the same
distribution
• Strong theoretical guarantees
• Generalization across domains
• Causal Transportability
• Domain-Adversarial Training
Today: Generalization is Key
7
BigML, Inc #DutchMLSchool
• Predicting Lung Cancer
• T: Lung Cancer
• C: Chest Pain
• A: Patient is taking aspirin
• K: Patient is a smoker (not observed)
• S: The distribution of A may change between training and
deployment (change of hospital)
• Goal: Create a predictive model that does not depend
on S
• Guaranteed to generalize to new hospital (assuming this
causal model is correct)
Causal Transportability
(Pearl & Bareinboim, 2011)
(Subbaswamy et al., 2018)
8
BigML, Inc #DutchMLSchool
• Generate all models that can make 𝑇
independent of 𝑆
• Evaluate each model on validation data
• Keep the best model
• Guaranteed to transport across hospitals
provided that the causal diagram is
correct
Graph Surgery Technique
Encourages thinking ahead about possible changes at deployment time
9
BigML, Inc #DutchMLSchool
• Given:
• Training data points from two or more
domains: 𝐷 , 𝐷
𝐷 points are labeled pairs 𝑥 , 𝑦
𝐷 points are unlabeled 𝑥
• Training:
• For 𝐷 points: Predict the correct label
• For all points: Predict the domain 1 vs. 2
• Find weights that give accurate
predictions for 𝐷 and chance
predictions for the domain
Domain Adversarial Training (Ganin, et al., 2016)
Ganin, et al., JMLR 2016
10
BigML, Inc #DutchMLSchool
Experiments
11
BigML, Inc #DutchMLSchool
Challenge #2: Feature Engineering
12
BigML, Inc #DutchMLSchool
• In 1980, Quinlan carefully designed interpretable
features with predictive power. This is still
important today in most applications
• Claim: Features should include meta-data
definitions
• “Numbers should never travel alone across the
internet” –Mark Fox
• BigML flatline language
• SQL statements/procedures
• Trifacta rules
Feature Engineering
Example:
Student_Teacher_Ratio(school, time)
|{s | registered(s, school, time)}| /
∑ 𝐹𝑇𝐸 𝑡, school, time
13
BigML, Inc #DutchMLSchool
• Allows data consumers to detect when the
meaning of the feature has changed even when
the feature name has not changed
• important for detecting data errors and debugging
classifier failures
Importance of Feature Meta-Data
14
BigML, Inc #DutchMLSchool
• No: Deep learning applications still
require careful data preparation
• image normalization, contrast
enhancement, etc.
• Yes: Deep learning can learn
powerful intermediate
representations
• <2012: Manually-designed SIFT and
HoG features for images combined
with support vector machines or
random forests
• >2012: Deep learning produces much
better results
Does Deep Learning Automate Feature
Engineering? Yes and No
0
5
10
15
20
25
30
2010 2011 2012 2013 2014
Top5ClassificationError(%) Before After
ImageNet 1000 Classes
15
BigML, Inc #DutchMLSchool
Challenge #3:
Explanation and Interpretability
16
BigML, Inc #DutchMLSchool
• 1980: Quinlan wanted interpretability because he expected
people to memorize the learned decision tree
• In practice, we needed to check whether the learning algorithm got
the right answer
• Today: Our highest-performing models (random forests, boosted
trees, deep neural networks) are not interpretable
• Interpretability and explanation are “hot topics” in ML research
Interpretability and Explanation
17
BigML, Inc #DutchMLSchool
• Claim: Explanations should help the user perform some task
• BigML has worked hard on visualization tools to provide interpretability
• At Oregon State, we are developing explanation tools for reinforcement
learning
Explanation and Interpretability
ML System User Task
Predictive Model ML Engineer Find errors and holes in data
Recommendation
System
End User
Decide whether to follow the
recommendation
Predictive Model
RL Model
ML Engineer
Acceptance Testing:
Decide whether delivered system is
sufficiently accurate
18
BigML, Inc #DutchMLSchool
Challenge #4:
Uncertainty Quantification
19
BigML, Inc #DutchMLSchool
• 1980: This issue was totally ignored
• Today: Giving calibrated uncertainty estimates is important
• Calibrated Probabilities:
• When the classifier says “X belongs to class C with probability 0.94”,
then it is correct 94% of the time
• This is measured using a separate labeled “calibration set”
• Can use “out of bag” training data in random forests
Uncertainty Quantification
20
BigML, Inc #DutchMLSchool
• Some classifiers are always well-calibrated
• Decision Trees
• Random Forests
• Others must be post-processed to achieve good calibration
• Boosted trees
• Support Vector Machines
• Deep Neural Networks
Calibration
21
BigML, Inc #DutchMLSchool
• Sort the predicted probabilities into
bins 0.0-0.1, 0.1-0.2, etc.
• For each bin, measure the average
accuracy on the calibration data
• Plot the accuracy for each bin
• should lie on the diagonal if well-
calibrated
• Example shows that Naïve Bayes is
generally very optimistic
Measuring Calibration via a Reliability Diagram
Reliability Diagram (Naïve Bayes; ADULT)
Zadrozny & Elkan,
2002
22
BigML, Inc #DutchMLSchool
• Fit an invertible curve to the
reliability diagram
• Often a sigmoid curve works well
• Use this to convert the predicted
values (on X axis) to calibrated
values (Y axis)
• Similar techniques can calibrate
Naïve Bayes, Deep Nets, Boosted
Trees, etc.
Fitting a recalibration function
23
BigML, Inc #DutchMLSchool
• Calibration compares predicted probability and expected accuracy globally – across
the entire calibration data set
• This may be misleading
• A classifier could achieve 95% accuracy and perfect calibration by classifying 95% of the
data set perfectly and the remaining 5% completely incorrectly
• This 5% could be a specific customer segment
• Within that segment, the classifier is actually very poorly calibrated because it outputs a
confidence of 0.95 but is correct 0% of the time
• Lesson: Calibration should be done separately for each customer segment or local group
• Decision trees calibrate separately for each leaf of the tree, so they usually don’t exhibit this
problem
• It is always important to look at model accuracy by customer segments and other
customer features (gender, race, region, age, etc.)
• Example: Face recognition is less accurate on dark skin and on women, etc.
Local vs. Global Calibration
24
BigML, Inc #DutchMLSchool
Challenge #5:
Run-Time Monitoring
25
BigML, Inc #DutchMLSchool
• Predictive models are only guaranteed to be accurate if run-time
queries are drawn from the same distribution as the training data
• Open Category Problem: Run-time data may involve new classes
• New types of objects in computer vision
• New classes of items (books, restaurants) in recommender systems
• New diseases in medical systems
• New types of fraud in supervised fraud detection
Why Monitor?
26
BigML, Inc #DutchMLSchool
• Outlier Detection
• Detect whether a new query 𝑥 is an outlier compared to the training
data 𝑥 , … , 𝑥
• Change Detection
• Detect whether the data distribution has changed
• Compare the 𝐿 most recent points 𝑥 , … , 𝑥 to the 𝐿 points before them,
𝑥 , … , 𝑥 . Do they come from different distributions?
How to Monitor?
27
BigML, Inc #DutchMLSchool 28
• Most AD papers only evaluate on a few datasets
• Often proprietary or very easy (e.g., KDD 1999)
• ML community needs a large and growing collection of public
anomaly benchmarks
Anomaly Detection Benchmarking Study
[Emmott, Das, Dietterich, Fern, Wong, 2013; KDD ODD-2013]
[Emmott, Das, Dietterich, Fern, Wong. 2016; arXiv 1503.01158v2]
BigML, Inc #DutchMLSchool 29
• Density-Based Approaches
• RKDE: Robust Kernel Density
Estimation (Kim & Scott, 2008)
• EGMM: Ensemble Gaussian Mixture
Model (our group)
• Quantile-Based Methods
• OCSVM: One-class SVM (Schoelkopf,
et al., 1999)
• SVDD: Support Vector Data
Description (Tax & Duin, 2004)
Algorithms
• Neighbor-Based Methods
• LOF: Local Outlier Factor (Breunig, et
al., 2000)
• ABOD: kNN Angle-Based Outlier
Detector (Kriegel, et al., 2008)
• Projection-Based Methods
• IFOR: Isolation Forest (Liu, et al., 2008)
• LODA: Lightweight Online Detector of
Anomalies (Pevny, 2016)
BigML, Inc #DutchMLSchool 30
Algorithm Comparison
0
0.2
0.4
0.6
0.8
1
1.2
iforest egmm lof rkde abod loda ocsvm svdd
ChangeinMetricwrtControl
Dataset
Algorithm
logit(AUC)
log(LIFT)
Based on this study,
BigML implemented
Isolation Forest
BigML, Inc #DutchMLSchool
• Only make a prediction
if the query 𝑥 has a
low anomaly score
• Liu, et al. 2018
showed how to set 𝜏 to
guarantee detecting
new category queries
with high probability
Open Category Detection
𝑥
Anomaly
Detector
𝐴 𝑥 𝜏?
Classifier 𝑓
Training
Examples
𝑥 , 𝑦 no
𝑦 𝑓 𝑥
yes
reject
[Liu, Garrepalli, Fern, Dietterich, ICML 2018]
31
BigML, Inc #DutchMLSchool
• “Two Sample” test. 𝑆 ∼ 𝑃 , 𝑆 ∼ 𝑃 , is 𝑃 𝑃 ?
• Method 1: Kernel two-sample test
• Method 2: Train a classifier to distinguish between 𝑆 and 𝑆 . Can it do better than
random guessing?
• At each time 𝑡, slide 𝑆 and 𝑆 one step forward in time (requires online methods)
• An area of active research
Change Detection
𝑥𝑥 𝑥𝑥 𝑥𝑥 𝑥𝑥 𝑥𝑥
𝑆 𝑆
32
BigML, Inc #DutchMLSchool
Challenge #6:
Evaluation
33
BigML, Inc #DutchMLSchool
• Standard metrics for evaluating classifiers, such as F1
and AUC, were developed for machine learning research
• Most applications require separate metrics
• Example:
• Financial fraud
• Suppose we have 5 analysts and each analyst can examine 10
cases per day
• Metric: Expected value of the top 50 alarms (value@50).
• Incorporates the estimated value of each candidate fraud alarm
Application-Specific Metrics are Important
34
BigML, Inc #DutchMLSchool
• Open Category Detection:
• Detect 99% of all open category queries
• Metric: Precision at 99% recall
• Obstacle Detection for Self-Driving cars
• Detect 99.999% of all dangerous obstacles
• Metric: Precision at 99.999% recall
• Cancer Screening:
• Must trade off false alarms versus missed alarms
• Metric: Cost to patient (may vary from one patient to another)
• AUC is a fairly good metric for this case
More Examples
35
BigML, Inc #DutchMLSchool
Summary
36
BigML, Inc #DutchMLSchool
• Generalization
• Beyond iid: Causal transportability; Domain adaptation
• Feature Engineering
• Very important; Deep learning can discover useful intermediate features in some cases
• Uncertainty Quantification
• Probability Calibration
• Run-time Monitoring
• Anomaly Detection; Change Point Detection
• Application-Specific Metrics
Frontiers of Machine Learning and Applications
37
Co-organized by: Sponsor:
Business Partners:
38

DutchMLSchool. ML: A Technical Perspective

  • 1.
    1st edition |July 8-11, 2019 1
  • 2.
    BigML, Inc #DutchMLSchool MachineLearning at 39: Les Même Choses et Plusieurs Changes Thomas G. Dietterich Chief Scientist, BigML 2
  • 3.
    BigML, Inc #DutchMLSchool 1980:First Machine Learning Workshop 3 • Carnegie-Mellon University • Organizers: Jaime Carbonell, Tom Mitchell, Ryszard Michalski • Attendees: ~30 • Topics: • Exact learning • Compression • Supervised learning with noise-free labels
  • 4.
    BigML, Inc #DutchMLSchool •Generalization • Feature Engineering • Explanation and Uncertainty • Uncertainty Quantification • Run-Time Monitoring • Application-Specific Metrics Outline 4
  • 5.
  • 6.
    BigML, Inc #DutchMLSchool •Ross Quinlan introduced ID3 • Decision tree learning algorithm • Goal: Compress chess endgame tables into simple decision rules • Ken Thompson had reverse-enumerated the winning positions for certain chess endgames  Large table of (board position, outcome) pairs • ID3 was applied to compress these into a more understandable representation • Notes: • No generalization, Noise Free • Interpretability was important Decision Tree Method: ID3 Win in 10 Breda, 2006 6
  • 7.
    BigML, Inc #DutchMLSchool •Generalization for iid data • Assume training and runtime data are drawn from the same distribution • Strong theoretical guarantees • Generalization across domains • Causal Transportability • Domain-Adversarial Training Today: Generalization is Key 7
  • 8.
    BigML, Inc #DutchMLSchool •Predicting Lung Cancer • T: Lung Cancer • C: Chest Pain • A: Patient is taking aspirin • K: Patient is a smoker (not observed) • S: The distribution of A may change between training and deployment (change of hospital) • Goal: Create a predictive model that does not depend on S • Guaranteed to generalize to new hospital (assuming this causal model is correct) Causal Transportability (Pearl & Bareinboim, 2011) (Subbaswamy et al., 2018) 8
  • 9.
    BigML, Inc #DutchMLSchool •Generate all models that can make 𝑇 independent of 𝑆 • Evaluate each model on validation data • Keep the best model • Guaranteed to transport across hospitals provided that the causal diagram is correct Graph Surgery Technique Encourages thinking ahead about possible changes at deployment time 9
  • 10.
    BigML, Inc #DutchMLSchool •Given: • Training data points from two or more domains: 𝐷 , 𝐷 𝐷 points are labeled pairs 𝑥 , 𝑦 𝐷 points are unlabeled 𝑥 • Training: • For 𝐷 points: Predict the correct label • For all points: Predict the domain 1 vs. 2 • Find weights that give accurate predictions for 𝐷 and chance predictions for the domain Domain Adversarial Training (Ganin, et al., 2016) Ganin, et al., JMLR 2016 10
  • 11.
  • 12.
    BigML, Inc #DutchMLSchool Challenge#2: Feature Engineering 12
  • 13.
    BigML, Inc #DutchMLSchool •In 1980, Quinlan carefully designed interpretable features with predictive power. This is still important today in most applications • Claim: Features should include meta-data definitions • “Numbers should never travel alone across the internet” –Mark Fox • BigML flatline language • SQL statements/procedures • Trifacta rules Feature Engineering Example: Student_Teacher_Ratio(school, time) |{s | registered(s, school, time)}| / ∑ 𝐹𝑇𝐸 𝑡, school, time 13
  • 14.
    BigML, Inc #DutchMLSchool •Allows data consumers to detect when the meaning of the feature has changed even when the feature name has not changed • important for detecting data errors and debugging classifier failures Importance of Feature Meta-Data 14
  • 15.
    BigML, Inc #DutchMLSchool •No: Deep learning applications still require careful data preparation • image normalization, contrast enhancement, etc. • Yes: Deep learning can learn powerful intermediate representations • <2012: Manually-designed SIFT and HoG features for images combined with support vector machines or random forests • >2012: Deep learning produces much better results Does Deep Learning Automate Feature Engineering? Yes and No 0 5 10 15 20 25 30 2010 2011 2012 2013 2014 Top5ClassificationError(%) Before After ImageNet 1000 Classes 15
  • 16.
    BigML, Inc #DutchMLSchool Challenge#3: Explanation and Interpretability 16
  • 17.
    BigML, Inc #DutchMLSchool •1980: Quinlan wanted interpretability because he expected people to memorize the learned decision tree • In practice, we needed to check whether the learning algorithm got the right answer • Today: Our highest-performing models (random forests, boosted trees, deep neural networks) are not interpretable • Interpretability and explanation are “hot topics” in ML research Interpretability and Explanation 17
  • 18.
    BigML, Inc #DutchMLSchool •Claim: Explanations should help the user perform some task • BigML has worked hard on visualization tools to provide interpretability • At Oregon State, we are developing explanation tools for reinforcement learning Explanation and Interpretability ML System User Task Predictive Model ML Engineer Find errors and holes in data Recommendation System End User Decide whether to follow the recommendation Predictive Model RL Model ML Engineer Acceptance Testing: Decide whether delivered system is sufficiently accurate 18
  • 19.
    BigML, Inc #DutchMLSchool Challenge#4: Uncertainty Quantification 19
  • 20.
    BigML, Inc #DutchMLSchool •1980: This issue was totally ignored • Today: Giving calibrated uncertainty estimates is important • Calibrated Probabilities: • When the classifier says “X belongs to class C with probability 0.94”, then it is correct 94% of the time • This is measured using a separate labeled “calibration set” • Can use “out of bag” training data in random forests Uncertainty Quantification 20
  • 21.
    BigML, Inc #DutchMLSchool •Some classifiers are always well-calibrated • Decision Trees • Random Forests • Others must be post-processed to achieve good calibration • Boosted trees • Support Vector Machines • Deep Neural Networks Calibration 21
  • 22.
    BigML, Inc #DutchMLSchool •Sort the predicted probabilities into bins 0.0-0.1, 0.1-0.2, etc. • For each bin, measure the average accuracy on the calibration data • Plot the accuracy for each bin • should lie on the diagonal if well- calibrated • Example shows that Naïve Bayes is generally very optimistic Measuring Calibration via a Reliability Diagram Reliability Diagram (Naïve Bayes; ADULT) Zadrozny & Elkan, 2002 22
  • 23.
    BigML, Inc #DutchMLSchool •Fit an invertible curve to the reliability diagram • Often a sigmoid curve works well • Use this to convert the predicted values (on X axis) to calibrated values (Y axis) • Similar techniques can calibrate Naïve Bayes, Deep Nets, Boosted Trees, etc. Fitting a recalibration function 23
  • 24.
    BigML, Inc #DutchMLSchool •Calibration compares predicted probability and expected accuracy globally – across the entire calibration data set • This may be misleading • A classifier could achieve 95% accuracy and perfect calibration by classifying 95% of the data set perfectly and the remaining 5% completely incorrectly • This 5% could be a specific customer segment • Within that segment, the classifier is actually very poorly calibrated because it outputs a confidence of 0.95 but is correct 0% of the time • Lesson: Calibration should be done separately for each customer segment or local group • Decision trees calibrate separately for each leaf of the tree, so they usually don’t exhibit this problem • It is always important to look at model accuracy by customer segments and other customer features (gender, race, region, age, etc.) • Example: Face recognition is less accurate on dark skin and on women, etc. Local vs. Global Calibration 24
  • 25.
    BigML, Inc #DutchMLSchool Challenge#5: Run-Time Monitoring 25
  • 26.
    BigML, Inc #DutchMLSchool •Predictive models are only guaranteed to be accurate if run-time queries are drawn from the same distribution as the training data • Open Category Problem: Run-time data may involve new classes • New types of objects in computer vision • New classes of items (books, restaurants) in recommender systems • New diseases in medical systems • New types of fraud in supervised fraud detection Why Monitor? 26
  • 27.
    BigML, Inc #DutchMLSchool •Outlier Detection • Detect whether a new query 𝑥 is an outlier compared to the training data 𝑥 , … , 𝑥 • Change Detection • Detect whether the data distribution has changed • Compare the 𝐿 most recent points 𝑥 , … , 𝑥 to the 𝐿 points before them, 𝑥 , … , 𝑥 . Do they come from different distributions? How to Monitor? 27
  • 28.
    BigML, Inc #DutchMLSchool28 • Most AD papers only evaluate on a few datasets • Often proprietary or very easy (e.g., KDD 1999) • ML community needs a large and growing collection of public anomaly benchmarks Anomaly Detection Benchmarking Study [Emmott, Das, Dietterich, Fern, Wong, 2013; KDD ODD-2013] [Emmott, Das, Dietterich, Fern, Wong. 2016; arXiv 1503.01158v2]
  • 29.
    BigML, Inc #DutchMLSchool29 • Density-Based Approaches • RKDE: Robust Kernel Density Estimation (Kim & Scott, 2008) • EGMM: Ensemble Gaussian Mixture Model (our group) • Quantile-Based Methods • OCSVM: One-class SVM (Schoelkopf, et al., 1999) • SVDD: Support Vector Data Description (Tax & Duin, 2004) Algorithms • Neighbor-Based Methods • LOF: Local Outlier Factor (Breunig, et al., 2000) • ABOD: kNN Angle-Based Outlier Detector (Kriegel, et al., 2008) • Projection-Based Methods • IFOR: Isolation Forest (Liu, et al., 2008) • LODA: Lightweight Online Detector of Anomalies (Pevny, 2016)
  • 30.
    BigML, Inc #DutchMLSchool30 Algorithm Comparison 0 0.2 0.4 0.6 0.8 1 1.2 iforest egmm lof rkde abod loda ocsvm svdd ChangeinMetricwrtControl Dataset Algorithm logit(AUC) log(LIFT) Based on this study, BigML implemented Isolation Forest
  • 31.
    BigML, Inc #DutchMLSchool •Only make a prediction if the query 𝑥 has a low anomaly score • Liu, et al. 2018 showed how to set 𝜏 to guarantee detecting new category queries with high probability Open Category Detection 𝑥 Anomaly Detector 𝐴 𝑥 𝜏? Classifier 𝑓 Training Examples 𝑥 , 𝑦 no 𝑦 𝑓 𝑥 yes reject [Liu, Garrepalli, Fern, Dietterich, ICML 2018] 31
  • 32.
    BigML, Inc #DutchMLSchool •“Two Sample” test. 𝑆 ∼ 𝑃 , 𝑆 ∼ 𝑃 , is 𝑃 𝑃 ? • Method 1: Kernel two-sample test • Method 2: Train a classifier to distinguish between 𝑆 and 𝑆 . Can it do better than random guessing? • At each time 𝑡, slide 𝑆 and 𝑆 one step forward in time (requires online methods) • An area of active research Change Detection 𝑥𝑥 𝑥𝑥 𝑥𝑥 𝑥𝑥 𝑥𝑥 𝑆 𝑆 32
  • 33.
  • 34.
    BigML, Inc #DutchMLSchool •Standard metrics for evaluating classifiers, such as F1 and AUC, were developed for machine learning research • Most applications require separate metrics • Example: • Financial fraud • Suppose we have 5 analysts and each analyst can examine 10 cases per day • Metric: Expected value of the top 50 alarms (value@50). • Incorporates the estimated value of each candidate fraud alarm Application-Specific Metrics are Important 34
  • 35.
    BigML, Inc #DutchMLSchool •Open Category Detection: • Detect 99% of all open category queries • Metric: Precision at 99% recall • Obstacle Detection for Self-Driving cars • Detect 99.999% of all dangerous obstacles • Metric: Precision at 99.999% recall • Cancer Screening: • Must trade off false alarms versus missed alarms • Metric: Cost to patient (may vary from one patient to another) • AUC is a fairly good metric for this case More Examples 35
  • 36.
  • 37.
    BigML, Inc #DutchMLSchool •Generalization • Beyond iid: Causal transportability; Domain adaptation • Feature Engineering • Very important; Deep learning can discover useful intermediate features in some cases • Uncertainty Quantification • Probability Calibration • Run-time Monitoring • Anomaly Detection; Change Point Detection • Application-Specific Metrics Frontiers of Machine Learning and Applications 37
  • 38.