DutchMLSchool. ML: A Technical Perspective

1st edition | July 8-11, 2019
1

BigML, Inc #DutchMLSchool
Machine Learning at 39:
Les Même Choses et Plusieurs Changes
Thomas G. Dietterich
Chief Scientist, BigML
2

1980: First Machine Learning Workshop
3
• Carnegie-Mellon University
• Organizers: Jaime Carbonell, Tom Mitchell, Ryszard Michalski
• Attendees: ~30
• Topics:
• Exact learning
• Compression
• Supervised learning with noise-free labels

• Generalization
• Feature Engineering
• Explanation and Uncertainty
• Uncertainty Quantification
• Run-Time Monitoring
• Application-Specific Metrics
Outline
4

Challenge #1: Generalization
5

• Ross Quinlan introduced ID3
• Decision tree learning algorithm
• Goal: Compress chess endgame tables into simple decision
rules
• Ken Thompson had reverse-enumerated the winning
positions for certain chess endgames  Large table of
(board position, outcome) pairs
• ID3 was applied to compress these into a more
understandable representation
• Notes:
• No generalization, Noise Free
• Interpretability was important
Decision Tree Method: ID3
Win in 10
Breda, 2006 6

• Generalization for iid data
• Assume training and runtime data are drawn from the same
distribution
• Strong theoretical guarantees
• Generalization across domains
• Causal Transportability
• Domain-Adversarial Training
Today: Generalization is Key
7

• Predicting Lung Cancer
• T: Lung Cancer
• C: Chest Pain
• A: Patient is taking aspirin
• K: Patient is a smoker (not observed)
• S: The distribution of A may change between training and
deployment (change of hospital)
• Goal: Create a predictive model that does not depend
on S
• Guaranteed to generalize to new hospital (assuming this
causal model is correct)
Causal Transportability
(Pearl & Bareinboim, 2011)
(Subbaswamy et al., 2018)
8

• Generate all models that can make 𝑇
independent of 𝑆
• Evaluate each model on validation data
• Keep the best model
• Guaranteed to transport across hospitals
provided that the causal diagram is
correct
Graph Surgery Technique
Encourages thinking ahead about possible changes at deployment time
9

• Given:
• Training data points from two or more
domains: 𝐷 , 𝐷
𝐷 points are labeled pairs 𝑥 , 𝑦
𝐷 points are unlabeled 𝑥
• Training:
• For 𝐷 points: Predict the correct label
• For all points: Predict the domain 1 vs. 2
• Find weights that give accurate
predictions for 𝐷 and chance
predictions for the domain
Domain Adversarial Training (Ganin, et al., 2016)
Ganin, et al., JMLR 2016
10

Experiments
11

Challenge #2: Feature Engineering
12

• In 1980, Quinlan carefully designed interpretable
features with predictive power. This is still
important today in most applications
• Claim: Features should include meta-data
definitions
• “Numbers should never travel alone across the
internet” –Mark Fox
• BigML flatline language
• SQL statements/procedures
• Trifacta rules
Feature Engineering
Example:
Student_Teacher_Ratio(school, time)
|{s | registered(s, school, time)}| /
∑ 𝐹𝑇𝐸 𝑡, school, time
13

• Allows data consumers to detect when the
meaning of the feature has changed even when
the feature name has not changed
• important for detecting data errors and debugging
classifier failures
Importance of Feature Meta-Data
14

• No: Deep learning applications still
require careful data preparation
• image normalization, contrast
enhancement, etc.
• Yes: Deep learning can learn
powerful intermediate
representations
• <2012: Manually-designed SIFT and
HoG features for images combined
with support vector machines or
random forests
• >2012: Deep learning produces much
better results
Does Deep Learning Automate Feature
Engineering? Yes and No
0
5
10
15
20
25
30
2010 2011 2012 2013 2014
Top5ClassificationError(%) Before After
ImageNet 1000 Classes
15

Challenge #3:
Explanation and Interpretability
16

• 1980: Quinlan wanted interpretability because he expected
people to memorize the learned decision tree
• In practice, we needed to check whether the learning algorithm got
the right answer
• Today: Our highest-performing models (random forests, boosted
trees, deep neural networks) are not interpretable
• Interpretability and explanation are “hot topics” in ML research
Interpretability and Explanation
17

• Claim: Explanations should help the user perform some task
• BigML has worked hard on visualization tools to provide interpretability
• At Oregon State, we are developing explanation tools for reinforcement
learning
Explanation and Interpretability
ML System User Task
Predictive Model ML Engineer Find errors and holes in data
Recommendation
System
End User
Decide whether to follow the
recommendation
Predictive Model
RL Model
ML Engineer
Acceptance Testing:
Decide whether delivered system is
sufficiently accurate
18

Challenge #4:
Uncertainty Quantification
19

• 1980: This issue was totally ignored
• Today: Giving calibrated uncertainty estimates is important
• Calibrated Probabilities:
• When the classifier says “X belongs to class C with probability 0.94”,
then it is correct 94% of the time
• This is measured using a separate labeled “calibration set”
• Can use “out of bag” training data in random forests
Uncertainty Quantification
20

• Some classifiers are always well-calibrated
• Decision Trees
• Random Forests
• Others must be post-processed to achieve good calibration
• Boosted trees
• Support Vector Machines
• Deep Neural Networks
Calibration
21

• Sort the predicted probabilities into
bins 0.0-0.1, 0.1-0.2, etc.
• For each bin, measure the average
accuracy on the calibration data
• Plot the accuracy for each bin
• should lie on the diagonal if well-
calibrated
• Example shows that Naïve Bayes is
generally very optimistic
Measuring Calibration via a Reliability Diagram
Reliability Diagram (Naïve Bayes; ADULT)
Zadrozny & Elkan,
2002
22

• Fit an invertible curve to the
reliability diagram
• Often a sigmoid curve works well
• Use this to convert the predicted
values (on X axis) to calibrated
values (Y axis)
• Similar techniques can calibrate
Naïve Bayes, Deep Nets, Boosted
Trees, etc.
Fitting a recalibration function
23

• Calibration compares predicted probability and expected accuracy globally – across
the entire calibration data set
• This may be misleading
• A classifier could achieve 95% accuracy and perfect calibration by classifying 95% of the
data set perfectly and the remaining 5% completely incorrectly
• This 5% could be a specific customer segment
• Within that segment, the classifier is actually very poorly calibrated because it outputs a
confidence of 0.95 but is correct 0% of the time
• Lesson: Calibration should be done separately for each customer segment or local group
• Decision trees calibrate separately for each leaf of the tree, so they usually don’t exhibit this
problem
• It is always important to look at model accuracy by customer segments and other
customer features (gender, race, region, age, etc.)
• Example: Face recognition is less accurate on dark skin and on women, etc.
Local vs. Global Calibration
24

Challenge #5:
Run-Time Monitoring
25

• Predictive models are only guaranteed to be accurate if run-time
queries are drawn from the same distribution as the training data
• Open Category Problem: Run-time data may involve new classes
• New types of objects in computer vision
• New classes of items (books, restaurants) in recommender systems
• New diseases in medical systems
• New types of fraud in supervised fraud detection
Why Monitor?
26

• Outlier Detection
• Detect whether a new query 𝑥 is an outlier compared to the training
data 𝑥 , … , 𝑥
• Change Detection
• Detect whether the data distribution has changed
• Compare the 𝐿 most recent points 𝑥 , … , 𝑥 to the 𝐿 points before them,
𝑥 , … , 𝑥 . Do they come from different distributions?
How to Monitor?
27

BigML, Inc #DutchMLSchool 28
• Most AD papers only evaluate on a few datasets
• Often proprietary or very easy (e.g., KDD 1999)
• ML community needs a large and growing collection of public
anomaly benchmarks
Anomaly Detection Benchmarking Study
[Emmott, Das, Dietterich, Fern, Wong, 2013; KDD ODD-2013]
[Emmott, Das, Dietterich, Fern, Wong. 2016; arXiv 1503.01158v2]

• Density-Based Approaches
• RKDE: Robust Kernel Density
Estimation (Kim & Scott, 2008)
• EGMM: Ensemble Gaussian Mixture
Model (our group)
• Quantile-Based Methods
• OCSVM: One-class SVM (Schoelkopf,
et al., 1999)
• SVDD: Support Vector Data
Description (Tax & Duin, 2004)
Algorithms
• Neighbor-Based Methods
• LOF: Local Outlier Factor (Breunig, et
al., 2000)
• ABOD: kNN Angle-Based Outlier
Detector (Kriegel, et al., 2008)
• Projection-Based Methods
• IFOR: Isolation Forest (Liu, et al., 2008)
• LODA: Lightweight Online Detector of
Anomalies (Pevny, 2016)

Algorithm Comparison
0
0.2
0.4
0.6
0.8
1
1.2
iforest egmm lof rkde abod loda ocsvm svdd
ChangeinMetricwrtControl
Dataset
Algorithm
logit(AUC)
log(LIFT)
Based on this study,
BigML implemented
Isolation Forest

• Only make a prediction
if the query 𝑥 has a
low anomaly score
• Liu, et al. 2018
showed how to set 𝜏 to
guarantee detecting
new category queries
with high probability
Open Category Detection
𝑥
Anomaly
Detector
𝐴 𝑥 𝜏?
Classifier 𝑓
Training
Examples
𝑥 , 𝑦 no
𝑦 𝑓 𝑥
yes
reject
[Liu, Garrepalli, Fern, Dietterich, ICML 2018]
31

• “Two Sample” test. 𝑆 ∼ 𝑃 , 𝑆 ∼ 𝑃 , is 𝑃 𝑃 ?
• Method 1: Kernel two-sample test
• Method 2: Train a classifier to distinguish between 𝑆 and 𝑆 . Can it do better than
random guessing?
• At each time 𝑡, slide 𝑆 and 𝑆 one step forward in time (requires online methods)
• An area of active research
Change Detection
𝑥𝑥 𝑥𝑥 𝑥𝑥 𝑥𝑥 𝑥𝑥
𝑆 𝑆
32

Challenge #6:
Evaluation
33

• Standard metrics for evaluating classifiers, such as F1
and AUC, were developed for machine learning research
• Most applications require separate metrics
• Example:
• Financial fraud
• Suppose we have 5 analysts and each analyst can examine 10
cases per day
• Metric: Expected value of the top 50 alarms (value@50).
• Incorporates the estimated value of each candidate fraud alarm
Application-Specific Metrics are Important
34

• Open Category Detection:
• Detect 99% of all open category queries
• Metric: Precision at 99% recall
• Obstacle Detection for Self-Driving cars
• Detect 99.999% of all dangerous obstacles
• Metric: Precision at 99.999% recall
• Cancer Screening:
• Must trade off false alarms versus missed alarms
• Metric: Cost to patient (may vary from one patient to another)
• AUC is a fairly good metric for this case
More Examples
35

Summary
36

• Generalization
• Beyond iid: Causal transportability; Domain adaptation
• Feature Engineering
• Very important; Deep learning can discover useful intermediate features in some cases
• Uncertainty Quantification
• Probability Calibration
• Run-time Monitoring
• Anomaly Detection; Change Point Detection
• Application-Specific Metrics
Frontiers of Machine Learning and Applications
37

Co-organized by: Sponsor:
Business Partners:
38

DutchMLSchool. ML: A Technical Perspective

More Related Content

What's hot

Similar to DutchMLSchool. ML: A Technical Perspective

More from BigML, Inc

Recently uploaded

DutchMLSchool. ML: A Technical Perspective