DutchMLSchool. Models, Evaluations, and Ensembles

BigML, Inc #DutchMLSchool
Supervised Learning I
Introduction to Machine Learning, Models, Evaluations and Ensembles
Poul Petersen
CIO, BigML, Inc
2

Machine Learning Motivation
3
• You are looking to buy a house
• Recently found a house you like
• Is the asking price fair?
Imagine:
What Next?

Maching Learning Motivation
4
Why not ask an expert?
• Experts can be rare / expensive
• Hard to validate experience:
• Experience with similar properties?
• Do they consider all relevant variables?
• Knowledge of market up to date?
• Hard to validate answer:
• How many times expert right / wrong?
• Probably can’t explain decision in detail
• Humans are not good at intuitive statistics

Data vs Expert
5
Replace the expert with data?
• Intuition: square footage relates to price.
• Collect data from past sales
SQFT SOLD
2424 360000
1785 307500
1003 185000
4135 600000
1676 328500
1012 247000
3352 420000
2825 435350
PRICE = 125.3*SQFT + 96535
PREDICT
400262
320195
222211
614651
306538
223339
516541
450508

Data vs Expert
6
Replace the expert scorecard
• Experts can be rare / expensive
• Hard to validate experience:
• Experience with similar properties?
• Do they consider all relevant variables?
• Knowledge of market up to date?
• Hard to validate answer:
• How many times expert right / wrong?
• Probably can’t explain decision in detail
• Humans are not good at intuitive statistics

Data vs Expert
7
Replace the expert with data
• Intuition: square footage relates to price.
• Collect data from past sales
SQFT SOLD
2424 360000
1785 307500
1003 185000
4135 600000
1676 328500
1012 247000
3352 420000
2825 435350
PRICE = 125.3*SQFT + 96535

More Data!
8
SQFT BEDS BATHS ADDRESS LOCATION
LOT
SIZE
YEAR
BUILT
PARKING
SPOTS
LATITUDE LONGITUDE SOLD
2424 4 3
1522 NW
Jonquil
Timberhill
SE 2nd
5227 1991 2 44,594828 -123,269328 360000
1785 3 2
7360 NW
Valley Vw
Country
Estates
25700 1979 2 44,643876 -123,238189 307500
1003 2 1
2620 NW
Chinaberry
Tamarack
Village
4792 1978 2 44,593704 -123,295424 185000
4135 5 3,5
4748 NW
Veronica
Suncrest 6098 2004 3 44,5929659 -123,306916 600000
1676 3 2
2842 NW
Monterey
Corvallis 8712 1975 2 44,5945279 -123,291523 328500
1012 3 1
2320 NW
Highland
Corvallis 9583 1959 2 44,591476 -123,262841 247000
3352 4 3
1205 NW
Ridgewood
Ridgewood
2
60113 1975 2 44,579439 -123,333888 420000
2825 3 411 NW 16th
Wilkins
Addition
4792 1938 1 44,570883 -123,272113 435350
Uhhhh……..
• Can we still fit a line to 10 variables? (well, yes)
• Will fitting a line give good results? (unlikely)
• What about those text fields and categorical values?

Models
9

Mythical ML Model?
10
• High representational power
• Fitting a line is an example of low
• Deep neural networks is an example of high
• High Ease-of-use
• Easy to conﬁgure - relatively few parameters
• Easy to interpret - how are decisions made?
• Easy to put into production
• Ability to work with real-world data
• Mixed data types: numeric, categorical, text, etc
• Handle missing values
• Resilient to outliers
• There are actually hundreds of possible choices…

Decision Trees
11
Last Bill > $180 and Support Calls > 0
Remember This?

Decision Tree Demo #1
12

What Just Happened?
13
• We started with Housing data as a CSV from Redﬁn
• We uploaded the CSV to create Source
• Then we created a Dataset from the Source and reviewed the
summary statistics
• With 1-click we build a Model which can predict home prices
based on all the housing features
• We explored the Model and used it to make a Prediction

Why Decision Trees
14
• Works for classiﬁcation or regression

Why Decision Trees
15
• Easy to understand: splits are features and values
• Lightweight and super fast at prediction time

DT Predictions
16
Question 2
Prediction
Question 1

Why Decision Trees
17
• Relatively parameter free
• Data can be messy
• Useless features are automatically ignored
• Works with un-normalized data
• Works with missing data at Training

Training with Missing
18
Reason Missing?
Loan Amount?

Why Decision Trees
19
• Works with missing data at Training & Prediction

Predictions with Missing
20
Missing?
Question 1
Last
Prediction

Predictions with Missing
21
Missing?
Question 1
Skip
Question 2 Question 3
Avg Prediction

Why Decision Trees
22
• Works with missing data at Training & Prediction
• Resilient to outliers
• High representational power
• Works easily with mixed data types

Data Types
23
numeric
1 2 3
1, 2.0, 3, -5.4 categorical
true / false
yes / no
giraffe / zebra / ape
categoricalcategorical
A B C
YEAR
MONTH
DAY-OF-MONTH
YYYY-MM-DD
DAY-OF-WEEK
HOUR
MINUTE
YYYY-MM-DD
YYYY-MM-DD
M-T-W-T-F-S-D
HH:MM:SS
HH:MM:SS
2013
September
25
Wednesday
10
02
DATE-TIME2013-09-25 10:02
DATE-TIME
text
Be not afraid of greatness:
some are born great, some
achieve greatness, and
some have greatness
thrust upon 'em.
text
“great”
“afraid”
“born”
appears 2 times
appears 1 time
appears 1 time
items
bread, sugar, coffee, milk
ice cream, hot fudge
items

Why Not Decision Trees
24
• Slightly prone to over-ﬁtting. (what is that again?)

Learning Problems (fit)
25
Under-fitting Over-fitting
• Model fits too well does not “generalize”
• Captures the noise or outliers of the data
• Change algorithm or filter outliers

26
• Slightly prone to over-ﬁtting
• But we’ll ﬁx this with ensembles
• Splitting prefers decision boundaries that are parallel
to feature axes

Splits Parallel to Axis
27
But not Possible!
Ideal split…

Splits Parallel to Axis
28
Will “discover”
diagonal edge
eventually

29
to feature axes
• More data!
• Predictions outside training data can be problematic

Outlier Predictions
30
?

31
to feature axes
• More data!
• We can catch this with model competence
• Can be sensitive to small changes in training data

Outlier Predictions
32

33
to feature axes
• More data!
• We can catch this with model competence
• Can be sensitive to small changes in training data
• What other models can we try?
• And how will we know which one works best?

Evaluations
34

Easy Right?
35
INTL
MIN
INTL
CALLS
INTL
CHARGE
CUST
SERV
CALLS
CHURN
8,7 4 2,35 1 False
11,2 5 3,02 0 False
12,7 6 3,43 4 True
9,1 5 2,46 0 False
11,2 2 3,02 1 False
12,3 5 3,32 3 False
13,1 6 3,54 4 False
5,4 9 1,46 4 True
13,8 4 3,73 1 False
Model Prediction
PREDICT
CHURN
False
True
True
False
False
False
False
False
False
Count up mistakes!

Mistakes can be Costly
36
FUN!
+ = DANGER!
Insight: Labeling a Yield as a stop is not as bad as
labelling a stop as a yield… Need better metrics!

Evaluation Metrics
37
• Imagine we have a model that can predict a person’s dominant
hand, that is for any individual it predicts left / right
• Deﬁne the positive class
• This selection is arbitrary
• It is the class you are interested in!
• The negative class is the “other” class (or others)
• For this example, we choose : left

Evaluation Metrics
38
• We choose the positive class: left
• True Positive (TP)
• We predicted left and the correct answer was left
• True Negative (TN)
• We predicted right and the correct answer was right
• False Positive (FP)
• Predicted left but the correct answer was right
• False Negative (FN)
• Predict right but the correct answer was left

Evaluation Metrics
39
True Positive: Correctly predicted the positive class
True Negative: Correctly predicted the negative class
False Positive: Incorrectly predicted the positive class
False Negative: Incorrectly predicted the negative class
Remember…

Accuracy
40
TP + TN
Total
• “Percentage correct” - like an exam
• If Accuracy = 1 then no mistakes
• If Accuracy = 0 then all mistakes
• Intuitive but not always useful
• Watch out for unbalanced classes!
• Ex: 90% of people are right-handed and 10% are left
• A silly model which always predicts right handed is
90% accurate

Accuracy
41
Classiﬁed as
Left Handed
Classiﬁed as
Right Handed
TP = 0
FP = 0
TN = 7
FN = 3
= Left
= RightPositive

Class
Negative

Class TP + TN
Total
= 70%

Precision
42
TP
TP + FP
• “accuracy” or “purity” of positive class
• How well you did separating the positive class from the
negative class
• If Precision = 1 then no FP.
• You may have missed some left handers, but of the
ones you identiﬁed, all are left handed. No mistakes.
• If Precision = 0 then no TP
• None of the left handers you identiﬁed are actually left
handed. All mistakes.

Precision
43
Classiﬁed as
Left Handed
Classiﬁed as
Right Handed
TP = 2
FP = 2
TN = 5
FN = 1
Positive

Class
Negative

Class
= Left
= Right
TP
TP + FP
= 50%

Recall
44
TP
TP + FN
• percentage of positive class correctly identified
• A measure of how well you identified all of the positive
class examples
• If Recall = 1 then no FN → All left handers identified
• There may be FP, so precision could be <1
• If Recall = 0 then no TP → No left handers identified

Recall
45
Classiﬁed as
Left Handed
Classiﬁed as
Right Handed
TP = 2
FP = 2
TN = 5
FN = 1
Positive

Class
Negative

Class
= Left
= Right
TP
TP + FN
= 66%

f-Measure
46
2 * Recall * Precision
Recall + Precision
• harmonic mean of Recall & Precision
• If f-measure = 1 then Recall == Precision == 1
• If Precision OR Recall is small then the f-measure is small

Phi Coefﬁcient
47
__________TP*TN_-_FP*FN__________
SQRT[(TP+FP)(TP+FN)(TN+FP)(TN+FN)]
• Returns a value between -1 and 1
• If -1 then predictions are opposite reality
• =0 no correlation between predictions and reality
• =1 then predictions are always correct

Evaluations Demo #1
48

What Just Happened?
49
• Starting with the Diabetes Source, we created a Dataset and
then a Model.
• Using both the Model and the original Dataset, we created an
Evaluation.
• We reviewed the metrics provided by the Evaluation:
• Confusion Matrix
• Accuracy, Precision, Recall, f-measure and
phi
• This Model seemed to perform really, really well…
Question: Can we trust this model?

Evaluation Danger!
50
• Never evaluate with the training data!
• Many models are able to “memorize” the training data
• This will result in overly optimistic evaluations!

“Memorizing” Training Data
51
plasma
glucose
bmi
diabetes
pedigree
age diabetes
148 33,6 0,627 50 TRUE
85 26,6 0,351 31 FALSE
183 23,3 0,672 32 TRUE
89 28,1 0,167 21 FALSE
137 43,1 2,288 33 TRUE
116 25,6 0,201 30 FALSE
78 31 0,248 26 TRUE
115 35,3 0,134 29 FALSE
197 30,5 0,158 53 TRUE
Training Evaluating
plasma
glucose
bmi
diabetes
pedigree
age diabetes
148 33,6 0,627 50 ?
85 26,6 0,351 31 ?
• Exactly the same values!
• Who needs a model?
• What we want to know is how the
model performs with values never
seen at training:
124 22 0,107 46 ?

Evaluation Danger!
52
• If you only have one Dataset, use a train/test split

Train / Test Split
53
plasma
glucose
bmi
diabetes
pedigree
age diabetes
148 33,6 0,627 50 TRUE
183 23,3 0,672 32 TRUE
89 28,1 0,167 21 FALSE
78 31 0,248 26 TRUE
115 35,3 0,134 29 FALSE
197 30,5 0,158 53 TRUE
Train Test
plasma
glucose
bmi
diabetes
pedigree
age diabetes
85 26,6 0,351 31 FALSE
137 43,1 2,288 33 TRUE
116 25,6 0,201 30 FALSE
• These instances were never seen
at training time.
• Better evaluation of how the
model will perform with “new” data

Evaluation Danger!
54
• Even a train/test split may not be enough!
• Might get a “lucky” split
• Solution is to repeat several times (formally to cross validate)

Evaluations Demo #2
55

What Just Happened?
56
• Starting with the Diabetes Dataset we created a train/test split
• We built a Model using the train set and evaluated it with the
test set
• The scores were much worse than before, showing the danger
of evaluating with training data.
• Then we built several other models with different parameters
and used the evaluation comparison tool to see which
performed the best.
Question:
Couldn’t we search for the best Model
or parameters?
STAY
TUNED

Evaluation
57
• Even a train/test split may not be enough!
• Might get a “lucky” split
• Solution is to repeat several times (formally to cross validate)
• Don’t forget that accuracy can be mis-leading!
• Mostly useless with unbalanced classes (left/right?)
• Use weighting, operating points, other tricks…

Operating Points
58
• The default probability threshold is 50%
• Changing the threshold can change the outcome for a
speciﬁc class
Rate Payment …
Actual
Outcome
Probability
PAID
Threshold
@ 50%
Threshold
@ 60%
Threshold
@ 90%
8,4 % US$456 … PAID 95 % PAID PAID PAID
9,6 % US$134 … PAID 87 % PAID PAID DEFAULT
18 % US$937 … DEFAULT 36 % DEFAULT DEFAULT DEFAULT
21 % US$35 … PAID 88 % PAID PAID DEFAULT
17,5 %US$1.044 … DEFAULT 55 % PAID DEFAULT DEFAULT

What about Regressions?
59
• No classes:
• Not possible to count mistakes: TP, FP, TN, FN
• Predicted values are numeric: error is the amount “off”
• actual 200, predict 180 = error 20
• Mean Absolute Error / Mean Squared Error
• Both are a measure of total error
• Note: value of the error is “unbounded”.
• When comparing models, lower values are “better”
• R-Squared Error
• Measure of how much better the model is than always
predicting the mean
• < 0 model is worse then mean
• = 0 model is no better than the mean
• ➞ 1 model ﬁts the data “perfectly”

Evaluations Demo #3
60

What just happened?
61
• We split the RedFin data into training and test Datasets
• We created a Model and Evaluation
• We examined the Evaluation metrics
Wait - What about Time Series?

Dependent Data
62
Year Pineapple
Harvest1986 50,74
1987 22,03
1988 50,69
1989 40,38
1990 29,80
1991 9,90
1992 73,93
1993 22,95
1994 139,09
1995 115,17
1996 193,88
1997 175,31
1998 223,41
1999 295,03
2000 450,53
Pineapple Harvest
Tons
0
125
250
375
500
Year
1986 1988 1990 1992 1994 1996 1998 2000
Trend
Error

Dependent Data
63
Pineapple Harvest
Tons
0
125
250
375
500
Year
1986 1988 1990 1992 1994 1996 1998 2000
Year Pineapple
Harvest1986 139,09
1987 175,31
1988 9,91
1989 22,95
1990 450,53
1991 73,93
1992 40,38
1993 22,03
1994 295,03
1995 50,74
1996 29,8
1997 223,41
1998 115,17
1999 193,88
2000 50,69
Rearranging Disrupts Patterns

Random Train / Test Split
64
plasma
glucose
bmi
diabetes
pedigree
age diabetes
148 33,6 0,627 50 TRUE
183 23,3 0,672 32 TRUE
89 28,1 0,167 21 FALSE
78 31 0,248 26 TRUE
115 35,3 0,134 29 FALSE
197 30,5 0,158 53 TRUE
Train Test
plasma
glucose
bmi
diabetes
pedigree
age diabetes
85 26,6 0,351 31 FALSE
137 43,1 2,288 33 TRUE
116 25,6 0,201 30 FALSE

Linear Train / Test Split
65
Train Test
Year Pineapple
Harvest1986 50,74
1987 22,03
1988 50,69
1989 40,38
1990 29,80
1991 9,90
1992 73,93
1993 22,95
1994 139,09
1995 115,17
1996 193,88
Year Pineapple
Harvest
1997 175,31
1998 223,41
1999 295,03
2000 450,53
Forecast
COMPARE

Ensembles
66

what is an Ensemble?
67
• Rather than build a single model…
• Combine the output of several typically “weaker” models into
a powerful ensemble…
• Q1: Why is this necessary?
• Q2: How do we build “weaker” models?
• Q3: How do we “combine” models?

No Model is Perfect
68
• A given ML algorithm may simply not be able to exactly
model the “real solution” of a particular dataset.
• Try to ﬁt a line to a curve
• Even if the model is very capable, the “real solution” may be
elusive
• DT/NN can model any decision boundary with enough
training data, but the solution is NP-hard
• Practical algorithms involve random processes and may
arrive at different, yet equally good, “solutions” depending
on the starting conditions, local optima, etc.
• If that wasn’t bad enough…

No Data is Perfect
69
• Not enough data!
• Always working with ﬁnite training data
• Therefore, every “model” is an approximation of the “real
solution” and there may be several good approximations.
• Anomalies / Outliers
• The model is trying to generalize from discrete training
data.
• Outliers can “skew” the model, by overﬁtting
• Mistakes in your data
• Does the model have to do everything for you?
• But really, there is always mistakes in your data

Ensemble Techniques
70
• Key Idea:
• By combining several good “models”, the combination
may be closer to the best possible “model”
• we want to ensure diversity. It’s not useful to use an
ensemble of 100 models that are all the same
• Training Data Tricks
• Build several models, each with only some of the data
• Introduce randomness directly into the algorithm
• Add training weights to “focus” the additional models on
the mistakes made
• Prediction Tricks
• Model the mistakes
• Model the output of several different algorithms

Simple Example - Fit a Line
71

72

73
Partition the data… then model each partition…
For predictions, use the model for the same partition
?

Decision Forest
74
MODEL 1
DATASET
SAMPLE 1
SAMPLE 2
SAMPLE 3
SAMPLE 4
MODEL 2
MODEL 3
MODEL 4
PREDICTION 1
PREDICTION 2
PREDICTION 3
PREDICTION 4
PREDICTION
COMBINER

Random Decision Forest
75
MODEL 1
DATASET
SAMPLE 1
SAMPLE 2
SAMPLE 3
SAMPLE 4
MODEL 2
MODEL 3
MODEL 4
PREDICTION 1
PREDICTION 2
PREDICTION 3
PREDICTION 4
SAMPLE 1
PREDICTION
COMBINER

Boosting
76
ADDRESS BEDS BATHS SQFT
LOT
SIZE
YEAR
BUILT
LATITUDE LONGITUDE
LAST SALE
PRICE
1522 NW
Jonquil
4 3 2424 5227 1991 44,594828 -123,269328 360000
7360 NW
Valley Vw
3 2 1785 25700 1979 44,643876 -123,238189 307500
4748 NW
Veronica
5 3,5 4135 6098 2004 44,5929659 -123,306916 600000
411 NW 16th 3 2825 4792 1938 44,570883 -123,272113 435350
MODEL 1
PREDICTED
SALE PRICE
360750
306875
587500
435350
ERROR
750
-625
-12500
0
ADDRESS BEDS BATHS SQFT
LOT
SIZE
YEAR
BUILT
LATITUDE LONGITUDE ERROR
1522 NW
Jonquil
4 3 2424 5227 1991 44,594828 -123,269328 750
7360 NW
Valley Vw
3 2 1785 25700 1979 44,643876 -123,238189 625
4748 NW
Veronica
5 3,5 4135 6098 2004 44,5929659 -123,306916 12500
411 NW 16th 3 2825 4792 1938 44,570883 -123,272113 0
MODEL 2
PREDICTED
ERROR
750
625
12393,83333
6879,67857
Why stop at one iteration?
"Hey Model 1, what do you predict is the sale price of this home?"
"Hey Model 2, how much error do you predict Model 1 just made?"

Boosting
77
DATASET MODEL 1
DATASET 2 MODEL 2
DATASET 3 MODEL 3
DATASET 4 MODEL 4
PREDICTION 1
PREDICTION 2
PREDICTION 3
PREDICTION 4
PREDICTION
SUM
Iteration 1
Iteration 2
Iteration 3
Iteration 4
etc…

Ensembles Demo #1
78

Which Ensemble Method
79
• The one that works best!
• Ok, but seriously. Did you evaluate?
• For "large" / "complex" datasets
• Use DF/RDF with deeper node threshold
• Even better, use Boosting with more iterations
• For "noisy" data
• Boosting may overfit
• RDF preferred
• For "wide" data
• Randomize features (RDF) will be quicker
• For "easy" data
• A single model may be fine
• Bonus: also has the best interpretability!
• For classification with "large" number of classes
• Boosting will be slower
• For "general" data
• DF/RDF likely better than a single model or Boosting.
• Boosting will be slower since the models are processed serially

Ensembles Demo #2
80

Summary
81
• Models have shortcomings: ability to ﬁt, NP-hard, etc
• Data has shortcomings: not enough, outliers, mistakes, etc
• Ensemble Techniques can improve on single models
• Sampling: partitioning, Decision Tree bagging
• Adding Randomness: RDF
• Modeling the Error: Boosting
• Modeling the Models: Stacking
• Guidelines for knowing which one might work best in a given
situation

Co-organized by: Sponsor:
Business Partners:

DutchMLSchool. Models, Evaluations, and Ensembles

More Related Content

What's hot

Similar to DutchMLSchool. Models, Evaluations, and Ensembles

More from BigML, Inc

Recently uploaded

DutchMLSchool. Models, Evaluations, and Ensembles