Evaluating Machine Learning Models -- A Beginner's Guide

Evaluating Machine
Learning Models
– A Beginner’s Guide
Alice Zheng, Dato
September 15, 2015
1

2
My machine learning trajectory
Applied machine learning
(Data science)
Build ML tools
Shortage of experts
and good tools.

3
Why machine learning?
Model data.
Make predictions.
Build intelligent
applications.

4
Machine learning pipeline
I fell in love the instant I laid
my eyes on that puppy. His
big eyes and playful tail, his
soft furry paws, …
Raw data
Features
Models
Predictions
Deploy in
production
GraphLab Create
Dato Distributed
Dato Predictive Services

Typical machine learning paper
6
… semi-supervised model for with large-scale
learning from sparse data … sub-modular
optimization for distributed computation…
evaluated on real and synthetic datasets…
performance exceeds start-of-the-art methods

What it looks like to ML researchers
7
Such regularize!
Much optimal
So sparsity
Wow!
Amaze
Very scale

What it looks like to normal people
8

What it’s like in practice
9
Doesn’t scale
Brittle
?
Hard to tune
?
?
Doesn’t solve my problem on my data

Achieve Machine Learning Zen
10

11
Why is evaluation important?
• So you know when you’ve succeeded
• So you know how much you’ve succeeded
• So you can decide when to stop
• So you can decide when to update the model
11

12
Basic questions for evaluation
• When to evaluate?
• What metric to use?
• On what data?
12

13
When to evaluate
Online
evaluation
Historical
data
Online
evaluation
results
Offline
evaluation
Live
data
Offline
evaluation
on live data
Prototype
model
Training
results
Validation
results
Deployed model

15
Types of evaluation metric
• Training metric
• Validation metric
• Tracking metric
• Business metric
“But they may
not match!”
Uh-oh Penguin

16
Example: recommender system
• Given data on which users liked which items, recommend
other items to users
• Training metric
- How well is it predicting the preference score?
- Residual mean squared error: (actual – predicted)2
• Validation metric
- Does it rank known preferences correctly?
- Ranking loss

17
Example: recommender system
• Tracking metric
- Does it rank items correctly, especially for top items?
- Normalized Discounted Cumulative Gain (NDCG)
• Business metric
- Does it increase the amount of time the user spends on the
site/service?

18
Dealing with metrics
• Many possible metrics at different
stages
• Defining the right metric is an art
- What’s useful? What’s feasible?
• Aligning the metrics will make
everyone happier
- Not always possible: cannot directly
train model to optimize for user
engagement
“Do the
best you
can!”
Okedokey Donkey

Model Selection and Tuning
Historical
data
Hyperparameter
tuning
Training
data
Validation
data
Model
training
Model
Training
results
Validation
results

21
Key questions for model selection
• What’s validation?
• What’s a hyperparameter and how do you tune it?
21

22
Model validation
• Measure generalization error
- How well the model works on new data
- “New” data = data not used during training
• Train on one dataset, validate on another
• Where to find “new” data for validation?
- Clever re-use of old data

23
Methods for simulating new data
Hold-out validation
Data
Training Validation
K-fold cross validation
Data
1 2 3 K…
Validation set
Bootstrap resampling
Data
Resampled dataset

24
Hyperparameter tuning vs. model training
Best model
parameters
Best
hyperparameters
Hyperparameter
tuning
Model
training

25
Hyperparameters != model parameters
Feature 2
Feature 1
Classification between two classes
Model parameter
Hyperparameter:
How many features to use

26
Why is hyperparameter tuning hard?
• Involves model training as a sub-process
- Can’t optimize directly
• Methods:
- Grid search
- Random search
- Smart search
• Gaussian processes/Bayesian optimization
• Random forests
• Derivative-free optimization
• Genetic algorithms

28
ML in production - 101
Model
Historical
Data
Predictions
Live
Data
Feedback
Batch training Real-time predictions

29
ML in production - 101
Model
Historical
Data
Real-time predictionsBatch training
Predictions
Model 2
Live
Data

30
Why evaluate models online?
• Track real performance of model over time
• Decide which model to use when

31
Choosing between ML models
Model 2
Model 1
2000 visits
10% CTR
Group A
Everybody gets
Model 2
2000 visits
30% CTR
Group B
Strategy 1: A/B testing—select the best model and use it all the time

32
A statistician walks into a casino…
Pay-off $1:$1000 Pay-off $1:$200 Pay-off $1:$500
Play this 85% of the
time
Play this 10% of the
time
Play this 5% of the
time
Multi-armed
bandits

33
A statistician walks into an ML production environment
Pay-off $1:$1000 Pay-off $1:$200 Pay-off $1:$500
Use this 85% of the
time
(Exploitation)
Use this 10% of the
time
(Exploration)
Use this 5% of the
time
(Exploration)
Model 1 Model 2 Model 3

34
MAB vs. A/B testing
Why MAB?
• Continuous optimization, “set and forget”
• Maximize overall reward
Why A/B test?
• Simple to understand
• Single winner
• Tricky to do right

35
That’s not all, folks!
Read the details
• Blog posts: http://blog.dato.com/topic/machine-learning-
primer
• Report: http://oreil.ly/1L7dS4a
• Dato is hiring! jobs@dato.com
alicez@dato.com @RainyData

Evaluating Machine Learning Models -- A Beginner's Guide

More Related Content

Viewers also liked

Recently uploaded

Evaluating Machine Learning Models -- A Beginner's Guide

Editor's Notes