ML Workshop 2: Machine Learning Model Comparison & Evaluation

© 2017 MapR Technologies 1
Machine Learning
Comparison and Evaluation

Contact Information
Ted Dunning, PhD
Chief Application Architect, MapR Technologies
Board Member, Apache Software Foundation
O’Reilly author
Email tdunning@mapr.com ted@apache.org
Twitter @ted_dunning

Machine Learning Everywhere
Image courtesy Mtell used with permission.Images © Ellen Friedman.

Scores
ArchiveDecoy
m1
m2
m3
Features /
proﬁles
InputRaw

ResultsRendezvousScores
ArchiveDecoy
m1
m2
m3
Features /
proﬁles
InputRaw

Metrics
Metrics
ResultsRendezvousScores
ArchiveDecoy
m1
m2
m3
Features /
proﬁles
InputRaw

Let’s talk about how the
rendezvous architecture makes
evaluation easier

Decoy Model in the Rendezvous Architecture
Input
Scores
Decoy
Model 2
Model 3
Archive
• Looks like a server, but it just archives inputs
• Safe in a good streaming environment, less safe without good isolation

Other Data Collected in Rendezvous
• Request ID + Input data
• All output scores
• Evaluation latency
• Round trip latency
• Rendezvous choices

Direct Model Comparison
• Don’t need ground truth to compare models at a gross level
• For uncalibrated models, score quantiles are useful
• For mature models, most results will be very similar
– Large differences from known good models cannot be good
• Ultimately, ground truth is important
– But only for cases where scores differ significantly

Direct Model Differencing
−2 0 2 4
0246
Raw Scores
0.0 0.5 1.0
0.00.51.0
Q−Q plot

−2 0 2 4
0246
Raw Scores
0.0 0.5 1.0
0.00.51.0
Q−Q plot
Scales may
differ radically

−2 0 2 4
0246
Raw Scores
0.0 0.5 1.0
0.00.51.0
Q−Q plot
Scales may
differ radically
Quantiles
correct scaling

−2 0 2 4
0246
Raw Scores
0.0 0.5 1.0
0.00.51.0
Q−Q plot
Scales may
differ radically
Quantiles
correct scaling
Perfect match
on high scores

Reject Inferencing
• Today’s model selects tomorrows training data
• Safe decisions often prevent data collection
– Fraud flag prevents the transaction
– Recommendation ranking has the same effect
• The model winds up confirming what it already knows
• Model comparison has same problem
– Champion says reject, challenger says retain

Reject Inferencing Solution
• We must balance EXPLORATION
– Calling a bluff to look at ground truth
• Versus EXPLOITATION
– Doing what we think is right
• Exploration costs us because we make worse decisions
– But it can help make better decisions later
• Exploitation costs us because we don’t learn better answers
– But it is the best we know now

Multi-Armed Bandits
• Classic formulation for explore/exploit trade-offs
• Thompson sampling is very good option
• Simple dithering may be good enough
• Key intuition is that we don’t need to perfectly characterize
losers … once we know they are losers, we don’t care
• Variant for ranking also good for model evaluation
– Also used to rank reddit comments

Some Warnings
• Bad models can be good explorers
• That can make other models look better
• Offline evaluation is fine, but you don’t know what would have
happened … real innovation has high error bars
• Where models all agree, we learning nothing
• In the end, it is differences that matter the most

Having complete and precise
history is golden for
offline comparisons

Allowing the rendezvous server
to do Thompson sampling is
even better

Change Detection
• Model comparison is all fine and good until the world changes
• And the world will change
• One of the most sensitive indicators is score distribution for a
good model
– T-digest is very effective for sketching distributions, especially in tails
– Compare current vs historical distribution using q-q or KS

Analyzing latencies

Hotel Room Latencies
• These are ping latencies from my hotel
• Looks pretty good, right?
• But what about longer term?
208.302
198.571
185.099
191.258
201.392
214.738
197.389
187.749
201.693
186.762
185.296
186.390
183.960
188.060
190.763
> mean(y$t[i])
[1] 198.6047
> sd(y$t[i])
[1] 71.43965

Not So Fast …

This is long-tailed land

This is long-tailed land
You have to know the distribution
of values

A single number
is simply not enough

And this histogram is hard to read

Idea – Exponential Bins
• Suppose we want relative accuracy in measurement space
• Latencies are positive and only matter within a few percent
– 1.1 ms versus 1.0 ms
– 1100 ms versus 1000 ms
• We can cheat by using floating point representations
– Compute bin using magic
– Adjust bins slightly using more magic
– Count

FloatHistogram
• Assume all measurements are in the range
• Divide this range into power of 2 sub-ranges
• Sub-divide each sub-range evenly with steps
– is typical
• Relative error is bounded in measurement space

FloatHistogram
• Assume all measurements are in the range
• Divide this range into power of 2 sub-ranges
• Sub-divide each sub-range evenly with steps
– is typical
• Relative error is bounded in measurement space
• Bin index can be computed using FP representation!

What about visualization?

Can’t see small count bars

Good Results

Bad Results – 1% of measurements are 3x bigger

Uniform Bins

FloatHistogram Bins

With FloatHistogram

Sign Up for Next Workshop in the MLL Series
by Ted Dunning, Chief Applications Architect at MapR:
Machine Learning in the Enterprise:
How to do model management in production
http://bit.ly/mapr-machine-learning-logistics-series

Additional Resources
O’Reilly report by Ted Dunning & Ellen Friedman © March 2017
Read free courtesy of MapR:
https://mapr.com/geo-distribution-big-data-and-analytics/
O’Reilly book by Ted Dunning & Ellen Friedman
© March 2016
https://mapr.com/streaming-architecture-using-
apache-kafka-mapr-streams/

O’Reilly book by Ted Dunning & Ellen Friedman
© June 2014
https://mapr.com/practical-machine-learning-
new-look-anomaly-detection/
O’Reilly book by Ellen Friedman & Ted Dunning
© February 2014
https://mapr.com/practical-machine-learning/

by Ellen Friedman 8 Aug 2017 on MapR blog:
https://mapr.com/blog/tensorflow-mxnet-caffe-h2o-which-ml-best/
Interview by Thor Olavsrud in CIO:
https://www.cio.com.au/article/630299/
what-dataops-collaborative-cross-
functional-analytics/?fp=16&fpid=1

Read more in new book on model management:
New O’Reilly book by Ted Dunning & Ellen Friedman© September 2017
Download free pdf courtesy of MapR:
https://mapr.com/ebook/machine-learning-logistics/

Please support women in tech – help build
girls’ dreams of what they can accomplish
© Ellen Friedman 2015#womenintech #datawomen

Q&A
@mapr
Maprtechnologies
tdunning@mapr.com
ENGAGE WITH US
@ted_dunning

ML Workshop 2: Machine Learning Model Comparison & Evaluation

More Related Content

What's hot

Similar to ML Workshop 2: Machine Learning Model Comparison & Evaluation

More from MapR Technologies

Recently uploaded

ML Workshop 2: Machine Learning Model Comparison & Evaluation