© 2017 MapR Technologies 1
Machine Learning
Comparison and Evaluation
© 2017 MapR Technologies 2
Contact Information
Ted Dunning, PhD
Chief Application Architect, MapR Technologies
Board Member, Apache Software Foundation
O’Reilly author
Email tdunning@mapr.com ted@apache.org
Twitter @ted_dunning
© 2017 MapR Technologies 3
Machine Learning Everywhere
Image courtesy Mtell used with permission.Images © Ellen Friedman.
© 2017 MapR Technologies 4
Scores
ArchiveDecoy
m1
m2
m3
Features /
profiles
InputRaw
© 2017 MapR Technologies 5
ResultsRendezvousScores
ArchiveDecoy
m1
m2
m3
Features /
profiles
InputRaw
© 2017 MapR Technologies 6
Metrics
Metrics
ResultsRendezvousScores
ArchiveDecoy
m1
m2
m3
Features /
profiles
InputRaw
© 2017 MapR Technologies 7
Let’s talk about how the
rendezvous architecture makes
evaluation easier
© 2017 MapR Technologies 8
Decoy Model in the Rendezvous Architecture
Input
Scores
Decoy
Model 2
Model 3
Archive
• Looks like a server, but it just archives inputs
• Safe in a good streaming environment, less safe without good isolation
© 2017 MapR Technologies 9
Other Data Collected in Rendezvous
• Request ID + Input data
• All output scores
• Evaluation latency
• Round trip latency
• Rendezvous choices
© 2017 MapR Technologies 10
Direct Model Comparison
• Don’t need ground truth to compare models at a gross level
• For uncalibrated models, score quantiles are useful
• For mature models, most results will be very similar
– Large differences from known good models cannot be good
• Ultimately, ground truth is important
– But only for cases where scores differ significantly
© 2017 MapR Technologies 11
Direct Model Differencing
−2 0 2 4
0246
Raw Scores
0.0 0.5 1.0
0.00.51.0
Q−Q plot
© 2017 MapR Technologies 12
Direct Model Differencing
−2 0 2 4
0246
Raw Scores
0.0 0.5 1.0
0.00.51.0
Q−Q plot
Scales may
differ radically
© 2017 MapR Technologies 13
Direct Model Differencing
−2 0 2 4
0246
Raw Scores
0.0 0.5 1.0
0.00.51.0
Q−Q plot
Scales may
differ radically
Quantiles
correct scaling
© 2017 MapR Technologies 14
Direct Model Differencing
−2 0 2 4
0246
Raw Scores
0.0 0.5 1.0
0.00.51.0
Q−Q plot
Scales may
differ radically
Quantiles
correct scaling
Perfect match
on high scores
© 2017 MapR Technologies 15
Reject Inferencing
• Today’s model selects tomorrows training data
• Safe decisions often prevent data collection
– Fraud flag prevents the transaction
– Recommendation ranking has the same effect
• The model winds up confirming what it already knows
• Model comparison has same problem
– Champion says reject, challenger says retain
© 2017 MapR Technologies 16
Reject Inferencing Solution
• We must balance EXPLORATION
– Calling a bluff to look at ground truth
• Versus EXPLOITATION
– Doing what we think is right
• Exploration costs us because we make worse decisions
– But it can help make better decisions later
• Exploitation costs us because we don’t learn better answers
– But it is the best we know now
© 2017 MapR Technologies 17
Multi-Armed Bandits
• Classic formulation for explore/exploit trade-offs
• Thompson sampling is very good option
• Simple dithering may be good enough
• Key intuition is that we don’t need to perfectly characterize
losers … once we know they are losers, we don’t care
• Variant for ranking also good for model evaluation
– Also used to rank reddit comments
© 2017 MapR Technologies 18
© 2017 MapR Technologies 19
© 2017 MapR Technologies 20
© 2017 MapR Technologies 21
© 2017 MapR Technologies 22
© 2017 MapR Technologies 23
© 2017 MapR Technologies 24
© 2017 MapR Technologies 25
© 2017 MapR Technologies 26
Some Warnings
• Bad models can be good explorers
• That can make other models look better
• Offline evaluation is fine, but you don’t know what would have
happened … real innovation has high error bars
• Where models all agree, we learning nothing
• In the end, it is differences that matter the most
© 2017 MapR Technologies 27
Having complete and precise
history is golden for
offline comparisons
© 2017 MapR Technologies 28
Allowing the rendezvous server
to do Thompson sampling is
even better
© 2017 MapR Technologies 29
Change Detection
• Model comparison is all fine and good until the world changes
• And the world will change
• One of the most sensitive indicators is score distribution for a
good model
– T-digest is very effective for sketching distributions, especially in tails
– Compare current vs historical distribution using q-q or KS
© 2017 MapR Technologies 30
Analyzing latencies
© 2017 MapR Technologies 31
Hotel Room Latencies
• These are ping latencies from my hotel
• Looks pretty good, right?
• But what about longer term?
208.302
198.571
185.099
191.258
201.392
214.738
197.389
187.749
201.693
186.762
185.296
186.390
183.960
188.060
190.763
> mean(y$t[i])
[1] 198.6047
> sd(y$t[i])
[1] 71.43965
© 2017 MapR Technologies 32
Not So Fast …
© 2017 MapR Technologies 33
This is long-tailed land
© 2017 MapR Technologies 34
This is long-tailed land
You have to know the distribution
of values
© 2017 MapR Technologies 35
© 2017 MapR Technologies 36
A single number
is simply not enough
© 2017 MapR Technologies 37
And this histogram is hard to read
© 2017 MapR Technologies 38
Idea – Exponential Bins
• Suppose we want relative accuracy in measurement space
• Latencies are positive and only matter within a few percent
– 1.1 ms versus 1.0 ms
– 1100 ms versus 1000 ms
• We can cheat by using floating point representations
– Compute bin using magic
– Adjust bins slightly using more magic
– Count
© 2017 MapR Technologies 39
FloatHistogram
• Assume all measurements are in the range
• Divide this range into power of 2 sub-ranges
• Sub-divide each sub-range evenly with steps
– is typical
• Relative error is bounded in measurement space
© 2017 MapR Technologies 40
FloatHistogram
• Assume all measurements are in the range
• Divide this range into power of 2 sub-ranges
• Sub-divide each sub-range evenly with steps
– is typical
• Relative error is bounded in measurement space
• Bin index can be computed using FP representation!
© 2017 MapR Technologies 41
What about visualization?
© 2017 MapR Technologies 42
Can’t see small count bars
© 2017 MapR Technologies 43
Good Results
© 2017 MapR Technologies 44
Bad Results – 1% of measurements are 3x bigger
© 2017 MapR Technologies 45
Bad Results – 1% of measurements are 3x bigger
© 2017 MapR Technologies 46
Uniform Bins
© 2017 MapR Technologies 47
FloatHistogram Bins
© 2017 MapR Technologies 48
With FloatHistogram
© 2017 MapR Technologies 49
Sign Up for Next Workshop in the MLL Series
by Ted Dunning, Chief Applications Architect at MapR:
Machine Learning in the Enterprise:
How to do model management in production
http://bit.ly/mapr-machine-learning-logistics-series
© 2017 MapR Technologies 50
Additional Resources
O’Reilly report by Ted Dunning & Ellen Friedman © March 2017
Read free courtesy of MapR:
https://mapr.com/geo-distribution-big-data-and-analytics/
O’Reilly book by Ted Dunning & Ellen Friedman
© March 2016
Read free courtesy of MapR:
https://mapr.com/streaming-architecture-using-
apache-kafka-mapr-streams/
© 2017 MapR Technologies 51
Additional Resources
O’Reilly book by Ted Dunning & Ellen Friedman
© June 2014
Read free courtesy of MapR:
https://mapr.com/practical-machine-learning-
new-look-anomaly-detection/
O’Reilly book by Ellen Friedman & Ted Dunning
© February 2014
Read free courtesy of MapR:
https://mapr.com/practical-machine-learning/
© 2017 MapR Technologies 52
Additional Resources
by Ellen Friedman 8 Aug 2017 on MapR blog:
https://mapr.com/blog/tensorflow-mxnet-caffe-h2o-which-ml-best/
Interview by Thor Olavsrud in CIO:
https://www.cio.com.au/article/630299/
what-dataops-collaborative-cross-
functional-analytics/?fp=16&fpid=1
© 2017 MapR Technologies 53
Read more in new book on model management:
New O’Reilly book by Ted Dunning & Ellen Friedman© September 2017
Download free pdf courtesy of MapR:
https://mapr.com/ebook/machine-learning-logistics/
© 2017 MapR Technologies 54
Please support women in tech – help build
girls’ dreams of what they can accomplish
© Ellen Friedman 2015#womenintech #datawomen
© 2017 MapR Technologies 55
Q&A
@mapr
Maprtechnologies
tdunning@mapr.com
ENGAGE WITH US
@ted_dunning

ML Workshop 2: Machine Learning Model Comparison & Evaluation

  • 1.
    © 2017 MapRTechnologies 1 Machine Learning Comparison and Evaluation
  • 2.
    © 2017 MapRTechnologies 2 Contact Information Ted Dunning, PhD Chief Application Architect, MapR Technologies Board Member, Apache Software Foundation O’Reilly author Email tdunning@mapr.com ted@apache.org Twitter @ted_dunning
  • 3.
    © 2017 MapRTechnologies 3 Machine Learning Everywhere Image courtesy Mtell used with permission.Images © Ellen Friedman.
  • 4.
    © 2017 MapRTechnologies 4 Scores ArchiveDecoy m1 m2 m3 Features / profiles InputRaw
  • 5.
    © 2017 MapRTechnologies 5 ResultsRendezvousScores ArchiveDecoy m1 m2 m3 Features / profiles InputRaw
  • 6.
    © 2017 MapRTechnologies 6 Metrics Metrics ResultsRendezvousScores ArchiveDecoy m1 m2 m3 Features / profiles InputRaw
  • 7.
    © 2017 MapRTechnologies 7 Let’s talk about how the rendezvous architecture makes evaluation easier
  • 8.
    © 2017 MapRTechnologies 8 Decoy Model in the Rendezvous Architecture Input Scores Decoy Model 2 Model 3 Archive • Looks like a server, but it just archives inputs • Safe in a good streaming environment, less safe without good isolation
  • 9.
    © 2017 MapRTechnologies 9 Other Data Collected in Rendezvous • Request ID + Input data • All output scores • Evaluation latency • Round trip latency • Rendezvous choices
  • 10.
    © 2017 MapRTechnologies 10 Direct Model Comparison • Don’t need ground truth to compare models at a gross level • For uncalibrated models, score quantiles are useful • For mature models, most results will be very similar – Large differences from known good models cannot be good • Ultimately, ground truth is important – But only for cases where scores differ significantly
  • 11.
    © 2017 MapRTechnologies 11 Direct Model Differencing −2 0 2 4 0246 Raw Scores 0.0 0.5 1.0 0.00.51.0 Q−Q plot
  • 12.
    © 2017 MapRTechnologies 12 Direct Model Differencing −2 0 2 4 0246 Raw Scores 0.0 0.5 1.0 0.00.51.0 Q−Q plot Scales may differ radically
  • 13.
    © 2017 MapRTechnologies 13 Direct Model Differencing −2 0 2 4 0246 Raw Scores 0.0 0.5 1.0 0.00.51.0 Q−Q plot Scales may differ radically Quantiles correct scaling
  • 14.
    © 2017 MapRTechnologies 14 Direct Model Differencing −2 0 2 4 0246 Raw Scores 0.0 0.5 1.0 0.00.51.0 Q−Q plot Scales may differ radically Quantiles correct scaling Perfect match on high scores
  • 15.
    © 2017 MapRTechnologies 15 Reject Inferencing • Today’s model selects tomorrows training data • Safe decisions often prevent data collection – Fraud flag prevents the transaction – Recommendation ranking has the same effect • The model winds up confirming what it already knows • Model comparison has same problem – Champion says reject, challenger says retain
  • 16.
    © 2017 MapRTechnologies 16 Reject Inferencing Solution • We must balance EXPLORATION – Calling a bluff to look at ground truth • Versus EXPLOITATION – Doing what we think is right • Exploration costs us because we make worse decisions – But it can help make better decisions later • Exploitation costs us because we don’t learn better answers – But it is the best we know now
  • 17.
    © 2017 MapRTechnologies 17 Multi-Armed Bandits • Classic formulation for explore/exploit trade-offs • Thompson sampling is very good option • Simple dithering may be good enough • Key intuition is that we don’t need to perfectly characterize losers … once we know they are losers, we don’t care • Variant for ranking also good for model evaluation – Also used to rank reddit comments
  • 18.
    © 2017 MapRTechnologies 18
  • 19.
    © 2017 MapRTechnologies 19
  • 20.
    © 2017 MapRTechnologies 20
  • 21.
    © 2017 MapRTechnologies 21
  • 22.
    © 2017 MapRTechnologies 22
  • 23.
    © 2017 MapRTechnologies 23
  • 24.
    © 2017 MapRTechnologies 24
  • 25.
    © 2017 MapRTechnologies 25
  • 26.
    © 2017 MapRTechnologies 26 Some Warnings • Bad models can be good explorers • That can make other models look better • Offline evaluation is fine, but you don’t know what would have happened … real innovation has high error bars • Where models all agree, we learning nothing • In the end, it is differences that matter the most
  • 27.
    © 2017 MapRTechnologies 27 Having complete and precise history is golden for offline comparisons
  • 28.
    © 2017 MapRTechnologies 28 Allowing the rendezvous server to do Thompson sampling is even better
  • 29.
    © 2017 MapRTechnologies 29 Change Detection • Model comparison is all fine and good until the world changes • And the world will change • One of the most sensitive indicators is score distribution for a good model – T-digest is very effective for sketching distributions, especially in tails – Compare current vs historical distribution using q-q or KS
  • 30.
    © 2017 MapRTechnologies 30 Analyzing latencies
  • 31.
    © 2017 MapRTechnologies 31 Hotel Room Latencies • These are ping latencies from my hotel • Looks pretty good, right? • But what about longer term? 208.302 198.571 185.099 191.258 201.392 214.738 197.389 187.749 201.693 186.762 185.296 186.390 183.960 188.060 190.763 > mean(y$t[i]) [1] 198.6047 > sd(y$t[i]) [1] 71.43965
  • 32.
    © 2017 MapRTechnologies 32 Not So Fast …
  • 33.
    © 2017 MapRTechnologies 33 This is long-tailed land
  • 34.
    © 2017 MapRTechnologies 34 This is long-tailed land You have to know the distribution of values
  • 35.
    © 2017 MapRTechnologies 35
  • 36.
    © 2017 MapRTechnologies 36 A single number is simply not enough
  • 37.
    © 2017 MapRTechnologies 37 And this histogram is hard to read
  • 38.
    © 2017 MapRTechnologies 38 Idea – Exponential Bins • Suppose we want relative accuracy in measurement space • Latencies are positive and only matter within a few percent – 1.1 ms versus 1.0 ms – 1100 ms versus 1000 ms • We can cheat by using floating point representations – Compute bin using magic – Adjust bins slightly using more magic – Count
  • 39.
    © 2017 MapRTechnologies 39 FloatHistogram • Assume all measurements are in the range • Divide this range into power of 2 sub-ranges • Sub-divide each sub-range evenly with steps – is typical • Relative error is bounded in measurement space
  • 40.
    © 2017 MapRTechnologies 40 FloatHistogram • Assume all measurements are in the range • Divide this range into power of 2 sub-ranges • Sub-divide each sub-range evenly with steps – is typical • Relative error is bounded in measurement space • Bin index can be computed using FP representation!
  • 41.
    © 2017 MapRTechnologies 41 What about visualization?
  • 42.
    © 2017 MapRTechnologies 42 Can’t see small count bars
  • 43.
    © 2017 MapRTechnologies 43 Good Results
  • 44.
    © 2017 MapRTechnologies 44 Bad Results – 1% of measurements are 3x bigger
  • 45.
    © 2017 MapRTechnologies 45 Bad Results – 1% of measurements are 3x bigger
  • 46.
    © 2017 MapRTechnologies 46 Uniform Bins
  • 47.
    © 2017 MapRTechnologies 47 FloatHistogram Bins
  • 48.
    © 2017 MapRTechnologies 48 With FloatHistogram
  • 49.
    © 2017 MapRTechnologies 49 Sign Up for Next Workshop in the MLL Series by Ted Dunning, Chief Applications Architect at MapR: Machine Learning in the Enterprise: How to do model management in production http://bit.ly/mapr-machine-learning-logistics-series
  • 50.
    © 2017 MapRTechnologies 50 Additional Resources O’Reilly report by Ted Dunning & Ellen Friedman © March 2017 Read free courtesy of MapR: https://mapr.com/geo-distribution-big-data-and-analytics/ O’Reilly book by Ted Dunning & Ellen Friedman © March 2016 Read free courtesy of MapR: https://mapr.com/streaming-architecture-using- apache-kafka-mapr-streams/
  • 51.
    © 2017 MapRTechnologies 51 Additional Resources O’Reilly book by Ted Dunning & Ellen Friedman © June 2014 Read free courtesy of MapR: https://mapr.com/practical-machine-learning- new-look-anomaly-detection/ O’Reilly book by Ellen Friedman & Ted Dunning © February 2014 Read free courtesy of MapR: https://mapr.com/practical-machine-learning/
  • 52.
    © 2017 MapRTechnologies 52 Additional Resources by Ellen Friedman 8 Aug 2017 on MapR blog: https://mapr.com/blog/tensorflow-mxnet-caffe-h2o-which-ml-best/ Interview by Thor Olavsrud in CIO: https://www.cio.com.au/article/630299/ what-dataops-collaborative-cross- functional-analytics/?fp=16&fpid=1
  • 53.
    © 2017 MapRTechnologies 53 Read more in new book on model management: New O’Reilly book by Ted Dunning & Ellen Friedman© September 2017 Download free pdf courtesy of MapR: https://mapr.com/ebook/machine-learning-logistics/
  • 54.
    © 2017 MapRTechnologies 54 Please support women in tech – help build girls’ dreams of what they can accomplish © Ellen Friedman 2015#womenintech #datawomen
  • 55.
    © 2017 MapRTechnologies 55 Q&A @mapr Maprtechnologies tdunning@mapr.com ENGAGE WITH US @ted_dunning