© 2017 MapR Technologies 1
Machine Learning Success:
The Key to Easier Model Management
© 2017 MapR Technologies 2
Contact Information
Ellen Friedman, PhD
Principal Technologist, MapR Technologies
Committer Apache Drill & Apache Mahout projects
O’Reilly author
Email efriedman@mapr.com ellenf@apache.org
Twitter @Ellen_Friedman
© 2017 MapR Technologies 3
Machine Learning Everywhere
Image courtesy Mtell used with permission.Images © Ellen Friedman.
© 2017 MapR Technologies 4
Traditional View
© 2017 MapR Technologies 5
Traditional View: This isn’t the whole story
© 2017 MapR Technologies 6
90% of the effort in successful
machine learning isn’t the
algorithm or the model…
It’s the logistics
© 2017 MapR Technologies 7
Why?
• Just getting the training data is hard
– Which data? How to make it accessible? Multiple sources!
– New kinds of observations force restarts
– Requires a ton of domain knowledge
• The myth of the unitary model
– You can’t train just one
– You will have dozens of models, likely hundreds or more
– Handoff to new versions is tricky

© 2017 MapR Technologies 8
What Machine Learning Tool is Best?
• Most successful groups keep several “favorite” machine
learning tools at hand
– No single tool is best in every situation
• The most important tool is a platform that supports logistics well
– Don’t have to do everything at the application level
– Lots of what matters can be handled at the platform level
• A good design can make a big difference
© 2017 MapR Technologies 9
Rendezvous Architecture
Input Scores
RendezvousModel 1
Model 2
Model 3
request
response
Results
© 2017 MapR Technologies 10
Rendezvous to the Rescue: Better ML Logistics
• Stream-1st architecture is a powerful approach with surprisingly
widespread advantages
– Innovative technologies emerging to for streaming data
• Microservices approach provides flexibility
– Streaming supports microservices (if done right)
• Containers remove surprises
– Predictable environment for running models
© 2017 MapR Technologies 11
Rendezvous: Mainly for Decisioning Type Systems
• Decisioning style machine learning
– Looking for a “right answer”
– Simpler than interactive machine learning (such as in self-driving car)
• Examples include:
– Fraud detection
– Predictive analytics / market prediction
– Churn prediction (as in telecommunications)
– Yield optimization
– Deep learning in form of speech or image recognition, in some cases
© 2017 MapR Technologies 12
Why Stream?
Munich surfing wave Image © 2017 Ellen Friedman
© 2017 MapR Technologies 13
Streaming data has value beyond
real-time insights
© 2017 MapR Technologies 14
Heart of Stream-1st Architecture: Message Transport
Real-time
analytics
EMR
Patient Facilities
management
Insurance
audit
A
B
Medical tests
C
Medical test
results
The right messaging tool
supports multiple classes of use
cases (A, B, C in figure)
Image © 2016 Ted Dunning & Ellen Friedman from Chap 1 O’Reilly
book Streaming Architecture used with permission
© 2017 MapR Technologies 15
Stream Transport that Decouples Producers & Consumers
P
P
P
C
C
C
Transport Processing
Kafka /
MapR Streams
© 2017 MapR Technologies 16
MapR Streams in the MapR Converged Data Platform
Enterprise Storage
MapR-FS MapR-DB MapR Streams
Database Event Streaming
Global Namespace High Availability Data Protection Self-healing Unified Security Real-time Multi-tenancy
• Helps build a global data fabric
• Multiple types of storage engineered into one technology
• Under the same security & administration
© 2017 MapR Technologies 17
With MapR, Geo-Distributed Data Appears Local
stream
Data
source
Consumer
© 2017 MapR Technologies 18
With MapR, Geo-Distributed Data Appears Local
stream
stream
Data
source
Consumer
© 2017 MapR Technologies 19
With MapR, Geo-distributed Data Appears Local
stream
stream
Data
source
ConsumerGlobal Data Center
Regional Data Center
© 2017 MapR Technologies 20
Stream transport supports
microservices
© 2017 MapR Technologies 21
Stream-1st Architecture: Basis for MicroServices
Stream instead of database as the shared “truth”
POS
1..n
Fraud
detector
Last card
use
Updater
Card
analytics
Other
card activity
Image © 2016 Ted Dunning & Ellen Friedman from Chap 6 of O’Reilly book Streaming Architecture used with permission
© 2017 MapR Technologies 22
Features of Good Streaming
• It is Persistent
– Messages stick around for other consumers
– Consumers don’t affect producers
– Consumer doesn’t have to be online when message arrives
• It is Performant
– You don’t have to worry if a stream can keep up
• It is Pervasive
– It is there whenever you need it, no need to deploy anything
– How much work is it to create a new file? Why harder for a stream?
© 2017 MapR Technologies 23
Raw data is gold!
© 2017 MapR Technologies 24
Raw Data & Training Data Are Key to Success
Model 1
Model 2
Model 3
request
Raw
Add
external
data
Input
Database
The world
Raw data may contain features you’ll want in future
© 2017 MapR Technologies 25
Quality & Reproducibility of Input Data is Important!
• Recording raw-ish data is really a big deal
– Data as seen by a model is worth gold
– Data reconstructed later often has time-machine leaks
– Databases were made for updates, streams are safer
• Raw data is useful for non-ML cases as well (think flexibility)
• Decoy model records training data as seen by models under
development & evaluation
© 2017 MapR Technologies 26
Decoy Model in the Rendezvous Architecture
Input
Scores
Decoy
Model 2
Model 3
Archive
• Looks like a server, but it just archives inputs
• Safe in a good streaming environment, less safe without good isolation
© 2017 MapR Technologies 27
Scores
ArchiveDecoy
m1
m2
m3
Features /
profiles
InputRaw
© 2017 MapR Technologies 28
ResultsRendezvousScores
ArchiveDecoy
m1
m2
m3
Features /
profiles
InputRaw
© 2017 MapR Technologies 29
Metrics
Metrics
ResultsRendezvousScores
ArchiveDecoy
m1
m2
m3
Features /
profiles
InputRaw
© 2017 MapR Technologies 30
Models in production live in the
real world:
Conditions may (will) change
© 2017 MapR Technologies 31
How to Do Better – Deployment in Production
• Keep models running “in the wings”
– Don’t wait until conditions change to start building the next model
– Keep new models ready
• Hot hand-off
– With rendezvous: just stop ignoring the model of interest
• Deploy a canary server
– Keep an old model active as a reference
– If it was 90% correct, difference with any better model should be small
– Score distribution should be roughly constant
© 2017 MapR Technologies 32
Advantages of Rendezvous Architecture
Real
model
∆
Result
Canary
Decoy
Archive
Input
© 2017 MapR Technologies 33
DataOps: Brings Flexibility & Focus
• You don’t have to be a data scientist to contribute to machine learning
• Software engineer/ developer plays a role: but you need good data skills
© 2017 MapR Technologies 34
Example: Tensor Chicken
Label
training
data
Run the
model
Deploy
model
Gather
training
data
Labeled
image files
Train
model
Update
model
Deep learning project by
software engineer Ian Downard
(see blog + @tensorchicken)
© 2017 MapR Technologies 35
Rendezvous Architecture
Input Scores
RendezvousModel 1
Model 2
Model 3
request
response
Results
© 2017 MapR Technologies 36
How to Do Better
• Data + the right question + domain knowledge matter!
• Prioritize – put serious effort into infrastructure
– DataOps requires more than just data science
• Persist – use streams to keep data around
• Measure – everything, and record it
• Meta-analyze – understand and see what is happening
• Containerize – make deployment repeatable, easy
• Oh… don’t forget to do some machine learning, too
© 2017 MapR Technologies 37
Sign Up for ML Logistics Workshop Series
Three deep-dive machine learning workshops
by Ted Dunning, Chief Applications Architect at MapR:
1. A New Architecture for Machine Learning Logistics: How to use
streaming, containers & a microservices design
2. Machine Learning Evaluation: How to do model-to-model comparisons
3. Machine Learning in the Enterprise: How to do model management in
production
http://bit.ly/mapr-machine-learning-logistics-series
© 2017 MapR Technologies 38
Additional Resources
O’Reilly report by Ted Dunning & Ellen Friedman © March 2017
Read free courtesy of MapR:
https://mapr.com/geo-distribution-big-data-and-analytics/
O’Reilly book by Ted Dunning & Ellen Friedman
© March 2016
Read free courtesy of MapR:
https://mapr.com/streaming-architecture-using-
apache-kafka-mapr-streams/
© 2017 MapR Technologies 39
Additional Resources
O’Reilly book by Ted Dunning & Ellen Friedman
© June 2014
Read free courtesy of MapR:
https://mapr.com/practical-machine-learning-
new-look-anomaly-detection/
O’Reilly book by Ellen Friedman & Ted Dunning
© February 2014
Read free courtesy of MapR:
https://mapr.com/practical-machine-learning/
© 2017 MapR Technologies 40
Additional Resources
by Ellen Friedman 8 Aug 2017 on MapR blog:
https://mapr.com/blog/tensorflow-mxnet-caffe-h2o-which-ml-best/
by Ted Dunning 13 Sept 2017 in
InfoWorld:
https://www.infoworld.com/article/3223
688/machine-learning/machine-
learning-skills-for-software-
engineers.html
© 2017 MapR Technologies 41
New book:
O’Reilly book by Ellen Friedman & Ted Dunning © Sept 2017
Pre-register for a free pdf copy of book when it becomes
available 25th September, courtesy of MapR:
http://info.mapr.com/2017_Content_Machine-Learning-
Logistics_eBook_Prereg_RegistrationPage.html
© 2017 MapR Technologies 42
Please support women in tech – help build
girls’ dreams of what they can accomplish
© Ellen Friedman 2015#womenintech #datawomen
© 2017 MapR Technologies 43
Thank you !
© 2017 MapR Technologies 44
Q&A
@mapr
Maprtechnologies
efriedman@mapr.com
ENGAGE WITH US
@ Ellen_Friedman

Machine Learning Success: The Key to Easier Model Management

  • 1.
    © 2017 MapRTechnologies 1 Machine Learning Success: The Key to Easier Model Management
  • 2.
    © 2017 MapRTechnologies 2 Contact Information Ellen Friedman, PhD Principal Technologist, MapR Technologies Committer Apache Drill & Apache Mahout projects O’Reilly author Email efriedman@mapr.com ellenf@apache.org Twitter @Ellen_Friedman
  • 3.
    © 2017 MapRTechnologies 3 Machine Learning Everywhere Image courtesy Mtell used with permission.Images © Ellen Friedman.
  • 4.
    © 2017 MapRTechnologies 4 Traditional View
  • 5.
    © 2017 MapRTechnologies 5 Traditional View: This isn’t the whole story
  • 6.
    © 2017 MapRTechnologies 6 90% of the effort in successful machine learning isn’t the algorithm or the model… It’s the logistics
  • 7.
    © 2017 MapRTechnologies 7 Why? • Just getting the training data is hard – Which data? How to make it accessible? Multiple sources! – New kinds of observations force restarts – Requires a ton of domain knowledge • The myth of the unitary model – You can’t train just one – You will have dozens of models, likely hundreds or more – Handoff to new versions is tricky 
  • 8.
    © 2017 MapRTechnologies 8 What Machine Learning Tool is Best? • Most successful groups keep several “favorite” machine learning tools at hand – No single tool is best in every situation • The most important tool is a platform that supports logistics well – Don’t have to do everything at the application level – Lots of what matters can be handled at the platform level • A good design can make a big difference
  • 9.
    © 2017 MapRTechnologies 9 Rendezvous Architecture Input Scores RendezvousModel 1 Model 2 Model 3 request response Results
  • 10.
    © 2017 MapRTechnologies 10 Rendezvous to the Rescue: Better ML Logistics • Stream-1st architecture is a powerful approach with surprisingly widespread advantages – Innovative technologies emerging to for streaming data • Microservices approach provides flexibility – Streaming supports microservices (if done right) • Containers remove surprises – Predictable environment for running models
  • 11.
    © 2017 MapRTechnologies 11 Rendezvous: Mainly for Decisioning Type Systems • Decisioning style machine learning – Looking for a “right answer” – Simpler than interactive machine learning (such as in self-driving car) • Examples include: – Fraud detection – Predictive analytics / market prediction – Churn prediction (as in telecommunications) – Yield optimization – Deep learning in form of speech or image recognition, in some cases
  • 12.
    © 2017 MapRTechnologies 12 Why Stream? Munich surfing wave Image © 2017 Ellen Friedman
  • 13.
    © 2017 MapRTechnologies 13 Streaming data has value beyond real-time insights
  • 14.
    © 2017 MapRTechnologies 14 Heart of Stream-1st Architecture: Message Transport Real-time analytics EMR Patient Facilities management Insurance audit A B Medical tests C Medical test results The right messaging tool supports multiple classes of use cases (A, B, C in figure) Image © 2016 Ted Dunning & Ellen Friedman from Chap 1 O’Reilly book Streaming Architecture used with permission
  • 15.
    © 2017 MapRTechnologies 15 Stream Transport that Decouples Producers & Consumers P P P C C C Transport Processing Kafka / MapR Streams
  • 16.
    © 2017 MapRTechnologies 16 MapR Streams in the MapR Converged Data Platform Enterprise Storage MapR-FS MapR-DB MapR Streams Database Event Streaming Global Namespace High Availability Data Protection Self-healing Unified Security Real-time Multi-tenancy • Helps build a global data fabric • Multiple types of storage engineered into one technology • Under the same security & administration
  • 17.
    © 2017 MapRTechnologies 17 With MapR, Geo-Distributed Data Appears Local stream Data source Consumer
  • 18.
    © 2017 MapRTechnologies 18 With MapR, Geo-Distributed Data Appears Local stream stream Data source Consumer
  • 19.
    © 2017 MapRTechnologies 19 With MapR, Geo-distributed Data Appears Local stream stream Data source ConsumerGlobal Data Center Regional Data Center
  • 20.
    © 2017 MapRTechnologies 20 Stream transport supports microservices
  • 21.
    © 2017 MapRTechnologies 21 Stream-1st Architecture: Basis for MicroServices Stream instead of database as the shared “truth” POS 1..n Fraud detector Last card use Updater Card analytics Other card activity Image © 2016 Ted Dunning & Ellen Friedman from Chap 6 of O’Reilly book Streaming Architecture used with permission
  • 22.
    © 2017 MapRTechnologies 22 Features of Good Streaming • It is Persistent – Messages stick around for other consumers – Consumers don’t affect producers – Consumer doesn’t have to be online when message arrives • It is Performant – You don’t have to worry if a stream can keep up • It is Pervasive – It is there whenever you need it, no need to deploy anything – How much work is it to create a new file? Why harder for a stream?
  • 23.
    © 2017 MapRTechnologies 23 Raw data is gold!
  • 24.
    © 2017 MapRTechnologies 24 Raw Data & Training Data Are Key to Success Model 1 Model 2 Model 3 request Raw Add external data Input Database The world Raw data may contain features you’ll want in future
  • 25.
    © 2017 MapRTechnologies 25 Quality & Reproducibility of Input Data is Important! • Recording raw-ish data is really a big deal – Data as seen by a model is worth gold – Data reconstructed later often has time-machine leaks – Databases were made for updates, streams are safer • Raw data is useful for non-ML cases as well (think flexibility) • Decoy model records training data as seen by models under development & evaluation
  • 26.
    © 2017 MapRTechnologies 26 Decoy Model in the Rendezvous Architecture Input Scores Decoy Model 2 Model 3 Archive • Looks like a server, but it just archives inputs • Safe in a good streaming environment, less safe without good isolation
  • 27.
    © 2017 MapRTechnologies 27 Scores ArchiveDecoy m1 m2 m3 Features / profiles InputRaw
  • 28.
    © 2017 MapRTechnologies 28 ResultsRendezvousScores ArchiveDecoy m1 m2 m3 Features / profiles InputRaw
  • 29.
    © 2017 MapRTechnologies 29 Metrics Metrics ResultsRendezvousScores ArchiveDecoy m1 m2 m3 Features / profiles InputRaw
  • 30.
    © 2017 MapRTechnologies 30 Models in production live in the real world: Conditions may (will) change
  • 31.
    © 2017 MapRTechnologies 31 How to Do Better – Deployment in Production • Keep models running “in the wings” – Don’t wait until conditions change to start building the next model – Keep new models ready • Hot hand-off – With rendezvous: just stop ignoring the model of interest • Deploy a canary server – Keep an old model active as a reference – If it was 90% correct, difference with any better model should be small – Score distribution should be roughly constant
  • 32.
    © 2017 MapRTechnologies 32 Advantages of Rendezvous Architecture Real model ∆ Result Canary Decoy Archive Input
  • 33.
    © 2017 MapRTechnologies 33 DataOps: Brings Flexibility & Focus • You don’t have to be a data scientist to contribute to machine learning • Software engineer/ developer plays a role: but you need good data skills
  • 34.
    © 2017 MapRTechnologies 34 Example: Tensor Chicken Label training data Run the model Deploy model Gather training data Labeled image files Train model Update model Deep learning project by software engineer Ian Downard (see blog + @tensorchicken)
  • 35.
    © 2017 MapRTechnologies 35 Rendezvous Architecture Input Scores RendezvousModel 1 Model 2 Model 3 request response Results
  • 36.
    © 2017 MapRTechnologies 36 How to Do Better • Data + the right question + domain knowledge matter! • Prioritize – put serious effort into infrastructure – DataOps requires more than just data science • Persist – use streams to keep data around • Measure – everything, and record it • Meta-analyze – understand and see what is happening • Containerize – make deployment repeatable, easy • Oh… don’t forget to do some machine learning, too
  • 37.
    © 2017 MapRTechnologies 37 Sign Up for ML Logistics Workshop Series Three deep-dive machine learning workshops by Ted Dunning, Chief Applications Architect at MapR: 1. A New Architecture for Machine Learning Logistics: How to use streaming, containers & a microservices design 2. Machine Learning Evaluation: How to do model-to-model comparisons 3. Machine Learning in the Enterprise: How to do model management in production http://bit.ly/mapr-machine-learning-logistics-series
  • 38.
    © 2017 MapRTechnologies 38 Additional Resources O’Reilly report by Ted Dunning & Ellen Friedman © March 2017 Read free courtesy of MapR: https://mapr.com/geo-distribution-big-data-and-analytics/ O’Reilly book by Ted Dunning & Ellen Friedman © March 2016 Read free courtesy of MapR: https://mapr.com/streaming-architecture-using- apache-kafka-mapr-streams/
  • 39.
    © 2017 MapRTechnologies 39 Additional Resources O’Reilly book by Ted Dunning & Ellen Friedman © June 2014 Read free courtesy of MapR: https://mapr.com/practical-machine-learning- new-look-anomaly-detection/ O’Reilly book by Ellen Friedman & Ted Dunning © February 2014 Read free courtesy of MapR: https://mapr.com/practical-machine-learning/
  • 40.
    © 2017 MapRTechnologies 40 Additional Resources by Ellen Friedman 8 Aug 2017 on MapR blog: https://mapr.com/blog/tensorflow-mxnet-caffe-h2o-which-ml-best/ by Ted Dunning 13 Sept 2017 in InfoWorld: https://www.infoworld.com/article/3223 688/machine-learning/machine- learning-skills-for-software- engineers.html
  • 41.
    © 2017 MapRTechnologies 41 New book: O’Reilly book by Ellen Friedman & Ted Dunning © Sept 2017 Pre-register for a free pdf copy of book when it becomes available 25th September, courtesy of MapR: http://info.mapr.com/2017_Content_Machine-Learning- Logistics_eBook_Prereg_RegistrationPage.html
  • 42.
    © 2017 MapRTechnologies 42 Please support women in tech – help build girls’ dreams of what they can accomplish © Ellen Friedman 2015#womenintech #datawomen
  • 43.
    © 2017 MapRTechnologies 43 Thank you !
  • 44.
    © 2017 MapRTechnologies 44 Q&A @mapr Maprtechnologies efriedman@mapr.com ENGAGE WITH US @ Ellen_Friedman