Maintainable Machine
Learning Products
ApacheCon Roadshow, Chicago 2019
Andrew Musselman
akm@apache.org
State of the Art in ML Development
So many tools
• Scikit-learn
• Spark MLLib
• Keras
• PyTorch
• DL4J
• Mahout
• MXNet
• SystemML
• PredictionIO
• #justRthings
• …
• Vendor solutions
• Kitchen sink
• Auto-magic
State of the Art in ML Development
And that is just for the ML pieces; also need:
• Data ingest
• Data engineering
• Plotting, charting
• UX/Publishing
• Sidecar functions:
• Search
• Model management/data versioning
• Monitoring/performance metrics
State of the Art in ML Development
All to do a glorified regression or similar
About Me and Why I'm Here
Corp
• Chief Data Scientist at Accenture
• Senior Director at Lucidworks
• Chief Analytics Officer at A2Go
ASF
• Mahout 0.9 release
• Committer
• PMC member
• Chair
Corp/OSS
• Bootstrapped open-source
contribution program at ACN
• Similar program to A2Go
Fun
• Adversarial Learning podcast
• Sailing, snowboarding, amateur
radio (KI7KQA)
About Me and Why I'm Here
In the course of doing
work I have seen
some bad things
Motivation
Moving data through the assembly line* to production requires
beating several bosses:
* There is no "assembly line" the first several times
Ingest Clean and
Transform
PublishTrain/
Test/
Tweak
Zooming Out
Before a project begins, there are multiple other bosses to beat:
Have an
Idea
Design
Solution
Convince
Team
Prototype Get
Priority
Get
Budget
Then
Why Projects Fail
Things can die at any stage, but most poignantly at the end,
when it's "finished"
• Results/findings/"insights" need a total re-write or port to "production
lang/infra
• E.g., a nice tidy model to predict customer behavior needs to be re-
written in Java to run in the "web service farm"
• Add six months!
• Priority battles! Unproven ML/AI pet project less urgent than:
• Ongoing maintenance
• Shifted business priorities
The Best Reason Projects Fail
No established
approach/workflow to
incorporate results into
existing infrastructure
ML/AI Has a Lot of Attention
In the face of these troubles, ML/AI is a stated priority of many,
many, many, orgs
• Leadership team: "we need an ML/AI story immediately; everyone is
doing it and we are behind the competition" 🤔
• Countless teams: "we need sentiment analysis of [our medical
records | social media about us | the stock market]" 😬
• "Can't machine learning fix this problem?" 🤔
• "Machine learning is a commodity now" 😂
ML/AI Has a Lot of Attention
Result: URGENCY
+ LARGE AND
WRONG SCOPE
Combatting Urgency and Bad Scope
People minimize risk by:
• Hiring consultants
• Building it all from scratch
• Buying a vendor solution (and paying their professional
services team to build all the hard parts)
• Researching/benchmarking/assembling some OSS
libraries/frameworks
Sometimes People Do Dumb Things
"Let's migrate off this vendor and use an open
source solution"
Vendor Apache
Sometimes People Do Dumb Things
"Dump the entire Teradata warehouse"
Sometimes People Do Dumb Things
"Into HDFS"
Sometimes People Do Dumb Things
"But let's not keep any of the metadata about the
tables"
Lies We Tell Ourselves
"Let's clean up all this legacy not invented here
(NIH) code and move to that vendor solution"
NIH Vendor
Lies We Tell Ourselves
"The vendor says migration should take less
than a month"
Lies We Tell Ourselves
"Our IT team says they can integrate the vendor
solution next fiscal year"
Lies We Tell Ourselves
"Our summer intern says they think they can
write all the connectors we need by September"
How to Choose Tools
People think about tech/infra
decisions on a 1-D spectrum
Vendor NIH
NIH OSS
But it's a multi-dimensional
problem
How to Choose Tools
Trade-offs
• Vendor-heavy: $$, less control
• NIH-heavy: tribal knowledge +s and –s
• OSS-heavy: config and extend, hiring pool +s
How Not to Choose Tools
IPython
NB Python
REPL
Jupyter
AWS EC2,
S3, DB,
cron
External
Data
APIs
?
bash
and
curl
Data
in/out
Data
in/out
A real workflow
Ideal Workflow
Each phase decoupled and easier to maintain, hire for skills
Ideal Workflow
• Encourage small, low-risk
prototypes
• Promote the successes to real
projects/features/apps
• Avoiding:
• Re-write
• IT Debate Club
• Budget Debate Club
The Six Moving Pieces of a Platform
Load API
Layer
Data API
Layer
Results API
Layer
Publish API
Layer
Serve API
Layer
Data
Engineering
Analytic Jobs,
Splits, Runs
Output,
Results,
Performance
Look and Feel,
Scoring,
Monitoring,
Display of (1) -
(3)
(1) (2) (3) (4)
(5) All APIs
(6) Packaging (1) - (5) for deployment
Skill Sets:
(1) Spark, SQL, Python, Linux, Databases, Key-Value Stores
(2) Python, Spark, SQL
(3) Python, SQL
(4) React, HTML, JavaScript, CSS
(5) React, Python, Linux, Redis
(6) Docker, Chef, Jenkins/Travis
Input
Data
The Six Moving Pieces of a Platform
Load API
Layer
Data API
Layer
Results API
Layer
Publish API
Layer
Serve API
Layer
Data
Engineering
Analytic Jobs,
Splits, Runs
Output,
Results,
Performance
Look and Feel,
Scoring,
Monitoring,
Display of (1) -
(3)
(1) (2) (3) (4)
(5) All APIs
(6) Packaging (1) - (5) for deployment
Skill Sets:
(1) Spark, SQL, Python, Linux, Databases, Key-Value Stores
(2) Python, Spark, SQL
(3) Python, SQL
(4) React, HTML, JavaScript, CSS
(5) React, Python, Linux, Redis
(6) Docker, Chef, Jenkins/Travis
Input
Data
UI/UX
The Six Moving Pieces of a Platform
Load API
Layer
Data API
Layer
Results API
Layer
Publish API
Layer
Serve API
Layer
Data
Engineering
Analytic Jobs,
Splits, Runs
Output,
Results,
Performance
Look and Feel,
Scoring,
Monitoring,
Display of (1) -
(3)
(1) (2) (3) (4)
(5) All APIs
(6) Packaging (1) - (5) for deployment
Skill Sets:
(1) Spark, SQL, Python, Linux, Databases, Key-Value Stores
(2) Python, Spark, SQL
(3) Python, SQL
(4) React, HTML, JavaScript, CSS
(5) React, Python, Linux, Redis
(6) Docker, Chef, Jenkins/Travis
Input
Data
DevOps
The Six Moving Pieces of a Platform
Load API
Layer
Data API
Layer
Results API
Layer
Publish API
Layer
Serve API
Layer
Data
Engineering
Analytic Jobs,
Splits, Runs
Output,
Results,
Performance
Look and Feel,
Scoring,
Monitoring,
Display of (1) -
(3)
(1) (2) (3) (4)
(5) All APIs
(6) Packaging (1) - (5) for deployment
Skill Sets:
(1) Spark, SQL, Python, Linux, Databases, Key-Value Stores
(2) Python, Spark, SQL
(3) Python, SQL
(4) React, HTML, JavaScript, CSS
(5) React, Python, Linux, Redis
(6) Docker, Chef, Jenkins/Travis
Input
Data
Data
Sci/ML
Encourage/Enforce Good Behavior
• Central notebook repository (e.g., Apache Zeppelin)
• Quick dashboard prototyping (e.g., Apache Superset,
Zeppelin)
• Use a model server (e.g., Apache PredictionIO)
• APIs for all stages
• Code reviews
• Unit and integration tests
• "Definition of done"
Encourage/Enforce Good Behavior
Future State
Productivity at scale!
Getting Involved in Open Source
• Fix documentation problems as you're using it
• Fix bugs
• Add features
• Make it an internal team effort
• Grow skills
• Adapt the software to real-life demands
• Give back
Thank You
Q&A

Maintainable Machine Learning Products

  • 1.
    Maintainable Machine Learning Products ApacheConRoadshow, Chicago 2019 Andrew Musselman akm@apache.org
  • 2.
    State of theArt in ML Development So many tools • Scikit-learn • Spark MLLib • Keras • PyTorch • DL4J • Mahout • MXNet • SystemML • PredictionIO • #justRthings • … • Vendor solutions • Kitchen sink • Auto-magic
  • 3.
    State of theArt in ML Development And that is just for the ML pieces; also need: • Data ingest • Data engineering • Plotting, charting • UX/Publishing • Sidecar functions: • Search • Model management/data versioning • Monitoring/performance metrics
  • 4.
    State of theArt in ML Development All to do a glorified regression or similar
  • 5.
    About Me andWhy I'm Here Corp • Chief Data Scientist at Accenture • Senior Director at Lucidworks • Chief Analytics Officer at A2Go ASF • Mahout 0.9 release • Committer • PMC member • Chair Corp/OSS • Bootstrapped open-source contribution program at ACN • Similar program to A2Go Fun • Adversarial Learning podcast • Sailing, snowboarding, amateur radio (KI7KQA)
  • 6.
    About Me andWhy I'm Here In the course of doing work I have seen some bad things
  • 7.
    Motivation Moving data throughthe assembly line* to production requires beating several bosses: * There is no "assembly line" the first several times Ingest Clean and Transform PublishTrain/ Test/ Tweak
  • 8.
    Zooming Out Before aproject begins, there are multiple other bosses to beat: Have an Idea Design Solution Convince Team Prototype Get Priority Get Budget Then
  • 9.
    Why Projects Fail Thingscan die at any stage, but most poignantly at the end, when it's "finished" • Results/findings/"insights" need a total re-write or port to "production lang/infra • E.g., a nice tidy model to predict customer behavior needs to be re- written in Java to run in the "web service farm" • Add six months! • Priority battles! Unproven ML/AI pet project less urgent than: • Ongoing maintenance • Shifted business priorities
  • 10.
    The Best ReasonProjects Fail No established approach/workflow to incorporate results into existing infrastructure
  • 11.
    ML/AI Has aLot of Attention In the face of these troubles, ML/AI is a stated priority of many, many, many, orgs • Leadership team: "we need an ML/AI story immediately; everyone is doing it and we are behind the competition" 🤔 • Countless teams: "we need sentiment analysis of [our medical records | social media about us | the stock market]" 😬 • "Can't machine learning fix this problem?" 🤔 • "Machine learning is a commodity now" 😂
  • 12.
    ML/AI Has aLot of Attention Result: URGENCY + LARGE AND WRONG SCOPE
  • 13.
    Combatting Urgency andBad Scope People minimize risk by: • Hiring consultants • Building it all from scratch • Buying a vendor solution (and paying their professional services team to build all the hard parts) • Researching/benchmarking/assembling some OSS libraries/frameworks
  • 14.
    Sometimes People DoDumb Things "Let's migrate off this vendor and use an open source solution" Vendor Apache
  • 15.
    Sometimes People DoDumb Things "Dump the entire Teradata warehouse"
  • 16.
    Sometimes People DoDumb Things "Into HDFS"
  • 17.
    Sometimes People DoDumb Things "But let's not keep any of the metadata about the tables"
  • 18.
    Lies We TellOurselves "Let's clean up all this legacy not invented here (NIH) code and move to that vendor solution" NIH Vendor
  • 19.
    Lies We TellOurselves "The vendor says migration should take less than a month"
  • 20.
    Lies We TellOurselves "Our IT team says they can integrate the vendor solution next fiscal year"
  • 21.
    Lies We TellOurselves "Our summer intern says they think they can write all the connectors we need by September"
  • 22.
    How to ChooseTools People think about tech/infra decisions on a 1-D spectrum Vendor NIH NIH OSS But it's a multi-dimensional problem
  • 23.
    How to ChooseTools Trade-offs • Vendor-heavy: $$, less control • NIH-heavy: tribal knowledge +s and –s • OSS-heavy: config and extend, hiring pool +s
  • 24.
    How Not toChoose Tools IPython NB Python REPL Jupyter AWS EC2, S3, DB, cron External Data APIs ? bash and curl Data in/out Data in/out A real workflow
  • 25.
    Ideal Workflow Each phasedecoupled and easier to maintain, hire for skills
  • 26.
    Ideal Workflow • Encouragesmall, low-risk prototypes • Promote the successes to real projects/features/apps • Avoiding: • Re-write • IT Debate Club • Budget Debate Club
  • 27.
    The Six MovingPieces of a Platform Load API Layer Data API Layer Results API Layer Publish API Layer Serve API Layer Data Engineering Analytic Jobs, Splits, Runs Output, Results, Performance Look and Feel, Scoring, Monitoring, Display of (1) - (3) (1) (2) (3) (4) (5) All APIs (6) Packaging (1) - (5) for deployment Skill Sets: (1) Spark, SQL, Python, Linux, Databases, Key-Value Stores (2) Python, Spark, SQL (3) Python, SQL (4) React, HTML, JavaScript, CSS (5) React, Python, Linux, Redis (6) Docker, Chef, Jenkins/Travis Input Data
  • 28.
    The Six MovingPieces of a Platform Load API Layer Data API Layer Results API Layer Publish API Layer Serve API Layer Data Engineering Analytic Jobs, Splits, Runs Output, Results, Performance Look and Feel, Scoring, Monitoring, Display of (1) - (3) (1) (2) (3) (4) (5) All APIs (6) Packaging (1) - (5) for deployment Skill Sets: (1) Spark, SQL, Python, Linux, Databases, Key-Value Stores (2) Python, Spark, SQL (3) Python, SQL (4) React, HTML, JavaScript, CSS (5) React, Python, Linux, Redis (6) Docker, Chef, Jenkins/Travis Input Data UI/UX
  • 29.
    The Six MovingPieces of a Platform Load API Layer Data API Layer Results API Layer Publish API Layer Serve API Layer Data Engineering Analytic Jobs, Splits, Runs Output, Results, Performance Look and Feel, Scoring, Monitoring, Display of (1) - (3) (1) (2) (3) (4) (5) All APIs (6) Packaging (1) - (5) for deployment Skill Sets: (1) Spark, SQL, Python, Linux, Databases, Key-Value Stores (2) Python, Spark, SQL (3) Python, SQL (4) React, HTML, JavaScript, CSS (5) React, Python, Linux, Redis (6) Docker, Chef, Jenkins/Travis Input Data DevOps
  • 30.
    The Six MovingPieces of a Platform Load API Layer Data API Layer Results API Layer Publish API Layer Serve API Layer Data Engineering Analytic Jobs, Splits, Runs Output, Results, Performance Look and Feel, Scoring, Monitoring, Display of (1) - (3) (1) (2) (3) (4) (5) All APIs (6) Packaging (1) - (5) for deployment Skill Sets: (1) Spark, SQL, Python, Linux, Databases, Key-Value Stores (2) Python, Spark, SQL (3) Python, SQL (4) React, HTML, JavaScript, CSS (5) React, Python, Linux, Redis (6) Docker, Chef, Jenkins/Travis Input Data Data Sci/ML
  • 31.
    Encourage/Enforce Good Behavior •Central notebook repository (e.g., Apache Zeppelin) • Quick dashboard prototyping (e.g., Apache Superset, Zeppelin) • Use a model server (e.g., Apache PredictionIO) • APIs for all stages • Code reviews • Unit and integration tests • "Definition of done"
  • 32.
  • 33.
  • 34.
    Getting Involved inOpen Source • Fix documentation problems as you're using it • Fix bugs • Add features • Make it an internal team effort • Grow skills • Adapt the software to real-life demands • Give back
  • 35.