Maintainable Machine Learning Products

Maintainable Machine
Learning Products
ApacheCon Roadshow, Chicago 2019
Andrew Musselman
akm@apache.org

State of the Art in ML Development
So many tools
• Scikit-learn
• Spark MLLib
• Keras
• PyTorch
• DL4J
• Mahout
• MXNet
• SystemML
• PredictionIO
• #justRthings
• …
• Vendor solutions
• Kitchen sink
• Auto-magic

And that is just for the ML pieces; also need:
• Data ingest
• Data engineering
• Plotting, charting
• UX/Publishing
• Sidecar functions:
• Search
• Model management/data versioning
• Monitoring/performance metrics

All to do a glorified regression or similar

About Me and Why I'm Here
Corp
• Chief Data Scientist at Accenture
• Senior Director at Lucidworks
• Chief Analytics Officer at A2Go
ASF
• Mahout 0.9 release
• Committer
• PMC member
• Chair
Corp/OSS
• Bootstrapped open-source
contribution program at ACN
• Similar program to A2Go
Fun
• Adversarial Learning podcast
• Sailing, snowboarding, amateur
radio (KI7KQA)

About Me and Why I'm Here
In the course of doing
work I have seen
some bad things

Motivation
Moving data through the assembly line* to production requires
beating several bosses:
* There is no "assembly line" the first several times
Ingest Clean and
Transform
PublishTrain/
Test/
Tweak

Zooming Out
Before a project begins, there are multiple other bosses to beat:
Have an
Idea
Design
Solution
Convince
Team
Prototype Get
Priority
Get
Budget
Then

Why Projects Fail
Things can die at any stage, but most poignantly at the end,
when it's "finished"
• Results/findings/"insights" need a total re-write or port to "production
lang/infra
• E.g., a nice tidy model to predict customer behavior needs to be re-
written in Java to run in the "web service farm"
• Add six months!
• Priority battles! Unproven ML/AI pet project less urgent than:
• Ongoing maintenance
• Shifted business priorities

The Best Reason Projects Fail
No established
approach/workflow to
incorporate results into
existing infrastructure

ML/AI Has a Lot of Attention
In the face of these troubles, ML/AI is a stated priority of many,
many, many, orgs
• Leadership team: "we need an ML/AI story immediately; everyone is
doing it and we are behind the competition" 🤔
• Countless teams: "we need sentiment analysis of [our medical
records | social media about us | the stock market]" 😬
• "Can't machine learning fix this problem?" 🤔
• "Machine learning is a commodity now" 😂

ML/AI Has a Lot of Attention
Result: URGENCY
+ LARGE AND
WRONG SCOPE

Combatting Urgency and Bad Scope
People minimize risk by:
• Hiring consultants
• Building it all from scratch
• Buying a vendor solution (and paying their professional
services team to build all the hard parts)
• Researching/benchmarking/assembling some OSS
libraries/frameworks

Sometimes People Do Dumb Things
"Let's migrate off this vendor and use an open
source solution"
Vendor Apache

"Dump the entire Teradata warehouse"

"Into HDFS"

"But let's not keep any of the metadata about the
tables"

Lies We Tell Ourselves
"Let's clean up all this legacy not invented here
(NIH) code and move to that vendor solution"
NIH Vendor

"The vendor says migration should take less
than a month"

"Our IT team says they can integrate the vendor
solution next fiscal year"

"Our summer intern says they think they can
write all the connectors we need by September"

How to Choose Tools
People think about tech/infra
decisions on a 1-D spectrum
Vendor NIH
NIH OSS
But it's a multi-dimensional
problem

How to Choose Tools
Trade-offs
• Vendor-heavy: $$, less control
• NIH-heavy: tribal knowledge +s and –s
• OSS-heavy: config and extend, hiring pool +s

How Not to Choose Tools
IPython
NB Python
REPL
Jupyter
AWS EC2,
S3, DB,
cron
External
Data
APIs
?
bash
and
curl
Data
in/out
Data
in/out
A real workflow

Ideal Workflow
Each phase decoupled and easier to maintain, hire for skills

Ideal Workflow
• Encourage small, low-risk
prototypes
• Promote the successes to real
projects/features/apps
• Avoiding:
• Re-write
• IT Debate Club
• Budget Debate Club

The Six Moving Pieces of a Platform
Load API
Layer
Data API
Layer
Results API
Layer
Publish API
Layer
Serve API
Layer
Data
Engineering
Analytic Jobs,
Splits, Runs
Output,
Results,
Performance
Look and Feel,
Scoring,
Monitoring,
Display of (1) -
(3)
(1) (2) (3) (4)
(5) All APIs
(6) Packaging (1) - (5) for deployment
Skill Sets:
(1) Spark, SQL, Python, Linux, Databases, Key-Value Stores
(2) Python, Spark, SQL
(3) Python, SQL
(4) React, HTML, JavaScript, CSS
(5) React, Python, Linux, Redis
(6) Docker, Chef, Jenkins/Travis
Input
Data

Load API
Layer
Data API
Layer
Results API
Layer
Publish API
Layer
Serve API
Layer
Data
Engineering
Analytic Jobs,
Splits, Runs
Output,
Results,
Performance
Look and Feel,
Scoring,
Monitoring,
Display of (1) -
(3)
(1) (2) (3) (4)
(5) All APIs
Skill Sets:
(3) Python, SQL
Input
Data
UI/UX

Load API
Layer
Data API
Layer
Results API
Layer
Publish API
Layer
Serve API
Layer
Data
Engineering
Analytic Jobs,
Splits, Runs
Output,
Results,
Performance
Look and Feel,
Scoring,
Monitoring,
Display of (1) -
(3)
(1) (2) (3) (4)
(5) All APIs
Skill Sets:
(3) Python, SQL
Input
Data
DevOps

Load API
Layer
Data API
Layer
Results API
Layer
Publish API
Layer
Serve API
Layer
Data
Engineering
Analytic Jobs,
Splits, Runs
Output,
Results,
Performance
Look and Feel,
Scoring,
Monitoring,
Display of (1) -
(3)
(1) (2) (3) (4)
(5) All APIs
Skill Sets:
(3) Python, SQL
Input
Data
Data
Sci/ML

Encourage/Enforce Good Behavior
• Central notebook repository (e.g., Apache Zeppelin)
• Quick dashboard prototyping (e.g., Apache Superset,
Zeppelin)
• Use a model server (e.g., Apache PredictionIO)
• APIs for all stages
• Code reviews
• Unit and integration tests
• "Definition of done"

Encourage/Enforce Good Behavior

Future State
Productivity at scale!

Getting Involved in Open Source
• Fix documentation problems as you're using it
• Fix bugs
• Add features
• Make it an internal team effort
• Grow skills
• Adapt the software to real-life demands
• Give back

Maintainable Machine Learning Products

More Related Content

What's hot

Similar to Maintainable Machine Learning Products

Recently uploaded

Maintainable Machine Learning Products