Continuous delivery for machine learning

5/15/2017
Continuous Delivery Principles
for Machine Learning
Rajesh Muppalla
rajesh@indix.com
@codingnirvana

About Me
• Co-Founder, Indix
• From Chennai
◊ 200 miles to the east (and north) of Bangalore
◊ S Ramanujan (19th Century Mathematician), Sundar Pichai
◊ Three Seasons - Hot, Hotter, Hottest
• Previously
◊ ScaleByTheBay 2016 - Data Pipelines Panelist
◊ Microservices, Lambda Architecture
• Ex-Thoughtworks
◊ Tech Lead - Go-CD - an open source CI/CD Tool

Six Business Critical Indexes
People
Documents Businesses
Places Products
Connected
Devices

Enabling businesses to build
location-aware software.
~3.6 million websites use Google maps
Enabling businesses to build
product-aware software.
Indix catalogs over 2.1 billion product offers
Indix – the “Google Maps” of Products

Data Pipeline @ Indix
Crawling Pipeline
Data PipelineML
AggregateMatchStandardizeExtract AttributesClassifyDedupe
Parse
Crawl
Data
CrawlSeed
Brand & Retailer
Websites
Feeds Pipeline
Transform Clean Connect
Feed
Data
Brand & Retailer
Feeds
Indix Product
Catalog
Customizable
Feeds
Search &
Analytics
Index
Indexing PipelineReal Time
Index Analyze Derive Join
API
(Bulk &
Synchronous)
Product Data
Transformation
Service

E-Tailers & Marketplaces
Original Catalog
Title Brand Color Size
Product 1 Running Shoes Adidas Blk 9
Product 2 Yoga Pants Black 32
Product 3 Jacket TNF White
Enriched Catalog
Title Brand Color Size Material
Product 1 Running Shoes Adidas Black 9 US Leather
Product 2 Yoga Pants Lululemon Black 32"" Polyester
Product 3 Jacket The North Face White Leather

Ad Display & Exchange Platforms
• Advertisers - Standardize, Enrich and Augment Product Information for
better relevance
• Retailers - Enrich, Match and Normalize their catalog for better targeting of
native Ads
• Publishers - Classify and tag publisher site content

Data Scale @ Indix
2.1
Billion
Product
URLs 8 TB
HTML Data
Crawled
Daily
1B
Unique
Products
7000
Categories
120 B
Price
Points
3000
Sites

3/31/16
Auto Parsers to detect and extract
Product content from Web pages, using
Machine Vision algorithms
Predictive Scheduler for deciding
re-crawl frequency using various signals
like Seasonality, Product Type, Store
Multi-label classifier Categorizing products into
a hierarchical taxonomy using text information
Inferring Product vs Listing vs Other
Pages using either just URL patterns or
using Page Content
Adaptive Crawlers that modifies the
crawl rate based dynamic
characteristics like Site traffic, Number
of products, Robots.txt settings
Deep learning - Categorizing
products using Product images
Predicting which products are an
exact match or similar products
NER based Attribute extraction algorithm
that mines text like Title, Descriptions,
Specifications to build structured Key:Value
Attributes
Fusion/Enrichment - An algorithm
that uses the data to learn and build
golden product record using
disparate sources
Product Rank - algorithm that uses
multiple signals like product
popularity, price, data quality, store
popularity, brand popularity to build
dynamic relevance/rank score
Recommendation Engines that suggest
Tags where Product information can be
found on a web page
Deep learning - Extracting visual
product attributes using Product
images
NLG algorithms to generate product
descriptions
Product GPS - Universal Product
Identifier using machine learning
algorithms and allowing Search &
Discovery
ML @ Indix

ML @ Indix - Attribute Extraction

5/15/2017
Machine Learning Workflow

Define Business
Objective
Explore &
Transform
Pull and Acquire
Data
Develop Model
Evaluate Model
Meets
Business
Needs?
Build Production
System
DeployMeasure Metrics
Yes!
Not Yet!
Human in
the Loop
Machine Learning Workflow

Machine Learning Sandwich?*
* - https://techcrunch.com/2017/08/08/the-evolution-of-machine-learning/
Explore &
Transform
Pull and Acquire
Data
Deploy
Build Production
System
Develop Model
Model Evaluation &
Validation
The MEAT is not in the middle

Experts agree with us
D. Sculley, et al. Hidden technical debt in machine learning systems. In Neural Information Processing Systems (NIPS). 2015
Only a small fraction of real-world ML systems is a composed of ML code, as shown by the
small black box in the middle. The required surrounding infrastructure is fast and complex.

Different Skillsets
Explore &
Transform
Pull and Acquire
Data
Deploy
Build Production
System
Develop Model
Model Evaluation &
Validation
Data Pipelines
App
Model

Separate Talk
Explore &
Transform
Pull and Acquire
Data

My Talk
Explore &
Transform
Pull and Acquire
Data
Deploy
Build Production
System
Develop Model
Model Evaluation &
Validation
Focus of
this talk

Pain Points
● A key employee in the team had to abruptly go on leave
○ Unable to reproduce the performance of an existing production model
■ Training Data Missing/Not known
■ Scripts not there for Pre-processing
■ Hyperparameters not known
● It takes 3 Months to productionize a model
■ Lot of glue code
■ Custom code developed every time
■ Frequent updates to model takes long time
● Heterogeneous Systems
■ Eg. - Sharing stuff between Python and JVM

Reality
● Confidence in Test Set != Confidence in Production
■ Confidence of model performance on a sample set not good enough

Continuous Delivery is a software engineering
approach that aims at building, testing and
releasing software faster and more frequently.
A straightforward and repeatable process is
important from continuous delivery
What is Continuous Delivery?

5/15/2017
Principles from CD in ML

Principle #1
Automation via CI + CD
pipelines Automation of ML Training,
Evaluation and Offline
Prediction Pipelines
Continuous Delivery
for
Software
Continuous Delivery
for
Machine Learning

Training Pipelines
● Training pipelines are modelled like a build pipeline
● Customized Go-CD, an open source CI & CD tool
● Created plugins to help us with our ML workflows
Pre-process Data
(Spark Job)
Build Model
(Python Script)
Evaluate Model
(Python)
Training Pipeline (3 Flavors)
Build Model
(Spark Job)
OR
Build Model
(Zeppelin Notebook)
OR
Training
Data

Principle #2
Source Code and
Artifact Repository
for
Reproducibility
Source Code, Data and
Model Repository
for
Reproducibility
Continuous Delivery
for
Software
Continuous Delivery
for
Machine Learning

Model Repository
● Similar to an artifact repository like Maven, Ivy
○ Directory Structure, Versioning, Publishing of models
● Has clients to publish models for most commonly used frameworks
○ scikit-learn, Spark MLLib, Keras
● For a model,
○ Data
■ Stored in S3
■ In Different formats
● Parquet (Spark MLLib), Scikit-Learn - Pickle, Keras - HDF5
○ Metadata
■ Training/Validation/Test Datasets
■ Hyper-parameters used
■ Evaluation Metrics

Model Repository
Pre-process Data
(Spark Job)
Build Model
(Python)
Evaluate Model
(Python)
Publish Model
Training Pipeline
Training
Data

Training
Data
Model Promotion
● Tagging the “latest good” version that needs to be deployed
● Not all models need/can be promoted
○ Experimental models
○ Models that fail the test set or performance/latency metrics
● Easy rollback - tag the “last good” version as the latest
Pre-process Data
(Spark Job)
Build Model
(Python)
Evaluate Model
(Python)
Publish Model Promote Model
Manual Stage
Training Pipeline

Principle #3
Containers
for
µServices
Model Containers
for
Model Prediction µServices
Continuous Delivery
for
Software
Continuous Delivery
for
Machine Learning

Model Container
● Hosts a single model to be used for predictions
● Exposes API for prediction and are “dockerized”
● Containers can be replicated to handle scale
● Two µServices
○ Scala
■ Handles pre-processing
○ Python
■ Loads model and exposes the predict on the model
■ Can also predict in batches for better throughput
■ Handles ensembles of models
○ Scala µservice delegates the predict and predict_batch functions to the
Python µservice

Model Container
Docker Host
Scala µService
predict(input)
predict_batch(inputs)
_preprocess(input)
Python µService
Model
Model
Model
predict(input)
predict_batch(inputs)
Create Docker
Image
(Docker)
Push to Docker
Registry
(Docker)
Publish Model Promote Model
Training
Pipeline

Model Deployment
● Two Modes - Offline (Batch) and/or Online
● Offline Mode
○ Package model containers into an AMI (Amazon Machine Image)
○ Start the container as part of your Spark/Hadoop clusters on the
Executors/Task Trackers
○ Within a job call the local Scala Service for prediction for each record
● Online Mode
○ Deploy the model containers into a Mesos + Marathon or a Kubernetes
cluster
○ (Auto) Scaling is managed by the cluster

Principle #4
A/B Testing
Using
Canary Releases
A/B Testing
Using
Request Shadowing
Continuous Delivery
for
Software
Continuous Delivery
for
Machine Learning

Model A/B Testing
● We don’t use Multi Armed Bandit Testing (MAB)
○ Reason - Payout is not easily measurable unlike CTR (for example)
● Instead we use Request Shadowing pattern
○ Input to both old and new both, but serve output only from old
○ Find deltas and do spot checking
● For Offline, we only do deltas + spot checking
● We have built an in-house data turking tool for spot checking

Future Work
● Lot more to be done
○ Support deep learning based models as a first class solution
○ Model Repository visualization
○ Add more plugins in Go-CD to better support ML workflows natively
● Open Source
○ Model Serving Repository + Clients (WIP)

Indix & Open Source
● oss.indix.com

Continuous delivery for machine learning

More Related Content

What's hot

Viewers also liked

Similar to Continuous delivery for machine learning

Recently uploaded

Continuous delivery for machine learning