Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Continuous delivery for machine learning


Published on

Lessons learnt and system built while solving the last mile problem in machine learning - taking models to production. Used for the talk at -

Published in: Software
  • Hello! I can recommend a site that has helped me. It's called ⇒ ⇐ So make sure to check it out!
    Are you sure you want to  Yes  No
    Your message goes here
  • Do This Simple 2-Minute Ritual To Loss 1 Pound Of Belly Fat Every 72 Hours ▲▲▲
    Are you sure you want to  Yes  No
    Your message goes here

Continuous delivery for machine learning

  1. 1. 5/15/2017 Continuous Delivery Principles for Machine Learning Rajesh Muppalla @codingnirvana
  2. 2. About Me • Co-Founder, Indix • From Chennai ◊ 200 miles to the east (and north) of Bangalore ◊ S Ramanujan (19th Century Mathematician), Sundar Pichai ◊ Three Seasons - Hot, Hotter, Hottest • Previously ◊ ScaleByTheBay 2016 - Data Pipelines Panelist ◊ Microservices, Lambda Architecture • Ex-Thoughtworks ◊ Tech Lead - Go-CD - an open source CI/CD Tool
  3. 3. 5/15/2017 About Indix
  4. 4. Six Business Critical Indexes People Documents Businesses Places Products Connected Devices
  5. 5. Enabling businesses to build location-aware software. ~3.6 million websites use Google maps Enabling businesses to build product-aware software. Indix catalogs over 2.1 billion product offers Indix – the “Google Maps” of Products
  6. 6. Data Pipeline @ Indix Crawling Pipeline Data PipelineML AggregateMatchStandardizeExtract AttributesClassifyDedupe Parse Crawl Data CrawlSeed Brand & Retailer Websites Feeds Pipeline Transform Clean Connect Feed Data Brand & Retailer Feeds Indix Product Catalog Customizable Feeds Search & Analytics Index Indexing PipelineReal Time Index Analyze Derive Join API (Bulk & Synchronous) Product Data Transformation Service
  7. 7. E-Tailers & Marketplaces Original Catalog Title Brand Color Size Product 1 Running Shoes Adidas Blk 9 Product 2 Yoga Pants Black 32 Product 3 Jacket TNF White Enriched Catalog Title Brand Color Size Material Product 1 Running Shoes Adidas Black 9 US Leather Product 2 Yoga Pants Lululemon Black 32"" Polyester Product 3 Jacket The North Face White Leather
  8. 8. Ad Display & Exchange Platforms • Advertisers - Standardize, Enrich and Augment Product Information for better relevance • Retailers - Enrich, Match and Normalize their catalog for better targeting of native Ads • Publishers - Classify and tag publisher site content
  9. 9. Data Scale @ Indix 2.1 Billion Product URLs 8 TB HTML Data Crawled Daily 1B Unique Products 7000 Categories 120 B Price Points 3000 Sites
  10. 10. 3/31/16 Auto Parsers to detect and extract Product content from Web pages, using Machine Vision algorithms Predictive Scheduler for deciding re-crawl frequency using various signals like Seasonality, Product Type, Store Multi-label classifier Categorizing products into a hierarchical taxonomy using text information Inferring Product vs Listing vs Other Pages using either just URL patterns or using Page Content Adaptive Crawlers that modifies the crawl rate based dynamic characteristics like Site traffic, Number of products, Robots.txt settings Deep learning - Categorizing products using Product images Predicting which products are an exact match or similar products NER based Attribute extraction algorithm that mines text like Title, Descriptions, Specifications to build structured Key:Value Attributes Fusion/Enrichment - An algorithm that uses the data to learn and build golden product record using disparate sources Product Rank - algorithm that uses multiple signals like product popularity, price, data quality, store popularity, brand popularity to build dynamic relevance/rank score Recommendation Engines that suggest Tags where Product information can be found on a web page Deep learning - Extracting visual product attributes using Product images NLG algorithms to generate product descriptions Product GPS - Universal Product Identifier using machine learning algorithms and allowing Search & Discovery ML @ Indix
  11. 11. ML @ Indix - Classification
  12. 12. ML @ Indix - Attribute Extraction
  13. 13. 5/15/2017 Machine Learning Workflow
  14. 14. Define Business Objective Explore & Transform Pull and Acquire Data Develop Model Evaluate Model Meets Business Needs? Build Production System DeployMeasure Metrics Yes! Not Yet! Human in the Loop Machine Learning Workflow
  15. 15. Machine Learning Sandwich?* * - Explore & Transform Pull and Acquire Data Deploy Build Production System Develop Model Model Evaluation & Validation The MEAT is not in the middle
  16. 16. Experts agree with us D. Sculley, et al. Hidden technical debt in machine learning systems. In Neural Information Processing Systems (NIPS). 2015 Only a small fraction of real-world ML systems is a composed of ML code, as shown by the small black box in the middle. The required surrounding infrastructure is fast and complex.
  17. 17. Different Skillsets Explore & Transform Pull and Acquire Data Deploy Build Production System Develop Model Model Evaluation & Validation Data Pipelines App Model
  18. 18. Separate Talk Explore & Transform Pull and Acquire Data
  19. 19. My Talk Explore & Transform Pull and Acquire Data Deploy Build Production System Develop Model Model Evaluation & Validation Focus of this talk
  20. 20. Pain Points ● A key employee in the team had to abruptly go on leave ○ Unable to reproduce the performance of an existing production model ■ Training Data Missing/Not known ■ Scripts not there for Pre-processing ■ Hyperparameters not known ● It takes 3 Months to productionize a model ■ Lot of glue code ■ Custom code developed every time ■ Frequent updates to model takes long time ● Heterogeneous Systems ■ Eg. - Sharing stuff between Python and JVM
  21. 21. Reality ● Confidence in Test Set != Confidence in Production ■ Confidence of model performance on a sample set not good enough
  22. 22. Dejavu?
  23. 23. Continuous Delivery is a software engineering approach that aims at building, testing and releasing software faster and more frequently. A straightforward and repeatable process is important from continuous delivery What is Continuous Delivery?
  24. 24. 5/15/2017 Principles from CD in ML
  25. 25. Principle #1 Automation via CI + CD pipelines Automation of ML Training, Evaluation and Offline Prediction Pipelines Continuous Delivery for Software Continuous Delivery for Machine Learning
  26. 26. Training Pipelines ● Training pipelines are modelled like a build pipeline ● Customized Go-CD, an open source CI & CD tool ● Created plugins to help us with our ML workflows Pre-process Data (Spark Job) Build Model (Python Script) Evaluate Model (Python) Training Pipeline (3 Flavors) Build Model (Spark Job) OR Build Model (Zeppelin Notebook) OR Training Data
  27. 27. Go-CD - Demo
  28. 28. Principle #2 Source Code and Artifact Repository for Reproducibility Source Code, Data and Model Repository for Reproducibility Continuous Delivery for Software Continuous Delivery for Machine Learning
  29. 29. Model Repository ● Similar to an artifact repository like Maven, Ivy ○ Directory Structure, Versioning, Publishing of models ● Has clients to publish models for most commonly used frameworks ○ scikit-learn, Spark MLLib, Keras ● For a model, ○ Data ■ Stored in S3 ■ In Different formats ● Parquet (Spark MLLib), Scikit-Learn - Pickle, Keras - HDF5 ○ Metadata ■ Training/Validation/Test Datasets ■ Hyper-parameters used ■ Evaluation Metrics
  30. 30. Model Repository Pre-process Data (Spark Job) Build Model (Python) Evaluate Model (Python) Publish Model Training Pipeline Training Data
  31. 31. Training Data Model Promotion ● Tagging the “latest good” version that needs to be deployed ● Not all models need/can be promoted ○ Experimental models ○ Models that fail the test set or performance/latency metrics ● Easy rollback - tag the “last good” version as the latest Pre-process Data (Spark Job) Build Model (Python) Evaluate Model (Python) Publish Model Promote Model Manual Stage Training Pipeline
  32. 32. Principle #3 Containers for µServices Model Containers for Model Prediction µServices Continuous Delivery for Software Continuous Delivery for Machine Learning
  33. 33. Model Container ● Hosts a single model to be used for predictions ● Exposes API for prediction and are “dockerized” ● Containers can be replicated to handle scale ● Two µServices ○ Scala ■ Handles pre-processing ○ Python ■ Loads model and exposes the predict on the model ■ Can also predict in batches for better throughput ■ Handles ensembles of models ○ Scala µservice delegates the predict and predict_batch functions to the Python µservice
  34. 34. Model Container Docker Host Scala µService predict(input) predict_batch(inputs) _preprocess(input) Python µService Model Model Model predict(input) predict_batch(inputs) Create Docker Image (Docker) Push to Docker Registry (Docker) Publish Model Promote Model Training Pipeline
  35. 35. Model Deployment ● Two Modes - Offline (Batch) and/or Online ● Offline Mode ○ Package model containers into an AMI (Amazon Machine Image) ○ Start the container as part of your Spark/Hadoop clusters on the Executors/Task Trackers ○ Within a job call the local Scala Service for prediction for each record ● Online Mode ○ Deploy the model containers into a Mesos + Marathon or a Kubernetes cluster ○ (Auto) Scaling is managed by the cluster
  36. 36. Principle #4 A/B Testing Using Canary Releases A/B Testing Using Request Shadowing Continuous Delivery for Software Continuous Delivery for Machine Learning
  37. 37. Model A/B Testing ● We don’t use Multi Armed Bandit Testing (MAB) ○ Reason - Payout is not easily measurable unlike CTR (for example) ● Instead we use Request Shadowing pattern ○ Input to both old and new both, but serve output only from old ○ Find deltas and do spot checking ● For Offline, we only do deltas + spot checking ● We have built an in-house data turking tool for spot checking
  38. 38. Spot Checking Example 1
  39. 39. Spot Checking Example 2
  40. 40. Future Work ● Lot more to be done ○ Support deep learning based models as a first class solution ○ Model Repository visualization ○ Add more plugins in Go-CD to better support ML workflows natively ● Open Source ○ Model Serving Repository + Clients (WIP)
  41. 41. Indix & Open Source ●
  42. 42. Thank You Questions