Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Jeremy Stanley, EVP/Data Scientist, Sailthru at MLconf NYC

2,069 views

Published on

Cost Effectively Scaling Machine Learning Systems in the Cloud: E-commerce and publishing clients use Sailthru to personalize billions of digital experiences for their customers weekly. Earlier this year, Sailthru launched Sightlines to allow clients to predict the future behavior of individual users. In this talk we cover how we scaled Sightlines cost effectively in the cloud by combining inexpensive computing resources with an efficient architecture and easy to maintain and evolve implementation.

To access computing resources cost effectively, we utilize Amazon spot instances and Apache Mesos to pool together large quantities of CPU and memory. This approach can be orders of magnitude more cost effective than traditional deployments, but requires sophisticated automation and orchestration tools, and a fine-grained fault tolerant application architecture.

Given cost effective resources, the next challenge was to design the application to be efficient. Simple sampling and data pre-processing techniques significantly limit the computational requirements without adversely impacting model performance. Further, by controlling how often we run various components of the pipeline, we minimize cost while keeping models up to date.

The final challenge is to make such a system maintainable and easy to evolve. This includes removing single points of failure, automating infrastructure management, building distributed logging and monitoring capabilities, and running identical A / B production environments to enable aggressive, iterative changes to the code base and architecture in production.

We hope to demonstrate that the challenges faced in scaling a complex machine learning system in the cloud are at least as interesting as the science behind it, and to provide some insight into modern tools and methods for addressing these scalability challenges.

Published in: Technology
  • Be the first to comment

Jeremy Stanley, EVP/Data Scientist, Sailthru at MLconf NYC

  1. 1. Online, Offline, Mobile, Email, Social www.sailthru.com Cost Effectively Scaling Machine Learning Systems in the Cloud Agenda: ● Background on me, Sailthru & Sightlines (mercifully short) ● Cost effective resources in the AWS cloud ● Efficient(ish) application design ● Easy maintenance and evolution ● Machine learning details
  2. 2. Online, Offline, Mobile, Email, Social www.sailthru.com @jeremystan Capitalism Idealism Indirect Value Direct Value Graduate student Math 2000 Consultant Finance 2005 CTO Ad Tech 2010 Chief Data Scientist Mar Tech 2015
  3. 3. Online, Offline, Mobile, Email, Social www.sailthru.com Sailthru
  4. 4. Online, Offline, Mobile, Email, Social www.sailthru.com Sightlines Analytics - Segmentation - Forecasting Personalization - Recommendations - Discounting Optimization - Frequency - Channel
  5. 5. Online, Offline, Mobile, Email, Social www.sailthru.com Requirements 1. ~5 million users per client 2. JSON formatted user data, siloed across clients 3. Predict varying outcomes normal, poisson, binomial, quantile, ... 4. Update models & predictions daily 5. Only really care about predictive performance 6. Scale to 1,000+ clients
  6. 6. Online, Offline, Mobile, Email, Social www.sailthru.com Our Cost Effective Scaling Strategy 1. Get really cheap computing power 2. Make it work really, really hard 3. Optimize apps for ease of evolution 4. Setup identical A/B environments Iterate aggressively based on data: ✓ Features ✓ Efficiency ✓ Scale 10x 3x 0.6x = 0.5x = 9x JSON to Features GBM in Memory 1 x0.2x Half our processing Half our processing
  7. 7. Online, Offline, Mobile, Email, Social www.sailthru.com Cost Effective Resources in the AWS Cloud
  8. 8. Online, Offline, Mobile, Email, Social www.sailthru.com Cost Effective r3.8xlarge 32 vCPU, 244GB RAM Resource Utilization 30% (typical cloud) 10% (data center) 90% (highly efficient) Cost Per Hour $2.80 (on demand) $1.76 (reserved 1yr) $1.05 (reserved 3yr) $0.28 (spot instance) Cloud $9.80 Data Center $10.50 Spot + Mesos + Relay $0.30 30x more cost efficient! ($10.50 = $1.05 / 10%)
  9. 9. Online, Offline, Mobile, Email, Social www.sailthru.com AWS Spot Instances Your bid What you pay All instances died!
  10. 10. Online, Offline, Mobile, Email, Social www.sailthru.com Mesos 81 “slaves” 4 availability zones 2 instance types 1,360 CPUs 10TB of RAM 94% utilized $11.90 per hour $104,244 per year
  11. 11. Online, Offline, Mobile, Email, Social www.sailthru.com Mesos + Marathon Zone 1 Zone 2 Zone 3 Zone 4 Mesos Slave (16 CPU) Mesos Slave (8 CPU)
  12. 12. Online, Offline, Mobile, Email, Social www.sailthru.com Mesos + Marathon Zone 1 Zone 2 Zone 3 Zone 4 Mesos Slave (16 CPU) Mesos Slave (8 CPU) Mesos Master App A App B App C Queue Size Applications must be: ● Distributed to be scheduled wherever Mesos wants ● Fine Grained to maximize utilization in Mesos ● Idempotent to handle duplicate runs in case network is partitioned
  13. 13. Online, Offline, Mobile, Email, Social www.sailthru.com Mesos + Marathon Zone 1 Zone 2 Zone 3 Zone 4 Mesos Slave (16 CPU) Mesos Slave (8 CPU) Mesos Master App A App B App C Queue Size Time Available Mesos CPU Jiffies Doesn’t work for apps with highly variable load Idle User
  14. 14. Online, Offline, Mobile, Email, Social www.sailthru.com Mesos + Relay Available Mesos CPU Jiffies User Idle Available Mesos CPU Jiffies User Idle Relay.Mesos Auto-scaler for distributed applications github.com/sailthru/relay.mesos ● Allocates resources based on queue size ● Wraps applications inside Mesos slaves ● Can significantly improve cluster utilization Before Relay After Relay App A App B App C Queue Size Mesos Master Time After Relay Relay. Mesos
  15. 15. Online, Offline, Mobile, Email, Social www.sailthru.com Efficient(ish) Application Design
  16. 16. Online, Offline, Mobile, Email, Social www.sailthru.com Stolos Distributed task dependency manager github.com/sailthru/stolos ● Directed acyclic graph ● Parameterizable templates ● Handles queueing ● Ensures idempotent Application Pipeline (simplified) Assembly GBMs Analyze Models JSON Sailthru User API Predict Upload Mongo Reports Actually much more complex ● ~1,000 clients ● ~10 models ● ~10 steps ● ~100 sub-tasks ETL Mongo
  17. 17. Online, Offline, Mobile, Email, Social www.sailthru.com shard 1 shard 1,000 Sampling Strategy JSON Day 1 Mongo S3 JSON sharded on hash(user)
  18. 18. Online, Offline, Mobile, Email, Social www.sailthru.com shard 1 shard 1,000 Sampling Strategy JSON Day N Mongo Day 1 S3
  19. 19. Online, Offline, Mobile, Email, Social www.sailthru.com Day N Day 1 shard 1 shard 1,000 Sampling Strategy JSON Consistent 0.1% of data to a Mesos Slave CPU Mongo S3
  20. 20. Online, Offline, Mobile, Email, Social www.sailthru.com Day N Day 1 shard 1 shard 1,000 Sampling Strategy JSON Apps sample more as needed Mongo S3
  21. 21. Online, Offline, Mobile, Email, Social www.sailthru.com User Profile JSON Data
  22. 22. Online, Offline, Mobile, Email, Social www.sailthru.com Each User Radically Different User Feature ???
  23. 23. Online, Offline, Mobile, Email, Social www.sailthru.com Each User Radically Different User Feature tidyjson Turn JSON into data frames github.com/sailthru/tidyjson ● Arbitrary JSON into R data.frames ● Guarantees deterministic structure ● Seamless with dplyr and %>%
  24. 24. Online, Offline, Mobile, Email, Social www.sailthru.com Why GBMs? ● Predict varying outcomes normal, poisson, binomial, quantile, … ● Flexible enough to capture non-linearity & complex interactions no need to feature engineer for each client ● Minimal number of hyper-parameters depth, shrinkage, number of trees ● Robust to missing values no need to impute
  25. 25. Online, Offline, Mobile, Email, Social www.sailthru.com + … + αK * Distributing a GBM α1 * tree 1 tree 2 tree 3 tree K + α2 * + α3 *
  26. 26. Online, Offline, Mobile, Email, Social www.sailthru.com + … + αK * Distributing a GBM α1 * tree 1 tree 2 tree 3 tree K 1. Across the sum Gives bagging, not boosting (iterative) => less accurate + α2 * + α3 * Zone 1 Zone 2 Zone 3 Zone 4 Mesos Slaves
  27. 27. Online, Offline, Mobile, Email, Social www.sailthru.com + … + αK * Distributing a GBM α1 * tree 1 tree 2 tree 3 tree K 1. Across the sum Gives bagging, not boosting (iterative) => less accurate 2. Within each tree (Spark MLLib, H20) A lot of overhead and coordination => not efficient for many small GBMs + α2 * + α3 * Zone 1 Zone 2 Zone 3 Zone 4 Mesos Slaves
  28. 28. Online, Offline, Mobile, Email, Social www.sailthru.com Distributing a GBM 1. Across the sum Gives bagging, not boosting (iterative) => less accurate 2. Within each tree (Spark MLLib, H20) A lot of overhead and coordination => not efficient for many small GBMs 3. Across the GBMs 50,000 GBMs to build => each can be built independently Zone 1 Zone 2 Zone 3 Zone 4 Mesos Slaves + … + αK *α1 * tree 1 tree 2 tree 3 tree K + α2 * + α3 * + … + αK *α1 * tree 1 tree 2 tree 3 tree K + α2 * + α3 * … GBM 1 GBM 50,000 50,000 = 1,000 clients * 10 models * 5-fold CV ✓
  29. 29. Online, Offline, Mobile, Email, Social www.sailthru.com Grid Search + … + αK *α1 * tree 1 tree 2 tree 3 tree K + α2 * + α3 * For each client & model: 1. Grid search over: a. Depth: size of trees b. Shrinkage: λ “learning rate” for {αi } 2. Cross-validate for optimal # of trees
  30. 30. Online, Offline, Mobile, Email, Social www.sailthru.com Easy Maintenance & Evolution
  31. 31. Online, Offline, Mobile, Email, Social www.sailthru.com Tools Used R Modeling Python ETL AWS S3 Batch Applications State Frameworks Zookeeper Coordination Spark Map Reduce Marathon Running Apps Cluster Mesos Sharing Maintenance ELK Log Mgmt Consul Discovery Configuration Chef Automation Librato Monitoring Sensu Alerting Asgard Auto Scaling AWS Spot Compute
  32. 32. Online, Offline, Mobile, Email, Social www.sailthru.com How we Iterate A B Sailthru User API Mongo JSON
  33. 33. Online, Offline, Mobile, Email, Social www.sailthru.com How we Iterate A B Sailthru User API Mongo JSON
  34. 34. Online, Offline, Mobile, Email, Social www.sailthru.com How we Iterate A B Sailthru User API Mongo ● Tools ● Configuration ● Applications JSON v1.0.0
  35. 35. Online, Offline, Mobile, Email, Social www.sailthru.com How we Iterate A B Sailthru User API Mongo ● Tools ● Configuration ● Applications JSON v1.0.0 v1.0.1
  36. 36. Online, Offline, Mobile, Email, Social www.sailthru.com How we Iterate A B Sailthru User API Mongo ● Tools ● Configuration ● Applications JSON v1.0.0 v1.0.1
  37. 37. Online, Offline, Mobile, Email, Social www.sailthru.com How we Iterate A B Sailthru User API Mongo ● Tools ● Configuration ● Applications JSON v1.0.0 v1.0.1 v1.0.2
  38. 38. Online, Offline, Mobile, Email, Social www.sailthru.com How we Iterate A B Sailthru User API Mongo ● Tools ● Configuration ● Applications ✓ Check monitoring JSON v1.0.0 v1.0.1 v1.0.2
  39. 39. Online, Offline, Mobile, Email, Social www.sailthru.com How we Iterate A B Sailthru User API Mongo ● Tools ● Configuration ● Applications ✓ Check monitoring ✓ Check logging JSON v1.0.0 v1.0.1 v1.0.2
  40. 40. Online, Offline, Mobile, Email, Social www.sailthru.com How we Iterate A B Sailthru User API Mongo ● Tools ● Configuration ● Applications ✓ Check monitoring ✓ Check logging ✓ Check performance JSON v1.0.0 v1.0.1 v1.0.2
  41. 41. Online, Offline, Mobile, Email, Social www.sailthru.com How we Iterate A B Sailthru User API Mongo ● Tools ● Configuration ● Applications ✓ Check monitoring ✓ Check logging ✓ Check performance JSON v1.0.0 v1.0.1 v1.0.2
  42. 42. Online, Offline, Mobile, Email, Social www.sailthru.com How we Iterate A B Sailthru User API Mongo ● Tools ● Configuration ● Applications ✓ Check monitoring ✓ Check logging ✓ Check performance JSON v1.0.0 v1.0.1 v1.0.2
  43. 43. Thank You! Our team: Divyanshu Vats Alex Gaudio Andras Kerekes Jeremy Stanley

×