Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

Share

Scaling Online ML Predictions At DoorDash

Download to read offline

DoorDash is a 3-sided marketplace that consists of Merchants, Consumers, and Dashers.
As DoorDash business grows, the online ML prediction volume grows exponentially to support the various Machine Learning use cases, such as the ETA predictions, the Dasher assignments, the personalized restaurants and menu items recommendations, and the ranking of the large volume of search queries.

The prediction service built to meet above use cases now supports many dozens of models spanning different Machine Learning algorithms such as gradient boosting, neural networks and rule-based. The service supports greater than 10 billion predictions every day with a peak hit rate of above 1 million per second.

In this session, we will share our journey of building and scaling our Machine Learning platform and particularly the prediction service, the various optimizations experimented, lessons learned, technical decisions and tradeoffs made. We will also share how we measure success and how we set goals for the future. Finally, we will end by highlighting the challenges ahead of us in extending our Machine Learning platform to support the Data Scientist community and a wider set use cases at DoorDash.

  • Be the first to like this

Scaling Online ML Predictions At DoorDash

  1. 1. Scaling Online ML Predictions at DoorDash Hien Luu & Arbaz Khan ML Platform, DoorDash
  2. 2. Agenda ▪ DoorDash Marketplace ▪ ML @ DoorDash ▪ ML Platform Journey ▪ Scaling ML Online Predictions ▪ Lessons Learned & Future Work
  3. 3. DoorDash Marketplace
  4. 4. DoorDash Marketplace Our mission is to grow and empower local economies
  5. 5. DoorDash Marketplace Delivery & Pickup DashPass Subscription Convenience & Grocery
  6. 6. DoorDash Marketplace Logistics Platform
  7. 7. DoorDash Marketplace Three-sided Marketplace - Flywheel Effect
  8. 8. DoorDash Marketplace Three-sided Marketplace - Flywheel Effect
  9. 9. DoorDash Marketplace Three-sided Marketplace - Flywheel Effect
  10. 10. DoorDash Marketplace Flywheel Effect Merchant Dasher Consumer DoorDash Flywheel
  11. 11. Machine Learning @ DoorDash
  12. 12. Food Order Lifecycle Step 2 Order Checkout Step 3 Dispatching Order Step 4 Delivering Order Creating Order Step 1
  13. 13. Creating Order Creating Order Step 1 Recommendation & Search Promotion
  14. 14. Order Checkout Recommendation Step 2 Order Checkout Fraud
  15. 15. Dispatching Order Step 3 Dispatching Order Merchants Dashers
  16. 16. Delivering Order Step 4 Delivering Order Consumers
  17. 17. Machine Learning Platform Journey
  18. 18. DS/ML Fraud …. Machine Learning Platform Journey Centralized ML Platform DS/ML Cx DS/ML Mx DS/ML Lx DS/ML Fraud ….
  19. 19. Machine Learning Platform Journey ML Platform Pillars Feature Engineering Model Training Model Prediction Model Management ML Insights Think big, but start small
  20. 20. Machine Learning Platform Journey Feature Service Feature Store (Redis) Online Prediction Service Real-time Features Historical Features Feature Engineering Production Systems
  21. 21. Machine Learning Platform Journey Model Training Service Model Store Online Prediction Service Model Training & Management Python Python Python
  22. 22. Machine Learning Platform Journey Feature Store (Redis) Online Prediction Service Prediction Results Model Prediction Model Store Prediction Requests Load models Fetch features
  23. 23. Scaling Online Predictions
  24. 24. Scaling Online Predictions
  25. 25. Isn’t it same as scaling any other microservice? Scaling Online Predictions
  26. 26. Larger payloads Heavier computations Production + Experiment traffic Near real-time auditing Challenges of scaling
  27. 27. Ok but what do you mean by scaling?
  28. 28. Users DSML Microservices Business vector created by macrovector - www.freepik.com
  29. 29. Users DSML Microservices Business vector created by macrovector - www.freepik.com - more models - new features - more experiments
  30. 30. Users DSML Microservices Business vector created by macrovector - www.freepik.com - higher request throughput - more uptime - more models - new features - more experiments
  31. 31. Four phases of scaling I II III IV User Isolation Optimize to survive Uh-oh Infrastructure Happy Horizontal Scaling
  32. 32. Phase I: Happy Horizontal Scaling
  33. 33. Prediction Store Prediction Microservice Feature Store Metrics Model Store Phase I: Happy Horizontal Scaling
  34. 34. peak predictions / sec
  35. 35. Use Case Latency Dasher Dispatch 133.194 ms delivery-predictors 103.838 ms Feed ranking 33.146 ms Item Recommendation 28.631 ms ETA prediction 11.889 ms Kitchen capacity 2.163 ms
  36. 36. Phase II: User isolation
  37. 37. Strategies of User isolation
  38. 38. All in One Strategies of User isolation
  39. 39. All in One One service per model Strategies of User isolation
  40. 40. All in One One service per model Hybrid Strategies of User isolation
  41. 41. Is hybrid isolation the way to go always?
  42. 42. How far hybrid isolation took your scaling attempts?
  43. 43. Hitting the limits peak predictions / sec Prediction Store Model Store Metrics Prediction Microservice Feature Store Prediction Microservice Feature Store Prediction Microservice Feature Store Model Store
  44. 44. Phase III: Optimize to survive
  45. 45. Hitting the limits peak predictions / sec Prediction Store Model Store Metrics Prediction Microservice Feature Store Prediction Microservice Feature Store Prediction Microservice Feature Store Model Store
  46. 46. Microservice optimizations Parameter tuning Runtime optimizations Load testing Latency profiling OBSERVE ITERATE
  47. 47. Feature Store optimizations Schema Redesign Benchmarking OBSERVE ITERATE
  48. 48. peak predictions / sec Prediction Store Model Store Metrics Prediction Microservice Feature Store V2 Prediction Microservic e Feature Store Prediction Microservic e Feature Store Model Store
  49. 49. peak predictions / sec Prediction Store Model Store Metrics Prediction Microservice Feature Store V2 Prediction Microservic e Feature Store Prediction Microservic e Feature Store Model Store
  50. 50. Phase IV: Uh-oh infrastructure!!
  51. 51. Stifled infrastructure - Splunk quota exceeded - Wavefront metrics limit breached - Blocked by Segment for sending “too many” events - High Service Discovery (Consul) CPU threatening a total outage
  52. 52. Stifled infrastructure - Splunk quota exceeded Only essential and sampled logging - Wavefront metrics limit breached Move to prometheus from statsd - Blocked by Segment for sending “too many” events Use in-house Kafka streaming instead - High Service Discovery (Consul) CPU threatening a total outage Beef up further to reduce number of discoverable pods
  53. 53. peak predictions / sec Prediction Store Model Store Metrics Prediction Microservice Feature Store V2 Prediction Microservice Feature Store Prediction Microservice Feature Store Prediction Microservice Feature Store V2 Model Store
  54. 54. Summary • Isolate use cases wherever possible • Scaling out always will either bust budgets or stop helping • Pen down infrastructure dependencies and implications
  55. 55. Lessons Learned • Happy path and unhappy less happy path • Customer obsession • Big vision, but build incrementally
  56. 56. Future Work ● More microservice optimizations ● Generalized model serving ○ NLP & Image recognition ● Unified prediction client
  57. 57. https://doordash.engineering/category/data-science-and-machine-learning Thank you

DoorDash is a 3-sided marketplace that consists of Merchants, Consumers, and Dashers. As DoorDash business grows, the online ML prediction volume grows exponentially to support the various Machine Learning use cases, such as the ETA predictions, the Dasher assignments, the personalized restaurants and menu items recommendations, and the ranking of the large volume of search queries. The prediction service built to meet above use cases now supports many dozens of models spanning different Machine Learning algorithms such as gradient boosting, neural networks and rule-based. The service supports greater than 10 billion predictions every day with a peak hit rate of above 1 million per second. In this session, we will share our journey of building and scaling our Machine Learning platform and particularly the prediction service, the various optimizations experimented, lessons learned, technical decisions and tradeoffs made. We will also share how we measure success and how we set goals for the future. Finally, we will end by highlighting the challenges ahead of us in extending our Machine Learning platform to support the Data Scientist community and a wider set use cases at DoorDash.

Views

Total views

217

On Slideshare

0

From embeds

0

Number of embeds

0

Actions

Downloads

10

Shares

0

Comments

0

Likes

0

×