Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Scaling Machine Learning Systems up to Billions of Predictions per Day

362 views

Published on

Whether it's a linear regressor or a system of connected deep learning models, getting your models ready is half the battle. Did you design your machine learning system to survive the onslaught of visitors from your latest Reddit and Hacker News post? Or the influx of users shopping during Black Friday? Are you ready for a world filled with flakey networks, invalid data, and impatient users? In this talk you'll learn how to design and architect your machine learning systems for the harsh realities it will face. We will show you how we tackled these problems in a real, complex machine learning system at OLX and scaled it to serve up to billions of predictions per day, using software engineering principles while debunking the myth that Python code cannot scale.

Published in: Engineering

Scaling Machine Learning Systems up to Billions of Predictions per Day

  1. 1. Scaling Machine Learning Systems Up to Billions of predictions per day
  2. 2. Carmine Paolino Senior Data Scientist at OLX @paolino github.com/crmne paolino.me
  3. 3. Software development process
  4. 4. Problem Software development process
  5. 5. Problem Software development process
  6. 6. Problem Code Software development process
  7. 7. Problem Code Software development process
  8. 8. “No no, I’m a serious developer”
  9. 9. “No no, I’m a serious developer”
  10. 10. “No no, I’m a serious developer”
  11. 11. “No no, I’m a serious developer”
  12. 12. “No no, I’m a serious developer”
  13. 13. …and I deploy it!
  14. 14. …and I deploy it!
  15. 15. …and I deploy it!
  16. 16. It works! 🥳
  17. 17. Are you really done?
  18. 18. Does feature complete mean production ready?
  19. 19. –Micheal T. Nygard, author of Release It! “There is a lot more to software development than just adding all the features.”
  20. 20. Software is often designed like a concept car
  21. 21. Software is often designed like a concept car Elegant!
  22. 22. Software is often designed like a concept car Elegant! Sophisticated!
  23. 23. Software is often designed like a concept car Elegant! Sophisticated! Clever!
  24. 24. Software is often designed like a concept car Elegant! Sophisticated! Clever! Fragile…
  25. 25. Software is often designed like a concept car Elegant! Sophisticated! Clever! Fragile… Requires constant maintenance…
  26. 26. Software engineering talks about what systems should do, not what they shouldn’t do.
  27. 27. This a story about my concept car.
  28. 28. Listing quality
  29. 29. Listing quality
  30. 30. Listing quality
  31. 31. Listing quality
  32. 32. Predict the quality of all versions of all listings in OLX Europe and make the model interpretable
  33. 33. What is listing quality? Listing Quality Image Features Light Quality Sharpness Text and Categorical Features
  34. 34. What is listing quality? Listing Quality Image Features Light Quality Sharpness Text and Categorical Features
  35. 35. Light quality The surfaces of the subject are well lit; it's easy to distinguish details on them. Good Light Bad Light The surfaces of the subject are badly lit; it's hard to distinguish details on them.
  36. 36. • Talebi, Hossein, and Peyman Milanfar. “NIMA: Neural Image Assessment.” IEEE Transactions on Image Processing 27.8 (2018): 3998–4011. Crossref. Web. • Response time: 70ms on CPU • Forked MXNet-Model-Server Light quality
  37. 37. What is listing quality? Listing Quality Image Features Light Quality Sharpness Text and Categorical Features
  38. 38. What is listing quality? Listing Quality Image Features Light Quality Sharpness Text and Categorical Features
  39. 39. Sharpness • Variance of Laplacian • Response time: 100 ms • asyncio for I/O bound code • Multiprocessing for CPU bound code
  40. 40. asyncio? • A library to write concurrent code using the async/await syntax • Perfect fit for I/O bound applications, like network applications • Can control subprocesses for CPU bound code
  41. 41. What’s wrong with Flask? variant min p50 p99 p99.9 max mean duration requests aiohttp 163.27 247.72 352.75 404.59 1414.08 257.59 20.10 30702 flask:gevent 85.02 945.17 6587.19 8177.32 8192.75 1207.66 20.08 7491 flask:meinheld 124.99 2526.55 6753.13 6857.55 6857.55 3036.93 20.10 190 flask:10 163.05 4419.11 4505.59 4659.46 4667.55 3880.05 20.05 1797 flask:20 110.23 2368.20 3140.01 3434.39 3476.06 2163.02 20.09 3364 flask:50 122.17 472.98 3978.68 8599.01 9845.94 541.13 20.10 4606 flask:100 118.26 499.16 4428.77 8714.60 9987.37 556.77 20.10 4555 flask:200 112.06 459.85 4493.61 8548.99 9683.27 527.02 20.10 4378 flask:400 121.63 526.72 3195.23 8069.06 9686.35 580.54 20.06 4336 flask:800 127.94 430.07 4503.95 8653.69 9722.19 514.47 20.09 4381 flask:1000 184.76 732.21 1919.72 5323.73 7364.60 786.26 20.04 4121
  42. 42. It’s much easier than you think I/O bound,
 runs in
 main process CPU bound,
 runs in
 subprocesses Subprocess pool size = number of cores Each await gives back control to the event loop, enabling other coroutines to run.
  43. 43. What is listing quality? Listing Quality Image Features Light Quality Sharpness Text and Categorical Features
  44. 44. What is listing quality? Listing Quality Image Features Light Quality Sharpness Text and Categorical Features
  45. 45. Listing Quality • Asyncio + multiprocessing • Response time: 15 ms
  46. 46. What is listing quality? Listing Quality Image Features Light Quality Sharpness Text and Categorical Features
  47. 47. What is listing quality? Listing Quality Image Features Light Quality Sharpness Text and Categorical Features + orchestrate all these models
  48. 48. Model orchestrator • asyncio • Total time: ~1.5 seconds per listing Model orchestrator Light Quality Sharpness Listing Quality HTTP #3HTTP #2HTTP #2 Image store HTTP #1
  49. 49. …deployed it…
  50. 50. …deployed it…
  51. 51. …deployed it…
  52. 52. …and it works! 🥳
  53. 53. Let’s enable it for all Europe!
  54. 54. Model orchestrator Light Quality Sharpness Listing Quality Image store Input and output Service Input Data Output Data New HTTP HTTP HTTPHTTP HTTP S3Kinesis and S3
  55. 55. What happened? Model orchestrator Light Quality Sharpness Listing Quality Image store Input and output Service Input Data Output Data HTTP HTTP HTTPHTTP HTTP S3Kinesis and S3
  56. 56. What happened? Model orchestrator Light Quality Sharpness Listing Quality Image store Input and output Service Input Data Output Data Sending requests too fast HTTP HTTP HTTPHTTP HTTP S3Kinesis and S3
  57. 57. What happened? Model orchestrator Light Quality Sharpness Listing Quality Image store Input and output Service Input Data Output Data Sending requests too fast HTTP HTTP HTTPHTTP HTTP S3Kinesis and S3 Multiplied the number of requests
  58. 58. What happened? Model orchestrator Light Quality Sharpness Listing Quality Image store Input and output Service Input Data Output Data Sending requests too fast HTTP HTTP HTTPHTTP HTTP S3Kinesis and S3 Multiplied the number of requests Overwhelmed the software defined networking of Kubernetes, crashing the cluster
  59. 59. What happened? Model orchestrator Light Quality Sharpness Listing Quality Image store Input and output Service Input Data Output Data Sending requests too fast HTTP HTTP HTTPHTTP HTTP S3Kinesis and S3 Multiplied the number of requests Overwhelmed the software defined networking of Kubernetes, crashing the cluster Overwhelmed the models
  60. 60. Almost all predictions failed.
  61. 61. What have we learned? • Going too fast can be a problem • Requests consume memory • Out of memory = Killed • Requests consume CPU time • Overcommitting CPU • Not enough power for software defined network takes down entire cluster
  62. 62. Solution: smooth out traffic using queues And don’t overcommit CPUs for CPU bound loads
  63. 63. Model orchestrator Light Quality Sharpness Listing Quality Image store Input Reader Input Data Output Data New HTTP HTTPHTTP HTTP S3Kinesis and S3 Output Writer SQS QueueSQS Queue
  64. 64. Model orchestrator Light Quality Sharpness Listing Quality Image store Input Reader Input Data Output Data HTTP HTTPHTTP HTTP S3Kinesis and S3 Output Writer SQS QueueSQS Queue What happened?
  65. 65. Model orchestrator Light Quality Sharpness Listing Quality Image store Input Reader Input Data Output Data Pulling messages too fast from queue HTTP HTTPHTTP HTTP S3Kinesis and S3 Output Writer SQS QueueSQS Queue What happened?
  66. 66. Model orchestrator Light Quality Sharpness Listing Quality Image store Input Reader Input Data Output Data Pulling messages too fast from queue HTTP HTTPHTTP HTTP S3Kinesis and S3 Output Writer SQS QueueSQS Queue What happened? Overwhelmed the models
  67. 67. Solution: limit the amount of messages processing at once
  68. 68. … by using a semaphore Only MAX_CONCURRENT_MESSAGES can be inside the semaphore block at once The rest has to wait for a “free spot”
  69. 69. It actually works now! 🎉
  70. 70. Until it didn’t…
  71. 71. Model orchestrator Light Quality Sharpness Listing Quality Image store Input Reader Input Data Output Data HTTP HTTPHTTP HTTP S3Kinesis and S3 Output Writer SQS QueueSQS Queue What happened?
  72. 72. Model orchestrator Light Quality Sharpness Listing Quality Image store Input Reader Input Data Output Data Idle waiting for 5 minutes per request HTTP HTTPHTTP HTTP S3Kinesis and S3 Output Writer SQS QueueSQS Queue What happened?
  73. 73. Model orchestrator Light Quality Sharpness Listing Quality Image store Input Reader Input Data Output Data Idle waiting for 5 minutes per request HTTP HTTPHTTP HTTP S3Kinesis and S3 Output Writer SQS QueueSQS Queue What happened? Sometimes fail
  74. 74. Model orchestrator Light Quality Sharpness Listing Quality Image store Input Reader Input Data Output Data Idle waiting for 5 minutes per request HTTP HTTPHTTP HTTP S3Kinesis and S3 Output Writer SQS QueueSQS Queue What happened? Sometimes fail Cluster autoscaler decides 
 it only needs few pods
  75. 75. Solution: let it fail, fast. Using timeouts
  76. 76. … by using asyncio.wait_for After PREDICTION_TIMEOUT an exception is raised We set PREDICTION_TIMEOUT to be a few seconds
  77. 77. Will it finally work?
  78. 78. Yes! 💪
  79. 79. Lesson #1: “Feature Complete” is not “Production Ready”
  80. 80. Lesson #2: Scalable doesn’t just mean Fast
  81. 81. Lesson #3: If you want to go fast, you need to drive smoothly
  82. 82. Lesson #4: Fail fast, recover later
  83. 83. Thank you! ☺ Questions? 🤔 @paolino github.com/crmne paolino.me

×