Taras Matsyk - Event Driven ML

•

0 likes•46 views

This document summarizes the development of an image processing API and ML system to meet requirements of processing images in under 2 seconds, with visibility, persistence, scalability, and extensibility. It describes splitting the API and ML into separate services, optimizing processing times from 4 to 1 second through changes like using Kafka, scaling the system, and troubleshooting issues with performance spikes, message brokers, and databases. Lessons learned include distinguishing CPU from IO-bound tasks and the challenges of Go versus Python. The overall goal of under 1 second per request processing time was eventually achieved.

Technology

Go && Python
Event driven ML
Lalafo case

How it all started
Part 1. The Phantom Menace

$
Feedback
Processes Monitoring
RAM - expiration?
Relational DB?

Any issues there?
1. Request monitoring (?)
2. Hard to reason GPU usage (varies 3 to 7GB)
3. ~5Gb RAM out of the box (scale?)
4. Redis is in-memory storage (temporary)
5. Threading != Parallel

What if we split
API and ML
Part 2. A new hope

Requirements
Lalafo case
1. Single image/request processing time: < 2 seconds
2. Visibility
3. Persistence
4. Scalability
5. Make it extendable for new features
a. Price prediction
b. Similarity search
c. Segmentation
6. SDK friendly (well documented, tested etc)

Requirements
Lalafo case
1. Single image/request processing time: < 2 seconds
2. Visibility - Decouple request and prediction
3. Persistence
4. Scalability
5. Make it extendable for new features
a. Price prediction
b. Similarity search
c. Segmentation
6. SDK friendly (well documented, tested etc)

2 seconds per request
Part 3. The Empire Strikes Back

4 seconds per request
API Listeners
1 second
0.01 sec

segmentio/kafka-go -> shopify/sarama
API Listeners
0.02 sec
0.01 sec

3 seconds per request
API Listeners
0.02 sec
0.3 second
0.3 second
Synchronous request

2.4 seconds per request
API Listeners
0.02 sec
Synchronous request

2 seconds per request
API Listeners
0.02 sec
Synchronous request
Add ThreadPool

3 seconds per request
API Listeners
0.02 sec

3 seconds per request
API Listeners
0.02 sec
1 second and growing

2 seconds per request
API Listeners
0.02 sec

2 seconds per request
API Listeners
0.02 sec
Scale
Scale
Master -> Slave -> balancer
Topics -> Partitions

Troubleshooting
API Listeners
0.02 sec
? ?

Troubleshooting
API Listeners
0.02 sec
?
175k records per request

Faust 1.4.6: No latest offset
https://github.com/robinhood/faust

Issues to solve
1. Occasional spikes in performance (GC, network latency)
2. Message broker (Kafka rebalancing, offset etc)
3. How to handle DB migrations
4. Something we are not aware of yet

Lessons learnt
1. CPU bound tasks != IO bound (¯_(ツ)_/¯)
2. High coupling - low cohesion
3. You need to know how to cook MongoDB
4. Go is not that obvious and library reach as Python
5. Simple != Easier
6. Concurrency != Parallelism (obviously)

+ Live statistics from PostgreSQL
API Listeners

What's hot

OpenRestyを用いてイケイケなサービスを作る方法Sho Yoshida

How go makes us faster (May 2015)Wilfried Schobeiri

fsharp goodness for everyday workUladzimir Shchur

Vinted life embettermentAgile Lietuva

The Puppet Master on the JVM - PuppetConf 2014Puppet

Pharo VM PerformancePharo

MySQL Tuning using digested slow-logsBob Burgess

How Typepad changed their architecture without taking down the serviceroyans

Os Webboscon2007

One Container, Two Container, Three Containers, FourAshley Roach

Performance Tuning Your Puppet Infrastructure - PuppetConf 2014Puppet

KazooCon 2014 - Kazoo Scalability2600Hz

Spork || How To Streamline Your TDD ProcessArik Fraimovich

IronRuby on Teched JapanShay Friedman

Erlang Lightning TalkGiltTech

Paris Monitoring meetup #1 - Zabbix at BlaBlaCarJean Baptiste Favre

Monitoring a billion kilometers of monthly ride sharing at BlaBlaCar - Zabbix...Jean Baptiste Favre

Scaling HBase (nosql store) to handle massive loads at Pinterest by Jeremy CarolHakka Labs

Lab: JVM Production Debugging 101Tomer Gabel

Lightweight development (Lightning talk)zroger

What's hot (20)

OpenRestyを用いてイケイケなサービスを作る方法

How go makes us faster (May 2015)

fsharp goodness for everyday work

Vinted life embetterment

The Puppet Master on the JVM - PuppetConf 2014

Pharo VM Performance

MySQL Tuning using digested slow-logs

How Typepad changed their architecture without taking down the service

Os Webb

One Container, Two Container, Three Containers, Four

Performance Tuning Your Puppet Infrastructure - PuppetConf 2014

KazooCon 2014 - Kazoo Scalability

Spork || How To Streamline Your TDD Process

IronRuby on Teched Japan

Erlang Lightning Talk

Paris Monitoring meetup #1 - Zabbix at BlaBlaCar

Monitoring a billion kilometers of monthly ride sharing at BlaBlaCar - Zabbix...

Scaling HBase (nosql store) to handle massive loads at Pinterest by Jeremy Carol

Lab: JVM Production Debugging 101

Lightweight development (Lightning talk)

Similar to Taras Matsyk - Event Driven ML

WebPerformance: Why and How? – Stefan WintermeyerElixir Club

Introduction to Apache KafkaShiao-An Yuan

Sloppy Little Serverless StoriesSheenBrisals

Lessons Learnt in 2009pratiknaik

3 Flink Mistakes We Made So You Won't Have ToHostedbyConfluent

Startups to Scaleguest73ced5

SQL Server On SANsQuest Software

Capistrano && SystemDAleksandr Simonov

Capistrano and SystemDAmoniac OÜ

Ruby Proxies for Scale, Performance, and Monitoring - GoGaRuCo - igvita.comIlya Grigorik

Fisl - DeploymentFabio Akita

Life on the Edge with ESIKit Chan

AF Ceph: Ceph Performance Analysis and Improvement on FlashCeph Community

A rough guide to JavaScript Performanceallmarkedup

Day 2 General Session Presentations RedisConfRedis Labs

Performance Optimization of Rails ApplicationsSerge Smetana

"To cover uncoverable", Andrii ShumadaFwdays

Scaling TwitterBlaine

20140425 ruby conftaiwan2014Hiroshi SHIBATA

Tuning Solr for LogsSematext Group, Inc.

Similar to Taras Matsyk - Event Driven ML (20)

WebPerformance: Why and How? – Stefan Wintermeyer

Introduction to Apache Kafka

Sloppy Little Serverless Stories

Lessons Learnt in 2009

3 Flink Mistakes We Made So You Won't Have To

Startups to Scale

SQL Server On SANs

Capistrano && SystemD

Capistrano and SystemD

Ruby Proxies for Scale, Performance, and Monitoring - GoGaRuCo - igvita.com

Fisl - Deployment

Life on the Edge with ESI

AF Ceph: Ceph Performance Analysis and Improvement on Flash

A rough guide to JavaScript Performance

Day 2 General Session Presentations RedisConf

Performance Optimization of Rails Applications

"To cover uncoverable", Andrii Shumada

Scaling Twitter

20140425 ruby conftaiwan2014

Tuning Solr for Logs

Recently uploaded

Search and Society: Reimagining Information Access for Radical FuturesBhaskar Mitra

Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxDavid Michel

ODC, Data Fabric and Architecture User GroupCatarinaPereira64715

Exploring UiPath Orchestrator API: updates and limits in 2024 🚀DianaGray10

Salesforce Adoption – Metrics, Methods, and Motivation, Antone KomCzechDreamin

Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)Julian Hyde

Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung

Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya HalderCzechDreamin

How world-class product teams are winning in the AI era by CEO and Founder, P...Product School

AI revolution and Salesforce, Jiří KarpíšekCzechDreamin

AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...Product School

UiPath Test Automation using UiPath Test Suite series, part 2DianaGray10

Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood

"Impact of front-end architecture on development cost", Viktor TurskyiFwdays

Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10

GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...Sri Ambati

In-Depth Performance Testing Guide for IT ProfessionalsExpeed Software

Introduction to Open Source RAG and RAG EvaluationZilliz

UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10

Mission to Decommission: Importance of Decommissioning Products to Increase E...Product School

Recently uploaded (20)

Search and Society: Reimagining Information Access for Radical Futures

Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx

ODC, Data Fabric and Architecture User Group

Exploring UiPath Orchestrator API: updates and limits in 2024 🚀

Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom

Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)

Key Trends Shaping the Future of Infrastructure.pdf

Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder

How world-class product teams are winning in the AI era by CEO and Founder, P...

AI revolution and Salesforce, Jiří Karpíšek

AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...

UiPath Test Automation using UiPath Test Suite series, part 2

Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...

"Impact of front-end architecture on development cost", Viktor Turskyi

Connector Corner: Automate dynamic content and events by pushing a button

GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...

In-Depth Performance Testing Guide for IT Professionals

Introduction to Open Source RAG and RAG Evaluation

UiPath Test Automation using UiPath Test Suite series, part 3

Mission to Decommission: Importance of Decommissioning Products to Increase E...

Taras Matsyk - Event Driven ML

1. Go && Python Event driven ML Lalafo case

3. How it all started Part 1. The Phantom Menace

5. StackOverflow: Ready to rescue

6. 200 RPM

8. $ Feedback

9. $ Feedback Processes Monitoring RAM - expiration? Relational DB?

10. Any issues there? 1. Request monitoring (?) 2. Hard to reason GPU usage (varies 3 to 7GB) 3. ~5Gb RAM out of the box (scale?) 4. Redis is in-memory storage (temporary) 5. Threading != Parallel

11. http://www.nooooooooooooooo.com/

12. What if we split API and ML Part 2. A new hope

13. Requirements Lalafo case 1. Single image/request processing time: < 2 seconds 2. Visibility 3. Persistence 4. Scalability 5. Make it extendable for new features a. Price prediction b. Similarity search c. Segmentation 6. SDK friendly (well documented, tested etc)

14. $ Feedback Processes Monitoring RAM - expiration? Relational DB?

15. After 2 months API Listeners Feedback $

16. Requirements Lalafo case 1. Single image/request processing time: < 2 seconds 2. Visibility - Decouple request and prediction 3. Persistence 4. Scalability 5. Make it extendable for new features a. Price prediction b. Similarity search c. Segmentation 6. SDK friendly (well documented, tested etc)

17. 2 seconds per request Part 3. The Empire Strikes Back

18. 4 seconds per request API Listeners 1 second 0.01 sec

19. segmentio/kafka-go -> shopify/sarama API Listeners 0.02 sec 0.01 sec

20. 3 seconds per request API Listeners 0.02 sec 0.3 second 0.3 second Synchronous request

21. How large is every image

22. 2.4 seconds per request API Listeners 0.02 sec Synchronous request

23. 2.4 seconds per request API Listeners 0.02 sec Synchronous request

24. 2 seconds per request API Listeners 0.02 sec Synchronous request Add ThreadPool

25. 2 seconds per request Success? Not yet

26. 3 seconds per request API Listeners 0.02 sec

27. 3 seconds per request API Listeners 0.02 sec 1 second and growing

28. 3 seconds per request

29. 2 seconds per request

30. ¯_(ツ)_/¯

31. 2 seconds per request API Listeners 0.02 sec

32. 200 RPM -> 1200 RPM

33. 2 seconds per request API Listeners 0.02 sec Scale Scale Master -> Slave -> balancer Topics -> Partitions

34. < 1 second per request Success? Not yet

35.

36. Out of 8Gb

37. How to PyTorch in production

38. 7.6 Gb -> 600 Mb

39. 1 second per request Success? Not yet

40. Back to 2, 5, 10 seconds per request

41. Troubleshooting API Listeners 0.02 sec ? ?

42. Troubleshooting API Listeners 0.02 sec ?

43. Troubleshooting API Listeners 0.02 sec ? 175k records per request

44. Troubleshooting API Listeners 0.02 sec ?

45. Troubleshooting API

46. SELECT * FROM ads LIMIT 1

47. 25 seconds -> 0.02 seconds

48. 0.02 seconds -> 0.002 second

49. Success? Not yet

50. Faust 1.4.6: No latest offset https://github.com/robinhood/faust

51. 3K images per minute 0_o

52. Success? Not yet

53. Issues to solve 1. Occasional spikes in performance (GC, network latency) 2. Message broker (Kafka rebalancing, offset etc) 3. How to handle DB migrations 4. Something we are not aware of yet

54. Lessons learnt 1. CPU bound tasks != IO bound (¯_(ツ)_/¯) 2. High coupling - low cohesion 3. You need to know how to cook MongoDB 4. Go is not that obvious and library reach as Python 5. Simple != Easier 6. Concurrency != Parallelism (obviously)

55. < 1 second per request

56. + Live statistics from PostgreSQL API Listeners

57. Thank you everyone Success?

Taras Matsyk - Event Driven ML

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Taras Matsyk - Event Driven ML

Similar to Taras Matsyk - Event Driven ML (20)

More from PyCon Odessa

More from PyCon Odessa (7)

Recently uploaded

Recently uploaded (20)

Taras Matsyk - Event Driven ML