SlideShare a Scribd company logo
1 of 46
Download to read offline
Uklon in numbers
12 130+
Engineers
Product Teams
16 M
Android/iOS
downloads
1.5M+
Riders DAU
30+
microservices
200k+
Drivers DAU
3
Countries
30
Cities
Uklon
RiderApp DriverApp
How to reduce CPU consumption by 10 times due
to stateful-processing and ensure high reliability
What is the report about?
3
What are the solutions employed
by our competitors?
1
Scaling of stateful services
Reliability of stateful services
Workloads that make the stateless approach inefficient
Basic concepts
Agenda
Workloads that make the
stateless approach inefficient
1. massive frequent write operations are needed to track the objects'
current locations. As drivers can move as fast as 20 meters per second,
it is therefore important to update drivers' locations at a second.
Several challenges within
the ride-hailing are…
2. a K-nearest neighbour (kNN) query poses tremendous challenges,
compared to a simple Get query, in a key-value data store such as
Redis.
Feature #1
Orders Dispatching
Find the best driver for the order
Feature #2
Orders Broadcasting
Streaming your order to many drivers
DriverApp
Feature #3
Batch dispatching
Greedy algorithm Batching algorithm
The Process of Order Dispatching
with Batch Windows
2 min
9 min
4 min
4 min
Total wait time = 11 min Total wait time = 8 min
image
Feature #4
Driver ETA Tracker
Requirements:
1. Active Orders = tens of thousands
2. Drivers send their location every
2-5 seconds
1. Order offers. Find the best driver near you.
2. Order broadcasts. Fan-out orders to multiple drivers.
3. Order chaining. Find the next order for the driver, while
completing the current one.
4. Order batching (optimization). Reduce the total waiting time
for all passengers.
5. Sector queue (airports, train stations).
6. Driver ETA tracking for accepted order.
7. Matching driver’s GPS location to map graph node.
Other Workloads
Simplified Overview of
the Architecture
Stateful
● Load balancing algorithms
● Scalability
○ Partitioning
○ Replication
● Fault tolerance and Cold start
4
Stateful
architectures
Open Problems
1
Key concept
1. Local state is stored in memory KV structures
2. The local state restored from the durable log.
In same cases, local state change may have
been checkpointed to remote KV store (or into
a separate kafka topic)
3. Local state updates occur within a
single-threaded. No concurrency, Monotonic
Writes
NFR (Kyiv only)
Writes
1.1) 5000-10000 rps
1.2) 100-500 rps
Reads
2.1) 500 rps (handle 100-500 drivers
per request)
2.2) fetch 50000-200000 rows/sec
(100-400MB/sec)
driver entity: 2 KB (50 perc)/ 13 KB (99 perc)
total size for 100K = 200 MB
Key differences
Stateless (remote KV)
● Provide GET/PUT/DELETE API
● A high CPU cost due to
marshalling and serialization
● Additional network latency
● Frequently necessitates
additional local caching
Stateful (in-memory/local KV)
● Domain specific API. Ex:
○ Find nearest drivers
○ Calculate ETA
● Data locality
● Shared-nothing
1
Access patterns for
In-memory KV
1. Key lookup
2. Index seek (Offers, Broadcast)
3. All scans / Range scans
Concept #1: Co-partitioning
Two topics are described as
co-partitioned if:
1. Their keys have the same schemas
2. They are materialized by topics
with the same number of partitions
3. Their producers have similar
'partitioner'
Concept #1: Co-partitioning
Concept #2: Re-keying partitions
● Related events are not
co-partitioned
● Well-balanced partitions
● These can be unbalanced partitions and,
as a result, consumers
● Achieving data locality for the consumer
Concept #3: Filtering + Enriching
DriverLocation {
"driver": 12345
"latitude": 50.30846,
"longitude": 30.53419
}
DriverETA {
"driver": 12345
"latitude": 50.30846,
"longitude": 30.53419
“order”: 98765,
“eta”: “2 min”
}
How to scale?
Driver Dispatching
Driver Dispatching
Driver Dispatching
Driver Dispatching
1
Scalability
1
1. geospatial indexing (geohash, S2, H3)
2. city_id (region)
Some sharding strategy
Consider the following points when you design a data
partitioning scheme:
1. Minimize cross-partition data access operations
2. Minimize cross-partition joins
1
Partitioning by Region
Possible challenges:
● down-time during rebalance:
scale-out, rolling update
● unbalanced load: The load
from Kyiv is equivalent to the
load from all cities of Ukraine
combined)
1
Try to fix:
Partitioning by Region + Replication
Replication:
● Standalone consumers
● No partitions rebalance
● No down-time
● Replication overhead is
less than 0.1CPU per pod
● Reduced requirements
for cold recovery
1
1. Scalability - adding Kafka
partitions and deploying
separate Shard-Instances for
cities/countries
2. Elasticity - scale-out of
consumers within a Shard
Scalability
Reliability?
1
Replica synchronization
● State-based CRDT
● Last write wins (LWW)
● Optimistic replication (can
become temporarily
inconsistent)
● Strong Eventual Consistency
(SEC)
● Reading Your Own
Writes
● Monotonic Reads
● Consistent Prefix Reads
Depends on your Domain
● Reading Your Own
Writes
● Monotonic Reads
● Consistent Prefix Reads
1
Problems with Replication Lag?
1
1. Single infrastructure dependency - Kafka (battle tested streaming
platform with high throughput, fault-tolerance, and scalability).
2. When a task instance restarts, local state is repopulated by reading its
own Kafka log
3. Yes, reading and repopulating will take some time
Fault tolerance with local state
1
1. Key-Based Retention
a. Aggressive topic compaction
b. Tombstones
2. Time-Based Retention
Controlling State Size.
How long time to rebuild the state?
1
1. Driver state retention: 1hour
2. Repopulate local state:
a. Read driver-state from the beginning of the topic: 400k msg (8
partitions)
b. Read driver-locations from the 'now - 5sec'
3. You need to implement own event for ”live processing started”
How long time to rebuild
the state?
"Live processing started "dispatching.driver-summary-events [0]"
after 00:00:01.7875633 sec (50142 msgs)"
SLA level of 99.998% uptime/availability
results in the following periods of allowed
downtime/unavailability:
■ Daily: 1.7s
Traffic Jams requirements
1. Reduce the cost of Google
Maps API
2. High rate of Writes (20k
online drivers)
3. Update traffic information
every 5min
Stateful processing
● Grouping messages by partition key
● Aggregating messages in hopping window
● MapReduce
Driver ETA Tracker
4
Similar workload using Redis
https://aws.amazon.com/blogs/database/optimize-redis-client-performance-for-amazon-elasticache/?utm_source=pocket_saves
○ Client: c5.4xlarge (16 vCPU 32GiB)
○ Redis: 3 nodes r6g.2xlarge (8 vCPUs 64Gib)
46
Resources Usage
Although the current design is simple, it allows flexibility to change
key aspects:
○ Replication + Sharding
4
Future works
46
1. Stateful is not always difficult
2. Simple and Reliable solution
3. Easy to maintain
4. Much more efficient in terms of resources (2 vCPUs for all
dispatching) instead of a Redis cluster with 16-24 vCPUs
5. What about MS Orleans?
Lessons learned
4
The Twelve-Factor App
Misleading
46
Space-based architecture?
https://www.amazon.com/_/dp/1492043451?smid=ATVPDKIKX0DER&_encoding=UTF8&tag=oreilly20-20
Contacts
Solution Architect
Oleksandr Chumak
https:/
/www.linkedin.com/in/oleksandr-chuma
k-45967588/
facebook.com/achumak.dev

More Related Content

Similar to "Stateful app as an efficient way to build dispatching for riders and drivers", Oleksandr Chumak

Challenges in Cloud Computing – VM Migration
Challenges in Cloud Computing – VM MigrationChallenges in Cloud Computing – VM Migration
Challenges in Cloud Computing – VM MigrationSarmad Makhdoom
 
Velocity 2018 preetha appan final
Velocity 2018   preetha appan finalVelocity 2018   preetha appan final
Velocity 2018 preetha appan finalpreethaappan
 
Strata Singapore: Gearpump Real time DAG-Processing with Akka at Scale
Strata Singapore: GearpumpReal time DAG-Processing with Akka at ScaleStrata Singapore: GearpumpReal time DAG-Processing with Akka at Scale
Strata Singapore: Gearpump Real time DAG-Processing with Akka at ScaleSean Zhong
 
Practice and challenges from building IaaS
Practice and challenges from building IaaSPractice and challenges from building IaaS
Practice and challenges from building IaaSShawn Zhu
 
MQTC V2.0.1.3 - WMQ & TCP Buffers – Size DOES Matter! (pps)
MQTC V2.0.1.3 - WMQ & TCP Buffers – Size DOES Matter! (pps)MQTC V2.0.1.3 - WMQ & TCP Buffers – Size DOES Matter! (pps)
MQTC V2.0.1.3 - WMQ & TCP Buffers – Size DOES Matter! (pps)Art Schanz
 
Unclouding Container Challenges
 Unclouding  Container Challenges Unclouding  Container Challenges
Unclouding Container ChallengesRakuten Group, Inc.
 
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...Martin Zapletal
 
Oow2007 performance
Oow2007 performanceOow2007 performance
Oow2007 performanceRicky Zhu
 
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...MLconf
 
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Data Con LA
 
Bridge to the Future: Migrating to KRaft
Bridge to the Future: Migrating to KRaftBridge to the Future: Migrating to KRaft
Bridge to the Future: Migrating to KRaftHostedbyConfluent
 
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...DataWorks Summit/Hadoop Summit
 
z/VM Performance Analysis
z/VM Performance Analysisz/VM Performance Analysis
z/VM Performance AnalysisRodrigo Campos
 
Ingestion and Dimensions Compute and Enrich using Apache Apex
Ingestion and Dimensions Compute and Enrich using Apache ApexIngestion and Dimensions Compute and Enrich using Apache Apex
Ingestion and Dimensions Compute and Enrich using Apache ApexApache Apex
 
MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...
MapReduce:Simplified Data Processing on Large Cluster  Presented by Areej Qas...MapReduce:Simplified Data Processing on Large Cluster  Presented by Areej Qas...
MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...areej qasrawi
 
Leveraging the Power of Solr with Spark: Presented by Johannes Weigend, QAware
Leveraging the Power of Solr with Spark: Presented by Johannes Weigend, QAwareLeveraging the Power of Solr with Spark: Presented by Johannes Weigend, QAware
Leveraging the Power of Solr with Spark: Presented by Johannes Weigend, QAwareLucidworks
 
Leveraging the Power of Solr with Spark
Leveraging the Power of Solr with SparkLeveraging the Power of Solr with Spark
Leveraging the Power of Solr with SparkQAware GmbH
 

Similar to "Stateful app as an efficient way to build dispatching for riders and drivers", Oleksandr Chumak (20)

Challenges in Cloud Computing – VM Migration
Challenges in Cloud Computing – VM MigrationChallenges in Cloud Computing – VM Migration
Challenges in Cloud Computing – VM Migration
 
Velocity 2018 preetha appan final
Velocity 2018   preetha appan finalVelocity 2018   preetha appan final
Velocity 2018 preetha appan final
 
Strata Singapore: Gearpump Real time DAG-Processing with Akka at Scale
Strata Singapore: GearpumpReal time DAG-Processing with Akka at ScaleStrata Singapore: GearpumpReal time DAG-Processing with Akka at Scale
Strata Singapore: Gearpump Real time DAG-Processing with Akka at Scale
 
Practice and challenges from building IaaS
Practice and challenges from building IaaSPractice and challenges from building IaaS
Practice and challenges from building IaaS
 
MQTC V2.0.1.3 - WMQ & TCP Buffers – Size DOES Matter! (pps)
MQTC V2.0.1.3 - WMQ & TCP Buffers – Size DOES Matter! (pps)MQTC V2.0.1.3 - WMQ & TCP Buffers – Size DOES Matter! (pps)
MQTC V2.0.1.3 - WMQ & TCP Buffers – Size DOES Matter! (pps)
 
Unclouding Container Challenges
 Unclouding  Container Challenges Unclouding  Container Challenges
Unclouding Container Challenges
 
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
 
Oow2007 performance
Oow2007 performanceOow2007 performance
Oow2007 performance
 
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
 
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
 
Bridge to the Future: Migrating to KRaft
Bridge to the Future: Migrating to KRaftBridge to the Future: Migrating to KRaft
Bridge to the Future: Migrating to KRaft
 
Map reduce
Map reduceMap reduce
Map reduce
 
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
 
z/VM Performance Analysis
z/VM Performance Analysisz/VM Performance Analysis
z/VM Performance Analysis
 
Designing Scalable Applications
Designing Scalable ApplicationsDesigning Scalable Applications
Designing Scalable Applications
 
Ingestion and Dimensions Compute and Enrich using Apache Apex
Ingestion and Dimensions Compute and Enrich using Apache ApexIngestion and Dimensions Compute and Enrich using Apache Apex
Ingestion and Dimensions Compute and Enrich using Apache Apex
 
MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...
MapReduce:Simplified Data Processing on Large Cluster  Presented by Areej Qas...MapReduce:Simplified Data Processing on Large Cluster  Presented by Areej Qas...
MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...
 
Corralling Big Data at TACC
Corralling Big Data at TACCCorralling Big Data at TACC
Corralling Big Data at TACC
 
Leveraging the Power of Solr with Spark: Presented by Johannes Weigend, QAware
Leveraging the Power of Solr with Spark: Presented by Johannes Weigend, QAwareLeveraging the Power of Solr with Spark: Presented by Johannes Weigend, QAware
Leveraging the Power of Solr with Spark: Presented by Johannes Weigend, QAware
 
Leveraging the Power of Solr with Spark
Leveraging the Power of Solr with SparkLeveraging the Power of Solr with Spark
Leveraging the Power of Solr with Spark
 

More from Fwdays

"How Preply reduced ML model development time from 1 month to 1 day",Yevhen Y...
"How Preply reduced ML model development time from 1 month to 1 day",Yevhen Y..."How Preply reduced ML model development time from 1 month to 1 day",Yevhen Y...
"How Preply reduced ML model development time from 1 month to 1 day",Yevhen Y...Fwdays
 
"GenAI Apps: Our Journey from Ideas to Production Excellence",Danil Topchii
"GenAI Apps: Our Journey from Ideas to Production Excellence",Danil Topchii"GenAI Apps: Our Journey from Ideas to Production Excellence",Danil Topchii
"GenAI Apps: Our Journey from Ideas to Production Excellence",Danil TopchiiFwdays
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
"What is a RAG system and how to build it",Dmytro Spodarets
"What is a RAG system and how to build it",Dmytro Spodarets"What is a RAG system and how to build it",Dmytro Spodarets
"What is a RAG system and how to build it",Dmytro SpodaretsFwdays
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
"Distributed graphs and microservices in Prom.ua", Maksym Kindritskyi
"Distributed graphs and microservices in Prom.ua",  Maksym Kindritskyi"Distributed graphs and microservices in Prom.ua",  Maksym Kindritskyi
"Distributed graphs and microservices in Prom.ua", Maksym KindritskyiFwdays
 
"Rethinking the existing data loading and processing process as an ETL exampl...
"Rethinking the existing data loading and processing process as an ETL exampl..."Rethinking the existing data loading and processing process as an ETL exampl...
"Rethinking the existing data loading and processing process as an ETL exampl...Fwdays
 
"How Ukrainian IT specialist can go on vacation abroad without crossing the T...
"How Ukrainian IT specialist can go on vacation abroad without crossing the T..."How Ukrainian IT specialist can go on vacation abroad without crossing the T...
"How Ukrainian IT specialist can go on vacation abroad without crossing the T...Fwdays
 
"The Strength of Being Vulnerable: the experience from CIA, Tesla and Uber", ...
"The Strength of Being Vulnerable: the experience from CIA, Tesla and Uber", ..."The Strength of Being Vulnerable: the experience from CIA, Tesla and Uber", ...
"The Strength of Being Vulnerable: the experience from CIA, Tesla and Uber", ...Fwdays
 
"[QUICK TALK] Radical candor: how to achieve results faster thanks to a cultu...
"[QUICK TALK] Radical candor: how to achieve results faster thanks to a cultu..."[QUICK TALK] Radical candor: how to achieve results faster thanks to a cultu...
"[QUICK TALK] Radical candor: how to achieve results faster thanks to a cultu...Fwdays
 
"[QUICK TALK] PDP Plan, the only one door to raise your salary and boost care...
"[QUICK TALK] PDP Plan, the only one door to raise your salary and boost care..."[QUICK TALK] PDP Plan, the only one door to raise your salary and boost care...
"[QUICK TALK] PDP Plan, the only one door to raise your salary and boost care...Fwdays
 
"4 horsemen of the apocalypse of working relationships (+ antidotes to them)"...
"4 horsemen of the apocalypse of working relationships (+ antidotes to them)"..."4 horsemen of the apocalypse of working relationships (+ antidotes to them)"...
"4 horsemen of the apocalypse of working relationships (+ antidotes to them)"...Fwdays
 
"Reconnecting with Purpose: Rediscovering Job Interest after Burnout", Anast...
"Reconnecting with Purpose: Rediscovering Job Interest after Burnout",  Anast..."Reconnecting with Purpose: Rediscovering Job Interest after Burnout",  Anast...
"Reconnecting with Purpose: Rediscovering Job Interest after Burnout", Anast...Fwdays
 
"Mentoring 101: How to effectively invest experience in the success of others...
"Mentoring 101: How to effectively invest experience in the success of others..."Mentoring 101: How to effectively invest experience in the success of others...
"Mentoring 101: How to effectively invest experience in the success of others...Fwdays
 
"Mission (im) possible: How to get an offer in 2024?", Oleksandra Myronova
"Mission (im) possible: How to get an offer in 2024?",  Oleksandra Myronova"Mission (im) possible: How to get an offer in 2024?",  Oleksandra Myronova
"Mission (im) possible: How to get an offer in 2024?", Oleksandra MyronovaFwdays
 
"Why have we learned how to package products, but not how to 'package ourselv...
"Why have we learned how to package products, but not how to 'package ourselv..."Why have we learned how to package products, but not how to 'package ourselv...
"Why have we learned how to package products, but not how to 'package ourselv...Fwdays
 
"How to tame the dragon, or leadership with imposter syndrome", Oleksandr Zin...
"How to tame the dragon, or leadership with imposter syndrome", Oleksandr Zin..."How to tame the dragon, or leadership with imposter syndrome", Oleksandr Zin...
"How to tame the dragon, or leadership with imposter syndrome", Oleksandr Zin...Fwdays
 

More from Fwdays (20)

"How Preply reduced ML model development time from 1 month to 1 day",Yevhen Y...
"How Preply reduced ML model development time from 1 month to 1 day",Yevhen Y..."How Preply reduced ML model development time from 1 month to 1 day",Yevhen Y...
"How Preply reduced ML model development time from 1 month to 1 day",Yevhen Y...
 
"GenAI Apps: Our Journey from Ideas to Production Excellence",Danil Topchii
"GenAI Apps: Our Journey from Ideas to Production Excellence",Danil Topchii"GenAI Apps: Our Journey from Ideas to Production Excellence",Danil Topchii
"GenAI Apps: Our Journey from Ideas to Production Excellence",Danil Topchii
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
"What is a RAG system and how to build it",Dmytro Spodarets
"What is a RAG system and how to build it",Dmytro Spodarets"What is a RAG system and how to build it",Dmytro Spodarets
"What is a RAG system and how to build it",Dmytro Spodarets
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
"Distributed graphs and microservices in Prom.ua", Maksym Kindritskyi
"Distributed graphs and microservices in Prom.ua",  Maksym Kindritskyi"Distributed graphs and microservices in Prom.ua",  Maksym Kindritskyi
"Distributed graphs and microservices in Prom.ua", Maksym Kindritskyi
 
"Rethinking the existing data loading and processing process as an ETL exampl...
"Rethinking the existing data loading and processing process as an ETL exampl..."Rethinking the existing data loading and processing process as an ETL exampl...
"Rethinking the existing data loading and processing process as an ETL exampl...
 
"How Ukrainian IT specialist can go on vacation abroad without crossing the T...
"How Ukrainian IT specialist can go on vacation abroad without crossing the T..."How Ukrainian IT specialist can go on vacation abroad without crossing the T...
"How Ukrainian IT specialist can go on vacation abroad without crossing the T...
 
"The Strength of Being Vulnerable: the experience from CIA, Tesla and Uber", ...
"The Strength of Being Vulnerable: the experience from CIA, Tesla and Uber", ..."The Strength of Being Vulnerable: the experience from CIA, Tesla and Uber", ...
"The Strength of Being Vulnerable: the experience from CIA, Tesla and Uber", ...
 
"[QUICK TALK] Radical candor: how to achieve results faster thanks to a cultu...
"[QUICK TALK] Radical candor: how to achieve results faster thanks to a cultu..."[QUICK TALK] Radical candor: how to achieve results faster thanks to a cultu...
"[QUICK TALK] Radical candor: how to achieve results faster thanks to a cultu...
 
"[QUICK TALK] PDP Plan, the only one door to raise your salary and boost care...
"[QUICK TALK] PDP Plan, the only one door to raise your salary and boost care..."[QUICK TALK] PDP Plan, the only one door to raise your salary and boost care...
"[QUICK TALK] PDP Plan, the only one door to raise your salary and boost care...
 
"4 horsemen of the apocalypse of working relationships (+ antidotes to them)"...
"4 horsemen of the apocalypse of working relationships (+ antidotes to them)"..."4 horsemen of the apocalypse of working relationships (+ antidotes to them)"...
"4 horsemen of the apocalypse of working relationships (+ antidotes to them)"...
 
"Reconnecting with Purpose: Rediscovering Job Interest after Burnout", Anast...
"Reconnecting with Purpose: Rediscovering Job Interest after Burnout",  Anast..."Reconnecting with Purpose: Rediscovering Job Interest after Burnout",  Anast...
"Reconnecting with Purpose: Rediscovering Job Interest after Burnout", Anast...
 
"Mentoring 101: How to effectively invest experience in the success of others...
"Mentoring 101: How to effectively invest experience in the success of others..."Mentoring 101: How to effectively invest experience in the success of others...
"Mentoring 101: How to effectively invest experience in the success of others...
 
"Mission (im) possible: How to get an offer in 2024?", Oleksandra Myronova
"Mission (im) possible: How to get an offer in 2024?",  Oleksandra Myronova"Mission (im) possible: How to get an offer in 2024?",  Oleksandra Myronova
"Mission (im) possible: How to get an offer in 2024?", Oleksandra Myronova
 
"Why have we learned how to package products, but not how to 'package ourselv...
"Why have we learned how to package products, but not how to 'package ourselv..."Why have we learned how to package products, but not how to 'package ourselv...
"Why have we learned how to package products, but not how to 'package ourselv...
 
"How to tame the dragon, or leadership with imposter syndrome", Oleksandr Zin...
"How to tame the dragon, or leadership with imposter syndrome", Oleksandr Zin..."How to tame the dragon, or leadership with imposter syndrome", Oleksandr Zin...
"How to tame the dragon, or leadership with imposter syndrome", Oleksandr Zin...
 

Recently uploaded

Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 

Recently uploaded (20)

Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 

"Stateful app as an efficient way to build dispatching for riders and drivers", Oleksandr Chumak

  • 1.
  • 2. Uklon in numbers 12 130+ Engineers Product Teams 16 M Android/iOS downloads 1.5M+ Riders DAU 30+ microservices 200k+ Drivers DAU 3 Countries 30 Cities
  • 3.
  • 5. How to reduce CPU consumption by 10 times due to stateful-processing and ensure high reliability What is the report about?
  • 6. 3 What are the solutions employed by our competitors?
  • 7. 1 Scaling of stateful services Reliability of stateful services Workloads that make the stateless approach inefficient Basic concepts Agenda
  • 8. Workloads that make the stateless approach inefficient
  • 9. 1. massive frequent write operations are needed to track the objects' current locations. As drivers can move as fast as 20 meters per second, it is therefore important to update drivers' locations at a second. Several challenges within the ride-hailing are… 2. a K-nearest neighbour (kNN) query poses tremendous challenges, compared to a simple Get query, in a key-value data store such as Redis.
  • 10. Feature #1 Orders Dispatching Find the best driver for the order
  • 11. Feature #2 Orders Broadcasting Streaming your order to many drivers DriverApp
  • 12. Feature #3 Batch dispatching Greedy algorithm Batching algorithm The Process of Order Dispatching with Batch Windows 2 min 9 min 4 min 4 min Total wait time = 11 min Total wait time = 8 min
  • 13. image Feature #4 Driver ETA Tracker Requirements: 1. Active Orders = tens of thousands 2. Drivers send their location every 2-5 seconds
  • 14. 1. Order offers. Find the best driver near you. 2. Order broadcasts. Fan-out orders to multiple drivers. 3. Order chaining. Find the next order for the driver, while completing the current one. 4. Order batching (optimization). Reduce the total waiting time for all passengers. 5. Sector queue (airports, train stations). 6. Driver ETA tracking for accepted order. 7. Matching driver’s GPS location to map graph node. Other Workloads
  • 15. Simplified Overview of the Architecture Stateful
  • 16. ● Load balancing algorithms ● Scalability ○ Partitioning ○ Replication ● Fault tolerance and Cold start 4 Stateful architectures Open Problems
  • 17. 1 Key concept 1. Local state is stored in memory KV structures 2. The local state restored from the durable log. In same cases, local state change may have been checkpointed to remote KV store (or into a separate kafka topic) 3. Local state updates occur within a single-threaded. No concurrency, Monotonic Writes
  • 18. NFR (Kyiv only) Writes 1.1) 5000-10000 rps 1.2) 100-500 rps Reads 2.1) 500 rps (handle 100-500 drivers per request) 2.2) fetch 50000-200000 rows/sec (100-400MB/sec) driver entity: 2 KB (50 perc)/ 13 KB (99 perc) total size for 100K = 200 MB
  • 19. Key differences Stateless (remote KV) ● Provide GET/PUT/DELETE API ● A high CPU cost due to marshalling and serialization ● Additional network latency ● Frequently necessitates additional local caching Stateful (in-memory/local KV) ● Domain specific API. Ex: ○ Find nearest drivers ○ Calculate ETA ● Data locality ● Shared-nothing
  • 20. 1 Access patterns for In-memory KV 1. Key lookup 2. Index seek (Offers, Broadcast) 3. All scans / Range scans
  • 22. Two topics are described as co-partitioned if: 1. Their keys have the same schemas 2. They are materialized by topics with the same number of partitions 3. Their producers have similar 'partitioner' Concept #1: Co-partitioning
  • 23. Concept #2: Re-keying partitions ● Related events are not co-partitioned ● Well-balanced partitions ● These can be unbalanced partitions and, as a result, consumers ● Achieving data locality for the consumer
  • 24. Concept #3: Filtering + Enriching DriverLocation { "driver": 12345 "latitude": 50.30846, "longitude": 30.53419 } DriverETA { "driver": 12345 "latitude": 50.30846, "longitude": 30.53419 “order”: 98765, “eta”: “2 min” }
  • 25. How to scale? Driver Dispatching Driver Dispatching Driver Dispatching Driver Dispatching
  • 27. 1 1. geospatial indexing (geohash, S2, H3) 2. city_id (region) Some sharding strategy Consider the following points when you design a data partitioning scheme: 1. Minimize cross-partition data access operations 2. Minimize cross-partition joins
  • 28. 1 Partitioning by Region Possible challenges: ● down-time during rebalance: scale-out, rolling update ● unbalanced load: The load from Kyiv is equivalent to the load from all cities of Ukraine combined)
  • 29. 1 Try to fix: Partitioning by Region + Replication Replication: ● Standalone consumers ● No partitions rebalance ● No down-time ● Replication overhead is less than 0.1CPU per pod ● Reduced requirements for cold recovery
  • 30. 1 1. Scalability - adding Kafka partitions and deploying separate Shard-Instances for cities/countries 2. Elasticity - scale-out of consumers within a Shard Scalability
  • 32. 1 Replica synchronization ● State-based CRDT ● Last write wins (LWW) ● Optimistic replication (can become temporarily inconsistent) ● Strong Eventual Consistency (SEC)
  • 33. ● Reading Your Own Writes ● Monotonic Reads ● Consistent Prefix Reads Depends on your Domain ● Reading Your Own Writes ● Monotonic Reads ● Consistent Prefix Reads 1 Problems with Replication Lag?
  • 34. 1 1. Single infrastructure dependency - Kafka (battle tested streaming platform with high throughput, fault-tolerance, and scalability). 2. When a task instance restarts, local state is repopulated by reading its own Kafka log 3. Yes, reading and repopulating will take some time Fault tolerance with local state
  • 35. 1 1. Key-Based Retention a. Aggressive topic compaction b. Tombstones 2. Time-Based Retention Controlling State Size. How long time to rebuild the state?
  • 36. 1 1. Driver state retention: 1hour 2. Repopulate local state: a. Read driver-state from the beginning of the topic: 400k msg (8 partitions) b. Read driver-locations from the 'now - 5sec' 3. You need to implement own event for ”live processing started” How long time to rebuild the state? "Live processing started "dispatching.driver-summary-events [0]" after 00:00:01.7875633 sec (50142 msgs)" SLA level of 99.998% uptime/availability results in the following periods of allowed downtime/unavailability: ■ Daily: 1.7s
  • 37. Traffic Jams requirements 1. Reduce the cost of Google Maps API 2. High rate of Writes (20k online drivers) 3. Update traffic information every 5min
  • 38. Stateful processing ● Grouping messages by partition key ● Aggregating messages in hopping window ● MapReduce
  • 40. 4 Similar workload using Redis https://aws.amazon.com/blogs/database/optimize-redis-client-performance-for-amazon-elasticache/?utm_source=pocket_saves ○ Client: c5.4xlarge (16 vCPU 32GiB) ○ Redis: 3 nodes r6g.2xlarge (8 vCPUs 64Gib)
  • 42. Although the current design is simple, it allows flexibility to change key aspects: ○ Replication + Sharding 4 Future works
  • 43. 46 1. Stateful is not always difficult 2. Simple and Reliable solution 3. Easy to maintain 4. Much more efficient in terms of resources (2 vCPUs for all dispatching) instead of a Redis cluster with 16-24 vCPUs 5. What about MS Orleans? Lessons learned