SlideShare a Scribd company logo
1 of 45
Ho Nguyen
• Senior Software Engineer
• Technical Interests:
• Solution & code design
• Distributed systems
• Video/Image encoding
• Hobbies
• Movies & music
• Manga & anime (One Piece, Dragon Ball...)
• Coffee lover
Data-intensive problem
Ho Nguyen
Senior Software Engineer
Outline
• Simple problem
• When the data is big
• More problems
• Approaches
Simple problem
Program diagram
Complete code
Face Detection
When the data is big
How big is the data?
• A data set of 2 billion records of
unique URLs
• Assuming the previous program
needs 2 seconds to complete =>
Concurrency number = 0.5 URL/s
2 ∗ 2 ∗ 10𝑒8
3600 ∗ 24
= 46296 𝑑𝑎𝑦𝑠 ≈ 127(𝑦𝑒𝑎𝑟𝑠)
What is the concurrency number we need to
complete the dataset in X days?
What is the concurrency number we need?
• Goal: X=7 Days
• 2 billions URLs
• Current concurrency 0.5 URL/s.
2 ∗ 10𝑒8
X ∗ 3600 ∗ 24
=
2 ∗ 10𝑒8
7 ∗ 3600 ∗ 24
≈ 3307 𝑈𝑅𝐿𝑠/𝑠
How to increase concurrency?
• Optimize code performance
• Increase hardware resource (CPU,
RAM, Disk, Network…) aka Scale-
up
• Scale-out
• Cloning to multiple processes
(X-Axis)
• Splitting by functions (Y-Axis)
• Data partitioning (Z-Axis)
Optimize code
• Pros
• Most effective if we found a bottleneck that can increase performance
to 661,300%
• Save infrastructure cost
• Cons
• Time consuming and uncertain
Scale-up
• Pros
• Easy to apply
• Cons
• Take time to find out the suitable
hardware configuration
• Expensive and limited
• Still need to optimize code and
redesign to take advantage of
hardware resources when cannot scale-up
Scale-out by cloning (X-Axis)
• Pros
• Can use all hardware
resources
• Not limited by hardware
• Cons
• More complex than scale-up
• Concurrency problems
Node 1 Node 2 Node 3
Scale-out by Splitting (Y-Axis)
Review the workflow
Scale-out by Splitting (Y-Axis)
• Download and resize image using CPU
• Face detection on GPU is faster
Reference: https://sites.google.com/site/facedetectionongpu/
Scale-out by Splitting (Y-Axis)
X-axis: Cloning
Download and
Process Image
Download and
Process Image
Download and
Process Image
Face Detection Face Detection
Y-axis:Splitting
Scale-out by Splitting (Y-Axis)
• Pro
• Reuse the advantage of hardware
• Cons
• Complex
• Concurrency problems
Scale-out by data-partitioning (Z-Axis)
Data schema
ID URL Done
1 https://abc.com/image1.jpg 1
2 https://abc.com/image2.jpg 0
3 https://abc.com/image3.jpg 0
4 https://abc.com/image4.jpg 0
Scale-out by data-partitioning (Z-Axis)
ID URL Done
1 https://abc.com/image1.jpg 0
3 https://abc.com/image2.jpg 0
ID URL Done
2 https://abc.com/image2.jpg 0
4 https://abc.com/image4.jpg 0
Key hashing
Scale-out by data-partitioning (Z-Axis)
ID URL Done
1 https://abc.com/image1.jpg 0
2 https://abc.com/image2.jpg 0
ID URL Done
3 https://abc.com/image2.jpg 0
4 https://abc.com/image4.jpg 0
Range base
Scale-out by data-partitioning (Z-Axis)
• Pros
• Increase database performance
• Reduce locking/non-locking
• Cons
• Increase maintenance and infrastructure cost
• Hard for automation scaling
Summary
• Skip the code optimization approach
• Skip the scale-up approach
• Focus on scale-out approaches
• We can increase the number of
processes/machines to increase the
concurrency number
• We can split into 2 services: Downloader and
Face Detections
• We may need data partition to optimize
database performance
Current approach
High Concurrency
Problems
Race condition
• Cause
• Same URL process twice or
more
• Impact
• Waste of resources
• Data corruption
• Faking concurrency
Race condition: How to solve?
• Distributed locks
• Pros
• N/a
• Cons
• Pessimistic locking impact
performance
• Hard to apply because we need to
synchronize multiples nodes
• Not good fault-tolerance
• Data sharding
• Pro
• High performance because of share
load (Physical shard)
• Cons
• Hard for scaling
• Increase maintenance & infrastructure
cost
• Queue/Worker
• Pros
• Easy to implement
• Easy to scale
• Good fault-tolerance
• Reusable communnication
• Con
• The load concentrates on the
queue so it can become a
bottleneck
Race condition: root cause
Race condition only causes
between Downloaders
=> If we found a way to
distribute the unique URL for
each downloader it will solve
the race condition for the whole
system.
Fault Tolerance
• Faults
• Network fault
• Network interruption
• IP Blocking
• Service crash
• Problems
• Can data be lost?
• Can the service restart and
continue to work on remaining
tasks?
Fault Tolerance criteria
Given When Then
A service crashed It restarted No Rework (Continue on
remaining items only)
Downloader service is running It crashed All downloaded images should
not be lost
FaceDetector service is running It crashed All detected result should not be
lost
Downloader is downloading
image
Network error happens Retry
Downloader retry to download
an image again
Network error is IP Locking Should rotate proxy to change
the ip
Service communication
• How do the services communicate?
• Do we need a load balancer?
Service communication methods
Type Method Pros Cons
Synchronous
HTTP • Familiar and
Simple to use
• Need a load
balancer
• Tight coupling
• Lock thread wait for
response
RPC • High performance
than HTTP
Asynchronous
Queue Messaging
(One-One)
• High performance
• Failure isolation
• Act as a load
balancer
• Reduced coupling
• Extra maintenance
cost
• Queue may
become bottleneck
Publich/Subscribe
(One-Many)
• We only need the
one-to-one
comunication
Summary
• Find approach to distribute unique URL to downloaders.
• The approach should pass the fault tolerance criteria
• We can base on the communication methods table to choose
the final solution
High Concurrency
Approaches
Approach 1: Range based physical shard
𝑛: 𝑡𝑜𝑡𝑎𝑙 𝑈𝑅𝐿𝑠
𝑚 ∶ 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑎𝑟𝑡𝑖𝑜𝑛𝑠
𝑖 ∈ [0 … 𝑚 − 1]: 𝑝𝑎𝑟𝑡𝑖𝑡𝑖𝑜𝑛 𝑛𝑢𝑚𝑏𝑒𝑟
𝑘 =
𝑛
𝑚
∶ number of urls in a partition
𝑠𝑡𝑎𝑟𝑡 𝑖 = 𝑘 ∗ 𝑖
𝑒𝑛𝑑 𝑖 =
𝑠𝑡𝑎𝑟𝑡 𝑖 + 𝑘, 0 ≤ 𝑖 < 𝑚 − 1
𝑠𝑡𝑎𝑟𝑡 𝑖 + 𝑘 + 𝑛 𝑚𝑜𝑑 𝑚 , 𝑖 = 𝑚 − 1
Approach 1: Range based physical shard
Solve
Race
Condition
Faul tolerance Comunication
Types
Notes
Solved + No rework
+ Need to download
image again if crash
when face detection
+ Partition can be
abadoned
HTTP/gPRC • Pros
• Non locking on db level
• Cons
• Take time for preparation
• Hard to scale out/adjust
• Need load balancer
Approach 2: Logical shard
𝑛: 𝑡𝑜𝑡𝑎𝑙 𝑢𝑟𝑙𝑠
𝑚: 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑟𝑜𝑐𝑒𝑠𝑠𝑒𝑠
𝑖𝑑 ∈ 0. . 𝑚 − 1 : 𝑝𝑟𝑜𝑐𝑒𝑠𝑠 𝑖𝑑
𝑘 =
𝑛
𝑚
𝑖: 𝑝𝑟𝑜𝑐𝑒𝑠𝑠𝑖𝑛𝑔 𝑢𝑟𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑖𝑛 𝑎 𝑝𝑟𝑜𝑐𝑒𝑠𝑠
𝑖 ∈ 0. . 𝑘 − 1 𝑖𝑑 < 𝑚 − 1
𝑖 ∈ 0. . 𝑘 − 1 + 𝑛 𝑚𝑜𝑑 𝑚 𝑖𝑑 = 𝑚 − 1
𝑓 𝑖 𝑡ℎ𝑒 𝑖𝑑 𝑜𝑓 𝑢𝑟𝑙 𝑤𝑒 𝑛𝑒𝑒𝑑 𝑡𝑜 𝑝𝑖𝑐𝑘.
⇒ 𝑓 𝑖 = 𝑖𝑑 ∗ 𝑘 + 𝑖
Approach 2: Logical shard
Solve
Race
Condition
Faul tolerance Comunication
Types
Notes
Solved + No rework
+ Need to download image
again if crash when face
detection
+ Partition can be
abadoned
HTTP/gPRC • Pros
• Non locking on db level
• Simple implementation
• Cons
• Hard to scale out/adjust
• High database throughput
• Extra state to maintain: Total
Urls, Current Url Id,…
Approach 3: Queue/Worker x Logical Sharding
𝑛: 𝑡𝑜𝑡𝑎𝑙 𝑢𝑟𝑙𝑠
𝑚: 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑟𝑜𝑐𝑒𝑠𝑠𝑒𝑠
𝑖𝑑 ∈ 0. . 𝑚 − 1 : 𝑝𝑟𝑜𝑐𝑒𝑠𝑠 𝑖𝑑
𝑘 =
𝑛
𝑚
𝑖: 𝑝𝑟𝑜𝑐𝑒𝑠𝑠𝑖𝑛𝑔 𝑢𝑟𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑖𝑛 𝑎 𝑝𝑟𝑜𝑐𝑒𝑠𝑠
𝑖 ∈ 0. . 𝑘 − 1 𝑖𝑑 < 𝑚 − 1
𝑖 ∈ 0. . 𝑘 − 1 + 𝑛 𝑚𝑜𝑑 𝑚 𝑖𝑑 = 𝑚 − 1
𝑓 𝑖 𝑡ℎ𝑒 𝑖𝑑 𝑜𝑓 𝑢𝑟𝑙 𝑤𝑒 𝑛𝑒𝑒𝑑 𝑡𝑜 𝑝𝑖𝑐𝑘.
⇒ 𝑓 𝑖 = 𝑖𝑑 ∗ 𝑘 + 𝑖
Approach 3: Queue/Worker x Logical Sharding
Solve
Race
Condition
Faul tolerance Comunication
Types
Notes
Solved + No Rework
+ Failure isolation
+ Node is replacable
Messaging • Pros
• Easy to scale
• Easy fault-tolerance
• Fail isolation
• Asynchronous
• Cons
• Extra infrastructrure
• High throughput on queue
END
Questions
• How to measure and debug service?
• What is deployment process?
Q&A
THANK YOU FOR YOUR ATTENTION

More Related Content

What's hot

Grokking Techtalk #46: Lessons from years hacking and defending Vietnamese banks
Grokking Techtalk #46: Lessons from years hacking and defending Vietnamese banksGrokking Techtalk #46: Lessons from years hacking and defending Vietnamese banks
Grokking Techtalk #46: Lessons from years hacking and defending Vietnamese banks
Grokking VN
 
Scalable and Available, Patterns for Success
Scalable and Available, Patterns for SuccessScalable and Available, Patterns for Success
Scalable and Available, Patterns for Success
Derek Collison
 

What's hot (20)

Toi uu hoa he thong 30 trieu nguoi dung
Toi uu hoa he thong 30 trieu nguoi dungToi uu hoa he thong 30 trieu nguoi dung
Toi uu hoa he thong 30 trieu nguoi dung
 
Sapo Microservices Architecture
Sapo Microservices ArchitectureSapo Microservices Architecture
Sapo Microservices Architecture
 
Building Bizweb Microservices with Docker
Building Bizweb Microservices with DockerBuilding Bizweb Microservices with Docker
Building Bizweb Microservices with Docker
 
Go micro framework to build microservices
Go micro framework to build microservicesGo micro framework to build microservices
Go micro framework to build microservices
 
SOLID & Design Patterns
SOLID & Design PatternsSOLID & Design Patterns
SOLID & Design Patterns
 
Microservice - Up to 500k CCU
Microservice - Up to 500k CCUMicroservice - Up to 500k CCU
Microservice - Up to 500k CCU
 
Asynchronous processing in big system
Asynchronous processing in big systemAsynchronous processing in big system
Asynchronous processing in big system
 
Grokking Techtalk: Problem solving for sw engineers
Grokking Techtalk: Problem solving for sw engineersGrokking Techtalk: Problem solving for sw engineers
Grokking Techtalk: Problem solving for sw engineers
 
MongoDB World 2019: MongoDB Read Isolation: Making Your Reads Clean, Committe...
MongoDB World 2019: MongoDB Read Isolation: Making Your Reads Clean, Committe...MongoDB World 2019: MongoDB Read Isolation: Making Your Reads Clean, Committe...
MongoDB World 2019: MongoDB Read Isolation: Making Your Reads Clean, Committe...
 
Grokking Techtalk #46: Lessons from years hacking and defending Vietnamese banks
Grokking Techtalk #46: Lessons from years hacking and defending Vietnamese banksGrokking Techtalk #46: Lessons from years hacking and defending Vietnamese banks
Grokking Techtalk #46: Lessons from years hacking and defending Vietnamese banks
 
itlchn 20 - Kien truc he thong chung khoan - Phan 2
itlchn 20 - Kien truc he thong chung khoan - Phan 2itlchn 20 - Kien truc he thong chung khoan - Phan 2
itlchn 20 - Kien truc he thong chung khoan - Phan 2
 
Disaster Recovery Plans for Apache Kafka
Disaster Recovery Plans for Apache KafkaDisaster Recovery Plans for Apache Kafka
Disaster Recovery Plans for Apache Kafka
 
Kafka as an event store - is it good enough?
Kafka as an event store - is it good enough?Kafka as an event store - is it good enough?
Kafka as an event store - is it good enough?
 
Maven et industrialisation du logiciel
Maven et industrialisation du logicielMaven et industrialisation du logiciel
Maven et industrialisation du logiciel
 
Errant GTIDs breaking replication @ Percona Live 2019
Errant GTIDs breaking replication @ Percona Live 2019Errant GTIDs breaking replication @ Percona Live 2019
Errant GTIDs breaking replication @ Percona Live 2019
 
[135] 오픈소스 데이터베이스, 은행 서비스에 첫발을 내밀다.
[135] 오픈소스 데이터베이스, 은행 서비스에 첫발을 내밀다.[135] 오픈소스 데이터베이스, 은행 서비스에 첫발을 내밀다.
[135] 오픈소스 데이터베이스, 은행 서비스에 첫발을 내밀다.
 
Grokking Techtalk #40: Consistency and Availability tradeoff in database cluster
Grokking Techtalk #40: Consistency and Availability tradeoff in database clusterGrokking Techtalk #40: Consistency and Availability tradeoff in database cluster
Grokking Techtalk #40: Consistency and Availability tradeoff in database cluster
 
Scalable and Available, Patterns for Success
Scalable and Available, Patterns for SuccessScalable and Available, Patterns for Success
Scalable and Available, Patterns for Success
 
Stability Patterns for Microservices
Stability Patterns for MicroservicesStability Patterns for Microservices
Stability Patterns for Microservices
 
Hexagonal Architecture.pdf
Hexagonal Architecture.pdfHexagonal Architecture.pdf
Hexagonal Architecture.pdf
 

Similar to Grokking Techtalk #37: Data intensive problem

Red Hat Storage Day Seattle: Stabilizing Petabyte Ceph Cluster in OpenStack C...
Red Hat Storage Day Seattle: Stabilizing Petabyte Ceph Cluster in OpenStack C...Red Hat Storage Day Seattle: Stabilizing Petabyte Ceph Cluster in OpenStack C...
Red Hat Storage Day Seattle: Stabilizing Petabyte Ceph Cluster in OpenStack C...
Red_Hat_Storage
 
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the HoodRadical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
Databricks
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
xlight
 
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RACPerformance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Kristofferson A
 
AOUG_11Nov2016_Challenges_with_EBS12_2
AOUG_11Nov2016_Challenges_with_EBS12_2AOUG_11Nov2016_Challenges_with_EBS12_2
AOUG_11Nov2016_Challenges_with_EBS12_2
Sean Braymen
 
Rails Performance Tricks and Treats
Rails Performance Tricks and TreatsRails Performance Tricks and Treats
Rails Performance Tricks and Treats
Marshall Yount
 
High performance Infrastructure Oct 2013
High performance Infrastructure Oct 2013High performance Infrastructure Oct 2013
High performance Infrastructure Oct 2013
Server Density
 

Similar to Grokking Techtalk #37: Data intensive problem (20)

Big Data Computing Architecture
Big Data Computing ArchitectureBig Data Computing Architecture
Big Data Computing Architecture
 
Алексей Ященко и Ярослав Волощук "False simplicity of front-end applications"
Алексей Ященко и Ярослав Волощук "False simplicity of front-end applications"Алексей Ященко и Ярослав Волощук "False simplicity of front-end applications"
Алексей Ященко и Ярослав Волощук "False simplicity of front-end applications"
 
Using Riak for Events storage and analysis at Booking.com
Using Riak for Events storage and analysis at Booking.comUsing Riak for Events storage and analysis at Booking.com
Using Riak for Events storage and analysis at Booking.com
 
Red Hat Storage Day Seattle: Stabilizing Petabyte Ceph Cluster in OpenStack C...
Red Hat Storage Day Seattle: Stabilizing Petabyte Ceph Cluster in OpenStack C...Red Hat Storage Day Seattle: Stabilizing Petabyte Ceph Cluster in OpenStack C...
Red Hat Storage Day Seattle: Stabilizing Petabyte Ceph Cluster in OpenStack C...
 
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the HoodRadical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
 
PAC 2019 virtual Mark Tomlinson
PAC 2019 virtual Mark TomlinsonPAC 2019 virtual Mark Tomlinson
PAC 2019 virtual Mark Tomlinson
 
Fixing twitter
Fixing twitterFixing twitter
Fixing twitter
 
Fixing_Twitter
Fixing_TwitterFixing_Twitter
Fixing_Twitter
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
 
Offline capable web applications with Google Gears and Dojo Offline
Offline capable web applications with Google Gears and Dojo OfflineOffline capable web applications with Google Gears and Dojo Offline
Offline capable web applications with Google Gears and Dojo Offline
 
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RACPerformance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
 
John adams talk cloudy
John adams   talk cloudyJohn adams   talk cloudy
John adams talk cloudy
 
AOUG_11Nov2016_Challenges_with_EBS12_2
AOUG_11Nov2016_Challenges_with_EBS12_2AOUG_11Nov2016_Challenges_with_EBS12_2
AOUG_11Nov2016_Challenges_with_EBS12_2
 
Silicon Valley Code Camp 2015 - Advanced MongoDB - The Sequel
Silicon Valley Code Camp 2015 - Advanced MongoDB - The SequelSilicon Valley Code Camp 2015 - Advanced MongoDB - The Sequel
Silicon Valley Code Camp 2015 - Advanced MongoDB - The Sequel
 
Rails Performance Tricks and Treats
Rails Performance Tricks and TreatsRails Performance Tricks and Treats
Rails Performance Tricks and Treats
 
High performance Infrastructure Oct 2013
High performance Infrastructure Oct 2013High performance Infrastructure Oct 2013
High performance Infrastructure Oct 2013
 
Realtime traffic analyser
Realtime traffic analyserRealtime traffic analyser
Realtime traffic analyser
 
Cognos Performance Tuning Tips & Tricks
Cognos Performance Tuning Tips & TricksCognos Performance Tuning Tips & Tricks
Cognos Performance Tuning Tips & Tricks
 
IMAGE PROCESSING
IMAGE PROCESSINGIMAGE PROCESSING
IMAGE PROCESSING
 

More from Grokking VN

Grokking Techtalk #45: First Principles Thinking
Grokking Techtalk #45: First Principles ThinkingGrokking Techtalk #45: First Principles Thinking
Grokking Techtalk #45: First Principles Thinking
Grokking VN
 
Grokking Techtalk #43: Payment gateway demystified
Grokking Techtalk #43: Payment gateway demystifiedGrokking Techtalk #43: Payment gateway demystified
Grokking Techtalk #43: Payment gateway demystified
Grokking VN
 
Grokking Techtalk #34: K8S On-premise: Incident & Lesson Learned ZaloPay Mer...
 Grokking Techtalk #34: K8S On-premise: Incident & Lesson Learned ZaloPay Mer... Grokking Techtalk #34: K8S On-premise: Incident & Lesson Learned ZaloPay Mer...
Grokking Techtalk #34: K8S On-premise: Incident & Lesson Learned ZaloPay Mer...
Grokking VN
 

More from Grokking VN (20)

Grokking Techtalk #45: First Principles Thinking
Grokking Techtalk #45: First Principles ThinkingGrokking Techtalk #45: First Principles Thinking
Grokking Techtalk #45: First Principles Thinking
 
Grokking Techtalk #42: Engineering challenges on building data platform for M...
Grokking Techtalk #42: Engineering challenges on building data platform for M...Grokking Techtalk #42: Engineering challenges on building data platform for M...
Grokking Techtalk #42: Engineering challenges on building data platform for M...
 
Grokking Techtalk #43: Payment gateway demystified
Grokking Techtalk #43: Payment gateway demystifiedGrokking Techtalk #43: Payment gateway demystified
Grokking Techtalk #43: Payment gateway demystified
 
Grokking Techtalk #40: AWS’s philosophy on designing MLOps platform
Grokking Techtalk #40: AWS’s philosophy on designing MLOps platformGrokking Techtalk #40: AWS’s philosophy on designing MLOps platform
Grokking Techtalk #40: AWS’s philosophy on designing MLOps platform
 
Grokking Techtalk #39: Gossip protocol and applications
Grokking Techtalk #39: Gossip protocol and applicationsGrokking Techtalk #39: Gossip protocol and applications
Grokking Techtalk #39: Gossip protocol and applications
 
Grokking Techtalk #39: How to build an event driven architecture with Kafka ...
 Grokking Techtalk #39: How to build an event driven architecture with Kafka ... Grokking Techtalk #39: How to build an event driven architecture with Kafka ...
Grokking Techtalk #39: How to build an event driven architecture with Kafka ...
 
Grokking Techtalk #38: Escape Analysis in Go compiler
 Grokking Techtalk #38: Escape Analysis in Go compiler Grokking Techtalk #38: Escape Analysis in Go compiler
Grokking Techtalk #38: Escape Analysis in Go compiler
 
Grokking Techtalk #37: Software design and refactoring
 Grokking Techtalk #37: Software design and refactoring Grokking Techtalk #37: Software design and refactoring
Grokking Techtalk #37: Software design and refactoring
 
Grokking TechTalk #35: Efficient spellchecking
Grokking TechTalk #35: Efficient spellcheckingGrokking TechTalk #35: Efficient spellchecking
Grokking TechTalk #35: Efficient spellchecking
 
Grokking Techtalk #34: K8S On-premise: Incident & Lesson Learned ZaloPay Mer...
 Grokking Techtalk #34: K8S On-premise: Incident & Lesson Learned ZaloPay Mer... Grokking Techtalk #34: K8S On-premise: Incident & Lesson Learned ZaloPay Mer...
Grokking Techtalk #34: K8S On-premise: Incident & Lesson Learned ZaloPay Mer...
 
Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big...
Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big...Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big...
Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big...
 
Grokking TechTalk #31: Asynchronous Communications
Grokking TechTalk #31: Asynchronous CommunicationsGrokking TechTalk #31: Asynchronous Communications
Grokking TechTalk #31: Asynchronous Communications
 
Grokking TechTalk #30: From App to Ecosystem: Lessons Learned at Scale
Grokking TechTalk #30: From App to Ecosystem: Lessons Learned at ScaleGrokking TechTalk #30: From App to Ecosystem: Lessons Learned at Scale
Grokking TechTalk #30: From App to Ecosystem: Lessons Learned at Scale
 
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedInGrokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
 
Grokking TechTalk #27: Optimal Binary Search Tree
Grokking TechTalk #27: Optimal Binary Search TreeGrokking TechTalk #27: Optimal Binary Search Tree
Grokking TechTalk #27: Optimal Binary Search Tree
 
Grokking TechTalk #26: Kotlin, Understand the Magic
Grokking TechTalk #26: Kotlin, Understand the MagicGrokking TechTalk #26: Kotlin, Understand the Magic
Grokking TechTalk #26: Kotlin, Understand the Magic
 
Grokking TechTalk #26: Compare ios and android platform
Grokking TechTalk #26: Compare ios and android platformGrokking TechTalk #26: Compare ios and android platform
Grokking TechTalk #26: Compare ios and android platform
 
Grokking TechTalk #24: Thiết kế hệ thống Background Job Queue bằng Ruby & Pos...
Grokking TechTalk #24: Thiết kế hệ thống Background Job Queue bằng Ruby & Pos...Grokking TechTalk #24: Thiết kế hệ thống Background Job Queue bằng Ruby & Pos...
Grokking TechTalk #24: Thiết kế hệ thống Background Job Queue bằng Ruby & Pos...
 
Grokking TechTalk #24: Kafka's principles and protocols
Grokking TechTalk #24: Kafka's principles and protocolsGrokking TechTalk #24: Kafka's principles and protocols
Grokking TechTalk #24: Kafka's principles and protocols
 
Grokking TechTalk #21: Deep Learning in Computer Vision
Grokking TechTalk #21: Deep Learning in Computer VisionGrokking TechTalk #21: Deep Learning in Computer Vision
Grokking TechTalk #21: Deep Learning in Computer Vision
 

Recently uploaded

Artificial intelligence presentation2-171219131633.pdf
Artificial intelligence presentation2-171219131633.pdfArtificial intelligence presentation2-171219131633.pdf
Artificial intelligence presentation2-171219131633.pdf
Kira Dess
 
一比一原版(Griffith毕业证书)格里菲斯大学毕业证成绩单学位证书
一比一原版(Griffith毕业证书)格里菲斯大学毕业证成绩单学位证书一比一原版(Griffith毕业证书)格里菲斯大学毕业证成绩单学位证书
一比一原版(Griffith毕业证书)格里菲斯大学毕业证成绩单学位证书
c3384a92eb32
 
21P35A0312 Internship eccccccReport.docx
21P35A0312 Internship eccccccReport.docx21P35A0312 Internship eccccccReport.docx
21P35A0312 Internship eccccccReport.docx
rahulmanepalli02
 

Recently uploaded (20)

Presentation on Slab, Beam, Column, and Foundation/Footing
Presentation on Slab,  Beam, Column, and Foundation/FootingPresentation on Slab,  Beam, Column, and Foundation/Footing
Presentation on Slab, Beam, Column, and Foundation/Footing
 
Basics of Relay for Engineering Students
Basics of Relay for Engineering StudentsBasics of Relay for Engineering Students
Basics of Relay for Engineering Students
 
Call for Papers - Journal of Electrical Systems (JES), E-ISSN: 1112-5209, ind...
Call for Papers - Journal of Electrical Systems (JES), E-ISSN: 1112-5209, ind...Call for Papers - Journal of Electrical Systems (JES), E-ISSN: 1112-5209, ind...
Call for Papers - Journal of Electrical Systems (JES), E-ISSN: 1112-5209, ind...
 
NO1 Best Powerful Vashikaran Specialist Baba Vashikaran Specialist For Love V...
NO1 Best Powerful Vashikaran Specialist Baba Vashikaran Specialist For Love V...NO1 Best Powerful Vashikaran Specialist Baba Vashikaran Specialist For Love V...
NO1 Best Powerful Vashikaran Specialist Baba Vashikaran Specialist For Love V...
 
SLIDESHARE PPT-DECISION MAKING METHODS.pptx
SLIDESHARE PPT-DECISION MAKING METHODS.pptxSLIDESHARE PPT-DECISION MAKING METHODS.pptx
SLIDESHARE PPT-DECISION MAKING METHODS.pptx
 
Working Principle of Echo Sounder and Doppler Effect.pdf
Working Principle of Echo Sounder and Doppler Effect.pdfWorking Principle of Echo Sounder and Doppler Effect.pdf
Working Principle of Echo Sounder and Doppler Effect.pdf
 
8th International Conference on Soft Computing, Mathematics and Control (SMC ...
8th International Conference on Soft Computing, Mathematics and Control (SMC ...8th International Conference on Soft Computing, Mathematics and Control (SMC ...
8th International Conference on Soft Computing, Mathematics and Control (SMC ...
 
Involute of a circle,Square, pentagon,HexagonInvolute_Engineering Drawing.pdf
Involute of a circle,Square, pentagon,HexagonInvolute_Engineering Drawing.pdfInvolute of a circle,Square, pentagon,HexagonInvolute_Engineering Drawing.pdf
Involute of a circle,Square, pentagon,HexagonInvolute_Engineering Drawing.pdf
 
Artificial intelligence presentation2-171219131633.pdf
Artificial intelligence presentation2-171219131633.pdfArtificial intelligence presentation2-171219131633.pdf
Artificial intelligence presentation2-171219131633.pdf
 
一比一原版(Griffith毕业证书)格里菲斯大学毕业证成绩单学位证书
一比一原版(Griffith毕业证书)格里菲斯大学毕业证成绩单学位证书一比一原版(Griffith毕业证书)格里菲斯大学毕业证成绩单学位证书
一比一原版(Griffith毕业证书)格里菲斯大学毕业证成绩单学位证书
 
Databricks Generative AI Fundamentals .pdf
Databricks Generative AI Fundamentals  .pdfDatabricks Generative AI Fundamentals  .pdf
Databricks Generative AI Fundamentals .pdf
 
Instruct Nirmaana 24-Smart and Lean Construction Through Technology.pdf
Instruct Nirmaana 24-Smart and Lean Construction Through Technology.pdfInstruct Nirmaana 24-Smart and Lean Construction Through Technology.pdf
Instruct Nirmaana 24-Smart and Lean Construction Through Technology.pdf
 
Geometric constructions Engineering Drawing.pdf
Geometric constructions Engineering Drawing.pdfGeometric constructions Engineering Drawing.pdf
Geometric constructions Engineering Drawing.pdf
 
CLOUD COMPUTING SERVICES - Cloud Reference Modal
CLOUD COMPUTING SERVICES - Cloud Reference ModalCLOUD COMPUTING SERVICES - Cloud Reference Modal
CLOUD COMPUTING SERVICES - Cloud Reference Modal
 
Independent Solar-Powered Electric Vehicle Charging Station
Independent Solar-Powered Electric Vehicle Charging StationIndependent Solar-Powered Electric Vehicle Charging Station
Independent Solar-Powered Electric Vehicle Charging Station
 
Circuit Breakers for Engineering Students
Circuit Breakers for Engineering StudentsCircuit Breakers for Engineering Students
Circuit Breakers for Engineering Students
 
Filters for Electromagnetic Compatibility Applications
Filters for Electromagnetic Compatibility ApplicationsFilters for Electromagnetic Compatibility Applications
Filters for Electromagnetic Compatibility Applications
 
21P35A0312 Internship eccccccReport.docx
21P35A0312 Internship eccccccReport.docx21P35A0312 Internship eccccccReport.docx
21P35A0312 Internship eccccccReport.docx
 
Fuzzy logic method-based stress detector with blood pressure and body tempera...
Fuzzy logic method-based stress detector with blood pressure and body tempera...Fuzzy logic method-based stress detector with blood pressure and body tempera...
Fuzzy logic method-based stress detector with blood pressure and body tempera...
 
Theory of Time 2024 (Universal Theory for Everything)
Theory of Time 2024 (Universal Theory for Everything)Theory of Time 2024 (Universal Theory for Everything)
Theory of Time 2024 (Universal Theory for Everything)
 

Grokking Techtalk #37: Data intensive problem

  • 1. Ho Nguyen • Senior Software Engineer • Technical Interests: • Solution & code design • Distributed systems • Video/Image encoding • Hobbies • Movies & music • Manga & anime (One Piece, Dragon Ball...) • Coffee lover
  • 3. Outline • Simple problem • When the data is big • More problems • Approaches
  • 8.
  • 9. When the data is big
  • 10. How big is the data? • A data set of 2 billion records of unique URLs • Assuming the previous program needs 2 seconds to complete => Concurrency number = 0.5 URL/s 2 ∗ 2 ∗ 10𝑒8 3600 ∗ 24 = 46296 𝑑𝑎𝑦𝑠 ≈ 127(𝑦𝑒𝑎𝑟𝑠)
  • 11. What is the concurrency number we need to complete the dataset in X days?
  • 12. What is the concurrency number we need? • Goal: X=7 Days • 2 billions URLs • Current concurrency 0.5 URL/s. 2 ∗ 10𝑒8 X ∗ 3600 ∗ 24 = 2 ∗ 10𝑒8 7 ∗ 3600 ∗ 24 ≈ 3307 𝑈𝑅𝐿𝑠/𝑠
  • 13. How to increase concurrency? • Optimize code performance • Increase hardware resource (CPU, RAM, Disk, Network…) aka Scale- up • Scale-out • Cloning to multiple processes (X-Axis) • Splitting by functions (Y-Axis) • Data partitioning (Z-Axis)
  • 14. Optimize code • Pros • Most effective if we found a bottleneck that can increase performance to 661,300% • Save infrastructure cost • Cons • Time consuming and uncertain
  • 15. Scale-up • Pros • Easy to apply • Cons • Take time to find out the suitable hardware configuration • Expensive and limited • Still need to optimize code and redesign to take advantage of hardware resources when cannot scale-up
  • 16. Scale-out by cloning (X-Axis) • Pros • Can use all hardware resources • Not limited by hardware • Cons • More complex than scale-up • Concurrency problems Node 1 Node 2 Node 3
  • 17. Scale-out by Splitting (Y-Axis) Review the workflow
  • 18. Scale-out by Splitting (Y-Axis) • Download and resize image using CPU • Face detection on GPU is faster Reference: https://sites.google.com/site/facedetectionongpu/
  • 19. Scale-out by Splitting (Y-Axis) X-axis: Cloning Download and Process Image Download and Process Image Download and Process Image Face Detection Face Detection Y-axis:Splitting
  • 20. Scale-out by Splitting (Y-Axis) • Pro • Reuse the advantage of hardware • Cons • Complex • Concurrency problems
  • 21. Scale-out by data-partitioning (Z-Axis) Data schema ID URL Done 1 https://abc.com/image1.jpg 1 2 https://abc.com/image2.jpg 0 3 https://abc.com/image3.jpg 0 4 https://abc.com/image4.jpg 0
  • 22. Scale-out by data-partitioning (Z-Axis) ID URL Done 1 https://abc.com/image1.jpg 0 3 https://abc.com/image2.jpg 0 ID URL Done 2 https://abc.com/image2.jpg 0 4 https://abc.com/image4.jpg 0 Key hashing
  • 23. Scale-out by data-partitioning (Z-Axis) ID URL Done 1 https://abc.com/image1.jpg 0 2 https://abc.com/image2.jpg 0 ID URL Done 3 https://abc.com/image2.jpg 0 4 https://abc.com/image4.jpg 0 Range base
  • 24. Scale-out by data-partitioning (Z-Axis) • Pros • Increase database performance • Reduce locking/non-locking • Cons • Increase maintenance and infrastructure cost • Hard for automation scaling
  • 25. Summary • Skip the code optimization approach • Skip the scale-up approach • Focus on scale-out approaches • We can increase the number of processes/machines to increase the concurrency number • We can split into 2 services: Downloader and Face Detections • We may need data partition to optimize database performance
  • 28. Race condition • Cause • Same URL process twice or more • Impact • Waste of resources • Data corruption • Faking concurrency
  • 29. Race condition: How to solve? • Distributed locks • Pros • N/a • Cons • Pessimistic locking impact performance • Hard to apply because we need to synchronize multiples nodes • Not good fault-tolerance • Data sharding • Pro • High performance because of share load (Physical shard) • Cons • Hard for scaling • Increase maintenance & infrastructure cost • Queue/Worker • Pros • Easy to implement • Easy to scale • Good fault-tolerance • Reusable communnication • Con • The load concentrates on the queue so it can become a bottleneck
  • 30. Race condition: root cause Race condition only causes between Downloaders => If we found a way to distribute the unique URL for each downloader it will solve the race condition for the whole system.
  • 31. Fault Tolerance • Faults • Network fault • Network interruption • IP Blocking • Service crash • Problems • Can data be lost? • Can the service restart and continue to work on remaining tasks?
  • 32. Fault Tolerance criteria Given When Then A service crashed It restarted No Rework (Continue on remaining items only) Downloader service is running It crashed All downloaded images should not be lost FaceDetector service is running It crashed All detected result should not be lost Downloader is downloading image Network error happens Retry Downloader retry to download an image again Network error is IP Locking Should rotate proxy to change the ip
  • 33. Service communication • How do the services communicate? • Do we need a load balancer?
  • 34. Service communication methods Type Method Pros Cons Synchronous HTTP • Familiar and Simple to use • Need a load balancer • Tight coupling • Lock thread wait for response RPC • High performance than HTTP Asynchronous Queue Messaging (One-One) • High performance • Failure isolation • Act as a load balancer • Reduced coupling • Extra maintenance cost • Queue may become bottleneck Publich/Subscribe (One-Many) • We only need the one-to-one comunication
  • 35. Summary • Find approach to distribute unique URL to downloaders. • The approach should pass the fault tolerance criteria • We can base on the communication methods table to choose the final solution
  • 37. Approach 1: Range based physical shard 𝑛: 𝑡𝑜𝑡𝑎𝑙 𝑈𝑅𝐿𝑠 𝑚 ∶ 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑎𝑟𝑡𝑖𝑜𝑛𝑠 𝑖 ∈ [0 … 𝑚 − 1]: 𝑝𝑎𝑟𝑡𝑖𝑡𝑖𝑜𝑛 𝑛𝑢𝑚𝑏𝑒𝑟 𝑘 = 𝑛 𝑚 ∶ number of urls in a partition 𝑠𝑡𝑎𝑟𝑡 𝑖 = 𝑘 ∗ 𝑖 𝑒𝑛𝑑 𝑖 = 𝑠𝑡𝑎𝑟𝑡 𝑖 + 𝑘, 0 ≤ 𝑖 < 𝑚 − 1 𝑠𝑡𝑎𝑟𝑡 𝑖 + 𝑘 + 𝑛 𝑚𝑜𝑑 𝑚 , 𝑖 = 𝑚 − 1
  • 38. Approach 1: Range based physical shard Solve Race Condition Faul tolerance Comunication Types Notes Solved + No rework + Need to download image again if crash when face detection + Partition can be abadoned HTTP/gPRC • Pros • Non locking on db level • Cons • Take time for preparation • Hard to scale out/adjust • Need load balancer
  • 39. Approach 2: Logical shard 𝑛: 𝑡𝑜𝑡𝑎𝑙 𝑢𝑟𝑙𝑠 𝑚: 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑟𝑜𝑐𝑒𝑠𝑠𝑒𝑠 𝑖𝑑 ∈ 0. . 𝑚 − 1 : 𝑝𝑟𝑜𝑐𝑒𝑠𝑠 𝑖𝑑 𝑘 = 𝑛 𝑚 𝑖: 𝑝𝑟𝑜𝑐𝑒𝑠𝑠𝑖𝑛𝑔 𝑢𝑟𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑖𝑛 𝑎 𝑝𝑟𝑜𝑐𝑒𝑠𝑠 𝑖 ∈ 0. . 𝑘 − 1 𝑖𝑑 < 𝑚 − 1 𝑖 ∈ 0. . 𝑘 − 1 + 𝑛 𝑚𝑜𝑑 𝑚 𝑖𝑑 = 𝑚 − 1 𝑓 𝑖 𝑡ℎ𝑒 𝑖𝑑 𝑜𝑓 𝑢𝑟𝑙 𝑤𝑒 𝑛𝑒𝑒𝑑 𝑡𝑜 𝑝𝑖𝑐𝑘. ⇒ 𝑓 𝑖 = 𝑖𝑑 ∗ 𝑘 + 𝑖
  • 40. Approach 2: Logical shard Solve Race Condition Faul tolerance Comunication Types Notes Solved + No rework + Need to download image again if crash when face detection + Partition can be abadoned HTTP/gPRC • Pros • Non locking on db level • Simple implementation • Cons • Hard to scale out/adjust • High database throughput • Extra state to maintain: Total Urls, Current Url Id,…
  • 41. Approach 3: Queue/Worker x Logical Sharding 𝑛: 𝑡𝑜𝑡𝑎𝑙 𝑢𝑟𝑙𝑠 𝑚: 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑟𝑜𝑐𝑒𝑠𝑠𝑒𝑠 𝑖𝑑 ∈ 0. . 𝑚 − 1 : 𝑝𝑟𝑜𝑐𝑒𝑠𝑠 𝑖𝑑 𝑘 = 𝑛 𝑚 𝑖: 𝑝𝑟𝑜𝑐𝑒𝑠𝑠𝑖𝑛𝑔 𝑢𝑟𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑖𝑛 𝑎 𝑝𝑟𝑜𝑐𝑒𝑠𝑠 𝑖 ∈ 0. . 𝑘 − 1 𝑖𝑑 < 𝑚 − 1 𝑖 ∈ 0. . 𝑘 − 1 + 𝑛 𝑚𝑜𝑑 𝑚 𝑖𝑑 = 𝑚 − 1 𝑓 𝑖 𝑡ℎ𝑒 𝑖𝑑 𝑜𝑓 𝑢𝑟𝑙 𝑤𝑒 𝑛𝑒𝑒𝑑 𝑡𝑜 𝑝𝑖𝑐𝑘. ⇒ 𝑓 𝑖 = 𝑖𝑑 ∗ 𝑘 + 𝑖
  • 42. Approach 3: Queue/Worker x Logical Sharding Solve Race Condition Faul tolerance Comunication Types Notes Solved + No Rework + Failure isolation + Node is replacable Messaging • Pros • Easy to scale • Easy fault-tolerance • Fail isolation • Asynchronous • Cons • Extra infrastructrure • High throughput on queue
  • 43. END
  • 44. Questions • How to measure and debug service? • What is deployment process?
  • 45. Q&A THANK YOU FOR YOUR ATTENTION

Editor's Notes

  1. Chào các bạn Chúng ta bắt đầu được chưa ạ? MÌnh xin giới thiệu, mình là Hồ Senior Software Engieer tại cty AXON Ở axon mình “Write Code, Save Lives” Về technical thì mình thích Thiết kế giải pháp và thiết kế code, ngoài ra mình còn thích nghiên cứu về video và image encoding. Ngoài code ra thì mình cũng là một người “Bình Thường” thích nghe nhạc xem phim, đọc manga và xem anime.
  2. Cho mình hỏi là ở đây có bạn nào từng phải suy nghỉ để tối ưu hóa code để chương trình chạy nhanh hơn chưa? Nguyên nhân gì khiến bạn phải optimize code?
  3. Hôm nay, mình muốn chia sẻ một vấn đề hết sức đơn giản trở nên khá là thú vị khi lượng data cần phải xử lí quá lớn mà mình từng gặp. Mục tiêu là giúp các bạn có thêm nhiều góc nhìn trong việc giải quyết các bài toán trong quá trình làm việc ;) Mình tin rằng lựa chọn công nghệ phù hợp sẻ giúp giải pháp của chúng ta tối ưu hơn, nhưng trong phần trình bày này mình sẻ không nghiên về việc lưa trọn công nghệ. Mình không nói rằng các giải pháp của mình đưa ra là giải pháp tốt nhất.
  4. Mình nhận được một yêu cầu là phải viết chương trình như thế này: Nhận vào một URL của một tấm hình, Download tắm hình đó, xử lý và tìm vị trí của các khuôn mặt trên tấm hình đó. => Để dễ hiểu hơn, mời các bạn xem diagram của chương trình.
  5. Tôi tin rằng, các bạn ở đây đều có thể viết được chương trình này. => và đây là code của chương trình
  6. Nhưng mà, sẻ có bạn nói là, phần Face Detections khá là phức tạp nếu không có kiến thức về machine learning. Đúng vậy, nhưng may mắn thay là Face Detection là vấn đề khá là phổ biến và bạn có thể dung thư viện có săn như: OpenCV hay Tensorflow…. => và đây là code của phần Face Detections
  7. Dùng code face detection có sẳn
  8. Và đây là kết quả
  9. Các bạn thầy bài toán ban đầu khá là đơn giản đúng không? Nhưng đó chỉ là bài toán với 1 đường dẫn. Vậy nếu chúng ta có 2 tỷ đường dẫn thì sao? 2 tỷ đường dẫn lớn như thế nào? Mất bao lâu mới có thể xử lý hết? Đó là những câu hỏi mà tôi đã đặc ra khi nhận được yêu cầu từ sếp là: Dùng chương trình ban đầu để xử lý hết 2 tỷ đường dẫn hình trong 1 tập dử liệu có sẳn. => Vậy giờ chúng ta cùng phân tích nha.
  10. 2 tỷ ảnh cần download Nếu chương trình lúc nãy cần 2 giây để hoàn thành Thì tôi cần 127 năm mới xử lý xong tập 2 tỷ ảnh. Nếu tôi quay lại và nói sếp là cần 127 năm mới xử lý xong tập dữ liệu mà sếp đưa. Các bạn nghĩ sẻ như thế nào? T = S/V 1 năm 365.25 ngày
  11. Bài toán là: Tìm số concurrency number chúng ta cần là bao nhiêu để hoàn thành tập 2 tý ảnh trong số ngày mà chúng ta mong muốn?
  12. Vì chúng ta cần phải tăng concurrency number từ 0.5 lên 3307 URL/s tức là tang khoảng 661300% V=S/T
  13. Theo kinh nhiệm của tôi thì chúng ta có 3 cách chính để tăng concurrency number. Có rất nhiều cách để giúp bạn tang concurrency number. Nhưng tổng quảt lại thì có thể có 3 cách Optimize code có thể giúp bạn tang concurrency Tăng phần cứng, ví dụ: Tăng tóc độ xử lý của CPU, Tăng tóc độ đọc ghi của ổ đĩa hoặc tang RAM… Hồi xưa tôi thường xúi khách hang tang IOPs của Database Sever để tăng tốc độ. Scale-out (nhân rộng), có 3 phương thức scale-out Nhân rộng ra nhiều nodes/processes Chia theo chức năng Chia nhóm dữ liệu đễ xử lý. Chúng ta sẻ phân tích từng cách một
  14. Cách tối ưu code có thể giúp chúng ta đạt được kết quả cực kì tốt nếu chúng ta tìm được thuật toán tối ưu hơn nhiều lần. Nhưng chúng ta cần phải bỏ nhiều thời gian và công sức để tìm được chỗ cần optimize
  15. Trong thời đại điện toán đám mây, bạn có thể có 1 con server cực mạnh chỉ cần vài click Nhưng mà nó rất đắt đỏ (AZURE Calculator) https://st.ht/M6rGb Và bạn sẻ đạt đến giới hạn sớm thôi Tôi đã nghĩ cách này thì không có gì thú vi
  16. Scale-out chiều ngang thì như mình đã nói, từ một node có sẳn bạn nhân ra thành nhiều node process dữ liệu cùng lúc. Các này thì bạn sẻ không có bị giới hạn phần cứng Khả năng chịu lỗi cao. Nhưng mà khó triễn khai. Ví dụ như race condition. Đối với bài toán hiện tại thì Race condition có thể xãy ra như thế nào?
  17. Scale-out theo chiều dọc là chia từ 1 node gồm tất cả các chức năng thành nhiều node mỗi node 1 chức năng. Kiến trúc microservices dựa trên cách chia này. Chúng ta cần xem xét lại các chức năng của chương trình để có hiểu được các chứng năng chính và lợi ích đạt được khi chia các chức năng ra. May mắn là chương trình của tôi có các chức năng rất đơn giãn, vì thế tôi nhanh chia chương trình này ra thành 2 phần. Có bạn nào có thể giúp tôi chia ra không?
  18. Bằng cách chia thành 2 nhóm, thì hệ thống của tôi sẻ tận dụng được thế mạnh của phần cứng.
  19. Đây là sơ đồ về các chia các nodes Tới đây thì các bạn cũng thaas là từ một vất đề đơn giản ban đầu chúng ta đang phải gặp các vấn đề phức tạp hơn. Nhưng mà chưa dừng lại ở đây. Có một số vấn đề cần giải quyết.
  20. => Chúng ta đã có hướng tiếp cận cơ bản là scale-out, nhưng còn một số vần đề cần phải quan tâm.
  21. Tôi gọi nó là ”High Concurrency Problems” bới vì mục tiêu là giải quyết các vấn đề cần phải làm để tăng concurrency number.
  22. Nhiều node cùng xử lý 1 URL. Kết quả lưu vào có thể bị xáo chộn. Vậy giải quyết race condition như thế nào? Làm sao để đạt được performance tốt nhất? Đây là vấn đề chúng ta cần phải quan tâm khi tìm giải pháp.
  23. Để chánh việc phân phát các URL chùng lập. Chúng ta có 3 giải pháp. Distributed locks Có thể hiện thực bằng Redis Set/Get NX option Data sharding Là 1 cách thức scale-out Các này cần phải tìm ra các để chia data cho hiệu quả. Tốn chi phí chuẩn bị cũng như cơ sở hạ tân Queue Dễ thực hiện Dễ scale Để tiết kiệm thời gian, mình xin bỏ qua phần Distributed Locks.
  24. Khả năng chịu lỗi là vấn đề tôi luôn nghĩ tới mỗi khi thiết kế hệ thống. Tính đến các trường hợp lỗi có thể xãy ra giúp bạn giãm các rũi ro mà hệ thống của bạn có thể gặp phải. Ví dụ: Mất data Làm lại tư đầu
  25. Làm thế nào để các service giao tiếp nhau? Có cần một load balancer hay không?
  26. Tiếp theo chúng ta sẻ tìm cách giải quyết vấn các vấn đề trên và tìm hướng tiếp cận.
  27. Đê tiếp cận giải phấp tôi sẻ phân tích từng vấn đề và ở phần trước và cách giải quyết chúng. Đầu tiên là Race Condition
  28. Cost for preparation and deployment is high Hard for scaling. Add a new node: We can’t assign a new node to an existing partition. So we need to re-shard data and start again with new partitions number. Remove a node (a node crash): The partition of this node can be abandoned.
  29. Cost for preparation and deployment is high Hard for scaling. Add a new node: We can’t assign a new node to an existing partition. So we need to re-shard data and start again with new partitions number. Remove a node (a node crash): The partition of this node can be abandoned.
  30. he load is concentrated on a single database so the database can become the bottleneck (we can solve this problem by using more hardware resources for database node) Hard for scaling: Remove a node (or node crash): We need to recover the node (or add a new one with the same id) if we don’t want the partition of the node is abandoned Add new node: restart all nodes in the system with new “Number Of Processes” value Need to maintain the number of remaining URLs (or the processed URL)
  31. the load is concentrated on a single database so the database can become the bottleneck (we can solve this problem by using more hardware resources for database node) Hard for scaling: Remove a node (or node crash): We need to recover the node (or add a new one with the same id) if we don’t want the partition of the node is abandoned Add new node: restart all nodes in the system with new “Number Of Processes” value Need to maintain the number of remaining URLs (or the processed URL)
  32. The Delegator/Load Balancer is a service that fetches URLs from URLs database and pushes the URLs into the Queue which will be consumed by Workers (download & process image) Queue: an abstraction, we can build this queue inside the Delegator/Load Balancer or using an open-source project like RabbitMQ/Kafka. Process (1..N): Worker that consumes the URL from Queue for processing. Then push another queue item for the Face Detection phase. With this approach, we have some advantages: We archive Fault-Tolerance and Scalability naturally When a worker fails, its URLs will be handled by other workers When we add a worker, it will consume outstanding URLs in the queue We can reuse the Queue system for Face Detection to optimize the reusable result of “Download & Process image”. This means we can save the image which was downloaded and processed in storage before putting the task item into Face Detection queue. The implementation of single vs multiple goroutines in a node are the same, so we can have flexibility.
  33. The Delegator/Load Balancer is a service that fetches URLs from URLs database and pushes the URLs into the Queue which will be consumed by Workers (download & process image) Queue: an abstraction, we can build this queue inside the Delegator/Load Balancer or using an open-source project like RabbitMQ/Kafka. Process (1..N): Worker that consumes the URL from Queue for processing. Then push another queue item for the Face Detection phase. With this approach, we have some advantages: We archive Fault-Tolerance and Scalability naturally When a worker fails, its URLs will be handled by other workers When we add a worker, it will consume outstanding URLs in the queue We can reuse the Queue system for Face Detection to optimize the reusable result of “Download & Process image”. This means we can save the image which was downloaded and processed in storage before putting the task item into Face Detection queue. The implementation of single vs multiple goroutines in a node are the same, so we can have flexibility.