SlideShare a Scribd company logo
1 of 35
Consistency and Availability
tradeoff in database clusters
Grokking Techtalk 40
About Me
● About Me
● Introduce Segmentation Platform
2
About Me
● Joint Grab for 2 years, currently working as the lead engineer of Segmentation Platform project
● Lead the Research Database group in Grokking Lab
3
About Segmentation Platform (SegP)
● Technology
○ Programming Languages: golang, java, scala
○ Batch processing (spark, scala),
○ Caching (redis),
○ Message queue (SQS, Kafka),
○ Relational database (MySQL),
○ Non-relational (Cassandra, DynamoDB, Elastic Search),
● Team's scope
○ Features development. Coordinate with business owners to develop a platform for
segmentation. Similar to segments.io but for internal users.
○ Batch data processing.
○ Real-time traffic. Build and maintain grpc apis to serve online traffic
4
What we'll discuss in this talk
● CAP theorem
● The cluster architect of Redis, Elastic Search, Cassandra
● How C-A tradeoff reflected in their designs
5
CAP Theorem
6
Consistency
The system is considered consistent if v1 is
returned to client 2 if the read request (2)
happened after the write request (1)
Client 1
Client 2
DB
System
(1) Update v=v1
(2) Get v
V value is
currently v0
7
Availability
When 1 request is sent, one algorithm is being
designed to handle that request which some
steps.
If the system can't go through the algorithm
designed for that request, they're considered
"not available" to that client.
Client 1
Client 2
DB
System
500
DB
System
(3) System return 2xx or 4xxx
(1) Client send request
(2) System
went
through the
algorithm
defined for
this request
(2) System
cannot go
through the
algorithm
defined for
this request
8
Network partition
Network partition happened when some of
the nodes cannot communicate properly to
each other and they believe that the others
was offline.
For example, Node 1 cannot communicate
with Node 2, hence Node 1 thought that
Node 2 is offline. But Node 2 still alive, and
still serve requests.
Node 1 Node 2
Client 1 Client 2
9
CAP Theorem
A distributed database has three very desirable properties:
1. Tolerance towards Network Partition
2. Consistency
3. Availability
The CAP theorem states: You can have at most two of these properties for any shared-data system
Consistency
Availability Partition tolerance
10
Redis Cluster
11
What is Redis
12
- Stands for Remote Dictionary Server
- Is a fast, open-source, in-memory key-value data store for use as a database, cache,
message broker, and queue.
- Delivers sub-millisecond response times enabling millions of requests per second for real-
time applications in Gaming, Ad-Tech, Financial Services, Healthcare, and IoT.
- Popular choice for caching, session management, gaming, leaderboards, real-time analytics,
geospatial, ride-hailing, chat/messaging, media streaming, and pub/sub apps.
Redis cluster - Multi-master
Key is hashed into (1-16384). Depends on
the hash value, the value will be read (and
write into the node assigned that token
accordingly.)
Client
Redis node
1
Redis node
2
key -> value
5 -> "ho chi minh"
6 -> "ha noi"
token 1->8000
token 8001-
>16384
key -> hash
5 -> 18
6 -> 8003
13
6 -> "ha noi"
5 -> "ho chi
minh"
Redis cluster - Master/Replica
Redis uses asynchronous replication, with
asynchronous replica-to-master
acknowledges of the amount of data
processed.
A master can have multiple replicas.
Client write to master, but can read from
replica
Client
1
Redis
master
Redis
replica
Redis
replica
Client
2
Write
command
async updates
Read
command
Ref: https://redis.io/topics/replication
async updates
14
C-A tradeoff
Redis uses asynchronous replication
by default. Which means, by default,
it's AP.
If network partition happened between
master and replica, we'll see
inconsistent data.
Client
1
Redis
master
Redis
replica
Redis
replica
Client
2
Write
command
async updates
async updates
Read
command
return stale
data
15
Elastic Search cluster
16
What is Elasticsearch
17
● Elasticsearch is a distributed, open source search and analytics engine for all types of data,
including textual, numerical, geospatial, structured, and unstructured.
● Elasticsearch is built on Apache Lucene and was first released in 2010 by Elasticsearch N.V.
(now known as Elastic).
Roles in Elasticsearch Cluster
Coordinator node
Master nodeData node
Data node
Client 1
Client 2
Manages the overall
operation of a cluster
and keeps track of
the cluster state
Stores and searches
data. Performs all
data-related
operations (indexing,
searching,
aggregating) on local
shards.
Delegates client
requests to the
shards on the data
nodes, collects and
aggregates the
results into one final
result, and sends this
result back to the
client.
18
Steps for primary shards:
● Validate incoming operation and reject it if
structurally invalid
● Execute the operation locally
● Forward the operation to each replica in the current
in-sync copies set.
● Once all replicas have successfully performed the
operation and responded to the primary, the
primary acknowledges the successful completion
of the request to the client
Primary and Replica shards
user1 auth a1 login
from
homepa
ge
Destination node
Coordination node
19
p1
r11
r12
r21
p2
r22
Primary shards to
forward the
operation to replica
If network partition happen, the primary shard cannot
write to replica shard which lead to the primary shard
becomes unavailable.
By default, ElasticSearch is more CP.
C-A tradeoff
Destination node
20
p1
r11
r12
r21
p2
r22
Primary shards to
forward the
operation to replica
Cassandra
21
What is Cassandra
22
Apache Cassandra is an open source, distributed NoSQL database that began internally at
Facebook and was released as an open-source project in July 2008.
Cassandra delivers continuous availability (zero downtime), high performance, and linear
scalability that modern applications require, while also offering operational simplicity and effortless
replication across data centers and geographies.
Cassandra Ring Cluster
-> token: 44
1-20
21-40
41-60
61-80
81-100
101-120
121-140
141-160
Coordinator node
- Each nodes will be assigned a range of token
- Client could connect to any nodes to write, that
node will become the coordinator node
- Partition keys will be hashed into a token.
Coordinator will base on the token to know which
node we can store the data
user1 auth a1 login
from
homepa
ge
Destination node
23
Replication Factor
-> token: 44
1-20
21-40
41-60
61-80
81-100
101-120
121-140
141-160
Coordinator node
Replication node
- Replication Factor (RF) = number of copies we
want to store
- Replication node will be defined by the Replication
Strategy
- Simple strategy = next two nodes will be the
replication node
user1 auth a1 login
from
homepa
ge
24
Data Consistency
C1
C2
C
A
B
Client 1
Client 2
read data with
token 44
write data with
token 44
v2
v1
v2
- Client 1 connect to C1 to read, C1 write data to three nodes, but
failed at node B.
- Client 2 also connect to C2 to read data,
What would happen?
25
Consistent Level (Write)
Level Read Write
One Returns a response from
the closest replica, as
determined by the snitch.
By default, a read repair
runs in the background to
make the other replicas
consistent.
A write must be written to
the commit log and
memtable of at least one
replica node.
Quorum Returns the record after a
quorum of replicas has
responded.
A write must be written to
the commit log and
memtable on a quorum of
replica nodes
All Returns the record after all
replicas have responded.
The read operation will fail if
a replica does not respond.
A write must be written to
the commit log and
memtable on all replica
nodes in the cluster for that
partition.
26
Write with CL=ALL
C1
C2
C
A
B
Client 1
write data with
token 44
v2
v1
v2
Write with CL=ALL
- All replica succeeded -> success
- 1 replica failed -> failed
Result: Failed
27
Write with CL=QUORUM
C1
C2
C
A
B
Client 1
write data with
token 44
v2
v1
v2
Quorum = (RF + 1) / 2 = 2
- Two replicas succeeded -> success
- Less than two success -> failed
Result: Success
28
Consistent Level (Read)
Level Read Write
One Returns a response from
the closest replica, as
determined by the snitch.
By default, a read repair
runs in the background to
make the other replicas
consistent.
A write must be written to
the commit log and
memtable of at least one
replica node.
Quorum Returns the record after a
quorum of replicas has
responded.
A write must be written to
the commit log and
memtable on a quorum of
replica nodes
All Returns the record after all
replicas have responded.
The read operation will fail if
a replica does not respond.
A write must be written to
the commit log and
memtable on all replica
nodes in the cluster for that
partition.
29
Write=QUORUM, Read=One
C1
C2
C
A
B
Client 1
Client 2
read data with
token 44
write data with
token 44
v2
v1
v2
Potentially inconsistent read. If client 2 read
node B, client 2 will receive stale-data.
W (Quorum) + R (1) -> eventual consistent
30
Write=QUORUM, Read=QUORUM
C1
C2
C
A
B
Client 1
Client 2
read data with
token 44
write data with
token 44
v2
v1
v2
Any read combination will always return v2
W (QU) + R (QU) -> consistent
31
Write=All, Read=One
C1
C2
C
A
B
Client 1
Client 2
read data with
token 44
write data with
token 44
v2
v1
v2
Potentially inconsistent read. If client 2 read
node B, client 2 will receive stale-data.
W (All) + R (1) -> consistent
32
Summarize Read and Write CL
WRITE READ Consistent Read Availability Write Availability
All All Consistent Low Low
Quorum All Consistent Low Medium
One All Consistent Low High
All Quoru
m
Consistent Medium Low
Quorum Quoru
m
Consistent Medium Medium
One Quoru
m
Inconsistent Medium High
All One Consistent High Low
Quorum One Inconsistent High Medium
One One Inconsistent High High
33
Summary
Redis Cassandra Elastic Search
Availability > Consistency Tweakable availability and
consistency
Availability < Consistency
34
Q&A
35

More Related Content

What's hot

What's hot (20)

MySQL Database Architectures - High Availability and Disaster Recovery Solution
MySQL Database Architectures - High Availability and Disaster Recovery SolutionMySQL Database Architectures - High Availability and Disaster Recovery Solution
MySQL Database Architectures - High Availability and Disaster Recovery Solution
 
Apache Kafka Best Practices
Apache Kafka Best PracticesApache Kafka Best Practices
Apache Kafka Best Practices
 
Custom DevOps Monitoring System in MelOn (with InfluxDB + Telegraf + Grafana)
Custom DevOps Monitoring System in MelOn (with InfluxDB + Telegraf + Grafana)Custom DevOps Monitoring System in MelOn (with InfluxDB + Telegraf + Grafana)
Custom DevOps Monitoring System in MelOn (with InfluxDB + Telegraf + Grafana)
 
Distributed Transaction in Microservice
Distributed Transaction in MicroserviceDistributed Transaction in Microservice
Distributed Transaction in Microservice
 
Kubernetes - A Comprehensive Overview
Kubernetes - A Comprehensive OverviewKubernetes - A Comprehensive Overview
Kubernetes - A Comprehensive Overview
 
Using NGINX as an Effective and Highly Available Content Cache
Using NGINX as an Effective and Highly Available Content CacheUsing NGINX as an Effective and Highly Available Content Cache
Using NGINX as an Effective and Highly Available Content Cache
 
HTTP Analytics for 6M requests per second using ClickHouse, by Alexander Boc...
HTTP Analytics for 6M requests per second using ClickHouse, by  Alexander Boc...HTTP Analytics for 6M requests per second using ClickHouse, by  Alexander Boc...
HTTP Analytics for 6M requests per second using ClickHouse, by Alexander Boc...
 
Introduction to Kafka with Spring Integration
Introduction to Kafka with Spring IntegrationIntroduction to Kafka with Spring Integration
Introduction to Kafka with Spring Integration
 
From Mainframe to Microservice: An Introduction to Distributed Systems
From Mainframe to Microservice: An Introduction to Distributed SystemsFrom Mainframe to Microservice: An Introduction to Distributed Systems
From Mainframe to Microservice: An Introduction to Distributed Systems
 
Apache Kafka - Messaging System Overview
Apache Kafka - Messaging System OverviewApache Kafka - Messaging System Overview
Apache Kafka - Messaging System Overview
 
Apache Kafka Fundamentals for Architects, Admins and Developers
Apache Kafka Fundamentals for Architects, Admins and DevelopersApache Kafka Fundamentals for Architects, Admins and Developers
Apache Kafka Fundamentals for Architects, Admins and Developers
 
Kafka presentation
Kafka presentationKafka presentation
Kafka presentation
 
ITLC HN 14 - Bizweb Microservices Architecture
ITLC HN 14  - Bizweb Microservices ArchitectureITLC HN 14  - Bizweb Microservices Architecture
ITLC HN 14 - Bizweb Microservices Architecture
 
Kafka streams windowing behind the curtain
Kafka streams windowing behind the curtain Kafka streams windowing behind the curtain
Kafka streams windowing behind the curtain
 
kafka
kafkakafka
kafka
 
Redpanda and ClickHouse
Redpanda and ClickHouseRedpanda and ClickHouse
Redpanda and ClickHouse
 
Kafka 101
Kafka 101Kafka 101
Kafka 101
 
Disaster Recovery Plans for Apache Kafka
Disaster Recovery Plans for Apache KafkaDisaster Recovery Plans for Apache Kafka
Disaster Recovery Plans for Apache Kafka
 
Bizweb Microservices Architecture
Bizweb Microservices ArchitectureBizweb Microservices Architecture
Bizweb Microservices Architecture
 
Kafka 101 and Developer Best Practices
Kafka 101 and Developer Best PracticesKafka 101 and Developer Best Practices
Kafka 101 and Developer Best Practices
 

Similar to Grokking Techtalk #40: Consistency and Availability tradeoff in database cluster

Similar to Grokking Techtalk #40: Consistency and Availability tradeoff in database cluster (20)

Software architecture for data applications
Software architecture for data applicationsSoftware architecture for data applications
Software architecture for data applications
 
MYSQL
MYSQLMYSQL
MYSQL
 
MongoDB World 2018: Active-Active Application Architectures: Become a MongoDB...
MongoDB World 2018: Active-Active Application Architectures: Become a MongoDB...MongoDB World 2018: Active-Active Application Architectures: Become a MongoDB...
MongoDB World 2018: Active-Active Application Architectures: Become a MongoDB...
 
The Apache Cassandra ecosystem
The Apache Cassandra ecosystemThe Apache Cassandra ecosystem
The Apache Cassandra ecosystem
 
RAC - The Savior of DBA
RAC - The Savior of DBARAC - The Savior of DBA
RAC - The Savior of DBA
 
Capital One Delivers Risk Insights in Real Time with Stream Processing
Capital One Delivers Risk Insights in Real Time with Stream ProcessingCapital One Delivers Risk Insights in Real Time with Stream Processing
Capital One Delivers Risk Insights in Real Time with Stream Processing
 
infiniband.pdf
infiniband.pdfinfiniband.pdf
infiniband.pdf
 
Introduction to DPDK
Introduction to DPDKIntroduction to DPDK
Introduction to DPDK
 
Data correlation using PySpark and HDFS
Data correlation using PySpark and HDFSData correlation using PySpark and HDFS
Data correlation using PySpark and HDFS
 
Lecture9
Lecture9Lecture9
Lecture9
 
Strata Singapore: Gearpump Real time DAG-Processing with Akka at Scale
Strata Singapore: GearpumpReal time DAG-Processing with Akka at ScaleStrata Singapore: GearpumpReal time DAG-Processing with Akka at Scale
Strata Singapore: Gearpump Real time DAG-Processing with Akka at Scale
 
Polyraptor
PolyraptorPolyraptor
Polyraptor
 
Postgres clusters
Postgres clustersPostgres clusters
Postgres clusters
 
Intro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
Intro to Apache Apex (next gen Hadoop) & comparison to Spark StreamingIntro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
Intro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
 
Webinar Back to Basics 3 - Introduzione ai Replica Set
Webinar Back to Basics 3 - Introduzione ai Replica SetWebinar Back to Basics 3 - Introduzione ai Replica Set
Webinar Back to Basics 3 - Introduzione ai Replica Set
 
Renegotiating the boundary between database latency and consistency
Renegotiating the boundary between database latency  and consistencyRenegotiating the boundary between database latency  and consistency
Renegotiating the boundary between database latency and consistency
 
Polyraptor
PolyraptorPolyraptor
Polyraptor
 
A Hitchhiker's Guide to Apache Kafka Geo-Replication with Sanjana Kaundinya ...
 A Hitchhiker's Guide to Apache Kafka Geo-Replication with Sanjana Kaundinya ... A Hitchhiker's Guide to Apache Kafka Geo-Replication with Sanjana Kaundinya ...
A Hitchhiker's Guide to Apache Kafka Geo-Replication with Sanjana Kaundinya ...
 
Pythian: My First 100 days with a Cassandra Cluster
Pythian: My First 100 days with a Cassandra ClusterPythian: My First 100 days with a Cassandra Cluster
Pythian: My First 100 days with a Cassandra Cluster
 
PortLand.pptx
PortLand.pptxPortLand.pptx
PortLand.pptx
 

More from Grokking VN

Grokking Techtalk #46: Lessons from years hacking and defending Vietnamese banks
Grokking Techtalk #46: Lessons from years hacking and defending Vietnamese banksGrokking Techtalk #46: Lessons from years hacking and defending Vietnamese banks
Grokking Techtalk #46: Lessons from years hacking and defending Vietnamese banks
Grokking VN
 
Grokking Techtalk #45: First Principles Thinking
Grokking Techtalk #45: First Principles ThinkingGrokking Techtalk #45: First Principles Thinking
Grokking Techtalk #45: First Principles Thinking
Grokking VN
 
Grokking Techtalk #43: Payment gateway demystified
Grokking Techtalk #43: Payment gateway demystifiedGrokking Techtalk #43: Payment gateway demystified
Grokking Techtalk #43: Payment gateway demystified
Grokking VN
 
Grokking Techtalk #34: K8S On-premise: Incident & Lesson Learned ZaloPay Mer...
 Grokking Techtalk #34: K8S On-premise: Incident & Lesson Learned ZaloPay Mer... Grokking Techtalk #34: K8S On-premise: Incident & Lesson Learned ZaloPay Mer...
Grokking Techtalk #34: K8S On-premise: Incident & Lesson Learned ZaloPay Mer...
Grokking VN
 

More from Grokking VN (20)

Grokking Techtalk #46: Lessons from years hacking and defending Vietnamese banks
Grokking Techtalk #46: Lessons from years hacking and defending Vietnamese banksGrokking Techtalk #46: Lessons from years hacking and defending Vietnamese banks
Grokking Techtalk #46: Lessons from years hacking and defending Vietnamese banks
 
Grokking Techtalk #45: First Principles Thinking
Grokking Techtalk #45: First Principles ThinkingGrokking Techtalk #45: First Principles Thinking
Grokking Techtalk #45: First Principles Thinking
 
Grokking Techtalk #42: Engineering challenges on building data platform for M...
Grokking Techtalk #42: Engineering challenges on building data platform for M...Grokking Techtalk #42: Engineering challenges on building data platform for M...
Grokking Techtalk #42: Engineering challenges on building data platform for M...
 
Grokking Techtalk #43: Payment gateway demystified
Grokking Techtalk #43: Payment gateway demystifiedGrokking Techtalk #43: Payment gateway demystified
Grokking Techtalk #43: Payment gateway demystified
 
Grokking Techtalk #40: AWS’s philosophy on designing MLOps platform
Grokking Techtalk #40: AWS’s philosophy on designing MLOps platformGrokking Techtalk #40: AWS’s philosophy on designing MLOps platform
Grokking Techtalk #40: AWS’s philosophy on designing MLOps platform
 
Grokking Techtalk #39: Gossip protocol and applications
Grokking Techtalk #39: Gossip protocol and applicationsGrokking Techtalk #39: Gossip protocol and applications
Grokking Techtalk #39: Gossip protocol and applications
 
Grokking Techtalk #38: Escape Analysis in Go compiler
 Grokking Techtalk #38: Escape Analysis in Go compiler Grokking Techtalk #38: Escape Analysis in Go compiler
Grokking Techtalk #38: Escape Analysis in Go compiler
 
Grokking Techtalk #37: Data intensive problem
 Grokking Techtalk #37: Data intensive problem Grokking Techtalk #37: Data intensive problem
Grokking Techtalk #37: Data intensive problem
 
Grokking Techtalk #37: Software design and refactoring
 Grokking Techtalk #37: Software design and refactoring Grokking Techtalk #37: Software design and refactoring
Grokking Techtalk #37: Software design and refactoring
 
Grokking TechTalk #35: Efficient spellchecking
Grokking TechTalk #35: Efficient spellcheckingGrokking TechTalk #35: Efficient spellchecking
Grokking TechTalk #35: Efficient spellchecking
 
Grokking Techtalk #34: K8S On-premise: Incident & Lesson Learned ZaloPay Mer...
 Grokking Techtalk #34: K8S On-premise: Incident & Lesson Learned ZaloPay Mer... Grokking Techtalk #34: K8S On-premise: Incident & Lesson Learned ZaloPay Mer...
Grokking Techtalk #34: K8S On-premise: Incident & Lesson Learned ZaloPay Mer...
 
Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big...
Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big...Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big...
Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big...
 
Grokking TechTalk #31: Asynchronous Communications
Grokking TechTalk #31: Asynchronous CommunicationsGrokking TechTalk #31: Asynchronous Communications
Grokking TechTalk #31: Asynchronous Communications
 
Grokking TechTalk #30: From App to Ecosystem: Lessons Learned at Scale
Grokking TechTalk #30: From App to Ecosystem: Lessons Learned at ScaleGrokking TechTalk #30: From App to Ecosystem: Lessons Learned at Scale
Grokking TechTalk #30: From App to Ecosystem: Lessons Learned at Scale
 
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedInGrokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
 
Grokking TechTalk #27: Optimal Binary Search Tree
Grokking TechTalk #27: Optimal Binary Search TreeGrokking TechTalk #27: Optimal Binary Search Tree
Grokking TechTalk #27: Optimal Binary Search Tree
 
Grokking TechTalk #26: Kotlin, Understand the Magic
Grokking TechTalk #26: Kotlin, Understand the MagicGrokking TechTalk #26: Kotlin, Understand the Magic
Grokking TechTalk #26: Kotlin, Understand the Magic
 
Grokking TechTalk #26: Compare ios and android platform
Grokking TechTalk #26: Compare ios and android platformGrokking TechTalk #26: Compare ios and android platform
Grokking TechTalk #26: Compare ios and android platform
 
Grokking TechTalk #24: Thiết kế hệ thống Background Job Queue bằng Ruby & Pos...
Grokking TechTalk #24: Thiết kế hệ thống Background Job Queue bằng Ruby & Pos...Grokking TechTalk #24: Thiết kế hệ thống Background Job Queue bằng Ruby & Pos...
Grokking TechTalk #24: Thiết kế hệ thống Background Job Queue bằng Ruby & Pos...
 
Grokking TechTalk #24: Kafka's principles and protocols
Grokking TechTalk #24: Kafka's principles and protocolsGrokking TechTalk #24: Kafka's principles and protocols
Grokking TechTalk #24: Kafka's principles and protocols
 

Recently uploaded

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider  Progress from Awareness to Implementation.pptxTales from a Passkey Provider  Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
FIDO Alliance
 

Recently uploaded (20)

Quantum Leap in Next-Generation Computing
Quantum Leap in Next-Generation ComputingQuantum Leap in Next-Generation Computing
Quantum Leap in Next-Generation Computing
 
Choreo: Empowering the Future of Enterprise Software Engineering
Choreo: Empowering the Future of Enterprise Software EngineeringChoreo: Empowering the Future of Enterprise Software Engineering
Choreo: Empowering the Future of Enterprise Software Engineering
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Intro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptxIntro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptx
 
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
 
The Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightThe Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and Insight
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)
Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)
Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
ADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptxADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptx
 
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider  Progress from Awareness to Implementation.pptxTales from a Passkey Provider  Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
 
ERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage IntacctERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage Intacct
 
Working together SRE & Platform Engineering
Working together SRE & Platform EngineeringWorking together SRE & Platform Engineering
Working together SRE & Platform Engineering
 
JavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate GuideJavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate Guide
 
2024 May Patch Tuesday
2024 May Patch Tuesday2024 May Patch Tuesday
2024 May Patch Tuesday
 
How to Check CNIC Information Online with Pakdata cf
How to Check CNIC Information Online with Pakdata cfHow to Check CNIC Information Online with Pakdata cf
How to Check CNIC Information Online with Pakdata cf
 
Decarbonising Commercial Real Estate: The Role of Operational Performance
Decarbonising Commercial Real Estate: The Role of Operational PerformanceDecarbonising Commercial Real Estate: The Role of Operational Performance
Decarbonising Commercial Real Estate: The Role of Operational Performance
 

Grokking Techtalk #40: Consistency and Availability tradeoff in database cluster

  • 1. Consistency and Availability tradeoff in database clusters Grokking Techtalk 40
  • 2. About Me ● About Me ● Introduce Segmentation Platform 2
  • 3. About Me ● Joint Grab for 2 years, currently working as the lead engineer of Segmentation Platform project ● Lead the Research Database group in Grokking Lab 3
  • 4. About Segmentation Platform (SegP) ● Technology ○ Programming Languages: golang, java, scala ○ Batch processing (spark, scala), ○ Caching (redis), ○ Message queue (SQS, Kafka), ○ Relational database (MySQL), ○ Non-relational (Cassandra, DynamoDB, Elastic Search), ● Team's scope ○ Features development. Coordinate with business owners to develop a platform for segmentation. Similar to segments.io but for internal users. ○ Batch data processing. ○ Real-time traffic. Build and maintain grpc apis to serve online traffic 4
  • 5. What we'll discuss in this talk ● CAP theorem ● The cluster architect of Redis, Elastic Search, Cassandra ● How C-A tradeoff reflected in their designs 5
  • 7. Consistency The system is considered consistent if v1 is returned to client 2 if the read request (2) happened after the write request (1) Client 1 Client 2 DB System (1) Update v=v1 (2) Get v V value is currently v0 7
  • 8. Availability When 1 request is sent, one algorithm is being designed to handle that request which some steps. If the system can't go through the algorithm designed for that request, they're considered "not available" to that client. Client 1 Client 2 DB System 500 DB System (3) System return 2xx or 4xxx (1) Client send request (2) System went through the algorithm defined for this request (2) System cannot go through the algorithm defined for this request 8
  • 9. Network partition Network partition happened when some of the nodes cannot communicate properly to each other and they believe that the others was offline. For example, Node 1 cannot communicate with Node 2, hence Node 1 thought that Node 2 is offline. But Node 2 still alive, and still serve requests. Node 1 Node 2 Client 1 Client 2 9
  • 10. CAP Theorem A distributed database has three very desirable properties: 1. Tolerance towards Network Partition 2. Consistency 3. Availability The CAP theorem states: You can have at most two of these properties for any shared-data system Consistency Availability Partition tolerance 10
  • 12. What is Redis 12 - Stands for Remote Dictionary Server - Is a fast, open-source, in-memory key-value data store for use as a database, cache, message broker, and queue. - Delivers sub-millisecond response times enabling millions of requests per second for real- time applications in Gaming, Ad-Tech, Financial Services, Healthcare, and IoT. - Popular choice for caching, session management, gaming, leaderboards, real-time analytics, geospatial, ride-hailing, chat/messaging, media streaming, and pub/sub apps.
  • 13. Redis cluster - Multi-master Key is hashed into (1-16384). Depends on the hash value, the value will be read (and write into the node assigned that token accordingly.) Client Redis node 1 Redis node 2 key -> value 5 -> "ho chi minh" 6 -> "ha noi" token 1->8000 token 8001- >16384 key -> hash 5 -> 18 6 -> 8003 13 6 -> "ha noi" 5 -> "ho chi minh"
  • 14. Redis cluster - Master/Replica Redis uses asynchronous replication, with asynchronous replica-to-master acknowledges of the amount of data processed. A master can have multiple replicas. Client write to master, but can read from replica Client 1 Redis master Redis replica Redis replica Client 2 Write command async updates Read command Ref: https://redis.io/topics/replication async updates 14
  • 15. C-A tradeoff Redis uses asynchronous replication by default. Which means, by default, it's AP. If network partition happened between master and replica, we'll see inconsistent data. Client 1 Redis master Redis replica Redis replica Client 2 Write command async updates async updates Read command return stale data 15
  • 17. What is Elasticsearch 17 ● Elasticsearch is a distributed, open source search and analytics engine for all types of data, including textual, numerical, geospatial, structured, and unstructured. ● Elasticsearch is built on Apache Lucene and was first released in 2010 by Elasticsearch N.V. (now known as Elastic).
  • 18. Roles in Elasticsearch Cluster Coordinator node Master nodeData node Data node Client 1 Client 2 Manages the overall operation of a cluster and keeps track of the cluster state Stores and searches data. Performs all data-related operations (indexing, searching, aggregating) on local shards. Delegates client requests to the shards on the data nodes, collects and aggregates the results into one final result, and sends this result back to the client. 18
  • 19. Steps for primary shards: ● Validate incoming operation and reject it if structurally invalid ● Execute the operation locally ● Forward the operation to each replica in the current in-sync copies set. ● Once all replicas have successfully performed the operation and responded to the primary, the primary acknowledges the successful completion of the request to the client Primary and Replica shards user1 auth a1 login from homepa ge Destination node Coordination node 19 p1 r11 r12 r21 p2 r22 Primary shards to forward the operation to replica
  • 20. If network partition happen, the primary shard cannot write to replica shard which lead to the primary shard becomes unavailable. By default, ElasticSearch is more CP. C-A tradeoff Destination node 20 p1 r11 r12 r21 p2 r22 Primary shards to forward the operation to replica
  • 22. What is Cassandra 22 Apache Cassandra is an open source, distributed NoSQL database that began internally at Facebook and was released as an open-source project in July 2008. Cassandra delivers continuous availability (zero downtime), high performance, and linear scalability that modern applications require, while also offering operational simplicity and effortless replication across data centers and geographies.
  • 23. Cassandra Ring Cluster -> token: 44 1-20 21-40 41-60 61-80 81-100 101-120 121-140 141-160 Coordinator node - Each nodes will be assigned a range of token - Client could connect to any nodes to write, that node will become the coordinator node - Partition keys will be hashed into a token. Coordinator will base on the token to know which node we can store the data user1 auth a1 login from homepa ge Destination node 23
  • 24. Replication Factor -> token: 44 1-20 21-40 41-60 61-80 81-100 101-120 121-140 141-160 Coordinator node Replication node - Replication Factor (RF) = number of copies we want to store - Replication node will be defined by the Replication Strategy - Simple strategy = next two nodes will be the replication node user1 auth a1 login from homepa ge 24
  • 25. Data Consistency C1 C2 C A B Client 1 Client 2 read data with token 44 write data with token 44 v2 v1 v2 - Client 1 connect to C1 to read, C1 write data to three nodes, but failed at node B. - Client 2 also connect to C2 to read data, What would happen? 25
  • 26. Consistent Level (Write) Level Read Write One Returns a response from the closest replica, as determined by the snitch. By default, a read repair runs in the background to make the other replicas consistent. A write must be written to the commit log and memtable of at least one replica node. Quorum Returns the record after a quorum of replicas has responded. A write must be written to the commit log and memtable on a quorum of replica nodes All Returns the record after all replicas have responded. The read operation will fail if a replica does not respond. A write must be written to the commit log and memtable on all replica nodes in the cluster for that partition. 26
  • 27. Write with CL=ALL C1 C2 C A B Client 1 write data with token 44 v2 v1 v2 Write with CL=ALL - All replica succeeded -> success - 1 replica failed -> failed Result: Failed 27
  • 28. Write with CL=QUORUM C1 C2 C A B Client 1 write data with token 44 v2 v1 v2 Quorum = (RF + 1) / 2 = 2 - Two replicas succeeded -> success - Less than two success -> failed Result: Success 28
  • 29. Consistent Level (Read) Level Read Write One Returns a response from the closest replica, as determined by the snitch. By default, a read repair runs in the background to make the other replicas consistent. A write must be written to the commit log and memtable of at least one replica node. Quorum Returns the record after a quorum of replicas has responded. A write must be written to the commit log and memtable on a quorum of replica nodes All Returns the record after all replicas have responded. The read operation will fail if a replica does not respond. A write must be written to the commit log and memtable on all replica nodes in the cluster for that partition. 29
  • 30. Write=QUORUM, Read=One C1 C2 C A B Client 1 Client 2 read data with token 44 write data with token 44 v2 v1 v2 Potentially inconsistent read. If client 2 read node B, client 2 will receive stale-data. W (Quorum) + R (1) -> eventual consistent 30
  • 31. Write=QUORUM, Read=QUORUM C1 C2 C A B Client 1 Client 2 read data with token 44 write data with token 44 v2 v1 v2 Any read combination will always return v2 W (QU) + R (QU) -> consistent 31
  • 32. Write=All, Read=One C1 C2 C A B Client 1 Client 2 read data with token 44 write data with token 44 v2 v1 v2 Potentially inconsistent read. If client 2 read node B, client 2 will receive stale-data. W (All) + R (1) -> consistent 32
  • 33. Summarize Read and Write CL WRITE READ Consistent Read Availability Write Availability All All Consistent Low Low Quorum All Consistent Low Medium One All Consistent Low High All Quoru m Consistent Medium Low Quorum Quoru m Consistent Medium Medium One Quoru m Inconsistent Medium High All One Consistent High Low Quorum One Inconsistent High Medium One One Inconsistent High High 33
  • 34. Summary Redis Cassandra Elastic Search Availability > Consistency Tweakable availability and consistency Availability < Consistency 34