SlideShare a Scribd company logo
1 of 58
Cassandra
Strategies for Distributed Data Storage
I: Fat Clients are Expensive
 II: Availability vs. Consistency
III: Strategies for Eventual Consistency




Cassandra: Strategies for Distributed Data Storage
I: Fat Clients are Expensive




Cassandra: Strategies for Distributed Data Storage
In the Beginning...


                                        Web
                                                  Thin Data API




                                                  Simple:
                                                   1 web server
                                        DB         1 database




 Cassandra: Strategies for Distributed Data Storage
Your Data Grows...


                                        Web
                                                  Data API




                                                             Move tables to
                              DB                   DB        different DBs.
                             user                 item




 Cassandra: Strategies for Distributed Data Storage
A table grows too large...

                                         Web


                                        Data API



                                                      ...
                                                            Shard table by
              DB            DB            DB                PK ranges.
             item          item          item         ...
               0             1             2

PK Range:   [0, 10k)     [10k, 20k)    [20k, 30k)




 Cassandra: Strategies for Distributed Data Storage
Problem:
Multiple Client Languages




                      python           ruby            java


                      Data API        Data API        Data API




 Cassandra: Strategies for Distributed Data Storage
Are there other trade-offs?




Cassandra: Strategies for Distributed Data Storage
II: Availability vs. Consistency




Cassandra: Strategies for Distributed Data Storage
Why consistency vs. availability?
CAP Theorem




Cassandra: Strategies for Distributed Data Storage
CAP Theorem


You can have at most two of these properties
in a shared-data system:
   Consistency
   Availability
   Partition-Tolerance



 Cassandra: Strategies for Distributed Data Storage
Problem:
Sharded DB Cluster Favors C over A.

                                        Web


                                      Data API
                              ...                     ...
                                                            SPOF
                                                            No
                              ...       DB
                                       shard          ...   Replication




 Cassandra: Strategies for Distributed Data Storage
Slightly better with master-slave replication...

                                        Web

                                        Data



                                                            Write:
                              ...       DB
                                       shard          ...    SPOF
                                                             Bottlenecked
                                       master




                              ...       DB
                                                      ...   Read:
                                                             Replicated
                                       shard
                                        slave




 Cassandra: Strategies for Distributed Data Storage
Availability Arguments



 Avoid SPOFs
 Distribute Writes to All Nodes in Replica Set




 Cassandra: Strategies for Distributed Data Storage
Availability
Easy: Write

                                                 replica
                                                      A    value: “x”
                                 Write


                       coord.                    replica
                                                      B




                                                 replica
                                                      C




 Cassandra: Strategies for Distributed Data Storage
Availability
Harder: Consistency Across Replicas

                                                 replica
                                                      A    value: “x”




                       coord.                    replica
                                                      B    value: “x”




                                                 replica
                                                      C    value: “x”




 Cassandra: Strategies for Distributed Data Storage
So, how do we achieve consistency?




Cassandra: Strategies for Distributed Data Storage
III: Strategies for Eventual Consistency




Cassandra: Strategies for Distributed Data Storage
I: Write-Related Strategies
 II: Read-Related Strategies




Cassandra: Strategies for Distributed Data Storage
Write-Related Strategies



   I: Hinted Hand-Off
  II: Gossip




 Cassandra: Strategies for Distributed Data Storage
I: Hinted Hand-Off




Cassandra: Strategies for Distributed Data Storage
Hinted Hand-Off
Problem


            Write to an Unavailable Node




 Cassandra: Strategies for Distributed Data Storage
Hinted Hand-Off
Solution

 1) “hinted” write to a live node
 2) deliver hints when node is reachable




 Cassandra: Strategies for Distributed Data Storage
Hinted Hand-Off
Step 1: “hinted” write to a live node
             part of replica set is available

                                                      A   target
                                                           (dead)


                                   “hinted”
                       coord.        write            B     nearest
                                                          live replica




                                                      C




 Cassandra: Strategies for Distributed Data Storage
Hinted Hand-Off
Step 1: “hinted” write to a live node
             all replica nodes unreachable

                                                      A   target
                                                          (dead)




            closest    coord.     “hinted”            B
             node                                         (dead)
                                    write



                                                      C   (dead)




 Cassandra: Strategies for Distributed Data Storage
Hinted Hand-Off
Step 2: deliver hints when node is reachable




                        node
                                    deliver           replica      target
                                                                (now available)




                     “hinted”
                      writes




 Cassandra: Strategies for Distributed Data Storage
How does a node learn when
               another node is available?




Cassandra: Strategies for Distributed Data Storage
II: Gossip




Cassandra: Strategies for Distributed Data Storage
Gossip
Problem
Each node cannot scalably ping every other node.


                                          8 nodes:   82 =     64
                                        100 nodes: 1002 = 10,000




 Cassandra: Strategies for Distributed Data Storage
Gossip
Solution

   I: Anti-Entropy Gossip Protocol
  II: Phi-Accrual Failure Detector




 Cassandra: Strategies for Distributed Data Storage
Gossip
Anti-Entropy Gossip Protocol



          node            node




 Cassandra: Strategies for Distributed Data Storage
Gossip
Phi-Accrual Failure Detector


Dynamically adjusts its “suspicion” level of another node,
based on inter-arrival times of gossip messages.




 Cassandra: Strategies for Distributed Data Storage
Read-Related Strategies



   I: Read-Repair
  II: Anti-Entropy Service




 Cassandra: Strategies for Distributed Data Storage
I: Read-Repair




Cassandra: Strategies for Distributed Data Storage
Read-Repair
Problem

          A Write Has Not Propagated to
                   All Replicas




 Cassandra: Strategies for Distributed Data Storage
Read-Repair
Solution

                Repair Outdated Replicas
                       After Read




 Cassandra: Strategies for Distributed Data Storage
Read-Repair
Example

                               Quorum Read
                           Replication Factor: 3




 Cassandra: Strategies for Distributed Data Storage
Read-Repair
Steps

 1) do digest-based read (if digests match)
 2) do full read and repair replicas




 Cassandra: Strategies for Distributed Data Storage
Read-Repair
Step 1: do digest-based read
              one full read; other reads are digest

                                                      A

                                         F
                       coord.
                                                      B
                                         D

                                         D
                                                      C




 Cassandra: Strategies for Distributed Data Storage
Read-Repair
Step 1: do digest-based read
             wait for 2 replies (where one is full read)

                                                      A

                                         F
                       coord.
                                                      B
                                         D


                                                      C




 Cassandra: Strategies for Distributed Data Storage
Read-Repair
Step 1: do digest-based read
             return value to client (if all digests match)


                         D    == digest(              F   )

                                                 coord.

                                return value
                                  to client




 Cassandra: Strategies for Distributed Data Storage
Read-Repair
Step 2: do full read and repair replicas
             full read from all replicas

                                                      A

                                         F
                       coord.
                                                      B
                                         F

                                         F
                                                      C




 Cassandra: Strategies for Distributed Data Storage
Read-Repair
Step 2: do full read and repair replicas
             wait for 2 replies

                                                      A

                                         F
                       coord.
                                                      B
                                         F


                                                      C




 Cassandra: Strategies for Distributed Data Storage
Read-Repair
Step 2: do full read and repair replicas
             calculate newest value from replies


                                      value           timestamp
               replica A:              “x”               t0
                replica B:             “y”               t1
             reconciled:               “y”               t1




 Cassandra: Strategies for Distributed Data Storage
Read-Repair
Step 2: do full read and repair replicas
             return newest value to client



                                                 coord.

                                  return
                             reconciled value
                                 to client




 Cassandra: Strategies for Distributed Data Storage
Read-Repair
Step 2: do full read and repair replicas
             calculate repair mutations for each replica


              diff(reconciled value, replica value)
                       = repair mutation


       Repair for Replica A                     Repair for Replica B
      diff( “y” @ t1, “x” @ t0)                diff( “y” @ t1, “y” @ t1)
             = “y” @ t1                                  = null



 Cassandra: Strategies for Distributed Data Storage
Read-Repair
Step 2: do full read and repair replicas
             send repair mutation to each replica

                                                      A

                                         R
                       coord.
                                                      B




                                                      C




 Cassandra: Strategies for Distributed Data Storage
What about values that
                      have not been read?




Cassandra: Strategies for Distributed Data Storage
II: Anti-Entropy Service




Cassandra: Strategies for Distributed Data Storage
Anti-Entropy Service
Problem


            How to Repair Unread Values




 Cassandra: Strategies for Distributed Data Storage
Anti-Entropy Service
Solution

 1) detect inconsistency via Merkle Trees
 2) repair inconsistent data




 Cassandra: Strategies for Distributed Data Storage
Anti-Entropy Service
Merkle Tree
a tree where a node’s hash summarizes
the hashes of its children
                                                           root node hash
                 A                                    summarizes its children’s hashes




                                                              node hash
         B               C                            summarizes its children’s hashes




                                                               leaf hash
    D        E       F        G                             hash of a data block




 Cassandra: Strategies for Distributed Data Storage
Anti-Entropy Service
Step 1: detect inconsistency
            create Merkle Trees on all replicas

                                                      B



                                        request
                          A            Merkle Tree
                                        creation


                         create
                   local Merkle Tree                  C




 Cassandra: Strategies for Distributed Data Storage
Anti-Entropy Service
Step 1: detect inconsistency
            exchange Merkle Trees between replicas

                                                       B



                                         exchange
                         A              Merkle Tree
                                     across all replicas




                                                       C




 Cassandra: Strategies for Distributed Data Storage
Anti-Entropy Service
Step 1: detect inconsistency
              compare local and remote Merkle Trees

             Replica A                         Replica B
                 A                                    A               match


                                                                      mismatch


         B               C                 B                  C


    D        E       F       G        D        E          F       G


 Cassandra: Strategies for Distributed Data Storage
Anti-Entropy Service
Step 2: repair inconsistent data
             send repair to remote replica



                            A                         B


                                 send repair
                            for data hashed by node F




 Cassandra: Strategies for Distributed Data Storage
Any Questions?




Cassandra: Strategies for Distributed Data Storage
More Information

Cassandra Site:
http://cassandra.apache.org/

My email address:
kakugawa@gmail.com




 Cassandra: Strategies for Distributed Data Storage

More Related Content

Viewers also liked

7 distributed storage_open_stack
7 distributed storage_open_stack7 distributed storage_open_stack
7 distributed storage_open_stackopenstackindia
 
DumpFS - A Distributed Storage Solution
DumpFS - A Distributed Storage SolutionDumpFS - A Distributed Storage Solution
DumpFS - A Distributed Storage SolutionNuno Loureiro
 
Deploying pNFS over Distributed File Storage w/ Jiffin Tony Thottan and Niels...
Deploying pNFS over Distributed File Storage w/ Jiffin Tony Thottan and Niels...Deploying pNFS over Distributed File Storage w/ Jiffin Tony Thottan and Niels...
Deploying pNFS over Distributed File Storage w/ Jiffin Tony Thottan and Niels...Gluster.org
 
A Design of Distributed Storage System over HTTP for Collecting Sensor Data
A Design of Distributed Storage System over HTTP for Collecting Sensor DataA Design of Distributed Storage System over HTTP for Collecting Sensor Data
A Design of Distributed Storage System over HTTP for Collecting Sensor DataSayed Ahmad Naweed
 
Distributed storage performance for OpenStack clouds using small-file IO work...
Distributed storage performance for OpenStack clouds using small-file IO work...Distributed storage performance for OpenStack clouds using small-file IO work...
Distributed storage performance for OpenStack clouds using small-file IO work...Principled Technologies
 
Ceph - A distributed storage system
Ceph - A distributed storage systemCeph - A distributed storage system
Ceph - A distributed storage systemItalo Santos
 
Mongo db administration guide
Mongo db administration guideMongo db administration guide
Mongo db administration guideDeysi Gmarra
 
ICDE2015 Research 3: Distributed Storage and Processing
ICDE2015 Research 3: Distributed Storage and ProcessingICDE2015 Research 3: Distributed Storage and Processing
ICDE2015 Research 3: Distributed Storage and ProcessingTakuma Wakamori
 
Clustered and distributed
 storage with
 commodity hardware 
and open source ...
Clustered and distributed
 storage with
 commodity hardware 
and open source ...Clustered and distributed
 storage with
 commodity hardware 
and open source ...
Clustered and distributed
 storage with
 commodity hardware 
and open source ...Phil Cryer
 
Survey of distributed storage system
Survey of distributed storage systemSurvey of distributed storage system
Survey of distributed storage systemZhichao Liang
 
StratioDeep: an Integration Layer Between Spark and Cassandra - Spark Summit ...
StratioDeep: an Integration Layer Between Spark and Cassandra - Spark Summit ...StratioDeep: an Integration Layer Between Spark and Cassandra - Spark Summit ...
StratioDeep: an Integration Layer Between Spark and Cassandra - Spark Summit ...Álvaro Agea Herradón
 
Fosdem 2011 - A Common Graph Database Access Layer for .Net and Mono
Fosdem 2011 - A Common Graph Database Access Layer for .Net and MonoFosdem 2011 - A Common Graph Database Access Layer for .Net and Mono
Fosdem 2011 - A Common Graph Database Access Layer for .Net and MonoAchim Friedland
 
Lecciones aprendidas SQL Server AlwaryOn
Lecciones aprendidas SQL Server AlwaryOnLecciones aprendidas SQL Server AlwaryOn
Lecciones aprendidas SQL Server AlwaryOnJulián Castiblanco
 
Webinar: Compliance and Data Protection in the Big Data Age: MongoDB Security...
Webinar: Compliance and Data Protection in the Big Data Age: MongoDB Security...Webinar: Compliance and Data Protection in the Big Data Age: MongoDB Security...
Webinar: Compliance and Data Protection in the Big Data Age: MongoDB Security...MongoDB
 
Ops Jumpstart: MongoDB Administration 101
Ops Jumpstart: MongoDB Administration 101Ops Jumpstart: MongoDB Administration 101
Ops Jumpstart: MongoDB Administration 101MongoDB
 

Viewers also liked (20)

7 distributed storage_open_stack
7 distributed storage_open_stack7 distributed storage_open_stack
7 distributed storage_open_stack
 
DumpFS - A Distributed Storage Solution
DumpFS - A Distributed Storage SolutionDumpFS - A Distributed Storage Solution
DumpFS - A Distributed Storage Solution
 
Distributed storage system
Distributed storage systemDistributed storage system
Distributed storage system
 
Deploying pNFS over Distributed File Storage w/ Jiffin Tony Thottan and Niels...
Deploying pNFS over Distributed File Storage w/ Jiffin Tony Thottan and Niels...Deploying pNFS over Distributed File Storage w/ Jiffin Tony Thottan and Niels...
Deploying pNFS over Distributed File Storage w/ Jiffin Tony Thottan and Niels...
 
Database storage engines
Database storage enginesDatabase storage engines
Database storage engines
 
A Design of Distributed Storage System over HTTP for Collecting Sensor Data
A Design of Distributed Storage System over HTTP for Collecting Sensor DataA Design of Distributed Storage System over HTTP for Collecting Sensor Data
A Design of Distributed Storage System over HTTP for Collecting Sensor Data
 
Distributed storage performance for OpenStack clouds using small-file IO work...
Distributed storage performance for OpenStack clouds using small-file IO work...Distributed storage performance for OpenStack clouds using small-file IO work...
Distributed storage performance for OpenStack clouds using small-file IO work...
 
Torus
TorusTorus
Torus
 
Ceph - A distributed storage system
Ceph - A distributed storage systemCeph - A distributed storage system
Ceph - A distributed storage system
 
Mongo db administration guide
Mongo db administration guideMongo db administration guide
Mongo db administration guide
 
ICDE2015 Research 3: Distributed Storage and Processing
ICDE2015 Research 3: Distributed Storage and ProcessingICDE2015 Research 3: Distributed Storage and Processing
ICDE2015 Research 3: Distributed Storage and Processing
 
Clustered and distributed
 storage with
 commodity hardware 
and open source ...
Clustered and distributed
 storage with
 commodity hardware 
and open source ...Clustered and distributed
 storage with
 commodity hardware 
and open source ...
Clustered and distributed
 storage with
 commodity hardware 
and open source ...
 
Survey of distributed storage system
Survey of distributed storage systemSurvey of distributed storage system
Survey of distributed storage system
 
StratioDeep: an Integration Layer Between Spark and Cassandra - Spark Summit ...
StratioDeep: an Integration Layer Between Spark and Cassandra - Spark Summit ...StratioDeep: an Integration Layer Between Spark and Cassandra - Spark Summit ...
StratioDeep: an Integration Layer Between Spark and Cassandra - Spark Summit ...
 
Fosdem 2011 - A Common Graph Database Access Layer for .Net and Mono
Fosdem 2011 - A Common Graph Database Access Layer for .Net and MonoFosdem 2011 - A Common Graph Database Access Layer for .Net and Mono
Fosdem 2011 - A Common Graph Database Access Layer for .Net and Mono
 
Minding SQL Server Memory
Minding SQL Server MemoryMinding SQL Server Memory
Minding SQL Server Memory
 
Lecciones aprendidas SQL Server AlwaryOn
Lecciones aprendidas SQL Server AlwaryOnLecciones aprendidas SQL Server AlwaryOn
Lecciones aprendidas SQL Server AlwaryOn
 
Webinar: Compliance and Data Protection in the Big Data Age: MongoDB Security...
Webinar: Compliance and Data Protection in the Big Data Age: MongoDB Security...Webinar: Compliance and Data Protection in the Big Data Age: MongoDB Security...
Webinar: Compliance and Data Protection in the Big Data Age: MongoDB Security...
 
Tachyon workshop 2015-07-19
Tachyon workshop 2015-07-19Tachyon workshop 2015-07-19
Tachyon workshop 2015-07-19
 
Ops Jumpstart: MongoDB Administration 101
Ops Jumpstart: MongoDB Administration 101Ops Jumpstart: MongoDB Administration 101
Ops Jumpstart: MongoDB Administration 101
 

Similar to Strategies for Distributed Data Storage

High order bits from cassandra & hadoop
High order bits from cassandra & hadoopHigh order bits from cassandra & hadoop
High order bits from cassandra & hadoopsrisatish ambati
 
5266732.ppt
5266732.ppt5266732.ppt
5266732.ppthothyfa
 
Developers summit cassandraで見るNoSQL
Developers summit cassandraで見るNoSQLDevelopers summit cassandraで見るNoSQL
Developers summit cassandraで見るNoSQLRyu Kobayashi
 
Qubole Overview at the Fifth Elephant Conference
Qubole Overview at the Fifth Elephant ConferenceQubole Overview at the Fifth Elephant Conference
Qubole Overview at the Fifth Elephant ConferenceJoydeep Sen Sarma
 
High order bits from cassandra & hadoop
High order bits from cassandra & hadoopHigh order bits from cassandra & hadoop
High order bits from cassandra & hadoopsrisatish ambati
 
Spring one2gx2010 spring-nonrelational_data
Spring one2gx2010 spring-nonrelational_dataSpring one2gx2010 spring-nonrelational_data
Spring one2gx2010 spring-nonrelational_dataRoger Xia
 
cassandra
cassandracassandra
cassandraAkash R
 
第17回Cassandra勉強会: MyCassandra
第17回Cassandra勉強会: MyCassandra第17回Cassandra勉強会: MyCassandra
第17回Cassandra勉強会: MyCassandraShun Nakamura
 
Navigating NoSQL in cloudy skies
Navigating NoSQL in cloudy skiesNavigating NoSQL in cloudy skies
Navigating NoSQL in cloudy skiesshnkr_rmchndrn
 
Hadoop and HBase on Amazon Web Services
Hadoop and HBase on Amazon Web Services Hadoop and HBase on Amazon Web Services
Hadoop and HBase on Amazon Web Services Amazon Web Services
 
Cassandra advanced part-ll
Cassandra advanced part-llCassandra advanced part-ll
Cassandra advanced part-llachudhivi
 
Java one2011 brisk-and_high_order_bits_from_cassandra_and_hadoop
Java one2011 brisk-and_high_order_bits_from_cassandra_and_hadoopJava one2011 brisk-and_high_order_bits_from_cassandra_and_hadoop
Java one2011 brisk-and_high_order_bits_from_cassandra_and_hadoopsrisatish ambati
 

Similar to Strategies for Distributed Data Storage (20)

Cassandra
CassandraCassandra
Cassandra
 
High order bits from cassandra & hadoop
High order bits from cassandra & hadoopHigh order bits from cassandra & hadoop
High order bits from cassandra & hadoop
 
Cassandra at no_sql
Cassandra at no_sqlCassandra at no_sql
Cassandra at no_sql
 
5266732.ppt
5266732.ppt5266732.ppt
5266732.ppt
 
Developers summit cassandraで見るNoSQL
Developers summit cassandraで見るNoSQLDevelopers summit cassandraで見るNoSQL
Developers summit cassandraで見るNoSQL
 
Qubole Overview at the Fifth Elephant Conference
Qubole Overview at the Fifth Elephant ConferenceQubole Overview at the Fifth Elephant Conference
Qubole Overview at the Fifth Elephant Conference
 
High order bits from cassandra & hadoop
High order bits from cassandra & hadoopHigh order bits from cassandra & hadoop
High order bits from cassandra & hadoop
 
Spring one2gx2010 spring-nonrelational_data
Spring one2gx2010 spring-nonrelational_dataSpring one2gx2010 spring-nonrelational_data
Spring one2gx2010 spring-nonrelational_data
 
NoSQL
NoSQLNoSQL
NoSQL
 
cassandra
cassandracassandra
cassandra
 
Big data
Big dataBig data
Big data
 
Cassandra
CassandraCassandra
Cassandra
 
Big data analytics_7_giants_public_24_sep_2013
Big data analytics_7_giants_public_24_sep_2013Big data analytics_7_giants_public_24_sep_2013
Big data analytics_7_giants_public_24_sep_2013
 
第17回Cassandra勉強会: MyCassandra
第17回Cassandra勉強会: MyCassandra第17回Cassandra勉強会: MyCassandra
第17回Cassandra勉強会: MyCassandra
 
Navigating NoSQL in cloudy skies
Navigating NoSQL in cloudy skiesNavigating NoSQL in cloudy skies
Navigating NoSQL in cloudy skies
 
Hadoop and HBase on Amazon Web Services
Hadoop and HBase on Amazon Web Services Hadoop and HBase on Amazon Web Services
Hadoop and HBase on Amazon Web Services
 
Cassandra advanced part-ll
Cassandra advanced part-llCassandra advanced part-ll
Cassandra advanced part-ll
 
No Sql
No SqlNo Sql
No Sql
 
Understanding hdfs
Understanding hdfsUnderstanding hdfs
Understanding hdfs
 
Java one2011 brisk-and_high_order_bits_from_cassandra_and_hadoop
Java one2011 brisk-and_high_order_bits_from_cassandra_and_hadoopJava one2011 brisk-and_high_order_bits_from_cassandra_and_hadoop
Java one2011 brisk-and_high_order_bits_from_cassandra_and_hadoop
 

Recently uploaded

Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 

Recently uploaded (20)

Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 

Strategies for Distributed Data Storage

  • 2. I: Fat Clients are Expensive II: Availability vs. Consistency III: Strategies for Eventual Consistency Cassandra: Strategies for Distributed Data Storage
  • 3. I: Fat Clients are Expensive Cassandra: Strategies for Distributed Data Storage
  • 4. In the Beginning... Web Thin Data API Simple: 1 web server DB 1 database Cassandra: Strategies for Distributed Data Storage
  • 5. Your Data Grows... Web Data API Move tables to DB DB different DBs. user item Cassandra: Strategies for Distributed Data Storage
  • 6. A table grows too large... Web Data API ... Shard table by DB DB DB PK ranges. item item item ... 0 1 2 PK Range: [0, 10k) [10k, 20k) [20k, 30k) Cassandra: Strategies for Distributed Data Storage
  • 7. Problem: Multiple Client Languages python ruby java Data API Data API Data API Cassandra: Strategies for Distributed Data Storage
  • 8. Are there other trade-offs? Cassandra: Strategies for Distributed Data Storage
  • 9. II: Availability vs. Consistency Cassandra: Strategies for Distributed Data Storage
  • 10. Why consistency vs. availability? CAP Theorem Cassandra: Strategies for Distributed Data Storage
  • 11. CAP Theorem You can have at most two of these properties in a shared-data system: Consistency Availability Partition-Tolerance Cassandra: Strategies for Distributed Data Storage
  • 12. Problem: Sharded DB Cluster Favors C over A. Web Data API ... ... SPOF No ... DB shard ... Replication Cassandra: Strategies for Distributed Data Storage
  • 13. Slightly better with master-slave replication... Web Data Write: ... DB shard ... SPOF Bottlenecked master ... DB ... Read: Replicated shard slave Cassandra: Strategies for Distributed Data Storage
  • 14. Availability Arguments Avoid SPOFs Distribute Writes to All Nodes in Replica Set Cassandra: Strategies for Distributed Data Storage
  • 15. Availability Easy: Write replica A value: “x” Write coord. replica B replica C Cassandra: Strategies for Distributed Data Storage
  • 16. Availability Harder: Consistency Across Replicas replica A value: “x” coord. replica B value: “x” replica C value: “x” Cassandra: Strategies for Distributed Data Storage
  • 17. So, how do we achieve consistency? Cassandra: Strategies for Distributed Data Storage
  • 18. III: Strategies for Eventual Consistency Cassandra: Strategies for Distributed Data Storage
  • 19. I: Write-Related Strategies II: Read-Related Strategies Cassandra: Strategies for Distributed Data Storage
  • 20. Write-Related Strategies I: Hinted Hand-Off II: Gossip Cassandra: Strategies for Distributed Data Storage
  • 21. I: Hinted Hand-Off Cassandra: Strategies for Distributed Data Storage
  • 22. Hinted Hand-Off Problem Write to an Unavailable Node Cassandra: Strategies for Distributed Data Storage
  • 23. Hinted Hand-Off Solution 1) “hinted” write to a live node 2) deliver hints when node is reachable Cassandra: Strategies for Distributed Data Storage
  • 24. Hinted Hand-Off Step 1: “hinted” write to a live node part of replica set is available A target (dead) “hinted” coord. write B nearest live replica C Cassandra: Strategies for Distributed Data Storage
  • 25. Hinted Hand-Off Step 1: “hinted” write to a live node all replica nodes unreachable A target (dead) closest coord. “hinted” B node (dead) write C (dead) Cassandra: Strategies for Distributed Data Storage
  • 26. Hinted Hand-Off Step 2: deliver hints when node is reachable node deliver replica target (now available) “hinted” writes Cassandra: Strategies for Distributed Data Storage
  • 27. How does a node learn when another node is available? Cassandra: Strategies for Distributed Data Storage
  • 28. II: Gossip Cassandra: Strategies for Distributed Data Storage
  • 29. Gossip Problem Each node cannot scalably ping every other node. 8 nodes: 82 = 64 100 nodes: 1002 = 10,000 Cassandra: Strategies for Distributed Data Storage
  • 30. Gossip Solution I: Anti-Entropy Gossip Protocol II: Phi-Accrual Failure Detector Cassandra: Strategies for Distributed Data Storage
  • 31. Gossip Anti-Entropy Gossip Protocol node node Cassandra: Strategies for Distributed Data Storage
  • 32. Gossip Phi-Accrual Failure Detector Dynamically adjusts its “suspicion” level of another node, based on inter-arrival times of gossip messages. Cassandra: Strategies for Distributed Data Storage
  • 33. Read-Related Strategies I: Read-Repair II: Anti-Entropy Service Cassandra: Strategies for Distributed Data Storage
  • 34. I: Read-Repair Cassandra: Strategies for Distributed Data Storage
  • 35. Read-Repair Problem A Write Has Not Propagated to All Replicas Cassandra: Strategies for Distributed Data Storage
  • 36. Read-Repair Solution Repair Outdated Replicas After Read Cassandra: Strategies for Distributed Data Storage
  • 37. Read-Repair Example Quorum Read Replication Factor: 3 Cassandra: Strategies for Distributed Data Storage
  • 38. Read-Repair Steps 1) do digest-based read (if digests match) 2) do full read and repair replicas Cassandra: Strategies for Distributed Data Storage
  • 39. Read-Repair Step 1: do digest-based read one full read; other reads are digest A F coord. B D D C Cassandra: Strategies for Distributed Data Storage
  • 40. Read-Repair Step 1: do digest-based read wait for 2 replies (where one is full read) A F coord. B D C Cassandra: Strategies for Distributed Data Storage
  • 41. Read-Repair Step 1: do digest-based read return value to client (if all digests match) D == digest( F ) coord. return value to client Cassandra: Strategies for Distributed Data Storage
  • 42. Read-Repair Step 2: do full read and repair replicas full read from all replicas A F coord. B F F C Cassandra: Strategies for Distributed Data Storage
  • 43. Read-Repair Step 2: do full read and repair replicas wait for 2 replies A F coord. B F C Cassandra: Strategies for Distributed Data Storage
  • 44. Read-Repair Step 2: do full read and repair replicas calculate newest value from replies value timestamp replica A: “x” t0 replica B: “y” t1 reconciled: “y” t1 Cassandra: Strategies for Distributed Data Storage
  • 45. Read-Repair Step 2: do full read and repair replicas return newest value to client coord. return reconciled value to client Cassandra: Strategies for Distributed Data Storage
  • 46. Read-Repair Step 2: do full read and repair replicas calculate repair mutations for each replica diff(reconciled value, replica value) = repair mutation Repair for Replica A Repair for Replica B diff( “y” @ t1, “x” @ t0) diff( “y” @ t1, “y” @ t1) = “y” @ t1 = null Cassandra: Strategies for Distributed Data Storage
  • 47. Read-Repair Step 2: do full read and repair replicas send repair mutation to each replica A R coord. B C Cassandra: Strategies for Distributed Data Storage
  • 48. What about values that have not been read? Cassandra: Strategies for Distributed Data Storage
  • 49. II: Anti-Entropy Service Cassandra: Strategies for Distributed Data Storage
  • 50. Anti-Entropy Service Problem How to Repair Unread Values Cassandra: Strategies for Distributed Data Storage
  • 51. Anti-Entropy Service Solution 1) detect inconsistency via Merkle Trees 2) repair inconsistent data Cassandra: Strategies for Distributed Data Storage
  • 52. Anti-Entropy Service Merkle Tree a tree where a node’s hash summarizes the hashes of its children root node hash A summarizes its children’s hashes node hash B C summarizes its children’s hashes leaf hash D E F G hash of a data block Cassandra: Strategies for Distributed Data Storage
  • 53. Anti-Entropy Service Step 1: detect inconsistency create Merkle Trees on all replicas B request A Merkle Tree creation create local Merkle Tree C Cassandra: Strategies for Distributed Data Storage
  • 54. Anti-Entropy Service Step 1: detect inconsistency exchange Merkle Trees between replicas B exchange A Merkle Tree across all replicas C Cassandra: Strategies for Distributed Data Storage
  • 55. Anti-Entropy Service Step 1: detect inconsistency compare local and remote Merkle Trees Replica A Replica B A A match mismatch B C B C D E F G D E F G Cassandra: Strategies for Distributed Data Storage
  • 56. Anti-Entropy Service Step 2: repair inconsistent data send repair to remote replica A B send repair for data hashed by node F Cassandra: Strategies for Distributed Data Storage
  • 57. Any Questions? Cassandra: Strategies for Distributed Data Storage
  • 58. More Information Cassandra Site: http://cassandra.apache.org/ My email address: kakugawa@gmail.com Cassandra: Strategies for Distributed Data Storage

Editor's Notes

  1. Kelvin Kakugawa infrastructure engineer @ Digg working on extending Cassandra (can talk about this more at the end of the session)
  2. 3 parts of my talk
  3. let’s go through the journey of a typical web developer so, we can understand why certain properties of Cassandra may be attractive
  4. just a web server and a database; nothing special
  5. so, your data starts growing what do you do? move your tables to different DB servers
  6. ok, so, now what happens when one table grows too large? shard DB cluster problem: data access API just got fatter now, client needs to know which shard to hit for a given read/write problem: now, you’re pushing up logic that is data store-specific up into your client layer not the best abstraction
  7. the problem gets compounded w/ multiple client languages what do you do? 1) replicate the logic in all languages? 2) write a C library w/ bindings for every language?
  8. [5m]
  9. examples: consistency: when you write a value to the cluster on the next read, will you get the most up-to-date value availability: if a subset of nodes goes down are you still able to write or read a given key
  10. so, let’s think back to the sharded DB example when you write to a shard, you’ll get the most recent value on the next read however: the shard is SPOF no replication
  11. reads are now replicated however, writes still have: SPOF bottle-necked on 1 server (can’t write to any node in the replica set)
  12. avoid SPOFs: machines fail depending on your use case, it may be advantageous to be able to write to multiple nodes in the replica set if you’re read-bound, then this probably doesn’t matter but, if you’re write-bound, it’s important
  13. so, how do we achieve availability? it’s easy to think about writes pretty straightforward write to one of the replicas in the replica set
  14. it’s harder to propagate that write to the other nodes in the replica set non-trivial
  15. [10m]
  16. separate into 2 sections
  17. first situation: part of replica set is still available
  18. second situation: all nodes in replica set are down so, what happens? let’s first talk about the distinction between coordinator and nodes in the replica set. basically, a client can talk to any node in the cassandra cluster and that node will then become the coordinator for that client making the appropriate calls to other nodes in the cassandra cluster that are part of the replica set for a given key so, getting back, what happens when all of the replica nodes are down? in this case, the coordinator node is the closest node, so it’ll write the hint locally
  19. and, naturally, when a node w/ hinted writes learns that the target node is back up it’ll deliver the hinted writes it has for the target
  20. great for the virality of your product bad for your network load
  21. gossip protocols (in general): randomly choose a node to exchange state w/ expectation: updates spread in logarithmic time w/ the # of nodes in the cluster anti-entropy protocol: gossip information until it’s made obsolete by newer info compare: rumor-mongering protocol: only gossips state for a limited amount of time, such that likelihood of state change has been propogated to all nodes in the cluster note: it’s important to note that cassandra uses an anti-entropy protocol, because of the failure detector
  22. failure detector: acts as an oracle for the node node consults FD FD returns a suspicion-level of whether a given node is up/down FD: maintains a sliding window of the most recent heart beats from a given node sliding window used to estimate arrival time of next heartbeat distribution of past samples used as an approximation for the probabilistic distribution of future hearbeat messages (cassandra uses an exponential distribution) as the next heartbeat message takes longer and longer to arrive, the suspicion level of that node being down increases
  23. quorum consistency level: quorum = majority (here: 2) requires: quorum write so that quorum read will catch at least 1 node w/ the most recent value
  24. example: let’s say we receive the merkle trees from two different replicas if the root node’s hash from both trees match we can be reasonably sure that both replicas are consistent
  25. each node creates a merkle tree, then exchanges them w/ the other replicas
  26. from the initiating replica, replica A we compare the MT from replica B
  27. replica A will send the inconsistent data to replica B (note: replica B will compare the MT from A and send the same range of keys to A) implementation detail: actually creates a repair SSTable for replica B (that only includes the inconsistent keys) then streams it over to replica B replica B will drop the streamed SSTable directly onto disk possible: talk about the relationship between Memtable and SSTables and how cassandra writes / reads data