SlideShare a Scribd company logo
1 of 37
Orbe: Scalable Causal Consistency Using
Dependency Matrices & Physical Clocks
Jiaqing Du, EPFL
Sameh Elnikety, Microsoft Research
Amitabha Roy, EPFL
Willy Zwaenepoel, EPFL
Key-Value Data Store API
• Read operation
– value = get( key )
• Write operation
– put( key, value)
• Read transaction
– <value1, value2, …> = mget ( key1, key2, … )
2
Partitioning
• Divide data set into several partitions.
• A server manages each partition.
3
Partition 1 Partition 2 Partition N…
• Data set is partitioned
Inside a Data Center
Partition 1
Application
Partition 2 Partition N…
client
Application Application
client client
…Application
tier
Data tier
4
Geo-Replication
Data Center E
Data Center B Data Center C
Data Center F
• Data close to end users
• Tolerates disasters
5
Data Center A
Scalable Causal Consistency in Orbe
• Partitioned and replicated data store
• Parallel asynchronous update propagation
• Efficient implementation of causal consistency
6
Partition 1 Partition 2 Partition N…
Partition 1 Partition 2 Partition N…
Replica A
Replica B
Consistency Models
• Strong consistency
– Total order on propagated updates
– High update latency, no partition tolerance
• Causal consistency
– Propagated updates are partially ordered
– Low update latency, partition tolerance
• Eventual consistency
– No order among propagated updates
– Low update latency, partition tolerance
7
• If A depends on B, then A appears after B.
Causal Consistency (1/3)
Photo
Comment: Great weather! Comment: Great weather!
Update
8
Propagation
Alice
Alice
• If A depends on B, then A appears after B.
Causal Consistency (2/3)
Photo
Comment: Nice photo! Comment: Nice photo!
Update
9
Propagation
Alice
Bob
Causal Consistency (3/3)
• Partitioned and replicated data stores
Partition 1 Partition 2 Partition N…
Partition 1 Partition 2 Partition N…
Replica A
Replica B
Client
Read(A) Read(B) Write(C, A+B)
Propagate (C)
How to guarantee A and B appear first?
10
Existing Solutions
• Version vectors
– Only work for purely replicated systems
• COPS [Lloyd’11]
– Explicit dependency tracking at client side
– Overhead is high under many workloads
• Our work
– Extends version vectors to dependency matrices
– Employs physical clocks for read-only transactions
– Keeps dependency metadata small and bounded
11
Outline
• DM protocol
• DM-Clock protocol
• Evaluation
• Conclusions
12
Dependency Matrix (DM)
• Represents dependencies of a state or a client session
• One integer per server
• An integer represents all dependencies from a partition
Partition 1 Partition 2
Replica A
Partition 1 Partition 2
Replica B
9 5
00DM
first 9 updates
first 5 updates
13
07
Partition 3
Partition 3
DM Protocol: Data Structures
Partition 1 of Replica A
Client
Dependency matrix
(DM)
14
0 0
0 0
0 0DM =
DM Protocol: Data Structures
Partition 1 of Replica A
Client
3 8
Dependency matrix
(DM)
Version vector
(VV)
15
0 0
0 0
0 0
VV =
DM =
DM Protocol: Data Structures
Partition 1 of Replica A
Client
3 8
Item A, rid = A, ut = 2, dm =
Item B, rid = B, ut = 5, dm =
Dependency matrix
(DM)
Version vector
(VV)
Update timestamp
(UT)
Source replica id
(RID)
16
0 0
0 0
0 0
1 4
0 0
0 0
0 5
0 0
1 0
VV =
DM =
DM Protocol: Read and Write
• Read item
– Client <-> server
– Includes read item in client DM
• Write item
– Client <-> server
– Associates client DM to updated item
– Resets client DM (transitivity of causality)
– Includes updated item in client DM
17
0 0
0 0
1 0
4 0
0 0
0 0
Replica A
Partition 1
(v, rid = A, ut = 4)
Example: Read and Write
Partition 2
Client
DM = DM = DM =
read(photo) write(comment, )
VV = [7, 0]
(ut = 1)
VV = [0, 0] VV = [1, 0]
18
Partition 3
VV = [0, 0]
0 0
0 0
0 0
4 0
0 0
0 0
DM Protocol: Update Propagation
• Propagate an update
– Server <-> server
– Asynchronous propagation
– Compares DM with VVs of local partitions
19
Replica A
Partition 1
Example: Update Propagation
Replica B
Partition 1
Partition 2
Partition 2
VV = [7, 0]
VV = [0, 0]
VV = [3, 0]
VV = [0, 0]
VV = [4, 0]
VV = [1, 0]
check dependency
VV = [1, 0]
20
Partition 3
VV = [0, 0]
Partition 3
VV = [0, 0]
replicate(comment, ut = 1, )
0 0
0 0
4 0
Complete and Nearest Dependencies
• Transitivity of causality
– If B depends on A, C depends on B, then C depends on A.
• Tracking nearest dependencies
– Reduces dependency metadata size
– Does not affect correctness
21
C: write
Comment 2
A: write
Photo
B: write
Comment 1
Complete
Dependencies
Nearest
Dependencies
DM Protocol: Benefits
• Keeps dependency metadata small and bounded
– Only tracks nearest dependencies by
resetting the client DM after each update
– Number of elements in a DM is fixed
– Utilizes sparse matrix encoding
22
Outline
• DM protocol
• DM-Clock protocol
• Evaluation
• Conclusions
23
Read Transaction on Causal Snapshot
24
Album: PublicBob 1
Photo
Bob 2
Album: Public
Only close friends!
Bob 3
Photo
Bob 4
Replica A Replica B
Album: Public
Photo
Album: Public
Only close friends!
Photo
Mom 1
Mom 2
• Provides causally consistent read-only transactions
• Requires loosely synchronized clocks (NTP)
• Data structures
DM-Clock Protocol (1/2)
Timestamps from
physical clocks
25
Partition 0
Client
3 8
Item A, rid = A, ut = 2, dm = , put = 27
Item B, rid = B, ut = 5, dm = , put = 35
0 0
0 0
0 0
1 4
0 0
0 0
0 5
0 0
1 0
VV =
DM = PDT = 0
DM-Clock Protocol (2/2)
• Still tracks nearest dependencies
• Read-only transaction
– Obtains snapshot timestamp from local physical clock
– Reads latest versions created “before” snapshot time
• A cut of the causal relationship graph
A0
B2
C1
B0
C0
D3
E0
snapshot timestamp
26
Outline
• DM protocol
• DM-Clock protocol
• Evaluation
• Conclusions
27
Evaluation
• Orbe
– A partitioned and replicated key-value store
– Implements the DM and DM-Clock protocols
• Experiment Setup
– A local cluster of 16 servers
– 120 ms update latency
28
Evaluation Questions
1. Does Orbe scale out?
2. How does Orbe compare to eventual consistency?
3. How does Orbe compare to COPS
29
Throughput over Num. of Partitions
30
Workload: Each client accesses two partitions.
Orbe scales out as the number of partitions increases.
Throughput over Varied Workloads
31
Workload: Each client accesses three partitions.
Orbe incurs relatively small overhead for tracking dependencies
under many workloads.
Orbe Metadata Percentage
32
Dependency metadata is relatively small and bounded.
Orbe Dependency Check Messages
33
The number of dependency check messages
is relatively small and bounded.
Orbe & COPS: Throughput over
Client Inter-Operation Delays
34
Workload: Each client accesses three partitions.
Orbe & COPS:
Number of Dependencies per Update
35
Orbe only tracks nearest dependencies when
supporting read-only transactions.
In the Paper
• Protocols
– Conflict detection
– Causal snapshot for read transaction
– Garbage collection
• Fault-tolerance and recovery
• Dependency cleaning optimization
• More experimental results
– Micro-benchmarks & latency distribution
– Benefits of dependency cleaning
36
Conclusions
• Orbe provides scalable causal consistency
– Partitioned and replicated data store
• DM protocol
– Dependency matrices
• DM-Clock protocol
– Dependency matrices + physical clocks
– Read-only transactions (causally consistency)
• Performance
– Scale out, low overhead, comparison to EC & COPS
37

More Related Content

What's hot

[Altibase] 8 replication part1 (overview)
[Altibase] 8 replication part1 (overview)[Altibase] 8 replication part1 (overview)
[Altibase] 8 replication part1 (overview)
altistory
 

What's hot (20)

Types of Load distributing algorithm in Distributed System
Types of Load distributing algorithm in Distributed SystemTypes of Load distributing algorithm in Distributed System
Types of Load distributing algorithm in Distributed System
 
2.communcation in distributed system
2.communcation in distributed system2.communcation in distributed system
2.communcation in distributed system
 
9 fault-tolerance
9 fault-tolerance9 fault-tolerance
9 fault-tolerance
 
Ranges & Cross-Entrance Consistency with OpenFlow
Ranges & Cross-Entrance Consistency with OpenFlowRanges & Cross-Entrance Consistency with OpenFlow
Ranges & Cross-Entrance Consistency with OpenFlow
 
Load balancing
Load balancingLoad balancing
Load balancing
 
Process Management-Process Migration
Process Management-Process MigrationProcess Management-Process Migration
Process Management-Process Migration
 
Distributed System Management
Distributed System ManagementDistributed System Management
Distributed System Management
 
Scheduling in distributed systems - Andrii Vozniuk
Scheduling in distributed systems - Andrii VozniukScheduling in distributed systems - Andrii Vozniuk
Scheduling in distributed systems - Andrii Vozniuk
 
A load balancing model based on cloud partitioning for the public cloud. ppt
A  load balancing model based on cloud partitioning for the public cloud. ppt A  load balancing model based on cloud partitioning for the public cloud. ppt
A load balancing model based on cloud partitioning for the public cloud. ppt
 
Load balancing
Load balancingLoad balancing
Load balancing
 
[EWiLi2016] Enabling power-awareness for the Xen Hypervisor
[EWiLi2016] Enabling power-awareness for the Xen Hypervisor[EWiLi2016] Enabling power-awareness for the Xen Hypervisor
[EWiLi2016] Enabling power-awareness for the Xen Hypervisor
 
10 replication
10 replication10 replication
10 replication
 
Inter-controller Traffic in ONOS Clusters for SDN Networks
Inter-controller Traffic in ONOS Clusters for SDN Networks Inter-controller Traffic in ONOS Clusters for SDN Networks
Inter-controller Traffic in ONOS Clusters for SDN Networks
 
Important tips on Router and SMTP mail routing
Important tips on Router and SMTP mail routingImportant tips on Router and SMTP mail routing
Important tips on Router and SMTP mail routing
 
Load balancing
Load balancingLoad balancing
Load balancing
 
Resource management
Resource managementResource management
Resource management
 
Load Balancing Server
Load Balancing ServerLoad Balancing Server
Load Balancing Server
 
The Role of Inter-Controller Traffic in SDN Controllers Placement
The Role of Inter-Controller Traffic in SDN Controllers PlacementThe Role of Inter-Controller Traffic in SDN Controllers Placement
The Role of Inter-Controller Traffic in SDN Controllers Placement
 
[Altibase] 8 replication part1 (overview)
[Altibase] 8 replication part1 (overview)[Altibase] 8 replication part1 (overview)
[Altibase] 8 replication part1 (overview)
 
Message passing in Distributed Computing Systems
Message passing in Distributed Computing SystemsMessage passing in Distributed Computing Systems
Message passing in Distributed Computing Systems
 

Similar to Orbe: Scalable Causal Consistency Using Dependency Matrices and Physical Clocks

SOC-CH3.pptSOC ProcessorsSOC Processors Used in SOC Used in SOC
SOC-CH3.pptSOC ProcessorsSOC Processors Used in SOC Used in SOCSOC-CH3.pptSOC ProcessorsSOC Processors Used in SOC Used in SOC
SOC-CH3.pptSOC ProcessorsSOC Processors Used in SOC Used in SOC
SnehaLatha68
 
MSF: Sync your Data On-Premises And To The Cloud - dotNetwork Gathering, Oct ...
MSF: Sync your Data On-Premises And To The Cloud - dotNetwork Gathering, Oct ...MSF: Sync your Data On-Premises And To The Cloud - dotNetwork Gathering, Oct ...
MSF: Sync your Data On-Premises And To The Cloud - dotNetwork Gathering, Oct ...
sameh samir
 

Similar to Orbe: Scalable Causal Consistency Using Dependency Matrices and Physical Clocks (20)

F14_Class1.pptx
F14_Class1.pptxF14_Class1.pptx
F14_Class1.pptx
 
Resilient Predictive Data Pipelines (GOTO Chicago 2016)
Resilient Predictive Data Pipelines (GOTO Chicago 2016)Resilient Predictive Data Pipelines (GOTO Chicago 2016)
Resilient Predictive Data Pipelines (GOTO Chicago 2016)
 
Case Study For Replication For PCMS
Case Study For Replication For PCMSCase Study For Replication For PCMS
Case Study For Replication For PCMS
 
Microservice - Up to 500k CCU
Microservice - Up to 500k CCUMicroservice - Up to 500k CCU
Microservice - Up to 500k CCU
 
HAProxyconf 2019 - Criteo - Transitioning from Ticketing to LBaaS
HAProxyconf 2019 - Criteo - Transitioning from Ticketing to LBaaSHAProxyconf 2019 - Criteo - Transitioning from Ticketing to LBaaS
HAProxyconf 2019 - Criteo - Transitioning from Ticketing to LBaaS
 
How we scaled Rudder to 10k, and the road to 50k
How we scaled Rudder to 10k, and the road to 50kHow we scaled Rudder to 10k, and the road to 50k
How we scaled Rudder to 10k, and the road to 50k
 
ECS19 - Ingo Gegenwarth - Running Exchange in large environment
ECS19 - Ingo Gegenwarth -  Running Exchangein large environmentECS19 - Ingo Gegenwarth -  Running Exchangein large environment
ECS19 - Ingo Gegenwarth - Running Exchange in large environment
 
Transport Layer Description By Varun Tiwari
Transport Layer Description By Varun TiwariTransport Layer Description By Varun Tiwari
Transport Layer Description By Varun Tiwari
 
SVCC-2014
SVCC-2014SVCC-2014
SVCC-2014
 
Kickstart your Kafka with Faker Data | Francesco Tisiot, Aiven.io
Kickstart your Kafka with Faker Data | Francesco Tisiot, Aiven.ioKickstart your Kafka with Faker Data | Francesco Tisiot, Aiven.io
Kickstart your Kafka with Faker Data | Francesco Tisiot, Aiven.io
 
The Power of Determinism in Database Systems
The Power of Determinism in Database SystemsThe Power of Determinism in Database Systems
The Power of Determinism in Database Systems
 
SOC-CH3.pptSOC ProcessorsSOC Processors Used in SOC Used in SOC
SOC-CH3.pptSOC ProcessorsSOC Processors Used in SOC Used in SOCSOC-CH3.pptSOC ProcessorsSOC Processors Used in SOC Used in SOC
SOC-CH3.pptSOC ProcessorsSOC Processors Used in SOC Used in SOC
 
Chapter - 1 Introduction to networking (3).ppt
Chapter - 1 Introduction to networking (3).pptChapter - 1 Introduction to networking (3).ppt
Chapter - 1 Introduction to networking (3).ppt
 
module10-rip (1).ppt
module10-rip (1).pptmodule10-rip (1).ppt
module10-rip (1).ppt
 
MSF: Sync your Data On-Premises And To The Cloud - dotNetwork Gathering, Oct ...
MSF: Sync your Data On-Premises And To The Cloud - dotNetwork Gathering, Oct ...MSF: Sync your Data On-Premises And To The Cloud - dotNetwork Gathering, Oct ...
MSF: Sync your Data On-Premises And To The Cloud - dotNetwork Gathering, Oct ...
 
Reduced network traffic
Reduced network trafficReduced network traffic
Reduced network traffic
 
Virtualization in 4-4 1-4 Data Center Network.
Virtualization in 4-4 1-4 Data Center Network.Virtualization in 4-4 1-4 Data Center Network.
Virtualization in 4-4 1-4 Data Center Network.
 
client server architecture
client server architecture client server architecture
client server architecture
 
Self-adaptive container monitoring with performance-aware Load-Shedding policies
Self-adaptive container monitoring with performance-aware Load-Shedding policiesSelf-adaptive container monitoring with performance-aware Load-Shedding policies
Self-adaptive container monitoring with performance-aware Load-Shedding policies
 
Introduction to Microservices Patterns
Introduction to Microservices PatternsIntroduction to Microservices Patterns
Introduction to Microservices Patterns
 

Recently uploaded

CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
masabamasaba
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
VishalKumarJha10
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
masabamasaba
 

Recently uploaded (20)

Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the past
 
10 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 202410 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 2024
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK Software
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
%in Durban+277-882-255-28 abortion pills for sale in Durban
%in Durban+277-882-255-28 abortion pills for sale in Durban%in Durban+277-882-255-28 abortion pills for sale in Durban
%in Durban+277-882-255-28 abortion pills for sale in Durban
 
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
 
Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfExploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdf
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 

Orbe: Scalable Causal Consistency Using Dependency Matrices and Physical Clocks

  • 1. Orbe: Scalable Causal Consistency Using Dependency Matrices & Physical Clocks Jiaqing Du, EPFL Sameh Elnikety, Microsoft Research Amitabha Roy, EPFL Willy Zwaenepoel, EPFL
  • 2. Key-Value Data Store API • Read operation – value = get( key ) • Write operation – put( key, value) • Read transaction – <value1, value2, …> = mget ( key1, key2, … ) 2
  • 3. Partitioning • Divide data set into several partitions. • A server manages each partition. 3 Partition 1 Partition 2 Partition N…
  • 4. • Data set is partitioned Inside a Data Center Partition 1 Application Partition 2 Partition N… client Application Application client client …Application tier Data tier 4
  • 5. Geo-Replication Data Center E Data Center B Data Center C Data Center F • Data close to end users • Tolerates disasters 5 Data Center A
  • 6. Scalable Causal Consistency in Orbe • Partitioned and replicated data store • Parallel asynchronous update propagation • Efficient implementation of causal consistency 6 Partition 1 Partition 2 Partition N… Partition 1 Partition 2 Partition N… Replica A Replica B
  • 7. Consistency Models • Strong consistency – Total order on propagated updates – High update latency, no partition tolerance • Causal consistency – Propagated updates are partially ordered – Low update latency, partition tolerance • Eventual consistency – No order among propagated updates – Low update latency, partition tolerance 7
  • 8. • If A depends on B, then A appears after B. Causal Consistency (1/3) Photo Comment: Great weather! Comment: Great weather! Update 8 Propagation Alice Alice
  • 9. • If A depends on B, then A appears after B. Causal Consistency (2/3) Photo Comment: Nice photo! Comment: Nice photo! Update 9 Propagation Alice Bob
  • 10. Causal Consistency (3/3) • Partitioned and replicated data stores Partition 1 Partition 2 Partition N… Partition 1 Partition 2 Partition N… Replica A Replica B Client Read(A) Read(B) Write(C, A+B) Propagate (C) How to guarantee A and B appear first? 10
  • 11. Existing Solutions • Version vectors – Only work for purely replicated systems • COPS [Lloyd’11] – Explicit dependency tracking at client side – Overhead is high under many workloads • Our work – Extends version vectors to dependency matrices – Employs physical clocks for read-only transactions – Keeps dependency metadata small and bounded 11
  • 12. Outline • DM protocol • DM-Clock protocol • Evaluation • Conclusions 12
  • 13. Dependency Matrix (DM) • Represents dependencies of a state or a client session • One integer per server • An integer represents all dependencies from a partition Partition 1 Partition 2 Replica A Partition 1 Partition 2 Replica B 9 5 00DM first 9 updates first 5 updates 13 07 Partition 3 Partition 3
  • 14. DM Protocol: Data Structures Partition 1 of Replica A Client Dependency matrix (DM) 14 0 0 0 0 0 0DM =
  • 15. DM Protocol: Data Structures Partition 1 of Replica A Client 3 8 Dependency matrix (DM) Version vector (VV) 15 0 0 0 0 0 0 VV = DM =
  • 16. DM Protocol: Data Structures Partition 1 of Replica A Client 3 8 Item A, rid = A, ut = 2, dm = Item B, rid = B, ut = 5, dm = Dependency matrix (DM) Version vector (VV) Update timestamp (UT) Source replica id (RID) 16 0 0 0 0 0 0 1 4 0 0 0 0 0 5 0 0 1 0 VV = DM =
  • 17. DM Protocol: Read and Write • Read item – Client <-> server – Includes read item in client DM • Write item – Client <-> server – Associates client DM to updated item – Resets client DM (transitivity of causality) – Includes updated item in client DM 17
  • 18. 0 0 0 0 1 0 4 0 0 0 0 0 Replica A Partition 1 (v, rid = A, ut = 4) Example: Read and Write Partition 2 Client DM = DM = DM = read(photo) write(comment, ) VV = [7, 0] (ut = 1) VV = [0, 0] VV = [1, 0] 18 Partition 3 VV = [0, 0] 0 0 0 0 0 0 4 0 0 0 0 0
  • 19. DM Protocol: Update Propagation • Propagate an update – Server <-> server – Asynchronous propagation – Compares DM with VVs of local partitions 19
  • 20. Replica A Partition 1 Example: Update Propagation Replica B Partition 1 Partition 2 Partition 2 VV = [7, 0] VV = [0, 0] VV = [3, 0] VV = [0, 0] VV = [4, 0] VV = [1, 0] check dependency VV = [1, 0] 20 Partition 3 VV = [0, 0] Partition 3 VV = [0, 0] replicate(comment, ut = 1, ) 0 0 0 0 4 0
  • 21. Complete and Nearest Dependencies • Transitivity of causality – If B depends on A, C depends on B, then C depends on A. • Tracking nearest dependencies – Reduces dependency metadata size – Does not affect correctness 21 C: write Comment 2 A: write Photo B: write Comment 1 Complete Dependencies Nearest Dependencies
  • 22. DM Protocol: Benefits • Keeps dependency metadata small and bounded – Only tracks nearest dependencies by resetting the client DM after each update – Number of elements in a DM is fixed – Utilizes sparse matrix encoding 22
  • 23. Outline • DM protocol • DM-Clock protocol • Evaluation • Conclusions 23
  • 24. Read Transaction on Causal Snapshot 24 Album: PublicBob 1 Photo Bob 2 Album: Public Only close friends! Bob 3 Photo Bob 4 Replica A Replica B Album: Public Photo Album: Public Only close friends! Photo Mom 1 Mom 2
  • 25. • Provides causally consistent read-only transactions • Requires loosely synchronized clocks (NTP) • Data structures DM-Clock Protocol (1/2) Timestamps from physical clocks 25 Partition 0 Client 3 8 Item A, rid = A, ut = 2, dm = , put = 27 Item B, rid = B, ut = 5, dm = , put = 35 0 0 0 0 0 0 1 4 0 0 0 0 0 5 0 0 1 0 VV = DM = PDT = 0
  • 26. DM-Clock Protocol (2/2) • Still tracks nearest dependencies • Read-only transaction – Obtains snapshot timestamp from local physical clock – Reads latest versions created “before” snapshot time • A cut of the causal relationship graph A0 B2 C1 B0 C0 D3 E0 snapshot timestamp 26
  • 27. Outline • DM protocol • DM-Clock protocol • Evaluation • Conclusions 27
  • 28. Evaluation • Orbe – A partitioned and replicated key-value store – Implements the DM and DM-Clock protocols • Experiment Setup – A local cluster of 16 servers – 120 ms update latency 28
  • 29. Evaluation Questions 1. Does Orbe scale out? 2. How does Orbe compare to eventual consistency? 3. How does Orbe compare to COPS 29
  • 30. Throughput over Num. of Partitions 30 Workload: Each client accesses two partitions. Orbe scales out as the number of partitions increases.
  • 31. Throughput over Varied Workloads 31 Workload: Each client accesses three partitions. Orbe incurs relatively small overhead for tracking dependencies under many workloads.
  • 32. Orbe Metadata Percentage 32 Dependency metadata is relatively small and bounded.
  • 33. Orbe Dependency Check Messages 33 The number of dependency check messages is relatively small and bounded.
  • 34. Orbe & COPS: Throughput over Client Inter-Operation Delays 34 Workload: Each client accesses three partitions.
  • 35. Orbe & COPS: Number of Dependencies per Update 35 Orbe only tracks nearest dependencies when supporting read-only transactions.
  • 36. In the Paper • Protocols – Conflict detection – Causal snapshot for read transaction – Garbage collection • Fault-tolerance and recovery • Dependency cleaning optimization • More experimental results – Micro-benchmarks & latency distribution – Benefits of dependency cleaning 36
  • 37. Conclusions • Orbe provides scalable causal consistency – Partitioned and replicated data store • DM protocol – Dependency matrices • DM-Clock protocol – Dependency matrices + physical clocks – Read-only transactions (causally consistency) • Performance – Scale out, low overhead, comparison to EC & COPS 37