Consistency, Availability, Partition: Make Your Choice

M A K E Y O U R C H O I C E
C O N S I S T E N C Y, A VA I L A B I L I T Y, PA R T I T I O N
A n d re a G i u l i a n o
@ b i t _ s h a r k

D I S T R I B U T E D S Y S T E M S

W H AT A D I S T R I B U T E D S Y S T E M I S
“A distributed system is a software system in which
components located on networked computers communicate
and coordinate their actions by passing messages”

E X A M P L E S

R E P L I C AT I O N

R E P L I C AT E D S E R V I C E
P R O P E R T I E S
CONSISTENCY
AVAILABILITY

C O N S I S T E N C Y
The result of operations will be predictable

Strong consistency
all replicas return the same value for the same object

Strong consistency
all replicas return the same value for the same object
Weak consistency
different replicas can return different values for the same object

S T R O N G V S W E A K

S T R O N G V S W E A K
Strong consistency
Atomic, consistent, isolated, durable database
Weak consistency
Basically Available Soft-state Eventual consistency database

E X A M P L E
put(price, 10)

E X A M P L E
get(price)
price = 10

E X A M P L E
A VA I L A B I L I T Y

PA R T I T I O N T O L E R A N C E
continue to operate even in presence of partitions

PA R T I T I O N T O L E R A N C E
Network failure
groups at each side of a faulty entity network (switch, backbone)
Process failure
system split in two groups: correct nodes and crashed node

C A P T H E O R E M
“Of three properties of shared-data systems
(data consistency, system availability and
tolerance to network partitions) only two can
be achieved at any given moment in time.”

T H E P R O O F
C A P T H E O R E M
put(price, 10)
get(price)
price = 0
price = 0 price = 0
price = 0
no response
not consistent
not available
t2
t1
partition 1
partition 2

CONSISTENCY AVAILABILITY
PARTITION
TOLERANCE
➡ distributed databases
➡ distributed locking
➡ majority protocol
➡ active/passive replication
➡ quorum-based systems
BigTable
C A P T H E O R E M
I N P R A C T I C E

C A P T H E O R E M
PARTITION
TOLERANCE
➡ web caches
➡ stateless systems
➡ DNS
DynamoDB

C A P T H E O R E M
PARTITION
TOLERANCE
➡ Single site database
➡ cluster databases
➡ ldap

R E Q U I R E M E N T S
D Y N A M O
“customers should be able to view and add items
to their shopping cart even if disks are failing,
network routes are flapping, or data centers are
being destroyed by tornados.”

D Y N A M O
“customers should be able to view and add items
to their shopping cart even if disks are failing,
network routes are flapping, or data centers are
being destroyed by tornados.”
➡ reliable
➡ high scalable
➡ always available

S I M P L E I N T E R FA C E
D Y N A M O
get(key)
returns the object associated with the key and returns a
single object or a list of objects with conflicting versions
along with a context.
put(key, context, object)
determines where the replicas of the object should be
placed based on the associated key. The context
includes information such as the version of the object.

R E P L I C AT I O N : T H E C H O I C E
D Y N A M O
Synchronous replica coordination
‣ strong consistency
‣ availability tradeoff
Optimistic replication technique
‣ high availability
‣ conflicts probability

C O N F L I C T S : W H E N
D Y N A M O
At write time
‣ writes rejection probability
At read time
‣ “always writable” datastore

C O N F L I C T S : W H O
D Y N A M O
The data store
‣ e.g. “last write win” policy
The application
‣ resolution as implementation detail

A R I N G T O R U L E T H E M A L L
D Y N A M O

PA R T I T I O N I N G : T H E R I N G
D Y N A M O
A
B
C
DE
F
G
DATA
hash

R E P L I C AT I O N
D Y N A M O
A
B
C
DE
F
G
N = 3 D will store keys in the range (A, B], (B, C], (C, D]
DATA
hash

D ATA V E R S I O N I N G
D Y N A M O
put()
may return before the update has been propagated to
all replicas.
get()
subsequent get() may return an object that does not
have the latest update

R E C O N C I L I AT I O N
D Y N A M O

R E C O N C I L I AT I O N
D Y N A M O
Syntactic reconciliation
‣ new version subsumes the previous
Semantic reconciliation
‣ conflicting versions of the same object

V E C T O R C L O C K
D Y N A M O

D Y N A M O
Definition
‣ list of (node, counter) pairs

D Y N A M O
Definition
D1
[Sx,1]
write
handled by Sx

D Y N A M O
Definition
D1
[Sx,1]
D2
[Sx,2]
write
handled by Sx
write
handled by Sx

D Y N A M O
Definition
D1
[Sx,1]
D2
[Sx,2]
D3
[Sx,2], [Sy,1]
write
handled by Sx
write
handled by Sx
handled by Sywrite

D Y N A M O
Definition
D1
[Sx,1]
D2
[Sx,2]
D3
[Sx,2], [Sy,1]
D4
[Sx,2], [Sz,1]
write
handled by Sx
write
handled by Sx
write
handled by Sy
write
handled by Sz

D Y N A M O
Definition
D1
[Sx,1]
D2
[Sx,2]
D3
[Sx,2], [Sy,1]
D4
[Sx,2], [Sz,1]
D5 [Sx,3], [Sy,1], [Sz,1]
write
handled by Sx
write
handled by Sx
write
handled by Sy
write
handled by Sz
reconciled and
written by Sx

P U T ( ) A N D G E T ( )
D Y N A M O
R
‣ minimum number of nodes that must partecipate
in a read operation.
W
‣ minimum number of nodes that must participate
in a successful write operation

P U T ( ) A N D G E T ( )
D Y N A M O
put()
‣ the coordinator generates the vector clock for the new version and
writes the new version locally
‣ the new version is sent to N nodes
‣ the write is successful if W-1 nodes respond
get()
‣ the coordinator requests all existing versions of data
‣ the coordinator waits for R responses before returning the result
‣ the coordinator returns all the version causally unrelated
‣ the divergent versions are reconciled and written back

S L O P P Y Q U O R U M
D Y N A M O
A
B
C
DE
F
G
N = 3

W H Y I S A P ?
D Y N A M O
‣ requests served even if some replicas are not available
‣ if some node is down the write is stored to another node
‣ consistency conflicts resolved at read time or in the
background
‣ eventually, all the replicas will converge
‣ concurrent read/write operation can make distinct clients
see distinct versions of the same key

G O O G L E B I G TA B L E
‣ scale to petabyte of data
‣ thousand of machines
‣ high availability
‣ high performance

D ATA M O D E L
‣ sparse, distributed, persistent multi-dimensional
sorted map
(row: string, column: string, time: int64) string

R O W S
‣ arbitrary strings
‣ read/write operations are atomic
‣ data is maintained in lexicographic order by row key
‣ each row range is called a tablet
maps.google.com com.google.maps

C O L U M N S
‣ columns keys are grouped into sets: column families
‣ a column family must be created before data can be
stored under any column key in that family
‣ column key named as family:qualiﬁer
‣ access control and both disk and memory
accounting are performed at the column-family level

T I M E S TA M P S
C O N T E N T S :
c o m . e x a m p l e
< h t m l > …
< h t m l > …
t 1
t 2

D ATA M O D E L : E X A M P L E
L A N G U A G E : C O N T E N T S : A N C H O R : C N N S I . C O M A N C H R : M Y L O O K . C A
c o m . e x a m p l e e n
< ! D O C T Y P E
h t m l P U B L I C
…
c o m . c n n . w w w e n
< ! D O C T Y P E
h t m l P U B L I C
…
“ c n n " “ c n n . c o m ”
c o m . c n n . w w w / f o o e n
< ! D O C T Y P E
h t m l P U B L I C
…
column familiesrow keys
sortedrows

D I F F E R E N C E S W I T H R D B M S
R D B M S B I G TA B L E
q u e r y l a n g u a g e s p e c i f i c a p i
j o i n s n o re f e re n t i a l i n t e g r i t y
e x p l i c i t s o r t i n g
s o r t i n g d e f i n e d a p r i o r i
i n t h e c o l u m n f a m i l y

A R C H I T E C T U R E
Google File System (GFS)
‣ store data files and logs
Google SSTable
‣ store BigTable data
Chubby
‣ high-available distributed lock service

C O M P O N E N T S
library
‣ linked into every client
one master server
‣ assigning tablets to tablet server
‣ detecting the addition and expiration of tablet servers
‣ balancing tablet-server load
‣ garbaging collection of files in GFS
‣ handling schema changes
many tablet servers
‣ manages 10 to 100 tablets
‣ handles read and write requests to the tablets
‣ splits tablets that have grown too large

C O M P O N E N T S
Master server
Client
Tablet server Tablet server Tablet server
Metadata
read/write

S TA R T U P A N D G R O W T H
Chubby file
Root tablet
1st Metadata tablet
other
metadata
tablets
UserTableN
UserTable1
…
…
…
…
…
…
…
…
…
…
…

TA B L E T A S S I G N M E N T
tablet server
‣ when started, creates and acquires a lock in Chubby
master
‣ grabs a unique master lock in Chubby
‣ scans Chubby to find live tablet servers
‣ asks each tablet server to discover its tablets
‣ scans the Metadata table to learn the full set of tablets
‣ builds a set of unassigned tablet server, for future tablet
assignment

W H Y I S C P ?
‣ master death cause services no longer functioning
‣ tablet server death cause tablets unavailable
‣ Chubby death cause BigTable inability to execute
synchronization operations and to serve client requests
‣ Google File System is a CP system

$ W H O A M I
Andrea Giuliano
@bit_shark
www.andreagiuliano.it

joind.in/13224
Please rate the talk!

G. DeCandia et al. “Dynamo: Amazon’s Highly Available Key-value Store”
F. Chang et al. “Bigtable: A Distributed Storage System for Structured Data”
Assets:
https://farm1.staticflickr.com/41/86744006_0026864df8_b_d.jpg
https://farm9.staticflickr.com/8305/7883634326_4e51a1a320_b_d.jpg
https://farm5.staticflickr.com/4145/4958650244_65b2eddffc_b_d.jpg
https://farm4.staticflickr.com/3677/10023456065_e54212c52e_b_d.jpg
https://farm4.staticflickr.com/3076/2871264822_261dafa44c_o_d.jpg
https://farm1.staticflickr.com/7/6111406_30005bdae5_b_d.jpg
https://farm4.staticflickr.com/3928/15416585502_92d5e608c7_b_d.jpg
https://farm8.staticflickr.com/7046/6873109431_d3b5199f7d_b_d.jpg
https://farm4.staticflickr.com/3007/2835755867_c530b0e0c6_o_d.jpg
https://farm3.staticflickr.com/2788/4202444169_2079db9580_o_d.jpg
https://farm1.staticflickr.com/55/129619657_907b480c7c_b_d.jpg
https://farm5.staticflickr.com/4046/4368269562_b3e05e3f06_b_d.jpg
https://farm8.staticflickr.com/7344/12137775834_d0cecc5004_k_d.jpg
https://farm5.staticflickr.com/4073/4895191036_1cb9b58d75_b_d.jpg
https://farm4.staticflickr.com/3144/3025249284_b77dec2d29_o_d.jpg
https://www.flickr.com/photos/avardwoolaver/7137096221
R E F E R E N C E S

Consistency, Availability, Partition: Make Your Choice

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (10)

Similar to Consistency, Availability, Partition: Make Your Choice

Similar to Consistency, Availability, Partition: Make Your Choice (20)

More from Andrea Giuliano

More from Andrea Giuliano (10)

Recently uploaded

Recently uploaded (20)

Consistency, Availability, Partition: Make Your Choice