Why Distributed Databases?

Databases
Sargun Dhillon
@Sargun

What is a database?
A database is an organized collection of data

Applications
What are databases for?

Internet Applications
Experiencing exploding growth

Internet Trafﬁc vs. Penetration
0
25
50
75
100
0
10000
20000
30000
40000
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012
IP Trafﬁc (PB/mo) Global Penetration (%)

Number of Internet Users in 2012

Average Distance to Every Human

Extrapolating
We have not yet Peak “Web” and we wont see it for
some time

Applications
How are they built?

Useful Application
Add Persistence

What is a Transaction?
A Unit of Work

Transaction Scheduling
Concurrent Operations

Non-Conﬂicting Concurrency
Parallel Execution

ACID = Atomicity
A transaction executes or it does not

ACID = Consistency
Correctness; Require the database to follow set of
invariants

ACID = Isolation
Prevent inter-actor visibility during concurrent operations

ACID = Durability
Once you write, it will survive

Vertically Scalability
Moore’s Law can take us places

Biggest AWS Database
• vCPUs: 32
• Memory: 244
• Storage: 3TB
• IOPs: 30,000 IOPs
• Networking: 10 Gigabit
• Resiliency: Multi-AZ
• SLA: 99.95%
• Backend: Postgresql

Do we have a natural
sharding key?

Two-phase commit?
Three-phase commit?
Paxos?
Enhanced Three-phase commit?
Wat?
Egalitarian Paxos?

Do we really want to
run NxM databases?

Hotspots?
(The “Beiber” problem)

Scaling SSI databases
is a hard problem

No latency win for
mutable data

Must sacriﬁce recency
for latency win

Multi-master requires
at least 1 RTT

“Average partition duration ranged from 6 minutes for
software-related failures to more than 8.2 hours for
hardware-related failures (median 2.7 and 32 minutes;
95th percentile of 19.9 minutes and 3.7 days,
respectively).”
-The Network is Reliable
WANs Fail

Incremental Scalability
Must be able to add nodes for greater reliability, or
throughput

High Availability
Must be able to seamlessly handle failures, and always
respond to operations

Efﬁciency
Meet stringent latency requirements

“Experience at Amazon has shown
that data stores that provide
ACID guarantees tend to have
poor availability.”
Dynamo: Amazon’s Highly Available Key-value Store

The Ring
A cluster composed of set of virtual nodes (vnodes)

Fault Tolerance
• Read Repair
• Active Anti-Entropy
• Hinted Handoff

“A shared-data system can have at most
two of the three following properties:
Consistency, Availability, and tolerance to
network Partitions.”
-Dr. Eric Brewer

On Consistency
• ACID Consistency: Any transaction, or operation
will bring the database from one valid state to
another
• CAP Consistency: All nodes see the same data at
the same time (synchrony)

On Partition Tolerance
• The network will be allowed to lose arbitrarily many
messages sent from one node to another.
• Databases systems, in order to be useful must
have communication over the network
• Clients count

There is no such thing as
a 100% reliable network:
Can’t choose CA
http://codahale.com/you-cant-sacriﬁce-partition-tolerance

“This is a speciﬁc form of weak
consistency; the storage system
guarantees that if no new
updates are made to the object,
eventually all accesses will
return the last updated value.”
Deﬁnition of “Eventual Consistency” from “Eventually
Consistency Revisited” - Werner Vogels

Tunable CAP Controls
• R (Read Acks) tunable: Default Quorum
• W (Write Acks) tunable: Default Quorum
• PR (Primary Read Acks) tunable: Default 0
• PW (Primary Write Acks) tunable: Default 0
• N (replicas) tunable: Default 3

Strong Eventual Consistency
PW+PR>N

Vector Clocks
• Extension of Lamport Clocks
• Used to detect cause and effect in distributed
systems
• Can determine concurrency of events, and
causality violations

• CRDTs:
• Convergent Replicated Data Types
• Commutative Replication Data Types
• Enables data structures to be always writeable on both sides of a partition,
and replay after healing a partition
• Enable distributed computation across monotonic functions
• Two Types:
• CvRDTs
• CmRDTs
CRDTs

CvRDTs
• State / value based CRDTs
• Minimal state
• Don’t require active garbage collection

CmRDTs
• Op / method based CRDTs
• Size grows monotonically
• Uses version vectors to determine order of
operations

CRDTs in the Wild
• Sets
• Observe-remove set
• Grow-only sets
• Counters
• Grow-only counters
• PN-Counters
• Flags
• Maps

Data structures that are
CRDTs
• Probabilistic, convergent data structures
• Hyper log log
• Bloom ﬁlter
• Co-recursive folding functions
• Maximum-counter
• Running Average
• Operational Transform

CRDTs
• Incredibly powerful primitive
• Not only useful for in-database manipulation but
client-database interaction
• You can compose them, and build your own
• Garbage collection is tricky

RAMP: Read Atomic
Multi-Partition
Transactions

Potential Consistency Violation

Have your availability
and consistency too
RAMP

Eventual Consistency
in the WAN

Write Anywhere
Beat the speed of the light

Eventual Consistency
In Summary

Invariant Operation AP / CP
Specify unique ID Any CP
Generate unique ID Any AP
> INCREMENT AP
> DECREMENT CP
< INCREMENT CP
< DECREMENT AP
Secondary Index Any AP
Materialized View Any AP
AUTO_INCREMENT INSERT CP
Linearizability CAS CP
Operations Requiring
Weak Consistency
vs.
Strong Consistency

BASE not ACID
•Basically Available: There will be a response
per request (failure, or success)
•Soft State: Any two reads against the system
may yield different data (when measured
against time)
•Eventually Consistent: The system will
eventually become consistent when all
failures have healed, and time goes to inﬁnity

AWS Deployment
• 6 x i2.4xlarge
• 732GB of RAM
• 19TB of storage
• 960,000 IOPs
• 96 vCPUs
• 3 x Replication
• 10 Gigabit networking
• 99.9999999997% availability

Ad Network
• Sell targeted ads with minimum latency
• Two datasets:
• Ads
• Users

Choose Random Ad
Based on Weight of
Outstanding Impressions

Batch System
Generated targeted ads in ofﬂine process

Choose Ad
Based upon weighted random

Test Model
• 50 actors
• 5 Ads with inventory between 1000, and 1200
• Actors randomly get [1,3] times to choose per
round
• Rounds continue until entire inventory is exhausted

Test Model
OutstandingImpressions
-300
0
300
600
900
1200
Round Number
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70 73 76
Ad 1 Ad 2 Ad 3 Ad 4 Ad 5

Garbage Collection
Utilizes secondary indexes in batch process to delete
exhausted ads from user records

Ad Serving
• Requires batch generation of targets
• Requires external GC
• Allows for multidatacenter operation

Scalable
Scalability
Processors

Why
Distributed
Databases?
Sargun Dhillon
@Sargun

Why Distributed Databases?

More Related Content

What's hot

Viewers also liked

Similar to Why Distributed Databases?

Recently uploaded

Why Distributed Databases?