Databases
Sargun Dhillon
@Sargun
What is a database?
A database is an organized collection of data
Applications
What are databases for?
Internet Applications
Experiencing exploding growth
Internet Traffic vs. Penetration
0
25
50
75
100
0
10000
20000
30000
40000
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012
IP Traffic (PB/mo) Global Penetration (%)
Number of Internet Users in 2012
Average Distance to Every Human
Extrapolating
We have not yet Peak “Web” and we wont see it for
some time
Applications
How are they built?
Basic Application
Useful Application
Add Persistence
Scale Out
Scale Out with Correctness
What is a Transaction?
A Unit of Work
Transaction Scheduling
Concurrent Operations
Non-Conflicting Concurrency
Parallel Execution
ACID
ACID = Atomicity
A transaction executes or it does not
ACID = Consistency
Correctness; Require the database to follow set of
invariants
ACID = Isolation
Prevent inter-actor visibility during concurrent operations
ACID = Durability
Once you write, it will survive
Lifecycle of a Transaction
Vertically Scalability
Moore’s Law can take us places
Biggest AWS Database
• vCPUs: 32
• Memory: 244
• Storage: 3TB
• IOPs: 30,000 IOPs
• Networking: 10 Gigabit
• Resiliency: Multi-AZ
• SLA: 99.95%
• Backend: Postgresql
$141,052.66/yr
Scaling Beyond
Sharding?
Do we have a natural
sharding key?
Add a Coordinator?
Two-phase commit?
Three-phase commit?
Paxos?
Enhanced Three-phase commit?
Wat?
Egalitarian Paxos?
Do we really want to
run NxM databases?
Partial Availability
Failure detectors are
hard
Database Failure
Cascading App Failure
Recovery
Hotspots?
(The “Beiber” problem)
Scaling SSI databases
is a hard problem
What if want
multidatacenter?
No latency win for
mutable data
Must sacrifice recency
for latency win
Complex Routing
Semantics
Multi-master requires
at least 1 RTT
80ms+ writes!
“Average partition duration ranged from 6 minutes for
software-related failures to more than 8.2 hours for
hardware-related failures (median 2.7 and 32 minutes;
95th percentile of 19.9 minutes and 3.7 days,
respectively).”
-The Network is Reliable
WANs Fail
Is there another way?
Into Riak
Design Requirements
Incremental Scalability
Must be able to add nodes for greater reliability, or
throughput
High Availability
Must be able to seamlessly handle failures, and always
respond to operations
Efficiency
Meet stringent latency requirements
Implementation
“Experience at Amazon has shown
that data stores that provide
ACID guarantees tend to have
poor availability.”
Dynamo: Amazon’s Highly Available Key-value Store
The Ring
A cluster composed of set of virtual nodes (vnodes)
The Ring
Virtual Node Placement
The Ring
Data Placement
Data Placement
Fault Tolerance
Hinted Handoff
Fallback Virtual Nodes
Hinted Handoff
Read Repair
Replicas
Partial Failure
Divergence
Read Repair
Read Repair
Read Repair
Active-Anti Entropy
Merkle Tree
Compare Trees
Compare Trees
Compare Trees
Compare Trees
Repair Trees
Fault Tolerance
• Read Repair
• Active Anti-Entropy
• Hinted Handoff
Eventual Consistency
CAP Theorem
“A shared-data system can have at most
two of the three following properties:
Consistency, Availability, and tolerance to
network Partitions.”
-Dr. Eric Brewer
On Consistency
• ACID Consistency: Any transaction, or operation
will bring the database from one valid state to
another
• CAP Consistency: All nodes see the same data at
the same time (synchrony)
On Partition Tolerance
• The network will be allowed to lose arbitrarily many
messages sent from one node to another.
• Databases systems, in order to be useful must
have communication over the network
• Clients count
There is no such thing as
a 100% reliable network:
Can’t choose CA
http://codahale.com/you-cant-sacrifice-partition-tolerance
Very “AP”
Weak Consistency
Weak Consistency
“This is a specific form of weak
consistency; the storage system
guarantees that if no new
updates are made to the object,
eventually all accesses will
return the last updated value.”
Definition of “Eventual Consistency” from “Eventually
Consistency Revisited” - Werner Vogels
Tunable CAP Controls
• R (Read Acks) tunable: Default Quorum
• W (Write Acks) tunable: Default Quorum
• PR (Primary Read Acks) tunable: Default 0
• PW (Primary Write Acks) tunable: Default 0
• N (replicas) tunable: Default 3
Strong Eventual Consistency
PW+PR>N
How do you even use this?
Vector Clocks
Vector Clocks
• Extension of Lamport Clocks
• Used to detect cause and effect in distributed
systems
• Can determine concurrency of events, and
causality violations
CRDTs
• CRDTs:
• Convergent Replicated Data Types
• Commutative Replication Data Types
• Enables data structures to be always writeable on both sides of a partition,
and replay after healing a partition
• Enable distributed computation across monotonic functions
• Two Types:
• CvRDTs
• CmRDTs
CRDTs
CvRDTs
• State / value based CRDTs
• Minimal state
• Don’t require active garbage collection
Set CvRDT
CmRDTs
• Op / method based CRDTs
• Size grows monotonically
• Uses version vectors to determine order of
operations
Counter CmRDT
CRDTs in the Wild
• Sets
• Observe-remove set
• Grow-only sets
• Counters
• Grow-only counters
• PN-Counters
• Flags
• Maps
Data structures that are
CRDTs
• Probabilistic, convergent data structures
• Hyper log log
• Bloom filter
• Co-recursive folding functions
• Maximum-counter
• Running Average
• Operational Transform
CRDTs
• Incredibly powerful primitive
• Not only useful for in-database manipulation but
client-database interaction
• You can compose them, and build your own
• Garbage collection is tricky
RAMP: Read Atomic
Multi-Partition
Transactions
Multikey Transaction
Potential Consistency Violation
Add Metadata
Uncommitted State
Uncommitted State
Committed State
Have your availability
and consistency too
RAMP
Eventual Consistency
in the WAN
Low-latency
everywhere
Write Anywhere
Beat the speed of the light
MDC Replication
Hybrid Topologies
Bidirectional Replication
Unidirectional Replication
Replication Hooks
Tied Writes
Hook on Replication
Hook on Replication
Replicate Hook Return Data
Build for WAN locality
Eventual Consistency
In Summary
Invariant Operation AP / CP
Specify unique ID Any CP
Generate unique ID Any AP
> INCREMENT AP
> DECREMENT CP
< INCREMENT CP
< DECREMENT AP
Secondary Index Any AP
Materialized View Any AP
AUTO_INCREMENT INSERT CP
Linearizability CAS CP
Operations Requiring
Weak Consistency
vs.
Strong Consistency
BASE not ACID
•Basically Available: There will be a response
per request (failure, or success)
•Soft State: Any two reads against the system
may yield different data (when measured
against time)
•Eventually Consistent: The system will
eventually become consistent when all
failures have healed, and time goes to infinity
Deploying Riak
AWS Deployment
• 6 x i2.4xlarge
• 732GB of RAM
• 19TB of storage
• 960,000 IOPs
• 96 vCPUs
• 3 x Replication
• 10 Gigabit networking
• 99.9999999997% availability
$74,790/yr
Real World Use Case
Ad Network
• Sell targeted ads with minimum latency
• Two datasets:
• Ads
• Users
Deployment
Deployment
Overselling Ads is
Okay
Choose Random Ad
Based on Weight of
Outstanding Impressions
Batch System
Batch System
Generated targeted ads in offline process
Ad Graph
Ad Store
Initial Visit
Fetch All Ads
Choose Ad
Based upon weighted random
Decrement Value
Test Model
• 50 actors
• 5 Ads with inventory between 1000, and 1200
• Actors randomly get [1,3] times to choose per
round
• Rounds continue until entire inventory is exhausted
Test Model
OutstandingImpressions
-300
0
300
600
900
1200
Round Number
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70 73 76
Ad 1 Ad 2 Ad 3 Ad 4 Ad 5
Garbage Collection
Garbage Collection
Utilizes secondary indexes in batch process to delete
exhausted ads from user records
Ad Serving
• Requires batch generation of targets
• Requires external GC
• Allows for multidatacenter operation
In Summary
Riak
Distributed
Fault-Tolerant
Scalable
Scalability
Processors
Toolchest
Why
Distributed
Databases?
Sargun Dhillon
@Sargun

Why Distributed Databases?