"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
The inner workings of Dynamo DB
1. THE INNER WORKINGS OF
AMAZON DYNAMO
Jonathan Lau
Nov 2013
Smokehouse Software | Jonathan Lau | jon@smokehousesoftware.com
2. MOTIVATION AND BIO
•
Early stage companies
•
Build bigger system
•
Specialize in backend
system
Smokehouse Software | Jonathan Lau | jon@smokehousesoftware.com
3. DISTRIBUTE / CENTRALIZE
Distributed
Centralized
Data
Different data for each
node
One master copy
Replicas
Replicate smaller data set
for each of the nodes
Replicate the master copy
into read slaves
Scaling
Data are shared into the
nodes by default
Extra work to shard
Smokehouse Software | Jonathan Lau | jon@smokehousesoftware.com
4. WHAT ABOUT NOSQL?
High performance solution != scaling
Smokehouse Software | Jonathan Lau | jon@smokehousesoftware.com
5. DYNAMO DESIGN
CONSIDERATION
•
Distributed key value store
•
Incremental scalability - Scaling one node at a time
•
Decentralized design - Gossip-based protocol for membership
and failure detection
•
Symmetry - All the nodes have the same functionality
•
Heterogeneity - The system will be deployed in a environment
with huge variance on hardware and system performance.
Smokehouse Software | Jonathan Lau | jon@smokehousesoftware.com
6. put()
get()
A
H
Request for key "K", which is in [C, D)
B
G
C
F
D
E
HIGH LEVEL CONCEPT
Distribute the data in N nodes in a ring
Smokehouse Software | Jonathan Lau | jon@smokehousesoftware.com
7. DYNAMO’S CHALLENGES
•
Data partitioning
•
N-1 replicas
•
High availability for writes
•
Handling temporary failures
•
Recovering from permanent failures
•
Membership and failure detection
Smokehouse Software | Jonathan Lau | jon@smokehousesoftware.com
8. PARTITIONING
•
128 bit MD5 hash
•
Consistent hashing for key
partitioning
•
Virtual node helps improve
the local distribution
•
Request can hit any of the
node on the key preference
list (coordinator)
Request for key K in [B, C)
A
B
C
D
Smokehouse Software | Jonathan Lau | jon@smokehousesoftware.com
9. REPLICATION
•
Replication is stored by N-1
successor nodes
•
The nodes with the replicas
and the coordinator node
forms the preference list.
Smokehouse Software | Jonathan Lau | jon@smokehousesoftware.com
10. AVAILABLE FOR WRITES
•
Accepts all the writes based on the version modified
•
Tracking modification and base version by vector clock
•
Accepts all the writes and the vector clock
•
Conflict resolution by examining the vector clock on the objects and
reconcile during the read operation
•
Consistency issue arises because of network or node failure
•
Oldest vector clock items will be purged
Smokehouse Software | Jonathan Lau | jon@smokehousesoftware.com
11. HANDLING TEMPORARY
FAILURES
•
Trade off between durability and availability
•
Sloppy Quorum - write / read is only consider successful if
the first N healthy nodes return from the preference list.
•
Hinted hand off - write will be picked up by the replicas
when the designated coordinator node is down. The write
picked up by replica will have hint about the intended
recipient for the write so we can reconstruct the state.
Smokehouse Software | Jonathan Lau | jon@smokehousesoftware.com
12. REPLICA SYNCHRON
•
Dynamo uses Merkle tree to track hash for the keys
•
Passing only the root hash to validate
synchronization states between the replicas
•
If a replica is deemed to be out of sync, the node
can traverse down the tree to figure out the exact
mismatch portion.
Smokehouse Software | Jonathan Lau | jon@smokehousesoftware.com
13. NODE MEMBERSHIP
•
Partition and placement information is propagate via a
gossip protocol
•
Each node will be aware of the token range of its peer
•
They have seed node in the cluster to speed up the
membership and the key range membership for the ring
•
Nodes are not really aware of each other until an actual
delete happens
Smokehouse Software | Jonathan Lau | jon@smokehousesoftware.com
14. GET() AND PUT()
What happen during a read or write request?
Smokehouse Software | Jonathan Lau | jon@smokehousesoftware.com
15. GET() AND PUT()
•
get() and put() are routed through a generic load balancer +
partition aware library to route traffic
•
top N nodes in the preference list for key K are the
coordinators.
•
Requests basically go down the list and bad nodes are
skipped over
•
Two configuration parameters: R and W, where R + W > N.
Smokehouse Software | Jonathan Lau | jon@smokehousesoftware.com
16. MORE ON GET() AND PUT()
When a writes happens:
•
coordinator generates a vector clock value
•
sends the new value along with the vector clock value to N highest ranked
reachable nodes
•
If at least W-1 node responded, the write is considered successful.
When a read happens:
•
coordinate sends a read request to N highest ranked reachable nodes
•
wait for R nodes return, and then return the result to client
Smokehouse Software | Jonathan Lau | jon@smokehousesoftware.com
17. WHAT DOES IT ALL MEAN
How does all these ties in together?
Smokehouse Software | Jonathan Lau | jon@smokehousesoftware.com
18. WHAT DOES IT MEAN?
•
Dynamo shards the data from day 1
•
Replica and redundancy is baked in from day 1
•
The configuration parameter W and R has a huge effect our
trade off between availability and durability.
•
•
W + R > N
Consistency resolution at read will allow more controlled conflict
resolution strategy
Smokehouse Software | Jonathan Lau | jon@smokehousesoftware.com
19. HAPPY
SCALING
Read the dynamo design
paper @
http://bit.ly/QeM8AC
Smokehouse Software | Jonathan Lau | jon@smokehousesoftware.com