Data Distribution Theory

Open source, high performance database

Data Distribution Theory

Will LaForest
Senior Director of 10gen Federal
will@10gen.com
@WLaForest

1

• I will be talking about distributing data across
machines connected by reliable network
• I will NOT be talking about on disk arrangements
(well maybe a little)
• I will NOT be talking replication
– This has some overlaps but in most respects can be
considered orthogonally

• There is a ton of implementation minutia from
technology to technology that I will try to avoid

2

• Need to scale for more
• Cost effective to scale horizontally by distributing
• Fundamentally limited by some resource
– Memory
– IO Capacity
– Disk
– CPU
• Lots of systems need to distribute
– Web servers/app servers
– File systems
– Databases
– Caches
3

• Its always been this way
– From time to time people forget
• Stateful MTS Objects
• Stateful EJB

• Concurrent access of data not as simple
– We will set aside fencing/locking for another day
• RDBMS not built for distributed computing
– Not surprising since theory was from 40 years ago
– Model works because joins fast
– BUT generically efficient distributed joins difficult
– Ditto for distributed transaction

4

• Given a data record what node do I store it on?
• Round robin/ ”random”
– Evenly distribute across set severs
– Doesn’t take into account rebalancing
– Expiring a lot of data? Not too bad (MVCC, expiring cache)
– MarkLogic
• Hash based (not talking the pipe)
– Many search engines & caches
– Amazon Dynamo (Cassandra*)
• Range based
– BigTable (HBase)
– MongoDB
5

• Distribute on the hash of some attribute
• Simple way is hash(att) mod N
– What happens when N changes (we add a node)?
• The industry standard is consistent hashing
• Pros
– Evenly distributes across nodes
– Avoid hot spots
– Great for high write throughput
• Cons
– No data locality
– Scattered reads on each node
– Scatter gather on all queries
6

0
| E
4 128 • Circles represent
2 3
nodes.
hash(hostname)
• Letters represent
B
C data points
hash(attribute)

A

Whats wrong
2 with this
1 specific ring?

D
7

• Use a hash algorithm with even
distribution (like MD5 or SHA-1)
• Use multiple points or replicas on 100

the hash ring
• Instead of just hash(“Host1”)

standard deviations
50
• hash(“Host1-1”) .. hash(“Host1-
r”) 20

10
• Running simulations you get a
plot that looks like this (see Tom 5

White reference) 1 5 20 100 500
• Based on 10 nodes and 10k replicas
points

8

2-3
2-R 1-5
• We have R
replicas for each
1-2 3-1 node

• The hash ring
3-3 1-R could also be
used to
determine
1-4 2-2
replicas by using
the same
strategy with
data
3-R 3-4

2-1 1-1

1-3 2-4
3-2
9

• Also known as sharding
• Distribute based upon an attribute (the key)
– Or multiple keys (compound)
• Pros
– Better for reads
– Data locality so…
– Querying/reads with shard attribute terms avoid scatter
– Data can be arranged in contiguous blocks
– If hash based indexing only allow for range queries on key

10

• Cons
– Requires more consideration a-priori
– Pick the right shard key
– Can develop hot spots
– Leads to more data balancing activities
• Chunking can be done on many levels
– BigTable breaks into tablets
– MongoDB uses “chunks”

11

• Pick a key(s) to partition on • In this example we are
• Map the key space to the nodes partitioning by Last Name
• Range to node mappings
adjusted to keep data as • What happens if we partition
distributed as possible by hash(attribute)?

-∞ Isaacj
i LaForest
Meyer r Scheich ∞
Abrams

12

• Use a key with a high cardinality
– Sufficient granularity to your “chunks”
• What are your write vs read requirements
• Read and query king?
1. Shard key should be something most of your queries use
2. Also something that distributes reads evenly (avoiding
read hotspots)
3. Reading scaling can sometimes be accommodated by
replication
• Write throughput biggest concern?
1. You may want to consider partitioning on hash
2. Avoid hot spots
3. What happens if shard on systematically increasing key?

13

• Consistent Hashing and Random Trees
– One of the original papers on consistent hashing
• Tom White: Consistent Hashing
– Great blog post on consistent hashing

14

Data Distribution Theory

More Related Content

Viewers also liked

Similar to Data Distribution Theory

Recently uploaded

Data Distribution Theory