Culprit:
B+Tree index
• Good at sequential insert
• e.g. ObjectId, Sequence #, Timestamp
• Poor at random insert
• Indexes on randomly-distributed data
Sequential vs. Random insert
1 55
2 75
3 78
4 1
5 99
6 36
7 80
8 91
9 52
10 B+Tree 63 B+Tree
11 56
12 33
working set working set
Sequential insert ➔ Small working set Random insert ➔ Large working set
➔ Fits in RAM ➔ Sequential I/O ➔ Cannot fit in RAM ➔ Random I/O
(bandwidth-bound) (IOPS-bound)
Sequential + hash key
• Coarse-grained sequential prefix
• e.g. Year-month + hash value
• 201210_24c3a5b9
B+Tree
201208_* 201209_* 201210_*
But what if...
B+Tree
large working set
201208_* 201209_* 201210_*
Sequential + hash key
• Can you predict data growth rate?
• Balancer not clever enough
• Only considers # of chunks
• Migration slow during heavy-writes
Low-cardinality hash key
• Small portion of hash value Shard key range: A ~ D
• e.g. A~Z, 00~FF
• Alleviates B+Tree problem
Local
• Sequential access on fixed # B+Tree
of parts
• Cardinality / # of shards
A A A B B B C C C
Low-cardinality hash key
• Limits the # of possible chunks
• e.g. 00 ~ FF ➔ 256 chunks
• Chunk grows past 64MB
• Balancing becomes difficult
Lessons learned
• Know the performance impact of secondary index
• Choose the right shard key
• Test with large data sets
• Linear scalability is hard
• If you really need it, consider HBase or Cassandra
• SSD