Recently a new class of database technologies has developed offering massively scalable distributed hash table functionality. Relative to more traditional relational database systems, these systems are simple to operate and capable of managing massive data sets. These characteristics come at a cost though: an impoverished query language that, in practice, can handle little more than exact-match lookups at scale.
This talk will explore the real world technical challenges we faced at SimpleGeo while building a web-scale spatial database on top of Apache Cassandra. Cassandra is a distributed database that falls into the broad category of second-generation systems described above. We chose Cassandra after carefully considering desirable database characteristics based on our prior experiences building large scale web applications. Cassandra offers operational simplicity, decentralized operations, no single points of failure, online load balancing and re-balancing, and linear horizontal scalability.
Unfortunately, Cassandra fell far short of providing the sort of sophisticated spatial queries we needed. We developed a short term solution that was good enough for most use cases, but far from optimal. Long term, our challenge was to bridge the gap without compromising any of the desirable qualities that led us to choose Cassandra in the first place.
The result is a robust general purpose mechanism for overlaying sophisticated data structures on top of distributed hash tables. By overlaying a spatial tree, for example, we’re able to durably persist massive amounts of spatial data and service complex nearest-neighbor and multidimensional range queries across billions of rows fast enough for an online consumer facing application. We continue to improve and evolve the system, but we’re eager to share what we’ve learned so far.
Working with Dimensional data in Distributed Hash Tables
1. o o
o oo
DIMENSIONAL DATA IN
DISTRIBUTED HASH TABLES
fturg Mike Male
STRANGE LOOP 2010
Monday, October 18, 2010
2. SIMPLEGEO
We originally began as a mobile
gaming startup, but quickly
discovered that the location services
and infrastructure needed to support
our ideas didn’t exist. So we took
matters into our own hands and
began building it ourselves.
Mt Gaig Joe Stump
CEO & co-founder CTO & co-founder
STRANGE LOOP 2010
Monday, October 18, 2010
3. ABOUT ME
MIKE MALONE
INFRASTRUCTURE ENGINEER
mike@simplegeo.com
@mjmalone
For the last 10 months I’ve been developing a spatial
database at the core of SimpleGeo’s infrastructure.
STRANGE LOOP 2010
Monday, October 18, 2010
5. CONFLICT
Integrity vs Availability
Locality vs Distributedness
Reductionism vs Emergence
STRANGE LOOP 2010
Monday, October 18, 2010
6. DATABASE CANDIDATES
THE TRANSACTIONAL RELATIONAL DATABASES
They’re theoretically pure, well understood, and mostly
standardized behind a relatively clean abstraction
They provide robust contracts that make it easy to reason
about the structure and nature of the data they contain
They’re battle hardened, robust, durable, etc.
OTHER STRUCTURED STORAGE OPTIONS
Plain I see you, western youths, see you tramping with the
foremost, Pioneers! O pioneers!
STRANGE LOOP 2010
Monday, October 18, 2010
7. INTEGRITY
- vs -
AVAILABILITY
STRANGE LOOP 2010
Monday, October 18, 2010
8. ACID
These terms are not formally defined - they’re a
framework, not mathematical axioms
ATOMICITY
Either all of a transaction’s actions are visible to another transaction, or none are
CONSISTENCY
Application-specific constraints must be met for transaction to succeed
ISOLATION
Two concurrent transactions will not see one another’s transactions while “in flight”
DURABILITY
The updates made to the database in a committed transaction will be visible to
future transactions
STRANGE LOOP 2010
Monday, October 18, 2010
9. ACID HELPS
ACID is a sort-of-formal contract that makes it
easy to reason about your data, and that’s good
IT DOES SOMETHING HARD FOR YOU
With ACID, you’re guaranteed to maintain a persistent global
state as long as you’ve defined proper constraints and your
logical transactions result in a valid system state
STRANGE LOOP 2010
Monday, October 18, 2010
10. CAP THEOREM
At PODC 2000 Eric Brewer told us there were three
desirable DB characteristics. But we can only have two.
CONSISTENCY
Every node in the system contains the same data (e.g., replicas are
never out of date)
AVAILABILITY
Every request to a non-failing node in the system returns a response
PARTITION TOLERANCE
System properties (consistency and/or availability) hold even when
the system is partitioned and data is lost
STRANGE LOOP 2010
Monday, October 18, 2010
11. CAP THEOREM IN 30 SECONDS
CLIENT SERVER REPLICA
STRANGE LOOP 2010
Monday, October 18, 2010
12. CAP THEOREM IN 30 SECONDS
CLIENT SERVER REPLICA
wre
STRANGE LOOP 2010
Monday, October 18, 2010
13. CAP THEOREM IN 30 SECONDS
CLIENT SERVER plice REPLICA
wre
STRANGE LOOP 2010
Monday, October 18, 2010
14. CAP THEOREM IN 30 SECONDS
CLIENT SERVER plice REPLICA
wre
ack
STRANGE LOOP 2010
Monday, October 18, 2010
15. CAP THEOREM IN 30 SECONDS
CLIENT SERVER plice REPLICA
wre
aept ack
STRANGE LOOP 2010
Monday, October 18, 2010
16. CAP THEOREM IN 30 SECONDS
CLIENT SERVER REPLICA
wre
FAIL!
ni
UNAVAILAB!
STRANGE LOOP 2010
Monday, October 18, 2010
17. CAP THEOREM IN 30 SECONDS
CLIENT SERVER REPLICA
wre
FAIL!
aept
CSTT!
STRANGE LOOP 2010
Monday, October 18, 2010
18. ACID HURTS
Certain aspects of ACID encourage (require?)
implementors to do “bad things”
Unfortunately, ANSI SQL’s definition of isolation...
relies in subtle ways on an assumption that a locking scheme is
used for concurrency control, as opposed to an optimistic or
multi-version concurrency scheme. This implies that the
proposed semantics are ill-defined.
Joseph M. Hellerstein and Michael Stonebraker
Anatomy of a Database System
STRANGE LOOP 2010
Monday, October 18, 2010
19. BALANCE
IT’S A QUESTION OF VALUES
For traditional databases CAP consistency is the holy grail: it’s
maximized at the expense of availability and partition
tolerance
At scale, failures happen: when you’re doing something a
million times a second a one-in-a-million failure happens every
second
We’re witnessing the birth of a new religion...
• CAP consistency is a luxury that must be sacrificed at scale in order to
maintain availability when faced with failures
STRANGE LOOP 2010
Monday, October 18, 2010
20. APACHE CASSANDRA
A DISTRIBUTED HASH TABLE WITH SOME TRICKS
Peer-to-peer gossip supports fault tolerance and decentralization
with a simple operational model
Random hash-based partitioning provides auto-scaling and
efficient online rebalancing - read and write throughput increases
linearly when new nodes are added
Pluggable replication strategy for multi-datacenter replication
making the system resilient to failure, even at the datacenter level
Tunable consistency allow us to adjust durability with the value of
data being written
STRANGE LOOP 2010
Monday, October 18, 2010
21. CASSANDRA
DATA MODEL
{ column family
“users”: { key
“alice”: {
“city”: [“St. Louis”, 1287040737182],
columns
(name, value, timestamp)
“name”: [“Alice” 1287080340940],
,
},
...
},
“locations”: {
},
...
}
STRANGE LOOP 2010
Monday, October 18, 2010
22. DISTRIBUTED HASH TABLE
DESTROY LOCALITY (BY HASHING) TO ACHIEVE GOOD
BALANCE AND SIMPLIFY LINEAR SCALABILITY
{
column family
“users”: { fffff 0
key
“
alice”: {
“city”: [“St. Louis”, 1287040737182],
columns
“name”: [“Alice” 1287080340940],
,
},
...
},
}
...
alice
bob s3b
3e8 STRANGE LOOP 2010
Monday, October 18, 2010
23. HASH TABLE
SUPPORTED QUERIES
EXACT MATCH
RANGE
PROXIMITY
ANYTHING THAT’S NOT
EXACT MATCH
STRANGE LOOP 2010
Monday, October 18, 2010
24. LOCALITY
- vs -
DISTRIBUTEDNESS
STRANGE LOOP 2010
Monday, October 18, 2010
25. THE ORDER PRESERVING
PARTITIONER
CASSANDRA’S PARTITIONING
STRATEGY IS PLUGGABLE
Partitioner maps keys to nodes
Random partitioner destroys locality by hashing
Order preserving partitioner retains locality, storing
keys in natural lexicographical order around ring z a
alice
a
bob
u h
sam m
STRANGE LOOP 2010
Monday, October 18, 2010
26. ORDER PRESERVING PARTITIONER
EXACT MATCH
RANGE
On a single dimension
? PROXIMITY
STRANGE LOOP 2010
Monday, October 18, 2010
27. SPATIAL DATA
IT’S INHERENTLY MULTIDIMENSIONAL
2 x 2, 2
1
1 2
STRANGE LOOP 2010
Monday, October 18, 2010
28. DIMENSIONALITY REDUCTION
WITH SPACE-FILLING CURVES
1 2
3 4
STRANGE LOOP 2010
Monday, October 18, 2010
29. Z-CURVE
SECOND ITERATION
STRANGE LOOP 2010
Monday, October 18, 2010
30. Z-VALUE
14
x
STRANGE LOOP 2010
Monday, October 18, 2010
31. GEOHASH
SIMPLE TO COMPUTE
Interleave the bits of decimal coordinates
(equivalent to binary encoding of pre-order
traversal!)
Base32 encode the result
AWESOME CHARACTERISTICS
Arbitrary precision
Human readable
Sorts lexicographically
01101
e STRANGE LOOP 2010
Monday, October 18, 2010
34. SPATIAL DATA
STILL MULTIDIMENSIONAL
DIMENSIONALITY REDUCTION ISN’T PERFECT
Clients must
• Pre-process to compose multiple queries
• Post-process to filter and merge results
Degenerate cases can be bad, particularly for nearest-neighbor
queries
STRANGE LOOP 2010
Monday, October 18, 2010
45. HELLO, DRAWING BOARD
SURVEY OF DISTRIBUTED P2P INDEXING
An overlay-dependent index works directly with nodes of the
peer-to-peer network, defining its own overlay
An over-DHT index overlays a more sophisticated data
structure on top of a peer-to-peer distributed hash table
STRANGE LOOP 2010
Monday, October 18, 2010
46. ANOTHER LOOK AT POSTGIS
MIGHT WORK, BUT
The relational transaction management system (which we’d
want to change) and access methods (which we’d have to
change) are tightly coupled (necessarily?) to other parts of
the system
Could work at a higher level and treat PostGIS as a black box
• Now we’re back to implementing a peer-to-peer network with failure
recovery, fault detection, etc... and Cassandra already had all that.
• It’s probably clear by now that I think these problems are more
difficult than actually storing structured data on disk
STRANGE LOOP 2010
Monday, October 18, 2010
47. LET’S TAKE A STEP BACK
STRANGE LOOP 2010
Monday, October 18, 2010
48. EARTH
STRANGE LOOP 2010
Monday, October 18, 2010
49. EARTH
STRANGE LOOP 2010
Monday, October 18, 2010
50. EARTH
STRANGE LOOP 2010
Monday, October 18, 2010
51. EARTH
STRANGE LOOP 2010
Monday, October 18, 2010
56. SPLITTING
IT’S PRETTY MUCH JUST A CONCURRENT TREE
Splitting shouldn’t lock the tree for reads or writes and failures
shouldn’t cause corruption
• Splits are optimistic, idempotent, and fail-forward
• Instead of locking, writes are replicated to the splitting node and the
relevant child[ren] while a split operation is taking place
• Cleanup occurs after the split is completed and all interested nodes are
aware that the split has occurred
• Cassandra writes are idempotent, so splits are too - if a split fails, it is
simply be retried
Split size: A Tunable knob for balancing locality and distributedness
The other hard problem with concurrent trees is rebalancing - we
just don’t do it! (more on this later)
STRANGE LOOP 2010
Monday, October 18, 2010
57. THE ROOT IS HOT
MIGHT BE A DEAL BREAKER
For a tree to be useful, it has to be traversed
• Typically, tree traversal starts at the root
• Root is the only discoverable node in our tree
Traversing through the root meant reading the root for every
read or write below it - unacceptable
• Lots of academic solutions - most promising was a skip graph, but
that required O(n log(n)) data - also unacceptable
• Minimum tree depth was propsed, but then you just get multiple hot-
spots at your minimum depth nodes
STRANGE LOOP 2010
Monday, October 18, 2010
58. BACK TO THE BOOKS
LOTS OF ACADEMIC WORK ON THIS TOPIC
But academia is obsessed with provable, deterministic,
asymptotically optimal algorithms
And we only need something that is probably fast enough
most of the time (for some value of “probably” and “most of
the time”)
• And if the probably good enough algorithm is, you know... tractable...
one might even consider it qualitatively better!
STRANGE LOOP 2010
Monday, October 18, 2010
59. REDUCTIONISM
- vs -
EMERGENCE
STRANGE LOOP 2010
Monday, October 18, 2010
60. We have We want
STRANGE LOOP 2010
Monday, October 18, 2010
61. THINKING HOLISTICALLY
WE OBSERVED THAT
Once a node in the tree exists, it doesn’t go away
Node state may change, but that state only really matters
locally - thinking a node is a leaf when it really has children is
not fatal
SO... WHAT IF WE JUST CACHED NODES THAT
WERE OBSERVED IN THE SYSTEM!?
CLIENT - XXX 2010
Monday, October 18, 2010
62. CACHE IT
STUPID SIMPLE SOLUTION
Keep an LRU cache of nodes that have been traversed
Start traversals at the most selective relevant node
If that node doesn’t satisfy you, traverse up the tree
Along with your result set, return a list of nodes that were
traversed so the caller can add them to its cache
STRANGE LOOP 2010
Monday, October 18, 2010
63. TRAVERSAL
NEAREST NEIGHBOR
o o
o xo
STRANGE LOOP 2010
Monday, October 18, 2010
64. TRAVERSAL
NEAREST NEIGHBOR
o o
o xo
STRANGE LOOP 2010
Monday, October 18, 2010
65. TRAVERSAL
NEAREST NEIGHBOR
o o
o xo
STRANGE LOOP 2010
Monday, October 18, 2010
66. KEY CHARACTERISTICS
PERFORMANCE
Best case on the happy path (everything cached) has zero
read overhead
Worst case, with nothing cached, O(log(n)) read overhead
RE-BALANCING SEEMS UNNECESSARY!
Makes worst case more worser, but so far so good
CLIENT - XXX 2010
Monday, October 18, 2010
67. DISTRIBUTED TREE
SUPPORTED QUERIES
EXACT MATCH
RANGE
PROXIMITY
SOMETHING ELSE I HAVEN’T
EVEN HEARD OF
STRANGE LOOP 2010
Monday, October 18, 2010
68. DISTRIBUTED TREE
SUPPORTED QUERIES
MUL
EXACT MATCH DI P
RANGE NS
NS!
PROXIMITY
SOMETHING ELSE I HAVEN’T
EVEN HEARD OF
STRANGE LOOP 2010
Monday, October 18, 2010
69. PEACE
Integrity and Availability
Locality and Distributedness
Reductionism and Emergence
STRANGE LOOP 2010
Monday, October 18, 2010
70. QUESTIONS?
MIKE MALONE
INFRASTRUCTURE ENGINEER
mike@simplegeo.com
@mjmalone
STRANGE LOOP 2010
Monday, October 18, 2010