Working with Dimensional data in Distributed Hash Tables

  • 9,057 views
Uploaded on

Recently a new class of database technologies has developed offering massively scalable distributed hash table functionality. Relative to more traditional relational database systems, these systems …

Recently a new class of database technologies has developed offering massively scalable distributed hash table functionality. Relative to more traditional relational database systems, these systems are simple to operate and capable of managing massive data sets. These characteristics come at a cost though: an impoverished query language that, in practice, can handle little more than exact-match lookups at scale.

This talk will explore the real world technical challenges we faced at SimpleGeo while building a web-scale spatial database on top of Apache Cassandra. Cassandra is a distributed database that falls into the broad category of second-generation systems described above. We chose Cassandra after carefully considering desirable database characteristics based on our prior experiences building large scale web applications. Cassandra offers operational simplicity, decentralized operations, no single points of failure, online load balancing and re-balancing, and linear horizontal scalability.

Unfortunately, Cassandra fell far short of providing the sort of sophisticated spatial queries we needed. We developed a short term solution that was good enough for most use cases, but far from optimal. Long term, our challenge was to bridge the gap without compromising any of the desirable qualities that led us to choose Cassandra in the first place.

The result is a robust general purpose mechanism for overlaying sophisticated data structures on top of distributed hash tables. By overlaying a spatial tree, for example, we’re able to durably persist massive amounts of spatial data and service complex nearest-neighbor and multidimensional range queries across billions of rows fast enough for an online consumer facing application. We continue to improve and evolve the system, but we’re eager to share what we’ve learned so far.

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
No Downloads

Views

Total Views
9,057
On Slideshare
0
From Embeds
0
Number of Embeds
4

Actions

Shares
Downloads
344
Comments
2
Likes
26

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. o o o oo DIMENSIONAL DATA IN DISTRIBUTED HASH TABLES fturg Mike Male STRANGE LOOP 2010 Monday, October 18, 2010
  • 2. SIMPLEGEO We originally began as a mobile gaming startup, but quickly discovered that the location services and infrastructure needed to support our ideas didn’t exist. So we took matters into our own hands and began building it ourselves. Mt Gaig Joe Stump CEO & co-founder CTO & co-founder STRANGE LOOP 2010 Monday, October 18, 2010
  • 3. ABOUT ME MIKE MALONE INFRASTRUCTURE ENGINEER mike@simplegeo.com @mjmalone For the last 10 months I’ve been developing a spatial database at the core of SimpleGeo’s infrastructure. STRANGE LOOP 2010 Monday, October 18, 2010
  • 4. REQUIREMENTS Multidimensional Efficiency Query Speed Complex Queries Decentralization Locality Spatial Performance Fault Tolerance Availability Distributedness GOALS Replication Reliability Simplicity Resilience Conceptual Integrity Scalability Operational Consistency Data Coherent Concurrency Query Volume Durability Linear STRANGE LOOP 2010 Monday, October 18, 2010
  • 5. CONFLICT Integrity vs Availability Locality vs Distributedness Reductionism vs Emergence STRANGE LOOP 2010 Monday, October 18, 2010
  • 6. DATABASE CANDIDATES THE TRANSACTIONAL RELATIONAL DATABASES They’re theoretically pure, well understood, and mostly standardized behind a relatively clean abstraction They provide robust contracts that make it easy to reason about the structure and nature of the data they contain They’re battle hardened, robust, durable, etc. OTHER STRUCTURED STORAGE OPTIONS Plain I see you, western youths, see you tramping with the foremost, Pioneers! O pioneers! STRANGE LOOP 2010 Monday, October 18, 2010
  • 7. INTEGRITY - vs - AVAILABILITY STRANGE LOOP 2010 Monday, October 18, 2010
  • 8. ACID These terms are not formally defined - they’re a framework, not mathematical axioms ATOMICITY Either all of a transaction’s actions are visible to another transaction, or none are CONSISTENCY Application-specific constraints must be met for transaction to succeed ISOLATION Two concurrent transactions will not see one another’s transactions while “in flight” DURABILITY The updates made to the database in a committed transaction will be visible to future transactions STRANGE LOOP 2010 Monday, October 18, 2010
  • 9. ACID HELPS ACID is a sort-of-formal contract that makes it easy to reason about your data, and that’s good IT DOES SOMETHING HARD FOR YOU With ACID, you’re guaranteed to maintain a persistent global state as long as you’ve defined proper constraints and your logical transactions result in a valid system state STRANGE LOOP 2010 Monday, October 18, 2010
  • 10. CAP THEOREM At PODC 2000 Eric Brewer told us there were three desirable DB characteristics. But we can only have two. CONSISTENCY Every node in the system contains the same data (e.g., replicas are never out of date) AVAILABILITY Every request to a non-failing node in the system returns a response PARTITION TOLERANCE System properties (consistency and/or availability) hold even when the system is partitioned and data is lost STRANGE LOOP 2010 Monday, October 18, 2010
  • 11. CAP THEOREM IN 30 SECONDS CLIENT SERVER REPLICA STRANGE LOOP 2010 Monday, October 18, 2010
  • 12. CAP THEOREM IN 30 SECONDS CLIENT SERVER REPLICA wre STRANGE LOOP 2010 Monday, October 18, 2010
  • 13. CAP THEOREM IN 30 SECONDS CLIENT SERVER plice REPLICA wre STRANGE LOOP 2010 Monday, October 18, 2010
  • 14. CAP THEOREM IN 30 SECONDS CLIENT SERVER plice REPLICA wre ack STRANGE LOOP 2010 Monday, October 18, 2010
  • 15. CAP THEOREM IN 30 SECONDS CLIENT SERVER plice REPLICA wre aept ack STRANGE LOOP 2010 Monday, October 18, 2010
  • 16. CAP THEOREM IN 30 SECONDS CLIENT SERVER REPLICA wre FAIL! ni UNAVAILAB! STRANGE LOOP 2010 Monday, October 18, 2010
  • 17. CAP THEOREM IN 30 SECONDS CLIENT SERVER REPLICA wre FAIL! aept CSTT! STRANGE LOOP 2010 Monday, October 18, 2010
  • 18. ACID HURTS Certain aspects of ACID encourage (require?) implementors to do “bad things” Unfortunately, ANSI SQL’s definition of isolation... relies in subtle ways on an assumption that a locking scheme is used for concurrency control, as opposed to an optimistic or multi-version concurrency scheme. This implies that the proposed semantics are ill-defined. Joseph M. Hellerstein and Michael Stonebraker Anatomy of a Database System STRANGE LOOP 2010 Monday, October 18, 2010
  • 19. BALANCE IT’S A QUESTION OF VALUES For traditional databases CAP consistency is the holy grail: it’s maximized at the expense of availability and partition tolerance At scale, failures happen: when you’re doing something a million times a second a one-in-a-million failure happens every second We’re witnessing the birth of a new religion... • CAP consistency is a luxury that must be sacrificed at scale in order to maintain availability when faced with failures STRANGE LOOP 2010 Monday, October 18, 2010
  • 20. APACHE CASSANDRA A DISTRIBUTED HASH TABLE WITH SOME TRICKS Peer-to-peer gossip supports fault tolerance and decentralization with a simple operational model Random hash-based partitioning provides auto-scaling and efficient online rebalancing - read and write throughput increases linearly when new nodes are added Pluggable replication strategy for multi-datacenter replication making the system resilient to failure, even at the datacenter level Tunable consistency allow us to adjust durability with the value of data being written STRANGE LOOP 2010 Monday, October 18, 2010
  • 21. CASSANDRA DATA MODEL { column family “users”: { key “alice”: { “city”: [“St. Louis”, 1287040737182], columns (name, value, timestamp) “name”: [“Alice” 1287080340940], , }, ... }, “locations”: { }, ... } STRANGE LOOP 2010 Monday, October 18, 2010
  • 22. DISTRIBUTED HASH TABLE DESTROY LOCALITY (BY HASHING) TO ACHIEVE GOOD BALANCE AND SIMPLIFY LINEAR SCALABILITY { column family “users”: { fffff 0 key “ alice”: { “city”: [“St. Louis”, 1287040737182], columns “name”: [“Alice” 1287080340940], , }, ... }, } ... alice bob s3b 3e8 STRANGE LOOP 2010 Monday, October 18, 2010
  • 23. HASH TABLE SUPPORTED QUERIES EXACT MATCH RANGE PROXIMITY ANYTHING THAT’S NOT EXACT MATCH STRANGE LOOP 2010 Monday, October 18, 2010
  • 24. LOCALITY - vs - DISTRIBUTEDNESS STRANGE LOOP 2010 Monday, October 18, 2010
  • 25. THE ORDER PRESERVING PARTITIONER CASSANDRA’S PARTITIONING STRATEGY IS PLUGGABLE Partitioner maps keys to nodes Random partitioner destroys locality by hashing Order preserving partitioner retains locality, storing keys in natural lexicographical order around ring z a alice a bob u h sam m STRANGE LOOP 2010 Monday, October 18, 2010
  • 26. ORDER PRESERVING PARTITIONER EXACT MATCH RANGE On a single dimension ? PROXIMITY STRANGE LOOP 2010 Monday, October 18, 2010
  • 27. SPATIAL DATA IT’S INHERENTLY MULTIDIMENSIONAL 2 x 2, 2 1 1 2 STRANGE LOOP 2010 Monday, October 18, 2010
  • 28. DIMENSIONALITY REDUCTION WITH SPACE-FILLING CURVES 1 2 3 4 STRANGE LOOP 2010 Monday, October 18, 2010
  • 29. Z-CURVE SECOND ITERATION STRANGE LOOP 2010 Monday, October 18, 2010
  • 30. Z-VALUE 14 x STRANGE LOOP 2010 Monday, October 18, 2010
  • 31. GEOHASH SIMPLE TO COMPUTE Interleave the bits of decimal coordinates (equivalent to binary encoding of pre-order traversal!) Base32 encode the result AWESOME CHARACTERISTICS Arbitrary precision Human readable Sorts lexicographically 01101 e STRANGE LOOP 2010 Monday, October 18, 2010
  • 32. DATA MODEL { “record-index”: { key <geohash>:<id> “9yzgcjn0:moonrise hotel”: { “”: [“”, 1287040737182], }, ... }, “records”: { “moonrise hotel”: { “latitude”: [“38.6554420”, 1287040737182], “longitude”: [“-90.2992910”, 1287040737182], ... } } } STRANGE LOOP 2010 Monday, October 18, 2010
  • 33. BOUNDING BOX E.G., MULTIDIMENSIONAL RANGE Gie stuff  bg box! Gie 2  3 1 2 3 4 Gie 4  5 STRANGE LOOP 2010 Monday, October 18, 2010
  • 34. SPATIAL DATA STILL MULTIDIMENSIONAL DIMENSIONALITY REDUCTION ISN’T PERFECT Clients must • Pre-process to compose multiple queries • Post-process to filter and merge results Degenerate cases can be bad, particularly for nearest-neighbor queries STRANGE LOOP 2010 Monday, October 18, 2010
  • 35. Z-CURVE LOCALITY STRANGE LOOP 2010 Monday, October 18, 2010
  • 36. Z-CURVE LOCALITY x x STRANGE LOOP 2010 Monday, October 18, 2010
  • 37. Z-CURVE LOCALITY x x STRANGE LOOP 2010 Monday, October 18, 2010
  • 38. Z-CURVE LOCALITY x o o o x o o o o STRANGE LOOP 2010 Monday, October 18, 2010
  • 39. THE WORLD IS NOT BALANCED Credit: C. Mayhew & R. Simmon (NASA/GSFC), NOAA/NGDC, DMSP Digital Archive STRANGE LOOP 2010 Monday, October 18, 2010
  • 40. TOO MUCH LOCALITY 1 2 SAN FRANCISCO 3 4 STRANGE LOOP 2010 Monday, October 18, 2010
  • 41. TOO MUCH LOCALITY 1 2 SAN FRANCISCO 3 4 STRANGE LOOP 2010 Monday, October 18, 2010
  • 42. TOO MUCH LOCALITY 1 2 I’m sad. SAN FRANCISCO 3 4 STRANGE LOOP 2010 Monday, October 18, 2010
  • 43. TOO MUCH LOCALITY I’m b. 1 2 I’m sad. Me o. SAN FRANCISCO 3 4 Let’s py xbox. STRANGE LOOP 2010 Monday, October 18, 2010
  • 44. A TURNING POINT STRANGE LOOP 2010 Monday, October 18, 2010
  • 45. HELLO, DRAWING BOARD SURVEY OF DISTRIBUTED P2P INDEXING An overlay-dependent index works directly with nodes of the peer-to-peer network, defining its own overlay An over-DHT index overlays a more sophisticated data structure on top of a peer-to-peer distributed hash table STRANGE LOOP 2010 Monday, October 18, 2010
  • 46. ANOTHER LOOK AT POSTGIS MIGHT WORK, BUT The relational transaction management system (which we’d want to change) and access methods (which we’d have to change) are tightly coupled (necessarily?) to other parts of the system Could work at a higher level and treat PostGIS as a black box • Now we’re back to implementing a peer-to-peer network with failure recovery, fault detection, etc... and Cassandra already had all that. • It’s probably clear by now that I think these problems are more difficult than actually storing structured data on disk STRANGE LOOP 2010 Monday, October 18, 2010
  • 47. LET’S TAKE A STEP BACK STRANGE LOOP 2010 Monday, October 18, 2010
  • 48. EARTH STRANGE LOOP 2010 Monday, October 18, 2010
  • 49. EARTH STRANGE LOOP 2010 Monday, October 18, 2010
  • 50. EARTH STRANGE LOOP 2010 Monday, October 18, 2010
  • 51. EARTH STRANGE LOOP 2010 Monday, October 18, 2010
  • 52. EARTH, TREE, RING STRANGE LOOP 2010 Monday, October 18, 2010
  • 53. DATA MODEL { “record-index”: { “layer-name:37 .875, -90:40.25, -101.25”: { “38.6554420, -90.2992910:moonrise hotel”: [“” 1287040737182], , ... }, }, “record-index-meta”: { “layer-name:37 .875, -90:40.25, -101.25”: { “split”: [“false”, 1287040737182], } “layer-name: 37 .875, -90:42.265, -101.25” { “split”: [“true”, 1287040737182], “child-left”: [“layer-name:37 .875, -90:40.25, -101.25” 1287040737182] , “child-right”: [“layer-name:40.25, -90:42.265, -101.25” 1287040737182] , } } } STRANGE LOOP 2010 Monday, October 18, 2010
  • 54. DATA MODEL { “record-index”: { “layer-name:37 .875, -90:40.25, -101.25”: { “38.6554420, -90.2992910:moonrise hotel”: [“” 1287040737182], , ... }, }, “record-index-meta”: { “layer-name:37 .875, -90:40.25, -101.25”: { “split”: [“false”, 1287040737182], } “layer-name: 37 .875, -90:42.265, -101.25” { “split”: [“true”, 1287040737182], “child-left”: [“layer-name:37 .875, -90:40.25, -101.25” 1287040737182] , “child-right”: [“layer-name:40.25, -90:42.265, -101.25” 1287040737182] , } } } STRANGE LOOP 2010 Monday, October 18, 2010
  • 55. DATA MODEL { “record-index”: { “layer-name:37 .875, -90:40.25, -101.25”: { “38.6554420, -90.2992910:moonrise hotel”: [“” 1287040737182], , ... }, }, “record-index-meta”: { “layer-name:37 .875, -90:40.25, -101.25”: { “split”: [“false”, 1287040737182], } “layer-name: 37 .875, -90:42.265, -101.25” { “split”: [“true”, 1287040737182], “child-left”: [“layer-name:37 .875, -90:40.25, -101.25” 1287040737182] , “child-right”: [“layer-name:40.25, -90:42.265, -101.25” 1287040737182] , } } } STRANGE LOOP 2010 Monday, October 18, 2010
  • 56. SPLITTING IT’S PRETTY MUCH JUST A CONCURRENT TREE Splitting shouldn’t lock the tree for reads or writes and failures shouldn’t cause corruption • Splits are optimistic, idempotent, and fail-forward • Instead of locking, writes are replicated to the splitting node and the relevant child[ren] while a split operation is taking place • Cleanup occurs after the split is completed and all interested nodes are aware that the split has occurred • Cassandra writes are idempotent, so splits are too - if a split fails, it is simply be retried Split size: A Tunable knob for balancing locality and distributedness The other hard problem with concurrent trees is rebalancing - we just don’t do it! (more on this later) STRANGE LOOP 2010 Monday, October 18, 2010
  • 57. THE ROOT IS HOT MIGHT BE A DEAL BREAKER For a tree to be useful, it has to be traversed • Typically, tree traversal starts at the root • Root is the only discoverable node in our tree Traversing through the root meant reading the root for every read or write below it - unacceptable • Lots of academic solutions - most promising was a skip graph, but that required O(n log(n)) data - also unacceptable • Minimum tree depth was propsed, but then you just get multiple hot- spots at your minimum depth nodes STRANGE LOOP 2010 Monday, October 18, 2010
  • 58. BACK TO THE BOOKS LOTS OF ACADEMIC WORK ON THIS TOPIC But academia is obsessed with provable, deterministic, asymptotically optimal algorithms And we only need something that is probably fast enough most of the time (for some value of “probably” and “most of the time”) • And if the probably good enough algorithm is, you know... tractable... one might even consider it qualitatively better! STRANGE LOOP 2010 Monday, October 18, 2010
  • 59. REDUCTIONISM - vs - EMERGENCE STRANGE LOOP 2010 Monday, October 18, 2010
  • 60. We have We want STRANGE LOOP 2010 Monday, October 18, 2010
  • 61. THINKING HOLISTICALLY WE OBSERVED THAT Once a node in the tree exists, it doesn’t go away Node state may change, but that state only really matters locally - thinking a node is a leaf when it really has children is not fatal SO... WHAT IF WE JUST CACHED NODES THAT WERE OBSERVED IN THE SYSTEM!? CLIENT - XXX 2010 Monday, October 18, 2010
  • 62. CACHE IT STUPID SIMPLE SOLUTION Keep an LRU cache of nodes that have been traversed Start traversals at the most selective relevant node If that node doesn’t satisfy you, traverse up the tree Along with your result set, return a list of nodes that were traversed so the caller can add them to its cache STRANGE LOOP 2010 Monday, October 18, 2010
  • 63. TRAVERSAL NEAREST NEIGHBOR o o o xo STRANGE LOOP 2010 Monday, October 18, 2010
  • 64. TRAVERSAL NEAREST NEIGHBOR o o o xo STRANGE LOOP 2010 Monday, October 18, 2010
  • 65. TRAVERSAL NEAREST NEIGHBOR o o o xo STRANGE LOOP 2010 Monday, October 18, 2010
  • 66. KEY CHARACTERISTICS PERFORMANCE Best case on the happy path (everything cached) has zero read overhead Worst case, with nothing cached, O(log(n)) read overhead RE-BALANCING SEEMS UNNECESSARY! Makes worst case more worser, but so far so good CLIENT - XXX 2010 Monday, October 18, 2010
  • 67. DISTRIBUTED TREE SUPPORTED QUERIES EXACT MATCH RANGE PROXIMITY SOMETHING ELSE I HAVEN’T EVEN HEARD OF STRANGE LOOP 2010 Monday, October 18, 2010
  • 68. DISTRIBUTED TREE SUPPORTED QUERIES MUL EXACT MATCH DI P RANGE NS NS! PROXIMITY SOMETHING ELSE I HAVEN’T EVEN HEARD OF STRANGE LOOP 2010 Monday, October 18, 2010
  • 69. PEACE Integrity and Availability Locality and Distributedness Reductionism and Emergence STRANGE LOOP 2010 Monday, October 18, 2010
  • 70. QUESTIONS? MIKE MALONE INFRASTRUCTURE ENGINEER mike@simplegeo.com @mjmalone STRANGE LOOP 2010 Monday, October 18, 2010
  • 71. STRANGE LOOP 2010 Monday, October 18, 2010