SCALABLE DATA STORAGE  GETTING YOU DOWN?    TO THE CLOUD!                Web 2.0 Expo SF 20   Mike Male, Mike Pcnko, D...
THE CAST   MIKE MALONE   INFRASTRUCTURE ENGINEER   @MJMALONE   MIKE PANCHENKO   INFRASTRUCTURE ENGINEER   @MIHASYA   DEREK...
SIMPLEGEO         We originally began as a mobile            gaming startup, but quickly       discovered that the locatio...
THE STACK             www                    gnop     AWS     RDS                      AWS                                ...
BUT WHY NOT  POSTGIS?
DATABASESWHAT ARE THEY GOOD FOR?DATA STORAGEDurably persist system stateCONSTRAINT MANAGEMENTEnforce data integrity constr...
DATA INDEPENDENCEData independence shields clients from the detailsof the storage system, and data structureLOGICAL DATA I...
TRANSACTIONAL RELATIONALDATABASE SYSTEMSHIGH DEGREE OF DATA INDEPENDENCE Logical structure: SQL Data Definition Language P...
ACIDThese terms are not formally defined - they’re aframework, not mathematical axiomsATOMICITY Either all of a transaction...
ACID HELPSACID is a sort-of-formal contract that makes iteasy to reason about your data, and that’s goodIT DOES SOMETHING ...
CAP THEOREMAt PODC 2000 Eric Brewer told us there were threedesirable DB characteristics. But we can only have two.CONSIST...
CAP THEOREM IN 30 SECONDSCLIENT    SERVER    REPLICA
CAP THEOREM IN 30 SECONDSCLIENT          SERVER   REPLICA         wre
CAP THEOREM IN 30 SECONDSCLIENT          SERVER   plice   REPLICA         wre
CAP THEOREM IN 30 SECONDSCLIENT          SERVER   plice   REPLICA         wre                           ack
CAP THEOREM IN 30 SECONDSCLIENT            SERVER   plice   REPLICA          wre         aept              ack
CAP THEOREM IN 30 SECONDSCLIENT           SERVER               REPLICA         wre                              FAIL!    ...
CAP THEOREM IN 30 SECONDSCLIENT            SERVER              REPLICA          wre                              FAIL!   ...
ACID HURTSCertain aspects of ACID encourage (require?)implementors to do “bad things”Unfortunately, ANSI SQL’s definition ...
BALANCEIT’S A QUESTION OF VALUES For traditional databases CAP consistency is the holy grail: it’s maximized at the expens...
NETWORK INDEPENDENCEA distributed system must also manage thenetwork - if it doesn’t, the client has toCLIENT APPLICATIONS...
WHAT’S WRONGWITH MYSQL..?TRADITIONAL RELATIONAL DATABASES They are from an era (er, one of the eras) when Big Iron was the...
UNDERSTANDING  CASSANDRA
APACHE CASSANDRAA DISTRIBUTED STRUCTURED STORAGE SYSTEMEMPHASIZING Extremely large data sets High transaction volumes High...
APACHE CASSANDRAA DISTRIBUTED HASH TABLE WITH SOME TRICKS Peer-to-peer architecture with no distinguished nodes, and there...
NETWORK MODELDYNAMO INSPIREDCONSISTENT HASHING Simple random partitioning mechanism for distribution Low fuss online rebal...
CONSISTENT HASHINGImproves shortcomings of modulo-based hashing                                    1    sh(alice) % 3    ...
CONSISTENT HASHINGWith modulo hashing, a change in the number ofnodes reshuffles the entire data set                        ...
CONSISTENT HASHINGInstead the range of the hash function is mapped to aring, with each node responsible for a segment     ...
CONSISTENT HASHINGWhen nodes are added (or removed) most of the datamappings remain the same                              ...
CONSISTENT HASHINGRebalancing the ring requires a minimal amount ofdata shuffling                                           ...
GOSSIPDISSEMINATES CLUSTER MEMBERSHIP ANDRELATED CONTROL STATE Gossip is initiated by an interval timer At each gossip tic...
REPLICATIONREPLICATION FACTOR Determines how manycopies of each piece of data are created in thesystem                    ...
CONSISTENCY MODELDYNAMO INSPIREDQUORUM-BASED CONSISTENCY                                  W=2                             ...
TUNABLE CONSISTENCYWRITES  ZERO DON’T BOTHER WAITING FOR A RESPONSE   ANY WAIT FOR SOME NODE (NOT NECESSARILY A       REPL...
TUNABLE CONSISTENCYREADS   ONE WAIT FOR ONE REPLICA TO RESPOND QUORUM WAIT FOR A QUORUM (N/2+1) TO RESPOND   ALL WAIT FOR ...
CONSISTENCY MODELDYNAMO INSPIREDREAD REPAIRHINTED HANDOFFANTI-ENTROPY                           W=2                 wre  ...
CONSISTENCY MODELDYNAMO INSPIREDREAD REPAIR Asynchronously checks replicas duringreads and repairs any inconsistenciesHINT...
CONSISTENCY MODELDYNAMO INSPIREDREAD REPAIRHINTED HANDOFF Sends failed writes to another nodewith a hint to re-replicate w...
CONSISTENCY MODELDYNAMO INSPIREDREAD REPAIRHINTED HANDOFF Sends failed writes to another nodewith a hint to re-replicate w...
CONSISTENCY MODELDYNAMO INSPIREDREAD REPAIRHINTED HANDOFFANTI-ENTROPY Manual repair process where nodesgenerate Merkle tre...
DATA MODELBIGTABLE INSPIREDSPARSE MATRIX it’s a hash-map (associative array):a simple, versatile data structureSCHEMA-FREE...
DATA MODELTERMINOLOGYKEYSPACE A named collection of column families(similar to a “database” in MySQL) you only need one an...
DATA MODEL  {                 column family      “users”: {            key          “alice”: {              “city”: [“St. ...
IT’S A DISTRIBUTED HASH TABLEWITH A TWIST...COLUMNS IN ARE STORED TOGETHER ON ONE NODE,IDENTIFIED BY <keyspace, key>{     ...
HASH TABLESUPPORTED QUERIES  EXACT MATCH  RANGE  PROXIMITY  ANYTHING THAT’S NOT  EXACT MATCH
COLUMNSSUPPORTED QUERIES  EXACT MATCH                     {  RANGE              “users”: {                        “alice”:...
LOG-STRUCTURED MERGEMEMTABLES are in memory data structures thatcontain newly written dataCOMMIT LOGS are append only file...
CASSANDRACONCEPTUAL SUMMARY...IT’S A DISTRIBUTED HASH TABLE Gossip based peer-to-peer “ring” with no distinguished nodes a...
ADVANCED CASSANDRA     - A case study -SPATIAL DATA IN A DHT
A FIRST PASSTHE ORDER PRESERVING PARTITIONERCASSANDRA’S PARTITIONINGSTRATEGY IS PLUGGABLE Partitioner maps keys to nodes R...
ORDER PRESERVING PARTITIONER  EXACT MATCH  RANGE              On a single dimension? PROXIMITY
SPATIAL DATAIT’S INHERENTLY MULTIDIMENSIONAL         2       x       2, 2         1             1       2
DIMENSIONALITY REDUCTIONWITH SPACE-FILLING CURVES           1     2           3     4
Z-CURVESECOND ITERATION
Z-VALUE              14          x
GEOHASHSIMPLE TO COMPUTE Interleave the bits of decimal coordinates (equivalent to binary encoding of pre-order traversal!...
DATA MODEL{    “record-index”: {                                  key                                       <geohash>:<id>...
BOUNDING BOXE.G., MULTIDIMENSIONAL RANGE Gie stuff  bg box!   Gie 2  3          1       2          3       4         ...
SPATIAL DATASTILL MULTIDIMENSIONALDIMENSIONALITY REDUCTION ISN’T PERFECT Clients must • Pre-process to compose multiple qu...
Z-CURVE LOCALITY
Z-CURVE LOCALITY                  x              x
Z-CURVE LOCALITY              x      x
Z-CURVE LOCALITY                   x           o o o       x            o          o o      o
THE WORLDIS NOT BALANCED          Credit: C. Mayhew & R. Simmon (NASA/GSFC), NOAA/NGDC, DMSP Digital Archive
TOO MUCH LOCALITY       1          2  SAN FRANCISCO      3           4
TOO MUCH LOCALITY       1          2  SAN FRANCISCO      3           4
TOO MUCH LOCALITY       1          2   I’m sad.  SAN FRANCISCO      3           4
TOO MUCH LOCALITY                                 I’m b.       1          2   I’m sad.              Me o.  SAN FRANCISC...
A TURNING POINT
HELLO, DRAWING BOARDSURVEY OF DISTRIBUTED P2P INDEXING An overlay-dependent index works directly with nodes of the peer-to...
ANOTHER LOOK AT POSTGISMIGHT WORK, BUT The relational transaction management system (which we’d want to change) and access...
LET’S TAKE A STEP BACK
EARTH
EARTH
EARTH
EARTH
EARTH, TREE, RING
DATA MODEL{    “record-index”: {       “layer-name:37     .875, -90:40.25, -101.25”: {           “38.6554420, -90.2992910:...
DATA MODEL{    “record-index”: {       “layer-name:37     .875, -90:40.25, -101.25”: {           “38.6554420, -90.2992910:...
DATA MODEL{    “record-index”: {       “layer-name:37     .875, -90:40.25, -101.25”: {           “38.6554420, -90.2992910:...
SPLITTINGIT’S PRETTY MUCH JUST A CONCURRENT TREE Splitting shouldn’t lock the tree for reads or writes and failures should...
THE ROOT IS HOTMIGHT BE A DEAL BREAKER For a tree to be useful, it has to be traversed • Typically, tree traversal starts ...
BACK TO THE BOOKSLOTS OF ACADEMIC WORK ON THIS TOPIC But academia is obsessed with provable, deterministic, asymptotically...
- THE ROOT -A HOT SPOT AND A SPOF
We have   We want
THINKING HOLISTICALLYWE OBSERVED THAT Once a node in the tree exists, it doesn’t go away Node state may change, but that s...
CACHE ITSTUPID SIMPLE SOLUTION Keep an LRU cache of nodes that have been traversed Start traversals at the most selective ...
TRAVERSALNEAREST NEIGHBOR          o   o          o xo
TRAVERSALNEAREST NEIGHBOR          o   o          o xo
TRAVERSALNEAREST NEIGHBOR          o   o          o xo
KEY CHARACTERISTICSPERFORMANCE Best case on the happy path (everything cached) has zero read overhead Worst case, with not...
DISTRIBUTED TREESUPPORTED QUERIES  EXACT MATCH  RANGE  PROXIMITY  SOMETHING ELSE I HAVEN’T  EVEN HEARD OF
DISTRIBUTED TREESUPPORTED QUERIES                       MUL  EXACT MATCH       DI P                       NS     RANGE...
LIFE OF A REQUEST
THE BIRDS ‘N THE BEES                  ELB       gate                  gate     service               service             ...
THE BIRDS ‘N THE BEES                  ELB       gate     service                  cass    worker pool                  in...
THE BIRDS ‘N THE BEES                  ELB     load bag; AWS svice       gate     service                  cass    wor...
THE BIRDS ‘N THE BEES                  ELB     load bag; AWS svice       gate               auicn; fwdg     ser...
THE BIRDS ‘N THE BEES                  ELB     load bag; AWS svice       gate               auicn; fwdg     ser...
THE BIRDS ‘N THE BEES                  ELB     load bag; AWS svice       gate               auicn; fwdg     ser...
THE BIRDS ‘N THE BEES                  ELB     load bag; AWS svice       gate               auicn; fwdg     ser...
THE BIRDS ‘N THE BEES                  ELB     load bag; AWS svice       gate               auicn; fwdg     ser...
ELB•Traffic management •Control which AZs are serving traffic •Upgrades without downtime  •Able to remove an AZ, upgrade, test...
GATE•Basic auth•HTTP proxy to specific services •Services are independent of one another •Auth is Decoupled from business l...
RABBITMQ•Decouple accepting writes fromperforming the heavy lifting •Don’t block client while we write to db/ index•Flexib...
AWS
EC2• Security groups• Static IPs• Choose your data center• Choose your instance type• On-demand vs. Reserved
ELASTIC BLOCK SUPPORT• Storage devices that can be anywhere from 1GB to 1TB• Snapshotting• Automatic replication• Mobility
ELASTIC LOAD BALANCING• Distribute incoming traffic• Automatic scaling• Health checks
SIMPLE STORAGE SERVICE• File sizes can be up to 5TBs• Unlimited amount of files• Individual read/write access credentials
RELATIONAL DATABASE SERVICE• MySQL in the cloud• Manages replication• Specify instance types based on need• Snapshots
CONFIGURATION MANAGEMENT
WHY IS THIS NECESSARY?• Easier DevOps integration• Reusable modules• Infrastructure as code• One word: WINNING
POSSIBLE OPTIONS
SAMPLE MANIFEST     # /root/learning-manifests/apache2.pp package {   apache2:      ensure => present; } file {   /etc/apa...
AN EXAMPLE       TERMINAL
CONTINUOUSINTEGRATION
GET ‘ER DONE• Revision control• Automate build process• Automate testing process• Automate deployment  Local code changes ...
DON’T FORGET TO DEBIANIZE• All codebases must be debianized• Open source project not debianized yet, fork the repo and  do...
REPOMANHTTPS://GITHUB.COM/SYNACK/REPOMAN.GIT   repoman upload myrepo sg-example.tar.gz   repoman promote development/sg-ex...
HUDSON
MAINTAINING MULTIPLEENVIRONMENTS• Run unit tests in a development environment• Promote to staging• Run system tests in a s...
THE MONEY MAKER       Github Plugin
TYING IT ALL TOGETHER        TERMINAL
FLUMEFlume is a distributed, reliable and availableservice for efficiently collecting, aggregating andmoving large amounts o...
DATA-FLOWS
AGENTSPhysical Host   Logical Nodes            Source and Sink                twitter_stream          twitter(“dsmitts”, “...
RELIABILITY         END-TO-END       STORE ON FAILURE         BEST EFFORT
GETTIN’ JIGGY WIT ITCustom Decorators
HOW DO WE USE IT?
PERSONAL EXPERIENCES• #flume• Automation was gnarly• Its never a good day when Eclipse is involved• Resource hog (at first)
We’ buildg kick-s ols f vops  x,tpt,  csume da cnect  a locn                  MARCH 2011
Upcoming SlideShare
Loading in …5
×

Scalable Data Storage Getting You Down? To The Cloud!

2,004 views

Published on

This was a three hour workshop given at the 2011 Web 2.0 Expo in San Francisco. Due to the length of the presentation and the number of presenters, portions of the slide deck may appear disjoint without the accompanying narrative.

Abstract: "The hype cycle is at a high for cloud computing, distributed “NoSQL” data storage, and high availability map-reducing eventually consistent distributed data processing frameworks everywhere. Back in the real world we know that these technologies aren’t a cure-all. But they’re not worthless, either. We’ll take a look behind the curtains and share some of our experiences working with these systems in production at SimpleGeo.

Our stack consists of Cassandra, HBase, Hadoop, Flume, node.js, rabbitmq, and Puppet. All running on Amazon EC2. Tying these technologies together has been a challenge, but the result really is worth the work. The rotten truth is that our ops guys still wake up in the middle of the night sometimes, and our engineers face new and novel challenges. Let us share what’s keeping us busy—the folks working in the wee hours of the morning—in the hopes that you won’t have to do so yourself."

Published in: Technology
0 Comments
6 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,004
On SlideShare
0
From Embeds
0
Number of Embeds
143
Actions
Shares
0
Downloads
114
Comments
0
Likes
6
Embeds 0
No embeds

No notes for slide

Scalable Data Storage Getting You Down? To The Cloud!

  1. 1. SCALABLE DATA STORAGE GETTING YOU DOWN? TO THE CLOUD! Web 2.0 Expo SF 20 Mike Male, Mike Pcnko, Dek Smh, Paul Lhrop
  2. 2. THE CAST MIKE MALONE INFRASTRUCTURE ENGINEER @MJMALONE MIKE PANCHENKO INFRASTRUCTURE ENGINEER @MIHASYA DEREK SMITH INFRASTRUCTURE ENGINEER @DSMITTS PAUL LATHROP OPERATIONS @GREYTALYN
  3. 3. SIMPLEGEO We originally began as a mobile gaming startup, but quickly discovered that the location services and infrastructure needed to support our ideas didn’t exist. So we took matters into our own hands and began building it ourselves. Mt Gaig Joe Stump CSO & co-founder CTO & co-founder
  4. 4. THE STACK www gnop AWS RDS AWS auth/proxy ELB HTTP data centers ... ... api servers record storage reads geocoder queues reverse geocoder GeoIP pushpin writes index storage Apache Cassandra
  5. 5. BUT WHY NOT POSTGIS?
  6. 6. DATABASESWHAT ARE THEY GOOD FOR?DATA STORAGEDurably persist system stateCONSTRAINT MANAGEMENTEnforce data integrity constraintsEFFICIENT ACCESSOrganize data and implement access methods for efficientretrieval and summarization
  7. 7. DATA INDEPENDENCEData independence shields clients from the detailsof the storage system, and data structureLOGICAL DATA INDEPENDENCE Clients that operate on a subset of the attributes in a data set should not be affected later when new attributes are addedPHYSICAL DATA INDEPENDENCE Clients that interact with a logical schema remain the same despite physical data structure changes like • File organization • Compression • Indexing strategy
  8. 8. TRANSACTIONAL RELATIONALDATABASE SYSTEMSHIGH DEGREE OF DATA INDEPENDENCE Logical structure: SQL Data Definition Language Physical structure: Managed by the DBMSOTHER GOODIES They’re theoretically pure, well understood, and mostly standardized behind a relatively clean abstraction They provide robust contracts that make it easy to reason about the structure and nature of the data they contain They’re ubiquitous, battle hardened, robust, durable, etc.
  9. 9. ACIDThese terms are not formally defined - they’re aframework, not mathematical axiomsATOMICITY Either all of a transaction’s actions are visible to another transaction, or none areCONSISTENCY Application-specific constraints must be met for transaction to succeedISOLATION Two concurrent transactions will not see one another’s transactions while “in flight”DURABILITY The updates made to the database in a committed transaction will be visible to future transactions
  10. 10. ACID HELPSACID is a sort-of-formal contract that makes iteasy to reason about your data, and that’s goodIT DOES SOMETHING HARD FOR YOU With ACID, you’re guaranteed to maintain a persistent global state as long as you’ve defined proper constraints and your logical transactions result in a valid system state
  11. 11. CAP THEOREMAt PODC 2000 Eric Brewer told us there were threedesirable DB characteristics. But we can only have two.CONSISTENCY Every node in the system contains the same data (e.g., replicas are never out of date)AVAILABILITY Every request to a non-failing node in the system returns a responsePARTITION TOLERANCE System properties (consistency and/or availability) hold even when the system is partitioned and data is lost
  12. 12. CAP THEOREM IN 30 SECONDSCLIENT SERVER REPLICA
  13. 13. CAP THEOREM IN 30 SECONDSCLIENT SERVER REPLICA wre
  14. 14. CAP THEOREM IN 30 SECONDSCLIENT SERVER plice REPLICA wre
  15. 15. CAP THEOREM IN 30 SECONDSCLIENT SERVER plice REPLICA wre ack
  16. 16. CAP THEOREM IN 30 SECONDSCLIENT SERVER plice REPLICA wre aept ack
  17. 17. CAP THEOREM IN 30 SECONDSCLIENT SERVER REPLICA wre FAIL! ni UNAVAILAB!
  18. 18. CAP THEOREM IN 30 SECONDSCLIENT SERVER REPLICA wre FAIL! aept CSTT!
  19. 19. ACID HURTSCertain aspects of ACID encourage (require?)implementors to do “bad things”Unfortunately, ANSI SQL’s definition of isolation... relies in subtle ways on an assumption that a locking scheme is used for concurrency control, as opposed to an optimistic or multi-version concurrency scheme. This implies that the proposed semantics are ill-defined. Joseph M. Hellerstein and Michael Stonebraker Anatomy of a Database System
  20. 20. BALANCEIT’S A QUESTION OF VALUES For traditional databases CAP consistency is the holy grail: it’s maximized at the expense of availability and partition tolerance At scale, failures happen: when you’re doing something a million times a second a one-in-a-million failure happens every second We’re witnessing the birth of a new religion... • CAP consistency is a luxury that must be sacrificed at scale in order to maintain availability when faced with failures
  21. 21. NETWORK INDEPENDENCEA distributed system must also manage thenetwork - if it doesn’t, the client has toCLIENT APPLICATIONS ARE LEFT TO HANDLE Partitioning data across multiple machines Working with loosely defined replication semantics Detecting, routing around, and correcting network and hardware failures
  22. 22. WHAT’S WRONGWITH MYSQL..?TRADITIONAL RELATIONAL DATABASES They are from an era (er, one of the eras) when Big Iron was the answer to scaling up In general, the network was not considered part of the systemNEXT GENERATION DATABASES Deconstructing, and decoupling the beast Trying to create a loosely coupled structured storage system • Something that the current generation of database systems never quite accomplished
  23. 23. UNDERSTANDING CASSANDRA
  24. 24. APACHE CASSANDRAA DISTRIBUTED STRUCTURED STORAGE SYSTEMEMPHASIZING Extremely large data sets High transaction volumes High value data that necessitates high availabilityTO USE CASSANDRA EFFECTIVELY IT HELPS TOUNDERSTAND WHAT’S GOING ON BEHIND THE SCENES
  25. 25. APACHE CASSANDRAA DISTRIBUTED HASH TABLE WITH SOME TRICKS Peer-to-peer architecture with no distinguished nodes, and therefore no single points of failure Gossip-based cluster management Generic distributed data placement strategy maps data to nodes • Pluggable partitioning • Pluggable replication strategy Quorum based consistency, tunable on a per-request basis Keys map to sparse, multi-dimensional sorted maps Append-only commit log and SSTables for efficient disk utilization
  26. 26. NETWORK MODELDYNAMO INSPIREDCONSISTENT HASHING Simple random partitioning mechanism for distribution Low fuss online rebalancing when operational requirements changeGOSSIP PROTOCOL Simple decentralized cluster configuration and fault detection Core protocol for determining cluster membership and providing resilience to partial system failure
  27. 27. CONSISTENT HASHINGImproves shortcomings of modulo-based hashing 1 sh(alice) % 3 2 => 23 % 3 => 2 3
  28. 28. CONSISTENT HASHINGWith modulo hashing, a change in the number ofnodes reshuffles the entire data set 1 sh(alice) % 4 2 => 23 % 4 => 3 3 4
  29. 29. CONSISTENT HASHINGInstead the range of the hash function is mapped to aring, with each node responsible for a segment 0 sh(alice) => 23 84 42
  30. 30. CONSISTENT HASHINGWhen nodes are added (or removed) most of the datamappings remain the same 0sh(alice) => 23 84 42 64
  31. 31. CONSISTENT HASHINGRebalancing the ring requires a minimal amount ofdata shuffling 0 sh(alice) => 23 96 32 64
  32. 32. GOSSIPDISSEMINATES CLUSTER MEMBERSHIP ANDRELATED CONTROL STATE Gossip is initiated by an interval timer At each gossip tick a node will • Randomly select a live node in the cluster, sending it a gossip message • Attempt to contact cluster members that were previously marked as down If the gossip message is unacknowledged for some period of time (statistically adjusted based on the inter-arrival time of previous messages) the remote node is marked as down
  33. 33. REPLICATIONREPLICATION FACTOR Determines how manycopies of each piece of data are created in thesystem RF=3 0 sh(alice) => 23 96 32 64
  34. 34. CONSISTENCY MODELDYNAMO INSPIREDQUORUM-BASED CONSISTENCY W=2 0 wre sh(alice) => 23 ad 96 32 W+R>N R=2 64 Cstt
  35. 35. TUNABLE CONSISTENCYWRITES ZERO DON’T BOTHER WAITING FOR A RESPONSE ANY WAIT FOR SOME NODE (NOT NECESSARILY A REPLICA) TO RESPOND ONE WAIT FOR ONE REPLICA TO RESPOND QUORUM WAIT FOR A QUORUM (N/2+1) TO RESPOND ALL WAIT FOR ALL N REPLICAS TO RESPOND
  36. 36. TUNABLE CONSISTENCYREADS ONE WAIT FOR ONE REPLICA TO RESPOND QUORUM WAIT FOR A QUORUM (N/2+1) TO RESPOND ALL WAIT FOR ALL N REPLICAS TO RESPOND
  37. 37. CONSISTENCY MODELDYNAMO INSPIREDREAD REPAIRHINTED HANDOFFANTI-ENTROPY W=2 wre fail
  38. 38. CONSISTENCY MODELDYNAMO INSPIREDREAD REPAIR Asynchronously checks replicas duringreads and repairs any inconsistenciesHINTED HANDOFFANTI-ENTROPY W=2 wre ad + fix
  39. 39. CONSISTENCY MODELDYNAMO INSPIREDREAD REPAIRHINTED HANDOFF Sends failed writes to another nodewith a hint to re-replicate when the failed node returnsANTI-ENTROPY wre plica
  40. 40. CONSISTENCY MODELDYNAMO INSPIREDREAD REPAIRHINTED HANDOFF Sends failed writes to another nodewith a hint to re-replicate when the failed node returnsANTI-ENTROPY * ck * pair
  41. 41. CONSISTENCY MODELDYNAMO INSPIREDREAD REPAIRHINTED HANDOFFANTI-ENTROPY Manual repair process where nodesgenerate Merkle trees (hash trees) to detect andrepair data inconsistencies pair
  42. 42. DATA MODELBIGTABLE INSPIREDSPARSE MATRIX it’s a hash-map (associative array):a simple, versatile data structureSCHEMA-FREE data model, introduces new freedomand new responsibilitiesCOLUMN FAMILIES blend row-oriented and column-oriented structure, providing a high level mechanismfor clients to manage on-disk and inter-node datalocality
  43. 43. DATA MODELTERMINOLOGYKEYSPACE A named collection of column families(similar to a “database” in MySQL) you only need one andyou can mostly ignore itCOLUMN FAMILY A named mapping of keys to rowsROW A named sorted map of columns or supercolumnsCOLUMN A <name, value, timestamp> tripleSUPERCOLUMN A named collection of columns, forpeople who want to get fancy
  44. 44. DATA MODEL { column family “users”: { key “alice”: { “city”: [“St. Louis”, 1287040737182], columnsrow (name, value, timestamp) “name”: [“Alice” 1287080340940], , }, ... }, “locations”: { }, ... }
  45. 45. IT’S A DISTRIBUTED HASH TABLEWITH A TWIST...COLUMNS IN ARE STORED TOGETHER ON ONE NODE,IDENTIFIED BY <keyspace, key>{ column family “users”: { key “ alice”: { “city”: [“St. Louis” 1287040737182], , columns “name”: [“Alice” 1287080340940], , }, ... },} ... bob alice s3b 3e8
  46. 46. HASH TABLESUPPORTED QUERIES EXACT MATCH RANGE PROXIMITY ANYTHING THAT’S NOT EXACT MATCH
  47. 47. COLUMNSSUPPORTED QUERIES EXACT MATCH { RANGE “users”: { “alice”: { “city”: [“St. Louis”, 1287040737182], PROXIMITY “friend-1”: [“Bob” 1287080340940], , friends “friend-2”: [“Joe”, 1287080340940], “friend-3”: [“Meg” 1287080340940], , “name”: [“Alice” 1287080340940], , }, ... } }
  48. 48. LOG-STRUCTURED MERGEMEMTABLES are in memory data structures thatcontain newly written dataCOMMIT LOGS are append only files where newdata is durably writtenSSTABLES are serialized memtables, persisted todiskCOMPACTION periodically merges multiplememtables to improve system performance
  49. 49. CASSANDRACONCEPTUAL SUMMARY...IT’S A DISTRIBUTED HASH TABLE Gossip based peer-to-peer “ring” with no distinguished nodes and no single point of failure Consistent hashing distributes workload and simple replication strategy for fault tolerance and improved throughputWITH TUNABLE CONSISTENCY Based on quorum protocol to ensure consistency And simple repair mechanisms to stay available during partial system failuresAND A SIMPLE, SCHEMA-FREE DATA MODEL It’s just a key-value store Whose values are multi-dimensional sorted map
  50. 50. ADVANCED CASSANDRA - A case study -SPATIAL DATA IN A DHT
  51. 51. A FIRST PASSTHE ORDER PRESERVING PARTITIONERCASSANDRA’S PARTITIONINGSTRATEGY IS PLUGGABLE Partitioner maps keys to nodes Random partitioner destroys locality by hashing Order preserving partitioner retains locality, storing keys in natural lexicographical order around ring z a alice a bob u h sam m
  52. 52. ORDER PRESERVING PARTITIONER EXACT MATCH RANGE On a single dimension? PROXIMITY
  53. 53. SPATIAL DATAIT’S INHERENTLY MULTIDIMENSIONAL 2 x 2, 2 1 1 2
  54. 54. DIMENSIONALITY REDUCTIONWITH SPACE-FILLING CURVES 1 2 3 4
  55. 55. Z-CURVESECOND ITERATION
  56. 56. Z-VALUE 14 x
  57. 57. GEOHASHSIMPLE TO COMPUTE Interleave the bits of decimal coordinates (equivalent to binary encoding of pre-order traversal!) Base32 encode the resultAWESOME CHARACTERISTICS Arbitrary precision Human readable Sorts lexicographically 01101 e
  58. 58. DATA MODEL{ “record-index”: { key <geohash>:<id> “9yzgcjn0:moonrise hotel”: { “”: [“”, 1287040737182], }, ... }, “records”: { “moonrise hotel”: { “latitude”: [“38.6554420”, 1287040737182], “longitude”: [“-90.2992910”, 1287040737182], ... } }}
  59. 59. BOUNDING BOXE.G., MULTIDIMENSIONAL RANGE Gie stuff  bg box! Gie 2  3 1 2 3 4 Gie 4  5
  60. 60. SPATIAL DATASTILL MULTIDIMENSIONALDIMENSIONALITY REDUCTION ISN’T PERFECT Clients must • Pre-process to compose multiple queries • Post-process to filter and merge results Degenerate cases can be bad, particularly for nearest-neighbor queries
  61. 61. Z-CURVE LOCALITY
  62. 62. Z-CURVE LOCALITY x x
  63. 63. Z-CURVE LOCALITY x x
  64. 64. Z-CURVE LOCALITY x o o o x o o o o
  65. 65. THE WORLDIS NOT BALANCED Credit: C. Mayhew & R. Simmon (NASA/GSFC), NOAA/NGDC, DMSP Digital Archive
  66. 66. TOO MUCH LOCALITY 1 2 SAN FRANCISCO 3 4
  67. 67. TOO MUCH LOCALITY 1 2 SAN FRANCISCO 3 4
  68. 68. TOO MUCH LOCALITY 1 2 I’m sad. SAN FRANCISCO 3 4
  69. 69. TOO MUCH LOCALITY I’m b. 1 2 I’m sad. Me o. SAN FRANCISCO 3 4 Let’s py xbox.
  70. 70. A TURNING POINT
  71. 71. HELLO, DRAWING BOARDSURVEY OF DISTRIBUTED P2P INDEXING An overlay-dependent index works directly with nodes of the peer-to-peer network, defining its own overlay An over-DHT index overlays a more sophisticated data structure on top of a peer-to-peer distributed hash table
  72. 72. ANOTHER LOOK AT POSTGISMIGHT WORK, BUT The relational transaction management system (which we’d want to change) and access methods (which we’d have to change) are tightly coupled (necessarily?) to other parts of the system Could work at a higher level and treat PostGIS as a black box • Now we’re back to implementing a peer-to-peer network with failure recovery, fault detection, etc... and Cassandra already had all that. • It’s probably clear by now that I think these problems are more difficult than actually storing structured data on disk
  73. 73. LET’S TAKE A STEP BACK
  74. 74. EARTH
  75. 75. EARTH
  76. 76. EARTH
  77. 77. EARTH
  78. 78. EARTH, TREE, RING
  79. 79. DATA MODEL{ “record-index”: { “layer-name:37 .875, -90:40.25, -101.25”: { “38.6554420, -90.2992910:moonrise hotel”: [“” 1287040737182], , ... }, }, “record-index-meta”: { “layer-name:37 .875, -90:40.25, -101.25”: { “split”: [“false”, 1287040737182], } “layer-name: 37 .875, -90:42.265, -101.25” { “split”: [“true”, 1287040737182], “child-left”: [“layer-name:37 .875, -90:40.25, -101.25” 1287040737182] , “child-right”: [“layer-name:40.25, -90:42.265, -101.25” 1287040737182] , } }}
  80. 80. DATA MODEL{ “record-index”: { “layer-name:37 .875, -90:40.25, -101.25”: { “38.6554420, -90.2992910:moonrise hotel”: [“” 1287040737182], , ... }, }, “record-index-meta”: { “layer-name:37 .875, -90:40.25, -101.25”: { “split”: [“false”, 1287040737182], } “layer-name: 37 .875, -90:42.265, -101.25” { “split”: [“true”, 1287040737182], “child-left”: [“layer-name:37 .875, -90:40.25, -101.25” 1287040737182] , “child-right”: [“layer-name:40.25, -90:42.265, -101.25” 1287040737182] , } }}
  81. 81. DATA MODEL{ “record-index”: { “layer-name:37 .875, -90:40.25, -101.25”: { “38.6554420, -90.2992910:moonrise hotel”: [“” 1287040737182], , ... }, }, “record-index-meta”: { “layer-name:37 .875, -90:40.25, -101.25”: { “split”: [“false”, 1287040737182], } “layer-name: 37 .875, -90:42.265, -101.25” { “split”: [“true”, 1287040737182], “child-left”: [“layer-name:37 .875, -90:40.25, -101.25” 1287040737182] , “child-right”: [“layer-name:40.25, -90:42.265, -101.25” 1287040737182] , } }}
  82. 82. SPLITTINGIT’S PRETTY MUCH JUST A CONCURRENT TREE Splitting shouldn’t lock the tree for reads or writes and failures shouldn’t cause corruption • Splits are optimistic, idempotent, and fail-forward • Instead of locking, writes are replicated to the splitting node and the relevant child[ren] while a split operation is taking place • Cleanup occurs after the split is completed and all interested nodes are aware that the split has occurred • Cassandra writes are idempotent, so splits are too - if a split fails, it is simply be retried Split size: A Tunable knob for balancing locality and distributedness The other hard problem with concurrent trees is rebalancing - we just don’t do it! (more on this later)
  83. 83. THE ROOT IS HOTMIGHT BE A DEAL BREAKER For a tree to be useful, it has to be traversed • Typically, tree traversal starts at the root • Root is the only discoverable node in our tree Traversing through the root meant reading the root for every read or write below it - unacceptable • Lots of academic solutions - most promising was a skip graph, but that required O(n log(n)) data - also unacceptable • Minimum tree depth was propsed, but then you just get multiple hot- spots at your minimum depth nodes
  84. 84. BACK TO THE BOOKSLOTS OF ACADEMIC WORK ON THIS TOPIC But academia is obsessed with provable, deterministic, asymptotically optimal algorithms And we only need something that is probably fast enough most of the time (for some value of “probably” and “most of the time”) • And if the probably good enough algorithm is, you know... tractable... one might even consider it qualitatively better!
  85. 85. - THE ROOT -A HOT SPOT AND A SPOF
  86. 86. We have We want
  87. 87. THINKING HOLISTICALLYWE OBSERVED THAT Once a node in the tree exists, it doesn’t go away Node state may change, but that state only really matters locally - thinking a node is a leaf when it really has children is not fatalSO... WHAT IF WE JUST CACHED NODES THATWERE OBSERVED IN THE SYSTEM!?
  88. 88. CACHE ITSTUPID SIMPLE SOLUTION Keep an LRU cache of nodes that have been traversed Start traversals at the most selective relevant node If that node doesn’t satisfy you, traverse up the tree Along with your result set, return a list of nodes that were traversed so the caller can add them to its cache
  89. 89. TRAVERSALNEAREST NEIGHBOR o o o xo
  90. 90. TRAVERSALNEAREST NEIGHBOR o o o xo
  91. 91. TRAVERSALNEAREST NEIGHBOR o o o xo
  92. 92. KEY CHARACTERISTICSPERFORMANCE Best case on the happy path (everything cached) has zero read overhead Worst case, with nothing cached, O(log(n)) read overheadRE-BALANCING SEEMS UNNECESSARY! Makes worst case more worser, but so far so good
  93. 93. DISTRIBUTED TREESUPPORTED QUERIES EXACT MATCH RANGE PROXIMITY SOMETHING ELSE I HAVEN’T EVEN HEARD OF
  94. 94. DISTRIBUTED TREESUPPORTED QUERIES MUL EXACT MATCH DI P NS  RANGE NS! PROXIMITY SOMETHING ELSE I HAVEN’T EVEN HEARD OF
  95. 95. LIFE OF A REQUEST
  96. 96. THE BIRDS ‘N THE BEES ELB gate gate service service cass worker pool worker pool index
  97. 97. THE BIRDS ‘N THE BEES ELB gate service cass worker pool index
  98. 98. THE BIRDS ‘N THE BEES ELB load bag; AWS svice gate service cass worker pool index
  99. 99. THE BIRDS ‘N THE BEES ELB load bag; AWS svice gate auicn; fwdg service cass worker pool index
  100. 100. THE BIRDS ‘N THE BEES ELB load bag; AWS svice gate auicn; fwdg service buss logic - bic validn cass worker pool index
  101. 101. THE BIRDS ‘N THE BEES ELB load bag; AWS svice gate auicn; fwdg service buss logic - bic validn cass cd stage worker pool index
  102. 102. THE BIRDS ‘N THE BEES ELB load bag; AWS svice gate auicn; fwdg service buss logic - bic validn cass cd stage worker pool buss logic - stage/xg index
  103. 103. THE BIRDS ‘N THE BEES ELB load bag; AWS svice gate auicn; fwdg service buss logic - bic validn cass cd stage worker pool buss logic - stage/xg index awome sauce f qryg
  104. 104. ELB•Traffic management •Control which AZs are serving traffic •Upgrades without downtime •Able to remove an AZ, upgrade, test, replace•API-level failure scenarios •Periodically runs healthchecks on nodes •Removes nodes that fail
  105. 105. GATE•Basic auth•HTTP proxy to specific services •Services are independent of one another •Auth is Decoupled from business logic•First line of defense •Very fast, very cheap •Keeps services from being overwhelmed by poorly authenticated requests
  106. 106. RABBITMQ•Decouple accepting writes fromperforming the heavy lifting •Don’t block client while we write to db/ index•Flexibility in the event of degradationfurther down the stack •Queues can hold a lot, and can keep accepting writes throughout incident•Heterogenous consumers - pass the samemessage through multiple code paths easily
  107. 107. AWS
  108. 108. EC2• Security groups• Static IPs• Choose your data center• Choose your instance type• On-demand vs. Reserved
  109. 109. ELASTIC BLOCK SUPPORT• Storage devices that can be anywhere from 1GB to 1TB• Snapshotting• Automatic replication• Mobility
  110. 110. ELASTIC LOAD BALANCING• Distribute incoming traffic• Automatic scaling• Health checks
  111. 111. SIMPLE STORAGE SERVICE• File sizes can be up to 5TBs• Unlimited amount of files• Individual read/write access credentials
  112. 112. RELATIONAL DATABASE SERVICE• MySQL in the cloud• Manages replication• Specify instance types based on need• Snapshots
  113. 113. CONFIGURATION MANAGEMENT
  114. 114. WHY IS THIS NECESSARY?• Easier DevOps integration• Reusable modules• Infrastructure as code• One word: WINNING
  115. 115. POSSIBLE OPTIONS
  116. 116. SAMPLE MANIFEST # /root/learning-manifests/apache2.pp package { apache2: ensure => present; } file { /etc/apache2/apache2.conf: ensure => file, mode => 600, notify => Service[‘apache2’], source => /root/learning-manifests/apache2.conf, } service { apache2: ensure => running, enable => true, subscribe => File[/etc/apache2/apache2.conf], }
  117. 117. AN EXAMPLE TERMINAL
  118. 118. CONTINUOUSINTEGRATION
  119. 119. GET ‘ER DONE• Revision control• Automate build process• Automate testing process• Automate deployment Local code changes should result in production deployments.
  120. 120. DON’T FORGET TO DEBIANIZE• All codebases must be debianized• Open source project not debianized yet, fork the repo and do it yourself!• Take the time to teach others• Debian directories can easily be reused after a simple search and replace
  121. 121. REPOMANHTTPS://GITHUB.COM/SYNACK/REPOMAN.GIT repoman upload myrepo sg-example.tar.gz repoman promote development/sg-example staging
  122. 122. HUDSON
  123. 123. MAINTAINING MULTIPLEENVIRONMENTS• Run unit tests in a development environment• Promote to staging• Run system tests in a staging environment• Run consumption tests in a staging environment• Promote to production Congratz, you have now just automated yourself out of a job.
  124. 124. THE MONEY MAKER Github Plugin
  125. 125. TYING IT ALL TOGETHER TERMINAL
  126. 126. FLUMEFlume is a distributed, reliable and availableservice for efficiently collecting, aggregating andmoving large amounts of log data. syslog on steriods
  127. 127. DATA-FLOWS
  128. 128. AGENTSPhysical Host Logical Nodes Source and Sink twitter_stream twitter(“dsmitts”, “mypassword”, “url”) agentSink(35853) tail(“/var/log/nginx/access.log”) i-192df98 tail agentSink(35853) collectorSource(35853) hdfs_writer collectorSink("hdfs://namenode.sg.com/bogus/", "logs")
  129. 129. RELIABILITY END-TO-END STORE ON FAILURE BEST EFFORT
  130. 130. GETTIN’ JIGGY WIT ITCustom Decorators
  131. 131. HOW DO WE USE IT?
  132. 132. PERSONAL EXPERIENCES• #flume• Automation was gnarly• Its never a good day when Eclipse is involved• Resource hog (at first)
  133. 133. We’ buildg kick-s ols f vops  x,tpt,  csume da cnect  a locn MARCH 2011

×