Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Five Lessons in Distributed Databases

407 views

Published on

1. If it’s not SQL, it’s not a database.
2. It takes 5+ years to build a database.
3. Listen to your users.
4. Too much magic is a bad thing.
5. It’s the cloud, stupid.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Five Lessons in Distributed Databases

  1. 1. Five Lessons in Distributed Databases Jonathan Ellis CTO, DataStax 1 © DataStax, All Rights Reserved. Confidential
  2. 2. © DataStax, All Rights Reserved. 1. If it’s not SQL, it’s not a database
  3. 3. © DataStax, All Rights Reserved. A brief history of NoSQL ● Early 2000s: people hit limits on vertical scaling, start sharding RDBMSes ● 2006, 2007: BigTable, Dynamo papers ● 2008-2010: Explosion of scale-out systems ○ Voldemort, Riak, Dynomite, FoundationDB, CouchDB ○ Cassandra, HBase, MongoDB
  4. 4. © DataStax, All Rights Reserved. One small problem
  5. 5. © DataStax, All Rights Reserved. Cassandra’s experience ● Thrift RPC “drivers” too low level ● Fragmented: Hector, Pelops, Astyanax ● Inconsistent across language ecosystems
  6. 6. © DataStax, All Rights Reserved.
  7. 7. © DataStax, All Rights Reserved. Solution: CQL ● 2011: Cassandra 0.8 introduces CQL 1.0 ● 2012: Cassandra 1.1 introduces CQL 3.0 ● 2013: Cassandra 1.2 adds collections
  8. 8. © DataStax, All Rights Reserved. Today ● Cassandra: CQL ● CosmosDB: “SQL” ● Cloud Spanner: “SQL” ● Couchbase: N1QL ● HBase: Phoenix SQL (Java only) ● DynamoDB: REST/JSON ● MongoDB: BSON
  9. 9. © DataStax, All Rights Reserved. 2. It takes 5+ years to build a database
  10. 10. © DataStax, All Rights Reserved. Curt Monash Rule 1: Developing a good DBMS requires 5-7 years and tens of millions of dollars. That’s if things go extremely well. Rule 2: You aren’t an exception to Rule 1.
  11. 11. © DataStax, All Rights Reserved. Aside: Mistakes I made starting DataStax ● Stayed at Rackspace too long ● Raised a $2.5M series A ● Waited a year to get serious about enterprise sales ● Changed the company name ● Brisk
  12. 12. © DataStax, All Rights Reserved. Examples (Curt) ● Concurrent workloads benchmarked in the lab are poor predictors of concurrent performance in real life. ● Mixed workload management is harder than you’re assuming it is. ● Those minor edge cases in which your Version 1 product works poorly aren’t minor after all.
  13. 13. © DataStax, All Rights Reserved. Examples (Cassandra) ● Hinted handoff ● Repair ● Counters ● Paxos ● Test suite
  14. 14. © DataStax, All Rights Reserved. Aside: Fallout (Jepsen at Scale) ● Ensemble - A set of clusters that is brought up/torn down each test ○ Server Cluster - Cassandra/DSE ○ Client Cluster - Load Generators ○ Observer Cluster - Records live information from clusters (OpsCenter/Graphite) ○ Controller - Fallout ● Workload - The guts of the test ○ Phases - Run sequentially. Contains one or more modules that run in parallel for that phase ○ Checkers - Run after all phases and verify the data emitted from modules. ○ Artifact Checkers - Runs against collected artifacts to look for correctness/problems
  15. 15. © DataStax, All Rights Reserved. A simple Fallout workload ensemble: server: node.count: 3 provisioner: name: local configuration_manager: name: ccm properties: cassandra.version: 3.0.0 client: server #use server cluster phases: - insert_workload: module: stress properties: iterations: 1m type: write rf: 3 gossip_updown: module: nodetool properties: command: disablegossip secondary.command: enablegossip sleep.seconds: 10 sleep.randomize: 20 - read_workload: module: stress properties: iterations: 1m type: read checkers: verify_success: checker: nofail 1. Start 3 node ccm cluster. 2. Insert data while bringing gossip on the nodes up and down. 3. Read/Check the data. 4. Verify none of the steps failed. Note: to move from ccm to ec2 we only need to change the ensemble section.
  16. 16. © DataStax, All Rights Reserved. 5-7 years? ● Cassandra became Apache TLP in Feb 2010 ● 3.0 released Fall 2015 ● OSS is about adoption, not saving time/money
  17. 17. © DataStax, All Rights Reserved. 3. The customer is always right
  18. 18. © DataStax, All Rights Reserved. Example: sequential scans SELECT * FROM user_purchases WHERE purchase_date > 2000
  19. 19. © DataStax, All Rights Reserved. What’s wrong with this query? For 100,000 purchases, nothing. For 100,000,000 purchases, you’ll crash the server (in 2012).
  20. 20. © DataStax, All Rights Reserved. Solution (2012): ALLOW FILTERING SELECT * FROM user_purchases WHERE purchase_date > 2000 ALLOW FILTERING
  21. 21. © DataStax, All Rights Reserved. Better solution (2013): Paging ● Build resultset incrementally and “page” it to the client
  22. 22. © DataStax, All Rights Reserved. Example: tombstones INSERT INTO foo VALUES (1254, …) DELETE FROM foo WHERE id = 1254 … SELECT * FROM foo
  23. 23. © DataStax, All Rights Reserved. Solution (2013) tombstone_warn_threshold: 1000 tombstone_failure_threshold: 100000
  24. 24. © DataStax, All Rights Reserved. Better Solution (???): It’s complicated ● Track repair status to get rid of GCGS ● Bring time-to-repair from “days” to “hours” ● Optional: improve time-to-compaction
  25. 25. © DataStax, All Rights Reserved. Example: joins ● CQL doesn’t support joins ● People still use client-side joins instead of denormalizing
  26. 26. © DataStax, All Rights Reserved. Solution (2015-???): MV ● Make it easier to denormalize
  27. 27. © DataStax, All Rights Reserved. Better solution (???): actually add joins ● Less controversial: shared partition joins ● More controversial: cross-partition ● CosmosDB, Spanner
  28. 28. © DataStax, All Rights Reserved. A note on configurability
  29. 29. © DataStax, All Rights Reserved. 4. Too much magic is a bad thing
  30. 30. © DataStax, All Rights Reserved. Not (just) about vendors overpromising ● “Our database isn’t subject to the limits of the CAP theorem” ● “Our queue can guarantee exactly once delivery” ● “We’ll give you 99.99% uptime*”
  31. 31. © DataStax, All Rights Reserved. Magic can be bad even when it works
  32. 32. © DataStax, All Rights Reserved. Cloud Spanner analysis excerpt Spanner’s architecture implies that writes will be significantly slower than reads due to the need to coordinate across multiple replicas and avoid overlapping time bounds, and that is what we see in the original 2012 Spanner paper. … Besides write performance in isolation, because Spanner uses pessimistic locking to achieve ACID, reads are locked out of rows (partitions?) that are in the process of being updated. Thus, write performance challenges can spread to causing problems with reads as well.
  33. 33. © DataStax, All Rights Reserved. Cloud Spanner
  34. 34. © DataStax, All Rights Reserved. Auto-scaling in DynamoDB ● Request capacity tied to “partitions” [pp] ○ pp count = max (rc / 3000, wc / 1000, st / 10 GB) ● Subtle implication: capacity / pp decreases as storage volume increases ○ Non-uniform: pp request capacity halved when shard splits ● Subtle implication 2: bulk loads will wreck your planning
  35. 35. © DataStax, All Rights Reserved. “Best practices for tables” ● Bulk load 200M items = 200 GB ● Target 60 minutes = 55,000 write capacity = 55 pps ● Post bulk load steady state ● 1000 req/s = 2 req/pp = 2 req/(3.6M items) ● No way to reduce partition count
  36. 36. © DataStax, All Rights Reserved. Ravelin, 2017 You construct a table which uses a customer ID as partition key. You know your customer ID’s are unique and should be uniformly distributed across nodes. Your business has millions of customers and no single customer can do so many actions so quickly that the individual could create a hot key. Under this key you are storing around 2KB of data. This sounds reasonable. This will not work at scale in DynamoDb.
  37. 37. © DataStax, All Rights Reserved. How much magic is too much? ● Joins: Apparently okay ● Auto-scaling: Apparently also okay ● Automatic partitioning: not okay ● Really slow ACID: not okay (?) ● Why? ● How do we make the system more transparent without inflicting an unnecessary level of detail on the user?
  38. 38. © DataStax, All Rights Reserved. 5. It’s the cloud, stupid
  39. 39. © DataStax, All Rights Reserved. September 2011
  40. 40. © DataStax, All Rights Reserved. March 2012
  41. 41. © DataStax, All Rights Reserved. March 2012
  42. 42. © DataStax, All Rights Reserved. March 2012
  43. 43. © DataStax, All Rights Reserved. The cloud is here. Now what?
  44. 44. © DataStax, All Rights Reserved. Cloud-first architecture “The second trend will be the increased prevalence of shared-disk distributed DBMS. By “shared-disk” I mean a DBMS that uses a distributed storage layer as its primary storage location, such as HDFS or Amazon’s EBS/S3 services. This separates the DBMS’s storage layer from its execution nodes. Contrast this with a shared-nothing DBMS architecture where each execution node maintains its own storage.”
  45. 45. © DataStax, All Rights Reserved. Cloud-first infrastructure ● What on-premises infrastructure can provide a cloud-like experience? ● Kubernetes? ● OpenStack?
  46. 46. © DataStax, All Rights Reserved. Cloud-first development ● Is a yearly (bi-yearly?) release process the right cadence for companies building cloud services?
  47. 47. © DataStax, All Rights Reserved. Cloud-first OSS ● What does OSS look like when you don’t work for the big three clouds? ● “Commons Clause” is an attempt to deal with this ○ (What about AGPL?)
  48. 48. © DataStax, All Rights Reserved. Summary 1. If it’s not SQL, it’s not a database. 2. It takes 5+ years to build a database. 3. Listen to your users. 4. Too much magic is a bad thing. 5. It’s the cloud, stupid.

×