What every developer should know about database scalability, PyCon 2010

10,428 views
10,247 views

Published on

0 Comments
29 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
10,428
On SlideShare
0
From Embeds
0
Number of Embeds
1,992
Actions
Shares
0
Downloads
267
Comments
0
Likes
29
Embeds 0
No embeds

No notes for slide

What every developer should know about database scalability, PyCon 2010

  1. 1. Database scalability Jonathan Ellis / @spyced
  2. 2. operations / s = throughput money What scaling means
  3. 3. “It scales, but” • It scales, but we have to keep adding people to our ops team • It scales, but when the netapp runs out of capacity we're screwed • It scales, but it's not fast
  4. 4. 2000 2010
  5. 5. Problem 1 Index Data
  6. 6. Problem 2
  7. 7. Problem 3
  8. 8. Two kinds of operations • Reads • Writes
  9. 9. Band-aids • (Not actually scaling your database)
  10. 10. Problem 1: btrees suck • ~100 MB/s sequential writes • ~1.5 MB/s random writes – ~10ms to seek – ~5ms on expensive 15k rpm disks
  11. 11. Magic bullets • SSD – ~100MB/s random writes (of 4KB) – ~$800 for 64GB • vs ~$200 for 2 TB rotational • Moore's law – Aka the 37signals approach
  12. 12. Caching • Memcached • Ehcache • etc cache DB
  13. 13. Partitioning a (read) cache i = hash(key) % len(servers) server = servers[i]
  14. 14. The cold start problem
  15. 15. Consistent hashing
  16. 16. Consistent hashing • libketema for memcached – Multiple “virtual nodes” per machine for better load balancing
  17. 17. Cache coherence strategies • Write-through • Explicit invalidation • Implicit invalidation
  18. 18. Cache set invalidation def get_cached_cart(cart, offset): key = 'cart:%s:%s' % (cart, offset) return get(key) # cart:13:10 def invalidate_cart(cart): ?
  19. 19. Set invalidation 2 def get_cached_cart(cart, offset): prefix = get_cart_prefix(cart) key = '%s:%s' % (prefix, offset) return get(key) def invalidate_cart(cart): del(get_cart_prefix(cart)) http://www.aminus.org/blogs/index.php/2007/1
  20. 20. Pop quiz • Can caching reduce write load?
  21. 21. Write-back caching foo.bar = 1 foo.bar = 2 foo.bar = 4 cache UPDATE foos SET bar=4 WHERE id=3 DB
  22. 22. Achtung! • Commercial solutions – Terracotta • Roll your own – Memcached – ehcache – Zookeeper?
  23. 23. Scaling reads
  24. 24. Replication
  25. 25. Types of replication • Master → slave – Master → slave → other slaves • Master ↔ master – multi-master
  26. 26. Types of replication 2 • Synchronous • Asynchronous
  27. 27. Synchronous master/slave • ?
  28. 28. Synchronous multi-master • Synchronous = slow(er) • Complexity (e.g. 2pc) • PGCluster • Oracle
  29. 29. Asynchronous master/slave • Easiest • MySQL replication • Slony, Londiste, WAL shipping • Tungsten • MongoDB • Redis
  30. 30. Asynchronous multi-master • Conflict resolution – O(N3) or O(N2) as you add nodes – http://research.microsoft.com/~gray/replicas.ps • Bucardo • MySQL Cluster • Tokyo Cabinet
  31. 31. Achtung! • Asynchronous replication can lose data if the master fails
  32. 32. “Architecture” • Primarily about how you cope with failure scenarios
  33. 33. Replication does not scale writes
  34. 34. Scaling writes
  35. 35. Partitioning
  36. 36. Partitioning • sharding / horizontal / key • Functional / vertical
  37. 37. Key based partitioning • PK of “root” table controls destination – e.g. user id • Retains referential integrity
  38. 38. Example: blogger.com Users Blogs Comments
  39. 39. Example: blogger.com Users Blogs Comments' Comments
  40. 40. What server gets each key? • Consistent hashing • Central db that knows what server owns a key – Makes adding machines easier (than – Single point of failure & query bottleneck – MongoDB
  41. 41. Vertical partitioning • Tables on separate nodes • Often a table that is too big to keep with the other tables, gets too big for a single node
  42. 42. Users2 Users1 Products1 Vertical segments Both Trans3 Trans2 Trans1 Sharding
  43. 43. Partitioning
  44. 44. Partitioning with replication
  45. 45. What these have in common • Not application-transparent • Ad hoc • Error-prone • Manpower-intensive
  46. 46. To summarize • Scaling reads sucks • Scaling writes sucks more
  47. 47. Case in point
  48. 48. NoSQL • Distributed (partitioned, scalable) – Cassandra – HBase – Voldemort • Not distributed – Neo4j – CouchDB – Redis
  49. 49. * Distributed databases • Data is automatically partitioned • Transparent to application • Add capacity without downtime • Failure tolerant *Like Bigtable, not like Lotus Notes
  50. 50. Two famous papers • Bigtable: A distributed storage system for structured data, 2006 • Dynamo: amazon's highly available key- value store, 2007
  51. 51. Two approaches • Bigtable: “How can we build a distributed database on top of GFS?” • Dynamo: “How can we build a distributed hash table appropriate for the data center?”
  52. 52. Cassandra • Dynamo-style replication • Bigtable ColumnFamily data model • High-performance – Digg phasing out memcached • Open space, Saturday at 17:30
  53. 53. Questions

×