Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Database scalability




Jonathan Ellis / @spyced
operations / s
        = throughput




money
                         What scaling means
“It scales, but”
• It scales, but we have to keep adding
    people to our ops team

• It scales, but when the netapp runs...
2000   2010
Problem 1

            Index




             Data
Problem 2
Problem 3
Two kinds of operations
• Reads

• Writes
Band-aids
• (Not actually scaling your database)
Problem 1: btrees suck
• ~100 MB/s sequential writes

• ~1.5 MB/s random writes
    – ~10ms to seek
    – ~5ms on expensiv...
Magic bullets
• SSD
    – ~100MB/s random writes (of 4KB)
    – ~$800 for 64GB
        • vs ~$200 for 2 TB rotational

• M...
Caching
• Memcached

• Ehcache

• etc                   cache




                         DB
Partitioning a (read) cache
i = hash(key) % len(servers)

server = servers[i]
The cold start problem
Consistent hashing
Consistent hashing
• libketema for memcached
    – Multiple “virtual nodes” per machine for
       better load balancing
Cache coherence strategies
• Write-through

• Explicit invalidation

• Implicit invalidation
Cache set invalidation
def get_cached_cart(cart, offset):
  key = 'cart:%s:%s' % (cart, offset)
  return get(key) # cart:1...
Set invalidation 2
def get_cached_cart(cart, offset):
  prefix = get_cart_prefix(cart)
  key = '%s:%s' % (prefix, offset)
...
Pop quiz
• Can caching reduce write load?
Write-back caching

foo.bar = 1


foo.bar = 2


foo.bar = 4                                cache


              UPDATE fo...
Achtung!
• Commercial solutions
    – Terracotta

• Roll your own
    – Memcached
    – ehcache
    – Zookeeper?
Scaling reads
Replication
Types of replication
• Master → slave
    – Master → slave → other slaves

• Master ↔ master
    – multi-master
Types of replication 2
• Synchronous

• Asynchronous
Synchronous master/slave
• ?
Synchronous multi-master
• Synchronous = slow(er)

• Complexity (e.g. 2pc)

• PGCluster

• Oracle
Asynchronous master/slave
• Easiest

• MySQL replication

• Slony, Londiste, WAL shipping

• Tungsten

• MongoDB

• Redis
Asynchronous multi-master
• Conflict resolution
    – O(N3) or O(N2) as you add nodes
    – http://research.microsoft.com/...
Achtung!
• Asynchronous replication can lose data if
   the master fails
“Architecture”
• Primarily about how you cope with failure
    scenarios
Replication does not scale writes
Scaling writes
Partitioning
Partitioning
• sharding / horizontal / key

• Functional / vertical
Key based partitioning
• PK of “root” table controls destination
    – e.g. user id

• Retains referential integrity
Example: blogger.com

Users

Blogs

Comments
Example: blogger.com

Users

Blogs      Comments'


Comments
What server gets each key?
• Consistent hashing

• Central db that knows what server owns a
   key
    – Makes adding mach...
Vertical partitioning
• Tables on separate nodes

• Often a table that is too big to keep with
   the other tables, gets t...
Users2       Users1




                    Products1
                                Vertical segments
                  ...
Partitioning
Partitioning with replication
What these have in common
• Not application-transparent

• Ad hoc

• Error-prone

• Manpower-intensive
To summarize
• Scaling reads sucks

• Scaling writes sucks more
Case in point
NoSQL
• Distributed (partitioned, scalable)
    – Cassandra
    – HBase
    – Voldemort

• Not distributed
    – Neo4j
   ...
*
      Distributed databases
• Data is automatically partitioned

• Transparent to application

• Add capacity without do...
Two famous papers
• Bigtable: A distributed storage system for
    structured data, 2006

• Dynamo: amazon's highly availa...
Two approaches
• Bigtable: “How can we build a distributed
    database on top of GFS?”

• Dynamo: “How can we build a dis...
Cassandra
• Dynamo-style replication

• Bigtable ColumnFamily data model

• High-performance
    – Digg phasing out memcac...
Questions
What every developer should know about database scalability, PyCon 2010
What every developer should know about database scalability, PyCon 2010
Upcoming SlideShare
Loading in …5
×

What every developer should know about database scalability, PyCon 2010

10,803 views

Published on

  • Be the first to comment

What every developer should know about database scalability, PyCon 2010

  1. 1. Database scalability Jonathan Ellis / @spyced
  2. 2. operations / s = throughput money What scaling means
  3. 3. “It scales, but” • It scales, but we have to keep adding people to our ops team • It scales, but when the netapp runs out of capacity we're screwed • It scales, but it's not fast
  4. 4. 2000 2010
  5. 5. Problem 1 Index Data
  6. 6. Problem 2
  7. 7. Problem 3
  8. 8. Two kinds of operations • Reads • Writes
  9. 9. Band-aids • (Not actually scaling your database)
  10. 10. Problem 1: btrees suck • ~100 MB/s sequential writes • ~1.5 MB/s random writes – ~10ms to seek – ~5ms on expensive 15k rpm disks
  11. 11. Magic bullets • SSD – ~100MB/s random writes (of 4KB) – ~$800 for 64GB • vs ~$200 for 2 TB rotational • Moore's law – Aka the 37signals approach
  12. 12. Caching • Memcached • Ehcache • etc cache DB
  13. 13. Partitioning a (read) cache i = hash(key) % len(servers) server = servers[i]
  14. 14. The cold start problem
  15. 15. Consistent hashing
  16. 16. Consistent hashing • libketema for memcached – Multiple “virtual nodes” per machine for better load balancing
  17. 17. Cache coherence strategies • Write-through • Explicit invalidation • Implicit invalidation
  18. 18. Cache set invalidation def get_cached_cart(cart, offset): key = 'cart:%s:%s' % (cart, offset) return get(key) # cart:13:10 def invalidate_cart(cart): ?
  19. 19. Set invalidation 2 def get_cached_cart(cart, offset): prefix = get_cart_prefix(cart) key = '%s:%s' % (prefix, offset) return get(key) def invalidate_cart(cart): del(get_cart_prefix(cart)) http://www.aminus.org/blogs/index.php/2007/1
  20. 20. Pop quiz • Can caching reduce write load?
  21. 21. Write-back caching foo.bar = 1 foo.bar = 2 foo.bar = 4 cache UPDATE foos SET bar=4 WHERE id=3 DB
  22. 22. Achtung! • Commercial solutions – Terracotta • Roll your own – Memcached – ehcache – Zookeeper?
  23. 23. Scaling reads
  24. 24. Replication
  25. 25. Types of replication • Master → slave – Master → slave → other slaves • Master ↔ master – multi-master
  26. 26. Types of replication 2 • Synchronous • Asynchronous
  27. 27. Synchronous master/slave • ?
  28. 28. Synchronous multi-master • Synchronous = slow(er) • Complexity (e.g. 2pc) • PGCluster • Oracle
  29. 29. Asynchronous master/slave • Easiest • MySQL replication • Slony, Londiste, WAL shipping • Tungsten • MongoDB • Redis
  30. 30. Asynchronous multi-master • Conflict resolution – O(N3) or O(N2) as you add nodes – http://research.microsoft.com/~gray/replicas.ps • Bucardo • MySQL Cluster • Tokyo Cabinet
  31. 31. Achtung! • Asynchronous replication can lose data if the master fails
  32. 32. “Architecture” • Primarily about how you cope with failure scenarios
  33. 33. Replication does not scale writes
  34. 34. Scaling writes
  35. 35. Partitioning
  36. 36. Partitioning • sharding / horizontal / key • Functional / vertical
  37. 37. Key based partitioning • PK of “root” table controls destination – e.g. user id • Retains referential integrity
  38. 38. Example: blogger.com Users Blogs Comments
  39. 39. Example: blogger.com Users Blogs Comments' Comments
  40. 40. What server gets each key? • Consistent hashing • Central db that knows what server owns a key – Makes adding machines easier (than – Single point of failure & query bottleneck – MongoDB
  41. 41. Vertical partitioning • Tables on separate nodes • Often a table that is too big to keep with the other tables, gets too big for a single node
  42. 42. Users2 Users1 Products1 Vertical segments Both Trans3 Trans2 Trans1 Sharding
  43. 43. Partitioning
  44. 44. Partitioning with replication
  45. 45. What these have in common • Not application-transparent • Ad hoc • Error-prone • Manpower-intensive
  46. 46. To summarize • Scaling reads sucks • Scaling writes sucks more
  47. 47. Case in point
  48. 48. NoSQL • Distributed (partitioned, scalable) – Cassandra – HBase – Voldemort • Not distributed – Neo4j – CouchDB – Redis
  49. 49. * Distributed databases • Data is automatically partitioned • Transparent to application • Add capacity without downtime • Failure tolerant *Like Bigtable, not like Lotus Notes
  50. 50. Two famous papers • Bigtable: A distributed storage system for structured data, 2006 • Dynamo: amazon's highly available key- value store, 2007
  51. 51. Two approaches • Bigtable: “How can we build a distributed database on top of GFS?” • Dynamo: “How can we build a distributed hash table appropriate for the data center?”
  52. 52. Cassandra • Dynamo-style replication • Bigtable ColumnFamily data model • High-performance – Digg phasing out memcached • Open space, Saturday at 17:30
  53. 53. Questions

×