Database scalability




Jonathan Ellis / @spyced
operations / s
        = throughput




money
                         What scaling means
“It scales, but”
• It scales, but we have to keep adding
    people to our ops team

• It scales, but when the netapp runs...
2000   2010
Problem 1

            Index




             Data
Problem 2
Problem 3
Two kinds of operations
• Reads

• Writes
Band-aids
• (Not actually scaling your database)
Problem 1: btrees suck
• ~100 MB/s sequential writes

• ~1.5 MB/s random writes
    – ~10ms to seek
    – ~5ms on expensiv...
Magic bullets
• SSD
    – ~100MB/s random writes (of 4KB)
    – ~$800 for 64GB
        • vs ~$200 for 2 TB rotational

• M...
Caching
• Memcached

• Ehcache

• etc                   cache




                         DB
Partitioning a (read) cache
i = hash(key) % len(servers)

server = servers[i]
The cold start problem
Consistent hashing
Consistent hashing
• libketema for memcached
    – Multiple “virtual nodes” per machine for
       better load balancing
Cache coherence strategies
• Write-through

• Explicit invalidation

• Implicit invalidation
Cache set invalidation
def get_cached_cart(cart, offset):
  key = 'cart:%s:%s' % (cart, offset)
  return get(key) # cart:1...
Set invalidation 2
def get_cached_cart(cart, offset):
  prefix = get_cart_prefix(cart)
  key = '%s:%s' % (prefix, offset)
...
Pop quiz
• Can caching reduce write load?
Write-back caching

foo.bar = 1


foo.bar = 2


foo.bar = 4                                cache


              UPDATE fo...
Achtung!
• Commercial solutions
    – Terracotta

• Roll your own
    – Memcached
    – ehcache
    – Zookeeper?
Scaling reads
Replication
Types of replication
• Master → slave
    – Master → slave → other slaves

• Master ↔ master
    – multi-master
Types of replication 2
• Synchronous

• Asynchronous
Synchronous master/slave
• ?
Synchronous multi-master
• Synchronous = slow(er)

• Complexity (e.g. 2pc)

• PGCluster

• Oracle
Asynchronous master/slave
• Easiest

• MySQL replication

• Slony, Londiste, WAL shipping

• Tungsten

• MongoDB

• Redis
Asynchronous multi-master
• Conflict resolution
    – O(N3) or O(N2) as you add nodes
    – http://research.microsoft.com/...
Achtung!
• Asynchronous replication can lose data if
   the master fails
“Architecture”
• Primarily about how you cope with failure
    scenarios
Replication does not scale writes
Scaling writes
Partitioning
Partitioning
• sharding / horizontal / key

• Functional / vertical
Key based partitioning
• PK of “root” table controls destination
    – e.g. user id

• Retains referential integrity
Example: blogger.com

Users

Blogs

Comments
Example: blogger.com

Users

Blogs      Comments'


Comments
What server gets each key?
• Consistent hashing

• Central db that knows what server owns a
   key
    – Makes adding mach...
Vertical partitioning
• Tables on separate nodes

• Often a table that is too big to keep with
   the other tables, gets t...
Users2       Users1




                    Products1
                                Vertical segments
                  ...
Partitioning
Partitioning with replication
What these have in common
• Not application-transparent

• Ad hoc

• Error-prone

• Manpower-intensive
To summarize
• Scaling reads sucks

• Scaling writes sucks more
Case in point
NoSQL
• Distributed (partitioned, scalable)
    – Cassandra
    – HBase
    – Voldemort

• Not distributed
    – Neo4j
   ...
*
      Distributed databases
• Data is automatically partitioned

• Transparent to application

• Add capacity without do...
Two famous papers
• Bigtable: A distributed storage system for
    structured data, 2006

• Dynamo: amazon's highly availa...
Two approaches
• Bigtable: “How can we build a distributed
    database on top of GFS?”

• Dynamo: “How can we build a dis...
Cassandra
• Dynamo-style replication

• Bigtable ColumnFamily data model

• High-performance
    – Digg phasing out memcac...
Questions
What every developer should know about database scalability, PyCon 2010
What every developer should know about database scalability, PyCon 2010
Upcoming SlideShare
Loading in...5
×

What every developer should know about database scalability, PyCon 2010

10,019

Published on

0 Comments
29 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
10,019
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
266
Comments
0
Likes
29
Embeds 0
No embeds

No notes for slide

What every developer should know about database scalability, PyCon 2010

  1. 1. Database scalability Jonathan Ellis / @spyced
  2. 2. operations / s = throughput money What scaling means
  3. 3. “It scales, but” • It scales, but we have to keep adding people to our ops team • It scales, but when the netapp runs out of capacity we're screwed • It scales, but it's not fast
  4. 4. 2000 2010
  5. 5. Problem 1 Index Data
  6. 6. Problem 2
  7. 7. Problem 3
  8. 8. Two kinds of operations • Reads • Writes
  9. 9. Band-aids • (Not actually scaling your database)
  10. 10. Problem 1: btrees suck • ~100 MB/s sequential writes • ~1.5 MB/s random writes – ~10ms to seek – ~5ms on expensive 15k rpm disks
  11. 11. Magic bullets • SSD – ~100MB/s random writes (of 4KB) – ~$800 for 64GB • vs ~$200 for 2 TB rotational • Moore's law – Aka the 37signals approach
  12. 12. Caching • Memcached • Ehcache • etc cache DB
  13. 13. Partitioning a (read) cache i = hash(key) % len(servers) server = servers[i]
  14. 14. The cold start problem
  15. 15. Consistent hashing
  16. 16. Consistent hashing • libketema for memcached – Multiple “virtual nodes” per machine for better load balancing
  17. 17. Cache coherence strategies • Write-through • Explicit invalidation • Implicit invalidation
  18. 18. Cache set invalidation def get_cached_cart(cart, offset): key = 'cart:%s:%s' % (cart, offset) return get(key) # cart:13:10 def invalidate_cart(cart): ?
  19. 19. Set invalidation 2 def get_cached_cart(cart, offset): prefix = get_cart_prefix(cart) key = '%s:%s' % (prefix, offset) return get(key) def invalidate_cart(cart): del(get_cart_prefix(cart)) http://www.aminus.org/blogs/index.php/2007/1
  20. 20. Pop quiz • Can caching reduce write load?
  21. 21. Write-back caching foo.bar = 1 foo.bar = 2 foo.bar = 4 cache UPDATE foos SET bar=4 WHERE id=3 DB
  22. 22. Achtung! • Commercial solutions – Terracotta • Roll your own – Memcached – ehcache – Zookeeper?
  23. 23. Scaling reads
  24. 24. Replication
  25. 25. Types of replication • Master → slave – Master → slave → other slaves • Master ↔ master – multi-master
  26. 26. Types of replication 2 • Synchronous • Asynchronous
  27. 27. Synchronous master/slave • ?
  28. 28. Synchronous multi-master • Synchronous = slow(er) • Complexity (e.g. 2pc) • PGCluster • Oracle
  29. 29. Asynchronous master/slave • Easiest • MySQL replication • Slony, Londiste, WAL shipping • Tungsten • MongoDB • Redis
  30. 30. Asynchronous multi-master • Conflict resolution – O(N3) or O(N2) as you add nodes – http://research.microsoft.com/~gray/replicas.ps • Bucardo • MySQL Cluster • Tokyo Cabinet
  31. 31. Achtung! • Asynchronous replication can lose data if the master fails
  32. 32. “Architecture” • Primarily about how you cope with failure scenarios
  33. 33. Replication does not scale writes
  34. 34. Scaling writes
  35. 35. Partitioning
  36. 36. Partitioning • sharding / horizontal / key • Functional / vertical
  37. 37. Key based partitioning • PK of “root” table controls destination – e.g. user id • Retains referential integrity
  38. 38. Example: blogger.com Users Blogs Comments
  39. 39. Example: blogger.com Users Blogs Comments' Comments
  40. 40. What server gets each key? • Consistent hashing • Central db that knows what server owns a key – Makes adding machines easier (than – Single point of failure & query bottleneck – MongoDB
  41. 41. Vertical partitioning • Tables on separate nodes • Often a table that is too big to keep with the other tables, gets too big for a single node
  42. 42. Users2 Users1 Products1 Vertical segments Both Trans3 Trans2 Trans1 Sharding
  43. 43. Partitioning
  44. 44. Partitioning with replication
  45. 45. What these have in common • Not application-transparent • Ad hoc • Error-prone • Manpower-intensive
  46. 46. To summarize • Scaling reads sucks • Scaling writes sucks more
  47. 47. Case in point
  48. 48. NoSQL • Distributed (partitioned, scalable) – Cassandra – HBase – Voldemort • Not distributed – Neo4j – CouchDB – Redis
  49. 49. * Distributed databases • Data is automatically partitioned • Transparent to application • Add capacity without downtime • Failure tolerant *Like Bigtable, not like Lotus Notes
  50. 50. Two famous papers • Bigtable: A distributed storage system for structured data, 2006 • Dynamo: amazon's highly available key- value store, 2007
  51. 51. Two approaches • Bigtable: “How can we build a distributed database on top of GFS?” • Dynamo: “How can we build a distributed hash table appropriate for the data center?”
  52. 52. Cassandra • Dynamo-style replication • Bigtable ColumnFamily data model • High-performance – Digg phasing out memcached • Open space, Saturday at 17:30
  53. 53. Questions
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×