Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Cassandra FTW   Andrew Byde Principal Scientist
Menu• Introduction• Data model + storage architecture• Partitioning + replication• Consistency• De-normalisation
History + design
History• 2007: Started at Facebook for inbox search• July 2008: Open sourced by Facebook• March 2009: Apache Incubator• Fe...
What it’s good for• Horizontal scalability• No single-point of failure -- symmetric• Multi-data centre support• Very high ...
What it’s not so good for• Transactions• Read heavy workloads• Low latency applications •   compared to in-memory dbs
Data model
Keyspaces and Column Families     SQL                                     Cassandra  Database   row/key col_1    col_2    ...
Column Familyrowkey: {  column: value,  column: value,  ... }        ...every value is timestamped
Super Column Family rowkey: {  supercol: {      column: value,      column: value,      ...     }     supercol: {      col...
Rows and columns       col1   col2   col3   col4   col5   col6   col7row1           x                    x      xrow2    x...
Reads• get• get_slice          One row, some cols • name predicate • slice range• multiget_slice     Multiple rows• get_ra...
get       col1   col2   col3   col4   col5   col6   col7row1           x                    x      xrow2    x      x      ...
get_slice: name predicate       col1   col2   col3   col4   col5   col6   col7row1           x                    x      x...
get_slice: slice range       col1   col2   col3   col4   col5   col6   col7row1           x                    x      xrow...
multiget_slice: name       predicate       col1   col2   col3   col4   col5   col6   col7row1           x                 ...
get_range_slices: slice range         col1   col2   col3   col4   col5   col6   col7  row1           x                    ...
Storagearchitecture
Data Layout                      writes                         key-value insert    on-diskun-orderedcommit log           ...
Data Layout                  SSTables                   SSTableBloom Filter        01001101110101000   Index    Data
Data Layout              reads                     ? 01001101110101000       01001101110101000   0100110111010100011110101...
Data Layout              reads                     ?           X             X 01001101110101000       01001101110101000  ...
Distribution:Partitioning + Replication
Partitioning + Replication(k, v)         ?
Partitioning + Replication• Partitioning data on to nodes • load balancing • row-based• Replication • to protect against f...
Partitioning• Random: take hash of row key •   good for load balancing •   bad for range queries• Ordered: subdivide key s...
Simple Replication(k, v)           Nodes arranged on a ‘ring’
Simple Replication                     Primary location(k, v)           Nodes arranged on a ‘ring’
Simple Replication                     Primary location(k, v)                              Extra copies                   ...
Topology-aware           Replication• Snitch : node IP      (DataCenter, rack)• EC2Snitch  •   Region   DC; availability_z...
Topology-aware  Replication               DC 1     DC 2 (k, v)          r1      r2   r1   r2
Topology-aware  Replication               DC 1     DC 2 (k, v)          r1      r2   r1   r2
Topology-aware                 Replication                              DC 1     DC 2extra copiesto differentdata center  ...
Topology-aware                 Replication                               DC 1     DC 2extra copiesto differentdata center ...
Distribution:Consistency
Consistency Level• How many replicas must respond in order to  declare success• W/N must succeed for write to succeed •   ...
Consistency Level• 1, 2, 3 responses• Quorum (more than half)• Quorum in local data center• Quorum in each data center
Maintaining consistency• Read repair• Hinted handoff• Anti-entropy
Read repair• If the replicas disagree on read, send most  recent data back                     n1   read k?           n2  ...
Read repair• If the replicas disagree on read, send most  recent data back                     n1      v, t1   read k?    ...
Read repair• If the replicas disagree on read, send most  recent data back                     n1      v, t1              ...
Read repair• If the replicas disagree on read, send most  recent data back                     n1                     n2  ...
Hinted handoff• When a node is unavailable• Writes can be written to any node as a hint• Delivered when the node comes bac...
Anti-entropy• Equivalent to ‘read repair all’• Requires reading all data (woah)    •   (Although only hashes are sent to c...
De-normalisation
De-normalisation• Disk space is much cheaper than disk seeks• Read at 100 MB/s, seek at 100 IO/s• => copy data to avoid se...
Inbox query                         user2        user1     msg1                         user3                  msg2       ...
Data-centric model   m1: {     sender: user1     content: “Mary had a little lamb”     recipients: user2, user3   }• but h...
To joinm1: {                                        user2: {  sender: user1                                m1: true  subje...
.. or not to join• Joins are expensive, so de-normalise to trade  off space for time• We can have lots of columns, so thin...
Super Column Family     user2: {       m1: {         sender: user1         subject: “A rhyme”       }     }     user3: {  ...
De-normalisation +         Cassandra• have to write a copy of the record for each  recipient ... but writes are very cheap...
Conclusion
What it’s good for• Horizontal scalability• No single-point of failure -- symmetric• Multi-data centre support• Very high ...
Q?
Cassandra deep-dive @ NoSQLNow!
Upcoming SlideShare
Loading in …5
×

Cassandra deep-dive @ NoSQLNow!

2,665 views

Published on

An introduction to Cassandra, including replication + partitioning options, data center awareness, local storage model, data modeling example. Presented by Andrew Byde on 25th August 2011 at NoSQLNow! in San Jose , California

Published in: Technology, Business
  • Be the first to comment

Cassandra deep-dive @ NoSQLNow!

  1. 1. Cassandra FTW Andrew Byde Principal Scientist
  2. 2. Menu• Introduction• Data model + storage architecture• Partitioning + replication• Consistency• De-normalisation
  3. 3. History + design
  4. 4. History• 2007: Started at Facebook for inbox search• July 2008: Open sourced by Facebook• March 2009: Apache Incubator• February 2010: Apache top-level project• May 2011:Version 0.8
  5. 5. What it’s good for• Horizontal scalability• No single-point of failure -- symmetric• Multi-data centre support• Very high write workloads• Tuneable consistency -- per operation
  6. 6. What it’s not so good for• Transactions• Read heavy workloads• Low latency applications • compared to in-memory dbs
  7. 7. Data model
  8. 8. Keyspaces and Column Families SQL Cassandra Database row/key col_1 col_2 Keyspace row/key col_1 col_1 row/ col_1 col_1 Table Column Family
  9. 9. Column Familyrowkey: { column: value, column: value, ... } ...every value is timestamped
  10. 10. Super Column Family rowkey: { supercol: { column: value, column: value, ... } supercol: { column: value, column: value, ... } }
  11. 11. Rows and columns col1 col2 col3 col4 col5 col6 col7row1 x x xrow2 x x x x xrow3 x x x x xrow4 x x x xrow5 x x x xrow6 xrow7 x x x
  12. 12. Reads• get• get_slice One row, some cols • name predicate • slice range• multiget_slice Multiple rows• get_range_slices
  13. 13. get col1 col2 col3 col4 col5 col6 col7row1 x x xrow2 x x x x xrow3 x x x x xrow4 x x x xrow5 x x x xrow6 xrow7 x x x
  14. 14. get_slice: name predicate col1 col2 col3 col4 col5 col6 col7row1 x x xrow2 x x x x xrow3 x x x x xrow4 x x x xrow5 x x x xrow6 xrow7 x x x
  15. 15. get_slice: slice range col1 col2 col3 col4 col5 col6 col7row1 x x xrow2 x x x x xrow3 x x x x x xrow4 x x x xrow5 x x x xrow6 xrow7 x x x
  16. 16. multiget_slice: name predicate col1 col2 col3 col4 col5 col6 col7row1 x x xrow2 x x x x xrow3 x x x x xrow4 x x x xrow5 x x x xrow6 xrow7 x x x
  17. 17. get_range_slices: slice range col1 col2 col3 col4 col5 col6 col7 row1 x x x row2 x x x x x row3 x x x x x row4 x x x x row5 x x x x row6 x row7 x x x
  18. 18. Storagearchitecture
  19. 19. Data Layout writes key-value insert on-diskun-orderedcommit log in-memory... (key,col)-sorted memtable flush on-disk 01001101110101000 01001101110101000 (key,col)-sorted ... SSTables
  20. 20. Data Layout SSTables SSTableBloom Filter 01001101110101000 Index Data
  21. 21. Data Layout reads ? 01001101110101000 01001101110101000 010011011101010001111010101001
  22. 22. Data Layout reads ? X X 01001101110101000 01001101110101000 010011011101010001111010101001
  23. 23. Distribution:Partitioning + Replication
  24. 24. Partitioning + Replication(k, v) ?
  25. 25. Partitioning + Replication• Partitioning data on to nodes • load balancing • row-based• Replication • to protect against failure • better availability
  26. 26. Partitioning• Random: take hash of row key • good for load balancing • bad for range queries• Ordered: subdivide key space • bad for load balancing • good for range queries• Or build your own...
  27. 27. Simple Replication(k, v) Nodes arranged on a ‘ring’
  28. 28. Simple Replication Primary location(k, v) Nodes arranged on a ‘ring’
  29. 29. Simple Replication Primary location(k, v) Extra copies are successors on the ring Nodes arranged on a ‘ring’
  30. 30. Topology-aware Replication• Snitch : node IP (DataCenter, rack)• EC2Snitch • Region DC; availability_zone rack• PropertyFileSnitch • Configured from a file
  31. 31. Topology-aware Replication DC 1 DC 2 (k, v) r1 r2 r1 r2
  32. 32. Topology-aware Replication DC 1 DC 2 (k, v) r1 r2 r1 r2
  33. 33. Topology-aware Replication DC 1 DC 2extra copiesto differentdata center (k, v) r1 r2 r1 r2
  34. 34. Topology-aware Replication DC 1 DC 2extra copiesto differentdata center (k, v)spread acrossracks within a r1 r2 r1 r2 data center
  35. 35. Distribution:Consistency
  36. 36. Consistency Level• How many replicas must respond in order to declare success• W/N must succeed for write to succeed • write with client-generated timestamp• R/N must succeed for read to succeed • return most recent, by timestamp• Tuneable per request
  37. 37. Consistency Level• 1, 2, 3 responses• Quorum (more than half)• Quorum in local data center• Quorum in each data center
  38. 38. Maintaining consistency• Read repair• Hinted handoff• Anti-entropy
  39. 39. Read repair• If the replicas disagree on read, send most recent data back n1 read k? n2 n3
  40. 40. Read repair• If the replicas disagree on read, send most recent data back n1 v, t1 read k? n2 not found! n3 v’, t2
  41. 41. Read repair• If the replicas disagree on read, send most recent data back n1 v, t1 n2 not found! user n3 v’, t2
  42. 42. Read repair• If the replicas disagree on read, send most recent data back n1 n2 n3 write (k, v’, t2)
  43. 43. Hinted handoff• When a node is unavailable• Writes can be written to any node as a hint• Delivered when the node comes back online
  44. 44. Anti-entropy• Equivalent to ‘read repair all’• Requires reading all data (woah) • (Although only hashes are sent to calculate diffs)• Manual process
  45. 45. De-normalisation
  46. 46. De-normalisation• Disk space is much cheaper than disk seeks• Read at 100 MB/s, seek at 100 IO/s• => copy data to avoid seeks
  47. 47. Inbox query user2 user1 msg1 user3 msg2 msg3 user4 ...Q? inbox for user3
  48. 48. Data-centric model m1: { sender: user1 content: “Mary had a little lamb” recipients: user2, user3 }• but how to do ‘recipients’ for Inbox?• one-to-many modelled by a join table
  49. 49. To joinm1: { user2: { sender: user1 m1: true subject: “A rhyme” content: “Mary had a little lamb” }} user3: {m2: { sender: user1 m1: true subject: “colours” m2: true content: “Its fleece was white as snow”} }m3: { user4: { sender: user1 subject: “loyalty” m2: true content: “And everywhere that Mary went” m3: true} }
  50. 50. .. or not to join• Joins are expensive, so de-normalise to trade off space for time• We can have lots of columns, so think BIG:• Make message id a time-typed super-column.• This makes get_slice an efficient way of searching for messages in a time window
  51. 51. Super Column Family user2: { m1: { sender: user1 subject: “A rhyme” } } user3: { m1: { sender: user1 subject: “A rhyme” } m2: { sender: user1 subject: “colours” } } ...
  52. 52. De-normalisation + Cassandra• have to write a copy of the record for each recipient ... but writes are very cheap• get_slice fetches columns for a particular row, so gets received messages for a user• on-disk column order is optimal for this query
  53. 53. Conclusion
  54. 54. What it’s good for• Horizontal scalability• No single-point of failure -- symmetric• Multi-data centre support• Very high write workloads• Tuneable consistency -- per operation
  55. 55. Q?

×