Cassandra FTW


   Andrew Byde
 Principal Scientist
Menu

• Introduction
• Data model + storage architecture
• Partitioning + replication
• Consistency
• De-normalisation
History + design
History

• 2007: Started at Facebook for inbox search
• July 2008: Open sourced by Facebook
• March 2009: Apache Incubator
• February 2010: Apache top-level project
• May 2011:Version 0.8
What it’s good for

• Horizontal scalability
• No single-point of failure -- symmetric
• Multi-data centre support
• Very high write workloads
• Tuneable consistency -- per operation
What it’s not so good for

• Transactions
• Read heavy workloads
• Low latency applications
 •   compared to in-memory dbs
Data model
Keyspaces and Column Families
     SQL                                     Cassandra

  Database   row/key col_1    col_2
                                              Keyspace
                row/key col_1     col_1
                    row/   col_1    col_1


   Table                                    Column Family
Column Family

rowkey: {
  column: value,
  column: value,
  ...
 }

        ...every value is timestamped
Super Column Family
 rowkey: {
  supercol: {
      column: value,
      column: value,
      ...
     }
     supercol: {
      column: value,
      column: value,
      ...
     }
   }
Rows and columns
       col1   col2   col3   col4   col5   col6   col7
row1           x                    x      x
row2    x      x      x      x      x
row3           x      x             x      x      x
row4           x      x      x             x
row5           x             x      x      x
row6           x
row7    x      x             x
Reads
• get
• get_slice          One row, some cols
 • name predicate
 • slice range
• multiget_slice     Multiple rows
• get_range_slices
get
       col1   col2   col3   col4   col5   col6   col7
row1           x                    x      x
row2    x      x      x      x      x
row3           x      x             x      x      x
row4           x      x      x             x
row5           x             x      x      x
row6           x
row7    x      x             x
get_slice: name predicate
       col1   col2   col3   col4   col5   col6   col7
row1           x                    x      x
row2    x      x      x      x      x
row3           x      x             x      x      x
row4           x      x      x             x
row5           x             x      x      x
row6           x
row7    x      x             x
get_slice: slice range
       col1   col2   col3   col4   col5   col6   col7
row1           x                    x      x
row2    x      x      x      x      x
row3    x      x      x             x      x      x
row4           x      x      x             x
row5           x             x      x      x
row6           x
row7    x      x             x
multiget_slice: name
       predicate
       col1   col2   col3   col4   col5   col6   col7
row1           x                    x      x
row2    x      x      x      x      x
row3           x      x             x      x      x
row4           x      x      x             x
row5           x             x      x      x
row6           x
row7    x      x             x
get_range_slices: slice range
         col1   col2   col3   col4   col5   col6   col7
  row1           x                    x      x
  row2    x      x      x      x      x
  row3           x      x             x      x      x
  row4           x      x      x             x
  row5           x             x      x      x
  row6           x
  row7    x      x             x
Storage
architecture
Data Layout
                      writes
                         key-value insert
    on-disk
un-ordered
commit log                                         in-memory
...                                              (key,col)-sorted
                                                    memtable
                             flush
              on-disk        01001101110101000   01001101110101000



          (key,col)-sorted                                           ...
              SSTables
Data Layout
                  SSTables


                   SSTable
Bloom Filter        01001101110101000



   Index
    Data
Data Layout
              reads
                     ?



 01001101110101000       01001101110101000   010011011101010001111010101001
Data Layout
              reads
                     ?


           X             X
 01001101110101000       01001101110101000   010011011101010001111010101001
Distribution:

Partitioning +
 Replication
Partitioning + Replication



(k, v)
         ?
Partitioning + Replication
• Partitioning data on to nodes
 • load balancing
 • row-based
• Replication
 • to protect against failure
 • better availability
Partitioning
• Random: take hash of row key
 •   good for load balancing

 •   bad for range queries

• Ordered: subdivide key space
 •   bad for load balancing

 •   good for range queries

• Or build your own...
Simple Replication



(k, v)




           Nodes arranged on a ‘ring’
Simple Replication
                     Primary location




(k, v)




           Nodes arranged on a ‘ring’
Simple Replication
                     Primary location




(k, v)                              Extra copies
                                   are successors
                                     on the ring


           Nodes arranged on a ‘ring’
Topology-aware
           Replication
• Snitch : node IP      (DataCenter, rack)

• EC2Snitch
  •   Region   DC; availability_zone   rack

• PropertyFileSnitch
  •   Configured from a file
Topology-aware
  Replication
               DC 1     DC 2




 (k, v)


          r1      r2   r1   r2
Topology-aware
  Replication
               DC 1     DC 2




 (k, v)


          r1      r2   r1   r2
Topology-aware
                 Replication
                              DC 1     DC 2
extra copies
to different
data center

                (k, v)


                         r1      r2   r1   r2
Topology-aware
                 Replication
                               DC 1     DC 2
extra copies
to different
data center

                 (k, v)

spread across
racks within a            r1      r2   r1   r2
 data center
Distribution:

Consistency
Consistency Level
• How many replicas must respond in order to
  declare success
• W/N must succeed for write to succeed
 •   write with client-generated timestamp

• R/N must succeed for read to succeed
 •   return most recent, by timestamp

• Tuneable per request
Consistency Level

• 1, 2, 3 responses
• Quorum (more than half)
• Quorum in local data center
• Quorum in each data center
Maintaining consistency

• Read repair
• Hinted handoff
• Anti-entropy
Read repair
• If the replicas disagree on read, send most
  recent data back

                     n1

   read k?           n2

                     n3
Read repair
• If the replicas disagree on read, send most
  recent data back

                     n1      v, t1

   read k?           n2      not found!

                     n3      v’, t2
Read repair
• If the replicas disagree on read, send most
  recent data back

                     n1      v, t1

                     n2      not found!

   user              n3      v’, t2
Read repair
• If the replicas disagree on read, send most
  recent data back

                     n1

                     n2

                     n3      write (k, v’, t2)
Hinted handoff

• When a node is unavailable
• Writes can be written to any node as a hint
• Delivered when the node comes back
  online
Anti-entropy

• Equivalent to ‘read repair all’
• Requires reading all data (woah)
    •   (Although only hashes are sent to calculate diffs)

•        Manual process
De-normalisation
De-normalisation

• Disk space is much cheaper than disk seeks
• Read at 100 MB/s, seek at 100 IO/s
• => copy data to avoid seeks
Inbox query
                         user2

        user1     msg1
                         user3
                  msg2


                  msg3   user4
                   ...



Q? inbox for
   user3
Data-centric model
   m1: {
     sender: user1
     content: “Mary had a little lamb”
     recipients: user2, user3
   }


• but how to do ‘recipients’ for Inbox?
• one-to-many modelled by a join table
To join
m1: {                                        user2: {
  sender: user1                                m1: true
  subject: “A rhyme”
  content: “Mary had a little lamb”          }
}                                            user3: {
m2: {
  sender: user1                                m1: true
  subject: “colours”                           m2: true
  content: “Its fleece was white as snow”
}                                            }
m3: {                                        user4: {
  sender: user1
  subject: “loyalty”                           m2: true
  content: “And everywhere that Mary went”     m3: true
}
                                             }
.. or not to join
• Joins are expensive, so de-normalise to trade
  off space for time
• We can have lots of columns, so think BIG:
• Make message id a time-typed super-column.
• This makes get_slice an efficient way of
  searching for messages in a time window
Super Column Family
     user2: {
       m1: {
         sender: user1
         subject: “A rhyme”
       }
     }
     user3: {
       m1: {
         sender: user1
         subject: “A rhyme”
       }
       m2: {
         sender: user1
         subject: “colours”
       }
     }
     ...
De-normalisation +
         Cassandra
• have to write a copy of the record for each
  recipient ... but writes are very cheap
• get_slice fetches columns for a particular
  row, so gets received messages for a user
• on-disk column order is optimal for this
  query
Conclusion
What it’s good for

• Horizontal scalability
• No single-point of failure -- symmetric
• Multi-data centre support
• Very high write workloads
• Tuneable consistency -- per operation
Q?

Cassandra deep-dive @ NoSQLNow!

  • 1.
    Cassandra FTW Andrew Byde Principal Scientist
  • 2.
    Menu • Introduction • Datamodel + storage architecture • Partitioning + replication • Consistency • De-normalisation
  • 3.
  • 4.
    History • 2007: Startedat Facebook for inbox search • July 2008: Open sourced by Facebook • March 2009: Apache Incubator • February 2010: Apache top-level project • May 2011:Version 0.8
  • 5.
    What it’s goodfor • Horizontal scalability • No single-point of failure -- symmetric • Multi-data centre support • Very high write workloads • Tuneable consistency -- per operation
  • 6.
    What it’s notso good for • Transactions • Read heavy workloads • Low latency applications • compared to in-memory dbs
  • 7.
  • 8.
    Keyspaces and ColumnFamilies SQL Cassandra Database row/key col_1 col_2 Keyspace row/key col_1 col_1 row/ col_1 col_1 Table Column Family
  • 9.
    Column Family rowkey: { column: value, column: value, ... } ...every value is timestamped
  • 10.
    Super Column Family rowkey: { supercol: { column: value, column: value, ... } supercol: { column: value, column: value, ... } }
  • 11.
    Rows and columns col1 col2 col3 col4 col5 col6 col7 row1 x x x row2 x x x x x row3 x x x x x row4 x x x x row5 x x x x row6 x row7 x x x
  • 12.
    Reads • get • get_slice One row, some cols • name predicate • slice range • multiget_slice Multiple rows • get_range_slices
  • 13.
    get col1 col2 col3 col4 col5 col6 col7 row1 x x x row2 x x x x x row3 x x x x x row4 x x x x row5 x x x x row6 x row7 x x x
  • 14.
    get_slice: name predicate col1 col2 col3 col4 col5 col6 col7 row1 x x x row2 x x x x x row3 x x x x x row4 x x x x row5 x x x x row6 x row7 x x x
  • 15.
    get_slice: slice range col1 col2 col3 col4 col5 col6 col7 row1 x x x row2 x x x x x row3 x x x x x x row4 x x x x row5 x x x x row6 x row7 x x x
  • 16.
    multiget_slice: name predicate col1 col2 col3 col4 col5 col6 col7 row1 x x x row2 x x x x x row3 x x x x x row4 x x x x row5 x x x x row6 x row7 x x x
  • 17.
    get_range_slices: slice range col1 col2 col3 col4 col5 col6 col7 row1 x x x row2 x x x x x row3 x x x x x row4 x x x x row5 x x x x row6 x row7 x x x
  • 18.
  • 19.
    Data Layout writes key-value insert on-disk un-ordered commit log in-memory ... (key,col)-sorted memtable flush on-disk 01001101110101000 01001101110101000 (key,col)-sorted ... SSTables
  • 20.
    Data Layout SSTables SSTable Bloom Filter 01001101110101000 Index Data
  • 21.
    Data Layout reads ? 01001101110101000 01001101110101000 010011011101010001111010101001
  • 22.
    Data Layout reads ? X X 01001101110101000 01001101110101000 010011011101010001111010101001
  • 23.
  • 24.
  • 25.
    Partitioning + Replication •Partitioning data on to nodes • load balancing • row-based • Replication • to protect against failure • better availability
  • 26.
    Partitioning • Random: takehash of row key • good for load balancing • bad for range queries • Ordered: subdivide key space • bad for load balancing • good for range queries • Or build your own...
  • 27.
    Simple Replication (k, v) Nodes arranged on a ‘ring’
  • 28.
    Simple Replication Primary location (k, v) Nodes arranged on a ‘ring’
  • 29.
    Simple Replication Primary location (k, v) Extra copies are successors on the ring Nodes arranged on a ‘ring’
  • 30.
    Topology-aware Replication • Snitch : node IP (DataCenter, rack) • EC2Snitch • Region DC; availability_zone rack • PropertyFileSnitch • Configured from a file
  • 31.
    Topology-aware Replication DC 1 DC 2 (k, v) r1 r2 r1 r2
  • 32.
    Topology-aware Replication DC 1 DC 2 (k, v) r1 r2 r1 r2
  • 33.
    Topology-aware Replication DC 1 DC 2 extra copies to different data center (k, v) r1 r2 r1 r2
  • 34.
    Topology-aware Replication DC 1 DC 2 extra copies to different data center (k, v) spread across racks within a r1 r2 r1 r2 data center
  • 35.
  • 36.
    Consistency Level • Howmany replicas must respond in order to declare success • W/N must succeed for write to succeed • write with client-generated timestamp • R/N must succeed for read to succeed • return most recent, by timestamp • Tuneable per request
  • 37.
    Consistency Level • 1,2, 3 responses • Quorum (more than half) • Quorum in local data center • Quorum in each data center
  • 38.
    Maintaining consistency • Readrepair • Hinted handoff • Anti-entropy
  • 39.
    Read repair • Ifthe replicas disagree on read, send most recent data back n1 read k? n2 n3
  • 40.
    Read repair • Ifthe replicas disagree on read, send most recent data back n1 v, t1 read k? n2 not found! n3 v’, t2
  • 41.
    Read repair • Ifthe replicas disagree on read, send most recent data back n1 v, t1 n2 not found! user n3 v’, t2
  • 42.
    Read repair • Ifthe replicas disagree on read, send most recent data back n1 n2 n3 write (k, v’, t2)
  • 43.
    Hinted handoff • Whena node is unavailable • Writes can be written to any node as a hint • Delivered when the node comes back online
  • 44.
    Anti-entropy • Equivalent to‘read repair all’ • Requires reading all data (woah) • (Although only hashes are sent to calculate diffs) • Manual process
  • 45.
  • 46.
    De-normalisation • Disk spaceis much cheaper than disk seeks • Read at 100 MB/s, seek at 100 IO/s • => copy data to avoid seeks
  • 47.
    Inbox query user2 user1 msg1 user3 msg2 msg3 user4 ... Q? inbox for user3
  • 48.
    Data-centric model m1: { sender: user1 content: “Mary had a little lamb” recipients: user2, user3 } • but how to do ‘recipients’ for Inbox? • one-to-many modelled by a join table
  • 49.
    To join m1: { user2: { sender: user1 m1: true subject: “A rhyme” content: “Mary had a little lamb” } } user3: { m2: { sender: user1 m1: true subject: “colours” m2: true content: “Its fleece was white as snow” } } m3: { user4: { sender: user1 subject: “loyalty” m2: true content: “And everywhere that Mary went” m3: true } }
  • 50.
    .. or notto join • Joins are expensive, so de-normalise to trade off space for time • We can have lots of columns, so think BIG: • Make message id a time-typed super-column. • This makes get_slice an efficient way of searching for messages in a time window
  • 51.
    Super Column Family user2: { m1: { sender: user1 subject: “A rhyme” } } user3: { m1: { sender: user1 subject: “A rhyme” } m2: { sender: user1 subject: “colours” } } ...
  • 52.
    De-normalisation + Cassandra • have to write a copy of the record for each recipient ... but writes are very cheap • get_slice fetches columns for a particular row, so gets received messages for a user • on-disk column order is optimal for this query
  • 53.
  • 54.
    What it’s goodfor • Horizontal scalability • No single-point of failure -- symmetric • Multi-data centre support • Very high write workloads • Tuneable consistency -- per operation
  • 56.

Editor's Notes

  • #2 We provide Cassandra training and support and the Acunu Data Platform, high performance storage software that incorporates Cassandra.  Come and talk to us if you want to know more.  We have an ebook to give away to those that want to dive into Cassandra details.\nYou've probably heard about 'eventual consistency / scale out / de-norm … I'm going to explain what they mean.\n\n
  • #3 \n
  • #4 \n
  • #5 \n
  • #6 \n
  • #7 \n
  • #8 \n
  • #9 \n
  • #10 \n
  • #11 but... Tables fixed structure, described in a schema. \nColumns much more flexible; no fixed schema in the RDBMS sense; little structure. \nAdd a column whenever you want. \nDon't need the same columns in each row, etc etc.\n
  • #12 * two-level map\n* everything in Cassandra has a timestamp which is used to help with consistency. \n* You might use your own timestamp as a key but you don't normally do anything with the internal timestamps.\n* (Of course this means your clocks need to be reasonably accurate, so you can tell people they need to use NTP).\n\n
  • #13 * three-level map\n
  • #14 * three level map\n
  • #15 \n
  • #16 * sparse\n* up to 2 billion rows\n* ... but big rows are a problem (repair etc done based on row)\n* on a single node, data sorted by row key\n
  • #17 * Queries are all key based. I.e. the ‘WHERE’ is all on key, the above differ in the SELECT * \n
  • #18 \n
  • #19 \n
  • #20 \n
  • #21 * note that the predicate is on NAME -- can’t do ‘WHERE col3=x’ with this\n
  • #22 \n
  • #23 \n
  • #24 * memtable default is skip list\n* background compaction of SSTables\n* BENEFIT IS SEQUENTIAL WRITES\n
  • #25 * data is sorted, key then value\n* compactions are streaming, hence efficient\n\n
  • #26 * reads go everywhere in parallel\n* Bloom filters are per-row, so help with get_slice but not multi-row range queries\n
  • #27 \n
  • #28 Amazon Dynamo\nconnect to any node in the cluster\nnodes talk to one another using a p2p protocol called ‘gossip’ -- entirely symmetric.\n\n
  • #29 \n
  • #30 \n
  • #31 Hash ring based: keys are hashed; regions of hash output space are claimed by nodes\n
  • #32 Hash ring based: keys are hashed; regions of hash output space are claimed by nodes\n
  • #33 \n
  • #34 \n
  • #35 \n
  • #36 \n
  • #37 \n
  • #38 \n
  • #39 \n
  • #40 \n
  • #41 \n
  • #42 PER REQUEST\n
  • #43 \n
  • #44 \n
  • #45 \n
  • #46 \n
  • #47 \n
  • #48 \n
  • #49 \n
  • #50 * merkel trees\n
  • #51 * at scale you have to optimise for queries\n* de-normalisation not specific to cassandra\n\n
  • #52 * de-normalisation not specific to cassandra\n* but it’s well suited because writes are relatively cheap, and little infrastructure for queries\n
  • #53 get inbox for user 3\n
  • #54 \n
  • #55 * extra table holding recipient -> msg\n* have to a point query per message to show the inbox for a user\n\n
  • #56 \n
  • #57 * note, content not duplicated, only subject -- row would become too large\n* columns need to be ordered by time decreasing -- custom comparator\n
  • #58 \n
  • #59 \n
  • #60 \n
  • #61 \n
  • #62 \n