Cassandra: Two data centers and great performance
 

Cassandra: Two data centers and great performance

on

  • 1,886 views

In this talk we describe the features of Cassandra that set it above the pack, and how to get the most out of them, depending on your application. In particular, we'll describe de-normalization, and ...

In this talk we describe the features of Cassandra that set it above the pack, and how to get the most out of them, depending on your application. In particular, we'll describe de-normalization, and detail how the algorithms behind Cassandra leverage awesome write speed to accelerate reads; and we'll explain how Cassandra achieves multi-datacenter support, tunable consistency and no single point of failure, to give a great solution for highly available systems.

Statistics

Views

Total Views
1,886
Views on SlideShare
1,886
Embed Views
0

Actions

Likes
2
Downloads
14
Comments
1

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

CC Attribution-NonCommercial-NoDerivs LicenseCC Attribution-NonCommercial-NoDerivs LicenseCC Attribution-NonCommercial-NoDerivs License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Cassandra: Two data centers and great performance Cassandra: Two data centers and great performance Presentation Transcript

  • Cassandra FTW Andrew Byde Principal ScientistMonday, 15 August 2011
  • Menu • Introduction • Data model + storage architecture • Partitioning + replication • Consistency • De-normalisationMonday, 15 August 2011
  • History + designMonday, 15 August 2011
  • History • 2007: Started at Facebook for inbox search • July 2008: Open sourced by Facebook • March 2009: Apache Incubator • February 2010: Apache top-level project • May 2011:Version 0.8Monday, 15 August 2011
  • What it’s good for • Horizontal scalability • No single-point of failure • Multi-data centre support • Very high write workloads • Tuneable consistencyMonday, 15 August 2011
  • What it’s not so good for • Transactions • Read heavy workloads • Low latency applications • compared to in-memory dbsMonday, 15 August 2011
  • Data modelMonday, 15 August 2011
  • Keyspaces and Column Families SQL Cassandra Database row/key col_1 col_2 Keyspace row/key col_1 col_1 row/ col_1 col_1 Table Column Family Keyspaces & CFs have different sets of configuration settingsMonday, 15 August 2011
  • Column Family key: { column: value, column: value, ... }Monday, 15 August 2011
  • Rows and columns col1 col2 col3 col4 col5 col6 col7 row1 x x x row2 x x x x x row3 x x x x x row4 x x x x row5 x x x x row6 x row7 x x xMonday, 15 August 2011
  • Reads • get • get_slice One row, some cols • name predicate • slice range • multiget_slice Multiple rows • get_range_slicesMonday, 15 August 2011
  • get col1 col2 col3 col4 col5 col6 col7 row1 x x x row2 x x x x x row3 x x x x x row4 x x x x row5 x x x x row6 x row7 x x xMonday, 15 August 2011
  • get_slice: name predicate col1 col2 col3 col4 col5 col6 col7 row1 x x x row2 x x x x x row3 x x x x x row4 x x x x row5 x x x x row6 x row7 x x xMonday, 15 August 2011
  • get_slice: slice range col1 col2 col3 col4 col5 col6 col7 row1 x x x row2 x x x x x row3 x x x x x x row4 x x x x row5 x x x x row6 x row7 x x xMonday, 15 August 2011
  • multiget_slice: name predicate col1 col2 col3 col4 col5 col6 col7 row1 x x x row2 x x x x x row3 x x x x x row4 x x x x row5 x x x x row6 x row7 x x xMonday, 15 August 2011
  • get_range_slices: slice range col1 col2 col3 col4 col5 col6 col7 row1 x x x row2 x x x x x row3 x x x x x row4 x x x x row5 x x x x row6 x row7 x x xMonday, 15 August 2011
  • Storage architectureMonday, 15 August 2011
  • Data Layout writes key-value insert on-disk un-ordered commit log in-memory ... (key,col)-sorted memtable flush on-disk 01001101110101000 01001101110101000 (key,col)-sorted ... SSTablesMonday, 15 August 2011
  • Data Layout SSTables SSTable Bloom Filter 01001101110101000 Index DataMonday, 15 August 2011
  • Data Layout reads ? 01001101110101000 01001101110101000 010011011101010001111010101001Monday, 15 August 2011
  • Data Layout reads ? X X 01001101110101000 01001101110101000 010011011101010001111010101001Monday, 15 August 2011
  • Distribution: Partitioning + ReplicationMonday, 15 August 2011
  • Partitioning + Replication (k, v) ?Monday, 15 August 2011
  • Partitioning + Replication • Partitioning data on to nodes • load balancing • row-based • Replication • to protect against failure • better availabilityMonday, 15 August 2011
  • Partitioning • Random: take hash of row key • good for load balancing • bad for range queries • Ordered: subdivide key space • bad for load balancing • good for range queries • Or build your own...Monday, 15 August 2011
  • Simple Replication (k, v) Nodes arranged on a ‘ring’Monday, 15 August 2011
  • Simple Replication Primary location (k, v) Nodes arranged on a ‘ring’Monday, 15 August 2011
  • Simple Replication Primary location (k, v) Extra copies are successors on the ring Nodes arranged on a ‘ring’Monday, 15 August 2011
  • Topology-aware Replication • Snitch : node IP (DataCenter, rack) • EC2Snitch • Region DC; availability_zone rack • PropertyFileSnitch • Configured from a fileMonday, 15 August 2011
  • Topology-aware Replication DC 1 DC 2 (k, v) r1 r2 r1 r2Monday, 15 August 2011
  • Topology-aware Replication DC 1 DC 2 (k, v) r1 r2 r1 r2Monday, 15 August 2011
  • Topology-aware Replication DC 1 DC 2 extra copies to different data center (k, v) r1 r2 r1 r2Monday, 15 August 2011
  • Topology-aware Replication DC 1 DC 2 extra copies to different data center (k, v) spread across racks within a r1 r2 r1 r2 data centerMonday, 15 August 2011
  • Distribution: ConsistencyMonday, 15 August 2011
  • Consistency Level • How many replicas must respond in order to declare success • W/N must succeed for write to succeed • write with client-generated timestamp • R/N must succeed for read to succeed • return most recent, by timestampMonday, 15 August 2011
  • Consistency Level • 1, 2, 3 responses • Quorum (more than half) • Quorum in local data center • Quorum in each data centerMonday, 15 August 2011
  • Maintaining consistency • Read repair • Hinted handoff • Anti-entropyMonday, 15 August 2011
  • Read repair • If the replicas disagree on read, send most recent data back n1 read k? n2 n3Monday, 15 August 2011
  • Read repair • If the replicas disagree on read, send most recent data back n1 v, t1 read k? n2 not found! n3 v’, t2Monday, 15 August 2011
  • Read repair • If the replicas disagree on read, send most recent data back n1 v, t1 n2 not found! n3 v’, t2Monday, 15 August 2011
  • Read repair • If the replicas disagree on read, send most recent data back n1 n2 n3 write (k, v’, t2)Monday, 15 August 2011
  • Hinted handoff • When a node is unavailable • Writes can be written to any node as a hint • Delivered when the node comes back onlineMonday, 15 August 2011
  • Anti-entropy • Equivalent to ‘read repair all’ • Requires reading all data (woah) • (Although only hashes are sent to calculate diffs) • Manual processMonday, 15 August 2011
  • De-normalisationMonday, 15 August 2011
  • De-normalisation • Disk space is much cheaper than disk seeks • Read at 100 MB/s, seek at 100 IO/s • => copy data to avoid seeksMonday, 15 August 2011
  • Inbox user2 user1 msg1 user3 msg2 msg3 user4 ...Monday, 15 August 2011
  • Data-centric model m1: { sender: user1 content: “Mary had a little lamb” recipients: user2, user3 } • but how to do ‘recipients’ for Inbox? • one-to-many modelled by a join tableMonday, 15 August 2011
  • To join m1: { user2: { sender: user1 m1: true subject: “A rhyme” content: “Mary had a little lamb” } } user3: { m2: { sender: user1 m1: true subject: “colours” m2: true content: “Its fleece was white as snow” } } m3: { user4: { sender: user1 subject: “loyalty” m2: true content: “And everywhere that Mary went” m3: true } }Monday, 15 August 2011
  • .. or not to join • Joins are expensive, so de-normalise to trade off space for time • We can have lots of columns, so think BIG: • Make message id a time-typed super-column. • This makes get_slice an efficient way of searching for messages in a time windowMonday, 15 August 2011
  • Super Column Family user2: { m1: { sender: user1 subject: “A rhyme” } } user3: { m1: { sender: user1 subject: “A rhyme” } m2: { sender: user1 subject: “colours” } } ...Monday, 15 August 2011
  • De-normalisation + Cassandra • have to write a copy of the record for each recipient ... but writes are very cheap • get_slice fetches columns for a particular row, so gets received messages for a user • on-disk column order is optimal for this queryMonday, 15 August 2011
  • ConclusionMonday, 15 August 2011
  • What it’s good for • Horizontal scalability • No single-point of failure • Multi-data centre support • Very high write workloads • Tuneable consistencyMonday, 15 August 2011
  • Q?Monday, 15 August 2011