Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
CASSANDRA DOES WHAT?   CODE MANIA 2012  Aaron Morton, Apache Cassandra Committer               @aaronmorton           www....
Cassandra?
Cassandra is...              Scalable
Cassandra is...          Distributed
Cassandra is...     Highly Available
Cassandra uses...    Column Families
Cassandra is...                  Fast
Cassandra is...                  Fun                  (Really.)
Why Cassandra?
Why Cassandra?             Scale
Why Cassandra?       Operations
Why Cassandra?       Data Model
Today.         Cluster         Data Model           Node
Cluster. Store the ‘foo’ row.
Store ‘foo’.                                Node 1 - foo               Node 4 - foo                    Node 2 - foo       ...
Cluster Capacity?            Limited.
Replication Factor specifies the  number of row replicas.              (RF)
Everything is a copy.        Master Slave        Replication.
Store ‘foo’ with Replication Factor 3.                              Node 1 - foo                     Node 4               ...
Cluster Capacity?   Node Capacity X Number Nodes           Replication Factor
Scalable Capacity?                 ✓
Consistent Hashing...	 Evenly map keys to       nodes.
Consistent Hashing...	        Minimise key      movements when     nodes join or leave.
Partitioner...     RandomPartitioner   transforms Keys to Tokens           using MD5.         (Default Partitioner, there ...
Keys and Tokens?    key     fop   foo  token 0    10     90      99
128 Bit Unsigned Integer Token.170,141,183,460,469,231,731,687,303,7   15,884,105,728
Token Ring.                          99   0                  foo            fop              token: 90            token: 10
Partitioning...   Assign a Token to      each node.                  (initial_token)
Token Ranges.                                   Node 1                                   token: 0                         ...
Token Ranges.    Node        Token   Range From   Range To      1           0         76          0      2          25    ...
Locate Token Range.                                              Node 1                                              token...
Replication Strategy selectsReplication Factor number of      nodes for a row.
SimpleStrategy selects nodes by Token Order.    (Non default, there are others.)
SimpleStrategy with RF 3.                                          Node 1                                          token: ...
NetworkTopologyStrategy uses a Replication Factor per Data           Centre.            (Default.)
NetworkTopologyStrategy...    Stripes replicas     across racks.
Multi DC Replication with RF 3 and RF 2.                          Node 1                              Node 1              ...
The Snitch knows which Data Centre and rack contains a            Node.
SimpleSnitch. Places all nodes in the same        DC and rack.          (Default, there are others.)
PropertyFileSnitch.   Places nodes in a multiple      DCs and racks using         configuration.             (There are oth...
EC2Snitch.  Places nodes in a DC using  the AWS Region and a rack    using Availability Zone.             (There are other...
DynamicSnitch.Re-orders nodes according totheir observed performance.           (Wraps other snitch.)
Clients connect to any node in the      cluster.
Coordinator handles  a request for a       client.
The Client and the Coordinator.                                            Node 1                                         ...
Nodes Gossip about  other nodes.
Gossip?Nodes share information witha small number of neighbours.Who share information with...
Scalable Throughput?            ✓
Distributed?               ✓
Node Down   (oh noes)
Node Down.                                     Node 1                                     token: 0             foo        ...
Client specifiedConsistency Level.
Consistency Level...   Any*, One, Two,       Three,
Consistency Level...          QUORUM,       LOCAL_QUORUM,       EACH_QUOURM*
Quorum?     floor(RF / 2) +1
QUOURM at Replication Factor...   Replication                 2 or 3   4 or 5   6 or 7     Factor   QUOURM          2     ...
UnavailableException
TimedOutException
Node Down with Hinted Handoff.                                          Node 1                                          fo...
Cluster. Read the ‘foo’ row.
Read ‘foo’.                                      Node 1                                      token: 0              foo    ...
Consistency Level  nodes must   respond.
Read ‘foo’ at QUOURM.                                       Node 1                                       foo              ...
Consistency Levelnodes must agree.
Digests used todetect differences.
Timestamps used toresolve differences.
Differences in the ‘foo’ row.    Column        Node 1           Node 2           Node 3                    cromulent      ...
Consistent Read.                    Node 1                                           Node 1                   cromulent   ...
Read Repair is active  on a fraction of     requests.       (10% by default)
QUORUM with and without Read Repair.                  Node 1                              Node 1         Node 4           ...
I can haz Consistency ?           R +W > N  (#Read Nodes + #Write Nodes > Replication Factor)
Anti Entropy... Hash key ranges on  each node using   Merkle Trees.
Anti Entropy...  Stream differences   between nodes.
Highly Available?             ✓
Today.            Cluster         Data Model            Node
Data Model so far.    Row Key:   Column        Column   Column                 (Incomplete.)
Data Model.                           Keyspace               Column Family   Column Family   Column Family                ...
Rows are the unit of replication.
The Column Family   is the unit of      storage.
Row and ColumnFamily are the unit   of querying.
API...                           Mutate# pycassa - Python>>> col_fam = pycassa.ColumnFamily(pool, ColumnFamily1)>>> col_fa...
API...                  Mutate# Cassandra Query Language (CQL)INSERT INTO ColumnFamily1 (KEY, col_name)VALUES (row_key, co...
API...                     Delete# pycassa - Python>>> col_fam.remove(row_key)>>> col_fam.remove(row_key, [‘col_name’])
API...                  Delete# Cassandra Query Language (CQL)DELETE FROM ColumnFamily1 WHERE key IN(row_key,);DELETE col_...
Batch Mutate saves on round trips.      (It’s not a Tx.)
API...                     Get, Multi-Get# pycassa - Python>>> col_fam.get(row_key){col_name: col_val, col_name2: col_val2...
API...             Get, Multi-Get# Cassandra Query Language (CQL)SELECT * FROM ColumnFamily1;SELECT col_name FROM ColumnFa...
API...                     Get Range*# pycassa - Python>>> col_fam.get_range(start=row_key){row_key : {col_name: col_val},...
API...               Get Range*# Cassandra Query Language (CQL)SELECT * FROM ColumnFamily1 WHERE KEY >=‘row_key’;
Column Families?            ✓
Today.          Cluster         Data Model          Node
Optimised for  Writes.
Write path...  Append to Write    Ahead Log.  (fsync every 10s by default, other options available)
Write path...   Merge Columns   into Memtable.        (Lock free, always in memory.)
Write path...           Done.
Fast for writes?             ✓
(Later.)      Asynchronously flush      Memtable to new files.           (May be 10’s or 100’s of MB in size.)
Data is stored inimmutable SSTables.      (Sorted String table.)
SSTable files.                 *-Data.db                 *-Index.db                 *-Filter.db        (Also *-Statistics.d...
SSTables.         SSTable 1             SSTable 2     SSTable 3         SSTable 4        SSTable 5   foo:                 ...
Read Path...   Read columns from each  SSTable, then merge results.               (Roughly speaking.)
Read Path...     Use Bloom Filter to determine if a row key does    not exist in a SSTable.               (In memory)
Bloom Filter says if a key is definitely not present, or  present with a certain        probability.    (Default false posi...
Read Path...     Search for prior key in       *-Index.db sample.               (In memory)
Read Path... Scan *-Index.db from priorkey to find the search key and     its’ *-Data.db offset.               (On disk.)
Read Path...Read *-Data.db from offset, all columns or specific pages.               (Default 64KB page size.)
Read purple, monkey, dishwasher.               Bloom Filter           Bloom Filter         Bloom Filter          Bloom Fil...
Merge SSTables.    Column       SSTable 1        SSTable 2        SSTable 4                    cromulent      purple      ...
Key Cache caches row keyposition in *-Data.db file.  (Removes up to1disk seek per SSTable.)
Read with Key Cache.               Bloom Filter           Bloom Filter         Bloom Filter          Bloom Filter         ...
Row Cache caches entire row.        (Removes all disk IO.)
Read with Row Cache.                                                               Row Cache                  Bloom Filter...
Fast for reads?             ✓
Tombstones ensure all replicas       see a delete.      (Purged after 10 days, configurable.)
Merge SSTables with Tombstones.   Column        SSTable 1        SSTable 2        SSTable 4                    cromulent  ...
Merge node response with Tombstones.   Column         Node 1           Node 2           Node 3                    cromulen...
Compaction merges truth from  multiple SSTables into one SSTable with the same truth.   (Manual and continuous background ...
Compaction.  Column SSTable 1 SSTable 2 SSTable 4                              New                   cromulent            ...
Today.          Cluster         Data Model           Node
Papers.•Cassandra - A Decentralized Structured Storage System (Lakshman et al).•Bigtable: A Distributed Storage System for...
Aaron Morton                     @aaronmorton                   www.thelastpickle.comLicensed under a Creative Commons Att...
Upcoming SlideShare
Loading in …5
×

Cassandra does what ? Code Mania 2012

3,458 views

Published on

Published in: Technology, Business

Cassandra does what ? Code Mania 2012

  1. 1. CASSANDRA DOES WHAT? CODE MANIA 2012 Aaron Morton, Apache Cassandra Committer @aaronmorton www.thelastpickle.com Licensed under a Creative Commons Attribution-NonCommercial 3.0 New Zealand License
  2. 2. Cassandra?
  3. 3. Cassandra is... Scalable
  4. 4. Cassandra is... Distributed
  5. 5. Cassandra is... Highly Available
  6. 6. Cassandra uses... Column Families
  7. 7. Cassandra is... Fast
  8. 8. Cassandra is... Fun (Really.)
  9. 9. Why Cassandra?
  10. 10. Why Cassandra? Scale
  11. 11. Why Cassandra? Operations
  12. 12. Why Cassandra? Data Model
  13. 13. Today. Cluster Data Model Node
  14. 14. Cluster. Store the ‘foo’ row.
  15. 15. Store ‘foo’. Node 1 - foo Node 4 - foo Node 2 - foo Node 3 - foo
  16. 16. Cluster Capacity? Limited.
  17. 17. Replication Factor specifies the number of row replicas. (RF)
  18. 18. Everything is a copy. Master Slave Replication.
  19. 19. Store ‘foo’ with Replication Factor 3. Node 1 - foo Node 4 Node 2 - foo Node 3 - foo
  20. 20. Cluster Capacity? Node Capacity X Number Nodes Replication Factor
  21. 21. Scalable Capacity? ✓
  22. 22. Consistent Hashing... Evenly map keys to nodes.
  23. 23. Consistent Hashing... Minimise key movements when nodes join or leave.
  24. 24. Partitioner... RandomPartitioner transforms Keys to Tokens using MD5. (Default Partitioner, there are others.)
  25. 25. Keys and Tokens? key fop foo token 0 10 90 99
  26. 26. 128 Bit Unsigned Integer Token.170,141,183,460,469,231,731,687,303,7 15,884,105,728
  27. 27. Token Ring. 99 0 foo fop token: 90 token: 10
  28. 28. Partitioning... Assign a Token to each node. (initial_token)
  29. 29. Token Ranges. Node 1 token: 0 76-0 1-25 Node 4 Node 2 token: 75 token: 25 Node 3 token: 50
  30. 30. Token Ranges. Node Token Range From Range To 1 0 76 0 2 25 1 25 3 50 26 50 4 75 51 75
  31. 31. Locate Token Range. Node 1 token: 0 foo token: 90 Node 4 Node 2 token: 75 token: 25 Node 3 token: 50
  32. 32. Replication Strategy selectsReplication Factor number of nodes for a row.
  33. 33. SimpleStrategy selects nodes by Token Order. (Non default, there are others.)
  34. 34. SimpleStrategy with RF 3. Node 1 token: 0 foo token: 90 Node 4 Node 2 token: 75 token: 25 Node 3 token: 50
  35. 35. NetworkTopologyStrategy uses a Replication Factor per Data Centre. (Default.)
  36. 36. NetworkTopologyStrategy... Stripes replicas across racks.
  37. 37. Multi DC Replication with RF 3 and RF 2. Node 1 Node 1 token: 0 token: 1 foo token: 90 Node 4 West DC Node 2 Node 4 East DC Node 2 token: 75 token: 25 token: 76 token: 26 Node 3 Node 3 token: 50 token: 51
  38. 38. The Snitch knows which Data Centre and rack contains a Node.
  39. 39. SimpleSnitch. Places all nodes in the same DC and rack. (Default, there are others.)
  40. 40. PropertyFileSnitch. Places nodes in a multiple DCs and racks using configuration. (There are others.)
  41. 41. EC2Snitch. Places nodes in a DC using the AWS Region and a rack using Availability Zone. (There are others.)
  42. 42. DynamicSnitch.Re-orders nodes according totheir observed performance. (Wraps other snitch.)
  43. 43. Clients connect to any node in the cluster.
  44. 44. Coordinator handles a request for a client.
  45. 45. The Client and the Coordinator. Node 1 token: 0 foo token: 90 Node 4 Node 2 token: 75 token: 25 Node 3 Client token: 50
  46. 46. Nodes Gossip about other nodes.
  47. 47. Gossip?Nodes share information witha small number of neighbours.Who share information with...
  48. 48. Scalable Throughput? ✓
  49. 49. Distributed? ✓
  50. 50. Node Down (oh noes)
  51. 51. Node Down. Node 1 token: 0 foo token: 90 Node 4 Node 2 token: 75 token: 25 Node 3 Client token: 50
  52. 52. Client specifiedConsistency Level.
  53. 53. Consistency Level... Any*, One, Two, Three,
  54. 54. Consistency Level... QUORUM, LOCAL_QUORUM, EACH_QUOURM*
  55. 55. Quorum? floor(RF / 2) +1
  56. 56. QUOURM at Replication Factor... Replication 2 or 3 4 or 5 6 or 7 Factor QUOURM 2 3 4
  57. 57. UnavailableException
  58. 58. TimedOutException
  59. 59. Node Down with Hinted Handoff. Node 1 foo foo token: 90 Node 4 Node 2 foo for #3 foo Node 3 Client
  60. 60. Cluster. Read the ‘foo’ row.
  61. 61. Read ‘foo’. Node 1 token: 0 foo token: 90 Node 4 Node 2 token: 75 token: 25 Node 3 Client token: 50
  62. 62. Consistency Level nodes must respond.
  63. 63. Read ‘foo’ at QUOURM. Node 1 foo foo token: 90 Node 4 Node 2 foo Node 3 Client
  64. 64. Consistency Levelnodes must agree.
  65. 65. Digests used todetect differences.
  66. 66. Timestamps used toresolve differences.
  67. 67. Differences in the ‘foo’ row. Column Node 1 Node 2 Node 3 cromulent cromulent purple <missing> (timestamp 10) (timestamp 10) embiggens embiggens debigulator monkey (timestamp 10) (timestamp 10) (timestamp 5) tomato tomato tomacco dishwasher (timestamp 10) (timestamp 10) (timestamp 15)
  68. 68. Consistent Read. Node 1 Node 1 cromulent cromulent Node 4 Node 2 Node 4 Node 2 <empty> cromulent cromulent Client Client Node 3 Node 3
  69. 69. Read Repair is active on a fraction of requests. (10% by default)
  70. 70. QUORUM with and without Read Repair. Node 1 Node 1 Node 4 Node 2 Node 4 Node 2 Node 3 Node 3Client Client
  71. 71. I can haz Consistency ? R +W > N (#Read Nodes + #Write Nodes > Replication Factor)
  72. 72. Anti Entropy... Hash key ranges on each node using Merkle Trees.
  73. 73. Anti Entropy... Stream differences between nodes.
  74. 74. Highly Available? ✓
  75. 75. Today. Cluster Data Model Node
  76. 76. Data Model so far. Row Key: Column Column Column (Incomplete.)
  77. 77. Data Model. Keyspace Column Family Column Family Column Family Column Column Column Row Key: Column Column Column Column Column Column (Excludes Super Columns.)
  78. 78. Rows are the unit of replication.
  79. 79. The Column Family is the unit of storage.
  80. 80. Row and ColumnFamily are the unit of querying.
  81. 81. API... Mutate# pycassa - Python>>> col_fam = pycassa.ColumnFamily(pool, ColumnFamily1)>>> col_fam.insert(row_key, {col_name: col_val})
  82. 82. API... Mutate# Cassandra Query Language (CQL)INSERT INTO ColumnFamily1 (KEY, col_name)VALUES (row_key, col_value);
  83. 83. API... Delete# pycassa - Python>>> col_fam.remove(row_key)>>> col_fam.remove(row_key, [‘col_name’])
  84. 84. API... Delete# Cassandra Query Language (CQL)DELETE FROM ColumnFamily1 WHERE key IN(row_key,);DELETE col_name FROM ColumnFamily1 WHEREkey = row_key;
  85. 85. Batch Mutate saves on round trips. (It’s not a Tx.)
  86. 86. API... Get, Multi-Get# pycassa - Python>>> col_fam.get(row_key){col_name: col_val, col_name2: col_val2}>>> col_fam.multi_get([row_key], [‘col_name’]){‘row_key’ : {col_name: col_val}}
  87. 87. API... Get, Multi-Get# Cassandra Query Language (CQL)SELECT * FROM ColumnFamily1;SELECT col_name FROM ColumnFamily1 WHEREKEY IN (‘row_key’);
  88. 88. API... Get Range*# pycassa - Python>>> col_fam.get_range(start=row_key){row_key : {col_name: col_val},row_key50: {col_name: col_val},row_key2: {col_name: col_val}}
  89. 89. API... Get Range*# Cassandra Query Language (CQL)SELECT * FROM ColumnFamily1 WHERE KEY >=‘row_key’;
  90. 90. Column Families? ✓
  91. 91. Today. Cluster Data Model Node
  92. 92. Optimised for Writes.
  93. 93. Write path... Append to Write Ahead Log. (fsync every 10s by default, other options available)
  94. 94. Write path... Merge Columns into Memtable. (Lock free, always in memory.)
  95. 95. Write path... Done.
  96. 96. Fast for writes? ✓
  97. 97. (Later.) Asynchronously flush Memtable to new files. (May be 10’s or 100’s of MB in size.)
  98. 98. Data is stored inimmutable SSTables. (Sorted String table.)
  99. 99. SSTable files. *-Data.db *-Index.db *-Filter.db (Also *-Statistics.db and *-Digest.sha1)
  100. 100. SSTables. SSTable 1 SSTable 2 SSTable 3 SSTable 4 SSTable 5 foo: foo: foo: dishwasher (ts 10): frink (ts 20): dishwasher (ts 15): tomato flayven tomacco purple (ts 10): monkey (ts 10): cromulent embiggins
  101. 101. Read Path... Read columns from each SSTable, then merge results. (Roughly speaking.)
  102. 102. Read Path... Use Bloom Filter to determine if a row key does not exist in a SSTable. (In memory)
  103. 103. Bloom Filter says if a key is definitely not present, or present with a certain probability. (Default false positive rate is 0.0744%)
  104. 104. Read Path... Search for prior key in *-Index.db sample. (In memory)
  105. 105. Read Path... Scan *-Index.db from priorkey to find the search key and its’ *-Data.db offset. (On disk.)
  106. 106. Read Path...Read *-Data.db from offset, all columns or specific pages. (Default 64KB page size.)
  107. 107. Read purple, monkey, dishwasher. Bloom Filter Bloom Filter Bloom Filter Bloom Filter Bloom Filter Memory Index Sample Index Sample Index Sample Index Sample Index Sample Disk SSTable 1-Index.db SSTable 2-Index.db SSTable 3-Index.db SSTable 4-Index.db SSTable 5-Index.db SSTable 1-Data.db SSTable 2-Data.db SSTable 3-Data.db SSTable 4-Data.db SSTable 5-Data.db foo: foo: foo: dishwasher (ts 10): frink (ts 20): dishwasher (ts 15): tomato flayven tomacco purple (ts 10): monkey (ts 10): cromulent embiggins
  108. 108. Merge SSTables. Column SSTable 1 SSTable 2 SSTable 4 cromulent purple (timestamp 10) embiggens monkey (timestamp 10) tomato tomacco dishwasher (timestamp 10) (timestamp 15)
  109. 109. Key Cache caches row keyposition in *-Data.db file. (Removes up to1disk seek per SSTable.)
  110. 110. Read with Key Cache. Bloom Filter Bloom Filter Bloom Filter Bloom Filter Bloom Filter Key Cache Key Cache Key Cache Key Cache Key Cache Memory Index Sample Index Sample Index Sample Index Sample Index Sample Disk SSTable 1-Index.db SSTable 2-Index.db SSTable 3-Index.db SSTable 4-Index.db SSTable 5-Index.db SSTable 1-Data.db SSTable 2-Data.db SSTable 3-Data.db SSTable 4-Data.db SSTable 5-Data.db foo: foo: foo: dishwasher (ts 10): frink (ts 20): dishwasher (ts 15): tomato flayven tomacco purple (ts 10): monkey (ts 10): cromulent embiggins
  111. 111. Row Cache caches entire row. (Removes all disk IO.)
  112. 112. Read with Row Cache. Row Cache Bloom Filter Bloom Filter Bloom Filter Bloom Filter Bloom Filter Key Cache Key Cache Key Cache Key Cache Key Cache Memory Index Sample Index Sample Index Sample Index Sample Index Sample Disk SSTable 1-Index.db SSTable 2-Index.db SSTable 3-Index.db SSTable 4-Index.db SSTable 5-Index.db SSTable 1-Data.db SSTable 2-Data.db SSTable 3-Data.db SSTable 4-Data.db SSTable 5-Data.db foo: foo: foo: dishwasher (ts 10): frink (ts 20): dishwasher (ts 15): tomato flayven tomacco purple (ts 10): monkey (ts 10): cromulent embiggins
  113. 113. Fast for reads? ✓
  114. 114. Tombstones ensure all replicas see a delete. (Purged after 10 days, configurable.)
  115. 115. Merge SSTables with Tombstones. Column SSTable 1 SSTable 2 SSTable 4 cromulent <tombstone> purple (timestamp 10) (timestamp 15) embiggens monkey (timestamp 10) tomato tomacco dishwasher (timestamp 10) (timestamp 15)
  116. 116. Merge node response with Tombstones. Column Node 1 Node 2 Node 3 cromulent cromulent <tombstone> purple (timestamp 10) (timestamp 10) (timestamp 15) embiggens embiggens debigulator monkey (timestamp 10) (timestamp 10) (timestamp 5) tomato tomato tomacco dishwasher (timestamp 10) (timestamp 10) (timestamp 15)
  117. 117. Compaction merges truth from multiple SSTables into one SSTable with the same truth. (Manual and continuous background process.)
  118. 118. Compaction. Column SSTable 1 SSTable 2 SSTable 4 New cromulent <tombstone> <tombstone> purple (timestamp 10) (timestamp 15) (timestamp 15) embiggens embiggens monkey (timestamp 10) (timestamp 10) tomato tomacco tomacco dishwasher (timestamp 10) (timestamp 15) (timestamp 15)
  119. 119. Today. Cluster Data Model Node
  120. 120. Papers.•Cassandra - A Decentralized Structured Storage System (Lakshman et al).•Bigtable: A Distributed Storage System for Structured Data (Chang, et al).•Dynamo: Amazon’s Highly Available Key-value Store (DeCandia, et al).•Eventually Consistent (Werner Vogels).•Epidemic algorithms for replicated database maintenance (Demers, et al).•Brewer’s conjecture and the feasibility of consistent, available, partition-tolerant webservices (Gilbert et al).•Consistent hashing and random trees: distributed caching protocols for relievinghot spots on the world wide web (Karger, et al).•The φ Accrual Failure Detector (Hayashibara et al).
  121. 121. Aaron Morton @aaronmorton www.thelastpickle.comLicensed under a Creative Commons Attribution-NonCommercial 3.0 New Zealand License

×