Your SlideShare is downloading. ×
Cassandra does what ? Code Mania 2012
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Cassandra does what ? Code Mania 2012

2,590
views

Published on

Published in: Technology, Business

1 Comment
8 Likes
Statistics
Notes
No Downloads
Views
Total Views
2,590
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
1
Likes
8
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Transcript

    • 1. CASSANDRA DOES WHAT? CODE MANIA 2012 Aaron Morton, Apache Cassandra Committer @aaronmorton www.thelastpickle.com Licensed under a Creative Commons Attribution-NonCommercial 3.0 New Zealand License
    • 2. Cassandra?
    • 3. Cassandra is... Scalable
    • 4. Cassandra is... Distributed
    • 5. Cassandra is... Highly Available
    • 6. Cassandra uses... Column Families
    • 7. Cassandra is... Fast
    • 8. Cassandra is... Fun (Really.)
    • 9. Why Cassandra?
    • 10. Why Cassandra? Scale
    • 11. Why Cassandra? Operations
    • 12. Why Cassandra? Data Model
    • 13. Today. Cluster Data Model Node
    • 14. Cluster. Store the ‘foo’ row.
    • 15. Store ‘foo’. Node 1 - foo Node 4 - foo Node 2 - foo Node 3 - foo
    • 16. Cluster Capacity? Limited.
    • 17. Replication Factor specifies the number of row replicas. (RF)
    • 18. Everything is a copy. Master Slave Replication.
    • 19. Store ‘foo’ with Replication Factor 3. Node 1 - foo Node 4 Node 2 - foo Node 3 - foo
    • 20. Cluster Capacity? Node Capacity X Number Nodes Replication Factor
    • 21. Scalable Capacity? ✓
    • 22. Consistent Hashing... Evenly map keys to nodes.
    • 23. Consistent Hashing... Minimise key movements when nodes join or leave.
    • 24. Partitioner... RandomPartitioner transforms Keys to Tokens using MD5. (Default Partitioner, there are others.)
    • 25. Keys and Tokens? key fop foo token 0 10 90 99
    • 26. 128 Bit Unsigned Integer Token.170,141,183,460,469,231,731,687,303,7 15,884,105,728
    • 27. Token Ring. 99 0 foo fop token: 90 token: 10
    • 28. Partitioning... Assign a Token to each node. (initial_token)
    • 29. Token Ranges. Node 1 token: 0 76-0 1-25 Node 4 Node 2 token: 75 token: 25 Node 3 token: 50
    • 30. Token Ranges. Node Token Range From Range To 1 0 76 0 2 25 1 25 3 50 26 50 4 75 51 75
    • 31. Locate Token Range. Node 1 token: 0 foo token: 90 Node 4 Node 2 token: 75 token: 25 Node 3 token: 50
    • 32. Replication Strategy selectsReplication Factor number of nodes for a row.
    • 33. SimpleStrategy selects nodes by Token Order. (Non default, there are others.)
    • 34. SimpleStrategy with RF 3. Node 1 token: 0 foo token: 90 Node 4 Node 2 token: 75 token: 25 Node 3 token: 50
    • 35. NetworkTopologyStrategy uses a Replication Factor per Data Centre. (Default.)
    • 36. NetworkTopologyStrategy... Stripes replicas across racks.
    • 37. Multi DC Replication with RF 3 and RF 2. Node 1 Node 1 token: 0 token: 1 foo token: 90 Node 4 West DC Node 2 Node 4 East DC Node 2 token: 75 token: 25 token: 76 token: 26 Node 3 Node 3 token: 50 token: 51
    • 38. The Snitch knows which Data Centre and rack contains a Node.
    • 39. SimpleSnitch. Places all nodes in the same DC and rack. (Default, there are others.)
    • 40. PropertyFileSnitch. Places nodes in a multiple DCs and racks using configuration. (There are others.)
    • 41. EC2Snitch. Places nodes in a DC using the AWS Region and a rack using Availability Zone. (There are others.)
    • 42. DynamicSnitch.Re-orders nodes according totheir observed performance. (Wraps other snitch.)
    • 43. Clients connect to any node in the cluster.
    • 44. Coordinator handles a request for a client.
    • 45. The Client and the Coordinator. Node 1 token: 0 foo token: 90 Node 4 Node 2 token: 75 token: 25 Node 3 Client token: 50
    • 46. Nodes Gossip about other nodes.
    • 47. Gossip?Nodes share information witha small number of neighbours.Who share information with...
    • 48. Scalable Throughput? ✓
    • 49. Distributed? ✓
    • 50. Node Down (oh noes)
    • 51. Node Down. Node 1 token: 0 foo token: 90 Node 4 Node 2 token: 75 token: 25 Node 3 Client token: 50
    • 52. Client specifiedConsistency Level.
    • 53. Consistency Level... Any*, One, Two, Three,
    • 54. Consistency Level... QUORUM, LOCAL_QUORUM, EACH_QUOURM*
    • 55. Quorum? floor(RF / 2) +1
    • 56. QUOURM at Replication Factor... Replication 2 or 3 4 or 5 6 or 7 Factor QUOURM 2 3 4
    • 57. UnavailableException
    • 58. TimedOutException
    • 59. Node Down with Hinted Handoff. Node 1 foo foo token: 90 Node 4 Node 2 foo for #3 foo Node 3 Client
    • 60. Cluster. Read the ‘foo’ row.
    • 61. Read ‘foo’. Node 1 token: 0 foo token: 90 Node 4 Node 2 token: 75 token: 25 Node 3 Client token: 50
    • 62. Consistency Level nodes must respond.
    • 63. Read ‘foo’ at QUOURM. Node 1 foo foo token: 90 Node 4 Node 2 foo Node 3 Client
    • 64. Consistency Levelnodes must agree.
    • 65. Digests used todetect differences.
    • 66. Timestamps used toresolve differences.
    • 67. Differences in the ‘foo’ row. Column Node 1 Node 2 Node 3 cromulent cromulent purple <missing> (timestamp 10) (timestamp 10) embiggens embiggens debigulator monkey (timestamp 10) (timestamp 10) (timestamp 5) tomato tomato tomacco dishwasher (timestamp 10) (timestamp 10) (timestamp 15)
    • 68. Consistent Read. Node 1 Node 1 cromulent cromulent Node 4 Node 2 Node 4 Node 2 <empty> cromulent cromulent Client Client Node 3 Node 3
    • 69. Read Repair is active on a fraction of requests. (10% by default)
    • 70. QUORUM with and without Read Repair. Node 1 Node 1 Node 4 Node 2 Node 4 Node 2 Node 3 Node 3Client Client
    • 71. I can haz Consistency ? R +W > N (#Read Nodes + #Write Nodes > Replication Factor)
    • 72. Anti Entropy... Hash key ranges on each node using Merkle Trees.
    • 73. Anti Entropy... Stream differences between nodes.
    • 74. Highly Available? ✓
    • 75. Today. Cluster Data Model Node
    • 76. Data Model so far. Row Key: Column Column Column (Incomplete.)
    • 77. Data Model. Keyspace Column Family Column Family Column Family Column Column Column Row Key: Column Column Column Column Column Column (Excludes Super Columns.)
    • 78. Rows are the unit of replication.
    • 79. The Column Family is the unit of storage.
    • 80. Row and ColumnFamily are the unit of querying.
    • 81. API... Mutate# pycassa - Python>>> col_fam = pycassa.ColumnFamily(pool, ColumnFamily1)>>> col_fam.insert(row_key, {col_name: col_val})
    • 82. API... Mutate# Cassandra Query Language (CQL)INSERT INTO ColumnFamily1 (KEY, col_name)VALUES (row_key, col_value);
    • 83. API... Delete# pycassa - Python>>> col_fam.remove(row_key)>>> col_fam.remove(row_key, [‘col_name’])
    • 84. API... Delete# Cassandra Query Language (CQL)DELETE FROM ColumnFamily1 WHERE key IN(row_key,);DELETE col_name FROM ColumnFamily1 WHEREkey = row_key;
    • 85. Batch Mutate saves on round trips. (It’s not a Tx.)
    • 86. API... Get, Multi-Get# pycassa - Python>>> col_fam.get(row_key){col_name: col_val, col_name2: col_val2}>>> col_fam.multi_get([row_key], [‘col_name’]){‘row_key’ : {col_name: col_val}}
    • 87. API... Get, Multi-Get# Cassandra Query Language (CQL)SELECT * FROM ColumnFamily1;SELECT col_name FROM ColumnFamily1 WHEREKEY IN (‘row_key’);
    • 88. API... Get Range*# pycassa - Python>>> col_fam.get_range(start=row_key){row_key : {col_name: col_val},row_key50: {col_name: col_val},row_key2: {col_name: col_val}}
    • 89. API... Get Range*# Cassandra Query Language (CQL)SELECT * FROM ColumnFamily1 WHERE KEY >=‘row_key’;
    • 90. Column Families? ✓
    • 91. Today. Cluster Data Model Node
    • 92. Optimised for Writes.
    • 93. Write path... Append to Write Ahead Log. (fsync every 10s by default, other options available)
    • 94. Write path... Merge Columns into Memtable. (Lock free, always in memory.)
    • 95. Write path... Done.
    • 96. Fast for writes? ✓
    • 97. (Later.) Asynchronously flush Memtable to new files. (May be 10’s or 100’s of MB in size.)
    • 98. Data is stored inimmutable SSTables. (Sorted String table.)
    • 99. SSTable files. *-Data.db *-Index.db *-Filter.db (Also *-Statistics.db and *-Digest.sha1)
    • 100. SSTables. SSTable 1 SSTable 2 SSTable 3 SSTable 4 SSTable 5 foo: foo: foo: dishwasher (ts 10): frink (ts 20): dishwasher (ts 15): tomato flayven tomacco purple (ts 10): monkey (ts 10): cromulent embiggins
    • 101. Read Path... Read columns from each SSTable, then merge results. (Roughly speaking.)
    • 102. Read Path... Use Bloom Filter to determine if a row key does not exist in a SSTable. (In memory)
    • 103. Bloom Filter says if a key is definitely not present, or present with a certain probability. (Default false positive rate is 0.0744%)
    • 104. Read Path... Search for prior key in *-Index.db sample. (In memory)
    • 105. Read Path... Scan *-Index.db from priorkey to find the search key and its’ *-Data.db offset. (On disk.)
    • 106. Read Path...Read *-Data.db from offset, all columns or specific pages. (Default 64KB page size.)
    • 107. Read purple, monkey, dishwasher. Bloom Filter Bloom Filter Bloom Filter Bloom Filter Bloom Filter Memory Index Sample Index Sample Index Sample Index Sample Index Sample Disk SSTable 1-Index.db SSTable 2-Index.db SSTable 3-Index.db SSTable 4-Index.db SSTable 5-Index.db SSTable 1-Data.db SSTable 2-Data.db SSTable 3-Data.db SSTable 4-Data.db SSTable 5-Data.db foo: foo: foo: dishwasher (ts 10): frink (ts 20): dishwasher (ts 15): tomato flayven tomacco purple (ts 10): monkey (ts 10): cromulent embiggins
    • 108. Merge SSTables. Column SSTable 1 SSTable 2 SSTable 4 cromulent purple (timestamp 10) embiggens monkey (timestamp 10) tomato tomacco dishwasher (timestamp 10) (timestamp 15)
    • 109. Key Cache caches row keyposition in *-Data.db file. (Removes up to1disk seek per SSTable.)
    • 110. Read with Key Cache. Bloom Filter Bloom Filter Bloom Filter Bloom Filter Bloom Filter Key Cache Key Cache Key Cache Key Cache Key Cache Memory Index Sample Index Sample Index Sample Index Sample Index Sample Disk SSTable 1-Index.db SSTable 2-Index.db SSTable 3-Index.db SSTable 4-Index.db SSTable 5-Index.db SSTable 1-Data.db SSTable 2-Data.db SSTable 3-Data.db SSTable 4-Data.db SSTable 5-Data.db foo: foo: foo: dishwasher (ts 10): frink (ts 20): dishwasher (ts 15): tomato flayven tomacco purple (ts 10): monkey (ts 10): cromulent embiggins
    • 111. Row Cache caches entire row. (Removes all disk IO.)
    • 112. Read with Row Cache. Row Cache Bloom Filter Bloom Filter Bloom Filter Bloom Filter Bloom Filter Key Cache Key Cache Key Cache Key Cache Key Cache Memory Index Sample Index Sample Index Sample Index Sample Index Sample Disk SSTable 1-Index.db SSTable 2-Index.db SSTable 3-Index.db SSTable 4-Index.db SSTable 5-Index.db SSTable 1-Data.db SSTable 2-Data.db SSTable 3-Data.db SSTable 4-Data.db SSTable 5-Data.db foo: foo: foo: dishwasher (ts 10): frink (ts 20): dishwasher (ts 15): tomato flayven tomacco purple (ts 10): monkey (ts 10): cromulent embiggins
    • 113. Fast for reads? ✓
    • 114. Tombstones ensure all replicas see a delete. (Purged after 10 days, configurable.)
    • 115. Merge SSTables with Tombstones. Column SSTable 1 SSTable 2 SSTable 4 cromulent <tombstone> purple (timestamp 10) (timestamp 15) embiggens monkey (timestamp 10) tomato tomacco dishwasher (timestamp 10) (timestamp 15)
    • 116. Merge node response with Tombstones. Column Node 1 Node 2 Node 3 cromulent cromulent <tombstone> purple (timestamp 10) (timestamp 10) (timestamp 15) embiggens embiggens debigulator monkey (timestamp 10) (timestamp 10) (timestamp 5) tomato tomato tomacco dishwasher (timestamp 10) (timestamp 10) (timestamp 15)
    • 117. Compaction merges truth from multiple SSTables into one SSTable with the same truth. (Manual and continuous background process.)
    • 118. Compaction. Column SSTable 1 SSTable 2 SSTable 4 New cromulent <tombstone> <tombstone> purple (timestamp 10) (timestamp 15) (timestamp 15) embiggens embiggens monkey (timestamp 10) (timestamp 10) tomato tomacco tomacco dishwasher (timestamp 10) (timestamp 15) (timestamp 15)
    • 119. Today. Cluster Data Model Node
    • 120. Papers.•Cassandra - A Decentralized Structured Storage System (Lakshman et al).•Bigtable: A Distributed Storage System for Structured Data (Chang, et al).•Dynamo: Amazon’s Highly Available Key-value Store (DeCandia, et al).•Eventually Consistent (Werner Vogels).•Epidemic algorithms for replicated database maintenance (Demers, et al).•Brewer’s conjecture and the feasibility of consistent, available, partition-tolerant webservices (Gilbert et al).•Consistent hashing and random trees: distributed caching protocols for relievinghot spots on the world wide web (Karger, et al).•The φ Accrual Failure Detector (Hayashibara et al).
    • 121. Aaron Morton @aaronmorton www.thelastpickle.comLicensed under a Creative Commons Attribution-NonCommercial 3.0 New Zealand License