Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Cassandra Community Webinar: Back to Basics with CQL3

3,449 views

Published on

Cassandra is a distributed, massively scalable, fault tolerant, columnar data store, and if you need the ability to make fast writes, the only thing faster than Cassandra is /dev/null! In this fast-paced presentation, we'll briefly describe big data, and the area of big data that Cassandra is designed to fill. We will cover Cassandra's unique, every-node-the-same architecture. We will reveal Cassandra's internal data structure and explain just why Cassandra is so darned fast. Finally, we'll wrap up with a discussion of data modeling using the new standard protocol: CQL (Cassandra Query Language).

Published in: Technology
  • Be the first to comment

Cassandra Community Webinar: Back to Basics with CQL3

  1. 1. Back to Basics with CQL3 Matt Overstreet OpenSource Connections OpenSource Connections
  2. 2. Outline • • • • • Overview Architecture Data Modeling Good At/Bad At Using Cassandra OpenSource Connections
  3. 3. Outline • • • • • Overview Architecture Data Modeling Good At/Bad At Using Cassandra • What is Big Data? • How does Cassandra fit? OpenSource Connections
  4. 4. What is Big Data? • The three V’s (and a C) velocity volume Variety Complexity OpenSource Connections
  5. 5. What is Big Data • Brewer’s CAP theorem o o o o Consistency - all nodes have same world view Availability - requests can be serviced Partition tolerance - network/machine failure Can’t have all 3 -- Pick 2! • Examples o MySQL – Consistent, Available o HBase – Consistent, Partition Tolerant o Cassandra – Available, Partition Tolerant – and “Tunably Consistent”! OpenSource Connections
  6. 6. What is Big Data? • Common theme: Denormalize everything! o What’s that? • JOIN all the tables in the database... • … well not all the tables o Why? • You can shard database at any point • All related data is co-located • What this means for you o o o o o No joins No transactions - potential for inconsistency Vastly simplified querying No data-modeling -- Instead, query-modeling “Infinite and easy” scaling potential OpenSource Connections
  7. 7. How Does Cassandra Fit? • No single point of failure • Optimized for writes, still good with reads • Can decide between Consistency and Availably concerns OpenSource Connections
  8. 8. Outline • • • • • Overview Architecture Data Modeling Good At/Bad At Using Cassandra • Ring architecture • Data partitioning o Operations o Writes o Reads OpenSource Connections
  9. 9. Ring Architecture • No single point of failure • Nodes talk via gossip • Democratic - all nodes are equal OpenSource Connections
  10. 10. Data Partitioning Original partitioning method. OpenSource Connections
  11. 11. Data Partitioning Flexible partitioning with virtual nodes. OpenSource Connections
  12. 12. Operations: Writes Requests sent out to nodes and replicants. OpenSource Connections
  13. 13. Operations: Reads Coordinator node reaches out to relevant replicants. OpenSource Connections
  14. 14. Outline • • • • • Overview Architecture Data Modeling Good At/Bad At Using Cassandra • • • • Internals Cassandra Query Language Modeling Strategy Example OpenSource Connections
  15. 15. C* Data Model Keyspace OpenSource Connections
  16. 16. C* Data Model Keyspace Column Family Column Family OpenSource Connections
  17. 17. C* Data Model Keyspace Column Family Column Family OpenSource Connections
  18. 18. C* Data Model Keyspace Column Family Column Family OpenSource Connections
  19. 19. C* Data Model Row Key OpenSource Connections
  20. 20. C* Data Model Row Key Column Column Name Column Value (or Tombstone) Timestamp Time-to-live OpenSource Connections
  21. 21. C* Data Model Row Key Column Column Name Column Value (or Tombstone) Timestamp Time-to-live ● Row Key, Column Name, Column Value have types ● Column Name has comparator ● RowKey has partitioner ● Rows can have any number of columns - even in same column family ● Rows can have many columns ● Column Values can be omitted ● Time-to-live is useful! ● Tombstones OpenSource Connections
  22. 22. C* Data Model: Writes Mem Table CommitLog Row Cache Bloom Filter ● Insert into MemTable ● Dump to CommitLog ● No read ● Very Fast! ● Blocks on CPU before O/I! Key Cache Key Cache Key Cache Key Cache SSTable SSTable SSTable SSTable OpenSource Connections
  23. 23. C* Data Model: Writes Mem Table CommitLog Row Cache Bloom Filter ● Insert into MemTable ● Dump to CommitLog ● No read ● Very Fast! ● Blocks on CPU before O/I! Key Cache Key Cache Key Cache Key Cache SSTable SSTable SSTable SSTable OpenSource Connections
  24. 24. C* Data Model: Writes Mem Table CommitLog Row Cache Bloom Filter ● Insert into MemTable ● Dump to CommitLog ● No read ● Very Fast! ● Blocks on CPU before O/I! Key Cache Key Cache Key Cache Key Cache SSTable SSTable SSTable SSTable OpenSource Connections
  25. 25. C* Data Model: Reads Mem Table CommitLog Row Cache Bloom Filter ● Get values from Memtable ● Get values from row cache if present ● Otherwise check bloom filter to find appropriate SSTables ● Check Key Cache for fast SSTable Search ● Get values from SSTables ● Repopulate Row Cache ● Super Fast Col. retrieval ● Fast row slicing Key Cache Key Cache Key Cache Key Cache SSTable SSTable SSTable SSTable OpenSource Connections
  26. 26. C* Data Model: Reads Mem Table CommitLog Row Cache Bloom Filter ● Get values from Memtable ● Get values from row cache if present ● Otherwise check bloom filter to find appropriate SSTables ● Check Key Cache for fast SSTable Search ● Get values from SSTables ● Repopulate Row Cache ● Super Fast Col. retrieval ● Fast row slicing Key Cache Key Cache Key Cache Key Cache SSTable SSTable SSTable SSTable OpenSource Connections
  27. 27. C* Data Model: Reads Mem Table CommitLog Row Cache Bloom Filter ● Get values from Memtable ● Get values from row cache if present ● Otherwise check bloom filter to find appropriate SSTables ● Check Key Cache for fast SSTable Search ● Get values from SSTables ● Repopulate Row Cache ● Super Fast Col. retrieval ● Fast row slicing Key Cache Key Cache Key Cache Key Cache SSTable SSTable SSTable SSTable OpenSource Connections
  28. 28. C* Data Model: Reads Mem Table CommitLog Row Cache Bloom Filter ● Get values from Memtable ● Get values from row cache if present ● Otherwise check bloom filter to find appropriate SSTables ● Check Key Cache for fast SSTable Search ● Get values from SSTables ● Repopulate Row Cache ● Super Fast Col. retrieval ● Fast row slicing Key Cache Key Cache Key Cache Key Cache SSTable SSTable SSTable SSTable OpenSource Connections
  29. 29. C* Data Model: Reads Mem Table CommitLog Row Cache Bloom Filter ● Get values from Memtable ● Get values from row cache if present ● Otherwise check bloom filter to find appropriate SSTables ● Check Key Cache for fast SSTable Search ● Get values from SSTables ● Repopulate Row Cache ● Super Fast Col. retrieval ● Fast row slicing Key Cache Key Cache Key Cache Key Cache SSTable SSTable SSTable SSTable OpenSource Connections
  30. 30. C* Data Model: Reads Mem Table CommitLog Row Cache Bloom Filter ● Get values from Memtable ● Get values from row cache if present ● Otherwise check bloom filter to find appropriate SSTables ● Check Key Cache for fast SSTable Search ● Get values from SSTables ● Repopulate Row Cache ● Super Fast Col. retrieval ● Fast row slicing Key Cache Key Cache Key Cache Key Cache SSTable SSTable SSTable SSTable OpenSource Connections
  31. 31. Internals: Twitter Example • 4 ColumnFamilies o o o o followers following tweets timeline OpenSource Connections
  32. 32. Internals: Twitter Example • 4 ColumnFamilies o o o o followers following tweets timeline • Nate follows Patricia o o o o SET followers[Patricia][Nate] = „‟; SET following[Nate][Patricia] = „‟; storing data in column names (not values) denormalized, redundant! • Get all Nate’s followers o GET followers[Patricia] o => Nate,Eric,Scott,Matt,Doug,Kate o No JOIN! OpenSource Connections
  33. 33. Internals: Twitter Example • Nate tweets o SET tweets[Nate][2013-07-19 T 09:20] = “Wonderful morning. This coffee is great.” o SET tweets[Nate][2013-07-19 T 09:21] = “Oops, smoke is coming out of the SQL server!” o SET tweets[Nate][2013-07-19 T 09:51] = “Now my coffee is cold :-(” • Get Nate’s tweets o GET tweets[Nate] …(what you’d expect)... OpenSource Connections
  34. 34. CQL (Cassandra Query Language) CREATE TABLE users ( id timeuuid PRIMARY KEY, lastname varchar, firstname varchar, dateOfBirth timestamp ); OpenSource Connections
  35. 35. CQL (Cassandra Query Language) CREATE TABLE users ( id timeuuid PRIMARY KEY, lastname varchar, firstname varchar, dateOfBirth timestamp ); INSERT INTO users (id,lastname, firstname, dateofbirth) VALUES (now(),'Berryman',‟John','1975-09-15'); OpenSource Connections
  36. 36. CQL (Cassandra Query Language) CREATE TABLE users ( id timeuuid PRIMARY KEY, lastname varchar, firstname varchar, dateOfBirth timestamp ); INSERT INTO users (id,lastname, firstname, dateofbirth) VALUES (now(),‟Berryman‟,‟John‟,‟1975-09-15‟); UPDATE users SET firstname = ‟John‟ WHERE id = f74c0b20-0862-11e3-8cf6-b74c10b01fc6; OpenSource Connections
  37. 37. CQL (Cassandra Query Language) CREATE TABLE users ( id timeuuid PRIMARY KEY, lastname varchar, firstname varchar, dateOfBirth timestamp ); INSERT INTO users (id,lastname, firstname, dateofbirth) VALUES (now(),'Berryman',‟John','1975-09-15'); UPDATE users SET firstname = 'John‟ WHERE id = f74c0b20-0862-11e3-8cf6-b74c10b01fc6; SELECT dateofbirth,firstname,lastname FROM users ; dateofbirth | firstname | lastname --------------------------+-----------+---------1975-09-15 00:00:00-0400 | John | Berryman OpenSource Connections
  38. 38. The CQL/Cassandra Mapping CREATE TABLE employees ( company text, name text, age int, role text, PRIMARY KEY (company,name) ); OpenSource Connections
  39. 39. The CQL/Cassandra Mapping CREATE TABLE employees ( company text, name text, age int, role text, PRIMARY KEY (company,name) ); company | name | age | role --------+------+-----+----OSC | eric | 38 | ceo OSC | john | 37 | dev RKG | anya | 29 | lead RKG | ben | 27 | dev RKG | chad | 35 | ops OpenSource Connections
  40. 40. The CQL/Cassandra Mapping company | name | age | role --------+------+-----+----OSC | eric | 38 | ceo OSC | john | 37 | dev RKG | anya | 29 | lead RKG | ben | 27 | dev RKG | chad | 35 | ops CREATE TABLE employees ( company text, name text, age int, role text, PRIMARY KEY (company,name) ); eric:age OS C eric:role john:age john:role 38 dev 37 dev anya:age RK G anya:role ben:age ben:role chad:age chad:role 29 lead 27 dev 35 ops OpenSource Connections
  41. 41. Modeling Strategy • Don’t think about the data structure • Do think of the questions you’ll ask • Consider efficient operations for Cassandra o o o o Writing (4K writes per second per core) Retrieving a row Retrieving a row slice Retrieving in natural order (which you control) • Write the data in the way you will query it • Disk space is cheap • Seperate read-heavy and write-heavy task o Make wise use of caches OpenSource Connections
  42. 42. Modeling Strategy: Anti-Patterns • Read-then-write • Heavy deletes o Scatters dead columns throughout SSTables o Won’t be corrected until first compaction after gc_grace_seconds (10days) • Distributed queue • JOIN-like behavior • Super wide-row sneak attack (>2B columns) OpenSource Connections
  43. 43. QUESTIONS? OpenSource Connections

×