Big Data Grows Up - A (re)introduction to Cassandra

2,220 views
2,107 views

Published on

For the last several years Cassandra has been the heavyweight in the NoSQL space. But its massive scalability was accompanied by a bare bones feature set, a substantial learning curve, and a Thrift-based RPC mechanism that left newbies bewildered by a sea of potential client libraries–all with their own fragmented semantics. Over the last year that’s all changed, culminating in the recently unveiled Cassandra 2.0. In this talk I’ll bring you up to speed on Cassandra Query Language, cursors, the new native libraries, lightweight transactions, virtual nodes, and loads of other new goodies. Whether you’re completely new to Cassandra or a seasoned veteran who wants the latest scoop, this talk has something for you.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
2,220
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
25
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Big Data Grows Up - A (re)introduction to Cassandra

  1. 1. Big Data Grows Up A (re)introduction to Cassandra Robbie Strickland
  2. 2. Who am I? Robbie Strickland Software Development Manager The Weather Channel rostrickland@gmail.com @dont_use_twitter
  3. 3. Who am I? ● ● ● ● ● Cassandra user/contributor since 2010 … it was at release 0.5 back then 4 years? Oracle DBA’s aren’t impressed Done lots of dumb stuff with Cassandra … and some really awesome stuff too
  4. 4. Cassandra in 2010
  5. 5. Cassandra in 2010
  6. 6. Cassandra in 2014
  7. 7. Why Cassandra? It’s fast: ● ● ● ● No locks Tunable consistency Sequential R/W Decentralized
  8. 8. Why Cassandra? It scales (linearly): ● ● ● ● Multi data center No SPOF DHT Hadoop integration
  9. 9. Why Cassandra? It’s fault tolerant: ● Automatic replication ● Masterless ● Failed nodes replaced with ease
  10. 10. What’s different? … a lot in the last year (ish)
  11. 11. What’s new? ● ● ● ● ● ● ● Virtual nodes O(n) data moved off-heap CQL3 (and defining schemas) Native protocol/driver Collections Lightweight transactions Compaction throttling that actually works
  12. 12. What’s gone? ● ● ● ● Manual token management Supercolumns Thrift (if you use the native driver) Directly managing storage rows
  13. 13. What’s still the same? ● ● ● ● ● Still not an RDBMS Still no joins (see above) Still no ad-hoc queries (see above again) Still requires a denormalized data model (^^) Still need to know what the heck you’re doing
  14. 14. Token Management Linear scalability without the migraine
  15. 15. The old way ● 1 token per node ● Assigned manually ● Adding nodes == reassignment of all tokens ● Node rebuild heavily taxes a few nodes A F B cluster with no vnodes E C D
  16. 16. … enter Vnodes N A B C M L D cluster with vnodes K J E F I H G ● n tokens per node ● Assigned magically ● Adding nodes == painless ● Node rebuild distributed across many nodes
  17. 17. Node rebuild without Vnodes
  18. 18. Node rebuild with Vnodes
  19. 19. Going Off-heap because the JVM sometimes sucks
  20. 20. Why go off-heap ● ● ● ● ● GC overhead JVM no good with big heap sizes GC overhead GC overhead GC overhead
  21. 21. O(n) data structures ● ● ● ● Row cache Bloom filters Compression offsets Partition summary … all these are moved off-heap
  22. 22. New memory allocation Row cache Bloom filters JVM Compression offsets Partition summary native heap Partition key cache
  23. 23. Death of a (Thrift) Salesman Or, how to build a killer data store without a crappy interface
  24. 24. Reasons not to ditch Thrift ● ● ● ● Lots of client libraries still use it You finally got it installed You didn’t know there was another choice It sucks less than many alternatives
  25. 25. … in spite of all those benefits, you really should ditch Thrift because: ● It requires your entire result set to fit into RAM on both client and server ● The native protocol is better, faster, and supports all the new features ● Thrift-based client libraries are always a step behind ● It’s going away eventually
  26. 26. … and did I mention ... It requires your entire result set to fit into RAM on both client and server!!!
  27. 27. Requesting too much data
  28. 28. Going Native really catchy tag line here
  29. 29. Native protocol ● ● ● ● ● ● ● It’s binary, making it lighter weight It supports cursors (FTW!) It supports prepared statements Cluster awareness built-in Either synchronous or asynchronous ops Only supports CQL-based operations Can be used side-by-side with Thrift
  30. 30. Native drivers from DataStax: Java C# Python … other community supported drivers available
  31. 31. Native query example val insert = session.prepare("INSERT INTO myKsp.myTable (myKey, col1, col2) VALUES (?,?,?)") val select = session.prepare("SELECT * FROM myKsp.myTable WHERE myKey = ?") val cluster = Cluster.builder().addContactPoints(host1, host2, host3) val session = cluster.connect() session.execute(insert.bind(myKey, col1, col2)) val result = session.execute(select.bind(myKey))
  32. 32. Wait, was that SQL?!! Or, how to make Cassandra more awesome while simultaneously irritating early adopters
  33. 33. Introducing CQL3 ● ● ● ● ● ● ● Because the first two attempts sucked Stands for “Cassandra Query Language” Looks a heck of a lot like SQL … but isn’t Substantially lowers the learning curve … but also makes it easier to screw up An abstraction over the storage rows
  34. 34. Storage rows [default@unknown] create keyspace Library; [default@unknown] use Library; [default@Library] create column family Books ... with comparator=UTF8Type ... and key_validation_class=UTF8Type … and default_validation_class=UTF8Type; [default@Library] set Books['Patriot Games']['author'] = 'Tom Clancy'; [default@Library] set Books['Patriot Games']['year'] = '1987'; [default@Library] list Books; RowKey: Patriot Games => (name=author, value=Tom Clancy, timestamp=1393102991499000) => (name=year, value=1987, timestamp=1393103015955000)
  35. 35. Storage rows - composites [default@Library] create column family Authors ... with key_validation_class=UTF8Type ... and comparator='CompositeType(LongType,UTF8Type,UTF8Type)' ... and default_validation_class=UTF8Type; [default@Library] set Authors['Tom Clancy']['1987:Patriot Games:publisher'] = 'Putnam'; [default@Library] set Authors['Tom Clancy']['1987:Patriot Games:ISBN'] = '0-399-13241-4'; [default@Library] set Authors['Tom Clancy']['1993:Without Remorse:publisher'] = 'Putnam'; [default@Library] set Authors['Tom Clancy']['1993:Without Remorse:ISBN'] = '0-399-13825-0'; [default@Library] list Authors; RowKey: Tom Clancy => (name=1987:Patriot Games:ISBN, value=0-399-13241-4, timestamp=1393104011458000) => (name=1987:Patriot Games:publisher, value=Putnam, timestamp=1393103948577000) => (name=1993:Without Remorse:ISBN, value=0-399-13825-0, timestamp=1393104109214000) => (name=1993:Without Remorse:publisher, value=Putnam, timestamp=1393104083773000)
  36. 36. CQL - simple intro cqlsh> CREATE KEYSPACE Library WITH REPLICATION = {'class':'SimpleStrategy', 'replication_factor':1}; cqlsh> use Library; cqlsh:library> CREATE TABLE Books ( ... title varchar, ... author varchar, ... year int, ... PRIMARY KEY (title) ... ); cqlsh:library> INSERT INTO Books (title, author, year) VALUES ('Patriot Games', 'Tom Clancy', 1987); cqlsh:library> INSERT INTO Books (title, author, year) VALUES ('Without Remorse', 'Tom Clancy', 1993);
  37. 37. CQL - simple intro Storage rows:
  38. 38. CQL - composite key CREATE TABLE Authors ( name varchar, year int, title varchar, publisher varchar, ISBN varchar, PRIMARY KEY (name, year, title) )
  39. 39. CQL - composite key Storage rows:
  40. 40. Keys and Filters ● ● ● ● ● ● Ad hoc queries are NOT supported Query by key Key must include all potential filter columns Must include partition key in filter Subsequent filters must be in order Only last filter can be a range
  41. 41. Example - Books table CREATE TABLE Books ( title varchar, author varchar, year int, PRIMARY KEY (title) )
  42. 42. Example - Books table CREATE TABLE Books ( title varchar, author varchar, year int, PRIMARY KEY (author, title) )
  43. 43. Example - Books table CREATE TABLE Books ( title varchar, author varchar, year int, PRIMARY KEY (author, year) )
  44. 44. Example - Books table CREATE TABLE Books ( title varchar, author varchar, year int, PRIMARY KEY (year, author) )
  45. 45. Secondary Indexes ● ● ● ● ● Allows query-by-value CREATE INDEX myIdx ON myTable (myCol) Works well on low cardinality fields Won’t scale for high cardinality fields Don’t overuse it -- not a quick fix for a bad data model
  46. 46. Example - Books table CREATE TABLE Books ( title varchar, author varchar, year int, PRIMARY KEY (author) ) CREATE INDEX Books_year ON Books(year)
  47. 47. Composite Partition Keys ● PRIMARY KEY((year, author), title) ● Creates a more granular shard key ● Can be useful to make certain queries more efficient, or to better distribute data ● Updates sharing a partition key are atomic and isolated
  48. 48. Example - Books table CREATE TABLE Books ( title varchar, author varchar, year int, PRIMARY KEY ((year, author), title) )
  49. 49. Example - Books table CREATE TABLE Books ( title varchar, author varchar, year int, PRIMARY KEY (year, author, title) )
  50. 50. Collections denormalization done well
  51. 51. Supported types ● Sets - ordered naturally ● Lists - ordered by index ● Maps - key/value pairs
  52. 52. Caveats ● Max 64k items in a collection ● Max 64k size per item ● Collections are read in their entirety, so keep them small
  53. 53. Sets
  54. 54. Sets Set name Item value
  55. 55. Lists
  56. 56. Lists List name Ordering meta data List item value
  57. 57. Maps
  58. 58. Maps Map name Key Value
  59. 59. TRON (tracing on)
  60. 60. Using tracing ● In cqlsh, “tracing on” ● … enjoy!
  61. 61. Example 1393126200000
  62. 62. Antipattern CREATE TABLE WorkQueue ( name varchar, time bigint, workItem varchar, PRIMARY KEY (name, time) ) … do a bunch of inserts ... SELECT * FROM WorkQueue WHERE name='ToDo' ORDER BY time ASC; DELETE FROM WorkQueue WHERE name=’ToDo’ AND time=[some_time]
  63. 63. Antipattern - enqueue
  64. 64. Antipattern - dequeue
  65. 65. Antipattern 20k tombstones!! 13ms of 17ms spent reading tombstones
  66. 66. Lightweight Transactions (no it’s not ACID)
  67. 67. Primer ● ● ● ● ● ● Supports basic Compare-and-Set ops Provides linearizable consistency … aka serial isolation Uses “Paxos light” under the hood Still expensive -- four round trips! For most cases quorum reads/writes will be sufficient
  68. 68. Usage INSERT INTO Users (login, name) VALUES (‘rs_atl’, ‘Robbie Strickland’) IF NOT EXISTS; UPDATE Users SET password=’super_secure_password’ WHERE login=’rs_atl’ IF reset_token=’some_reset_token’;
  69. 69. Other cool stuff ● ● ● ● ● Triggers (experimental) Batching multiple requests Leveled compaction Configuration via CQL Gossip-based rack/DC configuration
  70. 70. Thank you! Robbie Strickland Software Development Manager The Weather Channel rostrickland@gmail.com @dont_use_twitter

×