Cassandra 1.2 (and 2.0)Jonathan EllisProject Chair, Apache CassandraCTO, DataStax@spyced
©2012 DataStax
• Massively scalable • High performance • Reliable/Available©2012 DataStax
VLDB benchmark (RWS)©2012 DataStax
Endpoint benchmark (RW)©2012 DataStax
©2012 DataStax
©2012 DataStax
1.2 •      Concurrent schema          • Atomic batches        changes                                   • CQL3 • Virtual n...
Concurrent Schema Changes                    CREATE TABLE X;                          ...                    DROP TABLE X;...
Virtual nodes                      A                          C   D                                             B         ...
Virtual nodes                      A                          C   D                                             B         ...
Virtual nodes                      A                          C   D                                             B         ...
Node Rebuild without vnodes                              Node 1      Node 2      Node 3                                  A...
Node Rebuild with vnodes                                         Node 1   Node 2   Node 3                                 ...
JBOD support                          Cassandra                           Instance                 HDD1   HDD2      HDD3  ...
JBOD support                          Cassandra                           Instance                 HDD1                   ...
On-Heap/Off-Heap           On-Heap                          Off-Heap           Managed by GC           Not managed by GC  ...
Moving O(n) structures off-heap • Row (partition) bloom filter           •     1-2GB per billion rows     • Compression met...
Batches                                        Partition                                         Replica                  ...
Batches                                        Partition                                         Replica                  ...
Batches                                        Partition                                         Replica                  ...
Batches                                        Partition                                         Replica                  ...
Batches                                        Partition                                         Replica                 C...
Atomic batches                                        Partition                                         Replica           ...
Atomic batches                                        Partition                                         Replica           ...
Atomic batches                                        Partition                                         Replica           ...
Atomic batches                                        Partition                                         Replica           ...
Atomic batches                                        Partition                                         Replica           ...
Atomic batches                                        Partition                                         Replica           ...
CQL: You got SQL in my NoSQL! CREATE TABLE users (    id uuid PRIMARY KEY,    name text,    state text,    birth_date int ...
Strictly “realtime” focused  • No joins  • No subqueries  • No aggregation functions* or GROUP BY  • Strictly limited ORDE...
songscreate column family songswith key_validation_class = UUIDTypeand comparator = UTF8Type -- cell names are stringsand ...
CREATE TABLE songs (   id uuid PRIMARY KEY,   title text,   artist text,   album text,   data blob);             id       ...
song_tagscreate column family song_tagswith key_validation_class = UUIDTypeand comparator = UTF8Type;   a3e64f8f...   blue...
CREATE TABLE song_tags (    id uuid,    tag_name text,    PRIMARY KEY (id, tag_name) );       a3e64f8f...      blues:     ...
playlistscreate column family playlistswith key_validation_class = UUIDTypeand comparator = CompositeType(UTF8Type, UTF8Ty...
playlistscreate column family playlistswith key_validation_class = UUIDTypeand comparator = CompositeType(UTF8Type, UTF8Ty...
CREATE TABLE playlists (   id uuid,   title text,   album text,   artist text,   song_id uuid,   PRIMARY KEY (id, title, a...
CollectionsCREATE TABLE songs (   id uuid PRIMARY KEY,   title text,   artist text,   album text,   tags set<text>,   data...
Data dictionarycqlsh:system> SELECT * FROM schema_keyspaces; keyspace_name | durable_writes | strategy_class | strategy_op...
Data dictionarycqlsh:system> SELECT * FROM schema_keyspaces; keyspace_name | durable_writes | strategy_class | strategy_op...
Data dictionarycqlsh:system> SELECT * FROM schema_keyspaces; keyspace_name | durable_writes | strategy_class | strategy_op...
Data dictionarycqlsh:system> SELECT * FROM schema_keyspaces; keyspace_name | durable_writes | strategy_class | strategy_op...
Data dictionarycqlsh:system> SELECT * FROM local; key   | bootstrapped | cluster_name | cql_version | data_center | gossip...
Data dictionarycqlsh:system> SELECT * FROM peers LIMIT 1; peer      | data_center | rack | release_version        | ring_i...
Request tracing cqlsh:foo> INSERT INTO bar (i, j) VALUES (6, 2); Tracing session: 4ad36250-1eb4-11e2-0000-fe8ebeead9f9  ac...
Tracing an antipattern CREATE TABLE queues (    id text,    created_at timeuuid,    value blob,    PRIMARY KEY (id, create...
Tracing an antipattern CREATE TABLE queues (    id text,    created_at timeuuid,    value blob,    PRIMARY KEY (id, create...
CREATE TABLE queues (    id text,    created_at timeuuid,    value blob,    PRIMARY KEY (id, created_at) );         id    ...
cqlsh:foo> SELECT FROM queues WHERE id = myqueue ORDER BY created_at LIMIT 1; Tracing session: 4ad36250-1eb4-11e2-0000-fe8...
2.0     •      Eager retries     •      Improved compaction     •      Triggers     •      CAS (Compare-and-set)     •    ...
Eager retries                                   90% busy Client          Coordinator                                      ...
Eager retries                                   90% busy Client          Coordinator                                      ...
Eager retries                                   90% busy Client          Coordinator                                      ...
Improved compaction  • Specialized strategy for append-only with TTL  • Can we do any better for a general-purpose        ...
©2012 DataStax
Triggers CREATE TRIGGER foo BEFORE UPDATE ON users EXECUTE ’/var/lib/cassandra/triggers/send_registration_email.jar’©2012 ...
Triggers class MyTrigger implements ITrigger {     public Collection<RowMutation> revise(ByteBuffer key,                  ...
CAS                 Session 1             Session 2 SELECT * FROM users          SELECT * FROM users WHERE username = ’jbe...
CAS  • Locking does not solve this problem  • 2PC does not solve this problem  • Locking + 2PC does not solve this problem...
Paxos!©2012 DataStax
Open questions • What do we call it?           •     Conditional write guarantee?           •     Atomic conditional updat...
More-efficient repair©2012 DataStax
More-efficient repair©2012 DataStax
More-efficient repair©2012 DataStax
More-efficient repair©2012 DataStax
More-efficient repair©2012 DataStax
More-efficient repair©2012 DataStax
More-efficient repair©2012 DataStax
More-efficient repair©2012 DataStax
More-efficient repair©2012 DataStax
More-efficient repair©2012 DataStax
More-efficient repair©2012 DataStax
Consequences  • Repair won’t replace missing data due to            hardware failure by default     • Add --include-previo...
NYC* Jonathan Ellis Keynote: "Cassandra 1.2 + 2.0"
Upcoming SlideShare
Loading in...5
×

NYC* Jonathan Ellis Keynote: "Cassandra 1.2 + 2.0"

5,430

Published on

Jonathan Ellis, Apache Cassandra Project Chair & DataStax Co-Founder, presents Apache Cassandra 1.2 + 2.0.

Published in: Technology

Transcript of "NYC* Jonathan Ellis Keynote: "Cassandra 1.2 + 2.0""

  1. 1. Cassandra 1.2 (and 2.0)Jonathan EllisProject Chair, Apache CassandraCTO, DataStax@spyced
  2. 2. ©2012 DataStax
  3. 3. • Massively scalable • High performance • Reliable/Available©2012 DataStax
  4. 4. VLDB benchmark (RWS)©2012 DataStax
  5. 5. Endpoint benchmark (RW)©2012 DataStax
  6. 6. ©2012 DataStax
  7. 7. ©2012 DataStax
  8. 8. 1.2 • Concurrent schema • Atomic batches changes • CQL3 • Virtual nodes • Collections • “Fat node” support • Data dictionary • JBOD improvements • Tracing • Off-heap bloom filters, compression metadata • Parallel leveled compaction©2012 DataStax
  9. 9. Concurrent Schema Changes CREATE TABLE X; ... DROP TABLE X; Client Cassandra Cluster Client CREATE TABLE Y; ...©2012 DataStax DROP TABLE Y;
  10. 10. Virtual nodes A C D B E A F F B P G Ring without Ring with vnodes vnodes O H E C N I M J D L K©2012 DataStax
  11. 11. Virtual nodes A C D B E A F F B P G Ring without Ring with vnodes vnodes O H E C N I M J D L K©2012 DataStax
  12. 12. Virtual nodes A C D B E A F F B P G Ring without Ring with vnodes vnodes O H E C N I M J D L K©2012 DataStax
  13. 13. Node Rebuild without vnodes Node 1 Node 2 Node 3 A B C F E A F B A A F B Ring without vnodes E C D D E F C B D C E D Node 4 Node 5 Node 6©2012 DataStax
  14. 14. Node Rebuild with vnodes Node 1 Node 2 Node 3 B E A P K G G K M O C N C D D J D H J F B E A F L A K F P I P Ring with G O VNodes H N I M O E P H C M J L K I H I A B O B L M C N E F D G N J L Node 4 Node 5 Node 6©2012 DataStax
  15. 15. JBOD support Cassandra Instance HDD1 HDD2 HDD3 HDD4©2012 DataStax
  16. 16. JBOD support Cassandra Instance HDD1 X HDD2 HDD3 HDD4©2012 DataStax
  17. 17. On-Heap/Off-Heap On-Heap Off-Heap Managed by GC Not managed by GC JVM Java Heap Native Memory Java Process©2012 DataStax
  18. 18. Moving O(n) structures off-heap • Row (partition) bloom filter • 1-2GB per billion rows • Compression metadata • ~20GB per TB compressed data • 1.2 targets 5-10TB of data per machine©2012 DataStax
  19. 19. Batches Partition Replica Coordinator Partition Client Node Replica Partition Replica©2012 DataStax
  20. 20. Batches Partition Replica Coordinator Partition Client Node Replica Partition Replica©2012 DataStax
  21. 21. Batches Partition Replica Coordinator Partition Client Node Replica Partition Replica©2012 DataStax
  22. 22. Batches Partition Replica Coordinator Partition Client Node Replica Partition Replica©2012 DataStax
  23. 23. Batches Partition Replica Client X Coordinator Node Partition Replica Partition Replica©2012 DataStax
  24. 24. Atomic batches Partition Replica Coordinator Partition Client Node Replica Partition Batchlog Replica Node©2012 DataStax
  25. 25. Atomic batches Partition Replica Coordinator Partition Client Node Replica Partition Batchlog Replica Node©2012 DataStax
  26. 26. Atomic batches Partition Replica Coordinator Partition Client Node Replica Partition Batchlog Replica Node©2012 DataStax
  27. 27. Atomic batches Partition Replica Coordinator Partition Client Node Replica Partition Batchlog Replica Node©2012 DataStax
  28. 28. Atomic batches Partition Replica Client X Coordinator Node Partition Replica Partition Batchlog Replica Node©2012 DataStax
  29. 29. Atomic batches Partition Replica Client X Coordinator Node Partition Replica Partition Batchlog Replica Node©2012 DataStax
  30. 30. CQL: You got SQL in my NoSQL! CREATE TABLE users ( id uuid PRIMARY KEY, name text, state text, birth_date int ); CREATE INDEX ON users(state); SELECT * FROM users WHERE state=‘Texas’ AND birth_date > 1950;©2012 DataStax
  31. 31. Strictly “realtime” focused • No joins • No subqueries • No aggregation functions* or GROUP BY • Strictly limited ORDER BY©2012 DataStax
  32. 32. songscreate column family songswith key_validation_class = UUIDTypeand comparator = UTF8Type -- cell names are stringsand column_metdata = [{column_name: title, validation_class: UTF8Type} {column_name: album, validation_class: UTF8Type} {column_name: artist, validation_class: UTF8Type {column_name: data, validation_class: BytesType} a3e64f8f... title: La Grange artist: ZZ Top album: Tres Hombres 8a172618... title: Moving in Stereo artist: Fu Manchu album: We Must Obey 2b09185b... title: Outside Woman Blues artist: Back Door Slam album: Roll Away ©2012 DataStax
  33. 33. CREATE TABLE songs ( id uuid PRIMARY KEY, title text, artist text, album text, data blob); id title artist album a3e64f8f... La Grange ZZ Top Tres Hombres 8a172618... Moving in Stereo Fu Manchu We Must Obey 2b09185b... Outside Woman Blues Back Door Slam Roll Away©2012 DataStax
  34. 34. song_tagscreate column family song_tagswith key_validation_class = UUIDTypeand comparator = UTF8Type; a3e64f8f... blues: 1973: 8a172618... covers: 2003:©2012 DataStax
  35. 35. CREATE TABLE song_tags ( id uuid, tag_name text, PRIMARY KEY (id, tag_name) ); a3e64f8f... blues: 1973: 8a172618... covers: 2003: id tag_name a3e64f8f... blues a3e64f8f... 1973 8a172618... covers 8a172618... 2003©2012 DataStax
  36. 36. playlistscreate column family playlistswith key_validation_class = UUIDTypeand comparator = CompositeType(UTF8Type, UTF8Type, UTF8Type)and default_validation_class = UUIDType;62c36092... La Grange, Moving in S..., Outside Wo..., ZZ Top, : a3e64f8f... Fu Manchu, : 8a172618... Back Door ..., : 2b09185b... Tres Hombres We Must O... Roll Away©2012 DataStax
  37. 37. playlistscreate column family playlistswith key_validation_class = UUIDTypeand comparator = CompositeType(UTF8Type, UTF8Type, UTF8Type)and default_validation_class = UUIDType;62c36092... La Grange, Moving in S..., Outside Wo..., ZZ Top, : a3e64f8f... Fu Manchu, : 8a172618... Back Door ..., : 2b09185b... Tres Hombres We Must O... Roll Away©2012 DataStax
  38. 38. CREATE TABLE playlists ( id uuid, title text, album text, artist text, song_id uuid, PRIMARY KEY (id, title, album, artist));62c36092... La Grange, Moving in S..., Outside Wo..., ZZ Top, : a3e64f8f... Fu Manchu, : 8a172618... Back Door ..., : 2b09185b... Tres Hombres We Must O... Roll Away id title artist album song_id 62c36092... La Grange ZZ Top Tres Hombres a3e64f8f... 62c36092... Moving in Stereo Fu Manchu We Must Obey 8a172618... 62c36092...©2012 DataStax Outside Wo... Back Door Slam Roll Away 2b09185b...
  39. 39. CollectionsCREATE TABLE songs ( id uuid PRIMARY KEY, title text, artist text, album text, tags set<text>, data blob); id title artist album tags a3e64f8f... La Grange ZZ Top Tres Hombres {blues, 1973} 8a172618... Moving in Stereo Fu Manchu We Must Obey {covers, 2003} 2b09185b... Outside Woman Blues Back Door Slam Roll Away©2012 DataStax
  40. 40. Data dictionarycqlsh:system> SELECT * FROM schema_keyspaces; keyspace_name | durable_writes | strategy_class | strategy_options---------------+----------------+----------------+---------------------------- keyspace1 | True | SimpleStrategy | {"replication_factor":"1"} system | True | LocalStrategy | {} system_traces | True | SimpleStrategy | {"replication_factor":"1"} ©2012 DataStax
  41. 41. Data dictionarycqlsh:system> SELECT * FROM schema_keyspaces; keyspace_name | durable_writes | strategy_class | strategy_options---------------+----------------+----------------+---------------------------- keyspace1 | True | SimpleStrategy | {"replication_factor":"1"} system | True | LocalStrategy | {} system_traces | True | SimpleStrategy | {"replication_factor":"1"} ©2012 DataStax
  42. 42. Data dictionarycqlsh:system> SELECT * FROM schema_keyspaces; keyspace_name | durable_writes | strategy_class | strategy_options---------------+----------------+----------------+---------------------------- keyspace1 | True | SimpleStrategy | {"replication_factor":"1"} system | True | LocalStrategy | {} system_traces | True | SimpleStrategy | {"replication_factor":"1"}cqlsh:system> SELECT * FROM schema_columnfamilies WHERE keyspace_name=keyspace1 ANDcolumnfamily_name=test; ©2012 DataStax
  43. 43. Data dictionarycqlsh:system> SELECT * FROM schema_keyspaces; keyspace_name | durable_writes | strategy_class | strategy_options---------------+----------------+----------------+---------------------------- keyspace1 | True | SimpleStrategy | {"replication_factor":"1"} system | True | LocalStrategy | {} system_traces | True | SimpleStrategy | {"replication_factor":"1"}cqlsh:system> SELECT * FROM schema_columnfamilies WHERE keyspace_name=keyspace1 ANDcolumnfamily_name=test;cqlsh:system> SELECT * FROM schema_columns WHERE keyspace_name=keyspace1 ANDcolumnfamily_name=test; ©2012 DataStax
  44. 44. Data dictionarycqlsh:system> SELECT * FROM local; key | bootstrapped | cluster_name | cql_version | data_center | gossip_generation |partitioner | rack | release_version | ring_id| thrift_version | tokens | truncated_at-------+--------------+--------------+-------------+-------------+-------------------+---------------------------------------------+-------+----------------------+--------------------------------------+----------------+--------+-------------- local | COMPLETED | test | 3.0.0 | datacenter1 | 1352846064 |org.apache.cassandra.dht.Murmur3Partitioner | rack1 | 1.2.0-beta2-SNAPSHOT |224c55d5-21b4-42b0-8969-afc0cc04e812 | 19.35.0 | {0} | null ©2012 DataStax
  45. 45. Data dictionarycqlsh:system> SELECT * FROM peers LIMIT 1; peer | data_center | rack | release_version | ring_id| rpc_address | schema_version | tokens-----------+-------------+-------+----------------------+--------------------------------------+-------------+--------------------------------------+----------------------- 127.0.0.3 | datacenter1 | rack1 | 1.2.0-beta2-SNAPSHOT | f6782327-ef8e-41cf-87b9-2edc287b1ffe | 127.0.0.3 | 915ed888-ddd0-3448-860c-582f4eea1bc6 |{6148914691236517204} ©2012 DataStax
  46. 46. Request tracing cqlsh:foo> INSERT INTO bar (i, j) VALUES (6, 2); Tracing session: 4ad36250-1eb4-11e2-0000-fe8ebeead9f9 activity | timestamp | source | source_elapsed -------------------------------------+--------------+-----------+---------------- Determining replicas for mutation | 00:02:37,015 | 127.0.0.1 | 540 Sending message to /127.0.0.2 | 00:02:37,015 | 127.0.0.1 | 779 Message received from /127.0.0.1 | 00:02:37,016 | 127.0.0.2 | 63 Applying mutation | 00:02:37,016 | 127.0.0.2 | 220 Acquiring switchLock | 00:02:37,016 | 127.0.0.2 | 250 Appending to commitlog | 00:02:37,016 | 127.0.0.2 | 277 Adding to memtable | 00:02:37,016 | 127.0.0.2 | 378 Enqueuing response to /127.0.0.1 | 00:02:37,016 | 127.0.0.2 | 710 Sending message to /127.0.0.1 | 00:02:37,016 | 127.0.0.2 | 888 Message received from /127.0.0.2 | 00:02:37,017 | 127.0.0.1 | 2334 Processing response from /127.0.0.2 | 00:02:37,017 | 127.0.0.1 | 2550©2012 DataStax
  47. 47. Tracing an antipattern CREATE TABLE queues ( id text, created_at timeuuid, value blob, PRIMARY KEY (id, created_at) ); id created_at value myqueue 3092e86f 9b0450d30de9 myqueue 0867f47c fc7aee5f6a66 myqueue 5fc74be0 668fdb3a2196©2012 DataStax
  48. 48. Tracing an antipattern CREATE TABLE queues ( id text, created_at timeuuid, value blob, PRIMARY KEY (id, created_at) ); id created_at value myqueue 3092e86f 9b0450d30de9 myqueue 0867f47c fc7aee5f6a66 myqueue 5fc74be0 668fdb3a2196©2012 DataStax
  49. 49. CREATE TABLE queues ( id text, created_at timeuuid, value blob, PRIMARY KEY (id, created_at) ); id created_at value myqueue 3092e86f 9b0450d30de9 myqueue 0867f47c fc7aee5f6a66 myqueue 5fc74be0 668fdb3a2196©2012 DataStax
  50. 50. cqlsh:foo> SELECT FROM queues WHERE id = myqueue ORDER BY created_at LIMIT 1; Tracing session: 4ad36250-1eb4-11e2-0000-fe8ebeead9f9 activity | timestamp | source | source_elapsed ------------------------------------------+--------------+-----------+--------------- execute_cql3_query | 19:31:05,650 | 127.0.0.1 | 0 Sending message to /127.0.0.3 | 19:31:05,651 | 127.0.0.1 | 541 Message received from /127.0.0.1 | 19:31:05,651 | 127.0.0.3 | 39 Executing single-partition query | 19:31:05,652 | 127.0.0.3 | 943 Acquiring sstable references | 19:31:05,652 | 127.0.0.3 | 973 Merging memtable contents | 19:31:05,652 | 127.0.0.3 | 1020 Merging data from memtables and sstables | 19:31:05,652 | 127.0.0.3 | 1081 Read 1 live cells and 100000 tombstoned | 19:31:05,686 | 127.0.0.3 | 35072 Enqueuing response to /127.0.0.1 | 19:31:05,687 | 127.0.0.3 | 35220 Sending message to /127.0.0.1 | 19:31:05,687 | 127.0.0.3 | 35314 Message received from /127.0.0.3 | 19:31:05,687 | 127.0.0.1 | 36908 Processing response from /127.0.0.3 | 19:31:05,688 | 127.0.0.1 | 37650 Request complete | 19:31:05,688 | 127.0.0.1 | 38047©2012 DataStax
  51. 51. 2.0 • Eager retries • Improved compaction • Triggers • CAS (Compare-and-set) • More-efficient repair©2012 DataStax
  52. 52. Eager retries 90% busy Client Coordinator 30% busy 40% busy©2012 DataStax
  53. 53. Eager retries 90% busy Client Coordinator 30% busy 40% busy©2012 DataStax
  54. 54. Eager retries 90% busy Client Coordinator 30% busy 40% busy©2012 DataStax
  55. 55. Improved compaction • Specialized strategy for append-only with TTL • Can we do any better for a general-purpose workload?©2012 DataStax
  56. 56. ©2012 DataStax
  57. 57. Triggers CREATE TRIGGER foo BEFORE UPDATE ON users EXECUTE ’/var/lib/cassandra/triggers/send_registration_email.jar’©2012 DataStax
  58. 58. Triggers class MyTrigger implements ITrigger { public Collection<RowMutation> revise(ByteBuffer key, ColumnFamily update) { ... } }©2012 DataStax
  59. 59. CAS Session 1 Session 2 SELECT * FROM users SELECT * FROM users WHERE username = ’jbellis’ WHERE username = ’jbellis’ [empty resultset] [empty resultset] INSERT INTO users (...) INSERT INTO users (...) VALUES (’jbellis’, ...) VALUES (’jbellis’, ...)©2012 DataStax
  60. 60. CAS • Locking does not solve this problem • 2PC does not solve this problem • Locking + 2PC does not solve this problem©2012 DataStax
  61. 61. Paxos!©2012 DataStax
  62. 62. Open questions • What do we call it? • Conditional write guarantee? • Atomic conditional updates? • Lightweight transactions? • What syntax do we use for CQL? UPDATE USERS SET email = ‘jonathan@datastax.com’, ... WHERE username = ’jbellis’ IF email = ‘jbellis@datastax.com’©2012 DataStax
  63. 63. More-efficient repair©2012 DataStax
  64. 64. More-efficient repair©2012 DataStax
  65. 65. More-efficient repair©2012 DataStax
  66. 66. More-efficient repair©2012 DataStax
  67. 67. More-efficient repair©2012 DataStax
  68. 68. More-efficient repair©2012 DataStax
  69. 69. More-efficient repair©2012 DataStax
  70. 70. More-efficient repair©2012 DataStax
  71. 71. More-efficient repair©2012 DataStax
  72. 72. More-efficient repair©2012 DataStax
  73. 73. More-efficient repair©2012 DataStax
  74. 74. Consequences • Repair won’t replace missing data due to hardware failure by default • Add --include-previously-repaired to force old- style full validation©2012 DataStax
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×