NYC* Jonathan Ellis Keynote: "Cassandra 1.2 + 2.0"

  • 4,813 views
Uploaded on

Jonathan Ellis, Apache Cassandra Project Chair & DataStax Co-Founder, presents Apache Cassandra 1.2 + 2.0.

Jonathan Ellis, Apache Cassandra Project Chair & DataStax Co-Founder, presents Apache Cassandra 1.2 + 2.0.

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
4,813
On Slideshare
0
From Embeds
0
Number of Embeds
6

Actions

Shares
Downloads
72
Comments
0
Likes
6

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Cassandra 1.2 (and 2.0)Jonathan EllisProject Chair, Apache CassandraCTO, DataStax@spyced
  • 2. ©2012 DataStax
  • 3. • Massively scalable • High performance • Reliable/Available©2012 DataStax
  • 4. VLDB benchmark (RWS)©2012 DataStax
  • 5. Endpoint benchmark (RW)©2012 DataStax
  • 6. ©2012 DataStax
  • 7. ©2012 DataStax
  • 8. 1.2 • Concurrent schema • Atomic batches changes • CQL3 • Virtual nodes • Collections • “Fat node” support • Data dictionary • JBOD improvements • Tracing • Off-heap bloom filters, compression metadata • Parallel leveled compaction©2012 DataStax
  • 9. Concurrent Schema Changes CREATE TABLE X; ... DROP TABLE X; Client Cassandra Cluster Client CREATE TABLE Y; ...©2012 DataStax DROP TABLE Y;
  • 10. Virtual nodes A C D B E A F F B P G Ring without Ring with vnodes vnodes O H E C N I M J D L K©2012 DataStax
  • 11. Virtual nodes A C D B E A F F B P G Ring without Ring with vnodes vnodes O H E C N I M J D L K©2012 DataStax
  • 12. Virtual nodes A C D B E A F F B P G Ring without Ring with vnodes vnodes O H E C N I M J D L K©2012 DataStax
  • 13. Node Rebuild without vnodes Node 1 Node 2 Node 3 A B C F E A F B A A F B Ring without vnodes E C D D E F C B D C E D Node 4 Node 5 Node 6©2012 DataStax
  • 14. Node Rebuild with vnodes Node 1 Node 2 Node 3 B E A P K G G K M O C N C D D J D H J F B E A F L A K F P I P Ring with G O VNodes H N I M O E P H C M J L K I H I A B O B L M C N E F D G N J L Node 4 Node 5 Node 6©2012 DataStax
  • 15. JBOD support Cassandra Instance HDD1 HDD2 HDD3 HDD4©2012 DataStax
  • 16. JBOD support Cassandra Instance HDD1 X HDD2 HDD3 HDD4©2012 DataStax
  • 17. On-Heap/Off-Heap On-Heap Off-Heap Managed by GC Not managed by GC JVM Java Heap Native Memory Java Process©2012 DataStax
  • 18. Moving O(n) structures off-heap • Row (partition) bloom filter • 1-2GB per billion rows • Compression metadata • ~20GB per TB compressed data • 1.2 targets 5-10TB of data per machine©2012 DataStax
  • 19. Batches Partition Replica Coordinator Partition Client Node Replica Partition Replica©2012 DataStax
  • 20. Batches Partition Replica Coordinator Partition Client Node Replica Partition Replica©2012 DataStax
  • 21. Batches Partition Replica Coordinator Partition Client Node Replica Partition Replica©2012 DataStax
  • 22. Batches Partition Replica Coordinator Partition Client Node Replica Partition Replica©2012 DataStax
  • 23. Batches Partition Replica Client X Coordinator Node Partition Replica Partition Replica©2012 DataStax
  • 24. Atomic batches Partition Replica Coordinator Partition Client Node Replica Partition Batchlog Replica Node©2012 DataStax
  • 25. Atomic batches Partition Replica Coordinator Partition Client Node Replica Partition Batchlog Replica Node©2012 DataStax
  • 26. Atomic batches Partition Replica Coordinator Partition Client Node Replica Partition Batchlog Replica Node©2012 DataStax
  • 27. Atomic batches Partition Replica Coordinator Partition Client Node Replica Partition Batchlog Replica Node©2012 DataStax
  • 28. Atomic batches Partition Replica Client X Coordinator Node Partition Replica Partition Batchlog Replica Node©2012 DataStax
  • 29. Atomic batches Partition Replica Client X Coordinator Node Partition Replica Partition Batchlog Replica Node©2012 DataStax
  • 30. CQL: You got SQL in my NoSQL! CREATE TABLE users ( id uuid PRIMARY KEY, name text, state text, birth_date int ); CREATE INDEX ON users(state); SELECT * FROM users WHERE state=‘Texas’ AND birth_date > 1950;©2012 DataStax
  • 31. Strictly “realtime” focused • No joins • No subqueries • No aggregation functions* or GROUP BY • Strictly limited ORDER BY©2012 DataStax
  • 32. songscreate column family songswith key_validation_class = UUIDTypeand comparator = UTF8Type -- cell names are stringsand column_metdata = [{column_name: title, validation_class: UTF8Type} {column_name: album, validation_class: UTF8Type} {column_name: artist, validation_class: UTF8Type {column_name: data, validation_class: BytesType} a3e64f8f... title: La Grange artist: ZZ Top album: Tres Hombres 8a172618... title: Moving in Stereo artist: Fu Manchu album: We Must Obey 2b09185b... title: Outside Woman Blues artist: Back Door Slam album: Roll Away ©2012 DataStax
  • 33. CREATE TABLE songs ( id uuid PRIMARY KEY, title text, artist text, album text, data blob); id title artist album a3e64f8f... La Grange ZZ Top Tres Hombres 8a172618... Moving in Stereo Fu Manchu We Must Obey 2b09185b... Outside Woman Blues Back Door Slam Roll Away©2012 DataStax
  • 34. song_tagscreate column family song_tagswith key_validation_class = UUIDTypeand comparator = UTF8Type; a3e64f8f... blues: 1973: 8a172618... covers: 2003:©2012 DataStax
  • 35. CREATE TABLE song_tags ( id uuid, tag_name text, PRIMARY KEY (id, tag_name) ); a3e64f8f... blues: 1973: 8a172618... covers: 2003: id tag_name a3e64f8f... blues a3e64f8f... 1973 8a172618... covers 8a172618... 2003©2012 DataStax
  • 36. playlistscreate column family playlistswith key_validation_class = UUIDTypeand comparator = CompositeType(UTF8Type, UTF8Type, UTF8Type)and default_validation_class = UUIDType;62c36092... La Grange, Moving in S..., Outside Wo..., ZZ Top, : a3e64f8f... Fu Manchu, : 8a172618... Back Door ..., : 2b09185b... Tres Hombres We Must O... Roll Away©2012 DataStax
  • 37. playlistscreate column family playlistswith key_validation_class = UUIDTypeand comparator = CompositeType(UTF8Type, UTF8Type, UTF8Type)and default_validation_class = UUIDType;62c36092... La Grange, Moving in S..., Outside Wo..., ZZ Top, : a3e64f8f... Fu Manchu, : 8a172618... Back Door ..., : 2b09185b... Tres Hombres We Must O... Roll Away©2012 DataStax
  • 38. CREATE TABLE playlists ( id uuid, title text, album text, artist text, song_id uuid, PRIMARY KEY (id, title, album, artist));62c36092... La Grange, Moving in S..., Outside Wo..., ZZ Top, : a3e64f8f... Fu Manchu, : 8a172618... Back Door ..., : 2b09185b... Tres Hombres We Must O... Roll Away id title artist album song_id 62c36092... La Grange ZZ Top Tres Hombres a3e64f8f... 62c36092... Moving in Stereo Fu Manchu We Must Obey 8a172618... 62c36092...©2012 DataStax Outside Wo... Back Door Slam Roll Away 2b09185b...
  • 39. CollectionsCREATE TABLE songs ( id uuid PRIMARY KEY, title text, artist text, album text, tags set<text>, data blob); id title artist album tags a3e64f8f... La Grange ZZ Top Tres Hombres {blues, 1973} 8a172618... Moving in Stereo Fu Manchu We Must Obey {covers, 2003} 2b09185b... Outside Woman Blues Back Door Slam Roll Away©2012 DataStax
  • 40. Data dictionarycqlsh:system> SELECT * FROM schema_keyspaces; keyspace_name | durable_writes | strategy_class | strategy_options---------------+----------------+----------------+---------------------------- keyspace1 | True | SimpleStrategy | {"replication_factor":"1"} system | True | LocalStrategy | {} system_traces | True | SimpleStrategy | {"replication_factor":"1"} ©2012 DataStax
  • 41. Data dictionarycqlsh:system> SELECT * FROM schema_keyspaces; keyspace_name | durable_writes | strategy_class | strategy_options---------------+----------------+----------------+---------------------------- keyspace1 | True | SimpleStrategy | {"replication_factor":"1"} system | True | LocalStrategy | {} system_traces | True | SimpleStrategy | {"replication_factor":"1"} ©2012 DataStax
  • 42. Data dictionarycqlsh:system> SELECT * FROM schema_keyspaces; keyspace_name | durable_writes | strategy_class | strategy_options---------------+----------------+----------------+---------------------------- keyspace1 | True | SimpleStrategy | {"replication_factor":"1"} system | True | LocalStrategy | {} system_traces | True | SimpleStrategy | {"replication_factor":"1"}cqlsh:system> SELECT * FROM schema_columnfamilies WHERE keyspace_name=keyspace1 ANDcolumnfamily_name=test; ©2012 DataStax
  • 43. Data dictionarycqlsh:system> SELECT * FROM schema_keyspaces; keyspace_name | durable_writes | strategy_class | strategy_options---------------+----------------+----------------+---------------------------- keyspace1 | True | SimpleStrategy | {"replication_factor":"1"} system | True | LocalStrategy | {} system_traces | True | SimpleStrategy | {"replication_factor":"1"}cqlsh:system> SELECT * FROM schema_columnfamilies WHERE keyspace_name=keyspace1 ANDcolumnfamily_name=test;cqlsh:system> SELECT * FROM schema_columns WHERE keyspace_name=keyspace1 ANDcolumnfamily_name=test; ©2012 DataStax
  • 44. Data dictionarycqlsh:system> SELECT * FROM local; key | bootstrapped | cluster_name | cql_version | data_center | gossip_generation |partitioner | rack | release_version | ring_id| thrift_version | tokens | truncated_at-------+--------------+--------------+-------------+-------------+-------------------+---------------------------------------------+-------+----------------------+--------------------------------------+----------------+--------+-------------- local | COMPLETED | test | 3.0.0 | datacenter1 | 1352846064 |org.apache.cassandra.dht.Murmur3Partitioner | rack1 | 1.2.0-beta2-SNAPSHOT |224c55d5-21b4-42b0-8969-afc0cc04e812 | 19.35.0 | {0} | null ©2012 DataStax
  • 45. Data dictionarycqlsh:system> SELECT * FROM peers LIMIT 1; peer | data_center | rack | release_version | ring_id| rpc_address | schema_version | tokens-----------+-------------+-------+----------------------+--------------------------------------+-------------+--------------------------------------+----------------------- 127.0.0.3 | datacenter1 | rack1 | 1.2.0-beta2-SNAPSHOT | f6782327-ef8e-41cf-87b9-2edc287b1ffe | 127.0.0.3 | 915ed888-ddd0-3448-860c-582f4eea1bc6 |{6148914691236517204} ©2012 DataStax
  • 46. Request tracing cqlsh:foo> INSERT INTO bar (i, j) VALUES (6, 2); Tracing session: 4ad36250-1eb4-11e2-0000-fe8ebeead9f9 activity | timestamp | source | source_elapsed -------------------------------------+--------------+-----------+---------------- Determining replicas for mutation | 00:02:37,015 | 127.0.0.1 | 540 Sending message to /127.0.0.2 | 00:02:37,015 | 127.0.0.1 | 779 Message received from /127.0.0.1 | 00:02:37,016 | 127.0.0.2 | 63 Applying mutation | 00:02:37,016 | 127.0.0.2 | 220 Acquiring switchLock | 00:02:37,016 | 127.0.0.2 | 250 Appending to commitlog | 00:02:37,016 | 127.0.0.2 | 277 Adding to memtable | 00:02:37,016 | 127.0.0.2 | 378 Enqueuing response to /127.0.0.1 | 00:02:37,016 | 127.0.0.2 | 710 Sending message to /127.0.0.1 | 00:02:37,016 | 127.0.0.2 | 888 Message received from /127.0.0.2 | 00:02:37,017 | 127.0.0.1 | 2334 Processing response from /127.0.0.2 | 00:02:37,017 | 127.0.0.1 | 2550©2012 DataStax
  • 47. Tracing an antipattern CREATE TABLE queues ( id text, created_at timeuuid, value blob, PRIMARY KEY (id, created_at) ); id created_at value myqueue 3092e86f 9b0450d30de9 myqueue 0867f47c fc7aee5f6a66 myqueue 5fc74be0 668fdb3a2196©2012 DataStax
  • 48. Tracing an antipattern CREATE TABLE queues ( id text, created_at timeuuid, value blob, PRIMARY KEY (id, created_at) ); id created_at value myqueue 3092e86f 9b0450d30de9 myqueue 0867f47c fc7aee5f6a66 myqueue 5fc74be0 668fdb3a2196©2012 DataStax
  • 49. CREATE TABLE queues ( id text, created_at timeuuid, value blob, PRIMARY KEY (id, created_at) ); id created_at value myqueue 3092e86f 9b0450d30de9 myqueue 0867f47c fc7aee5f6a66 myqueue 5fc74be0 668fdb3a2196©2012 DataStax
  • 50. cqlsh:foo> SELECT FROM queues WHERE id = myqueue ORDER BY created_at LIMIT 1; Tracing session: 4ad36250-1eb4-11e2-0000-fe8ebeead9f9 activity | timestamp | source | source_elapsed ------------------------------------------+--------------+-----------+--------------- execute_cql3_query | 19:31:05,650 | 127.0.0.1 | 0 Sending message to /127.0.0.3 | 19:31:05,651 | 127.0.0.1 | 541 Message received from /127.0.0.1 | 19:31:05,651 | 127.0.0.3 | 39 Executing single-partition query | 19:31:05,652 | 127.0.0.3 | 943 Acquiring sstable references | 19:31:05,652 | 127.0.0.3 | 973 Merging memtable contents | 19:31:05,652 | 127.0.0.3 | 1020 Merging data from memtables and sstables | 19:31:05,652 | 127.0.0.3 | 1081 Read 1 live cells and 100000 tombstoned | 19:31:05,686 | 127.0.0.3 | 35072 Enqueuing response to /127.0.0.1 | 19:31:05,687 | 127.0.0.3 | 35220 Sending message to /127.0.0.1 | 19:31:05,687 | 127.0.0.3 | 35314 Message received from /127.0.0.3 | 19:31:05,687 | 127.0.0.1 | 36908 Processing response from /127.0.0.3 | 19:31:05,688 | 127.0.0.1 | 37650 Request complete | 19:31:05,688 | 127.0.0.1 | 38047©2012 DataStax
  • 51. 2.0 • Eager retries • Improved compaction • Triggers • CAS (Compare-and-set) • More-efficient repair©2012 DataStax
  • 52. Eager retries 90% busy Client Coordinator 30% busy 40% busy©2012 DataStax
  • 53. Eager retries 90% busy Client Coordinator 30% busy 40% busy©2012 DataStax
  • 54. Eager retries 90% busy Client Coordinator 30% busy 40% busy©2012 DataStax
  • 55. Improved compaction • Specialized strategy for append-only with TTL • Can we do any better for a general-purpose workload?©2012 DataStax
  • 56. ©2012 DataStax
  • 57. Triggers CREATE TRIGGER foo BEFORE UPDATE ON users EXECUTE ’/var/lib/cassandra/triggers/send_registration_email.jar’©2012 DataStax
  • 58. Triggers class MyTrigger implements ITrigger { public Collection<RowMutation> revise(ByteBuffer key, ColumnFamily update) { ... } }©2012 DataStax
  • 59. CAS Session 1 Session 2 SELECT * FROM users SELECT * FROM users WHERE username = ’jbellis’ WHERE username = ’jbellis’ [empty resultset] [empty resultset] INSERT INTO users (...) INSERT INTO users (...) VALUES (’jbellis’, ...) VALUES (’jbellis’, ...)©2012 DataStax
  • 60. CAS • Locking does not solve this problem • 2PC does not solve this problem • Locking + 2PC does not solve this problem©2012 DataStax
  • 61. Paxos!©2012 DataStax
  • 62. Open questions • What do we call it? • Conditional write guarantee? • Atomic conditional updates? • Lightweight transactions? • What syntax do we use for CQL? UPDATE USERS SET email = ‘jonathan@datastax.com’, ... WHERE username = ’jbellis’ IF email = ‘jbellis@datastax.com’©2012 DataStax
  • 63. More-efficient repair©2012 DataStax
  • 64. More-efficient repair©2012 DataStax
  • 65. More-efficient repair©2012 DataStax
  • 66. More-efficient repair©2012 DataStax
  • 67. More-efficient repair©2012 DataStax
  • 68. More-efficient repair©2012 DataStax
  • 69. More-efficient repair©2012 DataStax
  • 70. More-efficient repair©2012 DataStax
  • 71. More-efficient repair©2012 DataStax
  • 72. More-efficient repair©2012 DataStax
  • 73. More-efficient repair©2012 DataStax
  • 74. Consequences • Repair won’t replace missing data due to hardware failure by default • Add --include-previously-repaired to force old- style full validation©2012 DataStax