Massively scalable NoSQLwith Apache Cassandra!Jonathan EllisProject Chair, Apache CassandraCTO, DataStax@spyced
Big data           Analytics        Realtime                       ?           (Hadoop)        (“NoSQL”)©2012 DataStax
Some Casandra users ©2012 DataStax
eBay                     Application/Use Case                     • Social Signals: like/want/own features for            ...
Time series data©2012 DataStax
Multi-datacenter support©2012 DataStax
Distributed counters©2012 DataStax
Hadoop support©2012 DataStax
Disney                     Application/Use Case                     • Meet the data management needs of user              ...
Multitenancy©2012 DataStax
Multitenancy©2012 DataStax
Enterprise search©2012 DataStax
SimpleReach                     Application/Use Case                     • SimpleReach tracks social actions for          ...
SourceNinja                     Application/Use Case                     • SourceNinja notifies you to performance,        ...
Netflix                     Application/Use Case                     • General purpose backend for large scale             ...
Optimized for SSD©2012 DataStax
Open source©2012 DataStax
Use case patterns  • Massively scalable  • High performance  • Reliable/Available©2012 DataStax
©2012 DataStax
reads/s            writes/s                                                                       35000                   ...
©2012 DataStax
Classic partitioning with SPOF                 partition 1   partition 2      partition 3   partition 4                   ...
Availability  • “High availability implies that a single fault will not bring            down your system. Not ‘we’ll reco...
Fully distributed, no SPOF                 client                          p3                                p6        p1 ...
Partitioning                  jim     age: 36   car: camaro   gender: M                 carol    age: 37   car: subaru   g...
Partitioning           Primary key determines placement*                  jim     age: 36   car: camaro   gender: M       ...
PK      MD5 Hash                  jim     5e02739678...                                             MD5* hash             ...
The “token ring”                 Node A   Node B                 Node D   Node C©2012 DataStax
Start            End                 A   0xc000000000..1 0x0000000000..0                 B   0x0000000000..1 0x4000000000....
Start            End                 A   0xc000000000..1 0x0000000000..0                 B   0x0000000000..1 0x4000000000....
Start            End                 A   0xc000000000..1 0x0000000000..0                 B   0x0000000000..1 0x4000000000....
Start            End                 A   0xc000000000..1 0x0000000000..0                 B   0x0000000000..1 0x4000000000....
Start            End                 A   0xc000000000..1 0x0000000000..0                 B   0x0000000000..1 0x4000000000....
Replication                                 Node A   Node B                                 Node D   Node C       carol   ...
Node A   Node B                                 Node D   Node C       carol     a9a0198010...©2012 DataStax
Node A   Node B                                 Node D   Node C       carol     a9a0198010...©2012 DataStax
Highlights • Adding capacity is application-transparent and requires            no downtime     •      No SPOF, not even t...
CQL: You got SQL in my NoSQL! CREATE TABLE users (    id uuid PRIMARY KEY,    name text,    state text,    birth_date int ...
Strictly “realtime” focused  • No joins  • No subqueries  • No aggregation functions* or GROUP BY  • ORDER BY?©2012 DataStax
Clustered data in in CFS©2012 DataStax
Clustered data in in CFS©2012 DataStax
Clustering in CQL3 CREATE TABLE sblocks (     block_id uuid,     subblock_id uuid,     data blob,                         ...
Collections CREATE TABLE users (    id uuid PRIMARY KEY,    name text,    state text,    birth_date int ); CREATE TABLE us...
Collections CREATE TABLE users (    id uuid PRIMARY KEY,    name text,    state text,                 X    birth_date int ...
Collections CREATE TABLE users (    id uuid PRIMARY KEY,    name text,    state text,    birth_date int,    email_addresse...
Big data           Analytics        Realtime                       ?           (Hadoop)        (“NoSQL”)©2012 DataStax
The evolution of Analytics                 Analytics + Realtime©2012 DataStax
The evolution of Analytics                             replication                 Analytics                 Realtime©2012...
The evolution of Analytics                 ETL©2012 DataStax
Big data           Analytics    Datastax     Realtime           (Hadoop)    Enterprise   (Cassandra)©2012 DataStax
©2012 DataStax
Better Hadoop than Hadoop  • “Vanilla” Hadoop           •     8+ services to setup, monitor, backup, and recover          ...
Enterprise search with Solr SELECT title FROM solr WHERE solr_query=title:natio*;  title ---------------------------------...
Managing & Monitoring Big Data ©2012 DataStax
Questions?     •      http://www.datastax.com/docs     •      http://www.datastax.com/dev/blog/whats-new-in-            ca...
Upcoming SlideShare
Loading in...5
×

Massively Scalable NoSQL with Apache Cassandra

4,637

Published on

Published in: Technology

Transcript of "Massively Scalable NoSQL with Apache Cassandra"

  1. 1. Massively scalable NoSQLwith Apache Cassandra!Jonathan EllisProject Chair, Apache CassandraCTO, DataStax@spyced
  2. 2. Big data Analytics Realtime ? (Hadoop) (“NoSQL”)©2012 DataStax
  3. 3. Some Casandra users ©2012 DataStax
  4. 4. eBay Application/Use Case • Social Signals: like/want/own features for eBay product and item pages • Hunch taste graph for eBay users and items • Many time series use cases Why Cassandra? • Multi-datacenter • Scalable • Write performance • Distributed counters • Hadoop support©2012 DataStax ACE
  5. 5. Time series data©2012 DataStax
  6. 6. Multi-datacenter support©2012 DataStax
  7. 7. Distributed counters©2012 DataStax
  8. 8. Hadoop support©2012 DataStax
  9. 9. Disney Application/Use Case • Meet the data management needs of user facing applications across The Walt Disney Company with a single platform Why Cassandra? • DataStax Enterprise can tackle real-time and search functions in the same cluster • Scalability • 24x7 uptime©2012 DataStax NDI
  10. 10. Multitenancy©2012 DataStax
  11. 11. Multitenancy©2012 DataStax
  12. 12. Enterprise search©2012 DataStax
  13. 13. SimpleReach Application/Use Case • SimpleReach tracks social actions for content creators, from Twitter and Facebook to Pinterest and Reddit, to deliver detailed insights and clear metrics around social behavior. Why Cassandra? • Very high velocity data ingest rate and large data volumes • Workload separation between realtime and batch applications©2012 DataStax NDE
  14. 14. SourceNinja Application/Use Case • SourceNinja notifies you to performance, security, and bug fixes for the software you depend on Why Cassandra? • Previous database system could not handle load; HBase has too many points of failure and was too slow • Fast real time capabilities, batch analytics on that data, and enterprise search©2012 DataStax RDE
  15. 15. Netflix Application/Use Case • General purpose backend for large scale highly available cloud based web services supporting Netflix Streaming Why Cassandra? • Highly available, highly robust and no schema change downtime • Highly scalable, optimized for SSD • Much lower cost than previous Oracle and SimpleDB implementations • Flexible data model • Ability to directly influence/implement OSS feature set • Supports local and wide area distributed operations, spanning US and Europe©2012 DataStax RCE
  16. 16. Optimized for SSD©2012 DataStax
  17. 17. Open source©2012 DataStax
  18. 18. Use case patterns • Massively scalable • High performance • Reliable/Available©2012 DataStax
  19. 19. ©2012 DataStax
  20. 20. reads/s writes/s 35000 30000 25000 20000 15000 10000 5000 Cassandra 0.6 0©2012 DataStax Cassandra 1.0
  21. 21. ©2012 DataStax
  22. 22. Classic partitioning with SPOF partition 1 partition 2 partition 3 partition 4 router client©2012 DataStax
  23. 23. Availability • “High availability implies that a single fault will not bring down your system. Not ‘we’ll recover quickly.’” -- Ben Coverston: DataStax • “The biggest problem with failover is that youre almost never using it until it really hurts. Its like backups that you never test.” -- Rick Branson: Instagram©2012 DataStax
  24. 24. Fully distributed, no SPOF client p3 p6 p1 p1 p1©2012 DataStax
  25. 25. Partitioning jim age: 36 car: camaro gender: M carol age: 37 car: subaru gender: F johnny age:12 gender: M suzy age:10 gender: F©2012 DataStax
  26. 26. Partitioning Primary key determines placement* jim age: 36 car: camaro gender: M carol age: 37 car: subaru gender: F johnny age:12 gender: M suzy age:10 gender: F©2012 DataStax
  27. 27. PK MD5 Hash jim 5e02739678... MD5* hash carol a9a0198010... operation yields a 128-bit number johnny f4eb27cea7... for keys of any size. suzy 78b421309e...©2012 DataStax
  28. 28. The “token ring” Node A Node B Node D Node C©2012 DataStax
  29. 29. Start End A 0xc000000000..1 0x0000000000..0 B 0x0000000000..1 0x4000000000..0 C 0x4000000000..1 0x8000000000..0 D 0x8000000000..1 0xc000000000..0 jim 5e02739678... carol a9a0198010... johnny f4eb27cea7... suzy 78b421309e...©2012 DataStax
  30. 30. Start End A 0xc000000000..1 0x0000000000..0 B 0x0000000000..1 0x4000000000..0 C 0x4000000000..1 0x8000000000..0 D 0x8000000000..1 0xc000000000..0 jim 5e02739678... carol a9a0198010... johnny f4eb27cea7... suzy 78b421309e...©2012 DataStax
  31. 31. Start End A 0xc000000000..1 0x0000000000..0 B 0x0000000000..1 0x4000000000..0 C 0x4000000000..1 0x8000000000..0 D 0x8000000000..1 0xc000000000..0 jim 5e02739678... carol a9a0198010... johnny f4eb27cea7... suzy 78b421309e...©2012 DataStax
  32. 32. Start End A 0xc000000000..1 0x0000000000..0 B 0x0000000000..1 0x4000000000..0 C 0x4000000000..1 0x8000000000..0 D 0x8000000000..1 0xc000000000..0 jim 5e02739678... carol a9a0198010... johnny f4eb27cea7... suzy 78b421309e...©2012 DataStax
  33. 33. Start End A 0xc000000000..1 0x0000000000..0 B 0x0000000000..1 0x4000000000..0 C 0x4000000000..1 0x8000000000..0 D 0x8000000000..1 0xc000000000..0 jim 5e02739678... carol a9a0198010... johnny f4eb27cea7... suzy 78b421309e...©2012 DataStax
  34. 34. Replication Node A Node B Node D Node C carol a9a0198010...©2012 DataStax
  35. 35. Node A Node B Node D Node C carol a9a0198010...©2012 DataStax
  36. 36. Node A Node B Node D Node C carol a9a0198010...©2012 DataStax
  37. 37. Highlights • Adding capacity is application-transparent and requires no downtime • No SPOF, not even temporarily • No “primary” replica • Configurable synchronous/asynchronous • Tolerates node failure; never have to restart replication “from scratch” • “Smart” replication avoids correlated failures©2012 DataStax
  38. 38. CQL: You got SQL in my NoSQL! CREATE TABLE users ( id uuid PRIMARY KEY, name text, state text, birth_date int ); CREATE INDEX ON users(state); SELECT * FROM users WHERE state=‘Texas’ AND birth_date > 1950;©2012 DataStax
  39. 39. Strictly “realtime” focused • No joins • No subqueries • No aggregation functions* or GROUP BY • ORDER BY?©2012 DataStax
  40. 40. Clustered data in in CFS©2012 DataStax
  41. 41. Clustered data in in CFS©2012 DataStax
  42. 42. Clustering in CQL3 CREATE TABLE sblocks (     block_id uuid,     subblock_id uuid,     data blob, block_id subblock_id data     PRIMARY KEY (block_id, subblock_id) Block1 subblock A data A ); Block1 subblock B data B ... ... ... Block2 subblock C data C Block2 subblock D data D ... ... ... Block3 subblock E data E Block3 subblock F data F ... ... ...©2012 DataStax
  43. 43. Collections CREATE TABLE users ( id uuid PRIMARY KEY, name text, state text, birth_date int ); CREATE TABLE users_addresses ( user_id uuid REFERENCES users, email text ); SELECT * FROM users NATURAL JOIN users_addresses;©2012 DataStax
  44. 44. Collections CREATE TABLE users ( id uuid PRIMARY KEY, name text, state text, X birth_date int ); CREATE TABLE users_addresses ( user_id uuid REFERENCES users, email text ); SELECT * FROM users NATURAL JOIN users_addresses;©2012 DataStax
  45. 45. Collections CREATE TABLE users ( id uuid PRIMARY KEY, name text, state text, birth_date int, email_addresses set<text> ); UPDATE users SET email_addresses = email_addresses + {‘jbellis@gmail.com’, ‘jbellis@datastax.com’};©2012 DataStax
  46. 46. Big data Analytics Realtime ? (Hadoop) (“NoSQL”)©2012 DataStax
  47. 47. The evolution of Analytics Analytics + Realtime©2012 DataStax
  48. 48. The evolution of Analytics replication Analytics Realtime©2012 DataStax
  49. 49. The evolution of Analytics ETL©2012 DataStax
  50. 50. Big data Analytics Datastax Realtime (Hadoop) Enterprise (Cassandra)©2012 DataStax
  51. 51. ©2012 DataStax
  52. 52. Better Hadoop than Hadoop • “Vanilla” Hadoop • 8+ services to setup, monitor, backup, and recover (NameNode, SecondaryNameNode, DataNode, JobTracker, TaskTracker, Zookeeper, Region Server,...) • Single points of failure • Cant separate online and offline processing • DataStax Enterprise • Single, simplified component • Self-organizes based on workload • Peer to peer • JobTracker failover©2012 DataStax
  53. 53. Enterprise search with Solr SELECT title FROM solr WHERE solr_query=title:natio*; title -------------------------------------------------------------------------- Bolivia national football team 2002 List of French born footballers who have played for other national teams Lithuania national basketball team at Eurobasket 2009 Bolivia national football team 2000 Kenya national under-20 football team Bolivia national football team 1999 Israel mens national inline hockey team Bolivia national football team 2001©2012 DataStax
  54. 54. Managing & Monitoring Big Data ©2012 DataStax
  55. 55. Questions? • http://www.datastax.com/docs • http://www.datastax.com/dev/blog/whats-new-in- cassandra-1-1 • http://www.datastax.com/dev/blog/schema-in- cassandra-1-1 • http://www.datastax.com/products/enterprise©2012 DataStax
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×