Intro toCassandra  Tyler Hobbs
HistoryDynamo                     BigTable(clustering)               (data model)               Cassandra
Users
Clustering    Every node plays the same role    – No masters, slaves, or special nodes    – No single point of failure
Consistent Hashing           0     50          10     40          20           30
Consistent Hashing                      Key: “www.google.com”           0     50          10     40          20           30
Consistent Hashing                      Key: “www.google.com”           0                      md5(“www.google.com”)     5...
Consistent Hashing                      Key: “www.google.com”           0                      md5(“www.google.com”)     5...
Consistent Hashing                      Key: “www.google.com”           0                      md5(“www.google.com”)     5...
Consistent Hashing                        Key: “www.google.com”           0                        md5(“www.google.com”)  ...
Clustering    Client can talk to any node
ScalingRF = 2             0              50        10The node at50 owns thered portion             20                   30
ScalingRF = 2               0                50        10   Add a new    40        20   node at 40                     30
ScalingRF = 2               0                50        10   Add a new    40        20   node at 40                     30
Node FailuresRF = 2               0                50        10   Replicas                40        20                    ...
Node FailuresRF = 2               0                50        10   Replicas                40        20                    ...
Node FailuresRF = 2               0                50        10                40        20                     30
Consistency, Availability    Consistency    – Can I read stale data?    Availability    – Can I write/read at all?    T...
Consistency    N = Total number of replicas    R = Number of replicas read from    – (before the response is returned) ...
Consistency    N = Total number of replicas    R = Number of replicas read from    – (before the response is returned) ...
Consistency W + R > N gives strong consistency N=3 W=2 R=2 2 + 2 > 3 ==> strongly consistent
Consistency W + R > N gives strong consistency N=3 W=2 R=2 2 + 2 > 3 ==> strongly consistent Only 2 of the 3 replicas must...
Consistency    Tunable Consistency    – Specify N (Replication Factor) per data set    – Specify R, W per operation
Consistency    Tunable Consistency    – Specify N (Replication Factor) per data set    – Specify R, W per operation    – ...
Availability    Can tolerate the loss of:    – N – R replicas for reads    – N – W replicas for writes
CAP TheoremDuring node or network failure:          100%                                          Not                     ...
CAP TheoremDuring node or network failure:          100%                                                 Not              ...
Clustering    No single point of failure    Replication that works    Scales linearly    – 2x nodes = 2x performance   ...
Data Model    Comes from Google BigTable    Goals    – Minimize disk seeks    – High throughput    – Low latency    – Du...
Data Model    Keyspace    – A collection of Column Families    – Controls replication settings    Column Family    – Kin...
Column Families    Static    – Object data    – Similar to a table in a relational database    Dynamic    – Pre-calculat...
Static Column Families                   Users   zznate    password: *    name: Nate   driftx    password: *   name: Brand...
Dynamic Column Families    Rows    – Each row has a unique primary key    – Sorted list of (name, value) tuples       • L...
Dynamic Column Families                     Followingzznate    driftx:   thobbs:driftxthobbs    zznate:jbellis   driftx:  ...
Dynamic Column Families    Column Timestamps    – Each column (tuple) has a timestamp    – In the case of a collision, th...
Dynamic Column Families    Other Examples:    – Timeline of tweets by a user    – Timeline of tweets by all of the people...
The Data API    Two choices    – RPC-based API    – CQL       • Cassandra Query Language
Inserting Data INSERT INTO users (KEY, “name”, “age”)     VALUES (“thobbs”, “Tyler”, 24);
Updating Data Updates are the same as inserts: INSERT INTO users (KEY, “age”)     VALUES (“thobbs”, 34); Or UPDATE users S...
Fetching Data Whole row select: SELECT * FROM users WHERE KEY = “thobbs”;
Fetching Data Explicit column select: SELECT “name”, “age” FROM users     WHERE KEY = “thobbs”;
Fetching Data Get a slice of columns UPDATE letters SET 1=a, 2=b, 3=c, 4=d, 5=e     WHERE KEY = “key”; SELECT 1..3 FROM le...
Fetching Data Get a slice of columns SELECT FIRST 2 FROM letters WHERE KEY = “key”; Returns [(1, a), (2, b)] SELECT FIRST ...
Fetching Data Get a slice of columns SELECT 3.. FROM letters WHERE KEY = “key”; Returns [(3, c), (4, d), (5, e)] SELECT FI...
Deleting Data Delete a whole row: DELETE FROM users WHERE KEY = “thobbs”; Delete specific columns: DELETE “age” FROM users...
Secondary Indexes Builtin basic indexes CREATE INDEX ageIndex ON users (age); SELECT name FROM USERS     WHERE age = 24 AN...
Performance    Writes    – 10k – 30k per second per node    – Sub-millisecond latency    Reads    – 1k – 10k per second ...
Other Features    Distributed Counters    – Can support millions of high-volume counters    Excellent Multi-datacenter S...
What Cassandra Cant Do    Transactions    – Unless you use a distributed lock    – Atomicity, Isolation    – These arent ...
Not One-size-fits-all    Use alongside an RDBMS    – Use the RDBMS for highly-transactional or highly-      relational da...
Language Support    Good:    – Java    – Python    – Ruby    – PHP    – C#    Coming Soon:    – Everything else, now tha...
Questions?          Tyler Hobbs               @tylhobbs       tyler@datastax.com
Upcoming SlideShare
Loading in...5
×

Intro to Cassandra

2,528

Published on

An introduction to Apache Cassandra, covering the clustering model and the data model.

Presented by Tyler Hobbs at the October 2011 Austin NoSQL meetup.

Published in: Technology, News & Politics
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,528
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
79
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide

Intro to Cassandra

  1. 1. Intro toCassandra Tyler Hobbs
  2. 2. HistoryDynamo BigTable(clustering) (data model) Cassandra
  3. 3. Users
  4. 4. Clustering Every node plays the same role – No masters, slaves, or special nodes – No single point of failure
  5. 5. Consistent Hashing 0 50 10 40 20 30
  6. 6. Consistent Hashing Key: “www.google.com” 0 50 10 40 20 30
  7. 7. Consistent Hashing Key: “www.google.com” 0 md5(“www.google.com”) 50 10 14 40 20 30
  8. 8. Consistent Hashing Key: “www.google.com” 0 md5(“www.google.com”) 50 10 14 40 20 30
  9. 9. Consistent Hashing Key: “www.google.com” 0 md5(“www.google.com”) 50 10 14 40 20 30
  10. 10. Consistent Hashing Key: “www.google.com” 0 md5(“www.google.com”) 50 10 14 40 20 30 Replication Factor = 3
  11. 11. Clustering Client can talk to any node
  12. 12. ScalingRF = 2 0 50 10The node at50 owns thered portion 20 30
  13. 13. ScalingRF = 2 0 50 10 Add a new 40 20 node at 40 30
  14. 14. ScalingRF = 2 0 50 10 Add a new 40 20 node at 40 30
  15. 15. Node FailuresRF = 2 0 50 10 Replicas 40 20 30
  16. 16. Node FailuresRF = 2 0 50 10 Replicas 40 20 30
  17. 17. Node FailuresRF = 2 0 50 10 40 20 30
  18. 18. Consistency, Availability Consistency – Can I read stale data? Availability – Can I write/read at all? Tunable Consistency
  19. 19. Consistency N = Total number of replicas R = Number of replicas read from – (before the response is returned) W = Number of replicas written to – (before the write is considered a success)
  20. 20. Consistency N = Total number of replicas R = Number of replicas read from – (before the response is returned) W = Number of replicas written to – (before the write is considered a success) W + R > N gives strong consistency
  21. 21. Consistency W + R > N gives strong consistency N=3 W=2 R=2 2 + 2 > 3 ==> strongly consistent
  22. 22. Consistency W + R > N gives strong consistency N=3 W=2 R=2 2 + 2 > 3 ==> strongly consistent Only 2 of the 3 replicas must be available.
  23. 23. Consistency Tunable Consistency – Specify N (Replication Factor) per data set – Specify R, W per operation
  24. 24. Consistency Tunable Consistency – Specify N (Replication Factor) per data set – Specify R, W per operation – Quorum: N/2 + 1 • R = W = Quorum • Strong consistency • Tolerate the loss of N – Quorum replicas – R, W can also be 1 or N
  25. 25. Availability Can tolerate the loss of: – N – R replicas for reads – N – W replicas for writes
  26. 26. CAP TheoremDuring node or network failure: 100% Not Possible Availability Possible Consistency 100%
  27. 27. CAP TheoremDuring node or network failure: 100% Not Ca Possible ss an dr Availability a Possible Consistency 100%
  28. 28. Clustering No single point of failure Replication that works Scales linearly – 2x nodes = 2x performance • For both writes and reads – Up to 100s of nodes Operationally simple Multi-Datacenter Replication
  29. 29. Data Model Comes from Google BigTable Goals – Minimize disk seeks – High throughput – Low latency – Durable
  30. 30. Data Model Keyspace – A collection of Column Families – Controls replication settings Column Family – Kinda resembles a table
  31. 31. Column Families Static – Object data – Similar to a table in a relational database Dynamic – Pre-calculated query results – Materialized views
  32. 32. Static Column Families Users zznate password: * name: Nate driftx password: * name: Brandon thobbs password: * name: Tyler jbellis password: * name: Jonathan site: riptano.com
  33. 33. Dynamic Column Families Rows – Each row has a unique primary key – Sorted list of (name, value) tuples • Like a sorted map or dictionary – The (name, value) tuple is called a “column”
  34. 34. Dynamic Column Families Followingzznate driftx: thobbs:driftxthobbs zznate:jbellis driftx: mdennis: pcmanus thobbs: xedin: zznate
  35. 35. Dynamic Column Families Column Timestamps – Each column (tuple) has a timestamp – In the case of a collision, the latest timestamp wins – Client specifies timestamp with write – Writes are idempotent • Infinite retries allowed
  36. 36. Dynamic Column Families Other Examples: – Timeline of tweets by a user – Timeline of tweets by all of the people a user is following – List of comments sorted by score – List of friends grouped by state
  37. 37. The Data API Two choices – RPC-based API – CQL • Cassandra Query Language
  38. 38. Inserting Data INSERT INTO users (KEY, “name”, “age”) VALUES (“thobbs”, “Tyler”, 24);
  39. 39. Updating Data Updates are the same as inserts: INSERT INTO users (KEY, “age”) VALUES (“thobbs”, 34); Or UPDATE users SET “age” = 34 WHERE KEY = “thobbs”;
  40. 40. Fetching Data Whole row select: SELECT * FROM users WHERE KEY = “thobbs”;
  41. 41. Fetching Data Explicit column select: SELECT “name”, “age” FROM users WHERE KEY = “thobbs”;
  42. 42. Fetching Data Get a slice of columns UPDATE letters SET 1=a, 2=b, 3=c, 4=d, 5=e WHERE KEY = “key”; SELECT 1..3 FROM letters WHERE KEY = “key”; Returns [(1, a), (2, b), (3, c)]
  43. 43. Fetching Data Get a slice of columns SELECT FIRST 2 FROM letters WHERE KEY = “key”; Returns [(1, a), (2, b)] SELECT FIRST 2 REVERSED FROM letters WHERE KEY = “key”; Returns [(5, e), (4, d)]
  44. 44. Fetching Data Get a slice of columns SELECT 3.. FROM letters WHERE KEY = “key”; Returns [(3, c), (4, d), (5, e)] SELECT FIRST 2 REVERSED 4.. FROM letters WHERE KEY = “key”; Returns [(4, d), (3, c)]
  45. 45. Deleting Data Delete a whole row: DELETE FROM users WHERE KEY = “thobbs”; Delete specific columns: DELETE “age” FROM users WHERE KEY = “thobbs”;
  46. 46. Secondary Indexes Builtin basic indexes CREATE INDEX ageIndex ON users (age); SELECT name FROM USERS WHERE age = 24 AND state = “TX”;
  47. 47. Performance Writes – 10k – 30k per second per node – Sub-millisecond latency Reads – 1k – 10k per second per node – Depends on data set, caching – Usually 0.1 to 10ms latency
  48. 48. Other Features Distributed Counters – Can support millions of high-volume counters Excellent Multi-datacenter Support – Disaster recovery – Locality Hadoop Integration – Isolation of resources – Hive and Pig drivers Compression
  49. 49. What Cassandra Cant Do Transactions – Unless you use a distributed lock – Atomicity, Isolation – These arent needed as often as youd think Limited support for ad-hoc queries – Know what you want to do with the data
  50. 50. Not One-size-fits-all Use alongside an RDBMS – Use the RDBMS for highly-transactional or highly- relational data • Usually a small set of data – Let Cassandra scale to handle the rest
  51. 51. Language Support Good: – Java – Python – Ruby – PHP – C# Coming Soon: – Everything else, now that we have CQL
  52. 52. Questions? Tyler Hobbs @tylhobbs tyler@datastax.com
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×