Intro toCassandra  Tyler Hobbs
HistoryDynamo                        BigTable(clustering)                  (data model)               Inbox search        ...
Users
Clustering    Every node plays the same role    – No masters, slaves, or special nodes    – No single point of failure
Consistent Hashing           0     50          10     40          20           30
Consistent Hashing                      Key: “www.google.com”           0     50          10     40          20           30
Consistent Hashing                      Key: “www.google.com”           0                      md5(“www.google.com”)     5...
Consistent Hashing                      Key: “www.google.com”           0                      md5(“www.google.com”)     5...
Consistent Hashing                      Key: “www.google.com”           0                      md5(“www.google.com”)     5...
Consistent Hashing                        Key: “www.google.com”           0                        md5(“www.google.com”)  ...
Clustering    Client can talk to any node
ScalingRF = 2             0              50        10The node at50 owns thered portion             20                   30
ScalingRF = 2               0                50        10   Add a new    40        20   node at 40                     30
ScalingRF = 2               0                50        10   Add a new    40        20   node at 40                     30
Node FailuresRF = 2               0                50        10   Replicas                40        20                    ...
Node FailuresRF = 2               0                50        10   Replicas                40        20                    ...
Node FailuresRF = 2               0                50        10                40        20                     30
Consistency, Availability    Consistency    – Can I read stale data?    Availability    – Can I write/read at all?    T...
Consistency    N = Total number of replicas    R = Number of replicas read from    – (before the response is returned) ...
Consistency    N = Total number of replicas    R = Number of replicas read from    – (before the response is returned) ...
Consistency W + R > N gives strong consistency N=3 W=2 R=2 2 + 2 > 3 ==> strongly consistent
Consistency W + R > N gives strong consistency N=3 W=2 R=2 2 + 2 > 3 ==> strongly consistent Only 2 of the 3 replicas must...
Consistency    Tunable Consistency    – Specify N (Replication Factor) per data set    – Specify R, W per operation
Consistency    Tunable Consistency    – Specify N (Replication Factor) per data set    – Specify R, W per operation    – ...
Availability    Can tolerate the loss of:    – N – R replicas for reads    – N – W replicas for writes
CAP TheoremDuring node or network failure:          100%                                          Not                     ...
CAP TheoremDuring node or network failure:          100%                                                 Not              ...
Clustering    No single point of failure    Replication that works    Scales linearly    – 2x nodes = 2x performance   ...
Data Model    Comes from Google BigTable    Goals    – Commodity Hardware       • Spinning disks    – Handle data sets m...
Column Families    Static    – Object data    – Similar to a table in a relational database    Dynamic    – Precomputed ...
Static Column Families                   Users   zznate    password: *    name: Nate   driftx    password: *   name: Brand...
Dynamic Column Families    Rows    – Each row has a unique primary key    – Sorted list of (name, value) tuples       • L...
Dynamic Column Families                     Followingzznate    driftx:   thobbs:driftxthobbs    zznate:jbellis   driftx:  ...
Dynamic Column Families    Other Examples:    – Timeline of tweets by a user    – Timeline of tweets by all of the people...
The Data API    RPC-based API    – github.com/twitter/cassandra    CQL (Cassandra Query Language)    – code.google.com/a...
Inserting Data INSERT INTO users (KEY, “name”, “age”)     VALUES (“thobbs”, “Tyler”, 24);
Updating Data Updates are the same as inserts: INSERT INTO users (KEY, “age”)     VALUES (“thobbs”, 34); Or UPDATE users S...
Fetching Data Whole row select: SELECT * FROM users WHERE KEY = “thobbs”;
Fetching Data Explicit column select: SELECT “name”, “age” FROM users     WHERE KEY = “thobbs”;
Fetching Data Get a slice of columns UPDATE letters SET 1=a, 2=b, 3=c, 4=d, 5=e     WHERE KEY = “key”; SELECT 1..3 FROM le...
Fetching Data Get a slice of columns SELECT FIRST 2 FROM letters WHERE KEY = “key”; Returns [(1, a), (2, b)] SELECT FIRST ...
Fetching Data Get a slice of columns SELECT 3.. FROM letters WHERE KEY = “key”; Returns [(3, c), (4, d), (5, e)] SELECT FI...
Deleting Data Delete a whole row: DELETE FROM users WHERE KEY = “thobbs”; Delete specific columns: DELETE “age” FROM users...
Secondary Indexes Builtin basic indexes CREATE INDEX ageIndex ON users (age); SELECT name FROM USERS     WHERE age = 24 AN...
Performance    Writes    – 10k – 30k per second per node    – Sub-millisecond latency    Reads    – 1k – 20k per second ...
Other Features    Distributed Counters    – Can support millions of high-volume counters    Excellent Multi-datacenter S...
What Cassandra Cant Do    Transactions    – Unless you use a distributed lock    – Atomicity, Isolation    – These arent ...
Not One-size-fits-all    Use alongside an RDBMS
Problems you shouldnt solve with C*    Prototyping    Distributed Locking    Small datasets    – (When you dont need av...
The sweet spot for Cassandra    Large dataset, low latency queries    Simple to medium complexity queries    – Key/value...
The sweet spot for Cassandra    Social    – Texts, comments, check-ins, collaboration    Activity    – Feeds, timelines,...
ORMs    Poor integration    ORMs are not a natural fit for Cassandra    – In C*, we mainly care about queries, not objec...
Questions?          Tyler Hobbs               @tylhobbs       tyler@datastax.com
Upcoming SlideShare
Loading in …5
×

Cassandra for Ruby/Rails Devs

4,917 views

Published on

Published in: Business, Technology
0 Comments
8 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
4,917
On SlideShare
0
From Embeds
0
Number of Embeds
13
Actions
Shares
0
Downloads
35
Comments
0
Likes
8
Embeds 0
No embeds

No notes for slide

Cassandra for Ruby/Rails Devs

  1. 1. Intro toCassandra Tyler Hobbs
  2. 2. HistoryDynamo BigTable(clustering) (data model) Inbox search Cassandra
  3. 3. Users
  4. 4. Clustering Every node plays the same role – No masters, slaves, or special nodes – No single point of failure
  5. 5. Consistent Hashing 0 50 10 40 20 30
  6. 6. Consistent Hashing Key: “www.google.com” 0 50 10 40 20 30
  7. 7. Consistent Hashing Key: “www.google.com” 0 md5(“www.google.com”) 50 10 14 40 20 30
  8. 8. Consistent Hashing Key: “www.google.com” 0 md5(“www.google.com”) 50 10 14 40 20 30
  9. 9. Consistent Hashing Key: “www.google.com” 0 md5(“www.google.com”) 50 10 14 40 20 30
  10. 10. Consistent Hashing Key: “www.google.com” 0 md5(“www.google.com”) 50 10 14 40 20 30 Replication Factor = 3
  11. 11. Clustering Client can talk to any node
  12. 12. ScalingRF = 2 0 50 10The node at50 owns thered portion 20 30
  13. 13. ScalingRF = 2 0 50 10 Add a new 40 20 node at 40 30
  14. 14. ScalingRF = 2 0 50 10 Add a new 40 20 node at 40 30
  15. 15. Node FailuresRF = 2 0 50 10 Replicas 40 20 30
  16. 16. Node FailuresRF = 2 0 50 10 Replicas 40 20 30
  17. 17. Node FailuresRF = 2 0 50 10 40 20 30
  18. 18. Consistency, Availability Consistency – Can I read stale data? Availability – Can I write/read at all? Tunable Consistency
  19. 19. Consistency N = Total number of replicas R = Number of replicas read from – (before the response is returned) W = Number of replicas written to – (before the write is considered a success)
  20. 20. Consistency N = Total number of replicas R = Number of replicas read from – (before the response is returned) W = Number of replicas written to – (before the write is considered a success) W + R > N gives strong consistency
  21. 21. Consistency W + R > N gives strong consistency N=3 W=2 R=2 2 + 2 > 3 ==> strongly consistent
  22. 22. Consistency W + R > N gives strong consistency N=3 W=2 R=2 2 + 2 > 3 ==> strongly consistent Only 2 of the 3 replicas must be available.
  23. 23. Consistency Tunable Consistency – Specify N (Replication Factor) per data set – Specify R, W per operation
  24. 24. Consistency Tunable Consistency – Specify N (Replication Factor) per data set – Specify R, W per operation – Quorum: N/2 + 1 • R = W = Quorum • Strong consistency • Tolerate the loss of N – Quorum replicas – R, W can also be 1 or N
  25. 25. Availability Can tolerate the loss of: – N – R replicas for reads – N – W replicas for writes
  26. 26. CAP TheoremDuring node or network failure: 100% Not Possible Availability Possible Consistency 100%
  27. 27. CAP TheoremDuring node or network failure: 100% Not Ca Possible ss an dr Availability a Possible Consistency 100%
  28. 28. Clustering No single point of failure Replication that works Scales linearly – 2x nodes = 2x performance • For both reads and writes – Up to 100s of nodes – See “Netflix: 1 million writes/sec on AWS” Operationally simple Multi-Datacenter Replication
  29. 29. Data Model Comes from Google BigTable Goals – Commodity Hardware • Spinning disks – Handle data sets much larger than memory • Minimize disk seeks – High throughput – Low latency – Durable
  30. 30. Column Families Static – Object data – Similar to a table in a relational database Dynamic – Precomputed query results – Materialized views (these are just educational classifications)
  31. 31. Static Column Families Users zznate password: * name: Nate driftx password: * name: Brandon thobbs password: * name: Tyler jbellis password: * name: Jonathan site: riptano.com
  32. 32. Dynamic Column Families Rows – Each row has a unique primary key – Sorted list of (name, value) tuples • Like an ordered hash – The (name, value) tuple is called a “column”
  33. 33. Dynamic Column Families Followingzznate driftx: thobbs:driftxthobbs zznate:jbellis driftx: mdennis: pcmanus: thobbs: xedin: zznate:
  34. 34. Dynamic Column Families Other Examples: – Timeline of tweets by a user – Timeline of tweets by all of the people a user is following – List of comments sorted by score – List of friends grouped by state
  35. 35. The Data API RPC-based API – github.com/twitter/cassandra CQL (Cassandra Query Language) – code.google.com/a/apache-extras.org/p/cassandra-ruby/
  36. 36. Inserting Data INSERT INTO users (KEY, “name”, “age”) VALUES (“thobbs”, “Tyler”, 24);
  37. 37. Updating Data Updates are the same as inserts: INSERT INTO users (KEY, “age”) VALUES (“thobbs”, 34); Or UPDATE users SET “age” = 34 WHERE KEY = “thobbs”;
  38. 38. Fetching Data Whole row select: SELECT * FROM users WHERE KEY = “thobbs”;
  39. 39. Fetching Data Explicit column select: SELECT “name”, “age” FROM users WHERE KEY = “thobbs”;
  40. 40. Fetching Data Get a slice of columns UPDATE letters SET 1=a, 2=b, 3=c, 4=d, 5=e WHERE KEY = “key”; SELECT 1..3 FROM letters WHERE KEY = “key”; Returns [(1, a), (2, b), (3, c)]
  41. 41. Fetching Data Get a slice of columns SELECT FIRST 2 FROM letters WHERE KEY = “key”; Returns [(1, a), (2, b)] SELECT FIRST 2 REVERSED FROM letters WHERE KEY = “key”; Returns [(5, e), (4, d)]
  42. 42. Fetching Data Get a slice of columns SELECT 3.. FROM letters WHERE KEY = “key”; Returns [(3, c), (4, d), (5, e)] SELECT FIRST 2 REVERSED 4.. FROM letters WHERE KEY = “key”; Returns [(4, d), (3, c)]
  43. 43. Deleting Data Delete a whole row: DELETE FROM users WHERE KEY = “thobbs”; Delete specific columns: DELETE “age” FROM users WHERE KEY = “thobbs”;
  44. 44. Secondary Indexes Builtin basic indexes CREATE INDEX ageIndex ON users (age); SELECT name FROM USERS WHERE age = 24 AND state = “TX”;
  45. 45. Performance Writes – 10k – 30k per second per node – Sub-millisecond latency Reads – 1k – 20k per second per node (depends on data set, caching – 0.1 to 10ms latency
  46. 46. Other Features Distributed Counters – Can support millions of high-volume counters Excellent Multi-datacenter Support – Disaster recovery – Locality Hadoop Integration – Isolation of resources – Hive and Pig drivers Compression
  47. 47. What Cassandra Cant Do Transactions – Unless you use a distributed lock – Atomicity, Isolation – These arent needed as often as youd think Limited support for ad-hoc queries – Know what you want to do with the data
  48. 48. Not One-size-fits-all Use alongside an RDBMS
  49. 49. Problems you shouldnt solve with C* Prototyping Distributed Locking Small datasets – (When you dont need availability) Complex graph processing – Shallow graph queries work well, though Fundamentally highly relational/transactional data
  50. 50. The sweet spot for Cassandra Large dataset, low latency queries Simple to medium complexity queries – Key/value – Time series, ordered data – Lists, sets, maps High Availability
  51. 51. The sweet spot for Cassandra Social – Texts, comments, check-ins, collaboration Activity – Feeds, timelines, clickstreams, logs, sensor data Metrics – Performance data over time – CloudKick, DataStax OpsCenter Text Search – Inbox search at Facebook
  52. 52. ORMs Poor integration ORMs are not a natural fit for Cassandra – In C*, we mainly care about queries, not objects – Beyond simple K/V, abstraction breaks Suggestion: dont waste time with an ORM – C* will only be used for a specific subset of your data/queries – Use the C* API directly in your model
  53. 53. Questions? Tyler Hobbs @tylhobbs tyler@datastax.com

×