Cassandra for Ruby/Rails Devs

4,675 views
4,488 views

Published on

Published in: Business, Technology
0 Comments
8 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
4,675
On SlideShare
0
From Embeds
0
Number of Embeds
14
Actions
Shares
0
Downloads
34
Comments
0
Likes
8
Embeds 0
No embeds

No notes for slide

Cassandra for Ruby/Rails Devs

  1. 1. Intro toCassandra Tyler Hobbs
  2. 2. HistoryDynamo BigTable(clustering) (data model) Inbox search Cassandra
  3. 3. Users
  4. 4. Clustering Every node plays the same role – No masters, slaves, or special nodes – No single point of failure
  5. 5. Consistent Hashing 0 50 10 40 20 30
  6. 6. Consistent Hashing Key: “www.google.com” 0 50 10 40 20 30
  7. 7. Consistent Hashing Key: “www.google.com” 0 md5(“www.google.com”) 50 10 14 40 20 30
  8. 8. Consistent Hashing Key: “www.google.com” 0 md5(“www.google.com”) 50 10 14 40 20 30
  9. 9. Consistent Hashing Key: “www.google.com” 0 md5(“www.google.com”) 50 10 14 40 20 30
  10. 10. Consistent Hashing Key: “www.google.com” 0 md5(“www.google.com”) 50 10 14 40 20 30 Replication Factor = 3
  11. 11. Clustering Client can talk to any node
  12. 12. ScalingRF = 2 0 50 10The node at50 owns thered portion 20 30
  13. 13. ScalingRF = 2 0 50 10 Add a new 40 20 node at 40 30
  14. 14. ScalingRF = 2 0 50 10 Add a new 40 20 node at 40 30
  15. 15. Node FailuresRF = 2 0 50 10 Replicas 40 20 30
  16. 16. Node FailuresRF = 2 0 50 10 Replicas 40 20 30
  17. 17. Node FailuresRF = 2 0 50 10 40 20 30
  18. 18. Consistency, Availability Consistency – Can I read stale data? Availability – Can I write/read at all? Tunable Consistency
  19. 19. Consistency N = Total number of replicas R = Number of replicas read from – (before the response is returned) W = Number of replicas written to – (before the write is considered a success)
  20. 20. Consistency N = Total number of replicas R = Number of replicas read from – (before the response is returned) W = Number of replicas written to – (before the write is considered a success) W + R > N gives strong consistency
  21. 21. Consistency W + R > N gives strong consistency N=3 W=2 R=2 2 + 2 > 3 ==> strongly consistent
  22. 22. Consistency W + R > N gives strong consistency N=3 W=2 R=2 2 + 2 > 3 ==> strongly consistent Only 2 of the 3 replicas must be available.
  23. 23. Consistency Tunable Consistency – Specify N (Replication Factor) per data set – Specify R, W per operation
  24. 24. Consistency Tunable Consistency – Specify N (Replication Factor) per data set – Specify R, W per operation – Quorum: N/2 + 1 • R = W = Quorum • Strong consistency • Tolerate the loss of N – Quorum replicas – R, W can also be 1 or N
  25. 25. Availability Can tolerate the loss of: – N – R replicas for reads – N – W replicas for writes
  26. 26. CAP TheoremDuring node or network failure: 100% Not Possible Availability Possible Consistency 100%
  27. 27. CAP TheoremDuring node or network failure: 100% Not Ca Possible ss an dr Availability a Possible Consistency 100%
  28. 28. Clustering No single point of failure Replication that works Scales linearly – 2x nodes = 2x performance • For both reads and writes – Up to 100s of nodes – See “Netflix: 1 million writes/sec on AWS” Operationally simple Multi-Datacenter Replication
  29. 29. Data Model Comes from Google BigTable Goals – Commodity Hardware • Spinning disks – Handle data sets much larger than memory • Minimize disk seeks – High throughput – Low latency – Durable
  30. 30. Column Families Static – Object data – Similar to a table in a relational database Dynamic – Precomputed query results – Materialized views (these are just educational classifications)
  31. 31. Static Column Families Users zznate password: * name: Nate driftx password: * name: Brandon thobbs password: * name: Tyler jbellis password: * name: Jonathan site: riptano.com
  32. 32. Dynamic Column Families Rows – Each row has a unique primary key – Sorted list of (name, value) tuples • Like an ordered hash – The (name, value) tuple is called a “column”
  33. 33. Dynamic Column Families Followingzznate driftx: thobbs:driftxthobbs zznate:jbellis driftx: mdennis: pcmanus: thobbs: xedin: zznate:
  34. 34. Dynamic Column Families Other Examples: – Timeline of tweets by a user – Timeline of tweets by all of the people a user is following – List of comments sorted by score – List of friends grouped by state
  35. 35. The Data API RPC-based API – github.com/twitter/cassandra CQL (Cassandra Query Language) – code.google.com/a/apache-extras.org/p/cassandra-ruby/
  36. 36. Inserting Data INSERT INTO users (KEY, “name”, “age”) VALUES (“thobbs”, “Tyler”, 24);
  37. 37. Updating Data Updates are the same as inserts: INSERT INTO users (KEY, “age”) VALUES (“thobbs”, 34); Or UPDATE users SET “age” = 34 WHERE KEY = “thobbs”;
  38. 38. Fetching Data Whole row select: SELECT * FROM users WHERE KEY = “thobbs”;
  39. 39. Fetching Data Explicit column select: SELECT “name”, “age” FROM users WHERE KEY = “thobbs”;
  40. 40. Fetching Data Get a slice of columns UPDATE letters SET 1=a, 2=b, 3=c, 4=d, 5=e WHERE KEY = “key”; SELECT 1..3 FROM letters WHERE KEY = “key”; Returns [(1, a), (2, b), (3, c)]
  41. 41. Fetching Data Get a slice of columns SELECT FIRST 2 FROM letters WHERE KEY = “key”; Returns [(1, a), (2, b)] SELECT FIRST 2 REVERSED FROM letters WHERE KEY = “key”; Returns [(5, e), (4, d)]
  42. 42. Fetching Data Get a slice of columns SELECT 3.. FROM letters WHERE KEY = “key”; Returns [(3, c), (4, d), (5, e)] SELECT FIRST 2 REVERSED 4.. FROM letters WHERE KEY = “key”; Returns [(4, d), (3, c)]
  43. 43. Deleting Data Delete a whole row: DELETE FROM users WHERE KEY = “thobbs”; Delete specific columns: DELETE “age” FROM users WHERE KEY = “thobbs”;
  44. 44. Secondary Indexes Builtin basic indexes CREATE INDEX ageIndex ON users (age); SELECT name FROM USERS WHERE age = 24 AND state = “TX”;
  45. 45. Performance Writes – 10k – 30k per second per node – Sub-millisecond latency Reads – 1k – 20k per second per node (depends on data set, caching – 0.1 to 10ms latency
  46. 46. Other Features Distributed Counters – Can support millions of high-volume counters Excellent Multi-datacenter Support – Disaster recovery – Locality Hadoop Integration – Isolation of resources – Hive and Pig drivers Compression
  47. 47. What Cassandra Cant Do Transactions – Unless you use a distributed lock – Atomicity, Isolation – These arent needed as often as youd think Limited support for ad-hoc queries – Know what you want to do with the data
  48. 48. Not One-size-fits-all Use alongside an RDBMS
  49. 49. Problems you shouldnt solve with C* Prototyping Distributed Locking Small datasets – (When you dont need availability) Complex graph processing – Shallow graph queries work well, though Fundamentally highly relational/transactional data
  50. 50. The sweet spot for Cassandra Large dataset, low latency queries Simple to medium complexity queries – Key/value – Time series, ordered data – Lists, sets, maps High Availability
  51. 51. The sweet spot for Cassandra Social – Texts, comments, check-ins, collaboration Activity – Feeds, timelines, clickstreams, logs, sensor data Metrics – Performance data over time – CloudKick, DataStax OpsCenter Text Search – Inbox search at Facebook
  52. 52. ORMs Poor integration ORMs are not a natural fit for Cassandra – In C*, we mainly care about queries, not objects – Beyond simple K/V, abstraction breaks Suggestion: dont waste time with an ORM – C* will only be used for a specific subset of your data/queries – Use the C* API directly in your model
  53. 53. Questions? Tyler Hobbs @tylhobbs tyler@datastax.com

×