Cassandra for Ruby/Rails Devs
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Cassandra for Ruby/Rails Devs

on

  • 4,356 views

 

Statistics

Views

Total Views
4,356
Views on SlideShare
4,345
Embed Views
11

Actions

Likes
4
Downloads
25
Comments
0

2 Embeds 11

http://paper.li 8
http://us-w1.rockmelt.com 3

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Cassandra for Ruby/Rails Devs Presentation Transcript

  • 1. Intro toCassandra Tyler Hobbs
  • 2. HistoryDynamo BigTable(clustering) (data model) Inbox search Cassandra
  • 3. Users
  • 4. Clustering Every node plays the same role – No masters, slaves, or special nodes – No single point of failure
  • 5. Consistent Hashing 0 50 10 40 20 30
  • 6. Consistent Hashing Key: “www.google.com” 0 50 10 40 20 30
  • 7. Consistent Hashing Key: “www.google.com” 0 md5(“www.google.com”) 50 10 14 40 20 30
  • 8. Consistent Hashing Key: “www.google.com” 0 md5(“www.google.com”) 50 10 14 40 20 30
  • 9. Consistent Hashing Key: “www.google.com” 0 md5(“www.google.com”) 50 10 14 40 20 30
  • 10. Consistent Hashing Key: “www.google.com” 0 md5(“www.google.com”) 50 10 14 40 20 30 Replication Factor = 3
  • 11. Clustering Client can talk to any node
  • 12. ScalingRF = 2 0 50 10The node at50 owns thered portion 20 30
  • 13. ScalingRF = 2 0 50 10 Add a new 40 20 node at 40 30
  • 14. ScalingRF = 2 0 50 10 Add a new 40 20 node at 40 30
  • 15. Node FailuresRF = 2 0 50 10 Replicas 40 20 30
  • 16. Node FailuresRF = 2 0 50 10 Replicas 40 20 30
  • 17. Node FailuresRF = 2 0 50 10 40 20 30
  • 18. Consistency, Availability Consistency – Can I read stale data? Availability – Can I write/read at all? Tunable Consistency
  • 19. Consistency N = Total number of replicas R = Number of replicas read from – (before the response is returned) W = Number of replicas written to – (before the write is considered a success)
  • 20. Consistency N = Total number of replicas R = Number of replicas read from – (before the response is returned) W = Number of replicas written to – (before the write is considered a success) W + R > N gives strong consistency
  • 21. Consistency W + R > N gives strong consistency N=3 W=2 R=2 2 + 2 > 3 ==> strongly consistent
  • 22. Consistency W + R > N gives strong consistency N=3 W=2 R=2 2 + 2 > 3 ==> strongly consistent Only 2 of the 3 replicas must be available.
  • 23. Consistency Tunable Consistency – Specify N (Replication Factor) per data set – Specify R, W per operation
  • 24. Consistency Tunable Consistency – Specify N (Replication Factor) per data set – Specify R, W per operation – Quorum: N/2 + 1 • R = W = Quorum • Strong consistency • Tolerate the loss of N – Quorum replicas – R, W can also be 1 or N
  • 25. Availability Can tolerate the loss of: – N – R replicas for reads – N – W replicas for writes
  • 26. CAP TheoremDuring node or network failure: 100% Not Possible Availability Possible Consistency 100%
  • 27. CAP TheoremDuring node or network failure: 100% Not Ca Possible ss an dr Availability a Possible Consistency 100%
  • 28. Clustering No single point of failure Replication that works Scales linearly – 2x nodes = 2x performance • For both reads and writes – Up to 100s of nodes – See “Netflix: 1 million writes/sec on AWS” Operationally simple Multi-Datacenter Replication
  • 29. Data Model Comes from Google BigTable Goals – Commodity Hardware • Spinning disks – Handle data sets much larger than memory • Minimize disk seeks – High throughput – Low latency – Durable
  • 30. Column Families Static – Object data – Similar to a table in a relational database Dynamic – Precomputed query results – Materialized views (these are just educational classifications)
  • 31. Static Column Families Users zznate password: * name: Nate driftx password: * name: Brandon thobbs password: * name: Tyler jbellis password: * name: Jonathan site: riptano.com
  • 32. Dynamic Column Families Rows – Each row has a unique primary key – Sorted list of (name, value) tuples • Like an ordered hash – The (name, value) tuple is called a “column”
  • 33. Dynamic Column Families Followingzznate driftx: thobbs:driftxthobbs zznate:jbellis driftx: mdennis: pcmanus: thobbs: xedin: zznate:
  • 34. Dynamic Column Families Other Examples: – Timeline of tweets by a user – Timeline of tweets by all of the people a user is following – List of comments sorted by score – List of friends grouped by state
  • 35. The Data API RPC-based API – github.com/twitter/cassandra CQL (Cassandra Query Language) – code.google.com/a/apache-extras.org/p/cassandra-ruby/
  • 36. Inserting Data INSERT INTO users (KEY, “name”, “age”) VALUES (“thobbs”, “Tyler”, 24);
  • 37. Updating Data Updates are the same as inserts: INSERT INTO users (KEY, “age”) VALUES (“thobbs”, 34); Or UPDATE users SET “age” = 34 WHERE KEY = “thobbs”;
  • 38. Fetching Data Whole row select: SELECT * FROM users WHERE KEY = “thobbs”;
  • 39. Fetching Data Explicit column select: SELECT “name”, “age” FROM users WHERE KEY = “thobbs”;
  • 40. Fetching Data Get a slice of columns UPDATE letters SET 1=a, 2=b, 3=c, 4=d, 5=e WHERE KEY = “key”; SELECT 1..3 FROM letters WHERE KEY = “key”; Returns [(1, a), (2, b), (3, c)]
  • 41. Fetching Data Get a slice of columns SELECT FIRST 2 FROM letters WHERE KEY = “key”; Returns [(1, a), (2, b)] SELECT FIRST 2 REVERSED FROM letters WHERE KEY = “key”; Returns [(5, e), (4, d)]
  • 42. Fetching Data Get a slice of columns SELECT 3.. FROM letters WHERE KEY = “key”; Returns [(3, c), (4, d), (5, e)] SELECT FIRST 2 REVERSED 4.. FROM letters WHERE KEY = “key”; Returns [(4, d), (3, c)]
  • 43. Deleting Data Delete a whole row: DELETE FROM users WHERE KEY = “thobbs”; Delete specific columns: DELETE “age” FROM users WHERE KEY = “thobbs”;
  • 44. Secondary Indexes Builtin basic indexes CREATE INDEX ageIndex ON users (age); SELECT name FROM USERS WHERE age = 24 AND state = “TX”;
  • 45. Performance Writes – 10k – 30k per second per node – Sub-millisecond latency Reads – 1k – 20k per second per node (depends on data set, caching – 0.1 to 10ms latency
  • 46. Other Features Distributed Counters – Can support millions of high-volume counters Excellent Multi-datacenter Support – Disaster recovery – Locality Hadoop Integration – Isolation of resources – Hive and Pig drivers Compression
  • 47. What Cassandra Cant Do Transactions – Unless you use a distributed lock – Atomicity, Isolation – These arent needed as often as youd think Limited support for ad-hoc queries – Know what you want to do with the data
  • 48. Not One-size-fits-all Use alongside an RDBMS
  • 49. Problems you shouldnt solve with C* Prototyping Distributed Locking Small datasets – (When you dont need availability) Complex graph processing – Shallow graph queries work well, though Fundamentally highly relational/transactional data
  • 50. The sweet spot for Cassandra Large dataset, low latency queries Simple to medium complexity queries – Key/value – Time series, ordered data – Lists, sets, maps High Availability
  • 51. The sweet spot for Cassandra Social – Texts, comments, check-ins, collaboration Activity – Feeds, timelines, clickstreams, logs, sensor data Metrics – Performance data over time – CloudKick, DataStax OpsCenter Text Search – Inbox search at Facebook
  • 52. ORMs Poor integration ORMs are not a natural fit for Cassandra – In C*, we mainly care about queries, not objects – Beyond simple K/V, abstraction breaks Suggestion: dont waste time with an ORM – C* will only be used for a specific subset of your data/queries – Use the C* API directly in your model
  • 53. Questions? Tyler Hobbs @tylhobbs tyler@datastax.com