Your SlideShare is downloading. ×
Cassandra, Modeling and Availability at AMUG
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Cassandra, Modeling and Availability at AMUG

3,031

Published on

brief high level comparison of modeling between relational databases and Cassandra followed by a brief description of how Cassandra achieves global availability

brief high level comparison of modeling between relational databases and Cassandra followed by a brief description of how Cassandra achieves global availability

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
3,031
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
66
Comments
0
Likes
2
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Conceptual Modeling Differences From A RDBMS Matthew F. Dennis, DataStax // @mdennisAustin MySQL User GroupJanuary 11, 2012
  • 2. Cassandra Is Not Relationalget out of the relational mindset when working with Cassandra (or really any NoSQL DB)
  • 3. Work Backwards From Queries Think in terms of queries, not in terms ofnormalizing the data; in fact, you often want to denormalize (already common in the data warehousing world, even in RDBMS)
  • 4. OK great, but how do I do that?Well, you need to know how Cassandra Models Data (e.g. Google Big Table) research.google.com/archive/bigtable-osdi06.pdf Go Read It!
  • 5. In Cassandra:data is organized into Keyspaces (usually one per app)➔each Keyspace can have multiple Column Families➔each Column Family can have many Rows➔each Row has a Row Key and a variable number of Columns➔each Column consists of a Name, Value and Timestamp➔
  • 6. In Cassandra, Keyspaces:are similar in concept to a “database” in some RDBMs➔are stored in separate directories on disk➔are usually one-one with applications➔are usually the administrative unit for things related to ops➔contain multiple column families➔
  • 7. In Cassandra, In Keyspaces, Column Famlies: ➔ are similar in concept to a “table” in most RDBMs ➔ are stored in separate files on disk (many per CF) ➔ are usually approximately one-one with query type ➔ are usually the administrative unit for things related to your data ➔ can contain many (~billion* per node) rows* for a good sized node(you can always add nodes)
  • 8. In Cassandra, In Keyspaces, In Column Families ...
  • 9. Rows thepaul office: Austin OS: OSX twitter: thepaul0 mdennis office: UA OS: Linux twitter: mdennis thobbs office: Austin twitter: tylhobbsRow Keys
  • 10. thepaul office: Austin OS: OSX twitter: thepaul0mdennis office: UA OS: Linux twitter: mdennisthobbs office: Austin twitter: tylhobbs Columns
  • 11. Column Namesthepaul office: Austin OS: OSX twitter: thepaul0mdennis office: UA OS: Linux twitter: mdennisthobbs office: Austin twitter: tylhobbs
  • 12. Column Valuesthepaul office: Austin OS: OSX twitter: thepaul0mdennis office: UA OS: Linux twitter: mdennisthobbs office: Austin twitter: tylhobbs
  • 13. thepaul office: Austin OS: OSX twitter: thepaul0mdennis office: UA OS: Linux twitter: mdennisthobbs office: Austin twitter: tylhobbs Rows Are Randomly Ordered (if using the RandomPartitioner)
  • 14. thepaul office: Austin OS: OSX twitter: thepaul0mdennis office: UA OS: Linux twitter: mdennisthobbs office: Austin twitter: tylhobbs Columns Are Ordered by Name (by a configurable comparator)
  • 15. Columns are ordered because doing so allows very efficientimplementations of useful and common operations (e.g. merge joins)
  • 16. In particular, within a row I canfind given columns by name veryquickly (ordered names => log(n) binary search).
  • 17. More importantly, I can query for a slice between a start and end Row KeyRK ts0 ts1 ... ... tsM ... ... ... ... tsN ... ... ... ... ... start end
  • 18. Why does that matter?Because columns within a row arent static!
  • 19. The Column Name Can Be Part of Your Data INTC ts0: $25.20 ts1: $25.25 ... AMR ts0: $6.20 ts9: $0.26 ... CRDS ts0: $1.05 ts5: $6.82 ... Columns Are Ordered by Name (in this case by a TimeUUID Comparator)
  • 20. Turns Out That Pattern Comes Up A Lot ➔ stock ticks ➔ event logs ➔ ad clicks/views ➔ sensor records ➔ access/error logs ➔ plane/truck/person/”entity” locations ➔…
  • 21. OK, but I can do that in SQLNot efficiently at scale, at least not easily ...
  • 22. How it Looks In a RDBMS ticker timestamp bid ask ... AMR ts0 ... ... ... ... ... ... ... ... CRDS ts0 ... ... ... ... ... ... ... ...Data I Care About ... ts0 ... ... ... AMR ts1 ... ... ... ... ... ... ... ... ... ... ... ... ... … ts1 ... ... ... AMR ts2 ... ... ... ... ts2 ... ... ...
  • 23. How it Looks In a RDBMS ticker timestamp bid ask ... AMR ts0 ... ... ... Larger Than Your Page SizeDisk Seeks AMR ts1 ... ... ... Larger Than Your Page Size AMR ts2 ... ... ... ... ts2 ... ... ...
  • 24. OK, but what about ...PostgreSQL Cluster Command?➔MySQL Cluster Indexes?➔Oracle Index Organized Tables?➔SQLServer Clustered Index?➔
  • 25. OK, but what about ...PostgreSQL Cluster Using?➔ Meh ...MySQL [InnoDB] Cluster Indexes?➔Oracle Index Organized Table?➔SQLServer Clustered Index?➔ (seriously, who uses SQLServer?!)
  • 26. The on-disk management of that clustering results in tons of IO …In the case of PostgreSQL:clustering is a one time operation➔ (implies you must periodically rewrite the entire table)new data is *not* written in clustered order➔ (which is often the data you care most about)
  • 27. OK, so just partition the tables ...
  • 28. Not a bad idea, except in MySQL there is a limit of 1024 partitions and generally less if using NDB (you should probably still do it if using MySQL though) http://dev.mysql.com/doc/refman/5.5/en/partitioning-limitations.html
  • 29. OK fine, I agree storing data that is queried together on disk together is a good thing but whats that have to do with modeling? Seek To Here RK ts0 ts1 ... ... tsM ... ... ... ... tsN ... ... ... ... ... Read Precisely My Data ** more on some caveats later
  • 30. Well, thats what is meant by “work backwardsfrom your queries” or “think in terms of queries”(NB: this concept, in general, applies to RDBMS at scale as well; it is not specific to Cassandra)
  • 31. An Example From Fraud Detection To calculate risk it is common to need to know all the emails, destinations, origins, devices, locations, phonenumbers, et cetera ever used for the account in question
  • 32. In a normalized model that usually translates to a table for each type of entity being tracked id name ... id device ... 1 guy ... 1000 0xdead ... 2 gal ... 2000 0xb33f ... ... ... ... ... ... ...id dest ... id email ... id origin ...15 USA ... 100 guy@ ... 150 USA ...25 Finland ... 200 gal@ ... 250 Nigeria ...... ... ... ... ... ... ... ... ...
  • 33. The problem is that at scale that also means a disk seek for each one … (even for perfect IOT et al if across multiple tables)➔Previous emails? Thats a seek …➔Previous devices? Thats a seek …➔Previous destinations? Thats a seek ...
  • 34. But In Cassandra I Store The Data I Query Together On Disk Together (remember, column names need not be static) Data I Care AboutacctY ... ... ... ... ... ... ...acctX dest21 dev2 dev7 email3 email9 orig4 ...acctZ ... ... ... ... ... ... ... email:cassandra@mailinator.com = dateEmailWasLastUsed Column Name Column Value
  • 35. Dont treat Cassandra (or any DB) as a black box ➔Understand how your DBs (and data structures) work ➔Understand the building blocks they provide ➔Understand the work complexity (“big O”) of queries ➔For data sets > memory, goal is to minimize seeks ** on a related note, SSDs are awesome
  • 36. Q?(then brief intermission)
  • 37. Availability Has Many Levels➔ Component Failure (disk)➔ Machine Failure (NIC, cpu, power supply)➔ Site Failure (UPS, power grid, tornado)➔ Political Failure (war, coup)
  • 38. The Common Theme In The Solutions? Replication
  • 39. Replication In Cassandra Follows The Dynamo Model *http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html Read It!
  • 40. Every Node Has A Token 0 - 2^127 t0t3 t0 < t1 < t2 < t3 < 2^127 t1 t2
  • 41. Row Key Determines Node(s)MD5(RK) => T t0 t3 < T < 2^127 t3 t1 t2
  • 42. Row Key Determines NodeMD5(RK) => T First Replica t0 t3 < T < 2^127 t3 t1 t2
  • 43. Walk The Ring To Find Subsequent Replicas * MD5(RK) => T First Replica t0 t3 < T < 2^127 t3 t1 Second Replica t2* by default
  • 44. Writes Happen In Parallel To All Replicas First Replica client t0 RK= ... RK= ... t3 t1 RK= ... Second ReplicaCoordinator t2(not a master)
  • 45. Some Or All Replicas Respond First Replica client t0 RK= ... “ok” X t3 t1 “ok” Second ReplicaCoordinator Waits For Ack(s) t2 From Destination Node(s)
  • 46. The Coordinator Responds To Client First Replica client t0 “ok” “ok” X t3 t1 “ok” Second ReplicaCoordinator Waits For Ack(s) t2 From Destination Node(s)
  • 47. What Nodes Can Be A Coordinator?The coordinator for any given read orwrite is really just whatever node the client connected to for that requestany node for any request at any time
  • 48. How Many Replicas Does The Coordinator Wait For?configurable, per query➔ONE / QUORUM are the most common➔(more on this in a moment)
  • 49. Writing At CL.One First Replicaclient t0 t3 t1 X Second Replica t2 Third Replica Wait For At Least One Node (eventually all nodes get updates)
  • 50. Writing At CL.One First Replicaclient t0 “ok” “ok” t3 t1 X Second Replica t2 Third Replica Wait For At Least One Node (eventually all nodes get updates)
  • 51. Reading At CL.One First Replicaclient t0 t3 t1 X Second Replica t2 Third Replica Wait For At Least One Node (so you might read stale data)
  • 52. Reading At CL.One First Replicaclient t0 “old” “old” t3 t1 X Second Replica t2 Third Replica Wait For At Least One Node (so you might read stale data)
  • 53. Writing At CL.Quorum First Replicaclient t0 t3 t1 X Second Replica t2 Third Replica Wait For Majority Of Nodes (eventually all nodes get updates)
  • 54. Writing At CL.Quorum First Replicaclient t0 “ok” “ok” “ok” t3 t1 X Second Replica t2 Third Replica Wait For Majority Of Nodes (eventually all nodes get updates)
  • 55. Reading At CL.Quorum First Replicaclient t0 X t3 t1 Second Replica t2 Third Replica Wait For Majority Of Nodes (majority => overlap => consistent)
  • 56. Reading At CL.Quorum First Replica client t0 “ok” X “ok” t3 t1 “old”coordinator chooses client response based on client Second Replica supplied per column TS t2 Third Replica Wait For Majority Of Nodes (majority => overlap => consistent)
  • 57. Reading At CL.Quorum First Replica client t0 XAlready Has Response t3 t1 “current” Second Replica t2 Third Replica Read Repair Updates Stale Nodes
  • 58. On A Side Note, A Lost Response t0 “ok” X t3
  • 59. Is The Same As A Lost Request t0 X RK = ... t3* In Regards To Meeting Consistency
  • 60. Which Is The Same As A Failed/Slow Node X t0 RK = ... t3* In Regards To Meeting Consistency
  • 61. In fact, it is actually impossible for the originator to reliably distinguish between the 3
  • 62. One More Important Piece: writes are idempotent ** except with the counter API, but if you want that it can be done
  • 63. Why is that important? It means we can replay/retry writes, even late and/or out of order, and get the same resultsAfter/during node failures➔After/during network partitions➔After/during upgrades➔
  • 64. In other words you can concurrently issue conflicting updates to two different nodes whilethose nodes have no communication between them
  • 65. Which is important because ...
  • 66. Availability Has Many Levels➔ Component Failure (disk)➔ Machine Failure (NIC, cpu, power supply)➔ Site Failure (UPS, power grid, tornado)➔ Political Failure (war, coup)
  • 67. If you care about global availability you mustserve reads and writes from multiple data centers There is no way around this
  • 68. Q?Conceptual Modeling Differences From A RDBMS Matthew F. Dennis, DataStax // @mdennis
  • 69. A Brief Rant On Query Planners, Garbage Collectors, Virtual Memory, Automatic Transmissions and Data Structures

×