Cassandra Data Model

22,231 views
21,885 views

Published on

Eben Hewitt's talk on Apache Cassandra's Data Model from Cassandra Summit in San Francisco.

Published in: Technology
1 Comment
46 Likes
Statistics
Notes
  • I need it,but i don't download it,please send me for anyone;
    email:zhubin885@sina.com
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total views
22,231
On SlideShare
0
From Embeds
0
Number of Embeds
1,210
Actions
Shares
0
Downloads
0
Comments
1
Likes
46
Embeds 0
No embeds

No notes for slide

Cassandra Data Model

  1. 1. the cassandra data model eben hewitt cassandra summit - san francisco 8.10.2010
  2. 3. <ul><li>the data model </li></ul><ul><li>no sql? </li></ul><ul><li>design patterns </li></ul>
  3. 4. things we know about cassandra <ul><li>it’s eventually consistent </li></ul><ul><li>it’s column-oriented </li></ul>
  4. 5. these things are wrong.
  5. 6. ?...
  6. 7. consistency <ul><li>eventual? </li></ul>1
  7. 8. consistency <ul><li>tune-able? </li></ul>1
  8. 9. PACELC <ul><li>Daniel Abadi </li></ul><ul><li>if Partition </li></ul><ul><ul><li>Trade some Consistency for Availability </li></ul></ul><ul><li>Else (normal circumstances) </li></ul><ul><ul><li>Balance Consistency & Latency </li></ul></ul>
  9. 10. cassandra is consistent when read replica count + write replica count > replication factor cassandra is consistent if you read & write at CL.QUORUM (once Q nodes are up) good place to start <ul><li>R + W > N </li></ul>r = # of nodes consulted during read op w = # of replicas consulted during write op n = replicas r + w + n = consistent
  10. 11. cassandra is row-oriented <ul><li>each row is uniquely identifiable by key </li></ul><ul><li>rows group columns and super columns </li></ul>2
  11. 12. column <ul><li>atomic unit </li></ul><ul><li>name : value : timestamp </li></ul>email : alison@foo.com : 12578123685
  12. 13. column family
  13. 14. <ul><li>User { </li></ul><ul><li>123 : { email: alison@foo.com, </li></ul><ul><li> img: }, </li></ul><ul><li>456 : { email: eben@bar.com, </li></ul><ul><li> username: The Situation} </li></ul><ul><li>} </li></ul>column family
  14. 15. super column super columns group columns under a common name
  15. 16. super column family
  16. 17. about super column families <ul><li>sub-column names inside a SCF are not indexed </li></ul><ul><ul><li>top level columns (SCF Name) are always indexed </li></ul></ul><ul><li>often used for denormalizing data from standard CFs </li></ul>
  17. 18. <ul><li>PointOfInterest { </li></ul><ul><li>key: 85255 { </li></ul><ul><li>Phoenix Zoo { phone: 480-555-5555, desc: They have animals here. }, </li></ul><ul><li> Spring Training { phone: 623-333-3333, desc: Fun for baseball fans. }, </li></ul><ul><li>}, //end phx </li></ul><ul><li>key: 10019 { </li></ul><ul><li> Central Park { desc: Walk around. It's pretty.} , </li></ul><ul><li> Empire State Building { phone: 212-777-7777, desc: Great view from </li></ul><ul><li> 102nd floor. } </li></ul><ul><li>} //end nyc </li></ul><ul><li>} </li></ul>s super column super column family flexible schema key column super column family
  18. 19. <ul><li>the data model </li></ul><ul><li>no sql? </li></ul><ul><li>design patterns </li></ul>
  19. 20. what about… <ul><li>SELECT WHERE </li></ul><ul><li>ORDER BY </li></ul><ul><li>JOIN ON </li></ul><ul><li>GROUP </li></ul>?
  20. 21. rdbms : domain-based model what answers do I have? cassandra : query-based model what questions do I have?
  21. 22. Questions <ul><li>• Find hotels in a given area </li></ul><ul><li>• Find information about a given hotel, such as its name and location </li></ul><ul><li>• Find Points of Interest near a given hotel </li></ul><ul><li>• Find an available room in a given date range </li></ul><ul><li>• Find the rate and amenities about a room </li></ul><ul><li>• Book the selected room by entering guest information </li></ul>
  22. 23. SELECT WHERE <ul><li><<cf>> </li></ul><ul><li>USER </li></ul><ul><li>Key: UserID </li></ul><ul><li>Cols: username, email, birth date, city, state </li></ul><ul><li>  </li></ul><ul><li>To support a query like SELECT * FROM User WHERE city = ‘Austin’: </li></ul><ul><li>Create a new CF called UserCity: </li></ul><ul><li>  </li></ul><ul><li><<cf>> </li></ul><ul><li>USERCITY </li></ul><ul><li>Key: city </li></ul><ul><li>Cols: IDs of the users in that city. </li></ul><ul><li>Also uses the Valueless Column pattern. </li></ul>
  23. 24. <ul><li>Use an aggregate row key </li></ul><ul><li>state:city: { user1, user2} </li></ul><ul><li>Get rows between TX: & TX; </li></ul><ul><li>for all Texas users </li></ul><ul><li>Get rows between TX:Austin & TX:Austin1 </li></ul><ul><li>for all Austin users </li></ul>SELECT WHERE pt 2
  24. 25. ORDER BY <ul><li>Rows </li></ul><ul><li>are placed according to their Partitioner: </li></ul><ul><li>Random: MD5 of key </li></ul><ul><li>Order-Preserving: actual key </li></ul><ul><li>are sorted by key, regardless of partitioner </li></ul>Columns are sorted according to CompareWith or CompareSubcolumnsWith
  25. 26. JOIN ON <ul><li>“ join” means “create a relationship” </li></ul><ul><ul><li>rdbms : pay at runtime </li></ul></ul><ul><ul><li>cassandra : pay at design time </li></ul></ul><ul><li>representing the relationship </li></ul><ul><ul><li>rdbms : opaque, in query </li></ul></ul><ul><ul><li>cassandra : transparent, first-class citizen </li></ul></ul>
  26. 27. GROUP <ul><li>SELECT COUNT(*) from Hotel GROUP BY ZipCode </li></ul><ul><li> calculated column value </li></ul>
  27. 28. <ul><li>the data model </li></ul><ul><li>no sql? </li></ul><ul><li>design patterns </li></ul>
  28. 29. <ul><li>1. </li></ul>materialized view
  29. 30. <ul><li>problem </li></ul><ul><li>You need to perform SELECT FROM WHERE queries. </li></ul><ul><li>solution </li></ul><ul><li>Create a new CF. Use the WHERE idea as the row key. </li></ul><ul><li>impacts </li></ul><ul><li>Must also write to the index every time you write to the primary CF. Or run as a job. </li></ul>materialized view
  30. 31. <ul><li>2. </li></ul>valueless column
  31. 32. <ul><li>problem </li></ul><ul><li>Indexes require repeating columns from other column families. </li></ul><ul><li>solution </li></ul><ul><li>Treat the name of the column as the value . Use a byte[0] as the column ‘value’. </li></ul><ul><li>impacts </li></ul><ul><li>Only works with <= 2B columns in 0.7 </li></ul>valueless column
  32. 33. <ul><li>3. </li></ul>composite key
  33. 34. <ul><li>problem </li></ul><ul><li>Keys must support references and uniqueness. </li></ul><ul><li>solution </li></ul><ul><li>Fuse multiple values with a separator. </li></ul><ul><li>impacts </li></ul><ul><li>Can substitute for Super Column. </li></ul><ul><li>Use a Custom Comparator if necessary. </li></ul>composite key
  34. 35. <ul><li>4. </li></ul>semantic key
  35. 36. <ul><li>problem </li></ul><ul><li> Keys must be unique. You use OPP. </li></ul><ul><li>solution </li></ul><ul><li>Use a key that is meaningful in your application. </li></ul><ul><li>impacts </li></ul><ul><li>Harder to get right. Can proliferate Indexes. Range scans over keys less efficient. </li></ul>semantic key
  36. 37. <ul><li>6. </li></ul>client clock sync
  37. 38. <ul><li>problem </li></ul><ul><li>You need to keep clocks on different clients synchronized to support read repair. </li></ul><ul><li>solution </li></ul><ul><li>System.nanoTime() used in StorageProxy </li></ul><ul><li>NTP . </li></ul><ul><li>impacts </li></ul><ul><li>Consider geographic dispersement. </li></ul>client clock sync
  38. 39. EXAMPLE
  39. 42. is cassandra a good fit? <ul><li>very fast writes </li></ul><ul><li>need to be always writeable </li></ul><ul><li>lots of data </li></ul><ul><li>evolving schema </li></ul>
  40. 43. custom comparators <ul><li>i want to compare with </li></ul><ul><ul><li>float </li></ul></ul><ul><ul><li>lat/long </li></ul></ul>
  41. 44. choose keys carefully <ul><li>key-based routing to find data </li></ul><ul><li>queries are executed by key </li></ul><ul><li>keys need to be easily discoverable </li></ul><ul><li>you rely on keys for referential integrity </li></ul>
  42. 45. new use cases <ul><li>geographic data </li></ul><ul><li>weather data </li></ul><ul><li>rfid </li></ul><ul><li>travel schedules </li></ul><ul><li>data services in soa </li></ul><ul><li>hotel reservations </li></ul><ul><li>CEP </li></ul>
  43. 46. see also <ul><li>http://dfeatherston.com/cassandra-adf-uiuc-su10.pdf </li></ul><ul><li>http://github.com/ dietrichf/brireme </li></ul>
  44. 47. <ul><li>@ebenhewitt </li></ul><ul><li>cassandraguide.com </li></ul>

×