Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Cassandra Data Model

24,984 views

Published on

Eben Hewitt's talk on Apache Cassandra's Data Model from Cassandra Summit in San Francisco.

Published in: Technology
  • I need it,but i don't download it,please send me for anyone;
    email:zhubin885@sina.com
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Cassandra Data Model

  1. 1. the cassandra data model eben hewitt cassandra summit - san francisco 8.10.2010
  2. 3. <ul><li>the data model </li></ul><ul><li>no sql? </li></ul><ul><li>design patterns </li></ul>
  3. 4. things we know about cassandra <ul><li>it’s eventually consistent </li></ul><ul><li>it’s column-oriented </li></ul>
  4. 5. these things are wrong.
  5. 6. ?...
  6. 7. consistency <ul><li>eventual? </li></ul>1
  7. 8. consistency <ul><li>tune-able? </li></ul>1
  8. 9. PACELC <ul><li>Daniel Abadi </li></ul><ul><li>if Partition </li></ul><ul><ul><li>Trade some Consistency for Availability </li></ul></ul><ul><li>Else (normal circumstances) </li></ul><ul><ul><li>Balance Consistency & Latency </li></ul></ul>
  9. 10. cassandra is consistent when read replica count + write replica count > replication factor cassandra is consistent if you read & write at CL.QUORUM (once Q nodes are up) good place to start <ul><li>R + W > N </li></ul>r = # of nodes consulted during read op w = # of replicas consulted during write op n = replicas r + w + n = consistent
  10. 11. cassandra is row-oriented <ul><li>each row is uniquely identifiable by key </li></ul><ul><li>rows group columns and super columns </li></ul>2
  11. 12. column <ul><li>atomic unit </li></ul><ul><li>name : value : timestamp </li></ul>email : alison@foo.com : 12578123685
  12. 13. column family
  13. 14. <ul><li>User { </li></ul><ul><li>123 : { email: alison@foo.com, </li></ul><ul><li> img: }, </li></ul><ul><li>456 : { email: eben@bar.com, </li></ul><ul><li> username: The Situation} </li></ul><ul><li>} </li></ul>column family
  14. 15. super column super columns group columns under a common name
  15. 16. super column family
  16. 17. about super column families <ul><li>sub-column names inside a SCF are not indexed </li></ul><ul><ul><li>top level columns (SCF Name) are always indexed </li></ul></ul><ul><li>often used for denormalizing data from standard CFs </li></ul>
  17. 18. <ul><li>PointOfInterest { </li></ul><ul><li>key: 85255 { </li></ul><ul><li>Phoenix Zoo { phone: 480-555-5555, desc: They have animals here. }, </li></ul><ul><li> Spring Training { phone: 623-333-3333, desc: Fun for baseball fans. }, </li></ul><ul><li>}, //end phx </li></ul><ul><li>key: 10019 { </li></ul><ul><li> Central Park { desc: Walk around. It's pretty.} , </li></ul><ul><li> Empire State Building { phone: 212-777-7777, desc: Great view from </li></ul><ul><li> 102nd floor. } </li></ul><ul><li>} //end nyc </li></ul><ul><li>} </li></ul>s super column super column family flexible schema key column super column family
  18. 19. <ul><li>the data model </li></ul><ul><li>no sql? </li></ul><ul><li>design patterns </li></ul>
  19. 20. what about… <ul><li>SELECT WHERE </li></ul><ul><li>ORDER BY </li></ul><ul><li>JOIN ON </li></ul><ul><li>GROUP </li></ul>?
  20. 21. rdbms : domain-based model what answers do I have? cassandra : query-based model what questions do I have?
  21. 22. Questions <ul><li>• Find hotels in a given area </li></ul><ul><li>• Find information about a given hotel, such as its name and location </li></ul><ul><li>• Find Points of Interest near a given hotel </li></ul><ul><li>• Find an available room in a given date range </li></ul><ul><li>• Find the rate and amenities about a room </li></ul><ul><li>• Book the selected room by entering guest information </li></ul>
  22. 23. SELECT WHERE <ul><li><<cf>> </li></ul><ul><li>USER </li></ul><ul><li>Key: UserID </li></ul><ul><li>Cols: username, email, birth date, city, state </li></ul><ul><li>  </li></ul><ul><li>To support a query like SELECT * FROM User WHERE city = ‘Austin’: </li></ul><ul><li>Create a new CF called UserCity: </li></ul><ul><li>  </li></ul><ul><li><<cf>> </li></ul><ul><li>USERCITY </li></ul><ul><li>Key: city </li></ul><ul><li>Cols: IDs of the users in that city. </li></ul><ul><li>Also uses the Valueless Column pattern. </li></ul>
  23. 24. <ul><li>Use an aggregate row key </li></ul><ul><li>state:city: { user1, user2} </li></ul><ul><li>Get rows between TX: & TX; </li></ul><ul><li>for all Texas users </li></ul><ul><li>Get rows between TX:Austin & TX:Austin1 </li></ul><ul><li>for all Austin users </li></ul>SELECT WHERE pt 2
  24. 25. ORDER BY <ul><li>Rows </li></ul><ul><li>are placed according to their Partitioner: </li></ul><ul><li>Random: MD5 of key </li></ul><ul><li>Order-Preserving: actual key </li></ul><ul><li>are sorted by key, regardless of partitioner </li></ul>Columns are sorted according to CompareWith or CompareSubcolumnsWith
  25. 26. JOIN ON <ul><li>“ join” means “create a relationship” </li></ul><ul><ul><li>rdbms : pay at runtime </li></ul></ul><ul><ul><li>cassandra : pay at design time </li></ul></ul><ul><li>representing the relationship </li></ul><ul><ul><li>rdbms : opaque, in query </li></ul></ul><ul><ul><li>cassandra : transparent, first-class citizen </li></ul></ul>
  26. 27. GROUP <ul><li>SELECT COUNT(*) from Hotel GROUP BY ZipCode </li></ul><ul><li> calculated column value </li></ul>
  27. 28. <ul><li>the data model </li></ul><ul><li>no sql? </li></ul><ul><li>design patterns </li></ul>
  28. 29. <ul><li>1. </li></ul>materialized view
  29. 30. <ul><li>problem </li></ul><ul><li>You need to perform SELECT FROM WHERE queries. </li></ul><ul><li>solution </li></ul><ul><li>Create a new CF. Use the WHERE idea as the row key. </li></ul><ul><li>impacts </li></ul><ul><li>Must also write to the index every time you write to the primary CF. Or run as a job. </li></ul>materialized view
  30. 31. <ul><li>2. </li></ul>valueless column
  31. 32. <ul><li>problem </li></ul><ul><li>Indexes require repeating columns from other column families. </li></ul><ul><li>solution </li></ul><ul><li>Treat the name of the column as the value . Use a byte[0] as the column ‘value’. </li></ul><ul><li>impacts </li></ul><ul><li>Only works with <= 2B columns in 0.7 </li></ul>valueless column
  32. 33. <ul><li>3. </li></ul>composite key
  33. 34. <ul><li>problem </li></ul><ul><li>Keys must support references and uniqueness. </li></ul><ul><li>solution </li></ul><ul><li>Fuse multiple values with a separator. </li></ul><ul><li>impacts </li></ul><ul><li>Can substitute for Super Column. </li></ul><ul><li>Use a Custom Comparator if necessary. </li></ul>composite key
  34. 35. <ul><li>4. </li></ul>semantic key
  35. 36. <ul><li>problem </li></ul><ul><li> Keys must be unique. You use OPP. </li></ul><ul><li>solution </li></ul><ul><li>Use a key that is meaningful in your application. </li></ul><ul><li>impacts </li></ul><ul><li>Harder to get right. Can proliferate Indexes. Range scans over keys less efficient. </li></ul>semantic key
  36. 37. <ul><li>6. </li></ul>client clock sync
  37. 38. <ul><li>problem </li></ul><ul><li>You need to keep clocks on different clients synchronized to support read repair. </li></ul><ul><li>solution </li></ul><ul><li>System.nanoTime() used in StorageProxy </li></ul><ul><li>NTP . </li></ul><ul><li>impacts </li></ul><ul><li>Consider geographic dispersement. </li></ul>client clock sync
  38. 39. EXAMPLE
  39. 42. is cassandra a good fit? <ul><li>very fast writes </li></ul><ul><li>need to be always writeable </li></ul><ul><li>lots of data </li></ul><ul><li>evolving schema </li></ul>
  40. 43. custom comparators <ul><li>i want to compare with </li></ul><ul><ul><li>float </li></ul></ul><ul><ul><li>lat/long </li></ul></ul>
  41. 44. choose keys carefully <ul><li>key-based routing to find data </li></ul><ul><li>queries are executed by key </li></ul><ul><li>keys need to be easily discoverable </li></ul><ul><li>you rely on keys for referential integrity </li></ul>
  42. 45. new use cases <ul><li>geographic data </li></ul><ul><li>weather data </li></ul><ul><li>rfid </li></ul><ul><li>travel schedules </li></ul><ul><li>data services in soa </li></ul><ul><li>hotel reservations </li></ul><ul><li>CEP </li></ul>
  43. 46. see also <ul><li>http://dfeatherston.com/cassandra-adf-uiuc-su10.pdf </li></ul><ul><li>http://github.com/ dietrichf/brireme </li></ul>
  44. 47. <ul><li>@ebenhewitt </li></ul><ul><li>cassandraguide.com </li></ul>

×