Upcoming SlideShare
×

Cassandra Data Model

22,231 views
21,885 views

Published on

Eben Hewitt's talk on Apache Cassandra's Data Model from Cassandra Summit in San Francisco.

Published in: Technology
1 Comment
46 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
email:zhubin885@sina.com

Are you sure you want to  Yes  No
Views
Total views
22,231
On SlideShare
0
From Embeds
0
Number of Embeds
1,210
Actions
Shares
0
0
1
Likes
46
Embeds 0
No embeds

No notes for slide

Cassandra Data Model

1. 1. the cassandra data model eben hewitt cassandra summit - san francisco 8.10.2010
2. 3. <ul><li>the data model </li></ul><ul><li>no sql? </li></ul><ul><li>design patterns </li></ul>
3. 4. things we know about cassandra <ul><li>it’s eventually consistent </li></ul><ul><li>it’s column-oriented </li></ul>
4. 5. these things are wrong.
5. 6. ?...
6. 7. consistency <ul><li>eventual? </li></ul>1
7. 8. consistency <ul><li>tune-able? </li></ul>1
8. 9. PACELC <ul><li>Daniel Abadi </li></ul><ul><li>if Partition </li></ul><ul><ul><li>Trade some Consistency for Availability </li></ul></ul><ul><li>Else (normal circumstances) </li></ul><ul><ul><li>Balance Consistency & Latency </li></ul></ul>
9. 10. cassandra is consistent when read replica count + write replica count > replication factor cassandra is consistent if you read & write at CL.QUORUM (once Q nodes are up) good place to start <ul><li>R + W > N </li></ul>r = # of nodes consulted during read op w = # of replicas consulted during write op n = replicas r + w + n = consistent
10. 11. cassandra is row-oriented <ul><li>each row is uniquely identifiable by key </li></ul><ul><li>rows group columns and super columns </li></ul>2
11. 12. column <ul><li>atomic unit </li></ul><ul><li>name : value : timestamp </li></ul>email : alison@foo.com : 12578123685
12. 13. column family
13. 14. <ul><li>User { </li></ul><ul><li>123 : { email: alison@foo.com, </li></ul><ul><li> img: }, </li></ul><ul><li>456 : { email: eben@bar.com, </li></ul><ul><li> username: The Situation} </li></ul><ul><li>} </li></ul>column family
14. 15. super column super columns group columns under a common name
15. 16. super column family
16. 17. about super column families <ul><li>sub-column names inside a SCF are not indexed </li></ul><ul><ul><li>top level columns (SCF Name) are always indexed </li></ul></ul><ul><li>often used for denormalizing data from standard CFs </li></ul>
17. 18. <ul><li>PointOfInterest { </li></ul><ul><li>key: 85255 { </li></ul><ul><li>Phoenix Zoo { phone: 480-555-5555, desc: They have animals here. }, </li></ul><ul><li> Spring Training { phone: 623-333-3333, desc: Fun for baseball fans. }, </li></ul><ul><li>}, //end phx </li></ul><ul><li>key: 10019 { </li></ul><ul><li> Central Park { desc: Walk around. It's pretty.} , </li></ul><ul><li> Empire State Building { phone: 212-777-7777, desc: Great view from </li></ul><ul><li> 102nd floor. } </li></ul><ul><li>} //end nyc </li></ul><ul><li>} </li></ul>s super column super column family flexible schema key column super column family
18. 19. <ul><li>the data model </li></ul><ul><li>no sql? </li></ul><ul><li>design patterns </li></ul>
19. 20. what about… <ul><li>SELECT WHERE </li></ul><ul><li>ORDER BY </li></ul><ul><li>JOIN ON </li></ul><ul><li>GROUP </li></ul>?
20. 21. rdbms : domain-based model what answers do I have? cassandra : query-based model what questions do I have?
21. 22. Questions <ul><li>• Find hotels in a given area </li></ul><ul><li>• Find information about a given hotel, such as its name and location </li></ul><ul><li>• Find Points of Interest near a given hotel </li></ul><ul><li>• Find an available room in a given date range </li></ul><ul><li>• Find the rate and amenities about a room </li></ul><ul><li>• Book the selected room by entering guest information </li></ul>
22. 23. SELECT WHERE <ul><li><<cf>> </li></ul><ul><li>USER </li></ul><ul><li>Key: UserID </li></ul><ul><li>Cols: username, email, birth date, city, state </li></ul><ul><li>  </li></ul><ul><li>To support a query like SELECT * FROM User WHERE city = ‘Austin’: </li></ul><ul><li>Create a new CF called UserCity: </li></ul><ul><li>  </li></ul><ul><li><<cf>> </li></ul><ul><li>USERCITY </li></ul><ul><li>Key: city </li></ul><ul><li>Cols: IDs of the users in that city. </li></ul><ul><li>Also uses the Valueless Column pattern. </li></ul>
23. 24. <ul><li>Use an aggregate row key </li></ul><ul><li>state:city: { user1, user2} </li></ul><ul><li>Get rows between TX: & TX; </li></ul><ul><li>for all Texas users </li></ul><ul><li>Get rows between TX:Austin & TX:Austin1 </li></ul><ul><li>for all Austin users </li></ul>SELECT WHERE pt 2
24. 25. ORDER BY <ul><li>Rows </li></ul><ul><li>are placed according to their Partitioner: </li></ul><ul><li>Random: MD5 of key </li></ul><ul><li>Order-Preserving: actual key </li></ul><ul><li>are sorted by key, regardless of partitioner </li></ul>Columns are sorted according to CompareWith or CompareSubcolumnsWith
25. 26. JOIN ON <ul><li>“ join” means “create a relationship” </li></ul><ul><ul><li>rdbms : pay at runtime </li></ul></ul><ul><ul><li>cassandra : pay at design time </li></ul></ul><ul><li>representing the relationship </li></ul><ul><ul><li>rdbms : opaque, in query </li></ul></ul><ul><ul><li>cassandra : transparent, first-class citizen </li></ul></ul>
26. 27. GROUP <ul><li>SELECT COUNT(*) from Hotel GROUP BY ZipCode </li></ul><ul><li> calculated column value </li></ul>
27. 28. <ul><li>the data model </li></ul><ul><li>no sql? </li></ul><ul><li>design patterns </li></ul>
28. 29. <ul><li>1. </li></ul>materialized view
29. 30. <ul><li>problem </li></ul><ul><li>You need to perform SELECT FROM WHERE queries. </li></ul><ul><li>solution </li></ul><ul><li>Create a new CF. Use the WHERE idea as the row key. </li></ul><ul><li>impacts </li></ul><ul><li>Must also write to the index every time you write to the primary CF. Or run as a job. </li></ul>materialized view
30. 31. <ul><li>2. </li></ul>valueless column
31. 32. <ul><li>problem </li></ul><ul><li>Indexes require repeating columns from other column families. </li></ul><ul><li>solution </li></ul><ul><li>Treat the name of the column as the value . Use a byte[0] as the column ‘value’. </li></ul><ul><li>impacts </li></ul><ul><li>Only works with <= 2B columns in 0.7 </li></ul>valueless column
32. 33. <ul><li>3. </li></ul>composite key
33. 34. <ul><li>problem </li></ul><ul><li>Keys must support references and uniqueness. </li></ul><ul><li>solution </li></ul><ul><li>Fuse multiple values with a separator. </li></ul><ul><li>impacts </li></ul><ul><li>Can substitute for Super Column. </li></ul><ul><li>Use a Custom Comparator if necessary. </li></ul>composite key
34. 35. <ul><li>4. </li></ul>semantic key
35. 36. <ul><li>problem </li></ul><ul><li> Keys must be unique. You use OPP. </li></ul><ul><li>solution </li></ul><ul><li>Use a key that is meaningful in your application. </li></ul><ul><li>impacts </li></ul><ul><li>Harder to get right. Can proliferate Indexes. Range scans over keys less efficient. </li></ul>semantic key
36. 37. <ul><li>6. </li></ul>client clock sync
37. 38. <ul><li>problem </li></ul><ul><li>You need to keep clocks on different clients synchronized to support read repair. </li></ul><ul><li>solution </li></ul><ul><li>System.nanoTime() used in StorageProxy </li></ul><ul><li>NTP . </li></ul><ul><li>impacts </li></ul><ul><li>Consider geographic dispersement. </li></ul>client clock sync
38. 39. EXAMPLE
39. 42. is cassandra a good fit? <ul><li>very fast writes </li></ul><ul><li>need to be always writeable </li></ul><ul><li>lots of data </li></ul><ul><li>evolving schema </li></ul>
40. 43. custom comparators <ul><li>i want to compare with </li></ul><ul><ul><li>float </li></ul></ul><ul><ul><li>lat/long </li></ul></ul>
41. 44. choose keys carefully <ul><li>key-based routing to find data </li></ul><ul><li>queries are executed by key </li></ul><ul><li>keys need to be easily discoverable </li></ul><ul><li>you rely on keys for referential integrity </li></ul>
42. 45. new use cases <ul><li>geographic data </li></ul><ul><li>weather data </li></ul><ul><li>rfid </li></ul><ul><li>travel schedules </li></ul><ul><li>data services in soa </li></ul><ul><li>hotel reservations </li></ul><ul><li>CEP </li></ul>