• Save
Cassandra Data Model
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Cassandra Data Model

on

  • 19,438 views

Eben Hewitt's talk on Apache Cassandra's Data Model from Cassandra Summit in San Francisco.

Eben Hewitt's talk on Apache Cassandra's Data Model from Cassandra Summit in San Francisco.

Statistics

Views

Total Views
19,438
Views on SlideShare
18,837
Embed Views
601

Actions

Likes
44
Downloads
0
Comments
1

14 Embeds 601

http://www.nosqldatabases.com 272
http://highlyscalable.wordpress.com 229
http://www.scoop.it 55
http://doanduyhai.wordpress.com 19
http://www.slideshare.net 5
http://confluence.corp.apple.com 4
https://coral.corp.apple.com 3
http://www.demo.crescosolution.com 3
http://www.pearltrees.com 3
http://coral.corp.apple.com 2
http://static.slidesharecdn.com 2
http://www.linkedin.com 2
http://coral.corp.apple.com:9080 1
http://mongopi.wordpress.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

CC Attribution License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • I need it,but i don't download it,please send me for anyone;
    email:zhubin885@sina.com
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Cassandra Data Model Presentation Transcript

  • 1. the cassandra data model eben hewitt cassandra summit - san francisco 8.10.2010
  • 2.  
  • 3.
    • the data model
    • no sql?
    • design patterns
  • 4. things we know about cassandra
    • it’s eventually consistent
    • it’s column-oriented
  • 5. these things are wrong.
  • 6. ?...
  • 7. consistency
    • eventual?
    1
  • 8. consistency
    • tune-able?
    1
  • 9. PACELC
    • Daniel Abadi
    • if Partition
      • Trade some Consistency for Availability
    • Else (normal circumstances)
      • Balance Consistency & Latency
  • 10. cassandra is consistent when read replica count + write replica count > replication factor cassandra is consistent if you read & write at CL.QUORUM (once Q nodes are up) good place to start
    • R + W > N
    r = # of nodes consulted during read op w = # of replicas consulted during write op n = replicas r + w + n = consistent
  • 11. cassandra is row-oriented
    • each row is uniquely identifiable by key
    • rows group columns and super columns
    2
  • 12. column
    • atomic unit
    • name : value : timestamp
    email : alison@foo.com : 12578123685
  • 13. column family
  • 14.
    • User {
    • 123 : { email: alison@foo.com,
    • img: },
    • 456 : { email: eben@bar.com,
    • username: The Situation}
    • }
    column family
  • 15. super column super columns group columns under a common name
  • 16. super column family
  • 17. about super column families
    • sub-column names inside a SCF are not indexed
      • top level columns (SCF Name) are always indexed
    • often used for denormalizing data from standard CFs
  • 18.
    • PointOfInterest {
    • key: 85255 {
    • Phoenix Zoo { phone: 480-555-5555, desc: They have animals here. },
    • Spring Training { phone: 623-333-3333, desc: Fun for baseball fans. },
    • }, //end phx
    • key: 10019 {
    • Central Park { desc: Walk around. It's pretty.} ,
    • Empire State Building { phone: 212-777-7777, desc: Great view from
    • 102nd floor. }
    • } //end nyc
    • }
    s super column super column family flexible schema key column super column family
  • 19.
    • the data model
    • no sql?
    • design patterns
  • 20. what about…
    • SELECT WHERE
    • ORDER BY
    • JOIN ON
    • GROUP
    ?
  • 21. rdbms : domain-based model what answers do I have? cassandra : query-based model what questions do I have?
  • 22. Questions
    • • Find hotels in a given area
    • • Find information about a given hotel, such as its name and location
    • • Find Points of Interest near a given hotel
    • • Find an available room in a given date range
    • • Find the rate and amenities about a room
    • • Book the selected room by entering guest information
  • 23. SELECT WHERE
    • <<cf>>
    • USER
    • Key: UserID
    • Cols: username, email, birth date, city, state
    •  
    • To support a query like SELECT * FROM User WHERE city = ‘Austin’:
    • Create a new CF called UserCity:
    •  
    • <<cf>>
    • USERCITY
    • Key: city
    • Cols: IDs of the users in that city.
    • Also uses the Valueless Column pattern.
  • 24.
    • Use an aggregate row key
    • state:city: { user1, user2}
    • Get rows between TX: & TX;
    • for all Texas users
    • Get rows between TX:Austin & TX:Austin1
    • for all Austin users
    SELECT WHERE pt 2
  • 25. ORDER BY
    • Rows
    • are placed according to their Partitioner:
    • Random: MD5 of key
    • Order-Preserving: actual key
    • are sorted by key, regardless of partitioner
    Columns are sorted according to CompareWith or CompareSubcolumnsWith
  • 26. JOIN ON
    • “ join” means “create a relationship”
      • rdbms : pay at runtime
      • cassandra : pay at design time
    • representing the relationship
      • rdbms : opaque, in query
      • cassandra : transparent, first-class citizen
  • 27. GROUP
    • SELECT COUNT(*) from Hotel GROUP BY ZipCode
    •  calculated column value
  • 28.
    • the data model
    • no sql?
    • design patterns
  • 29.
    • 1.
    materialized view
  • 30.
    • problem
    • You need to perform SELECT FROM WHERE queries.
    • solution
    • Create a new CF. Use the WHERE idea as the row key.
    • impacts
    • Must also write to the index every time you write to the primary CF. Or run as a job.
    materialized view
  • 31.
    • 2.
    valueless column
  • 32.
    • problem
    • Indexes require repeating columns from other column families.
    • solution
    • Treat the name of the column as the value . Use a byte[0] as the column ‘value’.
    • impacts
    • Only works with <= 2B columns in 0.7
    valueless column
  • 33.
    • 3.
    composite key
  • 34.
    • problem
    • Keys must support references and uniqueness.
    • solution
    • Fuse multiple values with a separator.
    • impacts
    • Can substitute for Super Column.
    • Use a Custom Comparator if necessary.
    composite key
  • 35.
    • 4.
    semantic key
  • 36.
    • problem
    • Keys must be unique. You use OPP.
    • solution
    • Use a key that is meaningful in your application.
    • impacts
    • Harder to get right. Can proliferate Indexes. Range scans over keys less efficient.
    semantic key
  • 37.
    • 6.
    client clock sync
  • 38.
    • problem
    • You need to keep clocks on different clients synchronized to support read repair.
    • solution
    • System.nanoTime() used in StorageProxy
    • NTP .
    • impacts
    • Consider geographic dispersement.
    client clock sync
  • 39. EXAMPLE
  • 40.  
  • 41.  
  • 42. is cassandra a good fit?
    • very fast writes
    • need to be always writeable
    • lots of data
    • evolving schema
  • 43. custom comparators
    • i want to compare with
      • float
      • lat/long
  • 44. choose keys carefully
    • key-based routing to find data
    • queries are executed by key
    • keys need to be easily discoverable
    • you rely on keys for referential integrity
  • 45. new use cases
    • geographic data
    • weather data
    • rfid
    • travel schedules
    • data services in soa
    • hotel reservations
    • CEP
  • 46. see also
    • http://dfeatherston.com/cassandra-adf-uiuc-su10.pdf
    • http://github.com/ dietrichf/brireme
  • 47.
    • @ebenhewitt
    • cassandraguide.com