Successfully reported this slideshow.

Cassandra EU 2012 - Data modelling workshop by Richard Low

4,562 views

Published on

A workshop on data modelling in Cassandra. Works through a messaging application example.

Published in: Technology, Business
  • Be the first to comment

Cassandra EU 2012 - Data modelling workshop by Richard Low

  1. 1. Data modelling workshop Richard Low rlow@acunu.com @richardalowWednesday, 28 March 2012
  2. 2. Outline • What is data modelling? • What do I need to know to come up with a model? • Options and available tools • Denormalisation • Example and demo: scalable messaging applicationWednesday, 28 March 2012
  3. 3. What is data modelling?Wednesday, 28 March 2012
  4. 4. Data modelling • How you organise your data • Store all in one big value? • Store as columns in one row or lots of rows? • Use counters? • Can I avoid read-modify-write?Wednesday, 28 March 2012
  5. 5. Why care about it? • Performance • Ensure good load balancing • Disk usage • Future proofingWednesday, 28 March 2012
  6. 6. Performance 100 • Bad data model: do read-modify-write on 0x large column im pro • Good data model: just overwrite updated data vem • ent Difference? Could be 100 ops/s vs. 100k ops/sWednesday, 28 March 2012
  7. 7. Performance • Cacheability • Ensure your cache isn’t polluted by uncacheable things • Cached reads are ~100x faster than uncachedWednesday, 28 March 2012
  8. 8. What do you need?Wednesday, 28 March 2012
  9. 9. Optimise for queries • Data model design starts with queries • What are the common queries?Wednesday, 28 March 2012
  10. 10. Workload • How many inserts? • How many reads? • Do inserts depend on current data? • Is data write-once?Wednesday, 28 March 2012
  11. 11. Sizes • How big are the values? • Are some ‘users’ bigger than others? • How cacheable is your data?Wednesday, 28 March 2012
  12. 12. How do I get this? • Back of the envelope calculation • Monitor existing solution • Prototype a solutionWednesday, 28 March 2012
  13. 13. Options and toolsWednesday, 28 March 2012
  14. 14. Keyspaces and Column Families SQL Cassandra Database row/key col_1 col_2 Keyspace row/key col_1 col_1 row/ col_1 col_1 Table Column FamilyWednesday, 28 March 2012
  15. 15. Options and tools • Rows • Columns • Supercolumns • Composite columnsWednesday, 28 March 2012
  16. 16. Rows and columns col1 col2 col3 col4 col5 col6 col7 row1 x x x row2 x x x x x row3 x x x x x row4 x x x x row5 x x x x row6 x row7 x x xWednesday, 28 March 2012
  17. 17. Column options • Regular columns • Super columns: columns within columns • Composite columns: multi-dimensional column namesWednesday, 28 March 2012
  18. 18. Composite columns alice: { m2: { Sender: bob, Subject: ‘paper!’, ... } } bob: { m1: { Sender: alice, Subject: ‘rock?’, ... } } charlie: { m1: { Sender: alice, Subject: ‘rock?’, ... }, m2: { Sender: bob, Subject: ‘paper!’, ... } }Wednesday, 28 March 2012
  19. 19. Tools • Counters: atomic inc and dec • Expiring columns: TTL • Secondary indexes: your WHERE clauseWednesday, 28 March 2012
  20. 20. Rows vs columns • Row key is the shard key • Need lots of rows for scalability • Don’t be afraid of large-ish rows • But don’t make them too big • Avoid range queries across rows, but use them within rowsWednesday, 28 March 2012
  21. 21. Range queries • Within a row: SELECT col3..col5 FROM Standard1 WHERE KEY=row1 row1 col1 col2 col5 col6 col8Wednesday, 28 March 2012
  22. 22. Range queries • Across rows: SELECT * FROM table WHERE key > row2 LIMIT 2Wednesday, 28 March 2012
  23. 23. Range queries SELECT * FROM table WHERE key > row2 row4 LIMIT 2 > row2, row1 row2 row3 row1Wednesday, 28 March 2012
  24. 24. Range queries • Range queries within rows ‘get_slice’ are fine • Avoid range queries across rows ‘get_range_slices’Wednesday, 28 March 2012
  25. 25. Batching • Overhead on each call • Batch together inserts, better if in the same row • Reduce read ops, use large get_slice readsWednesday, 28 March 2012
  26. 26. DenormalisationWednesday, 28 March 2012
  27. 27. Denormalisation • Hard drive performance constraints: • Sequential IO at 100s MB/s • Seek at 100 IO/s • Avoid random IOWednesday, 28 March 2012
  28. 28. Denormalisation • Store columns accessed at similar times near to each other • => put them in the same row • Involves copying • Copying isn’t bad - pre flood prices <$100 per TBWednesday, 28 March 2012
  29. 29. Messaging ApplicationWednesday, 28 March 2012
  30. 30. Messaging application • Users can send messages to other users • Horizontally scalable • Expect users to send to lots of recipientsWednesday, 28 March 2012
  31. 31. Messaging • In an RDBMS we might have a table for: • Users • Messages (sender is unique) • Mappings, Message → ReceiverWednesday, 28 March 2012
  32. 32. A relational model Msg_Receipt Id Message_Id ∞ ∞ User_Id Users 1 Is_read 1 Messages Id Id username 1 Subject Content Date Example Relational ∞ Sender_Id DB modelWednesday, 28 March 2012
  33. 33. Querying Most recent 10 messages sent by a user: SELECT * FROM Messages WHERE Messages.Sender_Id = <id> ORDER BY Messages.Date DESC LIMIT 10; Most recent 10 messages received by a user: SELECT Messages.* FROM Messages, Msg_Receipt WHERE Msg_Receipt.User_Id = <id> AND Msg_Receipt.Message_Id = Messages.Id ORDER BY Messages.Date DESC LIMIT 10;Wednesday, 28 March 2012
  34. 34. Under the hood Msg_Receipt Messages id msg_id user_id id subject ... 0 0 0 0 a 1 3 1 1 b 2 4 2 2 c 3 6000 0 3 d 4 e ... 6000 xWednesday, 28 March 2012
  35. 35. Under the hood • Normalisation => seeks • So denormalise • Hit capacity limit of one node quicklyWednesday, 28 March 2012
  36. 36. Back of the envelope... • 1 M users • Message size 1 KB • Each user has 5000 messages • => 5 TB dataWednesday, 28 March 2012
  37. 37. Back of the envelope... • Reading 10 messages => 10 seeks • If 10k active at once, need 100k seeks/s • => need 1000 disks • With 8 disks per node, RF 3, that’s 375 nodesWednesday, 28 March 2012
  38. 38. Back of the envelope... • Denormalize: messages are immutable • Insert them into everyone’s inbox • Read 10 messages is one seek • Paging is sequential • => 10x fewer nodes: 38 nodes now!Wednesday, 28 March 2012
  39. 39. In Cassandra • Use a row per user • Composite columns, with TimeUUID as ID • Gives time ordering on messages • Inserts go to all recipientsWednesday, 28 March 2012
  40. 40. Messaging example From: alice To: bob, charlie Subject: rock? m1 alice sender subject bob alice rock? sender subject charlie alice rock?Wednesday, 28 March 2012
  41. 41. Messaging example From: bob To: alice, charlie Subject: paper! m1 m2 sender subject alice bob paper! sender subject bob alice rock? sender subject sender subject charlie alice rock? bob paper!Wednesday, 28 March 2012
  42. 42. Data alice: { m2: { Sender: bob, Subject: ‘paper!’, ... } } bob: { m1: { Sender: alice, Subject: ‘rock?’, ... } } charlie: { m1: { Sender: alice, Subject: ‘rock?’, ... }, m2: { Sender: bob, Subject: ‘paper!’, ... } }Wednesday, 28 March 2012
  43. 43. Demo • Pycassa • Send message • List messages • Unread countWednesday, 28 March 2012

×