Successfully reported this slideshow.
Your SlideShare is downloading. ×

Advanced Data Modeling and Bitmap Indexes

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Loading in …3
×

Check these out next

1 of 36 Ad
Advertisement

More Related Content

More from DataStax Academy (20)

Recently uploaded (20)

Advertisement

Advanced Data Modeling and Bitmap Indexes

  1. 1. ADVANCED DATA MODELING AND BITMAP INDEXES Matt Stump mstump@kissmetrics.com Monday, May 6, 13
  2. 2. WHOAREYOUR Customers? Monday, May 6, 13
  3. 3. WHEREDOTHEY Hangout? Monday, May 6, 13
  4. 4. HOWSHOULDYOU Engage? Monday, May 6, 13
  5. 5. What is User Experience? Monday, May 6, 13
  6. 6. Whatismy Data ? Monday, May 6, 13
  7. 7. FormFollows Function Monday, May 6, 13
  8. 8. DataFollows Queries Monday, May 6, 13
  9. 9. Primary Key CREATE TABLE users ( username text PRIMARY KEY, first_name text, last_name text, postal_code text, last_login timestamp); INSERT INTO users (username,first_name,last_name,postal_code,last_login) VALUES ('cstar','Cassandra','Database','11111','2013-4-4'); SELECT first_name, last_name FROM users WHERE username = 'cstar'; Monday, May 6, 13
  10. 10. Primary Key RowKey username first_name last_name postal_code cstar cstar Cassandra Database 11111 user2 user2 Some Guy 22222 Monday, May 6, 13
  11. 11. Secondary Index CREATE INDEX user_zipcode ON users(postal_code); 11111 cstar 22222 user2 user3 user456 ... Monday, May 6, 13
  12. 12. Where Secondary Indexes Break High Cardinality Data1 Only one index per query2 Indexes are distributed3 Only some datatypes; no counters4 Range queries are expensive5 Monday, May 6, 13
  13. 13. Roll Your Own Using Wide Rows RowKey 05/02/2012 02/01/2013 05/02/2013 ... user2 JSON JSON JSON JSON All events for “user2” indexed by time Monday, May 6, 13
  14. 14. Limitations to Rolling Your Own Can’t query across rows1 Only some datatypes; no counters2 Requires lots of work in the application3 No complex queries4 Monday, May 6, 13
  15. 15. WhatdoIneed ? Monday, May 6, 13
  16. 16. A Query Engine Wishlist High cardinality data; counters1 Complex queries, multiple clauses2 Results in < 500ms for billions of rows3 Sub-field searching; regex4 Range queries5 Monday, May 6, 13
  17. 17. First Iteration: Ginormus String Sets 11111 cstar 22222 user2 user3 user456 ... 11111 22222 Monday, May 6, 13
  18. 18. Bitmaps Monday, May 6, 13
  19. 19. Bitmaps Monday, May 6, 13
  20. 20. Bitmaps: How do they Work? 0-7 8-15 16-23 24-31 11111 11010011 1011011 1010000 00000000 22222 00000000 0011011 00000000 00000000 Monday, May 6, 13
  21. 21. Bitmaps: Equality 0-7 8-15 16-23 24-31 11111 11010011 1011011 1010000 00000000 22222 00000000 0011011 00000000 00000000 SELECT * FROM users WHERE postal_code IN ('11111','22222'); 0-7 8-15 16-23 24-31 11111 & 22222 00000000 0011011 00000000 00000000 Monday, May 6, 13
  22. 22. Bitmaps: Range, or How Do I Query Counters? Field Value 0-7 8-15 16-23 24-31 Event2 1 11010011 1011011 1010000 00000000 Event2 4 00000000 0011011 00000000 00000000 0-7 8-15 16-23 24-31 1 & 4 00000000 0011011 00000000 00000000 SELECT * FROM users WHERE Event2 > 0 AND Event2 < 5; Monday, May 6, 13
  23. 23. Trigrams; AKA You Promised REGEX Field Value 0-7 8-15 16-23 24-31 last_name “foo” 11010011 1011011 1010000 00000000 last_name “bar” 00000000 0011011 00000000 00000000 0-7 8-15 16-23 24-31 “foo” & “bar” 00000000 0011011 00000000 00000000 SELECT * FROM users WHERE last_name ~= ‘f.*bar’; INSERT INTO users (username,first_name,last_name,postal_code,last_login) VALUES ('foobar82','johnny','foobar','94110','2013-4-4'); Monday, May 6, 13
  24. 24. Monday, May 6, 13
  25. 25. Not Everything is Roses and Honey Indexes can be huge1 Requires a read before write2 Requires synchronization3 4 Monday, May 6, 13
  26. 26. Compression 2 4 Monday, May 6, 13
  27. 27. RLE Compression: How it Works 2 4 Header Fill, 11 blocks of 1s Literal 15 bits Fill,18 blocks of 0s Literal 15 bits 1010 10000000001011 111010000100101 000000000010010 000000010000011 Example taken from PWAH: http://www.sjvs.nl/?p=72 Monday, May 6, 13
  28. 28. Dealing with Read Before Write Partition Index Using a Ring 4 { "product": 124, "user": 22, "event": "event2", "value": "Name=Jonathan+Doe&Age=23" } Apply Hash to User Configured Field hash(:product) = c62fb32eadd5a0fcceb1ddf2697e2345c604f451 Monday, May 6, 13
  29. 29. Ring Partitioning Solves read before write1 Solves synchronization issues2 Insures index locality3 4 Easy to isolate big customers4 Index size is limited to the largest customer 5 Monday, May 6, 13
  30. 30. Sparse Indexes 2 4 Offset 0x00 Offset 0x01 Offset 0xA0 Offset 0xF0 Field1 0111010101101111 1001010100100101 0111010000100101 0111011100100101 OnlyStoretheSetBits Monday, May 6, 13
  31. 31. Query & Indexing Engine The Whole Enchilada 4 Queries and Events Monday, May 6, 13
  32. 32. Goals Core query and index engine, wrapped1 Extensible events and queries via Lua2 Equality, range and REGEX queries3 44 No single point of failure5 Distributed, <500ms for billions of rows Monday, May 6, 13
  33. 33. Resources Lots of Papers on Bitmap Compression http://www-users.cs.umn.edu/~kewu/annotated.html 4 How Google Code Search Worked http://swtch.com/~rsc/regexp/regexp4.html Monday, May 6, 13
  34. 34. GOTANY Questions ? Monday, May 6, 13
  35. 35. Thanks 4 Eric Tschetter of the Druid Project and Cassandra Devs for answering my questions Monday, May 6, 13
  36. 36. THANKYOU! Matt Stump www.matthewstump.com @mattstump Monday, May 6, 13

×