SlideShare a Scribd company logo
1 of 36
Download to read offline
ADVANCED DATA MODELING AND BITMAP INDEXES
Matt Stump
mstump@kissmetrics.com
Monday, May 6, 13
WHOAREYOUR
Customers?
Monday, May 6, 13
WHEREDOTHEY
Hangout?
Monday, May 6, 13
HOWSHOULDYOU
Engage?
Monday, May 6, 13
What is User Experience?
Monday, May 6, 13
Whatismy
Data
?
Monday, May 6, 13
FormFollows
Function
Monday, May 6, 13
DataFollows
Queries
Monday, May 6, 13
Primary Key
CREATE TABLE users (
username text PRIMARY KEY,
first_name text,
last_name text,
postal_code text,
last_login timestamp);
INSERT INTO users
(username,first_name,last_name,postal_code,last_login)
VALUES ('cstar','Cassandra','Database','11111','2013-4-4');
SELECT first_name, last_name
FROM users WHERE username = 'cstar';
Monday, May 6, 13
Primary Key
RowKey username first_name last_name postal_code
cstar cstar Cassandra Database 11111
user2 user2 Some Guy 22222
Monday, May 6, 13
Secondary Index
CREATE INDEX user_zipcode ON users(postal_code);
11111 cstar
22222 user2 user3 user456 ...
Monday, May 6, 13
Where Secondary Indexes Break
High Cardinality Data1
Only one index per query2
Indexes are distributed3
Only some datatypes; no counters4
Range queries are expensive5
Monday, May 6, 13
Roll Your Own Using Wide Rows
RowKey 05/02/2012 02/01/2013 05/02/2013 ...
user2 JSON JSON JSON JSON
All events for “user2” indexed by time
Monday, May 6, 13
Limitations to Rolling Your Own
Can’t query across rows1
Only some datatypes; no counters2
Requires lots of work in the application3
No complex queries4
Monday, May 6, 13
WhatdoIneed
?
Monday, May 6, 13
A Query Engine Wishlist
High cardinality data; counters1
Complex queries, multiple clauses2
Results in < 500ms for billions of rows3
Sub-field searching; regex4
Range queries5
Monday, May 6, 13
First Iteration: Ginormus String Sets
11111 cstar
22222 user2 user3 user456 ...
11111 22222
Monday, May 6, 13
Bitmaps
Monday, May 6, 13
Bitmaps
Monday, May 6, 13
Bitmaps: How do they Work?
0-7 8-15 16-23 24-31
11111 11010011 1011011 1010000 00000000
22222 00000000 0011011 00000000 00000000
Monday, May 6, 13
Bitmaps: Equality
0-7 8-15 16-23 24-31
11111 11010011 1011011 1010000 00000000
22222 00000000 0011011 00000000 00000000
SELECT * FROM users WHERE postal_code IN ('11111','22222');
0-7 8-15 16-23 24-31
11111 &
22222
00000000 0011011 00000000 00000000
Monday, May 6, 13
Bitmaps: Range, or How Do I Query Counters?
Field Value 0-7 8-15 16-23 24-31
Event2 1 11010011 1011011 1010000 00000000
Event2 4 00000000 0011011 00000000 00000000
0-7 8-15 16-23 24-31
1 & 4 00000000 0011011 00000000 00000000
SELECT * FROM users WHERE Event2 > 0 AND Event2 < 5;
Monday, May 6, 13
Trigrams; AKA You Promised REGEX
Field Value 0-7 8-15 16-23 24-31
last_name “foo” 11010011 1011011 1010000 00000000
last_name “bar” 00000000 0011011 00000000 00000000
0-7 8-15 16-23 24-31
“foo” &
“bar”
00000000 0011011 00000000 00000000
SELECT * FROM users WHERE last_name ~= ‘f.*bar’;
INSERT INTO users
(username,first_name,last_name,postal_code,last_login)
VALUES ('foobar82','johnny','foobar','94110','2013-4-4');
Monday, May 6, 13
Monday, May 6, 13
Not Everything is Roses and Honey
Indexes can be huge1
Requires a read before write2
Requires synchronization3
4
Monday, May 6, 13
Compression
2
4
Monday, May 6, 13
RLE Compression: How it Works
2
4
Header Fill, 11 blocks of 1s Literal 15 bits Fill,18 blocks of 0s Literal 15 bits
1010 10000000001011 111010000100101 000000000010010 000000010000011
Example taken from PWAH: http://www.sjvs.nl/?p=72
Monday, May 6, 13
Dealing with Read Before Write
Partition Index
Using a Ring
4
{
"product": 124,
"user": 22,
"event": "event2",
"value": "Name=Jonathan+Doe&Age=23"
}
Apply Hash to User
Configured Field
hash(:product) = c62fb32eadd5a0fcceb1ddf2697e2345c604f451
Monday, May 6, 13
Ring Partitioning
Solves read before write1
Solves synchronization issues2
Insures index locality3
4 Easy to isolate big customers4
Index size is limited to the largest
customer
5
Monday, May 6, 13
Sparse Indexes
2
4
Offset 0x00 Offset 0x01 Offset 0xA0 Offset 0xF0
Field1 0111010101101111 1001010100100101 0111010000100101 0111011100100101
OnlyStoretheSetBits
Monday, May 6, 13
Query &
Indexing Engine
The Whole Enchilada
4
Queries and
Events
Monday, May 6, 13
Goals
Core query and index engine, wrapped1
Extensible events and queries via Lua2
Equality, range and REGEX queries3
44
No single point of failure5
Distributed, <500ms for billions of rows
Monday, May 6, 13
Resources
Lots of Papers on Bitmap Compression
http://www-users.cs.umn.edu/~kewu/annotated.html
4
How Google Code Search Worked
http://swtch.com/~rsc/regexp/regexp4.html
Monday, May 6, 13
GOTANY
Questions
?
Monday, May 6, 13
Thanks
4
Eric Tschetter of the Druid Project
and
Cassandra Devs for answering my questions
Monday, May 6, 13
THANKYOU!
Matt Stump
www.matthewstump.com
@mattstump
Monday, May 6, 13

More Related Content

More from DataStax Academy

Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craftForrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craftDataStax Academy
 
Introduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph DatabaseIntroduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph DatabaseDataStax Academy
 
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache CassandraIntroduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache CassandraDataStax Academy
 
Cassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart LabsCassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart LabsDataStax Academy
 
Cassandra 3.0 Data Modeling
Cassandra 3.0 Data ModelingCassandra 3.0 Data Modeling
Cassandra 3.0 Data ModelingDataStax Academy
 
Cassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stackCassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stackDataStax Academy
 
Data Modeling for Apache Cassandra
Data Modeling for Apache CassandraData Modeling for Apache Cassandra
Data Modeling for Apache CassandraDataStax Academy
 
Production Ready Cassandra
Production Ready CassandraProduction Ready Cassandra
Production Ready CassandraDataStax Academy
 
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonCassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonDataStax Academy
 
Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1DataStax Academy
 
Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2DataStax Academy
 
Standing Up Your First Cluster
Standing Up Your First ClusterStanding Up Your First Cluster
Standing Up Your First ClusterDataStax Academy
 
Real Time Analytics with Dse
Real Time Analytics with DseReal Time Analytics with Dse
Real Time Analytics with DseDataStax Academy
 
Introduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraIntroduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraDataStax Academy
 
Enabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseEnabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseDataStax Academy
 
Advanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraAdvanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraDataStax Academy
 

More from DataStax Academy (20)

Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craftForrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
 
Introduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph DatabaseIntroduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph Database
 
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache CassandraIntroduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
 
Cassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart LabsCassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart Labs
 
Cassandra 3.0 Data Modeling
Cassandra 3.0 Data ModelingCassandra 3.0 Data Modeling
Cassandra 3.0 Data Modeling
 
Cassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stackCassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stack
 
Data Modeling for Apache Cassandra
Data Modeling for Apache CassandraData Modeling for Apache Cassandra
Data Modeling for Apache Cassandra
 
Coursera Cassandra Driver
Coursera Cassandra DriverCoursera Cassandra Driver
Coursera Cassandra Driver
 
Production Ready Cassandra
Production Ready CassandraProduction Ready Cassandra
Production Ready Cassandra
 
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonCassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
 
Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1
 
Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2
 
Standing Up Your First Cluster
Standing Up Your First ClusterStanding Up Your First Cluster
Standing Up Your First Cluster
 
Real Time Analytics with Dse
Real Time Analytics with DseReal Time Analytics with Dse
Real Time Analytics with Dse
 
Introduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraIntroduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache Cassandra
 
Cassandra Core Concepts
Cassandra Core ConceptsCassandra Core Concepts
Cassandra Core Concepts
 
Enabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseEnabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax Enterprise
 
Bad Habits Die Hard
Bad Habits Die Hard Bad Habits Die Hard
Bad Habits Die Hard
 
Advanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraAdvanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache Cassandra
 
Advanced Cassandra
Advanced CassandraAdvanced Cassandra
Advanced Cassandra
 

Recently uploaded

CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 

Recently uploaded (20)

CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 

Advanced Data Modeling and Bitmap Indexes

  • 1. ADVANCED DATA MODELING AND BITMAP INDEXES Matt Stump mstump@kissmetrics.com Monday, May 6, 13
  • 5. What is User Experience? Monday, May 6, 13
  • 9. Primary Key CREATE TABLE users ( username text PRIMARY KEY, first_name text, last_name text, postal_code text, last_login timestamp); INSERT INTO users (username,first_name,last_name,postal_code,last_login) VALUES ('cstar','Cassandra','Database','11111','2013-4-4'); SELECT first_name, last_name FROM users WHERE username = 'cstar'; Monday, May 6, 13
  • 10. Primary Key RowKey username first_name last_name postal_code cstar cstar Cassandra Database 11111 user2 user2 Some Guy 22222 Monday, May 6, 13
  • 11. Secondary Index CREATE INDEX user_zipcode ON users(postal_code); 11111 cstar 22222 user2 user3 user456 ... Monday, May 6, 13
  • 12. Where Secondary Indexes Break High Cardinality Data1 Only one index per query2 Indexes are distributed3 Only some datatypes; no counters4 Range queries are expensive5 Monday, May 6, 13
  • 13. Roll Your Own Using Wide Rows RowKey 05/02/2012 02/01/2013 05/02/2013 ... user2 JSON JSON JSON JSON All events for “user2” indexed by time Monday, May 6, 13
  • 14. Limitations to Rolling Your Own Can’t query across rows1 Only some datatypes; no counters2 Requires lots of work in the application3 No complex queries4 Monday, May 6, 13
  • 16. A Query Engine Wishlist High cardinality data; counters1 Complex queries, multiple clauses2 Results in < 500ms for billions of rows3 Sub-field searching; regex4 Range queries5 Monday, May 6, 13
  • 17. First Iteration: Ginormus String Sets 11111 cstar 22222 user2 user3 user456 ... 11111 22222 Monday, May 6, 13
  • 20. Bitmaps: How do they Work? 0-7 8-15 16-23 24-31 11111 11010011 1011011 1010000 00000000 22222 00000000 0011011 00000000 00000000 Monday, May 6, 13
  • 21. Bitmaps: Equality 0-7 8-15 16-23 24-31 11111 11010011 1011011 1010000 00000000 22222 00000000 0011011 00000000 00000000 SELECT * FROM users WHERE postal_code IN ('11111','22222'); 0-7 8-15 16-23 24-31 11111 & 22222 00000000 0011011 00000000 00000000 Monday, May 6, 13
  • 22. Bitmaps: Range, or How Do I Query Counters? Field Value 0-7 8-15 16-23 24-31 Event2 1 11010011 1011011 1010000 00000000 Event2 4 00000000 0011011 00000000 00000000 0-7 8-15 16-23 24-31 1 & 4 00000000 0011011 00000000 00000000 SELECT * FROM users WHERE Event2 > 0 AND Event2 < 5; Monday, May 6, 13
  • 23. Trigrams; AKA You Promised REGEX Field Value 0-7 8-15 16-23 24-31 last_name “foo” 11010011 1011011 1010000 00000000 last_name “bar” 00000000 0011011 00000000 00000000 0-7 8-15 16-23 24-31 “foo” & “bar” 00000000 0011011 00000000 00000000 SELECT * FROM users WHERE last_name ~= ‘f.*bar’; INSERT INTO users (username,first_name,last_name,postal_code,last_login) VALUES ('foobar82','johnny','foobar','94110','2013-4-4'); Monday, May 6, 13
  • 25. Not Everything is Roses and Honey Indexes can be huge1 Requires a read before write2 Requires synchronization3 4 Monday, May 6, 13
  • 27. RLE Compression: How it Works 2 4 Header Fill, 11 blocks of 1s Literal 15 bits Fill,18 blocks of 0s Literal 15 bits 1010 10000000001011 111010000100101 000000000010010 000000010000011 Example taken from PWAH: http://www.sjvs.nl/?p=72 Monday, May 6, 13
  • 28. Dealing with Read Before Write Partition Index Using a Ring 4 { "product": 124, "user": 22, "event": "event2", "value": "Name=Jonathan+Doe&Age=23" } Apply Hash to User Configured Field hash(:product) = c62fb32eadd5a0fcceb1ddf2697e2345c604f451 Monday, May 6, 13
  • 29. Ring Partitioning Solves read before write1 Solves synchronization issues2 Insures index locality3 4 Easy to isolate big customers4 Index size is limited to the largest customer 5 Monday, May 6, 13
  • 30. Sparse Indexes 2 4 Offset 0x00 Offset 0x01 Offset 0xA0 Offset 0xF0 Field1 0111010101101111 1001010100100101 0111010000100101 0111011100100101 OnlyStoretheSetBits Monday, May 6, 13
  • 31. Query & Indexing Engine The Whole Enchilada 4 Queries and Events Monday, May 6, 13
  • 32. Goals Core query and index engine, wrapped1 Extensible events and queries via Lua2 Equality, range and REGEX queries3 44 No single point of failure5 Distributed, <500ms for billions of rows Monday, May 6, 13
  • 33. Resources Lots of Papers on Bitmap Compression http://www-users.cs.umn.edu/~kewu/annotated.html 4 How Google Code Search Worked http://swtch.com/~rsc/regexp/regexp4.html Monday, May 6, 13
  • 35. Thanks 4 Eric Tschetter of the Druid Project and Cassandra Devs for answering my questions Monday, May 6, 13