Bit Vectors Siddhesh

1,039 views

Published on

Introduction to Bit Vectors

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,039
On SlideShare
0
From Embeds
0
Number of Embeds
10
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Bit Vectors Siddhesh

  1. 1. Processing Queries with Bit-Vector Indexes Originally presented by Anand Deshpande
  2. 2. Motivation <ul><li>Consider the following SQL query </li></ul><ul><li>SELECT name, address FROM students WHERE Dept = ‘CSCI’ AND Hostel = ‘H2’ </li></ul><ul><li>To process this query </li></ul><ul><ul><li>use a complete scan </li></ul></ul><ul><ul><li>use index </li></ul></ul>
  3. 3. Example students table 1 2 3 4 5 6 7 8 9 10 11 12 13 14 CS CS CS CS CS CS EE ME EE EE ME ME AE AE M M M M M M M M F F F F F F H1 H1 H2 H2 H2 H1 H1 H2 H3 H3 H3 H4 H4 H4 Abhay Athavale Bina Bajaj Chinmay Chatterjee David DeMillo Era Edke Frank Fernandez Gauri Gaikwad Hari Hate Indira Irani Jaya Joshi Kader Khan Leo Lobo Meera Malik Naresh Naik RID Name Hostel Gender Dept
  4. 4. Using an Index to Process Queries <ul><li>Find all records (rids) that match </li></ul><ul><ul><li>Dept = ‘CS’ </li></ul></ul><ul><ul><li>Hostel = ‘H2’ </li></ul></ul><ul><li>Intersect the two set of rids </li></ul><ul><li>Given the rid get the name and address </li></ul><ul><li>In the presence of an index -- </li></ul><ul><ul><li>FIND is of log order O(log(N)) and </li></ul></ul><ul><ul><li>intersects on sorted rids is O(n1 + n2) </li></ul></ul>
  5. 5. Processing Queries 1 2 3 4 5 6 7 8 9 10 11 12 13 14 CS CS CS CS CS CS EE ME EE EE ME ME AE AE M M M M M M M M F F F F F F H1 H1 H2 H2 H2 H1 H1 H2 H3 H3 H3 H4 H4 H4 Abhay Athavale Bina Bajaj Chinmay Chatterjee David DeMillo Era Edke Frank Fernandez Gauri Gaikwad Hari Hate Indira Irani Jaya Joshi Kader Khan Leo Lobo Meera Malik Naresh Naik RID Name Hostel Gender Dept Dept = CS {1, 5, 7, 8, 12, 14} Hostel = H2 {6, 8,11,12} Dept = CS  Hostel = H2 {8, 12}
  6. 6. What does an Index do? <ul><li>Index provides a mapping from Value to a Set of Records (RIDs) </li></ul><ul><li>Given a value -- tell me records that have that value </li></ul><ul><li>Various kinds of indices </li></ul><ul><ul><li>B-Tree </li></ul></ul><ul><ul><li>Hash Index </li></ul></ul><ul><ul><li>R-Tree </li></ul></ul>
  7. 7. B-Tree Index AE CS EE ME {4, 6} {1,5,7,8,12,14} {2, 9, 11} {3, 10, 13} B-Tree Index for Department List of RIDs
  8. 8. Index Architecture Value-based Index (B-Tree) Lists of RIDs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 CS CS CS CS CS CS EE ME EE EE ME ME AE AE M M M M M M M M F F F F F F H1 H1 H2 H2 H2 H1 H1 H2 H3 H3 H3 H4 H4 H4 RID Hostel Gender Dept Value-based Index (B-Tree) Lists of RIDs
  9. 9. Selectivity of Domains <ul><li>Domain is strongly selective if the number of rids for the value is small </li></ul><ul><ul><li>example -- primary key </li></ul></ul><ul><ul><li>only 1 rid for each value </li></ul></ul><ul><li>Domain is weakly selective if the number of rids for the value is large </li></ul><ul><ul><li>example -- gender -- male/female </li></ul></ul><ul><ul><li>.5 * table_size rid for each value </li></ul></ul>
  10. 10. Motivating Bit-Vectors <ul><li>In queries with constraints on many weakly selective domains, rid intersection costs dominate the cost equations. </li></ul><ul><li>AND/OR-ing bit-vectors is an efficient strategy instead of intersection/union of sets </li></ul>
  11. 11. Bit-Vector Representation of Sets 1 2 3 4 5 6 7 8 9 10 11 12 5 3 7 4 3 2 4 6 2 0 5 4 RID Score 0 -- {10} 1 - {} 2 - {6,9} 3 - {2,5} 4 - {4,7,12} 5 - {1,11} 6 - {8} 7 - {3} Score between 4 and 6 -- S4 U S5 U S6 V4 or V5 or V6 7 5 6 4 3 2 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  12. 12. Range Encoded Bit-Vectors Score = 4 V4  V3 Score <= 4 V4 Score >= 2 and Score <= 4 V4  V1 1 2 3 4 5 6 7 8 9 10 11 12 5 3 7 4 3 2 4 6 2 0 5 4 RID Score 7 5 6 4 3 2 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 1 0 1 1 0 1 1 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
  13. 13. Bit-Vectors AE CS EE ME Dept H1 H2 H3 H4 Hostel M F Gender 1 2 3 4 5 6 7 8 9 10 11 12 13 14 CS CS CS CS CS CS EE ME EE EE ME ME AE AE M M M M M M M M F F F F F F H1 H1 H2 H2 H2 H1 H1 H2 H3 H3 H3 H4 H4 H4 RID Hostel Gender Dept
  14. 14. Merging RecordIds <ul><li>SELECT name, address FROM students WHERE Dept = ‘CS’ AND Hostel = ‘H2’ </li></ul><ul><li>What is better? </li></ul><ul><ul><li>Bit-wise AND or Intersection </li></ul></ul><ul><li>Depends on how many records in each set </li></ul><ul><ul><li>if the sets are very small, record-id intersection will be faster than bit-wise and </li></ul></ul>
  15. 15. Processing Queries Bit-vector Bit-vector Bit-vector Dept = CSCI Hostel = H2  Record-Set Record-Set Record-Set Dept = CSCI Hostel = H2 • Dept = CSCI • Bit-Vector Bit-Vector Record-Set Hostel = H2 convert
  16. 16. Processing Queries <ul><li>Convert from bit-vector to record-ids and vice-versa </li></ul><ul><li>For a record-id probe into the bit-vector </li></ul><ul><li>Fast counting of bits to get counts -- extend to sum and average </li></ul><ul><li>Skip empty blocks </li></ul><ul><li>Deal with NULL values </li></ul>
  17. 17. N-way AND and ORs Early Exit Strategy Dept = CSCI Hostel = H2 • Age = 19 Age = 20 Age = 21 Age = 22 +
  18. 18. Where are Bit-Vectors good <ul><li>Equality predicate </li></ul><ul><ul><li>select * from customer where state = ‘CA’ </li></ul></ul><ul><li>AND predicates </li></ul><ul><ul><li>select * from customer where state = ‘CA’ and gender = ‘F’ </li></ul></ul><ul><li>OR predicates </li></ul><ul><ul><li>select * from customer where state = ‘CA’ or state is NULL </li></ul></ul><ul><li>Queries with Negation </li></ul><ul><ul><li>select * state from customer where state <> ‘CA’ and age between 30 and 40 </li></ul></ul>
  19. 19. Aggregate Queries <ul><li>Select count(*) from customer </li></ul><ul><li>select count(age) from customer where state = ‘CA’ </li></ul><ul><li>select state, count(*) from customer group by state </li></ul>
  20. 20. Metrics for Indices <ul><li>Effective Access (for queries) </li></ul><ul><ul><li>point query -- age = 29 </li></ul></ul><ul><ul><li>one sided range -- age > 21 </li></ul></ul><ul><ul><li>range of values -- age between 21 and 28 </li></ul></ul><ul><li>Inserts, Deletes and Updates on records </li></ul><ul><li>Size of the Index </li></ul><ul><li>Block Inserts </li></ul><ul><li>Index Creation Time </li></ul>
  21. 21. Bit-Vector Indices <ul><li>The structure that maps value to record-id is the same </li></ul><ul><li>The Record List Area stores bit-vectors rather than record lists </li></ul>
  22. 22. Comparing Space Requirements <ul><li>Consider a table with N (1M) values </li></ul><ul><li>Consider an index on a domain with n (100) values </li></ul><ul><li>The value-based index is identical in both cases </li></ul>
  23. 23. Calculating Space <ul><li>Record-Id </li></ul><ul><ul><li>1 million records ids. </li></ul></ul><ul><ul><li>32 bits * 1M records </li></ul></ul><ul><ul><li>32M bits </li></ul></ul><ul><ul><li>1 M words </li></ul></ul><ul><ul><li>N words </li></ul></ul><ul><li>Per value </li></ul><ul><ul><li>100 values, (1,000,000/100 = 10,000) rids per value </li></ul></ul><ul><li>Bit-Vectors </li></ul><ul><ul><li>1 million bits per value </li></ul></ul><ul><ul><li>100 values </li></ul></ul><ul><ul><li>100 * 1,000,000 bits = 100 M bits </li></ul></ul><ul><ul><li>~ 3M words </li></ul></ul><ul><ul><li>N * n/32 words </li></ul></ul><ul><ul><li>n -- number of distinct values </li></ul></ul><ul><ul><li>N -- number of records </li></ul></ul>
  24. 24. Can we do better? <ul><li>For small domains ( < 32) bit-vectors are space efficient </li></ul><ul><li>For large domains, bit-vectors are sparse </li></ul><ul><li>For very large domains, record-ids are the best compression </li></ul><ul><li>small domains -- bit vectors, medium domains -- compress, large domains -- record-ids </li></ul>
  25. 25. Handling Skew <ul><li>For many domains a large portion of values correspond to a few distinct values </li></ul><ul><li>Even though the number of unique values is large some domains are candidates for bit-vectors </li></ul><ul><li>Compression of bits to reduce space </li></ul><ul><li>Dynamic selection of encoding strategy </li></ul>
  26. 26. Compression <ul><li>Compressing bit-streams of 1s </li></ul><ul><li>run-length encoding </li></ul><ul><ul><li>111100001111 (1:4:9:4) </li></ul></ul><ul><ul><li>works well with large runs </li></ul></ul><ul><ul><li>for very large blocks of zeros, don’t store anything </li></ul></ul><ul><li>must deal with runs as “one” object </li></ul>
  27. 27. Inserts/Deletes Updates <ul><li>Delete and insert may require toggling a bit </li></ul><ul><li>However, if the number of rows increases, each bitmap needs to be extended </li></ul><ul><li>Don’t map bits to rows but to blocks </li></ul><ul><ul><li>shrinks the size of the bit-vector and more bits set -- better compression possible </li></ul></ul><ul><ul><li>Does not do precise computation, can’t deal with NOT, NULL etc. </li></ul></ul>
  28. 28. Star Joins <ul><li>Use bitmap indices as access paths for fact table. </li></ul><ul><li>Generate new subqueries as sources of keys to drive the bitmap accesses </li></ul><ul><li>OR the bitmaps for each dimension. Then AND the ORed bitmaps. </li></ul><ul><li>The new subqueries may obviate the need for some of the joins </li></ul><ul><li>Reject plan with transformation if other plan is better </li></ul><ul><li>[O’Neil and Graefe] </li></ul>
  29. 29. Star Join Queries <ul><li>Select c.Name, s.Price From Employee e, Product p, Customer c, Sales s Where e.EmpId = s.EmpId and p.ProductId = s.ProductId and c.CustomerId = s.CustomerId and c.Income > 100000 and p.Supplier = ‘IBM’ and e.Department = ‘Sales’ </li></ul>
  30. 30. Rewriting the Query <ul><li>Select c.Name, s.Price From Employee e, Product p, Customer c, Sales s Where e.EmpId = s.EmpId and p.ProductId = s.ProductId and c.CustomerId = s.CustomerId and c.Income > 100000 and p.Supplier = ‘IBM’ and e.Department = ‘Sales’ and s.ProductId in (select p.ProductId from Product p where p.Supplier = ‘IBM’) and s.EmpId in (select e.EmpId from Employee e where e.Department = ‘Sales’) and s.CustomerId in (select c.CustomerId from Customer c where Income > 100000) </li></ul>
  31. 31. References <ul><li>Don Kunth, Volume 3 (Sorting and Searching) </li></ul><ul><li>Red Brick Systems, White paper on Target Indices </li></ul><ul><li>Graefe, Query Processing Survey </li></ul><ul><li>Hakan Jakobsson Oracle Bit-Vector Indices presentation </li></ul><ul><li>Graefe and O’Neil Sigmod Records </li></ul><ul><li>Chee-Yong Chan, Yannis Ioannidis, SIGMOD 1998 </li></ul>

×