Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Successfully reported this slideshow.

Like this presentation? Why not share!

No Downloads

Total views

1,265

On SlideShare

0

From Embeds

0

Number of Embeds

1

Shares

0

Downloads

32

Comments

0

Likes

2

No embeds

No notes for slide

- 1. OLAP Indexes
- 2. What Is a Data Warehouse? • “A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decision-making process.” – W. H. Inmon • Data warehousing: the process of constructing and using data warehouses Jian Pei: Data Mining -- OLAP Indexes 2
- 3. Index Requirements in OLAP • Data is read only – No insertion or deletion • Query types – Point query: look up one specific tuple (rare) – Range query: return the aggregate of a (large) set of tuples, with group by – Complex queries: need specific algorithms and index structures, will be discussed later Jian Pei: Data Mining -- OLAP Indexes 3
- 4. OLAP Query Example • In table (cust, gender, …), find the total number of male customers • Method 1: scan the table once • Method 2: build a B+ tree index on attribute gender, still need to access all tuples of male customers • Can we get the count without scanning many tuples? Jian Pei: Data Mining -- OLAP Indexes 4
- 5. Bitmap Index • For n tuples, a bitmap index has n bits and can be packed into ⎡n /8⎤ bytes and ⎡n /32⎤ words • From a bit to the row-id: the j-th bit of the p- th byte row-id = p*8 +j cust gender … Jack M … Cathy F … … … … Nancy F … 1 0 … 0 Jian Pei: Data Mining -- OLAP Indexes 5
- 6. Using Bitmap to Count • Shcount[] contains the number of bits in the entry subscript – shcount[01100101]=4 count = 0; for (i = 0; i < SHNUM; i++) count += shcount[B[i]]; Jian Pei: Data Mining -- OLAP Indexes 6
- 7. Advantages of Bitmap Index • Efficient in space • Ready for logic composition – C = C1 AND C2 – Bitmap operations can be used • Bitmap index only works for categorical data with low cardinality – Naively, we need 50 bits per entry to represent the state of a customer in US – How to represent a sale in dollars? Jian Pei: Data Mining -- OLAP Indexes 7
- 8. Bit-Sliced Index • A sale amount can be written as an integer number of pennies, and then represented as a binary number of N bits – 24 bits is good for up to $167,772.15, appropriate for many stores • A bit-sliced index is N bitmaps – Tuple j sets in bitmap k if the k-th bit in its binary representation is on – The space costs of bit-sliced index is the same as storing the data directly Jian Pei: Data Mining -- OLAP Indexes 8
- 9. Using Indexes SELECT SUM(sales) FROM Sales WHERE C; – Tuples satisfying C is identified by a bitmap B • Direct access to rows to calculate SUM: scan the whole table once • B+ tree: find the tuples from the tree • Projection index: only scan attribute sales • Bit-sliced index: get the sum from ∑(B AND Bk)*2k Jian Pei: Data Mining -- OLAP Indexes 9
- 10. Cost Comparison • Traditional value-list index (B+ tree) is costly in both I/O and CPU time – Not good for OLAP • Bit-sliced index is efficient in I/O • Other case studies in [O’Neil and Quass, SIGMOD’97] Jian Pei: Data Mining -- OLAP Indexes 10
- 11. Horizontal or Vertical Storage • A fact table for data warehousing is often fat – Tens of even hundreds of dimensions/attributes • A query is often about only a few attributes • Horizontal storage: tuples are stored one by one • Vertical storage: tuples are stored by attributes A1 A2 … A100 A1 A2 … A100 x1 x2 … x100 x1 x2 … x100 … … … … … … … … z1 z2 … z100 z1 z2 … z100 Jian Pei: Data Mining -- OLAP Indexes 11
- 12. Horizontal Versus Vertical • Find the information of tuple t – Typical in OLTP – Horizontal storage: get the whole tuple in one search – Vertical storage: search 100 lists • Find SUM(a100) GROUP BY {a22, a83} – Typical in OLAP – Horizontal storage (no index): search all tuples O(100n), where n is the number of tuples – Vertical storage: search 3 lists O(3n), 3% of the horizontal storage method • Projection index: vertical storage Jian Pei: Data Mining -- OLAP Indexes 12
- 13. Outline • OLAP and data warehousing: concepts • Indexing for OLAP operations • Data cube computation – Computing complete data cubes – Compression Jian Pei: Data Mining -- OLAP Indexes 13
- 14. MOLAP Date 2Qtr t 1Qtr 3Qtr 4Qtr sum uc TV od PC U.S.A Pr VCR Country sum Canada Mexico sum Jian Pei: Data Mining -- OLAP Indexes 14
- 15. Pros and Cons • Easy to implement • Fast retrieval • Many entries may be empty if data is sparse • Space costly Jian Pei: Data Mining -- OLAP Indexes 15
- 16. ROLAP – Data Cube in Table • A multi-dimensional database Base table Dimensions Measure Store Product Season Sales Dimensions Measure S1 P1 Spring 6 Store Product Season AVG(Sales) S1 P2 Spring 12 S1 P1 Spring 6 S2 P1 Fall 9 S1 P2 Spring 12 S2 P1 Fall 9 S1 * Spring 9 … … … … * * * 9 Cubing Jian Pei: Data Mining -- OLAP Indexes 16
- 17. Pros and Cons • Compact in space • Using mature relational technology • Overhead in query answering Jian Pei: Data Mining -- OLAP Indexes 17
- 18. HOLAP • Hybrid OLAP • Store detail data in ROLAP – Save space • Store aggregates in MOLAP – Fast query answering Jian Pei: Data Mining -- OLAP Indexes 18
- 19. Cube Computation • Given a base table B(D1, …, Dn, M), compute the set of all aggregates in the data cube using aggregate function f() • D1, …, Dn are n dimensions and M is a measure Jian Pei: Data Mining -- OLAP Indexes 19
- 20. Aggregate Cells • A cell c=(c1, …, cn) is called an aggregate cell, provided each ci is either in Di or ALL – Generally, we assume that each tuple in B has distinct dimension values, i.e., (D1, …, Dn) is a key in B – Value ALL is also conventionally written as * • Cell c is a base cell if (c, m) is a tuple in the base table • A cell c is not empty if there are some base cells match all non-ALL attribute values in c Jian Pei: Data Mining -- OLAP Indexes 20
- 21. Example • (S1, ALL, Spring), (S2, P1, Fall) and (S1, ALL, Winter) are aggregate cells • (S2, P1, Fall) is a base cell • (S1, ALL, Winter) is empty • (S1, ALL, Spring) = (S1, *, Spring) Base table Dimensions Measure Store Product Season Sales S1 P1 Spring 6 S1 P2 Spring 12 S2 P1 Fall 9 Jian Pei: Data Mining -- OLAP Indexes 21
- 22. Cuboid Lattice • All aggregate cells belong to the same group by form a cuboid (Store,Product,Season) (Store,Product,*) (Store,*,Season) (*,Product,Season) Roll up Drill down (Store,*,*) (*,Product,*) (*,*,Season) (*,*,*) Jian Pei: Data Mining -- OLAP Indexes 22
- 23. Cube (Cell) Lattice Tuples in the base table (S1,P1,s):6 (S1,P2,s):12 (S2,P1,f):9 (S1,*,s):9 (S1,P1,*):6 (*,P1,s):6 (S1,P2,*):12 (*,P2,s):12 (S2,*,f):9 (S2,P1,*) (*,P1,f):9 (S1,*,*):9 (*,*,s):9 (*,P1,*):7.5 (*,P2,*):12 (*,*,f):9 (S2,*,*):9 (*,*,*):9 Jian Pei: Data Mining -- OLAP Indexes 23
- 24. A Naïve Approach • Set one counter for every aggregate cell • Scan the base table once, compute all aggregate cells • Time complexity O(n), where n is the number of tuples in the base table • Space overhead O(π(|Di|+1)) – 10 dimensions, each dimension has cardinality 10, (1110 – 1010) = 15,937,424,601 counters are needed! Jian Pei: Data Mining -- OLAP Indexes 24
- 25. Real Data Sets Are Often Sparse • Many aggregate cells are empty • Example: 10 dimensions, each dimension has cardinality 10 – 1010 possible base cells – Even a base table has 1 billion tuples, still at least 90% of the possible base cells are empty! Jian Pei: Data Mining -- OLAP Indexes 25
- 26. Multiway Aggregate Computation • Compute ABC • Compute AB, AC, and ABC BC from ABC • Compute A, B, and C AB AC from AB, AC, and BC BC • Compute ALL from A, C A B, C B ALL Jian Pei: Data Mining -- OLAP Indexes 26
- 27. Main Memory Allocation • Do we really need to allocate main memory for AB, AC, and BC, respectively? – The cuboids still can be very large • Observation: A, B, and C can be computed from any two of AB, AC, BC ABC AB AC BC A B C ALL Jian Pei: Data Mining -- OLAP Indexes 27
- 28. Multi-way Array Aggregation (1) C c3 61 c2 45 62 63 64 46 47 48 c1 29 30 31 32 c0 B13 14 15 16 60 b3 44 B b2 28 56 9 40 24 52 b1 5 36 20 b0 1 2 3 4 a0 a1 a2 a3 A Jian Pei: Data Mining -- OLAP Indexes 28
- 29. Multi-way Array Aggregation (2) C c3 61 c2 45 62 63 64 46 47 48 c1 29 30 31 32 c0 B13 14 15 16 60 b3 44 B b2 28 56 9 40 24 52 b1 5 36 20 b0 1 2 3 4 a0 a1 a2 a3 A Jian Pei: Data Mining -- OLAP Indexes 29
- 30. Ideas • Consider a base table (A, B, C), where |A|=10, |B|=100, |C|=1000 Order A-B-C C-B-A (A,B,C) 1 1 (A,B,*) 1 10*100 (A,*,C) 1000 10 ABC (*,B,C) 100*1000 1 (A,*,*) 1 10 AB AC BC (*,B,*) 100 100 (*,*,C) 1000 1 A B C (*,*,*) 1 1 Total 102,104 1,124 ALL Jian Pei: Data Mining -- OLAP Indexes 30
- 31. The Array-based Approach • Sort dimensions smartly – Keep the smaller aggregate planes in main memory – Output aggregates in larger planes once they are computed, and then reuse the counters • How to make it work when the main memory usage is still too large? – Details in [Zhao, Deshpande and Naughton, SIGMOD 97] Jian Pei: Data Mining -- OLAP Indexes 31
- 32. Summary • Indexes for OLAP – Bitmap index – Vertical storage • Data cube computation – the multiway array approach • Can you find some situations where bitmap index and vertical storage are useful? Jian Pei: Data Mining -- OLAP Indexes 32

No public clipboards found for this slide

×
### Save the most important slides with Clipping

Clipping is a handy way to collect and organize the most important slides from a presentation. You can keep your great finds in clipboards organized around topics.

Be the first to comment