- 1. qCube: Efficient integration of range query operators over a high dimension data cube Rodrigo Rocha Silva Doctorate Student Prof. Dr. Celso Massaki Hirata Advisor Prof. Dr. Joubert de Castro Lima Co-Advisor ITA – INSTITUTO TECNOLÓGICO DE AERONÁUTICA Electronic Engineering and Computer Science Division - EEC/I Department of Computer Science Brazil
- 2. qCube: Efficient integration of range query operators over a high dimension data cube Goal Present a new cube approach, designed for high dimension range queries. Our cube approach, named Query Cube (qCube), implements Equal, Not Equal, Greater or Less than, Some, Between and Similar range query operators and Distinct, Subcube and Top-k Similar inquire query operators Wednesday, October 02, 2012 28º Simpósio Brasileiro de Banco de Dados - Rodrigo Rocha Silva 2
- 3. qCube: Efficient integration of range query operators over a high dimension data cube Topics – – – – – – – Motivation Data Cube Related Work Query Cube (qCube) Experiments Results Conclusions Wednesday, October 02, 2012 28º Simpósio Brasileiro de Banco de Dados - Rodrigo Rocha Silva 3
- 4. qCube: Efficient integration of range query operators over a high dimension data cube Motivation Users need to view data in a tangible way, such as reports, cross tables and histograms Wednesday, October 02, 2012 28º Simpósio Brasileiro de Banco de Dados - Rodrigo Rocha Silva 4
- 5. qCube: Efficient integration of range query operators over a high dimension data cube Motivation • Suppose that at some decision-making process it is necessary the following information : “What is the women journal research papers variance impact, using months {1, 3, 5, 7, 11}, year 2012 and ages varying from 25-40 years? Return results for all countries” “The average temperatures above 30 degrees Celsius on the weekends of leap years in the last 200 years.” Wednesday, October 02, 2012 28º Simpósio Brasileiro de Banco de Dados - Rodrigo Rocha Silva 5
- 6. qCube: Efficient integration of range query operators over a high dimension data cube Data Cube A data cube, introduced by Gray et al., 1996, is a generalization of the group-by operator over all possible combinations of dimensions with various granularity aggregates. Wednesday, October 02, 2012 28º Simpósio Brasileiro de Banco de Dados - Rodrigo Rocha Silva 6
- 7. qCube: Efficient integration of range query operators over a high dimension data cube Data Cube A data cube has exponential complexity with respect to the number of dimensions For an input with size d the output has size 2d Wednesday, October 02, 2012 28º Simpósio Brasileiro de Banco de Dados - Rodrigo Rocha Silva 7
- 8. qCube: Efficient integration of range query operators over a high dimension data cube Data Cube • Hierarchies Year Discipline Day Department Year Wednesday, October 02, 2012 Hour 28º Simpósio Brasileiro de Banco de Dados - Rodrigo Rocha Silva 8
- 9. qCube: Efficient integration of range query operators over a high dimension data cube Data Cube A C COUNT A B C COUNT * * * 11 * b2 c1 1 a1 * * 3 * b2 c2 1 a2 * * 5 * b3 c2 3 a3 Base Relation R – 11 tuples B * * 3 a1 b1 c1 1 A B C COUNT * b1 * 6 a3 b3 c2 1 a1 b1 c1 1 * b2 * 2 a2 b3 c2 1 a3 b3 c2 1 * b3 * 3 a3 b1 c1 1 a2 b3 c2 1 * * c1 4 a2 b1 c1 1 a3 b1 c1 1 * * c2 7 a2 b2 c2 1 a2 b1 c1 1 a1 b1 * 2 a1 b1 c2 1 a2 b2 c2 1 a1 b3 * 1 a2 b2 c1 1 a1 b1 c2 1 a2 b1 * 2 a3 b1 c2 1 a2 b2 c1 1 a2 b2 * 2 a1 b3 c2 1 a3 b1 c2 1 a2 b3 * 1 a2 b1 c2 1 a1 b3 c2 1 a3 b1 * 2 a2 b1 c2 1 a3 b3 * 1 a1 * c1 1 a1 * c2 2 a2 * c1 2 a2 * c2 3 a3 * c1 1 a3 * c2 2 * b1 c1 3 * b1 c2 3 Wednesday, October 02, 2012 FULL 3D CUBE + 38 tuples 28º Simpósio Brasileiro de Banco de Dados - Rodrigo Rocha Silva 9
- 10. qCube: Efficient integration of range query operators over a high dimension data cube Related Work – Frag-Cubing Approach • Partitions the data vertically • Reduces high-dimensional cube into a set of lower dimensional cubes • Lossless reduction • Offers tradeoffs between the amount of pre-processing and the speed of online computation From book Han and Kamber: Data Mining Concepts and Techniques Wednesday, October 02, 2012 28º Simpósio Brasileiro de Banco de Dados - Rodrigo Rocha Silva 10
- 11. qCube: Efficient integration of range query operators over a high dimension data cube Related Work – Frag-Cubing Example • Let the cube aggregation function be count tid A B C D E 1 a1 b1 c1 d1 e1 2 a1 b2 c1 d2 e1 3 a1 b2 c1 d1 e2 4 a2 b1 c1 d1 e2 5 a2 b1 c1 d1 e3 • Divide the 5 dimensions into 2 shell fragments: – (A, B, C) and (D, E) From book Han and Kamber: Data Mining Concepts and Techniques Wednesday, October 02, 2012 28º Simpósio Brasileiro de Banco de Dados - Rodrigo Rocha Silva 11
- 12. qCube: Efficient integration of range query operators over a high dimension data cube Related Work – Frag-Cubing 1-D Inverted Indices • Build traditional invert index or RID list Attribute Value TID List List Size a1 123 3 a2 45 2 b1 145 3 b2 23 2 c1 12345 5 d1 1345 4 d2 2 1 e1 12 2 e2 34 2 e3 5 1 From book Han and Kamber: Data Mining Concepts and Techniques Wednesday, October 02, 2012 28º Simpósio Brasileiro de Banco de Dados - Rodrigo Rocha Silva 12
- 13. qCube: Efficient integration of range query operators over a high dimension data cube Related Work – Frag-Cubing Approach • Generalize the 1-D inverted indices to multi-dimensional ones in the data cube sense • Compute all cuboids for data cubes ABC and DE while retaining the inverted indices • For example, shell fragment cube ABC contains 7 cuboids: – A, B, C – AB, AC, BC – ABC • This completes the offline computation stage Cell Intersection TID List List Size a1 b1 1 2 3 ∩1 4 5 1 1 a1 b2 1 2 3 ∩2 3 23 2 a2 b1 4 5 ∩1 4 5 45 2 a2 b2 4 5 ∩2 3 ⊗ 0 From book Han and Kamber: Data Mining Concepts and Techniques Wednesday, October 02, 2012 28º Simpósio Brasileiro de Banco de Dados - Rodrigo Rocha Silva 13
- 14. qCube: Efficient integration of range query operators over a high dimension data cube Related Work – Frag-Cubing Measure Table • If measures other than count are present, store in ID_measure table separate from the shell fragments tid count sum 1 5 70 2 3 10 3 8 20 4 5 40 5 2 30 From book Han and Kamber: Data Mining Concepts and Techniques Wednesday, October 02, 2012 28º Simpósio Brasileiro de Banco de Dados - Rodrigo Rocha Silva 14
- 15. qCube: Efficient integration of range query operators over a high dimension data cube Related Work – Frag-Cubing Query • Given the fragment cubes, process a query as follows 1. Divide the query into fragment, same as the shell 2. Fetch the corresponding TID list for each fragment from the fragment cube 3. Intersect the TID lists from each fragment to construct instantiated base table 4. Compute the data cube using the base table with any cubing algorithm From book Han and Kamber: Data Mining Concepts and Techniques Wednesday, October 02, 2012 28º Simpósio Brasileiro de Banco de Dados - Rodrigo Rocha Silva 15
- 16. qCube: Efficient integration of range query operators over a high dimension data cube Related Work – Frag-Cubing Approach A B C D E F G H I J K L M N … Base Table Online Computation From book Han and Kamber: Data Mining Concepts and Techniques Wednesday, October 02, 2012 28º Simpósio Brasileiro de Banco de Dados - Rodrigo Rocha Silva 16
- 17. qCube: Efficient integration of range query operators over a high dimension data cube qCube Approach Implements a set of tuple identifiers per dimension attribute, similar to Frag-Cubing; Therefore, qCube can answer point queries using tuple identifiers intersections and range queries using unions plus intersections algorithms, regardless measure function types. Frag-Cubing just implements point and some inquire queries. There is no Frag-Cubing solution for queries like “What is the women journal research papers variance impact, using months {1, 3, 5, 7, 11}, year 2012 and ages varying from 25-40 years? Return results for all countries” Wednesday, October 02, 2012 28º Simpósio Brasileiro de Banco de Dados - Rodrigo Rocha Silva 17
- 18. qCube: Efficient integration of range query operators over a high dimension data cube qCube Approach Implements the range query operators: • Equal; • Not Equal; • Greater or Less than; • Some; • Between and Similar. Also implements inquire query operators: • Distinct; • Sub-cube; • Top-k Similar. Over a high dimension data cube. Wednesday, October 02, 2012 28º Simpósio Brasileiro de Banco de Dados - Rodrigo Rocha Silva 18
- 19. qCube: Efficient integration of range query operators over a high dimension data cube qCube Architecture Wednesday, October 02, 2012 28º Simpósio Brasileiro de Banco de Dados - Rodrigo Rocha Silva 19
- 20. qCube: Efficient integration of range query operators over a high dimension data cube qCube Computation TID 1 2 3 4 5 6 A a1 a2 a1 a3 a1 a5 Function tid 1 2 3 4 5 6 B b1 b2 b1 b3 b1 b5 C c1 c2 c1 c3 c4 c5 D d1 d2 d1 d2 d1 d2 Variance M1 2.56 3.14 2.45 6.7 9 1 Wednesday, October 02, 2012 E e1 e2 e1 e2 e2 e2 Count M2 1 1 1 1 1 1 Attribute Value TID List Attribute Value TID List a1 a2 a3 a5 b1 b2 b3 b5 c1 c2 c3 c4 c5 d1 d2 e1 e2 Average M3 10 20 10 11 3 1 1, 3, 5 2 4 6 1, 3, 5 2 4 6 1, 3 Skewness M4 1 0 1 1 1 1 2 4 5 6 1, 3, 5 2, 4, 6 1, 3 2, 4, 5, 6 Standard deviation M5 877686769698 7986676867.99 -7878789.8777 -99974333.23 100045.655 1 28º Simpósio Brasileiro de Banco de Dados - Rodrigo Rocha Silva 20
- 21. qCube: Efficient integration of range query operators over a high dimension data cube qCube Update The same qCube Computation algorithm Wednesday, October 02, 2012 28º Simpósio Brasileiro de Banco de Dados - Rodrigo Rocha Silva 21
- 22. qCube: Efficient integration of range query operators over a high dimension data cube qCube Update TID 1 2 3 4 5 6 A a1 a2 a1 a3 a1 a5 B b1 b2 b1 b3 b1 b5 C c1 c2 c1 c3 c4 c5 D d1 d2 d1 d2 d1 d2 E e1 e2 e1 e2 e2 e2 Attribute Value a1 a2 a3 a5 b1 b2 b3 b5 c1 c2 c3 Wednesday, October 02, 2012 tid 5 7 8 9 TID List 1, 3 2, 8 4, 5, 7 6, 9 1, 3, 5 2, 7 4, 8 6, 9 1, 3 2 4, 7 A a3 a3 a2 a5 B b1 b2 b3 b5 C c4 c3 c4 c5 Attribute Value c4 c5 d1 d2 d3 e1 e2 e3 f1 f2 D d1 d3 d3 d1 E e2 e3 e2 e1 F f1 f2 TID List 5, 8 6, 9 1, 3, 5, 9 2, 4, 6 7, 8 1, 3, 9 2, 4, 5, 6, 8 7 8 9 28º Simpósio Brasileiro de Banco de Dados - Rodrigo Rocha Silva 22
- 23. qCube: Efficient integration of range query operators over a high dimension data cube qCube Query pQ= a1:*:*:*:e1 Attribute Value a1 a2 a3 a5 b1 b2 b3 b5 c1 c2 c3 Wednesday, October 02, 2012 TID List 1, 3 2, 8 4, 5 6, 9 1, 3, 5 2, 7 4, 8 6, 9 1, 3 2 4, 7 Attribute Value c4 c5 d1 d2 d3 e1 e2 e3 f1 f2 TID List 5, 8 6, 9 1, 3, 5, 9 2, 4, 6 7, 8 1, 3, 9 2, 4, 5, 6, 8 7 8 9 28º Simpósio Brasileiro de Banco de Dados - Rodrigo Rocha Silva 23
- 24. qCube: Efficient integration of range query operators over a high dimension data cube qCube Range and Inquire Query rOp= (greater than + less than + between + some + different + similar x (fv1 … fvn)) iOp =(sub-cube + distinct + top-k similar x (fv1 … fvn)) qCube rearranges Q sub-queries in order to improve query response times a result of Q we have qR=(TID1, TID2 … TIDk), where TIDi is the ith tuple identifier of relation R. Wednesday, October 02, 2012 28º Simpósio Brasileiro de Banco de Dados - Rodrigo Rocha Silva 24
- 25. qCube: Efficient integration of range query operators over a high dimension data cube qCube Query - example Wednesday, October 02, 2012 28º Simpósio Brasileiro de Banco de Dados - Rodrigo Rocha Silva 25
- 26. qCube: Efficient integration of range query operators over a high dimension data cube qCube Query - example “What is the women journal research papers variance impact, using months {1, 3, 5, 7, 11}, year 2012 and ages varying from 2540 years? Return results for all countries” In Q, they are (sex = women, paperType=journal, year=2012). The range queries (month = (1,3,5,7,11), age <>25-40) are also sorted according to their cardinalities. In Q, there is inquire query (country=distinct). Wednesday, October 02, 2012 28º Simpósio Brasileiro de Banco de Dados - Rodrigo Rocha Silva 26
- 27. qCube: Efficient integration of range query operators over a high dimension data cube qCube Query - example Wednesday, October 02, 2012 28º Simpósio Brasileiro de Banco de Dados - Rodrigo Rocha Silva 27
- 28. qCube: Efficient integration of range query operators over a high dimension data cube Experiments • We tested qCube Computation and Query algorithms against Frag-Cubing algorithm used in [Li et al. 2004]; • The qCube algorithms were coded in Java 64 bits; • Frag-Cubing is a free and open source C++ application(http://illimine.cs.uiuc.edu/); • The synthetic base relations were created using data generator provided by the IlliMine project; • The IlliMine project is an open-source project to provide various approaches for data mining and machine learning. • Frag-Cubing approach is part of IlliMine project. • We ran the algorithms in two Intel Xeon six-core processors with 2.4GHz each core, 12MB cache and 128GB of RAM DDR3 1333MHz. • The system runs Windows Server 2008 64 bits, High Performance version. Wednesday, October 02, 2012 28º Simpósio Brasileiro de Banco de Dados - Rodrigo Rocha Silva 28
- 29. qCube: Efficient integration of range query operators over a high dimension data cube Results - Performance Evaluation of Point Queries and Skewed Relations Response time per query over 100 trials: T=107; C=5000; D=30, S=0 Response time per query over 100 trials: T=107; C=5000; D=30, S=2.5 Wednesday, October 02, 2012 28º Simpósio Brasileiro de Banco de Dados - Rodrigo Rocha Silva 29
- 30. qCube: Efficient integration of range query operators over a high dimension data cube Results - Performance Evaluation of Range Query Operators and Skewed Relations Response time queries with one infrequent point operator: T=107; C=5000; D=30, S=2.5 Wednesday, October 02, 2012 28º Simpósio Brasileiro de Banco de Dados - Rodrigo Rocha Silva 30
- 31. qCube: Efficient integration of range query operators over a high dimension data cube Results - Performance Evaluation of of Inquire Operators and Skewed Relations Response time queries with inquire operators: T = 107; C = 5000; D = 30, S = 2.5. Wednesday, October 02, 2012 28º Simpósio Brasileiro de Banco de Dados - Rodrigo Rocha Silva 31
- 32. qCube: Efficient integration of range query operators over a high dimension data cube Results - Runtime and Memory Consumption Wednesday, October 02, 2012 28º Simpósio Brasileiro de Banco de Dados - Rodrigo Rocha Silva 32
- 33. qCube: Efficient integration of range query operators over a high dimension data cube Conclusions • qCube has linear runtime and memory consumption, similar to Frag-Cubing; • It implements Not Equal, Greater or Less than, Some, Between and Similar range query operators and Distinct, Sub-cube and Top-k Similar inquire query operators; • When compared with Frag-Cubing, qCube is faster to answer point and inquire queries with sub-cube operators. • It introduces a different cube representation with less empty cells than Frag-Cubing; • Frag-Cubing cannot answer two sub-cube operators in a data cube with 107 tuples, C=5000, D=30 and S=2.5. Wednesday, October 02, 2012 28º Simpósio Brasileiro de Banco de Dados - Rodrigo Rocha Silva 33
- 34. qCube: Efficient integration of range query operators over a high dimension data cube Conclusions Interesting research directions to further extend qCube: First, we must experiment it with holistic measures. Update and computation experiments with many holistic measures are a hard problem; TIDs can become huge, thus memory consumption and intersection costs can become impracticable, and therefore we must address an efficient solution to partition TIDs with fast data retrieval. Multicore and multicomputer versions of qCube must be implemented. qCube must be improved to answer top-k queries combined with range, point and inquire queries. Experiments with high dimensional text cubes must be made to evaluate qCube , specially its text measures computing. Wednesday, October 02, 2012 28º Simpósio Brasileiro de Banco de Dados - Rodrigo Rocha Silva 34
- 35. qCube: Efficient integration of range query operators over a high dimension data cube Acknowlegements Wednesday, October 02, 2012 28º Simpósio Brasileiro de Banco de Dados - Rodrigo Rocha Silva 35

