Rodrigo Rocha Silva presented a paper on H-Frag, a hybrid memory approach for computing data cubes on high-dimensional relations. H-Frag extends the frag-cubing approach by storing portions of the computed data cube in external memory instead of relying solely on main memory. This allows for larger data cubes to be computed. H-Frag selects which fragments to store in main memory based on attribute value frequencies and dimension cardinalities. It generates inverted indexes of attribute values and stores low frequency values and their tid lists in external memory in order to compute cubes that exceed main memory capacity.
A Hybrid Memory Data Cube Approach for High Dimension Relations
1. H-Frag:
A Hybrid Memory Data Cube Approach for High Dimension
Relations
Rodrigo Rocha Silva
Doctoral Student
Prof. Dr. Celso Massaki Hirata
Advisor
Prof. Dr. Joubert de Castro Lima
Co-Advisor
ITA – AERONAUTICS INSTITUTE OF TECHNOLOGY
Electronic Engineering and Computer Science Division - EEC/I
Department of Computer Science
Brazil
2. Monday, April 27, 2015
H-Frag: A Hybrid Memory Data Cube Approach for High Dimension Relations
217th International Conference on Enterprise Information Systems - Rodrigo Rocha Silva
What is H-Frag?
Is a method for data cube
computation that extends the frag-
cubing approach enabling the
computation of massive data cubes
by making use of external memory,
rather than fully relying on the main
memory only.
3. Monday, April 27, 2015
H-Frag: A Hybrid Memory Data Cube Approach for High Dimension Relations
317th International Conference on Enterprise Information Systems - Rodrigo Rocha Silva
Topics
– Motivation;
– Data Cube;
– Frag-Cubing;
– H-Frag approach;
– Experiments;
– Results;
– Conclusions;
4. Monday, April 27, 2015
H-Frag: A Hybrid Memory Data Cube Approach for High Dimension Relations
417th International Conference on Enterprise Information Systems - Rodrigo Rocha Silva
Motivation
Users need to view data in a tangible way, reports, cross
tables and dashboards are usually the most used tools for
visualizing data.
5. Monday, April 27, 2015
H-Frag: A Hybrid Memory Data Cube Approach for High Dimension Relations
517th International Conference on Enterprise Information Systems - Rodrigo Rocha Silva
Approaches that use inverted indexes indexes, such as
Frag-Cubing, are considered efficient in terms of runtime
and main memory usage for massive data cube
computation and query.
• Approaches that use main memory only, are limited
when the data cube size exceeds the main memory
capacity.
Motivation
6. Monday, April 27, 2015
H-Frag: A Hybrid Memory Data Cube Approach for High Dimension Relations
617th International Conference on Enterprise Information Systems - Rodrigo Rocha Silva
A data cube has exponential complexity
in its runtime and storage space when
the number of dimensions increases
linearly.
Data Cube
For an input with size d the
output has size 2d
Allows the materialization of all or some cells or tuples of a
cube, which is represented by measures and dimensions.
7. Monday, April 27, 2015
H-Frag: A Hybrid Memory Data Cube Approach for High Dimension Relations
717th International Conference on Enterprise Information Systems - Rodrigo Rocha Silva
Data Cube
Subjects
Department
Year
Hour
Day
Year
A dimension may contain a hierarchical relation between two or
more members.
The individual members of a dimension may be hierarchically
related to each other.
8. Monday, April 27, 2015
H-Frag: A Hybrid Memory Data Cube Approach for High Dimension Relations
817th International Conference on Enterprise Information Systems - Rodrigo Rocha Silva
Base Relation R – 11 tuples
A B C COUNT
a1 b1 c1 1
a3 b3 c2 1
a2 b3 c2 1
a3 b1 c1 1
a2 b1 c1 1
a2 b2 c2 1
a1 b1 c2 1
a2 b2 c1 1
a3 b1 c2 1
a1 b3 c2 1
a2 b1 c2 1
A B C COUNT
* * * 11
a1 * * 3
a2 * * 5
a3 * * 3
* b1 * 6
* b2 * 2
* b3 * 3
* * c1 4
* * c2 7
a1 b1 * 2
a1 b3 * 1
a2 b1 * 2
a2 b2 * 2
a2 b3 * 1
a3 b1 * 2
a3 b3 * 1
a1 * c1 1
a1 * c2 2
a2 * c1 2
a2 * c2 3
a3 * c1 1
a3 * c2 2
* b1 c1 3
* b1 c2 3
FULL 3D CUBE
A B C COUNT
* b2 c1 1
* b2 c2 1
* b3 c2 3
a1 b1 c1 1
a3 b3 c2 1
a2 b3 c2 1
a3 b1 c1 1
a2 b1 c1 1
a2 b2 c2 1
a1 b1 c2 1
a2 b2 c1 1
a3 b1 c2 1
a1 b3 c2 1
a2 b1 c2 1
+
38 aggregations
Data Cube
Construction of a complete data cube is
an exponential problem
9. Monday, April 27, 2015
H-Frag: A Hybrid Memory Data Cube Approach for High Dimension Relations
917th International Conference on Enterprise Information Systems - Rodrigo Rocha Silva
Related Work – Frag-Cubing Approach
• Splits data vertically;
• Reduces high-dimensional cube into cuboids of lower
dimension;
• Offers tradeoffs between the data cube computation
runtime and the pre-processing of aggregations;
…FEDCBA
CUBE
ABC
CUBE
DEF
Dimensions
From book Han and Kamber: Data Mining Concepts and Techniques
10. Monday, April 27, 2015
H-Frag: A Hybrid Memory Data Cube Approach for High Dimension Relations
1017th International Conference on Enterprise Information Systems - Rodrigo Rocha Silva
For a 5-dimension relation:
two shell fragments can be built: (A, B, C) and (D, E)
tid A B C D E
1 a1 b1 c1 d1 e1
2 a1 b2 c1 d2 e1
3 a1 b2 c1 d1 e2
4 a2 b1 c1 d1 e2
5 a2 b1 c1 d1 e3
Related Work – Frag-Cubing Example
From book Han and Kamber: Data Mining Concepts and Techniques
11. Monday, April 27, 2015
H-Frag: A Hybrid Memory Data Cube Approach for High Dimension Relations
1117th International Conference on Enterprise Information Systems - Rodrigo Rocha Silva
• Build traditional invert index or RID list
Attribute Value TID List List Size
a1 1 2 3 3
a2 4 5 2
b1 1 4 5 3
b2 2 3 2
c1 1 2 3 4 5 5
d1 1 3 4 5 4
d2 2 1
e1 1 2 2
e2 3 4 2
e3 5 1
Related Work – Frag-Cubing 1-D Inverted Indexes
From book Han and Kamber: Data Mining Concepts and Techniques
12. Monday, April 27, 2015
H-Frag: A Hybrid Memory Data Cube Approach for High Dimension Relations
1217th International Conference on Enterprise Information Systems - Rodrigo Rocha Silva
Generalize the 1-D inverted indexes to multi-dimensional ones
in the data cube sense; Computes all cuboids for data cubes
ABC and DE while retaining the inverted indexes;
For example, shell fragment
cube ABC contains 7
cuboids:
A, B, C
AB, AC, BC
ABC
111 2 3 1 4 5a1 b1
04 5 2 3a2 b2
24 54 5 1 4 5a2 b1
22 31 2 3 2 3a1 b2
List SizeTID ListIntersectionCell
Related Work – Frag-Cubing Approach
From book Han and Kamber: Data Mining Concepts and Techniques
This completes the offline computation stage
Frag-cubing proposes to compute only the cuboid of a given
fragment during the processes of the data cube computation.
13. Monday, April 27, 2015
H-Frag: A Hybrid Memory Data Cube Approach for High Dimension Relations
1317th International Conference on Enterprise Information Systems - Rodrigo Rocha Silva
• If measures other than count measures are present,
store in ID_measure table separate from the shell
fragments
tid count sum
1 5 70
2 3 10
3 8 20
4 5 40
5 2 30
Related Work – Frag-Cubing Measure Table
From book Han and Kamber: Data Mining Concepts and Techniques
14. Monday, April 27, 2015
H-Frag: A Hybrid Memory Data Cube Approach for High Dimension Relations
1417th International Conference on Enterprise Information Systems - Rodrigo Rocha Silva
Once the data cube is computed into fragments, the query
process follows these steps:
Divides the query into fragment;
Fetches the corresponding tid-list for each fragment from the
fragment cube;
Intersects the tid-lists from each fragment in order to construct an
instantiated base table;
Related Work – Frag-Cubing Query
From book Han and Kamber: Data Mining Concepts and Techniques
Online
Computation
Base Table
Computes the data cube
using the base table with any
cubing algorithm.
15. Monday, April 27, 2015
H-Frag: A Hybrid Memory Data Cube Approach for High Dimension Relations
1517th International Conference on Enterprise Information Systems - Rodrigo Rocha Silva
H-Frag Approach
Implements a set of tuple identifiers per dimension attribute,
similar to Frag-Cubing;
• H-Frag allows larger cubes_ by using external memory to store
some of the computed cubes, rather than relying on the main
memory only.
The main challenge of using external memory_ is
to define the criteria to select which fragments
of the cube_ should be in main memory.
H-Frag, selects fragments of the cube_ according
to the attribute values frequencies_ and
dimension cardinalities, to be stored in main
memory.
16. Monday, April 27, 2015
H-Frag: A Hybrid Memory Data Cube Approach for High Dimension Relations
1617th International Conference on Enterprise Information Systems - Rodrigo Rocha Silva
H-Frag Architecture
17. Monday, April 27, 2015
H-Frag: A Hybrid Memory Data Cube Approach for High Dimension Relations
1717th International Conference on Enterprise Information Systems - Rodrigo Rocha Silva
H-Frag Computation
First, the computation component _ scans Entry Relation
completely_ in order to obtain the frequency of each attribute
value for each dimension.
Then, the average frequency is obtained, and attribute
values with frequencies lower than the average are marked_
in order to be stored in the external memory.
scans
Frequencies of the attribute values
attribute values are marked in
order to be stored in the
External
Memory
Entry
Relation
18. Monday, April 27, 2015
H-Frag: A Hybrid Memory Data Cube Approach for High Dimension Relations
1817th International Conference on Enterprise Information Systems - Rodrigo Rocha Silva
tid A B C M1 M2
1 a1 b1 c1 1.5 1
2 a2 b2 c2 2.5 1
3 a2 b2 c2 2 3
4 a3 b3 c2 78.5 2
5 a1 b1 c1 100 5
6 a2 b1 c2 102.5 4
7 a3 b1 c1 100 2
8 a1 b3 c2 22.5 3
9 a1 b3 c2 13.89 8
The frequency of each attribute value is: fa1=4, fa2=3, fa3=2,
fb1=4, fb2=2, fb3=3, fc1=3 and fc2=6.
H-Frag Computation – Example
19. Monday, April 27, 2015
H-Frag: A Hybrid Memory Data Cube Approach for High Dimension Relations
1917th International Conference on Enterprise Information Systems - Rodrigo Rocha Silva
H-Frag Computation – First Step
- 3 is the average frequency in the dimensions A and B;
- In dimension C, the average frequency is 4.5 (let´s consider 4).
fa1=4, fa2=3, fa3=2; -> (4+3+2)/3 = 3;
fb1=4, fb2=2, fb3=3; -> (4+2+3)/3 = 3;
fc1=3 and fc2=6; -> (3+6)/2 = 4.5
External
Memory
a3, b2, b3 and c1
The attribute values a3, b2, b3 and c1 are marked to be stored in
the external memory, because they are below the average.
20. Monday, April 27, 2015
H-Frag: A Hybrid Memory Data Cube Approach for High Dimension Relations
2017th International Conference on Enterprise Information Systems - Rodrigo Rocha Silva
H-Frag Computation – Second Step
The Entry Relation_ is scanned a second time by the computation
component_ in order to select the attribute values to be stored in the
external memory;
Each attribute value_ and its tid-list_ is stored in the external memory;
H-Frag splits the Entry Relation into complementary portions defined by
the user, with several tuples in each portion.
a single attribute value can have several complementary tid-lists in external
memory, since RAM can get full;
scans
to select the attribute values to be stored
attribute value and its tid-list
External
Memory
Entry
Relation
21. Monday, April 27, 2015
H-Frag: A Hybrid Memory Data Cube Approach for High Dimension Relations
2117th International Conference on Enterprise Information Systems - Rodrigo Rocha Silva
H-Frag Computation – Second Step
• In order to avoid attribute values with low number of tids in the
external memory, H-Frag defines an occurrence percentage for each
attribute value inside a portion.
Entry Relation
1 a1 b1 c1 d1 e1 f1 g1 h1
2 a1 b2 c2 d1 e2 f2 g2 h2
3 a3 b8 c3 d3 e4 f5 g6 h7
4 a5 b6 c5 d5 e4 f5 g5 h6
5 a9 b9 c9 d9 e9 f9 g9 h9
6 a9 b4 c4 d4 e3 f4 g4 h4
7 a5 b7 c7 d7 e7 f7 g7 h7
8 a7 b7 c7 d7 e7 f7 g7 h7
first
portion
second
portion
if portion
equals 4
a1 and d1
tid-list stored
e4 and f5
tid-list stored
a9
tid-list stored
. . .
Each attribute value, related to 50% of the number of processed tuples,
- in relation to the total number of tuples in the portion - will have its
tid-list stored in the external
memory.
Each portion_ should be stored fully in the main memory.
22. Monday, April 27, 2015
H-Frag: A Hybrid Memory Data Cube Approach for High Dimension Relations
2217th International Conference on Enterprise Information Systems - Rodrigo Rocha Silva
if 80% of the available working memory is being used, all the
tid-lists of the processed attribute values and all measure
values are stored in the external memory.
H-Frag Computation – Second Step
a1 = { 1, .. 4}
a2 = { 2, .. 8}
b1 = { 1, .. 3}
c1 = { 3, .. 7}
c2 = { 2, .. 4}
20 %
all
stored
working memory
This way, H-Frag eliminates the problem when there are many
attribute values below 50% of a
portion, which can happen_ in relations
with high cardinality and low skew.
23. Monday, April 27, 2015
H-Frag: A Hybrid Memory Data Cube Approach for High Dimension Relations
2317th International Conference on Enterprise Information Systems - Rodrigo Rocha Silva
H-Frag Computation – Measure Values
The measure values are grouped by portions;
Each group of measure values_ is identified by a tid interval
or range;
This way, H-Frag will generate a few files with the measure
values.
For example, when a portion of 10 tuples, in which
the initial tid equals 1 is processed, a file with
measure values identified as 1_10 will be generated.
24. Monday, April 27, 2015
H-Frag: A Hybrid Memory Data Cube Approach for High Dimension Relations
2417th International Conference on Enterprise Information Systems - Rodrigo Rocha Silva
Relation is scanned for a third time. As a result, it generates
a map with the top frequent attribute values of relation and
their tid-lists.
Such a map_ is kept in the main memory.
H-Frag Computation – Third Step
scans
map with the
tid-lists of the top frequent
attribute values
Entry
Relation
Main
Memory
25. Monday, April 27, 2015
H-Frag: A Hybrid Memory Data Cube Approach for High Dimension Relations
2517th International Conference on Enterprise Information Systems - Rodrigo Rocha Silva
tid A B C M1 M2
1 a1 b1 c1 1.5 1
2 a2 b2 c2 2.5 1
3 a2 b2 c2 2 3
4 a3 b3 c2 78.5 2
5 a1 b1 c1 100 5
6 a2 b1 c2 102.5 4
7 a3 b1 c1 100 2
8 a1 b3 c2 22.5 3
9 a1 b3 c2 13.89 8
Example of the computing process given this relation
Remembering that the frequencies
of the attribute values are:
a1=4, a2=3, a3=2, b1=4, b2=2,
b3=3, c1=3 and c2=6
26. Monday, April 27, 2015
H-Frag: A Hybrid Memory Data Cube Approach for High Dimension Relations
2617th International Conference on Enterprise Information Systems - Rodrigo Rocha Silva
tid A B C M1 M2
1 a1 b1 c1 1.5 1
2 a2 b2 c2 2.5 1
3 a2 b2 c2 2 3
4 a3 b3 c2 78.5 2
5 a1 b1 c1 100 5
6 a2 b1 c2 102.5 4
7 a3 b1 c1 100 2
8 a1 b3 c2 22.5 3
9 a1 b3 c2 13.89 8
Attribute Value Tids
a2 2, 3
a2 6
a3 4, 7
b2 2, 3
b3 4, 8
b3 9
c1 1, 5, 7
Attribute Values in External
Memory
3
Example of the computing process
stores a tid-sublist each time the attribute value is associated to 50% or
more of the tids of the defined portions
27. Monday, April 27, 2015
H-Frag: A Hybrid Memory Data Cube Approach for High Dimension Relations
2717th International Conference on Enterprise Information Systems - Rodrigo Rocha Silva
tid A B C M1 M2
1 a1 b1 c1 1.5 1
2 a2 b2 c2 2.5 1
3 a2 b2 c2 2 3
4 a3 b3 c2 78.5 2
5 a1 b1 c1 100 5
6 a2 b1 c2 102.5 4
7 a3 b1 c1 100 2
8 a1 b3 c2 22.5 3
9 a1 b3 c2 13.89 8
Attribute Value Tids
a2 2, 3
a2 6
a3 4, 7
b2 2, 3
b3 4, 8
b3 9
c1 1, 5, 7
Attribute Values in External
Memory
In this example, let`s
consider the size portion
equals 2
3
Example of the computing process
The attribute value a2,
which frequency is 3, will
have stored a sublist with
tids 2 and 3_ and another
sub list with tid 6
28. Monday, April 27, 2015
H-Frag: A Hybrid Memory Data Cube Approach for High Dimension Relations
2817th International Conference on Enterprise Information Systems - Rodrigo Rocha Silva
tid A B C M1 M2
1 a1 b1 c1 1.5 1
2 a2 b2 c2 2.5 1
3 a2 b2 c2 2 3
4 a3 b3 c2 78.5 2
5 a1 b1 c1 100 5
6 a2 b1 c2 102.5 4
7 a3 b1 c1 100 2
8 a1 b3 c2 22.5 3
9 a1 b3 c2 13.89 8
Attribute Value Tids
a2 2, 3
a2 6
a3 4, 7
b2 2, 3
b3 4, 8
b3 9
c1 1, 5, 7
Attribute Values in External
Memory
In this example, let`s
consider the size portion
equals 2
3 4.5
Example of the computing process
The attribute value C1 with
frequency is 3, will have
only one tid-list stored in
the external memory, with
three tids
29. Monday, April 27, 2015
H-Frag: A Hybrid Memory Data Cube Approach for High Dimension Relations
2917th International Conference on Enterprise Information Systems - Rodrigo Rocha Silva
Attribute Value tids
a1 1,5,8,9
b1 1,5,6,7
c2 2,3,4,6,8,9
tid A B C M1 M2
1 a1 b1 c1 1.5 1
2 a2 b2 c2 2.5 1
3 a2 b2 c2 2 3
4 a3 b3 c2 78.5 2
5 a1 b1 c1 100 5
6 a2 b1 c2 102.5 4
7 a3 b1 c1 100 2
8 a1 b3 c2 22.5 3
9 a1 b3 c2 13.89 8
Frequent Attribute Values in
Main Memory
Attribute Value Tids
a2 2, 3
a2 6
a3 4, 7
b2 2, 3
b3 4, 8
b3 9
c1 1, 5, 7
Attribute Values in External
Memory
Frequencies of the attribute values: fa1=4, fa2=3, fa3=2, fb1=4, fb2=2, fb3=3, fc1=3 and fc2=6.
Example of the computing process
which are the most frequent
30. Monday, April 27, 2015
H-Frag: A Hybrid Memory Data Cube Approach for High Dimension Relations
3017th International Conference on Enterprise Information Systems - Rodrigo Rocha Silva
Attribute Value tids
a1 1,5,8,9
b1 1,5,6,7
c2 2,3,4,6,8,9
tid A B C M1 M2
1 a1 b1 c1 1.5 1
2 a2 b2 c2 2.5 1
3 a2 b2 c2 2 3
4 a3 b3 c2 78.5 2
5 a1 b1 c1 100 5
6 a2 b1 c2 102.5 4
7 a3 b1 c1 100 2
8 a1 b3 c2 22.5 3
9 a1 b3 c2 13.89 8
Frequent Attribute Values in
Main Memory
Attribute Value Tids
a2 2, 3
a2 6
a3 4, 7
b2 2, 3
b3 4, 8
b3 9
c1 1, 5, 7
Attribute Values in External
Memory
Frequencies of the attribute values: fa1=4, fa2=3, fa3=2, fb1=4, fb2=2, fb3=3, fc1=3 and fc2=6.
Example of the computing process
31. Monday, April 27, 2015
H-Frag: A Hybrid Memory Data Cube Approach for High Dimension Relations
3117th International Conference on Enterprise Information Systems - Rodrigo Rocha Silva
Attribute Value tids
a1 1,5,8,9
b1 1,5,6,7
c2 2,3,4,6,8,9
tid A B C M1 M2
1 a1 b1 c1 1.5 1
2 a2 b2 c2 2.5 1
3 a2 b2 c2 2 3
4 a3 b3 c2 78.5 2
5 a1 b1 c1 100 5
6 a2 b1 c2 102.5 4
7 a3 b1 c1 100 2
8 a1 b3 c2 22.5 3
9 a1 b3 c2 13.89 8
Frequent Attribute Values in
Main Memory
Attribute Value Tids
a2 2, 3
a2 6
a3 4, 7
b2 2, 3
b3 4, 8
b3 9
c1 1, 5, 7
Attribute Values in External
Memory
tids M1 M2 Group ID
1 1.5 1
1_3
2 2.5 1
3 2 3
4 78.5 2
3_6
5 100 5
6 102.5 4
7 100 2
7_9
8 22.5 3
9 13.89 8
Measure Values
Relation in External
Memory
Example of the computing process
identifies
the tids’
range of the
processed
tuples
32. Monday, April 27, 2015
H-Frag: A Hybrid Memory Data Cube Approach for High Dimension Relations
3217th International Conference on Enterprise Information Systems - Rodrigo Rocha Silva
The same H-Frag Computation algorithm
H-Frag Update
33. Monday, April 27, 2015
H-Frag: A Hybrid Memory Data Cube Approach for High Dimension Relations
3317th International Conference on Enterprise Information Systems - Rodrigo Rocha Silva
H-Frag Update Relation: New Tuples
tid A B C M1 M2
1 a1 b1 c1 1.5 1
2 a2 b2 c2 2.5 1
3 a2 b2 c2 2 3
4 a3 b3 c2 78.5 2
5 a1 b1 c1 100 5
6 a2 b1 c2 102.5 4
7 a3 b1 c1 100 2
8 a1 b3 c2 22.5 3
9 a1 b3 c2 13.89 8
tid A B C M1 M2
10 a4 b4 c4 3 7
11 a3 b3 c1 4.7 12
12 a1 b1 c2 5.5 6
Attribute Value tids
a2 2, 3
a2 6
a3 4, 7
a4 10
a3 11
b2 2, 3
b3 4, 8
b3 9
b3 11
b4 10
c1 1, 5
c1 11
C4 10
new tuples
Attribute Value tids
a1 1,5,8,9,12
b1 1,5,6,7,12
c2 2,3,4,6,8,9,12
Records in the main memory are
updated with the new tids_
or_ are replaced by attribute values_
which become more frequent
new records
are created in
the external
memory
34. Monday, April 27, 2015
H-Frag: A Hybrid Memory Data Cube Approach for High Dimension Relations
3417th International Conference on Enterprise Information Systems - Rodrigo Rocha Silva
H-Frag Updates: attribute values are merged
Suppose that_ attribute value a2 and a3 are merged as attribute
value a9.
The attribute values a9 _ will have the highest frequency _ and will
replace attribute value a1 in the main memory.
Therefore, the attribute value a1 will be stored in external memory.
Attribute Value Tids
a2 2, 3
a2 6
a3 4, 7
b2 2, 3
b3 4, 8
b3 9
c1 1, 5, 7
a2 + a3 = a9 : {2, 3, 6, 4, 7}
External Memory
Attribute Value tids
a9 2, 3, 6, 4, 7
b1 1,5,6,7
c2 2,3,4,6,8,9
Attribute Value Tids
a1 1,5,8,9
b2 2, 3
b3 4, 8
b3 9
c1 1, 5, 7
External Memory
Main Memory
35. Monday, April 27, 2015
H-Frag: A Hybrid Memory Data Cube Approach for High Dimension Relations
3517th International Conference on Enterprise Information Systems - Rodrigo Rocha Silva
H-Frag Update: new dimensions and measures
tid A B C D M1 M2 M3
1 a1 b1 c1 d1 1.5 1 6
2 a2 b2 c2 d1 2.5 1 5.66
3 a2 b2 c2 d1 2 3 78.98
4 a3 b3 c2 d1 78.5 2 2.98
5 a1 b1 c1 d3 100 5 1.65
6 a2 b1 c2 d2 102.5 4 2.69
7 a3 b1 c1 d1 100 2 6.87
8 a1 b3 c2 d3 22.5 3 98.999
9 a1 b3 c2 d2 13.89 8 78.995
Attribute Value tids
a1 1,5,8,9
b1 1,5,6,7
c2 2,3,4,6,8,9
d1 1,2,3,4,7
Attribute Values in External
Memory
Frequent Attribute Values in
Main Memory
Attribute Value tids
a2 2, 3
a2 6
a3 4, 7
b2 2, 3
b3 4, 8
b3 9
c1 1, 5, 7
d2 6,9
d3 5,8
tids M1 M2 M3
1 1.5 1 6
2 2.5 1 5.66
3 2 3 78.98
4 78.5 2 2.98
5 100 5 1.65
6 102.5 4 2.69
7 100 2 6.87
8 22.5 3 98.999
9 13.89 8 78.995
Measure Values
Relation in External
Memory
the computing
algorithm processes
only the new
dimensions and
measures
36. Monday, April 27, 2015
H-Frag: A Hybrid Memory Data Cube Approach for High Dimension Relations
3617th International Conference on Enterprise Information Systems - Rodrigo Rocha Silva
The H-Frag approach enables point queries and range queries
rOp= (greater than + less than + between + some + different +
similar x (fv1 … fvn))
H-Frag Range and Inquire Query
It also allows inquire queries such as sub-cube and distinct.
iOp =(sub-cube + distinct + top-k similar x (fv1 … fvn))
37. Monday, April 27, 2015
H-Frag: A Hybrid Memory Data Cube Approach for High Dimension Relations
3717th International Conference on Enterprise Information Systems - Rodrigo Rocha Silva
H-Frag Range and Inquire Query
In order to achieve better performance, H-Frag organizes sub-cube
queries, by always starting by the queries that generate fewer
intersections.
As a result of Q, we have qR=(TID1, TID2 … TIDk), where TIDi is the ith
tuple identifier of relation R.
38. Monday, April 27, 2015
H-Frag: A Hybrid Memory Data Cube Approach for High Dimension Relations
3817th International Conference on Enterprise Information Systems - Rodrigo Rocha Silva
H-Frag Query
For each type of query, it’s
checked whether each
attribute value is stored in the
external memory
when getting each tid-list for
the attribute values that meet
the user's query an
intersection operation of those
lists is performed, and this _
generates the end of the
query.
39. Monday, April 27, 2015
H-Frag: A Hybrid Memory Data Cube Approach for High Dimension Relations
3917th International Conference on Enterprise Information Systems - Rodrigo Rocha Silva
H-Frag Query - example
q={?,?,c2}
a query like this, with two inquire operators
would be executed in SQL as follows:
SELECT a, '*', 'c2', COUNT(a) FROM TABLE WHERE c = 'c2'
GROUP BY 1,2,3 UNION
SELECT '*', B, 'c2', COUNT (b) FROM TABLE WHERE c = 'c2'
GROUP BY 1,2,3 UNION
SELECT A, B, 'c2', COUNT (*) FROM TABLE WHERE c = 'c2'
GROUP BY 1,2,3;
40. Monday, April 27, 2015
H-Frag: A Hybrid Memory Data Cube Approach for High Dimension Relations
4017th International Conference on Enterprise Information Systems - Rodrigo Rocha Silva
Experiments
• We experimented H-Frag Computation and Query algorithms against Frag-Cubing
algorithm used in [Li et al. 2004];
• The H-Frag algorithms were coded in Java 64 bits;
• Frag-Cubing is a free and open source C++ application (http://illimine.cs.uiuc.edu/);
• The synthetic base relations were created by using data generator provided by the
IlliMine project;
• The IlliMine project is an open-source project that provides various approaches for
data mining and machine learning.
• Frag-Cubing approach is part of IlliMine project.
• We ran the algorithms in two Intel Xeon six-core processors with 2.4GHz each core,
12MB cache and 128GB of RAM DDR3 1333MHz.
• The system runs Windows Server 2008 64 bits, High Performance version.
41. Monday, April 27, 2015
H-Frag: A Hybrid Memory Data Cube Approach for High Dimension Relations
4117th International Conference on Enterprise Information Systems - Rodrigo Rocha Silva
Results - Performance Evaluation of Point Queries
Response time per query over 100 trials: T=107; C=104; D=30, S=0
In average, point queries were answered 3 times slower when
accomplished by the H-Frag approach if compared to the Frag-Cubing
approach.
42. Monday, April 27, 2015
H-Frag: A Hybrid Memory Data Cube Approach for High Dimension Relations
4217th International Conference on Enterprise Information Systems - Rodrigo Rocha Silva
Results - Performance Evaluation of Inquire Operators
Response time queries with inquire operators: T = 107; C = 104; D = 30, S = 0.
Queries _ with two inquire operators _ were answered by the H-Frag approach
about 2.5 times slower than when answered by the Frag-Cubing approach.
• The Frag-Cubing approach _ did not perform queries with three inquire
operators due to memory overflow.
43. Monday, April 27, 2015
H-Frag: A Hybrid Memory Data Cube Approach for High Dimension Relations
4317th International Conference on Enterprise Information Systems - Rodrigo Rocha Silva
Results: Where the relation with different numbers of dimensions
were computed.
T = 107; C = 104; D = 30, S = 0.
The runtime was linear in
both approaches.
In average, the hybrid
memory usage_ caused the
H-Frag approach_ to
consume 3 times less main
memory.
44. Monday, April 27, 2015
H-Frag: A Hybrid Memory Data Cube Approach for High Dimension Relations
4417th International Conference on Enterprise Information Systems - Rodrigo Rocha Silva
Results - Massive Data Cube
One relation with T = 109 tuples was computed by the H-Frag
approach.
This experiment_ took 64 hours_ and consumed 126 GB of RAM.
Queries_ with five range operators, ten point operators, and
one inquire operator were answered in less than 35 seconds.
Data cubes, with a high number of tuples_ could not be
computed by the Frag-Cubing approach using the main
memory only. This_ was demonstrated_ by trying to compute a
base relation with 200 million tuples and 60 dimensions.
45. Monday, April 27, 2015
H-Frag: A Hybrid Memory Data Cube Approach for High Dimension Relations
4517th International Conference on Enterprise Information Systems - Rodrigo Rocha Silva
Conclusions
• H-Frag has linear runtime and memory consumption, similar to
Frag-Cubing;
• When compared to Frag-Cubing, H-Frag is faster to answer sub-cube
queries.
• It introduces a different cube representation with less empty cells_
than Frag-Cubing;
• Frag-Cubing cannot answer two sub-cube operators in a data
cube with 200 million tuples , C=104, D=30 and S=0.
• We had scenarios where the Frag-Cubing approach failed to compute the
data cube due to the main memory lack.
46. Monday, April 27, 2015
H-Frag: A Hybrid Memory Data Cube Approach for High Dimension Relations
4617th International Conference on Enterprise Information Systems - Rodrigo Rocha Silva
Conclusions
Interesting research directions_ to further extend H-Frag:
First, we must experiment H-Frag_ with holistic
measures.
Top-k query is part of our interest, since inverted index
is also useful for this type of problem.
Multicore and multicomputer versions of H-Frag must
be implemented.
47. Monday, April 27, 2015
H-Frag: A Hybrid Memory Data Cube Approach for High Dimension Relations
4717th International Conference on Enterprise Information Systems - Rodrigo Rocha Silva
Acknowledgements
Thank you very much
e-mail rrochas@gmail.com