A Hybrid Memory Data Cube Approach for High Dimension Relations

H-Frag:
A Hybrid Memory Data Cube Approach for High Dimension
Relations
Rodrigo Rocha Silva
Doctoral Student
Prof. Dr. Celso Massaki Hirata
Advisor
Prof. Dr. Joubert de Castro Lima
Co-Advisor
ITA – AERONAUTICS INSTITUTE OF TECHNOLOGY
Electronic Engineering and Computer Science Division - EEC/I
Department of Computer Science
Brazil

Monday, April 27, 2015
H-Frag: A Hybrid Memory Data Cube Approach for High Dimension Relations
217th International Conference on Enterprise Information Systems - Rodrigo Rocha Silva
What is H-Frag?
Is a method for data cube
computation that extends the frag-
cubing approach enabling the
computation of massive data cubes
by making use of external memory,
rather than fully relying on the main
memory only.

Topics
– Motivation;
– Data Cube;
– Frag-Cubing;
– H-Frag approach;
– Experiments;
– Results;
– Conclusions;

Motivation
Users need to view data in a tangible way, reports, cross
tables and dashboards are usually the most used tools for
visualizing data.

Approaches that use inverted indexes indexes, such as
Frag-Cubing, are considered efficient in terms of runtime
and main memory usage for massive data cube
computation and query.
• Approaches that use main memory only, are limited
when the data cube size exceeds the main memory
capacity.
Motivation

A data cube has exponential complexity
in its runtime and storage space when
the number of dimensions increases
linearly.
Data Cube
For an input with size d the
output has size 2d
Allows the materialization of all or some cells or tuples of a
cube, which is represented by measures and dimensions.

Data Cube
Subjects
Department
Year
Hour
Day
Year
A dimension may contain a hierarchical relation between two or
more members.
The individual members of a dimension may be hierarchically
related to each other.

Base Relation R – 11 tuples
A B C COUNT
a1 b1 c1 1
a3 b3 c2 1
a2 b3 c2 1
a3 b1 c1 1
a2 b1 c1 1
a2 b2 c2 1
a1 b1 c2 1
a2 b2 c1 1
a3 b1 c2 1
a1 b3 c2 1
a2 b1 c2 1
A B C COUNT
* * * 11
a1 * * 3
a2 * * 5
a3 * * 3
* b1 * 6
* b2 * 2
* b3 * 3
* * c1 4
* * c2 7
a1 b1 * 2
a1 b3 * 1
a2 b1 * 2
a2 b2 * 2
a2 b3 * 1
a3 b1 * 2
a3 b3 * 1
a1 * c1 1
a1 * c2 2
a2 * c1 2
a2 * c2 3
a3 * c1 1
a3 * c2 2
* b1 c1 3
* b1 c2 3
FULL 3D CUBE
A B C COUNT
* b2 c1 1
* b2 c2 1
* b3 c2 3
a1 b1 c1 1
a3 b3 c2 1
a2 b3 c2 1
a3 b1 c1 1
a2 b1 c1 1
a2 b2 c2 1
a1 b1 c2 1
a2 b2 c1 1
a3 b1 c2 1
a1 b3 c2 1
a2 b1 c2 1
+
38 aggregations
Data Cube
Construction of a complete data cube is
an exponential problem

Related Work – Frag-Cubing Approach
• Splits data vertically;
• Reduces high-dimensional cube into cuboids of lower
dimension;
• Offers tradeoffs between the data cube computation
runtime and the pre-processing of aggregations;
…FEDCBA
CUBE
ABC
CUBE
DEF
Dimensions
From book Han and Kamber: Data Mining Concepts and Techniques

For a 5-dimension relation:
two shell fragments can be built: (A, B, C) and (D, E)
tid A B C D E
1 a1 b1 c1 d1 e1
2 a1 b2 c1 d2 e1
3 a1 b2 c1 d1 e2
4 a2 b1 c1 d1 e2
5 a2 b1 c1 d1 e3
Related Work – Frag-Cubing Example

• Build traditional invert index or RID list
Attribute Value TID List List Size
a1 1 2 3 3
a2 4 5 2
b1 1 4 5 3
b2 2 3 2
c1 1 2 3 4 5 5
d1 1 3 4 5 4
d2 2 1
e1 1 2 2
e2 3 4 2
e3 5 1
Related Work – Frag-Cubing 1-D Inverted Indexes

Generalize the 1-D inverted indexes to multi-dimensional ones
in the data cube sense; Computes all cuboids for data cubes
ABC and DE while retaining the inverted indexes;
For example, shell fragment
cube ABC contains 7
cuboids:
A, B, C
AB, AC, BC
ABC
111 2 3 1 4 5a1 b1
04 5 2 3a2 b2
24 54 5 1 4 5a2 b1
22 31 2 3 2 3a1 b2
List SizeTID ListIntersectionCell







 
Related Work – Frag-Cubing Approach
This completes the offline computation stage
Frag-cubing proposes to compute only the cuboid of a given
fragment during the processes of the data cube computation.

• If measures other than count measures are present,
store in ID_measure table separate from the shell
fragments
tid count sum
1 5 70
2 3 10
3 8 20
4 5 40
5 2 30
Related Work – Frag-Cubing Measure Table

Once the data cube is computed into fragments, the query
process follows these steps:
 Divides the query into fragment;
 Fetches the corresponding tid-list for each fragment from the
fragment cube;
 Intersects the tid-lists from each fragment in order to construct an
instantiated base table;
Related Work – Frag-Cubing Query
Online
Computation
Base Table
 Computes the data cube
using the base table with any
cubing algorithm.

H-Frag Approach
Implements a set of tuple identifiers per dimension attribute,
similar to Frag-Cubing;
• H-Frag allows larger cubes_ by using external memory to store
some of the computed cubes, rather than relying on the main
memory only.
The main challenge of using external memory_ is
to define the criteria to select which fragments
of the cube_ should be in main memory.
H-Frag, selects fragments of the cube_ according
to the attribute values frequencies_ and
dimension cardinalities, to be stored in main
memory.

H-Frag Architecture

H-Frag Computation
First, the computation component _ scans Entry Relation
completely_ in order to obtain the frequency of each attribute
value for each dimension.
Then, the average frequency is obtained, and attribute
values with frequencies lower than the average are marked_
in order to be stored in the external memory.
scans
Frequencies of the attribute values
attribute values are marked in
order to be stored in the
External
Memory
Entry
Relation

tid A B C M1 M2
1 a1 b1 c1 1.5 1
2 a2 b2 c2 2.5 1
3 a2 b2 c2 2 3
4 a3 b3 c2 78.5 2
5 a1 b1 c1 100 5
6 a2 b1 c2 102.5 4
7 a3 b1 c1 100 2
8 a1 b3 c2 22.5 3
9 a1 b3 c2 13.89 8
The frequency of each attribute value is: fa1=4, fa2=3, fa3=2,
fb1=4, fb2=2, fb3=3, fc1=3 and fc2=6.
H-Frag Computation – Example

H-Frag Computation – First Step
- 3 is the average frequency in the dimensions A and B;
- In dimension C, the average frequency is 4.5 (let´s consider 4).
fa1=4, fa2=3, fa3=2; -> (4+3+2)/3 = 3;
fb1=4, fb2=2, fb3=3; -> (4+2+3)/3 = 3;
fc1=3 and fc2=6; -> (3+6)/2 = 4.5
External
Memory
a3, b2, b3 and c1
The attribute values a3, b2, b3 and c1 are marked to be stored in
the external memory, because they are below the average.

H-Frag Computation – Second Step
The Entry Relation_ is scanned a second time by the computation
component_ in order to select the attribute values to be stored in the
external memory;
Each attribute value_ and its tid-list_ is stored in the external memory;
H-Frag splits the Entry Relation into complementary portions defined by
the user, with several tuples in each portion.
a single attribute value can have several complementary tid-lists in external
memory, since RAM can get full;
scans
to select the attribute values to be stored
attribute value and its tid-list
External
Memory
Entry
Relation

• In order to avoid attribute values with low number of tids in the
external memory, H-Frag defines an occurrence percentage for each
attribute value inside a portion.
Entry Relation
1 a1 b1 c1 d1 e1 f1 g1 h1
2 a1 b2 c2 d1 e2 f2 g2 h2
3 a3 b8 c3 d3 e4 f5 g6 h7
4 a5 b6 c5 d5 e4 f5 g5 h6
5 a9 b9 c9 d9 e9 f9 g9 h9
6 a9 b4 c4 d4 e3 f4 g4 h4
7 a5 b7 c7 d7 e7 f7 g7 h7
8 a7 b7 c7 d7 e7 f7 g7 h7
first
portion
second
portion
if portion
equals 4
a1 and d1
tid-list stored
e4 and f5
tid-list stored
a9
tid-list stored
. . .
Each attribute value, related to 50% of the number of processed tuples,
- in relation to the total number of tuples in the portion - will have its
tid-list stored in the external
memory.
Each portion_ should be stored fully in the main memory.

if 80% of the available working memory is being used, all the
tid-lists of the processed attribute values and all measure
values are stored in the external memory.
a1 = { 1, .. 4}
a2 = { 2, .. 8}
b1 = { 1, .. 3}
c1 = { 3, .. 7}
c2 = { 2, .. 4}
20 %
all
stored
working memory
This way, H-Frag eliminates the problem when there are many
attribute values below 50% of a
portion, which can happen_ in relations
with high cardinality and low skew.

H-Frag Computation – Measure Values
The measure values are grouped by portions;
Each group of measure values_ is identified by a tid interval
or range;
This way, H-Frag will generate a few files with the measure
values.
For example, when a portion of 10 tuples, in which
the initial tid equals 1 is processed, a file with
measure values identified as 1_10 will be generated.

Relation is scanned for a third time. As a result, it generates
a map with the top frequent attribute values of relation and
their tid-lists.
Such a map_ is kept in the main memory.
H-Frag Computation – Third Step
scans
map with the
tid-lists of the top frequent
attribute values
Entry
Relation
Main
Memory

tid A B C M1 M2
1 a1 b1 c1 1.5 1
2 a2 b2 c2 2.5 1
3 a2 b2 c2 2 3
4 a3 b3 c2 78.5 2
5 a1 b1 c1 100 5
6 a2 b1 c2 102.5 4
7 a3 b1 c1 100 2
8 a1 b3 c2 22.5 3
9 a1 b3 c2 13.89 8
Example of the computing process given this relation
Remembering that the frequencies
of the attribute values are:
a1=4, a2=3, a3=2, b1=4, b2=2,
b3=3, c1=3 and c2=6

tid A B C M1 M2
1 a1 b1 c1 1.5 1
2 a2 b2 c2 2.5 1
3 a2 b2 c2 2 3
4 a3 b3 c2 78.5 2
5 a1 b1 c1 100 5
6 a2 b1 c2 102.5 4
7 a3 b1 c1 100 2
8 a1 b3 c2 22.5 3
9 a1 b3 c2 13.89 8
Attribute Value Tids
a2 2, 3
a2 6
a3 4, 7
b2 2, 3
b3 4, 8
b3 9
c1 1, 5, 7
Attribute Values in External
Memory
3
Example of the computing process
stores a tid-sublist each time the attribute value is associated to 50% or
more of the tids of the defined portions

tid A B C M1 M2
1 a1 b1 c1 1.5 1
2 a2 b2 c2 2.5 1
3 a2 b2 c2 2 3
4 a3 b3 c2 78.5 2
5 a1 b1 c1 100 5
6 a2 b1 c2 102.5 4
7 a3 b1 c1 100 2
8 a1 b3 c2 22.5 3
9 a1 b3 c2 13.89 8
a2 2, 3
a2 6
a3 4, 7
b2 2, 3
b3 4, 8
b3 9
c1 1, 5, 7
Memory
In this example, let`s
consider the size portion
equals 2
3
The attribute value a2,
which frequency is 3, will
have stored a sublist with
tids 2 and 3_ and another
sub list with tid 6

tid A B C M1 M2
1 a1 b1 c1 1.5 1
2 a2 b2 c2 2.5 1
3 a2 b2 c2 2 3
4 a3 b3 c2 78.5 2
5 a1 b1 c1 100 5
6 a2 b1 c2 102.5 4
7 a3 b1 c1 100 2
8 a1 b3 c2 22.5 3
9 a1 b3 c2 13.89 8
a2 2, 3
a2 6
a3 4, 7
b2 2, 3
b3 4, 8
b3 9
c1 1, 5, 7
Memory
In this example, let`s
consider the size portion
equals 2
3 4.5
The attribute value C1 with
frequency is 3, will have
only one tid-list stored in
the external memory, with
three tids

Attribute Value tids
a1 1,5,8,9
b1 1,5,6,7
c2 2,3,4,6,8,9
tid A B C M1 M2
1 a1 b1 c1 1.5 1
2 a2 b2 c2 2.5 1
3 a2 b2 c2 2 3
4 a3 b3 c2 78.5 2
5 a1 b1 c1 100 5
6 a2 b1 c2 102.5 4
7 a3 b1 c1 100 2
8 a1 b3 c2 22.5 3
9 a1 b3 c2 13.89 8
Frequent Attribute Values in
Main Memory
a2 2, 3
a2 6
a3 4, 7
b2 2, 3
b3 4, 8
b3 9
c1 1, 5, 7
Memory
Frequencies of the attribute values: fa1=4, fa2=3, fa3=2, fb1=4, fb2=2, fb3=3, fc1=3 and fc2=6.
which are the most frequent

a1 1,5,8,9
b1 1,5,6,7
c2 2,3,4,6,8,9
tid A B C M1 M2
1 a1 b1 c1 1.5 1
2 a2 b2 c2 2.5 1
3 a2 b2 c2 2 3
4 a3 b3 c2 78.5 2
5 a1 b1 c1 100 5
6 a2 b1 c2 102.5 4
7 a3 b1 c1 100 2
8 a1 b3 c2 22.5 3
9 a1 b3 c2 13.89 8
Main Memory
a2 2, 3
a2 6
a3 4, 7
b2 2, 3
b3 4, 8
b3 9
c1 1, 5, 7
Memory
Frequencies of the attribute values: fa1=4, fa2=3, fa3=2, fb1=4, fb2=2, fb3=3, fc1=3 and fc2=6.

a1 1,5,8,9
b1 1,5,6,7
c2 2,3,4,6,8,9
tid A B C M1 M2
1 a1 b1 c1 1.5 1
2 a2 b2 c2 2.5 1
3 a2 b2 c2 2 3
4 a3 b3 c2 78.5 2
5 a1 b1 c1 100 5
6 a2 b1 c2 102.5 4
7 a3 b1 c1 100 2
8 a1 b3 c2 22.5 3
9 a1 b3 c2 13.89 8
Main Memory
a2 2, 3
a2 6
a3 4, 7
b2 2, 3
b3 4, 8
b3 9
c1 1, 5, 7
Memory
tids M1 M2 Group ID
1 1.5 1
1_3
2 2.5 1
3 2 3
4 78.5 2
3_6
5 100 5
6 102.5 4
7 100 2
7_9
8 22.5 3
9 13.89 8
Measure Values
Relation in External
Memory
identifies
the tids’
range of the
processed
tuples

The same H-Frag Computation algorithm
H-Frag Update

H-Frag Update Relation: New Tuples
tid A B C M1 M2
1 a1 b1 c1 1.5 1
2 a2 b2 c2 2.5 1
3 a2 b2 c2 2 3
4 a3 b3 c2 78.5 2
5 a1 b1 c1 100 5
6 a2 b1 c2 102.5 4
7 a3 b1 c1 100 2
8 a1 b3 c2 22.5 3
9 a1 b3 c2 13.89 8
tid A B C M1 M2
10 a4 b4 c4 3 7
11 a3 b3 c1 4.7 12
12 a1 b1 c2 5.5 6
a2 2, 3
a2 6
a3 4, 7
a4 10
a3 11
b2 2, 3
b3 4, 8
b3 9
b3 11
b4 10
c1 1, 5
c1 11
C4 10
new tuples
a1 1,5,8,9,12
b1 1,5,6,7,12
c2 2,3,4,6,8,9,12
Records in the main memory are
updated with the new tids_
or_ are replaced by attribute values_
which become more frequent
new records
are created in
the external
memory

H-Frag Updates: attribute values are merged
Suppose that_ attribute value a2 and a3 are merged as attribute
value a9.
The attribute values a9 _ will have the highest frequency _ and will
replace attribute value a1 in the main memory.
Therefore, the attribute value a1 will be stored in external memory.
a2 2, 3
a2 6
a3 4, 7
b2 2, 3
b3 4, 8
b3 9
c1 1, 5, 7
a2 + a3 = a9 : {2, 3, 6, 4, 7}
External Memory
a9 2, 3, 6, 4, 7
b1 1,5,6,7
c2 2,3,4,6,8,9
a1 1,5,8,9
b2 2, 3
b3 4, 8
b3 9
c1 1, 5, 7
External Memory
Main Memory

H-Frag Update: new dimensions and measures
tid A B C D M1 M2 M3
1 a1 b1 c1 d1 1.5 1 6
2 a2 b2 c2 d1 2.5 1 5.66
3 a2 b2 c2 d1 2 3 78.98
4 a3 b3 c2 d1 78.5 2 2.98
5 a1 b1 c1 d3 100 5 1.65
6 a2 b1 c2 d2 102.5 4 2.69
7 a3 b1 c1 d1 100 2 6.87
8 a1 b3 c2 d3 22.5 3 98.999
9 a1 b3 c2 d2 13.89 8 78.995
a1 1,5,8,9
b1 1,5,6,7
c2 2,3,4,6,8,9
d1 1,2,3,4,7
Memory
Main Memory
a2 2, 3
a2 6
a3 4, 7
b2 2, 3
b3 4, 8
b3 9
c1 1, 5, 7
d2 6,9
d3 5,8
tids M1 M2 M3
1 1.5 1 6
2 2.5 1 5.66
3 2 3 78.98
4 78.5 2 2.98
5 100 5 1.65
6 102.5 4 2.69
7 100 2 6.87
8 22.5 3 98.999
9 13.89 8 78.995
Measure Values
Relation in External
Memory
the computing
algorithm processes
only the new
dimensions and
measures

The H-Frag approach enables point queries and range queries
rOp= (greater than + less than + between + some + different +
similar x (fv1 … fvn))
H-Frag Range and Inquire Query
It also allows inquire queries such as sub-cube and distinct.
iOp =(sub-cube + distinct + top-k similar x (fv1 … fvn))

H-Frag Range and Inquire Query
In order to achieve better performance, H-Frag organizes sub-cube
queries, by always starting by the queries that generate fewer
intersections.
As a result of Q, we have qR=(TID1, TID2 … TIDk), where TIDi is the ith
tuple identifier of relation R.

H-Frag Query
For each type of query, it’s
checked whether each
attribute value is stored in the
external memory
when getting each tid-list for
the attribute values that meet
the user's query an
intersection operation of those
lists is performed, and this _
generates the end of the
query.

H-Frag Query - example
q={?,?,c2}
a query like this, with two inquire operators
would be executed in SQL as follows:
SELECT a, '*', 'c2', COUNT(a) FROM TABLE WHERE c = 'c2'
GROUP BY 1,2,3 UNION
SELECT '*', B, 'c2', COUNT (b) FROM TABLE WHERE c = 'c2'
GROUP BY 1,2,3 UNION
SELECT A, B, 'c2', COUNT (*) FROM TABLE WHERE c = 'c2'
GROUP BY 1,2,3;

Experiments
• We experimented H-Frag Computation and Query algorithms against Frag-Cubing
algorithm used in [Li et al. 2004];
• The H-Frag algorithms were coded in Java 64 bits;
• Frag-Cubing is a free and open source C++ application (http://illimine.cs.uiuc.edu/);
• The synthetic base relations were created by using data generator provided by the
IlliMine project;
• The IlliMine project is an open-source project that provides various approaches for
data mining and machine learning.
• Frag-Cubing approach is part of IlliMine project.
• We ran the algorithms in two Intel Xeon six-core processors with 2.4GHz each core,
12MB cache and 128GB of RAM DDR3 1333MHz.
• The system runs Windows Server 2008 64 bits, High Performance version.

Results - Performance Evaluation of Point Queries
Response time per query over 100 trials: T=107; C=104; D=30, S=0
In average, point queries were answered 3 times slower when
accomplished by the H-Frag approach if compared to the Frag-Cubing
approach.

Results - Performance Evaluation of Inquire Operators
Response time queries with inquire operators: T = 107; C = 104; D = 30, S = 0.
Queries _ with two inquire operators _ were answered by the H-Frag approach
about 2.5 times slower than when answered by the Frag-Cubing approach.
• The Frag-Cubing approach _ did not perform queries with three inquire
operators due to memory overflow.

Results: Where the relation with different numbers of dimensions
were computed.
T = 107; C = 104; D = 30, S = 0.
The runtime was linear in
both approaches.
In average, the hybrid
memory usage_ caused the
H-Frag approach_ to
consume 3 times less main
memory.

Results - Massive Data Cube
One relation with T = 109 tuples was computed by the H-Frag
approach.
This experiment_ took 64 hours_ and consumed 126 GB of RAM.
Queries_ with five range operators, ten point operators, and
one inquire operator were answered in less than 35 seconds.
Data cubes, with a high number of tuples_ could not be
computed by the Frag-Cubing approach using the main
memory only. This_ was demonstrated_ by trying to compute a
base relation with 200 million tuples and 60 dimensions.

Conclusions
• H-Frag has linear runtime and memory consumption, similar to
Frag-Cubing;
• When compared to Frag-Cubing, H-Frag is faster to answer sub-cube
queries.
• It introduces a different cube representation with less empty cells_
than Frag-Cubing;
• Frag-Cubing cannot answer two sub-cube operators in a data
cube with 200 million tuples , C=104, D=30 and S=0.
• We had scenarios where the Frag-Cubing approach failed to compute the
data cube due to the main memory lack.

Conclusions
Interesting research directions_ to further extend H-Frag:
 First, we must experiment H-Frag_ with holistic
measures.
 Top-k query is part of our interest, since inverted index
is also useful for this type of problem.
 Multicore and multicomputer versions of H-Frag must
be implemented.

Acknowledgements
Thank you very much
e-mail rrochas@gmail.com

A Hybrid Memory Data Cube Approach for High Dimension Relations

Recommended

Recommended

More Related Content

Similar to A Hybrid Memory Data Cube Approach for High Dimension Relations

Similar to A Hybrid Memory Data Cube Approach for High Dimension Relations (20)

Recently uploaded

Recently uploaded (20)

A Hybrid Memory Data Cube Approach for High Dimension Relations