1.
Chapter 1
Introduction
Data warehousing and data mining are both popular technologies in recent years.
Data warehousing is an information infrastructure to store and integrate different data
sources into a consistent repository, and through OLAP (OnLine Analytical
Processing) tools business managers can analyze these data in various perspectives to
discover valuable information for strategic decision. Data mining, on the other hand,
is the exploration and analysis of data, automatically or semiautomatically, to
discover meaningful patterns and rules. From the business viewpoint, the integration
of these two technologies can allow a corporation to understand its customers
behaviors, and to use this information to gain market competition. Among various
pattern interested by data mining research community, association rule has attracted
great attention recently. An association rules is a rule of the form A ⇒ B (sup = s %,
conf = c %), which reveals the concurrence between two itemsets A and B. An
example is PC => Laser Printer (sup = 30%, conf = 80%), which means there are 30%
customers will buy PC and Laser Printer together, and 80% of those customers who
buy PC also get Laser Printer.
Mining association rules from large database is a data and computation intensive
task. To reduce the complexity of association mining, researchers have proposed the
concept of integrating data warehousing system and association mining algorithms.
1
2.
For example, the DBMiner system [22] developed by J. Han and its research team
adopts an OLAPbased association mining approach. Similar paradigm was presented
in [22].
The primary problem of OLAPbased approach is that the OLAP data cube is not
feasible for on line association mining. Excessive efforts are still required to complete
the task. As such, Lin et al. [15] proposed the concept of OLAM (OnLine
Association Mining) cube, an extension of Iceberg cube [3] used to store frequent
multidimensional itemsets. They also proposed a framework of online
multidimensional association rule mining system, called OMARS, to provide users an
environment to execute OLAPlike query to mine association rules from data
warehouses efficiently.
This thesis is a companion toward the implementation of OMARS. Particularly,
the problem of selecting appropriate OLAM cubes to materialize and store in
OMARS is concerned. And, in accordance with the proposed mining algorithms in
OMARS, a suitable model to evaluate the cost of selecting data cubes to materialize is
also developed.
1.1 Contributions
The main contributions of this thesis are as follows:
1. We exploit the devising dependency between OLAM cubes with regard to
association query, thereby devising the structure of OLAM lattice.
2. We deploy the model for evaluating the cost of answering association
queries using materialized OLAM cubes, which is a preliminary step for
OLAM cubes selection.
3. We modify and implement some stateoftheart heuristic algorithms, and
2
3.
draw comparisons between these algorithms to evaluate their effectiveness.
1.2 Thesis Organization
This thesis is organized as follows. We describe past researches and related work
about the data warehousing and data mining technologies in Chapter 2. In Chapter 3,
we describe the OMARS framework briefly. Chapter 4 formulates our OLAM cube
selection problem. The algorithm analysis and cost model is described in Chapter 5.
Chapter 6 explains our algorithms, and Chapter 7 shows the experimental results
conducted in this research. Finally, we conclude our work and point out some future
research directions in Chapter 8.
3
4.
Chapter 2
Background and Related Work
2.1 Data Warehouse and OLAP
2.1.1 Data Warehouse
As coined by W. H. Inmon, the term “Data warehouse” refers to a “subject
oriented, integrated, timevariant and nonvolatile collection of data in support of
management’s decisionmaking process” [11]. In this regard, a data warehouse is a
database dedicated to support decision making. According to the demand of analysts,
the data comes from different databases are extracted and transformed into the data
warehouse. If users want to execute queries, the system only needs to search the data
warehouse instead of the source databases. For this reason, it can save much more
query processing time for users.
A data warehouse system is composed of three primary parts:
1. The source databases in the backend: In the backend, the data are collected
from various sources, internal or external, legacy or operational, and any
change to these sources is continually monitored by several modules called
4
5.
monitors/ wrappers.
2. The data warehouse and data marts in the core: The reconciled data are
stored in the data warehouse and data mart, which are central repository for
the whole system.
3. The analysis tools in the front end: The analysis tools supported in the front
end are usually OLAP, query/tabulation tools, and data mining software.
The typical structure of a data warehouse is illustrated in Figure 2.1.
Figure 2.1. A typical architecture of data warehouse [11].
2.1.2 OnLine Analytical Processing (OLAP)
Although the data stored in a data warehouse have been cleaned, filtered, and
integrated, it still requires much time to transform the data into useful strategic
information owing to the massive amount of data stored in data warehouse. The
concept of OnLine Analytical Processing (OLAP) [4] refers to the process of creating
and managing multidimensional data for analysis and visualization. To provide fast
5
Extract
Clean
Transform
Load
Refresh
Data sources
Operational databases
External sources
Monitoring & Administration
Metadata
Repository
Data Warehouse
Serve
Analysis
Query/Reporting
Data Mining
Tools
OLAP
Servers
Data mart
dbs
Monitors/ wrappers
dbs
6.
and multidimensional analysis of data in a data warehouse, the OLAP tool
precomputes aggregation over data and organizes the result as a data cube composed
of several dimensions, each representing one of the user analysis perspectives.
The typical operations provided by OLAP include rollup, drilldown, slice and
dice and pivot [8]. Rollup operation performs aggregation on a data cube, either by
climbing up a concept hierarchy for a dimension or by dimension reduction. Drill
down is the reverse of rollup. It navigates from less detailed data to more detailed
data. The slice operation performs a selection on one dimension of the given cube,
resulting in a subcube, while the dice operation defines a subcube by performing a
selection on two or more dimensions. The pivot operation, which is also called rotate,
is a visualization operation that rotates the data axes in view in order to provide an
alternative presentation of the data. These OLAP operations are illustrated in Figure
2.2.
2.2 Data Warehouse Data Model
Because the data warehouse systems require a concise, subjectoriented schema
that facilitates online data analysis, the entityrelationship data model that is
generally used in relational database systems is not suitable for data warehouse
system. For this purpose, the most popular data model for a data warehouse is a
multidimensional data model. Two common relational models that facilitate
multidimensional analysis are star schema, and snowflake.
6
7.
Figure 2.2. The typical operations of OLAP
2.2.1 Star Schema
Star schema, proposed by Kimball [12], is the most popular dimensional model
used in data warehouse community. A star schema consists of a fact table and several
dimension tables. The fact table stores a list of foreign keys which correspond to
dimension tables, and numeric measure of user interests. Each dimension table
contains a set of attributes. Moreover, the attributes within a dimension table may
form either a hierarchy (total order) or a lattice (partial order). An example of star
schema is depicted in Figure 2.3, whose schema hierarchy is illustrated in Figure 2.4.
7
Slice
Customer
All
Supplier
S1
S2
S3
S4
Product
P1
P4
P3
P2
P5
P6
C1 C2
Customer
Product
P1
P2
Supplier
S1 S2
RollUp
DrillDown
Product
P1
P4
P3
P2
P5
P6
C1 C2 C3 C4
Customer
Dice
Pivot
C2
Product
P1
P4
P3
P2
P5
P6
Supplier
S1 S2 S3 S4
C1 C3 C4
Customer
Product
Customer
C3
C2
C1
C4
P3 P1P2P4P5P6
8.
Figure 2.3. An example of star schema for sales
Figure 2.4. An example of schema hierarchy for sales star
8
9.
2.2.2 Snowflake Data Model
The snowflake schema is a variant of the star schema model, where some
dimension tables are normalized, thereby further splitting the data into additional
individual and hierarchical tables. An example of snowflake data model is depicted in
Figure 2.5.
Figure 2.5. An example of snowflake schema for sales
The major difference between snowflake schema and star schema is that the
dimension tables of snowflake model may be kept in normalized form to reduce
redundancies. Through this characteristic one can easily maintain and save storage
space than that by star schema data model. On the other hand, the star schema can
integrate schema hierarchies into a dimension table, thereby incurring no join
9
10.
operation during hierarchical traverse of the dimensions. Hence, the star schema data
model is more popular than snowflake schema data model.
2.3 Association Rule Mining
2.3.1 Association Rules
Association rule mining is one of the prominent activities conducted in data
mining community. The concept of association rule mining is to search interesting
relationships among items in a given data set. For example, the information that
customers who purchase diapers also tend to buy beers at the same time is represented
in association rule below:
Diaper => Beer [sup = 2%, conf = 60%]
Rule support and confidence are two measures of rule interestingness. A support of
2% means that 2% of customers purchase diaper and beer together. A confidence of
60% means that 60% of the customers who purchase a diaper also buy beer. Typically,
an association rule is considered interesting if it satisfies a minimum support threshold
and a minimum confidence threshold that are set by users or domain experts.
The process of association rule mining can be divided into two steps:
1. Frequent itemsets generation: In this step, all itemsets with support greater
than the minimum support threshold are first discovered.
2. Rule construction: After generating all frequent itemsets, the confidence of
these frequent itemsets much greater than minimum confidence threshold.
Then, we can discover association rules.
10
11.
The most popular and influential association mining algorithm is Apriori [2],
which the apriori knowledge of frequent kitemsets to generate candidate (k+1)
itemsets. When the maximum length of frequent itemsets is l, Apriori needs l passes
of database scans. Since the Apriori algorithm costs much time to generate the
candidate itemsets and to count the support of each itemset, many variant algorithms
have been proposed to improve the efficiency of mining process.
2.3.2 Multidimensional Association Rules
The concept of multidimensional association rules is first proposed by H. Zhu
[22], which is used to describe associations between data values from data warehouse,
because where the data schema is composed of multiple dimensions, and each
dimension may contain many attributes. Following the work in [22], we can divide
the multidimensional association rules into three different types as follows:
1. Interdimensional association rule: This is the association among a set of
dimensions. For example, suppose an OLAP cube is composed of three
dimensions: Product, Supplier, Customer, and whose data is listed in Table
2.1. An inter dimensional association rule is:
Supplier (“Hong Kong”), Product (“Sport Wear”) ⇒ Customer (“John”)
2. Intradimensional association rule: This is the association among items
coming from one dimension. From Table 2.1, a possible intradimensional
association rule is:
Product (“Sport Wear”) ⇒ Product (“Tents”)
11
12.
3. Hybrid association rule: This is the association among a set of dimensions,
but some items in the rule are from one dimension. It can be regarded as a
combination of interdimensional and intradimensional associations.
According to Table 2.1, a hybridassociation rule is:
Product (“Sport Wear”), Supplier (“Hong Kong”) ⇒ Product (“Tents”)
Table 2.1. A relational representation of OLAP cube
Supplier Product Customer Count
HongKong
HongKong
HongKong
Mexico
Mexico
Mexico
Mexico
Mexico
Seattle
Seattle
Seattle
Seattle
Tokyo
Tokyo
Tokyo
Tokyo
Sport Wear
Sport Wear
Water Purifier
Alert Devices
Carry Bags
Carry Bags
Tents
Tents
Carry Bags
Sport Wear
Sport Wear
Water Purifier
Carry Bags
Sport Wear
Tents
Alert Devices
John
Mary
John
Peter
Peter
Bill
Sue
Mary
John
Peter
John
Bill
Sue
Bill
Sue
John
30
10
30
20
85
25
25
20
100
20
40
25
10
20
20
20
12
13.
2.4 Related Work
2.4.1 Data Cube
The concept of data cube is first proposed by Gray et al [6], which allow the
analysts to view the data stored in data warehouse from various aspects and to employ
multidimensional analysis. Each cell in a data cube represents the measured value. For
example, consider a sales data cube with three dimensions, Product, Supplier,
Customer, and one measure value, Sales_total. This cube is depicted in Figure 2.6 and
can be expressed as a SQL query as follows:
Select Product, Supplier, Customer SUM(Sales) AS Total Sales
From Sales_Fact
Group by Product, Supplier, Customer;
13
Customer
Supplier
Product
c1 c2 c3 c4
s1
s2
s3
s4
p1p2p3p4p5p6
14.
Figure 2.6 An example of data cube
2.4.2 Cube Selection Problem
In order to accelerate the query processing, it is important to select the most
suitable cubes to materialize. In general, there are three options to select the cubes to
materialize.
1. Materialize all data cubes: This method costs the lowest query time but
needs the largest storage space, because the whole cubes have to be
materialized.
2. Materialize nothing: This method saves the largest storage space but needs
the largest query time, because there is no cube to be materialized.
3. Materialize a fraction of data cubes: This method selects a part of the data
cubes to materialize. But how to select the most suitable cubes to
materialize under a space constraint is difficult. Indeed, it has been proved
to be a NPhard problem [9].
According to the above discussions, the best way is to materialize all data cubes.
However, the space limit of data warehouse would hinder us to do this. On the other
hand, if we materialize nothing, it will cost too much query time. Therefore, we
should try to select the most suitable cubes to materialize even this problem is an NP
hard problem. In the literature, there has been a substantial contribution in this
problem, which can be classified into three main categories:
14
15.
1. Heuristic method: This category is mainly based on the greedy paradigm.
Harinarayan et al. [9] was the first one to consider the problem of
materialized views selection for supporting multidimensional analysis in
OLAP. They proposed a lattice model and provided a greedy algorithm to
solve this problem. Gupta et al. [7] further extend their work to include
indices selection. Ezeife [5] also considered the same problem but proposed
a uniform approach using a more detailed const model. Shukla et al. [17]
proposed a modified greedy algorithm that selects only according to the
cube size. Their algorithm was shown to have the same quality as
Harinarayan’s greedy method but is more efficient.
2. Exhaustive method: The work in [19] supposed that all queries should be
answered solely by the materialized views, with or without rewriting the
users’ queries. They modeled the problem as a state space optimization
problem, and provided exhaustive and heuristic algorithms without concern
for the storage constraint. Soutyrina and Fotouhi [18] proposed a dynamic
programming algorithm to solve the problem, which can yield the optimal
set of cubes.
3. Genetic method: There is some work devoted to applying genetic
algorithms to the view selection problem [10, 20, 21]. Following the AND
OR view graph used in [7], Horng et al. [10] proposed a genetic algorithm
to select the appropriate set of views to minimize the query cost and view
maintenance cost. A similar genetic algorithm with different repairing
scheming is proposed in [13], which use a greedy repair method to correct
the infeasible solutions instead of using a penalty function to punish the
fitness of the infeasible solutions. Researches have shown that the repair
scheme is better in dealing with infeasible solutions than penalty function is
15
16.
[16]. Rather than optimize the view selection from a given query processing
plan, the work in [20, 21] focus on finding an optimal set of processing
plans for multiple queries. A solution in their genetic algorithm thus
represents a set of processing plans for the given queries.
16
17.
Chapter 3
The OMARS Framework
In this chapter, we will give a brief review of the OMARS framework, because
our research deals with the problem of how to select the most suitable OLAM cubes
to materialize in this system.
The OMARS framework, as illustrated in Figure 3.1, integrates data warehouse,
online analytical processing, and the OLAM Cube, whose objective is to provide an
efficient and convenient platform, allowing users to perform OLAPlike association
explorations. Through the OMARS system, users can perform multidimensional
associational mining queries, interactively change the dimensions that comprise the
associations, and refine the constraints such as minimum support and minimum
confidence. Functionality of each component is described in the following sections.
Figure 3.1. The OMARS framework [15].
17
Cube
Manager
Data
Warehouse
OLAP
Cube
OLAM
Cube
OLAM
Mediator
OLAM
Engine
Auxiliary
Cube
18.
3.1 OLAM Cube and Auxiliary Cube
OLAM cube is a new concept proposed by Lin et al. [15], which is used to store
the frequent itemsets with supports greater than or equal to a presetting minimum
support, denoted as prims. In this regard, the OLAM cube can be regarded as an
extension of iceberg cube. The main difference is that the iceberg cube stores the
information of frequent itemsets derived from interdimensional associations, while
OLAM cube is feasible for all of the three different associations. When the minsup of
user’s query is greater or equal than prims, it can accelerate the process of mining
association rules because of the OLAM cube stores the frequent itemsets with
supports greater or equal than prims.
Although the OLAM cube can be used to generate association rules efficiently
when minsup is greater than prims, it fails to solve the situation that minsup is lower
than prims. To alleviate this problem, the OMARS system embraces another type of
data cube, called auxiliary cube. The concept of auxiliary cube is used to store the
infrequent itemsets with length of Kα, where Kα denotes the cuttinglevel employed by
the mining algorithm CBWon used in OMARS.
3.2 Cube Manager
This component is responsible for three different tasks:
1. Cube selection: This refers to how to select the most proper cubes to
materialize, in order to minimize the query cost and/or maintenance cost
under the constraint of limited storage space.
2. Cube computation: This portion is to deal with the work of efficiently
generating the set of materialized cubes produced by the cube selection
18
19.
module.
3. Cube maintenance: This part concerns the problem of how to maintain the
materialized cubes when the data in the data warehouse are updated.
Our research in this thesis indeed deals with the implementation issue of the cube
selection task of Cube Manager. We will discuss this in the next chapter.
3.3 OLAM Mediator and OLAM Engine
OLAM Engine is an interface between the OMARS system and the users. It
accepts user’s queries and invokes the appropriate algorithm to mine
multidimensional association rules.
When OLAM Engine receives a user’s query, it will analyze the query and
forward relevant information to OLAM Mediator, which then looks for the most
relevant cube and returns the result to OLAM Engine. Here the most relevant cube
denotes the materialized OLAM cube that can answer the query and consume the
smallest cost. There are two possibilities of the search result returned by OLAM
Mediator, and each should be handled in different way.
1. OLAM Mediator can find the most relevant cube: In this case, OLAM
Mediator has to further compare the minsup of user’s query to prims, and to
handle this situation according to the following two different cases:
i. minsup ≥ prims: The discovered OLAM cube is capable of answering
the query. Return this cube to OLAM Engine.
ii. minsup < prims: The discovered OLAM cube can not answer the query
without the aid of the auxiliary cube. Return the OLAM cube and its
accompanied auxiliary cube to OLAM Engine.
2. OLAM Mediator can not find the cube: In this case, OLAM Mediator has to
19
20.
search the OLAP Cube repository to determine if there is an OLAP cube
whose data can be used to answer the query. If the answer is yes, return the
discovered OLAP cube to OLAM Engine; otherwise, notify OLAM Engine
to execute the mining procedure from the data warehouse afresh.
We will discuss the above cases in more detail and devise to the cost evaluation
of each case in Chapter 5.
20
21.
Chapter 4
Problem Formulation
In this chapter, we first elaborate the correspondence between OLAM query and
OLAM cube, and describe the concept of OLAM lattice. After this, we will define the
problem of OLAM cube selection.
4.1 OLAM Cube and OLAM Query
As described in Chapter 3, OLAM cube is used to store frequent itemsets, aiming
at accelerating the process of mining association rules. To clarify the structure of
OLAM cube and its relationship between multidimensional associations, we first
introduce a fourtuple mining metapattern to specify the form of multidimensional
association query. The definition is as follows:
Definition 4.1. Suppose a star schema S containing a fact table and m dimension
tables {D1, D2, …, Dm}. Let T be a jointed table from S composed of a1, a2, …., ak
attributes, such that ∀ai, aj ∈ Attr(Dk), there is no hierarchical relation between ai and
aj, 1 ≤ i, j ≤ r, 1 ≤ k ≤ m. Here Attr(Dk) denotes the attribute set of dimension table Dk.
A metapattern of multidimensional associations from T is defined as follows:
21
22.
MP: < tG, tM, ms, mc >,
where ms denotes the minimum support, mc the minimum confidence, tG the group of
transaction attributes, tM the group of item attributes, for tG, tM ⊆ {a1, a2, …., ak} and
tG ∩ tM = ∅.
The abovementioned metaform specification of multidimensional association
queries can present three different multidimensional association rules defined in [22],
intraassociation, interassociation, and hybrid association.
For example, consider a jointed table T involving three dimensions from the star
schema in Figure 2.3. The content of T is shown in Table 4.1. If the item attribute set
tM consists of only one attribute, then the meta pattern corresponds to an intra
association.
Table 4.1. A jointed table T from star schema
Tid City Education Date Month Product_I
D
Category
1 Taipei Bachelor 7/12 July 1 A
2 Taipei High school 7/12 July 2 A
3 N.Y. Master 7/18 July 1 A
4 Toronto Master 8/2 Aug. 3 B
5 Seattle Master 8/3 Aug. 4 B
6 N.Y. High School 8/2 Aug. 1 A
7 Toronto High School 7/4 July 1 A
8 Seattle Bachelor 7/18 July 5 C
9 Taipei Bachelor 8/2 Aug. 2 A
10 N.Y. Bachelor 9/1 Sep. 3 B
For instance, let tG = {City}, tM = {Category}. We may have the following intra
association rule:
22
23.
(Category, “A”) ⇒ (Category, “B”) (sup = 40%, conf = 80%)
Note that to facilitate this mining task, the table T has to be, implicitly or
explicitly, transformed into a transaction table as follows:
City Category
Taipei
N.Y.
Toronto
Seattle
A
A, B
A, B
B, C
On the other hand, if  tM  ≥ 2, then the resulting associations will be inter
association or hybrid association. For example, let tG = ∅, tM = {Education, Month}.
We have an interassociation:
(Education, “Master”) ⇒ (Month, “July”) (sup = 40%, conf = 80%)
Like intraassociation, the table T has to be transformed into the following form:
Tid Education Month
1 Bachelor July
2 High school July
3 Master July
4 Master Aug.
5 Master Aug.
6 High School Aug.
7 High School July
8 Bachelor July
9 Bachelor Aug.
10 Bachelor Sep.
Note that in this case, the transaction attribute is the same as the original table T.
But if tG = {City}, we will have a hybridassociation:
23
24.
(Education, “Master”), (Month, “July”) ⇒
(Month, “Aug.”) (sup = 40%, conf = 80%)
For this case, the transformed table will be:
City Education Month
Taipei Bachelor, High School July, Aug.
N.Y. Master, High School, Bachelor July, Aug., Sep.
Toronto Master, High School Aug., July
Seattle Master, Bachelor Aug., July
After explaining the mining patterns, we will clarify the structure of OLAM
Cube.
Definition 4.2. Given a metapattern MP with transaction attribute set tG and item
attribute set tM, and a presetting minsup, prims, the corresponding OLAM cube,
MCube(tG, tM), is the set of the frequent itemsets with supports larger than prims.
The following examples illustrate the corresponding OLAM cube for different
kinds of multidimensional association rules.
Example 4.1. An intradimensional OLAM Cube: Let tG = {City}, tM = {Category},
and prims = 2. From Table 4.1, the resulting OLAM cube is shown in Table 4.2.
Table 4.2. An example of intra OLAM cube expressed in table
Category Support
A
B
A, B
3
3
2
Example 4.2. An interdimensional OLAM cube: Let tG = ∅, tM = {Education,
Month}, and prims = 2. From Table 4.1, the resulting OLAM cube is shown in Table
24
25.
4.3.
Table 4.3. An example interdimensional OLAM cube expressed in table
Education Month Support
Bachelor
High school
Master


Bachelor
High school
Master



July
Aug.
July
July
Aug.
4
3
3
5
4
2
2
2
Example 4.3. A hybriddimensional OLAM cube: Let tG = {City}, tM = {Education,
Month}, and prims = 3. From Table 4.1, the resulting OLAM cube is shown in Table
4.4.
Table 4.4. An example hybriddimensional OLAM cube expressed in table
Education Month support
Bachelor
High school
Master



Bachelor
Bachelor
High school
High school
Master
Master
Bachelor
High school
Master



July
Aug.
July, Aug.
July
Aug.
July
Aug.
July
Aug.
July, Aug.
July, Aug.
July, Aug.
3
3
3
4
4
4
3
3
3
3
3
3
3
3
3
25
26.
4.2 OLAM Lattice
In accordance with the definition of OLAM cube, we can generate all possible
OLAM cubes from the star schema, thereby forming an OLAM lattice. In order to
provide hierarchical navigation and multidimensional exploration, the OMARS
system [15] models the OLAM lattice as a threelayer structure. The first layer lattice
expresses the combination of all dimensions. The second layer further exploits inter
attribute combinations for each dimensional combination in the first layer lattice. The
third layer exploits all OLAM cubes corresponding to the metapatterns derived from
each subcube in the second layer. Note that the real OLAM cubes are stored in the
third layer.
For example, consider the star schema illustrated in Figure 2.3. The first layer
lattice shown in Figure 4.1 is composed of eight possible dimensional combinations.
After constructing the first layer lattice, we choose the node composed of “customer”
and “time” dimensions, and extended it to form a second layer lattice shown in Figure
4.2. Each node of the second layer lattice is constructed by attaching any attribute
chosen from the selected dimensions. Finally, we extend cube <(city, education),
(date)> to form the third layer lattice shown in Figure 4.3. It can be observed that
there is one OLAM cube corresponding to interassociation, (city, education, date);
three OLAM cubes corresponding to hybridassociations, (date*, city, education),
(*education, city, date) and (city*, education, date); and three cubes corresponding to
intraassociations, (education*, date*, city), (city*, date*, education), (city*,
education*, date).
Note that (city*, education*, date*) is shown to complete the lattice structure,
which is useless and will not be materialized.
26
27.
Figure 4.1. The1st
layer OLAM lattice for the example star schema in Figure 2.3
Figure 4.2. The 2nd
layer lattice derived from <customer, time, > in the 1st
layer
27
28.
Figure 4.3. The 3rd
layer lattice derived from the subcube <(city, education), date > in
the 2nd
layer
Because the real OLAM cubes are stored in the third layer lattice, we can mine
multidimensional association rules efficiently through materialize these OLAM cubes.
From these three layers lattice, we discover attribute dependency that defined as
follows:
Proposition 4.1 Consider two OLAM cubes, 1 1
( , )G MMCube t t and 2 2
( , )G MMCube t t .
If 1 2G Gt t= and 2 1M Mt t⊆ , then every itemset in 2 2
( , )G MMCube t t must be a subset of
an itemset in 1 1
( , )G MMCube t t , and these two itemsets have the same support value.
28
29.
Example 4.4. Consider the table T in Table 4.1. Let 1 1
( , )G MMCube t t be the cube
illustrated in Table 4.4 and 2 2
( , )G MMCube t t that illustrated in Table 4.5. Hence
1 2
{ }G Gt t City= = , 1
{ , }Mt Education Month= , 2
{ }Mt Education= , and prims = 3. It can
be verified that every frequent itemsets stored in 2 2
( , )G MMCube t t is a subset of
frequent itemsets in 1 1
( , )G MMCube t t , and both itemsets have the same support value.
Table 4.5. An OLAM Cube
Education Support
Bachelor
High school
Master
3
3
3
According to Proposition 4.1, we know there is a dependency between OLAM
cubes in the third lattice, which is formalized below.
Definition 4.3. Consider two OLAM cubes, 1 1
( , )G MMCube t t and 2 2
( , )G MMCube t t .
We say that 2 2
( , )G MMCube t t is dependent upon 1 1
( , )G MMCube t t if 1 2G Gt t= and
2 1M Mt t⊆ , and is denoted as 2 2
( , )G MMCube t t ≤ 1 1
( , )G MMCube t t .
One important aspect of Definition 4.3 is that if 2 2
( , )G MMCube t t ≤
1 1
( , )G MMCube t t then all multidimensional queries that can be answered via
2 2
( , )G MMCube t t can also be answered via 1 1
( , )G MMCube t t .
Furthermore, it should be notice that not all of the OLAM cubes derived in the
lattice have to be materialized and stored, because the concept hierarchies defined
29
30.
over the attributes in the star schema provide the possibility to prune some redundant
cubes.
Consider an OLAM cube, MCube(tG, tM). We observed that there are two
different types of redundancy.
Proposition 4.2. Schema redundancy: Let ai, aj ∈ tG. If ai, aj are in the same
dimension and aj is an ancestor of ai, then MCube(tG, tM) is a redundancy of cube
MCube(tG{ aj }, tM).
Example 4.5. Consider the jointed table in Table 4.1. Let tM = {Category}. The
resulting table by grouping “Date” and “Month” as transaction attributes is shown in
Table 4.6. Note that this table has the same transactions as that obtained by grouping
“Date” as transaction attribute, as shown in Table 4.7. Thus, the resulting cube
MCube({Date, Month}, {Category}) is the same as MCube({Date}, {Category}).
Table 4.6. The resulting table by grouping {Date, Month}
as transaction attributes for Table 4.1
Date Month Category
7/4 July A
7/12 July A
7/18 July A, C
8/2 Aug. A, B
8/3 Aug. B
9/1 Sep. B
30
31.
Table 4.7. The resulting table by grouping {Date}
as transaction attribute for Table 4.1
Date Category
7/4 A
7/12 A
7/18 A, C
8/2 A, B
8/3 B
9/1 B
Proposition 4.3. Values Redundancy: Let ai, aj ∈ tM. If ai, aj are in the same
dimension and aj is an ancestor of ai, then MCube(tG, tM) is a cube with values
redundancy.
Example 4.6. Consider the jointed table in Table 4.1. Let tG = {City}, tM = {Date,
Month} and prims = 2. The resulting OLAM cube is shown in Table 4.8. One can
observe that the tuples with dotted lines in this table are redundant patterns. Therefore,
it satisfies the values redundancy. Note that if it holds the values redundancy, we must
prune the redundant patterns during the generation of frequent itemsets.
31
32.
Table 4.8. The resulting OLAM cube MCube({City}, {Date, Month})
Date Month support
7/18
8/2




July
Aug.
2
3
4
4
7/18 July 2
7/18
8/2
Aug.
July
2
3
8/2 Aug. 3
July, Aug. 4
7/18 July, Aug. 2
8/2 July, Aug. 3
In addition to above observations, we observe that any OLAM cube is useless if
it satisfies the following property.
Proposition 4.4. Useless Property: Let ai ∈ tG and tM = {aj}. If ai, aj are in the same
dimension and aj is an ancestor of ai, then MCube(tG, tM) is a useless cube.
Example 4.7. Let tG = {City, Date}, and tM = {Month}. The resulting table from table
4.1 by grouping {City, Date} as transactions is shown in Table 4.9. One can observe
that the cardinality of every transaction is 1. Therefore, we cannot find any association
rule from this table.
32
33.
Table 4.9. The resulting table by grouping {City, Date} as transaction attribute for
Table 4.1
City Date Month
Toronto
Taipei
Taipei
N.Y.
N.Y.
Toronto
Seattle
N.Y.
7/4
7/12
8/2
7/18
8/2
8/2
8/3
9/1
July
July
Aug
July
Aug.
Aug.
Aug.
Sep.
4.3 OLAM Cube Selection
We now proceed to give a formal definition of the OLAM cube selection
problem. To this end, we introduce symbols as shown in Table 4.10.
Assume that an OLAM lattice L contains n OLAM data cubes
1 2{ , , ..., }nD d d d= , the set of users queries is 1 2{ , , ..., }mQ q q q= , the set of query
frequencies is 1 2
{ , , ..., }mq q qF f f f= , and the space constraint is S . The OLAM cube
selection problem is denoted as a fivetuple { , , , , }L D Q F Sθ = . A solution to θ is a
subset of D, say M, that can minimize the following cost function subject to constraint
 d M
d S∈
≤∑ ,
1
min ( , )i
m
q i
i
f E q M
=
∑ * .
33
34.
Table 4.10. The Symbol Table
Symbol Definition
L Lattice
D Set of data cubes
nd nth
data cube
Q Set of user queries
mq mth
user query
F Set of user query frequencies
iqf Frequency of the ith
query
S Space constraint
M Set of materialized cubes
( , )iE q M The total time to response ith
query in materialized views
34
35.
Chapter 5
Evaluation of OLAM Query Cost
5.1 Query Evaluation Flow
As stated previously, the primary task of OLAM Engine is to generate
association rules according to users’ queries. After receiving a query, OLAM Engine
analyzes the query, transfers the necessary information to OLAM Mediator, and then
waits for the most matching cube from OLAM Mediator. When OLAM Mediator
receives the information of users’ queries from OLAM Engine, it will look for the
most matching cube. First, OLAM Engine searches for the required OLAM cube. If
found, then it further checks whether minsup ≥ prims; and if yes, then returns the
found OLAM cube to OLAM Engine, otherwise returns the corresponding auxiliary
cube of the found OLAM cube and notifies OLAM Engine to perform association
mining from data warehouse with the aid of this auxiliary cube. On the other hand, if
OLAM Engine can not find any qualified OLAM cube to answer user query, it will
notify OLAM Engine to perform association mining from data warehouse afresh.
The above described procedure employed by OLAM Mediator is depicted in
Figure 5.1.
35
36.
Figure 5.1 The flow diagram of OLAM query
An important thing worth mentioning is that, for simplicity, we do not consider
OLAP cubes in this study, the OMARS system did take account of this kind of data
cubes in association mining.
In accordance with the work flow of OLAM Mediator and OLAM Engine, our
paradigm for evaluating OLAM query cost is shown below:
36
37.
Procedure Evacost_OLAMQ(q)
begin
Let q = < tG, tM, minsup>;
found = OLAMQ_search(q, CQ);
if found = TRUE then
if prims ≤ minsup then
cost = the cost for evaluating query q using OLAM cube
CQ.Mcube; /*case 1*/
else
cost = the cost for evaluating query q using CQ.Mcube, auxiliary cube
CQ.XCube and data warehouse; /*case 2*/
end if
else
cost = the cost for evaluating query q using data warehouse; /*case 3*/
end if
return cost;
end
Figure 5.2. The procedure to compute the cost of user’s query
In summary, there are three different cases to be dealt with:
Case 1: evaluating the cost via the qualified OLAM cube.
Case 2: evaluating the cost via OLAM cube, auxiliary cube, and data
warehouse.
Case 3: evaluating the cost via data warehouse.
The cost complexity evaluation for each case will be elaborated in the following
sections. We end this section with the description of OLAMQ_search.
37
38.
Procedure OLAMQ_search(q, CQ)
begin
found = FALSE;
if MCube(q. tG, q. tM) is materialized then
CQ.MCube = MCube(q.tG, q.tM);
CQ.XCube = XCube(q.tG, q.tM);
found = TRUE;
end if
CurQ = φ ;
for each MCube in the OLAM lattice do
if MCube is materialized and MCube. tG = q.tG and MCube.tM ⊇ q. tM
and (MCube.tM ⊆ CurQ. tM or CurQ = φ ) then
CurQ = MCube;
if found then
CQ.MCube = CurQ;
CQ.XCube = XCube(q. tG, CurQ. tM);
end if
return found
end
Figure 5.3. Procedure OLAMQ_search
Example 5.1. Suppose the OMARS system stores the following three materialized
OLAM cubes, MCube( 1Gt , 1Mt ), where 1Gt = {City}, and 1Mt = {Education, Date},
MCube( 2Gt , 2Mt ), where 2Gt = {City}, and 2Mt = {Education, Date, Category}; MCube(
38
39.
3Gt , 3Mt ), where 3Gt = {Date}, 3Mt = {City}, and prims = 3. We have three users’
queries as follows: q1, q2, q3, where 1. Gq t = {City}, 1. Mq t = {Education, Date}, and
1.q ms = 4; 2. Gq t = {City}, 2. Mq t = {Education, Date, Category}, and 2.q ms = 2; 3. Gq t
= {Date}, 3. Mq t = {City, Education}, and 3.q ms = 3.
According to the above three queries, we have three conditions listed as follows:
1. When the user’s query is q1, this condition is the same as Case 1 described
above. Because the corresponding OLAM cube can be found in OMARS system,
and the minsup of user’s query is higher than prims, we can use MCube( 1Gt , 1Mt )
to respond user’s query immediately.
2. When the user’s query is q2, this condition is the same as Case 2 described
above. Because the minsup of user’s query is lower than prims, there is a need to
utilize the corresponding auxiliary cube of the found OLAM cube MCube( 2Gt ,
2Mt ) and data warehouse to answer query q2.
3. When the user’s query is 3q , this condition is the same as Case 3 described
above. Because we can not find the any matching OLAM cube in OMARS
system, we should utilize data warehouse to answer query 3q .
5.2 Cost Evaluation for Case 1
In this case, the OLAM cube returned from OLAM Mediator can be utilized to
respond users’ queries. The CBWon algorithm [15] is employed to mine association
rules. For convenience and facilitating the analysis, we replicate the CBWon algorithm
39
40.
in Figure 5.4. Because the qualified frequent itemsets have been stored in the found
OLAM cube, and minsup ≥ prims, there is no need to generate the frequent itemsets
via Apriorilike algorithm. All we have to do is scanning frequent itemsets in OLAM
cube and performing the association_gen procedure in Figure 5.7 to generate qualified
association rules.
Algorithm CBWon
Input: relevant cube MCube(tG, tM), minsup and prims;
Output: The set of frequent itemsets F;
1 if minsup < prims then
2 AF = {X sup(X) ≥ minsup, X ∈ Auxiliary Cube} ∪ {Y Y∈ MCube(tG, tM) and 
Y = Kα};
3 DF = Dwnsearchon(T , AF , Kα, minsup);
4 UF = Upsearch(AF, minusup);
5 F = DF ∪ UF;
6 else
7 F = {X X ∈ MCube(tG, tM) and sup(X) ≥ minsup};
8 end if
9 return F;
Figure 5.4. Algorithm CBWon
40
41.
Procedure Dwnsearchon
1 for i=1 to D do
2 scan the ith transaction ti;
3 delete those items in ti but not in AF;
4 for each subset X of ti and 2 ≤ X ≤ Kα do
5 sup(X)++;
6 end for
7 DF = {X  sup(X) ≥ minsup} ∪ AF;
Figure 5.5. Procedure Dwnsearchon
Procedure Upsearch
1 transform horizontal data format T into t_id lists;
2 αKF = frequent Kαitemsets;
3 k = Kα, Fk = αKF ;
4 repeat
5 k++;
6 Ck = new candidate kitemsets generated from Fk1;
7 for each X ⊆ Ck do
8 perform bitvector intersection on X;
9 count the support of X;
10 end for
11 kF = {X sup(X) ≥ prims, X ∈ Ck};
12 UF = UF ∪ Fk;
41
42.
13 until Fk = ∅
Figure 5.6. Procedure Upsearch
Procedure association_gen (F: set of all frequent itemsets; min_conf: minimum
confidence threshold)
begin
for each l ∈ F do
generate P(l) = l  ∅; // P(l): power set of l
for each s ∈ l and s ∉ ls do
if support_count(l) / support_count(s) ≥ min_conf then
output s ⇒ l – s;
end
Figure 5.7. Procedure association_gen
The cost thus can be divided into two parts:
1. Frequent itemsets discovery: This involves searching the frequent itemsets stored
in OLAM cube with support lower than minsup of user’s query, which costs DM,
for DM denoting the OLAM cube.
2. Rule generation: For each discovered frequent itemset, we construct all possible
rules from it, compute the confidence, and keep those satisfy the minimum
confidence.
The key point for the complexity analysis thus lies in the number of candidate
rules to be generated and inspected. Our first step toward this direction is to consider
the number of rules that can be generated from a frequent kitemset and all of its
subsets.
Lemma 1. The number of rules that can be constructed from a kitemset is 2k
2.
42
43.
Proof. Recall that each rule that can be constructed from an itemset X has the form for
A⊂ X and A ≠ φ , A ⇒ X – A. Thus, the number of different A’s determines the
number of rules, which is
( )
1
1
2 2
k
k k
i
i
−
=
= −∑ .
Lemma 2. For a kitemset X, the total number of rules that can be generated from X
and its subsets is
1
3 2 1k k+
− + .
Proof. From Lemma 1, we can derive
( )( ) ( )( ) ( )( )
( ) ( )
( ) ( )
( )
2 3
2 3
2 2
0 0
1
1
2 2 2 2 ... 2 2
2 2
2 1 1 2 2 1
2 1 1 2 2 2 2
3 2 1
k k k k
k
k k
k i k
i i
i i
k k
k i k i k
i i
i i
k k
k k
k k
k k
= =
−
= =
+
+
− + − + + −
= −
= × − − − − −
= + − − − + +
= − +
∑ ∑
∑ ∑
Now, if we know the set of maximal frequent itemsets, then we can complete the
analysis. Unfortunately, the exact set is unobtainable without the a priori knowledge
of user’s specified minsup. We thus resort to an estimation that proceeds by taking
prims in place of minsup. Then we apply sampling to obtain a random subset of the
warehouse data, and we can either
1. compute the maximal frequent itemsets for each OLAM cube using any
maximal pattern mining algorithm, or
2. apply the CBWoff algorithm to estimate Kα (cutting level), compute frequent
itemsets with cardinality of Kα, and regard these itemsets as the maximal
43
44.
frequent itemsets.
Let MF denotes the set of maximal patterns. If the first approach is adopted, the
computation spent on rule generation will be
( )    1
3 2 1X X
X MF
+
∈
− +∑ ,
or
    1
  (3 2 1)K K
KF α α
α
+
× − + ,
if the second approach is used. Here, for simplicity, we adopted the second approach.
Finally, combing the cost of frequent itemsets discovery and rule generation, we have
 
  3  K
K MF Dα
α
α× + × .
5.3 Cost Evaluation for Case 2
In this case, algorithm CBWon illustrated in Figure 5.4 will execute the “minsup
< prims” part of the “if” clause, which comprises three different steps.
1. Generate AF, i.e. αk
F . This requires scanning the auxiliary cube and the OLAM
cube. The cost is  MX DD + , where XD denotes auxiliary cube, and MD
denotes OLAM cube.
2. Execute procedure Dwnsearchon illustrated in Figure 5.5. Note that this procedure
presumes the availability of the corresponding jointed table, and ignores the
preprocessing step to generate the jointed table. To account for this task and
simplify the discussion, we assume this cost is w and the table is T.
44
45.
As illustrated in Figure 5.5, the Dwnsearchon procedure needs to scan all the
transactions in the database. The I/O cost is  T⋅α .
Next we estimate the cost for the most consumptive step: counting itemset
support. Let l denotes the average length of each transaction. This step costs
( ) ( ) ( )( )l
K
ll
T α
+++⋅ ... 32 , or ( )∑=
⋅
αK
i
l
iT
2
 in brief.
Finally, the total cost consumed by the Dwnsearchon procedure equals
( )∑=
⋅+⋅
α
α
K
i
l
iTT
2
 .
3. Execute procedure Upsearch illustrated in Figure 5.6. To minimize the I/O cost
and avoid combinatorial decomposition, the Upsearch procedure first transforms
the transaction data into vertical data format called transactionid lists, then
utilizes this structure to count the supports of itemsets. The cost lies in three
main steps.
(1) Data transformation. This requires  T⋅α data scan.
(2) Candidate generation. The dominate operation is itemset join. If the
largest itemset cardinality is Kmax. This task consumes at most
( )∑+=
−
max
1
1

2
K
KK
Fk
α
.
(3) Counting candidate support. For each kitemset, counting involves k1
bitvector intersections and one bitvector accumulation. Summing this
45
46.
cost over all candidate itemsets, we have
max
1
   
K
i
i k
C i T
α= +
××∑ .
Finally, the total cost for procedure Upsearch is
( )( )
max
1 
2
1
      i
K
F
i
i K
T C i T
α
α −
= +
× + ×× +∑ .
Combing all of the analysis, we have
( ) ( )( )
max
1   
2
2 1
(    2  )         3i
K K
F Kl
X M i i K
i i K
D D T T C i T F
α
α
α
α
α −
= = +
+ + + × + ×× + + ×∑ ∑
5.4 Cost Evaluation for Case 3
In this case, we should generate table T according to user’s query, and it costs
log  D D× . After this, the CBWoff algorithm shown in Figure 5.8 is performed. It can
be observed that except step 1, the steps employed by CBWoff are quite similar to
those by CBWon in Case 2. Since step 1 costs
   T K Tαα× + × ,
this makes the total cost for this case be
( ) ( )( )
max
1   
2
2 1
log   3             3i
K K
F Kl
i i K
i i K
D D T K T T C i T F
α
α
α
α
αα −
= = +
× + × + × + × + ×× + + ×∑ ∑ .
46
47.
Algorithm CBWoff(T, prims)
Input: Table T and prims;
Output: The set of frequent itemsets F;
1 scan T to compute Kα and generate all frequent 1itemsets F1;
2 DF = Dwnsearch(T , Kα, F1, prims);
3 UF = Upsearch(DF, prims);
4 return F = DF ∪ UF;
Figure 5.8. Algorithm CBWoff
Procedure Dwnsearch
1 for i=1 to D do
2 scan the ith transaction ti;
3 delete the items in ti that are not in F1;
4 for each subset X of ti and 2 ≤ X ≤ Kα do
5 sup(X)++;
6 end for
7 store all X in Auxiliary cube for X = Kα and sup(X) < prims;
8 DF={X  sup(X) ≥ prims};
Figure 5.9. Procedure Dwnsearch
47
48.
To sum up, we list the cost functions for the three cases below:
Case 1:
 
  3  K
K MF Dα
α
α× + × .
Case 2: ( ) ( )( )
max
1   
2
2 1
(    2  )         3i
K K
F Kl
X M i i K
i i K
D D T T C i T F
α
α
α
α
α −
= = +
+ + + × + ×× + + ×∑ ∑ .
Case 3:
( ) ( )( )
max
1   
2
2 1
log   3             3i
K K
F Kl
i i K
i i K
D D T K T T C i T F
α
α
α
α
αα −
= = +
× + × + × + × + ×× + + ×∑ ∑ .
48
49.
Chapter 6
OLAM Cube Selection Methods
In this chapter, we describe three typical heuristic algorithms proposed for OLAP
cube selection problem, and elaborate how to modify and combine our cost models
depicted in last chapter with each method to select the most suitable OLAM cubes.
The methods include forward greedy selection (FGS) method proposed by
Harinarayan et al. [9], Pick by size (PBS) selection method proposed by Shukla et al.
[17], and the backward greedy selection (BGS) method proposed by Lin and Kuo
[13].
6.1 Forward Greedy Selection Method (FGS)
The forward greedy selection method is proposed by Harinarayan et al. [19]. As
is known to all, the greedy algorithm always chooses the local optimal solution in
each step under some constraint. For this purpose, we define a benefit function B(di,
M) as follows:
1
( , ) ( ( , ) ( , ))i iq Q
i
B d M E q M E q M d
d ∈
= − ∪∑ (6.1)
49
50.
We use our benefit function to compute the benefit of all unselected OLAM cubes,
and combine the forward selection method to choose the most suitable OLAM cubes
one by one to materialize from empty until no cube can be added. The forward
selection method is described below:
Algorithm 1. Forward greedy selection (FGS)
Step 0. Let M=φ.
Step 1. When d M
d S∈
<∑ , repeat Step 2 to Step 5.
Step 2. According to equation (6.1), calculate the benefit of all unselected OLAM
cubes di, for 1 ≤ i ≤ n, and di∉M.
Step 3. Select the OLAM cube with the maximal benefit according to results of Step
2, and set it as dj.
Step 4. M ← M∪{dj}.
Step 5. Go to Step 1.
Figure 6.1. Forward Greedy Selection Method
Example 6.1. Suppose that we select three attributes city c, education e, and date d
from a sales star schema illustrated in Figure 2.3. Figure 6.2 depicts all possible
OLAM cubes formed with these three attributes as well as their dependencies, where
all OLAM cubes with the same transaction tG are packed into a metacube. The dotted
line between any two metacubes is used for clarification purpose, which accomplishes
the lattice structure of metacubes in terms of tG. Note that according to proposition
4.1, the dependency exists only in OLAM cubes within the same metacube. For
simplification, let us consider how to select the most suitable OLAM cubes from three
OLAM cubes ced*, cd*, and ed* to materialize under space constraint. The symbols
50
51.
used in this example are shown in Table 6.1, and the required parameter settings are
shown in Table 6.2. Besides, we assume that the base relation size is 64, and prims is
3. Table 6.3 shows the first two selection steps using FGS.
Table 6.1. The symbols used in cost model
Gt the set transaction attributes
Mt the set of mining attributes
α I/O to computation ratio
Kα the cardinality of maximal frequent itemset
maxK the cardinality of the largest itemset
 iC number of candidate iitemsets
l average length of each transaction
iF number of frequent iitemsets
DM size of OLAM cube
DX size of auxiliary cube
f frequency of OLAM cube
T size of the table composed of attributes G Mt t∪
D size of base relation
Table 6.2. The required parameter settings
subcubes α Kα maxK 3 C 4 C l 2F 3F DM DX 
T
minsu
p
f
d*ce 1 2 4 6 4 8 8 5 20 15 30 4 0.3
d*c 1 2 4 5 2 5 8 4 10 10 30 5 0.3
d*e 1 2 4 5 1 6 6 2 15 5 30 3 0.4
51
52.
Figure 6.2. All possible OLAM cubes formed with city, education, and date
Table 6.3. The benefits of OLAM cubes in the first two selection steps by FGS
First selection Second selection
subcubes Influenced
subcubes
Benefit Influenced
subcubes
Benefit
d*ce
ced*, cd*,
ed*
((64*6+3*1*30+2*30+30*
28+1058+8* 2
3 )(8* 2
3
+1*20))*(64
20)*(0.3+0.3+0.4)/20=530
6.4
ced*, cd*,
ed*
d*c
cd* ((64*6+3*1*30+2*30+30*
10+724+8* 2
3 )(8* 2
3 +1*
10))*(6410)*(0.3)/10
=2507.76
cd* ((8* 2
3 +1*20)(8* 2
3
+1*10))*(20
10)*(0.3)/10=3
d*e
ed* ((64*6+3*1*30+2*30+30*
15+586+6* 2
3 )(6* 2
3 +1*
15))*(6415)*(0.4)/15
=2031.87
ed* ((8* 2
3 +1*20)(6* 2
3
+1*15))*(20
15)*(0.4)/15
=3.067
52
ced
ce cded
ce*d* c*ed* c*ed*
c*ed
c*e c*d
ce*d
ce* e*d
ced*
cd* ed*
53.
6.2 Backward Greedy Selection Method (BGS)
The concept of the backward greedy selection method proposed by Lin and Kuo
[13] is similar to forward greedy selection method. The difference is that all OLAM
cubes have been selected at beginning, and the selection proceeds by removing one
OLAM cube which has the lowest detriment value step by step until the total size of
all remaining OLAM cubes is smaller than storage space. For this purpose, we define
a detriment function P(di,M) as follows:
{ }
1
( , ) ( ( , ) ( , ))i iq Q
i
P d M E q M d E q M
d ∈
= − −∑ (6.2)
Compared to FGS, the forward greedy selection algorithm can quickly find a set of
data cubes while storage space is noticeable smaller than the total sizes of data cubes.
But it is obviously that if the total cube size is not far from the storage space, BGS
will need more computation. The backward greedy selection algorithm is described
below:
53
54.
Algorithm 2. Backward Greedy Selection (BGS)
Step 0. Let M← D;
Step 1. When d M
d S∈
>∑ , repeat Step 2 to Step 5.
Step 2. According to equation (6.2), calculate the detriment of all subcubes di, for 1 ≤
i ≤ n, and di∈M.
Step 3. Select the OLAM cube with the minimum detriment value according to the
results in Step 2, and set it as dj.
Step 4. M ← M{dj}.
Step 5. Go to Step 1.
Figure 6.3. Backward Greedy Selection Method
Example 6.2. Consider Example 6.1 again. Suppose that the three cubes have been
selected and the space constraint is 20. Besides, we assume when ced* is not
materialized, all queries which should answered by ced* will go back to the base
relation in data warehouse. Table 6.4 shows the first two selection steps performed by
backward greedy selection method.
54
55.
Table 6.4. The benefits of OLAM cubes in the first two selection steps by BGS
First selection Second selection
subcubes Influenced
subcubes
Detriment Influenced
subcubes
Detriment
d*ce ced* ((64*6+3*1*30+2*30+30*
28+1058+8* 2
3 )(8* 2
3
+1*20))*(6420)*(0.3)/20
=1591.92
ced*, cd* ((64*6+3*1*30+2*30+
30*28+1058+8* 2
3 )(8*
2
3 +1*20))*(64
20)*(0.3+0.3)/20=3183.
84
d*c cd* ((8* 2
3 +1*20)(8* 2
3
+1*10))*(20
10)*(0.3)/10=3
cd*
d*e ed* ((8* 2
3 +1*20)(6* 2
3
+1*15))*(2015)*(0.4)/15
=3.067
ed* ((8* 2
3 +1*20)(6* 2
3
+1*15))*(20
15)*(0.4)/15
=3.067
6.3 Pick by Size Selection Method (PBS)
The pick by size selection algorithm is an intuitive method proposed by Shukla
et al. [17]. Its concept is to compute all OLAM cubes size, and select the smallest
OLAM cubes one by one until the storage constraint is exceeded. The pick by size
selection algorithm is described below:
55
56.
Algorithm 3. Pick by Size Selection (PBS)
Step 0. Sort all OLAM cube size;
Step 1. Let M=φ.
Step 2. When d M
d < S∈∑ , repeat Step 3 to Step 5.
Step 3. Select the smallest size of all OLAM cube, and set it as dj.
Step 4. M ← M∪{dj}.
Step 5. Go to Step 1.
Figure 6.4. Pick by Size Greedy Selection Method
56
57.
Chapter 7
Experimental Results
In this chapter, we describe our experiment and analysis. All experiments are
performed on a machine with Intel Celeron 1.2 GHz CPU, 512MB RAM, and running
on Microsoft Windows 2000 Server. The test data is generated from Microsoft
foodmart2000 database. We chose three dimensions from foodmart 2000 database,
including Customer, Time, and Product. Each dimension consists of two attributes.
They are city c, education e for the Customer dimension; date d, month m for the
Time dimension; and Product_ID p, Category a for the Product dimension. The three
dimensions’ schema hierarchy is shown in Figure 2.4. Characteristics of the test data
are shown in Table 7.1. In Table 7.1, we introduce the following notation:
g<attribute_list>(R),
where g is the symbol used to represent “group by” operation, and <attribute_list>
denotes the “group by”.
From this test database, we generated all possible OLAM subcubes, which are
detailed in Table 7.2. Note that we have filtered out those subcubes which satisfy
schema redundancy and those whose corresponding transaction table composed of tG
∪ tM has less than 10000 tuples. After these, we observed that all OLAM cubes can
be grouped into seven classes distinguished by tG, as shown in Table 7.2. Each cube is
represented by the first letter of the attributes, except for Category is abbreviated as
57
58.
“a”. The symbol ““ means none, and the number is the size of each subcube.
In our experiments, we consider two different combinations of frequencies of
subcubes: 1) all frequencies are the same; 2) randomly generated numbers between 0
and 1. Besides, we consider three different combinations of minsup: 1) all minsup are
lower than prims; 2) randomly generated minsup between 1% and 99%; 3) all minsup
are equal or greater than prims.
Finally, we assume eight different storage constraints, which are 10%, 20%,
30%, 40%, 50%, 60%, 70%, and 80% of the sum of all subcubes. Furthermore if none
of the subcubes that can be used to answer a query is materialized, we assume that the
base relation is invoked to answer this query.
Table 7.1. Data parameters of foodmart 2000
Parameters Value
D 86565
dom(City) 78
dom(Education) 5
dom(Date) 323
dom(Month) 12
dom(Product_ID) 1559
dom(Category) 45
g c, e, d(D) 10541
g c, p(D) 49483
g m, p(D) 18492
g c, e, a(D) 11576
g d, a(D) 12113
g c, d, a(D) 49390
g c, m, a(D) 22854
Table 7.2. All subcubes of six attributes,
58
59.
City, Education, Date, Month, Product_ID, Category
Gt All possible mining attributes tM
ced mpa mp ma pa m p a
713 12 714 678 12 0 678
cp edma edm eda ema dma ed em ea
72 60 24 72 18 12 60 24
dm da ma e d m a
12 6 18 12 0 12 6
mp ceda ced cea cda eda ce cd ca
407 377 403 62 62 373 54 59
ed ea da c e d a
34 58 9 51 30 3 6
cea dmp dm dp mp d m p
3955 3955 14 3891 14 3891 0
da cemp cem cep cmp emp ce cd ca
860 860 792 87 99 792 87 75
ed ea da c e m p
99 31 12 75 31 12 0
cda emp em ep mp e m p
22 22 9 12 9 12 0
cma edp ed ep dp e d p
30 30 30 0 30 0 0
7.1 Comparison of FGS, BGS, and PBS for
minsup ≥ prims
We first compare the query cost of the three selection methods when the
frequency of each subcube is random. The results are shown in Figure 7.1. According
to Figure 7.1, it is obvious that FGS and BGS are significantly better than PBS, and
there is an optimal costeffective space around 50%. We also recorded the result when
the frequency of each subcube is uniform as illustrated in Figure 7.2. The
59
60.
phenomenon is similar to Figure 7.1, but the optimal costeffective space is about
40% for FGS.
5.22E+03
5.00E+08
1.00E+09
1.50E+09
2.00E+09
2.50E+09
3.00E+09
3.50E+09
10% 20% 30% 40% 50% 60% 70% 80%
FGS
BGS
PBS
Figure 7.1. Comparing the query cost of FGS, BGS, and PBS with random frequency
when minsup ≥ prims.
3.19E+05
5.00E+10
1.00E+11
1.50E+11
2.00E+11
2.50E+11
10% 20% 30% 40% 50% 60% 70% 80%
FGS
BGS
PBS
Figure 7.2. Comparing the query cost of FGS, BGS, and PBS with uniform frequency
when minsup ≥ prims.
We also compare the efficiency of forward, backward, and pick by size selection
60
61.
method. Since the forward and backward greedy selection methods have similar
philosophy but different in the direction of selection, we use execution time as the
criterion. Besides, we also compared these two methods with PBS. The results are
shown in Figure 7.3, and Figure 7.4. From these two figures, the selection time
performed by forward greedy selection is less than backward greedy selection at first,
but when the space limit is higher than 30%, the situation is reversely. The reason is
that forward greedy selection method selects the subcube from empty until no cubes
can be added, while backward greedy selection method performs in the opposite way.
The pick by size selection method consumes the least time because it needs not to
select the best benefit subcube, and only chooses the subcubes according to their size.
0
0.001
0.002
0.003
0.004
0.005
10% 20% 30% 40% 50% 60% 70% 80%
FGS
BGS
PBS
Figure 7.3. Comparing the selection time of FGS, BGS, and PBS with random
frequency when minsup ≥ prims.
61
62.
0.000
0.001
0.002
0.003
0.004
0.005
10% 20% 30% 40% 50% 60% 70% 80%
FGS
BGS
PBS
Figure 7.4. Comparing the selection time of FGS, BGS, and PBS with uniform
frequency when minsup ≥ prims.
7.2 Comparison of FGS, BGS, and PBS for
minsup < prims
We first compare the query cost between FGS, BGS, and PBS. The results are
shown in Figure 7.5, and Figure 7.6. Recall that, when minsup < prims, the OMARS
system must utilize OLAM cube, auxiliary cube, and data warehouse to respond
user’s query. Obviously, this process will require more cost because no OLAM cube
can respond user’s query immediately. Thus, the query cost of FGS, and BGS when
minsup < prims is higher than that when minsup ≥ prim, no matter what frequencies
are. But the query cost of PBS when minsup < prims is similar to that when minsup ≥
prim, because PBS selects the subcube only according to their size.
62
63.
3.67E+05
5.00E+08
1.00E+09
1.50E+09
2.00E+09
2.50E+09
3.00E+09
3.50E+09
10% 20% 30% 40% 50% 60% 70% 80%
FGS
BGS
PBS
Figure 7.5. Comparing the query cost of FGS, BGS, and PBS with random frequency
when minsup < prims.
5.47E+06
5.00E+10
1.00E+11
1.50E+11
2.00E+11
2.50E+11
10% 20% 30% 40% 50% 60% 70% 80%
FGS
BGS
PBS
Figure 7.6. Comparing the query cost of FGS, BGS, and PBS with uniform frequency
when minsup < prims.
We then compare the efficiency of these three methods. The results are shown in
Figure 7.7, and Figure 7.8. It can be observed that the results are similar to the
63
64.
situation when minsup ≥ prims.
0
0.001
0.002
0.003
0.004
0.005
0.006
10% 20% 30% 40% 50% 60% 70% 80%
FGS
BGS
PBS
Figure 7.7. Comparing the selection time of FGS, BGS, and PBS with random
frequency when minsup < prims.
0.000
0.001
0.002
0.003
0.004
0.005
0.006
0.007
10% 20% 30% 40% 50% 60% 70% 80%
FGS
BGS
PBS
Figure 7.8. Comparing the selection time of FGS, BGS, and PBS with uniform
frequency when minsup < prims.
64
65.
7.3 Comparison of FGS, BGS, and PBS for
random minsups
Generally, the support settings of different users’ queries are different, and in
accordance with this situation, we conduct another experiment, setting minsups as
random. Figure 7.9 shows the results when the frequencies of all subcube are random.
When the space constraint goes beyond 50%, the forward and backward selection
method reaches the optimal costeffective point around 50%. For space constraint is
over 50%, there is no further saving in query cost for forward greedy. In Figure 7.10,
the results are similar to Figure 7.9, but the optimal costeffective point for FGS is
around 40%.
9.26E+04
5.00E+08
1.00E+09
1.50E+09
2.00E+09
2.50E+09
3.00E+09
3.50E+09
10% 20% 30% 40% 50% 60% 70% 80%
FGS
BGS
PBS
Figure 7.9. Comparing the query cost of FGS, BGS, and PBS with random frequency
for random minsups.
65
66.
2.17E+06
5.00E+10
1.00E+11
1.50E+11
2.00E+11
2.50E+11
10% 20% 30% 40% 50% 60% 70% 80%
FGS
BGS
PBS
Figure 7.10. Comparing the query cost of FGS, BGS, and PBS with uniform
frequency for random minsups.
We then compare the selection time of FGS, BGS, and PBS. As shown in Figure
7.11, and Figure 7.12, the phenomenon is similar to the above two cases.
0
0.001
0.002
0.003
0.004
0.005
10% 20% 30% 40% 50% 60% 70% 80%
FGS
BGS
PBS
Figure 7.11. Comparing the selection time of FGS, BGS, and PBS with random
66
67.
frequency for random minsups.
0.000
0.001
0.002
0.003
0.004
0.005
0.006
10% 20% 30% 40% 50% 60% 70% 80%
FGS
BGS
PBS
Figure 7.12. Comparing the selection time of FGS, BGS, and PBS with uniform
frequency for random minsups.
67
68.
Chapter 8
Conclusions and Future Works
8.1 Conclusions
In this thesis, we have considered the OLAM cube selection problem in OMARS
system, and proposed a cost model to evaluate the query cost. According to user’s
association queries, we divided the query evaluation into three cases, and accordingly
designed three different cost models. Through our cost models, we have modified
three most wellknown heuristic algorithms, FGS, BGS, and PBS to choose the most
suitable OLAM cubes to materialize. We also have implemented these algorithms to
evaluate their performances.
There is one thing need to be clarified. Although our cost models are based on
CBWon and CBWoff algorithms proposed by Lin et al [15]. Most of the concepts are
suitable for algorithms different from above two algorithms, except that the cost
functions need to be modified to conform to the new algorithms.
8.2 Future Works
As we pointed out in the beginning of this thesis, the main focus of this research
68
69.
is on OLAM cube selection in the OMARS system. There are some issues need to be
investigated in the near future:
1. OLAM cube and OLAP cube selection simultaneously
In OMARS system, there is another cube repository to store the OLAP Cube. In
our thesis, we only consider the OLAM cube selection problem. One of our future
works is to combine OLAP cube and OLAM cube, and design a suitable cost model to
evaluate the query cost in this situation.
2. Cube maintenance
In real world applications, data are evaluated as well as generated and need to be
loaded into data warehouse. This implies that the materialized cubes have to be
updated to reflect the new situation. One of our future works is to design a suitable
scheme to update our OLAM, OLAP and auxiliary cubes in OMARS system.
3. Other nonheuristics algorithms
In this thesis, we only consider the class of heuristic algorithms to select the most
suitable OLAM cubes to materialize. Besides these methods, we will consider other
nonheuristics algorithms, such as genetic algorithm, A*
algorithm or dynamic
programming to select the most suitable OLAM cubes to materialize.
69
70.
References
[1] 林文揚、張耀升:“ 發式資料方體挑選方法之分析比較”啟 ，九十年全國計算
機會議論文集，頁 4758，2001。
[2] R. Agrawal and R. Srikant, “Fast algorithms for mining association rules,” in
Proceedings of the 20th
VLDB Conference, pp. 487499, 1994.
[3] K.S. Beyer and R. Ramakrishnan, “Bottomup computation of sparse and
iceberg cubes,” in Proceedings of the ACM SIGMOD International Conference
on Management of Data, pp. 359370, 1999.
[4] S. Chaudhuri and U. Dayal, “An overview of data warehouse and OLAP technology,”
ACM SIGMOD Record,Vol. 26, pp. 6574, 1997.
[5] C.I. Ezeife, “A uniform approach for selecting views and indexes in a data
warehouse,” in Proceedings of International Database Engineering and
Applications Symposium, pp. 151160, 1997.
[6] J. Gray, A. Bosworth, A. Layman, and H. Pirahesh, “Data cube: a relational aggregation
operator generalizing groupby, crosstabs and subtotals,” in Proceedings of International
Conference on Data Engineering, pp. 152 159, 1996.
[7] H. Gupta, “Selection of views to materialize in a data warehouse,” in
Proceedings of International Conference on Database Theory, pp. 98112, 1997.
[8] J. Han and M. Kamber, Data Mining: Concepts and Techniques, MORGAN
70
71.
KAUFMANN PUBLISHERS, 2000.
[9] V. Harinarayan, A. Rajaraman, and J.D. Ullman, “Implementing data cubes
efficiently,” in Proceedings of ACM SIGMOD, pp. 205216, 1996.
[10] J.T Horng, Y.J. Chang, B.J. Liu, and C.Y. Kao, “Materialized view selection
using genetic algorithms in a data warehouse,” in Proceedings of World
Congress on Evolutionary Computation, pp. 22212227, 1999.
[11] W.H. Inmon and C. Kelley, Rdb/VMS: Developing the Data Warehouse, QED
Publishing Group, Boston, Massachussetts, 1993.
[12] R. Kimball, The Data Warehouse Toolkit Practical For Building Dimensional
Data Warehouses, JOHN WILEY & SONS, INC. 1996.
[13] W.Y. Lin and I.C. Kuo, “OLAP data cubes configuration with genetic
algorithms,” in Proceedings of IEEE System, Man and Cybernetics, pp. 1984–
1989, 2000.
[14] W.Y. Lin, I.C. Kuo, and Y.S. Chang, “A Genetic Selection Algorithm for OLAP
Data Cube,” in Proceedings of the 9th
National Conference on Fuzzy Theory and
Its Application, Taiwan, November 2001, pp. 624628, 2001.
[15] W.Y Lin, J.H Su and M.C Tseng, “OMARS: The framework of an online multi
dimensional association rules mining system,” in Proceedings of 2nd
International Conference on Electronic Business, Taipei, Taiwan, 2002.
[16] Z. Michalewicz, Genetic Algorithms + Data Structures = Evolution Programs,
SpringerVerlag, New York, 1994.
[17] A. Shukla, P. M. Deshande and J. F. Naughtion, “Materialized View Selection
for Multidimensional Datasets,” in Proceedings of the 24th
VLDB Conference,
New York, USA, pp. 488499, 1998.
[18] E. Soutyrina, F. Fotouhi, “Optimal view selection for multidimensional database
systems,” in Proceedings of International Database Engineering and
71
72.
Applications Symposium, pp. 309318, 1997.
[19] D. Theodoratos and T. Sellis, “Data warehouse configuration,” in Proceedings
of the 23rd
VLDB Conference, pp.126135, 1997.
[20] C. Zhang, X. Yao, and J. Yang, “Evolving materialized views in data
warehouse,” in Proceedings of World Congress on Evolutionary Computation,
pp. 823829, 1999.
[21] C. Zhang and J. Yang, “Genetic algorithm for materialized view selection in data
warehouse environments,” in Proceedings of International Conference on Data
Warehouse and Knowledge Discovery, pp. 116125, 1999.
[22] H. Zhu, OnLine Analytical Mining of Association Rules, SIMON FRASER
UNIVERSITY, December, 1998.
72
75.
OMARS 系統中線上關聯規則採掘資料方體之挑選
OLAM Cube Selection in OMARS
研究生：王敏峰 Student：MinFeng Wang
指導教授：林文揚 博士 Advisor：Dr. WenYang Lin
義守大學
資訊管理研究所
碩士論文
A Thesis
Submitted to Department of Information Management
IShou University
in Partial Fulfillment of the Requirements
for the Master degree
in
Information Management
July, 2003
Kaohsiung, Taiwan, Republic of China
中華民國九十二年七月
78.
OLAM Cube Selection in OMARS
Student: MinFeng Wang Advisor: WenYang Lin
Dept. of Information Management
IShou Unversity
ABSTRACT
Mining association rules from large database is a data and computation intensive
task. To reduce the complexity of association mining, Lin et al. proposed the concept
of OLAM (OnLine Association Mining) cube, an extension of Iceberg cube used to
store frequent multidimensional itemsets. They also proposed a framework of online
multidimensional association rule mining system, called OMARS, to provide users an
environment to execute OLAPlike query to mine association rules from data
warehouses efficiently.
This thesis is a companion toward the implementation of OMARS. Particularly,
the problem of selecting appropriate OLAM cubes to materialize and store in
OMARS is concerned. And, according to the proposed mining algorithms in OMARS,
we deploy the model for evaluating the cost of answering association queries using
materialized OLAM cubes, which is a preliminary step for OLAM cubes selection.
Besides, we modify and implement some stateoftheart heuristic algorithms, and
III
79.
draw comparisons between these algorithms to evaluate their effectiveness.
Keywords: data mining, data warehouse, OMARS, OLAM cube, multidimensional
association rules, cube selection problem
Acknowledgement
此篇論文的完成，最感謝我的指導教授林文揚博士的耐心指導，讓我了解
如何做研究，並體會其中的甘苦。特別是在最後完稿階段，更要感謝老師在暑假
期間還撥空予以指點。在這兩年的研究所期間除了理論的研習外，更加感謝恩師
給予我時間來針對資料倉儲與資料庫的實作予以鑽研。
IV
80.
兩年的時間是短暫的，我還要感謝洪宗貝老師、錢炳全老師、王學亮老師、林
建宏老師在這兩年的指教，讓我在這短暫的時間內增加了許多不同的見識，此
外，還要感謝在這兩年中一同與我研究及討論課業的同學們，尤其是思博、文傑
與欣龍，還有學長們，特別是耀升學長與詠騏學長。並感謝在口試時幫我的學弟
妹們。最後我要感謝我的家人，在這兩年中給我的支持與鼓勵。
V
81.
Contents
ABSTRACT III
ACKNOWLEDGEMENT............................................................................................IV
VI
82.
List of Figures
FIGURE 3.1. THE OMARS FRAMEWORK [15]...............................................17
FIGURE 4.2. THE 2ND LAYER LATTICE DERIVED FROM <CUSTOMER,
TIME, > IN THE 1ST LAYER......................................................27
FIGURE 5.6. PROCEDURE UPSEARCH...........................................................42
FIGURE 5.8. ALGORITHM CBWOFF...............................................................47
VII
83.
List of Tables
TABLE 4.8. THE RESULTING OLAM CUBE MCUBE({CITY}, {DATE, MONTH})
............................................................................................................32
DATE 32
MONTH 32
SUPPORT 32
7/18 32
8/2 32
 32
 32
 32
 32
JULY 32
AUG. 32
2 32
3 32
4 32
4 32
7/18 32
JULY 32
2 32
7/18 32
8/2 32
VIII
84.
AUG. 32
JULY 32
2 32
3 32
8/2 32
AUG. 32
3 32
JULY, AUG. 32
4 32
7/18 32
JULY, AUG. 32
2 32
8/2 32
JULY, AUG. 32
3 32
IX
Be the first to comment