Chapter 1
Introduction
Data warehousing and data mining are both popular technologies in recent years.
Data warehousing is...
For example, the DBMiner system [22] developed by J. Han and its research team
adopts an OLAP-based association mining app...
draw comparisons between these algorithms to evaluate their effectiveness.
1.2 Thesis Organization
This thesis is organize...
Chapter 2
Background and Related Work
2.1 Data Warehouse and OLAP
2.1.1 Data Warehouse
As coined by W. H. Inmon, the term ...
monitors/ wrappers.
2. The data warehouse and data marts in the core: The reconciled data are
stored in the data warehouse...
and multidimensional analysis of data in a data warehouse, the OLAP tool
precomputes aggregation over data and organizes t...
Figure 2.2. The typical operations of OLAP
2.2.1 Star Schema
Star schema, proposed by Kimball [12], is the most popular di...
Figure 2.3. An example of star schema for sales
Figure 2.4. An example of schema hierarchy for sales star
8
2.2.2 Snowflake Data Model
The snowflake schema is a variant of the star schema model, where some
dimension tables are nor...
operation during hierarchical traverse of the dimensions. Hence, the star schema data
model is more popular than snowflake...
The most popular and influential association mining algorithm is Apriori [2],
which the apriori knowledge of frequent k-it...
3. Hybrid association rule: This is the association among a set of dimensions,
but some items in the rule are from one dim...
2.4 Related Work
2.4.1 Data Cube
The concept of data cube is first proposed by Gray et al [6], which allow the
analysts to...
Figure 2.6 An example of data cube
2.4.2 Cube Selection Problem
In order to accelerate the query processing, it is importa...
1. Heuristic method: This category is mainly based on the greedy paradigm.
Harinarayan et al. [9] was the first one to con...
[16]. Rather than optimize the view selection from a given query processing
plan, the work in [20, 21] focus on finding an...
Chapter 3
The OMARS Framework
In this chapter, we will give a brief review of the OMARS framework, because
our research de...
3.1 OLAM Cube and Auxiliary Cube
OLAM cube is a new concept proposed by Lin et al. [15], which is used to store
the freque...
module.
3. Cube maintenance: This part concerns the problem of how to maintain the
materialized cubes when the data in the...
search the OLAP Cube repository to determine if there is an OLAP cube
whose data can be used to answer the query. If the a...
Chapter 4
Problem Formulation
In this chapter, we first elaborate the correspondence between OLAM query and
OLAM cube, and...
MP: < tG, tM, ms, mc >,
where ms denotes the minimum support, mc the minimum confidence, tG the group of
transaction attri...
(Category, “A”) ⇒ (Category, “B”) (sup = 40%, conf = 80%)
Note that to facilitate this mining task, the table T has to be,...
(Education, “Master”), (Month, “July”) ⇒
(Month, “Aug.”) (sup = 40%, conf = 80%)
For this case, the transformed table will...
4.3.
Table 4.3. An example inter-dimensional OLAM cube expressed in table
Education Month Support
Bachelor
High school
Mas...
4.2 OLAM Lattice
In accordance with the definition of OLAM cube, we can generate all possible
OLAM cubes from the star sch...
Figure 4.1. The1st
layer OLAM lattice for the example star schema in Figure 2.3
Figure 4.2. The 2nd
layer lattice derived ...
Figure 4.3. The 3rd
layer lattice derived from the subcube <(city, education), date > in
the 2nd
layer
Because the real OL...
Example 4.4. Consider the table T in Table 4.1. Let 1 1
( , )G MMCube t t be the cube
illustrated in Table 4.4 and 2 2
( ,...
over the attributes in the star schema provide the possibility to prune some redundant
cubes.
Consider an OLAM cube, MCube...
Table 4.7. The resulting table by grouping {Date}
as transaction attribute for Table 4.1
Date Category
7/4 A
7/12 A
7/18 A...
Table 4.8. The resulting OLAM cube MCube({City}, {Date, Month})
Date Month support
7/18
8/2
-
-
-
-
July
Aug.
2
3
4
4
7/18...
Table 4.9. The resulting table by grouping {City, Date} as transaction attribute for
Table 4.1
City Date Month
Toronto
Tai...
Table 4.10. The Symbol Table
Symbol Definition
L Lattice
D Set of data cubes
nd nth
data cube
Q Set of user queries
mq mth...
Chapter 5
Evaluation of OLAM Query Cost
5.1 Query Evaluation Flow
As stated previously, the primary task of OLAM Engine is...
Figure 5.1 The flow diagram of OLAM query
An important thing worth mentioning is that, for simplicity, we do not consider
...
Procedure Evacost_OLAMQ(q)
begin
Let q = < tG, tM, minsup>;
found = OLAMQ_search(q, CQ);
if found = TRUE then
if prims ≤ m...
Procedure OLAMQ_search(q, CQ)
begin
found = FALSE;
if MCube(q. tG, q. tM) is materialized then
CQ.MCube = MCube(q.tG, q.tM...
3Gt , 3Mt ), where 3Gt = {Date}, 3Mt = {City}, and prims = 3. We have three users’
queries as follows: q1, q2, q3, where 1...
in Figure 5.4. Because the qualified frequent itemsets have been stored in the found
OLAM cube, and minsup ≥ prims, there ...
Procedure Dwnsearchon
1 for i=1 to |D| do
2 scan the i-th transaction ti;
3 delete those items in ti but not in AF;
4 for ...
13 until Fk = ∅
Figure 5.6. Procedure Upsearch
Procedure association_gen (F: set of all frequent itemsets; min_conf: minim...
Proof. Recall that each rule that can be constructed from an itemset X has the form for
A⊂ X and A ≠ φ , A ⇒ X – A. Thus, ...
frequent itemsets.
Let MF denotes the set of maximal patterns. If the first approach is adopted, the
computation spent on ...
As illustrated in Figure 5.5, the Dwnsearchon procedure needs to scan all the
transactions in the database. The I/O cost i...
cost over all candidate itemsets, we have
max
1
| | | |
K
i
i k
C i T
α= +
××∑ .
Finally, the total cost for procedure Ups...
Algorithm CBWoff(T, prims)
Input: Table T and prims;
Output: The set of frequent itemsets F;
1 scan T to compute Kα and ge...
To sum up, we list the cost functions for the three cases below:
Case 1:
| |
| | 3 | |K
K MF Dα
α
α× + × .
Case 2: ( ) ( )...
Chapter 6
OLAM Cube Selection Methods
In this chapter, we describe three typical heuristic algorithms proposed for OLAP
cu...
We use our benefit function to compute the benefit of all unselected OLAM cubes,
and combine the forward selection method ...
used in this example are shown in Table 6.1, and the required parameter settings are
shown in Table 6.2. Besides, we assum...
Figure 6.2. All possible OLAM cubes formed with city, education, and date
Table 6.3. The benefits of OLAM cubes in the fir...
6.2 Backward Greedy Selection Method (BGS)
The concept of the backward greedy selection method proposed by Lin and Kuo
[13...
Algorithm 2. Backward Greedy Selection (BGS)
Step 0. Let M← D;
Step 1. When d M
d S∈
>∑ , repeat Step 2 to Step 5.
Step 2....
Table 6.4. The benefits of OLAM cubes in the first two selection steps by BGS
First selection Second selection
subcubes In...
Algorithm 3. Pick by Size Selection (PBS)
Step 0. Sort all OLAM cube size;
Step 1. Let M=φ.
Step 2. When d M
d < S∈∑ , rep...
Chapter 7
Experimental Results
In this chapter, we describe our experiment and analysis. All experiments are
performed on ...
“a”. The symbol “-“ means none, and the number is the size of each subcube.
In our experiments, we consider two different ...
City, Education, Date, Month, Product_ID, Category
Gt All possible mining attributes tM
ced mpa mp- m-a -pa m-- -p- --a
71...
phenomenon is similar to Figure 7.1, but the optimal cost-effective space is about
40% for FGS.
5.22E+03
5.00E+08
1.00E+09...
method. Since the forward and backward greedy selection methods have similar
philosophy but different in the direction of ...
0.000
0.001
0.002
0.003
0.004
0.005
10% 20% 30% 40% 50% 60% 70% 80%
FGS
BGS
PBS
Figure 7.4. Comparing the selection time o...
3.67E+05
5.00E+08
1.00E+09
1.50E+09
2.00E+09
2.50E+09
3.00E+09
3.50E+09
10% 20% 30% 40% 50% 60% 70% 80%
FGS
BGS
PBS
Figure...
situation when minsup ≥ prims.
0
0.001
0.002
0.003
0.004
0.005
0.006
10% 20% 30% 40% 50% 60% 70% 80%
FGS
BGS
PBS
Figure 7....
7.3 Comparison of FGS, BGS, and PBS for
random minsups
Generally, the support settings of different users’ queries are dif...
2.17E+06
5.00E+10
1.00E+11
1.50E+11
2.00E+11
2.50E+11
10% 20% 30% 40% 50% 60% 70% 80%
FGS
BGS
PBS
Figure 7.10. Comparing t...
frequency for random minsups.
0.000
0.001
0.002
0.003
0.004
0.005
0.006
10% 20% 30% 40% 50% 60% 70% 80%
FGS
BGS
PBS
Figure...
Chapter 8
Conclusions and Future Works
8.1 Conclusions
In this thesis, we have considered the OLAM cube selection problem ...
is on OLAM cube selection in the OMARS system. There are some issues need to be
investigated in the near future:
1. OLAM c...
References
[1] 林文揚、張耀升:“ 發式資料方體挑選方法之分析比較”啟 ,九十年全國計算
機會議論文集,頁 47-58,2001。
[2] R. Agrawal and R. Srikant, “Fast algorithms f...
KAUFMANN PUBLISHERS, 2000.
[9] V. Harinarayan, A. Rajaraman, and J.D. Ullman, “Implementing data cubes
efficiently,” in Pr...
Applications Symposium, pp. 309-318, 1997.
[19] D. Theodoratos and T. Sellis, “Data warehouse configuration,” in Proceedin...
義 守 大 學
資 訊 管 理 研 究 所
碩 士 論 文
OMARS 系統中線上關聯規則採掘資
料方體之挑選
OLAM Cube Selection in OMARS
研究生:王敏峰
指導教授:林文揚 博士
中華民國 九十二 年 七 月
II
OMARS 系統中線上關聯規則採掘資料方體之挑選
OLAM Cube Selection in OMARS
研究生:王敏峰 Student:Min-Feng Wang
指導教授:林文揚 博士 Advisor:Dr. Wen-Yang Lin
義...
OMARS 系統中線上關聯規則採掘資料方
體之挑選
學生:王敏峰 指導教授:林文揚 博士
義守大學資訊管理研究所
摘 要
從大型資料庫中採掘關聯規則是一計算密集的工作。為了減少關聯規則採掘
的複雜性,Lin 等延伸了冰山方體(Ice-berg ...
System, OMARS),提供使用者執行類似 OLAP 的 詢,以迅速的從資料倉儲中查
採掘關聯規則。
本篇論文目的就是根據在 OMARS 系統中提出的關聯規則採掘演算法,來
設計一成本模組,並利用來挑選出最適合回答關聯規則 詢的線上關聯規...
OLAM Cube Selection in OMARS
Student: Min-Feng Wang Advisor: Wen-Yang Lin
Dept. of Information Management
I-Shou Unversity...
draw comparisons between these algorithms to evaluate their effectiveness.
Keywords: data mining, data warehouse, OMARS, O...
兩年的時間是短暫的,我還要感謝洪宗貝老師、錢炳全老師、王學亮老師、林
建宏老師在這兩年的指教,讓我在這短暫的時間內增加了許多不同的見識,此
外,還要感謝在這兩年中一同與我研究及討論課業的同學們,尤其是思博、文傑
與欣龍,還有學長們,特別是耀升學...
Contents
ABSTRACT III
ACKNOWLEDGEMENT........................................................................................
List of Figures
FIGURE 3.1. THE OMARS FRAMEWORK [15]...............................................17
FIGURE 4.2. THE 2ND ...
List of Tables
TABLE 4.8. THE RESULTING OLAM CUBE MCUBE({CITY}, {DATE, MONTH})
..............................................
AUG. 32
JULY 32
2 32
3 32
8/2 32
AUG. 32
3 32
JULY, AUG. 32
4 32
7/18 32
JULY, AUG. 32
2 32
8/2 32
JULY, AUG. 32
3 32
IX
Upcoming SlideShare
Loading in...5
×

下載

402

Published on

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
402
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
9
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

下載

  1. 1. Chapter 1 Introduction Data warehousing and data mining are both popular technologies in recent years. Data warehousing is an information infrastructure to store and integrate different data sources into a consistent repository, and through OLAP (On-Line Analytical Processing) tools business managers can analyze these data in various perspectives to discover valuable information for strategic decision. Data mining, on the other hand, is the exploration and analysis of data, automatically or semi-automatically, to discover meaningful patterns and rules. From the business viewpoint, the integration of these two technologies can allow a corporation to understand its customers behaviors, and to use this information to gain market competition. Among various pattern interested by data mining research community, association rule has attracted great attention recently. An association rules is a rule of the form A ⇒ B (sup = s %, conf = c %), which reveals the concurrence between two itemsets A and B. An example is PC => Laser Printer (sup = 30%, conf = 80%), which means there are 30% customers will buy PC and Laser Printer together, and 80% of those customers who buy PC also get Laser Printer. Mining association rules from large database is a data and computation intensive task. To reduce the complexity of association mining, researchers have proposed the concept of integrating data warehousing system and association mining algorithms. 1
  2. 2. For example, the DBMiner system [22] developed by J. Han and its research team adopts an OLAP-based association mining approach. Similar paradigm was presented in [22]. The primary problem of OLAP-based approach is that the OLAP data cube is not feasible for on line association mining. Excessive efforts are still required to complete the task. As such, Lin et al. [15] proposed the concept of OLAM (On-Line Association Mining) cube, an extension of Ice-berg cube [3] used to store frequent multidimensional itemsets. They also proposed a framework of on-line multidimensional association rule mining system, called OMARS, to provide users an environment to execute OLAP-like query to mine association rules from data warehouses efficiently. This thesis is a companion toward the implementation of OMARS. Particularly, the problem of selecting appropriate OLAM cubes to materialize and store in OMARS is concerned. And, in accordance with the proposed mining algorithms in OMARS, a suitable model to evaluate the cost of selecting data cubes to materialize is also developed. 1.1 Contributions The main contributions of this thesis are as follows: 1. We exploit the devising dependency between OLAM cubes with regard to association query, thereby devising the structure of OLAM lattice. 2. We deploy the model for evaluating the cost of answering association queries using materialized OLAM cubes, which is a preliminary step for OLAM cubes selection. 3. We modify and implement some state-of-the-art heuristic algorithms, and 2
  3. 3. draw comparisons between these algorithms to evaluate their effectiveness. 1.2 Thesis Organization This thesis is organized as follows. We describe past researches and related work about the data warehousing and data mining technologies in Chapter 2. In Chapter 3, we describe the OMARS framework briefly. Chapter 4 formulates our OLAM cube selection problem. The algorithm analysis and cost model is described in Chapter 5. Chapter 6 explains our algorithms, and Chapter 7 shows the experimental results conducted in this research. Finally, we conclude our work and point out some future research directions in Chapter 8. 3
  4. 4. Chapter 2 Background and Related Work 2.1 Data Warehouse and OLAP 2.1.1 Data Warehouse As coined by W. H. Inmon, the term “Data warehouse” refers to a “subject- oriented, integrated, time-variant and nonvolatile collection of data in support of management’s decision-making process” [11]. In this regard, a data warehouse is a database dedicated to support decision making. According to the demand of analysts, the data comes from different databases are extracted and transformed into the data warehouse. If users want to execute queries, the system only needs to search the data warehouse instead of the source databases. For this reason, it can save much more query processing time for users. A data warehouse system is composed of three primary parts: 1. The source databases in the backend: In the backend, the data are collected from various sources, internal or external, legacy or operational, and any change to these sources is continually monitored by several modules called 4
  5. 5. monitors/ wrappers. 2. The data warehouse and data marts in the core: The reconciled data are stored in the data warehouse and data mart, which are central repository for the whole system. 3. The analysis tools in the front end: The analysis tools supported in the front end are usually OLAP, query/tabulation tools, and data mining software. The typical structure of a data warehouse is illustrated in Figure 2.1. Figure 2.1. A typical architecture of data warehouse [11]. 2.1.2 On-Line Analytical Processing (OLAP) Although the data stored in a data warehouse have been cleaned, filtered, and integrated, it still requires much time to transform the data into useful strategic information owing to the massive amount of data stored in data warehouse. The concept of On-Line Analytical Processing (OLAP) [4] refers to the process of creating and managing multidimensional data for analysis and visualization. To provide fast 5 Extract Clean Transform Load Refresh Data sources Operational databases External sources Monitoring & Administration Metadata Repository Data Warehouse Serve Analysis Query/Reporting Data Mining Tools OLAP Servers Data mart dbs Monitors/ wrappers dbs
  6. 6. and multidimensional analysis of data in a data warehouse, the OLAP tool precomputes aggregation over data and organizes the result as a data cube composed of several dimensions, each representing one of the user analysis perspectives. The typical operations provided by OLAP include roll-up, drill-down, slice and dice and pivot [8]. Roll-up operation performs aggregation on a data cube, either by climbing up a concept hierarchy for a dimension or by dimension reduction. Drill- down is the reverse of roll-up. It navigates from less detailed data to more detailed data. The slice operation performs a selection on one dimension of the given cube, resulting in a subcube, while the dice operation defines a subcube by performing a selection on two or more dimensions. The pivot operation, which is also called rotate, is a visualization operation that rotates the data axes in view in order to provide an alternative presentation of the data. These OLAP operations are illustrated in Figure 2.2. 2.2 Data Warehouse Data Model Because the data warehouse systems require a concise, subject-oriented schema that facilitates on-line data analysis, the entity-relationship data model that is generally used in relational database systems is not suitable for data warehouse system. For this purpose, the most popular data model for a data warehouse is a multidimensional data model. Two common relational models that facilitate multidimensional analysis are star schema, and snowflake. 6
  7. 7. Figure 2.2. The typical operations of OLAP 2.2.1 Star Schema Star schema, proposed by Kimball [12], is the most popular dimensional model used in data warehouse community. A star schema consists of a fact table and several dimension tables. The fact table stores a list of foreign keys which correspond to dimension tables, and numeric measure of user interests. Each dimension table contains a set of attributes. Moreover, the attributes within a dimension table may form either a hierarchy (total order) or a lattice (partial order). An example of star schema is depicted in Figure 2.3, whose schema hierarchy is illustrated in Figure 2.4. 7 Slice Customer All Supplier S1 S2 S3 S4 Product P1 P4 P3 P2 P5 P6 C1 C2 Customer Product P1 P2 Supplier S1 S2 Roll-Up Drill-Down Product P1 P4 P3 P2 P5 P6 C1 C2 C3 C4 Customer Dice Pivot C2 Product P1 P4 P3 P2 P5 P6 Supplier S1 S2 S3 S4 C1 C3 C4 Customer Product Customer C3 C2 C1 C4 P3 P1P2P4P5P6
  8. 8. Figure 2.3. An example of star schema for sales Figure 2.4. An example of schema hierarchy for sales star 8
  9. 9. 2.2.2 Snowflake Data Model The snowflake schema is a variant of the star schema model, where some dimension tables are normalized, thereby further splitting the data into additional individual and hierarchical tables. An example of snowflake data model is depicted in Figure 2.5. Figure 2.5. An example of snowflake schema for sales The major difference between snowflake schema and star schema is that the dimension tables of snowflake model may be kept in normalized form to reduce redundancies. Through this characteristic one can easily maintain and save storage space than that by star schema data model. On the other hand, the star schema can integrate schema hierarchies into a dimension table, thereby incurring no join 9
  10. 10. operation during hierarchical traverse of the dimensions. Hence, the star schema data model is more popular than snowflake schema data model. 2.3 Association Rule Mining 2.3.1 Association Rules Association rule mining is one of the prominent activities conducted in data mining community. The concept of association rule mining is to search interesting relationships among items in a given data set. For example, the information that customers who purchase diapers also tend to buy beers at the same time is represented in association rule below: Diaper => Beer [sup = 2%, conf = 60%] Rule support and confidence are two measures of rule interestingness. A support of 2% means that 2% of customers purchase diaper and beer together. A confidence of 60% means that 60% of the customers who purchase a diaper also buy beer. Typically, an association rule is considered interesting if it satisfies a minimum support threshold and a minimum confidence threshold that are set by users or domain experts. The process of association rule mining can be divided into two steps: 1. Frequent itemsets generation: In this step, all itemsets with support greater than the minimum support threshold are first discovered. 2. Rule construction: After generating all frequent itemsets, the confidence of these frequent itemsets much greater than minimum confidence threshold. Then, we can discover association rules. 10
  11. 11. The most popular and influential association mining algorithm is Apriori [2], which the apriori knowledge of frequent k-itemsets to generate candidate (k+1)- itemsets. When the maximum length of frequent itemsets is l, Apriori needs l passes of database scans. Since the Apriori algorithm costs much time to generate the candidate itemsets and to count the support of each itemset, many variant algorithms have been proposed to improve the efficiency of mining process. 2.3.2 Multi-dimensional Association Rules The concept of multi-dimensional association rules is first proposed by H. Zhu [22], which is used to describe associations between data values from data warehouse, because where the data schema is composed of multiple dimensions, and each dimension may contain many attributes. Following the work in [22], we can divide the multi-dimensional association rules into three different types as follows: 1. Inter-dimensional association rule: This is the association among a set of dimensions. For example, suppose an OLAP cube is composed of three dimensions: Product, Supplier, Customer, and whose data is listed in Table 2.1. An inter dimensional association rule is: Supplier (“Hong Kong”), Product (“Sport Wear”) ⇒ Customer (“John”) 2. Intra-dimensional association rule: This is the association among items coming from one dimension. From Table 2.1, a possible intra-dimensional association rule is: Product (“Sport Wear”) ⇒ Product (“Tents”) 11
  12. 12. 3. Hybrid association rule: This is the association among a set of dimensions, but some items in the rule are from one dimension. It can be regarded as a combination of inter-dimensional and intra-dimensional associations. According to Table 2.1, a hybrid-association rule is: Product (“Sport Wear”), Supplier (“Hong Kong”) ⇒ Product (“Tents”) Table 2.1. A relational representation of OLAP cube Supplier Product Customer Count HongKong HongKong HongKong Mexico Mexico Mexico Mexico Mexico Seattle Seattle Seattle Seattle Tokyo Tokyo Tokyo Tokyo Sport Wear Sport Wear Water Purifier Alert Devices Carry Bags Carry Bags Tents Tents Carry Bags Sport Wear Sport Wear Water Purifier Carry Bags Sport Wear Tents Alert Devices John Mary John Peter Peter Bill Sue Mary John Peter John Bill Sue Bill Sue John 30 10 30 20 85 25 25 20 100 20 40 25 10 20 20 20 12
  13. 13. 2.4 Related Work 2.4.1 Data Cube The concept of data cube is first proposed by Gray et al [6], which allow the analysts to view the data stored in data warehouse from various aspects and to employ multidimensional analysis. Each cell in a data cube represents the measured value. For example, consider a sales data cube with three dimensions, Product, Supplier, Customer, and one measure value, Sales_total. This cube is depicted in Figure 2.6 and can be expressed as a SQL query as follows: Select Product, Supplier, Customer SUM(Sales) AS Total Sales From Sales_Fact Group by Product, Supplier, Customer; 13 Customer Supplier Product c1 c2 c3 c4 s1 s2 s3 s4 p1p2p3p4p5p6
  14. 14. Figure 2.6 An example of data cube 2.4.2 Cube Selection Problem In order to accelerate the query processing, it is important to select the most suitable cubes to materialize. In general, there are three options to select the cubes to materialize. 1. Materialize all data cubes: This method costs the lowest query time but needs the largest storage space, because the whole cubes have to be materialized. 2. Materialize nothing: This method saves the largest storage space but needs the largest query time, because there is no cube to be materialized. 3. Materialize a fraction of data cubes: This method selects a part of the data cubes to materialize. But how to select the most suitable cubes to materialize under a space constraint is difficult. Indeed, it has been proved to be a NP-hard problem [9]. According to the above discussions, the best way is to materialize all data cubes. However, the space limit of data warehouse would hinder us to do this. On the other hand, if we materialize nothing, it will cost too much query time. Therefore, we should try to select the most suitable cubes to materialize even this problem is an NP- hard problem. In the literature, there has been a substantial contribution in this problem, which can be classified into three main categories: 14
  15. 15. 1. Heuristic method: This category is mainly based on the greedy paradigm. Harinarayan et al. [9] was the first one to consider the problem of materialized views selection for supporting multidimensional analysis in OLAP. They proposed a lattice model and provided a greedy algorithm to solve this problem. Gupta et al. [7] further extend their work to include indices selection. Ezeife [5] also considered the same problem but proposed a uniform approach using a more detailed const model. Shukla et al. [17] proposed a modified greedy algorithm that selects only according to the cube size. Their algorithm was shown to have the same quality as Harinarayan’s greedy method but is more efficient. 2. Exhaustive method: The work in [19] supposed that all queries should be answered solely by the materialized views, with or without rewriting the users’ queries. They modeled the problem as a state space optimization problem, and provided exhaustive and heuristic algorithms without concern for the storage constraint. Soutyrina and Fotouhi [18] proposed a dynamic programming algorithm to solve the problem, which can yield the optimal set of cubes. 3. Genetic method: There is some work devoted to applying genetic algorithms to the view selection problem [10, 20, 21]. Following the AND- OR view graph used in [7], Horng et al. [10] proposed a genetic algorithm to select the appropriate set of views to minimize the query cost and view maintenance cost. A similar genetic algorithm with different repairing scheming is proposed in [13], which use a greedy repair method to correct the infeasible solutions instead of using a penalty function to punish the fitness of the infeasible solutions. Researches have shown that the repair scheme is better in dealing with infeasible solutions than penalty function is 15
  16. 16. [16]. Rather than optimize the view selection from a given query processing plan, the work in [20, 21] focus on finding an optimal set of processing plans for multiple queries. A solution in their genetic algorithm thus represents a set of processing plans for the given queries. 16
  17. 17. Chapter 3 The OMARS Framework In this chapter, we will give a brief review of the OMARS framework, because our research deals with the problem of how to select the most suitable OLAM cubes to materialize in this system. The OMARS framework, as illustrated in Figure 3.1, integrates data warehouse, on-line analytical processing, and the OLAM Cube, whose objective is to provide an efficient and convenient platform, allowing users to perform OLAP-like association explorations. Through the OMARS system, users can perform multidimensional associational mining queries, interactively change the dimensions that comprise the associations, and refine the constraints such as minimum support and minimum confidence. Functionality of each component is described in the following sections. Figure 3.1. The OMARS framework [15]. 17 Cube Manager Data Warehouse OLAP Cube OLAM Cube OLAM Mediator OLAM Engine Auxiliary Cube
  18. 18. 3.1 OLAM Cube and Auxiliary Cube OLAM cube is a new concept proposed by Lin et al. [15], which is used to store the frequent itemsets with supports greater than or equal to a presetting minimum support, denoted as prims. In this regard, the OLAM cube can be regarded as an extension of iceberg cube. The main difference is that the iceberg cube stores the information of frequent itemsets derived from inter-dimensional associations, while OLAM cube is feasible for all of the three different associations. When the minsup of user’s query is greater or equal than prims, it can accelerate the process of mining association rules because of the OLAM cube stores the frequent itemsets with supports greater or equal than prims. Although the OLAM cube can be used to generate association rules efficiently when minsup is greater than prims, it fails to solve the situation that minsup is lower than prims. To alleviate this problem, the OMARS system embraces another type of data cube, called auxiliary cube. The concept of auxiliary cube is used to store the infrequent itemsets with length of Kα, where Kα denotes the cutting-level employed by the mining algorithm CBWon used in OMARS. 3.2 Cube Manager This component is responsible for three different tasks: 1. Cube selection: This refers to how to select the most proper cubes to materialize, in order to minimize the query cost and/or maintenance cost under the constraint of limited storage space. 2. Cube computation: This portion is to deal with the work of efficiently generating the set of materialized cubes produced by the cube selection 18
  19. 19. module. 3. Cube maintenance: This part concerns the problem of how to maintain the materialized cubes when the data in the data warehouse are updated. Our research in this thesis indeed deals with the implementation issue of the cube selection task of Cube Manager. We will discuss this in the next chapter. 3.3 OLAM Mediator and OLAM Engine OLAM Engine is an interface between the OMARS system and the users. It accepts user’s queries and invokes the appropriate algorithm to mine multidimensional association rules. When OLAM Engine receives a user’s query, it will analyze the query and forward relevant information to OLAM Mediator, which then looks for the most relevant cube and returns the result to OLAM Engine. Here the most relevant cube denotes the materialized OLAM cube that can answer the query and consume the smallest cost. There are two possibilities of the search result returned by OLAM Mediator, and each should be handled in different way. 1. OLAM Mediator can find the most relevant cube: In this case, OLAM Mediator has to further compare the minsup of user’s query to prims, and to handle this situation according to the following two different cases: i. minsup ≥ prims: The discovered OLAM cube is capable of answering the query. Return this cube to OLAM Engine. ii. minsup < prims: The discovered OLAM cube can not answer the query without the aid of the auxiliary cube. Return the OLAM cube and its accompanied auxiliary cube to OLAM Engine. 2. OLAM Mediator can not find the cube: In this case, OLAM Mediator has to 19
  20. 20. search the OLAP Cube repository to determine if there is an OLAP cube whose data can be used to answer the query. If the answer is yes, return the discovered OLAP cube to OLAM Engine; otherwise, notify OLAM Engine to execute the mining procedure from the data warehouse afresh. We will discuss the above cases in more detail and devise to the cost evaluation of each case in Chapter 5. 20
  21. 21. Chapter 4 Problem Formulation In this chapter, we first elaborate the correspondence between OLAM query and OLAM cube, and describe the concept of OLAM lattice. After this, we will define the problem of OLAM cube selection. 4.1 OLAM Cube and OLAM Query As described in Chapter 3, OLAM cube is used to store frequent itemsets, aiming at accelerating the process of mining association rules. To clarify the structure of OLAM cube and its relationship between multidimensional associations, we first introduce a four-tuple mining meta-pattern to specify the form of multidimensional association query. The definition is as follows: Definition 4.1. Suppose a star schema S containing a fact table and m dimension tables {D1, D2, …, Dm}. Let T be a jointed table from S composed of a1, a2, …., ak attributes, such that ∀ai, aj ∈ Attr(Dk), there is no hierarchical relation between ai and aj, 1 ≤ i, j ≤ r, 1 ≤ k ≤ m. Here Attr(Dk) denotes the attribute set of dimension table Dk. A meta-pattern of multidimensional associations from T is defined as follows: 21
  22. 22. MP: < tG, tM, ms, mc >, where ms denotes the minimum support, mc the minimum confidence, tG the group of transaction attributes, tM the group of item attributes, for tG, tM ⊆ {a1, a2, …., ak} and tG ∩ tM = ∅. The above-mentioned meta-form specification of multidimensional association queries can present three different multidimensional association rules defined in [22], intra-association, inter-association, and hybrid association. For example, consider a jointed table T involving three dimensions from the star schema in Figure 2.3. The content of T is shown in Table 4.1. If the item attribute set tM consists of only one attribute, then the meta pattern corresponds to an intra- association. Table 4.1. A jointed table T from star schema Tid City Education Date Month Product_I D Category 1 Taipei Bachelor 7/12 July 1 A 2 Taipei High school 7/12 July 2 A 3 N.Y. Master 7/18 July 1 A 4 Toronto Master 8/2 Aug. 3 B 5 Seattle Master 8/3 Aug. 4 B 6 N.Y. High School 8/2 Aug. 1 A 7 Toronto High School 7/4 July 1 A 8 Seattle Bachelor 7/18 July 5 C 9 Taipei Bachelor 8/2 Aug. 2 A 10 N.Y. Bachelor 9/1 Sep. 3 B For instance, let tG = {City}, tM = {Category}. We may have the following intra- association rule: 22
  23. 23. (Category, “A”) ⇒ (Category, “B”) (sup = 40%, conf = 80%) Note that to facilitate this mining task, the table T has to be, implicitly or explicitly, transformed into a transaction table as follows: City Category Taipei N.Y. Toronto Seattle A A, B A, B B, C On the other hand, if | tM | ≥ 2, then the resulting associations will be inter- association or hybrid association. For example, let tG = ∅, tM = {Education, Month}. We have an inter-association: (Education, “Master”) ⇒ (Month, “July”) (sup = 40%, conf = 80%) Like intra-association, the table T has to be transformed into the following form: Tid Education Month 1 Bachelor July 2 High school July 3 Master July 4 Master Aug. 5 Master Aug. 6 High School Aug. 7 High School July 8 Bachelor July 9 Bachelor Aug. 10 Bachelor Sep. Note that in this case, the transaction attribute is the same as the original table T. But if tG = {City}, we will have a hybrid-association: 23
  24. 24. (Education, “Master”), (Month, “July”) ⇒ (Month, “Aug.”) (sup = 40%, conf = 80%) For this case, the transformed table will be: City Education Month Taipei Bachelor, High School July, Aug. N.Y. Master, High School, Bachelor July, Aug., Sep. Toronto Master, High School Aug., July Seattle Master, Bachelor Aug., July After explaining the mining patterns, we will clarify the structure of OLAM Cube. Definition 4.2. Given a meta-pattern MP with transaction attribute set tG and item attribute set tM, and a presetting minsup, prims, the corresponding OLAM cube, MCube(tG, tM), is the set of the frequent itemsets with supports larger than prims. The following examples illustrate the corresponding OLAM cube for different kinds of multidimensional association rules. Example 4.1. An intra-dimensional OLAM Cube: Let tG = {City}, tM = {Category}, and prims = 2. From Table 4.1, the resulting OLAM cube is shown in Table 4.2. Table 4.2. An example of intra OLAM cube expressed in table Category Support A B A, B 3 3 2 Example 4.2. An inter-dimensional OLAM cube: Let tG = ∅, tM = {Education, Month}, and prims = 2. From Table 4.1, the resulting OLAM cube is shown in Table 24
  25. 25. 4.3. Table 4.3. An example inter-dimensional OLAM cube expressed in table Education Month Support Bachelor High school Master - - Bachelor High school Master - - - July Aug. July July Aug. 4 3 3 5 4 2 2 2 Example 4.3. A hybrid-dimensional OLAM cube: Let tG = {City}, tM = {Education, Month}, and prims = 3. From Table 4.1, the resulting OLAM cube is shown in Table 4.4. Table 4.4. An example hybrid-dimensional OLAM cube expressed in table Education Month support Bachelor High school Master - - - Bachelor Bachelor High school High school Master Master Bachelor High school Master - - - July Aug. July, Aug. July Aug. July Aug. July Aug. July, Aug. July, Aug. July, Aug. 3 3 3 4 4 4 3 3 3 3 3 3 3 3 3 25
  26. 26. 4.2 OLAM Lattice In accordance with the definition of OLAM cube, we can generate all possible OLAM cubes from the star schema, thereby forming an OLAM lattice. In order to provide hierarchical navigation and multidimensional exploration, the OMARS system [15] models the OLAM lattice as a three-layer structure. The first layer lattice expresses the combination of all dimensions. The second layer further exploits inter- attribute combinations for each dimensional combination in the first layer lattice. The third layer exploits all OLAM cubes corresponding to the meta-patterns derived from each subcube in the second layer. Note that the real OLAM cubes are stored in the third layer. For example, consider the star schema illustrated in Figure 2.3. The first layer lattice shown in Figure 4.1 is composed of eight possible dimensional combinations. After constructing the first layer lattice, we choose the node composed of “customer” and “time” dimensions, and extended it to form a second layer lattice shown in Figure 4.2. Each node of the second layer lattice is constructed by attaching any attribute chosen from the selected dimensions. Finally, we extend cube <(city, education), (date)> to form the third layer lattice shown in Figure 4.3. It can be observed that there is one OLAM cube corresponding to inter-association, (city, education, date); three OLAM cubes corresponding to hybrid-associations, (date*, city, education), (*education, city, date) and (city*, education, date); and three cubes corresponding to intra-associations, (education*, date*, city), (city*, date*, education), (city*, education*, date). Note that (city*, education*, date*) is shown to complete the lattice structure, which is useless and will not be materialized. 26
  27. 27. Figure 4.1. The1st layer OLAM lattice for the example star schema in Figure 2.3 Figure 4.2. The 2nd layer lattice derived from <customer, time, -> in the 1st layer 27
  28. 28. Figure 4.3. The 3rd layer lattice derived from the subcube <(city, education), date > in the 2nd layer Because the real OLAM cubes are stored in the third layer lattice, we can mine multidimensional association rules efficiently through materialize these OLAM cubes. From these three layers lattice, we discover attribute dependency that defined as follows: Proposition 4.1 Consider two OLAM cubes, 1 1 ( , )G MMCube t t and 2 2 ( , )G MMCube t t . If 1 2G Gt t= and 2 1M Mt t⊆ , then every itemset in 2 2 ( , )G MMCube t t must be a subset of an itemset in 1 1 ( , )G MMCube t t , and these two itemsets have the same support value. 28
  29. 29. Example 4.4. Consider the table T in Table 4.1. Let 1 1 ( , )G MMCube t t be the cube illustrated in Table 4.4 and 2 2 ( , )G MMCube t t that illustrated in Table 4.5. Hence 1 2 { }G Gt t City= = , 1 { , }Mt Education Month= , 2 { }Mt Education= , and prims = 3. It can be verified that every frequent itemsets stored in 2 2 ( , )G MMCube t t is a subset of frequent itemsets in 1 1 ( , )G MMCube t t , and both itemsets have the same support value. Table 4.5. An OLAM Cube Education Support Bachelor High school Master 3 3 3 According to Proposition 4.1, we know there is a dependency between OLAM cubes in the third lattice, which is formalized below. Definition 4.3. Consider two OLAM cubes, 1 1 ( , )G MMCube t t and 2 2 ( , )G MMCube t t . We say that 2 2 ( , )G MMCube t t is dependent upon 1 1 ( , )G MMCube t t if 1 2G Gt t= and 2 1M Mt t⊆ , and is denoted as 2 2 ( , )G MMCube t t ≤ 1 1 ( , )G MMCube t t . One important aspect of Definition 4.3 is that if 2 2 ( , )G MMCube t t ≤ 1 1 ( , )G MMCube t t then all multidimensional queries that can be answered via 2 2 ( , )G MMCube t t can also be answered via 1 1 ( , )G MMCube t t . Furthermore, it should be notice that not all of the OLAM cubes derived in the lattice have to be materialized and stored, because the concept hierarchies defined 29
  30. 30. over the attributes in the star schema provide the possibility to prune some redundant cubes. Consider an OLAM cube, MCube(tG, tM). We observed that there are two different types of redundancy. Proposition 4.2. Schema redundancy: Let ai, aj ∈ tG. If ai, aj are in the same dimension and aj is an ancestor of ai, then MCube(tG, tM) is a redundancy of cube MCube(tG-{ aj }, tM). Example 4.5. Consider the jointed table in Table 4.1. Let tM = {Category}. The resulting table by grouping “Date” and “Month” as transaction attributes is shown in Table 4.6. Note that this table has the same transactions as that obtained by grouping “Date” as transaction attribute, as shown in Table 4.7. Thus, the resulting cube MCube({Date, Month}, {Category}) is the same as MCube({Date}, {Category}). Table 4.6. The resulting table by grouping {Date, Month} as transaction attributes for Table 4.1 Date Month Category 7/4 July A 7/12 July A 7/18 July A, C 8/2 Aug. A, B 8/3 Aug. B 9/1 Sep. B 30
  31. 31. Table 4.7. The resulting table by grouping {Date} as transaction attribute for Table 4.1 Date Category 7/4 A 7/12 A 7/18 A, C 8/2 A, B 8/3 B 9/1 B Proposition 4.3. Values Redundancy: Let ai, aj ∈ tM. If ai, aj are in the same dimension and aj is an ancestor of ai, then MCube(tG, tM) is a cube with values redundancy. Example 4.6. Consider the jointed table in Table 4.1. Let tG = {City}, tM = {Date, Month} and prims = 2. The resulting OLAM cube is shown in Table 4.8. One can observe that the tuples with dotted lines in this table are redundant patterns. Therefore, it satisfies the values redundancy. Note that if it holds the values redundancy, we must prune the redundant patterns during the generation of frequent itemsets. 31
  32. 32. Table 4.8. The resulting OLAM cube MCube({City}, {Date, Month}) Date Month support 7/18 8/2 - - - - July Aug. 2 3 4 4 7/18 July 2 7/18 8/2 Aug. July 2 3 8/2 Aug. 3 July, Aug. 4 7/18 July, Aug. 2 8/2 July, Aug. 3 In addition to above observations, we observe that any OLAM cube is useless if it satisfies the following property. Proposition 4.4. Useless Property: Let ai ∈ tG and tM = {aj}. If ai, aj are in the same dimension and aj is an ancestor of ai, then MCube(tG, tM) is a useless cube. Example 4.7. Let tG = {City, Date}, and tM = {Month}. The resulting table from table 4.1 by grouping {City, Date} as transactions is shown in Table 4.9. One can observe that the cardinality of every transaction is 1. Therefore, we cannot find any association rule from this table. 32
  33. 33. Table 4.9. The resulting table by grouping {City, Date} as transaction attribute for Table 4.1 City Date Month Toronto Taipei Taipei N.Y. N.Y. Toronto Seattle N.Y. 7/4 7/12 8/2 7/18 8/2 8/2 8/3 9/1 July July Aug July Aug. Aug. Aug. Sep. 4.3 OLAM Cube Selection We now proceed to give a formal definition of the OLAM cube selection problem. To this end, we introduce symbols as shown in Table 4.10. Assume that an OLAM lattice L contains n OLAM data cubes 1 2{ , , ..., }nD d d d= , the set of users queries is 1 2{ , , ..., }mQ q q q= , the set of query frequencies is 1 2 { , , ..., }mq q qF f f f= , and the space constraint is S . The OLAM cube selection problem is denoted as a five-tuple { , , , , }L D Q F Sθ = . A solution to θ is a subset of D, say M, that can minimize the following cost function subject to constraint | |d M d S∈ ≤∑ , 1 min ( , )i m q i i f E q M = ∑ * . 33
  34. 34. Table 4.10. The Symbol Table Symbol Definition L Lattice D Set of data cubes nd nth data cube Q Set of user queries mq mth user query F Set of user query frequencies iqf Frequency of the ith query S Space constraint M Set of materialized cubes ( , )iE q M The total time to response ith query in materialized views 34
  35. 35. Chapter 5 Evaluation of OLAM Query Cost 5.1 Query Evaluation Flow As stated previously, the primary task of OLAM Engine is to generate association rules according to users’ queries. After receiving a query, OLAM Engine analyzes the query, transfers the necessary information to OLAM Mediator, and then waits for the most matching cube from OLAM Mediator. When OLAM Mediator receives the information of users’ queries from OLAM Engine, it will look for the most matching cube. First, OLAM Engine searches for the required OLAM cube. If found, then it further checks whether minsup ≥ prims; and if yes, then returns the found OLAM cube to OLAM Engine, otherwise returns the corresponding auxiliary cube of the found OLAM cube and notifies OLAM Engine to perform association mining from data warehouse with the aid of this auxiliary cube. On the other hand, if OLAM Engine can not find any qualified OLAM cube to answer user query, it will notify OLAM Engine to perform association mining from data warehouse afresh. The above described procedure employed by OLAM Mediator is depicted in Figure 5.1. 35
  36. 36. Figure 5.1 The flow diagram of OLAM query An important thing worth mentioning is that, for simplicity, we do not consider OLAP cubes in this study, the OMARS system did take account of this kind of data cubes in association mining. In accordance with the work flow of OLAM Mediator and OLAM Engine, our paradigm for evaluating OLAM query cost is shown below: 36
  37. 37. Procedure Evacost_OLAMQ(q) begin Let q = < tG, tM, minsup>; found = OLAMQ_search(q, CQ); if found = TRUE then if prims ≤ minsup then cost = the cost for evaluating query q using OLAM cube CQ.Mcube; /*case 1*/ else cost = the cost for evaluating query q using CQ.Mcube, auxiliary cube CQ.XCube and data warehouse; /*case 2*/ end if else cost = the cost for evaluating query q using data warehouse; /*case 3*/ end if return cost; end Figure 5.2. The procedure to compute the cost of user’s query In summary, there are three different cases to be dealt with:  Case 1: evaluating the cost via the qualified OLAM cube.  Case 2: evaluating the cost via OLAM cube, auxiliary cube, and data warehouse.  Case 3: evaluating the cost via data warehouse. The cost complexity evaluation for each case will be elaborated in the following sections. We end this section with the description of OLAMQ_search. 37
  38. 38. Procedure OLAMQ_search(q, CQ) begin found = FALSE; if MCube(q. tG, q. tM) is materialized then CQ.MCube = MCube(q.tG, q.tM); CQ.XCube = XCube(q.tG, q.tM); found = TRUE; end if CurQ = φ ; for each MCube in the OLAM lattice do if MCube is materialized and MCube. tG = q.tG and MCube.tM ⊇ q. tM and (MCube.tM ⊆ CurQ. tM or CurQ = φ ) then CurQ = MCube; if found then CQ.MCube = CurQ; CQ.XCube = XCube(q. tG, CurQ. tM); end if return found end Figure 5.3. Procedure OLAMQ_search Example 5.1. Suppose the OMARS system stores the following three materialized OLAM cubes, MCube( 1Gt , 1Mt ), where 1Gt = {City}, and 1Mt = {Education, Date}, MCube( 2Gt , 2Mt ), where 2Gt = {City}, and 2Mt = {Education, Date, Category}; MCube( 38
  39. 39. 3Gt , 3Mt ), where 3Gt = {Date}, 3Mt = {City}, and prims = 3. We have three users’ queries as follows: q1, q2, q3, where 1. Gq t = {City}, 1. Mq t = {Education, Date}, and 1.q ms = 4; 2. Gq t = {City}, 2. Mq t = {Education, Date, Category}, and 2.q ms = 2; 3. Gq t = {Date}, 3. Mq t = {City, Education}, and 3.q ms = 3. According to the above three queries, we have three conditions listed as follows: 1. When the user’s query is q1, this condition is the same as Case 1 described above. Because the corresponding OLAM cube can be found in OMARS system, and the minsup of user’s query is higher than prims, we can use MCube( 1Gt , 1Mt ) to respond user’s query immediately. 2. When the user’s query is q2, this condition is the same as Case 2 described above. Because the minsup of user’s query is lower than prims, there is a need to utilize the corresponding auxiliary cube of the found OLAM cube MCube( 2Gt , 2Mt ) and data warehouse to answer query q2. 3. When the user’s query is 3q , this condition is the same as Case 3 described above. Because we can not find the any matching OLAM cube in OMARS system, we should utilize data warehouse to answer query 3q . 5.2 Cost Evaluation for Case 1 In this case, the OLAM cube returned from OLAM Mediator can be utilized to respond users’ queries. The CBWon algorithm [15] is employed to mine association rules. For convenience and facilitating the analysis, we replicate the CBWon algorithm 39
  40. 40. in Figure 5.4. Because the qualified frequent itemsets have been stored in the found OLAM cube, and minsup ≥ prims, there is no need to generate the frequent itemsets via Apriori-like algorithm. All we have to do is scanning frequent itemsets in OLAM cube and performing the association_gen procedure in Figure 5.7 to generate qualified association rules. Algorithm CBWon Input: relevant cube MCube(tG, tM), minsup and prims; Output: The set of frequent itemsets F; 1 if minsup < prims then 2 AF = {X| sup(X) ≥ minsup, X ∈ Auxiliary Cube} ∪ {Y| Y∈ MCube(tG, tM) and | Y| = Kα}; 3 DF = Dwnsearchon(T , AF , Kα, minsup); 4 UF = Upsearch(AF, minusup); 5 F = DF ∪ UF; 6 else 7 F = {X| X ∈ MCube(tG, tM) and sup(X) ≥ minsup}; 8 end if 9 return F; Figure 5.4. Algorithm CBWon 40
  41. 41. Procedure Dwnsearchon 1 for i=1 to |D| do 2 scan the i-th transaction ti; 3 delete those items in ti but not in AF; 4 for each subset X of ti and 2 ≤ |X| ≤ Kα do 5 sup(X)++; 6 end for 7 DF = {X | sup(X) ≥ minsup} ∪ AF; Figure 5.5. Procedure Dwnsearchon Procedure Upsearch 1 transform horizontal data format T into t_id lists; 2 αKF = frequent Kα-itemsets; 3 k = Kα, Fk = αKF ; 4 repeat 5 k++; 6 Ck = new candidate k-itemsets generated from Fk-1; 7 for each X ⊆ Ck do 8 perform bit-vector intersection on X; 9 count the support of X; 10 end for 11 kF = {X| sup(X) ≥ prims, X ∈ Ck}; 12 UF = UF ∪ Fk; 41
  42. 42. 13 until Fk = ∅ Figure 5.6. Procedure Upsearch Procedure association_gen (F: set of all frequent itemsets; min_conf: minimum confidence threshold) begin for each l ∈ F do generate P(l) = l - ∅; // P(l): power set of l for each s ∈ l and s ∉ l-s do if support_count(l) / support_count(s) ≥ min_conf then output s ⇒ l – s; end Figure 5.7. Procedure association_gen The cost thus can be divided into two parts: 1. Frequent itemsets discovery: This involves searching the frequent itemsets stored in OLAM cube with support lower than minsup of user’s query, which costs |DM|, for DM denoting the OLAM cube. 2. Rule generation: For each discovered frequent itemset, we construct all possible rules from it, compute the confidence, and keep those satisfy the minimum confidence. The key point for the complexity analysis thus lies in the number of candidate rules to be generated and inspected. Our first step toward this direction is to consider the number of rules that can be generated from a frequent k-itemset and all of its subsets. Lemma 1. The number of rules that can be constructed from a k-itemset is 2k -2. 42
  43. 43. Proof. Recall that each rule that can be constructed from an itemset X has the form for A⊂ X and A ≠ φ , A ⇒ X – A. Thus, the number of different A’s determines the number of rules, which is ( ) 1 1 2 2 k k k i i − = = −∑ . Lemma 2. For a k-itemset X, the total number of rules that can be generated from X and its subsets is 1 3 2 1k k+ − + . Proof. From Lemma 1, we can derive ( )( ) ( )( ) ( )( ) ( ) ( ) ( ) ( ) ( ) 2 3 2 3 2 2 0 0 1 1 2 2 2 2 ... 2 2 2 2 2 1 1 2 2 1 2 1 1 2 2 2 2 3 2 1 k k k k k k k k i k i i i i k k k i k i k i i i i k k k k k k k k = = − = = + + − + − + + − = −   = × − − − − −    = + − − − + + = − + ∑ ∑ ∑ ∑ Now, if we know the set of maximal frequent itemsets, then we can complete the analysis. Unfortunately, the exact set is unobtainable without the a priori knowledge of user’s specified minsup. We thus resort to an estimation that proceeds by taking prims in place of minsup. Then we apply sampling to obtain a random subset of the warehouse data, and we can either 1. compute the maximal frequent itemsets for each OLAM cube using any maximal pattern mining algorithm, or 2. apply the CBWoff algorithm to estimate Kα (cutting level), compute frequent itemsets with cardinality of Kα, and regard these itemsets as the maximal 43
  44. 44. frequent itemsets. Let MF denotes the set of maximal patterns. If the first approach is adopted, the computation spent on rule generation will be ( )| | | | 1 3 2 1X X X MF + ∈ − +∑ , or | | | | 1 | | (3 2 1)K K KF α α α + × − + , if the second approach is used. Here, for simplicity, we adopted the second approach. Finally, combing the cost of frequent itemsets discovery and rule generation, we have | | | | 3 | |K K MF Dα α α× + × . 5.3 Cost Evaluation for Case 2 In this case, algorithm CBWon illustrated in Figure 5.4 will execute the “minsup < prims” part of the “if” clause, which comprises three different steps. 1. Generate AF, i.e. αk F . This requires scanning the auxiliary cube and the OLAM cube. The cost is |||| MX DD + , where XD denotes auxiliary cube, and MD denotes OLAM cube. 2. Execute procedure Dwnsearchon illustrated in Figure 5.5. Note that this procedure presumes the availability of the corresponding jointed table, and ignores the preprocessing step to generate the jointed table. To account for this task and simplify the discussion, we assume this cost is w and the table is T. 44
  45. 45. As illustrated in Figure 5.5, the Dwnsearchon procedure needs to scan all the transactions in the database. The I/O cost is || T⋅α . Next we estimate the cost for the most consumptive step: counting itemset support. Let l denotes the average length of each transaction. This step costs ( ) ( ) ( )( )l K ll T α +++⋅ ...|| 32 , or ( )∑= ⋅ αK i l iT 2 || in brief. Finally, the total cost consumed by the Dwnsearchon procedure equals ( )∑= ⋅+⋅ α α K i l iTT 2 |||| . 3. Execute procedure Upsearch illustrated in Figure 5.6. To minimize the I/O cost and avoid combinatorial decomposition, the Upsearch procedure first transforms the transaction data into vertical data format called transaction-id lists, then utilizes this structure to count the supports of itemsets. The cost lies in three main steps. (1) Data transformation. This requires || T⋅α data scan. (2) Candidate generation. The dominate operation is itemset join. If the largest itemset cardinality is Kmax. This task consumes at most ( )∑+= − max 1 1 || 2 K KK Fk α . (3) Counting candidate support. For each k-itemset, counting involves k-1 bit-vector intersections and one bit-vector accumulation. Summing this 45
  46. 46. cost over all candidate itemsets, we have max 1 | | | | K i i k C i T α= + ××∑ . Finally, the total cost for procedure Upsearch is ( )( ) max 1| | 2 1 | | | | | | i K F i i K T C i T α α − = + × + ×× +∑ . Combing all of the analysis, we have ( ) ( )( ) max 1| | | | 2 2 1 (| | | | 2 | |) | | | | | | | | 3i K K F Kl X M i i K i i K D D T T C i T F α α α α α − = = + + + + × + ×× + + ×∑ ∑ 5.4 Cost Evaluation for Case 3 In this case, we should generate table T according to user’s query, and it costs log | |D D× . After this, the CBWoff algorithm shown in Figure 5.8 is performed. It can be observed that except step 1, the steps employed by CBWoff are quite similar to those by CBWon in Case 2. Since step 1 costs | | | |T K Tαα× + × , this makes the total cost for this case be ( ) ( )( ) max 1| | | | 2 2 1 log | | 3 | | | | | | | | | | | | 3i K K F Kl i i K i i K D D T K T T C i T F α α α α αα − = = + × + × + × + × + ×× + + ×∑ ∑ . 46
  47. 47. Algorithm CBWoff(T, prims) Input: Table T and prims; Output: The set of frequent itemsets F; 1 scan T to compute Kα and generate all frequent 1-itemsets F1; 2 DF = Dwnsearch(T , Kα, F1, prims); 3 UF = Upsearch(DF, prims); 4 return F = DF ∪ UF; Figure 5.8. Algorithm CBWoff Procedure Dwnsearch 1 for i=1 to |D| do 2 scan the i-th transaction ti; 3 delete the items in ti that are not in F1; 4 for each subset X of ti and 2 ≤ |X| ≤ Kα do 5 sup(X)++; 6 end for 7 store all X in Auxiliary cube for |X| = Kα and sup(X) < prims; 8 DF={X | sup(X) ≥ prims}; Figure 5.9. Procedure Dwnsearch 47
  48. 48. To sum up, we list the cost functions for the three cases below: Case 1: | | | | 3 | |K K MF Dα α α× + × . Case 2: ( ) ( )( ) max 1| | | | 2 2 1 (| | | | 2 | |) | | | | | | | | 3i K K F Kl X M i i K i i K D D T T C i T F α α α α α − = = + + + + × + ×× + + ×∑ ∑ . Case 3: ( ) ( )( ) max 1| | | | 2 2 1 log | | 3 | | | | | | | | | | | | 3i K K F Kl i i K i i K D D T K T T C i T F α α α α αα − = = + × + × + × + × + ×× + + ×∑ ∑ . 48
  49. 49. Chapter 6 OLAM Cube Selection Methods In this chapter, we describe three typical heuristic algorithms proposed for OLAP cube selection problem, and elaborate how to modify and combine our cost models depicted in last chapter with each method to select the most suitable OLAM cubes. The methods include forward greedy selection (FGS) method proposed by Harinarayan et al. [9], Pick by size (PBS) selection method proposed by Shukla et al. [17], and the backward greedy selection (BGS) method proposed by Lin and Kuo [13]. 6.1 Forward Greedy Selection Method (FGS) The forward greedy selection method is proposed by Harinarayan et al. [19]. As is known to all, the greedy algorithm always chooses the local optimal solution in each step under some constraint. For this purpose, we define a benefit function B(di, M) as follows: 1 ( , ) ( ( , ) ( , ))i iq Q i B d M E q M E q M d d ∈ = − ∪∑ (6.1) 49
  50. 50. We use our benefit function to compute the benefit of all unselected OLAM cubes, and combine the forward selection method to choose the most suitable OLAM cubes one by one to materialize from empty until no cube can be added. The forward selection method is described below: Algorithm 1. Forward greedy selection (FGS) Step 0. Let M=φ. Step 1. When d M d S∈ <∑ , repeat Step 2 to Step 5. Step 2. According to equation (6.1), calculate the benefit of all unselected OLAM cubes di, for 1 ≤ i ≤ n, and di∉M. Step 3. Select the OLAM cube with the maximal benefit according to results of Step 2, and set it as dj. Step 4. M ← M∪{dj}. Step 5. Go to Step 1. Figure 6.1. Forward Greedy Selection Method Example 6.1. Suppose that we select three attributes city c, education e, and date d from a sales star schema illustrated in Figure 2.3. Figure 6.2 depicts all possible OLAM cubes formed with these three attributes as well as their dependencies, where all OLAM cubes with the same transaction tG are packed into a meta-cube. The dotted line between any two metacubes is used for clarification purpose, which accomplishes the lattice structure of metacubes in terms of tG. Note that according to proposition 4.1, the dependency exists only in OLAM cubes within the same metacube. For simplification, let us consider how to select the most suitable OLAM cubes from three OLAM cubes ced*, cd*, and ed* to materialize under space constraint. The symbols 50
  51. 51. used in this example are shown in Table 6.1, and the required parameter settings are shown in Table 6.2. Besides, we assume that the base relation size is 64, and prims is 3. Table 6.3 shows the first two selection steps using FGS. Table 6.1. The symbols used in cost model Gt the set transaction attributes Mt the set of mining attributes α I/O to computation ratio Kα the cardinality of maximal frequent itemset maxK the cardinality of the largest itemset | |iC number of candidate i-itemsets l average length of each transaction iF number of frequent i-itemsets |DM| size of OLAM cube |DX| size of auxiliary cube f frequency of OLAM cube |T| size of the table composed of attributes G Mt t∪ |D| size of base relation Table 6.2. The required parameter settings subcubes α Kα maxK 3| |C 4| |C l 2F 3F |DM| |DX| | T| minsu p f d*ce 1 2 4 6 4 8 8 5 20 15 30 4 0.3 d*c 1 2 4 5 2 5 8 4 10 10 30 5 0.3 d*e 1 2 4 5 1 6 6 2 15 5 30 3 0.4 51
  52. 52. Figure 6.2. All possible OLAM cubes formed with city, education, and date Table 6.3. The benefits of OLAM cubes in the first two selection steps by FGS First selection Second selection subcubes Influenced subcubes Benefit Influenced subcubes Benefit d*ce ced*, cd*, ed* ((64*6+3*1*30+2*30+30* 28+1058+8* 2 3 )-(8* 2 3 +1*20))*(64- 20)*(0.3+0.3+0.4)/20=530 6.4 ced*, cd*, ed* d*c cd* ((64*6+3*1*30+2*30+30* 10+724+8* 2 3 )-(8* 2 3 +1* 10))*(64-10)*(0.3)/10 =2507.76 cd* ((8* 2 3 +1*20)-(8* 2 3 +1*10))*(20- 10)*(0.3)/10=3 d*e ed* ((64*6+3*1*30+2*30+30* 15+586+6* 2 3 )-(6* 2 3 +1* 15))*(64-15)*(0.4)/15 =2031.87 ed* ((8* 2 3 +1*20)-(6* 2 3 +1*15))*(20- 15)*(0.4)/15 =3.067 52 ced ce cded ce*d* c*ed* c*ed* c*ed c*e c*d ce*d ce* e*d ced* cd* ed*
  53. 53. 6.2 Backward Greedy Selection Method (BGS) The concept of the backward greedy selection method proposed by Lin and Kuo [13] is similar to forward greedy selection method. The difference is that all OLAM cubes have been selected at beginning, and the selection proceeds by removing one OLAM cube which has the lowest detriment value step by step until the total size of all remaining OLAM cubes is smaller than storage space. For this purpose, we define a detriment function P(di,M) as follows: { } 1 ( , ) ( ( , ) ( , ))i iq Q i P d M E q M d E q M d ∈ = − −∑ (6.2) Compared to FGS, the forward greedy selection algorithm can quickly find a set of data cubes while storage space is noticeable smaller than the total sizes of data cubes. But it is obviously that if the total cube size is not far from the storage space, BGS will need more computation. The backward greedy selection algorithm is described below: 53
  54. 54. Algorithm 2. Backward Greedy Selection (BGS) Step 0. Let M← D; Step 1. When d M d S∈ >∑ , repeat Step 2 to Step 5. Step 2. According to equation (6.2), calculate the detriment of all subcubes di, for 1 ≤ i ≤ n, and di∈M. Step 3. Select the OLAM cube with the minimum detriment value according to the results in Step 2, and set it as dj. Step 4. M ← M-{dj}. Step 5. Go to Step 1. Figure 6.3. Backward Greedy Selection Method Example 6.2. Consider Example 6.1 again. Suppose that the three cubes have been selected and the space constraint is 20. Besides, we assume when ced* is not materialized, all queries which should answered by ced* will go back to the base relation in data warehouse. Table 6.4 shows the first two selection steps performed by backward greedy selection method. 54
  55. 55. Table 6.4. The benefits of OLAM cubes in the first two selection steps by BGS First selection Second selection subcubes Influenced subcubes Detriment Influenced subcubes Detriment d*ce ced* ((64*6+3*1*30+2*30+30* 28+1058+8* 2 3 )-(8* 2 3 +1*20))*(64-20)*(0.3)/20 =1591.92 ced*, cd* ((64*6+3*1*30+2*30+ 30*28+1058+8* 2 3 )-(8* 2 3 +1*20))*(64- 20)*(0.3+0.3)/20=3183. 84 d*c cd* ((8* 2 3 +1*20)-(8* 2 3 +1*10))*(20- 10)*(0.3)/10=3 cd* d*e ed* ((8* 2 3 +1*20)-(6* 2 3 +1*15))*(20-15)*(0.4)/15 =3.067 ed* ((8* 2 3 +1*20)-(6* 2 3 +1*15))*(20- 15)*(0.4)/15 =3.067 6.3 Pick by Size Selection Method (PBS) The pick by size selection algorithm is an intuitive method proposed by Shukla et al. [17]. Its concept is to compute all OLAM cubes size, and select the smallest OLAM cubes one by one until the storage constraint is exceeded. The pick by size selection algorithm is described below: 55
  56. 56. Algorithm 3. Pick by Size Selection (PBS) Step 0. Sort all OLAM cube size; Step 1. Let M=φ. Step 2. When d M d < S∈∑ , repeat Step 3 to Step 5. Step 3. Select the smallest size of all OLAM cube, and set it as dj. Step 4. M ← M∪{dj}. Step 5. Go to Step 1. Figure 6.4. Pick by Size Greedy Selection Method 56
  57. 57. Chapter 7 Experimental Results In this chapter, we describe our experiment and analysis. All experiments are performed on a machine with Intel Celeron 1.2 GHz CPU, 512MB RAM, and running on Microsoft Windows 2000 Server. The test data is generated from Microsoft foodmart2000 database. We chose three dimensions from foodmart 2000 database, including Customer, Time, and Product. Each dimension consists of two attributes. They are city c, education e for the Customer dimension; date d, month m for the Time dimension; and Product_ID p, Category a for the Product dimension. The three dimensions’ schema hierarchy is shown in Figure 2.4. Characteristics of the test data are shown in Table 7.1. In Table 7.1, we introduce the following notation: g<attribute_list>(R), where g is the symbol used to represent “group by” operation, and <attribute_list> denotes the “group by”. From this test database, we generated all possible OLAM subcubes, which are detailed in Table 7.2. Note that we have filtered out those subcubes which satisfy schema redundancy and those whose corresponding transaction table composed of tG ∪ tM has less than 10000 tuples. After these, we observed that all OLAM cubes can be grouped into seven classes distinguished by tG, as shown in Table 7.2. Each cube is represented by the first letter of the attributes, except for Category is abbreviated as 57
  58. 58. “a”. The symbol “-“ means none, and the number is the size of each subcube. In our experiments, we consider two different combinations of frequencies of subcubes: 1) all frequencies are the same; 2) randomly generated numbers between 0 and 1. Besides, we consider three different combinations of minsup: 1) all minsup are lower than prims; 2) randomly generated minsup between 1% and 99%; 3) all minsup are equal or greater than prims. Finally, we assume eight different storage constraints, which are 10%, 20%, 30%, 40%, 50%, 60%, 70%, and 80% of the sum of all subcubes. Furthermore if none of the subcubes that can be used to answer a query is materialized, we assume that the base relation is invoked to answer this query. Table 7.1. Data parameters of foodmart 2000 Parameters Value |D| 86565 |dom(City)| 78 |dom(Education)| 5 |dom(Date)| 323 |dom(Month)| 12 |dom(Product_ID)| 1559 |dom(Category)| 45 |g c, e, d(D)| 10541 |g c, p(D)| 49483 |g m, p(D)| 18492 |g c, e, a(D)| 11576 |g d, a(D)| 12113 |g c, d, a(D)| 49390 |g c, m, a(D)| 22854 Table 7.2. All subcubes of six attributes, 58
  59. 59. City, Education, Date, Month, Product_ID, Category Gt All possible mining attributes tM ced mpa mp- m-a -pa m-- -p- --a 713 12 714 678 12 0 678 cp edma edm- ed-a e-ma -dma ed-- e-m- e--a 72 60 24 72 18 12 60 24 -dm- -d-a --ma e--- -d-- --m- ---a 12 6 18 12 0 12 6 mp ceda ced- ce-a c-da -eda ce-- c-d- c--a 407 377 403 62 62 373 54 59 -ed- -e-a --da c--- -e-- --d- ---a 34 58 9 51 30 3 6 cea dmp dm- d-p -mp d-- -m- --p 3955 3955 14 3891 14 3891 0 da cemp cem- ce-p c-mp -emp ce-- c-d- c--a 860 860 792 87 99 792 87 75 -ed- -e-a --da c--- -e-- --m- ---p 99 31 12 75 31 12 0 cda emp em- e-p -mp e-- -m- --p 22 22 9 12 9 12 0 cma edp ed- e-p -dp e-- -d- --p 30 30 30 0 30 0 0 7.1 Comparison of FGS, BGS, and PBS for minsup ≥ prims We first compare the query cost of the three selection methods when the frequency of each subcube is random. The results are shown in Figure 7.1. According to Figure 7.1, it is obvious that FGS and BGS are significantly better than PBS, and there is an optimal cost-effective space around 50%. We also recorded the result when the frequency of each subcube is uniform as illustrated in Figure 7.2. The 59
  60. 60. phenomenon is similar to Figure 7.1, but the optimal cost-effective space is about 40% for FGS. 5.22E+03 5.00E+08 1.00E+09 1.50E+09 2.00E+09 2.50E+09 3.00E+09 3.50E+09 10% 20% 30% 40% 50% 60% 70% 80% FGS BGS PBS Figure 7.1. Comparing the query cost of FGS, BGS, and PBS with random frequency when minsup ≥ prims. 3.19E+05 5.00E+10 1.00E+11 1.50E+11 2.00E+11 2.50E+11 10% 20% 30% 40% 50% 60% 70% 80% FGS BGS PBS Figure 7.2. Comparing the query cost of FGS, BGS, and PBS with uniform frequency when minsup ≥ prims. We also compare the efficiency of forward, backward, and pick by size selection 60
  61. 61. method. Since the forward and backward greedy selection methods have similar philosophy but different in the direction of selection, we use execution time as the criterion. Besides, we also compared these two methods with PBS. The results are shown in Figure 7.3, and Figure 7.4. From these two figures, the selection time performed by forward greedy selection is less than backward greedy selection at first, but when the space limit is higher than 30%, the situation is reversely. The reason is that forward greedy selection method selects the subcube from empty until no cubes can be added, while backward greedy selection method performs in the opposite way. The pick by size selection method consumes the least time because it needs not to select the best benefit subcube, and only chooses the subcubes according to their size. 0 0.001 0.002 0.003 0.004 0.005 10% 20% 30% 40% 50% 60% 70% 80% FGS BGS PBS Figure 7.3. Comparing the selection time of FGS, BGS, and PBS with random frequency when minsup ≥ prims. 61
  62. 62. 0.000 0.001 0.002 0.003 0.004 0.005 10% 20% 30% 40% 50% 60% 70% 80% FGS BGS PBS Figure 7.4. Comparing the selection time of FGS, BGS, and PBS with uniform frequency when minsup ≥ prims. 7.2 Comparison of FGS, BGS, and PBS for minsup < prims We first compare the query cost between FGS, BGS, and PBS. The results are shown in Figure 7.5, and Figure 7.6. Recall that, when minsup < prims, the OMARS system must utilize OLAM cube, auxiliary cube, and data warehouse to respond user’s query. Obviously, this process will require more cost because no OLAM cube can respond user’s query immediately. Thus, the query cost of FGS, and BGS when minsup < prims is higher than that when minsup ≥ prim, no matter what frequencies are. But the query cost of PBS when minsup < prims is similar to that when minsup ≥ prim, because PBS selects the subcube only according to their size. 62
  63. 63. 3.67E+05 5.00E+08 1.00E+09 1.50E+09 2.00E+09 2.50E+09 3.00E+09 3.50E+09 10% 20% 30% 40% 50% 60% 70% 80% FGS BGS PBS Figure 7.5. Comparing the query cost of FGS, BGS, and PBS with random frequency when minsup < prims. 5.47E+06 5.00E+10 1.00E+11 1.50E+11 2.00E+11 2.50E+11 10% 20% 30% 40% 50% 60% 70% 80% FGS BGS PBS Figure 7.6. Comparing the query cost of FGS, BGS, and PBS with uniform frequency when minsup < prims. We then compare the efficiency of these three methods. The results are shown in Figure 7.7, and Figure 7.8. It can be observed that the results are similar to the 63
  64. 64. situation when minsup ≥ prims. 0 0.001 0.002 0.003 0.004 0.005 0.006 10% 20% 30% 40% 50% 60% 70% 80% FGS BGS PBS Figure 7.7. Comparing the selection time of FGS, BGS, and PBS with random frequency when minsup < prims. 0.000 0.001 0.002 0.003 0.004 0.005 0.006 0.007 10% 20% 30% 40% 50% 60% 70% 80% FGS BGS PBS Figure 7.8. Comparing the selection time of FGS, BGS, and PBS with uniform frequency when minsup < prims. 64
  65. 65. 7.3 Comparison of FGS, BGS, and PBS for random minsups Generally, the support settings of different users’ queries are different, and in accordance with this situation, we conduct another experiment, setting minsups as random. Figure 7.9 shows the results when the frequencies of all subcube are random. When the space constraint goes beyond 50%, the forward and backward selection method reaches the optimal cost-effective point around 50%. For space constraint is over 50%, there is no further saving in query cost for forward greedy. In Figure 7.10, the results are similar to Figure 7.9, but the optimal cost-effective point for FGS is around 40%. 9.26E+04 5.00E+08 1.00E+09 1.50E+09 2.00E+09 2.50E+09 3.00E+09 3.50E+09 10% 20% 30% 40% 50% 60% 70% 80% FGS BGS PBS Figure 7.9. Comparing the query cost of FGS, BGS, and PBS with random frequency for random minsups. 65
  66. 66. 2.17E+06 5.00E+10 1.00E+11 1.50E+11 2.00E+11 2.50E+11 10% 20% 30% 40% 50% 60% 70% 80% FGS BGS PBS Figure 7.10. Comparing the query cost of FGS, BGS, and PBS with uniform frequency for random minsups. We then compare the selection time of FGS, BGS, and PBS. As shown in Figure 7.11, and Figure 7.12, the phenomenon is similar to the above two cases. 0 0.001 0.002 0.003 0.004 0.005 10% 20% 30% 40% 50% 60% 70% 80% FGS BGS PBS Figure 7.11. Comparing the selection time of FGS, BGS, and PBS with random 66
  67. 67. frequency for random minsups. 0.000 0.001 0.002 0.003 0.004 0.005 0.006 10% 20% 30% 40% 50% 60% 70% 80% FGS BGS PBS Figure 7.12. Comparing the selection time of FGS, BGS, and PBS with uniform frequency for random minsups. 67
  68. 68. Chapter 8 Conclusions and Future Works 8.1 Conclusions In this thesis, we have considered the OLAM cube selection problem in OMARS system, and proposed a cost model to evaluate the query cost. According to user’s association queries, we divided the query evaluation into three cases, and accordingly designed three different cost models. Through our cost models, we have modified three most well-known heuristic algorithms, FGS, BGS, and PBS to choose the most suitable OLAM cubes to materialize. We also have implemented these algorithms to evaluate their performances. There is one thing need to be clarified. Although our cost models are based on CBWon and CBWoff algorithms proposed by Lin et al [15]. Most of the concepts are suitable for algorithms different from above two algorithms, except that the cost functions need to be modified to conform to the new algorithms. 8.2 Future Works As we pointed out in the beginning of this thesis, the main focus of this research 68
  69. 69. is on OLAM cube selection in the OMARS system. There are some issues need to be investigated in the near future: 1. OLAM cube and OLAP cube selection simultaneously In OMARS system, there is another cube repository to store the OLAP Cube. In our thesis, we only consider the OLAM cube selection problem. One of our future works is to combine OLAP cube and OLAM cube, and design a suitable cost model to evaluate the query cost in this situation. 2. Cube maintenance In real world applications, data are evaluated as well as generated and need to be loaded into data warehouse. This implies that the materialized cubes have to be updated to reflect the new situation. One of our future works is to design a suitable scheme to update our OLAM, OLAP and auxiliary cubes in OMARS system. 3. Other nonheuristics algorithms In this thesis, we only consider the class of heuristic algorithms to select the most suitable OLAM cubes to materialize. Besides these methods, we will consider other nonheuristics algorithms, such as genetic algorithm, A* algorithm or dynamic programming to select the most suitable OLAM cubes to materialize. 69
  70. 70. References [1] 林文揚、張耀升:“ 發式資料方體挑選方法之分析比較”啟 ,九十年全國計算 機會議論文集,頁 47-58,2001。 [2] R. Agrawal and R. Srikant, “Fast algorithms for mining association rules,” in Proceedings of the 20th VLDB Conference, pp. 487-499, 1994. [3] K.S. Beyer and R. Ramakrishnan, “Bottom-up computation of sparse and iceberg cubes,” in Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 359-370, 1999. [4] S. Chaudhuri and U. Dayal, “An overview of data warehouse and OLAP technology,” ACM SIGMOD Record,Vol. 26, pp. 65-74, 1997. [5] C.I. Ezeife, “A uniform approach for selecting views and indexes in a data warehouse,” in Proceedings of International Database Engineering and Applications Symposium, pp. 151-160, 1997. [6] J. Gray, A. Bosworth, A. Layman, and H. Pirahesh, “Data cube: a relational aggregation operator generalizing group-by, cross-tabs and subtotals,” in Proceedings of International Conference on Data Engineering, pp. 152 -159, 1996. [7] H. Gupta, “Selection of views to materialize in a data warehouse,” in Proceedings of International Conference on Database Theory, pp. 98-112, 1997. [8] J. Han and M. Kamber, Data Mining: Concepts and Techniques, MORGAN 70
  71. 71. KAUFMANN PUBLISHERS, 2000. [9] V. Harinarayan, A. Rajaraman, and J.D. Ullman, “Implementing data cubes efficiently,” in Proceedings of ACM SIGMOD, pp. 205-216, 1996. [10] J.-T Horng, Y.-J. Chang, B.-J. Liu, and C.-Y. Kao, “Materialized view selection using genetic algorithms in a data warehouse,” in Proceedings of World Congress on Evolutionary Computation, pp. 2221-2227, 1999. [11] W.H. Inmon and C. Kelley, Rdb/VMS: Developing the Data Warehouse, QED Publishing Group, Boston, Massachussetts, 1993. [12] R. Kimball, The Data Warehouse Toolkit Practical For Building Dimensional Data Warehouses, JOHN WILEY & SONS, INC. 1996. [13] W.Y. Lin and I.C. Kuo, “OLAP data cubes configuration with genetic algorithms,” in Proceedings of IEEE System, Man and Cybernetics, pp. 1984– 1989, 2000. [14] W.Y. Lin, I.C. Kuo, and Y.S. Chang, “A Genetic Selection Algorithm for OLAP Data Cube,” in Proceedings of the 9th National Conference on Fuzzy Theory and Its Application, Taiwan, November 2001, pp. 624-628, 2001. [15] W.Y Lin, J.H Su and M.C Tseng, “OMARS: The framework of an online multi- dimensional association rules mining system,” in Proceedings of 2nd International Conference on Electronic Business, Taipei, Taiwan, 2002. [16] Z. Michalewicz, Genetic Algorithms + Data Structures = Evolution Programs, Springer-Verlag, New York, 1994. [17] A. Shukla, P. M. Deshande and J. F. Naughtion, “Materialized View Selection for Multidimensional Datasets,” in Proceedings of the 24th VLDB Conference, New York, USA, pp. 488-499, 1998. [18] E. Soutyrina, F. Fotouhi, “Optimal view selection for multidimensional database systems,” in Proceedings of International Database Engineering and 71
  72. 72. Applications Symposium, pp. 309-318, 1997. [19] D. Theodoratos and T. Sellis, “Data warehouse configuration,” in Proceedings of the 23rd VLDB Conference, pp.126-135, 1997. [20] C. Zhang, X. Yao, and J. Yang, “Evolving materialized views in data warehouse,” in Proceedings of World Congress on Evolutionary Computation, pp. 823-829, 1999. [21] C. Zhang and J. Yang, “Genetic algorithm for materialized view selection in data warehouse environments,” in Proceedings of International Conference on Data Warehouse and Knowledge Discovery, pp. 116-125, 1999. [22] H. Zhu, On-Line Analytical Mining of Association Rules, SIMON FRASER UNIVERSITY, December, 1998. 72
  73. 73. 義 守 大 學 資 訊 管 理 研 究 所 碩 士 論 文 OMARS 系統中線上關聯規則採掘資 料方體之挑選 OLAM Cube Selection in OMARS
  74. 74. 研究生:王敏峰 指導教授:林文揚 博士 中華民國 九十二 年 七 月 II
  75. 75. OMARS 系統中線上關聯規則採掘資料方體之挑選 OLAM Cube Selection in OMARS 研究生:王敏峰 Student:Min-Feng Wang 指導教授:林文揚 博士 Advisor:Dr. Wen-Yang Lin 義守大學 資訊管理研究所 碩士論文 A Thesis Submitted to Department of Information Management I-Shou University in Partial Fulfillment of the Requirements for the Master degree in Information Management July, 2003 Kaohsiung, Taiwan, Republic of China 中華民國九十二年七月
  76. 76. OMARS 系統中線上關聯規則採掘資料方 體之挑選 學生:王敏峰 指導教授:林文揚 博士 義守大學資訊管理研究所 摘 要 從大型資料庫中採掘關聯規則是一計算密集的工作。為了減少關聯規則採掘 的複雜性,Lin 等延伸了冰山方體(Ice-berg cube)的概念,提出線上關聯規則採 掘資料方體(OLAM cube)的概念來儲存頻繁項目集,並提出一架構,稱為線上 多維度關聯規則採掘系統 (On-Line Multidimensional Association Rule Mining I
  77. 77. System, OMARS),提供使用者執行類似 OLAP 的 詢,以迅速的從資料倉儲中查 採掘關聯規則。 本篇論文目的就是根據在 OMARS 系統中提出的關聯規則採掘演算法,來 設計一成本模組,並利用來挑選出最適合回答關聯規則 詢的線上關聯規則採查 掘資料方體。此外,本論文並修改及實作一些目前最先進的啓發式演算法來結合 我們提出的成本模組,並分析出其效能。 關鍵字:資料採掘,資料倉儲,線上多維度關聯規則採掘系統,線上關聯規則 採 掘資料方體,多維度關聯規則,資料方體挑選問題 II
  78. 78. OLAM Cube Selection in OMARS Student: Min-Feng Wang Advisor: Wen-Yang Lin Dept. of Information Management I-Shou Unversity ABSTRACT Mining association rules from large database is a data and computation intensive task. To reduce the complexity of association mining, Lin et al. proposed the concept of OLAM (On-Line Association Mining) cube, an extension of Ice-berg cube used to store frequent multidimensional itemsets. They also proposed a framework of on-line multidimensional association rule mining system, called OMARS, to provide users an environment to execute OLAP-like query to mine association rules from data warehouses efficiently. This thesis is a companion toward the implementation of OMARS. Particularly, the problem of selecting appropriate OLAM cubes to materialize and store in OMARS is concerned. And, according to the proposed mining algorithms in OMARS, we deploy the model for evaluating the cost of answering association queries using materialized OLAM cubes, which is a preliminary step for OLAM cubes selection. Besides, we modify and implement some state-of-the-art heuristic algorithms, and III
  79. 79. draw comparisons between these algorithms to evaluate their effectiveness. Keywords: data mining, data warehouse, OMARS, OLAM cube, multidimensional association rules, cube selection problem Acknowledgement 此篇論文的完成,最感謝我的指導教授林文揚博士的耐心指導,讓我了解 如何做研究,並體會其中的甘苦。特別是在最後完稿階段,更要感謝老師在暑假 期間還撥空予以指點。在這兩年的研究所期間除了理論的研習外,更加感謝恩師 給予我時間來針對資料倉儲與資料庫的實作予以鑽研。 IV
  80. 80. 兩年的時間是短暫的,我還要感謝洪宗貝老師、錢炳全老師、王學亮老師、林 建宏老師在這兩年的指教,讓我在這短暫的時間內增加了許多不同的見識,此 外,還要感謝在這兩年中一同與我研究及討論課業的同學們,尤其是思博、文傑 與欣龍,還有學長們,特別是耀升學長與詠騏學長。並感謝在口試時幫我的學弟 妹們。最後我要感謝我的家人,在這兩年中給我的支持與鼓勵。 V
  81. 81. Contents ABSTRACT III ACKNOWLEDGEMENT............................................................................................IV VI
  82. 82. List of Figures FIGURE 3.1. THE OMARS FRAMEWORK [15]...............................................17 FIGURE 4.2. THE 2ND LAYER LATTICE DERIVED FROM <CUSTOMER, TIME, -> IN THE 1ST LAYER......................................................27 FIGURE 5.6. PROCEDURE UPSEARCH...........................................................42 FIGURE 5.8. ALGORITHM CBWOFF...............................................................47 VII
  83. 83. List of Tables TABLE 4.8. THE RESULTING OLAM CUBE MCUBE({CITY}, {DATE, MONTH}) ............................................................................................................32 DATE 32 MONTH 32 SUPPORT 32 7/18 32 8/2 32 - 32 - 32 - 32 - 32 JULY 32 AUG. 32 2 32 3 32 4 32 4 32 7/18 32 JULY 32 2 32 7/18 32 8/2 32 VIII
  84. 84. AUG. 32 JULY 32 2 32 3 32 8/2 32 AUG. 32 3 32 JULY, AUG. 32 4 32 7/18 32 JULY, AUG. 32 2 32 8/2 32 JULY, AUG. 32 3 32 IX

×