50120140503019

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),
ISSN 0976 - 6375(Online), Volume 5, Issue 3, March (2014), pp. 164-173 © IAEME
164
MINING ASSOCIATION RULES FOR HIGH UTILITY ITEMSETS USING
UP GROWTH+ ALGORITHM FROM TRANSACTIONAL DATABASES
M. Venkatesh
PG Scholar, CSE Department, KCG College of Technology, Chennai 97
Dr. M. Krishnamurthy
HOD (ME.,), CSE Department, KCG College of Technology, Chennai 97
ABSTRACT
Utility mining emerges as an important topic in data mining field. High utility item sets
mining refers to importance or profitability of an item to users. Efficient mining of high utility
itemsets plays an important role in many real-time applications and is an important research issue in
data mining area. Number of Algorithms like apriori, FP Growth has been proposed in this area, but
they cause the problem of generating a large number of candidate itemsets. That will lead to high
requirement of space and time and hence the performance of mining will be less .It is not at all good
when the database contains transactions having long size or high utility itemsets which also having
long size. UP Growth+ algorithm is proposed in this paper for mining high utility itemsets. A UP
Tree data structure is used for storing the information about high utility itemset such that by using
only double scanning of database, candidate itemsets can be efficiently generate. These proposed
algorithms will cause the reduction of the number of candidates effectively and also reduces the
requirement for space and time, when a database contains large number of transactions.
Keywords: Frequent Mining, Utility Mining, UP Tree, UP Growth, Association Rule.
1. INTRODUCTION
Data mining is the process of analyzing data from different perspectives and summarizing it
into useful information that can be used to increase revenue, cuts costs, or both. It allows users to
analyze data from many different dimensions and summarize the relationships identified.
Technically, data mining is the process of finding correlations or patterns among dozens of fields in
large relational databases. It is the process that attempts to discover interesting patterns from an
abundance of raw data. It utilizes methods at the intersection of artificial intelligence, machine
INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING &
TECHNOLOGY (IJCET)
ISSN 0976 – 6367(Print)
ISSN 0976 – 6375(Online)
Volume 5, Issue 3, March (2014), pp. 164-173
© IAEME: www.iaeme.com/ijcet.asp
Journal Impact Factor (2014): 8.5328 (Calculated by GISI)
www.jifactor.com
IJCET
© I A E M E

165
learning, statistics, and database systems. The overall goal of the data mining process is to extract
information from a data set and transform it into an understandable structure for further use.
Extensive studies have been proposed for mining frequent itemsets from the databases. In market
analysis, mining frequent itemsets from a transaction database refers to the discovery of the itemsets
which frequently appear together in the transactions. However the unit profits and purchased
quantities of items are not considered in the framework of frequent itemset mining. Hence, it cannot
satisfy the requirement of the user who is interested in discovering the itemsets with high sales
profits. In view of this, utility mining is used as an important topic in data mining for discovering the
high utility itemsets. Mining high utility itemsets from the databases refer to finding the itemsets
with high utilities. The basic meaning of utility is the interestedness/ importance/ profitability of
items to the users. An itemset is called a high utility itemset if its utility satisfies the user specified
threshold; otherwise, the itemset is called a low utility itemset. However, mining high utility itemsets
from the databases is not an easy task since the downward closure property cannot be applied here.
In other words, pruning search space for high utility itemset mining is difficult because a superset of
a low utility itemset may be a high utility itemset.
Table 1: Transaction Database
TID Transactions
T1 (A,2)(C,20)(D,2)
T2 (A,4)(C,12)(E,4)(G,10)
T3 (A,4)(B,4)(D,12)(E,6)(F,2)
T4 (B,8)(C,8)(D,4)(E,1)
T5 (B,2)(C,2)(E,1)(G,2)
An alternative approach for this problem is to enumerate all itemsets from the databases by
the principle of exhaustion. Obviously, this approach will encounter the large search space problem,
especially when databases contain lots of long transactions. Hence, how to effectively prune the
search space and efficiently capture all high utility itemsets with no miss is a big challenge in utility
mining. Existing methods often generate a huge set of high utility itemsets and the mining
performance is degraded consequently. The situation may become worse when the database contains
many long transactions. The huge number of potential high utility itemsets forms a challenging
problem to the mining performance since the higher processing cost is incurred with more potential
high utility itemsets are generated.
To address this issue, we propose a novel algorithm with a compact data structure for
discovering high utility itemsets. The major contributions of this work are described as follows:
An efficient algorithm, called UP-Growth+ (Utility Pattern Growth) and UP-Growth+
algorithm are proposed for discovering high utility itemsets. Correspondingly, a compact tree
structure, called UP-Tree is proposed to maintain the important information of the transaction
database related to the utility patterns. High utility itemsets are then generated from the UP-Tree
efficiently with only two scans of the database.

166
2. RELATED WORKS
Here we are discussing some basic definitions about utility of an item, utility of itemset in
transaction, utility of itemset in database and also related works and the problem involved in utility
mining and then related strategies are introduced.
2.1 Preliminary
A transaction database D contains a set of transactions, and each transaction has a unique
identifier called TID. Each item ip in transaction Td is associated with a quantity q (ip, Td), that is,
the purchased quantity of ip in Td.
Definition 1. Utility of an item ip in a transaction Td is denoted as u (ip, Td) and defined as pr
(ipxTd).
Definition 2. Utility of an itemset X in Td is denoted as U(X, Td) and defined as ∑ip⋴X∨X⊆Td
Definition 3. Utility of an itemset X in D is denoted as u(X) and ∑X Td Td Du(X, Td)
Definition 4. An itemset is called a high utility itemset if its utility is no less than a user-specified
minimum utility threshold which is denoted as min_util.
Table 2: Profit Table
Item A B C D E F G
Profit 5 2 1 2 3 1 1
From table 1 and 2
U ({A, T1}) =5×1=5
U ({AD, T1}) =u ({A, T1}) +u ({D, T1}) =5+2=7
U ({AD}) =u ({AD, T1}) +u ({AD, T3}) =7+17=24
U ({BD}) =u ({BD, T3}) +u ({BD, T4}) =16+18=34
If min_util is set to 30, BD is a high utility itemset; AD is a low utility itemset.
Extensive studies have been proposed for mining frequent itemsets from database. Apriori is
one of the best known algorithms for mining association rules from large databases. FP-Growth
algorithm achieves a better performance than Apriori-based algorithms since it finds frequent
itemsets without generating any candidate itemsets. However the importance of items to users is not
considered in the frequent itemset mining. Thus, the topic called weighted association rule mining
was brought to attention. Cai et al. first proposed the concept of weighted items and weighted
association rules. However, it does not have downward closure property. To address this problem,
Tao et al. proposed the concept of weighted downward closure property. By using transaction
weight, weighted support can not only reflect the importance of an itemset but also maintain the
downward closure property. Thus, the issue of high utility itemset mining is raised and many studies
have addressed this problem. Liu et al. proposed an algorithm named Two-Phase which is mainly
composed of two mining phases. In phase I, it employs an Apriori-based level-wise method to
generate High Utility Itemsets. In phase II, HTWUIs that are high utility itemsets are identified with
an additional database scan. Although two-phase algorithm reduces search space by using TWDC

167
property, it still generates too many candidates to obtain HTWUIs and requires multiple database
scans. To overcome this problem, Li et al. proposed an isolated items discarding strategy (IIDS) to
reduce the number of candidates. By pruning isolated items during level-wise search, the number of
candidate itemsets for HTWUIs in phase I can be reduced. However, this algorithm still scans
database for several times and uses a candidate generation-and-test scheme to find high utility
itemsets.
To efficiently generate HTWUIs in phase I and avoid scanning database too many times,
Ahmed et al. proposed a tree-based algorithm, named IHUP. A tree based structure called IHUP-
Tree is used to maintain the information about itemsets and their utilities. Each node of an IHUP-
Tree consists of an item name, a TWU value and a support count. IHUP algorithm has three steps: 1)
construction of IHUP-Tree, 2) generation of HTWUIs, and 3) identification of high utility itemsets.
In step 1, items in transactions are rearranged in a fixed order such as lexicographic order, support
descending order or TWU descending order. Then the rearranged transactions are inserted into an
IHUP-Tree. For each node in IHUP Tree, the first number beside item name is its TWU and the
second one is its support count. In step 2, HTWUIs are generated from the IHUP-Tree by applying
FP-Growth. Thus, HTWUIs in phase I can be found without generating any candidate for HTWUIs.
In step 3, high utility itemsets and their utilities are identified from the set of HTWUIs by scanning
the original database once. Although IHUP achieves a better performance than IIDS and Two-Phase,
it still produces too many HTWUIs in phase I. Note that IHUP and Two-Phase produce the same
number of HTWUIs in phase I since they both use TWU framework to overestimate itemsets’
utilities. However, this framework may produce too many HTWUIs in phase I since the
overestimated utility calculated by TWU is too large. Such a large number of HTWUIs will degrade
the mining performance in phase I substantially in terms of execution time and memory
consumption. Moreover, the number of HTWUIs in phase I also affects the performance of phase II
since the more HTWUIs the algorithm generates in phase I, the more execution time for identifying
high utility itemsets it requires in phase II. As stated above, the number of generated HTWUIs is a
critical issue for the performance of algorithms. Therefore, this study aims at reducing itemsets’
overestimated utilities and proposes several strategies. By applying the proposed strategies, the
number of generated candidates can be highly reduced in phase I and high utility itemsets can be
identified more efficiently in phase II.
3. PROPOSED METHODS
The major three steps of proposed methods are: 1) Double scan the database to construct the
global UP-Tree 2) Generate the potential high utility itemsets using UP Growth and UP Growth+
algorithm 3). From the set of PHUIs identify actual high utility item set.
3.1 Construction of UP Tree
After the original database is reorganized by removing the unpromising items and their
utilities from the database, UP-Tree is constructed. A compact tree structure, UP-Tree is used for
facilitate the mining performance and avoid scanning original database repeatedly. It will also
maintain the transaction information and high utility itemsets.
UP-Tree is constructed to facilitate the mining performance and avoid scanning original
database repeatedly, we use a compact tree structure, named UP-Tree (Utility Pattern Tree), to
maintain the information of transactions and high utility item sets are maintained in the UP-Tree. By
applying two strategies, the overestimated utilities stored in the nodes of global UP-Tree are
minimized. Two scan of database is required from the construction of UP-Tree.

168
Table 2: Reorganized Database
TID Transaction RTU
1 (C,1)(A,1)(D,1) 8
2 (C,6)(E,2)(A,2) 22
3 (C,1)(E,1)(A,1)(B,2)(D,6) 25
4 (C,3)((E,1)(B,4)(D,3) 20
5 (C,2)(E,1)(B,2) 9
Fig 3: Proposed Architecture
3.1.1 Discarding Global Unpromising Items
By using two scans of database it is possible to construct the global UP-Tree. TU of each
transaction is computed in the first scan, along with that TU of each transaction is also computed. An
item whose TWU is less than minimum utility threshold is said to be unpromising item. The
following figure will show how to remove unpromising items and their utilities from transactions and
TUs.
Suppose the minimum utility is 40, items having TWU< minimum utility will be discarded,
here unpromising items are F and G. After that Reorganized Transactions will be constructed with
their TU.
3.1.2 Discarding Global Node Utilities
Next step is to discarding global node utilities. Actually it is done by considering utilities of
descendant nodes during the construction of global UP-Tree.

169
The nodes having utility that are near to the root of a global UP-Tree are further reduced by
applying strategy DGN, It will be useful when the database contains large number of long
transactions in other ways if a transactions contains more items, more utilities can be discarded by
DGN.
Global UP-Tree with Strategies DGU and DGN
Construction of a global UP-Tree is done by using two database scans. Each transaction’s TU
is computed in the first scan; together with that, each 1- item’s TWU is also calculated. Thus, we can
collect promising items and unpromising items. DGU is applied finally taking all unpromising item.
After that the transactions are reorganized by pruning the unpromising items and sorting the
remaining promising items in a fixed order. Any ordering can be based on any scheme such as
alphabetic, support, or TWU order. Each transaction after the above reorganization is called a
reorganized transaction. Then a function Insert Reorganized Transaction is called to apply DGN
during constructing a global UP-Tree.
3.2 UP-Growth Mining Method
UP Growth algorithm uses two strategies namely Discarding Local Unpromising Items and
Decreasing Local Node Utilities to identify high utility itemsets from the UP Tree. By these two
strategies, overestimated utilities of itemsets can be decreased and thus the number of PHUIs can be
further reduced.
Subroutine: UP-Growth+ (Tx, Hx, X)
Input: A UP-Tree Tx, a header table Hx and an itemset X.
Output: All PHUIs in Tx.
Procedure UP-Growth+ (Tx, Hx, X)
(1) For each entry ai in Hx do
(2) Generate a PHUI Y=XUai;
(3) The estimate utility of Y is set as ai’s utility value in Hx;
(4) Construct Y’s conditional pattern base Y-CPB;
(5) Put local promising items in Y-CPB into Hy
(6) Apply strategy DLU to reduce path utilities of the paths;
(7) Apply strategy DLN and insert paths into Ty;
(8) If Ty null then call UP-Growth+ (Ty, Hy, Yy);
(9) End for
3.2.1 Discarding Local Unpromising Items (DLU)
It consists of three steps. First step is generating conditional pattern base, second step is
construct conditional trees and the final step is mine patterns from the conditional tree. Since actual
utilities are not maintained in global UP-tree, DGU, DGN strategies cannot be applied. To avoid this
problem items actual utilities of each transaction will be maintained in each node, but this will cause
inefficient utilization of memory. To avoid this problem DLU, DLN strategies are used. In both cases
minimum item utility table is used for keeping all promising item’s minimum item utilities. In order
to reduce local promising item’s utilities minimum item utilities can be used. For that estimated
value for each local unpromising item is subtracted from path utility of an extracted path which is
described by the following diagram.

170
3.2.2 Discarding Local Node utilities (DLN)
Information about the item below to the particular item will not be contain in the UP-Tree,
so that it is possible to discard the utilities of descendant nodes related to item. We use minimum
utilities of item to estimate discarded utilities because we don’t know the actual utilities of
descendant nodes.
3.3. UP Growth+ Mining Method
In UP Growth+ Algorithm, minimal node utility in each path is calculated to make the
estimated pruning values closer to the real utility values of the pruned itemsets in database. Minimal
Node Utility for each node in the global UP Tree is acquired during the construction of a global UP
Tree. Minimal Node Utility is described in each node of the global UP Tree.
By Discarding unpromising items and their Node Utilities (DNU) strategy Local unpromising
items and their estimated Node Utilities are discarded from the paths and path utilities of the
conditional Pattern bases. Path utility of p in the conditional Pattern base is recalculated.
By Decreasing Node utilities for the Nodes of UP Tree (DNN), the utility of items are further
reduced.
3.4. Association Rule Generation
Association Rule is generation based on the support count and minimum count threshold
value set by the users. Since the database is large and users concern about only those frequently
purchased items, usually thresholds of support and confidence are predefined by users to drop those
rules that are not so interesting or useful. In many cases, the algorithms generate an extremely large
number of association rules, often in thousands or even millions. It is nearly impossible for the end
users to comprehend or validate such large number of complex association rules. UP Growth+
algorithm produces only strong Association rules.
4. EXPERIMENTAL RESULTS
Performance of the proposed algorithm is evaluated in this section. The experiments were
performed on a 2.80 GHz Intel Pentium D Processor with 2 GB memory. The operating system is
Microsoft Windows 7. The algorithms are implemented in Java language. Synthetic data sets are
used in the experiments. Synthetic data sets were generated from the data generator. Parameter
descriptions are described as follows: D is the total number of transactions; T is the average size of
transactions; N is the number of distinct items; I is the average size of maximal potential frequent
itemsets.
For the proposed algorithms, Two methods UPT&UPG and UPT&UPG+ are designed that
are composed of the proposed methods UP-Tree(DGU and DGN) and UP-Growth (DLU and DLN)
and the proposed methods UP-Tree and UP-Growth+ (DNU, and DNN), respectively.
In phase II, these algorithms identify high utility itemsets by scanning the database which
contains no unpromising items.
By applying DGN strategy, the node utilities of the nodes in a global UP-Tree are effectively
decreased since the utilities of their descendants are discarded. By applying DLU strategy, local
unpromising items are removed from the paths of conditional pattern base and their minimum item
utilities are eliminated from the path utilities. By applying DLN, the node utilities of the nodes in a
local UP-Tree are decreased since they discard the minimum item utilities of their descendants.

171
Fig. 4.1: Reorganized Database
Figure 4.1 shows the reorganized database after removing the unpromising items and their
utilities. By applying strategy DGU, global unpromising items and their utilities are discarded from
the transactions and TUs.
Fig. 4.2: UP Tree
Fig 4.2 shows the UP Tree data structure after constructing the reorganized database.
UP+UPG means the PHUIs are generated from UP-Tree by applying UP-Growth+. In the above
algorithms, UP-Tree is constructed after removing the unpromising items from the original database
and it requires just two scan of database by scanning database twice. The items in a transaction are
rearranged in descending order of the global TWU during the construction of both UP-Trees.

172
Fig. 4.3: High Utility Itemsets
Fig 4.3 shows the Potential high utility itemsets which are mined from the conditional pattern bases.
Fig 4.4: Association Rule Generation
Figure 4.4 shows the Association Rules which satisfies both the minimum support count and
minimum confidence value for the High Utility Itemsets
5. CONCLUSION
UP Growth+ algorithm is used to discover high utility itemsets from transaction database. By
scanning the original database twice, the potential high utility itemsets are generated from the UP-
Tree. In this project four strategies namely Discarding Global Unpromising Items, Decreasing Global
Node Utilities, Discarding Local Unpromising Items, Decreasing Local Node Utilities have been
developed to decrease the estimated utility values and to improve the mining performance in utility
mining.
Potential High Utility Items are found in the experiment and synthetic datasets are used to
evaluate the performance of the UP Growth+ algorithm. The mining performance is improved since
both the search space and the number of candidates are reduced by the four strategies. Also
Association Rule is generated from the most profitable itemsets which satisfies the minimum support
count and minimum confidence.

173
REFERENCES
[1] R. Agrawal and R. Srikant, “Fast Algorithms for Mining Association Rules,” Proc. 20th Int’l
Conf. Very Large Data Bases (VLDB), pp. 487-499, 1994.
[2] J. Han, J. Pei, and Y. Yin, “Mining Frequent Patterns without Candidate Generation,” Proc.
ACM-SIGMOD Int’l Conf. Management of Data, pp. 1-12, 2000.
[3] Ying Liu, Wei-keng Liao, Alok Choudhary, “A Fast High Utility Itemsets Mining
Algorithm,” Proc. Utility-Based Data Mining Workshop, 2005.
[4] Y.-C. Li, J.-S. Yeh, and C.-C. Chang, “Isolated Items Discarding Strategy for Discovering
High Utility Itemsets,” Data and Knowledge Eng., vol. 64, no. 1, pp. 198-217, Jan. 2008.
[5] C.F. Ahmed, S.K. Tanbeer, B.-S. Jeong and Y.-K. Lee, “Efficient Tree Structures for High
Utility Pattern Mining in Incremental Databases,” IEEE Trans. Knowledge and Data Eng.,
vol. 21, no. 12, pp. 1708-1721, Dec. 2009.
[6] Vincent S. Tseng, Bai-En Shie, Cheng-Wei Wu, and Philip S. Yu, “Efficient Algorithms for
Mining High Utility Itemsets from Transactional Databases” IEEE Trans. Knowledge and
Data Eng., VOL.25, NO. 8, AUG. 2013.
[7] V.S. Tseng, C.-W. Wu, B.-E. Shie, and P.S. Yu, “UP-Growth: An Efficient Algorithm for
High Utility Itemsets Mining,” Proc. 16th
ACM SIGKDD Conf. Knowledge Discovery and
Data Mining (KDD ’10), pp. 253-262, 2010.
[8] A. Erwin, R. P. Gopalan, and N. R. Achuthan. Efficient mining of high utility itemsets from
large datasets. In Proc. of PAKDD 2008, LNAI 5012, pp. 554-561.
[9] Vijay Arputharaj J and Dr.R.Manicka Chezian, “Data Mining with Human Genetics to
Enhance Gene Based Algorithm and DNA Database Security”, International Journal of
Computer Engineering & Technology (IJCET), Volume 4, Issue 3, 2013, pp. 176 - 181,
ISSN Print: 0976 – 6367, ISSN Online: 0976 – 6375.
[10] M. Karthikeyan, M. Suriya Kumar and Dr. S. Karthikeyan, “A Literature Review on the Data
Mining and Information Security”, International Journal of Computer Engineering &
Technology (IJCET), Volume 3, Issue 1, 2012, pp. 141 - 146, ISSN Print: 0976 – 6367,
ISSN Online: 0976 – 6375.
[11] J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. In Proc. of
the ACM-SIGMOD Int'l Conf. on Management of Data, pp. 1-12, 2000.
[12] C.H. Cai, A.W.C. Fu, C.H. Cheng, and W.W. Kwong, “Mining Association Rules with
Weighted Items,” Proc. Int’l Database Eng. and Applications Symp. (IDEAS ’98), pp. 68-77,
1998.
[13] R. Manickam, D. Boominath and V. Bhuvaneswari, “An Analysis of Data Mining: Past,
Present and Future”, International Journal of Computer Engineering & Technology (IJCET),
Volume 3, Issue 1, 2012, pp. 1 - 9, ISSN Print: 0976 – 6367, ISSN Online: 0976 – 6375.
[14] Rajesh V. Argiddi and Sulabha S. Apte, “A Study of Association Rule Mining in Fragmented
Item-Sets for Prediction of Transactions Outcome in Stock Trading Systems”, International
Journal of Computer Engineering & Technology (IJCET), Volume 3, Issue 2, 2012,
pp. 478 - 486, ISSN Print: 0976 – 6367, ISSN Online: 0976 – 6375.
[15] R. Chan, Q. Yang, and Y. Shen, “Mining High Utility Itemsets,” Proc. IEEE Third Int’l Conf.
Data Mining, pp. 19-26, Nov. 2003.
[16] J. Han and Y. Fu, “Discovery of Multiple-Level Association Rules from Large Databases,”
Proc. 21th Int’l Conf. Very Large Data Bases, pp. 420-431, Sept. 1995.

50120140503019

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to 50120140503019

Similar to 50120140503019 (20)

More from IAEME Publication

More from IAEME Publication (20)

Recently uploaded

Recently uploaded (20)

50120140503019