50120140503019
Upcoming SlideShare
Loading in...5
×
 

50120140503019

on

  • 126 views

 

Statistics

Views

Total Views
126
Views on SlideShare
123
Embed Views
3

Actions

Likes
0
Downloads
3
Comments
0

1 Embed 3

http://www.slideee.com 3

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

50120140503019 50120140503019 Document Transcript

  • International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 5, Issue 3, March (2014), pp. 164-173 © IAEME 164 MINING ASSOCIATION RULES FOR HIGH UTILITY ITEMSETS USING UP GROWTH+ ALGORITHM FROM TRANSACTIONAL DATABASES M. Venkatesh PG Scholar, CSE Department, KCG College of Technology, Chennai 97 Dr. M. Krishnamurthy HOD (ME.,), CSE Department, KCG College of Technology, Chennai 97 ABSTRACT Utility mining emerges as an important topic in data mining field. High utility item sets mining refers to importance or profitability of an item to users. Efficient mining of high utility itemsets plays an important role in many real-time applications and is an important research issue in data mining area. Number of Algorithms like apriori, FP Growth has been proposed in this area, but they cause the problem of generating a large number of candidate itemsets. That will lead to high requirement of space and time and hence the performance of mining will be less .It is not at all good when the database contains transactions having long size or high utility itemsets which also having long size. UP Growth+ algorithm is proposed in this paper for mining high utility itemsets. A UP Tree data structure is used for storing the information about high utility itemset such that by using only double scanning of database, candidate itemsets can be efficiently generate. These proposed algorithms will cause the reduction of the number of candidates effectively and also reduces the requirement for space and time, when a database contains large number of transactions. Keywords: Frequent Mining, Utility Mining, UP Tree, UP Growth, Association Rule. 1. INTRODUCTION Data mining is the process of analyzing data from different perspectives and summarizing it into useful information that can be used to increase revenue, cuts costs, or both. It allows users to analyze data from many different dimensions and summarize the relationships identified. Technically, data mining is the process of finding correlations or patterns among dozens of fields in large relational databases. It is the process that attempts to discover interesting patterns from an abundance of raw data. It utilizes methods at the intersection of artificial intelligence, machine INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & TECHNOLOGY (IJCET) ISSN 0976 – 6367(Print) ISSN 0976 – 6375(Online) Volume 5, Issue 3, March (2014), pp. 164-173 © IAEME: www.iaeme.com/ijcet.asp Journal Impact Factor (2014): 8.5328 (Calculated by GISI) www.jifactor.com IJCET © I A E M E
  • International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 5, Issue 3, March (2014), pp. 164-173 © IAEME 165 learning, statistics, and database systems. The overall goal of the data mining process is to extract information from a data set and transform it into an understandable structure for further use. Extensive studies have been proposed for mining frequent itemsets from the databases. In market analysis, mining frequent itemsets from a transaction database refers to the discovery of the itemsets which frequently appear together in the transactions. However the unit profits and purchased quantities of items are not considered in the framework of frequent itemset mining. Hence, it cannot satisfy the requirement of the user who is interested in discovering the itemsets with high sales profits. In view of this, utility mining is used as an important topic in data mining for discovering the high utility itemsets. Mining high utility itemsets from the databases refer to finding the itemsets with high utilities. The basic meaning of utility is the interestedness/ importance/ profitability of items to the users. An itemset is called a high utility itemset if its utility satisfies the user specified threshold; otherwise, the itemset is called a low utility itemset. However, mining high utility itemsets from the databases is not an easy task since the downward closure property cannot be applied here. In other words, pruning search space for high utility itemset mining is difficult because a superset of a low utility itemset may be a high utility itemset. Table 1: Transaction Database TID Transactions T1 (A,2)(C,20)(D,2) T2 (A,4)(C,12)(E,4)(G,10) T3 (A,4)(B,4)(D,12)(E,6)(F,2) T4 (B,8)(C,8)(D,4)(E,1) T5 (B,2)(C,2)(E,1)(G,2) An alternative approach for this problem is to enumerate all itemsets from the databases by the principle of exhaustion. Obviously, this approach will encounter the large search space problem, especially when databases contain lots of long transactions. Hence, how to effectively prune the search space and efficiently capture all high utility itemsets with no miss is a big challenge in utility mining. Existing methods often generate a huge set of high utility itemsets and the mining performance is degraded consequently. The situation may become worse when the database contains many long transactions. The huge number of potential high utility itemsets forms a challenging problem to the mining performance since the higher processing cost is incurred with more potential high utility itemsets are generated. To address this issue, we propose a novel algorithm with a compact data structure for discovering high utility itemsets. The major contributions of this work are described as follows: An efficient algorithm, called UP-Growth+ (Utility Pattern Growth) and UP-Growth+ algorithm are proposed for discovering high utility itemsets. Correspondingly, a compact tree structure, called UP-Tree is proposed to maintain the important information of the transaction database related to the utility patterns. High utility itemsets are then generated from the UP-Tree efficiently with only two scans of the database.
  • International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 5, Issue 3, March (2014), pp. 164-173 © IAEME 166 2. RELATED WORKS Here we are discussing some basic definitions about utility of an item, utility of itemset in transaction, utility of itemset in database and also related works and the problem involved in utility mining and then related strategies are introduced. 2.1 Preliminary A transaction database D contains a set of transactions, and each transaction has a unique identifier called TID. Each item ip in transaction Td is associated with a quantity q (ip, Td), that is, the purchased quantity of ip in Td. Definition 1. Utility of an item ip in a transaction Td is denoted as u (ip, Td) and defined as pr (ipxTd). Definition 2. Utility of an itemset X in Td is denoted as U(X, Td) and defined as ∑ip⋴X∨X⊆Td Definition 3. Utility of an itemset X in D is denoted as u(X) and ∑X Td Td Du(X, Td) Definition 4. An itemset is called a high utility itemset if its utility is no less than a user-specified minimum utility threshold which is denoted as min_util. Table 2: Profit Table Item A B C D E F G Profit 5 2 1 2 3 1 1 From table 1 and 2 U ({A, T1}) =5×1=5 U ({AD, T1}) =u ({A, T1}) +u ({D, T1}) =5+2=7 U ({AD}) =u ({AD, T1}) +u ({AD, T3}) =7+17=24 U ({BD}) =u ({BD, T3}) +u ({BD, T4}) =16+18=34 If min_util is set to 30, BD is a high utility itemset; AD is a low utility itemset. Extensive studies have been proposed for mining frequent itemsets from database. Apriori is one of the best known algorithms for mining association rules from large databases. FP-Growth algorithm achieves a better performance than Apriori-based algorithms since it finds frequent itemsets without generating any candidate itemsets. However the importance of items to users is not considered in the frequent itemset mining. Thus, the topic called weighted association rule mining was brought to attention. Cai et al. first proposed the concept of weighted items and weighted association rules. However, it does not have downward closure property. To address this problem, Tao et al. proposed the concept of weighted downward closure property. By using transaction weight, weighted support can not only reflect the importance of an itemset but also maintain the downward closure property. Thus, the issue of high utility itemset mining is raised and many studies have addressed this problem. Liu et al. proposed an algorithm named Two-Phase which is mainly composed of two mining phases. In phase I, it employs an Apriori-based level-wise method to generate High Utility Itemsets. In phase II, HTWUIs that are high utility itemsets are identified with an additional database scan. Although two-phase algorithm reduces search space by using TWDC
  • International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 5, Issue 3, March (2014), pp. 164-173 © IAEME 167 property, it still generates too many candidates to obtain HTWUIs and requires multiple database scans. To overcome this problem, Li et al. proposed an isolated items discarding strategy (IIDS) to reduce the number of candidates. By pruning isolated items during level-wise search, the number of candidate itemsets for HTWUIs in phase I can be reduced. However, this algorithm still scans database for several times and uses a candidate generation-and-test scheme to find high utility itemsets. To efficiently generate HTWUIs in phase I and avoid scanning database too many times, Ahmed et al. proposed a tree-based algorithm, named IHUP. A tree based structure called IHUP- Tree is used to maintain the information about itemsets and their utilities. Each node of an IHUP- Tree consists of an item name, a TWU value and a support count. IHUP algorithm has three steps: 1) construction of IHUP-Tree, 2) generation of HTWUIs, and 3) identification of high utility itemsets. In step 1, items in transactions are rearranged in a fixed order such as lexicographic order, support descending order or TWU descending order. Then the rearranged transactions are inserted into an IHUP-Tree. For each node in IHUP Tree, the first number beside item name is its TWU and the second one is its support count. In step 2, HTWUIs are generated from the IHUP-Tree by applying FP-Growth. Thus, HTWUIs in phase I can be found without generating any candidate for HTWUIs. In step 3, high utility itemsets and their utilities are identified from the set of HTWUIs by scanning the original database once. Although IHUP achieves a better performance than IIDS and Two-Phase, it still produces too many HTWUIs in phase I. Note that IHUP and Two-Phase produce the same number of HTWUIs in phase I since they both use TWU framework to overestimate itemsets’ utilities. However, this framework may produce too many HTWUIs in phase I since the overestimated utility calculated by TWU is too large. Such a large number of HTWUIs will degrade the mining performance in phase I substantially in terms of execution time and memory consumption. Moreover, the number of HTWUIs in phase I also affects the performance of phase II since the more HTWUIs the algorithm generates in phase I, the more execution time for identifying high utility itemsets it requires in phase II. As stated above, the number of generated HTWUIs is a critical issue for the performance of algorithms. Therefore, this study aims at reducing itemsets’ overestimated utilities and proposes several strategies. By applying the proposed strategies, the number of generated candidates can be highly reduced in phase I and high utility itemsets can be identified more efficiently in phase II. 3. PROPOSED METHODS The major three steps of proposed methods are: 1) Double scan the database to construct the global UP-Tree 2) Generate the potential high utility itemsets using UP Growth and UP Growth+ algorithm 3). From the set of PHUIs identify actual high utility item set. 3.1 Construction of UP Tree After the original database is reorganized by removing the unpromising items and their utilities from the database, UP-Tree is constructed. A compact tree structure, UP-Tree is used for facilitate the mining performance and avoid scanning original database repeatedly. It will also maintain the transaction information and high utility itemsets. UP-Tree is constructed to facilitate the mining performance and avoid scanning original database repeatedly, we use a compact tree structure, named UP-Tree (Utility Pattern Tree), to maintain the information of transactions and high utility item sets are maintained in the UP-Tree. By applying two strategies, the overestimated utilities stored in the nodes of global UP-Tree are minimized. Two scan of database is required from the construction of UP-Tree.
  • International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 5, Issue 3, March (2014), pp. 164-173 © IAEME 168 Table 2: Reorganized Database TID Transaction RTU 1 (C,1)(A,1)(D,1) 8 2 (C,6)(E,2)(A,2) 22 3 (C,1)(E,1)(A,1)(B,2)(D,6) 25 4 (C,3)((E,1)(B,4)(D,3) 20 5 (C,2)(E,1)(B,2) 9 Fig 3: Proposed Architecture 3.1.1 Discarding Global Unpromising Items By using two scans of database it is possible to construct the global UP-Tree. TU of each transaction is computed in the first scan, along with that TU of each transaction is also computed. An item whose TWU is less than minimum utility threshold is said to be unpromising item. The following figure will show how to remove unpromising items and their utilities from transactions and TUs. Suppose the minimum utility is 40, items having TWU< minimum utility will be discarded, here unpromising items are F and G. After that Reorganized Transactions will be constructed with their TU. 3.1.2 Discarding Global Node Utilities Next step is to discarding global node utilities. Actually it is done by considering utilities of descendant nodes during the construction of global UP-Tree.
  • International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 5, Issue 3, March (2014), pp. 164-173 © IAEME 169 The nodes having utility that are near to the root of a global UP-Tree are further reduced by applying strategy DGN, It will be useful when the database contains large number of long transactions in other ways if a transactions contains more items, more utilities can be discarded by DGN. Global UP-Tree with Strategies DGU and DGN Construction of a global UP-Tree is done by using two database scans. Each transaction’s TU is computed in the first scan; together with that, each 1- item’s TWU is also calculated. Thus, we can collect promising items and unpromising items. DGU is applied finally taking all unpromising item. After that the transactions are reorganized by pruning the unpromising items and sorting the remaining promising items in a fixed order. Any ordering can be based on any scheme such as alphabetic, support, or TWU order. Each transaction after the above reorganization is called a reorganized transaction. Then a function Insert Reorganized Transaction is called to apply DGN during constructing a global UP-Tree. 3.2 UP-Growth Mining Method UP Growth algorithm uses two strategies namely Discarding Local Unpromising Items and Decreasing Local Node Utilities to identify high utility itemsets from the UP Tree. By these two strategies, overestimated utilities of itemsets can be decreased and thus the number of PHUIs can be further reduced. Subroutine: UP-Growth+ (Tx, Hx, X) Input: A UP-Tree Tx, a header table Hx and an itemset X. Output: All PHUIs in Tx. Procedure UP-Growth+ (Tx, Hx, X) (1) For each entry ai in Hx do (2) Generate a PHUI Y=XUai; (3) The estimate utility of Y is set as ai’s utility value in Hx; (4) Construct Y’s conditional pattern base Y-CPB; (5) Put local promising items in Y-CPB into Hy (6) Apply strategy DLU to reduce path utilities of the paths; (7) Apply strategy DLN and insert paths into Ty; (8) If Ty null then call UP-Growth+ (Ty, Hy, Yy); (9) End for 3.2.1 Discarding Local Unpromising Items (DLU) It consists of three steps. First step is generating conditional pattern base, second step is construct conditional trees and the final step is mine patterns from the conditional tree. Since actual utilities are not maintained in global UP-tree, DGU, DGN strategies cannot be applied. To avoid this problem items actual utilities of each transaction will be maintained in each node, but this will cause inefficient utilization of memory. To avoid this problem DLU, DLN strategies are used. In both cases minimum item utility table is used for keeping all promising item’s minimum item utilities. In order to reduce local promising item’s utilities minimum item utilities can be used. For that estimated value for each local unpromising item is subtracted from path utility of an extracted path which is described by the following diagram.
  • International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 5, Issue 3, March (2014), pp. 164-173 © IAEME 170 3.2.2 Discarding Local Node utilities (DLN) Information about the item below to the particular item will not be contain in the UP-Tree, so that it is possible to discard the utilities of descendant nodes related to item. We use minimum utilities of item to estimate discarded utilities because we don’t know the actual utilities of descendant nodes. 3.3. UP Growth+ Mining Method In UP Growth+ Algorithm, minimal node utility in each path is calculated to make the estimated pruning values closer to the real utility values of the pruned itemsets in database. Minimal Node Utility for each node in the global UP Tree is acquired during the construction of a global UP Tree. Minimal Node Utility is described in each node of the global UP Tree. By Discarding unpromising items and their Node Utilities (DNU) strategy Local unpromising items and their estimated Node Utilities are discarded from the paths and path utilities of the conditional Pattern bases. Path utility of p in the conditional Pattern base is recalculated. By Decreasing Node utilities for the Nodes of UP Tree (DNN), the utility of items are further reduced. 3.4. Association Rule Generation Association Rule is generation based on the support count and minimum count threshold value set by the users. Since the database is large and users concern about only those frequently purchased items, usually thresholds of support and confidence are predefined by users to drop those rules that are not so interesting or useful. In many cases, the algorithms generate an extremely large number of association rules, often in thousands or even millions. It is nearly impossible for the end users to comprehend or validate such large number of complex association rules. UP Growth+ algorithm produces only strong Association rules. 4. EXPERIMENTAL RESULTS Performance of the proposed algorithm is evaluated in this section. The experiments were performed on a 2.80 GHz Intel Pentium D Processor with 2 GB memory. The operating system is Microsoft Windows 7. The algorithms are implemented in Java language. Synthetic data sets are used in the experiments. Synthetic data sets were generated from the data generator. Parameter descriptions are described as follows: D is the total number of transactions; T is the average size of transactions; N is the number of distinct items; I is the average size of maximal potential frequent itemsets. For the proposed algorithms, Two methods UPT&UPG and UPT&UPG+ are designed that are composed of the proposed methods UP-Tree(DGU and DGN) and UP-Growth (DLU and DLN) and the proposed methods UP-Tree and UP-Growth+ (DNU, and DNN), respectively. In phase II, these algorithms identify high utility itemsets by scanning the database which contains no unpromising items. By applying DGN strategy, the node utilities of the nodes in a global UP-Tree are effectively decreased since the utilities of their descendants are discarded. By applying DLU strategy, local unpromising items are removed from the paths of conditional pattern base and their minimum item utilities are eliminated from the path utilities. By applying DLN, the node utilities of the nodes in a local UP-Tree are decreased since they discard the minimum item utilities of their descendants.
  • International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 5, Issue 3, March (2014), pp. 164-173 © IAEME 171 Fig. 4.1: Reorganized Database Figure 4.1 shows the reorganized database after removing the unpromising items and their utilities. By applying strategy DGU, global unpromising items and their utilities are discarded from the transactions and TUs. Fig. 4.2: UP Tree Fig 4.2 shows the UP Tree data structure after constructing the reorganized database. UP+UPG means the PHUIs are generated from UP-Tree by applying UP-Growth+. In the above algorithms, UP-Tree is constructed after removing the unpromising items from the original database and it requires just two scan of database by scanning database twice. The items in a transaction are rearranged in descending order of the global TWU during the construction of both UP-Trees.
  • International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 5, Issue 3, March (2014), pp. 164-173 © IAEME 172 Fig. 4.3: High Utility Itemsets Fig 4.3 shows the Potential high utility itemsets which are mined from the conditional pattern bases. Fig 4.4: Association Rule Generation Figure 4.4 shows the Association Rules which satisfies both the minimum support count and minimum confidence value for the High Utility Itemsets 5. CONCLUSION UP Growth+ algorithm is used to discover high utility itemsets from transaction database. By scanning the original database twice, the potential high utility itemsets are generated from the UP- Tree. In this project four strategies namely Discarding Global Unpromising Items, Decreasing Global Node Utilities, Discarding Local Unpromising Items, Decreasing Local Node Utilities have been developed to decrease the estimated utility values and to improve the mining performance in utility mining. Potential High Utility Items are found in the experiment and synthetic datasets are used to evaluate the performance of the UP Growth+ algorithm. The mining performance is improved since both the search space and the number of candidates are reduced by the four strategies. Also Association Rule is generated from the most profitable itemsets which satisfies the minimum support count and minimum confidence.
  • International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 5, Issue 3, March (2014), pp. 164-173 © IAEME 173 REFERENCES [1] R. Agrawal and R. Srikant, “Fast Algorithms for Mining Association Rules,” Proc. 20th Int’l Conf. Very Large Data Bases (VLDB), pp. 487-499, 1994. [2] J. Han, J. Pei, and Y. Yin, “Mining Frequent Patterns without Candidate Generation,” Proc. ACM-SIGMOD Int’l Conf. Management of Data, pp. 1-12, 2000. [3] Ying Liu, Wei-keng Liao, Alok Choudhary, “A Fast High Utility Itemsets Mining Algorithm,” Proc. Utility-Based Data Mining Workshop, 2005. [4] Y.-C. Li, J.-S. Yeh, and C.-C. Chang, “Isolated Items Discarding Strategy for Discovering High Utility Itemsets,” Data and Knowledge Eng., vol. 64, no. 1, pp. 198-217, Jan. 2008. [5] C.F. Ahmed, S.K. Tanbeer, B.-S. Jeong and Y.-K. Lee, “Efficient Tree Structures for High Utility Pattern Mining in Incremental Databases,” IEEE Trans. Knowledge and Data Eng., vol. 21, no. 12, pp. 1708-1721, Dec. 2009. [6] Vincent S. Tseng, Bai-En Shie, Cheng-Wei Wu, and Philip S. Yu, “Efficient Algorithms for Mining High Utility Itemsets from Transactional Databases” IEEE Trans. Knowledge and Data Eng., VOL.25, NO. 8, AUG. 2013. [7] V.S. Tseng, C.-W. Wu, B.-E. Shie, and P.S. Yu, “UP-Growth: An Efficient Algorithm for High Utility Itemsets Mining,” Proc. 16th ACM SIGKDD Conf. Knowledge Discovery and Data Mining (KDD ’10), pp. 253-262, 2010. [8] A. Erwin, R. P. Gopalan, and N. R. Achuthan. Efficient mining of high utility itemsets from large datasets. In Proc. of PAKDD 2008, LNAI 5012, pp. 554-561. [9] Vijay Arputharaj J and Dr.R.Manicka Chezian, “Data Mining with Human Genetics to Enhance Gene Based Algorithm and DNA Database Security”, International Journal of Computer Engineering & Technology (IJCET), Volume 4, Issue 3, 2013, pp. 176 - 181, ISSN Print: 0976 – 6367, ISSN Online: 0976 – 6375. [10] M. Karthikeyan, M. Suriya Kumar and Dr. S. Karthikeyan, “A Literature Review on the Data Mining and Information Security”, International Journal of Computer Engineering & Technology (IJCET), Volume 3, Issue 1, 2012, pp. 141 - 146, ISSN Print: 0976 – 6367, ISSN Online: 0976 – 6375. [11] J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. In Proc. of the ACM-SIGMOD Int'l Conf. on Management of Data, pp. 1-12, 2000. [12] C.H. Cai, A.W.C. Fu, C.H. Cheng, and W.W. Kwong, “Mining Association Rules with Weighted Items,” Proc. Int’l Database Eng. and Applications Symp. (IDEAS ’98), pp. 68-77, 1998. [13] R. Manickam, D. Boominath and V. Bhuvaneswari, “An Analysis of Data Mining: Past, Present and Future”, International Journal of Computer Engineering & Technology (IJCET), Volume 3, Issue 1, 2012, pp. 1 - 9, ISSN Print: 0976 – 6367, ISSN Online: 0976 – 6375. [14] Rajesh V. Argiddi and Sulabha S. Apte, “A Study of Association Rule Mining in Fragmented Item-Sets for Prediction of Transactions Outcome in Stock Trading Systems”, International Journal of Computer Engineering & Technology (IJCET), Volume 3, Issue 2, 2012, pp. 478 - 486, ISSN Print: 0976 – 6367, ISSN Online: 0976 – 6375. [15] R. Chan, Q. Yang, and Y. Shen, “Mining High Utility Itemsets,” Proc. IEEE Third Int’l Conf. Data Mining, pp. 19-26, Nov. 2003. [16] J. Han and Y. Fu, “Discovery of Multiple-Level Association Rules from Large Databases,” Proc. 21th Int’l Conf. Very Large Data Bases, pp. 420-431, Sept. 1995.