IMPROVING DATA MINING EFFICIENCY BY PREDICTIVE ITEMSETS
Upcoming SlideShare
Loading in...5
×
 

IMPROVING DATA MINING EFFICIENCY BY PREDICTIVE ITEMSETS

on

  • 852 views

 

Statistics

Views

Total Views
852
Views on SlideShare
852
Embed Views
0

Actions

Likes
0
Downloads
39
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

IMPROVING DATA MINING EFFICIENCY BY PREDICTIVE ITEMSETS IMPROVING DATA MINING EFFICIENCY BY PREDICTIVE ITEMSETS Document Transcript

  • IMPROVING DATA MINING EFFICIENCY BY PREDICTIVE ITEMSETS Tzung-Pei Hong Chyan-Yuan Horng Shyue-Liang Wang Dept. of Electrical Engineering Inst. of Info. Engineering Dept. of Info. Management National Univ. of Kaohsiung I-Shou University National Univ. of Kaohsiung tphong@nuk.edu.tw hcy@ms4.url.com.tw slwang@nuk.edu.tw ABSTRACT each level a different support threshold, which is derived from a data dependency parameter. Also, our proposed In this paper, we propose a novel mining algorithm to approach determines whether an item should be included improve the efficiency of finding large itemsets. The in a promising candidate itemset directly by supports of proposed algorithm bases on Denwattana and Getta’ items, which are easily obtained in the phase of finding prediction concept and considers the data dependency in large 1-itemsets. the given transactions. It aims at efficiently finding any p levels of large itemsets by scanning a database twice except for the first level. A new reasonable estimation 2. REVIEW OF MINING ASSOCIATION RULES method is proposed to predict promising and non-promising candidate itemsets flexibly. The proposed One application of data mining is to induce association approach estimates for each level a different support rules from transaction data, such that the presence of threshold, which is derived from a data dependency certain items in a transaction will imply the presence of parameter. Also, it determines whether an item should be certain other items. To achieve this purpose, Agrawal and included in a promising candidate itemset directly by his co-workers proposed several mining algorithms based supports of items, which are easily obtained in the phase of on the concept of large itemsets to find association rules in finding large 1-itemsets. transaction data [1][3][5]. They divided the mining process into two phases. In the first phase, candidate itemsets were generated and counted by scanning the transaction data. If 1. INTRODUCTION the count of an itemset appearing in the transactions was larger than a pre-defined threshold value (called the Among the mining techniques proposed, finding minimum support), the itemset was considered a large association rules from transaction databases is most itemset. Itemsets containing only one item were processed commonly seen [1][3][5][8][9][10][11][13][14]. In the past, first. Large itemsets containing only single items were then many algorithms for mining association rules from combined to form candidate itemsets containing two items. transactions were proposed, most of which were executed This process was repeated until all large itemsets had been in level-wise processes. Denwattana and Getta then found. In the second phase, association rules were induced proposed an interesting mining algorithm to reduce the from the large itemsets found in the first phase. All numbers of database scans in the process of finding large possible association combinations for each large itemset itemsets. Their approach first found the itemsets of single were formed, and those with calculated confidence values items by one database scan. After that, the approach could larger than a predefined threshold (called the minimum find each p more levels of large itemsets by two database confidence) were output as association rules. scans, where p could be arbitrarily assigned. Denwattana and Getta’ approach used a heuristic estimation method to predict the possibly large itemsets. If the prediction was 3. REVIEW OF DENWATTANA AND GETTA’S valid, then the approach was efficient in finding the APPROACH actually large itemsets. The estimation method was thus Denwattana and Getta proposed an approach to find each p very critical to the efficiency of the approach. levels of large itemsets by two database scans [7]. Their approach partitioned candidate itemsets into two parts: In this paper, we propose a novel mining algorithm to improve the efficiency of finding large itemsets. Our positive candidate itemsets (C + ) and negative candidate approach bases on Denwattana and Getta’ of prediction itemsets ( C − ). Positive candidate itemsets were guessed to concept and considers the data dependency in the given be large and negative candidate itemsets were guessed to transactions. As Denwattana and Getta’ approach did, our be small. Two parameters, called m-element transaction proposed mining algorithm aims at efficiently finding any threshold tt and frequency threshold tf, were used to judge p levels of large itemsets by scanning a database twice whether an item could compose a positive candidate except for the first level. A new reasonable estimation itemset. For each integer j, j ≤ tt,, the frequency of each method is proposed to predict promising and item appearing in transactions with j items was found. If at non-promising candidate itemsets flexibly. Different from least one frequency of an item was larger than or equal to the approach in [7], our proposed approach estimates for the fixed frequency threshold tf, the item could be used to
  • compose a positive candidate itemset. scan due to the incorrectly predicted promising itemsets in a phase; Their approach first found the large itemsets of single NC-: the itemsets to be checked in the second database items by one database scan. After that, the approach could scan due to the incorrectly predicted non-promising find each p more levels of large itemsets by two database itemsets in a phase. scans, where p could be arbitrarily assigned. Denwattana and Getta’ approach then formed positive candidate + 5. THEORETICAL FOUNDATION 2-itemsets C 2 , each of which has its two items satisfying the above criteria. The remaining candidate 2-itemsets not Usually, items must have greater support values for be ing in C 2 formed C 2 . C 3 and C 3− were then formed + − + + covered in large itemsets with more items from a from only C 2 in a similar way. The positive candidate probabilistic viewpoint. At one extreme, if total + 2-itemsets which were subsets of the itemsets in C 3 were dependency relations exist in the transactions, an + then removed from C 2 . The same process was repeated appearance of one item will certainly imply the appearance until p levels of positive and negative candidate itemsets of another. In this case, the support thresholds for an item were formed. Databases were then scanned to check to appear in large itemsets with different items are the whether the itemsets in the positive candidate itemsets same. At the other extreme, if the items are totally were actually large and the itemsets in the negative independent, then the support thresholds for an item to candidate itemsets were actually small. The itemsets appear in large itemsets with different items should be set incorrectly guessed were then expanded and processed by at different values. In this case, the support threshold for an a second database scan. item to appear in a large itemset with r items can be easily derived below. In this paper, we propose a novel mining algorithm based on Denwattana and Getta’ prediction concept. A new Since all the r items in a large r-itemset must be 1-large reasonable estimation method is proposed to predict itemsets, all the supports of the r items must be larger than promising and non-promising candidate itemsets flexibly. or equal to the predefined minimum supporta. Since the Different from the approach in [7], our proposed approach items are assumed totally independent, the support of the estimates for each level a different support threshold, r-itemset is s1 × s2 × ...× sr , where si is the actual which is derived from a data dependency parameter. Also, support of the i-th item in the itemset. If this r-itemset is our proposed approach determines whether an item should large, its support must be larger than or equal to a. Thus: be included in a promising candidate itemset directly by supports of items, which are easily obtained in the phase of s1 × s 2 × ...× s r ≥ α . finding large 1-itemsets. If the predictive support threshold for an item to appear in 4. NOTATION a large r-itemset is α r , then: The notation used in this paper is defined below. s1 × s2 × ...× sr ≥ α r × α r × ...× α r ≥ α . n: the number of transactions; Thus: m: the number of items; α rr ≥ α . a: the minimum support value; ß: the minimum confidence value; It implies: Ai: the i-th item, 1 ≤ i ≤ m; αr ≥ α1 / r . counti: the number of occurrences of Ai in the set of transactions; Therefore, if the items are totally independent, the support supporti: the support of Ai, calculated as counti / n; threshold of an item should be expected to be α 1 / r for p: the number of levels to be processed in a phase; being included in a large r-itemset. r: the number of items in itemsets currently being processed; Since the transactions are seldom totally dependent or r’: the number of items in itemsets at an end of a phase; totally independent, a data dependency parameter w, w: the dependency parameter used to predict the possible ranging between 0 and 1, is then used to calculate the support threshold (called predicting minimum support) predictive support threshold of an item for appearing in a for an item to appear in an r-itemset; large r-itemset as: Pr: the set of items predicted to appear in r -itemsets; 1 Lr: the set of large itemsets with r items; wα + ( 1 − w )α r . Cr: the set of candidate itemsets with r items; C r+ : the set of promising candidate itemsets in C with r A larger w value represents a stronger item relationship each item existing in Pr; existing in transactions. w = 1 means the total dependency C r− : the set of non-promising candidate itemsets in Cr with for transaction items and w = 0 means the total at least one item not existing in Pr; independency. T proposed approach thus uses different he NC+: the itemsets to be checked in the second database predictive support thresholds for each item to be included
  • in promising itemsets with different numbers of items. choosing from Cr the itemsets with each item existing in Pr. STEP 10: Set the non-promising candidate itemsets: 6. THE PROPOSED MINING ALGORITHM C r− = C r − C r . + The proposed mining algorithm aims at efficiently finding any p levels of large itemsets by scanning a database twice STEP 11: Set r=r+1. + except for the first level. The support of each item from the STEP 12: Generate the candidate set Cr from C r −1 in a first database scan is directly judged to predict whether an way similar to that in the Apriori algorithm [3]. item will appear in an itemset. The proposed method uses a + That is, the algorithm first joins C r −1 and higher predicting minimum support for each item to be + C r −1 assuming that r-2 items in two itemsets included in a promising itemset of more items. Itemsets with different numbers of items then have different are the same and the other one is different. It predicting minimum supports for an item. A predicting then keeps in Cr the itemsets, which have all + minimum support is calculated as a weighted average of their sub-itemsets of r-1 items existing in C r −1 . the possible minimum supports for totally dependent data STEP 13: Check whether the support of each item Ai in Pr-1 and for totally independent data. A data dependency is larger than or equal to the predicting minimum parameter, ranging between 0 and 1, is used as the weight. support: A mining process similar to that proposed in [7] can then 1 be adopted to find the p levels of large itemsets. The wα + ( 1 − w )α r . details of the proposed mining algorithm are described below. If the support of Ai is equal to or greater than the predicting minimum support, put Ai in the set of The proposed mining algorithm: predicted large items (Pr) for r-itemsets. INPUT: A set of n transactions with m items, a minimum STEP 14: Set the non-promising candidate itemsets: support value α, a minimum confidence value β, a dependency parameter w, and a level number p. C r = C r − C r+ . − OUTPUT: A set of association rules. + STEP 15: Remove the itemsets in C r −1 , which are subsets STEP 1: Calculate the number (count i) of each item Ai + appearing in the set of transactions; set the of any itemset in C r . support (supporti) of each item Ai as count i / n. STEP 16: Repeat STEPS 11 to 15 until r = r’+ p. STEP 2: Check whether the support of each item Ai is STEP 17: Scan the database to check whether the larger than or equal to the predefined minimum promising candidate itemsets, Cr+' +1 to C r+ support α. If the support of Ai is equal to or are actually large and whether the non-promising greater than α, put Ai in the set of large − candidate itemsets, Cr−' +1 to C r are actually 1-itemsets L1. not large. Put the actually large itemsets in the STEP 3: Set r=1, where r is used to represent the number corresponding sets Lr' +1 to L r . of items in itemsets currently being processed. STEP 18: Find all the proper subsets with r’+1 to i items STEP 4: Set r’=1, where r’ is used to record the number of for each itemset which is not large in C i+ , r’+1 items at an end of a phase. ≤ i ≤ r; keep the proper subsets which are not STEP 5: Set P1 = L1, where Pr is used to predict the items among existing large itemsets; Donate them as to be included in r-itemsets. STEP 6: Generate the candidate set Cr+1 from Lr in a way NC + . STEP 19: Find all the proper supersets with i to r items for similar to that in the apriori algorithm [3]. That is, each itemset which is large in C i− , r’+1 ≤ i ≤ r; the algorithm first joins Lr and Lr assuming that r-1 items in the itemsets are the same and the the supersets must also have all their other one is different. It then keeps in Cr+1 the sub-itemsets of r’ items existing in Lr' and itemsets, w hich have all their sub-itemsets of r cannot include any sub-itemset in the non-large items existing in Lr. itemsets in Ci+ and C i− checked in Step 17; STEP 7: Set r=r+1. Donate them as NC − . STEP 8: Check whether the support of each item Ai in Pr-1 STEP 20: Scan the database to check whether the itemsets is larger than or equal to the predicting minimum in NC + and NC − are large; add the large support: itemsets to the corresponding sets L r' +1 to 1 Lr . wα + ( 1 − w )α r , STEP 21: If Lr is not null, set r’= r’+ p and go to STEP to be included in predicted r-itemsets. If the 5 for another phase; otherwise do the next step. support of Ai is equal to or greater than the STEP 22: Add the non-redundant subsets of large itemsets predicting minimum support, put Ai in the set of to the corresponding sets L2 to L r . predicted large items (Pr) for r-itemsets. STEP 23: Drive association rules with confidence values + larger than or equal to β from the large itemsets STEP 9: Form the promising candidate itemsets C r by L 2 to Lr .
  • + AB is in C 2 since both A and B are in P2. AD is, + After Step 22, all the large itemsets for the transactions however, not in C 2 since D is not in P2 although A is in + have been determined. The final association rules can then P2. C 2 is thus formed as follows: be derived in Step 23 in the same way as the other mining + approaches uses. C 2 = {AB, AC, AE, AF, AH, BC, BE, BF, BH, CE, CF, CH, EF, EH, FH}. 7. AN EXAMPLE − The non-promising candidate itemsets C 2 is found as: + In this section, an example is given to illustrate the C2− = C 2 - C 2 = {AD, BD, CD, DE, DF, DH}. proposed data-mining algorithm. Assume a database including 10 transactions is shown in Table 1. Similarly, the predicting minimum support values for each item to be in promising 3-itemsets is calculated as 0.527. Table 1. A database used as an example C 3+ and C 3− are thus formed as follows: ID Items 1 ABCDEFH C 3+ = {ABC, ABE, ABF, ACE, ACF, AEF, BCE, BCF, 2 ABFGH BEF, CEF}. 3 ABCEF − + 4 ABEH C 3 = C3 - C 3 ={ABH, ACH, AEH, AFH, BEH, 5 ABCDEF BFH, CEH, CFH}. 6 ABCE Since all itemsets except AH, BH, CH and EH in C 2+ are 7 BCDEFH + subsets of itemsets in C 3+ , they are removed from C 2 . 8 ACEF Thus: 9 ADEGH 10 ABCEF C 2+ ={AH, BH, CH, EH}. Similarly, Each transaction is composed of a transaction identifier and items purchased. There are eight items, respectively C 4+ = {ABCE, ABCF, ABEF, ACEF, BCEF}. being A, B, C, D, E, F, G and H, to be purchased. The proposed algorithm processes this set of transactions as C4− = C 4 - C 4+ = φ . follows. The count and support of each item are found by scanning the database. Results are shown in Ta ble 2. The itemsets in C3+ which are subsets of itemsets in C 4+ are removed from C 3+ . Thus: Table 2. The count and support of each item Item Count Support C 3+ = φ . A 9 0.9 The database is then scanned to check whether the B 8 0.8 promising candidate itemsets of C 2+ to C 4 are actually + C 7 0.7 large and whether the non-promising candidate itemsets of D 4 0.4 − − C 2 to C 4 are actually not large. The set of large E 9 0.9 − 2-itemsets in C 2+ is {AH, BH, EH} and in C 2 is {DE}. F 7 0.7 AH, BH, EH and DE are then put in L2. The itemset CH in G 2 0.2 − C 2+ and the itemset DE in C 2 are incorrectly predicted. Similarly, L3 can be found as φ and L4 can be found as H 5 0.5 {ABCE, ABCF, ABEF, ACEF, BCEF}. All the itemsets in The support of each item is compared with the predefined + − + − C 3 , C 3 , C 4 and C 4 are predicted correctly. minimum support value α. Assume α is set at 0.35 in this example. Since the supports of {A}, {B}, {C}, {D}, {E}, Next, the proper subsets of the itemsets incorrectly {F} and {H} are larger than or equal to 0.35, they are put in predicted in C i+ are generated. Since only {CH} is L1. Also, P1 is the same as L1, which is {A, B, C, D, E, F, incorrectly predicted in this example, its proper subsets H}. The candidate set C2 is then formed from L1. Assume in with 2 to 4 items and not in existing large itemsets are φ . Thus NC + = φ . this example, the dependency parameter w is set at 0.5. The predicting minimum support value for each item to be in promising 2-itemsets is calculated as: 1 Also, The proper supersets of the itemsets incorrectly predicted in Ci− are generated. Since only {DE} is α' = 0 .5 × 0. 35 + ( 1 − 0.5 ) × 0.35 2 = 0.471 . incorrectly predicted in this example, its proper supersets The support of each item in P1 is then compared with with 3 items and not in existing large itemsets are {ADE}, 0.471. Since the supports of {A}, {B}, {C}, {E}, {F} and {BDE}, {CDE}, {FDE} and {HDE}. Since all of the {H} are larger than 0.471, P2 is {A, B, C, E, F, H}. above supersets contain at least one sub-itemset which is − The itemsets in C2 with each item existing in P2 are chosen not large in C 2 (from Step 17), they are not possibly large. to form promising candidate itemsets C2+ . For example, The proper supersets of {DE} with 4 items and not in
  • existing large item are also not possibly large. Thus NC − The proposed algorithm apriori algorithm The = φ. Execution Time (sec.) 2000 The database is then scanned to find the large itemsets from NC + and NC − . Since both NC + and NC − are 1500 empty in this example, no scan is needed. The large 1000 itemsets L2 to L4 are then found as follows: 500 L2 = {AH, BH, EH, DE}, L3 = φ , and 0 L4 = {ABCE, ABCF, ABEF, ACEF, BCEF}. 0 0.2 0.4 0.6 0.8 1 Minimum Support Since L4 = {ABCE, ABCF, ABEF, ACEF, BCEF }, which is not null, the next phase is executed. STEPs 5 to 20 are Figure 1: A comparison of the proposed algorithm with w = then repeated for L5 to L7. The results are shown as 0.5 and the apriori algorithm follows: 9. CONCLUSIONS L5 = {ABCEF}, L6 = φ , and In this paper, we have proposed a novel mining algorithm L7 = φ . to improve the efficiency of finding large itemsets. The proposed algorithm can efficiently find any p levels of The non-redundant subsets of the found large itemsets are large itemsets by scanning a database twice except for the added to the corresponding sets L2 to L5 . The final first level. The proposed approach estimates for each level large itemsets L2 to L5 are then found as follows: a different support threshold, which is derived from a data dependency parameter. It determines whether an item L2 = {AB, AC, AE, AF, AH, BC, BE, BF, BH, CE, should be included in a promising candidate itemset CF, DE, EF, EH}. directly by supports of items, which are easily obtained in L3 = {ABC, ABE, ABF, ACE, ACF, AEF, BCE, the phase of finding large 1-itemsets. An example has also BCF, BEF, CEF}. been given to illustrate the algorithm clearly. From the L4 = {ABCE, ABCF, ABEF, ACEF, BCEF}. results of the proposed mining algorithm on the example, L5 = {ABCEF}. different data dependency parameter values will cause the same large itemsets, but different predictive effects. Thus if the data dependency relationships in transactions can be 8. EXPERIMENTAL RESULTS well utilized, the proposed algorithm can help raise the performance of data mining. In the future, we will continue The experiments were implemented in VB on a extending the proposed approach to finding sequential Pentium-IV 2.0 G personal computer. There were 8 items patterns. to be purchased. 10000 transactions were run by the proposed algorithm and by the apriori algorithm. These ACKNOWLEDGEMENT transactions were randomly generated, with each item having different appearing probability. An item could not be generated twice in a transaction. This research was supported by the National Science Council of the Republic of China under contract Experiments were then made to compare the performance NSC91-2213-E-390-001. of the proposed approach and the apriori approach for showing the effect of predictive itemsets. The relationships REFERENCES between execution time and minimum supports for the proposed algorithm with w = 0.5 and for the apriori [1] R. Agrawal, T. Imielinksi and A. Swami, “Mining algorithm are shown in Figure 1. association rules between sets of items in large database,“ The ACM SIGMOD Conference, pp. From Figure 1, it is easily seen that the proposed algorithm 207-216, Washington DC, USA, 1993. has a better efficiency than the apriori algorithm when the [2] R. Agrawal, T. Imielinksi and A. Swami, “Database minimum support value lies about below 0.7. This is mining: a performance perspective,” IEEE because when the minimum support values are quite large, Transactions on Knowledge and Data Engineering, the numbers of large itemsets will become very small. The Vol. 5, No. 6, pp. 914-925, 1993. time saved due to the pruning of candidate itemsets in the [3] R. Agrawal and R. Srikant, “Fast algorithm for mining proposed algorithm will not cover the additional overhead. association rules,” The International Conference on The proposed algorithm is thus suitable for low or middle Very Large Data Bases, pp. 487-499, 1994. minimum support values. [4] R. Agrawal and R. Srikant, ”Mining sequential
  • patterns,” The Eleventh IEEE International Conference [10] H. Mannila, H. Toivonen, and A.I. Verkamo, on Data Engineering, pp. 3-14, 1995. “Efficient algorithm for discovering association [5] R. Agrawal, R. Srikant and Q. Vu, “Mining association rules,” The AAAI Workshop on Knowledge Discovery rules with item constraints,” The Third International in Databases, pp. 181-192, 1994. Conference on Knowledge Discovery in Databases and [11] J.S. Park, M.S. Chen, P.S. Yu, “Using a hash-based Data Mining, pp. 67-73, Newport Beach, California, method with transaction trimming for mining 1997. association rules,” IEEE Transactions on Knowledge [6] M.S. Chen, J. Han and P.S. Yu, “Data mining: an and Data Engineering, Vol. 9, No. 5, pp. 812-825, overview from a database perspective,” IEEE 1997. Transactions on Knowledge and Data Engineering, [12] L. Shen, H. Shen and L. Cheng, “New algorithms for Vol. 8, No. 6, pp. 866-883, 1996. efficient mining of association rules,” The Seventh [7] N. Denwattana and J. R. Getta, “A parameterised Symposium on the Frontiers of Massively Parallel algorithm for mining association rules,” The Twelfth Computation, pp. 234-241, 1999. Australasian Database Conference, pp. 45-51, 2001. [13] R. Srikant and R. Agrawal, “Mining generalized [8] T. Fukuda, Y. Morimoto, S. Morishita and T. association rules,” The Twenty-first International Tokuyama, "Mining optimized association rules for Conference on Very Large Data Bases, pp. 407-419, numeric attributes," The ACM Zurich, Switzerland, 1995. SIGACT-SIGMOD-SIGART Symposium on Principles [14] R. Srikant and R. Agrawal, “Mining quantitative of Database Systems, pp. 182-191, 1996. association rules in large relational tables,” The 1996 [9] J. Han and Y. Fu, “Discovery of multiple-level ACM SIGMOD International Conference on association rules from large database,” The Management of Data, pp. 1-12, Montreal, Canada, Twenty-first International Conference on Very Large 1996. Data Bases, pp. 420-431, Zurich, Switzerland, 1995.