Yu-Hui Tao

816 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
816
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
18
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Yu-Hui Tao

  1. 1. A New Incremental Data Mining Algorithm Using Pre-large Itemsets* Tzung-Pei Hong** Department of Information Management I-Shou University Kaohsiung, 84008, Taiwan, R.O.C. tphong@isu.edu.tw http://www.nuk.edu.tw/tphong Ching-Yao Wang Institute of Computer and Information Science National Chiao-Tung University Hsinchu, 300, Taiwan, R.O.C. cywang@cis.nctu.edu.tw Yu-Hui Tao Department of Information Management I-Shou University Kaohsiung, 84008, Taiwan, R.O.C. ytao@isu.edu.tw -------------------------------------- * This is a modified and expanded version of the paper "Incremental data mining based on two support thresholds," presented at The Fourth International Conference on Knowledge-Based Intelligent Engineering Systems & Allied Technologies, 2000, England. ** Corresponding author.
  2. 2. Abstract Due to the increasing use of very large databases and data warehouses, mining useful information and helpful knowledge from transactions is evolving into an important research area. In the past, researchers usually assumed databases were static to simplify data mining problems. Thus, most of the classic algorithms proposed focused on batch mining, and did not utilize previously mined information in incrementally growing databases. In real-world applications, however, developing a mining algorithm that can incrementally maintain discovered information as a database grows is quite important. In this paper, we propose the concept of pre-large itemsets and design a novel, efficient, incremental mining algorithm based on it. Pre- large itemsets are defined by a lower support threshold and an upper support threshold. They act as gaps to avoid the movements of itemsets directly from large to small and vice-versa. The proposed algorithm doesn't need to rescan the original database until a number of transactions have been newly inserted. If the database has grown larger, then the number of new transactions allowed will be larger too. Keywords: data mining, association rule, large itemset, pre-large itemset, incremental mining. 2
  3. 3. 1. Introduction Years of effort in data mining have produced a variety of efficient techniques. Depending on the type of databases processed, these mining approaches may be classified as working on transaction databases, temporal databases, relational databases, and multimedia databases, among others. On the other hand, depending on the classes of knowledge derived, the mining approaches may be classified as finding association rules, classification rules, clustering rules, and sequential patterns [4], among others. Among them, finding association rules in transaction databases is most commonly seen in data mining [1][3][5][9][10][12][13][15][16]. In the past, many algorithms for mining association rules from transactions were proposed, most of which were executed in level-wise processes. That is, itemsets containing single items were processed first, then itemsets with two items were processed, then the process was repeated, continuously adding one more item each time, until some criteria were met. These algorithms usually considered the database size static and focused on batch mining. In real-world applications, however, new records are usually inserted into databases, and designing a mining algorithm that can maintain association rules as a database grows is thus critically important When new records are added to databases, the original association rules may become invalid, or new implicitly valid rules may appear in the resulting updated databases [7][8][11][14][17]. In these situations, conventional batch-mining algorithms must re-process the entire updated databases to find final association rules. Two drawbacks may exist for conventional batch-mining algorithms in maintaining 3
  4. 4. database knowledge: (a) Nearly the same computation time as that spent in mining from the original database is needed to cope with each new transaction. If the original database is large, much computation time is wasted in maintaining association rules whenever new transactions are generated (b) Information previously mined from the original database, such as large itemsets and association rules, provides no help in the maintenance process. Cheung and his co-workers proposed an incremental mining algorithm, called FUP (Fast UPdate algorithm) [7], for incrementally maintaining mined association rules and avoiding the shortcomings mentioned above. The FUP algorithm modifies the Apriori mining algorithm [3] and adopts the pruning techniques used in the DHP (Direct Hashing and Pruning) algorithm [13]. It first calculates large itemsets mainly from newly inserted transactions, and compares them with the previous large itemsets from the original database. According to the comparison results, FUP determines whether re-scanning the original database is needed, thus saving some time in maintaining the association rules. Although the FUP algorithm can indeed improve mining performance for incrementally growing databases, original databases still need to be scanned when necessary. In this paper, we thus propose a new mining algorithm based on two support thresholds to further reduce the need for rescanning original databases. Since rescanning the database spends much computation time, the maintenance cost can thus be reduced in the proposed algorithm. 4
  5. 5. The remainder of this paper is organized as follows. The data mining process is introduced in section 2. The maintenance of association rules is described in section 3. The FUP algorithm is reviewed in section 4. A new incrementally mining algorithm is proposed in section 5. An example is also given there to illustrate the proposed algorithm. Conclusions are summarized in section 6. 2. The Data Mining Process Using Association Rules Data mining plays a central role in knowledge discovery. It involves applying specific algorithms to extract patterns or rules from data sets in a particular representation. Because data mining is important to KDD, many researchers in database and machine-learning fields are interested in this new research topic since it offers opportunities to discover useful information and important relevant patterns in large databases, thus helping decision-makers analyze data easily and make good decisions regarding the domains in question. One application of data mining is to induce association rules from transaction data, such that the presence of certain items in a transaction will imply the presence of certain other items. To achieve this purpose, Agrawal and his co-workers proposed several mining algorithms based on the concept of large itemsets to find association rules in transaction data [1][3][5]. They divided the mining process into two phases. In the first phase, candidate itemsets were generated and counted by scanning the transaction data. If the count of an itemset appearing in the transactions was larger than a pre-defined threshold value (called the minimum support), the itemset was considered a large itemset. Itemsets containing only one item were processed first. 5
  6. 6. Large itemsets containing only single items were then combined to form candidate itemsets containing two items. This process was repeated until all large itemsets had been found. In the second phase, association rules were induced from the large itemsets found in the first phase. All possible association combinations for each large itemset were formed, and those with calculated confidence values larger than a predefined threshold (called the minimum confidence) were output as association rules. We may summarize the data mining process we focus on as follows: 1. Determine user-specified thresholds, including the minimum support value and the minimum confidence value. 2. Find large itemsets in an iterative way. The count of a large itemset must exceed or equal the minimum support value. 3. Utilize the large itemsets to generate association rules, whose confidence must exceed or equal the minimum confidence value. Below, we use a simple example to illustrate the mining process. Suppose a database with five transactions shown in Table 1 is to be mined. The database has two features, transaction identification (TID) and transaction description (Items). Table 1. An example of a transaction database TID Items 100 BE 200 ABD 300 AD 400 BCE 500 ABDE Assume the user-specified minimum support and minimum confidence are 40% and 80%, respectively. The transaction database is first scanned to count the candidate 6
  7. 7. 1-itemsets. The results are shown in Table 2. Table 2. Candidate 1-itemsets Item Count A 3 B 4 C 1 D 3 E 3 Since the counts of the items A, B, D and E are larger than 2 (5*40%), they are put into the set of large 1-itemsets. The candidate 2-itemsets are then formed from these large 1-itemsets as shown in Table 3. Table 3. Candidate 2-itemsets with counts Items Count AB 2 AD 3 AE 1 BD 2 BE 3 DE 1 AB, AD, BD and BE then form the set of large 2-itemsets. In a similar way, ABD can be found to be a large 3-itemset. Next, the large itemsets are used to generate association rules. According to the condition probability, the possible association rules generated are shown in Table 4. 7
  8. 8. Table 4. Possible association rules Rule Confidence IF AB, Then D Count(ABD)/Count(AB)=1 IF AD, Then B Count(ABD)/Count(AD)=2/3 IF BD, Then A Count(ABD)/Count(BD)=1 IF A, Then B Count(AB)/Count(A)=2/3 IF B, Then A Count(AB)/Count(B)=2/4 IF A, Then D Count(AD)/Count(A)=1 IF D, Then A Count(AD)/Count(D)=1 IF B, Then D Count(BD)/Count(B)=2/4 IF D, Then B Count(BD)/Count(D)=2/3 IF B, Then E Count(BE)/Count(B)=3/4 IF E, Then B Count(BE)/Count(E)=1 Since the user-specified minimum confidence is 80%, the final association rules are shown in Table 5. Table 5. The final association rules for this example Rule Confidence IF AB, Then D Count(ABD)/Count(AB)=1 IF BD, Then A Count(ABD)/Count(BD)=1 IF A, Then D Count(AD)/Count(A)=1 IF D, Then A Count(AD)/Count(D)=1 IF E, Then B Count(BE)/Count(E)=1 3. Maintenance of Association Rules In real-world applications, transaction databases grow over time and the association rules mined from them must be re-evaluated because new association rules may be generated and old association rules may become invalid when the new entire databases are considered. Conventional batch-mining algorithms, such as Apriori [1] and DHP [13], solve 8
  9. 9. this problem by re-processing entire new databases when new transactions are inserted into the original databases. These algorithms do not, however, use previously mined information and require nearly the same computational time they needed to mine from the original databases. If new transactions appear often and the original databases are large, these algorithms are thus inefficient in maintaining association rules. Considering an original database and newly inserted transactions, the following four cases (illustrated in Figure 1) may arise: Case 1: An itemset is large in the original database and in the newly inserted transactions. Case 2: An itemset is large in the original database, but is not large in the newly inserted transactions. Case 3: An itemset is not large in the original database, but is large in the newly inserted transactions. Case 4: An itemset is not large in the original database and in the newly inserted transactions. New transactions Large Small itemset itemset Large itemset Case 1 Case 2 Original database Small Case 3 Case 4 itemset 9
  10. 10. Figure 1: Four cases arising from adding new transactions to existing databases Since itemsets in Case 1 are large in both the original database and the new transactions, they will still be large after the weighted average of the counts. Similarly, itemsets in Case 4 will still be small after the new transactions are inserted. Thus Cases 1 and 4 will not affect the final association rules. Case 2 may remove existing association rules, and case 3 may add new association rules. A good rule-maintenance algorithm should thus accomplish the following. 1. Evaluate large itemsets in the original database and determine whether they are still large in the updated database; 2. Find out whether any small itemsets in the original database may become large in the updated database; 3. Seek itemsets that appear only in the newly inserted transactions and determine whether they are large in the updated database. These are accomplished by the FUP algorithm and by our proposed algorithm. 4. Review of the Fast Update Algorithm (FUP) Cheung et al. proposed the FUP algorithm to incrementally maintain association rules when new transactions are inserted [7][8]. Using FUP, large itemsets with their counts in preceding runs are recorded for later use in maintenance. As new transactions are added, FUP first scans them to generate candidate 1-itemsets (only for these transactions), and then compares these itemsets with the previous ones. FUP 10
  11. 11. partitions candidate 1-itemsets into two parts according to whether they are large for the original database. If a candidate 1-itemset from the newly inserted transactions is also among the large 1-itemsets from the original database, its new total count for the entire updated database can easily be calculated from its current count and previous count since all previous large itemsets with their counts are kept by FUP. Whether an original large itemset is still large after new transactions are inserted is determined from its support ratio as its total count over the total number of transactions. By contrast, if a candidate 1-itemset from the newly inserted transactions does not exist among the large 1-itemsets in the original database, one of two possibilities arises. If this candidate 1-itemset is not large for the new transactions, then it cannot be large for the entire updated database, which means no action is necessary. If this candidate 1-itemset is large for the new transactions but not among the original large 1-itemsets, the original database must be re-scanned to determine whether the itemset is actually large for the entire updated database. Using the processing tactics mentioned above, FUP is thus able to find all large 1-itemsets for the entire updated database. After that, candidate 2-itemsets from the newly inserted transactions are formed and the same procedure is used to find all large 2-itemsets. This procedure is repeated until all large itemsets have been found. Below, we use a simple example to illustrate the FUP algorithm. Suppose a database with eight transactions such as the one shown in Table 6 is to be mined. The minimum support threshold s is set at 50%. Table 6. An original database with TID and Items Incremental database TID Items 11
  12. 12. 100 ACD 200 BCE 300 ABCE 400 ABE 500 ABE 600 ACD 700 BCDE 800 BCE Using a conventional mining algorithm such as the Apriori algorithm, all large itemsets with counts larger than or equal to 4 (8∗50%) are found, as shown in Table 7. These large itemsets and their counts are retained by the FUP algorithm. Table 7. All large itemsets from an original database with s=50% Large itemsets 1 item Count 2 items Count 3 items Count A 5 BC 4 BCE 4 B 6 BE 6 C 6 CE 4 E 6 Next, assume two new transactions, as shown in Table 8 appear. Table 8. New transactions for the example New transactions TID Items 900 ABCD 1000 DEF The FUP algorithm processes them as follows. First, the final large 1-itemsets for the entire updated database are found. This process is shown in Figure 2. The same 12
  13. 13. process is then repeated until no new candidate itemsets are generated. 13
  14. 14. New transactions TID Items 900 ABCD 1000 DEF Find all candidate 1-itemsets Item Count A 1 B 1 C 1 D 2 E 1 F 1 Extract originally large 1-itemsets Extract originally small 1-itemsets from these two transactions from these two transactions Item Count Item Count A 1 D 2 B 1 F 1 C 1 E 1 Extract 1-itemsets large for the new transactions Add the counts to the originally large 1-itemsets Item Count Item Count A 6 D 2 B 7 F 1 C 7 E 7 Find the large 1-itemsets by Find the large 1- itemsets for rescanning the original database updated database Item Count Item Count A 6 D 5 B 7 C 7 E 7 Items Count A 6 B 7 C 7 D 5 E 7 14
  15. 15. Figure 2: The FUP process of finding large 1-itemsets A summary of the four cases and their FUP results is given in Table 9. Table 9. Four cases and their FUP results Cases: Original – New Results Case 1: Large – Large Always large Case 2: Large – Small Determined from existing information Case 3: Small – Large Determined by rescanning original database Case 4: Small – Small Always small FUP is thus able to handle cases 1, 2 and 4 more efficiently than conventional batch mining algorithms. It must, however, reprocess the original database to handle case 3. 5. Maintenance of Association Rules Based on Pre-large Itemsets Although the FUP algorithm focuses on the newly inserted transactions and thus saves much processing time by incrementally maintaining rules, it must still scan the original database to handle case 3 in which a candidate itemsets is large for new transactions but is not recorded in large itemsets already mined from the original database. This situation may occur frequently, especially when the number of new transactions is small. In an extreme situation, if only one new transaction is added each time, then all items in this transaction are large since their support ratios are 100% for the new transaction. Thus, if case 3 could be efficiently handled, the maintenance time could be further reduced. 15
  16. 16. 5.1 Definition of Pre-large Itemsets In this paper, we propose the concept of pre-large itemsets to solve the problem represented by case 3. A pre-large itemset is not truly large, but promises to be large in the future. A lower support threshold and an upper support threshold are used to realize this concept. The upper support threshold is the same as that used in the conventional mining algorithms. The support ratio of an itemset must be larger than the upper support threshold in order to be considered large. On the other hand, the lower support threshold defines the lowest support ratio for an itemset to be treated as pre-large. An itemset with its support ratio below the lower threshold is thought of as a small itemset. Pre-large itemsets act like buffers in the incremental mining process and are used to reduce the movements of itemsets directly from large to small and vice-versa. Considering an original database and transactions newly inserted using the two support thresholds, itemsets may thus fall into one of the following nine cases illustrated in Figure 3. Figure 3: Nine cases arising from adding new transactions to existing databases New transactions Large Pre-large Small itemsets itemsets itemsets Large Case 1 Case 2 Case 3 itemsets Original Pre-large Case 4 Case 5 Case 6 database itemsets Small Case 7 Case 8 Case 9 itemsets 16
  17. 17. Cases 1, 5, 6, 8 and 9 above will not affect the final association rules according to the weighted average of the counts. Cases 2 and 3 may remove existing association rules, and cases 4 and 7 may add new association rules. If we retain all large and pre- large itemsets with their counts after each pass, then cases 2, 3 and case 4 can be handled easily. Also, in the maintenance phase, the ratio of new transactions to old transactions is usually very small. This is more apparent when the database is growing larger. An itemset in case 7 cannot possibly be large for the entire updated database as long as the number of transactions is small compared to the number of transactions in the original database. This point is proven below. A summary of the nine cases and their results is given in Table 10. Table 10. Nine cases and their results Cases: Original – New Results Case 1: Large – Large Always large Large or pre-large, Case 2: Large - Pre-large Determined from existing information Large or pre-large or small, Case 3: Large - Small Determined from existing information Pre-large or large, Case 4: Pre-large - Large Determined from existing information Case 5: Pre-large - Pre-large Always pre-large Pre-large or small, Case 6: Pre-large - Small Determined from existing information Pre-large or small when the number of Case 7: Small - Large transactions is small Case 8: Small - Pre-large Small or Pre-large Case 9: Small - Small Always small 5.2 Notation The notation used in this paper is defined below. 17
  18. 18. D : the original database;. T : the set of new transactions; U : the entire updated database, i.e., D∪ T; d : the number of transactions in D; t : the number of transactions in T; Sl : the lower support threshold for pre-large itemsets; Su : the upper support threshold for large itemsets, Su >Sl; LD : the set of large k-itemsets from D; k LT : the set of large k-itemsets from T; k LU : the set of large k-itemsets from U; k PkD : the set of pre-large k-itemsets from D; PkT : the set of pre-large k-itemsets from T; PkU : the set of pre-large k-itemsets from U; Ck : the set of all candidate k-itemsets from T; I : an itemset; SD(I) : the number of occurrences of I in D; ST(I) : the number of occurrences of I in T; SU(I) : the number of occurrences of I in U. 5.3 Theoretical Foundation As mentioned above, if the number of new transactions is small compared to the number of transactions in the original database, an itemset that is small (neither large nor pre-large) in the original database but is large in the newly inserted transactions cannot possibly be large for the entire updated database. This is proven in the 18
  19. 19. following theorem. Theorem 1: let Sl and Su be respectively the lower and the upper support thresholds, and let d and t be respectively the numbers of the original and new ( Su − Sl )d transactions. If t ≤ , then an itemset that is small (neither large nor pre- 1 − Su large) in the original database but is large in newly inserted transactions is not large for the entire updated database. Proof: ( Su − Sl )d The following derivation can be obtained from t ≤ : 1 − Su ( Su − Sl )d t≤ (1) 1 − Su ⇒ t(1-Su) ≤ (Su- Sl) d ⇒ t-tSu ≤ dSu- dSl ⇒ t+ dSl ≤ Su(d+t) t + dSl ⇒ ≤ Su. d +t If an itemset I is small (neither large nor pre-large) in the original database D, then its count SD(I) must be less than Sl∗d, therefore, SD(I) < dSl. If I is large in the newly inserted transactions T, then: 19
  20. 20. t ≥ ST(I) ≥ tSu. S U (I ) The entire support ratio of I in the updated database U is , which can be d +t further expanded to: S U (I ) S T (I ) + S D (I ) = d +t d +t t + dSl < d +t ≤ Su. I is thus not large for the entire updated database. This completes the proof. Example 1: Assume d=100, Sl=50% and Su=60%. The number of new transactions within which the original database need not be scanned for rule maintenance is: ( Su − Sl )d (0.6 − 0.5)100 = = 25 . 1 − Su 1 − 0.6 Thus, if the number of newly inserted transactions is equal to or less than 25, then I is cannot be large for the entire updated database. From theorem 1, the number of new transactions required for efficient handling of case 7 is determined by Sl, Su, and d. It can easily be seen from Formula 1 that if d grows larger, then t can grow larger too. Therefore, as the database grows, our proposed approach becomes increasingly efficient. This characteristic is especially useful for real-world applications. 20
  21. 21. Form theorem 1, the ratio of new transactions to previous transactions for the proposed approach to work out can easily be derived as follows. Corollary 1: Let r denote the ratio of new transactions t to old transactions d. If Su − Sl r≤ , then an itemset that is small (neither large nor pre-large) in the original 1 − Su database but is large in the newly inserted transactions cannot be large for the entire updated database. Example 2: Assume Sl=50% and Su=60%. The ratio of new transactions to old transactions within which the original database need not be scanned for rule maintenance is: ( Su − Sl ) (0.6 − 0.5) 1 = = . 1 − Su 1 − 0.6 4 Thus, if the number of newly inserted transactions is equal to or less than 1/4 of the number of original transactions, then I cannot be large for the entire updated database. It is easily seen from corollary 1 that if the range between Sl and Su is large, then the ratio r can also be large, meaning that the number of new transactions will be large for a fixed d. However, a large range between Sl and Su will also create a large set of pre-large itemsets, which will represent an additional overhead in maintenance. 21
  22. 22. 5.4 Presentation of the Algorithm In the proposed algorithm, the large and pre-large itemsets with their counts in preceding runs are recorded for later use in maintenance. As new transactions are added, the proposed algorithm first scans them to generate candidate 1-itemsets (only for these transactions), and then compares these itemsets with the previously retained large and pre-large 1-itemsets. It partitions candidate 1-itemsets into three parts according to whether they are large or pre-large for the original database. If a candidate 1-itemset from the newly inserted transactions is also among the large or pre-large 1-itemsets from the original database, its new total count for the entire updated database can easily be calculated from its current count and previous count since all previous large and pre-large itemsets with their counts have been retained. Whether an originally large or pre-large itemset is still large or pre-large after new transactions have been inserted is determined from its new support ratio, as derived from its total count over the total number of transactions. On the contrary, if a candidate 1-itemset from the newly inserted transactions does not exist among the large or pre-large 1-itemsets in the original database, then it is absolutely not large for the entire updated database as long as the number of newly inserted transactions is within the safety threshold derived from Theorem 1. In this situation, no action is needed. When transactions are incrementally added and the total number of new transactions exceeds the safety threshold, the original database is re-scanned to find new pre-large itemsets in a way similar to that used by the FUP algorithm. The proposed algorithm can thus find all large 1-itemsets for the entire updated database. After that, candidate 2-itemsets from the newly inserted transactions are formed and the same procedure is used to find all large 2-itemsets. This procedure is repeated until all large itemsets have been found. The details of the proposed maintenance 22
  23. 23. algorithm are described below. A variable, c, is used to record the number of new transactions since the last re-scan of the original database. The proposed maintenance algorithm: INPUT: A lower support threshold Sl, an upper support threshold Su, a set of large itemsets and pre-large itemsets in the original database consisting of (d+c) transactions, and a set of t new transactions. OUTPUT: A set of final association rules for the updated database. STEP 1: Calculate the safety number f of new transactions according to theorem 1 as follows:  ( Su − Sl )d  f =  .  1 − Su   STEP 2: Set k =1, where k records the number of items in itemsets currently being processed. STEP 3: Find all candidate k-itemsets Ck and their counts from the new transactions. STEP 4: Divide the candidate k-itemsets into three parts according to whether they are large, pre-large or small in the original database. D STEP 5: For each itemset I in the originally large k-itemsets Lk , do the following substeps: Substep 5-1: Set the new count SU(I) = ST(I)+ SD(I). Substep 5-2: If SU(I)/(d+t+c) ≥ Su, then assign I as a large itemset, set SD(I) = SU(I) and keep I with SD(I), otherwise, if SU(I)/(d+t+c) ≥ Sl, then assign I as a pre-large itemset, set SD(I) = SU(I) and keep I with SD(I), otherwise, neglect I. 23
  24. 24. STEP 6: For each itemset I in the originally pre-large itemset PkD , do the following substeps: Substep 6-1: Set the new count SU(I) = ST(I)+ SD(I). Substep 6-2: If SU(I)/(d+t+c) ≥ Su, then assign I as a large itemset, set SD(I) = SU(I) and keep I with SD(I), otherwise, if SU(I)/(d+t+c) ≥ Sl, then assign I as a pre-large itemset, set SD(I) = SU(I) and keep I with SD(I), otherwise, neglect I. STEP 7: For each itemset I in the candidate itemsets that is not in the originally large itemsets LD or pre-large itemsets PkD , do the following substeps: k Substep 7-1: If I is in the large itemsets LT or pre-large itemsets PkT from the k new transactions, then put it in the rescan-set R, which is used when rescanning in Step 8 is necessary. Substep 7-2: If I is small for the new transactions, then do nothing. STEP 8: If t +c ≤ f or R is null, then do nothing; otherwise, rescan the original database to determine whether the itemsets in the rescan-set R are large or pre-large. STEP 9: Form candidate (k+1)-itemsets Ck+1 from finally large and pre-large k- itemsets ( LU  PkU ) that appear in the new transactions. k STEP 10: Set k = k+1. STEP 11: Repeat STEPs 4 to 10 until no new large or pre-large itemsets are found. STEP 12: Modify the association rules according to the modified large itemsets. STEP 13: If t +c > f, then set d=d+t+c and set c=0; otherwise, set c=t+c. 24
  25. 25. After Step 13, the final association rules for the updated database have been determined. 5.5 An Example In this section, an example is given to illustrate the proposed incremental data mining algorithm. Assume the initial data set includes 8 transactions, which are the same as those shown in Table 6. For Sl=30% and Su=50%, the sets of large itemsets and pre-large itemsets for the given data are shown in Tables 11 and 12, respectively. Table 11. The large itemsets for the original database Large itemsets 1 item Count 2 items Count 3 items Count A 5 BC 4 BCE 4 B 6 BE 6 C 6 CE 4 E 6 Table 12. The pre-large itemsets for the original database Pre-large itemsets 1 item Count 2 items Count 3 items Count D 3 AB 3 ABE 3 AC 3 AE 3 CD 3 Assume the two new transactions shown in Table 13 are inserted after the initial data set is processed. Table 13. Two new transactions 25
  26. 26. New transactions TID Items 900 ABCD 1000 DEF The proposed incremental mining algorithm proceeds as follows. The variable c is initially set at 0. STEP 1: The safety number f for new transactions is calculated as: 0  ( Su − Sl )d   (0.5 − 0.3)8  f =  = = 3.  1 − Su   1 − 0.5     STEP 2: k is set to 1, where k records the number of items in itemsets currently being processed. STEP 3: All candidate 1-itemsets C1 and their counts from the two new transactions are found, as shown in Table 14. Table 14. All candidate 1-itemsets with counts from the two new transactions Candidate 1-itemsets Items Count A 1 B 1 C 1 D 2 E 1 F 1 STEP 4: From Table 14, all candidate 1-itmesets {A}{B}{C}{D}{E}{F} are divided into three parts, {A}{B}{C}{E}, {D}, and {F} according to whether they are large, pre-large or small in the original database. Results are shown in Table 15, where the counts are only from the new transactions. 26
  27. 27. Table 15. Three partitions of all candidate 1-itemsets from the two new transactions Originally large Originally pre-large Originally small 1-itemsets 1-itemsets 1-itemsets Items Count Items Count Items Count A 1 D 2 F 1 B 1 C 1 E 1 STEP 5: The following substeps are done for each of the originally large 1-itemsets {A}{B}{C}{E}: Substep 5-1: The total counts of the candidate 1-itemsets {A}{B}{C}{E} are calculated using ST(I)+ SD(I). Table 16 shows the results. Table 16. The total counts of {A}{B}{C}{E} Items Count A 6 B 7 C 7 E 7 Substep 5-2: The new support ratios of {A}{B}{C}{E} are calculated. For example, the new support ratio of {A} is 6/(8+2+0) ≥ 0.5. {A} is thus still a large itemset. In this example, {A}{B}{C}{E} are all large. {A}{B}{C}{E} with their new counts are retained in the large 1-itemsets for the entire updated database. STEP 6: The following substeps are done for itemset {D}, which is originally pre- large: 27
  28. 28. Substep 6-1: The total count of the candidate 1-itemset {D} is calculated using ST(I)+ SD(I) (= 5). Substep 6-2: The new support ratio of {D} is 5/(8+2+0) ≥ 0.5. {D} thus becomes a large 1-itemset for the whole updated database. {D} with its new count is retained in the large 1-itemsets for the entire updated database. STEP 7: Since the itemset {F}, which was originally neither large nor pre-large, is large for the new transactions, it is put in the rescan-set R., which is used when rescanning in Step 8 is necessary. STEP 8: Since t +c=2+0 ≤ f (=3), rescanning the database is unnecessary, so nothing is done. STEP 9: From Steps 5,6 and 7, the final large 1-itemsets and pre-large 1-itemsets for the entire updated database are {A}{B}{C}{D}{E}. All candidate 2-itemsets generated from them are shown in Table 17. Table 17. All candidate 2-itemsets for the new transactions Candidate 2-itemsets AB AC AD AE BC BD BE CD CE DE STEP 10: k = k+1=2. STEP 11: Steps 4 to 10 are repeated to find large or pre-large 2-itemsets. Results are shown in Table 18. 28
  29. 29. Table 18. All large 2-itemsets and pre-large 2-itemsets for the whole updated database Large 2-Itemsets Pre-large 2-Itemsets Items Count Items Count BC 5 AB 4 BE 6 AC 4 AE 3 CD 4 CE 4 Large or pre-large 3-itemsets are found in the same way. No large 3-itemsets were found in this example. STEP 12: The association rules derived from the newly found large itemsets are: B⇒ C (Confidence=5/7), C⇒ B (Confidence=5/7), B⇒ E (Confidence=6/7), and E⇒ B (Confidence=6/7). STEP 13: c=t+c=2+0=2. After Step 13, the final association rules for the updated database can then be found. Note that the final value of c is 2 in this example and f-c=1. This means that one more new transaction can be added without rescanning the original database. The whole process of finding large itemsets for this example is illustrated in Figures 4, 5 and 6. 29
  30. 30. New transactions TID Items 900 ABCD 1000 DEF Find all candidate 1-itemsets Item Count A 1 B 1 C 1 D 2 E 1 F 1 Extract large 1-itemsets in Extract pre-large 1-itemsets Extract 1-itemsets not original database in original database recorded in original database Item Count Item Count Item Count A 1 D 2 F 1 B 1 C 1 Add the counts to the original Extract 1-itemsets E 1 pre-large 1-itemsets large or pre-large for Add the counts to the original the new transactions large 1-itemsets Item Count Item Count Item Count A 6 D 5 F 1 B 7 C 7 Find the large itemsets or pre-large Since t+c=2+0<f=3, itemsets from the counts insert the 1-itemsets into R E 7 Find the large itemsets or pre-large itemsets from the counts Item Count Item Count R={F} A 6 D 5 B 7 C 7 E 7 Items Count A 6 B 7 C 7 D 5 E 7 Figure 4: Our process of finding large 1-itemsets 30
  31. 31. New transactions TID Items 900 ABCD 1000 DEF Find all candidate 2-itemsets Item Count AB 1 AC 1 AD 1 AE 0 BC 1 BD 1 BE 0 CD 1 CE 0 DE 1 Extract pre-large 2-itemsets Extract large 2-itemsets in Extract 2-itemsets not recorded in in original database original database original database Item Count Item Count Item Count BC 1 AB 1 AD 1 Item Count BE 0 Item Count AC 1 Item Count BD 1 BC CE 5 0 AB AE 4 0 AD DE 11 BE 6 Item Count original AC CD Item Count 4 1 BD 1 large or Extract 2-itemsets Add the counts to the R={F,AD,BD,DE} CE 4 BC 2-itemsets large 5 AE counts 3 Add the AB to the original 4 pre-large for 1 new DE the BE 6 CD AC 4 pre-large 2-itemsets Find the large itemsets or pre-large 4 transactions Since t+c=2+0<f=3, insert the CE 4 AE 3 itemsets from the counts R={F} Find the large itemsetsintopre-large 2-itemsets or R itemsets from the counts CD 4 Items Count AB 4 AC 4 AE 3 BC 5 BE 6 CD 4 CE 4 Figure 5: Our process of finding large 2-itemsets and pre-large 2-itemsets 31
  32. 32. New transactions TID Items 900 ABCD 1000 DEF Find all candidate 3-itemsets Item Count ABC 1 ABE 0 ACE 0 BCE 0 Extract large 3-itemsets in Extract pre-large 3-itemsets Extract 3-itemsets not original database in original database recorded in original database Item Count Item Count Item Count BCE 0 ABE 0 ABC 1 ACE 0 Add the counts to the original Add the counts to the original large 3-itemsets pre-large 3-itemsets Extract 3-itemsets large or pre- large for the new transactions Item Count Item Count Item Count BCE 4 ABE 3 ABC 1 Find the large itemsets or pre-large Find the large itemsets or pre-large Since t+c=2+0<f=3, itemsets from the counts itemsets from the counts insert the 3-itemsets into R Item Count Item Count R={F,AD,BD,DE,ABC} BCE 4 ABE 3 Items Count ABE 3 BCE 4 Figure 6: Our process of finding large 3-itemsets and pre-large 3-itemsets In Pass 1 of this example, the candidate 1-itemsets {D} and {F}, can easily be processed by our proposed algorithm; they are, however, processed by rescanning the whole database in the FUP algorithm. 6. Conclusions 32
  33. 33. In this paper, we have proposed the concept of pre-large itemsets, and designed a novel, efficient, incremental mining algorithm based on it. Using two user-specified upper and lower support thresholds, the pre-large itemsets act as a gap to avoid small itemsets becoming large in the updated database when transactions are inserted. Our proposed algorithm also retains the following features of the FUP algorithm [7][14]: 1. It avoids re-computing large itemsets that have already been discovered. 2. It focuses on newly inserted transactions, thus greatly reducing the number of candidate itemsets. 3. It uses a simple check to further filter the candidate itemsets in inserted transactions. Moreover, the proposed algorithm can effectively handle cases, in which itemsets are small in an original database but large in newly inserted transactions, although it does need additional storage space to record the pre-large itemsets. Note that the FUP algorithm needs to rescan databases to handle such cases. The proposed algorithm does not require rescanning of the original databases until a number of new transactions determined from the two support thresholds and the size of the database have been processed. If the size of the database grows larger, then the number of new transactions allowed before rescanning will be larger too. Therefore, as the database grows, our proposed approach becomes increasingly efficient. This characteristic is especially useful for real-world applications. Acknowledgment 33
  34. 34. The authors would like to thank the anonymous referees for their very constructive comments. This research was supported by the National Science Council of the Republic of China under contract NSC 89-2213-E-214-056. References [1] R. Agrawal, T. Imielinksi and A. Swami, “Mining association rules between sets of items in large database,“ The ACM SIGMOD Conference, pp. 207-216, Washington DC, USA, 1993. [2] R. Agrawal, T. Imielinksi and A. Swami, “Database mining: a performance perspective,” IEEE Transactions on Knowledge and Data Engineering, Vol. 5, No. 6, pp. 914-925, 1993. [3] R. Agrawal and R. Srikant, “Fast algorithm for mining association rules,” The International Conference on Very Large Data Bases, pp. 487-499, 1994. [4] R. Agrawal and R. Srikant, ”Mining sequential patterns,” The Eleventh IEEE International Conference on Data Engineering, pp. 3-14, 1995. [5] R. Agrawal, R. Srikant and Q. Vu, “Mining association rules with item constraints,” The Third International Conference on Knowledge Discovery in Databases and Data Mining, pp. 67-73, Newport Beach, California, 1997. [6] M.S. Chen, J. Han and P.S. Yu, “Data mining: An overview from a database perspective,” IEEE Transactions on Knowledge and Data Engineering, Vol. 8, No. 6, pp. 866-883, 1996. [7] D.W. Cheung, J. Han, V.T. Ng, and C.Y. Wong, “Maintenance of discovered association rules in large databases: An incremental updating approach,” The Twelfth IEEE International Conference on Data Engineering, pp. 106-114, 1996. 34
  35. 35. [8] D.W. Cheung, S.D. Lee, and B. Kao, “A general incremental technique for maintaining discovered association rules,” In Proceedings of Database Systems for Advanced Applications, pp. 185-194, Melbourne, Australia, 1997. [9] T. Fukuda, Y. Morimoto, S. Morishita and T. Tokuyama, "Mining optimized association rules for numeric attributes," The ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, pp. 182-191, 1996. [10] J. Han and Y. Fu, “Discovery of multiple-level association rules from large database,” The Twenty-first International Conference on Very Large Data Bases, pp. 420-431, Zurich, Switzerland, 1995. [11] M.Y. Lin and S.Y. Lee, “Incremental update on sequential patterns in large databases,” The Tenth IEEE International Conference on Tools with Artificial Intelligence, pp. 24-31, 1998. [12] H. Mannila, H. Toivonen, and A.I. Verkamo, “Efficient algorithm for discovering association rules,” The AAAI Workshop on Knowledge Discovery in Databases, pp. 181-192, 1994. [13] J.S. Park, M.S. Chen, P.S. Yu, “Using a hash-based method with transaction trimming for mining association rules,” IEEE Transactions on Knowledge and Data Engineering, Vol. 9, No. 5, pp. 812-825, 1997. [14] N.L. Sarda and N.V. Srinivas, “An adaptive algorithm for incremental mining of association rules,” The Ninth International Workshop on Database and Expert Systems, pp. 240-245, 1998. [15] R. Srikant and R. Agrawal, “Mining generalized association rules,” The Twenty- first International Conference on Very Large Data Bases, pp. 407-419, Zurich, Switzerland, 1995. [16] R. Srikant and R. Agrawal, “Mining quantitative association rules in large 35
  36. 36. relational tables,” The 1996 ACM SIGMOD International Conference on Management of Data, pp. 1-12, Montreal, Canada, 1996. [17] S. Zhang, “Aggregation and maintenance for database mining,” Intelligent Data Analysis, Vol. 3, No. 6, pp. 475-490, 1999. 36

×