Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Agrawal association rules


Published on

Published in: Education, Technology
  • Be the first to comment

Agrawal association rules

  1. 1. Fast Algorithms for Mining Association Rules Rakesh Agrawal Ramakrishnan Srikant IBM Almaden Research Center 650 Harry Road, San Jose, CA 95120Abstract tires and auto accessories also get automotive servicesWe consider the problem of discovering association rules done. Finding all such rules is valuable for cross-between items in a large database of sales transactions. marketing and attached mailing applications. OtherWe present two new algorithms for solving this problem applications include catalog design, add-on sales,that are fundamentally di erent from the known algo- store layout, and customer segmentation based onrithms. Empirical evaluation shows that these algorithms buying patterns. The databases involved in theseoutperform the known algorithms by factors ranging from applications are very large. It is imperative, therefore,three for small problems to more than an order of mag- to have fast algorithms for this task.nitude for large problems. We also show how the bestfeatures of the two proposed algorithms can be combined The following is a formal statement of the probleminto a hybrid algorithm, called AprioriHybrid. Scale-up 4]: Let I = fi1 i2 . . . im g be a set of literals,experiments show that AprioriHybrid scales linearly with called items. Let D be a set of transactions, wherethe number of transactions. AprioriHybrid also has ex- each transaction T is a set of items such that Tcellent scale-up properties with respect to the transaction I . Associated with each transaction is a uniquesize and the number of items in the database. identi er, called its TID. We say that a transaction T contains X, a set of some items in I , if X T .1 Introduction An association rule is an implication of the formProgress in bar-code technology has made it possi- X =) Y , where X I , Y I , and X Y = .ble for retail organizations to collect and store mas- The rule X =) Y holds in the transaction set D withsive amounts of sales data, referred to as the basket con dence c if c% of transactions in D that containdata. A record in such data typically consists of the X also contain Y . The rule X =) Y has support stransaction date and the items bought in the trans- in the transaction set D if s% of transactions in Daction. Successful organizations view such databases contain X Y . Our rules are somewhat more generalas important pieces of the marketing infrastructure. than in 4] in that we allow a consequent to have moreThey are interested in instituting information-driven than one processes, managed by database technol- Given a set of transactions D, the problem of min-ogy, that enable marketers to develop and implement ing association rules is to generate all association rulescustomized marketing programs and strategies 6]. that have support and con dence greater than the The problem of mining association rules over basket user-speci ed minimum support (called minsup ) anddata was introduced in 4]. An example of such a minimum con dence (called minconf ) respectively.rule might be that 98% of customers that purchase Our discussion is neutral with respect to the repre- Visiting from the Department of Computer Science, Uni- sentation of D. For example, D could be a data le,versity of Wisconsin, Madison. a relational table, or the result of a relational expres- Permission to copy without fee all or part of this material granted provided that the copies are not made or distributed An algorithm for nding all association rules,for direct commercial advantage, the VLDB copyright noticeand the title of the publication and its date appear, and notice henceforth referred to as the AIS algorithm, was pre-is given that copying is by permission of the Very Large Data sented in 4]. Another algorithm for this task, calledBase Endowment. To copy otherwise, or to republish, requires the SETM algorithm, has been proposed in 13]. Ina fee and/or special permission from the Endowment. this paper, we present two new algorithms, AprioriProceedings of the 20th VLDB Conference and AprioriTid, that di er fundamentally from theseSantiago, Chile, 1994 algorithms. We present experimental results showing
  2. 2. that the proposed algorithms always outperform the for an itemset is the number of transactions thatearlier algorithms. The performance gap is shown to contain the itemset. Itemsets with minimum sup-increase with problem size, and ranges from a fac- port are called large itemsets, and all others smalltor of three for small problems to more than an or- itemsets. In Section 2, we give new algorithms,der of magnitude for large problems. We then dis- Apriori and AprioriTid, for solving this problem.cuss how the best features of Apriori and Apriori-Tid can be combined into a hybrid algorithm, called 2. Use the large itemsets to generate the desired rules.AprioriHybrid. Experiments show that the Apriori- Here is a straightforward algorithm for this task.Hybrid has excellent scale-up properties, opening up For every large itemset l, nd all non-empty subsetsthe feasibility of mining association rules over very of l. For every such subset a, output a rule oflarge databases. the form a =) (l ; a) if the ratio of support(l) The problem of nding association rules falls within to support(a) is at least minconf. We need tothe purview of database mining 3] 12], also called consider all subsets of l to generate rules withknowledge discovery in databases 21]. Related, but multiple consequents. Due to lack of space, we donot directly applicable, work includes the induction not discuss this subproblem further, but refer theof classi cation rules 8] 11] 22], discovery of causal reader to 5] for a fast algorithm.rules 19], learning of logical de nitions 18], ttingof functions to data 15], and clustering 9] 10]. The In Section 3, we show the relative performanceclosest work in the machine learning literature is the of the proposed Apriori and AprioriTid algorithmsKID3 algorithm presented in 20]. If used for nding against the AIS 4] and SETM 13] algorithms.all association rules, this algorithm will make as many To make the paper self-contained, we include anpasses over the data as the number of combinations overview of the AIS and SETM algorithms in thisof items in the antecedent, which is exponentially section. We also describe how the Apriori andlarge. Related work in the database literature is AprioriTid algorithms can be combined into a hybridthe work on inferring functional dependencies from algorithm, AprioriHybrid, and demonstrate the scale-data 16]. Functional dependencies are rules requiring up properties of this algorithm. We conclude bystrict satisfaction. Consequently, having determined pointing out some related open problems in Section 4.a dependency X ! A, the algorithms in 16] considerany other dependency of the form X + Y ! A 2 Discovering Large Itemsetsredundant and do not generate it. The association Algorithms for discovering large itemsets make mul-rules we consider are probabilistic in nature. The tiple passes over the data. In the rst pass, we countpresence of a rule X ! A does not necessarily mean the support of individual items and determine whichthat X + Y ! A also holds because the latter may of them are large, i.e. have minimumsupport. In eachnot have minimumsupport. Similarly, the presence of subsequent pass, we start with a seed set of itemsetsrules X ! Y and Y ! Z does not necessarily mean found to be large in the previous pass. We use thisthat X ! Z holds because the latter may not have seed set for generating new potentially large itemsets,minimum con dence. called candidate itemsets, and count the actual sup- There has been work on quantifying the useful- port for these candidate itemsets during the pass overness" or interestingness" of a rule 20]. What is use- the data. At the end of the pass, we determine whichful or interesting is often application-dependent. The of the candidate itemsets are actually large, and theyneed for a human in the loop and providing tools to become the seed for the next pass. This process con-allow human guidance of the rule discovery process tinues until no new large itemsets are found.has been articulated, for example, in 7] 14]. We do The Apriori and AprioriTid algorithms we proposenot discuss these issues in this paper, except to point di er fundamentally from the AIS 4] and SETM 13]out that these are necessary features of a rule discov- algorithms in terms of which candidate itemsets areery system that may use our algorithms as the engine counted in a pass and in the way that those candidatesof the discovery process. are generated. In both the AIS and SETM algorithms,1.1 Problem Decomposition and Paper candidate itemsets are generated on-the- y during the Organization pass as data is being read. Speci cally, after readingThe problem of discovering all association rules can a transaction, it is determined which of the itemsetsbe decomposed into two subproblems 4]: found large in the previous pass are present in the transaction. New candidate itemsets are generated by1. Find all sets of items (itemsets) that have transac- extending these large itemsets with other items in the tion support above minimum support. The support transaction. However, as we will see, the disadvantage
  3. 3. is that this results in unnecessarily generating and Table 1: Notationcounting too many candidate itemsets that turn outto be small. -itemset An itemset having k items. The Apriori and AprioriTid algorithms generate k Set of large k-itemsetsthe candidate itemsets to be counted in a pass by Lk (those with minimum support).using only the itemsets found large in the previous Each member of this set has two elds:pass { without considering the transactions in the i) itemset and ii) support count.database. The basic intuition is that any subset Set of candidate k-itemsetsof a large itemset must be large. Therefore, the Ck (potentially large itemsets).candidate itemsets having k items can be generated Each member of this set has two elds:by joining large itemsets having k ; 1 items, and i) itemset and ii) support count.deleting those that contain any subset that is not Set of candidate k-itemsets when the TIDslarge. This procedure results in generation of a much Ck of the generating transactions are keptsmaller number of candidate itemsets. associated with the candidates. The AprioriTid algorithm has the additional prop-erty that the database is not used at all for count- given transaction t. Section 2.1.2 describes the subseting the support of candidate itemsets after the rst function used for this purpose. See 5] for a discussionpass. Rather, an encoding of the candidate itemsets of bu er management.used in the previous pass is employed for this purpose.In later passes, the size of this encoding can become 1) L1 = flarge 1-itemsetsgmuch smaller than the database, thus saving much 2) for ( k = 2 Lk;1 6= k++ ) do beginreading e ort. We will explain these points in more 3) Ck = apriori-gen(Lk;1 ) // New candidatesdetail when we describe the algorithms. 4) forall transactions t 2 D do beginNotation We assume that items in each transaction 5) Ct = subset(Ck , t) // Candidates contained in t 6) forall candidates c 2 Ct doare kept sorted in their lexicographic order. It is 7) c:count++straightforward to adapt these algorithms to the case 8) endwhere the database D is kept normalized and each 9) Lk = fc 2 Ck j c:count minsup gdatabase record is a <TID, item> pair, where TID isthe identi er of the corresponding transaction. 10) end 11) Answer = k Lk S We call the number of items in an itemset its size,and call an itemset of size k a k-itemset. Items within Figure 1: Algorithm Apriorian itemset are kept in lexicographic order. We usethe notation c 1] c 2] . . . c k] to represent a k-itemset c consisting of items c 1] c 2] . . .c k], where 2.1.1 Apriori Candidate Generationc 1] < c 2] < . . . < c k]. If c = X Y and Y The apriori-gen function takes as argument Lk;1,is an m-itemset, we also call Y an m-extension of the set of all large (k ; 1)-itemsets. It returns aX. Associated with each itemset is a count eld to superset of the set of all large k-itemsets. Thestore the support for this itemset. The count eld is function works as follows. 1 First, in the join step,initialized to zero when the itemset is rst created. we join Lk;1 with Lk;1: We summarize in Table 1 the notation used in the insert into kalgorithms. The set C k is used by AprioriTid and will C select .item1 , .item2 , ..., .itemk;1 , .itemk;1be further discussed when we describe this algorithm. p p p q from k;1 , k;1 L p L q2.1 Algorithm Apriori where .item1 = .item1 , . . ., .itemk;2 = .itemk;2 , p q p q .itemk;1 < q.itemk;1Figure 1 gives the Apriori algorithm. The rst pass pof the algorithm simply counts item occurrences to Next, in the prune step, we delete all itemsets c 2 Ckdetermine the large 1-itemsets. A subsequent pass, such that some (k ; 1)-subset of c is not in Lk;1:say pass k, consists of two phases. First, the large 1 Concurrent to our work, the following two-step candidateitemsets Lk;1 found in the (k ; 1)th pass are used to generation procedure has been proposed in 17]:generate the candidate itemsets Ck , using the apriori- 0 k =f 0j 0 2 k;1 j 0 j = ; 2ggen function described in Section 2.1.1. Next, the C X X X X L X X k 0 k = f 2 k j contains members of k ;1 gdatabase is scanned and the support of candidates in C X C X k L These two steps are similar to our join and prune stepsCk is counted. For fast counting, we need to e ciently respectively. However, in general, step 1 would produce adetermine the candidates in Ck that are contained in a superset of the candidates produced by our join step.
  4. 4. forall itemsets 2 k do c C later passes when the cost of counting and keeping in forall ( ; 1)-subsets of do k s c 0 memory additional Ck+1 ; Ck+1 candidates becomes if ( 62 k;1 ) then s L less than the cost of scanning the database. delete from k c C 2.1.2 Subset Function Candidate itemsets Ck are stored in a hash-tree. AExample Let L3 be ff1 2 3g, f1 2 4g, f1 3 4g, f1 node of the hash-tree either contains a list of itemsets3 5g, f2 3 4gg. After the join step, C4 will be ff1 2 3 (a leaf node) or a hash table (an interior node). In an4g, f1 3 4 5g g. The prune step will delete the itemset interior node, each bucket of the hash table points tof1 3 4 5g because the itemset f1 4 5g is not in L3 . another node. The root of the hash-tree is de ned toWe will then be left with only f1 2 3 4g in C4. be at depth 1. An interior node at depth d points to Contrast this candidate generation with the one nodes at depth d+1. Itemsets are stored in the leaves.used in the AIS and SETM algorithms. In pass k When we add an itemset c, we start from the root andof these algorithms, a database transaction t is read go down the tree until we reach a leaf. At an interiorand it is determined which of the large itemsets in node at depth d, we decide which branch to followLk;1 are present in t. Each of these large itemsets by applying a hash function to the dth item of thel is then extended with all those large items that itemset. All nodes are initially created as leaf nodes.are present in t and occur later in the lexicographic When the number of itemsets in a leaf node exceedsordering than any of the items in l. Continuing with a speci ed threshold, the leaf node is converted to anthe previous example, consider a transaction f1 2 interior node.3 4 5g. In the fourth pass, AIS and SETM will Starting from the root node, the subset functiongenerate two candidates, f1 2 3 4g and f1 2 3 5g, nds all the candidates contained in a transactionby extending the large itemset f1 2 3g. Similarly, an t as follows. If we are at a leaf, we nd which ofadditional three candidate itemsets will be generated the itemsets in the leaf are contained in t and addby extending the other large itemsets in L3 , leading references to them to the answer set. If we are at anto a total of 5 candidates for consideration in the interior node and we have reached it by hashing thefourth pass. Apriori, on the other hand, generates item i, we hash on each item that comes after i in tand counts only one itemset, f1 3 4 5g, because it and recursively apply this procedure to the node inconcludes a priori that the other combinations cannot the corresponding bucket. For the root node, we hashpossibly have minimum support. on every item in t. To see why the subset function returns the desiredCorrectness We need to show that Ck Lk . set of references, consider what happens at the rootClearly, any subset of a large itemset must also node. For any itemset c contained in transaction t, thehave minimum support. Hence, if we extended each rst item of c must be in t. At the root, by hashing onitemset in Lk;1 with all possible items and then every item in t, we ensure that we only ignore itemsetsdeleted all those whose (k ; 1)-subsets were not in that start with an item not in t. Similar argumentsLk;1, we would be left with a superset of the itemsets apply at lower depths. The only additional factor isin Lk . that, since the items in any itemset are ordered, if we The join is equivalent to extending Lk;1 with each reach the current node by hashing the item i, we onlyitem in the database and then deleting those itemsets need to consider the items in t that occur after i.for which the (k ;1)-itemset obtained by deleting the(k;1)th item is not in Lk;1 . The condition p.itemk;1 2.2 Algorithm AprioriTid< q.itemk;1 simply ensures that no duplicates are The AprioriTid algorithm, shown in Figure 2, alsogenerated. Thus, after the join step, Ck Lk . By uses the apriori-gen function (given in Section 2.1.1)similar reasoning, the prune step, where we delete to determine the candidate itemsets before the passfrom Ck all itemsets whose (k ; 1)-subsets are not in begins. The interesting feature of this algorithm isLk;1, also does not delete any itemset that could be that the database D is not used for counting supportin Lk . after the rst pass. Rather, the set C k is used for this purpose. Each member of the set C k isVariation: Counting Candidates of Multiple of the form < TID fXk g >, where each Xk is aSizes in One Pass Rather than counting only potentially large k-itemset present in the transactioncandidates of size k in the kth pass, we can also with identi er TID. For k = 1, C 1 corresponds to 0 0count the candidates Ck+1, where Ck+1 is generated the database D, although conceptually each item i 0from Ck , etc. Note that Ck+1 Ck+1 since Ck+1 is is replaced by the itemset fig. For k > 1, C k isgenerated from Lk . This variation can pay o in the generated by the algorithm (step 10). The member
  5. 5. of C k corresponding to transaction t is <t:T ID, we generate C4 using L3 , it turns out to be empty,fc 2 Ck jc contained in tg>. If a transaction does and we terminate.not contain any candidate k-itemset, then C k willnot have an entry for this transaction. Thus, the Database C1number of entries in C k may be smaller than the TID Items TID Set-of-Itemsetsnumber of transactions in the database, especially for 100 1 3 4 100 f f1g, f3g, f4g glarge values of k. In addition, for large values of k, 200 2 3 5 200 f f2g, f3g, f5g geach entry may be smaller than the corresponding 300 1 2 3 5 300 f f1g, f2g, f3g, f5g gtransaction because very few candidates may be 400 2 5 400 f f2g, f5g gcontained in the transaction. However, for small C2values for k, each entry may be larger than the L1 Itemset Supportcorresponding transaction because an entry in Ck Itemset Support f1 2g 1includes all candidate k-itemsets contained in the f1g 2 f1 3g 2transaction. f2g 3 f1 5g 1 In Section 2.2.1, we give the data structures used f3g 3 f2 3g 2to implement the algorithm. See 5] for a proof of f5g 3 f2 5g 3correctness and a discussion of bu er management. f3 5g 21) L1 = flarge 1-itemsetsg TID C2 Set-of-Itemsets L22) C 1 = database D 100 f f1 3g g Itemset Support3) for ( k = 2 Lk;1 6= k++ ) do begin 200 f f2 3g, f2 5g, f3 5g g f1 3g 24) Ck = apriori-gen(Lk;1 ) // New candidates 300 f f1 2g, f1 3g, f1 5g, f2 3g 25) C k = f2 3g, f2 5g, f3 5g g f2 5g 36) forall entries t 2 C k;1 do begin 400 f f2 5g g f3 5g 27) // determine candidate itemsets in Ck contained // in the transaction with identi er t.TID Ct = fc 2 Ck j (c ; c k ]) 2 t:set-of-itemsets ^ C3 (c ; c k ; 1]) 2 t.set-of-itemsetsg C3 Itemset Support TID Set-of-Itemsets8) forall candidates c 2 Ct do f2 3 5g 2 200 f f2 3 5g g9) c:count++ 300 f f2 3 5g g10) if (Ct 6= ) then C k += < t:TID Ct >11) end L312) Lk = fc 2 Ck j c:count minsup g Itemset Support f2 3 5g13) end S14) Answer = k Lk 2 Figure 3: Example Figure 2: Algorithm AprioriTid 2.2.1 Data Structures We assign each candidate itemset a unique number,Example Consider the database in Figure 3 and called its ID. Each set of candidate itemsets Ck is keptassume that minimum support is 2 transactions. in an array indexed by the IDs of the itemsets in Ck .Calling apriori-gen with L1 at step 4 gives the A member of C k is now of the form < TID fIDg >.candidate itemsets C2. In steps 6 through 10, we Each C k is stored in a sequential structure.count the support of candidates in C2 by iterating over The apriori-gen function generates a candidate k-the entries in C 1 and generate C 2. The rst entry in itemset ck by joining two large (k ; 1)-itemsets. WeC 1 is f f1g f3g f4g g, corresponding to transaction maintain two additional elds for each candidate100. The Ct at step 7 corresponding to this entry t itemset: i) generators and ii) extensions. Theis f f1 3g g, because f1 3g is a member of C2 and generators eld of a candidate itemset ck stores theboth (f1 3g - f1g) and (f1 3g - f3g) are members of IDs of the two large (k ; 1)-itemsets whose joint.set-of-itemsets. generated ck . The extensions eld of an itemset Calling apriori-gen with L2 gives C3 . Making a pass ck stores the IDs of all the (k + 1)-candidates thatover the data with C 2 and C3 generates C 3. Note that are extensions of ck . Thus, when a candidate ck isthere is no entry in C 3 for the transactions with TIDs 1 2 generated by joining lk;1 and lk;1, we save the IDs100 and 400, since they do not contain any of the 1 and l2 in the generators eld for ck . At the of lk;1 k;1itemsets in C3 . The candidate f2 3 5g in C3 turns same time, the ID of ck is added to the extensionsout to be large and is the only member of L3. When 1 eld of lk;1.
  6. 6. We now describe how Step 7 of Figure 2 is It thus generates and counts every candidate itemsetimplemented using the above data structures. Recall that the AIS algorithm generates. However, to use thethat the t.set-of-itemsets eld of an entry t in C k;1 standard SQL join operation for candidate generation,gives the IDs of all (k ; 1)-candidates contained in SETM separates candidate generation from counting.transaction t.TID. For each such candidate ck;1 the It saves a copy of the candidate itemset together withextensions eld gives Tk , the set of IDs of all the the TID of the generating transaction in a sequentialcandidate k-itemsets that are extensions of ck;1. For structure. At the end of the pass, the support counteach ck in Tk , the generators eld gives the IDs of of candidate itemsets is determined by sorting andthe two itemsets that generated ck . If these itemsets aggregating this sequential structure.are present in the entry for t.set-of-itemsets, we can SETM remembers the TIDs of the generatingconclude that ck is present in transaction t.TID, and transactions with the candidate itemsets. To avoidadd ck to Ct. needing a subset operation, it uses this information to determine the large itemsets contained in the3 Performance transaction read. Lk C k and is obtained by deletingTo assess the relative performance of the algorithms those candidates that do not have minimum support.for discovering large sets, we performed several Assuming that the database is sorted in TID order,experiments on an IBM RS/6000 530H workstation SETM can easily nd the large itemsets contained in awith a CPU clock rate of 33 MHz, 64 MB of main transaction in the next pass by sorting Lk on TID. Inmemory, and running AIX 3.2. The data resided in fact, it needs to visit every member of Lk only once inthe AIX le system and was stored on a 2GB SCSI the TID order, and the candidate generation can be3.5" drive, with measured sequential throughput of performed using the relational merge-join operationabout 2 MB/second. 13]. We rst give an overview of the AIS 4] and SETM The disadvantage of this approach is mainly due 13] algorithms against which we compare the per- to the size of candidate sets C k . For each candidateformance of the Apriori and AprioriTid algorithms. itemset, the candidate set now has as many entriesWe then describe the synthetic datasets used in the as the number of transactions in which the candidateperformance evaluation and show the performance re- itemset is present. Moreover, when we are ready tosults. Finally, we describe how the best performance count the support for candidate itemsets at the endfeatures of Apriori and AprioriTid can be combined of the pass, C k is in the wrong order and needs to beinto an AprioriHybrid algorithm and demonstrate its sorted on itemsets. After counting and pruning outscale-up properties. small candidate itemsets that do not have minimum support, the resulting set Lk needs another sort on3.1 The AIS Algorithm TID before it can be used for generating candidatesCandidate itemsets are generated and counted on- in the next pass.the- y as the database is scanned. After reading atransaction, it is determined which of the itemsets 3.3 Generation of Synthetic Datathat were found to be large in the previous pass are We generated synthetic transactions to evaluate thecontained in this transaction. New candidate itemsets performance of the algorithms over a large range ofare generated by extending these large itemsets with data characteristics. These transactions mimic theother items in the transaction. A large itemset l is transactions in the retailing environment. Our modelextended with only those items that are large and of the real" world is that people tend to buy setsoccur later in the lexicographic ordering of items than of items together. Each such set is potentially aany of the items in l. The candidates generated maximal large itemset. An example of such a setfrom a transaction are added to the set of candidate might be sheets, pillow case, comforter, and ru es.itemsets maintained for the pass, or the counts of However, some people may buy only some of thethe corresponding entries are increased if they were items from such a set. For instance, some peoplecreated by an earlier transaction. See 4] for further might buy only sheets and pillow case, and some onlydetails of the AIS algorithm. sheets. A transaction may contain more than one large itemset. For example, a customer might place an3.2 The SETM Algorithm order for a dress and jacket when ordering sheets andThe SETM algorithm 13] was motivated by the desire pillow cases, where the dress and jacket together formto use SQL to compute large itemsets. Like AIS, another large itemset. Transaction sizes are typicallythe SETM algorithm also generates candidates on- clustered around a mean and a few transactions havethe- y based on transactions read from the database. many items. Typical sizes of large itemsets are also
  7. 7. clustered around a mean, with a few large itemsets probability of picking the associated itemset.having a large number of items. To model the phenomenon that all the items in To create a dataset, our synthetic data generation a large itemset are not always bought together, weprogram takes the parameters shown in Table 2. assign each itemset in T a corruption level c. When adding an itemset to a transaction, we keep dropping Table 2: Parameters an item from the itemset as long as a uniformly distributed random number between 0 and 1 is less j j Number of transactions D than c. Thus for an itemset of size l, we will add l j j Average size of the transactions T items to the transaction 1 ; c of the time, l ; 1 items j j Average size of the maximal potentially I c(1 ; c) of the time, l ; 2 items c2(1 ; c) of the time, large itemsets etc. The corruption level for an itemset is xed and j j Number of maximal potentially large itemsets L is obtained from a normal distribution with mean 0.5 N Number of items and variance 0.1. We rst determine the size of the next transaction. We generated datasets by setting N = 1000 and jLjThe size is picked from a Poisson distribution with = 2000. We chose 3 values for jT j: 5, 10, and 20. Wemean equal to jT j. Note that if each item is chosen also chose 3 values for jI j: 2, 4, and 6. The number ofwith the same probability p, and there are N items, transactions was to set to 100,000 because, as we willthe expected number of items in a transaction is given see in Section 3.4, SETM could not be run for largerby a binomial distribution with parameters N and p, values. However, for our scale-up experiments, weand is approximated by a Poisson distribution with generated datasets with up to 10 million transactionsmean Np. (838MB for T20). Table 3 summarizes the dataset We then assign items to the transaction. Each parameter settings. For the same jT j and jDj values,transaction is assigned a series of potentially large the size of datasets in megabytes were roughly equalitemsets. If the large itemset on hand does not t in for the di erent values of jI j.the transaction, the itemset is put in the transactionanyway in half the cases, and the itemset is moved to Table 3: Parameter settingsthe next transaction the rest of the cases. Large itemsets are chosen from a set T of such Name jT j jI j jDj Size in Megabytesitemsets. The number of itemsets in T is set to T5.I2.D100K 5 2 100K 2.4jLj. There is an inverse relationship between jLj and T10.I2.D100K 10 2 100K 4.4 T10.I4.D100K 10 4 100Kthe average support for potentially large itemsets. T20.I2.D100K 20 2 100K 8.4An itemset in T is generated by rst picking the T20.I4.D100K 20 4 100Ksize of the itemset from a Poisson distribution with T20.I6.D100K 20 6 100Kmean equal to jI j. Items in the rst itemsetare chosen randomly. To model the phenomenon 3.4 Relative Performancethat large itemsets often have common items, somefraction of items in subsequent itemsets are chosen Figure 4 shows the execution times for the sixfrom the previous itemset generated. We use an synthetic datasets given in Table 3 for decreasingexponentially distributed random variable with mean values of minimum support. As the minimum supportequal to the correlation level to decide this fraction decreases, the execution times of all the algorithmsfor each itemset. The remaining items are picked at increase because of increases in the total number ofrandom. In the datasets used in the experiments, candidate and large itemsets.the correlation level was set to 0.5. We ran some For SETM, we have only plotted the executionexperiments with the correlation level set to 0.25 and times for the dataset T5.I2.D100K in Figure 4. The0.75 but did not nd much di erence in the nature of execution times for SETM for the two datasets withour performance results. an average transaction size of 10 are given in Table 4. Each itemset in T has a weight associated with We did not plot the execution times in Table 4it, which corresponds to the probability that this on the corresponding graphs because they are tooitemset will be picked. This weight is picked from large compared to the execution times of the otheran exponential distribution with unit mean, and is algorithms. For the three datasets with transactionthen normalized so that the sum of the weights for all sizes of 20, SETM took too long to execute andthe itemsets in T is 1. The next itemset to be put we aborted those runs as the trends were the transaction is chosen from T by tossing an jLj- Clearly, Apriori beats SETM by more than an ordersided weighted coin, where the weight for a side is the of magnitude for large datasets.
  8. 8. T5.I2.D100K T10.I2.D100K 80 160 SETM AIS 70 AIS 140 AprioriTid AprioriTid Apriori Apriori 60 120 50 100Time (sec) Time (sec) 40 80 30 60 20 40 10 20 0 0 2 1.5 1 0.75 0.5 0.33 0.25 2 1.5 1 0.75 0.5 0.33 0.25 Minimum Support Minimum Support T10.I4.D100K T20.I2.D100K 350 1000 AIS AIS AprioriTid 900 AprioriTid 300 Apriori Apriori 800 250 700 600Time (sec) Time (sec) 200 500 150 400 100 300 200 50 100 0 0 2 1.5 1 0.75 0.5 0.33 0.25 2 1.5 1 0.75 0.5 0.33 0.25 Minimum Support Minimum Support T20.I4.D100K T20.I6.D100K 1800 3500 AIS AIS 1600 AprioriTid AprioriTid Apriori 3000 Apriori 1400 2500 1200Time (sec) Time (sec) 1000 2000 800 1500 600 1000 400 500 200 0 0 2 1.5 1 0.75 0.5 0.33 0.25 2 1.5 1 0.75 0.5 0.33 0.25 Minimum Support Minimum Support Figure 4: Execution times
  9. 9. Table 4: Execution times in seconds for SETM SETM algorithm to perform poorly.2 This explains the jump in time for SETM in Table 4 when going Algorithm Minimum Support from 1.5% support to 1.0% support for datasets with 2.0% 1.5% 1.0% 0.75% 0.5% transaction size 10. The largest dataset in the scale- Dataset T10.I2.D100K up experiments for SETM in 13] was still small SETM 74 161 838 1262 1878 enough that C k could t in memory hence they did Apriori 4.4 5.3 11.0 14.5 15.3 not encounter this jump in execution time. Note that Dataset T10.I4.D100K for the same minimum support, the support count for SETM 41 91 659 929 1639 candidate itemsets increases linearly with the number Apriori 3.8 4.8 11.2 17.4 19.3 of transactions. Thus, as we increase the number of transactions for the same values of jT j and jI j, though Apriori beats AIS for all problem sizes, by factors the size of Ck does not change, the size of C k goes upranging from 2 for high minimum support to more linearly. Thus, for datasets with more transactions,than an order of magnitude for low levels of support. the performance gap between SETM and the otherAIS always did considerably better than SETM. For algorithms will become even larger.small problems, AprioriTid did about as well as The problem with AIS is that it generates too manyApriori, but performance degraded to about twice as candidates that later turn out to be small, causingslow for large problems. it to waste too much e ort. Apriori also counts too many small sets in the second pass (recall that C2 is3.5 Explanation of the Relative Performance really a cross-product of L1 with L1 ). However, thisTo explain these performance trends, we show in wastage decreases dramatically from the third passFigure 5 the sizes of the large and candidate sets in onward. Note that for the example in Figure 5, afterdi erent passes for the T10.I4.D100K dataset for the pass 3, almost every candidate itemset counted byminimum support of 0.75%. Note that the Y-axis in Apriori turns out to be a large set.this graph has a log scale. AprioriTid also has the problem of SETM that C k tends to be large. However, the apriori candidate 1e+07 generation used by AprioriTid generates signi cantly fewer candidates than the transaction-based candi- C-k-m (SETM) C-k-m (AprioriTid) date generation used by SETM. As a result, the C k of 1e+06 C-k (AIS, SETM) C-k (Apriori, AprioriTid) 100000 L-k AprioriTid has fewer entries than that of SETM. Apri- oriTid is also able to use a single word (ID) to store Number of Itemsets 10000 a candidate rather than requiring as many words as the number of items in the candidate.3 In addition, 1000 unlike SETM, AprioriTid does not have to sort C k . Thus, AprioriTid does not su er as much as SETM 100 from maintaining C k . 10 AprioriTid has the nice feature that it replaces a pass over the original dataset by a pass over the set 1 C k . Hence, AprioriTid is very e ective in later passes 1 2 3 4 5 6 7 when the size of C k becomes small compared to theFigure 5: Sizes of the large and candidate sets Pass Number 2 The cost of external sorting in SETM can be reduced(T10.I4.D100K, minsup = 0.75%) somewhat as follows. Before writing out entries in k to C The fundamental problem with the SETM algo- disk, we can sort them on itemsets using an internal sortingrithm is the size of its C k sets. Recall that the size of procedure, and write them as sorted runs. These sorted runs can then be merged to obtain support counts. However,the set C k is given by given the poor performance of SETM, we do not expect this X support-count(c): optimization to a ect the algorithm choice. 3 For SETM to use IDs, it would have to maintain two additional in-memory data structures: a hash table to nd candidate itemsets c out whether a candidate has been generated previously, and a mapping from the IDs to candidates. However, this wouldThus, the sets C k are roughly S times bigger than the destroy the set-oriented nature of the algorithm. Also, once wecorresponding Ck sets, where S is the average support have the hash table which gives us the IDs of candidates, wecount of the candidate itemsets. Unless the problem might as well count them at the same time and avoid the two external sorts. We experimentedwith this variant of SETM andsize is very small, the C k sets have to be written found that, while it did better than SETM, it still performedto disk, and externally sorted twice, causing the much worse than Apriori or AprioriTid.
  10. 10. size of the database. Thus, we nd that AprioriTid in Ck . From this, we estimate what the size of C kbeats Apriori when its C k sets can t in memory andthe distribution of the large itemsets has a long tail. P would have been if it had been generated. This size, in words, is ( candidates c 2 Ck support(c) +When C k doesnt t in memory, there is a jump in number of transactions). If C k in this pass was smallthe execution time for AprioriTid, such as when going enough to t in memory, and there were fewer largefrom 0.75% to 0.5% for datasets with transaction size candidates in the current pass than the previous pass,10 in Figure 4. In this region, Apriori starts beating we switch to AprioriTid. The latter condition is addedAprioriTid. to avoid switching when C k in the current pass ts in memory but C k in the next pass may not.3.6 Algorithm AprioriHybrid Switching from Apriori to AprioriTid does involveIt is not necessary to use the same algorithm in all the a cost. Assume that we decide to switch from Aprioripasses over data. Figure 6 shows the execution times to AprioriTid at the end of the kth pass. In thefor Apriori and AprioriTid for di erent passes over the (k + 1)th pass, after nding the candidate itemsetsdataset T10.I4.D100K. In the earlier passes, Apriori contained in a transaction, we will also have to adddoes better than AprioriTid. However, AprioriTid their IDs to C k+1 (see the description of AprioriTidbeats Apriori in later passes. We observed similar in Section 2.2). Thus there is an extra cost incurredrelative behavior for the other datasets, the reason in this pass relative to just running Apriori. It is onlyfor which is as follows. Apriori and AprioriTid in the (k +2)th pass that we actually start runninguse the same candidate generation procedure and AprioriTid. Thus, if there are no large (k+1)-itemsets,therefore count the same itemsets. In the later or no (k + 2)-candidates, we will incur the cost ofpasses, the number of candidate itemsets reduces switching without getting any of the savings of using(see the size of Ck for Apriori and AprioriTid in AprioriTid.Figure 5). However, Apriori still examines every Figure 7 shows the performance of AprioriHybridtransaction in the database. On the other hand, relative to Apriori and AprioriTid for three datasets.rather than scanning the database, AprioriTid scans AprioriHybrid performs better than Apriori in almostC k for obtaining support counts, and the size of C k all cases. For T10.I2.D100K with 1.5% support,has become smaller than the size of the database. AprioriHybrid does a little worse than Apriori sinceWhen the C k sets can t in memory, we do not even the pass in which the switch occurred was theincur the cost of writing them to disk. last pass AprioriHybrid thus incurred the cost of 14 switching without realizing the bene ts. In general, Apriori the advantage of AprioriHybrid over Apriori depends 12 AprioriTid on how the size of the C k set decline in the later passes. If C k remains large until nearly the end and 10 then has an abrupt drop, we will not gain much by using AprioriHybrid since we can use AprioriTid only Time (sec) 8 for a short period of time after the switch. This is 6 what happened with the T20.I6.D100K dataset. On the other hand, if there is a gradual decline in the 4 size of C k , AprioriTid can be used for a while after the switch, and a signi cant improvement can be obtained 2 in the execution time. 0 1 2 3 4 5 6 7 3.7 Scale-up ExperimentFigure 6: Per pass execution times of Apriori and Pass # Figure 8 shows how AprioriHybrid scales up as theAprioriTid (T10.I4.D100K, minsup = 0.75%) number of transactions is increased from 100,000 to 10 million transactions. We used the combinations Based on these observations, we can design a (T5.I2), (T10.I4), and (T20.I6) for the average sizeshybrid algorithm, which we call AprioriHybrid, that of transactions and itemsets respectively. All otheruses Apriori in the initial passes and switches to parameters were the same as for the data in Table 3.AprioriTid when it expects that the set C k at the The sizes of these datasets for 10 million transactionsend of the pass will t in memory. We use the were 239MB, 439MB and 838MB respectively. Thefollowing heuristic to estimate if C k would t in minimum support level was set to 0.75%. Thememory in the next pass. At the end of the execution times are normalized with respect to thecurrent pass, we have the counts of the candidates times for the 100,000 transaction datasets in the rst
  11. 11. graph and with respect to the 1 million transaction dataset in the second. As shown, the execution times scale quite linearly.T10.I2.D100K 40 12 AprioriTid 35 Apriori T20.I6 AprioriHybrid T10.I4 10 T5.I2 30 25 8 Time (sec) Relative Time 20 6 15 4 10 5 2 0 2 1.5 1 0.75 0.5 0.33 0.25 0 Minimum Support 100 250 500 750 1000T10.I4.D100K 14 Number of Transactions (in ’000s) 55 T20.I6 AprioriTid T10.I4 50 Apriori 12 T5.I2 45 AprioriHybrid 10 40 Relative Time 35 8 Time (sec) 30 25 6 20 4 15 10 2 5 0 0 1 2.5 5 7.5 10 2 1.5 1 0.75 0.5 0.33 0.25 Number of Transactions (in Millions) Minimum SupportT20.I6.D100K Figure 8: Number of transactions scale-up 700 AprioriTid 600 Apriori AprioriHybrid Next, we examined how AprioriHybrid scaled up with the number of items. We increased the num- 500 ber of items from 1000 to 10,000 for the three pa- rameter settings T5.I2.D100K, T10.I4.D100K and Time (sec) 400 T20.I6.D100K. All other parameters were the same as for the data in Table 3. We ran experiments for a minimum support at 0.75%, and obtained the results 300 200 shown in Figure 9. The execution times decreased a little since the average support for an item decreased 100 as we increased the number of items. This resulted in fewer large itemsets and, hence, faster execution 0 times. Finally, we investigated the scale-up as we increased 2 1.5 1 0.75 0.5 0.33 0.25 Minimum Support the average transaction size. The aim of this Figure 7: Execution times: AprioriHybrid experiment was to see how our data structures scaled with the transaction size, independent of other factors like the physical database size and the number of large itemsets. We kept the physical size of the
  12. 12. 45 30 T20.I6 500 40 T10.I4 750 T5.I2 25 1000 35 30 20 Time (sec) Time (sec) 25 15 20 15 10 10 5 5 0 0 1000 2500 5000 7500 10000 5 10 20 30 40 50 Number of Items Transaction Size Figure 9: Number of items scale-up Figure 10: Transaction size scale-updatabase roughly constant by keeping the product posed algorithms can be combined into a hybrid al-of the average transaction size and the number of gorithm, called AprioriHybrid, which then becomestransactions constant. The number of transactions the algorithm of choice for this problem. Scale-up ex-ranged from 200,000 for the database with an average periments showed that AprioriHybrid scales linearlytransaction size of 5 to 20,000 for the database with with the number of transactions. In addition, the ex-an average transaction size 50. Fixing the minimum ecution time decreases a little as the number of itemssupport as a percentage would have led to large in the database increases. As the average transactionincreases in the number of large itemsets as the size increases (while keeping the database size con-transaction size increased, since the probability of stant), the execution time increases only gradually.a itemset being present in a transaction is roughly These experiments demonstrate the feasibility of us-proportional to the transaction size. We therefore ing AprioriHybrid in real applications involving very xed the minimum support level in terms of the large databases.number of transactions. The results are shown in The algorithms presented in this paper have beenFigure 10. The numbers in the key (e.g. 500) refer implemented on several data repositories, includingto this minimum support. As shown, the execution the AIX le system, DB2/MVS, and DB2/6000.times increase with the transaction size, but only We have also tested these algorithms against realgradually. The main reason for the increase was that customer data, the details of which can be found inin spite of setting the minimum support in terms 5]. In the future, we plan to extend this work alongof the number of transactions, the number of large the following dimensions:itemsets increased with increasing transaction length. Multiple taxonomies (is-a hierarchies) over itemsA secondary reason was that nding the candidates are often available. An example of such apresent in a transaction took a little longer time. hierarchy is that a dish washer is a kitchen appliance is a heavy electric appliance, etc. We4 Conclusions and Future Work would like to be able to nd association rules thatWe presented two new algorithms, Apriori and Apri- use such hierarchies.oriTid, for discovering all signi cant association rules We did not consider the quantities of the itemsbetween items in a large database of transactions. bought in a transaction, which are useful for someWe compared these algorithms to the previously applications. Finding such rules needs furtherknown algorithms, the AIS 4] and SETM 13] algo- work.rithms. We presented experimental results, showingthat the proposed algorithms always outperform AIS The work reported in this paper has been doneand SETM. The performance gap increased with the in the context of the Quest project at the IBM Al-problem size, and ranged from a factor of three for maden Research Center. In Quest, we are exploringsmall problems to more than an order of magnitude the various aspects of the database mining problem.for large problems. Besides the problem of discovering association rules, We showed how the best features of the two pro- some other problems that we have looked into include
  13. 13. the enhancement of the database capability with clas- 10] D. H. Fisher. Knowledge acquisition via incre-si cation queries 2] and similarity queries over time mental conceptual clustering. Machine Learning,sequences 1]. We believe that database mining is an 2(2), 1987.important new application area for databases, com-bining commercial interest with intriguing research 11] J. Han, Y. Cai, and N. Cercone. Knowledgequestions. discovery in databases: An attribute oriented approach. In Proc. of the VLDB Conference, pages 547{559, Vancouver, British Columbia,Acknowledgment We wish to thank Mike Carey Canada, 1992.for his insightful comments and suggestions. 12] M. Holsheimer and A. Siebes. Data mining: TheReferences search for knowledge in databases. Technical Report CS-R9406, CWI, Netherlands, 1994. 1] R. Agrawal, C. Faloutsos, and A. Swami. Ef- cient similarity search in sequence databases. 13] M. Houtsma and A. Swami. Set-oriented mining In Proc. of the Fourth International Conference of association rules. Research Report RJ 9567, on Foundations of Data Organization and Algo- IBM Almaden Research Center, San Jose, Cali- rithms, Chicago, October 1993. fornia, October 1993. 2] R. Agrawal, S. Ghosh, T. Imielinski, B. Iyer, and 14] R. Krishnamurthy and T. Imielinski. Practi- A. Swami. An interval classi er for database tioner problems in need of database research: Re- mining applications. In Proc. of the VLDB search directions in knowledge discovery. SIG- Conference, pages 560{573, Vancouver, British MOD RECORD, 20(3):76{78, September 1991. Columbia, Canada, 1992. 15] P. Langley, H. Simon, G. Bradshaw, and 3] R. Agrawal, T. Imielinski, and A. Swami. J. Zytkow. Scienti c Discovery: Computational Database mining: A performance perspective. Explorations of the Creative Process. MIT Press, IEEE Transactions on Knowledge and Data En- 1987. gineering, 5(6):914{925, December 1993. Special 16] H. Mannila and K.-J. Raiha. Dependency Issue on Learning and Discovery in Knowledge- inference. In Proc. of the VLDB Conference, Based Databases. pages 155{158, Brighton, England, 1987. 4] R. Agrawal, T. Imielinski, and A. Swami. Mining 17] H. Mannila, H. Toivonen, and A. I. Verkamo. association rules between sets of items in large E cient algorithms for discovering association databases. In Proc. of the ACM SIGMOD Con- rules. In KDD-94: AAAI Workshop on Knowl- ference on Management of Data, Washington, edge Discovery in Databases, July 1994. D.C., May 1993. 18] S. Muggleton and C. Feng. E cient induction 5] R. Agrawal and R. Srikant. Fast algorithms for of logic programs. In S. Muggleton, editor, mining association rules in large databases. Re- Inductive Logic Programming. Academic Press, search Report RJ 9839, IBM Almaden Research 1992. Center, San Jose, California, June 1994. 19] J. Pearl. Probabilistic reasoning in intelligent 6] D. S. Associates. The new direct marketing. systems: Networks of plausible inference, 1992. Business One Irwin, Illinois, 1990. 20] G. Piatestsky-Shapiro. Discovery, analy- 7] R. Brachman et al. Integrated support for data sis, and presentation of strong rules. In archeology. In AAAI-93 Workshop on Knowledge G. Piatestsky-Shapiro, editor, Knowledge Dis- Discovery in Databases, July 1993. covery in Databases. AAAI/MIT Press, 1991. 8] L. Breiman, J. H. Friedman, R. A. Olshen, and 21] G. Piatestsky-Shapiro, editor. Knowledge Dis- C. J. Stone. Classi cation and Regression Trees. covery in Databases. AAAI/MIT Press, 1991. Wadsworth, Belmont, 1984. 22] J. R. Quinlan. C4.5: Programs for Machine 9] P. Cheeseman et al. Autoclass: A bayesian Learning. Morgan Kaufman, 1993. classi cation system. In 5th Intl Conf. on Machine Learning. Morgan Kaufman, June 1988.