395 404


Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

395 404

  1. 1. ISSN: 2278 – 1323 International Journal of Advanced Research in Computer Engineering & Technology Volume 1, Issue 5, July 2012 Frequent Itemset Mining Technique in Data Mining Sanjaydeep Singh Lodhi Department of Computer Application (Software Systems) S.A.T.I (Govt. Autonomous collage) , Vidisha, (M.P), India Premnarayan Arya Asst. Prof. Dept. of CA (Software Systems) S.A.T.I (Govt. Autonomous collage) , Vidisha, (M.P), India Dilip Vishwakarma Department of Computer Application (Software Systems) S.A.T.I (Govt. Autonomous collage) , Vidisha, (M.P), IndiaAbstractIn computer science and data mining, Apriori is a classic structures in the data: A pattern may be in the form ofalgorithm for learning association rules. Apriori is a rule, cluster, set, sequence, graph, tree, etc. anddesigned to operate on databases containing transactions each of these pattern types is able to express different(for example, collections of items bought by customers, or structures and relationships present in the data. Thedetails of a website frequentation). Frequent itemsets playan essential role in many data mining tasks that try to problem of mining probabilistic is frequent itemsetsfind interesting patterns from databases, such as in uncertain or probabilistic databases, includingassociation rules, correlations, sequences, episodes, mining the probability distribution of support. Oneclassifiers, clusters and many more of which the mining challenge is to compute the probability distributionof association rules is one of the most popular problems. efficiently. The other challenge is to mineIn this paper, we take the classic Apriori algorithm, and probabilistic frequent itemsets using thisimprove it quite significantly by introducing what we call computation. Uses an Apriori style algorithm for thea vertical sort. We then use the large dataset, web itemset mining task. The first algorithm based on thedocuments to contrast our performance against several FP-Growth idea that is able to handle probabilisticstate-of-the-art implementations and demonstrate not onlyequal efficiency with lower memory usage at all support data. It solves the problem much faster than thethresholds, but also the ability to mine support thresholds Apriori style approach. Finally, the problem can beas yet un-attempted in literature. We also indicate how we solved in the GIM framework, leading to the GIM-believe this work can be extended to achieve yet more PFIM algorithm. This improves the run time byimpressive results. We have demonstrated that our orders of magnitude, reduces the space usage andimplementation produces the same results with the same also allows the problem to be viewed more naturallyperformance as the best of the state-of-the art as a vector based problem. The vectors are realimplementations. In particular, we have started with the valued probability vectors and the search can benefitclassic algorithm for this problem and introduced a from subspace projections. The Generalizedconceptually simple idea, sorting the consequences ofwhich have permitted us to outperform all of the available Interaction Mining (GIM) methods for solvingstate-of-the-art implementations. interaction mining are problems at the abstract level.Keywords: Data Mining, Apriori Algorithm, Since GIM leaves the semantics of the interactions,Frequent Itemset Mining. their interestingness measures and the space in which the interactions are to be mined as flexible components; it creates a layer of abstraction between Introduction a problems definition/semantics and the algorithm used to solve it; allowing both to vary independentlyData mining is the automated process of extracting of each other. This was achieved by developing auseful patterns from typically large quantities of data. consistent but general geometric computation modelDifferent types of patterns capture different types of 395 All Rights Reserved © 2012 IJARCET
  2. 2. ISSN: 2278 – 1323 International Journal of Advanced Research in Computer Engineering & Technology Volume 1, Issue 5, July 2012based on vectors and vector valued functions. For work are the two categories “probabilistic databases”most problems, the space required is provably linear and “probabilistic query processing”.in the size of the data set. The run-time is provably The uncertainty model used in the approach is verylinear in the number of interactions that need to be close to the model used for probabilistic databases. Aexamined. These properties allow it to outperform probabilistic database denotes a database composedspecialist algorithms when applied to specific of relations with uncertain tuples, where each tuple isinteraction mining problems [1] and [3]. associated with a probability denoting the likelihoodIn this paper, we take the classic algorithm for the that it exists in the relation. This model, called “tupleproblem, Apriori, and improve it quite significantly uncertainty”, adopts the possible worlds semantics. Aby introducing what we call a vertical sort. We have probabilistic database represents a set of possibledemonstrated that our implementation produces the “certain” database instances (worlds), where asame results with the same performance as the best of database instance corresponds to a subset of uncertainthe state-of-the art implementations. But, whereas tuples. Each instance (world) is associated with thethey blow their memory in order to decrease the probability that the world is “true”. The probabilitiessupport threshold, the memory utilization of our reflect the probability distribution of all possibleimplementation remains relatively constant. We database instances.believe that this work opens up many avenues for yet In the general model description, the possible worldsmore pronounced improvement. are constrained by rules that are defined on the tuples in order to incorporate object (tuple) correlations. Related Works on Frequent Itemset Mining The ULDB model proposed, which is used in Trio, supports uncertain tuples with alternative instancesThe approach proposed by Chui et. al computes the which are called x-tuples. Relations in ULDB areexpected support of itemsets by summing all itemset called x-relations containing a set of x-tuples. Each x-probabilities in their U-Apriori algorithm. Later, they tuple corresponds to a set of tuple instances which areadditionally proposed a probabilistic filter in order to assumed to be mutually exclusive, i.e. no more thanprune candidates early. one instance of an x-tuple can appear in a possibleThe UF-growth algorithm is proposed. Like U- world instance at the same time. Probabilistic top-kApriori, UF-growth computes frequent itemsets by query approaches are usually associated withmeans of the expected support, but it uses the FP-tree uncertain databases using the tuple uncertaintyapproach in order to avoid expensive candidate model. The approach proposed was the first approachgeneration. In contrast to our probabilistic approach, able to solve probabilistic queries efficiently underitemsets are considered frequent if the expected tuple independency by means of dynamicsupport exceeds minSup. The main drawback of this programming techniques [1] and [4].estimator is that information about the uncertainty of Recently, a novel approach was proposed to solve athe expected support is lost; ignore the number of wide class of queries in the same time complexity,possible worlds in which an itemset is frequent. but in a more elegant and also more powerful wayProposes exact and sampling-based algorithms find using generating functions.likely frequent items in streaming probabilistic data. The efficient data structures and techniques used inHowever, they do not consider itemsets with more frequent itemset mining such as TID-lists, FP-tree,than one item. The current state-of the art (and only) which adopts a prefix tree structure as used in FP-approach for probabilistic frequent itemset mining growth, and the hyper-linked array based structure as(PFIM) in uncertain databases was proposed. Their used in H-mine can no longer be used as suchapproach uses an Apriori-like algorithm to mine all directly on the uncertain data. Therefore, recent workprobabilistic frequent itemsets and the poisson on frequent itemset mining in uncertain data thatbinomial recurrence to compute the support inherits the breadth-firstprobability distribution function (SPDF). and depth-first approaches from traditional frequentWe provide a faster solution by proposing the first itemset mining adapts the data structures to theprobabilistic frequent pattern growth approach probabilistic model.(ProFP-Growth), thus avoiding expensive candidate U-Apriori is based on a level wise algorithm andgeneration and allowing us to perform PFIM in large represents a baseline algorithm for mining frequentdatabases. Furthermore, use a more intuitive itemsets from uncertain datasets. Because of thegenerating function method to compute the SPDF. generate and test strategy, level by level, the methodExisting approaches in the field of uncertain data does not scale well.management and mining can be categorized into a UCP-Apriori is based on the decremental pruningnumber of research directions. Most related to our technique which consists in maintaining an upper bound of the support and decrementing it while 396 All Rights Reserved © 2012 IJARCET
  3. 3. ISSN: 2278 – 1323 International Journal of Advanced Research in Computer Engineering & Technology Volume 1, Issue 5, July 2012scanning the dataset. The itemset is pruned as soon as property is also known as the Apriori property.its most optimistic value falls below the threshold. Apriori-like algorithms search for frequent itemsetsThis approach represents the state of the art for in a breadth first generate-and-test manner, where allmining frequent patterns from uncertain data with a itemsets of length k are generated before those ofgenerate-and-prune strategy. length k+1. In the candidate generation step, possiblyUF-growth extends the FP-Growth algorithm. It is frequent k + 1 itemsets are generated from thebased on a UF-tree data structure (similar to FP-tree). already mined k-itemsets prior to counting items,The difference with the construction of a FPtree is making use of the Apriori property. This creates athat a transaction is merged with a child only if the high memory requirement, as all candidates mustsame item and the same expected support exist in the remain in main memory. For support counting (thetransaction and in the child node, leading to a far test step), a pass is made over the database whilelower compression ratio as in the original FP-tree. checking whether the candidate itemsets occur in atThe improvements consist in discretization of the least minSup transactions. This requires subsetexpected support to avoid the huge number of checking a computationally expensive task especiallydifferent values and in adapting the idea of co- when the transaction width is high. While methodsoccurrence frequent itemset tree (COFItree). such as hash-trees or bitmaps have been used toThe UF-trees are built only for the first two levels. It accelerate this step, candidate checking remainsthen enumerates the frequent patterns by traversing expensive. Various improvements have beenthe tree and decreasing the occurrence counts. proposed to speed up aspects of Apriori. ForAggarwal et al. extended several existing classical example, proposed a hash based method forfrequent itemset mining algorithms for deterministic candidate generation that reduces the number ofdata sets, and compared their relative performance in candidates generated, thus saving considerable spaceterms of efficiency and memory usage. The UH-mine and time. Argues for the use of the Trie data structurealgorithm, proposed in their paper, provides the best for the subset checking step, improving on the hashtrade-offs. The algorithm is based on the pattern tree approach. Apriori style algorithms make multiplegrowth paradigm. The main difference with UF- passes over the database, at least equal to the lengthgrowth is the data structure used which is a of the longest frequent itemset, which incurshyperlinked array. considerable I/O cost. GLIMIT does not performThe limitations of these existing methods are the ones candidate generation or subset enumeration, andinherited from the original methods. The size of the generates itemsets in a depth first fashion using adata for the level-wise generate-and-test techniques single pass over the transposed data set.affects their scalability and the pattern-growth Frequent pattern growth (FP-Growth) type algorithmstechniques require a lot of memory for are often regarded as the fastest item enumerationaccommodating the dataset in the data structures, algorithms. FP-Growth generates a compressedsuch as the FP-tree, especially when the transactions summary of the data set using two passes in a crossdo not share many items. In the case of uncertain referenced tree, the FP-tree, before mining itemsetsdata, not only the items have to be shared for a better by traversing the tree and recursively projecting thecompression but also the existence probabilities, database into sub-databases using conditional FP-which is often not the case [2] and [3]. Trees. In the first database pass, infrequent items areMany itemset mining algorithms have been proposed discarded, allowing some reduction in the databasesince association rules were introduced. size. In the the second pass, the complete FP-Tree isMost algorithms can be broadly classified into two built by collecting common prefixes of transactionsgroups, the item enumeration which operate with a and incrementing support counts at the nodes. at thegeneral-to-specific search strategy, and the row same time, a header table and node links areenumeration techniques which find specific itemsets maintained, which allow easy access to all nodesfirst. Broadly speaking, item enumeration algorithms given an item. The mining step operates by followingare most effective for data sets where |T| >> |I|, while these node links and creating (projected) FPTrees thatrow enumeration algorithms are effective for data are conditional on the presence of an item (set), fromsets where |T| << |I|, such as for micro-array data. which the support can be obtained by a traversal ofFurthermore, a specific to general strategy can be the leaves. Like GLIMIT, it does not perform anyuseful for finding maximal frequent itemsets in dense candidate generation and mines the itemsets in adatabases. Item enumeration algorithms mine subsets depth first manner while still mining all subsets of anof an itemset I0 before mining the more specific itemset I0 before mining I0. FP-Growth is very fast atitemset I0. Only those itemsets for which all (or reading from the FP-tree, but the downside is that thesome) subsets are frequent are generated making use FP-tree can become very large and is expensive toof the anti-monotonic property of support. This generate, so this investment does not always pay off. 397 All Rights Reserved © 2012 IJARCET
  4. 4. ISSN: 2278 – 1323 International Journal of Advanced Research in Computer Engineering & Technology Volume 1, Issue 5, July 2012Further, the FP-Tree may not fit into memory. In common criticism about the FPM approaches for (1)contrast, GLIMIT uses only as much space as is producing unacceptably large set of frequent patterns,required and is based on vector operations, rather which limits the usage of the patterns, (2) containsthan tree projections. It also uses a purely depth first large amount of redundant patters, (3) most of thesearch [5] and [6]. frequent patterns are not significant, i.e, they do not discriminate between the classes [6, 14]. TheFrequent Pattern Mining Useful in Building arguments of these studies return a fair question; arePredictive Models frequent patterns not useful anymore? In this paper we investigate, with the support of an empirical The recent studies of pattern mining have given evaluation using a representative set of patternmore attention to discovering patterns that are mining algorithms, how far the frequent patterns areinteresting, significant, discriminative and so forth, useful in building predictive models.than simply frequent. Does this imply that the Discovery of frequent itemsets is a two stepfrequent patterns are not useful anymore? procedure, the search of frequent patterns and support The amount of available literature related to frequent count. The apriori principle, which is introduced andpattern mining (FPM) reflects the reason for calling it substantially improved later, limits the search spacea focused theme in data mining. Due to the evolution using the monotonicity property of the support of setsof the modern information technology, collecting, [5]. The methods using horizontal [3, 10] and verticalcombining, transferring and storing huge amounts of [12] layouts for organizing the transaction databasedata could be done at very low costs. As a support efficient scanning of the database. Briefconsequence more sophisticated methods of mining descriptions of various studies related to improvingassociations from such data have been introduced, efficiency, scalability, usage of computer memoryexpanding the typical problem of analyzing the and so forth, in the frequent itemset mining algorithmassociations of commonly found items in a shopping [2] and [14].list [3] into discovering interesting and useful Despite the optimizations introduced to improve thepatterns from large, complex and dense databases. A performance of the frequent itemset miningfrequent pattern in a pattern database may be a set of algorithm, it often generates substantially largeitems, or a subsequence, a sub-tree or a sub-graph amount of frequent itemsets, which limit thethat appears across the database with a frequency that possibility of using them meaningfully. To overcomeis not less than a pre-defined threshold value [2]. A this problem, the concepts of maximal frequentfrequent item set is a set of items appearing together itemsets and closed frequent itemsets were proposed.more frequently, such as bread and butter, for The maximal frequent itemset is a frequent itemset I,example, in a transaction database [2]. A frequent which is included in no other frequent itemset. Asequential pattern may be a frequently seen frequent itemset is called a closed frequent itemset ifsubsequence from a database of an ordered list of there does not exist a super frequent itemset that hasitems [4], while frequent sub-trees of a database of the same support. The extra computational burdentrees may represent the frequently linked web pages related to determining whether a frequent itemset isin a weblog database. Frequent sub-graphs in a graph closed or maximal are being solved by approachesdatabase may be, for example, a more frequently such as keeping the track of the Transaction ID lists,appeared motif in a chemoinformatics database. Such using hash functions, or maintaining a frequentpatterns discovered by FPM algorithms could be used pattern tree similar to FP-tree. Maximal frequentfor classification, clustering, finding association itemset mining approaches use techniques such asrules, data indexing, and other data mining tasks. vertical bit maps for the Transaction ID list, or aprioriEver since the introduction of the first FPM based pruning for search space.algorithm by Agrawal et.al., plenty of approachesrelated to pattern mining problem are added, Proposed Frequent Itemset Miningincluding new algorithms as well as improvements to Techniquesexisting algorithms, making the FPM a scalable and asolvable problem]. We can find studies dealing with The Apriori-based algorithms find frequent itemsetscomplexity issues, database issues, search space based upon an iterative bottom-up approach toissues, scalability issues of FPM in abundance. After generate candidate itemsets. Since the first proposalabout two decades since the concept emerged, with of association rules mining by R. Agrawal, manythousands of research publications accumulated in researchers have been done to make frequent itemsetsthe literature, one may wonder whether the problem mining scalable and efficient. But there are still someof FPM is solved at an acceptable level on most of deficiencies that Apriori based algorithms sufferedthe data mining tasks. On the other hand there is a from, which include: too many scans of the 398 All Rights Reserved © 2012 IJARCET
  5. 5. ISSN: 2278 – 1323 International Journal of Advanced Research in Computer Engineering & Technology Volume 1, Issue 5, July 2012transaction database when seeking frequent itemsets, = fj . Then, classically, every frequent itemset fi islarge amount of candidate itemsets generated compared to every fj and the candidate fi ∪ fj isunnecessarily and so on. generated whenever fi = fj. However, our methodThe proposed of our method is the classical A Priori needs only ever compare one frequent itemset, fi, toalgorithm. Our contributions are in providing novel the one immediately following it, fi+1.scalable approaches for each building block. We start A crucial observation is that near-equality isby counting the support of every item in the dataset transitive because the equality of individual items isand sort them in decreasing order of their transitive. So, if fi = fi+1, . . . , fi+m-2 = fi+m-1 then wefrequencies. Next, we sort each transaction with know that (∀ j, k) < m, fi+j = fi+k.respect to the frequency order of their items. We call Recall also that the frequent k-itemsets are fullythis a horizontal sort. We also keep the generated sorted (that is, both horizontally and vertically), so allcandidate itemsets in horizontal sort. Furthermore, those that are near-equal appear contiguously. Thiswe are careful to generate the candidate itemsets in sorting taken together with the transitivity of near-sorted order with respect to each other. We call this a equality is what our method exploits.vertical sort. When itemsets are both horizontally andvertically sorted, we call them fully sorted. As we In this way, we successfully generate all theshow, generating sorted candidate itemsets (for any candidates with a single pass over the list of frequentsize k), both horizontally and vertically, is k-itemsets as opposed to the classical nested-loopcomputationally free and maintaining that sort order approach. Strictly speaking, it might seem that ourfor all subsequent candidate and frequent itemsetsrequires careful implementation, but no cost in processing of candidates effectively causesexecution time. This conceptually simple sorting idea extra passes, but it can be shown using the A Priorihas implications for every subsequent part of the Principle that m is typically much less than thealgorithm. In particular, as we show, having number of frequent itemsets. First, it remains to betransactions, candidates, and frequent itemsets all shown that our one pass does not miss any potentialadhering to the same sort order has the following candidates. Consider some candidate c = {ia, . . . , ik}.advantages: Generating candidates can be done very If it is a valid candidate, then by the A Prioriefficiently. Indices on lists of candidates can be Principle, fi = {i1, . . . , ik-2, ik-1} and fj = {i1, . . . , ik-2,efficiently generated at the same time as are the ik} are frequent. Then, because of the sort order thatcandidates. Groups of similar candidates can be is required as a precondition, the only frequentcompressed together and counted simultaneously. itemsets that would appear between fi and fj are thoseCandidates can be compared to transactions in linear that share the same (k − 2)-prefix as they do. Thetime. Better locality of data and cache-consciousness method described above merges together all pairs ofis achieved. Our particular choice of sort order (that frequent itemsets that appear contiguously with theis, sorting the items least frequent first) allows us to same (k − 2)-prefix. Since this includes both fi and fj ,with minimal cost entirely skip the candidate pruning c = fi ∪ fj must have been discovered.phase. Candidate compressionEffective Candidate GenerationCandidate generation is the important first step ineach iteration of A Priori. Typically it has not been Let us return to the concern of generatingconsidered a bottleneck in the algorithm and so most candidates from each group of m near-equal frequentof the literature focuses on the support counting.However, it is worth pausing on that for a moment. k-itemsets. Since each group of candidatesModern processors usually manage about thirty share in common their first k−1 items, we need notmillion elementary instructions per second. We repeat the information. As such, we can compress thedevote considerable attention to improving the candidates into a super-candidate. This new super-efficiency of candidate generation, too.Let us consider generating candidates of an arbitrarily candidate still represents all candidates, butchosen size, k + 1. We will assume that the frequent takes up much less space in memory and on disk.k-itemsets are sorted both horizontally and vertically. More importantly, however, we can now count theseThe (k − 1) × (k − 1) technique generates candidate candidates simultaneously. Suppose we wanted to(k+1) itemsets by taking the union of frequent k- extract the individual candidates from a super-itemsets. If the first k−1 elements are identical for candidate.two distinct frequent k-itemsets, fi and fj , we callthem near-equal and denote their near-equality by fi 399 All Rights Reserved © 2012 IJARCET
  6. 6. ISSN: 2278 – 1323 International Journal of Advanced Research in Computer Engineering & Technology Volume 1, Issue 5, July 2012Ideally this will not be done at all, but it is necessary groups of near-equal frequent k-itemsets. Because theafter support counting if at least one of the candidates frequent k-itemsets are already sorted, these groups,is frequent because the frequent candidates need to relative to each other, are too. As such, if theform a list of uncompressed frequent itemsets. candidates are generated from a sequential scan ofFortunately, this can be done quite easily. The the frequent itemsets, they will inherit the sort ordercandidates in a super-candidate c = (cw, cs) all share with respect to at least the first k−1 items. Then, onlythe same prefix: the first k − 1 items of cs. They all the ordering on the kth and (k+1) th items (those nothave a suffix of size shared among the members of the group) need be(k + 1) − (k − 1) = 2 ensured. That two itemsets are near-equal can beBy iterating in a nested loop over the last cw-k+1 item equivalently stated as that the itemsets differ on onlyof cs, we produce all possible suffices in sorted order. the kth item. So, by ignoring the shared items we can consider a group of near-equal itemsets as just a listConsequence Candidate Generation of single items. Since the itemsets were sorted andThere is another nice consequence of generating this new list is made of only those items whichsorted candidates in a single pass: we can efficiently differentiated the itemsets, the new list inherits thebuild an index for retrieving them. In our sort order. Thus, we use exactly the same method asimplementation and in the following example, we with F1 to ensure that each group of candidates isbuild this index on the least frequent item of each sorted on the last two items. Consequently, the entirecandidate (k + 1)-itemset. list of candidate (k + 1)-itemsets is fully sorted.The structure is a simple two-dimensional array.Candidates of a particular size k+1 are stored in a Candidate Pruningsequential file, and this array provides information When A Priori was first proposed, its performanceabout offsetting that file. Because of the sort on the was explained by its effective candidate generation.candidates, all those that begin with each item I What makes the candidate generation so effective isappear contiguously. The exact location in the file of its aggressive candidate pruning. We believe that thisthe first such candidate is given by the ith element in can be omitted entirely while still producing nearlythe first row of the array. The ith element in the the same set of candidates. Stated alternatively, aftersecond row of the array indicates how many bytes are our particular method of candidate generation, thereconsumed by all (k + 1)-candidates that begin with is little value in running a candidate pruning step.item i. In recent, the probability that a candidate is generated is shown to be largely dependent on its best testsetOn the precondition of sorting that is, the least frequent of its subsets. Classical AThis is a very feasible requirement. The first Priori has a very effective candidate generationcandidates that are produced contain only two items. technique because if any itemset c {ci} for 0 ≤ i ≤ kIf one considers the list of frequent items, call it F1, is infrequent the candidate c = {c0, . . . , ck} is prunedthen the candidate 2-itemsets are the entire cross- from the search space. By the A Priori Principle, theproduct F1 × F1. If we sort F1 first, then a standard best testset is guaranteed to be included among these.nested loop will induce the order we want. That is to However, if one routinely picks the best testset whensay, we can join the first item to the second, then the first generating the candidate, then the pruning phasethird, then the fourth, and so on until the end of the is redundant.list. Then, we can join the second item to the third, In our method, on the other hand, we generate athe fourth, and so on as well. Continuing this pattern, candidate from two particular subsets, fk = c {ck}one will produce the entire cross-product in fully and fk-1 = c {ck-1}. If either of these happens to besorted order. This initial sort is a cost we readily the best testset, then there is little added value in aincur for the improvements it permits. candidate pruning phase that checks the other k−2After this stage, there are only two things we ever do: size k subsets of c. Because of our least-frequent-firstgenerate candidates and detect frequent itemsets by sort order, f0 and f1 correspond exactly to the subsetscounting the support of the candidates. Because in the missing the most frequent items of all those in c. Welatter we only ever delete never add nor will change observed that usually either f0 or f1 is the best test set.itemsets from the sorted list of candidates, the list of We are also not especially concerned aboutfrequent itemsets retain the original sort order. generating a few extra candidates, because they willRegarding the former, there is a nice consequence of be indexed and compressed and countedgenerating candidates in our linear one pass fashion: simultaneously with others, so if we do not retain athe set of candidates is itself sorted in the same order considerable number of prunable candidates by notas the frequent itemsets from which they were pruning, then we do not do especially much extraderived. Recall that candidates are generated in work in counting them, anyway. 400 All Rights Reserved © 2012 IJARCET
  7. 7. ISSN: 2278 – 1323 International Journal of Advanced Research in Computer Engineering & Technology Volume 1, Issue 5, July 2012 itself quite nicely to the prospects of parallelizationSupport Counting using Indexing and fault tolerance.It was recognized quite early that A Priori would Frequent itemsets play an essential role in many datasuffer a bottleneck in comparing the entire set of mining tasks that try to find interesting patterns fromtransactions to the entire set of candidates for every databases, such as association rules, correlations,iteration of the algorithm. Consequently, most A sequences, episodes, classifiers, clusters. A Priori,Priori -based research has focused on trying to and improve it quite significantly by introducingaddress this bottleneck. Certainly, we need to address what we call a vertical sort.this bottleneck as well. The standard approach is tobuild a prefix trie on all the candidates and then, for Algorithm: The revised Vertically-Sorted A Priorieach transaction, check the trie for each of the k- algorithm:itemsets present in the transaction. But this sufferstwo traumatic consequences on large datasets. First, INPUT: A dataset D and a support threshold sif the set of candidates is large and not heavily OUTPUT: All sets that appear in at least s transactions ofoverlapping, the trie will not fit in memory and then D F is set of frequent itemsetsthe algorithm will thrash about exactly as do the other C is set of candidates C←Utree-based algorithms. Second, generating every Scan database to count support of each item in Cpossible itemset of size k from a transaction t = {t0, . . Add frequent items to F Sort F least-frequent-first (LFF) by support (using. , tw-1} produces possibilities. Even after quicksort)pruning infrequent items with a support threshold of Output F10%, w still ranges so high. for all f ∈ F, sorted LFF doInstead, we again exploit the vertical sort of our for all g ∈ F, supp(g) ≥ supp(f), sorted LFF do Add {f, g} to Ccandidates using the index we built when we end forgenerated them. To process that same transaction t Update index for item fabove, we consider each of the w − k first items in t. end forFor each such item ti we use the index to retrieve the while |C| > 0 docontiguous block of candidates whose first element is {Count support}ti. Then, we compare the suffix of t that is {ti, ti+1, . . . for all t ∈ D do, tw-1} to each of those candidates. for all i ∈ t do RelevantCans ← using index, compressed cans from fileProposed Vertically-Sorted A Priori algorithm that start with i for all CompressedCans ∈ RelevantCans doOur method naturally does this because it operates in if First k − 2 elements of CompressedCans are in t thena sequential manner on prefaces of sorted lists. Work Use compressed candidate support counting technique tothat is to be done on a particular contiguous block of update appropriatethe data structure is entirely done before the next support countsblock is used, because the algorithm proceeds in end if; end for; end for; end forsorted order and the blocks are sorted. Consequently, Add frequent candidates to Fwe fully process blocks of data before we swap them Output Fout. Our method probably also performs decently Clear Cwell in terms of cache utilization because contiguous {Generate candidates}blocks of itemsets will be highly similar given that Start ← 0 for 1 ≤ i ≤ |F| dothey are fully sorted. Perhaps of even more if i == |F| OR fi is not near-equal to fi−1 thenimportance is the independence of itemsets. The Create super candidate from fstart to fi−1 and updatecandidates of a particular size, so long as their order index as necessaryis ultimately maintained in the output to the next Start ← iiteration, can be processed together in blocks in end if; end forwhatever order desired. The lists of frequent itemsets {Candidate pruning—not needed!}can be similarly grouped into blocks, so long as care Clear F; Reset hash; end whileis taken to ensure that a block boundary occursbetween two itemsets fi and fi+1 only when they arenot near-equal. The indices can also be grouped into We then use the large dataset, web documents toblocks with the additional advantage that this can be contrast our performance against several state-of-the-done in a manner corresponding exactly to how the art implementations and demonstrate not only equalcandidates were grouped. As such, all of the data efficiency with lower memory usage at all supportstructures can be partitioned quite easily, which lends thresholds, but also the ability to mine support 401 All Rights Reserved © 2012 IJARCET
  8. 8. ISSN: 2278 – 1323 International Journal of Advanced Research in Computer Engineering & Technology Volume 1, Issue 5, July 2012thresholds as yet un-attempted in literature. We also comparison. For these reasons, we used other datasetsindicate how we believe this work can be extended to only for the purpose of verifying the correctness ofachieve yet more impressive results. We have our output. We test each implementation on webdocsdemonstrated that our implementation produces the with support thresholds of 21%, 14%, 11%, 7%, andsame results with the same performance as the best of 6%. Reducing the support threshold in this mannerthe state-of-the art implementations. increases the size of the problem as observed in Figure 1 the number of candidate itemsets is Experimental Results Analysis implementation-dependent and in general will be less than the number in the figure.The result of this paper is that the frequent itemsetmining problem can now be extended too muchlower support thresholds (or, equivalently, largereffective file sizes) than have even yet beenconsidered. These improvements came at noperformance cost, as evidenced by the fact that ourimplementation matched the state-of-the-artcompetitors while consuming much less memory.Prior to this work, it has been assumed that theperformance of A Priori is prohibitively slow.Through these experiments we have demonstratedthat at high support thresholds our implementationproduces the same results with the same performanceas the best of the state-of-the-art implementations.But as the support threshold decreases, the otherimplementations exceed memory and abort while the File Size (Mb) vs. Supports (%)memory utilisation of our implementation remains Figure 1: Size of web documents dataset with noiserelatively constant. As such, our performance (infrequent 1-itemsets) removed, graphed against thecontinues to follow a predictable trend and our number of frequent itemsets.programmers can successfully mine supportthresholds that are impossibly low for the otherimplementations.We compare the performance of this implementationagainst a wide selection of the best availableimplementations of various frequent itemset miningalgorithms. In previous works are state-of-the-artimplementations of the A Priori algorithm which usea tire structure to store candidates. In order tomaximally remove uncontrolled variability in thecomparisons the choice of programming language isimportant. The correctness of our implementation’soutput is compared to the output of these otheralgorithms. Since they were all developed for theFIMI workshop and all agree on their output, it seems Running Time (s) vs. Support Threshold (%)a fair assumption that they can serve correctly as an Figure 2: The Performance of Our Implementations”answer key”. But, nonetheless, boundary condition on Webdocs Datasetchecking was a prominent component duringdevelopment. We could generate our own large However, because our implementation uses explicitdataset against which to also run tests, but the value file-handling instead of relying on virtual memory,of doing so is minimal. The data in the web the memory requirements are effectively constant.documents set comes from a real domain and so is However, those of all the other algorithms growmeaningful. Constructing a random dataset will not beyond the limits of memory and consequentlynecessarily portray the true performance cannot initialize. Without the data structures, thecharacteristics of the algorithms. At any rate, the programs must obviously abort. Figure 2 describedother implementations were designed with the Performance of our implementations on web-docsknowledge of web documents, so it is a fairer 402 All Rights Reserved © 2012 IJARCET
  9. 9. ISSN: 2278 – 1323 International Journal of Advanced Research in Computer Engineering & Technology Volume 1, Issue 5, July 2012dataset with respect to Running Time (s) vs. Support Apriori and FP growth. It should be noted that in theThreshold (%). previous works, their FP-Growth implementation onFigure 2 described the Memory Usage of Our the same benchmark webdocs dataset as do we andimplementations on webdocs as Measured by final they report impressive running times.stages of execution with respect to MemoryConsumed (%) vs. Supports Thresholds (%) Execution Time (seconds) vs. Minimum Support (%)Memory Consumed (%) vs. Supports Thresholds (%) Figure 5 Proposed Classic Apriori Comparison withFigure 3: Memory Usage of Our implementations on FP Growth Graph webdocs as Measured by final stages of execution Unfortunately, the implementation is nowProposed Classic Apriori Comparison with unavailable. The details in the accompanying paperexisting Apriori and FP-growth algorithm are not sufficiently precise that we could implementFigure 4 shows the running time of proposed classic their modifications to the FP-Growth algorithm. AsApriori frequent pattern mining approach with such, no fair comparison can truly be made. Yet still,existing Apriori based multilevel frequent pattern they only publish results up to 8% which ismining algorithm on our created database with insufficient as we demonstrated in Figure 1. It is arespect to the minimum support threshold at 1 with fair hypothesis that, were their implementationthe minimum support. available, if would suffer the same consequences as do the other trie-based implementations when the support threshold is dropped further. So, through these experiments, we have demonstrated that our implementation produces the same results with the same performance as the best of the state-of-heart implementations. Conclusion and Future Works Algorithm APRIORI is one of the oldest and most versatile algorithms of Frequent Pattern Mining (FPM). With sound data structures and carefulExecution Time (seconds) vs. Minimum Support (%) implementation it has been proven to be a competitive algorithm in the contest of FrequentFigure 4 Proposed Classic Apriori Comparison with Itemset Mining Implementations (FIMI). APRIORI is existing Apriori Graph not only an appreciated member of the FIMI community and regarded as a baseline algorithm, butFig.5 shows the running time of proposed classic its variants to find frequent sequences of itemsets,Apriori frequent pattern mining approach with FP- episodes, Boolean formulas and labeled graphs havegrowth algorithm on our created database with proven to be efficient algorithms as well.respect to the minimum support threshold at 1 with We have demonstrated that our implementationthe minimum support. The result shows as minimum produces the same results with the same performancesupport decreases, the running time increases. The as the best of the state-of-the art implementations.new approach runs faster than the existing algorithms But, whereas they blow their memory in order to decrease the support threshold, the memory 403 All Rights Reserved © 2012 IJARCET
  10. 10. ISSN: 2278 – 1323 International Journal of Advanced Research in Computer Engineering & Technology Volume 1, Issue 5, July 2012utilization of our implementation remains relatively Knowledge discovery and data mining, New York,constant. As such, our performance continues to NY, USA, 2006, pp 86–95.follow a predictable trend and our program cansuccessfully mine support thresholds that are [7] Toon Calders. Deducing bounds on the frequencyimpossibly low for the other implementations. of itemsets. In EDBT Workshop DTDM DatabaseFurthermore, whereas other algorithms in the Techniques in Data Mining, 2002.literature are being fully optimized already, webelieve that this work opens up many avenues for yet [8] DPVG06] Nele Dexters, Paul W. Purdom, andmore pronounced improvement. Given the locality Dirk Van Gucht. A probability analysis forand independence of the data structures used, they candidate-based frequent itemset algorithms. In SACcan be partitioned quite easily. We intend to do ’06: Proceedings of the 2006 ACM symposium onprecisely that in parallelizing the algorithm. Applied computing, New York, NY, USA, 2006.Extending the index to more than one item to ACM, pp541–545.improve its precision on larger sets of candidates willlikely also yield significant improvement. And, of [9] Edward Fredkin. Trie memory. Commun. ACM,course, all the optimization tricks used in other 3(9):490–499, 1960.implementations can be incorporated here. The resultof this paper is that the frequent itemset mining [10] Amol Ghoting, Gregory Buehrer, Srinivasanproblem can now be extended to much lower support Parthasarathy, Daehyun Kim, Anthony Nguyen, Yen-thresholds (or, equivalently, larger effective file Kuang Chen, and Pradeep Dubey. Cacheconscioussizes) than have even yet been considered. frequent pattern mining on a modern processor. In Klemens B¨ohm, Christian S. Jensen, Laura M. Haas, References Martin L. Kersten, Per-Ake Larson, and Beng Chin Ooi, editors, VLDB ACM, 2005, pp 577–588.[1] Toon Calders, Calin Garboni and Bart Goethals,“Approximation of Frequentness Probability of [11] Mohammed J. Zaki. Scalable algorithms forItemsets in Uncertain Data”, 2010 IEEE International association mining. IEEE Trans. on Knowl. and DataConference on Data Mining, pp-749-754. Eng., 12(3):pp 372–390, 2000.[2] Bin Fu, Eugene Fink and Jaime G. Carbonell, [12] C. C. Aggarwal, Y. Li, J. Wang, and J. Wang.“Analysis of Uncertain Data: Tools for Frequent pattern mining with uncertain data. In Proc.Representation and Processing”, IEEE 2008. of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, 2009.[3] Mohamed Anis Bach Tobji, Boutheina BenYaghlane, and Khaled Mellouli, “A New Algorithm [13] P. Agrawal, O. Benjelloun, A. Das Sarma, C.for Mining Frequent Itemsets from Evidential Hayworth, S. Nabar, T. Sugihara, and J. Widom.Databases”, Proceedings of IPMU08, pp. "Trio: A system for data, uncertainty, and lineage". In1535{1542. Proc. Int. Conf. on Very Large Databases, 2006.[4] Biao Qin, Yuni Xia, Sunil Prabhakar and Yicheng [14] R. Agrawal and R. Srikant. Fast algorithms forTu, “A Rule-Based Classification Algorithm for mining association rules. In Proc. ACM SIGMODUncertain Data”, IEEE 2009 International Int. Conf. on Management of Data, Minneapolis,Conference on Data Engineering, pp- 1633-1640. MN, 1994.[5] Thomas Bernecker, Hans-Peter Kriegel, MatthiasRenz, Florian Verhein, Andreas Zuefle,“Probabilistic Frequent Itemset Mining in UncertainDatabases”, 15th ACM SIGKDD Conf. onKnowledge Discovery and Data Mining (KDD09),Paris, France, 2009.[6] Gregory Buehrer, Srinivasan Parthasarathy, andAmol Ghoting. Out-ofcore frequent pattern miningon a commodity pc. In KDD ’06: Proceedings of the12th ACM SIGKDD international conference on 404 All Rights Reserved © 2012 IJARCET