As 7


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

As 7

  1. 1. 2011 International Conference on Recent Trends in Information Systems Online Mining of data to Generate Association Rule Mining in Large DatabasesArchana Singh Megha Chaudhary Dr (Prof.) Ajay Rana Gaurav DubeyPh.D Scholar, Ph.d(Comp Science&Engg) Ph.d ScholarAmity University Amity University Amity University Amity UniveristyNOIDA (U.P) NOIDA (U.P) NOIDA (U.P) NOIDA(U.P)91-9958255675 +91-981811756 gdubey1977@gmail.comABSTRACT - Data Mining is a Technology to explore data, Association rule mining, as suggested by R. Agrawal, basically describes relationships between items in data sets. Itanalyze the data and finally discovering patterns from large helps in finding out the items, which would be selecteddata repository. In this paper, the problem of online mining of provided certain set of items have already been selected. Anassociation rules in large databases is discussed. Online improved algorithm for fast rule generation has beenassociation rule mining can be applied which helps to remove discussed Agrawal et. al (1994). Two algorithms forredundant rules and helps in compact representation of rules generating association rules have been discussed in ‘Fastfor user. Algorithms for Mining Association Rules’ by RakeshIn this paper, a new and more optimized algorithm has beenproposed for online rule generation. The advantage of this Agrawal and Srikant (1994).algorithm is that the graph generated in our algorithm has The online mining of data is performed by pre-processing theless edge as compared to the lattice used in the existing data effectively in order to make it suitable for repeatedalgorithm. The Proposed algorithm generates all the essential online queries. An online association rule mining techniquerules also and no rule is missing. The use of non redundant discussed by Charu C Agrawal at al(2001) suggests a graphassociation rules help significantly in the reduction of theoretic approach, in which the pre -processed data is storedirrelevant noise in the data mining process. This graph in such a way that online processing may be done by applyingtheoretic approach, called adjacency lattice is crucial for a graph theoretic search algorithm. In this paper concept ofonline mining of data. The adjacency lattice could be stored adjacency lattice of itemsets has been introduced.either in main memory or secondary memory. The idea of This adjacency lattice is crucial in performing effective onlineadjacency lattice is to pre store a number of large item sets in data mining. The adjacency lattice could be stored either in aspecial format which reduces disc I/O required in performing main memory or on secondary memory. The idea ofthe query. adjacency is to pre-store a number of item sets at a level of support. These items are stored in a special format (calledIndex Keywords: adjacency lattice) which reduces the disk I/O required inAdjacency lattice, Association Rule Mining, Data Mining order to perform the query. Online generation of the rules deals with the finding the I INTRODUCTION association rules online by changing the value of theData Mining is a process of analysis the data and minimum confidence value. Problems with the existingsummarizing it into useful information. In other words, algorithm is that the lattice has to be constructed again for alltechnically, data mining is the process of finding pattern large itemsets, to generate the rules, which is very timeamong dozens of fields in large relational databases. Data consuming for online generation of rule. The number of edgesmining software is one of a number of analytical tools for would be more in the generated lattice as we have edges for a frequent itemset to all its supersets in the subsequent levels.analyzing data. It allows users to analyze data from many This paper aims to develop a new algorithm for online ruledifferent dimensions or angles, categorize it, and summarize generation. A weighted directed graph has been constructedthe relationships identified. and depth first search has been used for rule generation. In theA. Overview of the Work done proposed algorithm, online rules can be generated by generating adjacency matrix for some confidence value and the generating rules for confidence measure higher than that used for generating the adjacency matrix. 978-1-4577-0792-6/11/$26.00 ©2011 IEEE 126
  2. 2. A new algorithm has been developed to overcome these threshold value. The itemsets obtained above are referred asdifficulties. In this algorithm the number of edges graph prestored itemsets, and can be stored in main memory orgenerated is less than the adjacency lattice and it is also secondary memory. This is beneficial in the sense that wecapable of finding all the essential rules. need not to refer dataset again and again from different valueThis paper is divided further into sections as : Section 2 of the min. support and confidence given by the user.describes the work done by Charu C Agarwal(2001). Section The adjacency lattice L is a directed acyclic graph. An itemset3 describes the new proposed algorithm. Section 4 discusses X is said to be adjacent to an itemset Y if one of them can bethe illustration of Existing and proposed algorithm. In the last obtained from the other by adding a single item. Thepara, the comparison between two algorithms with their adjacency lattice L is constructed as follows:complexity is found. Construct a graph with a vertex v(I) for each primary itemset I. Each vertex I has a label corresponding to the value of its II EXISTING ALGORITHM FOR ONLINE support. This label is denoted by S(I). For any pair of vertices RULE GENERATION corresponding to itemsets X and Y, a directed edge existsThe aim of Association Rule Mining (Rakesh et. al, 1994) is from v(X) to v(Y) if and only if X is a parent of Y .Note that itto detect relationships or patterns between specific values of is not possible to perform online mining of association rules atcategorical variables in large data sets. Rakesh suggests a levels less than the primary threshold.graph theoretic approach. The main idea of association rulemining in the existing algorithm is to partition the attribute STEP 2: Online Generation of Itemsets:values into Transaction patterns. Basically, this technique Once we have stored adjacency lattice in RAM. Now user canenables analysts and researchers to uncover hidden patterns in get some specific large itemsets as he desired. Suppose userlarge data sets. Here the pre-processed data is stored in such a want to find all large itemsets which contain a set of items Iway that online rule generation may be done with a and satisfy a level of minimum support s, then there is need tocomplexity proportional to the size of the output. In the solve the following search in the adjacency lattice. For aexisting algorithm, the concept of an adjacency lattice of given itemset I, find all itemsets J such that v(J) is reachableitemsets has been introduced. This adjacency lattice is crucial from v(I) by a directed path in the lattice L, and satisfies S(J)to performing effective online data mining. The adjacency ≥ s.lattice could be stored either in main memory or on secondary STEP 3: Rule Generation :memory. The idea of the adjacency lattice is to prestore a Rules are generated by using these prestored itemsets fornumber of large itemsets at a level of support possible given some user defined minimum support and minimumthe available memory. These itemsets are stored in a special confidence.format (called the adjacency lattice) which reduces the diskI/O required in order to perform the query. In fact, if enough III PROPOSED ALGORITHMmain memory is available for the entire adjacency lattice, The algorithm by Charu et al. (2001) is discussed in previousthen no I/O may need to be performed at all. section. Detailed discussion of the proposed algorithm has been done in the current section. Graph theoretic approachA Adjacency lattice has been used in the proposed algorithm. The graph generatedAn itemset X is said to be adjacent to an itemset Y if one of is a directed graph with weights associated on the edges. Alsothem can be obtained from the other by adding a single item. the number of edges is less compared to that in the algorithmSpecifically, an itemset X is said to be a parent of the itemset suggested by Charu et. al.Y, if Y can be obtained from X by adding a single item to theset X. It is clear that an itemset may possibly have more than A. Algorithmone parent and more than one child. In fact, the number of The algorithm has two steps explained below. The first step isparents of an itemset X is exactly equal to the cardinality of explained in the section 3(A) in which we will explain thatthe set X. This observation follows from the fact that for each how are we going to construct the graph. The second step isElement ir in an itemset X, X -ir is a parent of X. In the lattice explained in section 3(B) in which rule generation isif a directed path exists from the vertex corresponding to Z to explained.the vertex corresponding to X in the adjacency lattice, then Construction of adjacency latticeX Z. In such a case, X is said to be a descendant of Z and Zis said to be an ancestor of X.B. The Existing AlgorithmThere are three steps in the Existing algorithm explainedby (Agarwal et al. 1994)STEP 1: Generation of adjacency lattice:The Adjacency lattice is created using the frequent itemsetsgenerated using any standard algorithm by defining someminimum support. This support value is called primary 127
  3. 3. The large itemsets obtained by applying some traditional //finding all subsets of item1 in s(i+1,j)algorithm for finding frequent itemsets (like Apriori) are For each itemset in s(i+1) dostored in one file and corresponding values of support is Item2 = s(i+1,k).itemsets;stored in another file. Using these two files we can store the If (item1 is superset of item2)item and their corresponding support in a structure say S. Index2 = find_index(item2,3);Now create an array of structure s(i, j) having two fields Confidence = s(index2).support/s(index1).supportitemsets and support. This array of structure is used to store If(Confidence >= minconf)the different length of large itemsets in different dimensions. adj_lat(index1,index2)=Confidence;In the field itemsets of structure s(i, j) we will store 1-itemsets Return adj_lat;in s(1, j), 2-itemsets in s(2, j), 3-itemsets in s(3, j) and so on. End;We have written a function for this purpose named In the above gen_adj_lattice() function there is a sub-functionas Initialize ( ). The pseudo code for the Initialize ( ) to search an element in the structure S which returns the index of that itemset in the structure. Using this index we canAlgorithm Initialize (S) get the support of the corresponding large itemset.Begin for each large itemset Є S do Let an itemset X is to be searched in the S(i) firstly find the Item1 = s(i).itemset; length of the itemset X. Now take start traversing the Item2 = s(i+1).itemset; structure S if the length of the current itemset is equal to the M1 = length(item1); M2 = length of the itemset to be searched then only compare the length(item2); two itemsets. If all the items of the both itemsets are matching s(j,k).itemsets = item1; then return the index. This pseudo code for find_index() is s(j,k).support = given in the following: s(i).support; Increment k; If(diff of lengths of consecutive items!=0) Algorithm find_index(item,S) put itemsets in the next row of s; Begin return N1 = length(item1);s; End; For each itemset in S doNow to calculate the weight of the edge between itemset X Item2= S( r).itemsets;and itemset Y, where (X-Y) = 1-itemset, calculate the value N2 = length(item2);support(X)/support(Y) if this value is >= minimum If(length of the itemsets are equal)confidence then we can have an edge between the itemset X If(Each item matched)and the itemset Y and this edge will have weight = index= r;support(X)/support(Y). Now a function is required to generate return index;the adjacency matrix using the structure S and s. This End;function will take one large itemset from s(i, j) and compare The graph generated will be directed graph in which largestwith all the items in s(i+1, j). If any subset of this itemset in itemsets will be at the first level and 1-large itemsets will bes(i, j) is present in s(i+1, j) then it is required to find that at lowest level. And the direction of the edges will be fromwhether there will be link between them and if there will be (n-1)th level to nth level. And the weight will be equal tolink then what will be the weight of the link. the support of the itemset in the (n-1)th level divided by theLet an itemset X from structure s(i, j) is taken and searched in support of the itemset at the nth level.the S(i). When index of itemset X, say index 1, in the B. Generation of Rulesstructure S is obtained, we can easily get the support of thisitemset X. Now search all subset of this itemset in s(i+1, J). Each node in the directed graph is chosen for rule generation.There is need to find the support for each itemset Y, which is Call that node starting node and do depth first search in thepresent in the s(i+1, j) and also subset of the itemset X directed graph. And generate the rules from the visited nodepresent in s(i, j), The index of the itemset Y, index2, is and starting node if and only if it satisfies all the condition,obtained by searching it in structure S(i). Now weight = which are required to generate essential rule.S(index1).support/S(index2).support is calculated if it is Conditions:greater than or equal to minimum confidence then in the 1. Product of the confidence of the path between theadjacency matrix,say a, a[index1, index2] is assigned value starting node and the visited mode must be greaterequal to the weight. The pseudo code for gen_adj_lattice()B than or equal to minimum given in the following 2. To reduce simple redundancy: We generate set of all children of the visited node and then this set of childAlgorithm gen_adj_lattice(S,s) nodes is compared with the nodes that have alreadyBegin been used by the same starting node for rule For each row of s do generation. If any one of the child nodes is found Item 1 = s(I,j).itemsets; there from this visited node no rule can be generated. Index1 = find_index(item1,s); Since, this rule will be redundant. The pseudo code for find_allChild() is given 128
  4. 4. Algorithm find_allChild(adj_lat,i) Algorithm Generate Rule (Starting node: X, Visited node: Begin Y, Min Conf: c, G) C1=C=NULL; Begin C1=C=child(adj_lat,i); RuleSet=NULL; while C1 = NULL do C1=weighted product of the path(X,Y); For each c Є C1 do If(c1>=c) C1 = Child(adj_lat,c); If(~compare(find_allChild(adj_mat,Y),node_gen_rule(X,G))) C = C Ụ C1; If(~compare(node_gen_rule(find_allParents(adj_lat,X),G),Y) return C; ) End; If(~compare(find_allChild(adj_lat,Y),node_gen_rule(find_allWe have a structure, say G, which stores nodes that have Parents(X)),G))already been used for generating rules. They are stored in RuleSet = RuleSet U(Y->(X-Y));such a way that we can get the required nodes just by Return ruleSet;reaching the corresponding index. The pseudo code for the End;same is given in the followingAlgorithm node_gen_rule(nodeset: IV. ILLUSTRATION OF EXISTING AND S,G) Begin PROPOSED ALGORITHMS generated Set = NULL; Now we are going to illustrate both the algorithms by for each node S(i) Є S do taking example. The Market Basket Data sets taken generated Set – generated Set Ụ G(S(i)); shown below in Table 4.1. This dataset has five return generated Set; transactions and five itemsets. Let the minimum End; support be 0.4 and minimum confidence is 0.67.To reduce strict redundancy; Various large itemset obtained b having support value A) We have generated s of all Parents of the starting greater than 0.4, along with the support value are node and then for all these parent nodes we have to shown in the Tables 4.2 to 4.4. find out all the nodes which have been used for Rule Table 4.1 : 1-large itemsets Generation by these parent nodes. Then this set of ITEMS SUPPORT node is compared with the visited node. If this visited node is found then from this visited node no A=Bread 0.8 rule can be generated. Because this rule will be strictly redundant. The pseudo code for B=Milk 0.8 find_allParents() is given in the following C=Beer 0.6 B) We generate set of all Childs of the visited node and the set of all Parents of the starting node and then D=Diaper 0.8 for all these parent’s nodes we have to find out all the nodes, which have been used for Rule F=Cock 0.4 Generation by these parent nodes. Then this set of Table 4.2 : 2-large itemsets node is compared with the set of all child. If any of child of this visited node is found there then from AB 0.6 this visited node no rule can be generated. Because this rule will be strictly redundant. AC 0.4Algorithm find_allParents(adj_lat,i) AD 0.6 Begin BC 0.4 P1=P=NULL; P1=P=Parents(adj_lat,i) BD 0.6 While P1 is not equal to NULL do For Each P Є P1 dp BF 0.4 P1 = Parents(adj_lat,P) P = P Ụ P1; return P; CD 0.6 End; DF 0.4 129
  5. 5. Table 4.3 : 3- large itemsets ABD 0.4 ACD 0.4 BCD 0.4 BDF 0.4A. Rule Generation from the proposed algorithmWeights of edges between frequent 1-itemset to frequent 2-itemset and between frequent 2-itemset to frequent 3-itemsets Figure 4.1 Lattice Structureare shown in Table 4.4 . The weights of edges are calculatedin the following manner. Let X be k-itemset and Y be the k+1 The resultant graph is shown below:itemset, then the weight of the edge form X to Y is equal tothe confidence of the rule X (Y- X)Table 4.4: Weights of the edges between 1-itemset to 2-itemsets Edges Weights A – AB 0.75 A – AC 0.5 A – AD 0.75 B – AB 0.75 B – BC 0.5 B – BD 0.75 B – BF 0.5 Figure 4.2: Graph generated for the rule generation C – AC 0.67 We can see that there are more edges in the lattice generated for C – BC 0.67 C – CD 1.0 the same example. These edges are shown by dotted edge. D – DF 0.5 D – AD 0.75 D – BD 0.75 D – CD 0.75 AB-ABD 0.67 AC-ACD 0.67 AD-ABD 0.67 AD-ACD 0.67 BC-BCD 1.0 BD-BCD 0.67 BF-BDF 1.0 CD-ACD 0.67 Figure 4.3 Generating the rules for the large item sets ABD CD-BCD 0.67 BF-BDF 1.0 Applying depth first search starting from the node ABD, theThe lattice generated for the above example: node A will be the first visited node but the weighted product (0.67*0.75) of the path obtained from A to ABD is less than minimum confidence. So the node A will not participate in rule generation. Node B will be second visited node but this also will not participate in rule generation because of similar reason. Now the next visited node is AB and the weighted product of the path from AB to ABD is 0.67 which is equal to the minimum confidence. The children nodes of AB are not generating any rule, and also AB is not used by any of the parent nodes of ABD. Thus all the three conditions are satisfied for rule generation. So we will generate the rule 130
  6. 6. from AB. , AB = > D. Now the next visited node will be D, .but weighted product of the path from D to ABD is less than Theorem: The number of edges in the adjacency lattice isminimum confidence hence no rule will be there and we have equal to the sum of the number of parents of eachto go to next visited node AD as satisfies all the three primary itemset.conditions so there will be rule , AD => B . Let N(I, s) be the number of primary itemsets in R(I, s). ThusThe next visited node will BD and this node satisfies size of output in this case = N(I, s) . h(I, s) Complexity ofall the three conditions, thus we have rule , BD => A. existing algorithm is proportional to N(I, s) . h(I, s).In the proposed algorithm there are some edges left which are notSimilarly, Generating the rules for large itemset ACD, BCD, visited by their parents Let those nodes are denoted by L(I,BDF,BDF,AB,AD,BD,AC,BC,CD,BF,DF. We are getting the s).This size of output in this case = N(I, s) . h(I, s) – L(I, s)following rules shown in the Table 4.5 below Complexity of proposed algorithm is proportional to N(I, s) . Table 4.5: The rules generated h(I, s) – L(I, s) 1 AB => D CONCLUSION AND FUTURE WORK 2 AD => B 3 BD => A In this paper, data mining and one of important technique of 4 C => AD data mining is discussed. The issues related with association 5 AD => C rule mining and then online mining of association rules are 6 C => BD introduced to resolve these issues. Online association mining 7 BD => C helps to remove redundant rules and helps in compact 8 F => BD representation of rules for user. A new algorithm has been 9 BD => F proposed for online rule generation. The advantage of this 10 A => B algorithm is that, the graph generated in our algorithm has 11 B => A less edge as compared to the lattice used in the existing 12 A => D 13 D => A algorithm. This algorithm generates all the essential rules also 14 B => D and no rule is missing. 15 D => B The future work will be implementing both existing and 16 D => C proposed algorithms, and then test these algorithms on large datasets like the Zoo Dataset, Mushroom Dataset and Synthetic Dataset.B. Rules Generated from the Existing algorithm REFERENCESGenerating the rules for large itemsets ABDChose all the ancestors of ABD which has support less than [1] Agrawal, R., Imielinski, T., Swami, A. “Mining association rules between sets of items in large databases.’’ SIGMOD-1993,or equal to the pp. 207-214. Value = {support (ABD)/c} = (0.4 / 0.67) = 0.6 [2] Charu C. Agrawal and Philip S. Yu, “A New Approach toAB, AD and BD will be selected. So we will have following Online Generation of Association Rules’’ IEEE, vol. 13, No. 4, pp. 327-340, 2001.lattice. We can easily see that AB, AD and BD are the [3] Dao-I Lin, Zvi M.Kedem, ``Pincer search: An Efficientmaximal ancestor of the directed graph shown in the figure. algorithm to find maximal frequency item set IEEE transHence we will have two rules: On knowledge and data engineering, vol. No.3, AB => D , AD => B, BD => A pp. 333-344, may/june,2002. [4] Ming Liu, W.Hsu, and Y. Ma, “Mining association rules with multiple minimum supports.’’ In Proceeding of fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 337-341, N.Y., 1999. ACM Press. [5] R. Agrawal, T. Lmielinksi, and A. Swami “Mining association between sets of items in Large databases’’ Conf. Management of Data, Washington, DC, May 1993. [6] Ramakrishna Srikanth and Quoc Vu and Figure 4.4 : Directed Graph in Adjacency Rakesh Agrawal, ‘’Mining association rules with itemsets constraints.’’ In Proc. Of the 3rdTotal number of 16 rules generated in both algorithms. It was International Conference on KDD and Data Mining (KDD 97), Newport Beach, California,found that no essential rules are missing in proposed August 1997.algorithm and also there is no redundancy in the rules [7] Rakesh Agrawal and Ramakrishna Srikanth, “Fast Algorithm for Mining Association Rules’’ In Proc.generated. 20 Int Conf. Very Large Data Base, VLDB, 1994. [8] Data Mining: Concepts and Techniques By “JiaweiC. Comparison of Algorithms: Han Micheline Kamber’’. Academic press 2001.Complexity of graph search algorithm is proportional to thesize of output 131