2.
A new algorithm has been developed to overcome these threshold value. The itemsets obtained above are referred asdifficulties. In this algorithm the number of edges graph prestored itemsets, and can be stored in main memory orgenerated is less than the adjacency lattice and it is also secondary memory. This is beneficial in the sense that wecapable of finding all the essential rules. need not to refer dataset again and again from different valueThis paper is divided further into sections as : Section 2 of the min. support and confidence given by the user.describes the work done by Charu C Agarwal(2001). Section The adjacency lattice L is a directed acyclic graph. An itemset3 describes the new proposed algorithm. Section 4 discusses X is said to be adjacent to an itemset Y if one of them can bethe illustration of Existing and proposed algorithm. In the last obtained from the other by adding a single item. Thepara, the comparison between two algorithms with their adjacency lattice L is constructed as follows:complexity is found. Construct a graph with a vertex v(I) for each primary itemset I. Each vertex I has a label corresponding to the value of its II EXISTING ALGORITHM FOR ONLINE support. This label is denoted by S(I). For any pair of vertices RULE GENERATION corresponding to itemsets X and Y, a directed edge existsThe aim of Association Rule Mining (Rakesh et. al, 1994) is from v(X) to v(Y) if and only if X is a parent of Y .Note that itto detect relationships or patterns between specific values of is not possible to perform online mining of association rules atcategorical variables in large data sets. Rakesh suggests a levels less than the primary threshold.graph theoretic approach. The main idea of association rulemining in the existing algorithm is to partition the attribute STEP 2: Online Generation of Itemsets:values into Transaction patterns. Basically, this technique Once we have stored adjacency lattice in RAM. Now user canenables analysts and researchers to uncover hidden patterns in get some specific large itemsets as he desired. Suppose userlarge data sets. Here the pre-processed data is stored in such a want to find all large itemsets which contain a set of items Iway that online rule generation may be done with a and satisfy a level of minimum support s, then there is need tocomplexity proportional to the size of the output. In the solve the following search in the adjacency lattice. For aexisting algorithm, the concept of an adjacency lattice of given itemset I, find all itemsets J such that v(J) is reachableitemsets has been introduced. This adjacency lattice is crucial from v(I) by a directed path in the lattice L, and satisfies S(J)to performing effective online data mining. The adjacency ≥ s.lattice could be stored either in main memory or on secondary STEP 3: Rule Generation :memory. The idea of the adjacency lattice is to prestore a Rules are generated by using these prestored itemsets fornumber of large itemsets at a level of support possible given some user defined minimum support and minimumthe available memory. These itemsets are stored in a special confidence.format (called the adjacency lattice) which reduces the diskI/O required in order to perform the query. In fact, if enough III PROPOSED ALGORITHMmain memory is available for the entire adjacency lattice, The algorithm by Charu et al. (2001) is discussed in previousthen no I/O may need to be performed at all. section. Detailed discussion of the proposed algorithm has been done in the current section. Graph theoretic approachA Adjacency lattice has been used in the proposed algorithm. The graph generatedAn itemset X is said to be adjacent to an itemset Y if one of is a directed graph with weights associated on the edges. Alsothem can be obtained from the other by adding a single item. the number of edges is less compared to that in the algorithmSpecifically, an itemset X is said to be a parent of the itemset suggested by Charu et. al.Y, if Y can be obtained from X by adding a single item to theset X. It is clear that an itemset may possibly have more than A. Algorithmone parent and more than one child. In fact, the number of The algorithm has two steps explained below. The first step isparents of an itemset X is exactly equal to the cardinality of explained in the section 3(A) in which we will explain thatthe set X. This observation follows from the fact that for each how are we going to construct the graph. The second step isElement ir in an itemset X, X -ir is a parent of X. In the lattice explained in section 3(B) in which rule generation isif a directed path exists from the vertex corresponding to Z to explained.the vertex corresponding to X in the adjacency lattice, then Construction of adjacency latticeX Z. In such a case, X is said to be a descendant of Z and Zis said to be an ancestor of X.B. The Existing AlgorithmThere are three steps in the Existing algorithm explainedby (Agarwal et al. 1994)STEP 1: Generation of adjacency lattice:The Adjacency lattice is created using the frequent itemsetsgenerated using any standard algorithm by defining someminimum support. This support value is called primary 127
3.
The large itemsets obtained by applying some traditional //finding all subsets of item1 in s(i+1,j)algorithm for finding frequent itemsets (like Apriori) are For each itemset in s(i+1) dostored in one file and corresponding values of support is Item2 = s(i+1,k).itemsets;stored in another file. Using these two files we can store the If (item1 is superset of item2)item and their corresponding support in a structure say S. Index2 = find_index(item2,3);Now create an array of structure s(i, j) having two fields Confidence = s(index2).support/s(index1).supportitemsets and support. This array of structure is used to store If(Confidence >= minconf)the different length of large itemsets in different dimensions. adj_lat(index1,index2)=Confidence;In the field itemsets of structure s(i, j) we will store 1-itemsets Return adj_lat;in s(1, j), 2-itemsets in s(2, j), 3-itemsets in s(3, j) and so on. End;We have written a function for this purpose named In the above gen_adj_lattice() function there is a sub-functionas Initialize ( ). The pseudo code for the Initialize ( ) to search an element in the structure S which returns the index of that itemset in the structure. Using this index we canAlgorithm Initialize (S) get the support of the corresponding large itemset.Begin for each large itemset Є S do Let an itemset X is to be searched in the S(i) firstly find the Item1 = s(i).itemset; length of the itemset X. Now take start traversing the Item2 = s(i+1).itemset; structure S if the length of the current itemset is equal to the M1 = length(item1); M2 = length of the itemset to be searched then only compare the length(item2); two itemsets. If all the items of the both itemsets are matching s(j,k).itemsets = item1; then return the index. This pseudo code for find_index() is s(j,k).support = given in the following: s(i).support; Increment k; If(diff of lengths of consecutive items!=0) Algorithm find_index(item,S) put itemsets in the next row of s; Begin return N1 = length(item1);s; End; For each itemset in S doNow to calculate the weight of the edge between itemset X Item2= S( r).itemsets;and itemset Y, where (X-Y) = 1-itemset, calculate the value N2 = length(item2);support(X)/support(Y) if this value is >= minimum If(length of the itemsets are equal)confidence then we can have an edge between the itemset X If(Each item matched)and the itemset Y and this edge will have weight = index= r;support(X)/support(Y). Now a function is required to generate return index;the adjacency matrix using the structure S and s. This End;function will take one large itemset from s(i, j) and compare The graph generated will be directed graph in which largestwith all the items in s(i+1, j). If any subset of this itemset in itemsets will be at the first level and 1-large itemsets will bes(i, j) is present in s(i+1, j) then it is required to find that at lowest level. And the direction of the edges will be fromwhether there will be link between them and if there will be (n-1)th level to nth level. And the weight will be equal tolink then what will be the weight of the link. the support of the itemset in the (n-1)th level divided by theLet an itemset X from structure s(i, j) is taken and searched in support of the itemset at the nth level.the S(i). When index of itemset X, say index 1, in the B. Generation of Rulesstructure S is obtained, we can easily get the support of thisitemset X. Now search all subset of this itemset in s(i+1, J). Each node in the directed graph is chosen for rule generation.There is need to find the support for each itemset Y, which is Call that node starting node and do depth first search in thepresent in the s(i+1, j) and also subset of the itemset X directed graph. And generate the rules from the visited nodepresent in s(i, j), The index of the itemset Y, index2, is and starting node if and only if it satisfies all the condition,obtained by searching it in structure S(i). Now weight = which are required to generate essential rule.S(index1).support/S(index2).support is calculated if it is Conditions:greater than or equal to minimum confidence then in the 1. Product of the confidence of the path between theadjacency matrix,say a, a[index1, index2] is assigned value starting node and the visited mode must be greaterequal to the weight. The pseudo code for gen_adj_lattice()B than or equal to minimum confidence.is given in the following 2. To reduce simple redundancy: We generate set of all children of the visited node and then this set of childAlgorithm gen_adj_lattice(S,s) nodes is compared with the nodes that have alreadyBegin been used by the same starting node for rule For each row of s do generation. If any one of the child nodes is found Item 1 = s(I,j).itemsets; there from this visited node no rule can be generated. Index1 = find_index(item1,s); Since, this rule will be redundant. The pseudo code for find_allChild() is given 128
4.
Algorithm find_allChild(adj_lat,i) Algorithm Generate Rule (Starting node: X, Visited node: Begin Y, Min Conf: c, G) C1=C=NULL; Begin C1=C=child(adj_lat,i); RuleSet=NULL; while C1 = NULL do C1=weighted product of the path(X,Y); For each c Є C1 do If(c1>=c) C1 = Child(adj_lat,c); If(~compare(find_allChild(adj_mat,Y),node_gen_rule(X,G))) C = C Ụ C1; If(~compare(node_gen_rule(find_allParents(adj_lat,X),G),Y) return C; ) End; If(~compare(find_allChild(adj_lat,Y),node_gen_rule(find_allWe have a structure, say G, which stores nodes that have Parents(X)),G))already been used for generating rules. They are stored in RuleSet = RuleSet U(Y->(X-Y));such a way that we can get the required nodes just by Return ruleSet;reaching the corresponding index. The pseudo code for the End;same is given in the followingAlgorithm node_gen_rule(nodeset: IV. ILLUSTRATION OF EXISTING AND S,G) Begin PROPOSED ALGORITHMS generated Set = NULL; Now we are going to illustrate both the algorithms by for each node S(i) Є S do taking example. The Market Basket Data sets taken generated Set – generated Set Ụ G(S(i)); shown below in Table 4.1. This dataset has five return generated Set; transactions and five itemsets. Let the minimum End; support be 0.4 and minimum confidence is 0.67.To reduce strict redundancy; Various large itemset obtained b having support value A) We have generated s of all Parents of the starting greater than 0.4, along with the support value are node and then for all these parent nodes we have to shown in the Tables 4.2 to 4.4. find out all the nodes which have been used for Rule Table 4.1 : 1-large itemsets Generation by these parent nodes. Then this set of ITEMS SUPPORT node is compared with the visited node. If this visited node is found then from this visited node no A=Bread 0.8 rule can be generated. Because this rule will be strictly redundant. The pseudo code for B=Milk 0.8 find_allParents() is given in the following C=Beer 0.6 B) We generate set of all Childs of the visited node and the set of all Parents of the starting node and then D=Diaper 0.8 for all these parent’s nodes we have to find out all the nodes, which have been used for Rule F=Cock 0.4 Generation by these parent nodes. Then this set of Table 4.2 : 2-large itemsets node is compared with the set of all child. If any of child of this visited node is found there then from AB 0.6 this visited node no rule can be generated. Because this rule will be strictly redundant. AC 0.4Algorithm find_allParents(adj_lat,i) AD 0.6 Begin BC 0.4 P1=P=NULL; P1=P=Parents(adj_lat,i) BD 0.6 While P1 is not equal to NULL do For Each P Є P1 dp BF 0.4 P1 = Parents(adj_lat,P) P = P Ụ P1; return P; CD 0.6 End; DF 0.4 129
5.
Table 4.3 : 3- large itemsets ABD 0.4 ACD 0.4 BCD 0.4 BDF 0.4A. Rule Generation from the proposed algorithmWeights of edges between frequent 1-itemset to frequent 2-itemset and between frequent 2-itemset to frequent 3-itemsets Figure 4.1 Lattice Structureare shown in Table 4.4 . The weights of edges are calculatedin the following manner. Let X be k-itemset and Y be the k+1 The resultant graph is shown below:itemset, then the weight of the edge form X to Y is equal tothe confidence of the rule X (Y- X)Table 4.4: Weights of the edges between 1-itemset to 2-itemsets Edges Weights A – AB 0.75 A – AC 0.5 A – AD 0.75 B – AB 0.75 B – BC 0.5 B – BD 0.75 B – BF 0.5 Figure 4.2: Graph generated for the rule generation C – AC 0.67 We can see that there are more edges in the lattice generated for C – BC 0.67 C – CD 1.0 the same example. These edges are shown by dotted edge. D – DF 0.5 D – AD 0.75 D – BD 0.75 D – CD 0.75 AB-ABD 0.67 AC-ACD 0.67 AD-ABD 0.67 AD-ACD 0.67 BC-BCD 1.0 BD-BCD 0.67 BF-BDF 1.0 CD-ACD 0.67 Figure 4.3 Generating the rules for the large item sets ABD CD-BCD 0.67 BF-BDF 1.0 Applying depth first search starting from the node ABD, theThe lattice generated for the above example: node A will be the first visited node but the weighted product (0.67*0.75) of the path obtained from A to ABD is less than minimum confidence. So the node A will not participate in rule generation. Node B will be second visited node but this also will not participate in rule generation because of similar reason. Now the next visited node is AB and the weighted product of the path from AB to ABD is 0.67 which is equal to the minimum confidence. The children nodes of AB are not generating any rule, and also AB is not used by any of the parent nodes of ABD. Thus all the three conditions are satisfied for rule generation. So we will generate the rule 130
6.
from AB. , AB = > D. Now the next visited node will be D, .but weighted product of the path from D to ABD is less than Theorem: The number of edges in the adjacency lattice isminimum confidence hence no rule will be there and we have equal to the sum of the number of parents of eachto go to next visited node AD as satisfies all the three primary itemset.conditions so there will be rule , AD => B . Let N(I, s) be the number of primary itemsets in R(I, s). ThusThe next visited node will BD and this node satisfies size of output in this case = N(I, s) . h(I, s) Complexity ofall the three conditions, thus we have rule , BD => A. existing algorithm is proportional to N(I, s) . h(I, s).In the proposed algorithm there are some edges left which are notSimilarly, Generating the rules for large itemset ACD, BCD, visited by their parents Let those nodes are denoted by L(I,BDF,BDF,AB,AD,BD,AC,BC,CD,BF,DF. We are getting the s).This size of output in this case = N(I, s) . h(I, s) – L(I, s)following rules shown in the Table 4.5 below Complexity of proposed algorithm is proportional to N(I, s) . Table 4.5: The rules generated h(I, s) – L(I, s) 1 AB => D CONCLUSION AND FUTURE WORK 2 AD => B 3 BD => A In this paper, data mining and one of important technique of 4 C => AD data mining is discussed. The issues related with association 5 AD => C rule mining and then online mining of association rules are 6 C => BD introduced to resolve these issues. Online association mining 7 BD => C helps to remove redundant rules and helps in compact 8 F => BD representation of rules for user. A new algorithm has been 9 BD => F proposed for online rule generation. The advantage of this 10 A => B algorithm is that, the graph generated in our algorithm has 11 B => A less edge as compared to the lattice used in the existing 12 A => D 13 D => A algorithm. This algorithm generates all the essential rules also 14 B => D and no rule is missing. 15 D => B The future work will be implementing both existing and 16 D => C proposed algorithms, and then test these algorithms on large datasets like the Zoo Dataset, Mushroom Dataset and Synthetic Dataset.B. Rules Generated from the Existing algorithm REFERENCESGenerating the rules for large itemsets ABDChose all the ancestors of ABD which has support less than [1] Agrawal, R., Imielinski, T., Swami, A. “Mining association rules between sets of items in large databases.’’ SIGMOD-1993,or equal to the pp. 207-214. Value = {support (ABD)/c} = (0.4 / 0.67) = 0.6 [2] Charu C. Agrawal and Philip S. Yu, “A New Approach toAB, AD and BD will be selected. So we will have following Online Generation of Association Rules’’ IEEE, vol. 13, No. 4, pp. 327-340, 2001.lattice. We can easily see that AB, AD and BD are the [3] Dao-I Lin, Zvi M.Kedem, ``Pincer search: An Efficientmaximal ancestor of the directed graph shown in the figure. algorithm to find maximal frequency item set IEEE transHence we will have two rules: On knowledge and data engineering, vol. No.3, AB => D , AD => B, BD => A pp. 333-344, may/june,2002. [4] Ming Liu, W.Hsu, and Y. Ma, “Mining association rules with multiple minimum supports.’’ In Proceeding of fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 337-341, N.Y., 1999. ACM Press. [5] R. Agrawal, T. Lmielinksi, and A. Swami “Mining association between sets of items in Large databases’’ Conf. Management of Data, Washington, DC, May 1993. [6] Ramakrishna Srikanth and Quoc Vu and Figure 4.4 : Directed Graph in Adjacency Rakesh Agrawal, ‘’Mining association rules with itemsets constraints.’’ In Proc. Of the 3rdTotal number of 16 rules generated in both algorithms. It was International Conference on KDD and Data Mining (KDD 97), Newport Beach, California,found that no essential rules are missing in proposed August 1997.algorithm and also there is no redundancy in the rules [7] Rakesh Agrawal and Ramakrishna Srikanth, “Fast Algorithm for Mining Association Rules’’ In Proc.generated. 20 Int Conf. Very Large Data Base, VLDB, 1994. [8] Data Mining: Concepts and Techniques By “JiaweiC. Comparison of Algorithms: Han Micheline Kamber’’. Academic press 2001.Complexity of graph search algorithm is proportional to thesize of output 131
Be the first to comment