Published on

IJERA (International journal of Engineering Research and Applications) is International online, ... peer reviewed journal. For more detail or submit your article, please visit www.ijera.com

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  1. 1. SunithaVanamala, L.Padma sree, S.Durga Bhavani / International Journal of EngineeringResearch and Applications (IJERA) ISSN: 2248-9622 www.ijera.comVol. 3, Issue 3, May-Jun 2013, pp.753-757753 | P a g eEfficient Rare Association Rule Mining AlgorithmSunithaVanamala*, L.Padma sree **, S.Durga Bhavani****(Assistant Professor, Department of Information Technology, Kakatiya Institute of Technology and Science,Warangal,)** (Professor, Department of ECE, VNR Vignan Jyothi Institute of Technology and science, HYD)***(Professor, School of Information Technology , JNTUH ,HYD,)AbstractData mining is the process of discoveringcorrelations, patterns, trends or relationships bysearching through a large amount of data storedin repositories, corporate databases, and datawarehouses. In Data mining field, the primarytask is to mine frequent item sets from atransaction database using Association RuleMining (ARM).Whereas the extraction offrequent patterns has focused the majorresearches in association rule mining, therequirements of reliable rules that do notfrequently appear is taking an increasing interestin a great number of areas. Rare association rulerefers to an association rule forming betweenfrequent and rare items or among rare items. Inmany cases, the contradictions or exceptions alsooffers useful associations. Recent researchesfocus on the discovery of such kind ofassociations called rare associations. The miningof associations involving rare items is referred asrare association rule mining. Approaches toassociation rule mining uses single minimumsupport for identifying frequent associations.To mine interesting rare association rules, singleminimum support approaches are not useful.Hence, we propose an algorithm based onMSApriori ,we call this new algorithm asMSApriori_VDB which uses vertical databaseformat. Experimental results shows that ouralgorithms out performs previous approaches inboth memory requirement and execution time byreducing the number of database scans.Keywords—Data Mining, Association rules, Rareitems, Rare Association rule mining, MSAPriri_vdb1. INTRODUCTIONNowadays most research on AssociationRule Mining (ARM) [1] [10][11] has been focusedon discovering common patterns and rules in largedatasets. In fact, ARM is in various applicationareas such as telecommunication networks, marketand risk management, inventory control, mobilemining, graph mining, educational mining, etc. Thepatterns and rules discovered are based on themajority of commonly repeated items in the dataset,though some of these data can be either obvious orirrelevant. Unfortunately, not enough attention hasbeen paid to the extraction process of rareassociation rules, also known as non-frequent,unusual, exceptional or sporadic rules, whichprovide valuable knowledge about non-frequentpatterns. The goal of Rare Association Rule Mining(RARM) is to discover rare and low-rank item setsto generate meaningful rules from these items. Thistype of rule cannot be identified easily usingtraditional association mining algorithms.Given a transaction database, the association rulemining task involves two steps.1. Get frequent item sets and2. Automatic generation of interesting associationrulesThe main problem in the association rulemining is setting the MST. A low support thresholdgenerates too many itemsets sometimesuninteresting item sets but a high MST misses fewrare itemsets. The real world database may haveitems that are of varying frequencies. Some itemsappear frequently in transactions and some of themappear rarely. The rare itemsets may also beinteresting. A rare itemset is an itemset consisting ofrare items. It may be found by setting a low supportthreshold but leads to large number association rulesconsisting of both interesting and uninterestingrules. But it is difficult to mine rare association rulesusing single support threshold based approaches likeApriori and Frequent Pattern-Growth (FP-Growth).The problem of specifying an appropriate supportthreshold causes rare item problem. rareassociations are of two types.1. Interesting rare association2. Uninteresting rare associationDefinition: Interesting rare association – Anassociation is said to be interesting rare associationif it has low support but the confidence of theassociation is high.Definition : Uninteresting rare association – Anassociation is said to be uninteresting rareassociation if it has low support and low confidence.1.1 Problem statementIn this paper we first pre-process thedataset by converting into vertical data formatbefore performing association rule mining. Therationale behind the conversion process prior tomining association rules is that to reduce thenumber of database scans performed on original
  2. 2. SunithaVanamala, L.Padma sree, S.Durga Bhavani / International Journal of EngineeringResearch and Applications (IJERA) ISSN: 2248-9622 www.ijera.comVol. 3, Issue 3, May-Jun 2013, pp.753-757754 | P a g edatabase. The second step deals with the mining ofinteresting rare associations using convertedvertical data and to deal with the rare associationproblem, a multiple minimum support framework isbeing used. The framework assigns each item in thetransaction dataset a minsup called Minimum ItemSupport (MIS).Our process of generating rare associationsconsist of two phases 1. convert to vertical dataformat 2. finding interesting rare associations i,erules with high confidence.2. Related WorkAssociation Rule Mining algorithms [1][2] [3] are well-known data mining methods fordiscovering interesting relationships betweenvariables in transaction databases or other datarepositories. An association rule is an implicationX⇒Y, where X and Y are disjoint itemsets (i.e., setswith no items in common). The intuitive meaning ofsuch a rule is that when X appears, Y also tends toappear. The two traditional measures for evaluatingassociation rules are support and confidence. Theconfidence of an association rule X⇒Y is theproportion of the transactions containing X whichalso contain Y. The support of the rule is thefraction of the database that contains both X and Y.The problem of association rule mining is usuallybroken down into two subtasks. The first one is todiscover those itemsets whose occurrences exceed apredefined support threshold(MST), and which arecalled frequent itemsets. A second task is togenerate association rules from those large itemsconstrained by minimal confidence(MCT). Thoughthese algorithms are theoretically expected to becapable of finding rare association rules, theyactually become intractable if the minimum level ofsupport is set low enough to find rare rules.2.1 Data formatsThere are two data formats to be adopted.One is the horizontal and other is vertical dataformat. The horizontal data format is the same asthat stored in a database. The vertical data format isa conversion from the horizontal one, with thetransaction identifiers grouped for each item.Algorithms for mining frequent item sets based onthe vertical data format are usually more efficientthan those based on the horizontal. Because theformer often scan the database only once andcompute the supports of item sets fast. Thedisadvantage is that it uses more memory foradditional information, like Tidsets (Zaki et al.,1997; Zaki and Hsiao, 2005) or BitTableZaki et al. proposed for mining Frequent item setsby using additional data structure called TidSet thatstores Tid for each data item in database. Thesupport of item set x can be calculated by thecardinality of item in the corresponding Tidset.Thus support(X)=|Tidset(x)|, Tidset of two item setscan also be found easily by computing intersectionof corresponding Tidsets ( one items sets ), i,eTidset(xy)=Tidset(x)∩Tiidset(y).The problem of discovering rare items hasrecently captured the interest of the data miningcommunity [4]. As previously explained, rareitemsets are those that only appear together in veryfew transactions or some very small percentage oftransactions in the database. Rare association ruleshave low support and high confidence in contrast togeneral association rules which are determined byhigh support and a high confidence level. Figure 1illustrates how the support measure behaves inrelation to the two types of rules.Figure 1. Rules in a database.The figure shows Interesting and uninteresting rareassociation ruleABA. Support>MST &Conf>MCTB.Support<MST&conf>MCT(RareInteresting)B. Support<MST&conf<MCT(RareUninteresting)B
  3. 3. SunithaVanamala, L.Padma sree, S.Durga Bhavani / International Journal of EngineeringResearch and Applications (IJERA) ISSN: 2248-9622 www.ijera.comVol. 3, Issue 3, May-Jun 2013, pp.753-757755 | P a g eThere are several different approaches to discoverrare association rules. The simplest way is todirectly apply the Apriori algorithm [1] by simplysetting the minimum support threshold to a lowvalue. However, this leads to a combinatorialexplosion, which could produce a huge number ofpatterns, most of them frequent with only a smallnumber of them actually rare. A different proposal,known as Apriori-Infrequent, involves themodification of the Apriori algorithm to use only theabove-mentioned infrequent itemsets during rulegeneration. This simple change makes use of themaximum support measure, instead of the usualminimum support, to generate candidate itemsets,i.e., only items with a lower support than a giventhreshold are considered. Next, rules are yielded asgenerated by the Apriori algorithm. A totallydifferent perspective consists of developing a newalgorithm to tackle these new challenges. A firstproposal is Apriori-Inverse [4], which can be seen asa more intricate variation of the traditional Apriorialgorithm. It also uses the maximum support butproposes three different kinds of additions: fixedthreshold, adaptive threshold and hill climbing. Themain idea is that given a user-specified maximumsupport threshold, MaxSup, and a derivedMinAbsSup value, a rule X is rare if Sup(X) <MaxSup and Sup(X)> MinAbsSup. A secondproposal is the Apriori-Rare algorithm [11], alsoknown as Arima, which is another variation of theApriori approach. Arima is actually composed oftwo different algorithms: a naïve one, which relieson Apriori and hence enumerates all frequentitemsets; and MRG-Exp, which limits theconsiderations to frequent itemsets generators only.Finally, please notice that the first two approaches(Apriori-Frequent and Apriori-Infrequent) are takento ensure that rare items are also considered duringitemset generation, although the two latterapproaches (Apriori-Inverse and Apriori-Rare) try toencourage low-support items to take part incandidate rule generation by imposing structuralconstraints.Kanimozhi et al [7] proposed an approach basedon multiple minimum support to efficiently discoverrules that have a confidence value of hundredpercent and also presented the calculation of theappropriate minimum support based on thefrequency of the items.Ravi Kumar [8], et al proposed efficient rareassociation rule generation by constructing thecompact MIS-tree that uses the notion of ―leastminimum support‖ and ―infrequent child nodepruning. The approach is based on FP-Growth..Cristóbal et al [9] proposed algorithm to extractrare associations from e_learing data by gatheringstudent usage data from a Moodle system3.Proposed workRare association rule mining is the process offinding the item sets involving low frequent itemsbut at the same time the confidence of theassociation is strong i.e. above the minimumconfidence level.//finds all the rare patterns of interest andgenerates rulesAlgorithm :MSApriori_VDBInput : Transaction data Base DOutput :L, Interesting patterns in DR, High confidence rulesprocedure :D<=conv_vdb(D)L1= find frequent 1 itemsets(D) //Remove anyinfrequent 1 itemsetsCalculate support of I for all I Є L1MIS(I) <= support(I)for (k = 2; Lk-1 ≠Ø, k++) doCk=candidate-gen (Lk-1)endfor each candidate c Є Ckc.count=get_count(c)if c.count =0 then delete c;MIS(c)= c.count | c.count is minimum for all Ck-1Lk = {c Є_ Ck | c.count = MIS(c)}If c.count=MIS(c) thenLHS = c | c.count is minimum for all Ck-1Rk =Form_Rule(LHS,c)Return ∪k Rk;return ∪k Lk;endendProcedure Candidate-gen(Lk-1)for each itemset l1 Є tidset.[Lk-1]for each itemset l2 Є tidset[Lk-1]perform intersection operation l1, l2if has_infrequent(c, Lk-1)prune c;elsetidset[k][i]].append(I1∩I2)end ifendendreturn tidset[ Ck];Procedure has_infrequent(c, Lk-1)for each (k-1) subset s of cif s is in Lk-1return false;elsereturn true;end ifendProcedure Gen_Rule(LHS,c)RHS=c minus LHSWrite rule in the format LHS⇒RHSend
  4. 4. SunithaVanamala, L.Padma sree, S.Durga Bhavani / International Journal of EngineeringResearch and Applications (IJERA) ISSN: 2248-9622 www.ijera.comVol. 3, Issue 3, May-Jun 2013, pp.753-757756 | P a g eprocedure conv_vdb(D)for each tid in Dfor each i in tidtidset[1][pos(i)].append(tid)endend3.1.Calculating MIS for item setsInitially the dataset is scanned to calculatethe support of the first level items. During the firstlevel, the MIS value for each item is its support. 1-itemset is generated and arranged in thedecreasing order of their support value.Given an item X, at level 1, MIS(X)=sup(X)According to the example given in Table1, the MISvalues for the itemset{A, B,C,E, D} are {7, 6, 5, 3,3}. All items in the first iteration is considered to beinteresting and moved to the next iteration. Insubsequent levels, the MIS value for each itemset iscomputed as the minimum support among the subsetof items contained in the itemset. 2-itemsetcandidates are generated by intersectingcorresponding 1-itemsets and support for eachcandidate is |TIDSLIST|. Itemsets with zero supportare eliminated. Then the MIS value at the secondlevel is calculated as follows:Given an item set(P,Q), MIS(P,Q) =min(sup(P), sup(Q)) Rules which are not found tobe interesting are pruned. Now, the association rulecan be formed by finding the left hand side (LHS) ofthe pattern. The item with the lowest support valuein the item set is considered as LHS.Example:The Table1 shows the original data setwhich is to be scanned once and it is converted tothe vertical data format(item, Tidlist) as shown inTable2. MST is assumed as 30%. The length of Tidlist gives the support of corresponding data item.Hence we dont require to scan the data base tocalculate the support values. To calculate thesupport of two item sets we have to intersect thecorresponding one item sets ,from this we getTable3. Table4 shows Rare Association rules forinteresting two item sets. Following the similarprocedure i.e. by intersecting two item sets we willget the corresponding 3 item sets. Similarly we canget n item set from (n-1) item set.Table1.DatasetTid Items purchased1 A,B2 A,D3 B,C4 B, E5 A,B,C6 A,D,B7 A,D8 A,C9 B,E,CTable2.Vertical data formatITEM TID_LIST supp MISA 1,2,5,6,7,8,9 7 7B 1,3,4,5,6,9,10 7 7C 3,5,8,9,10 5 5D 2,6,7 3 3E 4,9,10 3 3Table3:TIDSET for 2 itemsetsITEM TID_LIST MISA,B 1,5,6,9 6A,C 5,8,9 5A,D 2,6,7 3A,E 7 3B,C 3,5,9 5B,D 6 3B,E 4,9,10 3C,E 9,10 3A,B 1,5,6,9 6Table4:Rare Association Rules with 2 item setsItemset Sup MIS LHS RuleA, D 3 3 D D  AB,E 3 3 E E  B4. ConclusionOur approach first performs conversion oforiginal database into vertical data format and thengenerates the rare association rules from the
  5. 5. SunithaVanamala, L.Padma sree, S.Durga Bhavani / International Journal of EngineeringResearch and Applications (IJERA) ISSN: 2248-9622 www.ijera.comVol. 3, Issue 3, May-Jun 2013, pp.753-757757 | P a g econverted database. The approach is very simple toimplement. The algorithm uses multiple minimumsupports which are calculated based on thefrequency of occurrence. Hence the approachreduces the burden of assumption about minimumsupport threshold. Advantage of vertical data formatis that reduces the number database scans to one.One possible direction for future work would be toreduce the size of tidlist , which is proportional tonumber of transactions, so that additional memoryrequirements can be reduced. This would lead tothe generation rare rules efficiently than would bepossible with the MSAppriori_VDBReferences[1] Agrawal, R. and R. Srikanth, 1994. Fastalgorithms for mining association rules.VLDB.[2] Agrawal, R., T. Imielinski and A. Swami,1993. Mining association rules betweensets of items in large databases. SIGMOD,207-216.[3] Kanimozhi Selvi C.S, and A. Tamilarasi,2009. An automated association rulemining technique with cumulative supportthresholds. Int. J. Open Problems Comput.Sci. Math., 2.[4] Koh, Y. and N. Rountree, 2005. Findingsporadic rules using apriori-inverse.Proceeding Of PAKDD ’05, Hanoi,Vietnam, LNCS, Springer, pp: 97-106.[5] Liu, B., Hsu, W. & Ma, Y. (1999), Miningassociationrules with multiple minimumsupports, in ‘Proceedingsof the 5th ACMSIGKDD International Conference onKnowledge Discovery and Data Mining’,[6] Adda, M., Wu, L, Feng, Y. Rare itemsetmining. In Sixth Conference on MachineLearning and Applications. 2007,Cincinnati, Ohio, USA, pp-73-80.[7] C. S. Kanimozhi Selvi, A. Tamilarasi 2011Mining of High Confidence RareAssociation Rules .EJSR ISSN[8]. T. Ravi Kumar , K. Raghava Rao 2011Association Rule Mining using ImprovedFPGrowth .[9] Cristóbal Romero, José Raúl Romero et alMining Rare Association Rules from e-Learning Data[10] B. Nath, D. K. Bhattacharyya, and A.Gosh. Faster generation of associationrules. volume 1, pages 267–279.IJITKM,2008.[11] B. Nath and A. Ghosh. Multi-objective rulemining using genetic algorithm. pages123–133. Information Science 163, 2004.[12] R.Srikant and R. Agrawala. Mininggeneralized association rules. pages 407–419. Proceedings of the 21st VLDBConference Zurich, Swizerland, 1995.