Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  1. 1. T. Sowmiya, V. Thangamani / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 www.ijera.com Vol. 2, Issue 6, November- December 2012, pp.354-358 Generating Similar Item Sets Of Temporal Databases Using Spamine Algorithm T. Sowmiya1, V. Thangamani2 1 pg Scholar, Dept Of Cse, Vel Tech Dr.Rr & Dr.Sr Technical University, Chennai-62. 2 pg Scholar, Dept Of It, Anna University, Guindy, Chennai-25.ABSTRACT Data mining is the process of extracting ecology, and biology.interesting like non-trivial, implicit, previously 2. RELATED WORKS:unknown and potentially useful information or Many related works are applicable topatterns from large information repositories such this paper that works includes the inductionas: relational database, data warehouses, XML by Bremen et al. and quinoa “classificationrepository, etc. Data mining is known as one of rules” in 1984 and 1993, Spiegel halter et al.the core processes of Knowledge Discovery in at “discovery of causal rules” in 1993,Database (KDD). Association rule mining is a Muggleton and Feng at “learning of logicalpopular and well researched method for definitions” in 1992, Langley et al. at “fittingdiscovering interesting relationships among of functions to data” in 1987, Cheeseman etvarious data items in large databases. The al. at “clustering” in 1988. The goal ofexisting system uses only frequent item sets for association rule discovery is to find associationsassociation rule mining. It may easily cause among items from a set of transactions, each ofthrashing when dataset becomes large and which contains a set of items.sparse. To overcome this spamine algorithm isused here for computation of support values of all Frequent item set generator:possible item sets at each time point and Mohammed Saki discovers Charm frequentgenerates their support sequences and compares item set generator generates the closed frequent itemthe generated support time sequences with a sets and not the association rules on February 25,given reference sequence and finds similar item 2001. Jawed Han’s research group at Simon Frasersets. University discovers FP-growth generates frequent Item sets for association rules from It generates allKeywords: Association rule mining, Support frequent item sets satisfying a given minimumvalues, Similar item set, Candidate item sets. support by growing a frequent pattern tree structure that stores compressed information about the1. INTRODUCTION: frequent-growth can avoid repeated database scans Generating the support time sequences by and also avoids the generation of a large number ofreducing the candidate item sets, using the Candidate item sets. Jawed Han and Jean PeiAssociation rules discover interrelation ships discovers FP-growth implementation in February 5,among various data items in transactional data. 2001, it provides the improvement when comparedSimilarity-profiled temporal association mining is to the earlier version of experimental results. Jawedto discover all associated item sets whose Han’s finds Closet; it is a frequent item set generatorprevalence variations over time are similar to for association rules from research group onthe reference sequence under a threshold. September 21, 2000 .Jawed Han and Jean PeiSimilarity-profiled temporal association mining provided the closet implementation.can reveal interesting relationships of data These implementations of FP-growth anditems that co occur with a particular event Closet only generate the frequent item sets, and notover time. Using the, time stamped transaction the association rules. Geoff Webb implements thedatabase and a user defined reference sequence Magnum Opus on February 1, 2001. Magnum Opusof interest over time. One major area of data directly generates association rules from a datasetmining from these data is association pattern based on a specified search preference. Magnumanalysis. The discovery of association rules Opus is a unique technique for the search algorithmpaid attention to temporal information, which is based on an efficient admissible algorithm forimplicitly related to transaction data, e.g., the unordered search.time that a transaction is executed, anddiscovered association patterns that vary over Discovering calendar based temporal associationtime. Association patterns can give important rules:insight into many Application domains such A temporal association rule is anas business, agriculture, earth science, association rule that holds during specific time 354 | P a g e
  2. 2. T. Sowmiya, V. Thangamani / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 www.ijera.com Vol. 2, Issue 6, November- December 2012, pp.354-358intervals. S.Ramaswamy, S.mahajan, and 4. MODULE DESCRIPTION:A.Silberschatz, “on the discovery of interesting 4.1 Data Preprocessingpatterns in association rules”. B.Ozde, The temporal dataset given by the user canS.Ramaswamy, andA.Silberschatz at “cyclic have any type of item sets. The proposed system isassociation rules” was extended to approximately modeled such that the association rule miningdiscover user-defined temporal patternsin algorithm could process only numeric item sets. Soassociation rules.the work in “on the discovery of the dataset given by the user is converted intointeresting patterns in association rules” is more numeric format and then the numeric data is splitflexible and practical than “cyclic association rules” into different time stamps to which the SPAMINEhowever it requires user-defined calendar algebraic algorithm can be applied.expressions in order to discover temporal patterns.Y. Li, P. Ning, X.S. Wang, and S. Jajodia at 4.2 Calculation of Support Time Sequences“Discovering calendar based temporal association Support time sequences can be constructedrules” J.Data and Knowledge Engg., Vol.15, no.2, by two different ways in scanning the time stamped2003 uses calendar schema for better result. As a transaction data set.result this implements less priori then “on the 1. Lattice-dominant scandiscovery of interesting patterns in association rules” 2. Snapshot-dominant scanand “cyclic association rules”. S.Ramaswamy, Lattice-dominant scan: The lattice-S.mahajan, and A.Silberschatz, at “on the discovery dominant scan method reads a whole transactionof interesting patterns in association rules” which data set from time slot t1 to time slot ten for candidatediscover temporal association rules for one user item sets of each size and generates their supportdefined temporal pattern Y. Li, P. Ning, X.S. time sequences.Wang, and S. Jajodia at “Discovering calendarbased temporal association rules” J.Data and Snapshot-dominant scan: The snapshot-Knowledge Engg., Vol.15,no.2,2003 shows all dominant scan method repeats the scanning ofpossible temporal patterns in calendar schema. It can transactions at each time slot. First, it counts thepotentially discover more temporal association rules. supports of all candidate item sets of different sizes in the first time slot, and then it moves to the next3. METHODS: time slot and repeats the process. This method Association rule mining is a popular and incrementally generates the support time sequenceswell researched method for discovering interesting with the processed time slots.relationships among various data items in large SPAMINE algorithm uses Lattice-dominant scandatabases. The existing apriori based temporal to read the whole Transaction dataset from time slotassociation rule mining. Find all frequent item sets t1 to time slot ten for candidate item sets and generatewhich occur at least as frequent as a pre-determined their support time sequences. The similarity measureminimum support count and Generate strong used in this algorithm is Euclidean distance.association rules from the frequent item sets which Euclidean distance between two pointssatisfy minimum confidence. The problem of mining (x1,y1) and (x2,y2) is calculated using the the temporal pattern is formulated with a similarity- D= [(x2-x1)2+ (y2-y1)2]1/2profiled subset specification, which consists of a Similarity measure (Euclidean distance) isreference time sequence. applied to each candidate item set to calculate the To overcome this problem this paper adopts following:the Similarity-Profiled temporal Association 1. Upper and lower bound sequenceMining method (SPAMINE)[1] algorithm divides 2. Upper lower-bounding distancethe mining process into two separate phases. Generating the support time sequences of• The first phase computes the support values of all item sets is the core operation in similarity-profiled possible item sets at each time point and association mining algorithm. The operation, generates their support sequences. however, is very data intensive and sometimes can• The second phase compares the generated produce the sequences of all combinations of items. support time sequences with a given reference A different way is explored to estimate support time sequence and finds similar item sets. sequences without examining an input data set. A SPAMINE algorithm is developed Lower bounding distances are explored tohere for Temporal patterns are searched with a user find item sets whose support sequences could notdefined numeric reference sequence and consider the possibly be a best match with a reference sequenceprevalence similarities of all possible item sets, not under a dissimilarity threshold. In this concept ofonly frequent item sets. The SPAMINE algorithm lower bounding distance, if the lower boundingwill reduce the search space. The computation cost distance of an item set does not satisfy theis reduced. Both transaction and time are taken dissimilarity threshold, its true distance also does notunder consideration. satisfy the threshold. 355 | P a g e
  3. 3. T. Sowmiya, V. Thangamani / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 www.ijera.com Vol. 2, Issue 6, November- December 2012, pp.354-358 Thus, the lower bounding distance can be  un =min{support (J1, DNS), …., supportused to prune item sets early without the (KJ, DNS)}computation of the true distances. Lower bounding Lower bound sequence:distance is defined with upper and lower bound Lower bound support time sequence of itemsupport time sequences. set I, L1 =< l1 , …, in > is defined bySupport time sequences:  l1 =max{(support (J1, D1)+support (I- J1,Let D=D1U…Dun be a set of disjoint transactions. D1)-1), …, (support(Joke, D1)+support (I-J={J1, …, Joke} be a set of all size k-1 subsets of a Joke, D1)-1),0 }size k item set I.  in=max{(support (J1, Den)+support(I- J1,Upper bound sequence: Den)-1), …, (support (Joke, Den)+support(I- Upper bound support time sequence of item Joke, Den)-1),0 }set I, U1 =< u1 , …, un> is calculated using The lower lower-bounding distance between R and  u1=min{support (J1, D1), …., support (KJ, L, Dib (R, L) is defined to D1)}Lower Bounding Distance  D(RU , LL) The upper lower-bounding distance The lower-bounding distance,between Rand U, Dub(R, U) is defined to  Dab(R, U, L) = Dub (R, U)+  D(RU , UL) Dib(R, L)4.3 Generation of Similar Item Sets such as upper bound support time sequence, lowerThe algorithm compares the calculated distances bound support time sequence and lower boundingwith the dissimilarity threshold in order to prune distance along with the user defined referencethe dissimilar item sets and results in similar item sequence.sets. The algorithm uses all the calculated valuesSYSTEM FLOW DIAGRAM: Read Datasets and Threshold Value Divide into Time Stamps Convert Data into Numeric Values Find Item Sets Find Support Values Calculate Upper Bound and Lower Bound Apply Upper Distance Formula to Find the Similar Item Sets Compare With Minimum Threshold Output the Similar Item Sets That Satisfy the Threshold FIGURE 1: FINDING SIMILAR ITEM SETS 356 | P a g e
  4. 4. T. Sowmiya, V. Thangamani / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 www.ijera.com Vol. 2, Issue 6, November- December 2012, pp.354-358 here is depends on data distribution, this type of reference sequence, dissimilarity threshold. The future improvement may be ofESULT: different similarity models for temporal patterns canExperimental results on fake and real data sets be explored. The current similarity model using a Lpshowed that the SPAMINE algorithm used here norm-based similarity function is a little rigid inis computationally efficient and can produce finding similar temporal patterns. It may bemeaningful results from real data. The interesting to consider not only a relaxed similaritysimilarity-profiled temporal association mining model to catch temporal patterns that show similarmethod algorithm uses the lower bounding trends but also phase shifts in time. For example, thedistance of the bounds of support sequences, sale of items for cleanup such as chain saws andand the monotonicity property of the upper mops would increase after a storm rather thanlower bounding distance without during the storm. The current project consideredcompromising the correctness and completeness whole sequence matching for the similar temporalof the mining results. For the significant patterns. Subsequence matching models may bereduction of search space by pruning candidate more flexible for the patterns.item sets However, the pruning scheme usedFIGURE 2: FABRICATION OF SIMILAR ITEM SETSCONCLUSION: distance without compromising the correctness and The similarity-profiled temporal completeness of the mining results. The miningassociation mining method (SPAMINE)[1] algorithm problem similarity-profiled temporal associationsubstantially reduced the search space by pruning patterns are formulated and an algorithm is proposedcandidate item sets using the lower bounding to discover them. Experimental results on data setsdistance of the bounds of support sequences, and the showed that the SPAMINE algorithm ismonotonicity property of the upper lower-bounding 357 | P a g e
  5. 5. T. Sowmiya, V. Thangamani / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 www.ijera.com Vol. 2, Issue 6, November- December 2012, pp.354-358computationally efficient and can produce meaningful results from real data. J.Lin, “Discovering Frequent Event Patterns with Multiple Granularities in Time Sequences,” IEEE Trans. Knowledge and Data Eng., vol. 10, no. 2,REFERENCES: Mar.-Apr. 1998. 1. Jin Soung Yoo and Shashi Shekhar, 7. G.Dong and J.Li, “Efficient Mining of “Similarity-Profiled Temporal Association Emerging Patterns: Discovering Trends Mining”, IEEE 2009. and Differences,” Proc. ACM SIGKDD, 2. Juan M.Ale and Gustavo H.Rossi R,”An 1999. approach to discovering temporal 8. B. Liu, W.Hsu, and Y.Ma, association rules”, ACM SIGDD 2002. “Discovering the Set of Fundamental 3. Keshri Verma and O.P.Vyas, ”Efficient Rule Change,” Proc. ACM SIGKDD, calendar based temporal association rule”, 2001. SIGMOD Record, Vol.34, No.3, 2005. 9. W.Teng, M.Chen, and P.Yu, “A 4. R. Agarwal and R. Srikant, “Fast Regression-Based Temporal Pattern Algorithms for Mining Association Mining Scheme for Data Streams,” Rules,” Proc. Int’l Conf. Very Large Proc. Int’l Conf. Very Large Databases Databases (VLDB), 1994. (VLDB), 2003. 5. R.Srikant and R.Agrawal, “Mining 10. Y. Li, P.Ning, X.S. Wang, and Generalized Association Rules,” Proc. S.Jajodia, “Discovering Calendar Based Int’l Conf. Very Large Databases (VLDB), Temporal Association Rules,” J. Data and 1995. Knowledge Eng., vol. 15, no. 2, 2003. 6. C.Bettini, X.Wang, S.Jajodia, and 358 | P a g e