Db2425082511

149 views
109 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
149
On SlideShare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
1
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Db2425082511

  1. 1. International Journal of Modern Engineering Research (IJMER) www.ijmer.com Vol.2, Issue.4, July-Aug. 2012 pp-2508-2511 ISSN: 2249-6645 CB Based Approach for Mining Frequent Itemsets K. Jothimani1, Dr. Antony Selvadoss Thanamani2 *(PhD Research Scholar, Research Department of Computer Science, NGM College, Pollachi ,India) ** (Professor and Head, Research Department of Computer Science, NGM College, Pollachi ,IndiaABSTRACT : In this paper, we propose a new method efficientwhen the average transaction length was small;for discovering frequent itemsets in a transactional data used lossy counting to mine frequent itemsets; [7],[8],andstream under the sliding window model. Based on a theory [9] focused on mining the recent itemsets, which used aof Chernoff Bound, the proposed method approximates the regression parameter to adjust and reflect theimportance ofcounts of itemsets from certain recorded summary recent transactions; [27] presented theFTP-DS method toinformation without scanning the input stream for each compress each frequent itemset; [10] and [1] separatelyitemset. Together with an innovative technique called focused on multiple-level frequentitemset mining anddynamically approximating to select parameter-values semi-structure stream mining; [12]proposed a groupproperly for different itemsets to be approximated, our testing technique, and [15] proposeda hash technique tomethod is adaptive to streams with different distributions. improve frequent itemset mining;[16] proposed an in-coreOur experiments show that our algorithm performs much mining algorithm to speed upthe runtime when distinctbetter in optimizing memory usage and mining only the items are huge or minimumsupport is low; [15] presentedmost recent patterns in very less time performance with two methods separatelybased on the average time stampspretty accurate mining result. and frequency-changingpoints of patterns to estimate the approximate supportsof frequent itemsets; [5] focused onKeywords: Chernoff Bound, DataStream, Frequent mining a streamover a flexible sliding window; [11] was aItemsets block-basedstream mining algorithm with DSTree structure; [12] useda verification technique to mine I. INTRODUCTION frequent itemsets over astream when the sliding window isThe most significant tasks in data mining are the process large; [11] reviewedthe main techniques of frequentof mining frequent itemsets over data streams. It should itemset mining algorithmsover data streams and classifiedsupport the flexible trade-off between processing time and them into categoriesto be separately addressed.mining accuracy. Mining frequent itemsets in data stream The above methods are mostly based on the (ε, δ)applications is beneficial for a number of purposes such as approximation scheme,and depend on error parameter ε toknowledge discovery, trend learning, fraud detection, control the memory consumption and the running time.transaction prediction and estimation[1]. However, the While adilemma between mining precision and memorycharacteristics of stream data – unbounded, continuous, consumption will be caused by the error parameter ε.fast arriving, and time- changing – make this a challenging Alittle decrease of ε may make memory consumptiontask. large, and a little increase of ε may degrade In data streams, new data are continuously outputprecision. Since uncertain streams are highly time-coming as time advances. It is costly even impossible to sensitive, most data items are more likely to be changedasstore all streaming data received so far due to the memory time goes by, besides people are generally more interestedconstraint. It is assumed that the stream can only be in the recent data items than those in the past.scanned once and hence if an item is passed, it can not be Thus this needs an efficient model to deal withrevisited, unless it is stored in main memory. Storing large the time-sensitive items on uncertain streams.Sliding-parts of the stream, however, is not possible because the window model is the most reprehensive approach toamount of data passing by is typically huge. In this paper, coping with the most recent items in manypracticalwe study the problem of finding frequent items in a applications. So we introduce a Chernoff bound withcontinuous stream of items.Many real-world applications Markov’s inequality to deal this problemdata are more appropriately handled by the data stream The remainder of this paperis organized asmodel than by traditional static databases. Such follows. In Section 2, we give an overview of theapplications can be: stock tickers, network traffic relatedwork and present our motivation for a newmeasurements, transaction flows in retail chains, click approach. Section 3 goes deeper into presenting thestreams, sensor networks and telecommunications call problems and gives an extensive statement of our problem.records. In the sameway, as the data distribution are Section 4 presents our solution. Experiments are reportedusually changing with time, very often end- users are Section 5, and Section 6 concludes the paper with themuch more interested in the most recent patterns [3]. For features of our work.example, in network monitoring, changes in the pastseveral minutes of the frequent patterns are useful to II. RELATED WORKdetect network intrusions [4]. Many previous studies contributed to the efficient mining Many methods focusing on frequent itemset of frequent itemsets (FI) in streaming data [4, 5].miningover a stream have been proposed. [13] proposed According to the stream processing model [20], theFPStreamto mine frequent itemsets, which was research of mining frequent itemsets in data streams can be divided into three categories: landmark windows [15, www.ijmer.com 2508 | Page
  2. 2. International Journal of Modern Engineering Research (IJMER) www.ijmer.com Vol.2, Issue.4, July-Aug. 2012 pp-2508-2511 ISSN: 2249-664512, 19, 11, 13], sliding windows [5, 6, 14, 16, 17, 18], and ζ×|DB| inDB.damped windows [7, 4], as described briefly as follows. In Thepreviousdefinitionsconsiderthatthedatabaseisstatic.the landmark window model, knowledge discovery is Letusnowassumethatdataarrivessequentially intheformofperformed based on the values between a specific continuousrapid streams. Letdatastreamtimestamp called landmark and the present. In the sliding bi,B bi+1,...,B bnbeaninfinitesequenceofbatches DS=Bwindow model, knowledge discovery is performed over a ,ai ai+1 anfixed number of recently generated data elements which isthe target of data mining. whereeachbatch isassociated with atime period[ak,bk],i.e.According to the existential uncertain stream model, an B bk,and letanbethemostrecentbatchuncertain stream US is a continuous and Each batch Each batch Bak consists of aunboundedsequence of some transactions, {T1 T2 … Tn setof transactions; that is, Each batch Bbk =…}. Each transaction Ti in US consists of a number of [T1 , T2 , T3, ..., Tk ]. We also assume thatitems,and each item x in Ti is associated with a positive batches do not have necessarily the same size.probability PTi(x), which stands for the possibility Hence, the length(orlikelihood) that x exists in the transaction Ti. For agiven uncertain stream, there are many possibleinstances The support of an itemset X at a specific time interval [aicalled worlds that are carried by the stream. The possible , bi ] is now denoted by the ratio of the number ofworlds semantics has been widely used customers having X in the current time window to the totalin [8, 9], which can be adopted in this paper to illustrate number of customers. Therefore, given a user-defineduncertain streams clearly. Each probability minimum support, the problem of mining itemsets on aPTi(x)associated with an item x deduces two possible data stream is thus to find all frequent itemsets X overworlds, pw1 and pw2. In possible world pw1, item x exists an arbitrary time period [ai , bi ], i.e. verifying:in the transaction Ti, and in possible world pw2, item xdoes not exist in Ti. Each possible world pwi is annotated biXsupportt(X ) ≥ ζ × |Bbi |,t=aiwith an existence probability, denoted as P(pwi) that thepossible world pwi happens. By this semantics, we can get of the streaming data using as little main memory asP(pw1)=PTi(x) and P(pw2)=1-PTi(x). In fact, a possible.Given a transaction sets X, we are interested intransaction often contains several items. finding strong bounds We introduce a new approach which combines themathematical model Chernoff bound for this problem. The Low End: Pr[X < a]main attempts were to keep some advantages of theprevious approach and resolve some of its drawbacks, and High End: Pr[X ≥ a]consequently to improve run time and memoryconsumption. We also revise the Markov’s inequality, Lemma 1.1 (Markov’s Inequality) Let X be a non-negativewhich avoids multiple scans of the entire data sets, random variable(transaction sets) with finite expectationoptimizing memory usage, and mining only the most µ. Then for any α> 0: Pr[X ≥ α] ≤ µ/α.recent patterns. In this paper, we propose a remarkableapproximating method for discovering frequent itemsets in Lemma 1.2 (Chebyshev’s Inequality) Let X be a randoma transactional data stream under the sliding window variable(transaction sets) with finite expectation µ andmodel. Based on a theory of Combinatorial Mathematics, standard deviation σ. Then for any α> 0: Pr[|X − µ| ≥ α σ]the proposed method approximates the counts of itemsets ≤ 1/α2.from certain recorded summary information withoutscanning the input stream for each itemset. Chebyshev’s inequality follows fromMarkov’s inequality for the variableY = (X − µ)2 (so E[Y ] =Var[X]) and the fact that 𝑥→ 𝑥 2 is a strictly monotonic function on R0 . III. PROBLEM DESCRIPTION The problem of mining frequent itemsets was Pr[|X − µ| ≥ α σ] = Pr[(X − µ)2 ≥ α2σ2]previously defined by [1]: Let I = {i1, i2, . . . im} be a setof literals, called items. Let database DB be a set of ≤E[(X − µ)2 ] = 1 / α2transactions, where each transaction T is a set of items α2σ2such that T ⊆ I . Associated with each transaction is a This makes it tempting to use even higher moments to getunique identifier, called its T I D. A set X ⊆ I is also better bounds. One way to do thisnicely is by consideringcalled an itemset, where items within are kept in an exponential of the basic form of 𝑒 𝑥 .lexicographic order. A k-itemset is represented by (x1,x2, . . . xk ) where x1 < x2 < . . . <xn . The support of Proof.Lemma 1.3 (Simplified Chernoff Bounds)an itemset X , denoted support(X ), is the number oftransactions in which that itemset occurs as a subset. Low End:An itemset is called a frequent itemset if support(X ) ≥σ × |DB| where ζ∈ (0,1)isauser-specifiedminimum Pr[X < (1 − δ)µ] < e−µδ2 /2 0 < δ ≤ 1support threshold and |DB|standsforthesizeofthedatabase. High End: Theproblem ofminingfrequentitemsets Pr[X ≥ (1 + δ)µ] < e−µδ2 /3 0 < δ ≤ 1istomineallitemsets whosesupport isgreater orequalthan www.ijmer.com 2509 | Page
  3. 3. International Journal of Modern Engineering Research (IJMER) www.ijmer.com Vol.2, Issue.4, July-Aug. 2012 pp-2508-2511 ISSN: 2249-6645 Pr[X ≥ (1 + δ)] < e−(1+δ)µ2e − 1 < δ V. EXPERIMENTAL RESUTLS All experiments were implemented with C#, compiledwithProof. Visual Studio 2005 in Windows Server 2003 andexecutedFor the LowEnd let D = (1 − δ)1−δ. By Taylor expansion on a Xeon 2.0GHz PC with 4GB RAM. We used 1we get synthetic datasets and 1 real-life datasets, which are well- known benchmarks for frequent itemset mining. Theln D = (1 − δ) ln(1 − δ) T40I10D100K dataset is generated with the IBM synthetic data generator.= (1 − δ)(−δ − δ2/2 − δ3/3 − . . .) We firstly compared the average runtime of these twoalgorithms under different data sizes when the= −δ + δ2/2 + δ3/6 + . . . minimumsupport was fixed. As shown in all of the images inFig.5.1, the running time cost of MFI and CBMFI > −δ + δ2/2 increase following the data size; these results verify thatboth algorithms are sensitive to data size, which is dueBy Applying this simplified method in the series we can tothe using of unchanged absolute minimum supportget the following equation for low end windows. 𝑒 −𝛿Hence = 𝑒 −µδ2/2 (1 – δ)1 – δ IV. CBMFI ALGORITHMCBMFI (ChernoffBound Based Mining Frequent Itemsets)algorithm relies on a verifier function and it is an exactand efficient algorithm for mining frequent itemsetssliding windows over data streams. The performance ofCBMFI improves when small delays are allowed inreporting new frequent itemsets, however this delay can beset to 0 with a small performance overhead. The following algorithm shows how it combines withCB for Mining frequent itemsets.CBMFI Fig: 5.1 Runtime vs number of RecordsInitializationFor all u pick a route rt(u). Memory Cost EvaluationPlace all Mu such that D(u) =6 u into Ce where e is the To evaluate the memory cost of our algorithm, wefirst window on rt(u). comparedthe count of generated items in MFIandMain Loop r = 0 CBMFI.As shown in Fig.5.2 and Fig.5.3, when we fix thewhile( there is a packet not at its destination ) minimumsupport and increase the data size, the generatedin parallel, for all window ends e = (x, y) itemsof both algorithm become more, but the overalldo itemscount of CBMFIis less than that ofifCe is not empty, MFIsinceCBMFIuses simplified method foritemsets.pick the highest priority itemsets in Ce y = itemsets(T)if ( a moved itemset is not at its destination ) Ce′= next end transaction r=r+1We would like to use a Chernoff bound, so we need to findsome independent Poisson trials somewhere. This requiresa bit of thought, lots of random variables associated withthis algorithm are not independent;The expected length of a route is k/2, so the total numberof edges on all routes is expected to benk/2. The totalnumber of edges in a hypercube is nk, so we have E[R(e)]= 1/2.It follows that E[H] ≤ k/2. Now apply the Fig: 5.2 Itemset Count cost vsTransation Setsimplified Chernoff bound where δ > 2e − 1: Pr[H ≥ 6k] < 2−6kThis is allowed since 6k = (δ + 1)µ implies δ > 11. Byapplying this in the CBMFI algorithm we can get theefficient result. www.ijmer.com 2510 | Page
  4. 4. International Journal of Modern Engineering Research (IJMER) www.ijmer.com Vol.2, Issue.4, July-Aug. 2012 pp-2508-2511 ISSN: 2249-6645 sliding window”, Proc. 4th IEEE Conf. on Data Mining, Brighton, UK, 2004, pp. 59–66. [8] N. Jiang & L. Gruenwald, “CFI-Stream: mining closed frequent Itemsets in data streams”, Proc. 12th ACM SIGKDD Conf. on Knowledge Discovery and Data Mining, Philadelphia, PA,USA, 2006, pp. 592– 597. [9] H.F Li, S.Y. Lee, M.K. Shan, “An Efficient Algorithm for Mining Frequent Itemsets over the Entire History of Data Streams”, In Proceedings of First International Workshop on Knowledge Discovery in Data Streams 9IWKDDS, 2004. [10] H.F Li, S.Y. Lee, M.K. Shan, “Online Mining (Recently) Maximal Frequent Itemsets over Data Streams”, In Proceedings of the 15th IEEE Fig: 5.3Itemsets Count cost vs. Minimum Support International Workshop on Research Issues on Data Engineering (RIDE), 2005. Furthermore, when the minimum support increases,the [11] J.H. Chang & W.S. Lee, “A sliding window methodincremental scope of CBMFIis much reduced thanthat of for finding recently frequent itemsets over onlineMFI, that is because more redundant items when the data streams”, Journal of Information science andoverall items count isfixed but the itemsets count Engineering, 20(4), 2004, pp. 753–762.increases. [12] J. Cheng, Y. Ke, & W. Ng,” Maintaining frequent itemsets over high-speed data streams”, Proc. 10th VI. CONCLUSION Pacific-Asia Conf. on Knowledge Discovery andIn this paper we considered a problem that how tomine Data Mining, Singapore, 2006, pp.462–467.frequent itemset over stream sliding window using [13] C.K.-S. Leung & Q.I. Khan, “DSTree: a treeChernoff Bound. We compared twoalgorithmsMFI and structure for the mining of frequent sets from dataCBMFI and introduce a compact datastructure, which can streams,” Proc. 6th IEEE Conf. on Data Mining,compress the itemsetstorage and optimize itemset Hong Kong, China, 2006, pp. 928–932.computation, based oncombinatorial mathematical. Our [14] B. Mozafari, H. Thakkar, & C. Zaniolo, “Verifyingexperimental studies showed that ouralgorithm is effective and mining frequent patterns from large windowsand efficient. over data streams”, Proc. 24th Conf. on Data Engineering, Mexico, 2008, pp. 179–188. [15] K.-F. Jea& C.-W. Li, “Discovering frequent itemsets REFERENCES over transactional data streams through an efficient[1] R. Agrawal, T. Imielinski, andA. Swami. Mining and stable approximate approach, Expert Systems associationrules betweensetsofitems with Applications”, 36(10), 2009, pp. 12323–12331. inlargedatabase.InProc. oftheIntl. Conf. onManagement ofData(ACM SIGMOD 93),1993.[2] Y.Chen, G.Dong,J. Han, B.Wah, and J. Wang. Multi-dimensional regression analysisoftime-series data streams. InProc. ofVLDB’02 Conference, 2002.[3] Y.Chi, H.Wang, P.S. Yu, andR.R. Muntz.Moment: Maintainingclosedfrequentitemsets over a stream sliding window. In Proc. ofICDM’04Conference, 2004.[4] P. Dokas, L. Ertoz, V. Kumar, A. Lazarevic, J. Srivastava, and P.-N. Tan. Data mining for network intrusion detection. In Proc. of the 2002 National Science Foundation Workshop on Data Mining, 2002.[5] G.Giannella, J.Han,J.Pei,X.Yan,andP.Yu. Miningfrequentpat-ternsindata streamsatmultipletimegranularities.InNextGeneratio n DataMining,MIT Press,2003.[6] J. Han, J. Pei, B. Mortazavi-asl, Q. Chen, U. Dayal, andM. Hsu.Freespan: Frequentpattern- projectedsequential patternmining. In Proc.ofKDD’00Conference, 2000.[7] Y. Chi, H. Wang, P.S. Yu, & R.R. Muntz, “Moment: maintaining closed frequent itemsets over a stream www.ijmer.com 2511 | Page

×