Ms. Madhu S. Shukla, Mr. Kirit R. Rathod / International Journal of Engineering Research               and Applications (I...
Ms. Madhu S. Shukla, Mr. Kirit R. Rathod / International Journal of Engineering Research                   and Application...
Ms. Madhu S. Shukla, Mr. Kirit R. Rathod / International Journal of Engineering Research                and Applications (...
Ms. Madhu S. Shukla, Mr. Kirit R. Rathod / International Journal of Engineering Research                and Applications (...
Ms. Madhu S. Shukla, Mr. Kirit R. Rathod / International Journal of Engineering Research                and Applications (...
Ms. Madhu S. Shukla, Mr. Kirit R. Rathod / International Journal of Engineering Research             and Applications (IJE...
Upcoming SlideShare
Loading in …5



Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  1. 1. Ms. Madhu S. Shukla, Mr. Kirit R. Rathod / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 Vol. 3, Issue 1, January -February 2013, pp.163-168 Stream Data Mining and Comparative Study of Classification Algorithms Ms. Madhu S. Shukla*, Mr. Kirit R. Rathod** *(PG-CE Student, Department of Computer Engineering), (C.U.Shah College of Engineering and Technology, Gujarat, India) ** (Assistant Professor, Department of Computer Engineering), (C.U.Shah College of Engineering and Technology, Gujarat, India)ABSTRACT Stream Data Mining is a new emerging  Random access is expensive—single lineartopic in the field of research. Today, there are scan algorithm (can only have one look)number of application that generate Massive  Store only the summary of the data seen soamount of stream data. Examples of such kind of are Sensor networks, Real time  Most stream data are at pretty low-level orsurveillance systems, telecommunication systems. multi-dimensional in nature, needs multi-Hence there is requirement of intelligent level and multi-dimensional processing.processing of such type of data that would help inproper analysis and use of this data in other task As said earlier it is a huge amount of data,even. Mining stream data is concerned with so in order to perform certain analysis, we need toextracting knowledge structures represented in take some sample of that data so that processing ofmodels and patterns in non stopping streams of stream data could be done with ease. These samplesinformation [1]. Such massive data are handled taken should be such that whatever data comes inwith software such as MOA (Massive Online the portion of sample is worth analyzing orAnalysis) or other open sources like Data Miner. processing, which means maximum knowledge isIn this paper we present some theoretical aspects extracted from that sampled data.of stream data mining and certain experimental In this paper some sampling techniquesresults obtained on that basis with the use of are briefed in section II. Some classificationMOA. algorithms are discussed in section III. And their implementation results in section IV. ConclusionsKeywords— Stream, Stream Data Mining, are discussed in section V and references coversIntelligent-processing, MOA (Massive Online section VI.Analysis), Continuous Data.I. INTRODUCTION Recently new classes of applications areup-coming rapidly, that basically deal withgenerating stream data as output. Stream data iscontinuous, ordered, changing, fast, huge amount ofdata. Such data are so huge and continuouslychanging that even one look at entire data becomesdifficult. Such systems are, any application thatdeals with Telecommunication calling recordsBusiness: credit card transactions, Networkmonitoring and traffic engineering, Sensor,monitoring and surveillance, Security monitoring,Web logs and page click streams. Methods withwhich we are trying to find knowledge or Specificpatterns are called Stream Data Mining. Certaincharacteristics of stream data are as follows:  Huge volumes of continuous data, possibly infinite. (Fig 1.1: Processing of Stream data)  Fast changing and requires fast, real-time response. II. FOUNDATIONS AND SAMPLING  Data stream captures nicely our data TECHNIQUES. processing needs of today. Foundation of stream data mining is based on statistics, complexity and computational theory 163 | P a g e
  2. 2. Ms. Madhu S. Shukla, Mr. Kirit R. Rathod / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 Vol. 3, Issue 1, January -February 2013, pp.163-168[2] .Sampling refers to the process of probabilistic Load shedding refers [6] to the process of dropping achoice of a data item to be processed or not. sequence of data streams. Load shedding has beenSampling is an old statistical technique that has been used successfully in querying data streams. It hasused for a long time. Boundaries of the error rate of the same problems of sampling. Load shedding isthe computation are given as a function of the difficult to be used with mining algorithms becausesampling rate. Very Fast Machine Learning it drops chunks of data streams that could be used intechniques [4] have used Hoeffding bound to the structuring of the generated models or it mightmeasure the sample size according to some derived represent a pattern of interest in time series analysis.loss functions [3]. 4. Synopsis Data Structures: Small space, approximate solution to massive data set..Some of the methods for stream data mining are: Summarizing techniques are used like Wavelet, 1. Sampling: Idea of representing large data set Histogram, and Aggregation. by a small random sample of the data Creating synopsis of data refers to the elements. process of applying summarization techniques that 2. Sketching: Building a summary of data are capable of summarizing the incoming stream for stream using a small amount of memory. further analysis. Wavelet analysis [25], histograms, 3. Load Shedding: Process of eliminating a quintiles‘ and frequency moments [5] have been batch of subsequent elements from being proposed as synopsis data structures. Since synopsis analyzed. of data does not represent all the characteristics of 4. Synopsis Data Structures: Small space, the dataset, approximate answers are produced when approximate solution to massive data set. using such data structures. Summarizing techniques are used like Wavelet, Histogram, and Aggregation. Aggregation is the process of computing 5. Sliding Window: Advance technique. statistical measures such as means and variance that Detailed analysis over most recent data summarize the incoming stream. This aggregated items and over summarized versions of data could be taken by the mining algorithm. The older ones. problem with aggregation is that it does not perform well with highly fluctuating data distributions.1. Sampling: Idea of representing large data set by asmall random sample of the data elements. 5. Sliding Window: It is an advance technique. It deals with detailed analysis over most recent data In this technique rather than dealing with items and over summarized versions of older ones.entire data, sample of stream are taken at regular The inspiration behind sliding window is that theinterval. To obtain an unbiased sampling of data we user is more concerned with the analysis of mostneed to know the length of stream in advance, but recent data streams. Thus the detailed analysis isincase if this is not possible then a little modified done over the most recent data items andtechnique is used. It is called RESERVOIR summarized versions of the old ones. This idea hasSampling, in this technique unbiased random sample been adopted in many techniques in the undergoingof ‗s‘ element are taken without replacement. Basic comprehensive data stream mining system.fundamental behind this approach is, a sample ofsize at least ‗S‘, called reservoir is kept from which III. MINING TASKa random sample of size s can be generated. Basic Stream mining task includes task likedrawback of this approach is, when the reservoir is Classification, Clustering and Mining Time-Serieslarge it can be very costly to generate this sample‗s‘. Data.2. Sketching: Building a summary of data stream In this paper, we will discuss some of the algorithmsusing a small amount of memory. that are used for classification of stream data and Sketching [3] is the process of randomly their comparison based on their experimentalprojecting a subset of the features. It is the process results.of vertically sampling the incoming stream. Classification generally is a two stepSketching has been applied in comparing different process consisting of learning or Modeldata streams and in aggregate queries. The major Construction (where a model is constructed baseddrawback of sketching is that of accuracy. It is hard on class labeled tuples from training set) andto use it in the context of data stream mining. classification or Model Usage (where the model isPrincipal Component Analysis (PCA) would be a used to predict the class labels of tuples from newbetter solution that has been applied in streaming data sets).applications [5]. Algorithms involving with classification of stream data mining are Naive Bayesian, Hoeffding3. Load Shedding: Process of eliminating a batch of tree, VFDT (Very Fast Decision Tree) and CVFDTsubsequent elements from being analyzed. (Concept Adapting Very Fast Decision Tree). We 164 | P a g e
  3. 3. Ms. Madhu S. Shukla, Mr. Kirit R. Rathod / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 Vol. 3, Issue 1, January -February 2013, pp.163-168will discuss working of these algorithms in detail 3.2 Decision Tree Approach: Hoeffding Treehere. Classifier In Hoeffding algorithm, classification3.1Naive Bayesian: problem must be defined. Classification problem is In simple terms, a naive Bayes classifier a set of training examples of the form (a, b), whereassumes that the presence (or absence) of a ‗a‘ is a vector of d attributes and ‗b‘ is a discreteparticular feature of a class is unrelated to the class label. Our goal is to produce a model b= f (a)presence (or absence) of any other feature, given the such that it provides and predicts the classes y forclass variable. For example, a fruit may be future examples x with high accuracy. Decision treeconsidered to be an apple if it is red, round, and learning is considered one of the most effectiveabout 4" in diameter. Even if these features depend classification methods. By recursively replacing leafon each other or upon the existence of the other node with test nodes, starting at the root we canfeatures, a Naive Bayes classifier considers all of learn a Decision trees. In decision tree each nodethese properties to independently contribute to the has a test on the attributes and each branch givesprobability that this fruit is an apple. possible outcome of the test and each leaf contain aThis is very common approach used for class prediction. Before processing starts, data isclassification. Hence forth the name is used as Naive first stored into main memory. After startingBayesian. learning process for complex trees it is expensive toAlgorithm: repeatedly read data from secondary memory. SoStep 1: Assumption our aim is to design decision tree learners than readD - Training set of tuples and class labels each example at most once, and use a small amountX=(x1,x2,…,xn) , Attributes – A1,A2,…,An , time to process it. First key role is to find the bestClasses – C1,C2,…,Cm attribute at a node and for that consider only someStep 2: Maximum Posteriori Hypothesis training examples that pass through that nodes.P(Ci|X) > P(Cj|X) for 1 ≤ j ≤ m, j ≠ i Second choose the root and then expensiveP(Ci|X)= P(X|Ci)P(Ci)/P(X) (By Bayes‘ Theorem) examples are checked down to the correspondingStep 3: Prior Probability of X leaves and used to choose the attribute there, and soStep 4: Class Conditional Independence (Naive‘s on. How many examples are required at each nodeAssumption) is decided by Hoeffding bound after continuous use. P(X|Ci) = ∏ P(xk|Ci) (k=1 to n) = P(x1|Ci) * Taken a random variable a and its range is R. WeP(x2|Ci) * … * P(xn|Ci) have n observation of a. Now find mean of a (), soStep 5: Classifier PredictionFor tuple X, the class Hoeffding tree bound states that with probability 1-is Ci if and only if δ, the true mean of a is at least - ε. where HoeffdingP(X|Ci)P(Ci) > P(X|Cj)P(Cj) for 1 ≤ j ≤ m, j ≠ i . bound ε [7] .Example: Naive Bayesian ClassifierROLLNO BRANCH BUYS_COMPUTERA1 CE YESB1 CE YESA2 CE YES Algorithm: Step 1: Calculate the information gain for theA3 CE YES attributes and determine the best two attributes.B2 IT NO Pre-pruning: consider a ―null‖ attribute that consistsB3 IT NO of not splitting the node. Step 2: At each node, check for the condition.(Fig 1.2: Naive Bayesian Example) Step 3: If condition is satisfied, create child nodes based on the test at the node.Performance Analysis Step 4: If not, stream-in more examples andAdvantages perform calculations till condition is satisfied.  Minimum error rate  High accuracy and speed when applied to large databases.  IncrementalLimitations  Can‘t handle concept drift 165 | P a g e
  4. 4. Ms. Madhu S. Shukla, Mr. Kirit R. Rathod / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 Vol. 3, Issue 1, January -February 2013, pp.163-168 best split candidate all others is less than G (.) and G (.) < τ. where τ is a user-defined threshold. G computation: Inefficient to calculate G for every new data set, because at specific point it is hard to split best attributes. Users specify minimum amount of new data examples, ‗Nmin‘ that must be calculated at a leaf before G is computed. This mechanism incrementally takes less amount of global time which was spent on G computations. VFDT making learning work as fast as classify dat- sets . Performance Analysis Advantages  VFDT is better than Hoeffding tree in terms of time, memory and accuracy.  So VFDT takes an advantage after 100k to greatly improve accuracy.(Fig:1.3 Hoeffding Tree Example) LimitationsPerformance Analysis  Concept of drift is not handled in VFDT.Advantages  High accuracy with small sample 3.4CVFDT  Multiple scans of the same data never Algorithm: performed Step 1.Alternate trees for each node in HT started as  Incremental empty.  Can classify the data while growing Step 2. Process Examples from the streamLimitations indefinitely.  Time consuming for splitting attribute Step 3. For Each Example (x, y), selection Step 4. Pass (x, y) down to a set of leaves using HT  Can‘t handle concept drift and all alternate trees of the nodes (x, y) pass Through.3.3 VFDT Algorithm Step 5. Add(x, y) to the sliding window ofAlgorithm: examples.Input: δ desired probability level Step 6.Remove and forget the effect of the oldest Output: τ a Decision Tree. Examples, if the sliding window over flows. In it: τ ← Empty Leaf (Root) Step 7. CVFDT Grows.Step 1.While (TRUE) Step 8. Check Split Validity if f examples seenStep 2. Read next example. since last checking of alternate trees.Step 3. Propagate Example through the tree from Step 9. Return HT.the root till a leaf.Step 4. Update sufficient statistics at leaf . Explanation:Step 5. If leaf (number of examples) > Nmin. • CVFDT is an extended version of VFDTStep 6. Evaluate the merit of each attribute. which provides same speed and accuracyStep 7. Let A1 the best attribute and A2 the second advantages but if any changes occur inbest. example generating process, it provides theStep 8. Let ε be the Hoeffding Bound. ability to detect and respond.Step 9. If G (A1)-G (A2) > ε. • CVFDT uses sliding window of variousStep 10. Install a splitting test based on A1. dataset to keep its model consistent.Step 11. Expand the tree with two descendant leaves • Most of systems need to learn a new model from scratch after arrival of newExplanation: data. Instead, CVFDT continuously Sometimes more than one attributes have monitors the quality of new data andsimilar attribute values. In that case choosing best adjusts those that are no longer correct.attribute is quite critical. We have to decide Whenever new data arrives, CVFDTappropriate value between them with high incrementing counts for new data andconfidence.VFDT can decide and solve this decrements counts for oldest data in theproblem. When there is effectively a tie and split on window.the current best attribute if difference between the • However If the concept is changing, some splits examples that will no longer appear 166 | P a g e
  5. 5. Ms. Madhu S. Shukla, Mr. Kirit R. Rathod / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 Vol. 3, Issue 1, January -February 2013, pp.163-168 best because new data provides more gain 4.2 Online Comparison With GNUPlot: than previous one. Whenever this phenomenon occurs, CVFDT creates alternative sub-tree to find best attribute at root. Each time new best tree replaces old sub tree and it is more accurate on new data.IV.COMPARISION: So the comparison shows that CVDFT isbest algorithm for classification and also it handlesconcept drifts. This result is generated by usingMOA software. Here, the criterion for evaluation isaccuracy of all the algorithms compared:4.1 Comparison of the CSV files generated.Learning Naive Hoefddingevaluation CVFDT Bayesian Treeinstances (Fig 1.5: Comparision Chart)10000 72.91 80.4 VI.CONCLUSION 80.56 In this paper, we discussed about20000 73.17 83.74 theoretical aspects and practical results of stream 84.06 data mining classification algorithms. In these30000 73.27 85.57 classification algorithms, Hoeffding trees spend 85.86 small amount of time for learning. Hoeffding tree40000 73.38 86.4 does not show any similarity with batch trees. In 86.83 real world scenario we have a limited amount of50000 73.35 87.07 hardware resources, despite this it analyzes and 87.49 generates results with high accuracy. In data mining60000 73.36 87.60 Systems VFDT is based on Hoeffding trees. Time- 88.11 variant data is used to extend VFDT to develop the70000 73.38 88.02 CVFDT system. Flooding is used to keep trees up- 88.61 to-date with time variant data streams[7]. Results also show that performance analysis of CVFDT is better80000 73.363 88.40 89.06 than VFDT and Hoeffding tree in terms of accuracy. Also CVFDT handles Concept Drift which is not90000 73.39 88.70 handled by other compared algorithms. Concept 89.43 drift and Memory utilization are the crucial100000 73.38 88.99 challenges in Stream data mining field. 89.73(Fig 1.4: Comparison of CSV Files) REFERENCES [1] Elena ikonomovska,Suzana Loskovska,Dejan Gjorgjevik, ―A Survey Of Stream Data Mining‖ Eight National Conference with International Participation-ETAI2007 [2] S.Muthukrishnan, ―Data streams: Algorithms and Applications‖.Proceeding of the fourteenth annual ACM-SIAM symposium on discrete algorithms,2003 [3] Mohamed Medhat Gaber, Arkady Zaslavsky and Shonali Krishnaswamy. ]―Mining Data Streams: A Review‖,Centre for Distributed Systems and Software Engineering, Monash University900 167 | P a g e
  6. 6. Ms. Madhu S. Shukla, Mr. Kirit R. Rathod / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 Vol. 3, Issue 1, January -February 2013, pp.163-168 Dandenong Rd, Caulfield East, VIC3145, Vehicle Monitoring‖, Proceedings of Australia SIAM International Conference on Data[4] P. Domingos and G. Hulten, ―A General Mining, 2004. Method for Scaling Up Machine Learning [6] B. Babcock, M. Datar, and R. Motwani. Algorithms and its Application to ―Load Shedding Techniques for Data Clustering‖, Proceedings of the Eighteenth Stream Systems‖ (short paper) In Proc. of International Conference on Machine the 2003 Workshop on Management and Learning, 2001, Williamstown, MA, Processing of Data Streams, June 2003 Morgan Kaufmann [7] Tusharkumar Trambadiya, Praveen[5] H. Kargupta, R. Bhargava, K. Liu, M. Bhanodia, ―A Comparative study of Stream Powers, P.Blair, S. Bushra, J. Dull, K. Data mining Algorithms‖ International Sarkar, M. Klein, M. Vasa, and D. Handy, Journal of Engineering and Innovative VEDAS: ―A Mobile and Distributed Data Technology (IJEIT) Volume 2, Issue 3, Stream Mining System for Real-Time September 2012. 168 | P a g e