169 s170
Upcoming SlideShare
Loading in...5




Total Views
Views on SlideShare
Embed Views



0 Embeds 0

No embeds



Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

    169 s170 169 s170 Document Transcript

    • International Journal of Modeling and Optimization, Vol. 2, No. 4, August 2012 A New Adaptive Ensemble Boosting Classifier for Concept Drifting Stream Data Kapil K. Wankhade and Snehlata S. Dongre, Members, IACSIT modeling, radio frequency identification, web mining, Abstract—With the emergence of large-volume and high scientific data, financial data, etc.speed streaming data, mining of stream data has become a focus Most recent learning algorithms [1]-[12] maintain aon increasing interests in real applications including credit card decision model that continuously evolve over time, takingfraud detection, target marketing, network intrusion detection, into account that the environment is non-stationary andetc. The major new challenges in stream data mining are, (a)since streaming data may flow in and out indefinitely and in fast computational resources are limited.speed, it is usually expected that a stream data mining process The desirable properties of learning systems for efficientcan only scan a data once and (b) since the characteristics of the mining continuous, high-volume, open-ended data streams:data may evolve over time, it is desirable to incorporate  Require small constant time per data example; use fixevolving features of streaming data. This paper introduced new amount of main memory, irrespective to the totaladaptive ensemble boosting approach for the classification of number of examples.streaming data with concept drift. This adaptive ensemble  Built a decision model using a single scan over theboosting method uses adaptive sliding window and HoeffdingTree with naï bayes adaptive as base learner. The result ve training data.shows that the proposed algorithm works well in changing  Generate an anytime model independent from theenvironment as compared with other ensemble classifiers. order of the examples.  Ability to deal with concept drift. For stationary data, Index Terms—Concept drift, ensemble approach, hoeffding ability to produce decision models that are nearlytree, sliding window, stream data identical to the ones we would obtain using a batch learner. Ensemble method is advantageous over single classifier I. INTRODUCTION methods. Ensemble methods are combinations of several The last twenty years or so have witnessed large progress models whose individual predictions are combined in somein machine learning and in its capability to handle real-world manner to form a final prediction. Ensemble learningapplications. Nevertheless, machine learning so far has classifiers often have better accuracy and they are easier tomostly centered on one-shot data analysis from scale and parallelize than single classifier methods.homogeneous and stationary data, and on centralized The paper organized as, types of concept drift arealgorithms. Most of machine learning and data mining discussed in Section II. Section III explains proposed methodapproaches assume that examples are independent, with boosting, adaptive sliding window, hoeffding tree.identically distributed and generated from a stationary Experiments and results are included in Section IV withdistribution. A large number of learning algorithms assume concluding conclusion in Section V.that computational resources are unlimited, e.g., data fits inmain memory. In that context, standard data miningtechniques use finite training sets and generate static models. II. TYPES OF CONCEPT DRIFTNowadays we are faced with tremendous amount of Change may come as a result of the changing environmentdistributed data that could be generated from the ever of the problem e.g., floating probability distributions,increasing number of smart devices. In most cases, this data migrating clusters of data, loss of old and appearance of newis transient, and may not even be stored permanently. classes and/or features, class label swaps, etc. Our ability to collect data is changing dramatically. Fig. 1 shows four examples of simple changes that mayNowadays, computers and small devices send data to other occur in a single variable along time. The first plot (Noise)computers. We are faced with the presence of distributed shows changes that are deemed non-significant and aresources of detailed data. Data continuously flow, eventually perceived as noise. The classifier should not respond toat high-speed, generated from non-stationary processes. minor fluctuations, and can use the noisy data to improve itsExamples of data mining applications that are faced with this robustness for the underlying stationary distribution. Thescenario include sensor networks, social networks, user second plot (Blip) represents a „rare event‟. Rare events can be regarded as outliers in a static distribution. Examples of such events include anomalies in landfill gas emission, Manuscript received May 25, 2012; revised June 25, 2012. K. K. Wankhade is with the Department of Information Technolgy, G. H. fraudulent card transactions, network intrusion and rareRaisoni College of Engineering, Nagpur, Maharashtra, INDIA 440019 medical conditions. Finding a rare event in streaming data(kaps.wankhade@gmail.com). can signify the onset of a concept drift. Hence the methods S. S. Dongre is with the Department of Computer Science andEngineering, G. H. Raisoni College of Engineering, Nagpur, Maharashtra, for on-line detection of rare events can be a component of theINDIA 440019 (dongre.sneha@gmail.com). novelty detection paradigm. The last two plots in Fig. 1 493
    • International Journal of Modeling and Optimization, Vol. 2, No. 4, August 2012(Abrupt) and (Gradual) show typical examples of the two values are different, and the older portion of the window ismajor types of concept drift represented in a single dropped. In other words, W is kept as long as possible whiledimension. the null hypothesis “µ has remained constant in W” is t sustainable up to confidence δ. “Large enough” and “distinct enough” above are made precise by choosing an appropriate statistical test for distribution change, which in general involves the value of δ, the lengths of the sub-windows and their contents. At every step, outputs the value of W as an ˆ approximation to µ . W C. Hoeffding Tree Fig. 1. Types of concept change in streaming data. The Hoeffding Tree algorithm is a decision tree learning method for stream data classification. It typically runs in III. PROPOSED METHOD sublinear time and produces a nearly identical decision tree to that of traditional batch learners. It uses Hoeffding Trees, This adaptive ensemble method uses boosting, adaptive which exploit the idea that a small sample can be enough tosliding window and Hoeffding tree for improvement of choose an optimal splitting attribute. This idea is supportedperformance. mathematically by the Hoeffding bound. A. Boosting Suppose we make N independent observations of a Boosting is a machine learning meta-algorithm for randomvariable r with range R, where r is an attributeperforming supervised learning. While boosting is not selection measure. In the case of Hoeffding trees, r isalgorithmically constrained, most boosting algorithms information gain. If we compute the mean, r of this sample,consist of iteratively learning weak classifiers with respect to the Hoeffding bound states that the true mean of r is at leasta distribution and adding them to a final strong classifier. r   , with probability 1-δ, where δ is user-specified andWhen they are added, they are typically weighted in someway that is usually related to the weak learners accuracy. R 2 ln(1/  After a weak learner is added, the data is reweighted;  (1) 2Nexamples that are misclassified gain weight and examplesthat are classified correctly lose weight. Boosting focuses onthe misclassified tuples, it risks overfitting the resulting D. Adaptive Ensemble Boosting Classifiercomposite model to such data. Therefore, sometimes the The designed algorithm uses boosting as ensembleresulting “boosted” model may be less accurate than a single method, sliding window and Hoeffding tree for to improvemodel derived from the same data. Bagging is less ensemble performance. Figure 1 shows algorithm for the datasusceptible to model overfitting. While both can significantly stream classification. In this algorithm there are m baseimprove accuracy in comparison to a single model, boosting models, D, a data used for training and initially assignstends to achieve greater accuracy. There is reason for the weight wk to each base model equal to 1. The learningimprovement in performance that it generates a hypothesis algorithm is provided with a series of training datasets {xit Єwhose error on training set is small by combining many X: yit Є Y}, i= 1,..., mt where t is an index on the changinghypotheses whose error may be large. The effect of boosting data, hence xit is the ith instance obtained from the ith dataset.has to do with variance reduction. However unlike bagging, The algorithm has two inputs, a supervised classificationboosting may also reduce the bias of the learning algorithm. algorithm, Base classifier to train individual classifiers and the training data Dt drawn from the current distribution Pt(x, B. Adaptive Sliding Window y) at time t. When new data comes at time t, the data is In data streams environment data comes infinitely and divided into number of sliding windows W1,…,Wn. Thehuge in amount. So it is impossible to stores and processes sliding window is used for change detection, time andsuch data fast. To overcome these problems window memory management. In this algorithm sliding window istechnique comes forward. Window strategies have been used parameter and assumption free in the sense that itin conjunction with mining algorithms in two ways: one, automatically detects and adapts to the current rate of change.externally to the learning algorithm; the window system is Window is not maintained explicitly but compressed using aused to monitor the error rate of the current model, which variant of the exponential histogram technique [14]. Theunder stable distributions should keep decreasing or at most expected values W = total / width of respective windows ˆstabilize; when instead this rate grows significantly, changeis declared and the base learning algorithm is invoked to are calculated.revise or rebuild the model with fresh data. A window is If change detect, it raises change alarm. With usingmaintained that keeps the most recent examples and expected values algorithm decides which sub-window hasaccording to some set of rules from window older examples dropped or not. Then algorithm generated one new classifierare dropped. ht, which is then combined with all previous classifiers to The idea behind sliding window [13] is that, whenever two create the composite hypothesis Ht. The decision of Ht serves“large enough” sub-windows of W exhibit “distinct enough” as the ensemble decision. The error of the compositeaverages, one can conclude that the corresponding expected hypothesis, Et, is computed on new dataset, which is then 494
    • International Journal of Modeling and Optimization, Vol. 2, No. 4, August 2012used to update the distribution weights. Once the distribution [15], OzaBoost [15] and OCBoost [16], OzaBagADWINis updated, the algorithm calls the Base classifier & asks it to [17] and OzaBagASHT [17]. The experiments werecreate the tth classifier ht using drawn from the current performed on a 2.59 GHz Intel Dual Core processor with 3training dataset Dt i.e. ht : X → Y. All classifiers are generated GB RAM, running on UBUNTU 8.04. For testing andfor hk, k=1,…,t are evaluated on the current dataset by comparison MOA [18] tool and Interleaved Test-Then-Traincomputing εkt, the error of the kth classifier hk at the tth time method has used.step. mt  i   Dt (i).[hk ( xi )  yi ] t i 1 (2)for k = 1,2,…,t At current time step t, we have t error measures one foreach classifier generated. The error is calculated usingequation (2) and then it is used for to update weight. Then theweight is dynamically updated using equation (3), wk  (1   t k ) /  t k (3) Using this updated weight, classifier is constructed. Theperformance evaluation of classifier is important in conceptdrifting environment. If the performance of the classifier hasgoing down then this algorithm drops the classifier and addnext classifier in ensemble for maintaining ensemble size. This classifier assigns dynamic sample weight. It keeps thewindow of length W using only O (log W) memory & O (logW) processing time per item, rather than the O (W) oneexpects from a naï implementation. It is used as change vedetector since it shrinks window if and only if there has beensignificant change in recent examples, and estimator for thecurrent average of the sequence it is reading since, with highprobability, older parts of the window with a significantly Fig. 2. Proposed agorithm.different average are automatically dropped. The proposedclassifier uses Hoeffding tree as a base learner. Because ofthis algorithm woks faster and increases performance. A. Experiment using Synthetic DataBasically algorithm is light weight means that it uses less The synthetic data sets are generated with drifting conceptmemory. Windowing technique does not store whole concepts based on random radial basis function (RBF) andwindow explicitly but instead of this it only stores statistics SEA.required for further computation. In this proposed algorithm Random Radial Basis Function Generator- The RBFthe data structure maintains an approximation of the number generator works as follows - A fixed number of randomof 1‟s in a sliding window of length W with logarithmic centroids are generated. Each center has a random position, amemory and update time. The data structure is adaptive in a single standard deviation, class label and weight. Newway that can provide this approximation simultaneously for examples are generated by selecting a center displacement isabout O(logW) sub windows whose lengths follow a randomly drawn from a Gaussian distribution with standardgeometric law, with no memory overhead with respect to deviation determined by the chosen centroid also determineskeeping the count for a single window. Keeping exact counts the class label of the example. This effectively creates afor a fixed-window size is impossible in sub linear memory. normally distributed hypersphere of examples surroundingThis problem can be tackled by shrinking or enlarging the each central point with varying densities. Only numericwindow strategically so that what would otherwise be an attributes are generated. Drift is introduced by moving theapproximate count happens to be exact. More precisely, to centroids with constant speed. This speed is initialized by adesign the algorithm a parameter M is chosen which controls drift parameter.the amount of memory used to O(MlogW/M) words. SEA Generator- This artificial dataset contains abrupt concept drift, first introduced in [13]. It is generated using three attributes, where only the two first attributes are IV. EXPERIMENTS AND RESULTS relevant. All three attributes have values between 0 and 10. The proposed method has tested on both real and synthetic The points of the dataset are divided into 4 blocks withdatasets. The proposed algorithm has compared with OzaBag different concepts. In each block, the classification is done 495
    • International Journal of Modeling and Optimization, Vol. 2, No. 4, August 2012using f1+f2 ≤θ, where f1 and f2 represent the first twoattributes and θ is a threshold value. The most frequent valuesare 8, 9, 7 and 9.5 for the data blocks. 10% of class noise wasthen introduced into each block of data by randomlychanging the class value of 10% of instances. LED Generator- The dataset generated is to predict thedigit displayed on a seven segment LED display, where eachattribute has a 10% chance of being inverted. It has anoptimal Bayes classification rate of 74%. The particularconfiguration of the generator used for experiments (led)produces 24 binary attributes, 17 of which are irrelevant. Fig. 3. Learning curves using RBF dataset. For all the above datasets the ensemble size has set to 10and the dataset size in instances has been selected as 1million. All the experiments are carried out using sameparameter settings and prequential method. Prequential method- In Prequential method the error of amodel is computed from the sequence of examples. For eachexample in the stream, the actual model makes a predictionbased only on the example attribute-values. Theprequential-error is computed based on an accumulated sumof a loss function between the prediction and observedvalues. Comparison in terms of time (in sec) and memory (in MB) Fig. 4. Learning curves using SEA dataset.has tabulated in Table I and II respectively. The learningcurves using RBF, SEA and LED datasets has depicted inFig. 3, Fig. 4 and Fig. 5. TABLE I: COMPARISON IN TERMS OF TIME (IN SEC) USING SYNTHETIC DATASET Fig. 5. Learning curves using LED dataset.TABLE II: COMPARISON IN TERMS OF MEMORY(IN Mb) USING SYNTHETIC DATASET TABLE III: COMPARISON IN TERMS OF TIME(IN SEC) USING REAL DATASET B. Experiment using Real dataset TABLE IV: COMPARISON IN TERMS OF ACCURACY(IN %) USING REAL In this subsection, the proposed method has tested on real DATASETdata sets from UCI machine learning repository-http://archive.ics.edu/ml/datasets.html. The proposedclassifier is compared with Ozabag, OzaBagASHT,OzaBagADWIN, OzaBoost and OCBoost. The result iscalculated in terms of time, accuracy and memory. Thecomparison in terms of time, accuracy and memory hastabulated in Table III, Table IV and Table V. 496
    • International Journal of Modeling and Optimization, Vol. 2, No. 4, August 2012 TABLE V: COMPARISON IN TERMS OF MEMORY(IN Mb) USING REAL [10] J. Gama, Pedro Medas, G. Castillo, and P. Rodrigues, “Learning with DATASET drift detection,” in Advances in Artificial Intelligence - SBIA 2004, 17th Brazilian Symposium on Artificial Intelligence, volume 3171 of Lecture Notes in Computer Science, Springer Verlag, 2004, pp. 286–295. [11] J. Z. Kolter and M. A. Maloof, “Using additive expert ensembles to cope with concept drift,” in Proc. International conference on Machine learning (ICML), Bonn, Germany, 2005, pp. 449-456. [12] G. Cormode, S. Muthukrishnan, and W. Zhuang, “Conquering the divide: Continuous clustering of distribututed data streams,” in ICDE, 2007, pp. 1036-1045. [13] Albert Bieft and Ricard Gavald`a, “Learning from time changing data with adaptive windowing,” in SIAM International Conference on Data Mining, 2007, pp. 443-449. [14] M. Datar, A. Gionis, P. Indyk, and R. Motwani, “Maintaining stream V. CONCLUSION statistics over sliding windows,” SIAM Journal on Computing, vol. 14, no. 1, pp 27-45, 2002. In this paper we have investigated the major issues in [15] N. Oza and S. Russell, “Online bagging and boosting,” in Artificialclassifying large volume, high speed and dynamically Intelligence and Statistics, Morgan Kaufmann, 2001, pp. 105-112.changing streaming data and proposed a novel approach, [16] R. Pelossof, M. Jones, I. Vovsha, and C. Rudin., “Online coordinate boosting, http://arxiv.org/abs/0810.4553,” 2008.adaptive ensemble boosting classifier, which uses adaptive [17] A. Bifet, G. Holmes, B. Pfahringer, R. Kirkby, and R. Gavalda, “Newsliding window and hoeffding tree for improving ensemble methods for evloving data streams,” in KDD’09, ACM, Paris,performance. 2009, pp. 139-148. [18] Goeffrey Holmes, Richard Kirkby and Bernhard Pfahringer. “Moa: Comparing with other classifier, The adaptive ensemble Massive [Online]. Analysis,boosting classifier method achieves distinct features as; it is http://sourceforge.net/projects/moa-datasream,” 2007.dynamically adaptive, uses less memory and processes datafastly. Thus, adaptive ensemble boosting classifier represents anew methodology for effective classification of dynamic,fast-growing and large volume data streams. K. K. Wankhade He received B. E. degree in Information Technology from Swami Ramanand REFERENCES Teerth Marathwada University, Nanded, India in[1] G. Widmer and M. Kubat, “Learning in the presence of concept drift 2007and M. Tech. degree in Computer Engineering and hidden contexts,” Journal on Machine Learning, vol. 23, no. 1, pp. from University of Pune, Pune, India in 2010. 69-101, 1996. He is currently working as Assistant Professor[2] Y. Freund and R. Schapire, “Experiments with a new boosting the Department of Information Technology at G. H. algorithm,” in the Proceeding of International Conference of Machine Raisoni College of Engineering, Nagpur, India. Learning, 1996, pp 148-156. Number of publications is in reputed Journal and[3] P. Domingos and G. Hulten, “Mining high-speed data streams,” in International conferences like Springer and IEEE. His research is on Data KDD’01 : Proceedings of the sixth ACM SIDKDD international Stream Mining, Machine Learning, Decision Support System, Artificial conference on Knowledge discovery and dta mining, New York, NY, Neural Network and Embedded System. His book has published on titled USA, ACM Press, 2000, pp. 71-80. Data Streams Mining: Classification and Application, LAP Publication[4] G. Hulten, L. Spencer, and P. Domingos, “Mining time-changing data House, Germany, 2010. streams,” in Proc. ACM SIGKDD, San Francisco, CA, USA, 2001, pp. Mr. Kapil K. Wankhade is a member of IACSIT and IEEE organizations. 97-106. He is currently working as a reviewer for Springer‟s Evolving System[5] W. N. Street and Y. Kim, “A streaming ensemble algorithm (sea) for Journal and IJCEE journal. large-scale classification,” in KDD’01 : Proceedings of the seventh ACM SIDKDD international conference on Knowledge discovery and dta mining, New York, NY, USA, ACM Press, 2001, pp. 377-382. S. S. Dongre She received B. E. degree in Computer[6] H. Wang, W. Fan, P. S. Yu, and J. Han, “Mining concept drifting data Science and Engineering from Pt. Ravishankar Shukla streams using ensemble classifiers,” in KDD’03: Proceedings of the University, Raipur, India in 2007 and M. Tech. degree ninth ACM SIGKDD international conference on Knowledge discovery in Computer Engineering from University of Pune, and data mining. New York, NY, USA, ACM Press, 2003, pp 226-235. Pune, India in 2010.[7] J. Z. Kolter and M. A. Maloof, “Dynamic weighted majority: a new She is currently working as Assistant Professor the ensemble method for tracking concept drift,” in 3rd International Department of Computer Science and Engineering at Conference on Data Mining, ICDM’03, IEEE CS Press, 2003, pp. G. H. Raisoni College of Engineering, Nagpur, India. 123-130. Number of publications is in reputed International[8] B. Babcock, M. Datar, R. Motwani, and L. O‟Callaghan, “Maintaining conferences like IEEE and Journals. Her research is on variance and k-medians over data stream windows,” in Proceeding of Data Stream Mining, Machine Learning, Decision Support System, ANN the 22nd Symposium on Principles of Database Systems, 2003, pp. and Embedded System. Her book has published on titled Data Streams 234-243. Mining: Classification and Application, LAP Publication House, Germany,[9] J. Gama, R. Rocha, and P. Medas, “Accurate decision trees for mining 2010. high-speed data streams,” in Proceedings of the 9th ACM SIGKDD Ms. Snehlata S. Dongre is a member of IACSIT, IEEE and ISTE International Conference on Knowledge Discovery and Data Mining, organizations. 2003, pp. 523-528. 497