Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  1. 1. Neha Gupta, Indrjeet Rajput / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 www.ijera.com Vol. 3, Issue 1, January -February 2013, pp.1113-1118 Stream Data Mining: A Survey Neha Gupta*1, Indrjeet Rajput*2 *Department of Computer Engineering,Gujarat Technological University, GujaratABSTRACT A data stream is a massive, continuous and The remaining parts of this paper are organized asrapid sequence of data elements. Mining data streams follows: In Section 2, we briefly discuss theraises new problems for the data mining community necessary background needed for data streamabout how to mine continuous high-speed data items analysis. Then, Section 3 and 4 describes techniquesthat you can only have one look at. Due to this reason, and systems used for mining data streams open andtraditional data mining approach is replaced by systems addressed research issues in this growing field areof some special characteristics, such as continuous discussed in Section 5. These research issues shouldarrival in multiple, rapid, time-varying, possiblyunpredictable and unbounded. Analyzing data streams be addressed in order to realize robust systems thathelp in applications like scientific applications, business are capable of fulfilling the needs of data streamand astronomy etc. mining applications. Finally, section 6 summarizesIn this paper, we present the overview of the growing the paper and discusses interesting directions forfield of Data Streams. We cover the theoretical basis future research.needed for analyzing the streams of data. We discussthe various techniques used for mining data streams. II. ESSENTIAL METHODOLOGIESThe focus of this paper is to study the problemsinvolved in mining data streams. Finally, suggested we OF DATA STREAMSconclude with a brief discussion of the big open From the well-established statistical andproblems and some promising research directions in the computational approaches the problems of miningfuture in the area. data streams can be solved using the methodologies which uses Keywords- Data Streams, Association Mining, • Examining a subset of the whole data set orClassification Mining, Clustering Mining. transforms the data to reduce the size. • Algorithms for efficient utilization of time and I. INTRODUCTION space, detailed discussion follow in section 3. As Data mining is considered as a process of The First methodology refers to summarization of thediscovering useful patterns beneath the data, also whole data set or selection of a subset of theuses machine learning algorithms. There have been incoming stream to be analyzed. The techniques usedtechniques which used computer programs to are Sampling, load shedding, sketching, synopsis dataautomatically extract models representing patterns structures and aggregation. These techniques arefrom data and then check those models. Traditional briefly discussed here.data mining techniques cannot be applied to datastreams. Due to most required multiple scans of data 2.1 Samplingto extract information, for stream data it was Sampling is one of the oldest statistical techniques,unrealistic. Information systems have been more used for a long time which makes a probabilisticcomplex, even amount of data being processed have choice of data item to be processed. Boundaries ofincreased and dynamic also, because of common error rate of the computation are given as a functionupdates. This streaming information has the of the sampling rate. Sampling is considered in thefollowing characteristics. process of selecting incoming stream elements to be• The data arrives continuously from data streams. analyzed .Some computing approximate frequency• No assumptions on data stream ordering can be counts over data streams previously performed bymade. sampling techniques. Very Fast Machine learning•The length of the data stream is unbounded. techniques [2] have used Hoeffding bound toEfficiently and effectively capturing knowledge from measure the size of the sample. Even samplingdata streams has become very critical; which include approaches had been studied for clustering datanetwork traffic monitoring, web click-stream There is streams, classification techniques, and the slidingneed of employment of semi-automated interactive window model. [7][8].techniques for the extraction of hidden knowledge Problems related to the sampling technique whileand information in the real-time. Systems, models analyzing the data stream areand techniques have been proposed and developed • Unknown size of data setover the past few years to discuss these challenges [1,3]. 1113 | P a g e
  2. 2. Neha Gupta, Indrjeet Rajput / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 www.ijera.com Vol. 3, Issue 1, January -February 2013, pp.1113-1118• Sampling may not be the correct choice in This method is easy to use with a wide variety ofapplications which check for anomalies in applications as it uses the same multi-dimensionalsurveillance analysis. representation as the original data points. In• It does not address the problem of fluctuating data particular reservoir based sampling methods wererates. very useful for data streams. Relations among data rate, sampling rate and error 2.4.2 Histograms:bounds are to be generated. Another key method for data summarization is that of histograms. Approximate the data in one or more2.2 Load shedding attributes of a relation by grouping attribute valuesLoad shedding refers to the process of dropping a into “buckets” (subsets) and approximating truefraction of data streams during periods of overload. attribute values and their frequencies in the dataLoad shedding is used in querying data streams for based on a summary statistic maintained in eachoptimization. It is desirable to shed load to minimize bucket [3]. Histograms have been used widely tothe drop in accuracy. Load shedding also has two capture data distribution, to represent the data by asteps. Firstly, choose target sampling rates for each small number of step functions. These methods arequery. In the second step, place the load shedders to widely used for static data sets. However mostrealize the targets in the most efficient manner. It is traditional algorithms on static data sets requiredifficult to use load shedding with mining algorithms super-linear time and space. This is because of thebecause the stream which was dropped might use of dynamic programming techniques for optimalrepresent a pattern of interest in time series analysis. histogram construction. For most real-worldThe problems found in sampling are even present in databases, there exist histograms that produce low-this technique also. Still it had been worked on error estimates while occupying reasonably smallsliding window aggregate queries. space. Their extension to the data stream case is a challenging task.2.3 Sketching 2.4.3 Wavelets:Sketching [1, 3] is the process of randomly projecting Wavelets [11] are a well known technique which issubset of the features. It is the process of vertical often used in databases for hierarchical datasampling the incoming stream. Sketching has been decomposition and summarization. Waveletapplied in comparing different data streams and in coefficients are projections of the given set of dataaggregate queries. The major drawback of sketching values onto an orthogonal set of basis vector. Theis that of accuracy because of which it is hard to use basic idea in the wavelet technique is to createthis technique in data stream mining. Techniques decomposition of the data characteristics into thebased on sketching are very convenient for asset of wavelet functions and basic functions. Thedistributed computation over multiple streams. property of the wavelet method is that the higherPrincipal Component Analysis (PCA) would be a order coefficients of the decomposition illustrate thebetter solution if being applied in streaming broad trends in the data, whereas the more localizedapplications. trends are captured by the lower order coefficients. In particular, the dynamic maintenance of the dominant2.4 Synopsis Data Structures coefficients of the wavelet representation requiresSynopsis data structures hold summary information some novel algorithmic techniques. There has beenover data streams. It embodies the idea of small some research done in computing the top waveletspace, approximate solution to massive data set coefficients in the data stream model. The techniqueproblems. Construction of summary or synopsis data of Gilbert gave rise to an easy greedy algorithm tostructures over data streams has been of much find the best B-term Haar wavelet representation.interest recently. Complexities calculated cannot be 2.4.4 Sketches:O (N) where N can be space/time cost per element to Randomized version of wavelet techniques is calledsolve a problem. Some solution which gives results sketch methods. These methods are difficult to applycloser to O (poly (log N)) is needed. Synopsis data as it is difficult to intuitive interpretations of sketchstructures are developed which use much smaller based representations. The generalization of sketchspace of order O (logk N). These structures refer to methods to the multi-dimensional case is still an openthe process of applying summarization problem.techniques.The smaller space which contains 2.4.5 Micro Cluster based summarization:summarized information about the streams is used for A recent micro-clustering method [11] can be used begaining knowledge. perform synopsis construction of data streams. It usesThere are a variety of techniques used for Cluster Feature Vector (CFV) [8]. This micro-clusterconstruction of synopsis data structures. These summarization is applicable for the multi-methods are briefly discussed. dimensional case and works well to the evolution of 2.4.1 Sampling methods: the underlying data stream. 2.4.6 Aggregation: 1114 | P a g e
  3. 3. Neha Gupta, Indrjeet Rajput / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 www.ijera.com Vol. 3, Issue 1, January -February 2013, pp.1113-1118Summarizations of the incoming stream are 3.3 Algorithm Output Granularity (AOG)generated using mean and variance. If the input AOG is the first resource-aware data analysisstreams have highly fluctuating data distributions approach used with fluctuating very high data rates. Itthen this technique fails. This can be used for works well with available memory and with timemerging offline data with online data which was constraints. The stages in this process are mining datastudied in [12]. It is often considered as a data rate streams, adaptation of resources and streams andadaptation technique in a resource-aware mining. merging the generated structures when running out ofMany synopsis methods such as wavelets, memory. These algorithms are used in association,histograms, and sketches are not easy to use for the clustering and classification.multi-dimensional cases. The random samplingtechnique is often the only method of choice for high IV. APPLICATION OFdimensional applications. METHODOLOGIES FOR STREAM MININGIII. ALGORITHMS FOR EFFICIENT UTILIZATION OF TIME AND The need to understand the enormous amount of data SPACE being generated every day in a timely fashion has given rise to a new data processing model-dataThe existing algorithms used for data mining are stream processing. In this new model, data arrives inmodified from generation of efficient algorithms for the form of continuous, high volume, fast and time-data streams. New algorithms have to be invented to varying streams and the processing of such streamsaddress the computational challenges of data streams. entail a near real-time constraint. The algorithmicThe techniques which fall into this category are ideas above presented have proved powerful forapproximation algorithm, sliding window algorithm solving a variety of problems in data streams. Aand output granularity algorithm. number of algorithms for extraction of knowledgeWe examine each of these techniques in the context from data streams were proposed in the domains ofof analyzing data streams. clustering, classification and association. In this section, an overview on mining these streams of data3.1 Approximation algorithm are presented.Approximation techniques are used for algorithm 4.1 Clusteringdesign. The solutions obtained by this algorithm are Guha et al. [7] Have studied analytically clusteringapproximate and are error bound. These have data streams using the K - median technique. Theattracted researchers as a direct solution for data proposed algorithm makes a single pass over the datastreams. Data rates with the available resourcescannot be solved. To provide absolution to these stream and uses small space. It requires O (nk) time and O (n E) space where "k" is the number of centers,other tools should be used along with this algorithm.For tracking most frequent items dynamically this "n" is the number of points and E <1.They havealgorithm was used in order to adapt to the available proved that any k-median algorithm that achievesresources [16]. constant factor approximation cannot achieve a better runtime than O (nk). The algorithm starts by3.2 Sliding Window clustering a calculated size sample according to the available memory into 2k, and then at a second level,In order to analysis recent data, sliding window the algorithm clusters the above points for a numberprotocol is used which is considered as an advanced of samples into 2k and this process is repeated to atechnique for producing approximate answers to a number of levels, and finally it clusters the 2kdata stream query. Analysis over the new arrived data clusters into k clusters.is done using summarized versions of previous data. The exponential histogram (EH) data structure andThis idea has been adopted in many techniques in the another k-median algorithm that overcomes theundergoing comprehensive data stream mining problem of increasing approximation factors in thesystem MAIDS. Using sliding window protocol the Guha et al [7] algorithm. Another algorithm thatold items are removed and replaced with new data captured the attention of many scientists is the k-streams. Two types of windows called count-based means clustering algorithm.windows and time-based windows are used. Using Domingos et al. [13] have proposed a general methodcount-based windows the latest N items are stored. for scaling up machine learning algorithms. TheyUsing the other type of window we can store only have termed this approach Very Fast Machinethose items which have been generated or have Learning VFML. This method depends onarrived in the last T units of time. As it emphasizes determining an upper bound for the Leamers loss asrecent data, which in the majority of real-world a function in a number of data items to be examinedapplications is more important and relevant than in each step of the algorithm. They have applied thisolder data. method to K-means clustering VFKM and decision 1115 | P a g e
  4. 4. Neha Gupta, Indrjeet Rajput / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 www.ijera.com Vol. 3, Issue 1, January -February 2013, pp.1113-1118tree classification VFDT techniques. These statistical measure. This algorithm also drops thealgorithms have been implemented and evaluated non-potential attributes. Aggarwal has used the ideausing synthetic data sets as wells real web data of micro clusters introduced in CluStream in On-streams. VFKM uses the Hoeffding bound to Demand classification and it shows a high accuracy.determine the number of examples needed in each The technique uses clustering results to classify datastep of K-means algorithm. The VFKM runs as a using statistics of class distribution in each cluster.sequence of K-means executions with each run uses Peng Zhang [10] have proposed a solution formore examples than the previous one until a concept drifting where in the categorized the conceptcalculated statistical bound (Hoeffding bound) is drifting in data streams into two scenarios: (1) Loosesatisfied. Concept Drifting (LCD); and (2) Rigorous ConceptA fast and effective scheme that is capable of Drifting (RCD), and proposed solutions to handleincrementally clustering moving objects. They used each of them by using weighted-instancing andnotion of object dissimilarity, which is capable of weighted classifier ensemble framework such that thetaking the future object movement which improves overall accuracy of the classifier ensemble built onthe quality clustering and runtime performance. The them can reach the minimum. Ganti et al. [10] haveExperiments conducted show that the proposed MC developed analytically two algorithms GEMM andalgorithm is 106 times faster than K-Means and 105 FOCUS for model maintenance and change detectiontimes faster than Birch. They have used average- between two data sets in terms of the data miningradius function which automatically detects cluster results they induce.split events.Ordonez [6] has proposed an improved incremental 4.3 Associationk-means algorithm for clustering binary data streams. The approach proposed and implemented anO’Challaghan et al. [4] proposed STREAM and approximate frequency count in data streams whichLOCALSEARCH algorithms for high quality data uses all the previous historical data to calculate thestream clustering. Aggarwal et al. [11] have proposed frequency patterns incrementally. Cormode anda framework for clustering evolving data steams Muthukrishnan [16] developed an algorithm forcalled CluStream algorithm. In [12] they have counting frequent items. This is used toproposed the HPStream; a projected clustering for approximately count the most frequent items. BIDEhigh dimensional data streams, which has was proposed which efficiently mines frequent itemsoutperformed CluStream. Stanford’s STREAM without candidate maintenance. It uses a novelproject has studied the approximate k-median sequence called BIDirectional Extension and prunesclustering with guaranteed probabilistic bound. the search space more deeply than the previous algorithm. The experimental results show that BIDE4.2 Classifications algorithm uses less memory and faster when supportIn the real world concepts are often not stable but is not greater than 88 percent. An algorithm calledchange with time. Typical examples of this are DSP which efficiently mines frequent items in theweather prediction rules and customers preferences. limited memory environment. The algorithm whichThe underlying data distribution may change as well. focuses on mining process; discovers from dataOften these changes make the model built on old data streams all those frequent patterns that satisfy theinconsistent with the new data and regular updating user constraints, and handle situations where theof the model is necessary. This problem is known as available memory space is limited.concept drift. 4.4 Frequency CountingWang et al. [14] proposed a general framework formining concept drifting data streams. This was the Frequency counting has not much in attention amongfirst framework which worked on drifting. They have the researchers in this field, as did clustering andused a weighted classifier to mine streams and old classification. Counting frequent items or itemsets isdata expires based on the distribution of data. The one of the issues considered in frequency counting.proposed algorithm combines multiple classifiers Cormode and Muthukrishnan [16] have developed anweighted by their expected prediction accuracy. algorithm for counting frequent items. The algorithmLastly proposed an online classification system maintains a small space data structure that monitorswhich dynamically adjusts the size of the training the transactions on the relation, and when required,window and the number of new examples between quickly outputs all hot items, without rescanning themodel reconstructions to the current rate of concept relation in the database. Even had developed adrift. frequent item set mining algorithm over data stream.Domingos et al. [15] have developed Vary Fast They have proposed the use of tilted windows toDecision Tree (VFDT) which is a decision tree calculate the frequent patterns for the most recentconstructed on Hoeffding trees. The split point is transactions. The implemented an approximatefound by using Hoeffding bound which satisfies the frequency count in data streams. The implemented 1116 | P a g e
  5. 5. Neha Gupta, Indrjeet Rajput / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 www.ijera.com Vol. 3, Issue 1, January -February 2013, pp.1113-1118algorithm uses all the previous historical data to streams are irregular in their rate of arrival, exhibitingcalculate the frequent pattern incrementally. One burstiness and variation of data arrival rate over time.more AOG-based algorithm: Lightweight frequency • Variants in mining tasks those are desirable in datacounting LWF. It has the ability to find an streams and their integration.approximate solution to the most frequent items in • High accuracy in the results generated while dealingthe incoming stream using adaptation and releasing with continuous streams of datathe least frequent items regularly in order to count the • Transferring data mining results over a wirelessmore frequent ones. network with a limited bandwidth. • The needs of real world applications, even mobile4.5 Time Series Analysis devicesTime series analysis had approximate solutions for • Efficiently store the stream data with timeline andthe error bounding problems. The algorithms based efficiently retrieve them during a certain timeon sketching techniques process of starts with interval in response to user queries is anothercomputing the sketches over an arbitrarily chosen important issue.time window and creating what so called sketch pool. • Online Interactive processing is needed which helpsUsing this pool of sketches, relaxed periods and user to modify the parameters during processingaverage trends are computed and were efficient in period.running time and accuracy.Then have proposed a two phase approach to mine VI. CONCLUSIONSastronomical time series streams. The first phaseclusters sliding window patterns of each time series. The dissemination of data stream phenomenon hasUsing the created clusters, an association rule necessitated the development of stream miningdiscovery technique is used to create affinity analysis algorithms. So in this paper we discussed the severalresults among the created clusters of time series. The issues that are to be considered when designing andproposed techniques to compute some statistical implementing the data stream mining technique. Wemeasures overtime series data streams. The proposed even reviewed some of these methodologies with thetechniques use discrete Fourier transform. The use of existing algorithm like clustering, classification,symbolic representation of time series data streams. frequency counting and time series analysis haveThis representation allows dimensionality/numerosity been developed.reduction. They have demonstrated the applicability We can conclude that most of the current miningof the proposed representation by applying it to approaches use one passes mining algorithms andclustering, classification, and indexing and anomaly few of them even address the problem of drifting.detection. The approach has two main stages. The The present techniques produce approximate resultsfirst one is the transformation of time series data to due to limited memory.Piecewise Aggregate Approximation followed by Research in data streams is still in its early stage.transforming the output to discrete string symbols in Systems have been implemented using thesethe second stage. The application of what so called techniques in real applications. If the problems areregression cubes for data streams was proposed. Due addressed or solved and if more efficient and user-to the success of OLAP technology in the application friendly mining techniques are developed for the endof static stored data, it has been proposed to use users, it is likely that in the near future data streammultidimensional regression analysis to create a mining will play an important role in the businesscompact cube that could be used for answering world as the data flows continuously.aggregate queries over the incoming streams.The randomized variations of segmenting time series REFERENCESdata streams generated on-board mobile phonesensors. One of the applications of clustering time [1]. S. Muthukrishnan (2003), Data streams:series discussed: Changing the user interface of algorithms and applications, Proceedings of themobile phone screen according to the user context. It fourteenth annual ACM- SIAM symposium onhas been proven in this study that Global Iterative discrete algorithms.Replacement provides approximately an optimal [2]. P. Domingos and G. Hulten, A General Methodsolution with high efficiency in running time. for Scaling Up Machine Learning Algorithms and its Application to Clustering, Proceedings V. RESEARCH ISSUES of the Eighteenth International Conference on Machine Learning, 2001, Williamstown, MA,The study of Data stream mining has raised many Morgan Kaufmann.challenges and research issues [3]. B. Babcock, S. Babu, M Datar, R Motwani, and• Optimize the memory space, computation power 1. Widom, Models and issues in data streamwhile processing large data sets as many real data systems, in Proceedings of PODS, 2002. 1117 | P a g e
  6. 6. Neha Gupta, Indrjeet Rajput / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 www.ijera.com Vol. 3, Issue 1, January -February 2013, pp.1113-1118[4] L. OCallaghan, N. Mishra, A. Meyerson, S. Guha, and R. Motwani. Streaming-data algorithms for high-quality clustering. Proceedings of IEEE International Conference on Data Engineering, March 2002.[5]. Aggarwal C, Han 1., Wang 1., Yu P. (2003) A Framework for Clustering Evolving Data Streams, VLDB Conference.[6] C. Ordonez. Clustering Binary Data Streams with K-means ACM DMKD 2003.[7]. S. Guha, N. Mishra, R Motwani, and L. OCallaghan. Clustering data streams. In Proceedings of the Annual Symposium on Foundations of Computer Science, IEEE, November 2000.[8]. Zhang, T., Ramakrishnan, R, Livny, M.: Birch: An efficient data clustering method for very large databases, In: SIGMOD, Montreal, Canada, ACM (1996).[9] C. Aggarwal, J. Han, J. Wang, and P. S. Yu, On Demand Classification of Data Streams, Proc. 2004 Int. Conf. on Knowledge Discovery and Data Mining, Seattle, WA, Aug. 2004.[10] V. Ganti, J. Gehrke, and R. Ramakrishnan: Mining Data Streams under Block Evolution. SIGKDD Explorations 3(2), 2002. [11]. Keirn D. A. Heczko M. (2001) Wavelets and their Applications in Databases. ICDE Conference. [12]. C Aggarwal, J. Han, J. Wang, and P. S. Yu, A Framework for Projected Clustering of High Dimensional Data Streams, Proc. 2004Int. Conf on Very Large Data Bases, Toronto, Canada, 2004. [13]. G. Hulten, L. Spencer, and P. Domingos. Mining Time-Changing Data Streams. ACM SIGKDD 2001.[14]. H. Wang, W. Fan, P. Yu and 1. Han, Mining Concept-Drifting Data Streams using Ensemble Classifiers, in the 9th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), Aug. 2003, Washington DC, USA[15]. P. Domingos and G. Hulten. Mining High- Speed Data Streams. In Proceedings of the Association for Computing Machinery Sixth International Conference on Knowledge Discovery and Data Mining, 2000.[16]. G. Cormode, S. Muthukrishnan Whats hot and whats not: tracking most frequent items dynamically. PODS 2003: 296-306 1118 | P a g e