ME Synopsis

Track 2: Data Mining
Third Post Graduate Symposium on Computer Engineering cPGCON2014 Organized by department of Computer Engineering,
MCERC Nasik
Mining Data Streams Based on the Improved McDiarmid’s Bound Ms. Poonam Debnath, and Prof. Santosh Kumar Chobe, Department of Computer Engineering, University of Pune
Abstract: Complex analysis of data streams is becoming a popular field of research as the information collected is prone to concepts drift or complete shift. The pre processing, storage, querying and mining of such data sets are highly computationally challenging tasks. Mining data streams implies extracting knowledge structures represented in models and patterns in non stopping streams of information. Traditionally the Hoeffding‟s bound is widely used to resolve the conflicts regarding the number of learning samples needed at a node to assume and select the split attribute. In this paper, we present the theoretical foundations for enhancing the bounds obtained by the McDiarmid‟s tree algorithm and outdo the processing efficiency of the stream mining system by applying Gaussian approximations to the bounds. Index Terms— Data streams, Decision trees, Gaussian approximation, Hoeffding’s bound, McDiarmid’s bound.
I. INTRODUCTION
ECENTLY a new class of emerging applications has become widely recognized: applications in which data is generated at very high rates in the form of transient data streams. In the data stream model, individual data items may be relational tuples, call records, web page visits, sensor readings, and so on. However, the continuous arrival of data in multiple, rapid, time varying, unpredictable and unbound streams open new elementary research problems. The rapid generation of continuous streams of information has posed a challenge for the storage, computation and communication capabilities in a computing system. The gigantic amounts of data arriving at high speed need application of semi- automated interactive techniques to perform real-time extraction of hidden knowledge.
Typical data mining tasks include concept description, regression analysis, association mining, outlier analysis, classification, and clustering. These techniques find interesting patterns, tracing regularities and anomalies in the data set. However, traditional data mining techniques cannot be directly applied to the data streaming model. This is because most of them require multiple scans of data to mine the information, which is impractical for stream data. The amount of formerly happened events is usually immeasurable, so they can be either dropped after processing or archived separately in secondary storage. More importantly, the traits of the data stream can change over time and the evolving pattern needs to be recorded. Furthermore, the problem of resource allocation should also be considered in mining data streams. Due to the bulky volume and the high speed of streaming data, stream mining algorithms must handle the effects of system burden. Thus, how to accomplish optimum results under various resource constraints becomes a challenging. Initially decision trees developed for data mining were tailored to deal with stream data as well. But the difficulty of ensuring whether an attribute selected from N examples is equally good to be used for infinite examples. The target was to calculate the heuristic value from the N training examples, and then exploit the results to split the learning sample space. At the beginning Hoeffding‟s tree algorithm based on Hoeffding‟s inequality and Hoeffding‟s bound, was used for knowledge discovery in data streams. The Hoeffding‟s bound postulate that with probability 1 - δ, the true mean of the random variable of range R does not differ from the expected mean, after N independent trails, by more than: = (1) A glance through the techniques and interpretation of data stream study provoked us to amend the existing tactics for improved performance in data stream mining systems. In this paper, we show that: Methods based on the McDiarmid‟s inequality call for gigantic amount of data streams at the node. By using Gaussian approximation techniques we can enhance the bounds used and can reduce the size of training samples needed for the split criteria selection. The Hoeffding‟s inequality is not sufficient to conquer the fundamental problem in a general case and so all the existing methods techniques should be adjusted. The McDiarmid‟s inequality, used in an appropriate way, is an efficient technique to cope up with glitches in data streaming model.
R

MCERC Nasik
II. RELATED WORK
One of the earliest works on managing evaluating data streams was carried out by P. Domingos and G. Hulten [5] where they proposed VFDT, which could generate decision tree for an example under strict time and memory considerations. C. Aggarwal [2] studied and proposed a two phase technique to conquer astronomical time series problem. In the first phase slinding window clusters are created and later the same is used to mine association rules from the streams. The effect of concept drift was examined by A. Bifet and R. Kirkby [3] and they evaluated online streams with complete concept shifts. To tackle with such concepts they used ensembles in their experiments. G. Hulten, L. Spencer, and P. Domingos [11] studied that the basic assumption of a machine learning system for a streaming model doesn‟t hold true because the data source distribution is never stationary or predictable. They proposed an algorithm for handling continuously changing data streams, called CFVDT, a variance of VFDT. They have worked on the complexity of target examples. Most of the approaches used the underlying concept of Hoeffding‟s inequality explained by W. Hoeffding [10]. The problem with adapting this theory in data stream mining operations was that the Hoeffding‟s inequality considered only the numeric valued data but in real world data is unpredictable. R. Kirkby [12] observed the flow in streaming model and proposed enhancement in the Hoeffding‟s tree algorithm. He labeled some fraction of the stream sample and used semi supervised approach to create clusters from the dataset. B. Pfahringer, G. Holmes, and R. Kirkby [14] have used option trees instead of clustering in their algorithm. The use of option trees helped in improving effiency of Hoeffding‟s bounds. Recently L. Rutkowski, L. Pietruczuk, P. Duda, M. Jaworski [1] have shown that the used of McDiarmid‟s inequality is the correct way to analyze the high speed time changing data streams.
III. IMPLEMENTATION DETAILS
Conventional techniques for data mining necessitate several passes of data to mine the knowledge, but this method is not feasible for stream model. Practically it is not achievable to stock up an entire stream or scan through it numerous times due to its terrific volume. Moreover, data streams evolve over time and face severe concept drift‟s and complete shift.
A. Hoeffding’s Inequality
Let Y1, Y2,..,YN be random variables with values of from a distribution. Let Yi [0,R] for i = 1,..,N where R is constant. Let expected value of the distribution be E[Y] and mean be denoted by .
Let Yi for i = 1,…,N be any independent random variable such that Pr(Yi [ai, bi]) = 1. Then for S = for all > 0 these inequalities are valid:
Pr(S - E[S] ) exp ) (2) Pr(|S - E[S]|) 2 exp ) (3)
B. Proposed Architecture
In order to improve the bounds obtained by the split measures we have to use Gaussian approximation on the bounds.
Fig. 1. Information Flow Path. Preprocessing: The dataset which has to be clubbed together is choosen for detecting the native structure in the document space using correlation analysis. The next step is to remove the null words and write the contents after removal of null words to a new folder stop words. Mcdiarmid’s inequality: Suppose an attribute „a‟ can take one of |a| different values from the set A = {a1,..,an} and = {k1,..,kk} is a set of different classes. Then let: Z = (4) be the training set of size N, where X1, . . .,XN are independent random variables defined as follows: (5)
for i=1,..,N, ji {1,…,|a|}, li {1,…,|b|}, qi {1,…,K}. Each element of Z belongs to one of the K different classes kj. Entropy associated with the classification of Z is defined as:
= - (6) where pj is the probability that element from Z comes from class kj. We estimate this probability by nj/N, where nj is the number of elements from class kj. Then = - (7)
Choose an attribute a, characterizing the elements of set Z. Then Zai denotes a set of elements from Z, for which the value of a is ai. The number of elements from set Zai is

MCERC Nasik
labeled as nai. Then the weighted entropy for attribute a and set Z is given by: (8)
Where
(9) and denotes the number of elements in set from class kj. Information gain for attribute a is given by:  = - (10) Let us assume, that a is an attribute with the highest value of information gain, while b is the second-best attribute.  - (11) = (12)
Let Z, given by (2), be the set of independent random variables, with Xi taking values in a set Ai for each i. Let us define:
Z’ = (13) with taking values in Ai. Observe that Z‟ differs from Z, only in the ith element. We will use McDiarmid‟s inequality. Suppose that the function: : R (14) satisfies | - | ≤ Ci (15) X1,…,XN, for some constant Ci, Then Pr((Z) - E|(Z‟)| ≥ ) (16) ≤ exp () (17) Proposed Architecture: The conceptual user interaction model of the proposed system is as follows:
Fig. 2. System Architecture.
Mathematical model for information gain: Traditional ID3 uses information gain as the best attribute split measure. When using McDiarmid‟s algorithm, the following theorem promises that a decision tree learning system, applied to data streams, hasthe property that its output is nearly identical to that produced by a conventional learner. Let Z = {X1,..,XN) be any set of random variables, with each of them taking values A × B × … . Then, for any fixed δ and any pair of attributes a and b, where =  - > 0, if = CGain (K, N) (18) then Pr(-E[] > ) ≤ δ (19) where CGain (K, N) = 6(K log2eN + log22N) + 2log2K (20) Mathematical model for gini index: Gini index is usually used in CART and it measures impurity of the training set. Theorem for the same using McDiarmid‟s bound is given by: Let Z = {X1,..,XN) be any set of random variables, with each of them taking values A × B × … . Then, for any fixed δ and any pair of attributes a and b, where =  - > 0, if = 8 (21)
C. Algorithm
For handiness the following notations will be used: À—set of all attributes. α—any attribute from set À. αMAX1—attribute with highest value of the split function. αMAX2—attribute with second highest value of the split function.

MCERC Nasik
IV. RESULTS
A. Data Set
Web logs, financial tickers, feeds from the sensor and all massively unordered continuous data sets.
B. Expected Result Set
Result set will be comprised of two modules. Firstly it will generate a decision tree based on McDiarmid‟s tree algorithm with the use of Gaussian bounds over the attribute selection function. Secondly it shall have a graph to compare the traditional decision tree by McDiarmid‟s algorithm with the decision tree obtained by the application of Gaussian approximation on the bounds of the split measures.
C. Platform
The proposed algorithm shall be developed and deployed with Java‟s EJB modules. MySql will be used for database related operations in the backend
V. CONCLUSION
The propagation of data stream fact has inclined the development of stream mining algorithms. Mining online high speed data streams has imposed a number of difficulties for the researchers. Due to the limited resources and critical time constraints many summarization and approximation methods have been picked up from statistics and computational theory background. The predictable mean used in the Hoeffding‟s theorem, is not valid universally. The suitable technique to solve the setback is the McDiarmids theorem. With the use of Gaussian approximations on the obtained McDiarmid‟s bounds, the system can drastically enhance the efficiency of the system and shall reduce the number of training examples needed to select a splitting criterion.
So in this paper we have tried to plaster a few issues but still there are many unbolt issues and fresh challenges that demand attention and if those problems are tackled efficiently then data streams will play a major role in each area of our life.
ACKNOWLEDGMENT
The authors would like to express thanks to the reviewers for helpful comments.
REFERENCES
[1] L. Rutkowski, L. Pietruczuk, P. Duda, M. Jaworski, ”Decision trees for mining data streams based on the McDiarmid‟s bound”, IEEE Transactions on Knowledge and Data Engineering, vol. 25, no. 6, 2013.
[2] C. Aggarwal, Data Streams. Models and Algorithms. Springer, 2007.
[3] A. Bifet and R. Kirkby, Data Stream Mining a Practical Approach, technical report, Univ. of Waikato, 2009.
[4] L. Breiman, J.H. Friedman, R.A. Olshen, and C.J. Stone, Classification and Regression Trees. Chapman and Hall, 1993.
[5] P. Domingos and G. Hulten, “Mining high-speed Data Streams,”Proc. Sixth ACM SIGKDD Int‟l Conf. Knowledge Discovery and DataMining, pp. 71- 80, 2000.
[6] W. Fan, Y. Huang, and P.S. Yu, “Decision Tree Evolution using Limited Number of Labeled Data Items from Drifting Data Streams,” Proc. IEEE Fourth Int‟l Conf. Data Mining, pp. 379-382, 2004.
[7] M. Gaber, A. Zaslavsky, and S. Krishnaswamy, “Mining Data Streams: A Review,” ACM SIGMOD Record, vol. 34, no. 2, pp. 18-26, June 2005.
[8] J. Gama, R. Fernandes, and R. Rocha, “Decision Trees for Mining Data Streams,” Intelligent Data Analysis, vol. 10, no. 1, pp. 23-45, Mar. 2006.
[9] J. Han and M. Kamber, Data Mining: Concepts and Techniques, second ed. Elsevier, 2006.
[10] W. Hoeffding, “Probability Inequalities for Sums of Bounded Random Variables,” J. Am. Statistical Assoc., vol. 58, no. 301, pp. 13-30, Mar. 1963.
[11] G. Hulten, L. Spencer, and P. Domingos, “Mining Time-Changing Data Streams,” Proc. Seventh ACM SIGKDD Int‟l Conf. Knowledge Discovery and Data Mining, pp. 97-106, 2001.
[12] R. Kirkby, “Improving Hoeffding Trees,” PhD dissertation, University of Waikato, Hamilton, 2007.
[13] X. Li, J.M. Barajas, and Y. Ding, “Collaborative Filtering On Streaming Data With Interest-Drifting,” Int‟l Intelligent Data Analysis, vol. 11, no. 1, pp. 75-87, 2007.
[14] B. Pfahringer, G. Holmes, and R. Kirkby, “New Options for Hoeffding Trees,” Proc. 20th Australian Joint Conf. Advances in Artificial Intelligence, pp. 90-99, 2007.
[15] J.R. Quinlan, C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993.
[16] W. Fan, Y. Huang, H. Wang, and P.S. Yu, Active Mining of Data Streams, Proc. SDM, 2004.
[17] C. Franke, Adaptivity in Data Stream Mining, PhD dissertation, University of California, DAVIS, 2009.

ME Synopsis

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to ME Synopsis

Similar to ME Synopsis (20)

Recently uploaded

Recently uploaded (20)

ME Synopsis