SlideShare a Scribd company logo
Track 2: Data Mining 
Third Post Graduate Symposium on Computer Engineering cPGCON2014 Organized by department of Computer Engineering, 
MCERC Nasik 
Mining Data Streams Based on the Improved McDiarmid’s Bound Ms. Poonam Debnath, and Prof. Santosh Kumar Chobe, Department of Computer Engineering, University of Pune 
Abstract: Complex analysis of data streams is becoming a popular field of research as the information collected is prone to concepts drift or complete shift. The pre processing, storage, querying and mining of such data sets are highly computationally challenging tasks. Mining data streams implies extracting knowledge structures represented in models and patterns in non stopping streams of information. Traditionally the Hoeffding‟s bound is widely used to resolve the conflicts regarding the number of learning samples needed at a node to assume and select the split attribute. In this paper, we present the theoretical foundations for enhancing the bounds obtained by the McDiarmid‟s tree algorithm and outdo the processing efficiency of the stream mining system by applying Gaussian approximations to the bounds. Index Terms— Data streams, Decision trees, Gaussian approximation, Hoeffding’s bound, McDiarmid’s bound. 
I. INTRODUCTION 
ECENTLY a new class of emerging applications has become widely recognized: applications in which data is generated at very high rates in the form of transient data streams. In the data stream model, individual data items may be relational tuples, call records, web page visits, sensor readings, and so on. However, the continuous arrival of data in multiple, rapid, time varying, unpredictable and unbound streams open new elementary research problems. The rapid generation of continuous streams of information has posed a challenge for the storage, computation and communication capabilities in a computing system. The gigantic amounts of data arriving at high speed need application of semi- automated interactive techniques to perform real-time extraction of hidden knowledge. 
Typical data mining tasks include concept description, regression analysis, association mining, outlier analysis, classification, and clustering. These techniques find interesting patterns, tracing regularities and anomalies in the data set. However, traditional data mining techniques cannot be directly applied to the data streaming model. This is because most of them require multiple scans of data to mine the information, which is impractical for stream data. The amount of formerly happened events is usually immeasurable, so they can be either dropped after processing or archived separately in secondary storage. More importantly, the traits of the data stream can change over time and the evolving pattern needs to be recorded. Furthermore, the problem of resource allocation should also be considered in mining data streams. Due to the bulky volume and the high speed of streaming data, stream mining algorithms must handle the effects of system burden. Thus, how to accomplish optimum results under various resource constraints becomes a challenging. Initially decision trees developed for data mining were tailored to deal with stream data as well. But the difficulty of ensuring whether an attribute selected from N examples is equally good to be used for infinite examples. The target was to calculate the heuristic value from the N training examples, and then exploit the results to split the learning sample space. At the beginning Hoeffding‟s tree algorithm based on Hoeffding‟s inequality and Hoeffding‟s bound, was used for knowledge discovery in data streams. The Hoeffding‟s bound postulate that with probability 1 - δ, the true mean of the random variable of range R does not differ from the expected mean, after N independent trails, by more than: = (1) A glance through the techniques and interpretation of data stream study provoked us to amend the existing tactics for improved performance in data stream mining systems. In this paper, we show that: Methods based on the McDiarmid‟s inequality call for gigantic amount of data streams at the node. By using Gaussian approximation techniques we can enhance the bounds used and can reduce the size of training samples needed for the split criteria selection. The Hoeffding‟s inequality is not sufficient to conquer the fundamental problem in a general case and so all the existing methods techniques should be adjusted. The McDiarmid‟s inequality, used in an appropriate way, is an efficient technique to cope up with glitches in data streaming model. 
R
Track 2: Data Mining 
Third Post Graduate Symposium on Computer Engineering cPGCON2014 Organized by department of Computer Engineering, 
MCERC Nasik 
II. RELATED WORK 
One of the earliest works on managing evaluating data streams was carried out by P. Domingos and G. Hulten [5] where they proposed VFDT, which could generate decision tree for an example under strict time and memory considerations. C. Aggarwal [2] studied and proposed a two phase technique to conquer astronomical time series problem. In the first phase slinding window clusters are created and later the same is used to mine association rules from the streams. The effect of concept drift was examined by A. Bifet and R. Kirkby [3] and they evaluated online streams with complete concept shifts. To tackle with such concepts they used ensembles in their experiments. G. Hulten, L. Spencer, and P. Domingos [11] studied that the basic assumption of a machine learning system for a streaming model doesn‟t hold true because the data source distribution is never stationary or predictable. They proposed an algorithm for handling continuously changing data streams, called CFVDT, a variance of VFDT. They have worked on the complexity of target examples. Most of the approaches used the underlying concept of Hoeffding‟s inequality explained by W. Hoeffding [10]. The problem with adapting this theory in data stream mining operations was that the Hoeffding‟s inequality considered only the numeric valued data but in real world data is unpredictable. R. Kirkby [12] observed the flow in streaming model and proposed enhancement in the Hoeffding‟s tree algorithm. He labeled some fraction of the stream sample and used semi supervised approach to create clusters from the dataset. B. Pfahringer, G. Holmes, and R. Kirkby [14] have used option trees instead of clustering in their algorithm. The use of option trees helped in improving effiency of Hoeffding‟s bounds. Recently L. Rutkowski, L. Pietruczuk, P. Duda, M. Jaworski [1] have shown that the used of McDiarmid‟s inequality is the correct way to analyze the high speed time changing data streams. 
III. IMPLEMENTATION DETAILS 
Conventional techniques for data mining necessitate several passes of data to mine the knowledge, but this method is not feasible for stream model. Practically it is not achievable to stock up an entire stream or scan through it numerous times due to its terrific volume. Moreover, data streams evolve over time and face severe concept drift‟s and complete shift. 
A. Hoeffding’s Inequality 
Let Y1, Y2,..,YN be random variables with values of from a distribution. Let Yi [0,R] for i = 1,..,N where R is constant. Let expected value of the distribution be E[Y] and mean be denoted by . 
Let Yi for i = 1,…,N be any independent random variable such that Pr(Yi [ai, bi]) = 1. Then for S = for all > 0 these inequalities are valid: 
Pr(S - E[S] ) exp ) (2) Pr(|S - E[S]|) 2 exp ) (3) 
B. Proposed Architecture 
In order to improve the bounds obtained by the split measures we have to use Gaussian approximation on the bounds. 
Fig. 1. Information Flow Path. Preprocessing: The dataset which has to be clubbed together is choosen for detecting the native structure in the document space using correlation analysis. The next step is to remove the null words and write the contents after removal of null words to a new folder stop words. Mcdiarmid’s inequality: Suppose an attribute „a‟ can take one of |a| different values from the set A = {a1,..,an} and = {k1,..,kk} is a set of different classes. Then let: Z = (4) be the training set of size N, where X1, . . .,XN are independent random variables defined as follows: (5) 
for i=1,..,N, ji {1,…,|a|}, li {1,…,|b|}, qi {1,…,K}. Each element of Z belongs to one of the K different classes kj. Entropy associated with the classification of Z is defined as: 
= - (6) where pj is the probability that element from Z comes from class kj. We estimate this probability by nj/N, where nj is the number of elements from class kj. Then = - (7) 
Choose an attribute a, characterizing the elements of set Z. Then Zai denotes a set of elements from Z, for which the value of a is ai. The number of elements from set Zai is
Track 2: Data Mining 
Third Post Graduate Symposium on Computer Engineering cPGCON2014 Organized by department of Computer Engineering, 
MCERC Nasik 
labeled as nai. Then the weighted entropy for attribute a and set Z is given by: (8) 
Where 
(9) and denotes the number of elements in set from class kj. Information gain for attribute a is given by:  = - (10) Let us assume, that a is an attribute with the highest value of information gain, while b is the second-best attribute.  - (11) = (12) 
Let Z, given by (2), be the set of independent random variables, with Xi taking values in a set Ai for each i. Let us define: 
Z’ = (13) with taking values in Ai. Observe that Z‟ differs from Z, only in the ith element. We will use McDiarmid‟s inequality. Suppose that the function: : R (14) satisfies | - | ≤ Ci (15) X1,…,XN, for some constant Ci, Then Pr((Z) - E|(Z‟)| ≥ ) (16) ≤ exp () (17) Proposed Architecture: The conceptual user interaction model of the proposed system is as follows: 
Fig. 2. System Architecture. 
Mathematical model for information gain: Traditional ID3 uses information gain as the best attribute split measure. When using McDiarmid‟s algorithm, the following theorem promises that a decision tree learning system, applied to data streams, hasthe property that its output is nearly identical to that produced by a conventional learner. Let Z = {X1,..,XN) be any set of random variables, with each of them taking values A × B × … . Then, for any fixed δ and any pair of attributes a and b, where =  - > 0, if = CGain (K, N) (18) then Pr(-E[] > ) ≤ δ (19) where CGain (K, N) = 6(K log2eN + log22N) + 2log2K (20) Mathematical model for gini index: Gini index is usually used in CART and it measures impurity of the training set. Theorem for the same using McDiarmid‟s bound is given by: Let Z = {X1,..,XN) be any set of random variables, with each of them taking values A × B × … . Then, for any fixed δ and any pair of attributes a and b, where =  - > 0, if = 8 (21) 
C. Algorithm 
For handiness the following notations will be used: À—set of all attributes. α—any attribute from set À. αMAX1—attribute with highest value of the split function. αMAX2—attribute with second highest value of the split function.
Track 2: Data Mining 
Third Post Graduate Symposium on Computer Engineering cPGCON2014 Organized by department of Computer Engineering, 
MCERC Nasik 
IV. RESULTS 
A. Data Set 
Web logs, financial tickers, feeds from the sensor and all massively unordered continuous data sets. 
B. Expected Result Set 
Result set will be comprised of two modules. Firstly it will generate a decision tree based on McDiarmid‟s tree algorithm with the use of Gaussian bounds over the attribute selection function. Secondly it shall have a graph to compare the traditional decision tree by McDiarmid‟s algorithm with the decision tree obtained by the application of Gaussian approximation on the bounds of the split measures. 
C. Platform 
The proposed algorithm shall be developed and deployed with Java‟s EJB modules. MySql will be used for database related operations in the backend 
V. CONCLUSION 
The propagation of data stream fact has inclined the development of stream mining algorithms. Mining online high speed data streams has imposed a number of difficulties for the researchers. Due to the limited resources and critical time constraints many summarization and approximation methods have been picked up from statistics and computational theory background. The predictable mean used in the Hoeffding‟s theorem, is not valid universally. The suitable technique to solve the setback is the McDiarmids theorem. With the use of Gaussian approximations on the obtained McDiarmid‟s bounds, the system can drastically enhance the efficiency of the system and shall reduce the number of training examples needed to select a splitting criterion. 
So in this paper we have tried to plaster a few issues but still there are many unbolt issues and fresh challenges that demand attention and if those problems are tackled efficiently then data streams will play a major role in each area of our life. 
ACKNOWLEDGMENT 
The authors would like to express thanks to the reviewers for helpful comments. 
REFERENCES 
[1] L. Rutkowski, L. Pietruczuk, P. Duda, M. Jaworski, ”Decision trees for mining data streams based on the McDiarmid‟s bound”, IEEE Transactions on Knowledge and Data Engineering, vol. 25, no. 6, 2013. 
[2] C. Aggarwal, Data Streams. Models and Algorithms. Springer, 2007. 
[3] A. Bifet and R. Kirkby, Data Stream Mining a Practical Approach, technical report, Univ. of Waikato, 2009. 
[4] L. Breiman, J.H. Friedman, R.A. Olshen, and C.J. Stone, Classification and Regression Trees. Chapman and Hall, 1993. 
[5] P. Domingos and G. Hulten, “Mining high-speed Data Streams,”Proc. Sixth ACM SIGKDD Int‟l Conf. Knowledge Discovery and DataMining, pp. 71- 80, 2000. 
[6] W. Fan, Y. Huang, and P.S. Yu, “Decision Tree Evolution using Limited Number of Labeled Data Items from Drifting Data Streams,” Proc. IEEE Fourth Int‟l Conf. Data Mining, pp. 379-382, 2004. 
[7] M. Gaber, A. Zaslavsky, and S. Krishnaswamy, “Mining Data Streams: A Review,” ACM SIGMOD Record, vol. 34, no. 2, pp. 18-26, June 2005. 
[8] J. Gama, R. Fernandes, and R. Rocha, “Decision Trees for Mining Data Streams,” Intelligent Data Analysis, vol. 10, no. 1, pp. 23-45, Mar. 2006. 
[9] J. Han and M. Kamber, Data Mining: Concepts and Techniques, second ed. Elsevier, 2006. 
[10] W. Hoeffding, “Probability Inequalities for Sums of Bounded Random Variables,” J. Am. Statistical Assoc., vol. 58, no. 301, pp. 13-30, Mar. 1963. 
[11] G. Hulten, L. Spencer, and P. Domingos, “Mining Time-Changing Data Streams,” Proc. Seventh ACM SIGKDD Int‟l Conf. Knowledge Discovery and Data Mining, pp. 97-106, 2001. 
[12] R. Kirkby, “Improving Hoeffding Trees,” PhD dissertation, University of Waikato, Hamilton, 2007. 
[13] X. Li, J.M. Barajas, and Y. Ding, “Collaborative Filtering On Streaming Data With Interest-Drifting,” Int‟l Intelligent Data Analysis, vol. 11, no. 1, pp. 75-87, 2007. 
[14] B. Pfahringer, G. Holmes, and R. Kirkby, “New Options for Hoeffding Trees,” Proc. 20th Australian Joint Conf. Advances in Artificial Intelligence, pp. 90-99, 2007. 
[15] J.R. Quinlan, C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993. 
[16] W. Fan, Y. Huang, H. Wang, and P.S. Yu, Active Mining of Data Streams, Proc. SDM, 2004. 
[17] C. Franke, Adaptivity in Data Stream Mining, PhD dissertation, University of California, DAVIS, 2009.

More Related Content

What's hot

PSF_Introduction_to_R_Package_for_Pattern_Sequence (1)
PSF_Introduction_to_R_Package_for_Pattern_Sequence (1)PSF_Introduction_to_R_Package_for_Pattern_Sequence (1)
PSF_Introduction_to_R_Package_for_Pattern_Sequence (1)
neeraj7svp
 
IJSETR-VOL-3-ISSUE-12-3358-3363
IJSETR-VOL-3-ISSUE-12-3358-3363IJSETR-VOL-3-ISSUE-12-3358-3363
IJSETR-VOL-3-ISSUE-12-3358-3363
SHIVA REDDY
 
Experimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithmsExperimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithms
IJDKP
 
Big Data Clustering Model based on Fuzzy Gaussian
Big Data Clustering Model based on Fuzzy GaussianBig Data Clustering Model based on Fuzzy Gaussian
Big Data Clustering Model based on Fuzzy Gaussian
IJCSIS Research Publications
 
J41046368
J41046368J41046368
J41046368
IJERA Editor
 
A Novel Approach for Clustering Big Data based on MapReduce
A Novel Approach for Clustering Big Data based on MapReduce A Novel Approach for Clustering Big Data based on MapReduce
A Novel Approach for Clustering Big Data based on MapReduce
IJECEIAES
 
IRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET- Review of Existing Methods in K-Means Clustering AlgorithmIRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET Journal
 
50120140505013
5012014050501350120140505013
50120140505013
IAEME Publication
 
WITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSER
WITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSERWITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSER
WITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSER
ijnlc
 
Extended pso algorithm for improvement problems k means clustering algorithm
Extended pso algorithm for improvement problems k means clustering algorithmExtended pso algorithm for improvement problems k means clustering algorithm
Extended pso algorithm for improvement problems k means clustering algorithm
IJMIT JOURNAL
 
K-means Clustering Method for the Analysis of Log Data
K-means Clustering Method for the Analysis of Log DataK-means Clustering Method for the Analysis of Log Data
K-means Clustering Method for the Analysis of Log Data
idescitation
 
A PSO-Based Subtractive Data Clustering Algorithm
A PSO-Based Subtractive Data Clustering AlgorithmA PSO-Based Subtractive Data Clustering Algorithm
A PSO-Based Subtractive Data Clustering Algorithm
IJORCS
 
A NOBEL HYBRID APPROACH FOR EDGE DETECTION
A NOBEL HYBRID APPROACH FOR EDGE  DETECTIONA NOBEL HYBRID APPROACH FOR EDGE  DETECTION
A NOBEL HYBRID APPROACH FOR EDGE DETECTION
ijcses
 
IRJET- Clustering of Hierarchical Documents based on the Similarity Deduc...
IRJET-  	  Clustering of Hierarchical Documents based on the Similarity Deduc...IRJET-  	  Clustering of Hierarchical Documents based on the Similarity Deduc...
IRJET- Clustering of Hierarchical Documents based on the Similarity Deduc...
IRJET Journal
 
A general weighted_fuzzy_clustering_algorithm
A general weighted_fuzzy_clustering_algorithmA general weighted_fuzzy_clustering_algorithm
A general weighted_fuzzy_clustering_algorithm
TA Minh Thuy
 
Classification of Iris Data using Kernel Radial Basis Probabilistic Neural Ne...
Classification of Iris Data using Kernel Radial Basis Probabilistic Neural Ne...Classification of Iris Data using Kernel Radial Basis Probabilistic Neural Ne...
Classification of Iris Data using Kernel Radial Basis Probabilistic Neural Ne...
Scientific Review
 
SVD BASED LATENT SEMANTIC INDEXING WITH USE OF THE GPU COMPUTATIONS
SVD BASED LATENT SEMANTIC INDEXING WITH USE OF THE GPU COMPUTATIONSSVD BASED LATENT SEMANTIC INDEXING WITH USE OF THE GPU COMPUTATIONS
SVD BASED LATENT SEMANTIC INDEXING WITH USE OF THE GPU COMPUTATIONS
ijscmcj
 
A Novel Penalized and Compensated Constraints Based Modified Fuzzy Possibilis...
A Novel Penalized and Compensated Constraints Based Modified Fuzzy Possibilis...A Novel Penalized and Compensated Constraints Based Modified Fuzzy Possibilis...
A Novel Penalized and Compensated Constraints Based Modified Fuzzy Possibilis...
ijsrd.com
 
AROPUB-IJPGE-14-30
AROPUB-IJPGE-14-30AROPUB-IJPGE-14-30
AROPUB-IJPGE-14-30
shirko mahmoudi
 

What's hot (19)

PSF_Introduction_to_R_Package_for_Pattern_Sequence (1)
PSF_Introduction_to_R_Package_for_Pattern_Sequence (1)PSF_Introduction_to_R_Package_for_Pattern_Sequence (1)
PSF_Introduction_to_R_Package_for_Pattern_Sequence (1)
 
IJSETR-VOL-3-ISSUE-12-3358-3363
IJSETR-VOL-3-ISSUE-12-3358-3363IJSETR-VOL-3-ISSUE-12-3358-3363
IJSETR-VOL-3-ISSUE-12-3358-3363
 
Experimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithmsExperimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithms
 
Big Data Clustering Model based on Fuzzy Gaussian
Big Data Clustering Model based on Fuzzy GaussianBig Data Clustering Model based on Fuzzy Gaussian
Big Data Clustering Model based on Fuzzy Gaussian
 
J41046368
J41046368J41046368
J41046368
 
A Novel Approach for Clustering Big Data based on MapReduce
A Novel Approach for Clustering Big Data based on MapReduce A Novel Approach for Clustering Big Data based on MapReduce
A Novel Approach for Clustering Big Data based on MapReduce
 
IRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET- Review of Existing Methods in K-Means Clustering AlgorithmIRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET- Review of Existing Methods in K-Means Clustering Algorithm
 
50120140505013
5012014050501350120140505013
50120140505013
 
WITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSER
WITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSERWITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSER
WITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSER
 
Extended pso algorithm for improvement problems k means clustering algorithm
Extended pso algorithm for improvement problems k means clustering algorithmExtended pso algorithm for improvement problems k means clustering algorithm
Extended pso algorithm for improvement problems k means clustering algorithm
 
K-means Clustering Method for the Analysis of Log Data
K-means Clustering Method for the Analysis of Log DataK-means Clustering Method for the Analysis of Log Data
K-means Clustering Method for the Analysis of Log Data
 
A PSO-Based Subtractive Data Clustering Algorithm
A PSO-Based Subtractive Data Clustering AlgorithmA PSO-Based Subtractive Data Clustering Algorithm
A PSO-Based Subtractive Data Clustering Algorithm
 
A NOBEL HYBRID APPROACH FOR EDGE DETECTION
A NOBEL HYBRID APPROACH FOR EDGE  DETECTIONA NOBEL HYBRID APPROACH FOR EDGE  DETECTION
A NOBEL HYBRID APPROACH FOR EDGE DETECTION
 
IRJET- Clustering of Hierarchical Documents based on the Similarity Deduc...
IRJET-  	  Clustering of Hierarchical Documents based on the Similarity Deduc...IRJET-  	  Clustering of Hierarchical Documents based on the Similarity Deduc...
IRJET- Clustering of Hierarchical Documents based on the Similarity Deduc...
 
A general weighted_fuzzy_clustering_algorithm
A general weighted_fuzzy_clustering_algorithmA general weighted_fuzzy_clustering_algorithm
A general weighted_fuzzy_clustering_algorithm
 
Classification of Iris Data using Kernel Radial Basis Probabilistic Neural Ne...
Classification of Iris Data using Kernel Radial Basis Probabilistic Neural Ne...Classification of Iris Data using Kernel Radial Basis Probabilistic Neural Ne...
Classification of Iris Data using Kernel Radial Basis Probabilistic Neural Ne...
 
SVD BASED LATENT SEMANTIC INDEXING WITH USE OF THE GPU COMPUTATIONS
SVD BASED LATENT SEMANTIC INDEXING WITH USE OF THE GPU COMPUTATIONSSVD BASED LATENT SEMANTIC INDEXING WITH USE OF THE GPU COMPUTATIONS
SVD BASED LATENT SEMANTIC INDEXING WITH USE OF THE GPU COMPUTATIONS
 
A Novel Penalized and Compensated Constraints Based Modified Fuzzy Possibilis...
A Novel Penalized and Compensated Constraints Based Modified Fuzzy Possibilis...A Novel Penalized and Compensated Constraints Based Modified Fuzzy Possibilis...
A Novel Penalized and Compensated Constraints Based Modified Fuzzy Possibilis...
 
AROPUB-IJPGE-14-30
AROPUB-IJPGE-14-30AROPUB-IJPGE-14-30
AROPUB-IJPGE-14-30
 

Similar to ME Synopsis

IEEE Datamining 2016 Title and Abstract
IEEE  Datamining 2016 Title and AbstractIEEE  Datamining 2016 Title and Abstract
IEEE Datamining 2016 Title and Abstract
tsysglobalsolutions
 
TOWARDS REDUCTION OF DATA FLOW IN A DISTRIBUTED NETWORK USING PRINCIPAL COMPO...
TOWARDS REDUCTION OF DATA FLOW IN A DISTRIBUTED NETWORK USING PRINCIPAL COMPO...TOWARDS REDUCTION OF DATA FLOW IN A DISTRIBUTED NETWORK USING PRINCIPAL COMPO...
TOWARDS REDUCTION OF DATA FLOW IN A DISTRIBUTED NETWORK USING PRINCIPAL COMPO...
cscpconf
 
Reduct generation for the incremental data using rough set theory
Reduct generation for the incremental data using rough set theoryReduct generation for the incremental data using rough set theory
Reduct generation for the incremental data using rough set theory
csandit
 
Spectral Clustering and Vantage Point Indexing for Efficient Data Retrieval
Spectral Clustering and Vantage Point Indexing for Efficient Data Retrieval Spectral Clustering and Vantage Point Indexing for Efficient Data Retrieval
Spectral Clustering and Vantage Point Indexing for Efficient Data Retrieval
IJECEIAES
 
Journal paper 1
Journal paper 1Journal paper 1
Journal paper 1
Editor IJCATR
 
A Threshold Fuzzy Entropy Based Feature Selection: Comparative Study
A Threshold Fuzzy Entropy Based Feature Selection:  Comparative StudyA Threshold Fuzzy Entropy Based Feature Selection:  Comparative Study
A Threshold Fuzzy Entropy Based Feature Selection: Comparative Study
IJMER
 
AN ENTROPIC OPTIMIZATION TECHNIQUE IN HETEROGENEOUS GRID COMPUTING USING BION...
AN ENTROPIC OPTIMIZATION TECHNIQUE IN HETEROGENEOUS GRID COMPUTING USING BION...AN ENTROPIC OPTIMIZATION TECHNIQUE IN HETEROGENEOUS GRID COMPUTING USING BION...
AN ENTROPIC OPTIMIZATION TECHNIQUE IN HETEROGENEOUS GRID COMPUTING USING BION...
ijcsit
 
Particle Swarm Optimization based K-Prototype Clustering Algorithm
Particle Swarm Optimization based K-Prototype Clustering Algorithm Particle Swarm Optimization based K-Prototype Clustering Algorithm
Particle Swarm Optimization based K-Prototype Clustering Algorithm
iosrjce
 
I017235662
I017235662I017235662
I017235662
IOSR Journals
 
[20240513_LabSeminar_Huy]GraphFewShort_Transfer.pptx
[20240513_LabSeminar_Huy]GraphFewShort_Transfer.pptx[20240513_LabSeminar_Huy]GraphFewShort_Transfer.pptx
[20240513_LabSeminar_Huy]GraphFewShort_Transfer.pptx
thanhdowork
 
Web image annotation by diffusion maps manifold learning algorithm
Web image annotation by diffusion maps manifold learning algorithmWeb image annotation by diffusion maps manifold learning algorithm
Web image annotation by diffusion maps manifold learning algorithm
ijfcstjournal
 
Physics inspired artificial intelligence/machine learning
Physics inspired artificial intelligence/machine learningPhysics inspired artificial intelligence/machine learning
Physics inspired artificial intelligence/machine learning
KAMAL CHOUDHARY
 
IJCSIT
IJCSITIJCSIT
A Comparative Study Of Various Clustering Algorithms In Data Mining
A Comparative Study Of Various Clustering Algorithms In Data MiningA Comparative Study Of Various Clustering Algorithms In Data Mining
A Comparative Study Of Various Clustering Algorithms In Data Mining
Natasha Grant
 
Volume 2-issue-6-1930-1932
Volume 2-issue-6-1930-1932Volume 2-issue-6-1930-1932
Volume 2-issue-6-1930-1932
Editor IJARCET
 
Volume 2-issue-6-1930-1932
Volume 2-issue-6-1930-1932Volume 2-issue-6-1930-1932
Volume 2-issue-6-1930-1932
Editor IJARCET
 
UHDMML.pps
UHDMML.ppsUHDMML.pps
UHDMML.pps
butest
 
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
acijjournal
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
IJERD Editor
 
Critical Paths Identification on Fuzzy Network Project
Critical Paths Identification on Fuzzy Network ProjectCritical Paths Identification on Fuzzy Network Project
Critical Paths Identification on Fuzzy Network Project
iosrjce
 

Similar to ME Synopsis (20)

IEEE Datamining 2016 Title and Abstract
IEEE  Datamining 2016 Title and AbstractIEEE  Datamining 2016 Title and Abstract
IEEE Datamining 2016 Title and Abstract
 
TOWARDS REDUCTION OF DATA FLOW IN A DISTRIBUTED NETWORK USING PRINCIPAL COMPO...
TOWARDS REDUCTION OF DATA FLOW IN A DISTRIBUTED NETWORK USING PRINCIPAL COMPO...TOWARDS REDUCTION OF DATA FLOW IN A DISTRIBUTED NETWORK USING PRINCIPAL COMPO...
TOWARDS REDUCTION OF DATA FLOW IN A DISTRIBUTED NETWORK USING PRINCIPAL COMPO...
 
Reduct generation for the incremental data using rough set theory
Reduct generation for the incremental data using rough set theoryReduct generation for the incremental data using rough set theory
Reduct generation for the incremental data using rough set theory
 
Spectral Clustering and Vantage Point Indexing for Efficient Data Retrieval
Spectral Clustering and Vantage Point Indexing for Efficient Data Retrieval Spectral Clustering and Vantage Point Indexing for Efficient Data Retrieval
Spectral Clustering and Vantage Point Indexing for Efficient Data Retrieval
 
Journal paper 1
Journal paper 1Journal paper 1
Journal paper 1
 
A Threshold Fuzzy Entropy Based Feature Selection: Comparative Study
A Threshold Fuzzy Entropy Based Feature Selection:  Comparative StudyA Threshold Fuzzy Entropy Based Feature Selection:  Comparative Study
A Threshold Fuzzy Entropy Based Feature Selection: Comparative Study
 
AN ENTROPIC OPTIMIZATION TECHNIQUE IN HETEROGENEOUS GRID COMPUTING USING BION...
AN ENTROPIC OPTIMIZATION TECHNIQUE IN HETEROGENEOUS GRID COMPUTING USING BION...AN ENTROPIC OPTIMIZATION TECHNIQUE IN HETEROGENEOUS GRID COMPUTING USING BION...
AN ENTROPIC OPTIMIZATION TECHNIQUE IN HETEROGENEOUS GRID COMPUTING USING BION...
 
Particle Swarm Optimization based K-Prototype Clustering Algorithm
Particle Swarm Optimization based K-Prototype Clustering Algorithm Particle Swarm Optimization based K-Prototype Clustering Algorithm
Particle Swarm Optimization based K-Prototype Clustering Algorithm
 
I017235662
I017235662I017235662
I017235662
 
[20240513_LabSeminar_Huy]GraphFewShort_Transfer.pptx
[20240513_LabSeminar_Huy]GraphFewShort_Transfer.pptx[20240513_LabSeminar_Huy]GraphFewShort_Transfer.pptx
[20240513_LabSeminar_Huy]GraphFewShort_Transfer.pptx
 
Web image annotation by diffusion maps manifold learning algorithm
Web image annotation by diffusion maps manifold learning algorithmWeb image annotation by diffusion maps manifold learning algorithm
Web image annotation by diffusion maps manifold learning algorithm
 
Physics inspired artificial intelligence/machine learning
Physics inspired artificial intelligence/machine learningPhysics inspired artificial intelligence/machine learning
Physics inspired artificial intelligence/machine learning
 
IJCSIT
IJCSITIJCSIT
IJCSIT
 
A Comparative Study Of Various Clustering Algorithms In Data Mining
A Comparative Study Of Various Clustering Algorithms In Data MiningA Comparative Study Of Various Clustering Algorithms In Data Mining
A Comparative Study Of Various Clustering Algorithms In Data Mining
 
Volume 2-issue-6-1930-1932
Volume 2-issue-6-1930-1932Volume 2-issue-6-1930-1932
Volume 2-issue-6-1930-1932
 
Volume 2-issue-6-1930-1932
Volume 2-issue-6-1930-1932Volume 2-issue-6-1930-1932
Volume 2-issue-6-1930-1932
 
UHDMML.pps
UHDMML.ppsUHDMML.pps
UHDMML.pps
 
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
Critical Paths Identification on Fuzzy Network Project
Critical Paths Identification on Fuzzy Network ProjectCritical Paths Identification on Fuzzy Network Project
Critical Paths Identification on Fuzzy Network Project
 

Recently uploaded

Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Sinan KOZAK
 
Computational Engineering IITH Presentation
Computational Engineering IITH PresentationComputational Engineering IITH Presentation
Computational Engineering IITH Presentation
co23btech11018
 
Literature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptxLiterature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptx
Dr Ramhari Poudyal
 
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
IJECEIAES
 
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student MemberIEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
VICTOR MAESTRE RAMIREZ
 
Understanding Inductive Bias in Machine Learning
Understanding Inductive Bias in Machine LearningUnderstanding Inductive Bias in Machine Learning
Understanding Inductive Bias in Machine Learning
SUTEJAS
 
Comparative analysis between traditional aquaponics and reconstructed aquapon...
Comparative analysis between traditional aquaponics and reconstructed aquapon...Comparative analysis between traditional aquaponics and reconstructed aquapon...
Comparative analysis between traditional aquaponics and reconstructed aquapon...
bijceesjournal
 
学校原版美国波士顿大学毕业证学历学位证书原版一模一样
学校原版美国波士顿大学毕业证学历学位证书原版一模一样学校原版美国波士顿大学毕业证学历学位证书原版一模一样
学校原版美国波士顿大学毕业证学历学位证书原版一模一样
171ticu
 
ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024
Rahul
 
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECTCHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
jpsjournal1
 
BRAIN TUMOR DETECTION for seminar ppt.pdf
BRAIN TUMOR DETECTION for seminar ppt.pdfBRAIN TUMOR DETECTION for seminar ppt.pdf
BRAIN TUMOR DETECTION for seminar ppt.pdf
LAXMAREDDY22
 
john krisinger-the science and history of the alcoholic beverage.pptx
john krisinger-the science and history of the alcoholic beverage.pptxjohn krisinger-the science and history of the alcoholic beverage.pptx
john krisinger-the science and history of the alcoholic beverage.pptx
Madan Karki
 
Hematology Analyzer Machine - Complete Blood Count
Hematology Analyzer Machine - Complete Blood CountHematology Analyzer Machine - Complete Blood Count
Hematology Analyzer Machine - Complete Blood Count
shahdabdulbaset
 
NATURAL DEEP EUTECTIC SOLVENTS AS ANTI-FREEZING AGENT
NATURAL DEEP EUTECTIC SOLVENTS AS ANTI-FREEZING AGENTNATURAL DEEP EUTECTIC SOLVENTS AS ANTI-FREEZING AGENT
NATURAL DEEP EUTECTIC SOLVENTS AS ANTI-FREEZING AGENT
Addu25809
 
Introduction to AI Safety (public presentation).pptx
Introduction to AI Safety (public presentation).pptxIntroduction to AI Safety (public presentation).pptx
Introduction to AI Safety (public presentation).pptx
MiscAnnoy1
 
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
insn4465
 
Engine Lubrication performance System.pdf
Engine Lubrication performance System.pdfEngine Lubrication performance System.pdf
Engine Lubrication performance System.pdf
mamamaam477
 
Textile Chemical Processing and Dyeing.pdf
Textile Chemical Processing and Dyeing.pdfTextile Chemical Processing and Dyeing.pdf
Textile Chemical Processing and Dyeing.pdf
NazakatAliKhoso2
 
Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...
IJECEIAES
 
International Conference on NLP, Artificial Intelligence, Machine Learning an...
International Conference on NLP, Artificial Intelligence, Machine Learning an...International Conference on NLP, Artificial Intelligence, Machine Learning an...
International Conference on NLP, Artificial Intelligence, Machine Learning an...
gerogepatton
 

Recently uploaded (20)

Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
 
Computational Engineering IITH Presentation
Computational Engineering IITH PresentationComputational Engineering IITH Presentation
Computational Engineering IITH Presentation
 
Literature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptxLiterature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptx
 
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
 
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student MemberIEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
 
Understanding Inductive Bias in Machine Learning
Understanding Inductive Bias in Machine LearningUnderstanding Inductive Bias in Machine Learning
Understanding Inductive Bias in Machine Learning
 
Comparative analysis between traditional aquaponics and reconstructed aquapon...
Comparative analysis between traditional aquaponics and reconstructed aquapon...Comparative analysis between traditional aquaponics and reconstructed aquapon...
Comparative analysis between traditional aquaponics and reconstructed aquapon...
 
学校原版美国波士顿大学毕业证学历学位证书原版一模一样
学校原版美国波士顿大学毕业证学历学位证书原版一模一样学校原版美国波士顿大学毕业证学历学位证书原版一模一样
学校原版美国波士顿大学毕业证学历学位证书原版一模一样
 
ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024
 
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECTCHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
 
BRAIN TUMOR DETECTION for seminar ppt.pdf
BRAIN TUMOR DETECTION for seminar ppt.pdfBRAIN TUMOR DETECTION for seminar ppt.pdf
BRAIN TUMOR DETECTION for seminar ppt.pdf
 
john krisinger-the science and history of the alcoholic beverage.pptx
john krisinger-the science and history of the alcoholic beverage.pptxjohn krisinger-the science and history of the alcoholic beverage.pptx
john krisinger-the science and history of the alcoholic beverage.pptx
 
Hematology Analyzer Machine - Complete Blood Count
Hematology Analyzer Machine - Complete Blood CountHematology Analyzer Machine - Complete Blood Count
Hematology Analyzer Machine - Complete Blood Count
 
NATURAL DEEP EUTECTIC SOLVENTS AS ANTI-FREEZING AGENT
NATURAL DEEP EUTECTIC SOLVENTS AS ANTI-FREEZING AGENTNATURAL DEEP EUTECTIC SOLVENTS AS ANTI-FREEZING AGENT
NATURAL DEEP EUTECTIC SOLVENTS AS ANTI-FREEZING AGENT
 
Introduction to AI Safety (public presentation).pptx
Introduction to AI Safety (public presentation).pptxIntroduction to AI Safety (public presentation).pptx
Introduction to AI Safety (public presentation).pptx
 
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
 
Engine Lubrication performance System.pdf
Engine Lubrication performance System.pdfEngine Lubrication performance System.pdf
Engine Lubrication performance System.pdf
 
Textile Chemical Processing and Dyeing.pdf
Textile Chemical Processing and Dyeing.pdfTextile Chemical Processing and Dyeing.pdf
Textile Chemical Processing and Dyeing.pdf
 
Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...
 
International Conference on NLP, Artificial Intelligence, Machine Learning an...
International Conference on NLP, Artificial Intelligence, Machine Learning an...International Conference on NLP, Artificial Intelligence, Machine Learning an...
International Conference on NLP, Artificial Intelligence, Machine Learning an...
 

ME Synopsis

  • 1. Track 2: Data Mining Third Post Graduate Symposium on Computer Engineering cPGCON2014 Organized by department of Computer Engineering, MCERC Nasik Mining Data Streams Based on the Improved McDiarmid’s Bound Ms. Poonam Debnath, and Prof. Santosh Kumar Chobe, Department of Computer Engineering, University of Pune Abstract: Complex analysis of data streams is becoming a popular field of research as the information collected is prone to concepts drift or complete shift. The pre processing, storage, querying and mining of such data sets are highly computationally challenging tasks. Mining data streams implies extracting knowledge structures represented in models and patterns in non stopping streams of information. Traditionally the Hoeffding‟s bound is widely used to resolve the conflicts regarding the number of learning samples needed at a node to assume and select the split attribute. In this paper, we present the theoretical foundations for enhancing the bounds obtained by the McDiarmid‟s tree algorithm and outdo the processing efficiency of the stream mining system by applying Gaussian approximations to the bounds. Index Terms— Data streams, Decision trees, Gaussian approximation, Hoeffding’s bound, McDiarmid’s bound. I. INTRODUCTION ECENTLY a new class of emerging applications has become widely recognized: applications in which data is generated at very high rates in the form of transient data streams. In the data stream model, individual data items may be relational tuples, call records, web page visits, sensor readings, and so on. However, the continuous arrival of data in multiple, rapid, time varying, unpredictable and unbound streams open new elementary research problems. The rapid generation of continuous streams of information has posed a challenge for the storage, computation and communication capabilities in a computing system. The gigantic amounts of data arriving at high speed need application of semi- automated interactive techniques to perform real-time extraction of hidden knowledge. Typical data mining tasks include concept description, regression analysis, association mining, outlier analysis, classification, and clustering. These techniques find interesting patterns, tracing regularities and anomalies in the data set. However, traditional data mining techniques cannot be directly applied to the data streaming model. This is because most of them require multiple scans of data to mine the information, which is impractical for stream data. The amount of formerly happened events is usually immeasurable, so they can be either dropped after processing or archived separately in secondary storage. More importantly, the traits of the data stream can change over time and the evolving pattern needs to be recorded. Furthermore, the problem of resource allocation should also be considered in mining data streams. Due to the bulky volume and the high speed of streaming data, stream mining algorithms must handle the effects of system burden. Thus, how to accomplish optimum results under various resource constraints becomes a challenging. Initially decision trees developed for data mining were tailored to deal with stream data as well. But the difficulty of ensuring whether an attribute selected from N examples is equally good to be used for infinite examples. The target was to calculate the heuristic value from the N training examples, and then exploit the results to split the learning sample space. At the beginning Hoeffding‟s tree algorithm based on Hoeffding‟s inequality and Hoeffding‟s bound, was used for knowledge discovery in data streams. The Hoeffding‟s bound postulate that with probability 1 - δ, the true mean of the random variable of range R does not differ from the expected mean, after N independent trails, by more than: = (1) A glance through the techniques and interpretation of data stream study provoked us to amend the existing tactics for improved performance in data stream mining systems. In this paper, we show that: Methods based on the McDiarmid‟s inequality call for gigantic amount of data streams at the node. By using Gaussian approximation techniques we can enhance the bounds used and can reduce the size of training samples needed for the split criteria selection. The Hoeffding‟s inequality is not sufficient to conquer the fundamental problem in a general case and so all the existing methods techniques should be adjusted. The McDiarmid‟s inequality, used in an appropriate way, is an efficient technique to cope up with glitches in data streaming model. R
  • 2. Track 2: Data Mining Third Post Graduate Symposium on Computer Engineering cPGCON2014 Organized by department of Computer Engineering, MCERC Nasik II. RELATED WORK One of the earliest works on managing evaluating data streams was carried out by P. Domingos and G. Hulten [5] where they proposed VFDT, which could generate decision tree for an example under strict time and memory considerations. C. Aggarwal [2] studied and proposed a two phase technique to conquer astronomical time series problem. In the first phase slinding window clusters are created and later the same is used to mine association rules from the streams. The effect of concept drift was examined by A. Bifet and R. Kirkby [3] and they evaluated online streams with complete concept shifts. To tackle with such concepts they used ensembles in their experiments. G. Hulten, L. Spencer, and P. Domingos [11] studied that the basic assumption of a machine learning system for a streaming model doesn‟t hold true because the data source distribution is never stationary or predictable. They proposed an algorithm for handling continuously changing data streams, called CFVDT, a variance of VFDT. They have worked on the complexity of target examples. Most of the approaches used the underlying concept of Hoeffding‟s inequality explained by W. Hoeffding [10]. The problem with adapting this theory in data stream mining operations was that the Hoeffding‟s inequality considered only the numeric valued data but in real world data is unpredictable. R. Kirkby [12] observed the flow in streaming model and proposed enhancement in the Hoeffding‟s tree algorithm. He labeled some fraction of the stream sample and used semi supervised approach to create clusters from the dataset. B. Pfahringer, G. Holmes, and R. Kirkby [14] have used option trees instead of clustering in their algorithm. The use of option trees helped in improving effiency of Hoeffding‟s bounds. Recently L. Rutkowski, L. Pietruczuk, P. Duda, M. Jaworski [1] have shown that the used of McDiarmid‟s inequality is the correct way to analyze the high speed time changing data streams. III. IMPLEMENTATION DETAILS Conventional techniques for data mining necessitate several passes of data to mine the knowledge, but this method is not feasible for stream model. Practically it is not achievable to stock up an entire stream or scan through it numerous times due to its terrific volume. Moreover, data streams evolve over time and face severe concept drift‟s and complete shift. A. Hoeffding’s Inequality Let Y1, Y2,..,YN be random variables with values of from a distribution. Let Yi [0,R] for i = 1,..,N where R is constant. Let expected value of the distribution be E[Y] and mean be denoted by . Let Yi for i = 1,…,N be any independent random variable such that Pr(Yi [ai, bi]) = 1. Then for S = for all > 0 these inequalities are valid: Pr(S - E[S] ) exp ) (2) Pr(|S - E[S]|) 2 exp ) (3) B. Proposed Architecture In order to improve the bounds obtained by the split measures we have to use Gaussian approximation on the bounds. Fig. 1. Information Flow Path. Preprocessing: The dataset which has to be clubbed together is choosen for detecting the native structure in the document space using correlation analysis. The next step is to remove the null words and write the contents after removal of null words to a new folder stop words. Mcdiarmid’s inequality: Suppose an attribute „a‟ can take one of |a| different values from the set A = {a1,..,an} and = {k1,..,kk} is a set of different classes. Then let: Z = (4) be the training set of size N, where X1, . . .,XN are independent random variables defined as follows: (5) for i=1,..,N, ji {1,…,|a|}, li {1,…,|b|}, qi {1,…,K}. Each element of Z belongs to one of the K different classes kj. Entropy associated with the classification of Z is defined as: = - (6) where pj is the probability that element from Z comes from class kj. We estimate this probability by nj/N, where nj is the number of elements from class kj. Then = - (7) Choose an attribute a, characterizing the elements of set Z. Then Zai denotes a set of elements from Z, for which the value of a is ai. The number of elements from set Zai is
  • 3. Track 2: Data Mining Third Post Graduate Symposium on Computer Engineering cPGCON2014 Organized by department of Computer Engineering, MCERC Nasik labeled as nai. Then the weighted entropy for attribute a and set Z is given by: (8) Where (9) and denotes the number of elements in set from class kj. Information gain for attribute a is given by:  = - (10) Let us assume, that a is an attribute with the highest value of information gain, while b is the second-best attribute.  - (11) = (12) Let Z, given by (2), be the set of independent random variables, with Xi taking values in a set Ai for each i. Let us define: Z’ = (13) with taking values in Ai. Observe that Z‟ differs from Z, only in the ith element. We will use McDiarmid‟s inequality. Suppose that the function: : R (14) satisfies | - | ≤ Ci (15) X1,…,XN, for some constant Ci, Then Pr((Z) - E|(Z‟)| ≥ ) (16) ≤ exp () (17) Proposed Architecture: The conceptual user interaction model of the proposed system is as follows: Fig. 2. System Architecture. Mathematical model for information gain: Traditional ID3 uses information gain as the best attribute split measure. When using McDiarmid‟s algorithm, the following theorem promises that a decision tree learning system, applied to data streams, hasthe property that its output is nearly identical to that produced by a conventional learner. Let Z = {X1,..,XN) be any set of random variables, with each of them taking values A × B × … . Then, for any fixed δ and any pair of attributes a and b, where =  - > 0, if = CGain (K, N) (18) then Pr(-E[] > ) ≤ δ (19) where CGain (K, N) = 6(K log2eN + log22N) + 2log2K (20) Mathematical model for gini index: Gini index is usually used in CART and it measures impurity of the training set. Theorem for the same using McDiarmid‟s bound is given by: Let Z = {X1,..,XN) be any set of random variables, with each of them taking values A × B × … . Then, for any fixed δ and any pair of attributes a and b, where =  - > 0, if = 8 (21) C. Algorithm For handiness the following notations will be used: À—set of all attributes. α—any attribute from set À. αMAX1—attribute with highest value of the split function. αMAX2—attribute with second highest value of the split function.
  • 4. Track 2: Data Mining Third Post Graduate Symposium on Computer Engineering cPGCON2014 Organized by department of Computer Engineering, MCERC Nasik IV. RESULTS A. Data Set Web logs, financial tickers, feeds from the sensor and all massively unordered continuous data sets. B. Expected Result Set Result set will be comprised of two modules. Firstly it will generate a decision tree based on McDiarmid‟s tree algorithm with the use of Gaussian bounds over the attribute selection function. Secondly it shall have a graph to compare the traditional decision tree by McDiarmid‟s algorithm with the decision tree obtained by the application of Gaussian approximation on the bounds of the split measures. C. Platform The proposed algorithm shall be developed and deployed with Java‟s EJB modules. MySql will be used for database related operations in the backend V. CONCLUSION The propagation of data stream fact has inclined the development of stream mining algorithms. Mining online high speed data streams has imposed a number of difficulties for the researchers. Due to the limited resources and critical time constraints many summarization and approximation methods have been picked up from statistics and computational theory background. The predictable mean used in the Hoeffding‟s theorem, is not valid universally. The suitable technique to solve the setback is the McDiarmids theorem. With the use of Gaussian approximations on the obtained McDiarmid‟s bounds, the system can drastically enhance the efficiency of the system and shall reduce the number of training examples needed to select a splitting criterion. So in this paper we have tried to plaster a few issues but still there are many unbolt issues and fresh challenges that demand attention and if those problems are tackled efficiently then data streams will play a major role in each area of our life. ACKNOWLEDGMENT The authors would like to express thanks to the reviewers for helpful comments. REFERENCES [1] L. Rutkowski, L. Pietruczuk, P. Duda, M. Jaworski, ”Decision trees for mining data streams based on the McDiarmid‟s bound”, IEEE Transactions on Knowledge and Data Engineering, vol. 25, no. 6, 2013. [2] C. Aggarwal, Data Streams. Models and Algorithms. Springer, 2007. [3] A. Bifet and R. Kirkby, Data Stream Mining a Practical Approach, technical report, Univ. of Waikato, 2009. [4] L. Breiman, J.H. Friedman, R.A. Olshen, and C.J. Stone, Classification and Regression Trees. Chapman and Hall, 1993. [5] P. Domingos and G. Hulten, “Mining high-speed Data Streams,”Proc. Sixth ACM SIGKDD Int‟l Conf. Knowledge Discovery and DataMining, pp. 71- 80, 2000. [6] W. Fan, Y. Huang, and P.S. Yu, “Decision Tree Evolution using Limited Number of Labeled Data Items from Drifting Data Streams,” Proc. IEEE Fourth Int‟l Conf. Data Mining, pp. 379-382, 2004. [7] M. Gaber, A. Zaslavsky, and S. Krishnaswamy, “Mining Data Streams: A Review,” ACM SIGMOD Record, vol. 34, no. 2, pp. 18-26, June 2005. [8] J. Gama, R. Fernandes, and R. Rocha, “Decision Trees for Mining Data Streams,” Intelligent Data Analysis, vol. 10, no. 1, pp. 23-45, Mar. 2006. [9] J. Han and M. Kamber, Data Mining: Concepts and Techniques, second ed. Elsevier, 2006. [10] W. Hoeffding, “Probability Inequalities for Sums of Bounded Random Variables,” J. Am. Statistical Assoc., vol. 58, no. 301, pp. 13-30, Mar. 1963. [11] G. Hulten, L. Spencer, and P. Domingos, “Mining Time-Changing Data Streams,” Proc. Seventh ACM SIGKDD Int‟l Conf. Knowledge Discovery and Data Mining, pp. 97-106, 2001. [12] R. Kirkby, “Improving Hoeffding Trees,” PhD dissertation, University of Waikato, Hamilton, 2007. [13] X. Li, J.M. Barajas, and Y. Ding, “Collaborative Filtering On Streaming Data With Interest-Drifting,” Int‟l Intelligent Data Analysis, vol. 11, no. 1, pp. 75-87, 2007. [14] B. Pfahringer, G. Holmes, and R. Kirkby, “New Options for Hoeffding Trees,” Proc. 20th Australian Joint Conf. Advances in Artificial Intelligence, pp. 90-99, 2007. [15] J.R. Quinlan, C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993. [16] W. Fan, Y. Huang, H. Wang, and P.S. Yu, Active Mining of Data Streams, Proc. SDM, 2004. [17] C. Franke, Adaptivity in Data Stream Mining, PhD dissertation, University of California, DAVIS, 2009.