1. The document discusses stream data mining and compares classification algorithms. It defines stream data and challenges in mining stream data.
2. It describes sampling techniques and classification algorithms for stream data mining including Naive Bayesian, Hoeffding Tree, VFDT, and CVFDT.
3. The algorithms are experimentally compared in terms of time, memory usage, accuracy, and ability to handle concept drift. VFDT and CVFDT are found to have advantages over Hoeffding Tree in accuracy while maintaining speed, but CVFDT can additionally detect and respond to concept drift.
This document provides an overview of stream data mining techniques. It discusses how traditional data mining cannot be directly applied to data streams due to their continuous, rapid nature. The document outlines some essential methodologies for analyzing data streams, including sampling, load shedding, sketching, and data summarization techniques like reservoirs, histograms, and wavelets. It also discusses challenges in applying these techniques to data streams and open problems in the emerging field of stream data mining.
Parametric comparison based on split criterion on classification algorithmIAEME Publication
This document presents a comparison of different attribute selection criteria for classification algorithms in stream data mining. It analyzes two common criteria - information gain and Gini index - and evaluates their impact on classification accuracy using different datasets. The results show that information gain generally achieves higher accuracy than Gini index, especially for larger data sizes. The document aims to improve the performance of stream data classification algorithms by optimizing the split criterion selection approach.
The document discusses data stream mining and summarizes some key challenges and techniques. It describes how traditional data mining cannot be directly applied to data streams due to their continuous, rapid arrival. It then outlines several techniques used for summarizing and extracting knowledge from data streams, including sampling, sketching, load shedding, synopsis data structures, and algorithms modified from basic data mining to handle streams.
This document presents an analytical framework for classifying data stream mining techniques based on their approaches to challenges. It discusses how data streams pose computational challenges due to their continuous, massive, and potentially infinite nature. It classifies data stream mining challenges and techniques for addressing them, such as approaches that modify existing data mining algorithms or develop new ones. The document proposes an analytical framework to evaluate how data mining applications can help develop novel data stream mining algorithms to handle different tasks.
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...acijjournal
Apriori is one of the key algorithms to generate frequent itemsets. Analysing frequent itemset is a crucial
step in analysing structured data and in finding association relationship between items. This stands as an
elementary foundation to supervised learning, which encompasses classifier and feature extraction
methods. Applying this algorithm is crucial to understand the behaviour of structured data. Most of the
structured data in scientific domain are voluminous. Processing such kind of data requires state of the art
computing machines. Setting up such an infrastructure is expensive. Hence a distributed environment
such as a clustered setup is employed for tackling such scenarios. Apache Hadoop distribution is one of
the cluster frameworks in distributed environment that helps by distributing voluminous data across a
number of nodes in the framework. This paper focuses on map/reduce design and implementation of
Apriori algorithm for structured data analysis.
This document summarizes a research paper that proposes a heuristic approach to preserve privacy in stream data classification. The approach applies data perturbation to the stream data before performing classification. This allows privacy to be preserved while also building an accurate classification model for the large-scale stream data. The approach is implemented in two phases: first the data stream is perturbed, then classification is performed on both the perturbed and original data. Experimental results show that this approach can effectively preserve privacy while reducing the complexity of mining large stream data.
Different Classification Technique for Data mining in Insurance Industry usin...IOSRjournaljce
this paper addresses the issues and techniques for Property/Casualty actuaries applying data mining methods. Data mining means the effective unknown pattern discovery from a large amount database. It is an interactive knowledge discovery procedure which is includes data acquisition, data integration, data exploration, model building, and model validation. The paper provides an overview of the data discovery method and introduces some important data mining method for application to insurance concluding cluster discovery approaches.
Feature Subset Selection for High Dimensional Data Using Clustering TechniquesIRJET Journal
The document discusses feature subset selection for high dimensional data using clustering techniques. It proposes the FAST algorithm which has three steps: 1) remove irrelevant features, 2) divide features into clusters using DBSCAN, and 3) select the most representative feature from each cluster. DBSCAN is a density-based clustering algorithm that can identify clusters of varying densities and detect outliers. The FAST algorithm is evaluated to select a small number of discriminative features from high dimensional data in an efficient manner. It aims to remove irrelevant and redundant features to improve predictive accuracy while handling large feature sets.
This document provides an overview of stream data mining techniques. It discusses how traditional data mining cannot be directly applied to data streams due to their continuous, rapid nature. The document outlines some essential methodologies for analyzing data streams, including sampling, load shedding, sketching, and data summarization techniques like reservoirs, histograms, and wavelets. It also discusses challenges in applying these techniques to data streams and open problems in the emerging field of stream data mining.
Parametric comparison based on split criterion on classification algorithmIAEME Publication
This document presents a comparison of different attribute selection criteria for classification algorithms in stream data mining. It analyzes two common criteria - information gain and Gini index - and evaluates their impact on classification accuracy using different datasets. The results show that information gain generally achieves higher accuracy than Gini index, especially for larger data sizes. The document aims to improve the performance of stream data classification algorithms by optimizing the split criterion selection approach.
The document discusses data stream mining and summarizes some key challenges and techniques. It describes how traditional data mining cannot be directly applied to data streams due to their continuous, rapid arrival. It then outlines several techniques used for summarizing and extracting knowledge from data streams, including sampling, sketching, load shedding, synopsis data structures, and algorithms modified from basic data mining to handle streams.
This document presents an analytical framework for classifying data stream mining techniques based on their approaches to challenges. It discusses how data streams pose computational challenges due to their continuous, massive, and potentially infinite nature. It classifies data stream mining challenges and techniques for addressing them, such as approaches that modify existing data mining algorithms or develop new ones. The document proposes an analytical framework to evaluate how data mining applications can help develop novel data stream mining algorithms to handle different tasks.
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...acijjournal
Apriori is one of the key algorithms to generate frequent itemsets. Analysing frequent itemset is a crucial
step in analysing structured data and in finding association relationship between items. This stands as an
elementary foundation to supervised learning, which encompasses classifier and feature extraction
methods. Applying this algorithm is crucial to understand the behaviour of structured data. Most of the
structured data in scientific domain are voluminous. Processing such kind of data requires state of the art
computing machines. Setting up such an infrastructure is expensive. Hence a distributed environment
such as a clustered setup is employed for tackling such scenarios. Apache Hadoop distribution is one of
the cluster frameworks in distributed environment that helps by distributing voluminous data across a
number of nodes in the framework. This paper focuses on map/reduce design and implementation of
Apriori algorithm for structured data analysis.
This document summarizes a research paper that proposes a heuristic approach to preserve privacy in stream data classification. The approach applies data perturbation to the stream data before performing classification. This allows privacy to be preserved while also building an accurate classification model for the large-scale stream data. The approach is implemented in two phases: first the data stream is perturbed, then classification is performed on both the perturbed and original data. Experimental results show that this approach can effectively preserve privacy while reducing the complexity of mining large stream data.
Different Classification Technique for Data mining in Insurance Industry usin...IOSRjournaljce
this paper addresses the issues and techniques for Property/Casualty actuaries applying data mining methods. Data mining means the effective unknown pattern discovery from a large amount database. It is an interactive knowledge discovery procedure which is includes data acquisition, data integration, data exploration, model building, and model validation. The paper provides an overview of the data discovery method and introduces some important data mining method for application to insurance concluding cluster discovery approaches.
Feature Subset Selection for High Dimensional Data Using Clustering TechniquesIRJET Journal
The document discusses feature subset selection for high dimensional data using clustering techniques. It proposes the FAST algorithm which has three steps: 1) remove irrelevant features, 2) divide features into clusters using DBSCAN, and 3) select the most representative feature from each cluster. DBSCAN is a density-based clustering algorithm that can identify clusters of varying densities and detect outliers. The FAST algorithm is evaluated to select a small number of discriminative features from high dimensional data in an efficient manner. It aims to remove irrelevant and redundant features to improve predictive accuracy while handling large feature sets.
Review: Data Driven Traffic Flow Forecasting using MapReduce in Distributed M...AM Publications
from last decade, the use of communication and transportation technology increases in urban traffic
management system. To predict the correct result forecasting technique is used. Furthermore, as more data are
collected, increase in traffic data. In short, traffic flow forecasting system find out collection of historical observations
for records similar to the current conditions and uses these to estimate the future state of the system. In this paper we
focus on data driven traffic flow forecasting system which is based on MapReduce framework for distributed system
with Bayesian network approach. For probability distribution of data between two adjacent node i.e. data used for
forecasting(Input node) and data which is forecasted (output node) used a Gaussian mixture model (GMM) whose
parameters are updated using Expectation Maximization algorithm. Finally focus on model fusion, main problem in
distributed modelling for data storage and processing in traffic flow forecasting system.
Comparative analysis of various data stream mining procedures and various dim...Alexander Decker
This document provides a comparative analysis of various data stream mining procedures and dimension reduction techniques. It discusses 10 different data stream clustering algorithms and their working mechanisms. It also compares 6 dimension reduction techniques and their objectives. The document proposes applying a dimension reduction technique to reduce the dimensionality of a high-dimensional data stream, before clustering it using a weighted fuzzy c-means algorithm. This combined approach aims to improve clustering quality and enable better visualization of streaming data.
The document discusses mining frequent items and item sets from data streams using fuzzy approaches. It describes objectives of mining frequent items from datasets in real-time using fuzzy sets and slices. This involves fetching relevant records, analyzing the data, searching for liked items using fuzzy slices, identifying frequently viewed item lists, making recommendations, and evaluating the results. Algorithms used for mining frequent items from data streams in a single or multiple pass are also reviewed.
A plethora of infinite data is generated from the Internet and other information sources. Analyzing this massive data in real-time and extracting valuable knowledge using different mining applications platforms have been an area for research and industry as well. However, data stream mining has different challenges making it different from traditional data mining. Recently, many studies have addressed the concerns on massive data mining problems and proposed several techniques that produce impressive results. In this paper, we review real time clustering and classification mining techniques for data stream. We analyze the characteristics of data stream mining and discuss the challenges and research issues of data steam mining. Finally, we present some of the platforms for data stream mining.
Data characterization towards modeling frequent pattern mining algorithmscsandit
Big data quickly comes under the spotlight in recent years. As big data is supposed to handle
extremely huge amount of data, it is quite natural that the demand for the computational
environment to accelerates, and scales out big data applications increases. The important thing
is, however, the behavior of big data applications is not clearly defined yet. Among big data
applications, this paper specifically focuses on stream mining applications. The behavior of
stream mining applications varies according to the characteristics of the input data. The
parameters for data characterization are, however, not clearly defined yet, and there is no study
investigating explicit relationships between the input data, and stream mining applications,
either. Therefore, this paper picks up frequent pattern mining as one of the representative
stream mining applications, and interprets the relationships between the characteristics of the
input data, and behaviors of signature algorithms for frequent pattern mining.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Spectral Clustering and Vantage Point Indexing for Efficient Data Retrieval IJECEIAES
Data mining is an essential process for identifying the patterns in large datasets through machine learning techniques and database systems. Clustering of high dimensional data is becoming very challenging process due to curse of dimensionality. In addition, space complexity and data retrieval performance was not improved. In order to overcome the limitation, Spectral Clustering Based VP Tree Indexing Technique is introduced. The technique clusters and indexes the densely populated high dimensional data points for effective data retrieval based on user query. A Normalized Spectral Clustering Algorithm is used to group similar high dimensional data points. After that, Vantage Point Tree is constructed for indexing the clustered data points with minimum space complexity. At last, indexed data gets retrieved based on user query using Vantage Point Tree based Data Retrieval Algorithm. This in turn helps to improve true positive rate with minimum retrieval time. The performance is measured in terms of space complexity, true positive rate and data retrieval time with El Nino weather data sets from UCI Machine Learning Repository. An experimental result shows that the proposed technique is able to reduce the space complexity by 33% and also reduces the data retrieval time by 24% when compared to state-of-the-artworks.
This document discusses big data mining and the Internet of Things. It first presents challenges with big data mining including modeling big data characteristics, identifying key challenges, and issues with statistical analysis of IoT data. It then describes an architecture called IOT-StatisticDB that provides a generalized schema for storing sensor data from IoT devices and a distributed system for parallel computing and statistical analysis of IoT big data. The system includes query operators for data retrieval and statistical analysis of IoT data in areas like transportation networks.
Data mining techniques application for prediction in OLAP cubeIJECEIAES
Data warehouses represent collections of data organized to support a process of decision support, and provide an appropriate solution for managing large volumes of data. OLAP online analytics is a technology that complements data warehouses to make data usable and understandable by users, by providing tools for visualization, exploration, and navigation of data-cubes. On the other hand, data mining allows the extraction of knowledge from data with different methods of description, classification, explanation and prediction. As part of this work, we propose new ways to improve existing approaches in the process of decision support. In the continuity of the work treating the coupling between the online analysis and data mining to integrate prediction into OLAP, an approach based on automatic learning with Clustering is proposed in order to partition an initial data cube into dense sub-cubes that could serve as a learning set to build a prediction model. The technique of data mining by regression trees is then applied for each sub-cube to predict the value of a cell.
A fuzzy clustering algorithm for high dimensional streaming dataAlexander Decker
This document summarizes a research paper that proposes a new dimension-reduced weighted fuzzy clustering algorithm (sWFCM-HD) for high-dimensional streaming data. The algorithm can cluster datasets that have both high dimensionality and a streaming (continuously arriving) nature. It combines previous work on clustering algorithms for streaming data and high-dimensional data. The paper introduces the algorithm and compares it experimentally to show improvements in memory usage and runtime over other approaches for these types of datasets.
Novel Ensemble Tree for Fast Prediction on Data StreamsIJERA Editor
Data Streams are sequential set of data records. When data appears at highest speed and constantly, so predicting
the class accordingly to the time is very essential. Currently Ensemble modeling techniques are growing
speedily in Classification of Data Stream. Ensemble learning will be accepted since its benefit to manage huge
amount of data stream, means it will manage the data in a large size and also it will be able to manage concept
drifting. Prior learning, mostly focused on accuracy of ensemble model, prediction efficiency has not considered
much since existing ensemble model predicts in linear time, which is enough for small applications and
accessible models workings on integrating some of the classifier. Although real time application has huge
amount of data stream so we required base classifier to recognize dissimilar model and make a high grade
ensemble model. To fix these challenges we developed Ensemble tree which is height balanced tree indexing
structure of base classifier for quick prediction on data streams by ensemble modeling techniques. Ensemble
Tree manages ensembles as geodatabases and it utilizes R tree similar to structure to achieve sub linear time
complexity
This document discusses clustering of uncertain data objects. It first provides background on clustering uncertain data and challenges involved. It then reviews various existing approaches for clustering uncertain data, including using soft classifiers and probabilistic databases. The document proposes combining k-means clustering with Voronoi diagrams and indexing techniques to improve the performance and efficiency of clustering uncertain datasets. It outlines a plan to integrate k-means with Voronoi diagrams and indexing to reduce execution time and increase clustering performance and results for uncertain data. Finally, it concludes that combining clustering with indexing approaches can better handle uncertain data clustering challenges.
Data performance characterization of frequent pattern mining algorithmsIJDKP
Big data quickly comes under the spotlight in recent years. As big data is supposed to handle extremely
huge amount of data, it is quite natural that the demand for the computational environment to accelerates,
and scales out big data applications increases. The important thing is, however, the behavior of big data
applications is not clearly defined yet. Among big data applications, this paper specifically focuses on stream mining applications. The behavior of stream mining applications varies according to the characteristics of the input data. The parameters for data characterization are, however, not clearly defined yet, and there is no study investigating explicit relationships between the input data, and streammining applications, either. Therefore, this paper picks up frequent pattern mining as one of the
representative stream mining applications, and interprets the relationships between the characteristics of the input data, and behaviors of signature algorithms for frequent pattern mining.
A Novel Approach for Clustering Big Data based on MapReduce IJECEIAES
Clustering is one of the most important applications of data mining. It has attracted attention of researchers in statistics and machine learning. It is used in many applications like information retrieval, image processing and social network analytics etc. It helps the user to understand the similarity and dissimilarity between objects. Cluster analysis makes the users understand complex and large data sets more clearly. There are different types of clustering algorithms analyzed by various researchers. Kmeans is the most popular partitioning based algorithm as it provides good results because of accurate calculation on numerical data. But Kmeans give good results for numerical data only. Big data is combination of numerical and categorical data. Kprototype algorithm is used to deal with numerical as well as categorical data. Kprototype combines the distance calculated from numeric and categorical data. With the growth of data due to social networking websites, business transactions, scientific calculation etc., there is vast collection of structured, semi-structured and unstructured data. So, there is need of optimization of Kprototype so that these varieties of data can be analyzed efficiently.In this work, Kprototype algorithm is implemented on MapReduce in this paper. Experiments have proved that Kprototype implemented on Mapreduce gives better performance gain on multiple nodes as compared to single node. CPU execution time and speedup are used as evaluation metrics for comparison.Intellegent splitter is proposed in this paper which splits mixed big data into numerical and categorical data. Comparison with traditional algorithms proves that proposed algorithm works better for large scale of data.
The document proposes a strategy for clustering distributed databases using self-organizing maps (SOM) and K-means algorithms. The strategy applies SOM locally to each distributed data set to obtain representative subsets, then combines the results and applies SOM and K-means globally. Specifically, it performs local SOM clustering, sends representative data to a central site, applies SOM again on the combined data, then uses K-means on the unified map to produce the final clustering result.
The document summarizes research on mining high utility itemsets from transactional databases. It discusses how traditional frequent itemset mining algorithms do not account for item importance (weights/profits). Utility mining aims to discover itemsets that generate high total utility based on item weights and quantities. The document reviews existing utility mining algorithms like Two-Phase and UP-Growth, and proposes a new algorithm called Miner. Miner uses a novel utility-list structure and an Estimated Utility Cooccurrence Pruning strategy to reduce the number of costly join operations during mining, achieving better performance than UP-Growth. Experimental results on real datasets show Miner performs up to 95% fewer joins and is up to six times faster than UP-Growth.
The document describes a study that uses artificial neural networks (ANN), fuzzy inference systems (FIS), and adaptive neuro-fuzzy inference systems (ANFIS) to model and predict groundwater levels in the Thurinjapuram watershed in Tamil Nadu, India. Monthly rainfall and water level data from 1985 to 2008 were used as inputs, with one month ahead water level as the output. ANFIS performed best with lower error rates and higher correlation than ANN and FIS models according to statistical evaluations. Validation with unused 2009-2010 data showed ANFIS predictions were 80% accurate.
Efficient Database Management System For Wireless Sensor Network Onyebuchi nosiri
An effective database management system has been put forward in this work to tackle the problem in remote monitoring using Wireless Sensor Network Object Oriented Analysis and Design method employed as classes was evolved to create objects in the employed program used. An algorithm was developed with a corresponding flowchart to realize the design, the work also came up with a dynamic graph plotter, as this offers an adaptive monitoring facility for data stored in the Database. Sensor Node query was implemented and result of transmitted data was filtered for a particular node
This document proposes a model to parallelize the frequent itemset mining process using GPUs instead of multi-core processors. It aims to speed up the mining process and allow it to handle large datasets more efficiently. The model parallelizes the FP-growth algorithm at different levels without generating the FP-tree. It first sorts the transaction database in parallel using GPUs for preprocessing. It then groups the transactions based on the first item and mines for frequent itemsets within each group in parallel on the GPU. Preliminary results show the sorting step is significantly faster when parallelized on the GPU compared to serial processing. The overall goal is to efficiently mine large datasets using the low-cost and high-performance capabilities of GPUs.
TASK-DECOMPOSITION BASED ANOMALY DETECTION OF MASSIVE AND HIGH-VOLATILITY SES...ijdpsjournal
This document summarizes a research paper that presents a task-decomposition based anomaly detection system for analyzing massive and highly volatile session data from the Science Information Network (SINET), Japan's academic backbone network. The system uses a master-worker design with dynamic task scheduling to process over 1 billion sessions per day. It discriminates incoming and outgoing traffic using GPU parallelization and generates histograms of traffic volumes over time. Long short-term memory (LSTM) neural networks detect anomalies like spikes in incoming traffic volumes. The experiment analyzed SINET data from February 27 to March 8, 2021, detecting some anomalies while processing 500-650 gigabytes of daily session data.
This document summarizes a research paper that proposes a new algorithm called ESW-FI to efficiently mine frequent itemsets from data streams using a sliding window model. The algorithm actively maintains potentially frequent itemsets in a compact data structure using only a single pass over the data. It guarantees output quality and bounds memory usage. The algorithm divides the sliding window into fixed-size segments and processes window slides by inserting new segments and removing old ones, avoiding reprocessing of all transactions on each slide.
This document summarizes an algorithm called ESW-FI that efficiently mines frequent itemsets from data streams using a sliding window model. The algorithm actively maintains potentially frequent itemsets in a compact data structure using only a single pass over the data. This is an improvement over existing algorithms that require multiple scans or maintaining all transaction data within the window. The ESW-FI algorithm guarantees output quality and bounds memory usage while processing streams of continuous, unpredictable data in a timely manner.
Review: Data Driven Traffic Flow Forecasting using MapReduce in Distributed M...AM Publications
from last decade, the use of communication and transportation technology increases in urban traffic
management system. To predict the correct result forecasting technique is used. Furthermore, as more data are
collected, increase in traffic data. In short, traffic flow forecasting system find out collection of historical observations
for records similar to the current conditions and uses these to estimate the future state of the system. In this paper we
focus on data driven traffic flow forecasting system which is based on MapReduce framework for distributed system
with Bayesian network approach. For probability distribution of data between two adjacent node i.e. data used for
forecasting(Input node) and data which is forecasted (output node) used a Gaussian mixture model (GMM) whose
parameters are updated using Expectation Maximization algorithm. Finally focus on model fusion, main problem in
distributed modelling for data storage and processing in traffic flow forecasting system.
Comparative analysis of various data stream mining procedures and various dim...Alexander Decker
This document provides a comparative analysis of various data stream mining procedures and dimension reduction techniques. It discusses 10 different data stream clustering algorithms and their working mechanisms. It also compares 6 dimension reduction techniques and their objectives. The document proposes applying a dimension reduction technique to reduce the dimensionality of a high-dimensional data stream, before clustering it using a weighted fuzzy c-means algorithm. This combined approach aims to improve clustering quality and enable better visualization of streaming data.
The document discusses mining frequent items and item sets from data streams using fuzzy approaches. It describes objectives of mining frequent items from datasets in real-time using fuzzy sets and slices. This involves fetching relevant records, analyzing the data, searching for liked items using fuzzy slices, identifying frequently viewed item lists, making recommendations, and evaluating the results. Algorithms used for mining frequent items from data streams in a single or multiple pass are also reviewed.
A plethora of infinite data is generated from the Internet and other information sources. Analyzing this massive data in real-time and extracting valuable knowledge using different mining applications platforms have been an area for research and industry as well. However, data stream mining has different challenges making it different from traditional data mining. Recently, many studies have addressed the concerns on massive data mining problems and proposed several techniques that produce impressive results. In this paper, we review real time clustering and classification mining techniques for data stream. We analyze the characteristics of data stream mining and discuss the challenges and research issues of data steam mining. Finally, we present some of the platforms for data stream mining.
Data characterization towards modeling frequent pattern mining algorithmscsandit
Big data quickly comes under the spotlight in recent years. As big data is supposed to handle
extremely huge amount of data, it is quite natural that the demand for the computational
environment to accelerates, and scales out big data applications increases. The important thing
is, however, the behavior of big data applications is not clearly defined yet. Among big data
applications, this paper specifically focuses on stream mining applications. The behavior of
stream mining applications varies according to the characteristics of the input data. The
parameters for data characterization are, however, not clearly defined yet, and there is no study
investigating explicit relationships between the input data, and stream mining applications,
either. Therefore, this paper picks up frequent pattern mining as one of the representative
stream mining applications, and interprets the relationships between the characteristics of the
input data, and behaviors of signature algorithms for frequent pattern mining.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Spectral Clustering and Vantage Point Indexing for Efficient Data Retrieval IJECEIAES
Data mining is an essential process for identifying the patterns in large datasets through machine learning techniques and database systems. Clustering of high dimensional data is becoming very challenging process due to curse of dimensionality. In addition, space complexity and data retrieval performance was not improved. In order to overcome the limitation, Spectral Clustering Based VP Tree Indexing Technique is introduced. The technique clusters and indexes the densely populated high dimensional data points for effective data retrieval based on user query. A Normalized Spectral Clustering Algorithm is used to group similar high dimensional data points. After that, Vantage Point Tree is constructed for indexing the clustered data points with minimum space complexity. At last, indexed data gets retrieved based on user query using Vantage Point Tree based Data Retrieval Algorithm. This in turn helps to improve true positive rate with minimum retrieval time. The performance is measured in terms of space complexity, true positive rate and data retrieval time with El Nino weather data sets from UCI Machine Learning Repository. An experimental result shows that the proposed technique is able to reduce the space complexity by 33% and also reduces the data retrieval time by 24% when compared to state-of-the-artworks.
This document discusses big data mining and the Internet of Things. It first presents challenges with big data mining including modeling big data characteristics, identifying key challenges, and issues with statistical analysis of IoT data. It then describes an architecture called IOT-StatisticDB that provides a generalized schema for storing sensor data from IoT devices and a distributed system for parallel computing and statistical analysis of IoT big data. The system includes query operators for data retrieval and statistical analysis of IoT data in areas like transportation networks.
Data mining techniques application for prediction in OLAP cubeIJECEIAES
Data warehouses represent collections of data organized to support a process of decision support, and provide an appropriate solution for managing large volumes of data. OLAP online analytics is a technology that complements data warehouses to make data usable and understandable by users, by providing tools for visualization, exploration, and navigation of data-cubes. On the other hand, data mining allows the extraction of knowledge from data with different methods of description, classification, explanation and prediction. As part of this work, we propose new ways to improve existing approaches in the process of decision support. In the continuity of the work treating the coupling between the online analysis and data mining to integrate prediction into OLAP, an approach based on automatic learning with Clustering is proposed in order to partition an initial data cube into dense sub-cubes that could serve as a learning set to build a prediction model. The technique of data mining by regression trees is then applied for each sub-cube to predict the value of a cell.
A fuzzy clustering algorithm for high dimensional streaming dataAlexander Decker
This document summarizes a research paper that proposes a new dimension-reduced weighted fuzzy clustering algorithm (sWFCM-HD) for high-dimensional streaming data. The algorithm can cluster datasets that have both high dimensionality and a streaming (continuously arriving) nature. It combines previous work on clustering algorithms for streaming data and high-dimensional data. The paper introduces the algorithm and compares it experimentally to show improvements in memory usage and runtime over other approaches for these types of datasets.
Novel Ensemble Tree for Fast Prediction on Data StreamsIJERA Editor
Data Streams are sequential set of data records. When data appears at highest speed and constantly, so predicting
the class accordingly to the time is very essential. Currently Ensemble modeling techniques are growing
speedily in Classification of Data Stream. Ensemble learning will be accepted since its benefit to manage huge
amount of data stream, means it will manage the data in a large size and also it will be able to manage concept
drifting. Prior learning, mostly focused on accuracy of ensemble model, prediction efficiency has not considered
much since existing ensemble model predicts in linear time, which is enough for small applications and
accessible models workings on integrating some of the classifier. Although real time application has huge
amount of data stream so we required base classifier to recognize dissimilar model and make a high grade
ensemble model. To fix these challenges we developed Ensemble tree which is height balanced tree indexing
structure of base classifier for quick prediction on data streams by ensemble modeling techniques. Ensemble
Tree manages ensembles as geodatabases and it utilizes R tree similar to structure to achieve sub linear time
complexity
This document discusses clustering of uncertain data objects. It first provides background on clustering uncertain data and challenges involved. It then reviews various existing approaches for clustering uncertain data, including using soft classifiers and probabilistic databases. The document proposes combining k-means clustering with Voronoi diagrams and indexing techniques to improve the performance and efficiency of clustering uncertain datasets. It outlines a plan to integrate k-means with Voronoi diagrams and indexing to reduce execution time and increase clustering performance and results for uncertain data. Finally, it concludes that combining clustering with indexing approaches can better handle uncertain data clustering challenges.
Data performance characterization of frequent pattern mining algorithmsIJDKP
Big data quickly comes under the spotlight in recent years. As big data is supposed to handle extremely
huge amount of data, it is quite natural that the demand for the computational environment to accelerates,
and scales out big data applications increases. The important thing is, however, the behavior of big data
applications is not clearly defined yet. Among big data applications, this paper specifically focuses on stream mining applications. The behavior of stream mining applications varies according to the characteristics of the input data. The parameters for data characterization are, however, not clearly defined yet, and there is no study investigating explicit relationships between the input data, and streammining applications, either. Therefore, this paper picks up frequent pattern mining as one of the
representative stream mining applications, and interprets the relationships between the characteristics of the input data, and behaviors of signature algorithms for frequent pattern mining.
A Novel Approach for Clustering Big Data based on MapReduce IJECEIAES
Clustering is one of the most important applications of data mining. It has attracted attention of researchers in statistics and machine learning. It is used in many applications like information retrieval, image processing and social network analytics etc. It helps the user to understand the similarity and dissimilarity between objects. Cluster analysis makes the users understand complex and large data sets more clearly. There are different types of clustering algorithms analyzed by various researchers. Kmeans is the most popular partitioning based algorithm as it provides good results because of accurate calculation on numerical data. But Kmeans give good results for numerical data only. Big data is combination of numerical and categorical data. Kprototype algorithm is used to deal with numerical as well as categorical data. Kprototype combines the distance calculated from numeric and categorical data. With the growth of data due to social networking websites, business transactions, scientific calculation etc., there is vast collection of structured, semi-structured and unstructured data. So, there is need of optimization of Kprototype so that these varieties of data can be analyzed efficiently.In this work, Kprototype algorithm is implemented on MapReduce in this paper. Experiments have proved that Kprototype implemented on Mapreduce gives better performance gain on multiple nodes as compared to single node. CPU execution time and speedup are used as evaluation metrics for comparison.Intellegent splitter is proposed in this paper which splits mixed big data into numerical and categorical data. Comparison with traditional algorithms proves that proposed algorithm works better for large scale of data.
The document proposes a strategy for clustering distributed databases using self-organizing maps (SOM) and K-means algorithms. The strategy applies SOM locally to each distributed data set to obtain representative subsets, then combines the results and applies SOM and K-means globally. Specifically, it performs local SOM clustering, sends representative data to a central site, applies SOM again on the combined data, then uses K-means on the unified map to produce the final clustering result.
The document summarizes research on mining high utility itemsets from transactional databases. It discusses how traditional frequent itemset mining algorithms do not account for item importance (weights/profits). Utility mining aims to discover itemsets that generate high total utility based on item weights and quantities. The document reviews existing utility mining algorithms like Two-Phase and UP-Growth, and proposes a new algorithm called Miner. Miner uses a novel utility-list structure and an Estimated Utility Cooccurrence Pruning strategy to reduce the number of costly join operations during mining, achieving better performance than UP-Growth. Experimental results on real datasets show Miner performs up to 95% fewer joins and is up to six times faster than UP-Growth.
The document describes a study that uses artificial neural networks (ANN), fuzzy inference systems (FIS), and adaptive neuro-fuzzy inference systems (ANFIS) to model and predict groundwater levels in the Thurinjapuram watershed in Tamil Nadu, India. Monthly rainfall and water level data from 1985 to 2008 were used as inputs, with one month ahead water level as the output. ANFIS performed best with lower error rates and higher correlation than ANN and FIS models according to statistical evaluations. Validation with unused 2009-2010 data showed ANFIS predictions were 80% accurate.
Efficient Database Management System For Wireless Sensor Network Onyebuchi nosiri
An effective database management system has been put forward in this work to tackle the problem in remote monitoring using Wireless Sensor Network Object Oriented Analysis and Design method employed as classes was evolved to create objects in the employed program used. An algorithm was developed with a corresponding flowchart to realize the design, the work also came up with a dynamic graph plotter, as this offers an adaptive monitoring facility for data stored in the Database. Sensor Node query was implemented and result of transmitted data was filtered for a particular node
This document proposes a model to parallelize the frequent itemset mining process using GPUs instead of multi-core processors. It aims to speed up the mining process and allow it to handle large datasets more efficiently. The model parallelizes the FP-growth algorithm at different levels without generating the FP-tree. It first sorts the transaction database in parallel using GPUs for preprocessing. It then groups the transactions based on the first item and mines for frequent itemsets within each group in parallel on the GPU. Preliminary results show the sorting step is significantly faster when parallelized on the GPU compared to serial processing. The overall goal is to efficiently mine large datasets using the low-cost and high-performance capabilities of GPUs.
TASK-DECOMPOSITION BASED ANOMALY DETECTION OF MASSIVE AND HIGH-VOLATILITY SES...ijdpsjournal
This document summarizes a research paper that presents a task-decomposition based anomaly detection system for analyzing massive and highly volatile session data from the Science Information Network (SINET), Japan's academic backbone network. The system uses a master-worker design with dynamic task scheduling to process over 1 billion sessions per day. It discriminates incoming and outgoing traffic using GPU parallelization and generates histograms of traffic volumes over time. Long short-term memory (LSTM) neural networks detect anomalies like spikes in incoming traffic volumes. The experiment analyzed SINET data from February 27 to March 8, 2021, detecting some anomalies while processing 500-650 gigabytes of daily session data.
This document summarizes a research paper that proposes a new algorithm called ESW-FI to efficiently mine frequent itemsets from data streams using a sliding window model. The algorithm actively maintains potentially frequent itemsets in a compact data structure using only a single pass over the data. It guarantees output quality and bounds memory usage. The algorithm divides the sliding window into fixed-size segments and processes window slides by inserting new segments and removing old ones, avoiding reprocessing of all transactions on each slide.
This document summarizes an algorithm called ESW-FI that efficiently mines frequent itemsets from data streams using a sliding window model. The algorithm actively maintains potentially frequent itemsets in a compact data structure using only a single pass over the data. This is an improvement over existing algorithms that require multiple scans or maintaining all transaction data within the window. The ESW-FI algorithm guarantees output quality and bounds memory usage while processing streams of continuous, unpredictable data in a timely manner.
A Comparative Study of Various Data Mining Techniques: Statistics, Decision T...Editor IJCATR
In this paper we focus on some techniques for solving data mining tasks such as: Statistics, Decision Trees and Neural
Networks. The new approach has succeed in defining some new criteria for the evaluation process, and it has obtained valuable results
based on what the technique is, the environment of using each techniques, the advantages and disadvantages of each technique, the
consequences of choosing any of these techniques to extract hidden predictive information from large databases, and the methods of
implementation of each technique. Finally, the paper has presented some valuable recommendations in this field.
Evaluation of a New Incremental Classification Tree Algorithm for Mining High...mlaij
A new model for online machine learning process of high speed data stream is proposed, to minimize the severe restrictions associated with the existing computer learning algorithms. Most of the existing models have three principle steps. In the first step, the system would create a model incrementally. In the second step the time taken by the examples to complete a prescribed procedure with their arrival speed is computed. In the third and final step of the model the size of memory required for computation is predicted in advance. To overcome these restrictions we proposed this new data stream classification algorithm, where the data can be partitioned into stream of trees. In this algorithm, the new data set can be updated with the existing tree. This algorithm, called incremental classification tree algorithm, is proved to be an excellent solution for processing larger data streams. In this paper, we present the experimental results of our new algorithm and prove that our method would eradicate the problems of the existing method.
EVALUATION OF A NEW INCREMENTAL CLASSIFICATION TREE ALGORITHM FOR MINING HIGH...mlaij
Abstract—A new model for online machine learning process of high speed data stream is proposed, to
minimize the severe restrictions associated with the existing computer learning algorithms. Most of the
existing models have three principle steps. In the first step, the system would create a model incrementally.
In the second step the time taken by the examples to complete a prescribed procedure with their arrival
speed is computed. In the third and final step of the model the size of memory required for computation is
predicted in advance. To overcome these restrictions we proposed this new data stream classification
algorithm, where the data can be partitioned into stream of trees. In this algorithm, the new data set can be
updated with the existing tree. This algorithm, called incremental classification tree algorithm, is proved to
be an excellent solution for processing larger data streams. In this paper, we present the experimental
results of our new algorithm and prove that our method would eradicate the problems of the existing
method.
This document discusses privacy-preserving techniques for data stream mining. It proposes a hybrid method that uses both rotation and translation transformations to perturb data streams and preserve privacy. The key steps are:
1) The data stream is represented as a matrix and only numeric attributes are considered.
2) Attribute pairs are randomly selected and perturbed using rotation transformations within a calculated "security range".
3) Additional attributes are perturbed using translation transformations, where random numbers generated by a secure function determine whether values are added to or subtracted from the original data.
4) The perturbed data stream is then used for clustering and analysis while preserving privacy. The goal is to maximize both privacy and utility of results.
This document discusses privacy-preserving techniques for data stream mining. It proposes a hybrid method that uses both rotation and translation based data perturbation to anonymize sensitive attributes in data streams. The key steps are:
1) Select attribute pairs and set security thresholds for perturbation.
2) Apply rotation transformations to selected attribute pairs to distort the data within the security thresholds.
3) Also apply translation perturbations by adding or subtracting random noise values to other attributes.
The goal is to anonymize the data enough to preserve privacy while maintaining accuracy for data stream mining tasks like clustering. Evaluation focuses on balancing privacy protections with preserving data utility for analysis.
The challenges with respect to mining frequent items over data streaming engaging variable window size
and low memory space are addressed in this research paper. To check the varying point of context change
in streaming transaction we have developed a window structure which will be in two levels and supports in
fixing the window size instantly and controls the heterogeneities and assures homogeneities among
transactions added to the window. To minimize the memory utilization, computational cost and improve the
process scalability, this design will allow fixing the coverage or support at window level. Here in this
document, an incremental mining of frequent item-sets from the window and a context variation analysis
approach are being introduced. The complete technology that we are presenting in this document is named
as Mining Frequent Item-sets using Variable Window Size fixed by Context Variation Analysis (MFI-VWSCVA).
There are clear boundaries among frequent and infrequent item-sets in specific item-sets. In this
design we have used window size change to represent the conceptual drift in an information stream. As it
were, whenever there is a problem in setting window size effectively the item-set will be infrequent. The
experiments that we have executed and documented proved that the algorithm that we have designed is
much efficient than that of existing.
IRJET- AC Duct Monitoring and Cleaning Vehicle for Train CoachesIRJET Journal
This document summarizes research on techniques for handling concept drift in data stream mining. It begins with an introduction to the challenges of concept drift in data streams and the two main approaches for handling concept drift using ensembles: online and block-based. It then reviews several existing studies on concept drift detection and handling in data streams. Finally, it proposes an adaptive online ensemble approach that uses an internal change detector to dynamically determine block sizes and capture concept drifts in a timely manner. Experimental results show this approach outperforms other ensemble techniques, especially on datasets with sudden concept changes.
IRJET- A Data Stream Mining Technique Dynamically Updating a Model with Dynam...IRJET Journal
This document summarizes several techniques for handling concept drift in data stream mining. It discusses how ensemble methods are commonly used to deal with concept drift and categorizes ensemble approaches into online and block-based. It also reviews several existing studies on handling concept drift, including methods that use adaptive windowing and online learning as well as techniques for detecting concept drift and efficiently updating models. The document concludes by discussing the need for approaches that can adapt to different types of concept drift and changes in non-stationary data streams.
A STUDY OF TRADITIONAL DATA ANALYSIS AND SENSOR DATA ANALYTICSijistjournal
The growth of smart and intelligent devices known as sensors generate large amount of data. These generated data over a time span takes such a large volume which is designated as big data. The data structure of repository holds unstructured data. The traditional data analytics methods well developed and used widely to analyze structured data and to limit extend the semi-structured data which involves additional processing over heads. The similar methods used to analyze unstructured data are different because of distributed computing approach where as there is a possibility of centralized processing in case of structured and semi-structured data. The under taken work is confined to analysis of both verities of methods. The result of this study is targeted to introduce methods available to analyze big data.
Analysis on different Data mining Techniques and algorithms used in IOTIJERA Editor
In this paper, we discusses about five functionalities of data mining in IOT that affects the performance and that
are: Data anomaly detection, Data clustering, Data classification, feature selection, time series prediction. Some
important algorithm has also been reviewed here of each functionalities that show advantages and limitations as
well as some new algorithm that are in research direction. Here we had represent knowledge view of data
mining in IOT.
Drsp dimension reduction for similarity matching and pruning of time series ...IJDKP
The document summarizes a research paper that proposes a framework called DRSP (Dimension Reduction for Similarity Matching and Pruning) for time series data streams. DRSP addresses the challenges of large streaming data size by:
1) Performing dimension reduction using a Multi-level Segment Mean technique to compactly represent the data while retaining crucial information.
2) Incorporating a similarity matching technique to analyze if new data objects match existing streams.
3) Applying a pruning technique to filter out non-relevant data object pairs and join only relevant pairs.
The framework aims to reduce storage and computation costs for similarity matching on large time series data streams.
Concept Drift Identification using Classifier Ensemble Approach IJECEIAES
Abstract:-In Internetworking system, the huge amount of data is scattered, generated and processed over the network. The data mining techniques are used to discover the unknown pattern from the underlying data. A traditional classification model is used to classify the data based on past labelled data. However in many current applications, data is increasing in size with fluctuating patterns. Due to this new feature may arrive in the data. It is present in many applications like sensornetwork, banking and telecommunication systems, financial domain, Electricity usage and prices based on its demand and supplyetc .Thus change in data distribution reduces the accuracy of classifying the data. It may discover some patterns as frequent while other patterns tend to disappear and wrongly classify. To mine such data distribution, traditionalclassification techniques may not be suitable as the distribution generating the items can change over time so data from the past may become irrelevant or even false for the current prediction. For handlingsuch varying pattern of data, concept drift mining approach is used to improve the accuracy of classification techniques. In this paper we have proposed ensemble approach for improving the accuracy of classifier. The ensemble classifier is applied on 3 different data sets. We investigated different features for the different chunk of data which is further given to ensemble classifier. We observed the proposed approach improves the accuracy of classifier for different chunks of data.
Applying K-Means Clustering Algorithm to Discover Knowledge from Insurance Da...theijes
Data mining works to extract information known in advance from the enormous quantities of data which can lead to knowledge. It provides information that helps to make good decisions. The effectiveness of data mining in access to knowledge to achieve the goal of which is the discovery of the hidden facts contained in databases and through the use of multiple technologies. Clustering is organizing data into clusters or groups such that they have high intra-cluster similarity and low inter cluster similarity. This paper deals with K-means clustering algorithm which collect a number of data based on the characteristics and attributes of this data, and process the Clustering by reducing the distances between the data center. This algorithm is applied using open source tool called WEKA, with the Insurance dataset as its input
A New Data Stream Mining Algorithm for Interestingness-rich Association RulesVenu Madhav
Frequent itemset mining and association rule generation is
a challenging task in data stream. Even though, various algorithms
have been proposed to solve the issue, it has been found
out that only frequency does not decides the significance
interestingness of the mined itemset and hence the association
rules. This accelerates the algorithms to mine the association
rules based on utility i.e. proficiency of the mined rules. However,
fewer algorithms exist in the literature to deal with the utility
as most of them deals with reducing the complexity in frequent
itemset/association rules mining algorithm. Also, those few
algorithms consider only the overall utility of the association
rules and not the consistency of the rules throughout a defined
number of periods. To solve this issue, in this paper, an enhanced
association rule mining algorithm is proposed. The algorithm
introduces new weightage validation in the conventional
association rule mining algorithms to validate the utility and
its consistency in the mined association rules. The utility is
validated by the integrated calculation of the cost/price efficiency
of the itemsets and its frequency. The consistency validation
is performed at every defined number of windows using the
probability distribution function, assuming that the weights are
normally distributed. Hence, validated and the obtained rules
are frequent and utility efficient and their interestingness are
distributed throughout the entire time period. The algorithm is
implemented and the resultant rules are compared against the
rules that can be obtained from conventional mining algorithms
Mining Stream Data using k-Means clustering AlgorithmManishankar Medi
This document discusses using k-means clustering to analyze urban road traffic stream data. Stream data arrives continuously over time and is challenging to process due to its high volume, velocity and volatility. The document proposes using a sliding window technique with k-means clustering to analyze recent urban traffic data and visualize clusters in real-time to provide insights into traffic patterns and congested roads. This analysis could help travelers and authorities respond to traffic issues more quickly.
A Quantified Approach for large Dataset Compression in Association MiningIOSR Journals
Abstract: With the rapid development of computer and information technology in the last several decades, an
enormous amount of data in science and engineering will continuously be generated in massive scale; data
compression is needed to reduce the cost and storage space. Compression and discovering association rules by
identifying relationships among sets of items in a transaction database is an important problem in Data Mining.
Finding frequent itemsets is computationally the most expensive step in association rule discovery and therefore
it has attracted significant research attention. However, existing compression algorithms are not appropriate in
data mining for large data sets. In this research a new approach is describe in which the original dataset is
sorted in lexicographical order and desired number of groups are formed to generate the quantification tables.
These quantification tables are used to generate the compressed dataset, which is more efficient algorithm for
mining complete frequent itemsets from compressed dataset. The experimental results show that the proposed
algorithm performs better when comparing it with the mining merge algorithm with different supports and
execution time.
Keywords: Apriori Algorithm, mining merge Algorithm, quantification table
This document discusses big data mining and the Internet of Things. It first presents challenges with big data mining including modeling big data characteristics, identifying key challenges, and issues with statistical analysis of IoT data. It then describes an architecture called IOT-StatisticDB that provides a generalized schema for storing sensor data from IoT devices and a distributed system for parallel computing and statistical analysis of IoT big data. The system includes query operators for data retrieval and statistical analysis of IoT data in areas like transportation networks.
This document outlines a presentation on data mining techniques. It discusses data compression methods like null compression and run length encoding. It also discusses association rule mining and the Apriori algorithm limitations. The problem statement proposes a method for compressing databases that can be decompressed while also improving data mining performance. The proposed work involves compressing data into groups, generating frequent itemsets using Apriori on the compressed data, then decompressing and generating association rules. The implementation environment and conclusions are also outlined. References on related work are provided at the end.
1. Ms. Madhu S. Shukla, Mr. Kirit R. Rathod / International Journal of Engineering Research
and Applications (IJERA) ISSN: 2248-9622 www.ijera.com
Vol. 3, Issue 1, January -February 2013, pp.163-168
Stream Data Mining and Comparative Study of Classification
Algorithms
Ms. Madhu S. Shukla*, Mr. Kirit R. Rathod**
*(PG-CE Student, Department of Computer Engineering),
(C.U.Shah College of Engineering and Technology, Gujarat, India)
** (Assistant Professor, Department of Computer Engineering),
(C.U.Shah College of Engineering and Technology, Gujarat, India)
ABSTRACT
Stream Data Mining is a new emerging Random access is expensive—single linear
topic in the field of research. Today, there are scan algorithm (can only have one look)
number of application that generate Massive Store only the summary of the data seen so
amount of stream data. Examples of such kind of far.
systems are Sensor networks, Real time Most stream data are at pretty low-level or
surveillance systems, telecommunication systems. multi-dimensional in nature, needs multi-
Hence there is requirement of intelligent level and multi-dimensional processing.
processing of such type of data that would help in
proper analysis and use of this data in other task As said earlier it is a huge amount of data,
even. Mining stream data is concerned with so in order to perform certain analysis, we need to
extracting knowledge structures represented in take some sample of that data so that processing of
models and patterns in non stopping streams of stream data could be done with ease. These samples
information [1]. Such massive data are handled taken should be such that whatever data comes in
with software such as MOA (Massive Online the portion of sample is worth analyzing or
Analysis) or other open sources like Data Miner. processing, which means maximum knowledge is
In this paper we present some theoretical aspects extracted from that sampled data.
of stream data mining and certain experimental In this paper some sampling techniques
results obtained on that basis with the use of are briefed in section II. Some classification
MOA. algorithms are discussed in section III. And their
implementation results in section IV. Conclusions
Keywords— Stream, Stream Data Mining, are discussed in section V and references covers
Intelligent-processing, MOA (Massive Online section VI.
Analysis), Continuous Data.
I. INTRODUCTION
Recently new classes of applications are
up-coming rapidly, that basically deal with
generating stream data as output. Stream data is
continuous, ordered, changing, fast, huge amount of
data. Such data are so huge and continuously
changing that even one look at entire data becomes
difficult. Such systems are, any application that
deals with Telecommunication calling records
Business: credit card transactions, Network
monitoring and traffic engineering, Sensor,
monitoring and surveillance, Security monitoring,
Web logs and page click streams. Methods with
which we are trying to find knowledge or Specific
patterns are called Stream Data Mining. Certain
characteristics of stream data are as follows:
Huge volumes of continuous data, possibly
infinite. (Fig 1.1: Processing of Stream data)
Fast changing and requires fast, real-time
response. II. FOUNDATIONS AND SAMPLING
Data stream captures nicely our data TECHNIQUES.
processing needs of today. Foundation of stream data mining is based
on statistics, complexity and computational theory
163 | P a g e
2. Ms. Madhu S. Shukla, Mr. Kirit R. Rathod / International Journal of Engineering Research
and Applications (IJERA) ISSN: 2248-9622 www.ijera.com
Vol. 3, Issue 1, January -February 2013, pp.163-168
[2]
.Sampling refers to the process of probabilistic Load shedding refers [6] to the process of dropping a
choice of a data item to be processed or not. sequence of data streams. Load shedding has been
Sampling is an old statistical technique that has been used successfully in querying data streams. It has
used for a long time. Boundaries of the error rate of the same problems of sampling. Load shedding is
the computation are given as a function of the difficult to be used with mining algorithms because
sampling rate. Very Fast Machine Learning it drops chunks of data streams that could be used in
techniques [4] have used Hoeffding bound to the structuring of the generated models or it might
measure the sample size according to some derived represent a pattern of interest in time series analysis.
loss functions [3]. 4. Synopsis Data Structures: Small space,
approximate solution to massive data set..
Some of the methods for stream data mining are: Summarizing techniques are used like Wavelet,
1. Sampling: Idea of representing large data set Histogram, and Aggregation.
by a small random sample of the data Creating synopsis of data refers to the
elements. process of applying summarization techniques that
2. Sketching: Building a summary of data are capable of summarizing the incoming stream for
stream using a small amount of memory. further analysis. Wavelet analysis [25], histograms,
3. Load Shedding: Process of eliminating a quintiles‘ and frequency moments [5] have been
batch of subsequent elements from being proposed as synopsis data structures. Since synopsis
analyzed. of data does not represent all the characteristics of
4. Synopsis Data Structures: Small space, the dataset, approximate answers are produced when
approximate solution to massive data set. using such data structures.
Summarizing techniques are used like
Wavelet, Histogram, and Aggregation. Aggregation is the process of computing
5. Sliding Window: Advance technique. statistical measures such as means and variance that
Detailed analysis over most recent data summarize the incoming stream. This aggregated
items and over summarized versions of data could be taken by the mining algorithm. The
older ones. problem with aggregation is that it does not perform
well with highly fluctuating data distributions.
1. Sampling: Idea of representing large data set by a
small random sample of the data elements. 5. Sliding Window: It is an advance technique. It
deals with detailed analysis over most recent data
In this technique rather than dealing with items and over summarized versions of older ones.
entire data, sample of stream are taken at regular The inspiration behind sliding window is that the
interval. To obtain an unbiased sampling of data we user is more concerned with the analysis of most
need to know the length of stream in advance, but recent data streams. Thus the detailed analysis is
incase if this is not possible then a little modified done over the most recent data items and
technique is used. It is called RESERVOIR summarized versions of the old ones. This idea has
Sampling, in this technique unbiased random sample been adopted in many techniques in the undergoing
of ‗s‘ element are taken without replacement. Basic comprehensive data stream mining system.
fundamental behind this approach is, a sample of
size at least ‗S‘, called reservoir is kept from which III. MINING TASK
a random sample of size s can be generated. Basic Stream mining task includes task like
drawback of this approach is, when the reservoir is Classification, Clustering and Mining Time-Series
large it can be very costly to generate this sample‗s‘. Data.
2. Sketching: Building a summary of data stream In this paper, we will discuss some of the algorithms
using a small amount of memory. that are used for classification of stream data and
Sketching [3] is the process of randomly their comparison based on their experimental
projecting a subset of the features. It is the process results.
of vertically sampling the incoming stream. Classification generally is a two step
Sketching has been applied in comparing different process consisting of learning or Model
data streams and in aggregate queries. The major Construction (where a model is constructed based
drawback of sketching is that of accuracy. It is hard on class labeled tuples from training set) and
to use it in the context of data stream mining. classification or Model Usage (where the model is
Principal Component Analysis (PCA) would be a used to predict the class labels of tuples from new
better solution that has been applied in streaming data sets).
applications [5]. Algorithms involving with classification of
stream data mining are Naive Bayesian, Hoeffding
3. Load Shedding: Process of eliminating a batch of tree, VFDT (Very Fast Decision Tree) and CVFDT
subsequent elements from being analyzed. (Concept Adapting Very Fast Decision Tree). We
164 | P a g e
3. Ms. Madhu S. Shukla, Mr. Kirit R. Rathod / International Journal of Engineering Research
and Applications (IJERA) ISSN: 2248-9622 www.ijera.com
Vol. 3, Issue 1, January -February 2013, pp.163-168
will discuss working of these algorithms in detail 3.2 Decision Tree Approach: Hoeffding Tree
here. Classifier
In Hoeffding algorithm, classification
3.1Naive Bayesian: problem must be defined. Classification problem is
In simple terms, a naive Bayes classifier a set of training examples of the form (a, b), where
assumes that the presence (or absence) of a ‗a‘ is a vector of d attributes and ‗b‘ is a discrete
particular feature of a class is unrelated to the class label. Our goal is to produce a model b= f (a)
presence (or absence) of any other feature, given the such that it provides and predicts the classes y for
class variable. For example, a fruit may be future examples x with high accuracy. Decision tree
considered to be an apple if it is red, round, and learning is considered one of the most effective
about 4" in diameter. Even if these features depend classification methods. By recursively replacing leaf
on each other or upon the existence of the other node with test nodes, starting at the root we can
features, a Naive Bayes classifier considers all of learn a Decision trees. In decision tree each node
these properties to independently contribute to the has a test on the attributes and each branch gives
probability that this fruit is an apple. possible outcome of the test and each leaf contain a
This is very common approach used for class prediction. Before processing starts, data is
classification. Hence forth the name is used as Naive first stored into main memory. After starting
Bayesian. learning process for complex trees it is expensive to
Algorithm: repeatedly read data from secondary memory. So
Step 1: Assumption our aim is to design decision tree learners than read
D - Training set of tuples and class labels each example at most once, and use a small amount
X=(x1,x2,…,xn) , Attributes – A1,A2,…,An , time to process it. First key role is to find the best
Classes – C1,C2,…,Cm attribute at a node and for that consider only some
Step 2: Maximum Posteriori Hypothesis training examples that pass through that nodes.
P(Ci|X) > P(Cj|X) for 1 ≤ j ≤ m, j ≠ i Second choose the root and then expensive
P(Ci|X)= P(X|Ci)P(Ci)/P(X) (By Bayes‘ Theorem) examples are checked down to the corresponding
Step 3: Prior Probability of X leaves and used to choose the attribute there, and so
Step 4: Class Conditional Independence (Naive‘s on. How many examples are required at each node
Assumption) is decided by Hoeffding bound after continuous use.
P(X|Ci) = ∏ P(xk|Ci) (k=1 to n) = P(x1|Ci) * Taken a random variable a and its range is R. We
P(x2|Ci) * … * P(xn|Ci) have n observation of a. Now find mean of a (), so
Step 5: Classifier PredictionFor tuple X, the class Hoeffding tree bound states that with probability 1-
is Ci if and only if δ, the true mean of a is at least - ε. where Hoeffding
P(X|Ci)P(Ci) > P(X|Cj)P(Cj) for 1 ≤ j ≤ m, j ≠ i . bound ε [7] .
Example: Naive Bayesian Classifier
ROLLNO BRANCH BUYS_COMPUTER
A1 CE YES
B1 CE YES
A2 CE YES Algorithm:
Step 1: Calculate the information gain for the
A3 CE YES attributes and determine the best two attributes.
B2 IT NO Pre-pruning: consider a ―null‖ attribute that consists
B3 IT NO of not splitting the node.
Step 2: At each node, check for the condition.
(Fig 1.2: Naive Bayesian Example) Step 3: If condition is satisfied, create child nodes
based on the test at the node.
Performance Analysis Step 4: If not, stream-in more examples and
Advantages perform calculations till condition is satisfied.
Minimum error rate
High accuracy and speed when applied to
large databases.
Incremental
Limitations
Can‘t handle concept drift
165 | P a g e
4. Ms. Madhu S. Shukla, Mr. Kirit R. Rathod / International Journal of Engineering Research
and Applications (IJERA) ISSN: 2248-9622 www.ijera.com
Vol. 3, Issue 1, January -February 2013, pp.163-168
best split candidate all others is less than G (.) and G
(.) < τ. where τ is a user-defined threshold.
G computation: Inefficient to calculate G for every
new data set, because at specific point it is hard to
split best attributes. Users specify minimum amount
of new data examples, ‗Nmin‘ that must be
calculated at a leaf before G is computed. This
mechanism incrementally takes less amount of
global time which was spent on G computations.
VFDT making learning work as fast as classify dat-
sets .
Performance Analysis
Advantages
VFDT is better than Hoeffding tree in
terms of time, memory and accuracy.
So VFDT takes an advantage after 100k to
greatly improve accuracy.
(Fig:1.3 Hoeffding Tree Example) Limitations
Performance Analysis Concept of drift is not handled in VFDT.
Advantages
High accuracy with small sample 3.4CVFDT
Multiple scans of the same data never Algorithm:
performed Step 1.Alternate trees for each node in HT started as
Incremental empty.
Can classify the data while growing Step 2. Process Examples from the stream
Limitations indefinitely.
Time consuming for splitting attribute Step 3. For Each Example (x, y),
selection Step 4. Pass (x, y) down to a set of leaves using HT
Can‘t handle concept drift and all alternate trees of the nodes (x, y) pass
Through.
3.3 VFDT Algorithm Step 5. Add(x, y) to the sliding window of
Algorithm: examples.
Input: δ desired probability level Step 6.Remove and forget the effect of the oldest
Output: τ a Decision Tree. Examples, if the sliding window over flows.
In it: τ ← Empty Leaf (Root) Step 7. CVFDT Grows.
Step 1.While (TRUE) Step 8. Check Split Validity if f examples seen
Step 2. Read next example. since last checking of alternate trees.
Step 3. Propagate Example through the tree from Step 9. Return HT.
the root till a leaf.
Step 4. Update sufficient statistics at leaf . Explanation:
Step 5. If leaf (number of examples) > Nmin. • CVFDT is an extended version of VFDT
Step 6. Evaluate the merit of each attribute. which provides same speed and accuracy
Step 7. Let A1 the best attribute and A2 the second advantages but if any changes occur in
best. example generating process, it provides the
Step 8. Let ε be the Hoeffding Bound. ability to detect and respond.
Step 9. If G (A1)-G (A2) > ε. • CVFDT uses sliding window of various
Step 10. Install a splitting test based on A1. dataset to keep its model consistent.
Step 11. Expand the tree with two descendant leaves • Most of systems need to learn a new
model from scratch after arrival of new
Explanation: data. Instead, CVFDT continuously
Sometimes more than one attributes have monitors the quality of new data and
similar attribute values. In that case choosing best adjusts those that are no longer correct.
attribute is quite critical. We have to decide Whenever new data arrives, CVFDT
appropriate value between them with high incrementing counts for new data and
confidence.VFDT can decide and solve this decrements counts for oldest data in the
problem. When there is effectively a tie and split on window.
the current best attribute if difference between the • However If the concept is changing, some
splits examples that will no longer appear
166 | P a g e
5. Ms. Madhu S. Shukla, Mr. Kirit R. Rathod / International Journal of Engineering Research
and Applications (IJERA) ISSN: 2248-9622 www.ijera.com
Vol. 3, Issue 1, January -February 2013, pp.163-168
best because new data provides more gain 4.2 Online Comparison With GNUPlot:
than previous one. Whenever this
phenomenon occurs, CVFDT creates
alternative sub-tree to find best attribute at
root. Each time new best tree replaces old
sub tree and it is more accurate on new
data.
IV.COMPARISION:
So the comparison shows that CVDFT is
best algorithm for classification and also it handles
concept drifts. This result is generated by using
MOA software. Here, the criterion for evaluation is
accuracy of all the algorithms compared:
4.1 Comparison of the CSV files generated.
Learning
Naive Hoefdding
evaluation CVFDT
Bayesian Tree
instances
(Fig 1.5: Comparision Chart)
10000 72.91 80.4 VI.CONCLUSION
80.56
In this paper, we discussed about
20000 73.17 83.74 theoretical aspects and practical results of stream
84.06
data mining classification algorithms. In these
30000 73.27 85.57 classification algorithms, Hoeffding trees spend
85.86
small amount of time for learning. Hoeffding tree
40000 73.38 86.4 does not show any similarity with batch trees. In
86.83
real world scenario we have a limited amount of
50000 73.35 87.07 hardware resources, despite this it analyzes and
87.49 generates results with high accuracy. In data mining
60000 73.36 87.60 Systems VFDT is based on Hoeffding trees. Time-
88.11 variant data is used to extend VFDT to develop the
70000 73.38 88.02 CVFDT system. Flooding is used to keep trees up-
88.61 to-date with time variant data streams[7]. Results also
show that performance analysis of CVFDT is better
80000 73.363 88.40
89.06 than VFDT and Hoeffding tree in terms of accuracy.
Also CVFDT handles Concept Drift which is not
90000 73.39 88.70 handled by other compared algorithms. Concept
89.43
drift and Memory utilization are the crucial
100000 73.38 88.99 challenges in Stream data mining field.
89.73
(Fig 1.4: Comparison of CSV Files)
REFERENCES
[1] Elena ikonomovska,Suzana
Loskovska,Dejan Gjorgjevik, ―A Survey
Of Stream Data Mining‖ Eight National
Conference with International
Participation-ETAI2007
[2] S.Muthukrishnan, ―Data streams:
Algorithms and Applications‖.Proceeding
of the fourteenth annual ACM-SIAM
symposium on discrete algorithms,2003
[3] Mohamed Medhat Gaber, Arkady
Zaslavsky and Shonali Krishnaswamy.
]―Mining Data Streams: A Review‖,Centre
for Distributed Systems and Software
Engineering, Monash University900
167 | P a g e
6. Ms. Madhu S. Shukla, Mr. Kirit R. Rathod / International Journal of Engineering Research
and Applications (IJERA) ISSN: 2248-9622 www.ijera.com
Vol. 3, Issue 1, January -February 2013, pp.163-168
Dandenong Rd, Caulfield East, VIC3145, Vehicle Monitoring‖, Proceedings of
Australia SIAM International Conference on Data
[4] P. Domingos and G. Hulten, ―A General Mining, 2004.
Method for Scaling Up Machine Learning [6] B. Babcock, M. Datar, and R. Motwani.
Algorithms and its Application to ―Load Shedding Techniques for Data
Clustering‖, Proceedings of the Eighteenth Stream Systems‖ (short paper) In Proc. of
International Conference on Machine the 2003 Workshop on Management and
Learning, 2001, Williamstown, MA, Processing of Data Streams, June 2003
Morgan Kaufmann [7] Tusharkumar Trambadiya, Praveen
[5] H. Kargupta, R. Bhargava, K. Liu, M. Bhanodia, ―A Comparative study of Stream
Powers, P.Blair, S. Bushra, J. Dull, K. Data mining Algorithms‖ International
Sarkar, M. Klein, M. Vasa, and D. Handy, Journal of Engineering and Innovative
VEDAS: ―A Mobile and Distributed Data Technology (IJEIT) Volume 2, Issue 3,
Stream Mining System for Real-Time September 2012.
168 | P a g e