Data Streams are sequential set of data records. When data appears at highest speed and constantly, so predicting
the class accordingly to the time is very essential. Currently Ensemble modeling techniques are growing
speedily in Classification of Data Stream. Ensemble learning will be accepted since its benefit to manage huge
amount of data stream, means it will manage the data in a large size and also it will be able to manage concept
drifting. Prior learning, mostly focused on accuracy of ensemble model, prediction efficiency has not considered
much since existing ensemble model predicts in linear time, which is enough for small applications and
accessible models workings on integrating some of the classifier. Although real time application has huge
amount of data stream so we required base classifier to recognize dissimilar model and make a high grade
ensemble model. To fix these challenges we developed Ensemble tree which is height balanced tree indexing
structure of base classifier for quick prediction on data streams by ensemble modeling techniques. Ensemble
Tree manages ensembles as geodatabases and it utilizes R tree similar to structure to achieve sub linear time
complexity
This document discusses GCUBE indexing, which is a method for indexing and aggregating spatial/continuous values in a data warehouse. The key challenges addressed are defining and aggregating spatial/continuous values, and efficiently representing, indexing, updating and querying data that includes both categorical and continuous dimensions. The proposed GCUBE approach maps multi-dimensional data to a linear ordering using the Hilbert curve, and then constructs an index structure on the ordered data to enable efficient query processing. Empirical results show the GCUBE indexing offers significant performance advantages over alternative approaches.
With the development of database, the data volume stored in database increases rapidly and in the large
amounts of data much important information is hidden. If the information can be extracted from the
database they will create a lot of profit for the organization. The question they are asking is how to extract
this value. The answer is data mining. There are many technologies available to data mining practitioners,
including Artificial Neural Networks, Genetics, Fuzzy logic and Decision Trees. Many practitioners are
wary of Neural Networks due to their black box nature, even though they have proven themselves in many
situations. This paper is an overview of artificial neural networks and questions their position as a
preferred tool by data mining practitioners.
A CONCEPTUAL METADATA FRAMEWORK FOR SPATIAL DATA WAREHOUSEIJDKP
Metadata represents the information about data to be stored in Data Warehouses. It is a mandatory
element of Data Warehouse to build an efficient Data Warehouse. Metadata helps in data integration,
lineage, data quality and populating transformed data into data warehouse. Spatial data warehouses are
based on spatial data mostly collected from Geographical Information Systems (GIS) and the transactional
systems that are specific to an application or enterprise. Metadata design and deployment is the most
critical phase in building of data warehouse where it is mandatory to bring the spatial information and
data modeling together. In this paper, we present a holistic metadata framework that drives metadata
creation for spatial data warehouse. Theoretically, the proposed metadata framework improves the
efficiency of accessing of data in response to frequent queries on SDWs. In other words, the proposed
framework decreases the response time of the query and accurate information is fetched from Data
Warehouse including the spatial information
A new model for iris data set classification based on linear support vector m...IJECEIAES
1. The authors propose a new model for classifying the iris data set using a linear support vector machine (SVM) classifier with genetic algorithm optimization of the SVM's C and gamma parameters.
2. Principal component analysis was used to reduce the iris data set features from four to three before classification.
3. The genetic algorithm was shown to optimize the SVM parameters, achieving 98.7% accuracy on the iris data set classification compared to 95.3% accuracy without parameter optimization.
This document summarizes a research paper that proposes a heuristic approach to preserve privacy in stream data classification. The approach applies data perturbation to the stream data before performing classification. This allows privacy to be preserved while also building an accurate classification model for the large-scale stream data. The approach is implemented in two phases: first the data stream is perturbed, then classification is performed on both the perturbed and original data. Experimental results show that this approach can effectively preserve privacy while reducing the complexity of mining large stream data.
Certain Investigation on Dynamic Clustering in Dynamic Dataminingijdmtaiir
Clustering is the process of grouping a set of objects
into classes of similar objects. Dynamic clustering comes in a
new research area that is concerned about dataset with dynamic
aspects. It requires updates of the clusters whenever new data
records are added to the dataset and may result in a change of
clustering over time. When there is a continuous update and
huge amount of dynamic data, rescan the database is not
possible in static data mining. But this is possible in Dynamic
data mining process. This dynamic data mining occurs when
the derived information is present for the purpose of analysis
and the environment is dynamic, i.e. many updates occur.
Since this has now been established by most researchers and
they will move into solving some of the problems and the
research is to concentrate on solving the problem of using data
mining dynamic databases. This paper gives some
investigation of existing work done in some papers related with
dynamic clustering and incremental data clustering
This document discusses various methodologies for processing and analyzing stream data, time series data, and sequence data. It covers topics such as random sampling and sketches/synopses for stream data, data stream management systems, the Hoeffding tree and VFDT algorithms for stream data classification, concept-adapting algorithms, ensemble approaches, clustering of evolving data streams, time series databases, Markov chains for sequence analysis, and algorithms like the forward algorithm, Viterbi algorithm, and Baum-Welch algorithm for hidden Markov models.
A fuzzy clustering algorithm for high dimensional streaming dataAlexander Decker
This document summarizes a research paper that proposes a new dimension-reduced weighted fuzzy clustering algorithm (sWFCM-HD) for high-dimensional streaming data. The algorithm can cluster datasets that have both high dimensionality and a streaming (continuously arriving) nature. It combines previous work on clustering algorithms for streaming data and high-dimensional data. The paper introduces the algorithm and compares it experimentally to show improvements in memory usage and runtime over other approaches for these types of datasets.
This document discusses GCUBE indexing, which is a method for indexing and aggregating spatial/continuous values in a data warehouse. The key challenges addressed are defining and aggregating spatial/continuous values, and efficiently representing, indexing, updating and querying data that includes both categorical and continuous dimensions. The proposed GCUBE approach maps multi-dimensional data to a linear ordering using the Hilbert curve, and then constructs an index structure on the ordered data to enable efficient query processing. Empirical results show the GCUBE indexing offers significant performance advantages over alternative approaches.
With the development of database, the data volume stored in database increases rapidly and in the large
amounts of data much important information is hidden. If the information can be extracted from the
database they will create a lot of profit for the organization. The question they are asking is how to extract
this value. The answer is data mining. There are many technologies available to data mining practitioners,
including Artificial Neural Networks, Genetics, Fuzzy logic and Decision Trees. Many practitioners are
wary of Neural Networks due to their black box nature, even though they have proven themselves in many
situations. This paper is an overview of artificial neural networks and questions their position as a
preferred tool by data mining practitioners.
A CONCEPTUAL METADATA FRAMEWORK FOR SPATIAL DATA WAREHOUSEIJDKP
Metadata represents the information about data to be stored in Data Warehouses. It is a mandatory
element of Data Warehouse to build an efficient Data Warehouse. Metadata helps in data integration,
lineage, data quality and populating transformed data into data warehouse. Spatial data warehouses are
based on spatial data mostly collected from Geographical Information Systems (GIS) and the transactional
systems that are specific to an application or enterprise. Metadata design and deployment is the most
critical phase in building of data warehouse where it is mandatory to bring the spatial information and
data modeling together. In this paper, we present a holistic metadata framework that drives metadata
creation for spatial data warehouse. Theoretically, the proposed metadata framework improves the
efficiency of accessing of data in response to frequent queries on SDWs. In other words, the proposed
framework decreases the response time of the query and accurate information is fetched from Data
Warehouse including the spatial information
A new model for iris data set classification based on linear support vector m...IJECEIAES
1. The authors propose a new model for classifying the iris data set using a linear support vector machine (SVM) classifier with genetic algorithm optimization of the SVM's C and gamma parameters.
2. Principal component analysis was used to reduce the iris data set features from four to three before classification.
3. The genetic algorithm was shown to optimize the SVM parameters, achieving 98.7% accuracy on the iris data set classification compared to 95.3% accuracy without parameter optimization.
This document summarizes a research paper that proposes a heuristic approach to preserve privacy in stream data classification. The approach applies data perturbation to the stream data before performing classification. This allows privacy to be preserved while also building an accurate classification model for the large-scale stream data. The approach is implemented in two phases: first the data stream is perturbed, then classification is performed on both the perturbed and original data. Experimental results show that this approach can effectively preserve privacy while reducing the complexity of mining large stream data.
Certain Investigation on Dynamic Clustering in Dynamic Dataminingijdmtaiir
Clustering is the process of grouping a set of objects
into classes of similar objects. Dynamic clustering comes in a
new research area that is concerned about dataset with dynamic
aspects. It requires updates of the clusters whenever new data
records are added to the dataset and may result in a change of
clustering over time. When there is a continuous update and
huge amount of dynamic data, rescan the database is not
possible in static data mining. But this is possible in Dynamic
data mining process. This dynamic data mining occurs when
the derived information is present for the purpose of analysis
and the environment is dynamic, i.e. many updates occur.
Since this has now been established by most researchers and
they will move into solving some of the problems and the
research is to concentrate on solving the problem of using data
mining dynamic databases. This paper gives some
investigation of existing work done in some papers related with
dynamic clustering and incremental data clustering
This document discusses various methodologies for processing and analyzing stream data, time series data, and sequence data. It covers topics such as random sampling and sketches/synopses for stream data, data stream management systems, the Hoeffding tree and VFDT algorithms for stream data classification, concept-adapting algorithms, ensemble approaches, clustering of evolving data streams, time series databases, Markov chains for sequence analysis, and algorithms like the forward algorithm, Viterbi algorithm, and Baum-Welch algorithm for hidden Markov models.
A fuzzy clustering algorithm for high dimensional streaming dataAlexander Decker
This document summarizes a research paper that proposes a new dimension-reduced weighted fuzzy clustering algorithm (sWFCM-HD) for high-dimensional streaming data. The algorithm can cluster datasets that have both high dimensionality and a streaming (continuously arriving) nature. It combines previous work on clustering algorithms for streaming data and high-dimensional data. The paper introduces the algorithm and compares it experimentally to show improvements in memory usage and runtime over other approaches for these types of datasets.
Extended pso algorithm for improvement problems k means clustering algorithmIJMIT JOURNAL
The clustering is a without monitoring process and one of the most common data mining techniques. The
purpose of clustering is grouping similar data together in a group, so were most similar to each other in a
cluster and the difference with most other instances in the cluster are. In this paper we focus on clustering
partition k-means, due to ease of implementation and high-speed performance of large data sets, After 30
year it is still very popular among the developed clustering algorithm and then for improvement problem of
placing of k-means algorithm in local optimal, we pose extended PSO algorithm, that its name is ECPSO.
Our new algorithm is able to be cause of exit from local optimal and with high percent produce the
problem’s optimal answer. The probe of results show that mooted algorithm have better performance
regards as other clustering algorithms specially in two index, the carefulness of clustering and the quality
of clustering.
Ensemble based Distributed K-Modes ClusteringIJERD Editor
Clustering has been recognized as the unsupervised classification of data items into groups. Due to the explosion in the number of autonomous data sources, there is an emergent need for effective approaches in distributed clustering. The distributed clustering algorithm is used to cluster the distributed datasets without gathering all the data in a single site. The K-Means is a popular clustering method owing to its simplicity and speed in clustering large datasets. But it fails to handle directly the datasets with categorical attributes which are generally occurred in real life datasets. Huang proposed the K-Modes clustering algorithm by introducing a new dissimilarity measure to cluster categorical data. This algorithm replaces means of clusters with a frequency based method which updates modes in the clustering process to minimize the cost function. Most of the distributed clustering algorithms found in the literature seek to cluster numerical data. In this paper, a novel Ensemble based Distributed K-Modes clustering algorithm is proposed, which is well suited to handle categorical data sets as well as to perform distributed clustering process in an asynchronous manner. The performance of the proposed algorithm is compared with the existing distributed K-Means clustering algorithms, and K-Modes based Centralized Clustering algorithm. The experiments are carried out for various datasets of UCI machine learning data repository.
IRJET- Sampling Selection Strategy for Large Scale Deduplication of Synthetic...IRJET Journal
This document proposes a two-stage sampling selection strategy (T3S) for large-scale data deduplication using Apache Spark. T3S reduces the labeling effort for training data by first selecting balanced subsets of candidate pairs, then removing redundant pairs to produce a smaller, more informative training set. It detects fuzzy region boundaries using this training set to classify candidate pairs. The approach is implemented in a distributed manner using Apache Spark and shows better performance than an existing method by reducing the training set size.
A statistical data fusion technique in virtual data integration environmentIJDKP
Data fusion in the virtual data integration environment starts after detecting and clustering duplicated
records from the different integrated data sources. It refers to the process of selecting or fusing attribute
values from the clustered duplicates into a single record representing the real world object. In this paper, a
statistical technique for data fusion is introduced based on some probabilistic scores from both data
sources and clustered duplicates
Data mining is utilized to manage huge measure of information which are put in the data ware houses and databases, to discover required information and data. Numerous data mining systems have been proposed, for example, association rules, decision trees, neural systems, clustering, and so on. It has turned into the purpose of consideration from numerous years. A re-known amongst the available data mining strategies is clustering of the dataset. It is the most effective data mining method. It groups the dataset in number of clusters based on certain guidelines that are predefined. It is dependable to discover the connection between the distinctive characteristics of data.
In k-mean clustering algorithm, the function is being selected on the basis of the relevancy of the function for predicting the data and also the Euclidian distance between the centroid of any cluster and the data objects outside the cluster is being computed for the clustering the data points. In this work, author enhanced the Euclidian distance formula to increase the cluster quality.
The problem of accuracy and redundancy of the dissimilar points in the clusters remains in the improved k-means for which new enhanced approach is been proposed which uses the similarity function for checking the similarity level of the point before including it to the cluster.
Assessment of Cluster Tree Analysis based on Data Linkagesjournal ijrtem
Abstract: Details linkage is a procedure which almost adjoins two or more places of data (surveyed or proprietary) from different companies to generate a value chest of information which can be used for further analysis. This allows for the real application of the details. One-to-Many data linkage affiliates an enterprise from the first data set with a number of related companies from the other data places. Before performs concentrate on accomplishing one-to-one data linkages. So formerly a two level clustering shrub known as One-Class Clustering Tree (OCCT) with designed in Jaccard Likeness evaluate was suggested in which each flyer contains team instead of only one categorized sequence. OCCT's strategy to use Jaccard's similarity co-efficient increases time complexness significantly. So we recommend to substitute jaccard's similarity coefficient with Jaro wrinket similarity evaluate to acquire the team similarity related because it requires purchase into consideration using positional indices to calculate relevance compared with Jaccard's. An assessment of our suggested idea suffices as approval of an enhanced one-to-many data linkage system.
Index Terms: Maximum-Weighted Bipartite Matching, Ant Colony Optimization, Graph Partitioning Technique
This document provides an overview of different techniques for clustering categorical data. It discusses various clustering algorithms that have been used for categorical data, including K-modes, ROCK, COBWEB, and EM algorithms. It also reviews more recently developed algorithms for categorical data clustering, such as algorithms based on particle swarm optimization, rough set theory, and feature weighting schemes. The document concludes that clustering categorical data remains an important area of research, with opportunities to develop techniques that initialize cluster centers better.
The document surveys previous work on clustering of time series data. It discusses three main approaches to time series clustering: (1) working directly with raw time series data, (2) extracting features from the raw data and clustering the features, and (3) building models from the raw data and clustering the model parameters. It also summarizes commonly used clustering algorithms, measures for determining time series similarity, and criteria for evaluating clustering results. The document categorizes and discusses past time series clustering research and identifies topics for future work. It aims to provide an overview of the field of time series clustering.
Spectral Clustering and Vantage Point Indexing for Efficient Data Retrieval IJECEIAES
Data mining is an essential process for identifying the patterns in large datasets through machine learning techniques and database systems. Clustering of high dimensional data is becoming very challenging process due to curse of dimensionality. In addition, space complexity and data retrieval performance was not improved. In order to overcome the limitation, Spectral Clustering Based VP Tree Indexing Technique is introduced. The technique clusters and indexes the densely populated high dimensional data points for effective data retrieval based on user query. A Normalized Spectral Clustering Algorithm is used to group similar high dimensional data points. After that, Vantage Point Tree is constructed for indexing the clustered data points with minimum space complexity. At last, indexed data gets retrieved based on user query using Vantage Point Tree based Data Retrieval Algorithm. This in turn helps to improve true positive rate with minimum retrieval time. The performance is measured in terms of space complexity, true positive rate and data retrieval time with El Nino weather data sets from UCI Machine Learning Repository. An experimental result shows that the proposed technique is able to reduce the space complexity by 33% and also reduces the data retrieval time by 24% when compared to state-of-the-artworks.
This document discusses techniques for detecting duplicate records from multiple web databases. It begins with an abstract describing an unsupervised approach that uses classifiers like the weighted component similarity summing classifier and support vector machine along with a Gaussian mixture model to iteratively identify duplicate records. The document then provides details on related work, including probabilistic matching models, supervised and unsupervised learning techniques, distance-based techniques, rule-based approaches, and methods for improving efficiency like blocking and the sorted neighborhood approach.
Recommendation system using bloom filter in mapreduceIJDKP
Many clients like to use the Web to discover product details in the form of online reviews. The reviews are
provided by other clients and specialists. Recommender systems provide an important response to the
information overload problem as it presents users more practical and personalized information facilities.
Collaborative filtering methods are vital component in recommender systems as they generate high-quality
recommendations by influencing the likings of society of similar users. The collaborative filtering method
has assumption that people having same tastes choose the same items. The conventional collaborative
filtering system has drawbacks as sparse data problem & lack of scalability. A new recommender system is
required to deal with the sparse data problem & produce high quality recommendations in large scale
mobile environment. MapReduce is a programming model which is widely used for large-scale data
analysis. The described algorithm of recommendation mechanism for mobile commerce is user based
collaborative filtering using MapReduce which reduces scalability problem in conventional CF system.
One of the essential operations for the data analysis is join operation. But MapReduce is not very
competent to execute the join operation as it always uses all records in the datasets where only small
fraction of datasets are applicable for the join operation. This problem can be reduced by applying
bloomjoin algorithm. The bloom filters are constructed and used to filter out redundant intermediate
records. The proposed algorithm using bloom filter will reduce the number of intermediate results and will
improve the join performance.
APPLICATION OF DYNAMIC CLUSTERING ALGORITHM IN MEDICAL SURVEILLANCE cscpconf
The traditional medical analysis is based on the static data, the medical data is about to be analysis after the collection of these data sets is completed, but this is far from satisfying the actual demand. Large amounts of medical data are generated in real time, so that real-time analysis can yield more value. This paper introduces the design of the Sentinel which can realize the real-time analysis system based on the clustering algorithm. Sentinel can realize clustering analysis of real-time data based on the clustering algorithm and issue an early alert.
INTEGRATED ASSOCIATIVE CLASSIFICATION AND NEURAL NETWORK MODEL ENHANCED BY US...IJDKP
The document summarizes a proposed methodology that integrates associative classification and neural networks for improved classification accuracy. It begins by introducing association rule mining and associative classification. It then describes using chi-squared analysis and the Gini index for attribute selection and rule pruning to generate a reduced set of rules. These rules are used to train a backpropagation neural network classifier. The methodology is tested on datasets from a public repository, demonstrating improved accuracy over traditional associative classification alone. Future work to integrate optical neural networks is also proposed.
A SURVEY ON OPTIMIZATION APPROACHES TO TEXT DOCUMENT CLUSTERINGijcsa
This document provides a survey of optimization approaches that have been applied to text document clustering. It discusses several clustering algorithms and categorizes them as partitioning methods, hierarchical methods, density-based methods, grid-based methods, model-based methods, frequent pattern-based clustering, and constraint-based clustering. It then describes several soft computing techniques that have been used as optimization approaches for text document clustering, including genetic algorithms, bees algorithms, particle swarm optimization, and ant colony optimization. These optimization techniques perform a global search to improve the quality and efficiency of document clustering algorithms.
This document presents an approach for clustering a mixed dataset containing both numeric and categorical attributes using an ART-2 neural network model. The dataset contains daily stock price data with 19 attributes describing comparisons between consecutive days. Clustering mixed datasets is challenging due to different attribute types. The ART-2 model is used to classify the dataset without requiring a distance function. Then an autoencoder model reduces the dimensionality to allow visual validation of the clusters. The results demonstrate the ART-2 model's ability to cluster complex, mixed datasets.
ANALYSIS AND COMPARISON STUDY OF DATA MINING ALGORITHMS USING RAPIDMINERIJCSEA Journal
Comparison study of algorithms is very much required before implementing them for the needs of any
organization. The comparisons of algorithms are depending on the various parameters such as data
frequency, types of data and relationship among the attributes in a given data set. There are number of
learning and classifications algorithms are used to analyse, learn patterns and categorize data are
available. But the problem is the one to find the best algorithm according to the problem and desired
output. The desired result has always been higher accuracy in predicting future values or events from the
given dataset. Algorithms taken for the comparisons study are Neural net, SVM, Naïve Bayes, BFT and
Decision stump. These top algorithms are most influential data mining algorithms in the research
community. These algorithms have been considered and mostly used in the field of knowledge discovery
and data mining.
This document summarizes a research paper that evaluates cluster quality using a modified density subspace clustering approach. It discusses how density subspace clustering can be used to identify clusters in high-dimensional datasets by detecting density-connected clusters in all subspaces. The proposed approach uses a density subspace clustering algorithm to select attribute subsets and identify the best clusters. It then calculates intra-cluster and inter-cluster distances to evaluate cluster quality and compares the results to other clustering algorithms in terms of accuracy and runtime. Experimental results showed that the proposed method improves clustering quality and performs faster than existing techniques.
Comparison of Text Classifiers on News ArticlesIRJET Journal
This document compares the performance of five text classification algorithms (SVM, Naive Bayes, K-Nearest Neighbors, Decision Tree, Rocchio) on news article datasets. It finds that an SVM classifier achieves the highest accuracy on the datasets tested, with accuracies of 86.7%, 75.6%, and 97.6% on the Twenty Newsgroups, Reuters, and BBC News datasets respectively. The document also evaluates and compares the training times and testing times of the classifiers on the different datasets.
Incremental learning from unbalanced data with concept class, concept drift a...IJDKP
Recently, stream data mining applications has drawn vital attention from several research communities.
Stream data is continuous form of data which is distinguished by its online nature. Traditionally, machine
learning area has been developing learning algorithms that have certain assumptions on underlying
distribution of data such as data should have predetermined distribution. Such constraints on the problem
domain lead the way for development of smart learning algorithms performance is theoretically verifiable.
Real-word situations are different than this restricted model. Applications usually suffers from problems
such as unbalanced data distribution. Additionally, data picked from non-stationary environments are also
usual in real world applications, resulting in the “concept drift” which is related with data stream
examples. These issues have been separately addressed by the researchers, also, it is observed that joint
problem of class imbalance and concept drift has got relatively little research. If the final objective of
clever machine learning techniques is to be able to address a broad spectrum of real world applications,
then the necessity for a universal framework for learning from and tailoring (adapting) to, environment
where drift in concepts may occur and unbalanced data distribution is present can be hardly exaggerated.
In this paper, we first present an overview of issues that are observed in stream data mining scenarios,
followed by a complete review of recent research in dealing with each of the issue.
The document describes a machine learning toolbox developed using Python that implements and compares several supervised machine learning algorithms, including Naive Bayes, K-nearest neighbors, decision trees, SVM, and neural networks. The toolbox allows users to test algorithms on various datasets, including Iris and diabetes data, and compare the accuracy results. Testing on these datasets showed Naive Bayes and K-nearest neighbors had the highest average accuracy rates, while neural networks and decision trees showed more variable performance depending on parameters and dataset splits. The toolbox is intended to help users evaluate which algorithms best fit their datasets.
Extended pso algorithm for improvement problems k means clustering algorithmIJMIT JOURNAL
The clustering is a without monitoring process and one of the most common data mining techniques. The
purpose of clustering is grouping similar data together in a group, so were most similar to each other in a
cluster and the difference with most other instances in the cluster are. In this paper we focus on clustering
partition k-means, due to ease of implementation and high-speed performance of large data sets, After 30
year it is still very popular among the developed clustering algorithm and then for improvement problem of
placing of k-means algorithm in local optimal, we pose extended PSO algorithm, that its name is ECPSO.
Our new algorithm is able to be cause of exit from local optimal and with high percent produce the
problem’s optimal answer. The probe of results show that mooted algorithm have better performance
regards as other clustering algorithms specially in two index, the carefulness of clustering and the quality
of clustering.
Ensemble based Distributed K-Modes ClusteringIJERD Editor
Clustering has been recognized as the unsupervised classification of data items into groups. Due to the explosion in the number of autonomous data sources, there is an emergent need for effective approaches in distributed clustering. The distributed clustering algorithm is used to cluster the distributed datasets without gathering all the data in a single site. The K-Means is a popular clustering method owing to its simplicity and speed in clustering large datasets. But it fails to handle directly the datasets with categorical attributes which are generally occurred in real life datasets. Huang proposed the K-Modes clustering algorithm by introducing a new dissimilarity measure to cluster categorical data. This algorithm replaces means of clusters with a frequency based method which updates modes in the clustering process to minimize the cost function. Most of the distributed clustering algorithms found in the literature seek to cluster numerical data. In this paper, a novel Ensemble based Distributed K-Modes clustering algorithm is proposed, which is well suited to handle categorical data sets as well as to perform distributed clustering process in an asynchronous manner. The performance of the proposed algorithm is compared with the existing distributed K-Means clustering algorithms, and K-Modes based Centralized Clustering algorithm. The experiments are carried out for various datasets of UCI machine learning data repository.
IRJET- Sampling Selection Strategy for Large Scale Deduplication of Synthetic...IRJET Journal
This document proposes a two-stage sampling selection strategy (T3S) for large-scale data deduplication using Apache Spark. T3S reduces the labeling effort for training data by first selecting balanced subsets of candidate pairs, then removing redundant pairs to produce a smaller, more informative training set. It detects fuzzy region boundaries using this training set to classify candidate pairs. The approach is implemented in a distributed manner using Apache Spark and shows better performance than an existing method by reducing the training set size.
A statistical data fusion technique in virtual data integration environmentIJDKP
Data fusion in the virtual data integration environment starts after detecting and clustering duplicated
records from the different integrated data sources. It refers to the process of selecting or fusing attribute
values from the clustered duplicates into a single record representing the real world object. In this paper, a
statistical technique for data fusion is introduced based on some probabilistic scores from both data
sources and clustered duplicates
Data mining is utilized to manage huge measure of information which are put in the data ware houses and databases, to discover required information and data. Numerous data mining systems have been proposed, for example, association rules, decision trees, neural systems, clustering, and so on. It has turned into the purpose of consideration from numerous years. A re-known amongst the available data mining strategies is clustering of the dataset. It is the most effective data mining method. It groups the dataset in number of clusters based on certain guidelines that are predefined. It is dependable to discover the connection between the distinctive characteristics of data.
In k-mean clustering algorithm, the function is being selected on the basis of the relevancy of the function for predicting the data and also the Euclidian distance between the centroid of any cluster and the data objects outside the cluster is being computed for the clustering the data points. In this work, author enhanced the Euclidian distance formula to increase the cluster quality.
The problem of accuracy and redundancy of the dissimilar points in the clusters remains in the improved k-means for which new enhanced approach is been proposed which uses the similarity function for checking the similarity level of the point before including it to the cluster.
Assessment of Cluster Tree Analysis based on Data Linkagesjournal ijrtem
Abstract: Details linkage is a procedure which almost adjoins two or more places of data (surveyed or proprietary) from different companies to generate a value chest of information which can be used for further analysis. This allows for the real application of the details. One-to-Many data linkage affiliates an enterprise from the first data set with a number of related companies from the other data places. Before performs concentrate on accomplishing one-to-one data linkages. So formerly a two level clustering shrub known as One-Class Clustering Tree (OCCT) with designed in Jaccard Likeness evaluate was suggested in which each flyer contains team instead of only one categorized sequence. OCCT's strategy to use Jaccard's similarity co-efficient increases time complexness significantly. So we recommend to substitute jaccard's similarity coefficient with Jaro wrinket similarity evaluate to acquire the team similarity related because it requires purchase into consideration using positional indices to calculate relevance compared with Jaccard's. An assessment of our suggested idea suffices as approval of an enhanced one-to-many data linkage system.
Index Terms: Maximum-Weighted Bipartite Matching, Ant Colony Optimization, Graph Partitioning Technique
This document provides an overview of different techniques for clustering categorical data. It discusses various clustering algorithms that have been used for categorical data, including K-modes, ROCK, COBWEB, and EM algorithms. It also reviews more recently developed algorithms for categorical data clustering, such as algorithms based on particle swarm optimization, rough set theory, and feature weighting schemes. The document concludes that clustering categorical data remains an important area of research, with opportunities to develop techniques that initialize cluster centers better.
The document surveys previous work on clustering of time series data. It discusses three main approaches to time series clustering: (1) working directly with raw time series data, (2) extracting features from the raw data and clustering the features, and (3) building models from the raw data and clustering the model parameters. It also summarizes commonly used clustering algorithms, measures for determining time series similarity, and criteria for evaluating clustering results. The document categorizes and discusses past time series clustering research and identifies topics for future work. It aims to provide an overview of the field of time series clustering.
Spectral Clustering and Vantage Point Indexing for Efficient Data Retrieval IJECEIAES
Data mining is an essential process for identifying the patterns in large datasets through machine learning techniques and database systems. Clustering of high dimensional data is becoming very challenging process due to curse of dimensionality. In addition, space complexity and data retrieval performance was not improved. In order to overcome the limitation, Spectral Clustering Based VP Tree Indexing Technique is introduced. The technique clusters and indexes the densely populated high dimensional data points for effective data retrieval based on user query. A Normalized Spectral Clustering Algorithm is used to group similar high dimensional data points. After that, Vantage Point Tree is constructed for indexing the clustered data points with minimum space complexity. At last, indexed data gets retrieved based on user query using Vantage Point Tree based Data Retrieval Algorithm. This in turn helps to improve true positive rate with minimum retrieval time. The performance is measured in terms of space complexity, true positive rate and data retrieval time with El Nino weather data sets from UCI Machine Learning Repository. An experimental result shows that the proposed technique is able to reduce the space complexity by 33% and also reduces the data retrieval time by 24% when compared to state-of-the-artworks.
This document discusses techniques for detecting duplicate records from multiple web databases. It begins with an abstract describing an unsupervised approach that uses classifiers like the weighted component similarity summing classifier and support vector machine along with a Gaussian mixture model to iteratively identify duplicate records. The document then provides details on related work, including probabilistic matching models, supervised and unsupervised learning techniques, distance-based techniques, rule-based approaches, and methods for improving efficiency like blocking and the sorted neighborhood approach.
Recommendation system using bloom filter in mapreduceIJDKP
Many clients like to use the Web to discover product details in the form of online reviews. The reviews are
provided by other clients and specialists. Recommender systems provide an important response to the
information overload problem as it presents users more practical and personalized information facilities.
Collaborative filtering methods are vital component in recommender systems as they generate high-quality
recommendations by influencing the likings of society of similar users. The collaborative filtering method
has assumption that people having same tastes choose the same items. The conventional collaborative
filtering system has drawbacks as sparse data problem & lack of scalability. A new recommender system is
required to deal with the sparse data problem & produce high quality recommendations in large scale
mobile environment. MapReduce is a programming model which is widely used for large-scale data
analysis. The described algorithm of recommendation mechanism for mobile commerce is user based
collaborative filtering using MapReduce which reduces scalability problem in conventional CF system.
One of the essential operations for the data analysis is join operation. But MapReduce is not very
competent to execute the join operation as it always uses all records in the datasets where only small
fraction of datasets are applicable for the join operation. This problem can be reduced by applying
bloomjoin algorithm. The bloom filters are constructed and used to filter out redundant intermediate
records. The proposed algorithm using bloom filter will reduce the number of intermediate results and will
improve the join performance.
APPLICATION OF DYNAMIC CLUSTERING ALGORITHM IN MEDICAL SURVEILLANCE cscpconf
The traditional medical analysis is based on the static data, the medical data is about to be analysis after the collection of these data sets is completed, but this is far from satisfying the actual demand. Large amounts of medical data are generated in real time, so that real-time analysis can yield more value. This paper introduces the design of the Sentinel which can realize the real-time analysis system based on the clustering algorithm. Sentinel can realize clustering analysis of real-time data based on the clustering algorithm and issue an early alert.
INTEGRATED ASSOCIATIVE CLASSIFICATION AND NEURAL NETWORK MODEL ENHANCED BY US...IJDKP
The document summarizes a proposed methodology that integrates associative classification and neural networks for improved classification accuracy. It begins by introducing association rule mining and associative classification. It then describes using chi-squared analysis and the Gini index for attribute selection and rule pruning to generate a reduced set of rules. These rules are used to train a backpropagation neural network classifier. The methodology is tested on datasets from a public repository, demonstrating improved accuracy over traditional associative classification alone. Future work to integrate optical neural networks is also proposed.
A SURVEY ON OPTIMIZATION APPROACHES TO TEXT DOCUMENT CLUSTERINGijcsa
This document provides a survey of optimization approaches that have been applied to text document clustering. It discusses several clustering algorithms and categorizes them as partitioning methods, hierarchical methods, density-based methods, grid-based methods, model-based methods, frequent pattern-based clustering, and constraint-based clustering. It then describes several soft computing techniques that have been used as optimization approaches for text document clustering, including genetic algorithms, bees algorithms, particle swarm optimization, and ant colony optimization. These optimization techniques perform a global search to improve the quality and efficiency of document clustering algorithms.
This document presents an approach for clustering a mixed dataset containing both numeric and categorical attributes using an ART-2 neural network model. The dataset contains daily stock price data with 19 attributes describing comparisons between consecutive days. Clustering mixed datasets is challenging due to different attribute types. The ART-2 model is used to classify the dataset without requiring a distance function. Then an autoencoder model reduces the dimensionality to allow visual validation of the clusters. The results demonstrate the ART-2 model's ability to cluster complex, mixed datasets.
ANALYSIS AND COMPARISON STUDY OF DATA MINING ALGORITHMS USING RAPIDMINERIJCSEA Journal
Comparison study of algorithms is very much required before implementing them for the needs of any
organization. The comparisons of algorithms are depending on the various parameters such as data
frequency, types of data and relationship among the attributes in a given data set. There are number of
learning and classifications algorithms are used to analyse, learn patterns and categorize data are
available. But the problem is the one to find the best algorithm according to the problem and desired
output. The desired result has always been higher accuracy in predicting future values or events from the
given dataset. Algorithms taken for the comparisons study are Neural net, SVM, Naïve Bayes, BFT and
Decision stump. These top algorithms are most influential data mining algorithms in the research
community. These algorithms have been considered and mostly used in the field of knowledge discovery
and data mining.
This document summarizes a research paper that evaluates cluster quality using a modified density subspace clustering approach. It discusses how density subspace clustering can be used to identify clusters in high-dimensional datasets by detecting density-connected clusters in all subspaces. The proposed approach uses a density subspace clustering algorithm to select attribute subsets and identify the best clusters. It then calculates intra-cluster and inter-cluster distances to evaluate cluster quality and compares the results to other clustering algorithms in terms of accuracy and runtime. Experimental results showed that the proposed method improves clustering quality and performs faster than existing techniques.
Comparison of Text Classifiers on News ArticlesIRJET Journal
This document compares the performance of five text classification algorithms (SVM, Naive Bayes, K-Nearest Neighbors, Decision Tree, Rocchio) on news article datasets. It finds that an SVM classifier achieves the highest accuracy on the datasets tested, with accuracies of 86.7%, 75.6%, and 97.6% on the Twenty Newsgroups, Reuters, and BBC News datasets respectively. The document also evaluates and compares the training times and testing times of the classifiers on the different datasets.
Incremental learning from unbalanced data with concept class, concept drift a...IJDKP
Recently, stream data mining applications has drawn vital attention from several research communities.
Stream data is continuous form of data which is distinguished by its online nature. Traditionally, machine
learning area has been developing learning algorithms that have certain assumptions on underlying
distribution of data such as data should have predetermined distribution. Such constraints on the problem
domain lead the way for development of smart learning algorithms performance is theoretically verifiable.
Real-word situations are different than this restricted model. Applications usually suffers from problems
such as unbalanced data distribution. Additionally, data picked from non-stationary environments are also
usual in real world applications, resulting in the “concept drift” which is related with data stream
examples. These issues have been separately addressed by the researchers, also, it is observed that joint
problem of class imbalance and concept drift has got relatively little research. If the final objective of
clever machine learning techniques is to be able to address a broad spectrum of real world applications,
then the necessity for a universal framework for learning from and tailoring (adapting) to, environment
where drift in concepts may occur and unbalanced data distribution is present can be hardly exaggerated.
In this paper, we first present an overview of issues that are observed in stream data mining scenarios,
followed by a complete review of recent research in dealing with each of the issue.
The document describes a machine learning toolbox developed using Python that implements and compares several supervised machine learning algorithms, including Naive Bayes, K-nearest neighbors, decision trees, SVM, and neural networks. The toolbox allows users to test algorithms on various datasets, including Iris and diabetes data, and compare the accuracy results. Testing on these datasets showed Naive Bayes and K-nearest neighbors had the highest average accuracy rates, while neural networks and decision trees showed more variable performance depending on parameters and dataset splits. The toolbox is intended to help users evaluate which algorithms best fit their datasets.
The document summarizes research on multi-document summarization using EM clustering. It begins with an introduction to the topic and issues with existing techniques. It then proposes using Expectation-Maximization (EM) clustering to identify clusters, which improves over other methods by identifying latent semantic variables between sentences. The architecture involves preprocessing, EM clustering, mutual reinforcement ranking algorithms RARP and RDRP, summarization, and post-processing. Experimental results on DUC2007 data show EM clustering identifies more clusters and sentences than affinity propagation clustering. The technique aims to improve summarization accuracy by better capturing semantic relationships between sentences.
Evaluation of a New Incremental Classification Tree Algorithm for Mining High...mlaij
A new model for online machine learning process of high speed data stream is proposed, to minimize the severe restrictions associated with the existing computer learning algorithms. Most of the existing models have three principle steps. In the first step, the system would create a model incrementally. In the second step the time taken by the examples to complete a prescribed procedure with their arrival speed is computed. In the third and final step of the model the size of memory required for computation is predicted in advance. To overcome these restrictions we proposed this new data stream classification algorithm, where the data can be partitioned into stream of trees. In this algorithm, the new data set can be updated with the existing tree. This algorithm, called incremental classification tree algorithm, is proved to be an excellent solution for processing larger data streams. In this paper, we present the experimental results of our new algorithm and prove that our method would eradicate the problems of the existing method.
EVALUATION OF A NEW INCREMENTAL CLASSIFICATION TREE ALGORITHM FOR MINING HIGH...mlaij
Abstract—A new model for online machine learning process of high speed data stream is proposed, to
minimize the severe restrictions associated with the existing computer learning algorithms. Most of the
existing models have three principle steps. In the first step, the system would create a model incrementally.
In the second step the time taken by the examples to complete a prescribed procedure with their arrival
speed is computed. In the third and final step of the model the size of memory required for computation is
predicted in advance. To overcome these restrictions we proposed this new data stream classification
algorithm, where the data can be partitioned into stream of trees. In this algorithm, the new data set can be
updated with the existing tree. This algorithm, called incremental classification tree algorithm, is proved to
be an excellent solution for processing larger data streams. In this paper, we present the experimental
results of our new algorithm and prove that our method would eradicate the problems of the existing
method.
Text document clustering and similarity detection is the major part of document management, where every document should be identified by its key terms and domain knowledge. Based on the similarity, the documents are grouped into clusters. For document similarity calculation there are several approaches were proposed in the existing system. But the existing system is either term based or pattern based. And those systems suffered from several problems. To make a revolution in this challenging environment, the proposed system presents an innovative model for document similarity by applying back propagation time stamp algorithm. It discovers patterns in text documents as higher level features and creates a network for fast grouping. It also detects the most appropriate patterns based on its weight and BPTT performs the document similarity measures. Using this approach, the document can be categorized easily. In order to perform the above, a new approach is used. This helps to reduce the training process problems. The above framework is named as BPTT. The BPTT has implemented and evaluated using dot net platform with different set of datasets.
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...acijjournal
Apriori is one of the key algorithms to generate frequent itemsets. Analysing frequent itemset is a crucial
step in analysing structured data and in finding association relationship between items. This stands as an
elementary foundation to supervised learning, which encompasses classifier and feature extraction
methods. Applying this algorithm is crucial to understand the behaviour of structured data. Most of the
structured data in scientific domain are voluminous. Processing such kind of data requires state of the art
computing machines. Setting up such an infrastructure is expensive. Hence a distributed environment
such as a clustered setup is employed for tackling such scenarios. Apache Hadoop distribution is one of
the cluster frameworks in distributed environment that helps by distributing voluminous data across a
number of nodes in the framework. This paper focuses on map/reduce design and implementation of
Apriori algorithm for structured data analysis.
A survey of modified support vector machine using particle of swarm optimizat...Editor Jacotech
This document summarizes a research paper that proposes a modified support vector machine (MSVM) classification algorithm using particle swarm optimization (PSO) for data classification in data streams. It discusses how new evolving features and concept drift in data streams can decrease the performance of traditional SVM classifiers. The proposed MSVM-PSO technique uses PSO to optimize feature selection and control the evaluation of new evolving features. PSO works in two phases - dynamic population selection and optimization of new evolved features. The methodology and implementation of MSVM-PSO is explained along with experimental results on three datasets showing it improves classification performance over traditional SVM.
This document summarizes a paper that presents a novel method for passive resource discovery in cluster grid environments. The method monitors network packet frequency from nodes' network interface cards to identify nodes with available CPU cycles (<70% utilization) by detecting latency signatures from frequent context switching. Experiments on a 50-node testbed showed the method can consistently and accurately discover available resources by analyzing existing network traffic, including traffic passed through a switch. The paper also proposes algorithms for distributed two-level resource discovery, replication and utilization to optimize resource allocation and access costs in distributed computing environments.
Distributed Digital Artifacts on the Semantic WebEditor IJCATR
Distributed digital artifacts incorporate cryptographic hash values to URI called trusty URIs in a distributed environment
building good in quality, verifiable and unchangeable web resources to prevent the rising man in the middle attack. The greatest
challenge of a centralized system is that it gives users no possibility to check whether data have been modified and the communication
is limited to a single server. As a solution for this, is the distributed digital artifact system, where resources are distributed among
different domains to enable inter-domain communication. Due to the emerging developments in web, attacks have increased rapidly,
among which man in the middle attack (MIMA) is a serious issue, where user security is at its threat. This work tries to prevent MIMA
to an extent, by providing self reference and trusty URIs even when presented in a distributed environment. Any manipulation to the
data is efficiently identified and any further access to that data is blocked by informing user that the uniform location has been
changed. System uses self-reference to contain trusty URI for each resource, lineage algorithm for generating seed and SHA-512 hash
generation algorithm to ensure security. It is implemented on the semantic web, which is an extension to the world wide web, using
RDF (Resource Description Framework) to identify the resource. Hence the framework was developed to overcome existing
challenges by making the digital artifacts on the semantic web distributed to enable communication between different domains across
the network securely and thereby preventing MIMA.
This document discusses using machine learning clustering algorithms to analyze stock market data. It compares the K-means, COBWEB, DBSCAN, EM and OPTICS clustering algorithms in the WEKA tool on a stock market dataset containing 420 instances and 6 attributes. The K-means algorithm had the best performance with the lowest error and fastest runtime. It clustered the data into 4 groups in 0.16 seconds. The COBWEB algorithm clustered the data into 107 groups in 27.88 seconds. The DBSCAN algorithm found 21 clusters in 3.97 seconds. The paper concludes that K-means is best suited for stock market data mining applications due to its simplicity and speed compared to other algorithms.
Feature selection in high-dimensional datasets is
considered to be a complex and time-consuming problem. To
enhance the accuracy of classification and reduce the execution
time, Parallel Evolutionary Algorithms (PEAs) can be used. In
this paper, we make a review for the most recent works which
handle the use of PEAs for feature selection in large datasets.
We have classified the algorithms in these papers into four main
classes (Genetic Algorithms (GA), Particle Swarm Optimization
(PSO), Scattered Search (SS), and Ant Colony Optimization
(ACO)). The accuracy is adopted as a measure to compare the
efficiency of these PEAs. It is noticeable that the Parallel Genetic
Algorithms (PGAs) are the most suitable algorithms for feature
selection in large datasets; since they achieve the highest accuracy.
On the other hand, we found that the Parallel ACO is timeconsuming
and less accurate comparing with other PEA.
Generic Algorithm based Data Retrieval Technique in Data MiningAM Publications,India
This system Hybrid extraction of robust model (GA), a dynamic XAML based mechanism for the adaptive management and reuse of e-learning resources in a distributed environment like the Web. This proposed system argues that to achieve the on-demand semantic-based resource management for Web-based e-learning, one should go beyond using domain ontology’s statically. So the propose XAML based matching process involves semantic mapping has done on both the open dataset and closed dataset mechanism to integrate e-learning databases by using ontology semantics. It defines context-specific portions from the whole ontology as optimized data and proposes an XAML based resource reuse approach by using an evolution algorithm. It explains the context aware based evolution algorithm for dynamic e-learning resource reuse in detail. This system is going to conduct a simulation experiment and evaluate the proposed approach with a xaml based e-learning scenario. The proposed approach for matching process in web cluster databases from different database servers can be easily integrated and deliver highly dimensional e-learning resource management and reuse is far from being mature. However, e-learning is also a widely open research area, and there is still much room for improvement on the method. This research mechanism includes 1) improving the proposed evolution approach by making use of and comparing different evolutionary algorithms, 2) applying the proposed approach to support more applications, and 3) extending to the situation with multiple e-learning systems or services.
Principle Component Analysis Based on Optimal Centroid Selection Model for Su...ijtsrd
Clustering a large sparse and large scale data is an open research in the data mining. To discover the significant information through clustering algorithm stands inadequate as most of the data finds to be non actionable. Existing clustering technique is not feasible to time varying data in high dimensional space. Hence Subspace clustering will be answerable to problems in the clustering through incorporation of domain knowledge and parameter sensitive prediction. Sensitiveness of the data is also predicted through thresholding mechanism. The problems of usability and usefulness in 3D subspace clustering are very important issue in subspace clustering. . The Solutions is highly helpful benefit for police departments and law enforcement organisations to better understand stock issues and provide insights that will enable them to track activities, predict the likelihood. Also determining the correct dimension is inconsistent and challenging issue in subspace clustering .In this thesis, we propose Centroid based Subspace Forecasting Framework by constraints is proposed, i.e. must link and must not link with domain knowledge. Unsupervised Subspace clustering algorithm with inbuilt process like inconsistent constraints correlating to dimensions has been resolved through singular value decomposition. Principle component analysis is been used in which condition has been explored to estimate the strength of actionable to be particular attributes and utilizing the domain knowledge to refinement and validating the optimal centroids dynamically. An experimental result proves that proposed framework outperforms other competition subspace clustering technique in terms of efficiency, Fmeasure, parameter insensitiveness and accuracy. G. Raj Kamal | A. Deepika | D. Pavithra | J. Mohammed Nadeem | V. Prasath Kumar "Principle Component Analysis Based on Optimal Centroid Selection Model for SubSpace Clustering Model" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-4 | Issue-4 , June 2020, URL: https://www.ijtsrd.com/papers/ijtsrd31374.pdf Paper Url :https://www.ijtsrd.com/computer-science/data-miining/31374/principle-component-analysis-based-on-optimal-centroid-selection-model-for-subspace-clustering-model/g-raj-kamal
Study and Analysis of K-Means Clustering Algorithm Using RapidminerIJERA Editor
Institution is a place where teacher explains and student just understands and learns the lesson. Every student has his own definition for toughness and easiness and there isn’t any absolute scale for measuring knowledge but examination score indicate the performance of student. In this case study, knowledge of data mining is combined with educational strategies to improve students’ performance. Generally, data mining (sometimes called data or knowledge discovery) is the process of analysing data from different perspectives and summarizing it into useful information. Data mining software is one of a number of analytical tools for data. It allows users to analyse data from many different dimensions or angles, categorize it, and summarize the relationships identified. Technically, data mining is the process of finding correlations or patterns among dozens of fields in large relational database. Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters).This project describes the use of clustering data mining technique to improve the efficiency of academic performance in the educational institutions .In this project, a live experiment was conducted on students .By conducting an exam on students of computer science major using MOODLE(LMS) and analysing that data generated using RapidMiner(Datamining Software) and later by performing clustering on the data. This method helps to identify the students who need special advising or counselling by the teacher to give high quality of education.
COMBINED MINING APPROACH TO GENERATE PATTERNS FOR COMPLEX DATAcscpconf
In Data mining applications, which often involve complex data like multiple heterogeneous data sources, user preferences, decision-making actions and business impacts etc., the complete useful information cannot be obtained by using single data mining method in the form of informative patterns as that would consume more time and space, if and only if it is possible to join large relevant data sources for discovering patterns consisting of various aspects of useful information. We consider combined mining as an approach for mining informative patterns
from multiple data-sources or multiple-features or by multiple-methods as per the requirements. In combined mining approach, we applied Lossy-counting algorithm on each data-source to get the frequent data item-sets and then get the combined association rules. In multi-feature combined mining approach, we obtained pair patterns and cluster patterns and then generate incremental pair patterns and incremental cluster patterns, which cannot be directly generated by the existing methods. In multi-method combined mining approach, we combine FP-growth and Bayesian Belief Network to make a classifier to get more informative knowledge.
Combined mining approach to generate patterns for complex datacsandit
The document discusses a combined mining approach to generate patterns from complex data. It proposes applying a lossy-counting algorithm to each data source to obtain frequent itemsets, then generating combined association rules. It also describes obtaining pair and cluster patterns by considering multiple features, and generating incremental pair and cluster patterns. Further, it combines FP-growth and Bayesian belief networks to classify data and obtain more informative knowledge.
Bra a bidirectional routing abstraction for asymmetric mobile ad hoc networks...Mumbai Academisc
This document summarizes a paper that presents a framework called BRA that provides a bidirectional abstraction of asymmetric mobile ad hoc networks to enable off-the-shelf routing protocols to work. BRA maintains multi-hop reverse routes for unidirectional links, improves connectivity by using unidirectional links, enables reverse route forwarding of control packets, and detects packet loss on unidirectional links. Simulations show packet delivery increases substantially when AODV is layered on BRA in asymmetric networks compared to regular AODV.
Hortizontal Aggregation in SQL for Data Mining Analysis to Prepare Data SetsIJMER
International Journal of Modern Engineering Research (IJMER) is Peer reviewed, online Journal. It serves as an international archival forum of scholarly research related to engineering and science education.
International Journal of Modern Engineering Research (IJMER) covers all the fields of engineering and science: Electrical Engineering, Mechanical Engineering, Civil Engineering, Chemical Engineering, Computer Engineering, Agricultural Engineering, Aerospace Engineering, Thermodynamics, Structural Engineering, Control Engineering, Robotics, Mechatronics, Fluid Mechanics, Nanotechnology, Simulators, Web-based Learning, Remote Laboratories, Engineering Design Methods, Education Research, Students' Satisfaction and Motivation, Global Projects, and Assessment…. And many more.
Similar to Novel Ensemble Tree for Fast Prediction on Data Streams (20)
Using recycled concrete aggregates (RCA) for pavements is crucial to achieving sustainability. Implementing RCA for new pavement can minimize carbon footprint, conserve natural resources, reduce harmful emissions, and lower life cycle costs. Compared to natural aggregate (NA), RCA pavement has fewer comprehensive studies and sustainability assessments.
Low power architecture of logic gates using adiabatic techniquesnooriasukmaningtyas
The growing significance of portable systems to limit power consumption in ultra-large-scale-integration chips of very high density, has recently led to rapid and inventive progresses in low-power design. The most effective technique is adiabatic logic circuit design in energy-efficient hardware. This paper presents two adiabatic approaches for the design of low power circuits, modified positive feedback adiabatic logic (modified PFAL) and the other is direct current diode based positive feedback adiabatic logic (DC-DB PFAL). Logic gates are the preliminary components in any digital circuit design. By improving the performance of basic gates, one can improvise the whole system performance. In this paper proposed circuit design of the low power architecture of OR/NOR, AND/NAND, and XOR/XNOR gates are presented using the said approaches and their results are analyzed for powerdissipation, delay, power-delay-product and rise time and compared with the other adiabatic techniques along with the conventional complementary metal oxide semiconductor (CMOS) designs reported in the literature. It has been found that the designs with DC-DB PFAL technique outperform with the percentage improvement of 65% for NOR gate and 7% for NAND gate and 34% for XNOR gate over the modified PFAL techniques at 10 MHz respectively.
Literature Review Basics and Understanding Reference Management.pptxDr Ramhari Poudyal
Three-day training on academic research focuses on analytical tools at United Technical College, supported by the University Grant Commission, Nepal. 24-26 May 2024
Harnessing WebAssembly for Real-time Stateless Streaming PipelinesChristina Lin
Traditionally, dealing with real-time data pipelines has involved significant overhead, even for straightforward tasks like data transformation or masking. However, in this talk, we’ll venture into the dynamic realm of WebAssembly (WASM) and discover how it can revolutionize the creation of stateless streaming pipelines within a Kafka (Redpanda) broker. These pipelines are adept at managing low-latency, high-data-volume scenarios.
Understanding Inductive Bias in Machine LearningSUTEJAS
This presentation explores the concept of inductive bias in machine learning. It explains how algorithms come with built-in assumptions and preferences that guide the learning process. You'll learn about the different types of inductive bias and how they can impact the performance and generalizability of machine learning models.
The presentation also covers the positive and negative aspects of inductive bias, along with strategies for mitigating potential drawbacks. We'll explore examples of how bias manifests in algorithms like neural networks and decision trees.
By understanding inductive bias, you can gain valuable insights into how machine learning models work and make informed decisions when building and deploying them.
Adaptive synchronous sliding control for a robot manipulator based on neural ...IJECEIAES
Robot manipulators have become important equipment in production lines, medical fields, and transportation. Improving the quality of trajectory tracking for
robot hands is always an attractive topic in the research community. This is a
challenging problem because robot manipulators are complex nonlinear systems
and are often subject to fluctuations in loads and external disturbances. This
article proposes an adaptive synchronous sliding control scheme to improve trajectory tracking performance for a robot manipulator. The proposed controller
ensures that the positions of the joints track the desired trajectory, synchronize
the errors, and significantly reduces chattering. First, the synchronous tracking
errors and synchronous sliding surfaces are presented. Second, the synchronous
tracking error dynamics are determined. Third, a robust adaptive control law is
designed,the unknown components of the model are estimated online by the neural network, and the parameters of the switching elements are selected by fuzzy
logic. The built algorithm ensures that the tracking and approximation errors
are ultimately uniformly bounded (UUB). Finally, the effectiveness of the constructed algorithm is demonstrated through simulation and experimental results.
Simulation and experimental results show that the proposed controller is effective with small synchronous tracking errors, and the chattering phenomenon is
significantly reduced.
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsVictor Morales
K8sGPT is a tool that analyzes and diagnoses Kubernetes clusters. This presentation was used to share the requirements and dependencies to deploy K8sGPT in a local environment.
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECTjpsjournal1
The rivalry between prominent international actors for dominance over Central Asia's hydrocarbon
reserves and the ancient silk trade route, along with China's diplomatic endeavours in the area, has been
referred to as the "New Great Game." This research centres on the power struggle, considering
geopolitical, geostrategic, and geoeconomic variables. Topics including trade, political hegemony, oil
politics, and conventional and nontraditional security are all explored and explained by the researcher.
Using Mackinder's Heartland, Spykman Rimland, and Hegemonic Stability theories, examines China's role
in Central Asia. This study adheres to the empirical epistemological method and has taken care of
objectivity. This study analyze primary and secondary research documents critically to elaborate role of
china’s geo economic outreach in central Asian countries and its future prospect. China is thriving in trade,
pipeline politics, and winning states, according to this study, thanks to important instruments like the
Shanghai Cooperation Organisation and the Belt and Road Economic Initiative. According to this study,
China is seeing significant success in commerce, pipeline politics, and gaining influence on other
governments. This success may be attributed to the effective utilisation of key tools such as the Shanghai
Cooperation Organisation and the Belt and Road Economic Initiative.
bank management system in java and mysql report1.pdf
Novel Ensemble Tree for Fast Prediction on Data Streams
1. Shashi.et al. Int. Journal of Engineering Research and Application www.ijera.com
ISSN : 2248-9622, Vol. 6, Issue 8, ( Part -1) August 2016, pp.51-58
www.ijera.com 51 | P a g e
Novel Ensemble Tree for Fast Prediction on Data Streams
Shashi*, Priyanka Paygude**, Snehal Chaudhary**
*(Department of Information Technology, College of Engineering Bharati Vidyapeeth University, Pune
** (Department of Information Technology, College of Engineering Bharati Vidyapeeth University, Pune
*** (Department of Information Technology, College of Engineering Bharati Vidyapeeth University, Pune
ABSTRACT
Data Streams are sequential set of data records. When data appears at highest speed and constantly, so predicting
the class accordingly to the time is very essential. Currently Ensemble modeling techniques are growing
speedily in Classification of Data Stream. Ensemble learning will be accepted since its benefit to manage huge
amount of data stream, means it will manage the data in a large size and also it will be able to manage concept
drifting. Prior learning, mostly focused on accuracy of ensemble model, prediction efficiency has not considered
much since existing ensemble model predicts in linear time, which is enough for small applications and
accessible models workings on integrating some of the classifier. Although real time application has huge
amount of data stream so we required base classifier to recognize dissimilar model and make a high grade
ensemble model. To fix these challenges we developed Ensemble tree which is height balanced tree indexing
structure of base classifier for quick prediction on data streams by ensemble modeling techniques. Ensemble
Tree manages ensembles as geodatabases and it utilizes R tree similar to structure to achieve sub linear time
complexity.
Keywords - Data Stream mining, classification, machine learning, spatial indexing, concept drifting.
I. INTRODUCTION
Nowadays data stream are presents at
everywhere in information system, thus it's a
resilient task to manage such huge information,
additionally the storage and analysis of it. Envision
of such huge amounts of data to classify as it is
additionally resilient. Utilization of constant or
earliest algorithms will be resilient because the
information stream possibly will show the construct
drift [4]. Therefore it turns out to be a resilient or
analysis space for data processing is classification of
data stream. There are many possible assumptions
for such applications. This kind of data stream
classification is universally used everywhere when
the intrusion detection is needed, additionally to
determine the spam mails or websites. For security
reason, biosensor measurements system
classification and analysis is a significant growing
application. To manage the challenges of huge
volume information and construct drifting, some
machine learning technique i.e. Ensemble construct
has been planned by researchers. This method
contains weighted classifier machine learning that's
nothing but based on weight of node [1] [2],
incremental classifier ensembles, classifier and
cluster ensembles [3], etc. Basic technique used by
ensemble models is to separate the stream
information into a small block of data and from
every block of data, it makes the bottom classifier,
and after that combine these classifier into several
way for prediction.
Prior learning, up to this point are mainly focused on
accuracy of ensemble model, prediction adequacy
did not considered, regarding a lot of as a result of
current ensemble model predicts in linear time, that
is sufficient for general or small applications and
existing models working on integration small
collection of classifier. Though real time application
have huge amount of data stream thus we require
other base classifier to catch totally different patterns
and make an ensemble model which is high grade.
Like this applications required fast sub linear
prediction resolution. At this point basic
classification working in Ensemble model is made
discrimination call tree without any loss of
generalization. At that time, each base classification
module enclosed some rules to create a preference.
Each call condition covers a few spaces inside the
call space that's referred to as spatial object. During
this technique each base category will have been
modified to a collection of spatial objects, while
ensemble model is modified into a geodatabase.
Discriminating this methodology, the problem of
classification will be subtracted, that's used for a few
patterns sharing among all special entities used for
base classification within the geodatabase[5][8]
working in machine learning. Numbers of various
categorization structures for geodatabase are
obtainable that use common patterns for all spatial
information. This can be used to decrease the
quantity of queries and change prices in database.
There are several techniques that are used to
RESEARCH ARTICLE OPEN ACCESS
2. Shashi.et al. Int. Journal of Engineering Research and Application www.ijera.com
ISSN : 2248-9622, Vol. 6, Issue 8, ( Part -1) August 2016, pp.51-58
www.ijera.com 52 | P a g e
associate index in Training ensemble models like M-
tree, R+ -tree. There are a several benefits of using
ensemble model categorization as compared to
standard methodology of spatial categorization: (1)
Ensemble model uses decision rules for indexing,
these rules contain a lot of prosperous information
like label of categories , likelihood distribution and
classifier weight. Typical methodology of
categorization planned for particular information
like multimedia system, road maps, rectangles and
pictures etc. (2) Aim of ensemble model is to predict
earlier whereas plan of typical methodology is rapid
change and retrieval. (3) Recognition of conception
drifting technique, stream of data modification very
often, therefore call space applied by call order
ensemble models might belong to entirely dissimilar
classes’ category or category labels.
To solve the challenges mentioned above,
we are proposing a novel ensemble tree which
integrates the base classifier into a height balanced
tree structure for fast prediction that is to achieve the
sub linear time complexity. E-tree (Ensemble tree)
provides an efficient indexing structure. With the
help of this approach we attain logarithmic
complexity of time for predicting data streams.
Paper includes the following sections:
Section 2 contains all research work and literature
survey. In section 3 we proposed the system, it
describes the architecture and basic operation of tree.
Section 4 and 5 includes mathematical model and
experimental results respectively and in section 6
conclusion is presented.
II. LITERATURE SURVEY
There are many algorithms and techniques
which are proposed for handling large volume of
data or data streams based on ensemble learning. In
this paper [6] classification and novel Class
detection for Concept-Drifting was proposed for data
streams under time constraints, this issue has not
been focused by majority of the existing
classification technique. Existing techniques of
classification systems expect that overall number of
classes is fixed in the data stream; it would not be
able to find out the novel class up to the time
models. When the trained dataset with the novel
class is not available for classification model, it is
not required to identify novel classes. Novel class
identification turns out to be additionally major
challenging part in the presence of concept-drift. In
this approach they find out the intrusion. In concern
to figure out if an instance of data stream adjacent to
a novel class, the classification technique
intermittently needs to look for additional test
instances to find out the similarities between those
instances. For classification of an instance wait time
Tc is applied as time constraints to discover the
novel class in absence of training model for that
class. The major difficulties in class detection are:
1.efficiently save training data without much
memory utilization, 2. must have knowledge about
when to detain the classification of stream and when
to immediately classify the test instance 3. Delayed
instances are classified within Tc unit of time and 4.
Prediction of a novel class rapidly and accurately.
They apply their technique on two distinct
classifiers: first is decision tree classification and
second is k-nearest neighbor classification. But this
technique may lead to an inefficiency of
classification model, in context of memory
utilization and execution time. So as to build more
effective model and improve efficiency, they used K
means clustering with the training data.
In this paper [2] author proposed solution
for large scale classification i.e. large scale or data
stream classification problem is overcome, to do this
they propose algorithm. They build a committee or
ensemble classifier which is a combination of many
classifiers, each built on a subset of the accessible
data points. Lately, a lot of consideration in the
machine learning group has been coordinated
towards strategies like bagging and boosting. Their
approach was based on the merging of classifiers as
follows: Individual classifier is constructed from
small data set, read data in blocks rather than entire
data set at a time, and then component classifiers are
inserted into an ensemble with a fixed size. If
ensemble is full, new classifier is inserted only if
they satisfy the quality criteria so that the
performance of ensemble is improved. Performance
estimation is done by testing the existing ensemble
and the new tree is built on the next set of data. They
performed some experiment based on this
framework. Their results included: (1) Better
generalization will be achieve by increasing the size
of the ensemble up to 20 to 25 classifier. But there is
interchange between the number of classifiers and
number of points per classifier with data sets of
limited size, unless sampling is performed again. (2)
Even if the accuracy of the trees was increased,
accuracy of ensemble is decreased by pruning the
individual trees. (3) Ensemble accuracy does not
effect by simple little or more variations in majority.
The Key performance of this technique is the
strategy used to figure out which existing or
outdated tree should be deleted and also whether or
not a new tree should be added to the group. Their
technique shows that classification diversity is also
important; accuracy alone is not the best practice.
Ensemble is changing continuously and constantly,
so collecting the calculation over time is
complicated and also estimates are noisy. Another
probability is that to support classifiers which do
better on points that are misclassified but it leaves
data noisy. So they rather support classifiers on
3. Shashi.et al. Int. Journal of Engineering Research and Application www.ijera.com
ISSN : 2248-9622, Vol. 6, Issue 8, ( Part -1) August 2016, pp.51-58
www.ijera.com 53 | P a g e
which ensemble are not decided yet and they classify
points correctly.
In next paper [7] author had discussed
about ADWIN Bagging and Adaptive-size
Hoeffding Tree (ASHT). These are two new
approaches i.e. ADWIN Bagging and ASHT
proposed for concept drift study. Firstly, they
proposed a new method of bagging using Hoeffding
Trees of various sizes. It is an incremental approach
for constructing tree. Decision tree induction
algorithm is used in this method. This algorithm is
capable for learning from large volume of data
streams, with assumption that the distribution
generating examples does not change very
frequently. With some differences from Hoeffding
Tree algorithm the Adaptive-Size Hoeffding Tree
(ASHT) is derived. Listed are as: (1) ASHT has a
maximal size or number of split nodes. (2) To reduce
its size it deletes some nodes, approach used is as,
once node splits, and number of split nodes is greater
than the maximum value. Using this method we
achieve: small size trees accommodate to changes
more quickly whereas large size trees build on large
data so they do better in one condition when there
are small changes or no changes. Secondly, they
proposed new method of bagging using ADWIN.
ADWIN solves the problem of tracking the average
stream of bits or real-valued numbers in a well
specified way. It does not look after any program
explicitly, but handles it by exponential histogram
technique. There is only one online bagging method
named as ADWIN Bagging which is used along with
the ADWIN algorithm for detecting a change and
estimating the weights of the boosting method.
When a change is detected, the most outdated
classifier is deleted and a new classifier is added to
the tree.
In 2010, Classifier and Cluster ensembles
[3] were proposed for mining concept drifting data
streams. This paper, overcomes the two main
challenges of the existing system of ensemble
classifiers which are as follows: (1) obtaining an
authentic label of class for every unlabeled cluster.
(2) in tree as node represent the class need to decide
how to allocate weights to all base classifiers and
clusters correctly, so that the concept drifting
problem handle by ensemble predictor. To handle
these challenges weighted tree classifiers and
clusters model is used for concept drifting. To
overcome the difficulties (1) first construct a graph
which constitute of all classifiers and clusters? This
graph is used to presents a new label mechanics to
newly introduced class, it firstly promote this label
data from all classifiers to the clusters, and then at
each iteration of insert operation modifies the
outcomes by reproducing resemblance between all
clusters. (2) A consistency-based weighing
technique is also proposed by author in whom it uses
assignment of a weighted value to each base class of
decision tree depending on their frequencies with
reference to the updated base model. After following
this method we get a combined result of both
classifiers and clusters for accurate prediction
through a weighted averaging schema.
Existing system facing two main problems
of knowledge discovery that is large volume of
incoming data streams and other one is concept
drifting. To overcome this they propose a weighted
classifier [1] for mining the data streams with
concept drifting. For timely prediction of data
streams it is important that data streams should take
new and updated patterns. There are some
challenges like Accuracy, Efficiency and ease of use
for maintaining an up to date and accurate classifier
works on large volume of data streams (infinite)
with concept drifts. (1) Accuracy: High rate will
affect the accuracy of the current model i.e. less
accurate, low rate will lead to less delicate model.
(2) Efficiency: Even a slight use of the base of them
may activate a huge modification in the tree, and
may drastically result into a undermine efficiency.
(3) Ease of Use: In current classification methods
consist of decision trees to handle data streams with
drifting concepts in an increased manner.
Reusability of the approach cannot be used directly
as it is limited. To overcome this, they propose an
ensemble of weighted classifier for mining the input
data streams with concept drift. As in most of
qualified classifier there are chances of valuable
information wastage which are less accurate and
trained before. In order to avoid over fitting and the
problems of conflicting concepts, the deletion of old
data must be based on data distribution instead of
based on their arrival time. Ensemble approach
achieves this competence by assigning all classifier a
weight, which is based on the expected accuracy of
prediction on current data of the model. Other
advantages of this approach is its efficiency and easy
to use.
Ensemble model indexing mostly preferred
for quick prediction or classification of incoming
data stream. Data representation Changes: Data
stream changes consistently because of concept
drifting, and hidden patterns. Therefore a decision
region used for decision tree in machine learning
method may be of distinct class labels. To overcome
the challenges of existing systems, in this paper we
propose a novel ensemble (machine learning) tree
that integrates the base classifier into a height
balanced tree structure for fast prediction that is to
achieve the sub linear time complexity, E-tree
(Ensemble tree) provides an efficient indexing
structure.
4. Shashi.et al. Int. Journal of Engineering Research and Application www.ijera.com
ISSN : 2248-9622, Vol. 6, Issue 8, ( Part -1) August 2016, pp.51-58
www.ijera.com 54 | P a g e
III. PROPOSED SYSTEM
In this section we discuss about the E-tree, E-tree structure consist of two parts: first is a tree i.e. R-
Tree [5] like structure and second part consist of a table where tree stores decision rules and table stores
information about the classifier like ID and classifier weight. Both structures are coupled by linking every base
classifier of the table to its related decision rules. E-Tree consists of three basic operations as follows: Search
Operation: it is traversing operation that is traveling of tree to classify input data. Insertion Operation: it is used
for integrating new
Table 1 Data Streams
base classifier into a tree. Deletion Operation: When
E-tree is full, delete operation is called and deletes
outdated (not used anymore) classifier.
There are two main modules of the system:
Training module and Prediction Module shown in
Figure 1. Training Module: For each input data
stream (unlabeled stream is coming), in training
module this stream data record is stored in buffer
until buffer is full for labeling. This labeled data is
used for creating or introducing a new class to label
data stream which is inserted in an E-Tree using
Insertion operation. When E-Tree is full then
classifier which is outdated will be deleted using
Deletion operation. Prediction Module: Prediction
modules also maintain the similar copy of E-Tree
generated in training module, so when unlabeled
data stream comes, Prediction module call search
operation for predicting the unlabeled data. The
updated tree from Training module will be
integrated or synchronized with the tree in prediction
module every time when new classifier is added in
E-Tree. Buffer: It is used to store data records until it
is full. In this module stream will be labeled by
experts. Classifier: This module is used for building
a new base classifier from all the labeled data
records. We use data streams from KDD data sets
which are listed in Table 1.
Operations on E-Tree
Ensemble Tree (E-Tree) are similar to all other
tree in data structure like R-Tree and it consist of
three main operations which are as follows:
3.1. Search Operation: Whenever a new record
comes, search operation is called to predict its class
label. In this operation we first travel the tree to find
the suitable decision rule which covers the record in
the leaf node. We perform the depth first search on
tree.
3.2. Insert Operation: Insertion operation on
Ensemble based Tree are same as insertion operation
on other trees that is, it is used for integrating new
Fig 1. Architecture of Proposed system
Name Attributes Areas
Intrusion detection 22 Security
Spam detection 35 Security
Malicious URL 12 Security
5. Shashi.et al. Int. Journal of Engineering Research and Application www.ijera.com
ISSN : 2248-9622, Vol. 6, Issue 8, ( Part -1) August 2016, pp.51-58
www.ijera.com 55 | P a g e
base class into E-Tree, this helps to ensemble model
to adapt new trends or class of data streams.
Insertion operation on Ensemble based Tree are
same as insertion operation on other trees that is, It
is used for integrating new base class into E-Tree,
this helps to ensemble model to adapt new trends or
class of data streams.
3.3. Delete Operation: This operation remove or
delete the old classifiers which are not in use from
long time when the E-Tree reaches its capacity.
There are basically two different types of methods
for deletion which can be performed. First method is
similar to the method used in B tree. Merging
operation is performed to all the full nodes to any
one sibling that results from the least area increase.
The second one appears deletions in R-trees and
performs in sequence of delete-then-insert. Firstly it
goes to deletion of the under-full node, and then only
adds using insert operation on the remaining entries
into the tree. This method has more advantageous
as: (1)Implementation is easy; and (2)Re-insertion
operation will automatically or reproduce the spatial
structure of the tree as new class found or some new
features are added to existing class training dataset.
3.4 Basic Steps in the J48 Algorithm:
1. The leaf represented by the tree is returned by
labeling where the instances belong to the same
class.
2. The potential information given by the test done
on each attribute is calculated for every attribute.
Then calculated information gain that would result
from a test on the attribute.
3. The best attribute is selected for branching and on
the basis of present selection criteria the best
attribute is founded.
IV. MATHEMATICAL MODEL
Here we consider three class of Data stream
DS Consisting of an infinite number of records {(pi,
qi)}, Where pi ∈ rd
is a d-dimensional attribute
vector, qi ∈ {normal, abnormal} is the class label.
Suppose that we have built n base classifiers (a1, a2,
a3 …an) from historical stream data using a decision
tree algorithm.
Each base classifier Ai (1≤i≥n)
l is decision rules
M=n×l Decision rule ensemble decision rules in the
ensemble T.
Sub linear complexity O (log (M)
4.1 Pseudo Code for Search:
Input: E, x, y, where ‘E’ is e-tree, ‘x’ is stream
record and ‘y’ is parameter
Output: x’s
label of class yx
Initialize (stack); // S is initialization of stack;
// All rules are recorded by S covering x , Q is
E.tree. root;
// calculate tree root
For each entry r ϵ E do
push (stack, r);
While stack is not null do
t← pop (stack)
If entry of leaf is t then
S ← S ᴗ t;
else
Q ← t.child;
For each entry t ← Q do
if x ϵ t then
push (stack, t);
For each entry t ϵ S do
Calculate its weight in structure of table;
Output is ‘yx’;
4.2 Pseudo Code for Insertion:
Input: E, F, n, N
Where ‘E’ is E-tree, ‘F’ is classifier and ‘n/ N’ is
parameter
Output: E’ as updated E-tree
Q is E.tree.root; // get the tree root
For each r ϵ F do
Where ‘r’ is decision rule
D ← searchLeaf(r, Q); // D holds r;
If D.size < n then
D ← D ᴗ r;
E’ ← (D) updatedParentNode;
else
< > ← (D, r) splitNode;
E’ ← (E, ) adjustTree;
Insert F into the structure of table;
Output: is E’ i.e. updated E-tree;
4.3 Pseudo Code for Deletion:
Input: E, F, n, N
Where ‘E’ is E-tree, ‘F’ is classifier and ‘n/ N’ is
parameter
Output: is E’ as updated E-tree.
Q ← E.tree.root; // obtain the tree root.
G ← E.table.ref; // obtain references
r ← (F, G) searchClassifier; // F is Classifier
Q ← r.pointer;
While Q is not null do
D ← Q.node (); // D is leaf node which
contains rule ‘Q’
p← Q.sibling;
D’ ← deleteEntry (D, Q);
if D.size < n then
E’ ← (E, D’) deleteNode // deleted;
For each entry t ϵ D’ do
E’ ← insertRule (t, E);
Q← p;
6. Shashi.et al. Int. Journal of Engineering Research and Application www.ijera.com
ISSN : 2248-9622, Vol. 6, Issue 8, ( Part -1) August 2016, pp.51-58
www.ijera.com 56 | P a g e
Output: is E’ i.e. updated E-tree;
4.4 Pseudo Code for Random Forest:
To create classifier C:
For i =1 to C do
Training data ‘D’ is sampled randomly with
replacement to produce Di
Produce a root node, Ni containing Di
Call BuildTree (Ni)
End for
BuildTree (N):
If M include instances of single class only then
Return
Else
Selects x% of splitting features randomly in M
Select highest information gain feature H to split on
Create h child nodes of M, M2 ,…., Mh , where F
has f possible values F1 ,…., Fh
For i = 1 to h do
Put the contents of Mi to Ji
Where Mi and Ji are instances in M that matches
to Fi
Call BuildTree (Mi )
end for
end if
V. EXPERIMENTAL RESULTS
For experimental setup we used KDD
dataset. We define class label as normal and
abnormal class. For this we used three input dataset
stream as Intrusion Detection, Spam Detection,
Malicious URL detection. Here we used J48 and
Random Forest classifier to major the performance
of system for above mentioned class. Parameter
consider for it accuracy and detection rate
(percentage). Fig 2 shows the graphical comparison
and As shown in
Table2 for Intrusion Detection that
Random Forest gives better result but not much
different. Same for Spam detection, it does not make
major difference. But for Malicious URL detection
as shown in Table it makes major difference in
accuracy.
TABLE 2: Comparison between J48 and Random Forest
Type Instances J48 Random Forest
J48 Random Forest Percentage Accuracy Percentage Accuracy
Malicious 4814.0 4887.0 43.55% 73.08% 44.21% 71.16%
Benign 6241.0 6168 56.45% 85.94% 55.79% 84.94%
Spam 2832.0 2769.0 61.55% 98.42% 60.77% 99.77%
Not Spam 1769.0 1805.0 38.45% 97.57% 39.23% 99.56%
Normal 29565.0 29584.0 19.7% 99.85% 19.71% 99.79%
DOS 119014.0 119006.0 79.29% 99.98% 79.28% 99.99%
Probe 1205.0 1208.0 0.8% 98.45% 0.8% 98.69%
R2L 312.0 299.0 0.21% 86.43% 0.2% 82.83%
U2R 12.0 11.0 0.01% 90.91% 0.01% 100.0%
Fig 2 : Intrusion Detection
Figure 2 shows the classification comparision of the
different types of attacks such as normal, U2R, R2L
and DOS and Probe. It shows the graph of accuracy
vs all these attacks
7. Shashi.et al. Int. Journal of Engineering Research and Application www.ijera.com
ISSN : 2248-9622, Vol. 6, Issue 8, ( Part -1) August 2016, pp.51-58
www.ijera.com 57 | P a g e
Fig 3: Spam Detection
Fig 4: Malicious URL
Figure 3 shows the graph of the classification
comparison of the spam and non-spam vs accuracy.
Both spam and non-spam has different accuracy.
Figure 4 also shows the graph of classification
comparison of the malicious and benign vs accuracy.
This figure shows that benign has more accuracy
than malicious attacks.
VI. CONCLUSION
In this project, we are proposed a novel
Ensemble tree indexing structure for classifying high
speed data streams to reduce the expected time
complexity for prediction. We are able to reduce the
linear time complexity of system to sub linear time
complexity of system. Classification in this project is
based on the ensemble models. In future we can
extend novel Ensemble tree to other classification
models of data stream.
REFERENCES
[1]. H. Whang, W. Fan, P.S. Yu, "Mining
Concept-Drifting Data Streams using
Ensemble Classifier", ACM SIGKDD
International Conference on. Knowledge
Discovery and Data Mining (KDD),
Washington DC, August 24-27, 2003, pp.
226-235.
[2]. W. Street and Y. Kim, "A Streaming
Ensemble Algorithm (SEA) for Large-Scale
Classification", ACM SIGKDD
international conference on Knowledge
discovery and data mining, San Francisco,
CA, August 26-29, 2001, pp. 377-382.
[3]. P. Zhang, X. Zhu, J. Tan and L. Guo,
"Classifier and Cluster Ensembles for
Mining Concept Drifting Data Streams”,
2010 IEEE International Conference on
Data Mining (ICDM), Sydney, NSW,
December 13-17, 2010, pp. 1175-1180.
8. Shashi.et al. Int. Journal of Engineering Research and Application www.ijera.com
ISSN : 2248-9622, Vol. 6, Issue 8, ( Part -1) August 2016, pp.51-58
www.ijera.com 58 | P a g e
[4]. M. Kelly, D. Hand and N. Adams, "The
impact of changing populations on
classifier performance”, ACM SIGKDD
International Conference on. Knowledge
Discovery and Data Mining (KDD), San
Diego, CA, US, August 15-18, 1999, pp.
367-371.
[5]. A. Guttman, "R-Trees: A Dynamic Index
Structure for Spatial Searching", ACM
SIGMOD International Conference on
Management of Data, New York, US, June
1984, pp 47-57.
[6]. M. Masud, J. Gao, L. Khan, J. Han and B.
Thuraisingham, "Classification and Novel
Class Detection in Concept-Drifting Data
Streams under Time Constraints”, IEEE
Transactions on Knowledge and Data
Engineering, vol. 23, no.6, June, 2011, pp.
859-874.
[7]. A. Bifet, G. Holmes, B. Pfahringer, R.
Kirkby and R. Gavalda, "New Ensemble
Methods for Evolving Data Streams”, 15th
ACM SIGKDD International Conference
Knowledge Discovery and Data Mining
(KDD), Paris, France, June 28-July 1, 2009,
pp. 139-148.
[8]. R. Guting, "An Introduction to Spatial
Database Systems”, VLDB Journal, 1994
vol.3, no.4, pp.357-399.