1) The document discusses mining data streams using an improved version of McDiarmid's bound. It aims to enhance the bounds obtained by McDiarmid's tree algorithm and improve processing efficiency.
2) Traditional data mining techniques cannot be directly applied to data streams due to their continuous, rapid arrival. The document proposes using Gaussian approximations to McDiarmid's bounds to reduce the size of training samples needed for split criteria selection.
3) It describes Hoeffding's inequality, which is commonly used but not sufficient for data streams. The document argues that McDiarmid's inequality, used appropriately, provides a more efficient technique for high-speed, time-changing data streams.
A fuzzy clustering algorithm for high dimensional streaming dataAlexander Decker
This document summarizes a research paper that proposes a new dimension-reduced weighted fuzzy clustering algorithm (sWFCM-HD) for high-dimensional streaming data. The algorithm can cluster datasets that have both high dimensionality and a streaming (continuously arriving) nature. It combines previous work on clustering algorithms for streaming data and high-dimensional data. The paper introduces the algorithm and compares it experimentally to show improvements in memory usage and runtime over other approaches for these types of datasets.
This document summarizes a research paper that proposes a new density-based clustering technique called Triangle-Density Based Clustering Technique (TDCT) to efficiently cluster large spatial datasets. TDCT uses a polygon approach where the number of data points inside each triangle of a polygon is calculated to determine triangle densities. Triangle densities are used to identify clusters based on a density confidence threshold. The technique aims to identify clusters of arbitrary shapes and densities while minimizing computational costs. Experimental results demonstrate the technique's superiority in terms of cluster quality and complexity compared to other density-based clustering algorithms.
A Novel Approach to Mathematical Concepts in Data Miningijdmtaiir
-This paper describes three different fundamental
mathematical programming approaches that are relevant to
data mining. They are: Feature Selection, Clustering and
Robust Representation. This paper comprises of two clustering
algorithms such as K-mean algorithm and K-median
algorithms. Clustering is illustrated by the unsupervised
learning of patterns and clusters that may exist in a given
databases and useful tool for Knowledge Discovery in
Database (KDD). The results of k-median algorithm are used
to collecting the blood cancer patient from a medical database.
K-mean clustering is a data mining/machine learning algorithm
used to cluster observations into groups of related observations
without any prior knowledge of those relationships. The kmean algorithm is one of the simplest clustering techniques
and it is commonly used in medical imaging, biometrics and
related fields.
This document summarizes a research paper on developing an improved LEACH (Low-Energy Adaptive Clustering Hierarchy) communication protocol for energy efficient data mining in multi-feature sensor networks. It begins with background on wireless sensor networks and issues like energy efficiency. It then discusses the existing LEACH protocol and its drawbacks. The proposed improved LEACH protocol includes cluster heads, sub-cluster heads, and cluster nodes to address LEACH's limitations. This new version aims to minimize energy consumption during cluster formation and data aggregation in multi-feature sensor networks.
The document proposes a Modified Pure Radix Sort algorithm for large heterogeneous datasets. The algorithm divides the data into numeric and string processes that work simultaneously. The numeric process further divides data into sublists by element length and sorts them simultaneously using an even/odd logic across digits. The string process identifies common patterns to convert strings to numbers that are then sorted. This optimizes problems with traditional radix sort through a distributed computing approach.
Textual Data Partitioning with Relationship and Discriminative AnalysisEditor IJMTER
Data partitioning methods are used to partition the data values with similarity. Similarity
measures are used to estimate transaction relationships. Hierarchical clustering model produces tree
structured results. Partitioned clustering produces results in grid format. Text documents are
unstructured data values with high dimensional attributes. Document clustering group ups unlabeled text
documents into meaningful clusters. Traditional clustering methods require cluster count (K) for the
document grouping process. Clustering accuracy degrades drastically with reference to the unsuitable
cluster count.
Textual data elements are divided into two types’ discriminative words and nondiscriminative
words. Only discriminative words are useful for grouping documents. The involvement of
nondiscriminative words confuses the clustering process and leads to poor clustering solution in return.
A variation inference algorithm is used to infer the document collection structure and partition of
document words at the same time. Dirichlet Process Mixture (DPM) model is used to partition
documents. DPM clustering model uses both the data likelihood and the clustering property of the
Dirichlet Process (DP). Dirichlet Process Mixture Model for Feature Partition (DPMFP) is used to
discover the latent cluster structure based on the DPM model. DPMFP clustering is performed without
requiring the number of clusters as input.
Document labels are used to estimate the discriminative word identification process. Concept
relationships are analyzed with Ontology support. Semantic weight model is used for the document
similarity analysis. The system improves the scalability with the support of labels and concept relations
for dimensionality reduction process.
TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...ijnlc
The tremendous increase in the amount of available research documents impels researchers to propose topic models to extract the latent semantic themes of a documents collection. However, how to extract the hidden topics of the documents collection has become a crucial task for many topic model applications. Moreover, conventional topic modeling approaches suffer from the scalability problem when the size of documents collection increases. In this paper, the Correlated Topic Model with variational ExpectationMaximization algorithm is implemented in MapReduce framework to solve the scalability problem. The proposed approach utilizes the dataset crawled from the public digital library. In addition, the full-texts of the crawled documents are analysed to enhance the accuracy of MapReduce CTM. The experiments are conducted to demonstrate the performance of the proposed algorithm. From the evaluation, the proposed approach has a comparable performance in terms of topic coherences with LDA implemented in MapReduce framework.
A fuzzy clustering algorithm for high dimensional streaming dataAlexander Decker
This document summarizes a research paper that proposes a new dimension-reduced weighted fuzzy clustering algorithm (sWFCM-HD) for high-dimensional streaming data. The algorithm can cluster datasets that have both high dimensionality and a streaming (continuously arriving) nature. It combines previous work on clustering algorithms for streaming data and high-dimensional data. The paper introduces the algorithm and compares it experimentally to show improvements in memory usage and runtime over other approaches for these types of datasets.
This document summarizes a research paper that proposes a new density-based clustering technique called Triangle-Density Based Clustering Technique (TDCT) to efficiently cluster large spatial datasets. TDCT uses a polygon approach where the number of data points inside each triangle of a polygon is calculated to determine triangle densities. Triangle densities are used to identify clusters based on a density confidence threshold. The technique aims to identify clusters of arbitrary shapes and densities while minimizing computational costs. Experimental results demonstrate the technique's superiority in terms of cluster quality and complexity compared to other density-based clustering algorithms.
A Novel Approach to Mathematical Concepts in Data Miningijdmtaiir
-This paper describes three different fundamental
mathematical programming approaches that are relevant to
data mining. They are: Feature Selection, Clustering and
Robust Representation. This paper comprises of two clustering
algorithms such as K-mean algorithm and K-median
algorithms. Clustering is illustrated by the unsupervised
learning of patterns and clusters that may exist in a given
databases and useful tool for Knowledge Discovery in
Database (KDD). The results of k-median algorithm are used
to collecting the blood cancer patient from a medical database.
K-mean clustering is a data mining/machine learning algorithm
used to cluster observations into groups of related observations
without any prior knowledge of those relationships. The kmean algorithm is one of the simplest clustering techniques
and it is commonly used in medical imaging, biometrics and
related fields.
This document summarizes a research paper on developing an improved LEACH (Low-Energy Adaptive Clustering Hierarchy) communication protocol for energy efficient data mining in multi-feature sensor networks. It begins with background on wireless sensor networks and issues like energy efficiency. It then discusses the existing LEACH protocol and its drawbacks. The proposed improved LEACH protocol includes cluster heads, sub-cluster heads, and cluster nodes to address LEACH's limitations. This new version aims to minimize energy consumption during cluster formation and data aggregation in multi-feature sensor networks.
The document proposes a Modified Pure Radix Sort algorithm for large heterogeneous datasets. The algorithm divides the data into numeric and string processes that work simultaneously. The numeric process further divides data into sublists by element length and sorts them simultaneously using an even/odd logic across digits. The string process identifies common patterns to convert strings to numbers that are then sorted. This optimizes problems with traditional radix sort through a distributed computing approach.
Textual Data Partitioning with Relationship and Discriminative AnalysisEditor IJMTER
Data partitioning methods are used to partition the data values with similarity. Similarity
measures are used to estimate transaction relationships. Hierarchical clustering model produces tree
structured results. Partitioned clustering produces results in grid format. Text documents are
unstructured data values with high dimensional attributes. Document clustering group ups unlabeled text
documents into meaningful clusters. Traditional clustering methods require cluster count (K) for the
document grouping process. Clustering accuracy degrades drastically with reference to the unsuitable
cluster count.
Textual data elements are divided into two types’ discriminative words and nondiscriminative
words. Only discriminative words are useful for grouping documents. The involvement of
nondiscriminative words confuses the clustering process and leads to poor clustering solution in return.
A variation inference algorithm is used to infer the document collection structure and partition of
document words at the same time. Dirichlet Process Mixture (DPM) model is used to partition
documents. DPM clustering model uses both the data likelihood and the clustering property of the
Dirichlet Process (DP). Dirichlet Process Mixture Model for Feature Partition (DPMFP) is used to
discover the latent cluster structure based on the DPM model. DPMFP clustering is performed without
requiring the number of clusters as input.
Document labels are used to estimate the discriminative word identification process. Concept
relationships are analyzed with Ontology support. Semantic weight model is used for the document
similarity analysis. The system improves the scalability with the support of labels and concept relations
for dimensionality reduction process.
TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...ijnlc
The tremendous increase in the amount of available research documents impels researchers to propose topic models to extract the latent semantic themes of a documents collection. However, how to extract the hidden topics of the documents collection has become a crucial task for many topic model applications. Moreover, conventional topic modeling approaches suffer from the scalability problem when the size of documents collection increases. In this paper, the Correlated Topic Model with variational ExpectationMaximization algorithm is implemented in MapReduce framework to solve the scalability problem. The proposed approach utilizes the dataset crawled from the public digital library. In addition, the full-texts of the crawled documents are analysed to enhance the accuracy of MapReduce CTM. The experiments are conducted to demonstrate the performance of the proposed algorithm. From the evaluation, the proposed approach has a comparable performance in terms of topic coherences with LDA implemented in MapReduce framework.
This document introduces an R package called PSF that implements a Pattern Sequence based Forecasting (PSF) algorithm for univariate time series forecasting. The PSF algorithm clusters time series data and then predicts future values based on identifying repeating patterns of clusters. The PSF package contains functions that perform the main steps of the PSF algorithm, including selecting the optimal number of clusters, selecting the optimal window size, and making predictions for a given window size and number of clusters. The package aims to promote and simplify the use of the PSF algorithm for time series forecasting.
This document summarizes a research paper that proposes a new resource scheduling algorithm called STRS for cloud computing environments. STRS aims to optimally allocate data resources across computational clusters in a distributed system to minimize data access costs. It does this through two distributed algorithms - STRSA runs at each parent node to determine optimal data allocation to child nodes, and STRSD runs at each child node to determine optimal data de-allocation. The paper also proposes a intra-cluster replication algorithm called ORPNDA that uses heuristic expansion-shrinking methods to determine optimal partial data replication within each cluster. Experimental results show STRS and ORPNDA significantly outperform general frequency-based replication schemes.
Experimental study of Data clustering using k- Means and modified algorithmsIJDKP
The k- Means clustering algorithm is an old algorithm that has been intensely researched owing to its ease
and simplicity of implementation. Clustering algorithm has a broad attraction and usefulness in
exploratory data analysis. This paper presents results of the experimental study of different approaches to
k- Means clustering, thereby comparing results on different datasets using Original k-Means and other
modified algorithms implemented using MATLAB R2009b. The results are calculated on some performance
measures such as no. of iterations, no. of points misclassified, accuracy, Silhouette validity index and
execution time
Clustering is also known as data segmentation aims to partitions data set into groups, clusters, according to their similarity. Cluster analysis has been extensively studied in many researches. There are many algorithms for different types of clustering. These classical algorithms can't be applied on big data due to its distinct features. It is a challenge to apply the traditional techniques on large unstructured data. This study proposes a hybrid model to cluster big data using the famous traditional K-means clustering algorithm. The proposed model consists of three phases namely; Mapper phase, Clustering Phase and Reduce phase. The first phase uses map-reduce algorithm to split big data into small datasets. Whereas, the second phase implements the traditional clustering K-means algorithm on each of the spitted small data sets. The last phase is responsible of producing the general clusters output of the complete data set. Two functions, Mode and Fuzzy Gaussian, have been implemented and compared at the last phase to determine the most suitable one. The experimental study used four benchmark big data sets; Covtype, Covtype-2, Poker, and Poker-2. The results proved the efficiency of the proposed model in clustering big data using the traditional K-means algorithm. Also, the experiments show that the Fuzzy Gaussian function produces more accurate results than the traditional Mode function.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
A Novel Approach for Clustering Big Data based on MapReduce IJECEIAES
Clustering is one of the most important applications of data mining. It has attracted attention of researchers in statistics and machine learning. It is used in many applications like information retrieval, image processing and social network analytics etc. It helps the user to understand the similarity and dissimilarity between objects. Cluster analysis makes the users understand complex and large data sets more clearly. There are different types of clustering algorithms analyzed by various researchers. Kmeans is the most popular partitioning based algorithm as it provides good results because of accurate calculation on numerical data. But Kmeans give good results for numerical data only. Big data is combination of numerical and categorical data. Kprototype algorithm is used to deal with numerical as well as categorical data. Kprototype combines the distance calculated from numeric and categorical data. With the growth of data due to social networking websites, business transactions, scientific calculation etc., there is vast collection of structured, semi-structured and unstructured data. So, there is need of optimization of Kprototype so that these varieties of data can be analyzed efficiently.In this work, Kprototype algorithm is implemented on MapReduce in this paper. Experiments have proved that Kprototype implemented on Mapreduce gives better performance gain on multiple nodes as compared to single node. CPU execution time and speedup are used as evaluation metrics for comparison.Intellegent splitter is proposed in this paper which splits mixed big data into numerical and categorical data. Comparison with traditional algorithms proves that proposed algorithm works better for large scale of data.
IRJET- Review of Existing Methods in K-Means Clustering AlgorithmIRJET Journal
The document reviews existing methods for the k-means clustering algorithm. It discusses how k-means clustering works and some of its limitations when dealing with large datasets, such as being dependent on the initial choice of centroids. It then proposes using Hadoop to overcome big data challenges and calculate preliminary centroids for k-means clustering in a distributed manner. Finally, it reviews different techniques that have been proposed in other research to improve k-means clustering, such as methods for selecting better initial centroids or determining the optimal number of clusters.
This document describes a new distance-based clustering algorithm (DBCA) that aims to improve upon K-means clustering. DBCA selects initial cluster centroids based on the total distance of each data point to all other points, rather than random selection. It calculates distances between all points, identifies points with maximum total distances, and sets initial centroids as the averages of groups of these maximally distant points. The algorithm is compared to K-means, hierarchical clustering, and hierarchical partitioning clustering on synthetic and real data. Experimental results show DBCA produces better quality clusters than these other algorithms.
WITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSERijnlc
This document presents an adaptive log file parser that uses semantics and hidden Markov models. It first clusters log file lines based on semantics to limit unstructured text. It then builds a hidden Markov model to represent parsing patterns, with log entries as states and extracted values as emissions. When applied to a new system, it adapts the model's transition and emission probabilities to fit the new data. The approach achieves over 99.99% accuracy when trained on one system and applied to another with slightly different log patterns.
Extended pso algorithm for improvement problems k means clustering algorithmIJMIT JOURNAL
The clustering is a without monitoring process and one of the most common data mining techniques. The
purpose of clustering is grouping similar data together in a group, so were most similar to each other in a
cluster and the difference with most other instances in the cluster are. In this paper we focus on clustering
partition k-means, due to ease of implementation and high-speed performance of large data sets, After 30
year it is still very popular among the developed clustering algorithm and then for improvement problem of
placing of k-means algorithm in local optimal, we pose extended PSO algorithm, that its name is ECPSO.
Our new algorithm is able to be cause of exit from local optimal and with high percent produce the
problem’s optimal answer. The probe of results show that mooted algorithm have better performance
regards as other clustering algorithms specially in two index, the carefulness of clustering and the quality
of clustering.
K-means Clustering Method for the Analysis of Log Dataidescitation
Clustering analysis method is one of the main
analytical methods in data mining; the method of clustering
algorithm will influence the clustering results directly. This
paper discusses the standard k-means clustering algorithm
and analyzes the shortcomings of standard k-means
algorithm. This paper also focuses on web usage mining to
analyze the data for pattern recognition. With the help of k-
means algorithm, pattern is identified.
A PSO-Based Subtractive Data Clustering AlgorithmIJORCS
There is a tremendous proliferation in the amount of information available on the largest shared information source, the World Wide Web. Fast and high-quality clustering algorithms play an important role in helping users to effectively navigate, summarize, and organize the information. Recent studies have shown that partitional clustering algorithms such as the k-means algorithm are the most popular algorithms for clustering large datasets. The major problem with partitional clustering algorithms is that they are sensitive to the selection of the initial partitions and are prone to premature converge to local optima. Subtractive clustering is a fast, one-pass algorithm for estimating the number of clusters and cluster centers for any given set of data. The cluster estimates can be used to initialize iterative optimization-based clustering methods and model identification methods. In this paper, we present a hybrid Particle Swarm Optimization, Subtractive + (PSO) clustering algorithm that performs fast clustering. For comparison purpose, we applied the Subtractive + (PSO) clustering algorithm, PSO, and the Subtractive clustering algorithms on three different datasets. The results illustrate that the Subtractive + (PSO) clustering algorithm can generate the most compact clustering results as compared to other algorithms.
The objective of this paper is to present the hybrid approach for edge detection. Under this technique, edge
detection is performed in two phase. In first phase, Canny Algorithm is applied for image smoothing and in
second phase neural network is to detecting actual edges. Neural network is a wonderful tool for edge
detection. As it is a non-linear network with built-in thresholding capability. Neural Network can be trained
with back propagation technique using few training patterns but the most important and difficult part is to
identify the correct and proper training set.
IRJET- Clustering of Hierarchical Documents based on the Similarity Deduc...IRJET Journal
This document discusses techniques for clustering hierarchical documents based on their structural similarity. It summarizes several existing approaches:
1) A tree edit distance-based method that represents trees as paths and computes the distance between subtrees. However, it requires trees to have a pre-specified structure.
2) Chawathe's algorithm that uses pre-order tree traversal and transforms trees into sequences of node labels and depths to calculate distances. It allows efficient assignment of new documents to clusters.
3) The XCLSC algorithm that clusters documents in two phases - grouping structurally similar documents and then searching to further improve clustering results and performance. However, it has high computational requirements.
4) The XPattern and PathXP
A general weighted_fuzzy_clustering_algorithmTA Minh Thuy
This document proposes a framework for adapting iterative clustering algorithms to handle streaming data. The key ideas are:
1) As data arrives in chunks, cluster each chunk and represent the clustering results as a set of weighted centroids, with the weights indicating the number of data points assigned to each cluster.
2) Add the weighted centroids from previous chunks to the current chunk as it is clustered. This allows the algorithm to incorporate historical information from all previously seen data.
3) The weighted centroids produced by clustering the entire stream can then be used to assign labels or groups to new data points.
Experimental results on a large dataset treated as a stream show the streaming algorithm produces clusters almost identical to clustering all data at once
Classification of Iris Data using Kernel Radial Basis Probabilistic Neural Ne...Scientific Review
Radial Basis Probabilistic Neural Network (RBPNN) has a broader generalized capability that been successfully applied to multiple fields. In this paper, the Euclidean distance of each data point in RBPNN is extended by calculating its kernel-induced distance instead of the conventional sum-of squares distance. The kernel function is a generalization of the distance metric that measures the distance between two data points as the data points are mapped into a high dimensional space. During the comparing of the four constructed classification models with Kernel RBPNN, Radial Basis Function networks, RBPNN and Back-Propagation networks as proposed, results showed that, model classification on Iris Data with Kernel RBPNN display an outstanding performance in this regard.
SVD BASED LATENT SEMANTIC INDEXING WITH USE OF THE GPU COMPUTATIONSijscmcj
The purpose of this article is to determine the usefulness of the Graphics Processing Unit (GPU) calculations used to implement the Latent Semantic Indexing (LSI) reduction of the TERM-BY DOCUMENT matrix. Considered reduction of the matrix is based on the use of the SVD (Singular Value Decomposition) decomposition. A high computational complexity of the SVD decomposition - O(n3), causes that a reduction of a large indexing structure is a difficult task. In this article there is a comparison of the time complexity and accuracy of the algorithms implemented for two different environments. The first environment is associated with the CPU and MATLAB R2011a. The second environment is related to graphics processors and the CULA library. The calculations were carried out on generally available benchmark matrices, which were combined to achieve the resulting matrix of high size. For both considered environments computations were performed for double and single precision data.
A Novel Penalized and Compensated Constraints Based Modified Fuzzy Possibilis...ijsrd.com
A cluster is a group of objects which are similar to each other within a cluster and are dissimilar to the objects of other clusters. The similarity is typically calculated on the basis of distance between two objects or clusters. Two or more objects present inside a cluster and only if those objects are close to each other based on the distance between them.The major objective of clustering is to discover collection of comparable objects based on similarity metric. Fuzzy Possibilistic C-Means (FPCM) is the effective clustering algorithm available to cluster unlabeled data that produces both membership and typicality values during clustering process. In this approach, the efficiency of the Fuzzy Possibilistic C-means clustering approach is enhanced by using the penalized and compensated constraints based FPCM (PCFPCM). The proposed PCFPCM approach differ from the conventional clustering techniques by imposing the possibilistic reasoning strategy on fuzzy clustering with penalized and compensated constraints for updating the grades of membership and typicality. The performance of the proposed approaches is evaluated on the University of California, Irvine (UCI) machine repository datasets such as Iris, Wine, Lung Cancer and Lymphograma. The parameters used for the evaluation is Clustering accuracy, Mean Squared Error (MSE), Execution Time and Convergence behavior.
The document describes a study that uses fuzzy logic to predict porosity from well log data. It discusses (1) normalizing the input data, (2) using subtractive clustering to identify clusters and membership functions, and (3) developing fuzzy rules with Gaussian membership functions to relate inputs like density, sonic, and neutron logs to the output of porosity. The results showed fuzzy logic predictions of porosity were more accurate than those from multiple linear regression on the same well log data.
TOWARDS REDUCTION OF DATA FLOW IN A DISTRIBUTED NETWORK USING PRINCIPAL COMPO...cscpconf
For performing distributed data mining two approaches are possible: First, data from several sources are copied to a data warehouse and mining algorithms are applied in it. Secondly,
mining can performed at the local sites and the results can be aggregated. When the number of
features is high, a lot of bandwidth is consumed in transferring datasets to a centralized location. For this dimensionality reduction can be done at the local sites. In dimensionality reduction a certain encoding is applied on data so as to obtain its compressed form. The
reduced features thus obtained at the local sites are aggregated and data mining algorithms are applied on them. There are several methods of performing dimensionality reduction. Two most important ones are Discrete Wavelet Transforms (DWT) and Principal Component Analysis (PCA). Here a detailed study is done on how PCA could be useful in reducing data flow across a distributed network.
Reduct generation for the incremental data using rough set theorycsandit
n today’s changing world huge amount of data is ge
nerated and transferred frequently.
Although the data is sometimes static but most comm
only it is dynamic and transactional. New
data that is being generated is getting constantly
added to the old/existing data. To discover the
knowledge from this incremental data, one approach
is to run the algorithm repeatedly for the
modified data sets which is time consuming. The pap
er proposes a dimension reduction
algorithm that can be applied in dynamic environmen
t for generation of reduced attribute set as
dynamic reduct.
The method analyzes the new dataset, when it become
s available, and modifies
the reduct accordingly to fit the entire dataset. T
he concepts of discernibility relation, attribute
dependency and attribute significance of Rough Set
Theory are integrated for the generation of
dynamic reduct set, which not only reduces the comp
lexity but also helps to achieve higher
accuracy of the decision system. The proposed metho
d has been applied on few benchmark
dataset collected from the UCI repository and a dyn
amic reduct is computed. Experimental
result shows the efficiency of the proposed method
This document introduces an R package called PSF that implements a Pattern Sequence based Forecasting (PSF) algorithm for univariate time series forecasting. The PSF algorithm clusters time series data and then predicts future values based on identifying repeating patterns of clusters. The PSF package contains functions that perform the main steps of the PSF algorithm, including selecting the optimal number of clusters, selecting the optimal window size, and making predictions for a given window size and number of clusters. The package aims to promote and simplify the use of the PSF algorithm for time series forecasting.
This document summarizes a research paper that proposes a new resource scheduling algorithm called STRS for cloud computing environments. STRS aims to optimally allocate data resources across computational clusters in a distributed system to minimize data access costs. It does this through two distributed algorithms - STRSA runs at each parent node to determine optimal data allocation to child nodes, and STRSD runs at each child node to determine optimal data de-allocation. The paper also proposes a intra-cluster replication algorithm called ORPNDA that uses heuristic expansion-shrinking methods to determine optimal partial data replication within each cluster. Experimental results show STRS and ORPNDA significantly outperform general frequency-based replication schemes.
Experimental study of Data clustering using k- Means and modified algorithmsIJDKP
The k- Means clustering algorithm is an old algorithm that has been intensely researched owing to its ease
and simplicity of implementation. Clustering algorithm has a broad attraction and usefulness in
exploratory data analysis. This paper presents results of the experimental study of different approaches to
k- Means clustering, thereby comparing results on different datasets using Original k-Means and other
modified algorithms implemented using MATLAB R2009b. The results are calculated on some performance
measures such as no. of iterations, no. of points misclassified, accuracy, Silhouette validity index and
execution time
Clustering is also known as data segmentation aims to partitions data set into groups, clusters, according to their similarity. Cluster analysis has been extensively studied in many researches. There are many algorithms for different types of clustering. These classical algorithms can't be applied on big data due to its distinct features. It is a challenge to apply the traditional techniques on large unstructured data. This study proposes a hybrid model to cluster big data using the famous traditional K-means clustering algorithm. The proposed model consists of three phases namely; Mapper phase, Clustering Phase and Reduce phase. The first phase uses map-reduce algorithm to split big data into small datasets. Whereas, the second phase implements the traditional clustering K-means algorithm on each of the spitted small data sets. The last phase is responsible of producing the general clusters output of the complete data set. Two functions, Mode and Fuzzy Gaussian, have been implemented and compared at the last phase to determine the most suitable one. The experimental study used four benchmark big data sets; Covtype, Covtype-2, Poker, and Poker-2. The results proved the efficiency of the proposed model in clustering big data using the traditional K-means algorithm. Also, the experiments show that the Fuzzy Gaussian function produces more accurate results than the traditional Mode function.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
A Novel Approach for Clustering Big Data based on MapReduce IJECEIAES
Clustering is one of the most important applications of data mining. It has attracted attention of researchers in statistics and machine learning. It is used in many applications like information retrieval, image processing and social network analytics etc. It helps the user to understand the similarity and dissimilarity between objects. Cluster analysis makes the users understand complex and large data sets more clearly. There are different types of clustering algorithms analyzed by various researchers. Kmeans is the most popular partitioning based algorithm as it provides good results because of accurate calculation on numerical data. But Kmeans give good results for numerical data only. Big data is combination of numerical and categorical data. Kprototype algorithm is used to deal with numerical as well as categorical data. Kprototype combines the distance calculated from numeric and categorical data. With the growth of data due to social networking websites, business transactions, scientific calculation etc., there is vast collection of structured, semi-structured and unstructured data. So, there is need of optimization of Kprototype so that these varieties of data can be analyzed efficiently.In this work, Kprototype algorithm is implemented on MapReduce in this paper. Experiments have proved that Kprototype implemented on Mapreduce gives better performance gain on multiple nodes as compared to single node. CPU execution time and speedup are used as evaluation metrics for comparison.Intellegent splitter is proposed in this paper which splits mixed big data into numerical and categorical data. Comparison with traditional algorithms proves that proposed algorithm works better for large scale of data.
IRJET- Review of Existing Methods in K-Means Clustering AlgorithmIRJET Journal
The document reviews existing methods for the k-means clustering algorithm. It discusses how k-means clustering works and some of its limitations when dealing with large datasets, such as being dependent on the initial choice of centroids. It then proposes using Hadoop to overcome big data challenges and calculate preliminary centroids for k-means clustering in a distributed manner. Finally, it reviews different techniques that have been proposed in other research to improve k-means clustering, such as methods for selecting better initial centroids or determining the optimal number of clusters.
This document describes a new distance-based clustering algorithm (DBCA) that aims to improve upon K-means clustering. DBCA selects initial cluster centroids based on the total distance of each data point to all other points, rather than random selection. It calculates distances between all points, identifies points with maximum total distances, and sets initial centroids as the averages of groups of these maximally distant points. The algorithm is compared to K-means, hierarchical clustering, and hierarchical partitioning clustering on synthetic and real data. Experimental results show DBCA produces better quality clusters than these other algorithms.
WITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSERijnlc
This document presents an adaptive log file parser that uses semantics and hidden Markov models. It first clusters log file lines based on semantics to limit unstructured text. It then builds a hidden Markov model to represent parsing patterns, with log entries as states and extracted values as emissions. When applied to a new system, it adapts the model's transition and emission probabilities to fit the new data. The approach achieves over 99.99% accuracy when trained on one system and applied to another with slightly different log patterns.
Extended pso algorithm for improvement problems k means clustering algorithmIJMIT JOURNAL
The clustering is a without monitoring process and one of the most common data mining techniques. The
purpose of clustering is grouping similar data together in a group, so were most similar to each other in a
cluster and the difference with most other instances in the cluster are. In this paper we focus on clustering
partition k-means, due to ease of implementation and high-speed performance of large data sets, After 30
year it is still very popular among the developed clustering algorithm and then for improvement problem of
placing of k-means algorithm in local optimal, we pose extended PSO algorithm, that its name is ECPSO.
Our new algorithm is able to be cause of exit from local optimal and with high percent produce the
problem’s optimal answer. The probe of results show that mooted algorithm have better performance
regards as other clustering algorithms specially in two index, the carefulness of clustering and the quality
of clustering.
K-means Clustering Method for the Analysis of Log Dataidescitation
Clustering analysis method is one of the main
analytical methods in data mining; the method of clustering
algorithm will influence the clustering results directly. This
paper discusses the standard k-means clustering algorithm
and analyzes the shortcomings of standard k-means
algorithm. This paper also focuses on web usage mining to
analyze the data for pattern recognition. With the help of k-
means algorithm, pattern is identified.
A PSO-Based Subtractive Data Clustering AlgorithmIJORCS
There is a tremendous proliferation in the amount of information available on the largest shared information source, the World Wide Web. Fast and high-quality clustering algorithms play an important role in helping users to effectively navigate, summarize, and organize the information. Recent studies have shown that partitional clustering algorithms such as the k-means algorithm are the most popular algorithms for clustering large datasets. The major problem with partitional clustering algorithms is that they are sensitive to the selection of the initial partitions and are prone to premature converge to local optima. Subtractive clustering is a fast, one-pass algorithm for estimating the number of clusters and cluster centers for any given set of data. The cluster estimates can be used to initialize iterative optimization-based clustering methods and model identification methods. In this paper, we present a hybrid Particle Swarm Optimization, Subtractive + (PSO) clustering algorithm that performs fast clustering. For comparison purpose, we applied the Subtractive + (PSO) clustering algorithm, PSO, and the Subtractive clustering algorithms on three different datasets. The results illustrate that the Subtractive + (PSO) clustering algorithm can generate the most compact clustering results as compared to other algorithms.
The objective of this paper is to present the hybrid approach for edge detection. Under this technique, edge
detection is performed in two phase. In first phase, Canny Algorithm is applied for image smoothing and in
second phase neural network is to detecting actual edges. Neural network is a wonderful tool for edge
detection. As it is a non-linear network with built-in thresholding capability. Neural Network can be trained
with back propagation technique using few training patterns but the most important and difficult part is to
identify the correct and proper training set.
IRJET- Clustering of Hierarchical Documents based on the Similarity Deduc...IRJET Journal
This document discusses techniques for clustering hierarchical documents based on their structural similarity. It summarizes several existing approaches:
1) A tree edit distance-based method that represents trees as paths and computes the distance between subtrees. However, it requires trees to have a pre-specified structure.
2) Chawathe's algorithm that uses pre-order tree traversal and transforms trees into sequences of node labels and depths to calculate distances. It allows efficient assignment of new documents to clusters.
3) The XCLSC algorithm that clusters documents in two phases - grouping structurally similar documents and then searching to further improve clustering results and performance. However, it has high computational requirements.
4) The XPattern and PathXP
A general weighted_fuzzy_clustering_algorithmTA Minh Thuy
This document proposes a framework for adapting iterative clustering algorithms to handle streaming data. The key ideas are:
1) As data arrives in chunks, cluster each chunk and represent the clustering results as a set of weighted centroids, with the weights indicating the number of data points assigned to each cluster.
2) Add the weighted centroids from previous chunks to the current chunk as it is clustered. This allows the algorithm to incorporate historical information from all previously seen data.
3) The weighted centroids produced by clustering the entire stream can then be used to assign labels or groups to new data points.
Experimental results on a large dataset treated as a stream show the streaming algorithm produces clusters almost identical to clustering all data at once
Classification of Iris Data using Kernel Radial Basis Probabilistic Neural Ne...Scientific Review
Radial Basis Probabilistic Neural Network (RBPNN) has a broader generalized capability that been successfully applied to multiple fields. In this paper, the Euclidean distance of each data point in RBPNN is extended by calculating its kernel-induced distance instead of the conventional sum-of squares distance. The kernel function is a generalization of the distance metric that measures the distance between two data points as the data points are mapped into a high dimensional space. During the comparing of the four constructed classification models with Kernel RBPNN, Radial Basis Function networks, RBPNN and Back-Propagation networks as proposed, results showed that, model classification on Iris Data with Kernel RBPNN display an outstanding performance in this regard.
SVD BASED LATENT SEMANTIC INDEXING WITH USE OF THE GPU COMPUTATIONSijscmcj
The purpose of this article is to determine the usefulness of the Graphics Processing Unit (GPU) calculations used to implement the Latent Semantic Indexing (LSI) reduction of the TERM-BY DOCUMENT matrix. Considered reduction of the matrix is based on the use of the SVD (Singular Value Decomposition) decomposition. A high computational complexity of the SVD decomposition - O(n3), causes that a reduction of a large indexing structure is a difficult task. In this article there is a comparison of the time complexity and accuracy of the algorithms implemented for two different environments. The first environment is associated with the CPU and MATLAB R2011a. The second environment is related to graphics processors and the CULA library. The calculations were carried out on generally available benchmark matrices, which were combined to achieve the resulting matrix of high size. For both considered environments computations were performed for double and single precision data.
A Novel Penalized and Compensated Constraints Based Modified Fuzzy Possibilis...ijsrd.com
A cluster is a group of objects which are similar to each other within a cluster and are dissimilar to the objects of other clusters. The similarity is typically calculated on the basis of distance between two objects or clusters. Two or more objects present inside a cluster and only if those objects are close to each other based on the distance between them.The major objective of clustering is to discover collection of comparable objects based on similarity metric. Fuzzy Possibilistic C-Means (FPCM) is the effective clustering algorithm available to cluster unlabeled data that produces both membership and typicality values during clustering process. In this approach, the efficiency of the Fuzzy Possibilistic C-means clustering approach is enhanced by using the penalized and compensated constraints based FPCM (PCFPCM). The proposed PCFPCM approach differ from the conventional clustering techniques by imposing the possibilistic reasoning strategy on fuzzy clustering with penalized and compensated constraints for updating the grades of membership and typicality. The performance of the proposed approaches is evaluated on the University of California, Irvine (UCI) machine repository datasets such as Iris, Wine, Lung Cancer and Lymphograma. The parameters used for the evaluation is Clustering accuracy, Mean Squared Error (MSE), Execution Time and Convergence behavior.
The document describes a study that uses fuzzy logic to predict porosity from well log data. It discusses (1) normalizing the input data, (2) using subtractive clustering to identify clusters and membership functions, and (3) developing fuzzy rules with Gaussian membership functions to relate inputs like density, sonic, and neutron logs to the output of porosity. The results showed fuzzy logic predictions of porosity were more accurate than those from multiple linear regression on the same well log data.
TOWARDS REDUCTION OF DATA FLOW IN A DISTRIBUTED NETWORK USING PRINCIPAL COMPO...cscpconf
For performing distributed data mining two approaches are possible: First, data from several sources are copied to a data warehouse and mining algorithms are applied in it. Secondly,
mining can performed at the local sites and the results can be aggregated. When the number of
features is high, a lot of bandwidth is consumed in transferring datasets to a centralized location. For this dimensionality reduction can be done at the local sites. In dimensionality reduction a certain encoding is applied on data so as to obtain its compressed form. The
reduced features thus obtained at the local sites are aggregated and data mining algorithms are applied on them. There are several methods of performing dimensionality reduction. Two most important ones are Discrete Wavelet Transforms (DWT) and Principal Component Analysis (PCA). Here a detailed study is done on how PCA could be useful in reducing data flow across a distributed network.
Reduct generation for the incremental data using rough set theorycsandit
n today’s changing world huge amount of data is ge
nerated and transferred frequently.
Although the data is sometimes static but most comm
only it is dynamic and transactional. New
data that is being generated is getting constantly
added to the old/existing data. To discover the
knowledge from this incremental data, one approach
is to run the algorithm repeatedly for the
modified data sets which is time consuming. The pap
er proposes a dimension reduction
algorithm that can be applied in dynamic environmen
t for generation of reduced attribute set as
dynamic reduct.
The method analyzes the new dataset, when it become
s available, and modifies
the reduct accordingly to fit the entire dataset. T
he concepts of discernibility relation, attribute
dependency and attribute significance of Rough Set
Theory are integrated for the generation of
dynamic reduct set, which not only reduces the comp
lexity but also helps to achieve higher
accuracy of the decision system. The proposed metho
d has been applied on few benchmark
dataset collected from the UCI repository and a dyn
amic reduct is computed. Experimental
result shows the efficiency of the proposed method
Spectral Clustering and Vantage Point Indexing for Efficient Data Retrieval IJECEIAES
Data mining is an essential process for identifying the patterns in large datasets through machine learning techniques and database systems. Clustering of high dimensional data is becoming very challenging process due to curse of dimensionality. In addition, space complexity and data retrieval performance was not improved. In order to overcome the limitation, Spectral Clustering Based VP Tree Indexing Technique is introduced. The technique clusters and indexes the densely populated high dimensional data points for effective data retrieval based on user query. A Normalized Spectral Clustering Algorithm is used to group similar high dimensional data points. After that, Vantage Point Tree is constructed for indexing the clustered data points with minimum space complexity. At last, indexed data gets retrieved based on user query using Vantage Point Tree based Data Retrieval Algorithm. This in turn helps to improve true positive rate with minimum retrieval time. The performance is measured in terms of space complexity, true positive rate and data retrieval time with El Nino weather data sets from UCI Machine Learning Repository. An experimental result shows that the proposed technique is able to reduce the space complexity by 33% and also reduces the data retrieval time by 24% when compared to state-of-the-artworks.
Wireless data broadcast is an efficient way of disseminating data to users in the mobile computing environments. From the server’s point of view, how to place the data items on channels is a crucial issue, with the objective of minimizing the average access time and tuning time. Similarly, how to schedule the data retrieval process for a given request at the client side such that all the requested items can be downloaded in a short time is also an important problem. In this paper, we investigate the multi-item data retrieval scheduling in the push-based multichannel broadcast environments. The most important issues in mobile computing are energy efficiency and query response efficiency. However, in data broadcast the objectives of reducing access latency and energy cost can be contradictive to each other. Consequently, we define a new problem named Minimum Cost Data Retrieval Problem (MCDR) and Large Number Data Retrieval (LNDR) Problem. We also develop a heuristic algorithm to download a large number of items efficiently. When there is no replicated item in a broadcast cycle, we show that an optimal retrieval schedule can be obtained in polynomial time
A Threshold Fuzzy Entropy Based Feature Selection: Comparative StudyIJMER
Feature selection is one of the most common and critical tasks in database classification. It
reduces the computational cost by removing insignificant and unwanted features. Consequently, this
makes the diagnosis process accurate and comprehensible. This paper presents the measurement of
feature relevance based on fuzzy entropy, tested with Radial Basis Classifier (RBF) network,
Bagging(Bootstrap Aggregating), Boosting and stacking for various fields of datasets. Twenty
benchmarked datasets which are available in UCI Machine Learning Repository and KDD have been
used for this work. The accuracy obtained from these classification process shows that the proposed
method is capable of producing good and accurate results with fewer features than the original
datasets.
AN ENTROPIC OPTIMIZATION TECHNIQUE IN HETEROGENEOUS GRID COMPUTING USING BION...ijcsit
This document summarizes a research paper that proposes a new method for improving both fault tolerance and load balancing in grid computing networks. The method converts the tree structure of grid computing nodes into a distributed R-tree index structure and then applies an entropy estimation technique. This entropy estimation helps discard nodes with high entropy from the tree, reducing complexity. The method then uses thresholding and control algorithms to select optimal route paths based on load balance and fault tolerance. Various optimization techniques like genetic algorithms, ant colony optimization, and particle swarm optimization are also applied to reach better solutions. Experimental results showed the proposed method improved performance over other existing methods.
Particle Swarm Optimization based K-Prototype Clustering Algorithm iosrjce
This document summarizes a research paper that proposes a new Particle Swarm Optimization (PSO) based K-Prototype clustering algorithm to cluster mixed numeric and categorical data. It begins with background information on clustering algorithms like K-Means, K-Modes, and K-Prototype. It then describes the K-Prototype algorithm, PSO, and discrete binary PSO. Related work integrating PSO with other clustering algorithms is also reviewed. The proposed approach uses binary PSO to select improved initial prototypes for K-Prototype clustering in order to obtain better clustering results than traditional K-Prototype and avoid local optima.
This document discusses using particle swarm optimization to improve the k-prototype clustering algorithm. The k-prototype algorithm clusters data with both numeric and categorical attributes but can get stuck in local optima. The proposed method uses particle swarm optimization, a global optimization technique, to guide the k-prototype algorithm towards better clusterings. Particle swarm optimization models potential solutions as particles that explore the search space. It is integrated with k-prototype clustering to avoid locally optimal solutions and produce better clusterings. The method is tested on standard benchmark datasets and shown to outperform traditional k-modes and k-prototype clustering algorithms.
Web image annotation by diffusion maps manifold learning algorithmijfcstjournal
Automatic image annotation is one of the most challenging problems in machine vision areas. The goal of this task is to predict number of keywords automatically for images captured in real data. Many methods are based on visual features in order to calculate similarities between image samples. But the computation cost of these approaches is very high. These methods require many training samples to be stored in memory. To lessen thisburden, a number of techniques have been developed to reduce the number
of features in a dataset. Manifold learning is a popular approach to nonlinear dimensionality reduction. In
this paper, we investigate Diffusion maps manifold learning method for webimage auto-annotation task.Diffusion maps
manifold learning method isused to reduce the dimension of some visual features. Extensive experiments and analysis onNUS-WIDE-LITE web image dataset with
different visual featuresshow how this manifold learning dimensionality reduction method can be applied effectively to image annotation.
The document discusses JARVIS-ML, an AI system for fast and accurate screening of materials properties. It uses machine learning models trained on a large dataset of materials properties calculated using density functional theory. Some key points:
- JARVIS-ML uses gradient boosting decision trees to predict properties like formation energies, bandgaps, and elastic moduli, achieving good accuracy compared to DFT calculations.
- Feature selection is important, and JARVIS-ML uses over 1,500 descriptors of atomic structure. Chemical features are most important for predictions.
- The models can screen thousands of materials in seconds, much faster than DFT. This enables large-scale materials discovery tasks like genetic algorithm searches.
The document discusses data stream mining and summarizes some key challenges and techniques. It describes how traditional data mining cannot be directly applied to data streams due to their continuous, rapid arrival. It then outlines several techniques used for summarizing and extracting knowledge from data streams, including sampling, sketching, load shedding, synopsis data structures, and algorithms modified from basic data mining to handle streams.
A Comparative Study Of Various Clustering Algorithms In Data MiningNatasha Grant
This document provides an overview and comparison of various clustering algorithms used in data mining. It discusses the key types of clustering algorithms: partition-based (such as k-means and k-medoids), hierarchical-based, density-based, and grid-based. For partition-based algorithms, it describes k-means and k-medoids in more detail. It also discusses hierarchical clustering approaches like agglomerative nesting. The document aims to provide insights into different clustering techniques for segmenting and grouping data in an unsupervised manner.
This document discusses clustering of uncertain data objects. It first provides background on clustering uncertain data and challenges in doing so. It then proposes combining k-means clustering with Voronoi diagrams to improve the performance of k-means when clustering uncertain data. Specifically, it suggests using k-means to generate clusters and Voronoi diagrams to answer nearest neighbor queries, in order to minimize computation time. Finally, it concludes that integrating clustering algorithms with indexing methods can effectively cluster uncertain data objects.
This document discusses clustering of uncertain data objects. It first provides background on clustering uncertain data and challenges involved. It then reviews various existing approaches for clustering uncertain data, including using soft classifiers and probabilistic databases. The document proposes combining k-means clustering with Voronoi diagrams and indexing techniques to improve the performance and efficiency of clustering uncertain datasets. It outlines a plan to integrate k-means with Voronoi diagrams and indexing to reduce execution time and increase clustering performance and results for uncertain data. Finally, it concludes that combining clustering with indexing approaches can better handle uncertain data clustering challenges.
1. The document summarizes ongoing data mining and machine learning research at the University of Houston from 2006-2009.
2. Key areas of research included developing shape-aware clustering algorithms, discovering regional knowledge in geo-referenced datasets, emergent pattern discovery, and various machine learning applications.
3. The researchers were developing techniques for clustering with plug-in fitness functions, discovering spatial risk patterns like arsenic levels, and an open source data mining framework called Cougar2.
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...acijjournal
Apriori is one of the key algorithms to generate frequent itemsets. Analysing frequent itemset is a crucial
step in analysing structured data and in finding association relationship between items. This stands as an
elementary foundation to supervised learning, which encompasses classifier and feature extraction
methods. Applying this algorithm is crucial to understand the behaviour of structured data. Most of the
structured data in scientific domain are voluminous. Processing such kind of data requires state of the art
computing machines. Setting up such an infrastructure is expensive. Hence a distributed environment
such as a clustered setup is employed for tackling such scenarios. Apache Hadoop distribution is one of
the cluster frameworks in distributed environment that helps by distributing voluminous data across a
number of nodes in the framework. This paper focuses on map/reduce design and implementation of
Apriori algorithm for structured data analysis.
International Journal of Engineering Research and Development (IJERD)IJERD Editor
journal publishing, how to publish research paper, Call For research paper, international journal, publishing a paper, IJERD, journal of science and technology, how to get a research paper published, publishing a paper, publishing of journal, publishing of research paper, reserach and review articles, IJERD Journal, How to publish your research paper, publish research paper, open access engineering journal, Engineering journal, Mathemetics journal, Physics journal, Chemistry journal, Computer Engineering, Computer Science journal, how to submit your paper, peer reviw journal, indexed journal, reserach and review articles, engineering journal, www.ijerd.com, research journals,
yahoo journals, bing journals, International Journal of Engineering Research and Development, google journals, hard copy of journal
Critical Paths Identification on Fuzzy Network Projectiosrjce
In this paper, a new approach for identifying fuzzy critical path is presented, based on converting the
fuzzy network project into deterministic network project, by transforming the parameters set of the fuzzy
activities into the time probability density function PDF of each fuzzy time activity. A case study is considered as
a numerical tested problem to demonstrate our approach.
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024Sinan KOZAK
Sinan from the Delivery Hero mobile infrastructure engineering team shares a deep dive into performance acceleration with Gradle build cache optimizations. Sinan shares their journey into solving complex build-cache problems that affect Gradle builds. By understanding the challenges and solutions found in our journey, we aim to demonstrate the possibilities for faster builds. The case study reveals how overlapping outputs and cache misconfigurations led to significant increases in build times, especially as the project scaled up with numerous modules using Paparazzi tests. The journey from diagnosing to defeating cache issues offers invaluable lessons on maintaining cache integrity without sacrificing functionality.
Literature Review Basics and Understanding Reference Management.pptxDr Ramhari Poudyal
Three-day training on academic research focuses on analytical tools at United Technical College, supported by the University Grant Commission, Nepal. 24-26 May 2024
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...IJECEIAES
Climate change's impact on the planet forced the United Nations and governments to promote green energies and electric transportation. The deployments of photovoltaic (PV) and electric vehicle (EV) systems gained stronger momentum due to their numerous advantages over fossil fuel types. The advantages go beyond sustainability to reach financial support and stability. The work in this paper introduces the hybrid system between PV and EV to support industrial and commercial plants. This paper covers the theoretical framework of the proposed hybrid system including the required equation to complete the cost analysis when PV and EV are present. In addition, the proposed design diagram which sets the priorities and requirements of the system is presented. The proposed approach allows setup to advance their power stability, especially during power outages. The presented information supports researchers and plant owners to complete the necessary analysis while promoting the deployment of clean energy. The result of a case study that represents a dairy milk farmer supports the theoretical works and highlights its advanced benefits to existing plants. The short return on investment of the proposed approach supports the paper's novelty approach for the sustainable electrical system. In addition, the proposed system allows for an isolated power setup without the need for a transmission line which enhances the safety of the electrical network
Understanding Inductive Bias in Machine LearningSUTEJAS
This presentation explores the concept of inductive bias in machine learning. It explains how algorithms come with built-in assumptions and preferences that guide the learning process. You'll learn about the different types of inductive bias and how they can impact the performance and generalizability of machine learning models.
The presentation also covers the positive and negative aspects of inductive bias, along with strategies for mitigating potential drawbacks. We'll explore examples of how bias manifests in algorithms like neural networks and decision trees.
By understanding inductive bias, you can gain valuable insights into how machine learning models work and make informed decisions when building and deploying them.
Comparative analysis between traditional aquaponics and reconstructed aquapon...bijceesjournal
The aquaponic system of planting is a method that does not require soil usage. It is a method that only needs water, fish, lava rocks (a substitute for soil), and plants. Aquaponic systems are sustainable and environmentally friendly. Its use not only helps to plant in small spaces but also helps reduce artificial chemical use and minimizes excess water use, as aquaponics consumes 90% less water than soil-based gardening. The study applied a descriptive and experimental design to assess and compare conventional and reconstructed aquaponic methods for reproducing tomatoes. The researchers created an observation checklist to determine the significant factors of the study. The study aims to determine the significant difference between traditional aquaponics and reconstructed aquaponics systems propagating tomatoes in terms of height, weight, girth, and number of fruits. The reconstructed aquaponics system’s higher growth yield results in a much more nourished crop than the traditional aquaponics system. It is superior in its number of fruits, height, weight, and girth measurement. Moreover, the reconstructed aquaponics system is proven to eliminate all the hindrances present in the traditional aquaponics system, which are overcrowding of fish, algae growth, pest problems, contaminated water, and dead fish.
ACEP Magazine edition 4th launched on 05.06.2024Rahul
This document provides information about the third edition of the magazine "Sthapatya" published by the Association of Civil Engineers (Practicing) Aurangabad. It includes messages from current and past presidents of ACEP, memories and photos from past ACEP events, information on life time achievement awards given by ACEP, and a technical article on concrete maintenance, repairs and strengthening. The document highlights activities of ACEP and provides a technical educational article for members.
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECTjpsjournal1
The rivalry between prominent international actors for dominance over Central Asia's hydrocarbon
reserves and the ancient silk trade route, along with China's diplomatic endeavours in the area, has been
referred to as the "New Great Game." This research centres on the power struggle, considering
geopolitical, geostrategic, and geoeconomic variables. Topics including trade, political hegemony, oil
politics, and conventional and nontraditional security are all explored and explained by the researcher.
Using Mackinder's Heartland, Spykman Rimland, and Hegemonic Stability theories, examines China's role
in Central Asia. This study adheres to the empirical epistemological method and has taken care of
objectivity. This study analyze primary and secondary research documents critically to elaborate role of
china’s geo economic outreach in central Asian countries and its future prospect. China is thriving in trade,
pipeline politics, and winning states, according to this study, thanks to important instruments like the
Shanghai Cooperation Organisation and the Belt and Road Economic Initiative. According to this study,
China is seeing significant success in commerce, pipeline politics, and gaining influence on other
governments. This success may be attributed to the effective utilisation of key tools such as the Shanghai
Cooperation Organisation and the Belt and Road Economic Initiative.
The CBC machine is a common diagnostic tool used by doctors to measure a patient's red blood cell count, white blood cell count and platelet count. The machine uses a small sample of the patient's blood, which is then placed into special tubes and analyzed. The results of the analysis are then displayed on a screen for the doctor to review. The CBC machine is an important tool for diagnosing various conditions, such as anemia, infection and leukemia. It can also help to monitor a patient's response to treatment.
Advanced control scheme of doubly fed induction generator for wind turbine us...IJECEIAES
This paper describes a speed control device for generating electrical energy on an electricity network based on the doubly fed induction generator (DFIG) used for wind power conversion systems. At first, a double-fed induction generator model was constructed. A control law is formulated to govern the flow of energy between the stator of a DFIG and the energy network using three types of controllers: proportional integral (PI), sliding mode controller (SMC) and second order sliding mode controller (SOSMC). Their different results in terms of power reference tracking, reaction to unexpected speed fluctuations, sensitivity to perturbations, and resilience against machine parameter alterations are compared. MATLAB/Simulink was used to conduct the simulations for the preceding study. Multiple simulations have shown very satisfying results, and the investigations demonstrate the efficacy and power-enhancing capabilities of the suggested control system.
International Conference on NLP, Artificial Intelligence, Machine Learning an...gerogepatton
International Conference on NLP, Artificial Intelligence, Machine Learning and Applications (NLAIM 2024) offers a premier global platform for exchanging insights and findings in the theory, methodology, and applications of NLP, Artificial Intelligence, Machine Learning, and their applications. The conference seeks substantial contributions across all key domains of NLP, Artificial Intelligence, Machine Learning, and their practical applications, aiming to foster both theoretical advancements and real-world implementations. With a focus on facilitating collaboration between researchers and practitioners from academia and industry, the conference serves as a nexus for sharing the latest developments in the field.
International Conference on NLP, Artificial Intelligence, Machine Learning an...
ME Synopsis
1. Track 2: Data Mining
Third Post Graduate Symposium on Computer Engineering cPGCON2014 Organized by department of Computer Engineering,
MCERC Nasik
Mining Data Streams Based on the Improved McDiarmid’s Bound Ms. Poonam Debnath, and Prof. Santosh Kumar Chobe, Department of Computer Engineering, University of Pune
Abstract: Complex analysis of data streams is becoming a popular field of research as the information collected is prone to concepts drift or complete shift. The pre processing, storage, querying and mining of such data sets are highly computationally challenging tasks. Mining data streams implies extracting knowledge structures represented in models and patterns in non stopping streams of information. Traditionally the Hoeffding‟s bound is widely used to resolve the conflicts regarding the number of learning samples needed at a node to assume and select the split attribute. In this paper, we present the theoretical foundations for enhancing the bounds obtained by the McDiarmid‟s tree algorithm and outdo the processing efficiency of the stream mining system by applying Gaussian approximations to the bounds. Index Terms— Data streams, Decision trees, Gaussian approximation, Hoeffding’s bound, McDiarmid’s bound.
I. INTRODUCTION
ECENTLY a new class of emerging applications has become widely recognized: applications in which data is generated at very high rates in the form of transient data streams. In the data stream model, individual data items may be relational tuples, call records, web page visits, sensor readings, and so on. However, the continuous arrival of data in multiple, rapid, time varying, unpredictable and unbound streams open new elementary research problems. The rapid generation of continuous streams of information has posed a challenge for the storage, computation and communication capabilities in a computing system. The gigantic amounts of data arriving at high speed need application of semi- automated interactive techniques to perform real-time extraction of hidden knowledge.
Typical data mining tasks include concept description, regression analysis, association mining, outlier analysis, classification, and clustering. These techniques find interesting patterns, tracing regularities and anomalies in the data set. However, traditional data mining techniques cannot be directly applied to the data streaming model. This is because most of them require multiple scans of data to mine the information, which is impractical for stream data. The amount of formerly happened events is usually immeasurable, so they can be either dropped after processing or archived separately in secondary storage. More importantly, the traits of the data stream can change over time and the evolving pattern needs to be recorded. Furthermore, the problem of resource allocation should also be considered in mining data streams. Due to the bulky volume and the high speed of streaming data, stream mining algorithms must handle the effects of system burden. Thus, how to accomplish optimum results under various resource constraints becomes a challenging. Initially decision trees developed for data mining were tailored to deal with stream data as well. But the difficulty of ensuring whether an attribute selected from N examples is equally good to be used for infinite examples. The target was to calculate the heuristic value from the N training examples, and then exploit the results to split the learning sample space. At the beginning Hoeffding‟s tree algorithm based on Hoeffding‟s inequality and Hoeffding‟s bound, was used for knowledge discovery in data streams. The Hoeffding‟s bound postulate that with probability 1 - δ, the true mean of the random variable of range R does not differ from the expected mean, after N independent trails, by more than: = (1) A glance through the techniques and interpretation of data stream study provoked us to amend the existing tactics for improved performance in data stream mining systems. In this paper, we show that: Methods based on the McDiarmid‟s inequality call for gigantic amount of data streams at the node. By using Gaussian approximation techniques we can enhance the bounds used and can reduce the size of training samples needed for the split criteria selection. The Hoeffding‟s inequality is not sufficient to conquer the fundamental problem in a general case and so all the existing methods techniques should be adjusted. The McDiarmid‟s inequality, used in an appropriate way, is an efficient technique to cope up with glitches in data streaming model.
R
2. Track 2: Data Mining
Third Post Graduate Symposium on Computer Engineering cPGCON2014 Organized by department of Computer Engineering,
MCERC Nasik
II. RELATED WORK
One of the earliest works on managing evaluating data streams was carried out by P. Domingos and G. Hulten [5] where they proposed VFDT, which could generate decision tree for an example under strict time and memory considerations. C. Aggarwal [2] studied and proposed a two phase technique to conquer astronomical time series problem. In the first phase slinding window clusters are created and later the same is used to mine association rules from the streams. The effect of concept drift was examined by A. Bifet and R. Kirkby [3] and they evaluated online streams with complete concept shifts. To tackle with such concepts they used ensembles in their experiments. G. Hulten, L. Spencer, and P. Domingos [11] studied that the basic assumption of a machine learning system for a streaming model doesn‟t hold true because the data source distribution is never stationary or predictable. They proposed an algorithm for handling continuously changing data streams, called CFVDT, a variance of VFDT. They have worked on the complexity of target examples. Most of the approaches used the underlying concept of Hoeffding‟s inequality explained by W. Hoeffding [10]. The problem with adapting this theory in data stream mining operations was that the Hoeffding‟s inequality considered only the numeric valued data but in real world data is unpredictable. R. Kirkby [12] observed the flow in streaming model and proposed enhancement in the Hoeffding‟s tree algorithm. He labeled some fraction of the stream sample and used semi supervised approach to create clusters from the dataset. B. Pfahringer, G. Holmes, and R. Kirkby [14] have used option trees instead of clustering in their algorithm. The use of option trees helped in improving effiency of Hoeffding‟s bounds. Recently L. Rutkowski, L. Pietruczuk, P. Duda, M. Jaworski [1] have shown that the used of McDiarmid‟s inequality is the correct way to analyze the high speed time changing data streams.
III. IMPLEMENTATION DETAILS
Conventional techniques for data mining necessitate several passes of data to mine the knowledge, but this method is not feasible for stream model. Practically it is not achievable to stock up an entire stream or scan through it numerous times due to its terrific volume. Moreover, data streams evolve over time and face severe concept drift‟s and complete shift.
A. Hoeffding’s Inequality
Let Y1, Y2,..,YN be random variables with values of from a distribution. Let Yi [0,R] for i = 1,..,N where R is constant. Let expected value of the distribution be E[Y] and mean be denoted by .
Let Yi for i = 1,…,N be any independent random variable such that Pr(Yi [ai, bi]) = 1. Then for S = for all > 0 these inequalities are valid:
Pr(S - E[S] ) exp ) (2) Pr(|S - E[S]|) 2 exp ) (3)
B. Proposed Architecture
In order to improve the bounds obtained by the split measures we have to use Gaussian approximation on the bounds.
Fig. 1. Information Flow Path. Preprocessing: The dataset which has to be clubbed together is choosen for detecting the native structure in the document space using correlation analysis. The next step is to remove the null words and write the contents after removal of null words to a new folder stop words. Mcdiarmid’s inequality: Suppose an attribute „a‟ can take one of |a| different values from the set A = {a1,..,an} and = {k1,..,kk} is a set of different classes. Then let: Z = (4) be the training set of size N, where X1, . . .,XN are independent random variables defined as follows: (5)
for i=1,..,N, ji {1,…,|a|}, li {1,…,|b|}, qi {1,…,K}. Each element of Z belongs to one of the K different classes kj. Entropy associated with the classification of Z is defined as:
= - (6) where pj is the probability that element from Z comes from class kj. We estimate this probability by nj/N, where nj is the number of elements from class kj. Then = - (7)
Choose an attribute a, characterizing the elements of set Z. Then Zai denotes a set of elements from Z, for which the value of a is ai. The number of elements from set Zai is
3. Track 2: Data Mining
Third Post Graduate Symposium on Computer Engineering cPGCON2014 Organized by department of Computer Engineering,
MCERC Nasik
labeled as nai. Then the weighted entropy for attribute a and set Z is given by: (8)
Where
(9) and denotes the number of elements in set from class kj. Information gain for attribute a is given by: = - (10) Let us assume, that a is an attribute with the highest value of information gain, while b is the second-best attribute. - (11) = (12)
Let Z, given by (2), be the set of independent random variables, with Xi taking values in a set Ai for each i. Let us define:
Z’ = (13) with taking values in Ai. Observe that Z‟ differs from Z, only in the ith element. We will use McDiarmid‟s inequality. Suppose that the function: : R (14) satisfies | - | ≤ Ci (15) X1,…,XN, for some constant Ci, Then Pr((Z) - E|(Z‟)| ≥ ) (16) ≤ exp () (17) Proposed Architecture: The conceptual user interaction model of the proposed system is as follows:
Fig. 2. System Architecture.
Mathematical model for information gain: Traditional ID3 uses information gain as the best attribute split measure. When using McDiarmid‟s algorithm, the following theorem promises that a decision tree learning system, applied to data streams, hasthe property that its output is nearly identical to that produced by a conventional learner. Let Z = {X1,..,XN) be any set of random variables, with each of them taking values A × B × … . Then, for any fixed δ and any pair of attributes a and b, where = - > 0, if = CGain (K, N) (18) then Pr(-E[] > ) ≤ δ (19) where CGain (K, N) = 6(K log2eN + log22N) + 2log2K (20) Mathematical model for gini index: Gini index is usually used in CART and it measures impurity of the training set. Theorem for the same using McDiarmid‟s bound is given by: Let Z = {X1,..,XN) be any set of random variables, with each of them taking values A × B × … . Then, for any fixed δ and any pair of attributes a and b, where = - > 0, if = 8 (21)
C. Algorithm
For handiness the following notations will be used: À—set of all attributes. α—any attribute from set À. αMAX1—attribute with highest value of the split function. αMAX2—attribute with second highest value of the split function.
4. Track 2: Data Mining
Third Post Graduate Symposium on Computer Engineering cPGCON2014 Organized by department of Computer Engineering,
MCERC Nasik
IV. RESULTS
A. Data Set
Web logs, financial tickers, feeds from the sensor and all massively unordered continuous data sets.
B. Expected Result Set
Result set will be comprised of two modules. Firstly it will generate a decision tree based on McDiarmid‟s tree algorithm with the use of Gaussian bounds over the attribute selection function. Secondly it shall have a graph to compare the traditional decision tree by McDiarmid‟s algorithm with the decision tree obtained by the application of Gaussian approximation on the bounds of the split measures.
C. Platform
The proposed algorithm shall be developed and deployed with Java‟s EJB modules. MySql will be used for database related operations in the backend
V. CONCLUSION
The propagation of data stream fact has inclined the development of stream mining algorithms. Mining online high speed data streams has imposed a number of difficulties for the researchers. Due to the limited resources and critical time constraints many summarization and approximation methods have been picked up from statistics and computational theory background. The predictable mean used in the Hoeffding‟s theorem, is not valid universally. The suitable technique to solve the setback is the McDiarmids theorem. With the use of Gaussian approximations on the obtained McDiarmid‟s bounds, the system can drastically enhance the efficiency of the system and shall reduce the number of training examples needed to select a splitting criterion.
So in this paper we have tried to plaster a few issues but still there are many unbolt issues and fresh challenges that demand attention and if those problems are tackled efficiently then data streams will play a major role in each area of our life.
ACKNOWLEDGMENT
The authors would like to express thanks to the reviewers for helpful comments.
REFERENCES
[1] L. Rutkowski, L. Pietruczuk, P. Duda, M. Jaworski, ”Decision trees for mining data streams based on the McDiarmid‟s bound”, IEEE Transactions on Knowledge and Data Engineering, vol. 25, no. 6, 2013.
[2] C. Aggarwal, Data Streams. Models and Algorithms. Springer, 2007.
[3] A. Bifet and R. Kirkby, Data Stream Mining a Practical Approach, technical report, Univ. of Waikato, 2009.
[4] L. Breiman, J.H. Friedman, R.A. Olshen, and C.J. Stone, Classification and Regression Trees. Chapman and Hall, 1993.
[5] P. Domingos and G. Hulten, “Mining high-speed Data Streams,”Proc. Sixth ACM SIGKDD Int‟l Conf. Knowledge Discovery and DataMining, pp. 71- 80, 2000.
[6] W. Fan, Y. Huang, and P.S. Yu, “Decision Tree Evolution using Limited Number of Labeled Data Items from Drifting Data Streams,” Proc. IEEE Fourth Int‟l Conf. Data Mining, pp. 379-382, 2004.
[7] M. Gaber, A. Zaslavsky, and S. Krishnaswamy, “Mining Data Streams: A Review,” ACM SIGMOD Record, vol. 34, no. 2, pp. 18-26, June 2005.
[8] J. Gama, R. Fernandes, and R. Rocha, “Decision Trees for Mining Data Streams,” Intelligent Data Analysis, vol. 10, no. 1, pp. 23-45, Mar. 2006.
[9] J. Han and M. Kamber, Data Mining: Concepts and Techniques, second ed. Elsevier, 2006.
[10] W. Hoeffding, “Probability Inequalities for Sums of Bounded Random Variables,” J. Am. Statistical Assoc., vol. 58, no. 301, pp. 13-30, Mar. 1963.
[11] G. Hulten, L. Spencer, and P. Domingos, “Mining Time-Changing Data Streams,” Proc. Seventh ACM SIGKDD Int‟l Conf. Knowledge Discovery and Data Mining, pp. 97-106, 2001.
[12] R. Kirkby, “Improving Hoeffding Trees,” PhD dissertation, University of Waikato, Hamilton, 2007.
[13] X. Li, J.M. Barajas, and Y. Ding, “Collaborative Filtering On Streaming Data With Interest-Drifting,” Int‟l Intelligent Data Analysis, vol. 11, no. 1, pp. 75-87, 2007.
[14] B. Pfahringer, G. Holmes, and R. Kirkby, “New Options for Hoeffding Trees,” Proc. 20th Australian Joint Conf. Advances in Artificial Intelligence, pp. 90-99, 2007.
[15] J.R. Quinlan, C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993.
[16] W. Fan, Y. Huang, H. Wang, and P.S. Yu, Active Mining of Data Streams, Proc. SDM, 2004.
[17] C. Franke, Adaptivity in Data Stream Mining, PhD dissertation, University of California, DAVIS, 2009.