Data Mining is one of the most significant tools for discovering association patterns that are useful for many knowledge domains. Yet, there are some drawbacks in existing mining techniques. Three main weaknesses of current data-mining techniques are: 1) re-scanning of the entire database must be done whenever new attributes are added. 2) An association rule may be true on a certain granularity but fail on a smaller ones and vise verse. 3) Current methods can only be used to find either frequent rules or infrequent rules, but not both at the same time. This research proposes a novel data schema and an algorithm that solves the above weaknesses while improving on the efficiency and effectiveness of data mining strategies. Crucial mechanisms in each step will be clarified in this paper. Finally, this paper presents experimental results regarding efficiency, scalability, information loss, etc. of the proposed approach to prove its advantages.
This document summarizes a research paper that proposes a multidimensional data mining algorithm to determine association rules across different granularities. The algorithm addresses weaknesses in existing techniques, such as having to rescan the entire database when new attributes are added. It uses a concept taxonomy structure to represent the search space and finds association patterns by selecting concepts from individual taxonomies. An experiment on a wholesale business dataset demonstrates that the algorithm is linear and highly scalable to the number of records and can flexibly handle different data types.
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...acijjournal
Apriori is one of the key algorithms to generate frequent itemsets. Analysing frequent itemset is a crucial
step in analysing structured data and in finding association relationship between items. This stands as an
elementary foundation to supervised learning, which encompasses classifier and feature extraction
methods. Applying this algorithm is crucial to understand the behaviour of structured data. Most of the
structured data in scientific domain are voluminous. Processing such kind of data requires state of the art
computing machines. Setting up such an infrastructure is expensive. Hence a distributed environment
such as a clustered setup is employed for tackling such scenarios. Apache Hadoop distribution is one of
the cluster frameworks in distributed environment that helps by distributing voluminous data across a
number of nodes in the framework. This paper focuses on map/reduce design and implementation of
Apriori algorithm for structured data analysis.
The document proposes a Modified Pure Radix Sort algorithm for large heterogeneous datasets. The algorithm divides the data into numeric and string processes that work simultaneously. The numeric process further divides data into sublists by element length and sorts them simultaneously using an even/odd logic across digits. The string process identifies common patterns to convert strings to numbers that are then sorted. This optimizes problems with traditional radix sort through a distributed computing approach.
This document summarizes a research paper on developing an improved LEACH (Low-Energy Adaptive Clustering Hierarchy) communication protocol for energy efficient data mining in multi-feature sensor networks. It begins with background on wireless sensor networks and issues like energy efficiency. It then discusses the existing LEACH protocol and its drawbacks. The proposed improved LEACH protocol includes cluster heads, sub-cluster heads, and cluster nodes to address LEACH's limitations. This new version aims to minimize energy consumption during cluster formation and data aggregation in multi-feature sensor networks.
This document summarizes a research paper that proposes a new density-based clustering technique called Triangle-Density Based Clustering Technique (TDCT) to efficiently cluster large spatial datasets. TDCT uses a polygon approach where the number of data points inside each triangle of a polygon is calculated to determine triangle densities. Triangle densities are used to identify clusters based on a density confidence threshold. The technique aims to identify clusters of arbitrary shapes and densities while minimizing computational costs. Experimental results demonstrate the technique's superiority in terms of cluster quality and complexity compared to other density-based clustering algorithms.
A fuzzy clustering algorithm for high dimensional streaming dataAlexander Decker
This document summarizes a research paper that proposes a new dimension-reduced weighted fuzzy clustering algorithm (sWFCM-HD) for high-dimensional streaming data. The algorithm can cluster datasets that have both high dimensionality and a streaming (continuously arriving) nature. It combines previous work on clustering algorithms for streaming data and high-dimensional data. The paper introduces the algorithm and compares it experimentally to show improvements in memory usage and runtime over other approaches for these types of datasets.
A Novel Penalized and Compensated Constraints Based Modified Fuzzy Possibilis...ijsrd.com
A cluster is a group of objects which are similar to each other within a cluster and are dissimilar to the objects of other clusters. The similarity is typically calculated on the basis of distance between two objects or clusters. Two or more objects present inside a cluster and only if those objects are close to each other based on the distance between them.The major objective of clustering is to discover collection of comparable objects based on similarity metric. Fuzzy Possibilistic C-Means (FPCM) is the effective clustering algorithm available to cluster unlabeled data that produces both membership and typicality values during clustering process. In this approach, the efficiency of the Fuzzy Possibilistic C-means clustering approach is enhanced by using the penalized and compensated constraints based FPCM (PCFPCM). The proposed PCFPCM approach differ from the conventional clustering techniques by imposing the possibilistic reasoning strategy on fuzzy clustering with penalized and compensated constraints for updating the grades of membership and typicality. The performance of the proposed approaches is evaluated on the University of California, Irvine (UCI) machine repository datasets such as Iris, Wine, Lung Cancer and Lymphograma. The parameters used for the evaluation is Clustering accuracy, Mean Squared Error (MSE), Execution Time and Convergence behavior.
This document summarizes an article from the International Journal of Computer Engineering and Technology (IJCET) that proposes an algorithm called Replica Placement in Graph Topology Grid (RPGTG) to optimally place data replicas in a graph-based data grid while ensuring quality of service (QoS). The algorithm aims to minimize data access time, balance load among replica servers, and avoid unnecessary replications, while restricting QoS in terms of number of hops and deadline to complete requests. The article describes how the algorithm converts the graph structure of the data grid to a hierarchical structure to better manage replica servers and proposes services to facilitate dynamic replication, including a replica catalog to track replica locations and a replica manager to perform replication
This document summarizes a research paper that proposes a multidimensional data mining algorithm to determine association rules across different granularities. The algorithm addresses weaknesses in existing techniques, such as having to rescan the entire database when new attributes are added. It uses a concept taxonomy structure to represent the search space and finds association patterns by selecting concepts from individual taxonomies. An experiment on a wholesale business dataset demonstrates that the algorithm is linear and highly scalable to the number of records and can flexibly handle different data types.
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...acijjournal
Apriori is one of the key algorithms to generate frequent itemsets. Analysing frequent itemset is a crucial
step in analysing structured data and in finding association relationship between items. This stands as an
elementary foundation to supervised learning, which encompasses classifier and feature extraction
methods. Applying this algorithm is crucial to understand the behaviour of structured data. Most of the
structured data in scientific domain are voluminous. Processing such kind of data requires state of the art
computing machines. Setting up such an infrastructure is expensive. Hence a distributed environment
such as a clustered setup is employed for tackling such scenarios. Apache Hadoop distribution is one of
the cluster frameworks in distributed environment that helps by distributing voluminous data across a
number of nodes in the framework. This paper focuses on map/reduce design and implementation of
Apriori algorithm for structured data analysis.
The document proposes a Modified Pure Radix Sort algorithm for large heterogeneous datasets. The algorithm divides the data into numeric and string processes that work simultaneously. The numeric process further divides data into sublists by element length and sorts them simultaneously using an even/odd logic across digits. The string process identifies common patterns to convert strings to numbers that are then sorted. This optimizes problems with traditional radix sort through a distributed computing approach.
This document summarizes a research paper on developing an improved LEACH (Low-Energy Adaptive Clustering Hierarchy) communication protocol for energy efficient data mining in multi-feature sensor networks. It begins with background on wireless sensor networks and issues like energy efficiency. It then discusses the existing LEACH protocol and its drawbacks. The proposed improved LEACH protocol includes cluster heads, sub-cluster heads, and cluster nodes to address LEACH's limitations. This new version aims to minimize energy consumption during cluster formation and data aggregation in multi-feature sensor networks.
This document summarizes a research paper that proposes a new density-based clustering technique called Triangle-Density Based Clustering Technique (TDCT) to efficiently cluster large spatial datasets. TDCT uses a polygon approach where the number of data points inside each triangle of a polygon is calculated to determine triangle densities. Triangle densities are used to identify clusters based on a density confidence threshold. The technique aims to identify clusters of arbitrary shapes and densities while minimizing computational costs. Experimental results demonstrate the technique's superiority in terms of cluster quality and complexity compared to other density-based clustering algorithms.
A fuzzy clustering algorithm for high dimensional streaming dataAlexander Decker
This document summarizes a research paper that proposes a new dimension-reduced weighted fuzzy clustering algorithm (sWFCM-HD) for high-dimensional streaming data. The algorithm can cluster datasets that have both high dimensionality and a streaming (continuously arriving) nature. It combines previous work on clustering algorithms for streaming data and high-dimensional data. The paper introduces the algorithm and compares it experimentally to show improvements in memory usage and runtime over other approaches for these types of datasets.
A Novel Penalized and Compensated Constraints Based Modified Fuzzy Possibilis...ijsrd.com
A cluster is a group of objects which are similar to each other within a cluster and are dissimilar to the objects of other clusters. The similarity is typically calculated on the basis of distance between two objects or clusters. Two or more objects present inside a cluster and only if those objects are close to each other based on the distance between them.The major objective of clustering is to discover collection of comparable objects based on similarity metric. Fuzzy Possibilistic C-Means (FPCM) is the effective clustering algorithm available to cluster unlabeled data that produces both membership and typicality values during clustering process. In this approach, the efficiency of the Fuzzy Possibilistic C-means clustering approach is enhanced by using the penalized and compensated constraints based FPCM (PCFPCM). The proposed PCFPCM approach differ from the conventional clustering techniques by imposing the possibilistic reasoning strategy on fuzzy clustering with penalized and compensated constraints for updating the grades of membership and typicality. The performance of the proposed approaches is evaluated on the University of California, Irvine (UCI) machine repository datasets such as Iris, Wine, Lung Cancer and Lymphograma. The parameters used for the evaluation is Clustering accuracy, Mean Squared Error (MSE), Execution Time and Convergence behavior.
This document summarizes an article from the International Journal of Computer Engineering and Technology (IJCET) that proposes an algorithm called Replica Placement in Graph Topology Grid (RPGTG) to optimally place data replicas in a graph-based data grid while ensuring quality of service (QoS). The algorithm aims to minimize data access time, balance load among replica servers, and avoid unnecessary replications, while restricting QoS in terms of number of hops and deadline to complete requests. The article describes how the algorithm converts the graph structure of the data grid to a hierarchical structure to better manage replica servers and proposes services to facilitate dynamic replication, including a replica catalog to track replica locations and a replica manager to perform replication
A Novel Multi- Viewpoint based Similarity Measure for Document ClusteringIJMER
International Journal of Modern Engineering Research (IJMER) is Peer reviewed, online Journal. It serves as an international archival forum of scholarly research related to engineering and science education.
International Journal of Modern Engineering Research (IJMER) covers all the fields of engineering and science: Electrical Engineering, Mechanical Engineering, Civil Engineering, Chemical Engineering, Computer Engineering, Agricultural Engineering, Aerospace Engineering, Thermodynamics, Structural Engineering, Control Engineering, Robotics, Mechatronics, Fluid Mechanics, Nanotechnology, Simulators, Web-based Learning, Remote Laboratories, Engineering Design Methods, Education Research, Students' Satisfaction and Motivation, Global Projects, and Assessment…. And many more.
The improved k means with particle swarm optimizationAlexander Decker
This document summarizes a research paper that proposes an improved K-means clustering algorithm using particle swarm optimization. It begins with an introduction to data clustering and types of clustering algorithms. It then discusses K-means clustering and some of its drawbacks. Particle swarm optimization is introduced as an optimization technique inspired by swarm behavior in nature. The proposed algorithm uses particle swarm optimization to select better initial cluster centroids for K-means clustering in order to overcome some limitations of standard K-means. The algorithm works in two phases - the first uses particle swarm optimization and the second performs K-means clustering using the outputs from the first phase.
Extended pso algorithm for improvement problems k means clustering algorithmIJMIT JOURNAL
The clustering is a without monitoring process and one of the most common data mining techniques. The
purpose of clustering is grouping similar data together in a group, so were most similar to each other in a
cluster and the difference with most other instances in the cluster are. In this paper we focus on clustering
partition k-means, due to ease of implementation and high-speed performance of large data sets, After 30
year it is still very popular among the developed clustering algorithm and then for improvement problem of
placing of k-means algorithm in local optimal, we pose extended PSO algorithm, that its name is ECPSO.
Our new algorithm is able to be cause of exit from local optimal and with high percent produce the
problem’s optimal answer. The probe of results show that mooted algorithm have better performance
regards as other clustering algorithms specially in two index, the carefulness of clustering and the quality
of clustering.
Juha vesanto esa alhoniemi 2000:clustering of the somArchiLab 7
This document summarizes a research paper that proposes clustering the self-organizing map (SOM) as a way to analyze cluster structure in data. The paper discusses:
1) Using a two-stage process where data is first mapped to prototypes using a SOM, then the SOM prototypes are clustered, reducing computational load compared to direct data clustering.
2) Different clustering algorithms that can be applied to the SOM prototypes, including hierarchical and partitive (k-means) methods.
3) The benefits of clustering the SOM include noise reduction and being able to cluster large datasets more efficiently.
A Survey on Fuzzy Association Rule Mining MethodologiesIOSR Journals
Abstract : Fuzzy association rule mining (Fuzzy ARM) uses fuzzy logic to generate interesting association
rules. These association relationships can help in decision making for the solution of a given problem. Fuzzy
ARM is a variant of classical association rule mining. Classical association rule mining uses the concept of
crisp sets. Because of this reason classical association rule mining has several drawbacks. To overcome those
drawbacks the concept of fuzzy association rule mining came. Today there is a huge number of different types of
fuzzy association rule mining algorithms are present in research works and day by day these algorithms are
getting better. But as the problem domain is also becoming more complex in nature, continuous research work
is still going on. In this paper, we have studied several well-known methodologies and algorithms for fuzzy
association rule mining. Four important methodologies are briefly discussed in this paper which will show the
recent trends and future scope of research in the field of fuzzy association rule mining.
Keywords: Knowledge discovery in databases, Data mining, Fuzzy association rule mining, Classical
association rule mining, Very large datasets, Minimum support, Cardinality, Certainty factor, Redundant rule,
Equivalence , Equivalent rules
This document discusses load balancing strategies for grid computing. It proposes a dynamic tree-based model to represent grid architecture in a hierarchical way that supports heterogeneity and scalability. It then develops a hierarchical load balancing strategy and algorithms based on neighborhood properties to decrease communication overhead. Conventional scheduling algorithms like Min-Min, Max-Min, and Sufferage are discussed but determined to ignore dynamic network status, which is important for load balancing. Genetic algorithms are also mentioned as a potential solution.
In recent machine learning community, there is a trend of constructing a linear logarithm version of
nonlinear version through the ‘kernel method’ for example kernel principal component analysis, kernel
fisher discriminant analysis, support Vector Machines (SVMs), and the current kernel clustering
algorithms. Typically, in unsupervised methods of clustering algorithms utilizing kernel method, a
nonlinear mapping is operated initially in order to map the data into a much higher space feature, and then
clustering is executed. A hitch of these kernel clustering algorithms is that the clustering prototype resides
in increased features specs of dimensions and therefore lack intuitive and clear descriptions without
utilizing added approximation of projection from the specs to the data as executed in the literature
presented. This paper aims to utilize the ‘kernel method’, a novel clustering algorithm, founded on the
conventional fuzzy clustering algorithm (FCM) is anticipated and known as kernel fuzzy c-means algorithm
(KFCM). This method embraces a novel kernel-induced metric in the space of data in order to interchange
the novel Euclidean matric norm in cluster prototype and fuzzy clustering algorithm still reside in the space
of data so that the results of clustering could be interpreted and reformulated in the spaces which are
original. This property is used for clustering incomplete data. Execution on supposed data illustrate that
KFCM has improved performance of clustering and stout as compare to other transformations of FCM for
clustering incomplete data.
Particle Swarm Optimization based K-Prototype Clustering Algorithm iosrjce
This document summarizes a research paper that proposes a new Particle Swarm Optimization (PSO) based K-Prototype clustering algorithm to cluster mixed numeric and categorical data. It begins with background information on clustering algorithms like K-Means, K-Modes, and K-Prototype. It then describes the K-Prototype algorithm, PSO, and discrete binary PSO. Related work integrating PSO with other clustering algorithms is also reviewed. The proposed approach uses binary PSO to select improved initial prototypes for K-Prototype clustering in order to obtain better clustering results than traditional K-Prototype and avoid local optima.
In the present day huge amount of data is generated in every minute and transferred frequently. Although
the data is sometimes static but most commonly it is dynamic and transactional. New data that is being
generated is getting constantly added to the old/existing data. To discover the knowledge from this
incremental data, one approach is to run the algorithm repeatedly for the modified data sets which is time
consuming. Again to analyze the datasets properly, construction of efficient classifier model is necessary.
The objective of developing such a classifier is to classify unlabeled dataset into appropriate classes. The
paper proposes a dimension reduction algorithm that can be applied in dynamic environment for
generation of reduced attribute set as dynamic reduct, and an optimization algorithm which uses the
reduct and build up the corresponding classification system. The method analyzes the new dataset, when it
becomes available, and modifies the reduct accordingly to fit the entire dataset and from the entire data
set, interesting optimal classification rule sets are generated. The concepts of discernibility relation,
attribute dependency and attribute significance of Rough Set Theory are integrated for the generation of
dynamic reduct set, and optimal classification rules are selected using PSO method, which not only
reduces the complexity but also helps to achieve higher accuracy of the decision system. The proposed
method has been applied on some benchmark dataset collected from the UCI repository and dynamic
reduct is computed, and from the reduct optimal classification rules are also generated. Experimental
result shows the efficiency of the proposed method.
This document compares hierarchical and non-hierarchical clustering algorithms. It summarizes four clustering algorithms: K-Means, K-Medoids, Farthest First Clustering (hierarchical algorithms), and DBSCAN (non-hierarchical algorithm). It describes the methodology of each algorithm and provides pseudocode. It also describes the datasets used to evaluate the performance of the algorithms and the evaluation metrics. The goal is to compare the performance of the clustering methods on different datasets.
Comparative analysis of various data stream mining procedures and various dim...Alexander Decker
This document provides a comparative analysis of various data stream mining procedures and dimension reduction techniques. It discusses 10 different data stream clustering algorithms and their working mechanisms. It also compares 6 dimension reduction techniques and their objectives. The document proposes applying a dimension reduction technique to reduce the dimensionality of a high-dimensional data stream, before clustering it using a weighted fuzzy c-means algorithm. This combined approach aims to improve clustering quality and enable better visualization of streaming data.
This document describes a new distance-based clustering algorithm (DBCA) that aims to improve upon K-means clustering. DBCA selects initial cluster centroids based on the total distance of each data point to all other points, rather than random selection. It calculates distances between all points, identifies points with maximum total distances, and sets initial centroids as the averages of groups of these maximally distant points. The algorithm is compared to K-means, hierarchical clustering, and hierarchical partitioning clustering on synthetic and real data. Experimental results show DBCA produces better quality clusters than these other algorithms.
Decision tree clustering a columnstores tuple reconstructioncsandit
Column-Stores has gained market share due to promi
sing physical storage alternative for
analytical queries. However, for multi-attribute qu
eries column-stores pays performance
penalties due to on-the-fly tuple reconstruction. T
his paper presents an adaptive approach for
reducing tuple reconstruction time. Proposed approa
ch exploits decision tree algorithm to
cluster attributes for each projection and also eli
minates frequent database scanning.
Experimentations with TPC-H data shows the effectiv
eness of proposed approach.
Text documents clustering using modified multi-verse optimizerIJECEIAES
In this study, a multi-verse optimizer (MVO) is utilised for the text document clus- tering (TDC) problem. TDC is treated as a discrete optimization problem, and an objective function based on the Euclidean distance is applied as similarity measure. TDC is tackled by the division of the documents into clusters; documents belonging to the same cluster are similar, whereas those belonging to different clusters are dissimilar. MVO, which is a recent metaheuristic optimization algorithm established for continuous optimization problems, can intelligently navigate different areas in the search space and search deeply in each area using a particular learning mechanism. The proposed algorithm is called MVOTDC, and it adopts the convergence behaviour of MVO operators to deal with discrete, rather than continuous, optimization problems. For evaluating MVOTDC, a comprehensive comparative study is conducted on six text document datasets with various numbers of documents and clusters. The quality of the final results is assessed using precision, recall, F-measure, entropy accuracy, and purity measures. Experimental results reveal that the proposed method performs competitively in comparison with state-of-the-art algorithms. Statistical analysis is also conducted and shows that MVOTDC can produce significant results in comparison with three well-established methods.
A PSO-Based Subtractive Data Clustering AlgorithmIJORCS
There is a tremendous proliferation in the amount of information available on the largest shared information source, the World Wide Web. Fast and high-quality clustering algorithms play an important role in helping users to effectively navigate, summarize, and organize the information. Recent studies have shown that partitional clustering algorithms such as the k-means algorithm are the most popular algorithms for clustering large datasets. The major problem with partitional clustering algorithms is that they are sensitive to the selection of the initial partitions and are prone to premature converge to local optima. Subtractive clustering is a fast, one-pass algorithm for estimating the number of clusters and cluster centers for any given set of data. The cluster estimates can be used to initialize iterative optimization-based clustering methods and model identification methods. In this paper, we present a hybrid Particle Swarm Optimization, Subtractive + (PSO) clustering algorithm that performs fast clustering. For comparison purpose, we applied the Subtractive + (PSO) clustering algorithm, PSO, and the Subtractive clustering algorithms on three different datasets. The results illustrate that the Subtractive + (PSO) clustering algorithm can generate the most compact clustering results as compared to other algorithms.
The main goal of cluster analysis is to classify elements into groupsbased on their similarity. Clustering has many applications such as astronomy, bioinformatics, bibliography, and pattern recognition. In this paper, a survey of clustering methods and techniques and identification of advantages and disadvantages of these methods are presented to give a solid background to choose the best method to extract strong association rules.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
A Quantified Approach for large Dataset Compression in Association MiningIOSR Journals
Abstract: With the rapid development of computer and information technology in the last several decades, an
enormous amount of data in science and engineering will continuously be generated in massive scale; data
compression is needed to reduce the cost and storage space. Compression and discovering association rules by
identifying relationships among sets of items in a transaction database is an important problem in Data Mining.
Finding frequent itemsets is computationally the most expensive step in association rule discovery and therefore
it has attracted significant research attention. However, existing compression algorithms are not appropriate in
data mining for large data sets. In this research a new approach is describe in which the original dataset is
sorted in lexicographical order and desired number of groups are formed to generate the quantification tables.
These quantification tables are used to generate the compressed dataset, which is more efficient algorithm for
mining complete frequent itemsets from compressed dataset. The experimental results show that the proposed
algorithm performs better when comparing it with the mining merge algorithm with different supports and
execution time.
Keywords: Apriori Algorithm, mining merge Algorithm, quantification table
A Framework for Automated Association Mining Over Multiple DatabasesGurdal Ertek
Literature on association mining, the data mining methodology that investigates associations between items, has primarily focused on efficiently mining larger databases. The motivation for association mining is to use the rules obtained from historical data to influence future transactions. However, associations in transactional processes change significantly over time, implying that rules extracted for a given time interval may not be applicable for a later time interval. Hence, an
analysis framework is necessary to identify how associations change over time. This paper presents such a framework, reports the implementation of the framework as a tool, and demonstrates the applicability of and the necessity for the framework through a case study in the domain of finance.
http://research.sabanciuniv.edu.
DECISION TREE CLUSTERING: A COLUMNSTORES TUPLE RECONSTRUCTIONcscpconf
Column-Stores has gained market share due to promising physical storage alternative for analytical queries. However, for multi-attribute queries column-stores pays performance
penalties due to on-the-fly tuple reconstruction. This paper presents an adaptive approach for reducing tuple reconstruction time. Proposed approach exploits decision tree algorithm to
cluster attributes for each projection and also eliminates frequent database scanning.Experimentations with TPC-H data shows the effectiveness of proposed approach.
This document summarizes a paper that presents a novel method for passive resource discovery in cluster grid environments. The method monitors network packet frequency from nodes' network interface cards to identify nodes with available CPU cycles (<70% utilization) by detecting latency signatures from frequent context switching. Experiments on a 50-node testbed showed the method can consistently and accurately discover available resources by analyzing existing network traffic, including traffic passed through a switch. The paper also proposes algorithms for distributed two-level resource discovery, replication and utilization to optimize resource allocation and access costs in distributed computing environments.
A Novel Multi- Viewpoint based Similarity Measure for Document ClusteringIJMER
International Journal of Modern Engineering Research (IJMER) is Peer reviewed, online Journal. It serves as an international archival forum of scholarly research related to engineering and science education.
International Journal of Modern Engineering Research (IJMER) covers all the fields of engineering and science: Electrical Engineering, Mechanical Engineering, Civil Engineering, Chemical Engineering, Computer Engineering, Agricultural Engineering, Aerospace Engineering, Thermodynamics, Structural Engineering, Control Engineering, Robotics, Mechatronics, Fluid Mechanics, Nanotechnology, Simulators, Web-based Learning, Remote Laboratories, Engineering Design Methods, Education Research, Students' Satisfaction and Motivation, Global Projects, and Assessment…. And many more.
The improved k means with particle swarm optimizationAlexander Decker
This document summarizes a research paper that proposes an improved K-means clustering algorithm using particle swarm optimization. It begins with an introduction to data clustering and types of clustering algorithms. It then discusses K-means clustering and some of its drawbacks. Particle swarm optimization is introduced as an optimization technique inspired by swarm behavior in nature. The proposed algorithm uses particle swarm optimization to select better initial cluster centroids for K-means clustering in order to overcome some limitations of standard K-means. The algorithm works in two phases - the first uses particle swarm optimization and the second performs K-means clustering using the outputs from the first phase.
Extended pso algorithm for improvement problems k means clustering algorithmIJMIT JOURNAL
The clustering is a without monitoring process and one of the most common data mining techniques. The
purpose of clustering is grouping similar data together in a group, so were most similar to each other in a
cluster and the difference with most other instances in the cluster are. In this paper we focus on clustering
partition k-means, due to ease of implementation and high-speed performance of large data sets, After 30
year it is still very popular among the developed clustering algorithm and then for improvement problem of
placing of k-means algorithm in local optimal, we pose extended PSO algorithm, that its name is ECPSO.
Our new algorithm is able to be cause of exit from local optimal and with high percent produce the
problem’s optimal answer. The probe of results show that mooted algorithm have better performance
regards as other clustering algorithms specially in two index, the carefulness of clustering and the quality
of clustering.
Juha vesanto esa alhoniemi 2000:clustering of the somArchiLab 7
This document summarizes a research paper that proposes clustering the self-organizing map (SOM) as a way to analyze cluster structure in data. The paper discusses:
1) Using a two-stage process where data is first mapped to prototypes using a SOM, then the SOM prototypes are clustered, reducing computational load compared to direct data clustering.
2) Different clustering algorithms that can be applied to the SOM prototypes, including hierarchical and partitive (k-means) methods.
3) The benefits of clustering the SOM include noise reduction and being able to cluster large datasets more efficiently.
A Survey on Fuzzy Association Rule Mining MethodologiesIOSR Journals
Abstract : Fuzzy association rule mining (Fuzzy ARM) uses fuzzy logic to generate interesting association
rules. These association relationships can help in decision making for the solution of a given problem. Fuzzy
ARM is a variant of classical association rule mining. Classical association rule mining uses the concept of
crisp sets. Because of this reason classical association rule mining has several drawbacks. To overcome those
drawbacks the concept of fuzzy association rule mining came. Today there is a huge number of different types of
fuzzy association rule mining algorithms are present in research works and day by day these algorithms are
getting better. But as the problem domain is also becoming more complex in nature, continuous research work
is still going on. In this paper, we have studied several well-known methodologies and algorithms for fuzzy
association rule mining. Four important methodologies are briefly discussed in this paper which will show the
recent trends and future scope of research in the field of fuzzy association rule mining.
Keywords: Knowledge discovery in databases, Data mining, Fuzzy association rule mining, Classical
association rule mining, Very large datasets, Minimum support, Cardinality, Certainty factor, Redundant rule,
Equivalence , Equivalent rules
This document discusses load balancing strategies for grid computing. It proposes a dynamic tree-based model to represent grid architecture in a hierarchical way that supports heterogeneity and scalability. It then develops a hierarchical load balancing strategy and algorithms based on neighborhood properties to decrease communication overhead. Conventional scheduling algorithms like Min-Min, Max-Min, and Sufferage are discussed but determined to ignore dynamic network status, which is important for load balancing. Genetic algorithms are also mentioned as a potential solution.
In recent machine learning community, there is a trend of constructing a linear logarithm version of
nonlinear version through the ‘kernel method’ for example kernel principal component analysis, kernel
fisher discriminant analysis, support Vector Machines (SVMs), and the current kernel clustering
algorithms. Typically, in unsupervised methods of clustering algorithms utilizing kernel method, a
nonlinear mapping is operated initially in order to map the data into a much higher space feature, and then
clustering is executed. A hitch of these kernel clustering algorithms is that the clustering prototype resides
in increased features specs of dimensions and therefore lack intuitive and clear descriptions without
utilizing added approximation of projection from the specs to the data as executed in the literature
presented. This paper aims to utilize the ‘kernel method’, a novel clustering algorithm, founded on the
conventional fuzzy clustering algorithm (FCM) is anticipated and known as kernel fuzzy c-means algorithm
(KFCM). This method embraces a novel kernel-induced metric in the space of data in order to interchange
the novel Euclidean matric norm in cluster prototype and fuzzy clustering algorithm still reside in the space
of data so that the results of clustering could be interpreted and reformulated in the spaces which are
original. This property is used for clustering incomplete data. Execution on supposed data illustrate that
KFCM has improved performance of clustering and stout as compare to other transformations of FCM for
clustering incomplete data.
Particle Swarm Optimization based K-Prototype Clustering Algorithm iosrjce
This document summarizes a research paper that proposes a new Particle Swarm Optimization (PSO) based K-Prototype clustering algorithm to cluster mixed numeric and categorical data. It begins with background information on clustering algorithms like K-Means, K-Modes, and K-Prototype. It then describes the K-Prototype algorithm, PSO, and discrete binary PSO. Related work integrating PSO with other clustering algorithms is also reviewed. The proposed approach uses binary PSO to select improved initial prototypes for K-Prototype clustering in order to obtain better clustering results than traditional K-Prototype and avoid local optima.
In the present day huge amount of data is generated in every minute and transferred frequently. Although
the data is sometimes static but most commonly it is dynamic and transactional. New data that is being
generated is getting constantly added to the old/existing data. To discover the knowledge from this
incremental data, one approach is to run the algorithm repeatedly for the modified data sets which is time
consuming. Again to analyze the datasets properly, construction of efficient classifier model is necessary.
The objective of developing such a classifier is to classify unlabeled dataset into appropriate classes. The
paper proposes a dimension reduction algorithm that can be applied in dynamic environment for
generation of reduced attribute set as dynamic reduct, and an optimization algorithm which uses the
reduct and build up the corresponding classification system. The method analyzes the new dataset, when it
becomes available, and modifies the reduct accordingly to fit the entire dataset and from the entire data
set, interesting optimal classification rule sets are generated. The concepts of discernibility relation,
attribute dependency and attribute significance of Rough Set Theory are integrated for the generation of
dynamic reduct set, and optimal classification rules are selected using PSO method, which not only
reduces the complexity but also helps to achieve higher accuracy of the decision system. The proposed
method has been applied on some benchmark dataset collected from the UCI repository and dynamic
reduct is computed, and from the reduct optimal classification rules are also generated. Experimental
result shows the efficiency of the proposed method.
This document compares hierarchical and non-hierarchical clustering algorithms. It summarizes four clustering algorithms: K-Means, K-Medoids, Farthest First Clustering (hierarchical algorithms), and DBSCAN (non-hierarchical algorithm). It describes the methodology of each algorithm and provides pseudocode. It also describes the datasets used to evaluate the performance of the algorithms and the evaluation metrics. The goal is to compare the performance of the clustering methods on different datasets.
Comparative analysis of various data stream mining procedures and various dim...Alexander Decker
This document provides a comparative analysis of various data stream mining procedures and dimension reduction techniques. It discusses 10 different data stream clustering algorithms and their working mechanisms. It also compares 6 dimension reduction techniques and their objectives. The document proposes applying a dimension reduction technique to reduce the dimensionality of a high-dimensional data stream, before clustering it using a weighted fuzzy c-means algorithm. This combined approach aims to improve clustering quality and enable better visualization of streaming data.
This document describes a new distance-based clustering algorithm (DBCA) that aims to improve upon K-means clustering. DBCA selects initial cluster centroids based on the total distance of each data point to all other points, rather than random selection. It calculates distances between all points, identifies points with maximum total distances, and sets initial centroids as the averages of groups of these maximally distant points. The algorithm is compared to K-means, hierarchical clustering, and hierarchical partitioning clustering on synthetic and real data. Experimental results show DBCA produces better quality clusters than these other algorithms.
Decision tree clustering a columnstores tuple reconstructioncsandit
Column-Stores has gained market share due to promi
sing physical storage alternative for
analytical queries. However, for multi-attribute qu
eries column-stores pays performance
penalties due to on-the-fly tuple reconstruction. T
his paper presents an adaptive approach for
reducing tuple reconstruction time. Proposed approa
ch exploits decision tree algorithm to
cluster attributes for each projection and also eli
minates frequent database scanning.
Experimentations with TPC-H data shows the effectiv
eness of proposed approach.
Text documents clustering using modified multi-verse optimizerIJECEIAES
In this study, a multi-verse optimizer (MVO) is utilised for the text document clus- tering (TDC) problem. TDC is treated as a discrete optimization problem, and an objective function based on the Euclidean distance is applied as similarity measure. TDC is tackled by the division of the documents into clusters; documents belonging to the same cluster are similar, whereas those belonging to different clusters are dissimilar. MVO, which is a recent metaheuristic optimization algorithm established for continuous optimization problems, can intelligently navigate different areas in the search space and search deeply in each area using a particular learning mechanism. The proposed algorithm is called MVOTDC, and it adopts the convergence behaviour of MVO operators to deal with discrete, rather than continuous, optimization problems. For evaluating MVOTDC, a comprehensive comparative study is conducted on six text document datasets with various numbers of documents and clusters. The quality of the final results is assessed using precision, recall, F-measure, entropy accuracy, and purity measures. Experimental results reveal that the proposed method performs competitively in comparison with state-of-the-art algorithms. Statistical analysis is also conducted and shows that MVOTDC can produce significant results in comparison with three well-established methods.
A PSO-Based Subtractive Data Clustering AlgorithmIJORCS
There is a tremendous proliferation in the amount of information available on the largest shared information source, the World Wide Web. Fast and high-quality clustering algorithms play an important role in helping users to effectively navigate, summarize, and organize the information. Recent studies have shown that partitional clustering algorithms such as the k-means algorithm are the most popular algorithms for clustering large datasets. The major problem with partitional clustering algorithms is that they are sensitive to the selection of the initial partitions and are prone to premature converge to local optima. Subtractive clustering is a fast, one-pass algorithm for estimating the number of clusters and cluster centers for any given set of data. The cluster estimates can be used to initialize iterative optimization-based clustering methods and model identification methods. In this paper, we present a hybrid Particle Swarm Optimization, Subtractive + (PSO) clustering algorithm that performs fast clustering. For comparison purpose, we applied the Subtractive + (PSO) clustering algorithm, PSO, and the Subtractive clustering algorithms on three different datasets. The results illustrate that the Subtractive + (PSO) clustering algorithm can generate the most compact clustering results as compared to other algorithms.
The main goal of cluster analysis is to classify elements into groupsbased on their similarity. Clustering has many applications such as astronomy, bioinformatics, bibliography, and pattern recognition. In this paper, a survey of clustering methods and techniques and identification of advantages and disadvantages of these methods are presented to give a solid background to choose the best method to extract strong association rules.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
A Quantified Approach for large Dataset Compression in Association MiningIOSR Journals
Abstract: With the rapid development of computer and information technology in the last several decades, an
enormous amount of data in science and engineering will continuously be generated in massive scale; data
compression is needed to reduce the cost and storage space. Compression and discovering association rules by
identifying relationships among sets of items in a transaction database is an important problem in Data Mining.
Finding frequent itemsets is computationally the most expensive step in association rule discovery and therefore
it has attracted significant research attention. However, existing compression algorithms are not appropriate in
data mining for large data sets. In this research a new approach is describe in which the original dataset is
sorted in lexicographical order and desired number of groups are formed to generate the quantification tables.
These quantification tables are used to generate the compressed dataset, which is more efficient algorithm for
mining complete frequent itemsets from compressed dataset. The experimental results show that the proposed
algorithm performs better when comparing it with the mining merge algorithm with different supports and
execution time.
Keywords: Apriori Algorithm, mining merge Algorithm, quantification table
A Framework for Automated Association Mining Over Multiple DatabasesGurdal Ertek
Literature on association mining, the data mining methodology that investigates associations between items, has primarily focused on efficiently mining larger databases. The motivation for association mining is to use the rules obtained from historical data to influence future transactions. However, associations in transactional processes change significantly over time, implying that rules extracted for a given time interval may not be applicable for a later time interval. Hence, an
analysis framework is necessary to identify how associations change over time. This paper presents such a framework, reports the implementation of the framework as a tool, and demonstrates the applicability of and the necessity for the framework through a case study in the domain of finance.
http://research.sabanciuniv.edu.
DECISION TREE CLUSTERING: A COLUMNSTORES TUPLE RECONSTRUCTIONcscpconf
Column-Stores has gained market share due to promising physical storage alternative for analytical queries. However, for multi-attribute queries column-stores pays performance
penalties due to on-the-fly tuple reconstruction. This paper presents an adaptive approach for reducing tuple reconstruction time. Proposed approach exploits decision tree algorithm to
cluster attributes for each projection and also eliminates frequent database scanning.Experimentations with TPC-H data shows the effectiveness of proposed approach.
This document summarizes a paper that presents a novel method for passive resource discovery in cluster grid environments. The method monitors network packet frequency from nodes' network interface cards to identify nodes with available CPU cycles (<70% utilization) by detecting latency signatures from frequent context switching. Experiments on a 50-node testbed showed the method can consistently and accurately discover available resources by analyzing existing network traffic, including traffic passed through a switch. The paper also proposes algorithms for distributed two-level resource discovery, replication and utilization to optimize resource allocation and access costs in distributed computing environments.
Decision Tree Clustering : A Columnstores Tuple Reconstructioncsandit
DTFCA is an approach that uses decision tree clustering to reduce tuple reconstruction time in column-stores databases. It exploits decision tree algorithms to cluster frequently accessed attributes together based on an attribute usage matrix. This clusters attributes into projections so that tuples can be reconstructed more efficiently. Experiments on TPC-H data show that DTFCA lowers tuple reconstruction time compared to traditional methods, with execution time inversely proportional to the minimum support threshold used for clustering. DTFCA provides a way to organize attributes into projections that reflects actual query access patterns and correlations.
Study of Density Based Clustering Techniques on Data StreamsIJERA Editor
Data streams are generated by many real time systems. Data stream is fast changing and massive. In stream data mining traditional methods are not efficient so that many methodologies developed to stream data processing. Many applications require data into groups based on its characteristics. So clustering on data streams is applied. Clustering of non liner data density based clustering is used. Review of clustering algorithm and methodologies is represented and evaluated if they meet requirement of users. Study of density based clustering algorithm is presented here because of advantages of density based clustering method over other clustering method.
The document discusses density-based clustering techniques for data streams. It begins by defining data streams and the challenges of clustering streaming data. It then reviews several density-based clustering algorithms for data streams, including DenStream, StreamOptics, MR-Stream, D-Stream, and HDDStream. These algorithms use concepts like micro-clustering and fading windows to cluster streaming data in an online and incremental manner while handling issues like noise and evolving clusters. The document focuses on density-based methods because they can detect clusters of arbitrary shapes and handle noise well.
The document discusses density-based clustering techniques for data streams. It begins by defining data streams and the challenges of clustering streaming data using traditional methods. It then reviews several density-based clustering algorithms designed for data streams, including DenStream, StreamOptics, MR-Stream, D-Stream, and HDDStream. These algorithms use concepts like micro-clustering and fading windows to cluster streaming data in an online and incremental manner while handling issues like noise and evolving clusters. The document focuses on density-based methods because they can detect clusters of arbitrary shapes and handle noise more effectively than other clustering approaches.
Data mining Algorithm’s Variant AnalysisIOSR Journals
This document discusses and compares three main data mining algorithms: association, classification, and clustering. It provides an overview of each algorithm, including definitions and examples of common algorithms used for each type. Association rule learning is used to find frequent patterns and relationships between variables in large datasets. Classification involves using a training dataset to predict target class labels for new or unknown data. Clustering groups a set of data points into clusters so that objects within a cluster are more similar to each other than objects in other clusters. The document then examines specific algorithms for each type, such as Apriori and FP-Growth for association rules and Naive Bayes for classification. It also discusses the use of the WEKA data mining tool to implement and evaluate
Data mining Algorithm’s Variant AnalysisIOSR Journals
This document discusses and compares three main data mining algorithms: association, classification, and clustering. Association rule learning is used to discover interesting relationships between variables in large databases. Classification involves predicting a target outcome based on given data and finding relationships between predictor and target variables. Clustering groups a set of data into meaningful subclasses called clusters where objects within a cluster are more similar to each other than objects in other clusters. The document then discusses specific algorithms for each technique, including Apriori, FP-Growth, Naive Bayes, k-Nearest Neighbors (k-NN), and DBSCAN. It provides an overview of each algorithm and compares their approaches.
This document discusses and compares three main data mining algorithms: association, classification, and clustering. Association rule learning is used to discover interesting relationships between variables in large databases. Classification involves predicting a target based on given data and finding relationships between predictor and target variables. Clustering groups a set of data into meaningful subclasses called clusters where objects within a cluster are more similar to each other than objects in other clusters. The document then discusses specific algorithms for each technique, including Apriori, FP-Growth, Naive Bayes, k-Nearest Neighbors (k-NN), and DBSCAN. It provides an overview of each algorithm and compares their approaches.
A cyber physical stream algorithm for intelligent software defined storageMade Artha
The document presents a new Cyber Physical Stream (CPS) algorithm for selecting predominant items from large data streams. The algorithm works well for item frequencies starting from 2%. It is designed for use in intelligent Software-Defined Storage systems combined with fuzzy indexing. Experiments show CPS improves accuracy and efficiency over previous algorithms. CPS is inspired by a brain model and works by incrementing a "voltage" value when items match and decrementing it otherwise, selecting the item with highest voltage. It performs well on both uniform random and Zipf's law distributed streams, with optimal parameter values depending on the distribution.
A New Data Stream Mining Algorithm for Interestingness-rich Association RulesVenu Madhav
Frequent itemset mining and association rule generation is
a challenging task in data stream. Even though, various algorithms
have been proposed to solve the issue, it has been found
out that only frequency does not decides the significance
interestingness of the mined itemset and hence the association
rules. This accelerates the algorithms to mine the association
rules based on utility i.e. proficiency of the mined rules. However,
fewer algorithms exist in the literature to deal with the utility
as most of them deals with reducing the complexity in frequent
itemset/association rules mining algorithm. Also, those few
algorithms consider only the overall utility of the association
rules and not the consistency of the rules throughout a defined
number of periods. To solve this issue, in this paper, an enhanced
association rule mining algorithm is proposed. The algorithm
introduces new weightage validation in the conventional
association rule mining algorithms to validate the utility and
its consistency in the mined association rules. The utility is
validated by the integrated calculation of the cost/price efficiency
of the itemsets and its frequency. The consistency validation
is performed at every defined number of windows using the
probability distribution function, assuming that the weights are
normally distributed. Hence, validated and the obtained rules
are frequent and utility efficient and their interestingness are
distributed throughout the entire time period. The algorithm is
implemented and the resultant rules are compared against the
rules that can be obtained from conventional mining algorithms
Welcome to International Journal of Engineering Research and Development (IJERD)IJERD Editor
The document summarizes research on vertical fragmentation, allocation, and re-fragmentation in distributed object relational database systems. It proposes an algorithm for vertical fragmentation and allocation that considers the usage of attributes and methods by queries at different sites. The algorithm forms usage matrices, calculates affinity between methods, clusters methods, and partitions the data into fragments that are allocated to sites where they see the most demand. It also describes handling update queries by redirecting them to a server for processing and then propagating the updates to relevant fragments.
Frequent pattern mining techniques helpful to find interesting trends or patterns in
massive data. Prior domain knowledge leads to decide appropriate minimum support threshold. This
review article show different frequent pattern mining techniques based on apriori or FP-tree or user
define techniques under different computing environments like parallel, distributed or available data
mining tools, those helpful to determine interesting frequent patterns/itemsets with or without prior
domain knowledge. Proposed review article helps to develop efficient and scalable frequent pattern
mining techniques.
A SURVEY OF CLUSTERING ALGORITHMS IN ASSOCIATION RULES MININGijcsit
The main goal of cluster analysis is to classify elements into groupsbased on their similarity. Clustering
has many applications such as astronomy, bioinformatics, bibliography, and pattern recognition. In this
paper, a survey of clustering methods and techniques and identification of advantages and disadvantages
of these methods are presented to give a solid background to choose the best method to extract strong
association rules.
A SURVEY OF CLUSTERING ALGORITHMS IN ASSOCIATION RULES MININGijcsit
The main goal of cluster analysis is to classify elements into groupsbased on their similarity. Clustering
has many applications such as astronomy, bioinformatics, bibliography, and pattern recognition. In this
paper, a survey of clustering methods and techniques and identification of advantages and disadvantages
of these methods are presented to give a solid background to choose the best method to extract strong
association rules.
Mining High Utility Patterns in Large Databases using Mapreduce FrameworkIRJET Journal
This document discusses mining high utility patterns from large databases using the MapReduce framework. It proposes using the d2HUP algorithm to efficiently mine high utility patterns from partitioned big data in parallel. The algorithm traverses a reverse set enumeration tree using depth-first search to identify high utility patterns based on a minimum utility threshold, without generating candidates. It partitions the data using MapReduce and mines patterns from each partition individually. The results are then combined to obtain the final high utility patterns. The proposed approach aims to improve efficiency over existing methods that are not scalable to large datasets.
Optimization of resource allocation in computational gridsijgca
The resource allocation in Grid computing system needs to be scalable, reliable and smart. It should also be adaptable to change its allocation mechanism depending upon the environment and user’s requirements. Therefore, a scalable and optimized approach for resource allocation where the system can adapt itself to the changing environment and the fluctuating resources is essentially needed. In this paper, a Teaching Learning based optimization approach for resource allocation in Computational Grids is proposed. The proposed algorithm is found to outperform the existing ones in terms of execution time and cost. The algorithm is simulated using GRIDSIM and the simulation results are presented.
Similar to DEVELOPING A NOVEL MULTIDIMENSIONAL MULTIGRANULARITY DATA MINING APPROACH FOR DISCOVERING ASSOCIATION RULES (20)
ANALYSIS OF LAND SURFACE DEFORMATION GRADIENT BY DINSAR cscpconf
The progressive development of Synthetic Aperture Radar (SAR) systems diversify the exploitation of the generated images by these systems in different applications of geoscience. Detection and monitoring surface deformations, procreated by various phenomena had benefited from this evolution and had been realized by interferometry (InSAR) and differential interferometry (DInSAR) techniques. Nevertheless, spatial and temporal decorrelations of the interferometric couples used, limit strongly the precision of analysis results by these techniques. In this context, we propose, in this work, a methodological approach of surface deformation detection and analysis by differential interferograms to show the limits of this technique according to noise quality and level. The detectability model is generated from the deformation signatures, by simulating a linear fault merged to the images couples of ERS1 / ERS2 sensors acquired in a region of the Algerian south.
4D AUTOMATIC LIP-READING FOR SPEAKER'S FACE IDENTIFCATIONcscpconf
A novel based a trajectory-guided, concatenating approach for synthesizing high-quality image real sample renders video is proposed . The lips reading automated is seeking for modeled the closest real image sample sequence preserve in the library under the data video to the HMM predicted trajectory. The object trajectory is modeled obtained by projecting the face patterns into an KDA feature space is estimated. The approach for speaker's face identification by using synthesise the identity surface of a subject face from a small sample of patterns which sparsely each the view sphere. An KDA algorithm use to the Lip-reading image is discrimination, after that work consisted of in the low dimensional for the fundamental lip features vector is reduced by using the 2D-DCT.The mouth of the set area dimensionality is ordered by a normally reduction base on the PCA to obtain the Eigen lips approach, their proposed approach by[33]. The subjective performance results of the cost function under the automatic lips reading modeled , which wasn’t illustrate the superior performance of the
method.
MOVING FROM WATERFALL TO AGILE PROCESS IN SOFTWARE ENGINEERING CAPSTONE PROJE...cscpconf
Universities offer software engineering capstone course to simulate a real world-working environment in which students can work in a team for a fixed period to deliver a quality product. The objective of the paper is to report on our experience in moving from Waterfall process to Agile process in conducting the software engineering capstone project. We present the capstone course designs for both Waterfall driven and Agile driven methodologies that highlight the structure, deliverables and assessment plans.To evaluate the improvement, we conducted a survey for two different sections taught by two different instructors to evaluate students’ experience in moving from traditional Waterfall model to Agile like process. Twentyeight students filled the survey. The survey consisted of eight multiple-choice questions and an open-ended question to collect feedback from students. The survey results show that students were able to attain hands one experience, which simulate a real world-working environment. The results also show that the Agile approach helped students to have overall better design and avoid mistakes they have made in the initial design completed in of the first phase of the capstone project. In addition, they were able to decide on their team capabilities, training needs and thus learn the required technologies earlier which is reflected on the final product quality
PROMOTING STUDENT ENGAGEMENT USING SOCIAL MEDIA TECHNOLOGIEScscpconf
This document discusses using social media technologies to promote student engagement in a software project management course. It describes the course and objectives of enhancing communication. It discusses using Facebook for 4 years, then switching to WhatsApp based on student feedback, and finally introducing Slack to enable personalized team communication. Surveys found students engaged and satisfied with all three tools, though less familiar with Slack. The conclusion is that social media promotes engagement but familiarity with the tool also impacts satisfaction.
A SURVEY ON QUESTION ANSWERING SYSTEMS: THE ADVANCES OF FUZZY LOGICcscpconf
In real world computing environment with using a computer to answer questions has been a human dream since the beginning of the digital era, Question-answering systems are referred to as intelligent systems, that can be used to provide responses for the questions being asked by the user based on certain facts or rules stored in the knowledge base it can generate answers of questions asked in natural , and the first main idea of fuzzy logic was to working on the problem of computer understanding of natural language, so this survey paper provides an overview on what Question-Answering is and its system architecture and the possible relationship and
different with fuzzy logic, as well as the previous related research with respect to approaches that were followed. At the end, the survey provides an analytical discussion of the proposed QA models, along or combined with fuzzy logic and their main contributions and limitations.
DYNAMIC PHONE WARPING – A METHOD TO MEASURE THE DISTANCE BETWEEN PRONUNCIATIONS cscpconf
Human beings generate different speech waveforms while speaking the same word at different times. Also, different human beings have different accents and generate significantly varying speech waveforms for the same word. There is a need to measure the distances between various words which facilitate preparation of pronunciation dictionaries. A new algorithm called Dynamic Phone Warping (DPW) is presented in this paper. It uses dynamic programming technique for global alignment and shortest distance measurements. The DPW algorithm can be used to enhance the pronunciation dictionaries of the well-known languages like English or to build pronunciation dictionaries to the less known sparse languages. The precision measurement experiments show 88.9% accuracy.
INTELLIGENT ELECTRONIC ASSESSMENT FOR SUBJECTIVE EXAMS cscpconf
In education, the use of electronic (E) examination systems is not a novel idea, as Eexamination systems have been used to conduct objective assessments for the last few years. This research deals with randomly designed E-examinations and proposes an E-assessment system that can be used for subjective questions. This system assesses answers to subjective questions by finding a matching ratio for the keywords in instructor and student answers. The matching ratio is achieved based on semantic and document similarity. The assessment system is composed of four modules: preprocessing, keyword expansion, matching, and grading. A survey and case study were used in the research design to validate the proposed system. The examination assessment system will help instructors to save time, costs, and resources, while increasing efficiency and improving the productivity of exam setting and assessments.
TWO DISCRETE BINARY VERSIONS OF AFRICAN BUFFALO OPTIMIZATION METAHEURISTICcscpconf
African Buffalo Optimization (ABO) is one of the most recent swarms intelligence based metaheuristics. ABO algorithm is inspired by the buffalo’s behavior and lifestyle. Unfortunately, the standard ABO algorithm is proposed only for continuous optimization problems. In this paper, the authors propose two discrete binary ABO algorithms to deal with binary optimization problems. In the first version (called SBABO) they use the sigmoid function and probability model to generate binary solutions. In the second version (called LBABO) they use some logical operator to operate the binary solutions. Computational results on two knapsack problems (KP and MKP) instances show the effectiveness of the proposed algorithm and their ability to achieve good and promising solutions.
DETECTION OF ALGORITHMICALLY GENERATED MALICIOUS DOMAINcscpconf
In recent years, many malware writers have relied on Dynamic Domain Name Services (DDNS) to maintain their Command and Control (C&C) network infrastructure to ensure a persistence presence on a compromised host. Amongst the various DDNS techniques, Domain Generation Algorithm (DGA) is often perceived as the most difficult to detect using traditional methods. This paper presents an approach for detecting DGA using frequency analysis of the character distribution and the weighted scores of the domain names. The approach’s feasibility is demonstrated using a range of legitimate domains and a number of malicious algorithmicallygenerated domain names. Findings from this study show that domain names made up of English characters “a-z” achieving a weighted score of < 45 are often associated with DGA. When a weighted score of < 45 is applied to the Alexa one million list of domain names, only 15% of the domain names were treated as non-human generated.
GLOBAL MUSIC ASSET ASSURANCE DIGITAL CURRENCY: A DRM SOLUTION FOR STREAMING C...cscpconf
The document proposes a blockchain-based digital currency and streaming platform called GoMAA to address issues of piracy in the online music streaming industry. Key points:
- GoMAA would use a digital token on the iMediaStreams blockchain to enable secure dissemination and tracking of streamed content. Content owners could control access and track consumption of released content.
- Original media files would be converted to a Secure Portable Streaming (SPS) format, embedding watermarks and smart contract data to indicate ownership and enable validation on the blockchain.
- A browser plugin would provide wallets for fans to collect GoMAA tokens as rewards for consuming content, incentivizing participation and addressing royalty discrepancies by recording
IMPORTANCE OF VERB SUFFIX MAPPING IN DISCOURSE TRANSLATION SYSTEMcscpconf
This document discusses the importance of verb suffix mapping in discourse translation from English to Telugu. It explains that after anaphora resolution, the verbs must be changed to agree with the gender, number, and person features of the subject or anaphoric pronoun. Verbs in Telugu inflect based on these features, while verbs in English only inflect based on number and person. Several examples are provided that demonstrate how the Telugu verb changes based on whether the subject or pronoun is masculine, feminine, neuter, singular or plural. Proper verb suffix mapping is essential for generating natural and coherent translations while preserving the context and meaning of the original discourse.
EXACT SOLUTIONS OF A FAMILY OF HIGHER-DIMENSIONAL SPACE-TIME FRACTIONAL KDV-T...cscpconf
In this paper, based on the definition of conformable fractional derivative, the functional
variable method (FVM) is proposed to seek the exact traveling wave solutions of two higherdimensional
space-time fractional KdV-type equations in mathematical physics, namely the
(3+1)-dimensional space–time fractional Zakharov-Kuznetsov (ZK) equation and the (2+1)-
dimensional space–time fractional Generalized Zakharov-Kuznetsov-Benjamin-Bona-Mahony
(GZK-BBM) equation. Some new solutions are procured and depicted. These solutions, which
contain kink-shaped, singular kink, bell-shaped soliton, singular soliton and periodic wave
solutions, have many potential applications in mathematical physics and engineering. The
simplicity and reliability of the proposed method is verified.
AUTOMATED PENETRATION TESTING: AN OVERVIEWcscpconf
The document discusses automated penetration testing and provides an overview. It compares manual and automated penetration testing, noting that automated testing allows for faster, more standardized and repeatable tests but has limitations in developing new exploits. It also reviews some current automated penetration testing methodologies and tools, including those using HTTP/TCP/IP attacks, linking common scanning tools, a Python-based tool targeting databases, and one using POMDPs for multi-step penetration test planning under uncertainty. The document concludes that automated testing is more efficient than manual for known vulnerabilities but cannot replace manual testing for discovering new exploits.
CLASSIFICATION OF ALZHEIMER USING fMRI DATA AND BRAIN NETWORKcscpconf
Since the mid of 1990s, functional connectivity study using fMRI (fcMRI) has drawn increasing
attention of neuroscientists and computer scientists, since it opens a new window to explore
functional network of human brain with relatively high resolution. BOLD technique provides
almost accurate state of brain. Past researches prove that neuro diseases damage the brain
network interaction, protein- protein interaction and gene-gene interaction. A number of
neurological research paper also analyse the relationship among damaged part. By
computational method especially machine learning technique we can show such classifications.
In this paper we used OASIS fMRI dataset affected with Alzheimer’s disease and normal
patient’s dataset. After proper processing the fMRI data we use the processed data to form
classifier models using SVM (Support Vector Machine), KNN (K- nearest neighbour) & Naïve
Bayes. We also compare the accuracy of our proposed method with existing methods. In future,
we will other combinations of methods for better accuracy.
VALIDATION METHOD OF FUZZY ASSOCIATION RULES BASED ON FUZZY FORMAL CONCEPT AN...cscpconf
The document proposes a new validation method for fuzzy association rules based on three steps: (1) applying the EFAR-PN algorithm to extract a generic base of non-redundant fuzzy association rules using fuzzy formal concept analysis, (2) categorizing the extracted rules into groups, and (3) evaluating the relevance of the rules using structural equation modeling, specifically partial least squares. The method aims to address issues with existing fuzzy association rule extraction algorithms such as large numbers of extracted rules, redundancy, and difficulties with manual validation.
PROBABILITY BASED CLUSTER EXPANSION OVERSAMPLING TECHNIQUE FOR IMBALANCED DATAcscpconf
In many applications of data mining, class imbalance is noticed when examples in one class are
overrepresented. Traditional classifiers result in poor accuracy of the minority class due to the
class imbalance. Further, the presence of within class imbalance where classes are composed of
multiple sub-concepts with different number of examples also affect the performance of
classifier. In this paper, we propose an oversampling technique that handles between class and
within class imbalance simultaneously and also takes into consideration the generalization
ability in data space. The proposed method is based on two steps- performing Model Based
Clustering with respect to classes to identify the sub-concepts; and then computing the
separating hyperplane based on equal posterior probability between the classes. The proposed
method is tested on 10 publicly available data sets and the result shows that the proposed
method is statistically superior to other existing oversampling methods.
CHARACTER AND IMAGE RECOGNITION FOR DATA CATALOGING IN ECOLOGICAL RESEARCHcscpconf
Data collection is an essential, but manpower intensive procedure in ecological research. An
algorithm was developed by the author which incorporated two important computer vision
techniques to automate data cataloging for butterfly measurements. Optical Character
Recognition is used for character recognition and Contour Detection is used for imageprocessing.
Proper pre-processing is first done on the images to improve accuracy. Although
there are limitations to Tesseract’s detection of certain fonts, overall, it can successfully identify
words of basic fonts. Contour detection is an advanced technique that can be utilized to
measure an image. Shapes and mathematical calculations are crucial in determining the precise
location of the points on which to draw the body and forewing lines of the butterfly. Overall,
92% accuracy were achieved by the program for the set of butterflies measured.
SOCIAL MEDIA ANALYTICS FOR SENTIMENT ANALYSIS AND EVENT DETECTION IN SMART CI...cscpconf
Smart cities utilize Internet of Things (IoT) devices and sensors to enhance the quality of the city
services including energy, transportation, health, and much more. They generate massive
volumes of structured and unstructured data on a daily basis. Also, social networks, such as
Twitter, Facebook, and Google+, are becoming a new source of real-time information in smart
cities. Social network users are acting as social sensors. These datasets so large and complex
are difficult to manage with conventional data management tools and methods. To become
valuable, this massive amount of data, known as 'big data,' needs to be processed and
comprehended to hold the promise of supporting a broad range of urban and smart cities
functions, including among others transportation, water, and energy consumption, pollution
surveillance, and smart city governance. In this work, we investigate how social media analytics
help to analyze smart city data collected from various social media sources, such as Twitter and
Facebook, to detect various events taking place in a smart city and identify the importance of
events and concerns of citizens regarding some events. A case scenario analyses the opinions of
users concerning the traffic in three largest cities in the UAE
SOCIAL NETWORK HATE SPEECH DETECTION FOR AMHARIC LANGUAGEcscpconf
The anonymity of social networks makes it attractive for hate speech to mask their criminal
activities online posing a challenge to the world and in particular Ethiopia. With this everincreasing
volume of social media data, hate speech identification becomes a challenge in
aggravating conflict between citizens of nations. The high rate of production, has become
difficult to collect, store and analyze such big data using traditional detection methods. This
paper proposed the application of apache spark in hate speech detection to reduce the
challenges. Authors developed an apache spark based model to classify Amharic Facebook
posts and comments into hate and not hate. Authors employed Random forest and Naïve Bayes
for learning and Word2Vec and TF-IDF for feature selection. Tested by 10-fold crossvalidation,
the model based on word2vec embedding performed best with 79.83%accuracy. The
proposed method achieve a promising result with unique feature of spark for big data.
GENERAL REGRESSION NEURAL NETWORK BASED POS TAGGING FOR NEPALI TEXTcscpconf
This article presents Part of Speech tagging for Nepali text using General Regression Neural
Network (GRNN). The corpus is divided into two parts viz. training and testing. The network is
trained and validated on both training and testing data. It is observed that 96.13% words are
correctly being tagged on training set whereas 74.38% words are tagged correctly on testing
data set using GRNN. The result is compared with the traditional Viterbi algorithm based on
Hidden Markov Model. Viterbi algorithm yields 97.2% and 40% classification accuracies on
training and testing data sets respectively. GRNN based POS Tagger is more consistent than the
traditional Viterbi decoding technique.
This presentation includes basic of PCOS their pathology and treatment and also Ayurveda correlation of PCOS and Ayurvedic line of treatment mentioned in classics.
How to Manage Your Lost Opportunities in Odoo 17 CRMCeline George
Odoo 17 CRM allows us to track why we lose sales opportunities with "Lost Reasons." This helps analyze our sales process and identify areas for improvement. Here's how to configure lost reasons in Odoo 17 CRM
Thinking of getting a dog? Be aware that breeds like Pit Bulls, Rottweilers, and German Shepherds can be loyal and dangerous. Proper training and socialization are crucial to preventing aggressive behaviors. Ensure safety by understanding their needs and always supervising interactions. Stay safe, and enjoy your furry friends!
A Strategic Approach: GenAI in EducationPeter Windle
Artificial Intelligence (AI) technologies such as Generative AI, Image Generators and Large Language Models have had a dramatic impact on teaching, learning and assessment over the past 18 months. The most immediate threat AI posed was to Academic Integrity with Higher Education Institutes (HEIs) focusing their efforts on combating the use of GenAI in assessment. Guidelines were developed for staff and students, policies put in place too. Innovative educators have forged paths in the use of Generative AI for teaching, learning and assessments leading to pockets of transformation springing up across HEIs, often with little or no top-down guidance, support or direction.
This Gasta posits a strategic approach to integrating AI into HEIs to prepare staff, students and the curriculum for an evolving world and workplace. We will highlight the advantages of working with these technologies beyond the realm of teaching, learning and assessment by considering prompt engineering skills, industry impact, curriculum changes, and the need for staff upskilling. In contrast, not engaging strategically with Generative AI poses risks, including falling behind peers, missed opportunities and failing to ensure our graduates remain employable. The rapid evolution of AI technologies necessitates a proactive and strategic approach if we are to remain relevant.
Introduction to AI for Nonprofits with Tapp NetworkTechSoup
Dive into the world of AI! Experts Jon Hill and Tareq Monaur will guide you through AI's role in enhancing nonprofit websites and basic marketing strategies, making it easy to understand and apply.
Strategies for Effective Upskilling is a presentation by Chinwendu Peace in a Your Skill Boost Masterclass organisation by the Excellence Foundation for South Sudan on 08th and 09th June 2024 from 1 PM to 3 PM on each day.
How to Add Chatter in the odoo 17 ERP ModuleCeline George
In Odoo, the chatter is like a chat tool that helps you work together on records. You can leave notes and track things, making it easier to talk with your team and partners. Inside chatter, all communication history, activity, and changes will be displayed.
Biological screening of herbal drugs: Introduction and Need for
Phyto-Pharmacological Screening, New Strategies for evaluating
Natural Products, In vitro evaluation techniques for Antioxidants, Antimicrobial and Anticancer drugs. In vivo evaluation techniques
for Anti-inflammatory, Antiulcer, Anticancer, Wound healing, Antidiabetic, Hepatoprotective, Cardio protective, Diuretics and
Antifertility, Toxicity studies as per OECD guidelines
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...Dr. Vinod Kumar Kanvaria
Exploiting Artificial Intelligence for Empowering Researchers and Faculty,
International FDP on Fundamentals of Research in Social Sciences
at Integral University, Lucknow, 06.06.2024
By Dr. Vinod Kumar Kanvaria
2. 150 Computer Science & Information Technology (CS & IT)
mining tool deals only with the database as a whole, the meaningful at smaller granularities will
be lost.
The goal of this research is to provide an approach with novel data structure and efficient. The
crucial issue is to explore more efficient and accurate multidimensional mining of association
patterns on different granularities in a flexible and robust manner.
2. RELATED WORKS AND TERMINOLOGIES
2.1. Frequent and Infrequent Rules
Records in a transactional database contain simple items identified by Transaction IDs using
conventional methods. The notion of association is applied to capture the co-occurrence of items
in transactions. There are two important factors for association rules: support and confidence.
Support means how often the rule applies while confidence refers to how often the rule is true.
We are likely to find association rules with high confidence and support. Some data mining
approaches allow users to set minimum support/confidence as the threshold for mining [6, 10].
Efficient algorithms for finding infrequent rules are also in development.
2.2. Multidimensional Data Mining
Finding association rules involving various attributes efficiently is an important subject for data
mining. Association Rule Clustering System (ARCS) was proposed in [9], where association
clustering is proposed for a 2-dimensional space. The restriction of ARCS is that only one rule is
generated in each scan. Hence, it takes massive redundant scans to find all rules.
The method proposed in [9] mines all large itemsets at first and then use a relationship graph to
assign attribute according to user given priorities of each attribute. Since the method is meant to
discover large itemsets over a database as the whole, infrequent rules that hold in smaller
granularities will be lost. Different priorities of the condition attributes will induce different rules
so that user may need to try with all possible priorities to discover all rules.
2.3. Apriori Algorithm
The Apriori algorithm is a level wise iterative search algorithm for mining frequent itemsets with
regards to association rules [1,3-5,13]. The weakness of Apriori algorithm s that it requires k
passes of database scans when the cardinality of the longest frequent itemsets is k. the algorithm
is also computation intensive in generating the candidate itemsets and computing of support
values. If the number of first itemsets element is k, the database will be scanned k times at least.
Hence it is not efficient enough.
A variant of Apriori algorithm is the AprioriTID [2]. The AprioriTID reduces the time required
for frequency counting by replacing every transaction in the database by a set of candidate sets
that occurs in that transaction [2]. Although the AprioriTID algorithm is much faster in later
iterations, it is much slower in early iterations as compared to the original Apriori algorithm.
Another drawback of AprioriTID is inefficient use of space; the database modified by Apriori-gen
can be much larger than the initial database.
2.4. Concept Description and Taxonomy
The issue of data structure and descriptive model for mining is less discussed when comparing
works on algorithms. The concept description task is problematic since the term concept
description is used in different ways. In this situation, researchers argue for a de facto standard
definition for the term [10, 11]. At this stage it is easier to deal with common criterion on higher
abstraction level for concept description such as comprehension [11] and compatibility [6].
3. Computer Science & Information Technology (CS & IT) 151
Han & Kamber view concept description as a form of data generalization and define concept
description as a task that generates descriptions for the characterization and comparison of the
data [11].
Ontology provides a vocabulary for specific domains and defines the meaning of the terms and
relationships between them in practical situations. The term Taxonomy is used in this paper as it
is more flexible and can cover cases with no semantic meaning.
3. THE MULTIDIMENSIONAL MINING APPROACH
3.1. Representation Schema and Data Structure
Figure 1 Forest of Concept Taxonomy
For the sakes of comprehension and compatibility, we use the forest structure consisting of
Concept-Taxonomies to represent the overall searching space, i.e. the set of all propositions of the
concepts. On top of this structure, the sets of association patterns can be formed by selecting
concepts from individual taxonomies. The notions can be clarified with examples as follows:
3.1.1. Taxonomy: A category consists of domain concepts in a latticed hierarchical structure,
while each member per se can be in turn a taxonomy. An example for customer’s characteristics
can be [Age, Sex, Occupation…], while the taxonomy of occupation can be [manager, labor,
teacher, engineer…].
3.1.2. Forest of concept taxonomies: A hyper-graph for representing universe of discourse or the
closed-world of interests built with domain taxonomies with regards to location and Sex of
customers is shown in Figure 1.
3.1.3. Association Rule: a pattern consisting of elements taken from various concept taxonomies
such as [(location=web),(Sex=female),(Goods=milk)].
By multidimensional data mining of association rules, the notion of relation often refers to the
belonging relationship between elementary patterns and generalized patterns rather than
semantics [6]. Other notations will be used in this paper are shown in Table 1.
TABLE 1: CONCEPTS AND NOTATIONS
Notation Definition
TID Transaction identifier
MD Multidimensional database
CT Concept Taxonomy
Ei The i-th element segment
T[Ei] An element segment over Ei in MD
4. 152 Computer Science & Information Technology (CS & IT)
Notation Definition
Gi The j-th generalized pattern
T[Gj] The j-th combined segment over Gj
REi
Rules with regards to the i-th element
segment
RGj
Rules with regards to the j-th generalized
pattern
(Gj,r)
Association rules over Gj with regards to
to match ratio r
m Ratio for a relax match, given by the user.
3.2. The Data Mining Algorithm
Outline of the proposed algorithm is shown in Figure 2. The input of the mining process involves
5 entities: 1) a multidimensional transaction database MD (optional when a default MD is
assigned), 2) a set of concept taxonomies for each dimension (CT), 3) a minimal support viz.
minSup, 4) a minimal confidence, viz. minConf, and 5) a match ration m for the relaxed match.
The output of the algorithm is the multidimensional association with regards to a full/relaxed
match in the MD. The mining process can be characterized in two independent steps: 1) finding
all item sets in each element segment and 2) updating all combinations of segments by using the
output of step 1. For practical reasons, step 1 of the algorithm can be replaced by similar
algorithms such as Apriori. This segregation of the two steps enables flexible mining and ease of
use in distributed environments.
Figure 2 Outline of algorithm
The task of the algorithm is to discover all association rules REi in the element segment T[Ei] for
each element pattern Ei. It then uses REi to update RGj, i.e the set of association rules for every
generalized pattern Gj which covers Ei. The task done by each element pattern is to find large
itemsets in itself and acknowledge its super generalized patterns with the association rules. The
task don by each generalized pattern is o decide which rules hold within it according to the
acknowledgements from the element patterns. The mining procedure needs only to work on each
element segment and uses the output from each segment to determine which rules hold in the
1) Input:
2) Multidimensional Transaction Database MD
3) Concept taxonomies for each dimension: CTx(X= 1-n)
4) User given threshold: minSup, minConf, match ratio m, ;
5) Procedure:
6) Phase0:
7) to generate all Ei and Gj by CTx (x = 1 to n);
8) build the pattern table;
9) Phase1:
10) For all Ei ⊂ G
11) to discover all association rules r in T[Ei] as REi
12) Phase2:
13) for all Ei
14) for all Gjthat Ei ⊂ Gj
15) to update RGj using REi;
16) Phase3:
17) for all Gj
18) For all r (which satisfy m) in RGj
19) output (Gj, r);
20) Output:
21) all multidimensional association rules(p, r)
,
,
5. Computer Science & Information Technology (CS & IT) 153
combined segments. Thus, there is no need to scan all of the possible segments for finding the
rules.
3.3. Generating all patterns and the pattern table
The procedure generates all elementary and generalized patterns with the given forest, where a
pattern table for recording the belonging relationship between the elementary and generalized
patterns is built. Given a set of concept taxonomies, a multi-dimensional pattern can be generated
by choosing a node from each concept taxonomy. The combination of different choices represents
all the represents all the multidimensional patterns. For example, Figure 3 presents a situation of
12 patterns.
Figure 3 Belonging relationships between patterns
TABLE II. THE PATTERN TABLE FOR RELATIONS SHOWN IN FIGURE 3
Mar Apr May Spring,
Male
Spring,
Female
Spring
Male 1 0 0 1 0 1
Female 1 0 0 0 1 1
Male 0 1 0 1 0 1
Female 0 1 0 0 1 1
Male 0 0 1 1 0 1
Female 0 0 1 0 1 1
Figure 3 also shows the belonging relationship between patterns in a lattice structure. The
relationships are recorded in the form of bit map which includes element patterns and generalized
patterns. In the table, a “1” indicates that the element pattern belongs to the corresponding
generalized pattern and “0” indicates case vice versa. Table II shows a bit map which stores such
relationships.
3.4. Update Process
Figure 4 The “update” algorithm
After all patterns have been generated and the pattern table has been built, the procedure begins to
read the transactions with regards to each element segment to discover all association patterns.
Besides our algorithm, the Apriori algorithm can also be used in this phase. The output of this
1) for all REi
2) for all Gj Ei
3) if (RGj never be updated)
4) RGj = REi;
5) else
6) RGj = RGj REi;
⊃
I
1) for all REi
2) for all Gj Ei
3) if (RGj never be updated)
4) RGj = REi;
5) else
6) RGj = RGj REi;
⊃
I
Spring, Female
Spring (Any)
Spring, MaleAprMar May
Mar, Male Mar, Female Apr, Male Apr, Female May, Male May, Female
Spring, Female
Spring (Any)
Spring, MaleAprMar May
Mar, Male Mar, Female Apr, Male Apr, Female May, Male May, Female
6. 154 Computer Science & Information Technology (CS & IT)
phase is all of REi for each element patter Ei. It will be fed as the input to the next phase for
updating each RGj using REi. Figure 4 shows the outline of the update procedure with a full
match.
3.5. The output function
For a full match, the mining process outputs all (Gj, r) pairs for every r left in each RGj. For a
relaxed match, it outputs all (Gj,r) pair for every r in each RGj where the count exceeds |mT[Gj]|
by a relax match. By means of algorithm described above, loss of finding the rules that only hold
in some segments can be prevented and pickup of multidimensional association rules that do not
holder over all the range of the domain can be avoided.
4. THE EXPERIMENT AND EVALUATION
4.1. Experiment scenario of a wholesale case
To measure and prove the performance of the method, a scenario for a wholesales business using
synthetic data are established for the test. The wholesales enterprise has various business
branches and a web-site for its business operations.
Data from four branches and the website are gathered for the experiment. We take five of the
various attributes (Abode, Sex, Occupation, Age and Marriage) as the dimensions for the test.
Adding with the product catalog and price/profit record, there are 7 dimensions and we build the
concept taxonomies for each dimension.
To examine the effect of different customer behaviors, we generate three data types as illustrated
in Table III. The parameters and the default values of the data sets are shown in Table IV. There
are 118 multidimensional patterns from these taxonomies, 44 of them are element patterns and the
other 74 of them are generalized ones. The mining tool should find all large item sets for the 74
generalized patterns.
TABLE III. THREE TYPES OF DATA SETS
Type 1 To generate a single set of maximal potentially large itemsets and then generate
transactions for each element pattern Ei following apriori-gen.[4]
Type 2 Besides a set of common maximal potential large itemsets, to genreate maximal
potentially large itemsets for each element pattern Ei, and then genreate
transactions for each element pattern Ei. The common maximal potentially large
itemsets respectively following the apriori-gen[4].
Type 3 Generating a set of maximal potentially large itemsets for each element pattern Ei,
and generating transactions for each element pattern Ei from its own maximal
potentially large items following the apriori-gen.[4]
TABLE IV. PARAMETERS AND DEFAULT VALUES OF DATA SETS
Notation Meaning Default
|D| Number of transactions 100K
|T| Average size of transactions 6
|I| Average size of maximal potentially large itemsets 4
|L| NBumber of maximal potentially large itemsets 1000
N Number of items 1000
SM The maximum size of segmentation 50
7. Computer Science & Information Technology (CS & IT) 155
4.2. Experiment Results
At first, the 74 generalized patterns are successfully found. The key feature of the algorithm as
shown in Figure 5 is that it is linear (and hence highly scalable) to the number of records and that
it is flexible in terms of reading various data types. The test result w.r.t scalability in Figure 5
shows that the algorithm takes execution time linear to the number of transactions of all three data
types. The experiment results of both the test (see Figure 5 and 6) shows that our algorithm is
superior to conventional methods in several areas
Figure 5 Execution Time w.r.t to the no. of transactions
Figure 6 Scalability test w.r.t. the no. of records
Execution time with regards to number of transactions is linear for the data types tested for the
whole process. This means that the time and space cost of executing our algorithm do not increase
exponentially as compared to conventional methods.
Phase 2 ( the update phase ) of our algorithm is an important space and time saver as shown by
the Figure 8; execution time is also linear and time taken to read up to 2000k records took less
than 5 seconds. This means that data patterns from new data can be quickly extracted and used to
update the existing pattern table for immediate use. result in Figure 5 shows that the algorithm
takes linear time to the number of transactions.
8. 156 Computer Science & Information Technology (CS & IT)
Figure 7 Efficiency in relation to the number of element patterns
In general, an increase of element patterns with result in an increase in execution time; the key to
scalability is having the execution time increasing in a linear manner with an increase in element
patterns. In Figure 7, all three data types experienced an increase of execution time with an
increase of element pattern in a linear fashion, thus making our algorithm efficient.
Most importantly, an increase in element patterns leads to a less than proportion increase in
execution time, making out the algorithm highly scalable. Reading off Figure 7, a 4 time increase
of 30 element patterns from 10 to 40 will result in:
1) 75 times increase in execution time for data type 1 from 20 seconds to 35 seconds
2) 67 times increase in execution time for data type 2 from 15 seconds to 25 seconds.
3) 05 times increase in execution time for data type 3 from approximately 22 seconds to 45
seconds.
The test results shows that for increasing the number of element patterns will not decrease the
efficiency of the algorithm. It fulfills the requirement of scalability in terms of number of element
patterns as well.
After we have understood the execution effectiveness and flexibility of the algorithm, the next
step is evaluate the impact of various user inputs on the algorithm. As described earlier in this
paper, the main user inputs are minimum support minSup, minimum confidence minConf, and
match ratio m. The impact of user input on the algorithm is shown in Figure 8, 9, 10 and 11
respectively.
4.3. Impact of User Input on the Algorithm
The impact of minSup on the algorithm can be categorized in terms of efficiency, discrete ratio
and lost ratio. All of such algorithms are sensitive to the minimum support; the smaller the
minimum support, the longer the execution time. However, we have shown that the real execution
time of the step 2 (the update) in the proposed algorithm is relatively much shorter than the whole
process (see Figure 5).
The test results proved that an increase in minSup will lead to greater returns of investment in
terms of time efficiency; this is in line with one of the core objectives of building an efficient
algorithm. Our algorithm is more efficient than conventional methods in terms of execution time
over data. For instance in Figure 8, a 10 time increase (from 0.1 to 1) in minSup leads to a more
than proportionate decrease in execution time across all data types:
9. Computer Science & Information Technology (CS & IT) 157
4) Execution time for data type 1 decreased by approximately 10 times, from approximately 400
seconds to approximately 40 seconds in terms of execution time.
5) Execution time for data type 2 decreased by more than 30 times, from more than 600 seconds
to approximately 20 seconds in terms of execution time.
6) Execution time for data type 3 decreased by more than 11 times, from approximately 350
seconds to approximately 30 seconds in terms of execution time.
The test results proved that an increase in minSup will lead to greater returns of investment in
terms of time efficiency; this is in line with one of the core objectives of building an efficient
algorithm.
Fig 8 Efficiency in Relation to Minimum Support
The discrete ratio is the ratio of the number of rules pruned by the improved algorithm to the
number of rules discovered by prior mining approaches. Figure 10 shows the ratio of rules pruned
by the improved algorithm against minSup.
In general, all three data types (except for data type 1) exhibited an increase of ratio with an
increase of minSup from approximately 0.2% to 2%.
The test results point the fact that the proposed algorithm can effectively decrease unwanted
generalized patterns in which elemental data patterns is not true. This greatly helps users to focus
on data patterns that are useful for their organizations while uncovering niche data patterns. For
instance with a higher setting value, only <Female, Age over 60, take AP transfusions> will be
found instead of <Age over 60, take AP transfusions>.
Figure 9 Effects of minSup on Discrete Large Itemsets Ratio
10. 158 Computer Science & Information Technology (CS & IT)
Figure 10 shows the test result on lost ratio, i.e. the influence of minSup values on the lost rules
by other mining tools in comparison to this approach. All three data types experienced an increase
in lost ratio over an increase in minSup from 0.25% to 2%, with the greatest increase in data type
2, followed by data type 3 and finally data type 1.
Figure 10 Effects of minSup on Lost Itemsets
Figure 11 Effects of match ratio on discrete large item sets ratio
The test results prove that the proposed algorithm will help users uncover useful data patterns
which otherwise would be uncovered by traditional approaches.Thus, our objective of uncovering
niche data patterns that would otherwise be left out is met and proved by this test result.
Increasing the match ratio would reduce unwanted data patterns in general. Figure 11 shows the
effect of match ratio (r) on discrete ratio.
Similar to the above test results, an increase of m from 0.5 to 1 results in a more than proportional
increase in discrete ratio across all three forms of data types. The significance of this test result is
congruent with the test results above; the algorithm is efficient and scalable without losing
flexibility and helps uncover niche data patterns.
5. SUMMARY
The article proposes an approach which includes a novel data structure and an efficient algorithm
for mining association rules on various granularities. The advantages of this approach over
existing ones include 1) greater comprehensiveness and easy of use. 2) More efficient with
limited scans and storage I/O operations. 3) More effective in terms of finding rules that hold in
11. Computer Science & Information Technology (CS & IT) 159
different granularity levels. 4) Association patterns can be stored so that incremental mining is
possible. 5) Information loss rate is very low.
The whole process of the development and experimental measurement of the multidimensional
data mining approach was discussed in this paper. The test result shows that its performance,
efficiency, scalability and information loss rate are better than the current approaches. The effects
of perceived issues and potential development of data mining and concept description are worthy
of further investigation.
REFERENCES
Article in a journal:
[1] Agrawal,R and Shafer, J.C, “Parallel mining of association rules”, IEEE Transactions on Knowledge
and Data Engineering, 8(6) (1996) 962-969.
[2] He, L-J., Chen, L. C. and Liu, S.-Y., “Improvement of aprioriTID algorithm for mining association
rules”, Journal of Yangtai University, Vol. 16, No. 4, 2003
Article in conference proceedings:
[3] Srikant, R. and Agrawal, R, “Mining generalized association rules, Proceedings of the 21th VLDB
Conference, Zurich, Swizerland, 1995
[4] Agrawal, R. and Srikant, R, “Fast algorithms for mining association rules in large databases”, in
Proceedings of the ’94 International Conference on Very Large Data Bases, 1994, pp. 487–499.
[5] Agrawal, R., Imielinski, T. and Swami, A.N, “Mining association rules between sets of items in large
databases”, in Proceedings of the ACM-SIGMOD 1993 International Conference on Management of
Data, 1993, pp. 207–216.
[6] Chiang, J. K. and Wu, J.C, “Mining multi-dimension rules in multiple database segmentation-on
examples of cross-selling”, 16th ICIM Conference, 2005, Taipei, Taiwan.
[7] Lent, B, Swami A. and Widon, J, “Clustering association rules”, in:13th International Conference on
Data Engineering
[8] Liu, B. Hsu, W. and Ma, Y. “Mining association rules with multiple minumum supports”, in ACM
SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD-99)
[9] Tasi, S. M. Pauray and Chen, C-M, “Mining interesting association rules from customer databases
and transaction databases”.
[10] The CRISP-DM Consortium, CRISP-DM 1.0 www.crisp-dm.org
Textbook reference:
[11] Han, J. and Kamber, M. “Data mining - concepts and techniques 2nd ed, Morgan Kaufman, 2006.
[12] Li, M. and Baker, M. : “The GRID – core technologies”, Willy 2005.
[13] Feldman, R, and Sanger, J. “The text mining handbook – advanced approaches in analyzing
unstructured data”, Cambridge University Press, 2007
Author
Prof. Dr.-Ing. Johannes K. Chiang is now a faculty member on the Department of MIS
and Deputy Director of the Center for Cloud Computing at National Chengchi
University Taipei. He received the academic degree of Doctor in Engineering Science
(Dr.-Ing., Summa Cum laude) from the RWTH University of Aachen, Germany. His
current research interests include “Data and Semantic GRID”, “Business Intelligence
and Data Mining”, e-Business and ebXML, Business Data Communication. He also
serves as a consultant for the government agencies in Taiwan and as an active member
of various international affiliations, such as IEEE, ACM etc. before 1995, he has been
a research fellow at RWTH of Aachen and Manager of EU/CEC projects.