The document discusses finding the optimal parameter K for the emerging pattern-based classifier PCL. PCL selects the top-K emerging patterns from each class that match a test instance and uses these patterns to predict the class label. The authors develop an algorithm to determine the best value of K for PCL on different datasets, as K's optimal value varies. They sample subsets of training data to find the K that performs best on average, maximizing the likelihood of good performance on the whole dataset. Experimental results show that finding the best K improves PCL's accuracy and that using incremental frequent pattern maintenance techniques speeds up the algorithm significantly.
Patterns that only occur in objects belonging to a
single class are called Jumping Emerging Patterns (JEP). JEP
based Classifiers are considered one of the successful classification
systems. Due to its comprehensibility, simplicity and strong
differentiating abilities JEPs have captured significant recognition.
However, discovery of JEPs in a large pattern space is normally a
time consuming and challenging task because of their exponential
behaviour. In this work a novel method based on genetic
algorithm (GA) is proposed to discover JEPs in large pattern
space. Since the complexity of GA is lower than other algorithms,
so we have combined the power of JEPs and GA to find high
quality JEPs from datasets to improve performance of
classification system. Our proposed method explores a set of high
quality JEPs from pattern search space unlike other methods in
literature that compute complete set of JEPs, Large numbers of
duplicate and redundant JEPs are filtered out during their
discovery process. Experimental results show that our proposed
Genetic-JEPs are effective and accurate for classification of a
variety of data sets and in general achieve higher accuracy than
other standard classifiers.
Genetic Programming for Generating Prototypes in Classification ProblemsTarundeep Dhot
This document summarizes a research paper that proposes using genetic programming to generate prototypes for classification problems. It describes encoding prototypes as variable-length logical expressions and using a multi-tree representation. The approach evolves a population of these prototypes using genetic operators like crossover and mutation. Classification involves matching samples to prototypes, and fitness is based on recognition rate on a training set. The method was tested on several datasets and achieved higher recognition rates than an alternative genetic programming approach.
A Review on Feature Selection Methods For Classification TasksEditor IJCATR
In recent years, application of feature selection methods in medical datasets has greatly increased. The challenging task in
feature selection is how to obtain an optimal subset of relevant and non redundant features which will give an optimal solution without
increasing the complexity of the modeling task. Thus, there is a need to make practitioners aware of feature selection methods that have
been successfully applied in medical data sets and highlight future trends in this area. The findings indicate that most existing feature
selection methods depend on univariate ranking that does not take into account interactions between variables, overlook stability of the
selection algorithms and the methods that produce good accuracy employ more number of features. However, developing a universal
method that achieves the best classification accuracy with fewer features is still an open research area.
This document discusses feature selection concepts and methods. It defines features as attributes that determine which class an instance belongs to. Feature selection aims to select a relevant subset of features by removing irrelevant, redundant and unnecessary data. This improves learning accuracy, model performance and interpretability. The document categorizes feature selection algorithms as filter, wrapper or embedded methods based on how they evaluate feature subsets. It also discusses concepts like feature relevance, search strategies, successor generation and evaluation measures used in feature selection algorithms.
The document proposes a novel hybrid method called PCA-BEL for classifying gene expression microarray data. PCA-BEL uses principal component analysis (PCA) for feature extraction followed by classification using a Brain Emotional Learning (BEL) network. PCA reduces the dimensionality of the microarray data to overcome the high dimensionality problem. BEL is then used for classification due its low computational complexity making it suitable for high dimensional data. The method is tested on several cancer gene expression datasets and achieves average accuracies of 100%, 96%, 98.32%, 87.40% and 88% on five datasets respectively, demonstrating its effectiveness for microarray classification tasks.
ON FEATURE SELECTION ALGORITHMS AND FEATURE SELECTION STABILITY MEASURES: A C...ijcsit
Data mining is indispensable for business organizations for extracting useful information from the huge volume of stored data which can be used in managerial decision making to survive in the competition. Due to the day-to-day advancements in information and communication technology, these data collected from ecommerce and e-governance are mostly high dimensional. Data mining prefers small datasets than high dimensional datasets. Feature selection is an important dimensionality reduction technique. The subsets selected in subsequent iterations by feature selection should be same or similar even in case of small perturbations of the dataset and is called as selection stability. It is recently becomes important topic of research community. The selection stability has been measured by various measures. This paper analyses the selection of the suitable search method and stability measure for the feature selection algorithms and also the influence of the characteristics of the dataset as the choice of the best approach is highly problem dependent.
A Modified KS-test for Feature SelectionIOSR Journals
This document proposes a modified Kolmogorov-Smirnov (KS) test-based feature selection algorithm. It begins with an overview of feature selection and its benefits. It then discusses two common feature selection approaches: filter and wrapper models. The document proposes a fast redundancy removal filter based on a modified KS statistic that utilizes class label information to compare feature pairs. It compares the proposed algorithm to other methods like Correlation Feature Selection (CFS) and KS-Correlation Based Filter (KS-CBF). The efficiency and effectiveness of the various methods are tested on standard classifiers. In most cases, the proposed approach achieved equal or better classification accuracy compared to using all features or the other algorithms.
Patterns that only occur in objects belonging to a
single class are called Jumping Emerging Patterns (JEP). JEP
based Classifiers are considered one of the successful classification
systems. Due to its comprehensibility, simplicity and strong
differentiating abilities JEPs have captured significant recognition.
However, discovery of JEPs in a large pattern space is normally a
time consuming and challenging task because of their exponential
behaviour. In this work a novel method based on genetic
algorithm (GA) is proposed to discover JEPs in large pattern
space. Since the complexity of GA is lower than other algorithms,
so we have combined the power of JEPs and GA to find high
quality JEPs from datasets to improve performance of
classification system. Our proposed method explores a set of high
quality JEPs from pattern search space unlike other methods in
literature that compute complete set of JEPs, Large numbers of
duplicate and redundant JEPs are filtered out during their
discovery process. Experimental results show that our proposed
Genetic-JEPs are effective and accurate for classification of a
variety of data sets and in general achieve higher accuracy than
other standard classifiers.
Genetic Programming for Generating Prototypes in Classification ProblemsTarundeep Dhot
This document summarizes a research paper that proposes using genetic programming to generate prototypes for classification problems. It describes encoding prototypes as variable-length logical expressions and using a multi-tree representation. The approach evolves a population of these prototypes using genetic operators like crossover and mutation. Classification involves matching samples to prototypes, and fitness is based on recognition rate on a training set. The method was tested on several datasets and achieved higher recognition rates than an alternative genetic programming approach.
A Review on Feature Selection Methods For Classification TasksEditor IJCATR
In recent years, application of feature selection methods in medical datasets has greatly increased. The challenging task in
feature selection is how to obtain an optimal subset of relevant and non redundant features which will give an optimal solution without
increasing the complexity of the modeling task. Thus, there is a need to make practitioners aware of feature selection methods that have
been successfully applied in medical data sets and highlight future trends in this area. The findings indicate that most existing feature
selection methods depend on univariate ranking that does not take into account interactions between variables, overlook stability of the
selection algorithms and the methods that produce good accuracy employ more number of features. However, developing a universal
method that achieves the best classification accuracy with fewer features is still an open research area.
This document discusses feature selection concepts and methods. It defines features as attributes that determine which class an instance belongs to. Feature selection aims to select a relevant subset of features by removing irrelevant, redundant and unnecessary data. This improves learning accuracy, model performance and interpretability. The document categorizes feature selection algorithms as filter, wrapper or embedded methods based on how they evaluate feature subsets. It also discusses concepts like feature relevance, search strategies, successor generation and evaluation measures used in feature selection algorithms.
The document proposes a novel hybrid method called PCA-BEL for classifying gene expression microarray data. PCA-BEL uses principal component analysis (PCA) for feature extraction followed by classification using a Brain Emotional Learning (BEL) network. PCA reduces the dimensionality of the microarray data to overcome the high dimensionality problem. BEL is then used for classification due its low computational complexity making it suitable for high dimensional data. The method is tested on several cancer gene expression datasets and achieves average accuracies of 100%, 96%, 98.32%, 87.40% and 88% on five datasets respectively, demonstrating its effectiveness for microarray classification tasks.
ON FEATURE SELECTION ALGORITHMS AND FEATURE SELECTION STABILITY MEASURES: A C...ijcsit
Data mining is indispensable for business organizations for extracting useful information from the huge volume of stored data which can be used in managerial decision making to survive in the competition. Due to the day-to-day advancements in information and communication technology, these data collected from ecommerce and e-governance are mostly high dimensional. Data mining prefers small datasets than high dimensional datasets. Feature selection is an important dimensionality reduction technique. The subsets selected in subsequent iterations by feature selection should be same or similar even in case of small perturbations of the dataset and is called as selection stability. It is recently becomes important topic of research community. The selection stability has been measured by various measures. This paper analyses the selection of the suitable search method and stability measure for the feature selection algorithms and also the influence of the characteristics of the dataset as the choice of the best approach is highly problem dependent.
A Modified KS-test for Feature SelectionIOSR Journals
This document proposes a modified Kolmogorov-Smirnov (KS) test-based feature selection algorithm. It begins with an overview of feature selection and its benefits. It then discusses two common feature selection approaches: filter and wrapper models. The document proposes a fast redundancy removal filter based on a modified KS statistic that utilizes class label information to compare feature pairs. It compares the proposed algorithm to other methods like Correlation Feature Selection (CFS) and KS-Correlation Based Filter (KS-CBF). The efficiency and effectiveness of the various methods are tested on standard classifiers. In most cases, the proposed approach achieved equal or better classification accuracy compared to using all features or the other algorithms.
Oversampling technique in student performance classification from engineering...IJECEIAES
This document discusses various oversampling techniques for dealing with imbalanced data in student performance classification. It compares SMOTE, Borderline-SMOTE, SVMSMOTE, and ADASYN oversampling combined with MLP, gradient boosting, AdaBoost, and random forest classifiers. The results show that Borderline-SMOTE gave the best performance for predicting the minority (low performance) class according to several evaluation metrics. SVMSMOTE also performed well overall, particularly for recall, F1-measure, and AUC. Gradient boosting provided high and consistent precision, recall, F1-measure, and AUC across the different oversampling methods.
This document discusses feature selection techniques for classification problems. It begins by outlining class separability measures like divergence, Bhattacharyya distance, and scatter matrices. It then discusses feature subset selection approaches, including scalar feature selection which treats features individually, and feature vector selection which considers feature sets and correlations. Examples are provided to demonstrate calculating class separability measures for different feature combinations on sample datasets. Exhaustive search and suboptimal techniques like forward, backward, and floating selection are discussed for choosing optimal feature subsets. The goal of feature selection is to select a subset of features that maximizes class separation.
This document discusses computational intelligence and supervised learning techniques for classification. It provides examples of applications in medical diagnosis and credit card approval. The goal of supervised learning is to learn from labeled training data to predict the class of new unlabeled examples. Decision trees and backpropagation neural networks are introduced as common supervised learning algorithms. Evaluation methods like holdout validation, cross-validation and performance metrics beyond accuracy are also summarized.
Lazy Associative Classification is a novel lazy associative classifier that generates classification rules on demand for each test instance, focusing only on relevant features. This approach avoids overfitting and outperforms traditional eager associative classifiers and decision tree classifiers in experiments on 26 datasets. The lazy classifier is able to better handle small disjuncts problems compared to eager approaches and achieves higher accuracy with faster execution times.
This document discusses Item Response Theory (IRT), which is a psychometric theory used to analyze test items and scores. IRT aims to estimate examinee ability independently of the test items used and provides sample-independent item and test statistics. Some key benefits of IRT include obtaining ability estimates and measurement errors that are independent of the particular test used. IRT also allows test developers to select test items to achieve specific test properties like matching items to ability levels or targeting information.
Multi-Cluster Based Approach for skewed Data in Data MiningIOSR Journals
This document proposes a multi-cluster based majority under-sampling approach for handling skewed data in data mining. It begins with an introduction to the class imbalance problem, where one class has significantly more samples than the other class. Common techniques for addressing this include under-sampling the majority class and over-sampling the minority class. However, under-sampling can lose important majority class information. The proposed approach first clusters the majority class into multiple clusters. It then under-samples each cluster based on its size to select representative samples, avoiding information loss. It also oversamples the minority class. Experimental results on several datasets show the approach improves prediction of the minority class compared to standard under-sampling.
Welcome to International Journal of Engineering Research and Development (IJERD)IJERD Editor
journal publishing, how to publish research paper, Call For research paper, international journal, publishing a paper, IJERD, journal of science and technology, how to get a research paper published, publishing a paper, publishing of journal, publishing of research paper, reserach and review articles, IJERD Journal, How to publish your research paper, publish research paper, open access engineering journal, Engineering journal, Mathemetics journal, Physics journal, Chemistry journal, Computer Engineering, Computer Science journal, how to submit your paper, peer reviw journal, indexed journal, reserach and review articles, engineering journal, www.ijerd.com, research journals,
yahoo journals, bing journals, International Journal of Engineering Research and Development, google journals, hard copy of journal
Analysis On Classification Techniques In Mammographic Mass Data SetIJERA Editor
Data mining, the extraction of hidden information from large databases, is to predict future trends and behaviors, allowing businesses to make proactive, knowledge-driven decisions. Data-Mining classification techniques deals with determining to which group each data instances are associated with. It can deal with a wide variety of data so that large amount of data can be involved in processing. This paper deals with analysis on various data mining classification techniques such as Decision Tree Induction, Naïve Bayes , k-Nearest Neighbour (KNN) classifiers in mammographic mass dataset.
This document provides an overview of item response theory (IRT), including key concepts like item response functions, item parameters, and assumptions of IRT models. IRT aims to measure latent traits through analysis of item-level data, allowing item and person parameters to be estimated independent of specific test administrations. The document outlines the 1, 2, and 3 parameter IRT models and how they relate item characteristics like difficulty and discrimination to the probability of endorsing an item based on trait level. Key assumptions like unidimensionality and local independence are also discussed.
Item Response Theory in Constructing MeasuresCarlo Magno
The document discusses approaches to analyzing test data, including classical test theory (CTT) and item response theory (IRT). It provides an overview of CTT, limitations of CTT, approaches in IRT including advantages over CTT. It also discusses the Rasch model as an example of an IRT model. The document outlines what can be interpreted from IRT analyses including using IRT for scales. It concludes by mentioning some applications of IRT on tests.
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...IJRES Journal
The document presents a mathematical programming approach for selecting important variables in cluster analysis. It formulates a nonlinear binary model to minimize the distance between observations within clusters, using indicator variables to select important variables. The model is applied to a sample dataset of 30 observations across 5 variables, correctly identifying variables 3, 4 and 5 as most important for clustering the observations into two groups. The results are compared to an existing variable selection heuristic, with the mathematical programming approach achieving a 100% correct classification versus 97% for the other method.
This Presentation is on recommended system on question paper predication using machine learning techniques. We did literature survey and implement using same technique.
This document summarizes a study on multilabel text classification and the effect of label hierarchy. The study implements various algorithms for multilabel classification, including naive Bayes, k-nearest neighbors, random forests, SVMs, RBMs, and hierarchical classification algorithms. It evaluates the algorithms on four datasets that vary in features, labels, training/test sizes, and label cardinality. The goal is to analyze how different algorithmic approaches and dataset properties affect classification performance, particularly for hierarchical learning algorithms. Evaluation measures include micro/macro-averaged precision, recall and F1-score. The document provides details on the problem formulation, algorithms, implementation, datasets and evaluation.
Leave one out cross validated Hybrid Model of Genetic Algorithm and Naïve Bay...IJERA Editor
This document presents a new hybrid model for feature selection and classification that combines genetic algorithm and naive Bayes. The proposed method first uses a binary coded genetic algorithm to select a reduced subset of important features from datasets. It then applies a naive Bayes classification method to evaluate the selected features and identify the subset that achieves the highest classification accuracy. The performance of the proposed hybrid model is evaluated on eight datasets and compared to recent publications, finding it achieves satisfactory or higher classification accuracy using fewer features.
A Survey Ondecision Tree Learning Algorithms for Knowledge DiscoveryIJERA Editor
Theimmense volumes of data are populated into repositories from various applications. In order to find out desired information and knowledge from large datasets, the data mining techniques are very much helpful. Classification is one of the knowledge discovery techniques. In Classification, Decision trees are very popular in research community due to simplicity and easy comprehensibility. This paper presentsan updated review of recent developments in the field of decision trees.
This document proposes algorithms to efficiently find the best parameter K for the emerging pattern-based classifier PCL. It first summarizes the PCL classifier and discusses how the parameter K impacts its performance. It then presents the rPCL and ePCL algorithms to determine the optimal K value. rPCL repeatedly subsamples the training data and evaluates different K values to find the best average accuracy. ePCL enhances rPCL by using an incremental pattern maintenance technique called PSM to avoid recomputing emerging patterns from scratch for each subsample, improving runtime efficiency. Experimental results show that finding the best K through these algorithms significantly improves PCL's accuracy.
Game theory concepts like the Nash equilibrium and the Shapley value are being used to solve fundamental problems in knowledge discovery and data mining, such as clustering, classification, and discovering influential nodes in social networks. The talk will discuss the conceptual underpinnings of applying game theory techniques to problem solving in these areas. It will also present recent results on discovering influential nodes in social networks using the Shapley value and identifying topologies of strategically formed social networks using a game theoretic approach.
Efficiently Finding the Best Parameter for the Emerging Pattern-Based ClassifierGanesh Pavan Kambhampati
This document proposes algorithms to efficiently find the best parameter K for the emerging pattern-based classifier PCL. It first summarizes the PCL classifier and discusses how the parameter K impacts its performance. It then presents the rPCL and ePCL algorithms. rPCL repeatedly runs cross-validation to determine the best K by averaging accuracies. ePCL enhances rPCL by using an incremental pattern maintenance technique to avoid repeatedly constructing emerging patterns from scratch, improving efficiency. Experimental results show ePCL finds the best K much faster than rPCL by leveraging pattern maintenance.
Block chain 101 what it is, why it mattersPaul Brody
The Blockchain is an important new technology, but it is shrouded in mystery: what does it do? Why is it such a big deal? How is it related to bitcoin? In this short presentation (with attached video), I attempt to answer those questions.
Oversampling technique in student performance classification from engineering...IJECEIAES
This document discusses various oversampling techniques for dealing with imbalanced data in student performance classification. It compares SMOTE, Borderline-SMOTE, SVMSMOTE, and ADASYN oversampling combined with MLP, gradient boosting, AdaBoost, and random forest classifiers. The results show that Borderline-SMOTE gave the best performance for predicting the minority (low performance) class according to several evaluation metrics. SVMSMOTE also performed well overall, particularly for recall, F1-measure, and AUC. Gradient boosting provided high and consistent precision, recall, F1-measure, and AUC across the different oversampling methods.
This document discusses feature selection techniques for classification problems. It begins by outlining class separability measures like divergence, Bhattacharyya distance, and scatter matrices. It then discusses feature subset selection approaches, including scalar feature selection which treats features individually, and feature vector selection which considers feature sets and correlations. Examples are provided to demonstrate calculating class separability measures for different feature combinations on sample datasets. Exhaustive search and suboptimal techniques like forward, backward, and floating selection are discussed for choosing optimal feature subsets. The goal of feature selection is to select a subset of features that maximizes class separation.
This document discusses computational intelligence and supervised learning techniques for classification. It provides examples of applications in medical diagnosis and credit card approval. The goal of supervised learning is to learn from labeled training data to predict the class of new unlabeled examples. Decision trees and backpropagation neural networks are introduced as common supervised learning algorithms. Evaluation methods like holdout validation, cross-validation and performance metrics beyond accuracy are also summarized.
Lazy Associative Classification is a novel lazy associative classifier that generates classification rules on demand for each test instance, focusing only on relevant features. This approach avoids overfitting and outperforms traditional eager associative classifiers and decision tree classifiers in experiments on 26 datasets. The lazy classifier is able to better handle small disjuncts problems compared to eager approaches and achieves higher accuracy with faster execution times.
This document discusses Item Response Theory (IRT), which is a psychometric theory used to analyze test items and scores. IRT aims to estimate examinee ability independently of the test items used and provides sample-independent item and test statistics. Some key benefits of IRT include obtaining ability estimates and measurement errors that are independent of the particular test used. IRT also allows test developers to select test items to achieve specific test properties like matching items to ability levels or targeting information.
Multi-Cluster Based Approach for skewed Data in Data MiningIOSR Journals
This document proposes a multi-cluster based majority under-sampling approach for handling skewed data in data mining. It begins with an introduction to the class imbalance problem, where one class has significantly more samples than the other class. Common techniques for addressing this include under-sampling the majority class and over-sampling the minority class. However, under-sampling can lose important majority class information. The proposed approach first clusters the majority class into multiple clusters. It then under-samples each cluster based on its size to select representative samples, avoiding information loss. It also oversamples the minority class. Experimental results on several datasets show the approach improves prediction of the minority class compared to standard under-sampling.
Welcome to International Journal of Engineering Research and Development (IJERD)IJERD Editor
journal publishing, how to publish research paper, Call For research paper, international journal, publishing a paper, IJERD, journal of science and technology, how to get a research paper published, publishing a paper, publishing of journal, publishing of research paper, reserach and review articles, IJERD Journal, How to publish your research paper, publish research paper, open access engineering journal, Engineering journal, Mathemetics journal, Physics journal, Chemistry journal, Computer Engineering, Computer Science journal, how to submit your paper, peer reviw journal, indexed journal, reserach and review articles, engineering journal, www.ijerd.com, research journals,
yahoo journals, bing journals, International Journal of Engineering Research and Development, google journals, hard copy of journal
Analysis On Classification Techniques In Mammographic Mass Data SetIJERA Editor
Data mining, the extraction of hidden information from large databases, is to predict future trends and behaviors, allowing businesses to make proactive, knowledge-driven decisions. Data-Mining classification techniques deals with determining to which group each data instances are associated with. It can deal with a wide variety of data so that large amount of data can be involved in processing. This paper deals with analysis on various data mining classification techniques such as Decision Tree Induction, Naïve Bayes , k-Nearest Neighbour (KNN) classifiers in mammographic mass dataset.
This document provides an overview of item response theory (IRT), including key concepts like item response functions, item parameters, and assumptions of IRT models. IRT aims to measure latent traits through analysis of item-level data, allowing item and person parameters to be estimated independent of specific test administrations. The document outlines the 1, 2, and 3 parameter IRT models and how they relate item characteristics like difficulty and discrimination to the probability of endorsing an item based on trait level. Key assumptions like unidimensionality and local independence are also discussed.
Item Response Theory in Constructing MeasuresCarlo Magno
The document discusses approaches to analyzing test data, including classical test theory (CTT) and item response theory (IRT). It provides an overview of CTT, limitations of CTT, approaches in IRT including advantages over CTT. It also discusses the Rasch model as an example of an IRT model. The document outlines what can be interpreted from IRT analyses including using IRT for scales. It concludes by mentioning some applications of IRT on tests.
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...IJRES Journal
The document presents a mathematical programming approach for selecting important variables in cluster analysis. It formulates a nonlinear binary model to minimize the distance between observations within clusters, using indicator variables to select important variables. The model is applied to a sample dataset of 30 observations across 5 variables, correctly identifying variables 3, 4 and 5 as most important for clustering the observations into two groups. The results are compared to an existing variable selection heuristic, with the mathematical programming approach achieving a 100% correct classification versus 97% for the other method.
This Presentation is on recommended system on question paper predication using machine learning techniques. We did literature survey and implement using same technique.
This document summarizes a study on multilabel text classification and the effect of label hierarchy. The study implements various algorithms for multilabel classification, including naive Bayes, k-nearest neighbors, random forests, SVMs, RBMs, and hierarchical classification algorithms. It evaluates the algorithms on four datasets that vary in features, labels, training/test sizes, and label cardinality. The goal is to analyze how different algorithmic approaches and dataset properties affect classification performance, particularly for hierarchical learning algorithms. Evaluation measures include micro/macro-averaged precision, recall and F1-score. The document provides details on the problem formulation, algorithms, implementation, datasets and evaluation.
Leave one out cross validated Hybrid Model of Genetic Algorithm and Naïve Bay...IJERA Editor
This document presents a new hybrid model for feature selection and classification that combines genetic algorithm and naive Bayes. The proposed method first uses a binary coded genetic algorithm to select a reduced subset of important features from datasets. It then applies a naive Bayes classification method to evaluate the selected features and identify the subset that achieves the highest classification accuracy. The performance of the proposed hybrid model is evaluated on eight datasets and compared to recent publications, finding it achieves satisfactory or higher classification accuracy using fewer features.
A Survey Ondecision Tree Learning Algorithms for Knowledge DiscoveryIJERA Editor
Theimmense volumes of data are populated into repositories from various applications. In order to find out desired information and knowledge from large datasets, the data mining techniques are very much helpful. Classification is one of the knowledge discovery techniques. In Classification, Decision trees are very popular in research community due to simplicity and easy comprehensibility. This paper presentsan updated review of recent developments in the field of decision trees.
This document proposes algorithms to efficiently find the best parameter K for the emerging pattern-based classifier PCL. It first summarizes the PCL classifier and discusses how the parameter K impacts its performance. It then presents the rPCL and ePCL algorithms to determine the optimal K value. rPCL repeatedly subsamples the training data and evaluates different K values to find the best average accuracy. ePCL enhances rPCL by using an incremental pattern maintenance technique called PSM to avoid recomputing emerging patterns from scratch for each subsample, improving runtime efficiency. Experimental results show that finding the best K through these algorithms significantly improves PCL's accuracy.
Game theory concepts like the Nash equilibrium and the Shapley value are being used to solve fundamental problems in knowledge discovery and data mining, such as clustering, classification, and discovering influential nodes in social networks. The talk will discuss the conceptual underpinnings of applying game theory techniques to problem solving in these areas. It will also present recent results on discovering influential nodes in social networks using the Shapley value and identifying topologies of strategically formed social networks using a game theoretic approach.
Efficiently Finding the Best Parameter for the Emerging Pattern-Based ClassifierGanesh Pavan Kambhampati
This document proposes algorithms to efficiently find the best parameter K for the emerging pattern-based classifier PCL. It first summarizes the PCL classifier and discusses how the parameter K impacts its performance. It then presents the rPCL and ePCL algorithms. rPCL repeatedly runs cross-validation to determine the best K by averaging accuracies. ePCL enhances rPCL by using an incremental pattern maintenance technique to avoid repeatedly constructing emerging patterns from scratch, improving efficiency. Experimental results show ePCL finds the best K much faster than rPCL by leveraging pattern maintenance.
Block chain 101 what it is, why it mattersPaul Brody
The Blockchain is an important new technology, but it is shrouded in mystery: what does it do? Why is it such a big deal? How is it related to bitcoin? In this short presentation (with attached video), I attempt to answer those questions.
Artificial intelligence (AI) is everywhere, promising self-driving cars, medical breakthroughs, and new ways of working. But how do you separate hype from reality? How can your company apply AI to solve real business problems?
Here’s what AI learnings your business should keep in mind for 2017.
Study: The Future of VR, AR and Self-Driving CarsLinkedIn
We asked LinkedIn members worldwide about their levels of interest in the latest wave of technology: whether they’re using wearables, and whether they intend to buy self-driving cars and VR headsets as they become available. We asked them too about their attitudes to technology and to the growing role of Artificial Intelligence (AI) in the devices that they use. The answers were fascinating – and in many cases, surprising.
This SlideShare explores the full results of this study, including detailed market-by-market breakdowns of intention levels for each technology – and how attitudes change with age, location and seniority level. If you’re marketing a tech brand – or planning to use VR and wearables to reach a professional audience – then these are insights you won’t want to miss.
Fuzzy clustering has been widely studied and applied in a variety of key areas of science and
engineering. In this paper the Improved Teaching Learning Based Optimization (ITLBO)
algorithm is used for data clustering, in which the objects in the same cluster are similar. This
algorithm has been tested on several datasets and compared with some other popular algorithm
in clustering. Results have been shown that the proposed method improves the output of
clustering and can be efficiently used for fuzzy clustering.
An Automatic Medical Image Segmentation using Teaching Learning Based Optimiz...idescitation
Nature inspired population based evolutionary algorithms are very popular with
their competitive solutions for a wide variety of applications. Teaching Learning based
Optimization (TLBO) is a very recent population based evolutionary algorithm evolved
on the basis of Teaching Learning process of a class room. TLBO does not require any
algorithmic specific parameters. This paper proposes an automatic grouping of pixels into
different homogeneous regions using the TLBO. The experimental results have
demonstrated the effectiveness of TLBO in image segmentation.
Association rule discovery for student performance prediction using metaheuri...csandit
According to the increase of using data mining tech
niques in improving educational systems
operations, Educational Data Mining has been introd
uced as a new and fast growing research
area. Educational Data Mining aims to analyze data
in educational environments in order to
solve educational research problems. In this paper
a new associative classification technique
has been proposed to predict students final perform
ance. Despite of several machine learning
approaches such as ANNs, SVMs, etc. associative cla
ssifiers maintain interpretability along
with high accuracy. In this research work, we have
employed Honeybee Colony Optimization
and Particle Swarm Optimization to extract associat
ion rule for student performance prediction
as a multi-objective classification problem. Result
s indicate that the proposed swarm based
algorithm outperforms well-known classification tec
hniques on student performance prediction
classification problem.
ASSOCIATION RULE DISCOVERY FOR STUDENT PERFORMANCE PREDICTION USING METAHEURI...cscpconf
According to the increase of using data mining techniques in improving educational systems
operations, Educational Data Mining has been introduced as a new and fast growing research
area. Educational Data Mining aims to analyze data in educational environments in order to
solve educational research problems. In this paper a new associative classification technique
has been proposed to predict students final performance. Despite of several machine learning
approaches such as ANNs, SVMs, etc. associative classifiers maintain interpretability along
with high accuracy. In this research work, we have employed Honeybee Colony Optimization
and Particle Swarm Optimization to extract association rule for student performance prediction
as a multi-objective classification problem. Results indicate that the proposed swarm based
algorithm outperforms well-known classification techniques on student performance prediction
classification problem.
Engineering Research Publication
Best International Journals, High Impact Journals,
International Journal of Engineering & Technical Research
ISSN : 2321-0869 (O) 2454-4698 (P)
www.erpublication.org
Optimal rule set generation using pso algorithmcsandit
Classification and Prediction is an important resea
rch area of data mining. Construction of
classifier model for any decision system is an impo
rtant job for many data mining applications.
The objective of developing such a classifier is to
classify unlabeled dataset into classes. Here
we have applied a discrete Particle Swarm Optimizat
ion (PSO) algorithm for selecting optimal
classification rule sets from huge number of rules
possibly exist in a dataset. In the proposed
DPSO algorithm, decision matrix approach was used f
or generation of initial possible
classification rules from a dataset. Then the propo
sed algorithm discovers important or
significant rules from all possible classification
rules without sacrificing predictive accuracy.
The proposed algorithm deals with discrete valued d
ata, and its initial population of candidate
solutions contains particles of different sizes. Th
e experiment has been done on the task of
optimal rule selection in the data sets collected f
rom UCI repository. Experimental results show
that the proposed algorithm can automatically evolv
e on average the small number of
conditions per rule and a few rules per rule set, a
nd achieved better classification performance
of predictive accuracy for few classes.
An Automatic Clustering Technique for Optimal ClustersIJCSEA Journal
This document presents a new automatic clustering algorithm called Automatic Merging for Optimal Clusters (AMOC). AMOC is a two-phase iterative extension of k-means clustering that aims to automatically determine the optimal number of clusters for a given dataset. In the first phase, AMOC initializes a large number of clusters k using k-means. In the second phase, it iteratively merges the lowest probability cluster with its closest neighbor, recomputing metrics each time to evaluate if the merge improved clustering quality. The algorithm stops merging once no improvements are found. Experimental results on synthetic and real datasets show AMOC finds nearly optimal cluster structures in terms of number, compactness and separation of clusters.
Performance Comparision of Machine Learning AlgorithmsDinusha Dilanka
In this paper Compare the performance of two
classification algorithm. I t is useful to differentiate
algorithms based on computational performance rather
than classification accuracy alone. As although
classification accuracy between the algorithms is similar,
computational performance can differ significantly and it
can affect to the final results. So the objective of this paper
is to perform a comparative analysis of two machine
learning algorithms namely, K Nearest neighbor,
classification and Logistic Regression. In this paper it
was considered a large dataset of 7981 data points and 112
features. Then the performance of the above mentioned
machine learning algorithms are examined. In this paper
the processing time and accuracy of the different machine
learning techniques are being estimated by considering the
collected data set, over a 60% for train and remaining
40% for testing. The paper is organized as follows. In
Section I, introduction and background analysis of the
research is included and in section II, problem statement.
In Section III, our application and data analyze Process,
the testing environment, and the Methodology of our
analysis are being described briefly. Section IV comprises
the results of two algorithms. Finally, the paper concludes
with a discussion of future directions for research by
eliminating the problems existing with the current
research methodology.
EFFICIENTLY PROCESSING OF TOP-K TYPICALITY QUERY FOR STRUCTURED DATAcsandit
This work presents a novel ranking scheme for structured data. We show how to apply the
notion of typicality analysis from cognitive science and how to use this notion to formulate the
problem of ranking data with categorical attributes. First, we formalize the typicality query
model for relational databases. We adopt Pearson correlation coefficient to quantify the extent
of the typicality of an object. The correlation coefficient estimates the extent of statistical
relationships between two variables based on the patterns of occurrences and absences of their
values. Second, we develop a top-k query processing method for efficient computation. TPFilter
prunes unpromising objects based on tight upper bounds and selectively joins tuples of highest
typicality score. Our methods efficiently prune unpromising objects based on upper bounds.
Experimental results show our approach is promising for real data.
Patterns that only occur in objects belonging to a
single class are called Jumping Emerging Patterns (JEP). JEP
based Classifiers are considered one of the successful classification
systems. Due to its comprehensibility, simplicity and strong
differentiating abilities JEPs have captured significant recognition.
However, discovery of JEPs in a large pattern space is normally a
time consuming and challenging task because of their exponential
behaviour. In this work a novel method based on genetic
algorithm (GA) is proposed to discover JEPs in large pattern
space. Since the complexity of GA is lower than other algorithms,
so we have combined the power of JEPs and GA to find high
quality JEPs from datasets to improve performance of
classification system. Our proposed method explores a set of high
quality JEPs from pattern search space unlike other methods in
literature that compute complete set of JEPs, Large numbers of
duplicate and redundant JEPs are filtered out during their
discovery process. Experimental results show that our proposed
Genetic-JEPs are effective and accurate for classification of a
variety of data sets and in general achieve higher accuracy than
other standard classifiers.
A Study of Efficiency Improvements Technique for K-Means AlgorithmIRJET Journal
This document discusses techniques to improve the efficiency of the K-means clustering algorithm. It begins with an introduction to K-means clustering and discusses some of its limitations, such as high computational time. It then proposes using a ranking method to help assign data points to clusters more efficiently. The key steps of standard K-means clustering and the proposed ranking-based approach are described. Experimental results on sample datasets show that the ranking method leads to faster clustering compared to standard K-means, with comparable accuracy. Therefore, the ranking method can help enhance the performance of K-means clustering for large datasets.
A Formal Machine Learning or Multi Objective Decision Making System for Deter...Editor IJCATR
Decision-making typically needs the mechanisms to compromise among opposing norms. Once multiple objectives square measure is concerned of machine learning, a vital step is to check the weights of individual objectives to the system-level performance. Determinant, the weights of multi-objectives is associate in analysis method, associated it's been typically treated as a drawback. However, our preliminary investigation has shown that existing methodologies in managing the weights of multi-objectives have some obvious limitations like the determination of weights is treated as one drawback, a result supporting such associate improvement is limited, if associated it will even be unreliable, once knowledge concerning multiple objectives is incomplete like an integrity caused by poor data. The constraints of weights are also mentioned. Variable weights square measure is natural in decision-making processes. Here, we'd like to develop a scientific methodology in determinant variable weights of multi-objectives. The roles of weights in a creative multi-objective decision-making or machine-learning of square measure analyzed, and therefore the weights square measure determined with the help of a standard neural network.
Optimal Feature Selection from VMware ESXi 5.1 Feature Setijccmsjournal
A study of VMware ESXi 5.1 server has been carried out to find the optimal set of parameters which suggest usage of different resources of the server. Feature selection algorithms have been used to extract the optimum set of parameters of the data obtained from VMware ESXi 5.1 server using esxtop command. Multiple virtual machines (VMs) are running in the mentioned server. K-means algorithm is used for clustering the VMs. The goodness of each cluster is determined by Davies Bouldin index and Dunn index respectively. The best cluster is further identified by the determined indices. The features of the best cluster are considered into a set of optimal parameters.
Optimal feature selection from v mware esxi 5.1 feature setijccmsjournal
A study of VMware ESXi 5.1 server has been carried out to find the optimal set of parameters which
suggest usage of different resources of the server. Feature selection algorithms have been used to extract
the optimum set of parameters of the data obtained from VMware ESXi 5.1 server using esxtop command.
Multiple virtual machines (VMs) are running in the mentioned server. K-means algorithm is used for
clustering the VMs. The goodness of each cluster is determined by Davies Bouldin index and Dunn index
respectively. The best cluster is further identified by the determined indices. The features of the best cluster
are considered into a set of optimal parameters.
A Threshold Fuzzy Entropy Based Feature Selection: Comparative StudyIJMER
Feature selection is one of the most common and critical tasks in database classification. It
reduces the computational cost by removing insignificant and unwanted features. Consequently, this
makes the diagnosis process accurate and comprehensible. This paper presents the measurement of
feature relevance based on fuzzy entropy, tested with Radial Basis Classifier (RBF) network,
Bagging(Bootstrap Aggregating), Boosting and stacking for various fields of datasets. Twenty
benchmarked datasets which are available in UCI Machine Learning Repository and KDD have been
used for this work. The accuracy obtained from these classification process shows that the proposed
method is capable of producing good and accurate results with fewer features than the original
datasets.
EFFECTIVENESS PREDICTION OF MEMORY BASED CLASSIFIERS FOR THE CLASSIFICATION O...cscpconf
Classification is a step by step practice for allocating a given piece of input into any of the given
category. Classification is an essential Machine Learning technique. There are many
classification problem occurs in different application areas and need to be solved. Different
types are classification algorithms like memory-based, tree-based, rule-based, etc are widely
used. This work studies the performance of different memory based classifiers for classification
of Multivariate data set from UCI machine learning repository using the open source machine
learning tool. A comparison of different memory based classifiers used and a practical
guideline for selecting the most suited algorithm for a classification is presented. Apart fromthat some empirical criteria for describing and evaluating the best classifiers are discussed
Enhanced Genetic Algorithm with K-Means for the Clustering ProblemAnders Viken
The document proposes a genetic algorithm combined with K-Means clustering to solve clustering problems. The genetic algorithm is used to generate new clusters through crossover, then K-Means is applied to further improve cluster quality and speed up search. Experimental results showed the combined approach converges faster while producing equivalent quality clusters compared to a standard genetic algorithm alone.
Main news related to the CCS TSI 2023 (2023/1695)Jakub Marek
An English 🇬🇧 translation of a presentation to the speech I gave about the main changes brought by CCS TSI 2023 at the biggest Czech conference on Communications and signalling systems on Railways, which was held in Clarion Hotel Olomouc from 7th to 9th November 2023 (konferenceszt.cz). Attended by around 500 participants and 200 on-line followers.
The original Czech 🇨🇿 version of the presentation can be found here: https://www.slideshare.net/slideshow/hlavni-novinky-souvisejici-s-ccs-tsi-2023-2023-1695/269688092 .
The videorecording (in Czech) from the presentation is available here: https://youtu.be/WzjJWm4IyPk?si=SImb06tuXGb30BEH .
Skybuffer SAM4U tool for SAP license adoptionTatiana Kojar
Manage and optimize your license adoption and consumption with SAM4U, an SAP free customer software asset management tool.
SAM4U, an SAP complimentary software asset management tool for customers, delivers a detailed and well-structured overview of license inventory and usage with a user-friendly interface. We offer a hosted, cost-effective, and performance-optimized SAM4U setup in the Skybuffer Cloud environment. You retain ownership of the system and data, while we manage the ABAP 7.58 infrastructure, ensuring fixed Total Cost of Ownership (TCO) and exceptional services through the SAP Fiori interface.
HCL Notes and Domino License Cost Reduction in the World of DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-and-domino-license-cost-reduction-in-the-world-of-dlau/
The introduction of DLAU and the CCB & CCX licensing model caused quite a stir in the HCL community. As a Notes and Domino customer, you may have faced challenges with unexpected user counts and license costs. You probably have questions on how this new licensing approach works and how to benefit from it. Most importantly, you likely have budget constraints and want to save money where possible. Don’t worry, we can help with all of this!
We’ll show you how to fix common misconfigurations that cause higher-than-expected user counts, and how to identify accounts which you can deactivate to save money. There are also frequent patterns that can cause unnecessary cost, like using a person document instead of a mail-in for shared mailboxes. We’ll provide examples and solutions for those as well. And naturally we’ll explain the new licensing model.
Join HCL Ambassador Marc Thomas in this webinar with a special guest appearance from Franz Walder. It will give you the tools and know-how to stay on top of what is going on with Domino licensing. You will be able lower your cost through an optimized configuration and keep it low going forward.
These topics will be covered
- Reducing license cost by finding and fixing misconfigurations and superfluous accounts
- How do CCB and CCX licenses really work?
- Understanding the DLAU tool and how to best utilize it
- Tips for common problem areas, like team mailboxes, functional/test users, etc
- Practical examples and best practices to implement right away
Have you ever been confused by the myriad of choices offered by AWS for hosting a website or an API?
Lambda, Elastic Beanstalk, Lightsail, Amplify, S3 (and more!) can each host websites + APIs. But which one should we choose?
Which one is cheapest? Which one is fastest? Which one will scale to meet our needs?
Join me in this session as we dive into each AWS hosting service to determine which one is best for your scenario and explain why!
Dive into the realm of operating systems (OS) with Pravash Chandra Das, a seasoned Digital Forensic Analyst, as your guide. 🚀 This comprehensive presentation illuminates the core concepts, types, and evolution of OS, essential for understanding modern computing landscapes.
Beginning with the foundational definition, Das clarifies the pivotal role of OS as system software orchestrating hardware resources, software applications, and user interactions. Through succinct descriptions, he delineates the diverse types of OS, from single-user, single-task environments like early MS-DOS iterations, to multi-user, multi-tasking systems exemplified by modern Linux distributions.
Crucial components like the kernel and shell are dissected, highlighting their indispensable functions in resource management and user interface interaction. Das elucidates how the kernel acts as the central nervous system, orchestrating process scheduling, memory allocation, and device management. Meanwhile, the shell serves as the gateway for user commands, bridging the gap between human input and machine execution. 💻
The narrative then shifts to a captivating exploration of prominent desktop OSs, Windows, macOS, and Linux. Windows, with its globally ubiquitous presence and user-friendly interface, emerges as a cornerstone in personal computing history. macOS, lauded for its sleek design and seamless integration with Apple's ecosystem, stands as a beacon of stability and creativity. Linux, an open-source marvel, offers unparalleled flexibility and security, revolutionizing the computing landscape. 🖥️
Moving to the realm of mobile devices, Das unravels the dominance of Android and iOS. Android's open-source ethos fosters a vibrant ecosystem of customization and innovation, while iOS boasts a seamless user experience and robust security infrastructure. Meanwhile, discontinued platforms like Symbian and Palm OS evoke nostalgia for their pioneering roles in the smartphone revolution.
The journey concludes with a reflection on the ever-evolving landscape of OS, underscored by the emergence of real-time operating systems (RTOS) and the persistent quest for innovation and efficiency. As technology continues to shape our world, understanding the foundations and evolution of operating systems remains paramount. Join Pravash Chandra Das on this illuminating journey through the heart of computing. 🌟
Introduction of Cybersecurity with OSS at Code Europe 2024Hiroshi SHIBATA
I develop the Ruby programming language, RubyGems, and Bundler, which are package managers for Ruby. Today, I will introduce how to enhance the security of your application using open-source software (OSS) examples from Ruby and RubyGems.
The first topic is CVE (Common Vulnerabilities and Exposures). I have published CVEs many times. But what exactly is a CVE? I'll provide a basic understanding of CVEs and explain how to detect and handle vulnerabilities in OSS.
Next, let's discuss package managers. Package managers play a critical role in the OSS ecosystem. I'll explain how to manage library dependencies in your application.
I'll share insights into how the Ruby and RubyGems core team works to keep our ecosystem safe. By the end of this talk, you'll have a better understanding of how to safeguard your code.
Programming Foundation Models with DSPy - Meetup SlidesZilliz
Prompting language models is hard, while programming language models is easy. In this talk, I will discuss the state-of-the-art framework DSPy for programming foundation models with its powerful optimizers and runtime constraint system.
A Comprehensive Guide to DeFi Development Services in 2024Intelisync
DeFi represents a paradigm shift in the financial industry. Instead of relying on traditional, centralized institutions like banks, DeFi leverages blockchain technology to create a decentralized network of financial services. This means that financial transactions can occur directly between parties, without intermediaries, using smart contracts on platforms like Ethereum.
In 2024, we are witnessing an explosion of new DeFi projects and protocols, each pushing the boundaries of what’s possible in finance.
In summary, DeFi in 2024 is not just a trend; it’s a revolution that democratizes finance, enhances security and transparency, and fosters continuous innovation. As we proceed through this presentation, we'll explore the various components and services of DeFi in detail, shedding light on how they are transforming the financial landscape.
At Intelisync, we specialize in providing comprehensive DeFi development services tailored to meet the unique needs of our clients. From smart contract development to dApp creation and security audits, we ensure that your DeFi project is built with innovation, security, and scalability in mind. Trust Intelisync to guide you through the intricate landscape of decentralized finance and unlock the full potential of blockchain technology.
Ready to take your DeFi project to the next level? Partner with Intelisync for expert DeFi development services today!
This presentation provides valuable insights into effective cost-saving techniques on AWS. Learn how to optimize your AWS resources by rightsizing, increasing elasticity, picking the right storage class, and choosing the best pricing model. Additionally, discover essential governance mechanisms to ensure continuous cost efficiency. Whether you are new to AWS or an experienced user, this presentation provides clear and practical tips to help you reduce your cloud costs and get the most out of your budget.
Digital Marketing Trends in 2024 | Guide for Staying AheadWask
https://www.wask.co/ebooks/digital-marketing-trends-in-2024
Feeling lost in the digital marketing whirlwind of 2024? Technology is changing, consumer habits are evolving, and staying ahead of the curve feels like a never-ending pursuit. This e-book is your compass. Dive into actionable insights to handle the complexities of modern marketing. From hyper-personalization to the power of user-generated content, learn how to build long-term relationships with your audience and unlock the secrets to success in the ever-shifting digital landscape.
Monitoring and Managing Anomaly Detection on OpenShift.pdfTosin Akinosho
Monitoring and Managing Anomaly Detection on OpenShift
Overview
Dive into the world of anomaly detection on edge devices with our comprehensive hands-on tutorial. This SlideShare presentation will guide you through the entire process, from data collection and model training to edge deployment and real-time monitoring. Perfect for those looking to implement robust anomaly detection systems on resource-constrained IoT/edge devices.
Key Topics Covered
1. Introduction to Anomaly Detection
- Understand the fundamentals of anomaly detection and its importance in identifying unusual behavior or failures in systems.
2. Understanding Edge (IoT)
- Learn about edge computing and IoT, and how they enable real-time data processing and decision-making at the source.
3. What is ArgoCD?
- Discover ArgoCD, a declarative, GitOps continuous delivery tool for Kubernetes, and its role in deploying applications on edge devices.
4. Deployment Using ArgoCD for Edge Devices
- Step-by-step guide on deploying anomaly detection models on edge devices using ArgoCD.
5. Introduction to Apache Kafka and S3
- Explore Apache Kafka for real-time data streaming and Amazon S3 for scalable storage solutions.
6. Viewing Kafka Messages in the Data Lake
- Learn how to view and analyze Kafka messages stored in a data lake for better insights.
7. What is Prometheus?
- Get to know Prometheus, an open-source monitoring and alerting toolkit, and its application in monitoring edge devices.
8. Monitoring Application Metrics with Prometheus
- Detailed instructions on setting up Prometheus to monitor the performance and health of your anomaly detection system.
9. What is Camel K?
- Introduction to Camel K, a lightweight integration framework built on Apache Camel, designed for Kubernetes.
10. Configuring Camel K Integrations for Data Pipelines
- Learn how to configure Camel K for seamless data pipeline integrations in your anomaly detection workflow.
11. What is a Jupyter Notebook?
- Overview of Jupyter Notebooks, an open-source web application for creating and sharing documents with live code, equations, visualizations, and narrative text.
12. Jupyter Notebooks with Code Examples
- Hands-on examples and code snippets in Jupyter Notebooks to help you implement and test anomaly detection models.
Best 20 SEO Techniques To Improve Website Visibility In SERPPixlogix Infotech
Boost your website's visibility with proven SEO techniques! Our latest blog dives into essential strategies to enhance your online presence, increase traffic, and rank higher on search engines. From keyword optimization to quality content creation, learn how to make your site stand out in the crowded digital landscape. Discover actionable tips and expert insights to elevate your SEO game.
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfMalak Abu Hammad
Discover how MongoDB Atlas and vector search technology can revolutionize your application's search capabilities. This comprehensive presentation covers:
* What is Vector Search?
* Importance and benefits of vector search
* Practical use cases across various industries
* Step-by-step implementation guide
* Live demos with code snippets
* Enhancing LLM capabilities with vector search
* Best practices and optimization strategies
Perfect for developers, AI enthusiasts, and tech leaders. Learn how to leverage MongoDB Atlas to deliver highly relevant, context-aware search results, transforming your data retrieval process. Stay ahead in tech innovation and maximize the potential of your applications.
#MongoDB #VectorSearch #AI #SemanticSearch #TechInnovation #DataScience #LLM #MachineLearning #SearchTechnology
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on integration of Salesforce with Bonterra Impact Management.
Interested in deploying an integration with Salesforce for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Driving Business Innovation: Latest Generative AI Advancements & Success StorySafe Software
Are you ready to revolutionize how you handle data? Join us for a webinar where we’ll bring you up to speed with the latest advancements in Generative AI technology and discover how leveraging FME with tools from giants like Google Gemini, Amazon, and Microsoft OpenAI can supercharge your workflow efficiency.
During the hour, we’ll take you through:
Guest Speaker Segment with Hannah Barrington: Dive into the world of dynamic real estate marketing with Hannah, the Marketing Manager at Workspace Group. Hear firsthand how their team generates engaging descriptions for thousands of office units by integrating diverse data sources—from PDF floorplans to web pages—using FME transformers, like OpenAIVisionConnector and AnthropicVisionConnector. This use case will show you how GenAI can streamline content creation for marketing across the board.
Ollama Use Case: Learn how Scenario Specialist Dmitri Bagh has utilized Ollama within FME to input data, create custom models, and enhance security protocols. This segment will include demos to illustrate the full capabilities of FME in AI-driven processes.
Custom AI Models: Discover how to leverage FME to build personalized AI models using your data. Whether it’s populating a model with local data for added security or integrating public AI tools, find out how FME facilitates a versatile and secure approach to AI.
We’ll wrap up with a live Q&A session where you can engage with our experts on your specific use cases, and learn more about optimizing your data workflows with AI.
This webinar is ideal for professionals seeking to harness the power of AI within their data management systems while ensuring high levels of customization and security. Whether you're a novice or an expert, gain actionable insights and strategies to elevate your data processes. Join us to see how FME and AI can revolutionize how you work with data!
Fueling AI with Great Data with Airbyte WebinarZilliz
This talk will focus on how to collect data from a variety of sources, leveraging this data for RAG and other GenAI use cases, and finally charting your course to productionalization.
Building Production Ready Search Pipelines with Spark and MilvusZilliz
Spark is the widely used ETL tool for processing, indexing and ingesting data to serving stack for search. Milvus is the production-ready open-source vector database. In this talk we will show how to use Spark to process unstructured data to extract vector representations, and push the vectors to Milvus vector database for search serving.
Building Production Ready Search Pipelines with Spark and Milvus
Efficiently finding the parameter for emergign pattern based classifier
1. Efficiently Finding the Best Parameter
for the Emerging Pattern-Based Classifier PCL
Thanh-Son Ngo1 , Mengling Feng2 Guimei Liu1 , and Limsoon Wong1
1
National University of Singapore
{ngothanh,liugm,wongls}@comp.nus.edu.sg
2
Institute for Infocomm Research
mfeng@i2r.a-star.edu.sg
Abstract. Emerging patterns are itemsets whose frequencies change
sharply from one class to the other. PCL is an example of efficient clas-
sification algorithms that leverage the prediction power of emerging pat-
terns. It first selects the top-K emerging patterns of each class that match
a testing instance, and then uses these selected patterns to decide the
class label of the testing instance. We study the impact of the parameter
K on the accuracy of PCL. We have observed that in many cases, the
value of K is critical to the performance of PCL. This motivates us to de-
velop an algorithm to find the best value of K for PCL. Our results show
that finding the best K can improve the accuracy of PCL greatly, and
employing incremental frequent itemset maintenance techniques reduces
the running time of our algorithm significantly.
1 Introduction
Classification is the task of learning from one data set and making predictions
in another data set. The learning data is called training data which consists of
entities with their labels. The other data set is called testing data which consists
of entities without labels, and the classifying task is to make prediction of their
labels based on what we learn from the training data. Many classifiers have been
proposed in the literature. Here, we focus on pattern-based classifiers.
Frequent patterns are patterns that appear frequently in the dataset. Fre-
quent patterns whose supports in the training data set change significantly from
one class to the other are used to construct the classifiers. These patterns are
called emerging patterns (EP) [9,10,16] and these patterns are applied to test-
ing instances to predict their class memberships. As a classifier may generate
many rules for the same class, aggregating their discrimination power gives bet-
ter prediction results [4,8,10,20]. Using EPs has the advantage that they not
only predict the class labels but also provide explanation for the decision.
Jumping emerging patterns (JEP) are a special type of EPs that have non-zero
occurrence in one class and zero occurrence in all other classes [10,11]. PCL [9,10]
is a classifier based on aggregating the prediction power of frequent JEPs. Given
Supported by A*STAR grant (SERC 072 101 0016).
M.J. Zaki et al. (Eds.): PAKDD 2010, Part I, LNAI 6118, pp. 121–133, 2010.
c Springer-Verlag Berlin Heidelberg 2010
2. 122 T.-S. Ngo et al.
a test instance t, PCL selects the top-K JEPs with the highest supports that
are contained in t from each class and then computes the score for each class
using these selected patterns. PCL assigns the class label with the highest score
to the testing instance. The choice of a good value for K is tricky and its optimal
value varies on different datasets, as shown in Fig. 1. Remarkably, the problem
of choosing the best value of K has not been investigated previously.
Fig. 1. Accuracy of PCL on four datasets with different values of K. X-axis shows
different values of K and Y-axis shows the accuracy. When K increases, the accuracy
does not always increase.
Here, to avoid overfitting K to the data, we sample many subsets of the train-
ing data to get the value of K that appears the best on average. By this method,
we maximize the likelihood that the chosen K will produce the best results in the
whole dataset. Our main contributions are summarized as follows: (i) We revisit
the PCL algorithm [9,10] and propose a method to find the most appropriate
parameters to improve the performance of PCL. (ii) We introduce a method
to speed up the proposed algorithm as well as cross-validation methodology for
frequent-pattern-based classification algorithms in general.
2 Related Work
Emerging patterns are a special type of frequent patterns. They occur frequently
in one class but rarely in other classes. Emerging pattern-based classification ben-
efits enormously from the advancement of frequent pattern mining algorithms
and better understanding of the patterns space. Many methods have been pro-
posed to select a good set of patterns for constructing classifiers [15]. Most of
these algorithms first generate frequent patterns satisfying a certain minimum
support constraint, and then select patterns based on some criteria and make
3. Finding Best Parameter for PCL 123
predictions based on the selected patterns. These algorithms mainly differ in
how patterns are selected and how class labels are determined.
The first class of methods pick the best pattern to classify testing instances,
like CBA [14] and MMAC [18]. CBA first derives classification rules from fre-
quent patterns, and then ranks the rules in descending order of their confidence.
A very small set of rules is then selected to maximize the classification accuracy
on the training data. The class label of a new instance is decided by the first rule
that matches the instance. MMAC extends CBA to multiple classes and labels.
Only a small set of patterns are selected for classification. So it is possible that
a new testing instance does not contain any EP that is selected.
The second class of methods treats frequent patterns as additional features
and then use classical algorithms to classify the data [2,3]. Cheng et al. [2,3]
build a connection between pattern frequency and discriminative measures such
as information gain score, and develop a strategy to set minimum support in
frequent pattern mining for generating useful patterns for classification. Based
on this strategy, coupled with a proposed feature selection algorithm, selected
EPs are used as additional features to build high quality classifiers. The results
show that using EPs as additional features can improve the accuracy of C4.5
and SVM. The main drawback of this approach is that the intuitiveness of the
pattern-based classifiers may be lost.
The last class of methods selects the top-K patterns and aggregates the pre-
dictive power of the selected patterns. They include CAEP [4], iCAEP [20],
PCL [9,10], CEP [1], CPAR [17], CMAR [8] and HARMONY [19]. CAEP uses
EPs with high growth rate to do classification. iCAEP aggregates the prediction
power of EPs based on information theory. PCL uses only JEPs. JEPs occur in
one and only one class, which may be too restrictive in some cases. CEP relaxes
this constraint by putting an upper bound on the number of occurrences of the
EPs in other classes. CPAR uses the expected accuracy to select classification
rules. It compares the average expected accuracy of the best K rules of each class
and chooses the class with the highest expected accuracy as the predicted class.
CMAR uses a weighted measure to select rules, and the score of a class is also
calculated using this weighted measure. HARMONY directly mines the final set
of classification rules by using an instance-centric rule generation approach.
The aggregation-based algorithms generally show better performance than
other algorithms. However, the value of K is critical to their performance in many
cases. Here, we use PCL as an example to study how to select proper values for
parameter K to maximize the accuracy of the aggregation-based algorithms.
3 Preliminaries
Let I = {i1 , i2 , .., in } be a set of distinct literals called items. An itemset or
a pattern is a subset of I. A transaction is a non-empty set of items. Let
C = {C1 , C2 , .., Ck } be a set of distinct labels called class labels. A transac-
tional database consists of a set of transactions associated with their labels. A
classification task involves two phases: training and testing. In training, the class
4. 124 T.-S. Ngo et al.
labels are revealed and in testing, class labels are hidden. The classifier builds a
model based on the training set and uses this model to predict the class labels
of transactions in the testing set. The accuracy of a classifier A is the proportion
of test-instances which are correctly classified by A.
A pattern P covers a transaction t if P ⊆ t. The support of P is the number
of transactions that are covered by P . We use sup(P, D) to indicate the support
of P in the dataset D. We use P |D to indicate the set of transactions in D
that contain P . Given a 2-class dataset, jumping emerging patterns (JEP) are
patterns whose frequency in one class is non-zero and in other class is zero. An
EPi is called a JEP from class A to class B if its support in class A is zero. A
pattern P is a generator if and only if for every P ⊂ P , sup(P , D) > sup(P, D).
A JEP generator is both a generator and a JEP.
4 Algorithms
4.1 PCL
We present here an overview of PCL [9,10]. The dataset D is divided into
positive and negative classes. Given a test-instance t, two sets of JEPs that
cover t are used: EPt1 , EPt2 , .., EPtn from negative to positive classes and
+ + +
− − −
EPt1 , EPt2 , .., EPtn from positive to negative classes. The K most frequent
JEPs that cover t are sorted in descending order of their supports. Suppose
the set of JEPs (based on the training set) from negative to positive classes
are EP1 , EP2 , .., EPn in descending order of their supports. Similarly,
+ + +
− − −
EP1 , EP2 , .., EPn is the set of JEPs from positive to negative. It is hypoth-
esized that if t belongs to one class, it should contain more JEPs of this class
than the other class. So Li and Wong [9] formulated the scores of t with respect
to the two classes as below.
k k −
i=1 sup(EPti , D)
+
i=1 sup(EPti , D)
Score(t, +) = k
Score(t, −) = k
i=1 sup(EPi+ , D) i=1 sup(EPi− , D)
4.2 PSM
Maintenance of EPs was first introduced in [11], though a complete solution
for insertion and deletion was not discussed. Here, we describe a more efficient
method called PSM for complete maintenance introduced recently in [5].
PSM is a maintenance algorithm for frequent pattern space. It is based on the
GE-tree, an effective data structure described in [6,7] for enumerating genera-
tors. Frequent generators are stored in the GE-tree and the tree is incrementally
adjusted when new transactions are added or existing transactions are removed.
Each node in the tree represents a generator. To find EPs, we modify the GE-tree
so that each node stores both positive and negative supports of the correspond-
ing generator. In addition to frequent generators, GE-tree maintains a negative
border which comprises infrequent generators whose immediate subsets are all
5. Finding Best Parameter for PCL 125
frequent [5,6]. The negative border helps generate new frequent generators and
equivalence classes efficiently when transactions are added or removed.
Computating frequent generators is expensive. The benefit of PSM is that a
new set of frequent generators does not need to be computed from scratch when
the data is modified. As a small change in dataset seldom causes a big change
in the set of frequent patterns, PSM is effective for pattern space maintenance.
4.3 rPCL and ePCL
A good choice of K has a big impact on prediction results in PCL. We propose a
method to tackle this problem as follows. According to the Central Limit The-
orem, the distribution of accuracy will behave like normal distribution. Indeed,
Fig. 2 suggests the convergence of average classification accuracy in training data
to the real value of accuracy in the whole dataset. So we simulate the actual pro-
cess of classification in training set and choose the value of K that maximizes
the mean accuracy. The simulation is run repeatedly to determine which value
of K appears as the best on average. By the Central Limit Theorem, the average
accuracy of each K approaches the true accuracy of that value of K, given suf-
ficient number of simulation runs. Thus, the value of K that has the best mean
accuracy in the simulation runs will also perform well in the whole set of data.
Fig. 2. Distribution of accuracies across 100 runs given a fixed K=10 over training set.
The vertical red line indicates the actual value of accuracy when we evaluate the same
K in the whole dataset. The means are shown to be close to the actual accuracy.
We describe two algorithms rPCL and ePCL. Algorithm rPCL is a direct but
naive solution for the method above. We use it as a benchmark to understand
where inefficiencies might be and to assess the improvement of a more efficient
method using maintenance. ePCL is the fast version where PSM is used. It
produces the same results as rPCL but runs many times faster.
6. 126 T.-S. Ngo et al.
Fig. 3. The rPCL algorithm
In rPCL (Fig. 3) we use a technique called repeated random sub-sampling
validation as a framework to assess the best parameter for PCL. It involves
several rounds, in each round the training set is randomly divided into new
complementary testing and training set. For each K, we perform this procedure
to determine the expected accuracy of the classifier wrt this K. The chosen K is
the one which returns the best average accuracy over these runs. 10-fold cross
validation is popular for accessing classifier performance. So, we subsample 10%
of training set as testing subsample and 90% as training subsample.
In rPCL, makeDecision simply decides if the classifier with ScoreK [t, +] and
ScoreK [t, −] correctly predicts the class of t, wrt to the current K. At step 9, we
can compute ScoreK [t, +] from Score( K − 1)[t, +] in constant time. To achieve
this, we use a vector to store ScoreK [t, +] and ScoreK [t, −] for all t ∈ Dt. To
determine a good value of K, a significantly large number of iterations maxtime
is performed and all values of K in the range 1..maxK are considered.
Constructing a set of frequent EPs from the training set (step 4) is compu-
tationally very expensive. It involves frequent pattern mining. And this step is
repeated many times with different Dt and Dn. Because the testing-training sep-
arations are largely similar among all the runs, we should not need to construct
the set of EPs from scratch in such a naive fashion. PSM [5,6] is used to achieve
this purpose. PSM allows us to adjust the set of frequent patterns when the
training fold is changed. We only need to construct the frequent patterns space
at the beginning and repeatedly use PSM to maintain the set of rules when the
sub-sampling folds are changed. That is, we only need to mine frequent patterns
once at the beginning. With this innovation, we present an improved algorithm
called ePCL (Fig. 4) which stands for enhanced PCL.
7. Finding Best Parameter for PCL 127
Fig. 4. ePCL algorithms. The highlighted codes indicate the involvement of PSM.
Fig. 5. Workflow for rPCL and ePCL. EP set indicates the set of EPs used to construct
PCL.
8. 128 T.-S. Ngo et al.
The set of frequent generators are incrementally maintained. Line 5 and 17
show the execution of PSM: PSM.delete is invoked when a fold is removed from
the training set and PSM.add is invoked to add this fold into original set. That
eliminates building a new set of frequent patterns when we run the simulation
with a new testing-training separation. Fig. 5 is the workflow of rPCL and ePCL.
4.4 Complexity Analysis
We compare the theoretical improvement of ePCL from rPCL. In rPCL, the
outer loop repeats maxtime, which is the number of runs needed to find the
best K. In each loop, we need to run PCL classifier which involves a frequent
pattern mining step (F P M ) and evaluation for all K from 1 to maxK. The
total time is maxtime ∗ (F P M + P CL(maxK)). P CL(maxK) is the time to
compute scores for K = 1..maxK. This step evolves visiting entire frequent pat-
tern set to get the top K patterns and compute scores accordingly for each test
instance. In ePCL, we only need to build frequent patterns once and incremen-
tally adjust the pattern space. The total time is F P M + maxtime ∗ (|D|/10 ∗
maintenance + P CL(maxK)), where maintenance is the running time for main-
taining one transaction, |D| is the size of the dataset. According to [5], PSM is
much more efficient than mining-from-scratch algorithms. At 10% of the data
size, the speed-up is at least 3 times faster for incremental and 5 times faster for
decremental maintenance than mining from scratch.
5 Experimental Studies
5.1 Experiment Setup
Experiments are conducted using 18 data sets of various sizes from UCI reposi-
tory. Continuous attributes are discretized by the entropy method. Table 1 gives
details of these data sets. We evaluate the performance on two-class datasets. For
multi-class datasets, we select one class as positive and the remaining as negative.
The process is done for all classes. PCL can be extended to handle multiple-class
datasets; however, the method of choosing parameter is still applicable. 10-fold
cross validation is used to assess the efficiency and average accuracies are re-
ported. The datasets are separated in such a way that the proportion of each
class is consistent in both testing and training.
For the original PCL, we use the frequent pattern mining algorithm provided
by [13]. The experiments are done in Windows machine with 2.6 GHz processor
and 1G memory. We assess the accuracy improvement of rPCL over PCL and
running time comparison of ePCL and rPCL. Time is measured in seconds.
When we compute the score of an instance in two classes, it is possible that
the scores are both zero. Such a test-instance is not covered by any pattern
and therefore unclassifiable by both PCL and ePCL. The average percentage of
unclassified data is 11% and there is no dataset with more than 20%. We do not
include these cases in our accuracy computation. The purpose is to evaluate the
improvement of rPCL over the original PCL only.
9. Finding Best Parameter for PCL 129
Table 1. Datasets information
Dataset # attributes # instances # continuous attributes # nomial attributes
mushroom 4 8124 4 0
iris 8 150 8 0
Glass 22 214 0 22
zoo 16 101 0 16
Promoter 18 106 0 18
Vote 10 435 10 0
Splice 19 3190 0 19
Hepatitis 59 155 0 59
Pima 61 768 0 61
Hypothyroid 25 3163 7 18
Breast 16 3190 0 16
Cleve 10 2727 0 10
German 12 2700 0 12
Lymph 19 1332 0 19
Vehicle 19 3809 0 19
Waveform 20 9000 0 20
Wine 14 178 0 14
Tic-tac-toe 10 4312 0 10
5.2 Parameters Setting
All the patterns used in PCL and its improved versions are JEP generators [10].
The exception is an early paper [8]; it uses the “boundary” JEPs, which are
defined in [9] as those JEPs where none of their proper subsets are JEPs. Here,
we use JEP generators as discriminative patterns because:
– It is known that the JEP space is partitioned into disjoint equivalence
classes [9,10]. So the set of JEP generators (which correspond to the most
general patterns in each equivalence class) adequately characterizes the data.
In contrast, boundary JEPs are insufficient to capture all equivalence classes.
– The minimum description length principle is a well-accepted general solution
for model selection problems. The choice of JEP generators is in accord with
this principle [12]. Also, our experiments show that JEP generators are less
noise-sensitive than boundary JEPs.
– JEP generators are efficient to compute. We can make use of many efficient
frequent patterns mining algorithms [12].
For rPCL and ePCL, we set maxtime to 50 and we run K in a range from 1
to 50. We limit the range to 50 because small-frequency patterns have minimal
prediction power over the top ones. Patterns ranked higher than 50 generally
have low support. As shown in Fig. 1, the predictions stabilize after K = 50,
so no need to consider K > 50. For original PCL, we use K = 10 as suggested
in [9]. Minimum support is set to make sure enough EPs are found.
10. 130 T.-S. Ngo et al.
5.3 Efficiency
PCL with generators and boundary EPs. For PCL, we have choices over
which type of frequent patterns are used to make predictions. We have done
experiments to justify our choice of JEP generators, as suggested in [9], rather
than boundary JEPs. We want to test the robustness of PCL in the case of noise
with these two types of patterns. When a part of the original dataset is missing,
the noisy dataset reduces the accuracy of the classifier. Fig. 6 shows accuracy
of PCL with generators (gen PCL) and boundary JEPs (boundary PCL) for
different levels of missing rate (We assume items in the dataset are missing with
certain rate): 0%, 10%, 20%, 30%, 40% and 50%. The accuracies are average
over 18 datasets. Gen PCL is less noise-sensitive than boundary PCL.
Fig. 6. Accuracy of gen PCL and boundary PCL in the presence of missing information
PCL and ePCL. We compare the accuracy of the original PCL and ePCL.
Since rPCL and ePCL give the same results, we do not show the accuracy of
rPCL here. Table 2 shows accuracy of ePCL and the original PCL in 18 datasets.
Overall, the improvement is 3.19%. In some datasets like Promoters, Wine and
Hepatitis, we get improvement of 26%, 11% and 15% respectively.
Table 2. Accuracy comparison of ePCL and PCL
Datasets ePCL PCL Improve (%) Datasets ePCL PCL Improve (%)
Mushroom 1 1 0 Iris 0.9999 0.9658 3.534591
Glass 0.9714 0.9675 0.404967 Zoo 0.9997 0.9528 4.925911
Promoter 0.8216 0.6525 25.92549 Vote 0.9439 0.9492 -0.55892
Splice 0.6667 0.6667 0 Hepatitis 0.8872 0.7716 14.98068
Pima 0.8192 0.8331 -1.67043 Hypothyroid 0.9962 0.9861 1.017527
Breast 0.9912 0.9907 0.050572 Cleve 0.9644 0.9959 -3.16567
German 0.9644 0.994 -2.97462 Lymph 1 1 0
Vehicle 0.9569 0.9437 1.390222 Waveform 0.8347 0.8053 3.643507
Wine 0.97892 0.8789 11.3941 Tic-tac-toe 0.9512 0.965 -1.42591
11. Finding Best Parameter for PCL 131
Table 3. Accuracy comparison of ePCL and PCL in difficult cases
Datasets Difficult PCL ePCL Datasets Difficult PCL ePCL
cases (%) cases (%)
mushroom 0 - - Iris 0 - -
Glass 33.94 0.9071 0.9183 zoo 26.53 0.8274 0.9989
Promoter 49.02 0.3123 0.647 Vote 16.31 0.6978 0.6663
Splice 100 0.6667 0.6667 Hepatitis 100 0.7716 0.8872
Pima 12.10 0 0 Hypothyroid 20 0.9327 0.9813
Breast 10.90 0.917 0.9214 Cleve ¡1 - -
German 10.60 0.9449 0.6745 Lymph 0 - -
Vehicle 3.7453 0 0 Waveform ¡2 - -
Wine 31 0.8788 0.9789 Tic-tac-toe 30.2 0.553 0.578
Recall in PCL, to classify one test instance, one score is computed for each
class. Some instances received zero score in one class and non-zero score in the
other class; thus the association rules vote unanimously to one class. We call these
the easy cases and the rest are difficult cases. The value of K has a much greater
impact on the difficult cases. Table 3 shows the accuracy of original PCL and
ePCL in difficult cases. Although we do not show statistics for easy cases here,
the accuracy is close to 100% for most easy cases. We do not show datasets with
too few difficult cases since there are not enough samples to evaluate accurately.
Performance. We compare the running time of rPCL, ePCL and original PCL.
The performance of ePCL compared to rPCL is good, demonstrating the supe-
riority of using pattern maintenance. Table 4 shows that ePCL is an order of
magnitude faster in many cases. In one case (vehicle.dat), rPCL could not com-
plete after more than one day, but ePCL completed within 4 hours. Nevertheless,
ePCL takes more time than PCL due to the repeating part; fortunately, this is
compensated by the much better accuracy of ePCL.
Table 4. Performance comparison
Datasets PCL rPCL ePCL Speed up Datasets PCL rPCL ePCL Speed up
(rPCL/ (rPCL/
ePCL) ePCL)
mushroom 2.75 84 35 2.4 Iris 2 99 3 33
glass 8.5 120 2 60 zoo 5 291 7 41.5
promoters 1 51 7 7.2 Vote 0.75 47 11 4.2
splice 2.5 129 4 32.2 hepatitis 0.5 38 3 12.6
pima 0.75 42 9 4.6 hypothyroid 2 105 43 2.4
breast 2.5 109 60 1.8 cleve 4.25 181 123 1.4
german 11 513 611 0.8 lymph 27.5 1224 512 2.3
vehicle 2518 too long 6966 A lot waveform 117 5738 2507 2.3
wine 3.25 188 17 11.1 tictactoe 13 88 43 2
12. 132 T.-S. Ngo et al.
6 Discussion and Conclusion
Cross-validation is a technique to assess the performance of classification algo-
rithms. Typically the testing set is much smaller than the training set and a
classifier is built from the training set which is largely similar between runs. Our
framework of maintenance can be applied to efficiently perform cross-validation
for pattern-based classification. Recall PSM allows us to maintain the set of
frequent generators while adding or deleting a relatively small portion of the
dataset. We only need to maintain a single frequent pattern set and incremen-
tally adjust using PSM. In addition, for more efficient computation, PSM can be
extended to handle multiple additions/deletions at the same time by organizing
transactions in a prefix tree.
We showed the importance of finding the top K patterns for PCL. We intro-
duced a method to perform the task efficiently, making use of the PSM main-
tenance algorithm [5]. The maintenance framework can be applied to general
cross-validation tasks which involve frequent update of the pattern space. Our
experiments showed improvement in accuracy. However, we compromised on
complexity. We hope to get a better implementation for ePCL, thus minimiz-
ing the running time to close to original PCL and implement the mentioned
extensions.
References
1. Bailey, J., Manoukian, T., Ramamohanarao, K.: Classification using constrained
emerging patterns. In: Dong, G., Tang, C., Wang, W. (eds.) WAIM 2003. LNCS,
vol. 2762, pp. 226–237. Springer, Heidelberg (2003)
2. Cheng, H., et al.: Discriminative frequent pattern analysis for effective classifica-
tion. In: Proc 23th ICDE, pp. 716–725 (2007)
3. Cheng, H., et al.: Direct discriminative pattern mining for effective classification.
In: Proc 24th ICDE, pp. 169–178 (2008)
4. Dong, G., et al.: CAEP: Classification by Aggregating Emerging Patterns. In: Proc.
2nd Intl. Conf. on Discovery Science, pp. 30–42 (1999)
5. Feng, M.: Frequent Pattern Maintenance: Theories and Algorithms. PhD thesis,
Nanyang Technological University (2009)
6. Feng, M., et al.: Negative generator border for effective pattern maintenance. In:
Proc 4th Intl. Conf. on Advanced Data Mining and Applications, pp. 217–228
(2008)
7. Feng, M., et al.: Evolution and maintenance of frequent pattern space when trans-
actions are removed. In: Zhou, Z.-H., Li, H., Yang, Q. (eds.) PAKDD 2007. LNCS
(LNAI), vol. 4426, pp. 489–497. Springer, Heidelberg (2007)
8. Li, W., Han, J., Pei, J.: CMAR: Accurate and efficient classification based on
multiple class-association rules. In: Proc 1st ICDM, pp. 369–376 (2001)
9. Li, J., Wong, L.: Solving the fragmentation problem of decision trees by discovering
boundary emerging patterns. In: Proc. 2nd ICDM, pp. 653–656 (2002)
10. Li, J., Wong, L.: Structural geography of the space of emerging patterns. Intelligent
Data Analysis 9(6), 567–588 (2005)
13. Finding Best Parameter for PCL 133
11. Li, J., Ramamohanarao, K., Dong, G.: The space of jumping emerging patterns
and its incremental maintenance algorithms. In: Proc. of 17th ICML, pp. 551–558
(2000)
12. Li, J., et al.: Minimum description length principle: Generators are preferable to
closed patterns. In: Proc. 21st Natl. Conf. on Artificial Intelligence, pp. 409–415
(2006)
13. Li, H., et al.: Relative risk and odds ratio: A data mining perspective. In: Proc
24th PODS, pp. 368–377 (2005)
14. Liu, B., Hsu, W., Ma, Y.: Integrating classification and association rule mining.
In: Proc. 4th KDD, pp. 80–86 (1998)
15. Ramamohanarao, K., Fan, H.: Patterns based classifiers. In: Proc. 16th WWW,
pp. 71–83 (2007)
16. Ramamohanarao, K., Bailey, J.: Discovery of emerging patterns and their use in
classification. In: Gedeon, T(T.) D., Fung, L.C.C. (eds.) AI 2003. LNCS (LNAI),
vol. 2903, pp. 1–12. Springer, Heidelberg (2003)
17. Thabtah, F.A., Cowling, P., Peng, Y.: MMAC: A new Multi-class Multi-label As-
sociative Classification approach. In: Proc. 4th ICDM, pp. 217–224 (2004)
18. Yin, X., Han, J.: CPAR: Classification based on Predictive Association Rules. In:
Proc. 3rd SDM, pp. 331–335 (2003)
19. Wang, J., Karypis, G.: HARMONY: Efficiently mining the best rules for classifi-
cation. In: Proc. 5th SDM, pp. 205–216 (2005)
20. Zhang, X., Dong, G., Ramamohanarao, K.: Information-based classification by
aggregating emerging patterns. In: Proc. 2nd Intl. Conf. on Intelligent Data Engi-
neering and Automated Learning, pp. 48–53 (2000)