Increasingly, supervised classification algorithms are being routinely applied in many application domains. A large number of base algorithms along with variants gives rise to a palette of several dozen algorithms. Selecting the right algorithm or at least reducing the number of options is useful in such situations.
In this presentation we explore meta-attributes of data sets to predict which set of algorithms will perform well when used for supervised classification of a data set. We do this only for binary classification in this presentation.
The document compares constructive meta-learning and stacking methods for composing inductive applications. It presents CAMLET, a tool for constructive meta-learning that analyzes learning algorithms, organizes them in a repository, and searches for compositions. A case study shows CAMLET achieving accuracies on par with stacking on common datasets and good parallel efficiency for composition.
Zero-shot Image Recognition Using Relational Matching, Adaptation and Calibra...Debasmit Das
Ā
This document proposes a three-step approach for zero-shot image recognition using relational matching, domain adaptation, and calibration. The approach uses relational matching to find structural correspondences between semantic embeddings and features, domain adaptation to adapt unseen semantic embeddings to the test data domain, and calibration to reduce bias towards seen classes. Experimental results on four datasets show improved zero-shot and generalized zero-shot classification performance compared to previous methods, with domain adaptation providing the most benefit. Analysis of hubness and convergence properties are also presented.
This document discusses various machine learning techniques for transfer learning, including unsupervised domain adaptation (UDA), few-shot learning (FSL), zero-shot learning (ZSL), and hypothesis transfer learning (HTL). For UDA, the author proposes graph matching approaches to minimize domain discrepancy between source and target domains. For FSL, a two-stage approach is used to estimate novel class prototypes and variances. For ZSL, an approach is described that uses relational matching, adaptation, and calibration. For HTL, estimating novel class prototypes from source prototypes and sparse target data is discussed. Experimental results demonstrate the effectiveness of the proposed approaches.
IRJET- Analysis of PV Fed Vector Controlled Induction Motor DriveIRJET Journal
Ā
The document describes a project to develop a deep learning model to predict hardware performance. The model takes hardware configuration parameters like CPU, memory, etc. as input and predicts benchmark scores. The authors preprocessed data, tested various regression models like linear regression and lasso regression, and techniques like backward elimination and cross-validation. Their best model used backward elimination and linear regression, achieving 80.82% accuracy. The project aims to automate hardware performance analysis and prediction to save time compared to manual methods.
Comparative Recommender System Evaluation: Benchmarking Recommendation Frame...Alan Said
Ā
Video available here http://www.youtube.com/watch?v=1jHxGCl8RXc
Recommender systems research is often based on comparisons of predictive accuracy: the better the evaluation scores, the better the recommender.
However, it is difficult to compare results from different recommender systems due to the many options in design and implementation of an evaluation strategy.
Additionally, algorithmic implementations can diverge from the standard formulation due to manual tuning and modifications that work better in some situations.
In this work we compare common recommendation algorithms as implemented in three popular recommendation frameworks.
To provide a fair comparison, we have complete control of the evaluation dimensions being benchmarked: dataset, data splitting, evaluation strategies, and metrics.
We also include results using the internal evaluation mechanisms of these frameworks.
Our analysis points to large differences in recommendation accuracy across frameworks and strategies, i.e. the same baselines may perform orders of magnitude better or worse across frameworks.
Our results show the necessity of clear guidelines when reporting evaluation of recommender systems to ensure reproducibility and comparison of results.
This document discusses various transfer learning techniques for machine learning, including domain adaptation and small sample learning. It proposes three methods for unsupervised domain adaptation that use graph or hypergraph matching to minimize domain discrepancy: 1) Graph Matching, 2) Hypergraph Matching, and 3) Graph Matching with representation learning. For small sample learning, it discusses approaches for few-shot learning and zero-shot learning, and proposes a two-stage solution for few-shot learning that learns a discriminative low-dimensional space and estimates class variance, and a method for zero-shot learning that matches features to semantics. Evaluation on standard datasets shows the proposed methods achieve competitive performance.
Classification techniques in data miningKamal Acharya
Ā
The document discusses classification algorithms in machine learning. It provides an overview of various classification algorithms including decision tree classifiers, rule-based classifiers, nearest neighbor classifiers, Bayesian classifiers, and artificial neural network classifiers. It then describes the supervised learning process for classification, which involves using a training set to construct a classification model and then applying the model to a test set to classify new data. Finally, it provides a detailed example of how a decision tree classifier is constructed from a training dataset and how it can be used to classify data in the test set.
Chinese Named Entity Recognition with Graph-based Semi-supervised Learning ModelLifeng (Aaron) Han
Ā
This document summarizes an experiment on using graph-based semi-supervised learning to improve a conditional random field model for Chinese named entity recognition. The experiment used unlabeled data from previous NER tasks to extend the labeled training data via label propagation. This enhanced CRF model was evaluated on a standard test corpus and showed a slight improvement over a closed CRF baseline, particularly for person and organization entities. However, the unlabeled data was not large enough to cover all entity types. Future work could explore using more unlabeled data and optimizing features for the graph construction.
The document compares constructive meta-learning and stacking methods for composing inductive applications. It presents CAMLET, a tool for constructive meta-learning that analyzes learning algorithms, organizes them in a repository, and searches for compositions. A case study shows CAMLET achieving accuracies on par with stacking on common datasets and good parallel efficiency for composition.
Zero-shot Image Recognition Using Relational Matching, Adaptation and Calibra...Debasmit Das
Ā
This document proposes a three-step approach for zero-shot image recognition using relational matching, domain adaptation, and calibration. The approach uses relational matching to find structural correspondences between semantic embeddings and features, domain adaptation to adapt unseen semantic embeddings to the test data domain, and calibration to reduce bias towards seen classes. Experimental results on four datasets show improved zero-shot and generalized zero-shot classification performance compared to previous methods, with domain adaptation providing the most benefit. Analysis of hubness and convergence properties are also presented.
This document discusses various machine learning techniques for transfer learning, including unsupervised domain adaptation (UDA), few-shot learning (FSL), zero-shot learning (ZSL), and hypothesis transfer learning (HTL). For UDA, the author proposes graph matching approaches to minimize domain discrepancy between source and target domains. For FSL, a two-stage approach is used to estimate novel class prototypes and variances. For ZSL, an approach is described that uses relational matching, adaptation, and calibration. For HTL, estimating novel class prototypes from source prototypes and sparse target data is discussed. Experimental results demonstrate the effectiveness of the proposed approaches.
IRJET- Analysis of PV Fed Vector Controlled Induction Motor DriveIRJET Journal
Ā
The document describes a project to develop a deep learning model to predict hardware performance. The model takes hardware configuration parameters like CPU, memory, etc. as input and predicts benchmark scores. The authors preprocessed data, tested various regression models like linear regression and lasso regression, and techniques like backward elimination and cross-validation. Their best model used backward elimination and linear regression, achieving 80.82% accuracy. The project aims to automate hardware performance analysis and prediction to save time compared to manual methods.
Comparative Recommender System Evaluation: Benchmarking Recommendation Frame...Alan Said
Ā
Video available here http://www.youtube.com/watch?v=1jHxGCl8RXc
Recommender systems research is often based on comparisons of predictive accuracy: the better the evaluation scores, the better the recommender.
However, it is difficult to compare results from different recommender systems due to the many options in design and implementation of an evaluation strategy.
Additionally, algorithmic implementations can diverge from the standard formulation due to manual tuning and modifications that work better in some situations.
In this work we compare common recommendation algorithms as implemented in three popular recommendation frameworks.
To provide a fair comparison, we have complete control of the evaluation dimensions being benchmarked: dataset, data splitting, evaluation strategies, and metrics.
We also include results using the internal evaluation mechanisms of these frameworks.
Our analysis points to large differences in recommendation accuracy across frameworks and strategies, i.e. the same baselines may perform orders of magnitude better or worse across frameworks.
Our results show the necessity of clear guidelines when reporting evaluation of recommender systems to ensure reproducibility and comparison of results.
This document discusses various transfer learning techniques for machine learning, including domain adaptation and small sample learning. It proposes three methods for unsupervised domain adaptation that use graph or hypergraph matching to minimize domain discrepancy: 1) Graph Matching, 2) Hypergraph Matching, and 3) Graph Matching with representation learning. For small sample learning, it discusses approaches for few-shot learning and zero-shot learning, and proposes a two-stage solution for few-shot learning that learns a discriminative low-dimensional space and estimates class variance, and a method for zero-shot learning that matches features to semantics. Evaluation on standard datasets shows the proposed methods achieve competitive performance.
Classification techniques in data miningKamal Acharya
Ā
The document discusses classification algorithms in machine learning. It provides an overview of various classification algorithms including decision tree classifiers, rule-based classifiers, nearest neighbor classifiers, Bayesian classifiers, and artificial neural network classifiers. It then describes the supervised learning process for classification, which involves using a training set to construct a classification model and then applying the model to a test set to classify new data. Finally, it provides a detailed example of how a decision tree classifier is constructed from a training dataset and how it can be used to classify data in the test set.
Chinese Named Entity Recognition with Graph-based Semi-supervised Learning ModelLifeng (Aaron) Han
Ā
This document summarizes an experiment on using graph-based semi-supervised learning to improve a conditional random field model for Chinese named entity recognition. The experiment used unlabeled data from previous NER tasks to extend the labeled training data via label propagation. This enhanced CRF model was evaluated on a standard test corpus and showed a slight improvement over a closed CRF baseline, particularly for person and organization entities. However, the unlabeled data was not large enough to cover all entity types. Future work could explore using more unlabeled data and optimizing features for the graph construction.
This document discusses classification and prediction. Classification predicts categorical class labels by classifying data based on a training set and class labels. Prediction models continuous values and predicts unknown values. Some applications are credit approval, marketing, medical diagnosis, and treatment analysis. Classification involves a learning step to describe classes and a classification step to classify new data. Prediction involves estimating accuracy by comparing test results to known labels. Issues with classification and prediction include data preparation, comparing methods, and decision tree induction algorithms.
Presentation slides for my PhD thesis dissertation on machine learning algorithm development to analyze multi dimensional genomic data such as microarrays
Analysis of Textual Data Classification with a Reddit Comments DatasetAdamBab
Ā
McGill COMP551, Applied Machine Learning Course
The main objective of this project is to categorize comments from the American social
news aggregation, web content rating, and discussion website ā Reddit.
This document provides a summary of Bayesian classification. Bayesian classification predicts the probability of class membership for new data instances based on prior knowledge and training data. It uses Bayes' theorem to calculate the posterior probability of a class given the attributes of an instance. The naive Bayesian classifier assumes attribute independence and uses frequency counts to estimate probabilities. It classifies new instances by selecting the class with the highest posterior probability. The example shows how probabilities are estimated from training data and used to classify an unseen instance in the play-tennis dataset.
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kambererror007
Ā
The document describes Chapter 6 of the book "Data Mining: Concepts and Techniques" which covers the topics of classification and prediction. It defines classification and prediction and discusses key issues in classification such as data preparation, evaluating methods, and decision tree induction. Decision tree induction creates a tree model by recursively splitting the training data on attributes and their values to make predictions. The chapter also covers other classification methods like Bayesian classification, rule-based classification, and support vector machines. It describes the process of model construction from training data and then using the model to classify new, unlabeled data.
Alleviating cold-user start problem with users' social network data in recomm...Eduardo Castillejo Gil
Ā
This work explores the possibility of using relevant data from usersā
social network to alleviate the cold-user problems in a recommender
system domain. The proposed solution extracts the most valuable
node in the graph generated by check in a venue with an Android
application using the Foursquare API. By obtaining the recommendations to this node we estimate the probability of some categories
to be similar to users tastes...
Evolutionary Search Techniques with Strong Heuristics for Multi-Objective Fea...Abdel Salam Sayyad
Ā
This document summarizes Abdel Salam Sayyad's doctoral defense that addressed using evolutionary search techniques with strong heuristics for multi-objective feature selection in software product lines. The defense outlined modeling feature models, analyzing them automatically, formulating the multi-objective feature selection problem, and using multi-objective evolutionary algorithms. Results demonstrated scalability by increasing objectives, tuning parameters, and using heuristics like "PUSH" and "PULL" as well as population seeding. The defense concluded by discussing future work.
A Top-N Recommender System Evaluation Protocol Inspired by Deployed SystemsAlan Said
Ā
he evaluation of recommender systems is crucial for their development. In today's recommendation landscape there are many standardized recommendation algorithms and approaches, however, there exists no standardized method for experimental setup of evaluation -- not even for widely used measures such as precision and root-mean-squared error. This creates a setting where comparison of recommendation results using the same datasets becomes problematic. In this paper, we propose an evaluation protocol specifically developed with the recommendation use-case in mind, i.e. the recommendation of one or several items to an end user. The protocol attempts to closely mimic a scenario of a deployed (production) recommendation system, taking specific user aspects into consideration and allowing a comparison of small and large scale recommendation systems. The protocol is evaluated on common recommendation datasets and compared to traditional recommendation settings found in research literature. Our results show that the proposed model can better capture the quality of a recommender system than traditional evaluation does, and is not affected by characteristics of the data (e.g. size. sparsity, etc.).
Data.Mining.C.6(II).classification and predictionMargaret Wang
Ā
The document summarizes different machine learning classification techniques including instance-based approaches, ensemble approaches, co-training approaches, and partially supervised approaches. It discusses k-nearest neighbor classification and how it works. It also explains bagging, boosting, and AdaBoost ensemble methods. Co-training uses two independent views to label unlabeled data. Partially supervised approaches can build classifiers using only positive and unlabeled data.
Presentation made during the Intelligent User-Adapted Interfaces: Design and Multi-Modal Evaluation Workshop (IUadaptME) workshop conducted as part of UMAP 2018
Fuzzy logic applications for data acquisition systems of practical measurement IJECEIAES
Ā
In laboratory works, the error in measurement, reading the measurring devices, similarity of experimental data and lack of understanding of practicum materials are often found. These will lead to the inacurracy and invalid in data obtanined. As an alternative solution, application of fuzzy logic to the data acquisition system using a web server. This research focuses on the design of data acquisition systems with the target of reducing the error rate in measuring experimental data on the laboratory. Data measurement on laboratory practice module is done by taking the analog data resulted from the measurement. Furthermore, the data are converted into digital data via arduino and stored on the server. To get valid data, the server will process the data by using fuzzy logic method. The valid data are integrated into a web server so that it can be accessed as needed. The results showed that the data acquisition system based on fuzzy logic is able to provide recommendation of measurement result on the lab works based on the degree value of membership and truth value. Fuzzy logic will select the measured data with a maximum error percentage of 5% and select the measurement result which has minimum error rate.
Pareto-Optimal Search-Based Software Engineering (POSBSE): A Literature SurveyAbdel Salam Sayyad
Ā
Paper presented at the 2nd International Workshop on Realizing Artificial Intelligence Synergies in Software Engineering (RAISEā13), San Francisco, USA. May 2013.
This document discusses computational intelligence and supervised learning techniques for classification. It provides examples of applications in medical diagnosis and credit card approval. The goal of supervised learning is to learn from labeled training data to predict the class of new unlabeled examples. Decision trees and backpropagation neural networks are introduced as common supervised learning algorithms. Evaluation methods like holdout validation, cross-validation and performance metrics beyond accuracy are also summarized.
This document proposes an online course recommendation system that uses machine learning algorithms like K-nearest neighbor (KNN), K-means clustering, and collaborative filtering to recommend courses to students. It extracts student data like marks, attendance, and teacher ratings to classify students and identify lacking skills. It then generates personalized course recommendations and study material links for each student cluster. Finally, it provides recommendations to students using collaborative filtering by rating previously recommended links. The system aims to provide more effective recommendations than solely using collaborative filtering by integrating multiple student attributes.
This document describes a Yelp data challenge to predict user ratings of businesses from user review text using classical machine learning algorithms and deep learning techniques. It provides details on the problem definition, preprocessing steps, models used, and results. Classical machine learning approaches such as Naive Bayes, logistic regression, and SVM were able to predict ratings with around 67% accuracy, while a convolutional neural network achieved slightly higher accuracy of 73.5%. A Docker image containing the code was also created to allow easy running of the models.
This document discusses classification and prediction techniques in data mining. It covers various classification methods like decision tree induction, Bayesian classification, and support vector machines. It also discusses scaling classification to large databases, evaluating model accuracy, and presenting classification results visually. The key methods covered are decision tree construction using information gain, the naĆÆve Bayesian classifier based on Bayes' theorem, and scaling tree learning using techniques like RainForest.
This document discusses item-based collaborative filtering for recommender systems. It describes how item-based collaborative filtering works by predicting a target user's rating for an item based on the ratings of similar items. It highlights advantages over user-based filtering like lower computational cost and more stable similarity computations. Key aspects covered include using cosine similarity to calculate item similarities, adjusting for individual rating biases, selecting the top K similar items, and predicting ratings based on similar items' ratings.
Building a Classifier Employing Prism Algorithm with Fuzzy LogicIJDKP
Ā
Classification in data mining is receiving immense interest in recent times. As the knowledge is based on
historical data, classifications of data are essential for discovering the knowledge. To decrease the
classification complexity, the quantitative attributes of data need splitting. But the splitting using the
classical logic is less accurate. This can be overcome by the use of fuzzy logic. This paper illustrates how to
build up the classification rules using the fuzzy logic. The fuzzy classifier is built on by using the prism
decision tree algorithm. This classifier produces more realistic results than the classical one. The
effectiveness of this method is justified over a sample dataset.
Improving neural question generation using answer separationNAVER Engineering
Ā
Neural question generation (NQG) is the task of generating a question from a given passage with deep neural networks. Previous NQG models suffer from a problem that a significant proportion of the generated questions include words in the question target, resulting in the generation of unintended questions. In this paper, we propose answer-separated seq2seq, which better utilizes the information from both the passage and the target answer. By replacing the target answer in the original passage with a special token, our model learns to identify which interrogative word should be used. We also propose a new module termed keyword-net, which helps the model better capture the key information in the target answer and generate an appropriate question. Experimental results demonstrate that our answer separation method significantly reduces the number of improper questions which include answers. Consequently, our model significantly outperforms previous state-of-the-art NQG models.
Supervised machine learning algorithms are categorized as either supervised or unsupervised. Supervised algorithms learn from labeled examples to predict future labels, while unsupervised algorithms find hidden patterns in unlabeled data. Specifically, supervised algorithms are presented with labeled training data and learn a model to predict the class labels of new test data. Common supervised algorithms include neural networks, decision trees, k-nearest neighbors, and Naive Bayes classifiers. Naive Bayes is an easy to implement algorithm that assumes independence between features. It has been successfully applied to problems like spam filtering.
This document provides an overview of supervised and unsupervised learning, with a focus on clustering as an unsupervised learning technique. It describes the basic concepts of clustering, including how clustering groups similar data points together without labeled categories. It then covers two main clustering algorithms - k-means, a partitional clustering method, and hierarchical clustering. It discusses aspects like cluster representation, distance functions, strengths and weaknesses of different approaches. The document aims to introduce clustering and compare it with supervised learning.
This document discusses classification and prediction. Classification predicts categorical class labels by classifying data based on a training set and class labels. Prediction models continuous values and predicts unknown values. Some applications are credit approval, marketing, medical diagnosis, and treatment analysis. Classification involves a learning step to describe classes and a classification step to classify new data. Prediction involves estimating accuracy by comparing test results to known labels. Issues with classification and prediction include data preparation, comparing methods, and decision tree induction algorithms.
Presentation slides for my PhD thesis dissertation on machine learning algorithm development to analyze multi dimensional genomic data such as microarrays
Analysis of Textual Data Classification with a Reddit Comments DatasetAdamBab
Ā
McGill COMP551, Applied Machine Learning Course
The main objective of this project is to categorize comments from the American social
news aggregation, web content rating, and discussion website ā Reddit.
This document provides a summary of Bayesian classification. Bayesian classification predicts the probability of class membership for new data instances based on prior knowledge and training data. It uses Bayes' theorem to calculate the posterior probability of a class given the attributes of an instance. The naive Bayesian classifier assumes attribute independence and uses frequency counts to estimate probabilities. It classifies new instances by selecting the class with the highest posterior probability. The example shows how probabilities are estimated from training data and used to classify an unseen instance in the play-tennis dataset.
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kambererror007
Ā
The document describes Chapter 6 of the book "Data Mining: Concepts and Techniques" which covers the topics of classification and prediction. It defines classification and prediction and discusses key issues in classification such as data preparation, evaluating methods, and decision tree induction. Decision tree induction creates a tree model by recursively splitting the training data on attributes and their values to make predictions. The chapter also covers other classification methods like Bayesian classification, rule-based classification, and support vector machines. It describes the process of model construction from training data and then using the model to classify new, unlabeled data.
Alleviating cold-user start problem with users' social network data in recomm...Eduardo Castillejo Gil
Ā
This work explores the possibility of using relevant data from usersā
social network to alleviate the cold-user problems in a recommender
system domain. The proposed solution extracts the most valuable
node in the graph generated by check in a venue with an Android
application using the Foursquare API. By obtaining the recommendations to this node we estimate the probability of some categories
to be similar to users tastes...
Evolutionary Search Techniques with Strong Heuristics for Multi-Objective Fea...Abdel Salam Sayyad
Ā
This document summarizes Abdel Salam Sayyad's doctoral defense that addressed using evolutionary search techniques with strong heuristics for multi-objective feature selection in software product lines. The defense outlined modeling feature models, analyzing them automatically, formulating the multi-objective feature selection problem, and using multi-objective evolutionary algorithms. Results demonstrated scalability by increasing objectives, tuning parameters, and using heuristics like "PUSH" and "PULL" as well as population seeding. The defense concluded by discussing future work.
A Top-N Recommender System Evaluation Protocol Inspired by Deployed SystemsAlan Said
Ā
he evaluation of recommender systems is crucial for their development. In today's recommendation landscape there are many standardized recommendation algorithms and approaches, however, there exists no standardized method for experimental setup of evaluation -- not even for widely used measures such as precision and root-mean-squared error. This creates a setting where comparison of recommendation results using the same datasets becomes problematic. In this paper, we propose an evaluation protocol specifically developed with the recommendation use-case in mind, i.e. the recommendation of one or several items to an end user. The protocol attempts to closely mimic a scenario of a deployed (production) recommendation system, taking specific user aspects into consideration and allowing a comparison of small and large scale recommendation systems. The protocol is evaluated on common recommendation datasets and compared to traditional recommendation settings found in research literature. Our results show that the proposed model can better capture the quality of a recommender system than traditional evaluation does, and is not affected by characteristics of the data (e.g. size. sparsity, etc.).
Data.Mining.C.6(II).classification and predictionMargaret Wang
Ā
The document summarizes different machine learning classification techniques including instance-based approaches, ensemble approaches, co-training approaches, and partially supervised approaches. It discusses k-nearest neighbor classification and how it works. It also explains bagging, boosting, and AdaBoost ensemble methods. Co-training uses two independent views to label unlabeled data. Partially supervised approaches can build classifiers using only positive and unlabeled data.
Presentation made during the Intelligent User-Adapted Interfaces: Design and Multi-Modal Evaluation Workshop (IUadaptME) workshop conducted as part of UMAP 2018
Fuzzy logic applications for data acquisition systems of practical measurement IJECEIAES
Ā
In laboratory works, the error in measurement, reading the measurring devices, similarity of experimental data and lack of understanding of practicum materials are often found. These will lead to the inacurracy and invalid in data obtanined. As an alternative solution, application of fuzzy logic to the data acquisition system using a web server. This research focuses on the design of data acquisition systems with the target of reducing the error rate in measuring experimental data on the laboratory. Data measurement on laboratory practice module is done by taking the analog data resulted from the measurement. Furthermore, the data are converted into digital data via arduino and stored on the server. To get valid data, the server will process the data by using fuzzy logic method. The valid data are integrated into a web server so that it can be accessed as needed. The results showed that the data acquisition system based on fuzzy logic is able to provide recommendation of measurement result on the lab works based on the degree value of membership and truth value. Fuzzy logic will select the measured data with a maximum error percentage of 5% and select the measurement result which has minimum error rate.
Pareto-Optimal Search-Based Software Engineering (POSBSE): A Literature SurveyAbdel Salam Sayyad
Ā
Paper presented at the 2nd International Workshop on Realizing Artificial Intelligence Synergies in Software Engineering (RAISEā13), San Francisco, USA. May 2013.
This document discusses computational intelligence and supervised learning techniques for classification. It provides examples of applications in medical diagnosis and credit card approval. The goal of supervised learning is to learn from labeled training data to predict the class of new unlabeled examples. Decision trees and backpropagation neural networks are introduced as common supervised learning algorithms. Evaluation methods like holdout validation, cross-validation and performance metrics beyond accuracy are also summarized.
This document proposes an online course recommendation system that uses machine learning algorithms like K-nearest neighbor (KNN), K-means clustering, and collaborative filtering to recommend courses to students. It extracts student data like marks, attendance, and teacher ratings to classify students and identify lacking skills. It then generates personalized course recommendations and study material links for each student cluster. Finally, it provides recommendations to students using collaborative filtering by rating previously recommended links. The system aims to provide more effective recommendations than solely using collaborative filtering by integrating multiple student attributes.
This document describes a Yelp data challenge to predict user ratings of businesses from user review text using classical machine learning algorithms and deep learning techniques. It provides details on the problem definition, preprocessing steps, models used, and results. Classical machine learning approaches such as Naive Bayes, logistic regression, and SVM were able to predict ratings with around 67% accuracy, while a convolutional neural network achieved slightly higher accuracy of 73.5%. A Docker image containing the code was also created to allow easy running of the models.
This document discusses classification and prediction techniques in data mining. It covers various classification methods like decision tree induction, Bayesian classification, and support vector machines. It also discusses scaling classification to large databases, evaluating model accuracy, and presenting classification results visually. The key methods covered are decision tree construction using information gain, the naĆÆve Bayesian classifier based on Bayes' theorem, and scaling tree learning using techniques like RainForest.
This document discusses item-based collaborative filtering for recommender systems. It describes how item-based collaborative filtering works by predicting a target user's rating for an item based on the ratings of similar items. It highlights advantages over user-based filtering like lower computational cost and more stable similarity computations. Key aspects covered include using cosine similarity to calculate item similarities, adjusting for individual rating biases, selecting the top K similar items, and predicting ratings based on similar items' ratings.
Building a Classifier Employing Prism Algorithm with Fuzzy LogicIJDKP
Ā
Classification in data mining is receiving immense interest in recent times. As the knowledge is based on
historical data, classifications of data are essential for discovering the knowledge. To decrease the
classification complexity, the quantitative attributes of data need splitting. But the splitting using the
classical logic is less accurate. This can be overcome by the use of fuzzy logic. This paper illustrates how to
build up the classification rules using the fuzzy logic. The fuzzy classifier is built on by using the prism
decision tree algorithm. This classifier produces more realistic results than the classical one. The
effectiveness of this method is justified over a sample dataset.
Improving neural question generation using answer separationNAVER Engineering
Ā
Neural question generation (NQG) is the task of generating a question from a given passage with deep neural networks. Previous NQG models suffer from a problem that a significant proportion of the generated questions include words in the question target, resulting in the generation of unintended questions. In this paper, we propose answer-separated seq2seq, which better utilizes the information from both the passage and the target answer. By replacing the target answer in the original passage with a special token, our model learns to identify which interrogative word should be used. We also propose a new module termed keyword-net, which helps the model better capture the key information in the target answer and generate an appropriate question. Experimental results demonstrate that our answer separation method significantly reduces the number of improper questions which include answers. Consequently, our model significantly outperforms previous state-of-the-art NQG models.
Supervised machine learning algorithms are categorized as either supervised or unsupervised. Supervised algorithms learn from labeled examples to predict future labels, while unsupervised algorithms find hidden patterns in unlabeled data. Specifically, supervised algorithms are presented with labeled training data and learn a model to predict the class labels of new test data. Common supervised algorithms include neural networks, decision trees, k-nearest neighbors, and Naive Bayes classifiers. Naive Bayes is an easy to implement algorithm that assumes independence between features. It has been successfully applied to problems like spam filtering.
This document provides an overview of supervised and unsupervised learning, with a focus on clustering as an unsupervised learning technique. It describes the basic concepts of clustering, including how clustering groups similar data points together without labeled categories. It then covers two main clustering algorithms - k-means, a partitional clustering method, and hierarchical clustering. It discusses aspects like cluster representation, distance functions, strengths and weaknesses of different approaches. The document aims to introduce clustering and compare it with supervised learning.
This document discusses unsupervised machine learning techniques for clustering data. It introduces the concepts of supervised vs. unsupervised learning and describes clustering as an unsupervised technique for grouping similar data points into clusters without labeled categories. The document outlines different clustering algorithms, including K-means clustering and K-center clustering, and discusses their applications in data reduction, hypothesis generation, and prediction based on group membership.
This document provides an overview and literature review of unsupervised feature learning techniques. It begins with background on machine learning and the challenges of feature engineering. It then discusses unsupervised feature learning as a framework to learn representations from unlabeled data. The document specifically examines sparse autoencoders, PCA, whitening, and self-taught learning. It provides details on the mathematical concepts and implementations of these algorithms, including applying them to learn features from images. The goal is to use unsupervised learning to extract features that can enhance supervised models without requiring labeled training data.
1. Reinforcement learning involves an agent learning through trial-and-error interactions with an environment. The agent learns a policy for how to act by maximizing rewards.
2. The document outlines key elements of reinforcement learning including states, actions, rewards, value functions, and explores different methods for solving reinforcement learning problems including dynamic programming, Monte Carlo methods, and temporal difference learning.
3. Temporal difference learning combines the advantages of Monte Carlo methods and dynamic programming by allowing for incremental learning through bootstrapping predictions like dynamic programming while also learning directly from experience like Monte Carlo methods.
The document provides an overview of various machine learning algorithms and methods. It begins with an introduction to predictive modeling and supervised vs. unsupervised learning. It then describes several supervised learning algorithms in detail including linear regression, K-nearest neighbors (KNN), decision trees, random forest, logistic regression, support vector machines (SVM), and naive Bayes. It also briefly discusses unsupervised learning techniques like clustering and dimensionality reduction methods.
Optimization Technique for Feature Selection and Classification Using Support...IJTET Journal
Ā
Abstractā Classification problems often have a large number of features in the data sets, but only some of them are useful for classification. Data Mining Performance gets reduced by Irrelevant and redundant features. Feature selection aims to choose a small number of relevant features to achieve similar or even better classification performance than using all features. It has two main objectives are maximizing the classification performance and minimizing the number of features. Moreover, the existing feature selection algorithms treat the task as a single objective problem. Selecting attribute is done by the combination of attribute evaluator and search method using WEKA Machine Learning Tool. We compare SVM classification algorithm to automatically classify the data using selected features with different standard dataset.
The document discusses query processing and optimization. It defines query processing as translating a query into low-level activities like evaluation and data extraction. Query optimization aims to select the most efficient query evaluation plan. The key steps in query processing are parsing, translating to relational algebra, creating evaluation plans, optimization to find the best plan, and executing the plan. Optimization techniques include heuristic-based and cost-based approaches. Heuristic rules are used to modify the query representation to improve performance. Cost-based optimization estimates the costs of different plans and selects the lowest cost plan.
This document presents a traditional approach to predicting hard queries using a keyword analyzer over databases. It proposes using association analysis to find the top k results from search keywords. An algorithm is proposed to find the top k searched keyword items from a combination of keywords in a probabilistic method that predicts results quickly. The proposed system uses a keyword analyzer and frequent pattern tree generation to efficiently rank the top k results over a corrupted database.
Network Based Intrusion Detection System using Filter Based Feature Selection...IRJET Journal
Ā
This document proposes a mutual information-based feature selection algorithm to select optimal features for network intrusion detection classification. The algorithm aims to handle dependent data features better than previous methods. It evaluates the effectiveness of the algorithm on network intrusion detection cases. Most previous methods suffer from low detection rates and high false alarm rates. The proposed approach uses feature selection, filtering, clustering, and clustering ensemble techniques in a hybrid data mining method to achieve high accuracy for intrusion detection systems.
Metabolomic Data Analysis Workshop and Tutorials (2014)Dmitry Grapov
Ā
This document provides an introduction and overview of tutorials for metabolomic data analysis. It discusses downloading required files and software. The goals of the analysis include using statistical and multivariate analyses to identify differences between sample groups and impacted biochemical domains. It also discusses various data analysis techniques including data quality assessment, univariate and multivariate statistical analyses, clustering, principal component analysis, partial least squares modeling, functional enrichment analysis, and network mapping.
IRJET- Deep Learning Model to Predict Hardware PerformanceIRJET Journal
Ā
This document discusses using deep learning models to predict hardware performance. Specifically, it aims to predict benchmark scores from hardware configurations, or predict configurations from scores. It explores various machine learning algorithms like linear regression, logistic regression, and multi-linear regression on hardware performance data. The best results were from backward elimination and linear regression, achieving over 80% accuracy. Data preprocessing like encoding was important. The model can help analyze hardware performance more quickly than manual methods.
ATTRIBUTE REDUCTION-BASED ENSEMBLE RULE CLASSIFIERS METHOD FOR DATASET CLASSI...csandit
Ā
Attribute reduction and classification task are an essential process in dealing with large data
sets that comprise numerous number of input attributes. There are many search methods and
classifiers that have been used to find the optimal number of attributes. The aim of this paper is
to find the optimal set of attributes and improve the classification accuracy by adopting
ensemble rule classifiers method. Research process involves 2 phases; finding the optimal set of
attributes and ensemble classifiers method for classification task. Results are in terms of
percentage of accuracy and number of selected attributes and rules generated. 6 datasets were
used for the experiment. The final output is an optimal set of attributes with ensemble rule
classifiers method. The experimental results conducted on public real dataset demonstrate that
the ensemble rule classifiers methods consistently show improve classification accuracy on the
selected dataset. Significant improvement in accuracy and optimal set of attribute selected is
achieved by adopting ensemble rule classifiers method.
The document discusses an agenda for a lecture on deriving knowledge from data at scale. The lecture will include a course project check-in, a thought exercise on data transformation, and a deeper dive into ensembling techniques. It also provides tips on gaining experience and intuition for data science, including becoming proficient in tools, deeply understanding algorithms, and focusing on specific data types through hands-on practice of experiments. Attribute selection techniques like filters, wrappers and embedded methods are also covered. Finally, the document discusses support vector machines and handling missing values in data.
Kaggle Higgs Boson Machine Learning ChallengeBernard Ong
Ā
What It Took to Score the Top 2% on the Higgs Boson Machine Learning Challenge. A journey into advanced machine learning models ensembles stacking methods.
The International Journal of Engineering and Science (The IJES)theijes
Ā
The International Journal of Engineering & Science is aimed at providing a platform for researchers, engineers, scientists, or educators to publish their original research results, to exchange new ideas, to disseminate information in innovative designs, engineering experiences and technological skills. It is also the Journal's objective to promote engineering and technology education. All papers submitted to the Journal will be blind peer-reviewed. Only original articles will be published.
Weka is a collection of machine learning algorithms for data mining tasks. The name "Weka" stands for "Waikato Environment for Knowledge Analysis," as it was developed at the University of Waikato in New Zealand. Weka provides a graphical user interface (GUI) that makes it easy to experiment with various machine learning algorithms on datasets.
Study on Relavance Feature Selection MethodsIRJET Journal
Ā
This document summarizes research on feature selection methods. It discusses how feature selection is used to reduce dimensionality when working with large datasets that have thousands of variables. Several feature selection algorithms are examined, including ant colony optimization, quadratic programming, variable ranking using filter, wrapper and embedded methods, and fast correlation-based filtering with sequential forward selection. Feature selection can improve classification efficiency and understanding of data by identifying the most meaningful features.
This document is a research proposal on attribute selection and representation for software defect prediction. The proposal discusses limitations in existing attribute selection methods and the importance of pre-processing data. It aims to propose a new attribute selection method that improves accuracy by addressing shortcomings, and to study appropriate classifiers. The methodology involves a literature review on pre-processing, attribute selection and classification methods. It will then propose and implement a new attribute selection process, compare it using different classifiers and pre-processing, and evaluate it against existing techniques in a technical report.
The document discusses a study comparing the SQL optimizer in Oracle and Hive query execution. It aims to understand how the SQL optimizer works in Oracle by generating query plans using Explain and comparing performance to queries executed on Hive. Various query types including single relations, joins, aggregates, and subqueries are executed on both Oracle and Hive and their plans and performance are analyzed and compared to understand how each system optimizes queries and executes them efficiently.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
This document summarizes a research paper that evaluated the effect of feature reduction using principal component analysis (PCA) on sentiment analysis of online product reviews. The researchers developed two models - Model I used unigram features directly, while Model II reduced the features to the top 57 principal components. Both support vector machines and naive Bayes classifiers showed improved accuracy when trained on the reduced feature set of Model II compared to the full feature set of Model I. Receiver operating characteristic curves also indicated better classification performance from both classifiers when using the reduced features. The results provide promising evidence that PCA can be an effective feature reduction method for sentiment analysis tasks.
This document summarizes a research paper that examines the effect of feature reduction in sentiment analysis of online reviews. It uses principle component analysis to reduce the number of features (product attributes) from a dataset of 500 camera reviews labeled as positive or negative. Two models are developed - one using the original set of 95 product attributes, and one using the reduced set. Support vector machines and naive Bayes classifiers are applied to both models and their performance is evaluated to determine if classification accuracy can be maintained while using fewer features. The results show it is possible to achieve similar accuracy levels with less features, improving computational efficiency.
New Feature Selection Model Based Ensemble Rule Classifiers Method for Datase...ijaia
Ā
Feature selection and classification task are an essential process in dealing with large data sets that
comprise numerous number of input attributes. There are many search methods and classifiers that have
been used to find the optimal number of attributes. The aim of this paper is to find the optimal set of
attributes and improve the classification accuracy by adopting ensemble rule classifiers method. Research
process involves 2 phases; finding the optimal set of attributes and ensemble classifiers method for
classification task. Results are in terms of percentage of accuracy and number of selected attributes and
rules generated. 6 datasets were used for the experiment. The final output is an optimal set of attributes
with ensemble rule classifiers method. The experimental results conducted on public real dataset
demonstrate that the ensemble rule classifiers methods consistently show improve classification accuracy
on the selected dataset. Significant improvement in accuracy and optimal set of attribute selected is
achieved by adopting ensemble rule classifiers method.
This is an introductory workshop for machine learning. Introduced machine learning tasks such as supervised learning, unsupervised learning and reinforcement learning.
Similar to Predicting best classifier using properties of data sets (20)
The Ipsos - AI - Monitor 2024 Report.pdfSocial Samosa
Ā
According to Ipsos AI Monitor's 2024 report, 65% Indians said that products and services using AI have profoundly changed their daily life in the past 3-5 years.
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataKiwi Creative
Ā
Harness the power of AI-backed reports, benchmarking and data analysis to predict trends and detect anomalies in your marketing efforts.
Peter Caputa, CEO at Databox, reveals how you can discover the strategies and tools to increase your growth rate (and margins!).
From metrics to track to data habits to pick up, enhance your reporting for powerful insights to improve your B2B tech company's marketing.
- - -
This is the webinar recording from the June 2024 HubSpot User Group (HUG) for B2B Technology USA.
Watch the video recording at https://youtu.be/5vjwGfPN9lw
Sign up for future HUG events at https://events.hubspot.com/b2b-technology-usa/
End-to-end pipeline agility - Berlin Buzzwords 2024Lars Albertsson
Ā
We describe how we achieve high change agility in data engineering by eliminating the fear of breaking downstream data pipelines through end-to-end pipeline testing, and by using schema metaprogramming to safely eliminate boilerplate involved in changes that affect whole pipelines.
A quick poll on agility in changing pipelines from end to end indicated a huge span in capabilities. For the question "How long time does it take for all downstream pipelines to be adapted to an upstream change," the median response was 6 months, but some respondents could do it in less than a day. When quantitative data engineering differences between the best and worst are measured, the span is often 100x-1000x, sometimes even more.
A long time ago, we suffered at Spotify from fear of changing pipelines due to not knowing what the impact might be downstream. We made plans for a technical solution to test pipelines end-to-end to mitigate that fear, but the effort failed for cultural reasons. We eventually solved this challenge, but in a different context. In this presentation we will describe how we test full pipelines effectively by manipulating workflow orchestration, which enables us to make changes in pipelines without fear of breaking downstream.
Making schema changes that affect many jobs also involves a lot of toil and boilerplate. Using schema-on-read mitigates some of it, but has drawbacks since it makes it more difficult to detect errors early. We will describe how we have rejected this tradeoff by applying schema metaprogramming, eliminating boilerplate but keeping the protection of static typing, thereby further improving agility to quickly modify data pipelines without fear.
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeWalaa Eldin Moustafa
Ā
Dynamic policy enforcement is becoming an increasingly important topic in todayās world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines.
#SQL #Views #Privacy #Compliance #DataLake
Open Source Contributions to Postgres: The Basics POSETTE 2024ElizabethGarrettChri
Ā
Postgres is the most advanced open-source database in the world and it's supported by a community, not a single company. So how does this work? How does code actually get into Postgres? I recently had a patch submitted and committed and I want to share what I learned in that process. Iāll give you an overview of Postgres versions and how the underlying project codebase functions. Iāll also show you the process for submitting a patch and getting that tested and committed.
Build applications with generative AI on Google CloudMƔrton Kodok
Ā
We will explore Vertex AI - Model Garden powered experiences, we are going to learn more about the integration of these generative AI APIs. We are going to see in action what the Gemini family of generative models are for developers to build and deploy AI-driven applications. Vertex AI includes a suite of foundation models, these are referred to as the PaLM and Gemini family of generative ai models, and they come in different versions. We are going to cover how to use via API to: - execute prompts in text and chat - cover multimodal use cases with image prompts. - finetune and distill to improve knowledge domains - run function calls with foundation models to optimize them for specific tasks. At the end of the session, developers will understand how to innovate with generative AI and develop apps using the generative ai industry trends.
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Aggregage
Ā
This webinar will explore cutting-edge, less familiar but powerful experimentation methodologies which address well-known limitations of standard A/B Testing. Designed for data and product leaders, this session aims to inspire the embrace of innovative approaches and provide insights into the frontiers of experimentation!
Predicting best classifier using properties of data sets
1. Predicting the Best Classiļ¬er using Properties of
Datasets
Abhishek Vijayvargia
Supervised by: Prof. Harish Karnick
Department of Computer Science & Engineering
IIT Kanpur
June 24, 2015
2. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Outline
1 Introduction and Background
2 Data Properties
3 Regression and Signiļ¬cance Testing
4 Results
5 Conclusion and Future Work
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 2/ 47
3. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Outline
1 Introduction and Background
2 Data Properties
3 Regression and Signiļ¬cance Testing
4 Results
5 Conclusion and Future Work
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 3/ 47
4. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Introduction
Classiļ¬cation techniques have application in diļ¬erent domains
Datasets contains mixture of nominal,integer,real and text attributes
Datasets have diļ¬erent properties
Classiļ¬cation algorithms performs diļ¬erently on these datasets
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 4/ 47
5. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Introduction
Classiļ¬cation techniques have application in diļ¬erent domains
Datasets contains mixture of nominal,integer,real and text attributes
Datasets have diļ¬erent properties
Classiļ¬cation algorithms performs diļ¬erently on these datasets
No single best algorithm exists(NFL)
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 4/ 47
6. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Introduction
Classiļ¬cation techniques have application in diļ¬erent domains
Datasets contains mixture of nominal,integer,real and text attributes
Datasets have diļ¬erent properties
Classiļ¬cation algorithms performs diļ¬erently on these datasets
No single best algorithm exists(NFL)
Cross validation is used to ļ¬nd good algorithm
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 4/ 47
7. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Introduction
Classiļ¬cation techniques have application in diļ¬erent domains
Datasets contains mixture of nominal,integer,real and text attributes
Datasets have diļ¬erent properties
Classiļ¬cation algorithms performs diļ¬erently on these datasets
No single best algorithm exists(NFL)
Cross validation is used to ļ¬nd good algorithm (time consuming)
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 4/ 47
8. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Introduction
Classiļ¬cation techniques have application in diļ¬erent domains
Datasets contains mixture of nominal,integer,real and text attributes
Datasets have diļ¬erent properties
Classiļ¬cation algorithms performs diļ¬erently on these datasets
No single best algorithm exists(NFL)
Cross validation is used to ļ¬nd good algorithm (time consuming)
Takes more time with large datasets and algorithms
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 4/ 47
9. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Introduction
Meta-Learning
Knowledge of datasets and performances of algorithms are stored
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 5/ 47
10. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Introduction
Meta-Learning
Knowledge of datasets and performances of algorithms are stored
Predict performance of algorithms
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 5/ 47
11. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Introduction
Meta-Learning
Knowledge of datasets and performances of algorithms are stored
Predict performance of algorithms
Generate ranking
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 5/ 47
12. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Introduction
Meta-Learning
Knowledge of datasets and performances of algorithms are stored
Predict performance of algorithms
Generate ranking
Top-k algorithms can be chosen
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 5/ 47
13. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Introduction
Meta-Learning
Knowledge of datasets and performances of algorithms are stored
Predict performance of algorithms
Generate ranking
Top-k algorithms can be chosen
Problem Statement
Predict an optimal learning algorithm or nearly optimal learning
algorithms using a ranking paradigm in terms of performance for a
new data set by using the properties of the data set.
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 5/ 47
14. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Related Work
Characterization of Classiļ¬cation Algorithms [2]
Dataset characteristics
Simple Measures
Statistical Measures
Information Theoretic Measures
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 6/ 47
15. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Related Work
Characterization of Classiļ¬cation Algorithms [2]
Dataset characteristics
Simple Measures
Statistical Measures
Information Theoretic Measures
Used four types of models
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 6/ 47
16. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Related Work
Characterization of Classiļ¬cation Algorithms [2]
Dataset characteristics
Simple Measures
Statistical Measures
Information Theoretic Measures
Used four types of models
Partial Learning Curve [4]
Full learning curve from partial learning curve
Fraction of instances used (10%)
Predict best algorithm from two algorithms
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 6/ 47
17. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Related Work
Meta-Analysis [3]
Meta Features
Simple, Statistical and Information Theoretic Measures
Model Based Measures
Landmarks
Classiļ¬cation for algorithm selection
Synthetic dataset used
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 7/ 47
18. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Related Work
Meta-Analysis [3]
Meta Features
Simple, Statistical and Information Theoretic Measures
Model Based Measures
Landmarks
Classiļ¬cation for algorithm selection
Synthetic dataset used
Automatic Classiļ¬er Selection for Non-Experts [5]
Meta Features
Predicted Accuracy by regression
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 7/ 47
19. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Motivation
Empirical Comparison of Supervised Learning Algorithm [1]
Best methods performs poorly
Poor methods performs exceptionally well on some problems
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 8/ 47
20. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Motivation
Empirical Comparison of Supervised Learning Algorithm [1]
Best methods performs poorly
Poor methods performs exceptionally well on some problems
Motivation to generate ranking of algorithms
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 8/ 47
21. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Outline
1 Introduction and Background
2 Data Properties
3 Regression and Signiļ¬cance Testing
4 Results
5 Conclusion and Future Work
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 9/ 47
22. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Histogram of Standard Deviation
Creating Histogram
K standard deviation values (numerical attributes)
H histogram bins
2 histogram per dataset (binary class)
Bins from range [0,0.5] are taken (Data range [0,1])
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 10/ 47
23. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Histogram of Standard Deviation
Table 1: Standard Deviation Data
Class 0 0.228 0.215 0.2 0.187 0.135 0.15 0.116 0.154
Class 1 0.366 0.223 0.204 0.171 0.179 0.162 0.164 0.178
Table 2: Histogram
Range .00-.05 .05-.10 0.10-.15 .15-.20 .20-.25 .25-.30 .30-.35 .35-.40 .40-.45 .45-.50
Class 0 Histogram 0 0 2 3 3 0 0 0 0 0
Class 1 Histogram 0 0 0 5 2 0 0 1 0 0
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 11/ 47
24. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Comparing Histograms
1-Norm Distance Based Comparison
Two datasets histograms compared on the basis of 1-Norm distance
Two pairwise comparison (one for each class) between datasets
Minimum distance score of two pairwise comparison is taken
Order datasets in increasing distance
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 12/ 47
25. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Comparing Histograms
Range .00-.05 .05-.10 0.10-.15 .15-.20 .20-.25 .25-.30 .30-.35 .35-.40 .40-.45 .45-.50
Class 0 Histogram 0 0 2 3 3 0 0 0 0 0
Class 1 Histogram 0 0 0 5 2 0 0 1 0 0
Range .00-.05 .05-.10 0.10-.15 .15-.20 .20-.25 .25-.30 .30-.35 .35-.40 .40-.45 .45-.50
Class 0 Histogram 0 1 0 5 2 0 0 0 0 0
Class 1 Histogram 0 1 1 1 2 3 0 0 0 0
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 13/ 47
26. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Comparing Histograms
Range .00-.05 .05-.10 0.10-.15 .15-.20 .20-.25 .25-.30 .30-.35 .35-.40 .40-.45 .45-.50
Class 0 Histogram 0 0 2 3 3 0 0 0 0 0
Class 1 Histogram 0 0 0 5 2 0 0 1 0 0
Range .00-.05 .05-.10 0.10-.15 .15-.20 .20-.25 .25-.30 .30-.35 .35-.40 .40-.45 .45-.50
Class 0 Histogram 0 1 0 5 2 0 0 0 0 0
Class 1 Histogram 0 1 1 1 2 3 0 0 0 0
Score 1 : Comparing Class 0 of Dataset-1 with Class 0 of Dataset-2 and Class 1 of Dataset-1 with Class 1 of
Dataset-2.
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 13/ 47
27. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Comparing Histograms
Range .00-.05 .05-.10 0.10-.15 .15-.20 .20-.25 .25-.30 .30-.35 .35-.40 .40-.45 .45-.50
Class 0 Histogram 0 0 2 3 3 0 0 0 0 0
Class 1 Histogram 0 0 0 5 2 0 0 1 0 0
Range .00-.05 .05-.10 0.10-.15 .15-.20 .20-.25 .25-.30 .30-.35 .35-.40 .40-.45 .45-.50
Class 0 Histogram 0 1 0 5 2 0 0 0 0 0
Class 1 Histogram 0 1 1 1 2 3 0 0 0 0
Score 1 : Comparing Class 0 of Dataset-1 with Class 0 of Dataset-2 and Class 1 of Dataset-1 with Class 1 of
Dataset-2.
Score ā 1 = (|0 ā 0| + |0 ā 1| + |2 ā 0| + |3 ā 5| + |3 ā 2| + |0 ā 0| + |0 ā 0| + |0 ā 0| + |0 ā 0| + |0 ā 0|)
+ (|0 ā 0| + |0 ā 1| + |0 ā 1| + |5 ā 1| + |2 ā 2| + |0 ā 3| + |0 ā 0| + |1 ā 0| + |0 ā 0| + |0 ā 0|)
= 6 + 10 = 16
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 13/ 47
28. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Comparing Histograms
Range .00-.05 .05-.10 0.10-.15 .15-.20 .20-.25 .25-.30 .30-.35 .35-.40 .40-.45 .45-.50
Class 0 Histogram 0 0 2 3 3 0 0 0 0 0
Class 1 Histogram 0 0 0 5 2 0 0 1 0 0
Range .00-.05 .05-.10 0.10-.15 .15-.20 .20-.25 .25-.30 .30-.35 .35-.40 .40-.45 .45-.50
Class 0 Histogram 0 1 0 5 2 0 0 0 0 0
Class 1 Histogram 0 1 1 1 2 3 0 0 0 0
Score 1 : Comparing Class 0 of Dataset-1 with Class 0 of Dataset-2 and Class 1 of Dataset-1 with Class 1 of
Dataset-2.
Score ā 1 = (|0 ā 0| + |0 ā 1| + |2 ā 0| + |3 ā 5| + |3 ā 2| + |0 ā 0| + |0 ā 0| + |0 ā 0| + |0 ā 0| + |0 ā 0|)
+ (|0 ā 0| + |0 ā 1| + |0 ā 1| + |5 ā 1| + |2 ā 2| + |0 ā 3| + |0 ā 0| + |1 ā 0| + |0 ā 0| + |0 ā 0|)
= 6 + 10 = 16
Score 2 : Comparing Class 0 of Dataset-1 with Class 1 of Dataset-2 and Class 1 of Dataset-1 with Class 0 of
Dataset-2.
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 13/ 47
29. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Comparing Histograms
Range .00-.05 .05-.10 0.10-.15 .15-.20 .20-.25 .25-.30 .30-.35 .35-.40 .40-.45 .45-.50
Class 0 Histogram 0 0 2 3 3 0 0 0 0 0
Class 1 Histogram 0 0 0 5 2 0 0 1 0 0
Range .00-.05 .05-.10 0.10-.15 .15-.20 .20-.25 .25-.30 .30-.35 .35-.40 .40-.45 .45-.50
Class 0 Histogram 0 1 0 5 2 0 0 0 0 0
Class 1 Histogram 0 1 1 1 2 3 0 0 0 0
Score 1 : Comparing Class 0 of Dataset-1 with Class 0 of Dataset-2 and Class 1 of Dataset-1 with Class 1 of
Dataset-2.
Score ā 1 = (|0 ā 0| + |0 ā 1| + |2 ā 0| + |3 ā 5| + |3 ā 2| + |0 ā 0| + |0 ā 0| + |0 ā 0| + |0 ā 0| + |0 ā 0|)
+ (|0 ā 0| + |0 ā 1| + |0 ā 1| + |5 ā 1| + |2 ā 2| + |0 ā 3| + |0 ā 0| + |1 ā 0| + |0 ā 0| + |0 ā 0|)
= 6 + 10 = 16
Score 2 : Comparing Class 0 of Dataset-1 with Class 1 of Dataset-2 and Class 1 of Dataset-1 with Class 0 of
Dataset-2.
Score ā 2 = (|0 ā 0| + |0 ā 1| + |2 ā 1| + |3 ā 1| + |3 ā 2| + |0 ā 3| + |0 ā 0| + |0 ā 0| + |0 ā 0| + |0 ā 0|)
+ (|0 ā 0| + |0 ā 1| + |0 ā 0| + |5 ā 5| + |2 ā 2| + |0 ā 0| + |0 ā 0| + |1 ā 0| + |0 ā 0| + |0 ā 0|)
= 8 + 2 = 10
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 13/ 47
30. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Comparing Histograms
Range .00-.05 .05-.10 0.10-.15 .15-.20 .20-.25 .25-.30 .30-.35 .35-.40 .40-.45 .45-.50
Class 0 Histogram 0 0 2 3 3 0 0 0 0 0
Class 1 Histogram 0 0 0 5 2 0 0 1 0 0
Range .00-.05 .05-.10 0.10-.15 .15-.20 .20-.25 .25-.30 .30-.35 .35-.40 .40-.45 .45-.50
Class 0 Histogram 0 1 0 5 2 0 0 0 0 0
Class 1 Histogram 0 1 1 1 2 3 0 0 0 0
Score 1 : Comparing Class 0 of Dataset-1 with Class 0 of Dataset-2 and Class 1 of Dataset-1 with Class 1 of
Dataset-2.
Score ā 1 = (|0 ā 0| + |0 ā 1| + |2 ā 0| + |3 ā 5| + |3 ā 2| + |0 ā 0| + |0 ā 0| + |0 ā 0| + |0 ā 0| + |0 ā 0|)
+ (|0 ā 0| + |0 ā 1| + |0 ā 1| + |5 ā 1| + |2 ā 2| + |0 ā 3| + |0 ā 0| + |1 ā 0| + |0 ā 0| + |0 ā 0|)
= 6 + 10 = 16
Score 2 : Comparing Class 0 of Dataset-1 with Class 1 of Dataset-2 and Class 1 of Dataset-1 with Class 0 of
Dataset-2.
Score ā 2 = (|0 ā 0| + |0 ā 1| + |2 ā 1| + |3 ā 1| + |3 ā 2| + |0 ā 3| + |0 ā 0| + |0 ā 0| + |0 ā 0| + |0 ā 0|)
+ (|0 ā 0| + |0 ā 1| + |0 ā 0| + |5 ā 5| + |2 ā 2| + |0 ā 0| + |0 ā 0| + |1 ā 0| + |0 ā 0| + |0 ā 0|)
= 8 + 2 = 10
Distance Score = min(Score-1,Score-2) = min(16,10) = 10
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 13/ 47
31. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Comparing Histograms
Kolmogorov Smirnov Test Based Comparison
Take value of histogram as a sample
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 14/ 47
32. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Comparing Histograms
Kolmogorov Smirnov Test Based Comparison
Take value of histogram as a sample
Calculate proportion of each value in two samples
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 14/ 47
33. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Comparing Histograms
Kolmogorov Smirnov Test Based Comparison
Take value of histogram as a sample
Calculate proportion of each value in two samples
Calculate cumulative proportion of each sample
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 14/ 47
34. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Comparing Histograms
Kolmogorov Smirnov Test Based Comparison
Take value of histogram as a sample
Calculate proportion of each value in two samples
Calculate cumulative proportion of each sample
Calculate D statistics
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 14/ 47
35. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Comparing Histograms
Kolmogorov Smirnov Test Based Comparison
Take value of histogram as a sample
Calculate proportion of each value in two samples
Calculate cumulative proportion of each sample
Calculate D statistics
Two pairwise comparison
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 14/ 47
36. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Comparing Histograms
Kolmogorov Smirnov Test Based Comparison
Take value of histogram as a sample
Calculate proportion of each value in two samples
Calculate cumulative proportion of each sample
Calculate D statistics
Two pairwise comparison
Minimum distance score
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 14/ 47
38. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Comparing Histograms
Table 3: Komogorov Smirnov test
Bin Rage Histogram-1 Histogram-2 Proportion-1 Proportion-2 Cum.Pro.-1 Cum. Pro.-2 Diļ¬erence
.00-.05 0 0 0 0 0 0 0
.05-.10 1 0 0.125 0 0.125 0 0.125
.10-.15 0 2 0 0.25 0.125 0.25 0.125
.15-.20 5 3 0.625 0.375 0.75 0.625 0.125
.20-.25 2 3 0.25 0.375 1 1 0
.25-.30 0 0 0 0 1 1 0
.30-.35 0 0 0 0 1 1 0
.35-.40 0 0 0 0 1 1 0
.40-.45 0 0 0 0 1 1 0
.45-.50 0 0 0 0 1 1 0
Two datasets can be compared in two ways
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 15/ 47
39. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Comparing Histograms
Table 3: Komogorov Smirnov test
Bin Rage Histogram-1 Histogram-2 Proportion-1 Proportion-2 Cum.Pro.-1 Cum. Pro.-2 Diļ¬erence
.00-.05 0 0 0 0 0 0 0
.05-.10 1 0 0.125 0 0.125 0 0.125
.10-.15 0 2 0 0.25 0.125 0.25 0.125
.15-.20 5 3 0.625 0.375 0.75 0.625 0.125
.20-.25 2 3 0.25 0.375 1 1 0
.25-.30 0 0 0 0 1 1 0
.30-.35 0 0 0 0 1 1 0
.35-.40 0 0 0 0 1 1 0
.40-.45 0 0 0 0 1 1 0
.45-.50 0 0 0 0 1 1 0
Two datasets can be compared in two ways
Total 2 D values for each comparison
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 15/ 47
40. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Comparing Histograms
Table 3: Komogorov Smirnov test
Bin Rage Histogram-1 Histogram-2 Proportion-1 Proportion-2 Cum.Pro.-1 Cum. Pro.-2 Diļ¬erence
.00-.05 0 0 0 0 0 0 0
.05-.10 1 0 0.125 0 0.125 0 0.125
.10-.15 0 2 0 0.25 0.125 0.25 0.125
.15-.20 5 3 0.625 0.375 0.75 0.625 0.125
.20-.25 2 3 0.25 0.375 1 1 0
.25-.30 0 0 0 0 1 1 0
.30-.35 0 0 0 0 1 1 0
.35-.40 0 0 0 0 1 1 0
.40-.45 0 0 0 0 1 1 0
.45-.50 0 0 0 0 1 1 0
Two datasets can be compared in two ways
Total 2 D values for each comparison
Sum both values and consider minimum D score mapping
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 15/ 47
41. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Datasets Properties
CAV
Separate dataset based on itās class
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 16/ 47
42. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Datasets Properties
CAV
Separate dataset based on itās class
Apply KMeans Clustering
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 16/ 47
43. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Datasets Properties
CAV
Separate dataset based on itās class
Apply KMeans Clustering
Cluster Properties
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 16/ 47
44. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Datasets Properties
CAV
Separate dataset based on itās class
Apply KMeans Clustering
Cluster Properties
Number of instances in each cluster
Cluster Value = Ck = ĀÆxācluster dist(ĀÆx, centroid)
Cluster Centroid
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 16/ 47
45. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Datasets Properties
CAV
Separate dataset based on itās class
Apply KMeans Clustering
Cluster Properties
Number of instances in each cluster
Cluster Value = Ck = ĀÆxācluster dist(ĀÆx, centroid)
Cluster Centroid
Moments of Data
Variance
Skewness
Kurtosis
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 16/ 47
46. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Datasets Properties
Mixture Of Gaussians
Dataset may have overlapping clusters(non-circular shape)
For each attributes k diļ¬erent Gaussian
Model ļ¬t by Maximum likelihood of observed data
mean and variance of each Gaussian are stored
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 17/ 47
47. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Datasets Properties
Mixture Of Gaussians
Dataset may have overlapping clusters(non-circular shape)
For each attributes k diļ¬erent Gaussian
Model ļ¬t by Maximum likelihood of observed data
mean and variance of each Gaussian are stored
Multivariate Gaussian Model
N(x; Āµ, ) = 1
(2Ļ)d/2 | |
ā1/2
e(ā1
2 (xāĀµ) ā1
(xāĀµ)T
)
Āµ is d-length row vector
Ļ is d Ć d matrix
Singular value decomposition of covariance matrix
Values from diagonal matrix and mean vector are stored
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 17/ 47
48. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Outline
1 Introduction and Background
2 Data Properties
3 Regression and Signiļ¬cance Testing
4 Results
5 Conclusion and Future Work
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 18/ 47
49. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Regression to Predict Performance Measures
Regression Analysis
Property vector populated by meta properties
Entire vector or sub-vector can be used to predict performance
measures
Regression model is given as Y = f (X, Ī±)
Y is dependent variable (performance measure)
X is independent variables (property vector)
Ī± is a vector of unknown parameters
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 19/ 47
50. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Regression to Predict Performance Measures
Figure 1: Training
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 20/ 47
51. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Regression to Predict Performance Measures
Figure 2: Testing
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 21/ 47
52. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Statistical Signiļ¬cance Testing
Comparison of predicted sequence with random sequence
Actual sequence is taken as baseline
Probability of at least one algorithm of top-k present in actual top k:
Probability = 1 ā (none of algorithm present in topk)
= 1 ā
n ā k
n
Ć
n ā k ā 1
n ā 1
Ć . . . Ć
n ā 2k + 1
n ā k + 1
Expected Number of algorithm from random sequence present in
actual top k:
Expected Value = 1 Ć
k
1
nāk
kā1
n
k
+ 2 Ć
k
2
nāk
kā2
n
k
+ . . . +
+ . . . k Ć
k
1āk
nāk
kāk
n
k
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 22/ 47
53. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Statistical Signiļ¬cance Testing
Test 1
At least one algorithm of given top-k sequence is present in actual
top-k sequence
Statistical signiļ¬cance test between
Predicted rank-actual rank matches
Random rank-actual rank matches
Exclude prediction methods where the diļ¬erence is not statistically
signiļ¬cant
Test 2
Number of algorithms of given top-k sequence is present in actual
top-k sequence
Statistical signiļ¬cance test between
Predicted rank-actual rank matches
Random rank-actual rank matches
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 23/ 47
54. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Autoencoder
Learns to construct itās own input
Increasing dimension of data by autoencoder
Decreasing dimension of data by autoencoder
Stacked autoencoder
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 24/ 47
55. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Outline
1 Introduction and Background
2 Data Properties
3 Regression and Signiļ¬cance Testing
4 Results
5 Conclusion and Future Work
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 25/ 47
56. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Data and Algorithm Set
Real world datasets
44 binary datasets from UCI, tunedIT, keel, delve
Synthetic datasets
484 datasets are generated using univariate and multivariate
distribution
DA1 : 13 classiļ¬cation algorithms and 44 real datasets
DA2 : 70 classiļ¬cation algorithms and 44 real datasets
DA2 : 70 classiļ¬cation algorithms and 484 synthetic datasets
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 26/ 47
57. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Data Cleaning
Steps for Data Cleaning
Nominal to binary 0/1 attribute
PCA with maximum attributes as 8
Normalization has been done on these reduced sets of attributes
using (valueāmin)
(maxāmin) .
Class attributes are renamed to 0 and 1
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 27/ 47
58. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Predicting Ranking from Histogram Bin
Table 4: Predicting Ranking from histogram bins using 1-Norm distance on DA1
Count Random Probability 1-Norm distance Conļ¬dence for alternative hypothesis
1 0.076923077 0.090909091 0.5583927
2 0.294871795 0.295454545 0.44665546
3 0.58041958 0.659090909 0.8166996
4 0.823776224 0.795454545 0.2378282
5 0.956487956 1 0.8587793
6 0.995920746 1 0.164608
7 1 1 0
Table 5: Predicting Ranking from histogram bins using Kolmogorov Smirnov on
DA1
Count Random Probability Kolmogorov smirnov test Conļ¬dence for alternative hypothesis
1 0.076923077 0.090909091 0.7518056
2 0.294871795 0.340909091 0.6987076
3 0.58041958 0.636363636 0.7234472
4 0.823776224 0.863636364 0.6779081
5 0.956487956 0.977272727 0.5761086
6 0.995920746 1 0.164608
7 1 1 0
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 28/ 47
59. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Predicting Ranking from Histogram Bin
Table 6: Predicting Ranking from histogram bins using DA2
Count Random Probability 1-Norm distance Conļ¬dence for alter. hypo. Kolmogorov smirnov test Conļ¬dence for alter. hypo.
1 0.014285714 0 0 0 0
2 0.056728778 0.022727273 0.07656138 0.022727273 0.07656138
3 0.124862989 0.068181818 0.07501977 0.090909091 0.1837721
4 0.213955796 0.159090909 0.1402761 0.159090909 0.1402761
5 0.317534624 0.340909091 0.5753865 0.272727273 0.213988
6 0.428182857 0.454545455 0.5822902 0.363636364 0.1543545
7 0.538469854 0.522727273 0.3581275 0.454545455 0.1025831
8 0.641846095 0.659090909 0.5264127 0.681818182 0.6489032
9 0.733341187 0.818181818 0.8664682 0.795454545 0.77316
10 0.809949161 0.863636364 0.756462 0.863636364 0.756462
11 0.870659846 0.909090909 0.688802 0.886363636 0.5115497
12 0.916175987 0.954545455 0.7251068 0.977272727 0.8932743
13 0.948419826 0.977272727 0.6699302 0.977272727 0.6699302
14 0.969963161 1 0.7386451 1 0.7386451
15 0.983506557 1 0.5189398 1 0.5189398
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 29/ 47
60. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Predicting Ranking from Histogram Bin
Table 7: Predicting Ranking using from histogram bins DA3
Count Random Probability 1-Norm distance Conļ¬dence for alter. hypo. Kolmogorov smirnov test Conļ¬dence for alter. hypo.
1 0.014285714 0 0 0 0
2 0.056728778 0 0 0 0
3 0.124862989 0.008264463 0 0.010330579 0
4 0.213955796 0.068181818 0 0.07231405 0
5 0.317534624 0.150826446 0 0.150826446 0
6 0.428182857 0.268595041 5.64E-14 0.27892562 3.98E-12
7 0.538469854 0.400826446 2.69E-10 0.404958678 1.49E-09
8 0.641846095 0.541322314 1.44E-06 0.530991736 2.25E-07
9 0.733341187 0.683884298 0.0066866 0.681818182 0.00504363
10 0.809949161 0.780991736 0.04829307 0.783057851 0.04829307
11 0.870659846 0.863636364 0.2945404 0.863636364 0.2945404
12 0.916175987 0.919421488 0.5609581 0.919421488 0.5609581
13 0.948419826 0.960743802 0.8715195 0.962809917 0.8715195
14 0.969963161 0.975206612 0.6958514 0.977272727 0.6958514
15 0.983506557 0.981404959 0.2801928 0.983471074 0.2801928
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 30/ 47
61. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Predicting Ranking by Data Property Vector
Steps
These properties are considered:
Histogram of standard deviation of each class.
Cluster Analysis Vector(CVV).
Moments of data.
Mixture of Gaussian on each attribute.
Mean and vector of diagonal entries from Sigma matrix obtained by
singular value decomposition (SVD) of the covariance matrix of the
multivariate Gaussian model of the dataset.
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 31/ 47
62. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Predicting Ranking by Data Property Vector
Steps
These properties are considered:
Histogram of standard deviation of each class.
Cluster Analysis Vector(CVV).
Moments of data.
Mixture of Gaussian on each attribute.
Mean and vector of diagonal entries from Sigma matrix obtained by
singular value decomposition (SVD) of the covariance matrix of the
multivariate Gaussian model of the dataset.
Property vector as independent variable and accuracy as dependent
variable in regression
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 31/ 47
63. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Predicting Ranking by Data Property Vector
Steps
These properties are considered:
Histogram of standard deviation of each class.
Cluster Analysis Vector(CVV).
Moments of data.
Mixture of Gaussian on each attribute.
Mean and vector of diagonal entries from Sigma matrix obtained by
singular value decomposition (SVD) of the covariance matrix of the
multivariate Gaussian model of the dataset.
Property vector as independent variable and accuracy as dependent
variable in regression
One regression model for each classiļ¬er
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 31/ 47
64. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Predicting Ranking by Data Property Vector
Steps
These properties are considered:
Histogram of standard deviation of each class.
Cluster Analysis Vector(CVV).
Moments of data.
Mixture of Gaussian on each attribute.
Mean and vector of diagonal entries from Sigma matrix obtained by
singular value decomposition (SVD) of the covariance matrix of the
multivariate Gaussian model of the dataset.
Property vector as independent variable and accuracy as dependent
variable in regression
One regression model for each classiļ¬er
Result compared with random sequence
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 31/ 47
65. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Predicting Ranking by Data Property Vector
Table 8: Test-1 using Data Characteristics forDA1
Value of k
Random
Probability
Gaussian
Processes
p-value for
null hypo.
IBk
p-value for
null hypo.
1 0.076923077 0.136363636 0.2481944 0.25 0.000392335
2 0.294871795 0.522727273 0.001277486 0.454545455 0.01802491
3 0.58041958 0.795454545 0.006170564 0.704545455 0.06290122
Table 9: Test-2 using Data Characteristics for DA1
Count
Random
Probability
Gaussian
Processes
p-value for
null hypo.
IBk
p-value for
null hypo.
1 0.076923077 0.136363636 0.2481944 0.25 0.000392335
2 0.153846154 0.284090909 0.002996455 0.25 0.0128041
3 0.230769231 0.386363636 0.000394481 0.363636364 0.000394481
4 0.307692308 0.443181818 0.000343617 0.403409091 0.0159287
5 0.384615385 0.518181818 0.00032955 0.440909091 0.08601391
6 0.461538462 0.549242424 0.003806131 0.53030303 0.02680756
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 32/ 47
66. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Predicting Ranking by Data Property Vector
Table 10: Test-1 using Data Characteristics for DA2
Count
Random
Probability
Gaussian
Processes
p-value for
null hypo.
IBk
p-value for
null hypo.
1 0.014285714 0.045454545 0.469059 0.068181818 0.130488
2 0.056728778 0.227272727 0.000702291 0.295454545 4.23E-06
3 0.124862989 0.318181818 0.002140653 0.295454545 0.0063842
4 0.213955796 0.431818182 0.000962345 0.409090909 0.006957731
5 0.317534624 0.590909091 0.000168674 0.477272727 0.03942881
6 0.428182857 0.636363636 0.01014022 0.613636364 0.01014022
Table 11: Test-2 using Data Characteristics for DA2
Count
Random
Probability
Gaussian
Processes
p-value for
null hypo.
IBk
p-value for
null hypo.
5 0.071428571 0.195454545 4.80E-08 0.25 1.92E-16
10 0.142857143 0.281818182 1.03E-12 0.327272727 8.45E-21
15 0.214285714 0.363636364 1.32E-18 0.366666667 1.32E-18
20 0.285714286 0.454545455 1.86E-26 0.422727273 3.23E-15
25 0.357142857 0.512727273 2.45E-22 0.470909091 1.99E-11
30 0.428571429 0.575 1.61E-24 0.543939394 3.85E-12
35 0.5 0.646753247 3.16E-27 0.607792208 5.06E-13
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 33/ 47
67. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Predicting Ranking by Data Property Vector
Table 12: Test-1 using Data Characteristics for DA3
Count
Random
Probability
Gaussian
Processes
p-value for
null hypo.
IBk
p-value for
null hypo.
1 0.014285714 0.148760331 5.27E-49 0.237603306 2.57E-101
2 0.056728778 0.384297521 1.92E-101 0.48553719 1.66E-154
3 0.124862989 0.520661157 7.63E-97 0.657024793 8.85E-163
4 0.213955796 0.597107438 1.49E-73 0.766528926 4.31E-148
5 0.317534624 0.665289256 1.05E-54 0.830578512 3.11E-119
6 0.428182857 0.710743802 3.06E-36 0.873966942 5.45E-92
Table 13: Test-2 using Data Characteristics for DA3
Count
Random
Probability
Gaussian
Processes
p-value for
null hypo.
IBk
p-value for
null hypo.
5 0.071428571 0.321487603 1.43E-284 0.395041322 0
10 0.142857143 0.40661157 0 0.480991736 0
15 0.214285714 0.478236915 0 0.534710744 0
20 0.285714286 0.517768595 0 0.577582645 0
25 0.357142857 0.573636364 0 0.622479339 0
30 0.428571429 0.631060606 0 0.666804408 0
35 0.5 0.687839433 0 0.717532468 0
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 34/ 47
68. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Figure 3: Test-1 on DA1
Figure 4: Test-1 on DA2
Figure 5: Test-2 on DA1
Figure 6: Test-2 on DA2
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 35/ 47
69. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Figure 7: Test-1 on DA3 Figure 8: Test-2 on DA3
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 36/ 47
70. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Increasing Dimension of Data (Autoencoder)
Figure 9: Test-1 on DA2 Figure 10: Test-2 on DA2
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 37/ 47
71. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Decreasing Dimension of Data (Autoencoder)
Figure 11: Test-1 on DA2 Figure 12: Test-2 on DA2
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 38/ 47
72. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Stacked Autoencoder
Figure 13: Test-1 on DA2 Figure 14: Test-2 on DA2
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 39/ 47
73. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Comparison with Previous Techniques
Test-1 : Diļ¬erence between Accuracy
True accuracy and predicted accuracy of a algorithm on each test
dataset is compared
Average of absolute diļ¬erence of all datasets and for all algorithms
is taken
Result of best regression technique is reported
Table 14: Diļ¬erence Between Accuracy
Simple Statistical
Info
Theoretic
Model
Based
Landmark DCT ALL MDDF
Diļ¬erence 0.0890 0.0915 0.0648 0.0859 0.0422 0.0747 0.0525 0.0426
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 40/ 47
74. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Comparison with Previous Techniques
Test-2 : Rank Correlation
Spearmanās rank correlation coeļ¬cient between these two rankings
is calculated
Value of this coeļ¬cient is averaged over all data sets
Higher the value of this coeļ¬cient, the higher the correlation
between the actual rank and predicted rank
Table 15: Rank Correlation
Simple Statistical
Info
Theoretic
Model
Based
Landmark DCT ALL MDDF
Average
Value
0.464 0.444 0.488 0.431 0.495 0.459 0.488 0.520
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 41/ 47
75. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Outline
1 Introduction and Background
2 Data Properties
3 Regression and Signiļ¬cance Testing
4 Results
5 Conclusion and Future Work
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 42/ 47
76. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Conclusion
Methods to generate ranking of binary classiļ¬cation algorithms
without running them
Intrinsic properties of data
Ranking of classiļ¬ers predicted via regression
Autoencoder used for more predictive analysis
Our approach give better result than previous techniques
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 43/ 47
77. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Future Work
Multi-class classiļ¬cation
Datasets can be grouped together based on domain knowledge
Other performance measures like precision, recall and f-measure can
be used
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 44/ 47
78. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 45/ 47
79. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
Thank you!
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 46/ 47
80. Introduction and Background Data Properties Regression and Signiļ¬cance Testing Results Conclusion and Future Work
References I
[1] R. Caruana and A. Niculescu-Mizil.
An empirical comparison of supervised learning algorithms.
In Proceedings of the 23rd International Conference on Machine Learning, ICML ā06, pages
161ā168, New York, NY, USA, 2006. ACM.
[2] J. Gama and P. Brazdil.
Characterization of classiļ¬cation algorithms.
In Proceedings of the 7th Portuguese Conference on Artiļ¬cial Intelligence: Progress in
Artiļ¬cial Intelligence, EPIA ā95, pages 189ā200, London, UK, UK, 1995. Springer-Verlag.
[3] C. Kpf, C. Taylor, and J. Keller.
Meta-analysis: From data characterisation for meta-learning to meta-regression.
In Proceedings of the PKDD-00 Workshop on Data Mining, Decision Support,Meta-Learning
and ILP.
[4] R. Leite and P. Brazdil.
An iterative process for building learning curves and predicting relative performance of
classiļ¬ers.
In J. Neves, M. Santos, and J. Machado, editors, Progress in Artiļ¬cial Intelligence, volume
4874 of Lecture Notes in Computer Science, pages 87ā98. Springer Berlin Heidelberg, 2007.
[5] M. Reif, F. Shafait, M. Goldstein, T. Breuel, and A. Dengel.
Automatic classiļ¬er selection for non-experts.
Pattern Analysis and Applications, 17(1):83ā96, 2014.
Abhishek Vijayvargia Predicting the Best Classiļ¬er using Properties of Datasets 47/ 47