This document discusses an optimized approach for building ensemble models from large datasets in distributed environments. The key aspects are:
1. Data is randomly partitioned into training blocks, a validation set, and holdout set to build base models, with partitions sized optimally for model building.
2. Base models are built on each partition concurrently using map-reduce frameworks. The best performing base models are then combined into an ensemble model.
3. The approach eliminates issues from unevenly sized partitions and non-randomly distributed data, improving model performance and validity over existing methods.
IRJET- Evidence Chain for Missing Data Imputation: SurveyIRJET Journal
This document summarizes research on techniques for imputing missing data. It begins with an abstract describing a new method called Missing value Imputation Algorithm based on an Evidence Chain (MIAEC) that provides accurate imputation for increasing rates of missing data using a chain of evidence. It then reviews several existing imputation techniques including mean, regression, likelihood-based methods, and nearest neighbor approaches. The document proposes extending MIAEC with MapReduce for large-scale data processing. Key advantages of MIAEC include utilizing all relevant evidence to estimate missing values and ability to process large datasets.
QUERY PROOF STRUCTURE CACHING FOR INCREMENTAL EVALUATION OF TABLED PROLOG PRO...csandit
The incremental evaluation of logic programs maintains the tabled answers in a complete and
consistent form in response to the changes in the database of facts and rules. The critical
challenges for the incremental evaluation are how to detect which table entries need to change,
how to compute the changes and how to avoid the re-computation. In this paper we present an
approach of maintaining one consolidate system to cache the query answers under the nonmonotonic
logic. We use the justification-based truth-maintenance system to support the
incremental evaluation of tabled Prolog Programs. The approach used in this paper suits the
logic based systems that depend on dynamic facts and rules to benefit in their performance from
the idea of incremental evaluation of tabled Prolog programs. More precisely, our approach
favors the dynamic rules based logic systems.
The document summarizes a data mining exercise to predict customer interest in caravan insurance. Key steps included understanding the objective, data, and models. First level models on the original data achieved 94% accuracy. Further levels balancing data and adding variables saw no clear improvement. The best model was the original with 94% accuracy, showing additional changes did not enhance predictions.
This document proposes using truncated non-negative matrix factorization (NMF) with sparseness constraints for privacy-preserving data perturbation. NMF is used to distort individual data values while preserving statistical distributions. Experimental results on breast cancer and ionosphere datasets show that the method effectively conceals sensitive information while maintaining data mining performance after distortion, as measured by a k-nearest neighbors classifier's accuracy. The degree of data distortion and privacy can be controlled by varying the NMF rank, sparseness constraint, and truncation threshold.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
This document presents an approach for clustering a mixed dataset containing both numeric and categorical attributes using an ART-2 neural network model. The dataset contains daily stock price data with 19 attributes describing comparisons between consecutive days. Clustering mixed datasets is challenging due to different attribute types. The ART-2 model is used to classify the dataset without requiring a distance function. Then an autoencoder model reduces the dimensionality to allow visual validation of the clusters. The results demonstrate the ART-2 model's ability to cluster complex, mixed datasets.
High dimesional data (FAST clustering ALG) PPTdeepan v
The document presents a feature selection algorithm called FAST (Fast clustering-based feature selection algorithm). FAST uses minimum spanning trees and clustering to identify relevant feature subsets while removing irrelevant and redundant features. This achieves dimensionality reduction and improves the accuracy of learning algorithms. The algorithm was experimentally evaluated on datasets with over 10,000 features and was shown to outperform other feature selection methods in terms of time complexity and selected feature proportions.
Iaetsd an enhanced feature selection forIaetsd Iaetsd
The document discusses feature selection techniques for machine learning applications. It proposes an Enhanced Fast Clustering-based Feature Selection (EFAST) algorithm. The EFAST algorithm works in two steps: 1) features are clustered using graph-theoretic clustering methods, and 2) the most relevant representative feature strongly correlated with the target categories is selected from each cluster to form the optimal feature subset. Features from different clusters are relatively independent, so EFAST has a high chance of selecting a set of useful and independent features. The algorithm was tested on real-world data and showed improved performance over other feature selection methods by reducing features while also improving classifier performance.
IRJET- Evidence Chain for Missing Data Imputation: SurveyIRJET Journal
This document summarizes research on techniques for imputing missing data. It begins with an abstract describing a new method called Missing value Imputation Algorithm based on an Evidence Chain (MIAEC) that provides accurate imputation for increasing rates of missing data using a chain of evidence. It then reviews several existing imputation techniques including mean, regression, likelihood-based methods, and nearest neighbor approaches. The document proposes extending MIAEC with MapReduce for large-scale data processing. Key advantages of MIAEC include utilizing all relevant evidence to estimate missing values and ability to process large datasets.
QUERY PROOF STRUCTURE CACHING FOR INCREMENTAL EVALUATION OF TABLED PROLOG PRO...csandit
The incremental evaluation of logic programs maintains the tabled answers in a complete and
consistent form in response to the changes in the database of facts and rules. The critical
challenges for the incremental evaluation are how to detect which table entries need to change,
how to compute the changes and how to avoid the re-computation. In this paper we present an
approach of maintaining one consolidate system to cache the query answers under the nonmonotonic
logic. We use the justification-based truth-maintenance system to support the
incremental evaluation of tabled Prolog Programs. The approach used in this paper suits the
logic based systems that depend on dynamic facts and rules to benefit in their performance from
the idea of incremental evaluation of tabled Prolog programs. More precisely, our approach
favors the dynamic rules based logic systems.
The document summarizes a data mining exercise to predict customer interest in caravan insurance. Key steps included understanding the objective, data, and models. First level models on the original data achieved 94% accuracy. Further levels balancing data and adding variables saw no clear improvement. The best model was the original with 94% accuracy, showing additional changes did not enhance predictions.
This document proposes using truncated non-negative matrix factorization (NMF) with sparseness constraints for privacy-preserving data perturbation. NMF is used to distort individual data values while preserving statistical distributions. Experimental results on breast cancer and ionosphere datasets show that the method effectively conceals sensitive information while maintaining data mining performance after distortion, as measured by a k-nearest neighbors classifier's accuracy. The degree of data distortion and privacy can be controlled by varying the NMF rank, sparseness constraint, and truncation threshold.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
This document presents an approach for clustering a mixed dataset containing both numeric and categorical attributes using an ART-2 neural network model. The dataset contains daily stock price data with 19 attributes describing comparisons between consecutive days. Clustering mixed datasets is challenging due to different attribute types. The ART-2 model is used to classify the dataset without requiring a distance function. Then an autoencoder model reduces the dimensionality to allow visual validation of the clusters. The results demonstrate the ART-2 model's ability to cluster complex, mixed datasets.
High dimesional data (FAST clustering ALG) PPTdeepan v
The document presents a feature selection algorithm called FAST (Fast clustering-based feature selection algorithm). FAST uses minimum spanning trees and clustering to identify relevant feature subsets while removing irrelevant and redundant features. This achieves dimensionality reduction and improves the accuracy of learning algorithms. The algorithm was experimentally evaluated on datasets with over 10,000 features and was shown to outperform other feature selection methods in terms of time complexity and selected feature proportions.
Iaetsd an enhanced feature selection forIaetsd Iaetsd
The document discusses feature selection techniques for machine learning applications. It proposes an Enhanced Fast Clustering-based Feature Selection (EFAST) algorithm. The EFAST algorithm works in two steps: 1) features are clustered using graph-theoretic clustering methods, and 2) the most relevant representative feature strongly correlated with the target categories is selected from each cluster to form the optimal feature subset. Features from different clusters are relatively independent, so EFAST has a high chance of selecting a set of useful and independent features. The algorithm was tested on real-world data and showed improved performance over other feature selection methods by reducing features while also improving classifier performance.
For years, the Machine Learning community has focused on developing efficient
algorithms that can produce very accurate classifiers. However, it is often much easier
to find several good classifiers based on dataset combination, instead of single classifier
applied on deferent datasets. The advantages of using classifier dataset combinations
instead of a single one are twofold: it helps lowering the computational complexity by
using simpler models, and it can improve the classification accuracy and performance.
Most Data mining applications are based on pattern matching algorithms, thus improving
the performance of the classification has a positive impact on the quality of the overall
data mining task. Since combination strategies proved very useful in improving the
performance, these techniques have become very important in applications such as
Cancer detection, Speech Technology and Natural Language Processing .The aim of this
paper is basically to propose proprietary metric, Normalized Geometric Index (NGI)
based on the latent properties of datasets for improving the accuracy of data mining tasks.
This document proposes a technique called DMAP (Damage Mitigation via Absorption Provenance) to detect and remove the effects of faults in distributed systems. DMAP works by maintaining metadata with each system state that can determine if a state should be removed in response to deleting an input. This avoids recomputing all states like the previous incremental view maintenance technique. The document evaluates DMAP and shows it significantly outperforms cascading deletion by reducing unnecessary recomputations through absorption provenance.
A fast clustering based feature subset selection algorithm for high-dimension...IEEEFINALYEARPROJECTS
To Get any Project for CSE, IT ECE, EEE Contact Me @ 09849539085, 09966235788 or mail us - ieeefinalsemprojects@gmail.co¬m-Visit Our Website: www.finalyearprojects.org
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETSEditor IJCATR
This paper presents a hybrid data mining approach based on supervised learning and unsupervised learning to identify the closest data patterns in the data base. This technique enables to achieve the maximum accuracy rate with minimal complexity. The proposed algorithm is compared with traditional clustering and classification algorithm and it is also implemented with multidimensional datasets. The implementation results show better prediction accuracy and reliability.
A statistical data fusion technique in virtual data integration environmentIJDKP
Data fusion in the virtual data integration environment starts after detecting and clustering duplicated
records from the different integrated data sources. It refers to the process of selecting or fusing attribute
values from the clustered duplicates into a single record representing the real world object. In this paper, a
statistical technique for data fusion is introduced based on some probabilistic scores from both data
sources and clustered duplicates
Decision tree clustering a columnstores tuple reconstructioncsandit
Column-Stores has gained market share due to promi
sing physical storage alternative for
analytical queries. However, for multi-attribute qu
eries column-stores pays performance
penalties due to on-the-fly tuple reconstruction. T
his paper presents an adaptive approach for
reducing tuple reconstruction time. Proposed approa
ch exploits decision tree algorithm to
cluster attributes for each projection and also eli
minates frequent database scanning.
Experimentations with TPC-H data shows the effectiv
eness of proposed approach.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
We conducted comparative analysis of different supervised dimension reduction techniques by integrating a set of different data splitting algorithms and demonstrate the relative efficacy of learning algorithms dependence of sample complexity. The issue of sample complexity discussed in the dependence of data splitting algorithms. In line with the expectations, every supervised learning classifier demonstrated different capability for different data splitting algorithms and no way to calculate overall ranking of techniques was directly available. We specifically focused the classifier ranking dependence of data splitting algorithms and devised a model built on weighted average rank Weighted Mean Rank Risk Adjusted Model (WMRRAM) for consent ranking of learning classifier algorithms.
Feature Selection Algorithm for Supervised and Semisupervised ClusteringEditor IJCATR
This document summarizes a research paper on feature selection algorithms for supervised and semi-supervised clustering. It discusses how semi-supervised learning uses both labeled and unlabeled data for training, between unsupervised and supervised learning. It also describes a fast clustering-based feature selection algorithm (FAST) that works in two steps: 1) using graph-theoretic clustering to separate features into clusters, and 2) selecting the most representative feature from each cluster to form a subset of features. The algorithm aims to efficiently obtain a good feature subset by removing unrelated and redundant features.
A fast clustering based feature subset selection algorithm for high-dimension...JPINFOTECH JAYAPRAKASH
The document proposes a fast clustering-based feature selection algorithm (FAST) to efficiently and effectively select useful feature subsets from high-dimensional data. FAST works in two steps: (1) it clusters features using minimum spanning trees, partitioning clusters so each represents a subset of independent features; (2) it selects the most representative feature from each cluster to form the output subset. Experiments on 35 real-world datasets show FAST not only selects smaller feature subsets but also improves performance of four common classifiers compared to other feature selection methods.
A fast clustering based feature subset selection algorithm for high-dimension...IEEEFINALYEARPROJECTS
To Get any Project for CSE, IT ECE, EEE Contact Me @ 09849539085, 09966235788 or mail us - ieeefinalsemprojects@gmail.co¬m-Visit Our Website: www.finalyearprojects.org
This document discusses different techniques for selecting machine learning models, including random train/test splitting, resampling methods like k-fold cross-validation and bootstrap, and probabilistic measures. Resampling techniques like k-fold cross-validation estimate error by evaluating models on out-of-sample data. Probabilistic measures consider both a model's performance and complexity, seeking to balance fit and simplicity. Common probabilistic measures mentioned are the Akaike Information Criterion, Bayesian Information Criterion, Minimum Description Length, and Structural Risk Minimization.
This document presents a method for selecting optimal views for materialization in a data warehouse using a genetic algorithm. It discusses how genetic algorithms work by mimicking natural selection. The method represents views in a multidimensional lattice and uses chromosomes to encode potential view materialization subsets. It calculates a fitness score for each chromosome based on the total attribute frequency of selected views, with higher scores indicating better solutions. The genetic algorithm is run over generations to iteratively improve the selected view materialization subset based on this fitness function and optimize for reducing query response times within space constraints.
Handling Imbalanced Data: SMOTE vs. Random UndersamplingIRJET Journal
This document compares different techniques for handling imbalanced data, including random undersampling and SMOTE (Synthetic Minority Over-sampling Technique), using two machine learning classifiers - Random Forest and XGBoost. It finds that XGBoost generally performs better than Random Forest across different sampling techniques in terms of metrics like AUC, sensitivity and specificity. Random undersampling with XGBoost provided the most balanced results on the randomly generated imbalanced dataset used in this study. SMOTE without proper validation gave the best numerical results but likely overfit the data.
The document summarizes a data mining exercise to predict customer interest in caravan insurance. Key steps included understanding the objective, data, and models. First level models on the original data achieved 94% accuracy. Further levels balancing data and adding variables saw no clear improvement. The best model was the original with 94% accuracy, showing additional changes did not enhance predictions.
Analysis of Common Supervised Learning Algorithms Through Applicationaciijournal
This document analyzes and compares the performance of common supervised learning algorithms (decision trees, boosting, support vector machines) on two datasets (breast cancer and company bankruptcy prediction). For each algorithm, the author applies hyperparameter tuning to optimize performance. Validation and learning curves are generated to analyze algorithm behavior at different hyperparameters and dataset sizes. Overall, the research finds that hyperparameter tuning can improve algorithm accuracy and that different algorithms perform best depending on the specific dataset. The analysis provides guidance for researchers on selecting and applying supervised learning algorithms in practice.
ANALYSIS OF COMMON SUPERVISED LEARNING ALGORITHMS THROUGH APPLICATIONaciijournal
Supervised learning is a branch of machine learning wherein the machine is equipped with labelled data
which it uses to create sophisticated models that can predict the labels of related unlabelled data.the
literature on the field offers a wide spectrum of algorithms and applications.however, there is limited
research available to compare the algorithms making it difficult for beginners to choose the most efficient
algorithm and tune it for their application.
This research aims to analyse the performance of common supervised learning algorithms when applied to
sample datasets along with the effect of hyper-parameter tuning.for the research, each algorithm is applied
to the datasets and the validation curves (for the hyper-parameters) and learning curves are analysed to
understand the sensitivity and performance of the algorithms.the research can guide new researchers
aiming to apply supervised learning algorithm to better understand, compare and select the appropriate
algorithm for their application. Additionally, they can also tune the hyper-parameters for improved
efficiency and create ensemble of algorithms for enhancing accuracy.
ENACTMENT RANKING OF SUPERVISED ALGORITHMS DEPENDENCE OF DATA SPLITTING ALGOR...ijnlc
This document presents a comparative analysis of different supervised learning algorithms using various data splitting methods. It evaluates the dependence of algorithm performance on the data splitting approach. Four standard data splitting methods are compared: hold-out, leave-one-out cross-validation, k-fold cross-validation, and bootstrap. Six benchmark datasets from diverse domains are used to train and evaluate 15 supervised learning classifiers. Results show classifier performance varies with splitting method and dataset. To address this, a weighted mean rank risk adjusted model is proposed that ranks classifiers based on their average performance across splitting methods while accounting for variability.
Analysis of Common Supervised Learning Algorithms Through Applicationaciijournal
Supervised learning is a branch of machine learning wherein the machine is equipped with labelled data
which it uses to create sophisticated models that can predict the labels of related unlabelled data. the
literature on the field offers a wide spectrum of algorithms and applications. However, there is limited
research available to compare the algorithms making it difficult for beginners to choose the most efficient
algorithm and tune it for their application.
This research aims to analyse the performance of common supervised learning algorithms when applied to
sample datasets along with the effect of hyper-parameter tuning. for the research, each algorithm is
applied to the datasets and the validation curves (for the hyper-parameters) and learning curves are
analysed to understand the sensitivity and performance of the algorithms. The research can guide new
researchers aiming to apply supervised learning algorithm to better understand, compare and select the
appropriate algorithm for their application. Additionally, they can also tune the hyper-parameters for
improved efficiency and create ensemble of algorithms for enhancing accuracy.
IRJET- Predicting Customers Churn in Telecom Industry using Centroid Oversamp...IRJET Journal
This document proposes a new oversampling technique called Centroid Oversampling to address imbalanced class distributions in customer churn prediction problems. It summarizes existing oversampling methods like SMOTE and introduces Centroid Oversampling, which generates synthetic samples by calculating the centroid of the three nearest data points rather than oversampling outliers. Experimental results on three telecom datasets show Centroid Oversampling achieves better accuracy, recall, and F-measure than SMOTE when used with a KNN classifier, particularly on datasets with high imbalance.
ATTRIBUTE REDUCTION-BASED ENSEMBLE RULE CLASSIFIERS METHOD FOR DATASET CLASSI...csandit
Attribute reduction and classification task are an essential process in dealing with large data
sets that comprise numerous number of input attributes. There are many search methods and
classifiers that have been used to find the optimal number of attributes. The aim of this paper is
to find the optimal set of attributes and improve the classification accuracy by adopting
ensemble rule classifiers method. Research process involves 2 phases; finding the optimal set of
attributes and ensemble classifiers method for classification task. Results are in terms of
percentage of accuracy and number of selected attributes and rules generated. 6 datasets were
used for the experiment. The final output is an optimal set of attributes with ensemble rule
classifiers method. The experimental results conducted on public real dataset demonstrate that
the ensemble rule classifiers methods consistently show improve classification accuracy on the
selected dataset. Significant improvement in accuracy and optimal set of attribute selected is
achieved by adopting ensemble rule classifiers method.
CONTENT BASED VIDEO CATEGORIZATION USING RELATIONAL CLUSTERING WITH LOCAL SCA...ijcsit
This paper introduces a novel approach for efficient video categorization. It relies on two main
components. The first one is a new relational clustering technique that identifies video key frames by
learning cluster dependent Gaussian kernels. The proposed algorithm, called clustering and Local Scale
Learning algorithm (LSL) learns the underlying cluster dependent dissimilarity measure while finding
compact clusters in the given dataset. The learned measure is a Gaussian dissimilarity function defined
with respect to each cluster. We minimize one objective function to optimize the optimal partition and the
cluster dependent parameter. This optimization is done iteratively by dynamically updating the partition
and the local measure. The kernel learning task exploits the unlabeled data and reciprocally, the
categorization task takes advantages of the local learned kernel. The second component of the proposed
video categorization system consists in discovering the video categories in an unsupervised manner using
the proposed LSL. We illustrate the clustering performance of LSL on synthetic 2D datasets and on high
dimensional real data. Also, we assess the proposed video categorization system using a real video
collection and LSL algorithm.
For years, the Machine Learning community has focused on developing efficient
algorithms that can produce very accurate classifiers. However, it is often much easier
to find several good classifiers based on dataset combination, instead of single classifier
applied on deferent datasets. The advantages of using classifier dataset combinations
instead of a single one are twofold: it helps lowering the computational complexity by
using simpler models, and it can improve the classification accuracy and performance.
Most Data mining applications are based on pattern matching algorithms, thus improving
the performance of the classification has a positive impact on the quality of the overall
data mining task. Since combination strategies proved very useful in improving the
performance, these techniques have become very important in applications such as
Cancer detection, Speech Technology and Natural Language Processing .The aim of this
paper is basically to propose proprietary metric, Normalized Geometric Index (NGI)
based on the latent properties of datasets for improving the accuracy of data mining tasks.
This document proposes a technique called DMAP (Damage Mitigation via Absorption Provenance) to detect and remove the effects of faults in distributed systems. DMAP works by maintaining metadata with each system state that can determine if a state should be removed in response to deleting an input. This avoids recomputing all states like the previous incremental view maintenance technique. The document evaluates DMAP and shows it significantly outperforms cascading deletion by reducing unnecessary recomputations through absorption provenance.
A fast clustering based feature subset selection algorithm for high-dimension...IEEEFINALYEARPROJECTS
To Get any Project for CSE, IT ECE, EEE Contact Me @ 09849539085, 09966235788 or mail us - ieeefinalsemprojects@gmail.co¬m-Visit Our Website: www.finalyearprojects.org
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETSEditor IJCATR
This paper presents a hybrid data mining approach based on supervised learning and unsupervised learning to identify the closest data patterns in the data base. This technique enables to achieve the maximum accuracy rate with minimal complexity. The proposed algorithm is compared with traditional clustering and classification algorithm and it is also implemented with multidimensional datasets. The implementation results show better prediction accuracy and reliability.
A statistical data fusion technique in virtual data integration environmentIJDKP
Data fusion in the virtual data integration environment starts after detecting and clustering duplicated
records from the different integrated data sources. It refers to the process of selecting or fusing attribute
values from the clustered duplicates into a single record representing the real world object. In this paper, a
statistical technique for data fusion is introduced based on some probabilistic scores from both data
sources and clustered duplicates
Decision tree clustering a columnstores tuple reconstructioncsandit
Column-Stores has gained market share due to promi
sing physical storage alternative for
analytical queries. However, for multi-attribute qu
eries column-stores pays performance
penalties due to on-the-fly tuple reconstruction. T
his paper presents an adaptive approach for
reducing tuple reconstruction time. Proposed approa
ch exploits decision tree algorithm to
cluster attributes for each projection and also eli
minates frequent database scanning.
Experimentations with TPC-H data shows the effectiv
eness of proposed approach.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
We conducted comparative analysis of different supervised dimension reduction techniques by integrating a set of different data splitting algorithms and demonstrate the relative efficacy of learning algorithms dependence of sample complexity. The issue of sample complexity discussed in the dependence of data splitting algorithms. In line with the expectations, every supervised learning classifier demonstrated different capability for different data splitting algorithms and no way to calculate overall ranking of techniques was directly available. We specifically focused the classifier ranking dependence of data splitting algorithms and devised a model built on weighted average rank Weighted Mean Rank Risk Adjusted Model (WMRRAM) for consent ranking of learning classifier algorithms.
Feature Selection Algorithm for Supervised and Semisupervised ClusteringEditor IJCATR
This document summarizes a research paper on feature selection algorithms for supervised and semi-supervised clustering. It discusses how semi-supervised learning uses both labeled and unlabeled data for training, between unsupervised and supervised learning. It also describes a fast clustering-based feature selection algorithm (FAST) that works in two steps: 1) using graph-theoretic clustering to separate features into clusters, and 2) selecting the most representative feature from each cluster to form a subset of features. The algorithm aims to efficiently obtain a good feature subset by removing unrelated and redundant features.
A fast clustering based feature subset selection algorithm for high-dimension...JPINFOTECH JAYAPRAKASH
The document proposes a fast clustering-based feature selection algorithm (FAST) to efficiently and effectively select useful feature subsets from high-dimensional data. FAST works in two steps: (1) it clusters features using minimum spanning trees, partitioning clusters so each represents a subset of independent features; (2) it selects the most representative feature from each cluster to form the output subset. Experiments on 35 real-world datasets show FAST not only selects smaller feature subsets but also improves performance of four common classifiers compared to other feature selection methods.
A fast clustering based feature subset selection algorithm for high-dimension...IEEEFINALYEARPROJECTS
To Get any Project for CSE, IT ECE, EEE Contact Me @ 09849539085, 09966235788 or mail us - ieeefinalsemprojects@gmail.co¬m-Visit Our Website: www.finalyearprojects.org
This document discusses different techniques for selecting machine learning models, including random train/test splitting, resampling methods like k-fold cross-validation and bootstrap, and probabilistic measures. Resampling techniques like k-fold cross-validation estimate error by evaluating models on out-of-sample data. Probabilistic measures consider both a model's performance and complexity, seeking to balance fit and simplicity. Common probabilistic measures mentioned are the Akaike Information Criterion, Bayesian Information Criterion, Minimum Description Length, and Structural Risk Minimization.
This document presents a method for selecting optimal views for materialization in a data warehouse using a genetic algorithm. It discusses how genetic algorithms work by mimicking natural selection. The method represents views in a multidimensional lattice and uses chromosomes to encode potential view materialization subsets. It calculates a fitness score for each chromosome based on the total attribute frequency of selected views, with higher scores indicating better solutions. The genetic algorithm is run over generations to iteratively improve the selected view materialization subset based on this fitness function and optimize for reducing query response times within space constraints.
Handling Imbalanced Data: SMOTE vs. Random UndersamplingIRJET Journal
This document compares different techniques for handling imbalanced data, including random undersampling and SMOTE (Synthetic Minority Over-sampling Technique), using two machine learning classifiers - Random Forest and XGBoost. It finds that XGBoost generally performs better than Random Forest across different sampling techniques in terms of metrics like AUC, sensitivity and specificity. Random undersampling with XGBoost provided the most balanced results on the randomly generated imbalanced dataset used in this study. SMOTE without proper validation gave the best numerical results but likely overfit the data.
The document summarizes a data mining exercise to predict customer interest in caravan insurance. Key steps included understanding the objective, data, and models. First level models on the original data achieved 94% accuracy. Further levels balancing data and adding variables saw no clear improvement. The best model was the original with 94% accuracy, showing additional changes did not enhance predictions.
Analysis of Common Supervised Learning Algorithms Through Applicationaciijournal
This document analyzes and compares the performance of common supervised learning algorithms (decision trees, boosting, support vector machines) on two datasets (breast cancer and company bankruptcy prediction). For each algorithm, the author applies hyperparameter tuning to optimize performance. Validation and learning curves are generated to analyze algorithm behavior at different hyperparameters and dataset sizes. Overall, the research finds that hyperparameter tuning can improve algorithm accuracy and that different algorithms perform best depending on the specific dataset. The analysis provides guidance for researchers on selecting and applying supervised learning algorithms in practice.
ANALYSIS OF COMMON SUPERVISED LEARNING ALGORITHMS THROUGH APPLICATIONaciijournal
Supervised learning is a branch of machine learning wherein the machine is equipped with labelled data
which it uses to create sophisticated models that can predict the labels of related unlabelled data.the
literature on the field offers a wide spectrum of algorithms and applications.however, there is limited
research available to compare the algorithms making it difficult for beginners to choose the most efficient
algorithm and tune it for their application.
This research aims to analyse the performance of common supervised learning algorithms when applied to
sample datasets along with the effect of hyper-parameter tuning.for the research, each algorithm is applied
to the datasets and the validation curves (for the hyper-parameters) and learning curves are analysed to
understand the sensitivity and performance of the algorithms.the research can guide new researchers
aiming to apply supervised learning algorithm to better understand, compare and select the appropriate
algorithm for their application. Additionally, they can also tune the hyper-parameters for improved
efficiency and create ensemble of algorithms for enhancing accuracy.
ENACTMENT RANKING OF SUPERVISED ALGORITHMS DEPENDENCE OF DATA SPLITTING ALGOR...ijnlc
This document presents a comparative analysis of different supervised learning algorithms using various data splitting methods. It evaluates the dependence of algorithm performance on the data splitting approach. Four standard data splitting methods are compared: hold-out, leave-one-out cross-validation, k-fold cross-validation, and bootstrap. Six benchmark datasets from diverse domains are used to train and evaluate 15 supervised learning classifiers. Results show classifier performance varies with splitting method and dataset. To address this, a weighted mean rank risk adjusted model is proposed that ranks classifiers based on their average performance across splitting methods while accounting for variability.
Analysis of Common Supervised Learning Algorithms Through Applicationaciijournal
Supervised learning is a branch of machine learning wherein the machine is equipped with labelled data
which it uses to create sophisticated models that can predict the labels of related unlabelled data. the
literature on the field offers a wide spectrum of algorithms and applications. However, there is limited
research available to compare the algorithms making it difficult for beginners to choose the most efficient
algorithm and tune it for their application.
This research aims to analyse the performance of common supervised learning algorithms when applied to
sample datasets along with the effect of hyper-parameter tuning. for the research, each algorithm is
applied to the datasets and the validation curves (for the hyper-parameters) and learning curves are
analysed to understand the sensitivity and performance of the algorithms. The research can guide new
researchers aiming to apply supervised learning algorithm to better understand, compare and select the
appropriate algorithm for their application. Additionally, they can also tune the hyper-parameters for
improved efficiency and create ensemble of algorithms for enhancing accuracy.
IRJET- Predicting Customers Churn in Telecom Industry using Centroid Oversamp...IRJET Journal
This document proposes a new oversampling technique called Centroid Oversampling to address imbalanced class distributions in customer churn prediction problems. It summarizes existing oversampling methods like SMOTE and introduces Centroid Oversampling, which generates synthetic samples by calculating the centroid of the three nearest data points rather than oversampling outliers. Experimental results on three telecom datasets show Centroid Oversampling achieves better accuracy, recall, and F-measure than SMOTE when used with a KNN classifier, particularly on datasets with high imbalance.
ATTRIBUTE REDUCTION-BASED ENSEMBLE RULE CLASSIFIERS METHOD FOR DATASET CLASSI...csandit
Attribute reduction and classification task are an essential process in dealing with large data
sets that comprise numerous number of input attributes. There are many search methods and
classifiers that have been used to find the optimal number of attributes. The aim of this paper is
to find the optimal set of attributes and improve the classification accuracy by adopting
ensemble rule classifiers method. Research process involves 2 phases; finding the optimal set of
attributes and ensemble classifiers method for classification task. Results are in terms of
percentage of accuracy and number of selected attributes and rules generated. 6 datasets were
used for the experiment. The final output is an optimal set of attributes with ensemble rule
classifiers method. The experimental results conducted on public real dataset demonstrate that
the ensemble rule classifiers methods consistently show improve classification accuracy on the
selected dataset. Significant improvement in accuracy and optimal set of attribute selected is
achieved by adopting ensemble rule classifiers method.
CONTENT BASED VIDEO CATEGORIZATION USING RELATIONAL CLUSTERING WITH LOCAL SCA...ijcsit
This paper introduces a novel approach for efficient video categorization. It relies on two main
components. The first one is a new relational clustering technique that identifies video key frames by
learning cluster dependent Gaussian kernels. The proposed algorithm, called clustering and Local Scale
Learning algorithm (LSL) learns the underlying cluster dependent dissimilarity measure while finding
compact clusters in the given dataset. The learned measure is a Gaussian dissimilarity function defined
with respect to each cluster. We minimize one objective function to optimize the optimal partition and the
cluster dependent parameter. This optimization is done iteratively by dynamically updating the partition
and the local measure. The kernel learning task exploits the unlabeled data and reciprocally, the
categorization task takes advantages of the local learned kernel. The second component of the proposed
video categorization system consists in discovering the video categories in an unsupervised manner using
the proposed LSL. We illustrate the clustering performance of LSL on synthetic 2D datasets and on high
dimensional real data. Also, we assess the proposed video categorization system using a real video
collection and LSL algorithm.
Evaluation of a New Incremental Classification Tree Algorithm for Mining High...mlaij
A new model for online machine learning process of high speed data stream is proposed, to minimize the severe restrictions associated with the existing computer learning algorithms. Most of the existing models have three principle steps. In the first step, the system would create a model incrementally. In the second step the time taken by the examples to complete a prescribed procedure with their arrival speed is computed. In the third and final step of the model the size of memory required for computation is predicted in advance. To overcome these restrictions we proposed this new data stream classification algorithm, where the data can be partitioned into stream of trees. In this algorithm, the new data set can be updated with the existing tree. This algorithm, called incremental classification tree algorithm, is proved to be an excellent solution for processing larger data streams. In this paper, we present the experimental results of our new algorithm and prove that our method would eradicate the problems of the existing method.
EVALUATION OF A NEW INCREMENTAL CLASSIFICATION TREE ALGORITHM FOR MINING HIGH...mlaij
Abstract—A new model for online machine learning process of high speed data stream is proposed, to
minimize the severe restrictions associated with the existing computer learning algorithms. Most of the
existing models have three principle steps. In the first step, the system would create a model incrementally.
In the second step the time taken by the examples to complete a prescribed procedure with their arrival
speed is computed. In the third and final step of the model the size of memory required for computation is
predicted in advance. To overcome these restrictions we proposed this new data stream classification
algorithm, where the data can be partitioned into stream of trees. In this algorithm, the new data set can be
updated with the existing tree. This algorithm, called incremental classification tree algorithm, is proved to
be an excellent solution for processing larger data streams. In this paper, we present the experimental
results of our new algorithm and prove that our method would eradicate the problems of the existing
method.
Comparative Study of Pre-Trained Neural Network Models in Detection of GlaucomaIRJET Journal
The document presents a comparative study of various pre-trained neural network models for the early detection of glaucoma from fundus images. Six pre-trained models - Inception, Xception, ResNet50, MobileNetV3, DenseNet121 and DenseNet169 - were analyzed based on their accuracy, loss graphs, confusion matrices and performance metrics like precision, recall, F1 score and specificity. The DenseNet169 model achieved the best results among the models based on these evaluation parameters.
Identifying and classifying unknown Network Disruptionjagan477830
This document discusses identifying and classifying unknown network disruptions using machine learning algorithms. It begins by introducing the problem and importance of identifying network disruptions. Then it discusses related work on classifying network protocols. The document outlines the dataset and problem statement of predicting fault severity. It describes the machine learning workflow and various algorithms like random forest, decision tree and gradient boosting that are evaluated on the dataset. Finally, it concludes with achieving the objective of classifying disruptions and discusses future work like optimizing features and using neural networks.
A TALE of DATA PATTERN DISCOVERY IN PARALLELJenny Liu
In the era of IoTs and A.I., distributed and parallel computing is embracing big data driven and algorithm focused applications and services. With rapid progress and development on parallel frameworks, algorithms and accelerated computing capacities, it still remains challenging on deliver an efficient and scalable data analysis solution. This talk shares a research experience on data pattern discovery in domain applications. In particular, the research scrutinizes key factors in analysis workflow design and data parallelism improvement on cloud.
This document presents a new link-based approach for improving categorical data clustering through cluster ensembles. It transforms categorical data matrices into numerical representations to apply graph partitioning techniques. The approach uses a Weighted Triple-Quality similarity algorithm to construct the representation and measure cluster similarity. An experimental evaluation shows the link-based method outperforms traditional categorical clustering algorithms and benchmark ensemble techniques on several real datasets in terms of accuracy, normalized mutual information, and adjusted rand index.
Implementation of query optimization for reducing run timeAlexander Decker
This document discusses query optimization techniques to improve performance. It proposes performing query optimization at compile-time using histograms of data statistics rather than at run-time. Histograms are used to estimate selectivity of query joins and predicates at compile-time, allowing a query plan to be constructed in advance and executed without run-time optimization. The technique uses a split and merge algorithm to incrementally maintain histograms as data changes. Selectivity estimation with histograms allows join and predicate ordering to be determined at compile-time for query plan generation. Experimental results showed this compile-time optimization approach improved runtime performance over traditional run-time optimization.
Assessment of Cluster Tree Analysis based on Data Linkagesjournal ijrtem
Abstract: Details linkage is a procedure which almost adjoins two or more places of data (surveyed or proprietary) from different companies to generate a value chest of information which can be used for further analysis. This allows for the real application of the details. One-to-Many data linkage affiliates an enterprise from the first data set with a number of related companies from the other data places. Before performs concentrate on accomplishing one-to-one data linkages. So formerly a two level clustering shrub known as One-Class Clustering Tree (OCCT) with designed in Jaccard Likeness evaluate was suggested in which each flyer contains team instead of only one categorized sequence. OCCT's strategy to use Jaccard's similarity co-efficient increases time complexness significantly. So we recommend to substitute jaccard's similarity coefficient with Jaro wrinket similarity evaluate to acquire the team similarity related because it requires purchase into consideration using positional indices to calculate relevance compared with Jaccard's. An assessment of our suggested idea suffices as approval of an enhanced one-to-many data linkage system.
Index Terms: Maximum-Weighted Bipartite Matching, Ant Colony Optimization, Graph Partitioning Technique
DATA-LEVEL HYBRID STRATEGY SELECTION FOR DISK FAULT PREDICTION MODEL BASED ON...IJCI JOURNAL
This document summarizes a research paper that proposes using a multivariate generative adversarial network (GAN) to synthesize additional disk failure samples, which are then combined with real samples at an optimized ratio to balance a disk failure prediction dataset. Genetic algorithms are used to find the optimal mixing ratio of real and synthesized samples for improving the accuracy of various classification models on the disk failure prediction task. The approach aims to address data imbalance issues in disk failure prediction datasets by generating additional minority class samples via GANs to balance classes, while genetic algorithms optimize the mixing of real and generated samples for each classification model.
Decision Tree Clustering : A Columnstores Tuple Reconstructioncsandit
DTFCA is an approach that uses decision tree clustering to reduce tuple reconstruction time in column-stores databases. It exploits decision tree algorithms to cluster frequently accessed attributes together based on an attribute usage matrix. This clusters attributes into projections so that tuples can be reconstructed more efficiently. Experiments on TPC-H data show that DTFCA lowers tuple reconstruction time compared to traditional methods, with execution time inversely proportional to the minimum support threshold used for clustering. DTFCA provides a way to organize attributes into projections that reflects actual query access patterns and correlations.
DECISION TREE CLUSTERING: A COLUMNSTORES TUPLE RECONSTRUCTIONcscpconf
Column-Stores has gained market share due to promising physical storage alternative for analytical queries. However, for multi-attribute queries column-stores pays performance
penalties due to on-the-fly tuple reconstruction. This paper presents an adaptive approach for reducing tuple reconstruction time. Proposed approach exploits decision tree algorithm to
cluster attributes for each projection and also eliminates frequent database scanning.Experimentations with TPC-H data shows the effectiveness of proposed approach.
Similar to DATA PARTITIONING FOR ENSEMBLE MODEL BUILDING (20)
This presentation includes basic of PCOS their pathology and treatment and also Ayurveda correlation of PCOS and Ayurvedic line of treatment mentioned in classics.
Strategies for Effective Upskilling is a presentation by Chinwendu Peace in a Your Skill Boost Masterclass organisation by the Excellence Foundation for South Sudan on 08th and 09th June 2024 from 1 PM to 3 PM on each day.
A workshop hosted by the South African Journal of Science aimed at postgraduate students and early career researchers with little or no experience in writing and publishing journal articles.
This document provides an overview of wound healing, its functions, stages, mechanisms, factors affecting it, and complications.
A wound is a break in the integrity of the skin or tissues, which may be associated with disruption of the structure and function.
Healing is the body’s response to injury in an attempt to restore normal structure and functions.
Healing can occur in two ways: Regeneration and Repair
There are 4 phases of wound healing: hemostasis, inflammation, proliferation, and remodeling. This document also describes the mechanism of wound healing. Factors that affect healing include infection, uncontrolled diabetes, poor nutrition, age, anemia, the presence of foreign bodies, etc.
Complications of wound healing like infection, hyperpigmentation of scar, contractures, and keloid formation.
This slide is special for master students (MIBS & MIFB) in UUM. Also useful for readers who are interested in the topic of contemporary Islamic banking.
How to Fix the Import Error in the Odoo 17Celine George
An import error occurs when a program fails to import a module or library, disrupting its execution. In languages like Python, this issue arises when the specified module cannot be found or accessed, hindering the program's functionality. Resolving import errors is crucial for maintaining smooth software operation and uninterrupted development processes.
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UPRAHUL
This Dissertation explores the particular circumstances of Mirzapur, a region located in the
core of India. Mirzapur, with its varied terrains and abundant biodiversity, offers an optimal
environment for investigating the changes in vegetation cover dynamics. Our study utilizes
advanced technologies such as GIS (Geographic Information Systems) and Remote sensing to
analyze the transformations that have taken place over the course of a decade.
The complex relationship between human activities and the environment has been the focus
of extensive research and worry. As the global community grapples with swift urbanization,
population expansion, and economic progress, the effects on natural ecosystems are becoming
more evident. A crucial element of this impact is the alteration of vegetation cover, which plays a
significant role in maintaining the ecological equilibrium of our planet.Land serves as the foundation for all human activities and provides the necessary materials for
these activities. As the most crucial natural resource, its utilization by humans results in different
'Land uses,' which are determined by both human activities and the physical characteristics of the
land.
The utilization of land is impacted by human needs and environmental factors. In countries
like India, rapid population growth and the emphasis on extensive resource exploitation can lead
to significant land degradation, adversely affecting the region's land cover.
Therefore, human intervention has significantly influenced land use patterns over many
centuries, evolving its structure over time and space. In the present era, these changes have
accelerated due to factors such as agriculture and urbanization. Information regarding land use and
cover is essential for various planning and management tasks related to the Earth's surface,
providing crucial environmental data for scientific, resource management, policy purposes, and
diverse human activities.
Accurate understanding of land use and cover is imperative for the development planning
of any area. Consequently, a wide range of professionals, including earth system scientists, land
and water managers, and urban planners, are interested in obtaining data on land use and cover
changes, conversion trends, and other related patterns. The spatial dimensions of land use and
cover support policymakers and scientists in making well-informed decisions, as alterations in
these patterns indicate shifts in economic and social conditions. Monitoring such changes with the
help of Advanced technologies like Remote Sensing and Geographic Information Systems is
crucial for coordinated efforts across different administrative levels. Advanced technologies like
Remote Sensing and Geographic Information Systems
9
Changes in vegetation cover refer to variations in the distribution, composition, and overall
structure of plant communities across different temporal and spatial scales. These changes can
occur natural.
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UP
DATA PARTITIONING FOR ENSEMBLE MODEL BUILDING
1. International Journal on Cloud Computing: Services and Architecture (IJCCSA) Vol. 7, No. 3/4, August 2017
DOI: 10.5121/ijccsa.2017.7401 1
DATA PARTITIONING FOR ENSEMBLE MODEL
BUILDING
Ates Dagli1
, Niall McCarroll2
and Dmitry Vasilenko3
1
IBM Big Data Analytics, Chicago, USA
2
IBM Watson Machine Learning, Hursley, UK
3
IBM Watson Cloud Platform and Data Science, Chicago, USA
ABSTRACT
In distributed ensemble model-building algorithms, the performance and statistical validity of models are
dependent on sizes of the input data partitions as well as the distribution of records among the partitions.
Failure to correctly select and pre-process the data often results in the models which are not stable and do
not perform well. This article introduces an optimized approach to building the ensemble models for very
large data sets in distributed map-reduce environments using Pass-Stream-Merge (PSM) algorithm. To
ensure the model correctness the input data is randomly distributed using the facilities built into map-
reduce frameworks.
KEYWORDS
Ensemble Models, Pass-Stream-Merge, Big Data, Map-Reduce, Cloud
1. INTRODUCTION
Ensemble models are used to enhance model accuracy (boosting) [1], enhance model stability
(bagging) [2], and build models for very large datasets (pass, stream, merge) [3]. In distributed
ensemble model- building algorithms, one so-called base model is built from each data partition
(split) and evaluated against a sample set aside for this purpose. The best-performing base models
are then selected and combined into a model ensemble for purposes of prediction. Both model-
building performance and the statistical validity of the models depend on data records being
distributed approximately randomly across roughly equal-sized partitions. When implemented in
a map-reduce framework, base models are built in mappers [4]. Sizes of data partitions and the
distribution of records among them are properties of the input data source. The partition size of
the input source is often uneven and rarely of appropriate size for building models. Furthermore,
data records are frequently arranged in some systematic order and not randomly ordered. As a
result, base models sometimes fail to build or, what is worse, produce incorrect or suboptimal
results. The algorithm proposed in this paper eliminates any partition size and ordering variability
and as a result improves performance and statistical validity of the generated models.
2. METHODOLOGY
We implement the PSM features Pass, Stream and Merge through ensemble modeling [2], [5],
[6], [7], [8], [9]. Pass builds models on very large data sets with only one data pass [3]; Stream
updates the existing model with new cases without the need to store or recall the old training data;
Merge builds models in a distributed environment and merges the built models into one model. In
an ensemble model, the training set will be divided into subsets called blocks, and a model will be
built on each block. Because the blocks may be dispatched to different processing nodes in the
map reduce environment, models can be built concurrently. As new data blocks arrive, the
2. International Journal on Cloud Computing: Services and Architecture (IJCCSA) Vol. 7, No. 3/4, August 2017
2
algorithm repeats the procedure. Therefore, it can handle the data stream and perform incremental
learning for ensemble modeling [10]. The Pass operation includes following steps:
1. Splitting the data into training blocks, a testing set and a holdout set.
2. Building base models on training blocks and a reference model on the testing set.
One model is built on the testing set and one on each training block.
3. Evaluating each base model by computing its accuracy based on the testing set and
selecting a subset of base models as ensemble elements according to accuracy.
During the Stream step when new cases arrive and the existing ensemble model
Needs to be updated with these cases, the algorithm will:
1. Start a Pass operation to build an ensemble model on the new data, and then
2. Merge the newly created ensemble model and the existing ensemble model.
The Merge operation has the following steps:
1. Merging the holdout sets into a single holdout set and, if necessary, reducing the set
to a reasonable size.
2. Merging the testing sets into a single testing set and, if necessary, reducing the set to
a reasonable size.
3. Building a merged reference model on the merged testing set.
4. Evaluating every base model by computing its accuracy based on the merged testing
set and selecting a subset of base models as elements of the merged ensemble model
according to accuracy.
5. Evaluating the merged ensemble model and the merged reference model by
computing their accuracy based on the merged holdout set.
Figure 1. Data blocks and base models
3. International Journal on Cloud Computing: Services and Architecture (IJCCSA) Vol. 7, No. 3/4, August 2017
3
Figure 2. File sizes and base models
It can be shown that in map reduce environments the training block or partition size of the input
source is often uneven and rarely of appropriate size for building models. The simplest example
of uneven partition sizes is one caused by the fact that the last block of a file in the distributed file
system is almost always a different size from those before it.
In the example of Figure 1, the base model built from the last block is built from a smaller
number of records. When the dataset is comprised of multiple files as is often the case, the
number of small partitions increases. This is illustrated in Figure 2.
Another assumption that is frequently violated is that the input records are randomly distributed
among partitions. If the records in the dataset are ordered by the values of the modeling target
field, an input field, or a field correlated with them, base models cannot be built or exhibit low
predictive accuracy. In the example shown in Figure 3, records are ordered by the binary-valued
target. No model can be built from the first or the last partition because there is no variation in the
target value. A model can probably be built from the second partition but its quality depends on
where the boundary between the two target values lies.
Figure 3. The data sort order and base models
4. International Journal on Cloud Computing: Services and Architecture (IJCCSA) Vol. 7, No. 3/4, August 2017
4
An existing partial solution to the problem is to first run a map-reduce job that shuffles the input
records and creates a temporary data set. Shuffling is achieved by creating a random-valued field
and sorting the data by that field. The ensemble-building program is subsequently run with the
temporary data set as input. The solution is only partial because, while records are now randomly
ordered, the partitioning of data is determined by the needs of the random sorting. The resulting
partitioning is rarely optimal because the upper limit on the partition size is still dependent on the
default block size employed by the map-reduce framework, and smaller, unevenly sized partitions
may be created. This approach also requires the duplication of all the input data in a temporary
data set.
We propose a method that provides optimally-sized partitions of shuffled, or randomly ordered,
records to model-building steps using facilities built into map-reduce frameworks. The model-
building step is run in reducers with input partitions whose size is configurable automatically or
by the user. The contents of the partitions are randomly assigned. Our approach allows the
partition size to be set at runtime. The partition size may be based on statistical heuristic rules,
properties of the modeling problem, properties of the computing environment, or any
combination these factors. Each partition consists of a set of records selected with equal
probability from the input. The advantages of using our method over the known explicit shuffling
solution are:
1. Our method guarantees partitions of uniform optimal size. The explicit shuffling
solution cannot guarantee a given size or uniform sizes.
2. Map-reduce frameworks have built-in mechanisms for automatically grouping
records passed to reducers. Explicit shuffling incurs the additional cost of the creation
of a temporary dataset and a sort operation. This description is confined to the portion
of ensemble modeling process where base models are built because that is the step
our approach improves.
In addition to partitions for building base models, ensemble modeling requires the creation of two
small random samples of records called the validation and the holdout samples. The sizes of these
samples are preset constants. A so-called reference model is built from the validation sample in
order to compare with the ensemble later. The validation sample is also used to rank the
predictive performance of the base models in a later step. Also in a later step, the holdout sample
is used to compare the predictive performance of the ensemble with that of the reference model.
The desired number of models in the final ensemble, E, is determined by the user. It is usually in
the range 10-100. The rules we use to determine how to partition the data balance the goal of a
desirable partition size with that of building an ensemble of the desired size.
To compute the average adjusted partition size we pick an optimal value for the size of the base
model partition, B. B may be based on statistical heuristic rules, properties of the modeling
problem, properties of the computing environment, or any combination of these factors.
We also determine a minimum acceptable value for the size of the base model partition, Bmin,
based on the same factors. Given N, the total number of records in the input dataset, the size of
the holdout sample H, and the size of the validation sample V, we determine the number of base
models, S, as follows:
5. International Journal on Cloud Computing: Services and Architecture (IJCCSA) Vol. 7, No. 3/4, August 2017
5
Given S, compute an average adjusted partition size, B′ as
B′ = (N−H−V)/S.
B′ is usually not a whole number. Sampling probabilities for base partitions, the validation
sample, and the holdout sample are B′/N, V/N, and H/N, respectively. Note that
S ∗ (B′/N) + V/N + H/N = N/N = 1
so that the sampling probabilities add up to 1.
The map stage consists of randomly assigning each record one of k+2 keys 1, 2, ... S+2. Key
values 1 and 2 correspond to the holdout and validation samples, respectively. Thus, a given
record is assigned key 1 with probability H/N, key 2 with probability V/N, and keys 3..S+2 each
with probability B′/N. The resulting value is used as the map-reduce key. The key is also written
to mapper output records so that the reducers can distinguish partitions 1 and 2 from the base
model partitions. The sizes of the resulting partitions will be approximately B′, H and V.
In the reduce stage, we build models from each partition except the holdout sample.
Figure 4. The data flow through mappers to reducers
6. International Journal on Cloud Computing: Services and Architecture (IJCCSA) Vol. 7, No. 3/4, August 2017
6
Figure 4 shows the flow of input data through an arbitrary number K of mappers to the reducers
where base and reference models are built. Regardless of the order and grouping of the input data,
the expected sizes of the holdout, validation and base model training partitions are as determined
above and their contents are randomly assigned.
3. CONCLUSIONS
We have presented an algorithm for creating optimally-sized random partitioning for data-
parallel ensemble model building in map-reduce environments. The approach improves
performance and statistical validity of the generated ensemble models by introducing random
record keys that reflect probabilities for the holdout, validation and training samples. The keys are
also written to mapper output records so that the reducers can distinguish holdout and validation
partitions from the base model partitions. In the reduce stage, we build models from each partition
except the holdout sample.
The future plan is to implement the algorithm in the real cloud setup and test the performance and
statistical validity by considering the anticipated workload.
ACKNOWLEDGEMENTS
Authors thank Steven Halter of IBM for invaluable comments during preparation of this paper.
REFERENCES
[1] R. Schapire and Y. Freund. Boosting: Foundations and Algorithms. MIT, 2012.
[2] R. Bordawekar, B. Blainey, C. Apte, and M. McRoberts. A survey of business analytics models and
algorithms. IBM Research Report RC25186, IBM, November 2011. W1107- 029.
[3] L. Wilkinson. System and method for computing analytics on structured data. Patent US 7627432,
December 2009.
[4] M. I. Danciu, L. Fan, M. McRoberts, J Shyr, D. Spisic, and J. Xu. Generating a predictive model from
multiple data sources. Patent US 20120278275 A1, November 2012.
[5] R. Levesque. Programming and Data Management for IBM SPSS Statistics. IBM Corporation,
Armonk, New York, 2011.
[6] R. Polikar. Ensemble based systems in decision making. IEEE Circuits and Systems Magazine,
6(3):21–45, 2006.
[7] D. Opitz and R. Maclin. Popular ensemble methods: An empirical study. Journal of Artificial
Intelligence Research, 1(11):169–198, 1999.
[8] B. Zenko. Is combining classifiers better than selecting the best one. Machine Learning, 1 (1):255–
273, 2004.
[9] Z. Zhihua. Ensemble Methods: Foundations and Algorithms. Chapman and Hall/CRC, 2012.
[10] IBM SPSS Modeler 16 Algorithms Guide. IBM Corporation, Armonk, New York, 2013.
7. International Journal on Cloud Computing: Services and Architecture (IJCCSA) Vol. 7, No. 3/4, August 2017
7
AUTHORS
Ates Dagli is a Principal Software Engineer working on the IBM SPSS Analytic Server line
of products. Mr. Dagli graduated from the Columbia University in the City of New York
with Master of Philosophy degree in Economics in 1978. Mr. Dagli led design and
implementation of the variety of statistical and machine learning algorithms that address the
entire analytical process, from planning to data collection to analysis, reporting and
deployment.
Niall McCarroll, Ph.D., is an Architect and Senior Software Engineer working on the IBM
Watson Machine Learning and predictive analytics tools. In the last few years Niall has
been involved in the creation and teaching of a module covering applied predictive analytics
in several UK universities.
Dmitry Vasilenko is an Architect and Senior Software Engineer working on the IBM
Watson Cloud Platform. He received a M.S. degree in Electrical Engineering from
Novosibirsk State Technical University, Russian Federation, in 1986. Before joining IBM
SPSS in 1997 Mr. Vasilenko led Computer Aided Design projects in the area of Electrical
Engineering at the Institute of Electric Power System and Electric Transmission Networks.
During his tenure with IBM Mr. Vasilenko received three technical excellence awards for his work in
Business Analytics. He is an author or co-author of 11 technical papers andaUSpatent.