SlideShare a Scribd company logo
1 of 7
Download to read offline
International Journal on Cloud Computing: Services and Architecture (IJCCSA) Vol. 7, No. 3/4, August 2017
DOI: 10.5121/ijccsa.2017.7401 1
DATA PARTITIONING FOR ENSEMBLE MODEL
BUILDING
Ates Dagli1
, Niall McCarroll2
and Dmitry Vasilenko3
1
IBM Big Data Analytics, Chicago, USA
2
IBM Watson Machine Learning, Hursley, UK
3
IBM Watson Cloud Platform and Data Science, Chicago, USA
ABSTRACT
In distributed ensemble model-building algorithms, the performance and statistical validity of models are
dependent on sizes of the input data partitions as well as the distribution of records among the partitions.
Failure to correctly select and pre-process the data often results in the models which are not stable and do
not perform well. This article introduces an optimized approach to building the ensemble models for very
large data sets in distributed map-reduce environments using Pass-Stream-Merge (PSM) algorithm. To
ensure the model correctness the input data is randomly distributed using the facilities built into map-
reduce frameworks.
KEYWORDS
Ensemble Models, Pass-Stream-Merge, Big Data, Map-Reduce, Cloud
1. INTRODUCTION
Ensemble models are used to enhance model accuracy (boosting) [1], enhance model stability
(bagging) [2], and build models for very large datasets (pass, stream, merge) [3]. In distributed
ensemble model- building algorithms, one so-called base model is built from each data partition
(split) and evaluated against a sample set aside for this purpose. The best-performing base models
are then selected and combined into a model ensemble for purposes of prediction. Both model-
building performance and the statistical validity of the models depend on data records being
distributed approximately randomly across roughly equal-sized partitions. When implemented in
a map-reduce framework, base models are built in mappers [4]. Sizes of data partitions and the
distribution of records among them are properties of the input data source. The partition size of
the input source is often uneven and rarely of appropriate size for building models. Furthermore,
data records are frequently arranged in some systematic order and not randomly ordered. As a
result, base models sometimes fail to build or, what is worse, produce incorrect or suboptimal
results. The algorithm proposed in this paper eliminates any partition size and ordering variability
and as a result improves performance and statistical validity of the generated models.
2. METHODOLOGY
We implement the PSM features Pass, Stream and Merge through ensemble modeling [2], [5],
[6], [7], [8], [9]. Pass builds models on very large data sets with only one data pass [3]; Stream
updates the existing model with new cases without the need to store or recall the old training data;
Merge builds models in a distributed environment and merges the built models into one model. In
an ensemble model, the training set will be divided into subsets called blocks, and a model will be
built on each block. Because the blocks may be dispatched to different processing nodes in the
map reduce environment, models can be built concurrently. As new data blocks arrive, the
International Journal on Cloud Computing: Services and Architecture (IJCCSA) Vol. 7, No. 3/4, August 2017
2
algorithm repeats the procedure. Therefore, it can handle the data stream and perform incremental
learning for ensemble modeling [10]. The Pass operation includes following steps:
1. Splitting the data into training blocks, a testing set and a holdout set.
2. Building base models on training blocks and a reference model on the testing set.
One model is built on the testing set and one on each training block.
3. Evaluating each base model by computing its accuracy based on the testing set and
selecting a subset of base models as ensemble elements according to accuracy.
During the Stream step when new cases arrive and the existing ensemble model
Needs to be updated with these cases, the algorithm will:
1. Start a Pass operation to build an ensemble model on the new data, and then
2. Merge the newly created ensemble model and the existing ensemble model.
The Merge operation has the following steps:
1. Merging the holdout sets into a single holdout set and, if necessary, reducing the set
to a reasonable size.
2. Merging the testing sets into a single testing set and, if necessary, reducing the set to
a reasonable size.
3. Building a merged reference model on the merged testing set.
4. Evaluating every base model by computing its accuracy based on the merged testing
set and selecting a subset of base models as elements of the merged ensemble model
according to accuracy.
5. Evaluating the merged ensemble model and the merged reference model by
computing their accuracy based on the merged holdout set.
Figure 1. Data blocks and base models
International Journal on Cloud Computing: Services and Architecture (IJCCSA) Vol. 7, No. 3/4, August 2017
3
Figure 2. File sizes and base models
It can be shown that in map reduce environments the training block or partition size of the input
source is often uneven and rarely of appropriate size for building models. The simplest example
of uneven partition sizes is one caused by the fact that the last block of a file in the distributed file
system is almost always a different size from those before it.
In the example of Figure 1, the base model built from the last block is built from a smaller
number of records. When the dataset is comprised of multiple files as is often the case, the
number of small partitions increases. This is illustrated in Figure 2.
Another assumption that is frequently violated is that the input records are randomly distributed
among partitions. If the records in the dataset are ordered by the values of the modeling target
field, an input field, or a field correlated with them, base models cannot be built or exhibit low
predictive accuracy. In the example shown in Figure 3, records are ordered by the binary-valued
target. No model can be built from the first or the last partition because there is no variation in the
target value. A model can probably be built from the second partition but its quality depends on
where the boundary between the two target values lies.
Figure 3. The data sort order and base models
International Journal on Cloud Computing: Services and Architecture (IJCCSA) Vol. 7, No. 3/4, August 2017
4
An existing partial solution to the problem is to first run a map-reduce job that shuffles the input
records and creates a temporary data set. Shuffling is achieved by creating a random-valued field
and sorting the data by that field. The ensemble-building program is subsequently run with the
temporary data set as input. The solution is only partial because, while records are now randomly
ordered, the partitioning of data is determined by the needs of the random sorting. The resulting
partitioning is rarely optimal because the upper limit on the partition size is still dependent on the
default block size employed by the map-reduce framework, and smaller, unevenly sized partitions
may be created. This approach also requires the duplication of all the input data in a temporary
data set.
We propose a method that provides optimally-sized partitions of shuffled, or randomly ordered,
records to model-building steps using facilities built into map-reduce frameworks. The model-
building step is run in reducers with input partitions whose size is configurable automatically or
by the user. The contents of the partitions are randomly assigned. Our approach allows the
partition size to be set at runtime. The partition size may be based on statistical heuristic rules,
properties of the modeling problem, properties of the computing environment, or any
combination these factors. Each partition consists of a set of records selected with equal
probability from the input. The advantages of using our method over the known explicit shuffling
solution are:
1. Our method guarantees partitions of uniform optimal size. The explicit shuffling
solution cannot guarantee a given size or uniform sizes.
2. Map-reduce frameworks have built-in mechanisms for automatically grouping
records passed to reducers. Explicit shuffling incurs the additional cost of the creation
of a temporary dataset and a sort operation. This description is confined to the portion
of ensemble modeling process where base models are built because that is the step
our approach improves.
In addition to partitions for building base models, ensemble modeling requires the creation of two
small random samples of records called the validation and the holdout samples. The sizes of these
samples are preset constants. A so-called reference model is built from the validation sample in
order to compare with the ensemble later. The validation sample is also used to rank the
predictive performance of the base models in a later step. Also in a later step, the holdout sample
is used to compare the predictive performance of the ensemble with that of the reference model.
The desired number of models in the final ensemble, E, is determined by the user. It is usually in
the range 10-100. The rules we use to determine how to partition the data balance the goal of a
desirable partition size with that of building an ensemble of the desired size.
To compute the average adjusted partition size we pick an optimal value for the size of the base
model partition, B. B may be based on statistical heuristic rules, properties of the modeling
problem, properties of the computing environment, or any combination of these factors.
We also determine a minimum acceptable value for the size of the base model partition, Bmin,
based on the same factors. Given N, the total number of records in the input dataset, the size of
the holdout sample H, and the size of the validation sample V, we determine the number of base
models, S, as follows:
International Journal on Cloud Computing: Services and Architecture (IJCCSA) Vol. 7, No. 3/4, August 2017
5
Given S, compute an average adjusted partition size, B′ as
B′ = (N−H−V)/S.
B′ is usually not a whole number. Sampling probabilities for base partitions, the validation
sample, and the holdout sample are B′/N, V/N, and H/N, respectively. Note that
S ∗ (B′/N) + V/N + H/N = N/N = 1
so that the sampling probabilities add up to 1.
The map stage consists of randomly assigning each record one of k+2 keys 1, 2, ... S+2. Key
values 1 and 2 correspond to the holdout and validation samples, respectively. Thus, a given
record is assigned key 1 with probability H/N, key 2 with probability V/N, and keys 3..S+2 each
with probability B′/N. The resulting value is used as the map-reduce key. The key is also written
to mapper output records so that the reducers can distinguish partitions 1 and 2 from the base
model partitions. The sizes of the resulting partitions will be approximately B′, H and V.
In the reduce stage, we build models from each partition except the holdout sample.
Figure 4. The data flow through mappers to reducers
International Journal on Cloud Computing: Services and Architecture (IJCCSA) Vol. 7, No. 3/4, August 2017
6
Figure 4 shows the flow of input data through an arbitrary number K of mappers to the reducers
where base and reference models are built. Regardless of the order and grouping of the input data,
the expected sizes of the holdout, validation and base model training partitions are as determined
above and their contents are randomly assigned.
3. CONCLUSIONS
We have presented an algorithm for creating optimally-sized random partitioning for data-
parallel ensemble model building in map-reduce environments. The approach improves
performance and statistical validity of the generated ensemble models by introducing random
record keys that reflect probabilities for the holdout, validation and training samples. The keys are
also written to mapper output records so that the reducers can distinguish holdout and validation
partitions from the base model partitions. In the reduce stage, we build models from each partition
except the holdout sample.
The future plan is to implement the algorithm in the real cloud setup and test the performance and
statistical validity by considering the anticipated workload.
ACKNOWLEDGEMENTS
Authors thank Steven Halter of IBM for invaluable comments during preparation of this paper.
REFERENCES
[1] R. Schapire and Y. Freund. Boosting: Foundations and Algorithms. MIT, 2012.
[2] R. Bordawekar, B. Blainey, C. Apte, and M. McRoberts. A survey of business analytics models and
algorithms. IBM Research Report RC25186, IBM, November 2011. W1107- 029.
[3] L. Wilkinson. System and method for computing analytics on structured data. Patent US 7627432,
December 2009.
[4] M. I. Danciu, L. Fan, M. McRoberts, J Shyr, D. Spisic, and J. Xu. Generating a predictive model from
multiple data sources. Patent US 20120278275 A1, November 2012.
[5] R. Levesque. Programming and Data Management for IBM SPSS Statistics. IBM Corporation,
Armonk, New York, 2011.
[6] R. Polikar. Ensemble based systems in decision making. IEEE Circuits and Systems Magazine,
6(3):21–45, 2006.
[7] D. Opitz and R. Maclin. Popular ensemble methods: An empirical study. Journal of Artificial
Intelligence Research, 1(11):169–198, 1999.
[8] B. Zenko. Is combining classifiers better than selecting the best one. Machine Learning, 1 (1):255–
273, 2004.
[9] Z. Zhihua. Ensemble Methods: Foundations and Algorithms. Chapman and Hall/CRC, 2012.
[10] IBM SPSS Modeler 16 Algorithms Guide. IBM Corporation, Armonk, New York, 2013.
International Journal on Cloud Computing: Services and Architecture (IJCCSA) Vol. 7, No. 3/4, August 2017
7
AUTHORS
Ates Dagli is a Principal Software Engineer working on the IBM SPSS Analytic Server line
of products. Mr. Dagli graduated from the Columbia University in the City of New York
with Master of Philosophy degree in Economics in 1978. Mr. Dagli led design and
implementation of the variety of statistical and machine learning algorithms that address the
entire analytical process, from planning to data collection to analysis, reporting and
deployment.
Niall McCarroll, Ph.D., is an Architect and Senior Software Engineer working on the IBM
Watson Machine Learning and predictive analytics tools. In the last few years Niall has
been involved in the creation and teaching of a module covering applied predictive analytics
in several UK universities.
Dmitry Vasilenko is an Architect and Senior Software Engineer working on the IBM
Watson Cloud Platform. He received a M.S. degree in Electrical Engineering from
Novosibirsk State Technical University, Russian Federation, in 1986. Before joining IBM
SPSS in 1997 Mr. Vasilenko led Computer Aided Design projects in the area of Electrical
Engineering at the Institute of Electric Power System and Electric Transmission Networks.
During his tenure with IBM Mr. Vasilenko received three technical excellence awards for his work in
Business Analytics. He is an author or co-author of 11 technical papers andaUSpatent.

More Related Content

What's hot

A fast clustering based feature subset selection algorithm for high-dimension...
A fast clustering based feature subset selection algorithm for high-dimension...A fast clustering based feature subset selection algorithm for high-dimension...
A fast clustering based feature subset selection algorithm for high-dimension...IEEEFINALYEARPROJECTS
 
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETS
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETSA HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETS
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETSEditor IJCATR
 
A statistical data fusion technique in virtual data integration environment
A statistical data fusion technique in virtual data integration environmentA statistical data fusion technique in virtual data integration environment
A statistical data fusion technique in virtual data integration environmentIJDKP
 
Decision tree clustering a columnstores tuple reconstruction
Decision tree clustering  a columnstores tuple reconstructionDecision tree clustering  a columnstores tuple reconstruction
Decision tree clustering a columnstores tuple reconstructioncsandit
 
Enactment Ranking of Supervised Algorithms Dependence of Data Splitting Algor...
Enactment Ranking of Supervised Algorithms Dependence of Data Splitting Algor...Enactment Ranking of Supervised Algorithms Dependence of Data Splitting Algor...
Enactment Ranking of Supervised Algorithms Dependence of Data Splitting Algor...AIRCC Publishing Corporation
 
Feature Selection Algorithm for Supervised and Semisupervised Clustering
Feature Selection Algorithm for Supervised and Semisupervised ClusteringFeature Selection Algorithm for Supervised and Semisupervised Clustering
Feature Selection Algorithm for Supervised and Semisupervised ClusteringEditor IJCATR
 
A fast clustering based feature subset selection algorithm for high-dimension...
A fast clustering based feature subset selection algorithm for high-dimension...A fast clustering based feature subset selection algorithm for high-dimension...
A fast clustering based feature subset selection algorithm for high-dimension...JPINFOTECH JAYAPRAKASH
 
A fast clustering based feature subset selection algorithm for high-dimension...
A fast clustering based feature subset selection algorithm for high-dimension...A fast clustering based feature subset selection algorithm for high-dimension...
A fast clustering based feature subset selection algorithm for high-dimension...IEEEFINALYEARPROJECTS
 
Model Selection Techniques
Model Selection TechniquesModel Selection Techniques
Model Selection TechniquesSwati .
 

What's hot (13)

Ijmet 10 01_141
Ijmet 10 01_141Ijmet 10 01_141
Ijmet 10 01_141
 
DTAP
DTAPDTAP
DTAP
 
A fast clustering based feature subset selection algorithm for high-dimension...
A fast clustering based feature subset selection algorithm for high-dimension...A fast clustering based feature subset selection algorithm for high-dimension...
A fast clustering based feature subset selection algorithm for high-dimension...
 
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETS
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETSA HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETS
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETS
 
A statistical data fusion technique in virtual data integration environment
A statistical data fusion technique in virtual data integration environmentA statistical data fusion technique in virtual data integration environment
A statistical data fusion technique in virtual data integration environment
 
Decision tree clustering a columnstores tuple reconstruction
Decision tree clustering  a columnstores tuple reconstructionDecision tree clustering  a columnstores tuple reconstruction
Decision tree clustering a columnstores tuple reconstruction
 
Lx3520322036
Lx3520322036Lx3520322036
Lx3520322036
 
Enactment Ranking of Supervised Algorithms Dependence of Data Splitting Algor...
Enactment Ranking of Supervised Algorithms Dependence of Data Splitting Algor...Enactment Ranking of Supervised Algorithms Dependence of Data Splitting Algor...
Enactment Ranking of Supervised Algorithms Dependence of Data Splitting Algor...
 
Feature Selection Algorithm for Supervised and Semisupervised Clustering
Feature Selection Algorithm for Supervised and Semisupervised ClusteringFeature Selection Algorithm for Supervised and Semisupervised Clustering
Feature Selection Algorithm for Supervised and Semisupervised Clustering
 
A fast clustering based feature subset selection algorithm for high-dimension...
A fast clustering based feature subset selection algorithm for high-dimension...A fast clustering based feature subset selection algorithm for high-dimension...
A fast clustering based feature subset selection algorithm for high-dimension...
 
A fast clustering based feature subset selection algorithm for high-dimension...
A fast clustering based feature subset selection algorithm for high-dimension...A fast clustering based feature subset selection algorithm for high-dimension...
A fast clustering based feature subset selection algorithm for high-dimension...
 
Model Selection Techniques
Model Selection TechniquesModel Selection Techniques
Model Selection Techniques
 
T180203125133
T180203125133T180203125133
T180203125133
 

Similar to Data Partitioning for Ensemble Model Building

Handling Imbalanced Data: SMOTE vs. Random Undersampling
Handling Imbalanced Data: SMOTE vs. Random UndersamplingHandling Imbalanced Data: SMOTE vs. Random Undersampling
Handling Imbalanced Data: SMOTE vs. Random UndersamplingIRJET Journal
 
Caravan insurance data mining prediction models
Caravan insurance data mining prediction modelsCaravan insurance data mining prediction models
Caravan insurance data mining prediction modelsMuthu Kumaar Thangavelu
 
Analysis of Common Supervised Learning Algorithms Through Application
Analysis of Common Supervised Learning Algorithms Through ApplicationAnalysis of Common Supervised Learning Algorithms Through Application
Analysis of Common Supervised Learning Algorithms Through Applicationaciijournal
 
ANALYSIS OF COMMON SUPERVISED LEARNING ALGORITHMS THROUGH APPLICATION
ANALYSIS OF COMMON SUPERVISED LEARNING ALGORITHMS THROUGH APPLICATIONANALYSIS OF COMMON SUPERVISED LEARNING ALGORITHMS THROUGH APPLICATION
ANALYSIS OF COMMON SUPERVISED LEARNING ALGORITHMS THROUGH APPLICATIONaciijournal
 
ENACTMENT RANKING OF SUPERVISED ALGORITHMS DEPENDENCE OF DATA SPLITTING ALGOR...
ENACTMENT RANKING OF SUPERVISED ALGORITHMS DEPENDENCE OF DATA SPLITTING ALGOR...ENACTMENT RANKING OF SUPERVISED ALGORITHMS DEPENDENCE OF DATA SPLITTING ALGOR...
ENACTMENT RANKING OF SUPERVISED ALGORITHMS DEPENDENCE OF DATA SPLITTING ALGOR...ijnlc
 
Analysis of Common Supervised Learning Algorithms Through Application
Analysis of Common Supervised Learning Algorithms Through ApplicationAnalysis of Common Supervised Learning Algorithms Through Application
Analysis of Common Supervised Learning Algorithms Through Applicationaciijournal
 
IRJET- Predicting Customers Churn in Telecom Industry using Centroid Oversamp...
IRJET- Predicting Customers Churn in Telecom Industry using Centroid Oversamp...IRJET- Predicting Customers Churn in Telecom Industry using Centroid Oversamp...
IRJET- Predicting Customers Churn in Telecom Industry using Centroid Oversamp...IRJET Journal
 
ATTRIBUTE REDUCTION-BASED ENSEMBLE RULE CLASSIFIERS METHOD FOR DATASET CLASSI...
ATTRIBUTE REDUCTION-BASED ENSEMBLE RULE CLASSIFIERS METHOD FOR DATASET CLASSI...ATTRIBUTE REDUCTION-BASED ENSEMBLE RULE CLASSIFIERS METHOD FOR DATASET CLASSI...
ATTRIBUTE REDUCTION-BASED ENSEMBLE RULE CLASSIFIERS METHOD FOR DATASET CLASSI...csandit
 
CONTENT BASED VIDEO CATEGORIZATION USING RELATIONAL CLUSTERING WITH LOCAL SCA...
CONTENT BASED VIDEO CATEGORIZATION USING RELATIONAL CLUSTERING WITH LOCAL SCA...CONTENT BASED VIDEO CATEGORIZATION USING RELATIONAL CLUSTERING WITH LOCAL SCA...
CONTENT BASED VIDEO CATEGORIZATION USING RELATIONAL CLUSTERING WITH LOCAL SCA...ijcsit
 
Evaluation of a New Incremental Classification Tree Algorithm for Mining High...
Evaluation of a New Incremental Classification Tree Algorithm for Mining High...Evaluation of a New Incremental Classification Tree Algorithm for Mining High...
Evaluation of a New Incremental Classification Tree Algorithm for Mining High...mlaij
 
EVALUATION OF A NEW INCREMENTAL CLASSIFICATION TREE ALGORITHM FOR MINING HIGH...
EVALUATION OF A NEW INCREMENTAL CLASSIFICATION TREE ALGORITHM FOR MINING HIGH...EVALUATION OF A NEW INCREMENTAL CLASSIFICATION TREE ALGORITHM FOR MINING HIGH...
EVALUATION OF A NEW INCREMENTAL CLASSIFICATION TREE ALGORITHM FOR MINING HIGH...mlaij
 
Comparative Study of Pre-Trained Neural Network Models in Detection of Glaucoma
Comparative Study of Pre-Trained Neural Network Models in Detection of GlaucomaComparative Study of Pre-Trained Neural Network Models in Detection of Glaucoma
Comparative Study of Pre-Trained Neural Network Models in Detection of GlaucomaIRJET Journal
 
Identifying and classifying unknown Network Disruption
Identifying and classifying unknown Network DisruptionIdentifying and classifying unknown Network Disruption
Identifying and classifying unknown Network Disruptionjagan477830
 
A TALE of DATA PATTERN DISCOVERY IN PARALLEL
A TALE of DATA PATTERN DISCOVERY IN PARALLELA TALE of DATA PATTERN DISCOVERY IN PARALLEL
A TALE of DATA PATTERN DISCOVERY IN PARALLELJenny Liu
 
Implementation of query optimization for reducing run time
Implementation of query optimization for reducing run timeImplementation of query optimization for reducing run time
Implementation of query optimization for reducing run timeAlexander Decker
 
Assessment of Cluster Tree Analysis based on Data Linkages
Assessment of Cluster Tree Analysis based on Data LinkagesAssessment of Cluster Tree Analysis based on Data Linkages
Assessment of Cluster Tree Analysis based on Data Linkagesjournal ijrtem
 
DATA-LEVEL HYBRID STRATEGY SELECTION FOR DISK FAULT PREDICTION MODEL BASED ON...
DATA-LEVEL HYBRID STRATEGY SELECTION FOR DISK FAULT PREDICTION MODEL BASED ON...DATA-LEVEL HYBRID STRATEGY SELECTION FOR DISK FAULT PREDICTION MODEL BASED ON...
DATA-LEVEL HYBRID STRATEGY SELECTION FOR DISK FAULT PREDICTION MODEL BASED ON...IJCI JOURNAL
 
Decision Tree Clustering : A Columnstores Tuple Reconstruction
Decision Tree Clustering : A Columnstores Tuple ReconstructionDecision Tree Clustering : A Columnstores Tuple Reconstruction
Decision Tree Clustering : A Columnstores Tuple Reconstructioncsandit
 
DECISION TREE CLUSTERING: A COLUMNSTORES TUPLE RECONSTRUCTION
DECISION TREE CLUSTERING: A COLUMNSTORES TUPLE RECONSTRUCTIONDECISION TREE CLUSTERING: A COLUMNSTORES TUPLE RECONSTRUCTION
DECISION TREE CLUSTERING: A COLUMNSTORES TUPLE RECONSTRUCTIONcscpconf
 

Similar to Data Partitioning for Ensemble Model Building (20)

Handling Imbalanced Data: SMOTE vs. Random Undersampling
Handling Imbalanced Data: SMOTE vs. Random UndersamplingHandling Imbalanced Data: SMOTE vs. Random Undersampling
Handling Imbalanced Data: SMOTE vs. Random Undersampling
 
Caravan insurance data mining prediction models
Caravan insurance data mining prediction modelsCaravan insurance data mining prediction models
Caravan insurance data mining prediction models
 
Analysis of Common Supervised Learning Algorithms Through Application
Analysis of Common Supervised Learning Algorithms Through ApplicationAnalysis of Common Supervised Learning Algorithms Through Application
Analysis of Common Supervised Learning Algorithms Through Application
 
ANALYSIS OF COMMON SUPERVISED LEARNING ALGORITHMS THROUGH APPLICATION
ANALYSIS OF COMMON SUPERVISED LEARNING ALGORITHMS THROUGH APPLICATIONANALYSIS OF COMMON SUPERVISED LEARNING ALGORITHMS THROUGH APPLICATION
ANALYSIS OF COMMON SUPERVISED LEARNING ALGORITHMS THROUGH APPLICATION
 
ENACTMENT RANKING OF SUPERVISED ALGORITHMS DEPENDENCE OF DATA SPLITTING ALGOR...
ENACTMENT RANKING OF SUPERVISED ALGORITHMS DEPENDENCE OF DATA SPLITTING ALGOR...ENACTMENT RANKING OF SUPERVISED ALGORITHMS DEPENDENCE OF DATA SPLITTING ALGOR...
ENACTMENT RANKING OF SUPERVISED ALGORITHMS DEPENDENCE OF DATA SPLITTING ALGOR...
 
Analysis of Common Supervised Learning Algorithms Through Application
Analysis of Common Supervised Learning Algorithms Through ApplicationAnalysis of Common Supervised Learning Algorithms Through Application
Analysis of Common Supervised Learning Algorithms Through Application
 
IRJET- Predicting Customers Churn in Telecom Industry using Centroid Oversamp...
IRJET- Predicting Customers Churn in Telecom Industry using Centroid Oversamp...IRJET- Predicting Customers Churn in Telecom Industry using Centroid Oversamp...
IRJET- Predicting Customers Churn in Telecom Industry using Centroid Oversamp...
 
ATTRIBUTE REDUCTION-BASED ENSEMBLE RULE CLASSIFIERS METHOD FOR DATASET CLASSI...
ATTRIBUTE REDUCTION-BASED ENSEMBLE RULE CLASSIFIERS METHOD FOR DATASET CLASSI...ATTRIBUTE REDUCTION-BASED ENSEMBLE RULE CLASSIFIERS METHOD FOR DATASET CLASSI...
ATTRIBUTE REDUCTION-BASED ENSEMBLE RULE CLASSIFIERS METHOD FOR DATASET CLASSI...
 
CONTENT BASED VIDEO CATEGORIZATION USING RELATIONAL CLUSTERING WITH LOCAL SCA...
CONTENT BASED VIDEO CATEGORIZATION USING RELATIONAL CLUSTERING WITH LOCAL SCA...CONTENT BASED VIDEO CATEGORIZATION USING RELATIONAL CLUSTERING WITH LOCAL SCA...
CONTENT BASED VIDEO CATEGORIZATION USING RELATIONAL CLUSTERING WITH LOCAL SCA...
 
Evaluation of a New Incremental Classification Tree Algorithm for Mining High...
Evaluation of a New Incremental Classification Tree Algorithm for Mining High...Evaluation of a New Incremental Classification Tree Algorithm for Mining High...
Evaluation of a New Incremental Classification Tree Algorithm for Mining High...
 
EVALUATION OF A NEW INCREMENTAL CLASSIFICATION TREE ALGORITHM FOR MINING HIGH...
EVALUATION OF A NEW INCREMENTAL CLASSIFICATION TREE ALGORITHM FOR MINING HIGH...EVALUATION OF A NEW INCREMENTAL CLASSIFICATION TREE ALGORITHM FOR MINING HIGH...
EVALUATION OF A NEW INCREMENTAL CLASSIFICATION TREE ALGORITHM FOR MINING HIGH...
 
Comparative Study of Pre-Trained Neural Network Models in Detection of Glaucoma
Comparative Study of Pre-Trained Neural Network Models in Detection of GlaucomaComparative Study of Pre-Trained Neural Network Models in Detection of Glaucoma
Comparative Study of Pre-Trained Neural Network Models in Detection of Glaucoma
 
Identifying and classifying unknown Network Disruption
Identifying and classifying unknown Network DisruptionIdentifying and classifying unknown Network Disruption
Identifying and classifying unknown Network Disruption
 
A TALE of DATA PATTERN DISCOVERY IN PARALLEL
A TALE of DATA PATTERN DISCOVERY IN PARALLELA TALE of DATA PATTERN DISCOVERY IN PARALLEL
A TALE of DATA PATTERN DISCOVERY IN PARALLEL
 
A new link based approach for categorical data clustering
A new link based approach for categorical data clusteringA new link based approach for categorical data clustering
A new link based approach for categorical data clustering
 
Implementation of query optimization for reducing run time
Implementation of query optimization for reducing run timeImplementation of query optimization for reducing run time
Implementation of query optimization for reducing run time
 
Assessment of Cluster Tree Analysis based on Data Linkages
Assessment of Cluster Tree Analysis based on Data LinkagesAssessment of Cluster Tree Analysis based on Data Linkages
Assessment of Cluster Tree Analysis based on Data Linkages
 
DATA-LEVEL HYBRID STRATEGY SELECTION FOR DISK FAULT PREDICTION MODEL BASED ON...
DATA-LEVEL HYBRID STRATEGY SELECTION FOR DISK FAULT PREDICTION MODEL BASED ON...DATA-LEVEL HYBRID STRATEGY SELECTION FOR DISK FAULT PREDICTION MODEL BASED ON...
DATA-LEVEL HYBRID STRATEGY SELECTION FOR DISK FAULT PREDICTION MODEL BASED ON...
 
Decision Tree Clustering : A Columnstores Tuple Reconstruction
Decision Tree Clustering : A Columnstores Tuple ReconstructionDecision Tree Clustering : A Columnstores Tuple Reconstruction
Decision Tree Clustering : A Columnstores Tuple Reconstruction
 
DECISION TREE CLUSTERING: A COLUMNSTORES TUPLE RECONSTRUCTION
DECISION TREE CLUSTERING: A COLUMNSTORES TUPLE RECONSTRUCTIONDECISION TREE CLUSTERING: A COLUMNSTORES TUPLE RECONSTRUCTION
DECISION TREE CLUSTERING: A COLUMNSTORES TUPLE RECONSTRUCTION
 

More from neirew J

ANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUES
ANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUESANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUES
ANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUESneirew J
 
SUCCESS-DRIVING BUSINESS MODEL CHARACTERISTICS OF IAAS AND PAAS PROVIDERS
SUCCESS-DRIVING BUSINESS MODEL CHARACTERISTICS OF IAAS AND PAAS PROVIDERSSUCCESS-DRIVING BUSINESS MODEL CHARACTERISTICS OF IAAS AND PAAS PROVIDERS
SUCCESS-DRIVING BUSINESS MODEL CHARACTERISTICS OF IAAS AND PAAS PROVIDERSneirew J
 
Strategic Business Challenges in Cloud Systems
Strategic Business Challenges in Cloud SystemsStrategic Business Challenges in Cloud Systems
Strategic Business Challenges in Cloud Systemsneirew J
 
Laypeople's and Experts' Risk Perception of Cloud Computing Services
Laypeople's and Experts' Risk Perception of Cloud Computing Services Laypeople's and Experts' Risk Perception of Cloud Computing Services
Laypeople's and Experts' Risk Perception of Cloud Computing Services neirew J
 
Factors Influencing Risk Acceptance of Cloud Computing Services in the UK Gov...
Factors Influencing Risk Acceptance of Cloud Computing Services in the UK Gov...Factors Influencing Risk Acceptance of Cloud Computing Services in the UK Gov...
Factors Influencing Risk Acceptance of Cloud Computing Services in the UK Gov...neirew J
 
A Cloud Security Approach for Data at Rest Using FPE
A Cloud Security Approach for Data at Rest Using FPE A Cloud Security Approach for Data at Rest Using FPE
A Cloud Security Approach for Data at Rest Using FPE neirew J
 
Error Isolation and Management in Agile Multi-Tenant Cloud Based Applications
Error Isolation and Management in Agile Multi-Tenant Cloud Based Applications Error Isolation and Management in Agile Multi-Tenant Cloud Based Applications
Error Isolation and Management in Agile Multi-Tenant Cloud Based Applications neirew J
 
Locality Sim : Cloud Simulator with Data Locality
Locality Sim : Cloud Simulator with Data LocalityLocality Sim : Cloud Simulator with Data Locality
Locality Sim : Cloud Simulator with Data Localityneirew J
 
Benefits and Challenges of the Adoption of Cloud Computing in Business
Benefits and Challenges of the Adoption of Cloud Computing in BusinessBenefits and Challenges of the Adoption of Cloud Computing in Business
Benefits and Challenges of the Adoption of Cloud Computing in Businessneirew J
 
Intrusion Detection and Marking Transactions in a Cloud of Databases Environm...
Intrusion Detection and Marking Transactions in a Cloud of Databases Environm...Intrusion Detection and Marking Transactions in a Cloud of Databases Environm...
Intrusion Detection and Marking Transactions in a Cloud of Databases Environm...neirew J
 
A Survey on Resource Allocation in Cloud Computing
A Survey on Resource Allocation in Cloud ComputingA Survey on Resource Allocation in Cloud Computing
A Survey on Resource Allocation in Cloud Computingneirew J
 
An Approach to Reduce Energy Consumption in Cloud data centers using Harmony ...
An Approach to Reduce Energy Consumption in Cloud data centers using Harmony ...An Approach to Reduce Energy Consumption in Cloud data centers using Harmony ...
An Approach to Reduce Energy Consumption in Cloud data centers using Harmony ...neirew J
 
Data Distribution Handling on Cloud for Deployment of Big Data
Data Distribution Handling on Cloud for Deployment of Big DataData Distribution Handling on Cloud for Deployment of Big Data
Data Distribution Handling on Cloud for Deployment of Big Dataneirew J
 
Multi-Campus Universities Private-Cloud Migration Infrastructure
Multi-Campus Universities Private-Cloud Migration Infrastructure Multi-Campus Universities Private-Cloud Migration Infrastructure
Multi-Campus Universities Private-Cloud Migration Infrastructure neirew J
 
Implementation of the Open Source Virtualization Technologies in Cloud Computing
Implementation of the Open Source Virtualization Technologies in Cloud ComputingImplementation of the Open Source Virtualization Technologies in Cloud Computing
Implementation of the Open Source Virtualization Technologies in Cloud Computingneirew J
 
A Broker-based Framework for Integrated SLA-Aware SaaS Provisioning
A Broker-based Framework for Integrated SLA-Aware SaaS Provisioning A Broker-based Framework for Integrated SLA-Aware SaaS Provisioning
A Broker-based Framework for Integrated SLA-Aware SaaS Provisioning neirew J
 
Comparative Study of Various Platform as a Service Frameworks
Comparative Study of Various Platform as a Service Frameworks Comparative Study of Various Platform as a Service Frameworks
Comparative Study of Various Platform as a Service Frameworks neirew J
 
Neuro-Fuzzy System Based Dynamic Resource Allocation in Collaborative Cloud C...
Neuro-Fuzzy System Based Dynamic Resource Allocation in Collaborative Cloud C...Neuro-Fuzzy System Based Dynamic Resource Allocation in Collaborative Cloud C...
Neuro-Fuzzy System Based Dynamic Resource Allocation in Collaborative Cloud C...neirew J
 
A Proposed Model for Improving Performance and Reducing Costs of IT Through C...
A Proposed Model for Improving Performance and Reducing Costs of IT Through C...A Proposed Model for Improving Performance and Reducing Costs of IT Through C...
A Proposed Model for Improving Performance and Reducing Costs of IT Through C...neirew J
 
Improved Secure Cloud Transmission Protocol
Improved Secure Cloud Transmission ProtocolImproved Secure Cloud Transmission Protocol
Improved Secure Cloud Transmission Protocolneirew J
 

More from neirew J (20)

ANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUES
ANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUESANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUES
ANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUES
 
SUCCESS-DRIVING BUSINESS MODEL CHARACTERISTICS OF IAAS AND PAAS PROVIDERS
SUCCESS-DRIVING BUSINESS MODEL CHARACTERISTICS OF IAAS AND PAAS PROVIDERSSUCCESS-DRIVING BUSINESS MODEL CHARACTERISTICS OF IAAS AND PAAS PROVIDERS
SUCCESS-DRIVING BUSINESS MODEL CHARACTERISTICS OF IAAS AND PAAS PROVIDERS
 
Strategic Business Challenges in Cloud Systems
Strategic Business Challenges in Cloud SystemsStrategic Business Challenges in Cloud Systems
Strategic Business Challenges in Cloud Systems
 
Laypeople's and Experts' Risk Perception of Cloud Computing Services
Laypeople's and Experts' Risk Perception of Cloud Computing Services Laypeople's and Experts' Risk Perception of Cloud Computing Services
Laypeople's and Experts' Risk Perception of Cloud Computing Services
 
Factors Influencing Risk Acceptance of Cloud Computing Services in the UK Gov...
Factors Influencing Risk Acceptance of Cloud Computing Services in the UK Gov...Factors Influencing Risk Acceptance of Cloud Computing Services in the UK Gov...
Factors Influencing Risk Acceptance of Cloud Computing Services in the UK Gov...
 
A Cloud Security Approach for Data at Rest Using FPE
A Cloud Security Approach for Data at Rest Using FPE A Cloud Security Approach for Data at Rest Using FPE
A Cloud Security Approach for Data at Rest Using FPE
 
Error Isolation and Management in Agile Multi-Tenant Cloud Based Applications
Error Isolation and Management in Agile Multi-Tenant Cloud Based Applications Error Isolation and Management in Agile Multi-Tenant Cloud Based Applications
Error Isolation and Management in Agile Multi-Tenant Cloud Based Applications
 
Locality Sim : Cloud Simulator with Data Locality
Locality Sim : Cloud Simulator with Data LocalityLocality Sim : Cloud Simulator with Data Locality
Locality Sim : Cloud Simulator with Data Locality
 
Benefits and Challenges of the Adoption of Cloud Computing in Business
Benefits and Challenges of the Adoption of Cloud Computing in BusinessBenefits and Challenges of the Adoption of Cloud Computing in Business
Benefits and Challenges of the Adoption of Cloud Computing in Business
 
Intrusion Detection and Marking Transactions in a Cloud of Databases Environm...
Intrusion Detection and Marking Transactions in a Cloud of Databases Environm...Intrusion Detection and Marking Transactions in a Cloud of Databases Environm...
Intrusion Detection and Marking Transactions in a Cloud of Databases Environm...
 
A Survey on Resource Allocation in Cloud Computing
A Survey on Resource Allocation in Cloud ComputingA Survey on Resource Allocation in Cloud Computing
A Survey on Resource Allocation in Cloud Computing
 
An Approach to Reduce Energy Consumption in Cloud data centers using Harmony ...
An Approach to Reduce Energy Consumption in Cloud data centers using Harmony ...An Approach to Reduce Energy Consumption in Cloud data centers using Harmony ...
An Approach to Reduce Energy Consumption in Cloud data centers using Harmony ...
 
Data Distribution Handling on Cloud for Deployment of Big Data
Data Distribution Handling on Cloud for Deployment of Big DataData Distribution Handling on Cloud for Deployment of Big Data
Data Distribution Handling on Cloud for Deployment of Big Data
 
Multi-Campus Universities Private-Cloud Migration Infrastructure
Multi-Campus Universities Private-Cloud Migration Infrastructure Multi-Campus Universities Private-Cloud Migration Infrastructure
Multi-Campus Universities Private-Cloud Migration Infrastructure
 
Implementation of the Open Source Virtualization Technologies in Cloud Computing
Implementation of the Open Source Virtualization Technologies in Cloud ComputingImplementation of the Open Source Virtualization Technologies in Cloud Computing
Implementation of the Open Source Virtualization Technologies in Cloud Computing
 
A Broker-based Framework for Integrated SLA-Aware SaaS Provisioning
A Broker-based Framework for Integrated SLA-Aware SaaS Provisioning A Broker-based Framework for Integrated SLA-Aware SaaS Provisioning
A Broker-based Framework for Integrated SLA-Aware SaaS Provisioning
 
Comparative Study of Various Platform as a Service Frameworks
Comparative Study of Various Platform as a Service Frameworks Comparative Study of Various Platform as a Service Frameworks
Comparative Study of Various Platform as a Service Frameworks
 
Neuro-Fuzzy System Based Dynamic Resource Allocation in Collaborative Cloud C...
Neuro-Fuzzy System Based Dynamic Resource Allocation in Collaborative Cloud C...Neuro-Fuzzy System Based Dynamic Resource Allocation in Collaborative Cloud C...
Neuro-Fuzzy System Based Dynamic Resource Allocation in Collaborative Cloud C...
 
A Proposed Model for Improving Performance and Reducing Costs of IT Through C...
A Proposed Model for Improving Performance and Reducing Costs of IT Through C...A Proposed Model for Improving Performance and Reducing Costs of IT Through C...
A Proposed Model for Improving Performance and Reducing Costs of IT Through C...
 
Improved Secure Cloud Transmission Protocol
Improved Secure Cloud Transmission ProtocolImproved Secure Cloud Transmission Protocol
Improved Secure Cloud Transmission Protocol
 

Recently uploaded

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 

Recently uploaded (20)

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 

Data Partitioning for Ensemble Model Building

  • 1. International Journal on Cloud Computing: Services and Architecture (IJCCSA) Vol. 7, No. 3/4, August 2017 DOI: 10.5121/ijccsa.2017.7401 1 DATA PARTITIONING FOR ENSEMBLE MODEL BUILDING Ates Dagli1 , Niall McCarroll2 and Dmitry Vasilenko3 1 IBM Big Data Analytics, Chicago, USA 2 IBM Watson Machine Learning, Hursley, UK 3 IBM Watson Cloud Platform and Data Science, Chicago, USA ABSTRACT In distributed ensemble model-building algorithms, the performance and statistical validity of models are dependent on sizes of the input data partitions as well as the distribution of records among the partitions. Failure to correctly select and pre-process the data often results in the models which are not stable and do not perform well. This article introduces an optimized approach to building the ensemble models for very large data sets in distributed map-reduce environments using Pass-Stream-Merge (PSM) algorithm. To ensure the model correctness the input data is randomly distributed using the facilities built into map- reduce frameworks. KEYWORDS Ensemble Models, Pass-Stream-Merge, Big Data, Map-Reduce, Cloud 1. INTRODUCTION Ensemble models are used to enhance model accuracy (boosting) [1], enhance model stability (bagging) [2], and build models for very large datasets (pass, stream, merge) [3]. In distributed ensemble model- building algorithms, one so-called base model is built from each data partition (split) and evaluated against a sample set aside for this purpose. The best-performing base models are then selected and combined into a model ensemble for purposes of prediction. Both model- building performance and the statistical validity of the models depend on data records being distributed approximately randomly across roughly equal-sized partitions. When implemented in a map-reduce framework, base models are built in mappers [4]. Sizes of data partitions and the distribution of records among them are properties of the input data source. The partition size of the input source is often uneven and rarely of appropriate size for building models. Furthermore, data records are frequently arranged in some systematic order and not randomly ordered. As a result, base models sometimes fail to build or, what is worse, produce incorrect or suboptimal results. The algorithm proposed in this paper eliminates any partition size and ordering variability and as a result improves performance and statistical validity of the generated models. 2. METHODOLOGY We implement the PSM features Pass, Stream and Merge through ensemble modeling [2], [5], [6], [7], [8], [9]. Pass builds models on very large data sets with only one data pass [3]; Stream updates the existing model with new cases without the need to store or recall the old training data; Merge builds models in a distributed environment and merges the built models into one model. In an ensemble model, the training set will be divided into subsets called blocks, and a model will be built on each block. Because the blocks may be dispatched to different processing nodes in the map reduce environment, models can be built concurrently. As new data blocks arrive, the
  • 2. International Journal on Cloud Computing: Services and Architecture (IJCCSA) Vol. 7, No. 3/4, August 2017 2 algorithm repeats the procedure. Therefore, it can handle the data stream and perform incremental learning for ensemble modeling [10]. The Pass operation includes following steps: 1. Splitting the data into training blocks, a testing set and a holdout set. 2. Building base models on training blocks and a reference model on the testing set. One model is built on the testing set and one on each training block. 3. Evaluating each base model by computing its accuracy based on the testing set and selecting a subset of base models as ensemble elements according to accuracy. During the Stream step when new cases arrive and the existing ensemble model Needs to be updated with these cases, the algorithm will: 1. Start a Pass operation to build an ensemble model on the new data, and then 2. Merge the newly created ensemble model and the existing ensemble model. The Merge operation has the following steps: 1. Merging the holdout sets into a single holdout set and, if necessary, reducing the set to a reasonable size. 2. Merging the testing sets into a single testing set and, if necessary, reducing the set to a reasonable size. 3. Building a merged reference model on the merged testing set. 4. Evaluating every base model by computing its accuracy based on the merged testing set and selecting a subset of base models as elements of the merged ensemble model according to accuracy. 5. Evaluating the merged ensemble model and the merged reference model by computing their accuracy based on the merged holdout set. Figure 1. Data blocks and base models
  • 3. International Journal on Cloud Computing: Services and Architecture (IJCCSA) Vol. 7, No. 3/4, August 2017 3 Figure 2. File sizes and base models It can be shown that in map reduce environments the training block or partition size of the input source is often uneven and rarely of appropriate size for building models. The simplest example of uneven partition sizes is one caused by the fact that the last block of a file in the distributed file system is almost always a different size from those before it. In the example of Figure 1, the base model built from the last block is built from a smaller number of records. When the dataset is comprised of multiple files as is often the case, the number of small partitions increases. This is illustrated in Figure 2. Another assumption that is frequently violated is that the input records are randomly distributed among partitions. If the records in the dataset are ordered by the values of the modeling target field, an input field, or a field correlated with them, base models cannot be built or exhibit low predictive accuracy. In the example shown in Figure 3, records are ordered by the binary-valued target. No model can be built from the first or the last partition because there is no variation in the target value. A model can probably be built from the second partition but its quality depends on where the boundary between the two target values lies. Figure 3. The data sort order and base models
  • 4. International Journal on Cloud Computing: Services and Architecture (IJCCSA) Vol. 7, No. 3/4, August 2017 4 An existing partial solution to the problem is to first run a map-reduce job that shuffles the input records and creates a temporary data set. Shuffling is achieved by creating a random-valued field and sorting the data by that field. The ensemble-building program is subsequently run with the temporary data set as input. The solution is only partial because, while records are now randomly ordered, the partitioning of data is determined by the needs of the random sorting. The resulting partitioning is rarely optimal because the upper limit on the partition size is still dependent on the default block size employed by the map-reduce framework, and smaller, unevenly sized partitions may be created. This approach also requires the duplication of all the input data in a temporary data set. We propose a method that provides optimally-sized partitions of shuffled, or randomly ordered, records to model-building steps using facilities built into map-reduce frameworks. The model- building step is run in reducers with input partitions whose size is configurable automatically or by the user. The contents of the partitions are randomly assigned. Our approach allows the partition size to be set at runtime. The partition size may be based on statistical heuristic rules, properties of the modeling problem, properties of the computing environment, or any combination these factors. Each partition consists of a set of records selected with equal probability from the input. The advantages of using our method over the known explicit shuffling solution are: 1. Our method guarantees partitions of uniform optimal size. The explicit shuffling solution cannot guarantee a given size or uniform sizes. 2. Map-reduce frameworks have built-in mechanisms for automatically grouping records passed to reducers. Explicit shuffling incurs the additional cost of the creation of a temporary dataset and a sort operation. This description is confined to the portion of ensemble modeling process where base models are built because that is the step our approach improves. In addition to partitions for building base models, ensemble modeling requires the creation of two small random samples of records called the validation and the holdout samples. The sizes of these samples are preset constants. A so-called reference model is built from the validation sample in order to compare with the ensemble later. The validation sample is also used to rank the predictive performance of the base models in a later step. Also in a later step, the holdout sample is used to compare the predictive performance of the ensemble with that of the reference model. The desired number of models in the final ensemble, E, is determined by the user. It is usually in the range 10-100. The rules we use to determine how to partition the data balance the goal of a desirable partition size with that of building an ensemble of the desired size. To compute the average adjusted partition size we pick an optimal value for the size of the base model partition, B. B may be based on statistical heuristic rules, properties of the modeling problem, properties of the computing environment, or any combination of these factors. We also determine a minimum acceptable value for the size of the base model partition, Bmin, based on the same factors. Given N, the total number of records in the input dataset, the size of the holdout sample H, and the size of the validation sample V, we determine the number of base models, S, as follows:
  • 5. International Journal on Cloud Computing: Services and Architecture (IJCCSA) Vol. 7, No. 3/4, August 2017 5 Given S, compute an average adjusted partition size, B′ as B′ = (N−H−V)/S. B′ is usually not a whole number. Sampling probabilities for base partitions, the validation sample, and the holdout sample are B′/N, V/N, and H/N, respectively. Note that S ∗ (B′/N) + V/N + H/N = N/N = 1 so that the sampling probabilities add up to 1. The map stage consists of randomly assigning each record one of k+2 keys 1, 2, ... S+2. Key values 1 and 2 correspond to the holdout and validation samples, respectively. Thus, a given record is assigned key 1 with probability H/N, key 2 with probability V/N, and keys 3..S+2 each with probability B′/N. The resulting value is used as the map-reduce key. The key is also written to mapper output records so that the reducers can distinguish partitions 1 and 2 from the base model partitions. The sizes of the resulting partitions will be approximately B′, H and V. In the reduce stage, we build models from each partition except the holdout sample. Figure 4. The data flow through mappers to reducers
  • 6. International Journal on Cloud Computing: Services and Architecture (IJCCSA) Vol. 7, No. 3/4, August 2017 6 Figure 4 shows the flow of input data through an arbitrary number K of mappers to the reducers where base and reference models are built. Regardless of the order and grouping of the input data, the expected sizes of the holdout, validation and base model training partitions are as determined above and their contents are randomly assigned. 3. CONCLUSIONS We have presented an algorithm for creating optimally-sized random partitioning for data- parallel ensemble model building in map-reduce environments. The approach improves performance and statistical validity of the generated ensemble models by introducing random record keys that reflect probabilities for the holdout, validation and training samples. The keys are also written to mapper output records so that the reducers can distinguish holdout and validation partitions from the base model partitions. In the reduce stage, we build models from each partition except the holdout sample. The future plan is to implement the algorithm in the real cloud setup and test the performance and statistical validity by considering the anticipated workload. ACKNOWLEDGEMENTS Authors thank Steven Halter of IBM for invaluable comments during preparation of this paper. REFERENCES [1] R. Schapire and Y. Freund. Boosting: Foundations and Algorithms. MIT, 2012. [2] R. Bordawekar, B. Blainey, C. Apte, and M. McRoberts. A survey of business analytics models and algorithms. IBM Research Report RC25186, IBM, November 2011. W1107- 029. [3] L. Wilkinson. System and method for computing analytics on structured data. Patent US 7627432, December 2009. [4] M. I. Danciu, L. Fan, M. McRoberts, J Shyr, D. Spisic, and J. Xu. Generating a predictive model from multiple data sources. Patent US 20120278275 A1, November 2012. [5] R. Levesque. Programming and Data Management for IBM SPSS Statistics. IBM Corporation, Armonk, New York, 2011. [6] R. Polikar. Ensemble based systems in decision making. IEEE Circuits and Systems Magazine, 6(3):21–45, 2006. [7] D. Opitz and R. Maclin. Popular ensemble methods: An empirical study. Journal of Artificial Intelligence Research, 1(11):169–198, 1999. [8] B. Zenko. Is combining classifiers better than selecting the best one. Machine Learning, 1 (1):255– 273, 2004. [9] Z. Zhihua. Ensemble Methods: Foundations and Algorithms. Chapman and Hall/CRC, 2012. [10] IBM SPSS Modeler 16 Algorithms Guide. IBM Corporation, Armonk, New York, 2013.
  • 7. International Journal on Cloud Computing: Services and Architecture (IJCCSA) Vol. 7, No. 3/4, August 2017 7 AUTHORS Ates Dagli is a Principal Software Engineer working on the IBM SPSS Analytic Server line of products. Mr. Dagli graduated from the Columbia University in the City of New York with Master of Philosophy degree in Economics in 1978. Mr. Dagli led design and implementation of the variety of statistical and machine learning algorithms that address the entire analytical process, from planning to data collection to analysis, reporting and deployment. Niall McCarroll, Ph.D., is an Architect and Senior Software Engineer working on the IBM Watson Machine Learning and predictive analytics tools. In the last few years Niall has been involved in the creation and teaching of a module covering applied predictive analytics in several UK universities. Dmitry Vasilenko is an Architect and Senior Software Engineer working on the IBM Watson Cloud Platform. He received a M.S. degree in Electrical Engineering from Novosibirsk State Technical University, Russian Federation, in 1986. Before joining IBM SPSS in 1997 Mr. Vasilenko led Computer Aided Design projects in the area of Electrical Engineering at the Institute of Electric Power System and Electric Transmission Networks. During his tenure with IBM Mr. Vasilenko received three technical excellence awards for his work in Business Analytics. He is an author or co-author of 11 technical papers andaUSpatent.