SlideShare a Scribd company logo
1 of 21
Download to read offline
Introduction Preprocessing methods Racing Conclusions
Racing for unbalanced methods selection
–
the unbalanced package
Andrea Dal Pozzolo
24/2/2016
1/ 20
Introduction Preprocessing methods Racing Conclusions
ABOUT ME
I’m a Decision Analytics Consultant at VASCO Data
Security.
I hold a PhD from the Machine Learning Group (MLG) of
the Universit´e Libre de Bruxelles (ULB).
My research focused on Machine Learning techniques for
Fraud Detection in electronic transaction.
2/ 20
Introduction Preprocessing methods Racing Conclusions
TABLE OF CONTENTS
1 Introduction
2 Preprocessing methods
3 Racing
4 Conclusions
3/ 20
Introduction Preprocessing methods Racing Conclusions
INTRODUCTION
In several binary classification problems (e.g. fraud
detection), the two classes are not equally represented in
the dataset.
When one class is underrepresented in a dataset, the data
is said to be unbalanced.
In unbalanced datasets, classification algorithms perform
poorly in terms of predictive accuracy [9].
A common strategy is to balance the classes before
learning a classifier [1].
Technical sections will be denoted by the symbol
4/ 20
Introduction Preprocessing methods Racing Conclusions
PCA PROJECTION OF CREDIT CARD TRANSACTIONS
Figure : Transactions distribution over the first two Principal
Components in different days.
5/ 20
Introduction Preprocessing methods Racing Conclusions
THE FRAUD DETECTION PROBLEM
We formalize FD as a classification task f : Rn → {+, −}.
X ∈ Rn is the input and Y ∈ {+, −} the output domain.
+ is the fraud (minority) and − the genuine (majority)
class.
Given a classifier K and a training set TN, we are interested
in estimating for a new sample (x, y) the posterior
probability P(y = +|x).
6/ 20
Introduction Preprocessing methods Racing Conclusions
PREPROCESSING METHODS FOR UNBALANCED DATA
Sampling methods
Undersampling [7]
Oversampling [7]
SMOTE [3]
Distance based methods
Tomek link [15]
Condensed Nearest Neighbor (CNN) [8]
One side Selection (OSS) [10]
Edited Nearest Neighbor (ENN) [16]
Neighborhood Cleaning Rule (NCL) [11]
7/ 20
Introduction Preprocessing methods Racing Conclusions
SAMPLING METHODS
Undersampling- Oversampling- SMOTE-
Figure : Resampling methods for unbalanced classification. The
negative and positive symbols denotes majority and minority class
instances. In red the new observations created with oversampling
methods.
8/ 20
Introduction Preprocessing methods Racing Conclusions
SELECTING THE BEST STRATEGY
With no prior information about the data distribution is
difficult to decide which unbalanced strategy to use.
No-free-lunch theorem [17]: no single strategy is coherently
superior to all others in all conditions (i.e. algorithm,
dataset and performance metric)
Testing all unbalanced techniques is not an option because
of the associated computational cost.
We proposed to use the Racing approach [12] to perform
strategy selection.
9/ 20
Introduction Preprocessing methods Racing Conclusions
RACING FOR STRATEGY SELECTION
Racing consists in testing in parallel a set of alternatives
and using a statistical test to remove an alternative if it is
significantly worse than the others.
We adopted F-Race version [2] to search efficiently for the
best strategy for unbalanced data.
The F-race combines the Friedman test with Hoeffding
Races [12].
10/ 20
Introduction Preprocessing methods Racing Conclusions
RACING FOR UNBALANCED TECHNIQUE SELECTION
Automatically select the most adequate technique for a given
dataset.
1. Test in parallel a set of alternative balancing strategies on a
subset of the dataset
2. Remove progressively the alternatives which are
significantly worse.
3. Iterate the testing and removal step until there is only one
candidate left or not more data is available
Candidate(1 Candidate(2 Candidate(3
subset(1( 0.50 0.47 0.48
subset(2 0.51 0.48 0.30
subset(3 0.51 0.47
subset(4 0.60 0.45
subset(5 0.55
Candidate(1 Candidate(2 Candidate(3
subset(1( 0.50 0.47 0.48
subset(2 0.51 0.48 0.3011/ 20
Introduction Preprocessing methods Racing Conclusions
F-RACE METHOD
Use 10-fold Cross Validation (CV) to provide the data
during the race.
Every time new data is added to the race, the Friedman
test is used to remove significantly bad candidates.
We made a comparison of CV and F-race in terms of
F-measure.
Candidate(1 Candidate(2 Candidate(3
subset(1( 0.50 0.47 0.48
subset(2 0.51 0.48 0.30
subset(3 0.51 0.47
subset(4 0.60 0.45
subset(5 0.55
Candidate(1 Candidate(2 Candidate(3
subset(1( 0.50 0.47 0.48
subset(2 0.51 0.48 0.30
subset(3 0.51 0.47
subset(4 0.49 0.46
subset(5 0.48 0.46
subset(6 0.60 0.45
subset(7 0.59
subset(8
subset(9
subset(10
Original(Dataset
Time
12/ 20
Introduction Preprocessing methods Racing Conclusions
F-RACE VS CROSS VALIDATION
Dataset Exploration Method Ntest % Gain Mean Sd
ecoli
Race Under 46 49 0.836 0.04
CV SMOTE 90 - 0.754 0.112
letter-a
Race Under 34 62 0.952 0.008
CV SMOTE 90 - 0.949 0.01
letter-vowel
Race Under 34 62 0.884 0.011
CV Under 90 - 0.887 0.009
letter
Race SMOTE 37 59 0.951 0.009
CV Under 90 - 0.951 0.01
oil
Race Under 41 54 0.629 0.074
CV SMOTE 90 - 0.597 0.076
page
Race SMOTE 45 50 0.919 0.01
CV SMOTE 90 - 0.92 0.008
pendigits
Race Under 39 57 0.978 0.011
CV Under 90 - 0.981 0.006
PhosS
Race Under 19 79 0.598 0.01
CV Under 90 - 0.608 0.016
satimage
Race Under 34 62 0.843 0.008
CV Under 90 - 0.841 0.011
segment
Race SMOTE 90 0 0.978 0.01
CV SMOTE 90 - 0.978 0.01
estate
Race Under 27 70 0.553 0.023
CV Under 90 - 0.563 0.021
covtype
Race Under 42 53 0.924 0.007
CV SMOTE 90 - 0.921 0.008
cam
Race Under 34 62 0.68 0.007
CV Under 90 - 0.674 0.015
compustat
Race Under 37 59 0.738 0.021
CV Under 90 - 0.745 0.017
creditcard
Race Under 43 52 0.927 0.008
CV SMOTE 90 - 0.924 0.006
Table : Results in terms of G-mean for RF classifier.
13/ 20
Introduction Preprocessing methods Racing Conclusions
DISCUSSION
The best strategy is extremely dependent on the data
nature, algorithm adopted and performance measure.
F-race is able to automatise the selection of the best
unbalanced strategy for a given unbalanced problem
without exploring the whole dataset.
For the fraud dataset the unbalanced strategy chosen had a
big impact on the accuracy of the results.
14/ 20
Introduction Preprocessing methods Racing Conclusions
CONCLUSIONS
With racing [6] we can rapidly select the best strategy for a
given unbalanced task.
However, we see that undersampling and SMOTE are
often the best strategy to adopt.
In R [14] we have a software package called
unbalanced [4] that implements all these algorithms.
15/ 20
Introduction Preprocessing methods Racing Conclusions
PACKAGE DEMO
Let’s see in practise how to do it ...
The code of this demo is available at
https://github.com/dalpozz/utility/blob/master/ubDEMO.R
16/ 20
Introduction Preprocessing methods Racing Conclusions
Vignette: www.ulb.ac.be/di/map/adalpozz/pdf/unbalanced.pdf
CRAN: http://CRAN.R-project.org/package=unbalanced
Github: https://github.com/dalpozz/unbalanced
Email: dalpozz@gmail.com
Questions?
17/ 20
Introduction Preprocessing methods Racing Conclusions
BIBLIOGRAPHY I
[1] R. Akbani, S. Kwek, and N. Japkowicz.
Applying support vector machines to imbalanced datasets.
In Machine Learning: ECML 2004, pages 39–50. Springer, 2004.
[2] M. Birattari, T. St¨utzle, L. Paquete, and K. Varrentrapp.
A racing algorithm for configuring metaheuristics.
In Proceedings of the genetic and evolutionary computation conference, pages 11–18, 2002.
[3] N. Chawla, K. Bowyer, L. O. Hall, and W. P. Kegelmeyer.
Smote: synthetic minority over-sampling technique.
Journal of Artificial Intelligence Research (JAIR), 16:321–357, 2002.
[4] A. Dal Pozzolo, O. Caelen, and G. Bontempi.
unbalanced: Racing For Unbalanced Methods Selection., 2015.
R package version 2.0.
[5] A. Dal Pozzolo, O. Caelen, R. A. Johnson, and G. Bontempi.
Calibrating probability with undersampling for unbalanced classification.
In 2015 IEEE Symposium on Computational Intelligence and Data Mining. IEEE, 2015.
[6] A. Dal Pozzolo, O. Caelen, S. Waterschoot, and G. Bontempi.
Racing for unbalanced methods selection.
In Proceedings of the 14th International Conference on Intelligent Data Engineering and Automated Learning. IDEAL,
2013.
[7] C. Drummond and R. Holte.
C4.5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling.
In Workshop on Learning from Imbalanced Datasets II, 2003.
17/ 20
Introduction Preprocessing methods Racing Conclusions
BIBLIOGRAPHY II
[8] P. E. Hart.
The condensed nearest neighbor rule.
IEEE Transactions on Information Theory, 1968.
[9] N. Japkowicz and S. Stephen.
The class imbalance problem: A systematic study.
Intelligent data analysis, 6(5):429–449, 2002.
[10] M. Kubat, S. Matwin, et al.
Addressing the curse of imbalanced training sets: one-sided selection.
In ICML, volume 97, pages 179–186. Nashville, USA, 1997.
[11] J. Laurikkala.
Improving identification of difficult small classes by balancing class distribution.
Artificial Intelligence in Medicine, pages 63–66, 2001.
[12] O. Maron and A. Moore.
Hoeffding races: Accelerating model selection search for classification and function approximation.
Robotics Institute, page 263, 1993.
[13] J. Quionero-Candela, M. Sugiyama, A. Schwaighofer, and N. D. Lawrence.
Dataset shift in machine learning.
The MIT Press, 2009.
[14] R Development Core Team.
R: A Language and Environment for Statistical Computing.
R Foundation for Statistical Computing, Vienna, Austria, 2011.
ISBN 3-900051-07-0.
18/ 20
Introduction Preprocessing methods Racing Conclusions
BIBLIOGRAPHY III
[15] I. Tomek.
Two modifications of cnn.
IEEE Trans. Syst. Man Cybern., 6:769–772, 1976.
[16] D. Wilson.
Asymptotic properties of nearest neighbor rules using edited data.
Systems, Man and Cybernetics, (3):408–421, 1972.
[17] D. H. Wolpert.
The lack of a priori distinctions between learning algorithms.
Neural computation, 8(7):1341–1390, 1996.
19/ 20
Introduction Preprocessing methods Racing Conclusions
SAMPLING SELECTION BIAS
Sampling selection bias [13] occurs when the training and
testing set comes from a different distribution. In this case we
need to calibrate the posterior probability of the classifier [5].
Posi%ves(
Nega%ves(
Undersampling(
Posi%ves(
Nega%ves(
Data(
Genera%on(
Process(
Posi%ves(
Nega%ves(
Training(
Tes%ng(
Different(distribu%ons(
20/ 20

More Related Content

What's hot

Learning On The Border:Active Learning in Imbalanced classification Data
Learning On The Border:Active Learning in Imbalanced classification DataLearning On The Border:Active Learning in Imbalanced classification Data
Learning On The Border:Active Learning in Imbalanced classification Data萍華 楊
 
Classification using L1-Penalized Logistic Regression
Classification using L1-Penalized Logistic RegressionClassification using L1-Penalized Logistic Regression
Classification using L1-Penalized Logistic RegressionSetia Pramana
 
Naive Bayes Classifier
Naive Bayes ClassifierNaive Bayes Classifier
Naive Bayes ClassifierArunabha Saha
 
Big Data Analysis
Big Data AnalysisBig Data Analysis
Big Data AnalysisNBER
 
Naive bayes
Naive bayesNaive bayes
Naive bayesumeskath
 
Naive bayesian classification
Naive bayesian classificationNaive bayesian classification
Naive bayesian classificationDr-Dipali Meher
 
Text classification
Text classificationText classification
Text classificationHarry Potter
 
Scalable sentiment classification for big data analysis using naive bayes cla...
Scalable sentiment classification for big data analysis using naive bayes cla...Scalable sentiment classification for big data analysis using naive bayes cla...
Scalable sentiment classification for big data analysis using naive bayes cla...Tien-Yang (Aiden) Wu
 
Distributed Group Analytical Hierarchical Process by Consensus
 Distributed Group Analytical Hierarchical Process by Consensus Distributed Group Analytical Hierarchical Process by Consensus
Distributed Group Analytical Hierarchical Process by ConsensusMiguel Rebollo
 
Self similarity student for partial label histopathology image segmentation
Self similarity student for partial label histopathology image segmentationSelf similarity student for partial label histopathology image segmentation
Self similarity student for partial label histopathology image segmentationtaeseon ryu
 
Diagnostic methods for Building the regression model
Diagnostic methods for Building the regression modelDiagnostic methods for Building the regression model
Diagnostic methods for Building the regression modelMehdi Shayegani
 

What's hot (13)

Learning On The Border:Active Learning in Imbalanced classification Data
Learning On The Border:Active Learning in Imbalanced classification DataLearning On The Border:Active Learning in Imbalanced classification Data
Learning On The Border:Active Learning in Imbalanced classification Data
 
Classification using L1-Penalized Logistic Regression
Classification using L1-Penalized Logistic RegressionClassification using L1-Penalized Logistic Regression
Classification using L1-Penalized Logistic Regression
 
Naive Bayes Classifier
Naive Bayes ClassifierNaive Bayes Classifier
Naive Bayes Classifier
 
Big Data Analysis
Big Data AnalysisBig Data Analysis
Big Data Analysis
 
Naive bayes
Naive bayesNaive bayes
Naive bayes
 
Naive bayesian classification
Naive bayesian classificationNaive bayesian classification
Naive bayesian classification
 
Text classification
Text classificationText classification
Text classification
 
Scalable sentiment classification for big data analysis using naive bayes cla...
Scalable sentiment classification for big data analysis using naive bayes cla...Scalable sentiment classification for big data analysis using naive bayes cla...
Scalable sentiment classification for big data analysis using naive bayes cla...
 
Distributed Group Analytical Hierarchical Process by Consensus
 Distributed Group Analytical Hierarchical Process by Consensus Distributed Group Analytical Hierarchical Process by Consensus
Distributed Group Analytical Hierarchical Process by Consensus
 
Mech ma6452 snm_notes
Mech ma6452 snm_notesMech ma6452 snm_notes
Mech ma6452 snm_notes
 
Self similarity student for partial label histopathology image segmentation
Self similarity student for partial label histopathology image segmentationSelf similarity student for partial label histopathology image segmentation
Self similarity student for partial label histopathology image segmentation
 
Diagnostic methods for Building the regression model
Diagnostic methods for Building the regression modelDiagnostic methods for Building the regression model
Diagnostic methods for Building the regression model
 
MUMS: Bayesian, Fiducial, and Frequentist Conference - Inference on Treatment...
MUMS: Bayesian, Fiducial, and Frequentist Conference - Inference on Treatment...MUMS: Bayesian, Fiducial, and Frequentist Conference - Inference on Treatment...
MUMS: Bayesian, Fiducial, and Frequentist Conference - Inference on Treatment...
 

Viewers also liked

The true religion of god
The true  religion of godThe true  religion of god
The true religion of godexploringislam
 
Подарунки
ПодарункиПодарунки
ПодарункиIrina Lut
 
Valve Handbook For New Employees
Valve Handbook For New EmployeesValve Handbook For New Employees
Valve Handbook For New EmployeesGauin
 
The three fundamental principles
The three fundamental principlesThe three fundamental principles
The three fundamental principlesexploringislam
 
互邀新平台产品宣介
互邀新平台产品宣介互邀新平台产品宣介
互邀新平台产品宣介Gauin
 
Shapiro capitulo 4 1° parte
Shapiro capitulo 4 1° parteShapiro capitulo 4 1° parte
Shapiro capitulo 4 1° parteferaza
 
Shapiro capitulo 4 2° parte
Shapiro capitulo 4 2° parteShapiro capitulo 4 2° parte
Shapiro capitulo 4 2° parteferaza
 
Data Innovation Summit - Made in Belgium 2015
Data Innovation Summit - Made in Belgium 2015Data Innovation Summit - Made in Belgium 2015
Data Innovation Summit - Made in Belgium 2015Andrea Dal Pozzolo
 
Aq workshop l01 - way of a muslim
Aq workshop   l01 - way of a muslimAq workshop   l01 - way of a muslim
Aq workshop l01 - way of a muslimexploringislam
 
Andrea Dal Pozzolo - Data Scientist
Andrea Dal Pozzolo - Data ScientistAndrea Dal Pozzolo - Data Scientist
Andrea Dal Pozzolo - Data ScientistAndrea Dal Pozzolo
 
Raising a Mathematician Training Program 2015
Raising a Mathematician Training Program 2015Raising a Mathematician Training Program 2015
Raising a Mathematician Training Program 2015School of Vedic Maths
 
产品经理实用工具全集(1 8)
产品经理实用工具全集(1 8)产品经理实用工具全集(1 8)
产品经理实用工具全集(1 8)Gauin
 
腾讯产品运营PPT-产品经理的视角
腾讯产品运营PPT-产品经理的视角腾讯产品运营PPT-产品经理的视角
腾讯产品运营PPT-产品经理的视角Gauin
 

Viewers also liked (20)

Estructura para la pagina web 1 zanabria
Estructura para  la pagina web  1 zanabriaEstructura para  la pagina web  1 zanabria
Estructura para la pagina web 1 zanabria
 
The true religion of god
The true  religion of godThe true  religion of god
The true religion of god
 
Colegio nacional nicolas esguerra
Colegio nacional nicolas esguerraColegio nacional nicolas esguerra
Colegio nacional nicolas esguerra
 
Подарунки
ПодарункиПодарунки
Подарунки
 
Board room of today presentation
Board room of today presentationBoard room of today presentation
Board room of today presentation
 
Valve Handbook For New Employees
Valve Handbook For New EmployeesValve Handbook For New Employees
Valve Handbook For New Employees
 
Confusion in manhaj
Confusion in manhajConfusion in manhaj
Confusion in manhaj
 
The three fundamental principles
The three fundamental principlesThe three fundamental principles
The three fundamental principles
 
互邀新平台产品宣介
互邀新平台产品宣介互邀新平台产品宣介
互邀新平台产品宣介
 
Shapiro capitulo 4 1° parte
Shapiro capitulo 4 1° parteShapiro capitulo 4 1° parte
Shapiro capitulo 4 1° parte
 
Shapiro capitulo 4 2° parte
Shapiro capitulo 4 2° parteShapiro capitulo 4 2° parte
Shapiro capitulo 4 2° parte
 
Notes day 2 - angels
Notes   day 2 - angelsNotes   day 2 - angels
Notes day 2 - angels
 
Data Innovation Summit - Made in Belgium 2015
Data Innovation Summit - Made in Belgium 2015Data Innovation Summit - Made in Belgium 2015
Data Innovation Summit - Made in Belgium 2015
 
Catalogue
CatalogueCatalogue
Catalogue
 
Aq workshop l01 - way of a muslim
Aq workshop   l01 - way of a muslimAq workshop   l01 - way of a muslim
Aq workshop l01 - way of a muslim
 
Andrea Dal Pozzolo - Data Scientist
Andrea Dal Pozzolo - Data ScientistAndrea Dal Pozzolo - Data Scientist
Andrea Dal Pozzolo - Data Scientist
 
Raising a Mathematician Training Program 2015
Raising a Mathematician Training Program 2015Raising a Mathematician Training Program 2015
Raising a Mathematician Training Program 2015
 
tttt
tttttttt
tttt
 
产品经理实用工具全集(1 8)
产品经理实用工具全集(1 8)产品经理实用工具全集(1 8)
产品经理实用工具全集(1 8)
 
腾讯产品运营PPT-产品经理的视角
腾讯产品运营PPT-产品经理的视角腾讯产品运营PPT-产品经理的视角
腾讯产品运营PPT-产品经理的视角
 

Similar to Presentation of the unbalanced R package

Racing for unbalanced methods selection
Racing for unbalanced methods selectionRacing for unbalanced methods selection
Racing for unbalanced methods selectionAndrea Dal Pozzolo
 
ENACTMENT RANKING OF SUPERVISED ALGORITHMS DEPENDENCE OF DATA SPLITTING ALGOR...
ENACTMENT RANKING OF SUPERVISED ALGORITHMS DEPENDENCE OF DATA SPLITTING ALGOR...ENACTMENT RANKING OF SUPERVISED ALGORITHMS DEPENDENCE OF DATA SPLITTING ALGOR...
ENACTMENT RANKING OF SUPERVISED ALGORITHMS DEPENDENCE OF DATA SPLITTING ALGOR...ijnlc
 
Enactment Ranking of Supervised Algorithms Dependence of Data Splitting Algor...
Enactment Ranking of Supervised Algorithms Dependence of Data Splitting Algor...Enactment Ranking of Supervised Algorithms Dependence of Data Splitting Algor...
Enactment Ranking of Supervised Algorithms Dependence of Data Splitting Algor...AIRCC Publishing Corporation
 
Mining the LET Performance in Generating Prediction Models for OTDSS
Mining the LET Performance in Generating Prediction Models for OTDSSMining the LET Performance in Generating Prediction Models for OTDSS
Mining the LET Performance in Generating Prediction Models for OTDSSIvy Tarun
 
IRJET- Predicting Customers Churn in Telecom Industry using Centroid Oversamp...
IRJET- Predicting Customers Churn in Telecom Industry using Centroid Oversamp...IRJET- Predicting Customers Churn in Telecom Industry using Centroid Oversamp...
IRJET- Predicting Customers Churn in Telecom Industry using Centroid Oversamp...IRJET Journal
 
TECHNICAL REVIEW: PERFORMANCE OF EXISTING IMPUTATION METHODS FOR MISSING DATA...
TECHNICAL REVIEW: PERFORMANCE OF EXISTING IMPUTATION METHODS FOR MISSING DATA...TECHNICAL REVIEW: PERFORMANCE OF EXISTING IMPUTATION METHODS FOR MISSING DATA...
TECHNICAL REVIEW: PERFORMANCE OF EXISTING IMPUTATION METHODS FOR MISSING DATA...IJDKP
 
TECHNICAL REVIEW: PERFORMANCE OF EXISTING IMPUTATION METHODS FOR MISSING DATA...
TECHNICAL REVIEW: PERFORMANCE OF EXISTING IMPUTATION METHODS FOR MISSING DATA...TECHNICAL REVIEW: PERFORMANCE OF EXISTING IMPUTATION METHODS FOR MISSING DATA...
TECHNICAL REVIEW: PERFORMANCE OF EXISTING IMPUTATION METHODS FOR MISSING DATA...IJDKP
 
Improving Analogy Software Effort Estimation using Fuzzy Feature Subset Selec...
Improving Analogy Software Effort Estimation using Fuzzy Feature Subset Selec...Improving Analogy Software Effort Estimation using Fuzzy Feature Subset Selec...
Improving Analogy Software Effort Estimation using Fuzzy Feature Subset Selec...gregoryg
 
An advance extended binomial GLMBoost ensemble method with synthetic minorit...
An advance extended binomial GLMBoost ensemble method  with synthetic minorit...An advance extended binomial GLMBoost ensemble method  with synthetic minorit...
An advance extended binomial GLMBoost ensemble method with synthetic minorit...IJECEIAES
 
A Novel Approach for Breast Cancer Detection using Data Mining Techniques
A Novel Approach for Breast Cancer Detection using Data Mining TechniquesA Novel Approach for Breast Cancer Detection using Data Mining Techniques
A Novel Approach for Breast Cancer Detection using Data Mining Techniquesahmad abdelhafeez
 
Comparative performance analysis
Comparative performance analysisComparative performance analysis
Comparative performance analysiscsandit
 
Comparative Performance Analysis of Machine Learning Techniques for Software ...
Comparative Performance Analysis of Machine Learning Techniques for Software ...Comparative Performance Analysis of Machine Learning Techniques for Software ...
Comparative Performance Analysis of Machine Learning Techniques for Software ...csandit
 
Robust Breast Cancer Diagnosis on Four Different Datasets Using Multi-Classif...
Robust Breast Cancer Diagnosis on Four Different Datasets Using Multi-Classif...Robust Breast Cancer Diagnosis on Four Different Datasets Using Multi-Classif...
Robust Breast Cancer Diagnosis on Four Different Datasets Using Multi-Classif...ahmad abdelhafeez
 
Decision Making Using The Analytic Hierarchy Process
Decision Making Using The Analytic Hierarchy ProcessDecision Making Using The Analytic Hierarchy Process
Decision Making Using The Analytic Hierarchy ProcessVaibhav Gaikwad
 
Traffic Outlier Detection by Density-Based Bounded Local Outlier Factors
Traffic Outlier Detection by Density-Based Bounded Local Outlier FactorsTraffic Outlier Detection by Density-Based Bounded Local Outlier Factors
Traffic Outlier Detection by Density-Based Bounded Local Outlier FactorsITIIIndustries
 
PROGRAM TEST DATA GENERATION FOR BRANCH COVERAGE WITH GENETIC ALGORITHM: COMP...
PROGRAM TEST DATA GENERATION FOR BRANCH COVERAGE WITH GENETIC ALGORITHM: COMP...PROGRAM TEST DATA GENERATION FOR BRANCH COVERAGE WITH GENETIC ALGORITHM: COMP...
PROGRAM TEST DATA GENERATION FOR BRANCH COVERAGE WITH GENETIC ALGORITHM: COMP...cscpconf
 
16 03 06 Smiths Aeros 2084a
16 03 06 Smiths Aeros 2084a16 03 06 Smiths Aeros 2084a
16 03 06 Smiths Aeros 2084aFNian
 

Similar to Presentation of the unbalanced R package (20)

Racing for unbalanced methods selection
Racing for unbalanced methods selectionRacing for unbalanced methods selection
Racing for unbalanced methods selection
 
ENACTMENT RANKING OF SUPERVISED ALGORITHMS DEPENDENCE OF DATA SPLITTING ALGOR...
ENACTMENT RANKING OF SUPERVISED ALGORITHMS DEPENDENCE OF DATA SPLITTING ALGOR...ENACTMENT RANKING OF SUPERVISED ALGORITHMS DEPENDENCE OF DATA SPLITTING ALGOR...
ENACTMENT RANKING OF SUPERVISED ALGORITHMS DEPENDENCE OF DATA SPLITTING ALGOR...
 
Enactment Ranking of Supervised Algorithms Dependence of Data Splitting Algor...
Enactment Ranking of Supervised Algorithms Dependence of Data Splitting Algor...Enactment Ranking of Supervised Algorithms Dependence of Data Splitting Algor...
Enactment Ranking of Supervised Algorithms Dependence of Data Splitting Algor...
 
Da35573574
Da35573574Da35573574
Da35573574
 
Mining the LET Performance in Generating Prediction Models for OTDSS
Mining the LET Performance in Generating Prediction Models for OTDSSMining the LET Performance in Generating Prediction Models for OTDSS
Mining the LET Performance in Generating Prediction Models for OTDSS
 
IRJET- Predicting Customers Churn in Telecom Industry using Centroid Oversamp...
IRJET- Predicting Customers Churn in Telecom Industry using Centroid Oversamp...IRJET- Predicting Customers Churn in Telecom Industry using Centroid Oversamp...
IRJET- Predicting Customers Churn in Telecom Industry using Centroid Oversamp...
 
TECHNICAL REVIEW: PERFORMANCE OF EXISTING IMPUTATION METHODS FOR MISSING DATA...
TECHNICAL REVIEW: PERFORMANCE OF EXISTING IMPUTATION METHODS FOR MISSING DATA...TECHNICAL REVIEW: PERFORMANCE OF EXISTING IMPUTATION METHODS FOR MISSING DATA...
TECHNICAL REVIEW: PERFORMANCE OF EXISTING IMPUTATION METHODS FOR MISSING DATA...
 
TECHNICAL REVIEW: PERFORMANCE OF EXISTING IMPUTATION METHODS FOR MISSING DATA...
TECHNICAL REVIEW: PERFORMANCE OF EXISTING IMPUTATION METHODS FOR MISSING DATA...TECHNICAL REVIEW: PERFORMANCE OF EXISTING IMPUTATION METHODS FOR MISSING DATA...
TECHNICAL REVIEW: PERFORMANCE OF EXISTING IMPUTATION METHODS FOR MISSING DATA...
 
Improving Analogy Software Effort Estimation using Fuzzy Feature Subset Selec...
Improving Analogy Software Effort Estimation using Fuzzy Feature Subset Selec...Improving Analogy Software Effort Estimation using Fuzzy Feature Subset Selec...
Improving Analogy Software Effort Estimation using Fuzzy Feature Subset Selec...
 
An advance extended binomial GLMBoost ensemble method with synthetic minorit...
An advance extended binomial GLMBoost ensemble method  with synthetic minorit...An advance extended binomial GLMBoost ensemble method  with synthetic minorit...
An advance extended binomial GLMBoost ensemble method with synthetic minorit...
 
A Novel Approach for Breast Cancer Detection using Data Mining Techniques
A Novel Approach for Breast Cancer Detection using Data Mining TechniquesA Novel Approach for Breast Cancer Detection using Data Mining Techniques
A Novel Approach for Breast Cancer Detection using Data Mining Techniques
 
Comparative performance analysis
Comparative performance analysisComparative performance analysis
Comparative performance analysis
 
Comparative Performance Analysis of Machine Learning Techniques for Software ...
Comparative Performance Analysis of Machine Learning Techniques for Software ...Comparative Performance Analysis of Machine Learning Techniques for Software ...
Comparative Performance Analysis of Machine Learning Techniques for Software ...
 
Robust Breast Cancer Diagnosis on Four Different Datasets Using Multi-Classif...
Robust Breast Cancer Diagnosis on Four Different Datasets Using Multi-Classif...Robust Breast Cancer Diagnosis on Four Different Datasets Using Multi-Classif...
Robust Breast Cancer Diagnosis on Four Different Datasets Using Multi-Classif...
 
Decision Making Using The Analytic Hierarchy Process
Decision Making Using The Analytic Hierarchy ProcessDecision Making Using The Analytic Hierarchy Process
Decision Making Using The Analytic Hierarchy Process
 
Traffic Outlier Detection by Density-Based Bounded Local Outlier Factors
Traffic Outlier Detection by Density-Based Bounded Local Outlier FactorsTraffic Outlier Detection by Density-Based Bounded Local Outlier Factors
Traffic Outlier Detection by Density-Based Bounded Local Outlier Factors
 
Ijetr021251
Ijetr021251Ijetr021251
Ijetr021251
 
Dt34733739
Dt34733739Dt34733739
Dt34733739
 
PROGRAM TEST DATA GENERATION FOR BRANCH COVERAGE WITH GENETIC ALGORITHM: COMP...
PROGRAM TEST DATA GENERATION FOR BRANCH COVERAGE WITH GENETIC ALGORITHM: COMP...PROGRAM TEST DATA GENERATION FOR BRANCH COVERAGE WITH GENETIC ALGORITHM: COMP...
PROGRAM TEST DATA GENERATION FOR BRANCH COVERAGE WITH GENETIC ALGORITHM: COMP...
 
16 03 06 Smiths Aeros 2084a
16 03 06 Smiths Aeros 2084a16 03 06 Smiths Aeros 2084a
16 03 06 Smiths Aeros 2084a
 

More from Andrea Dal Pozzolo

Adaptive Machine Learning for Credit Card Fraud Detection
Adaptive Machine Learning for Credit Card Fraud DetectionAdaptive Machine Learning for Credit Card Fraud Detection
Adaptive Machine Learning for Credit Card Fraud DetectionAndrea Dal Pozzolo
 
Is Machine learning useful for Fraud Prevention?
Is Machine learning useful for Fraud Prevention?Is Machine learning useful for Fraud Prevention?
Is Machine learning useful for Fraud Prevention?Andrea Dal Pozzolo
 
Credit card fraud detection and concept drift adaptation with delayed supervi...
Credit card fraud detection and concept drift adaptation with delayed supervi...Credit card fraud detection and concept drift adaptation with delayed supervi...
Credit card fraud detection and concept drift adaptation with delayed supervi...Andrea Dal Pozzolo
 
Doctiris project - Innoviris, Brussels
Doctiris project - Innoviris, Brussels Doctiris project - Innoviris, Brussels
Doctiris project - Innoviris, Brussels Andrea Dal Pozzolo
 
Using HDDT to avoid instances propagation in unbalanced and evolving data str...
Using HDDT to avoid instances propagation in unbalanced and evolving data str...Using HDDT to avoid instances propagation in unbalanced and evolving data str...
Using HDDT to avoid instances propagation in unbalanced and evolving data str...Andrea Dal Pozzolo
 

More from Andrea Dal Pozzolo (6)

Andrea Dal Pozzolo's CV
Andrea Dal Pozzolo's CVAndrea Dal Pozzolo's CV
Andrea Dal Pozzolo's CV
 
Adaptive Machine Learning for Credit Card Fraud Detection
Adaptive Machine Learning for Credit Card Fraud DetectionAdaptive Machine Learning for Credit Card Fraud Detection
Adaptive Machine Learning for Credit Card Fraud Detection
 
Is Machine learning useful for Fraud Prevention?
Is Machine learning useful for Fraud Prevention?Is Machine learning useful for Fraud Prevention?
Is Machine learning useful for Fraud Prevention?
 
Credit card fraud detection and concept drift adaptation with delayed supervi...
Credit card fraud detection and concept drift adaptation with delayed supervi...Credit card fraud detection and concept drift adaptation with delayed supervi...
Credit card fraud detection and concept drift adaptation with delayed supervi...
 
Doctiris project - Innoviris, Brussels
Doctiris project - Innoviris, Brussels Doctiris project - Innoviris, Brussels
Doctiris project - Innoviris, Brussels
 
Using HDDT to avoid instances propagation in unbalanced and evolving data str...
Using HDDT to avoid instances propagation in unbalanced and evolving data str...Using HDDT to avoid instances propagation in unbalanced and evolving data str...
Using HDDT to avoid instances propagation in unbalanced and evolving data str...
 

Recently uploaded

Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSISrknatarajan
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...ranjana rawat
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingrknatarajan
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escortsranjana rawat
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSSIVASHANKAR N
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...Call Girls in Nagpur High Profile
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINESIVASHANKAR N
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130Suhani Kapoor
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Call Girls in Nagpur High Profile
 
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...RajaP95
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSKurinjimalarL3
 
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSRajkumarAkumalla
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth
 

Recently uploaded (20)

Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSIS
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
 
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINEDJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 

Presentation of the unbalanced R package

  • 1. Introduction Preprocessing methods Racing Conclusions Racing for unbalanced methods selection – the unbalanced package Andrea Dal Pozzolo 24/2/2016 1/ 20
  • 2. Introduction Preprocessing methods Racing Conclusions ABOUT ME I’m a Decision Analytics Consultant at VASCO Data Security. I hold a PhD from the Machine Learning Group (MLG) of the Universit´e Libre de Bruxelles (ULB). My research focused on Machine Learning techniques for Fraud Detection in electronic transaction. 2/ 20
  • 3. Introduction Preprocessing methods Racing Conclusions TABLE OF CONTENTS 1 Introduction 2 Preprocessing methods 3 Racing 4 Conclusions 3/ 20
  • 4. Introduction Preprocessing methods Racing Conclusions INTRODUCTION In several binary classification problems (e.g. fraud detection), the two classes are not equally represented in the dataset. When one class is underrepresented in a dataset, the data is said to be unbalanced. In unbalanced datasets, classification algorithms perform poorly in terms of predictive accuracy [9]. A common strategy is to balance the classes before learning a classifier [1]. Technical sections will be denoted by the symbol 4/ 20
  • 5. Introduction Preprocessing methods Racing Conclusions PCA PROJECTION OF CREDIT CARD TRANSACTIONS Figure : Transactions distribution over the first two Principal Components in different days. 5/ 20
  • 6. Introduction Preprocessing methods Racing Conclusions THE FRAUD DETECTION PROBLEM We formalize FD as a classification task f : Rn → {+, −}. X ∈ Rn is the input and Y ∈ {+, −} the output domain. + is the fraud (minority) and − the genuine (majority) class. Given a classifier K and a training set TN, we are interested in estimating for a new sample (x, y) the posterior probability P(y = +|x). 6/ 20
  • 7. Introduction Preprocessing methods Racing Conclusions PREPROCESSING METHODS FOR UNBALANCED DATA Sampling methods Undersampling [7] Oversampling [7] SMOTE [3] Distance based methods Tomek link [15] Condensed Nearest Neighbor (CNN) [8] One side Selection (OSS) [10] Edited Nearest Neighbor (ENN) [16] Neighborhood Cleaning Rule (NCL) [11] 7/ 20
  • 8. Introduction Preprocessing methods Racing Conclusions SAMPLING METHODS Undersampling- Oversampling- SMOTE- Figure : Resampling methods for unbalanced classification. The negative and positive symbols denotes majority and minority class instances. In red the new observations created with oversampling methods. 8/ 20
  • 9. Introduction Preprocessing methods Racing Conclusions SELECTING THE BEST STRATEGY With no prior information about the data distribution is difficult to decide which unbalanced strategy to use. No-free-lunch theorem [17]: no single strategy is coherently superior to all others in all conditions (i.e. algorithm, dataset and performance metric) Testing all unbalanced techniques is not an option because of the associated computational cost. We proposed to use the Racing approach [12] to perform strategy selection. 9/ 20
  • 10. Introduction Preprocessing methods Racing Conclusions RACING FOR STRATEGY SELECTION Racing consists in testing in parallel a set of alternatives and using a statistical test to remove an alternative if it is significantly worse than the others. We adopted F-Race version [2] to search efficiently for the best strategy for unbalanced data. The F-race combines the Friedman test with Hoeffding Races [12]. 10/ 20
  • 11. Introduction Preprocessing methods Racing Conclusions RACING FOR UNBALANCED TECHNIQUE SELECTION Automatically select the most adequate technique for a given dataset. 1. Test in parallel a set of alternative balancing strategies on a subset of the dataset 2. Remove progressively the alternatives which are significantly worse. 3. Iterate the testing and removal step until there is only one candidate left or not more data is available Candidate(1 Candidate(2 Candidate(3 subset(1( 0.50 0.47 0.48 subset(2 0.51 0.48 0.30 subset(3 0.51 0.47 subset(4 0.60 0.45 subset(5 0.55 Candidate(1 Candidate(2 Candidate(3 subset(1( 0.50 0.47 0.48 subset(2 0.51 0.48 0.3011/ 20
  • 12. Introduction Preprocessing methods Racing Conclusions F-RACE METHOD Use 10-fold Cross Validation (CV) to provide the data during the race. Every time new data is added to the race, the Friedman test is used to remove significantly bad candidates. We made a comparison of CV and F-race in terms of F-measure. Candidate(1 Candidate(2 Candidate(3 subset(1( 0.50 0.47 0.48 subset(2 0.51 0.48 0.30 subset(3 0.51 0.47 subset(4 0.60 0.45 subset(5 0.55 Candidate(1 Candidate(2 Candidate(3 subset(1( 0.50 0.47 0.48 subset(2 0.51 0.48 0.30 subset(3 0.51 0.47 subset(4 0.49 0.46 subset(5 0.48 0.46 subset(6 0.60 0.45 subset(7 0.59 subset(8 subset(9 subset(10 Original(Dataset Time 12/ 20
  • 13. Introduction Preprocessing methods Racing Conclusions F-RACE VS CROSS VALIDATION Dataset Exploration Method Ntest % Gain Mean Sd ecoli Race Under 46 49 0.836 0.04 CV SMOTE 90 - 0.754 0.112 letter-a Race Under 34 62 0.952 0.008 CV SMOTE 90 - 0.949 0.01 letter-vowel Race Under 34 62 0.884 0.011 CV Under 90 - 0.887 0.009 letter Race SMOTE 37 59 0.951 0.009 CV Under 90 - 0.951 0.01 oil Race Under 41 54 0.629 0.074 CV SMOTE 90 - 0.597 0.076 page Race SMOTE 45 50 0.919 0.01 CV SMOTE 90 - 0.92 0.008 pendigits Race Under 39 57 0.978 0.011 CV Under 90 - 0.981 0.006 PhosS Race Under 19 79 0.598 0.01 CV Under 90 - 0.608 0.016 satimage Race Under 34 62 0.843 0.008 CV Under 90 - 0.841 0.011 segment Race SMOTE 90 0 0.978 0.01 CV SMOTE 90 - 0.978 0.01 estate Race Under 27 70 0.553 0.023 CV Under 90 - 0.563 0.021 covtype Race Under 42 53 0.924 0.007 CV SMOTE 90 - 0.921 0.008 cam Race Under 34 62 0.68 0.007 CV Under 90 - 0.674 0.015 compustat Race Under 37 59 0.738 0.021 CV Under 90 - 0.745 0.017 creditcard Race Under 43 52 0.927 0.008 CV SMOTE 90 - 0.924 0.006 Table : Results in terms of G-mean for RF classifier. 13/ 20
  • 14. Introduction Preprocessing methods Racing Conclusions DISCUSSION The best strategy is extremely dependent on the data nature, algorithm adopted and performance measure. F-race is able to automatise the selection of the best unbalanced strategy for a given unbalanced problem without exploring the whole dataset. For the fraud dataset the unbalanced strategy chosen had a big impact on the accuracy of the results. 14/ 20
  • 15. Introduction Preprocessing methods Racing Conclusions CONCLUSIONS With racing [6] we can rapidly select the best strategy for a given unbalanced task. However, we see that undersampling and SMOTE are often the best strategy to adopt. In R [14] we have a software package called unbalanced [4] that implements all these algorithms. 15/ 20
  • 16. Introduction Preprocessing methods Racing Conclusions PACKAGE DEMO Let’s see in practise how to do it ... The code of this demo is available at https://github.com/dalpozz/utility/blob/master/ubDEMO.R 16/ 20
  • 17. Introduction Preprocessing methods Racing Conclusions Vignette: www.ulb.ac.be/di/map/adalpozz/pdf/unbalanced.pdf CRAN: http://CRAN.R-project.org/package=unbalanced Github: https://github.com/dalpozz/unbalanced Email: dalpozz@gmail.com Questions? 17/ 20
  • 18. Introduction Preprocessing methods Racing Conclusions BIBLIOGRAPHY I [1] R. Akbani, S. Kwek, and N. Japkowicz. Applying support vector machines to imbalanced datasets. In Machine Learning: ECML 2004, pages 39–50. Springer, 2004. [2] M. Birattari, T. St¨utzle, L. Paquete, and K. Varrentrapp. A racing algorithm for configuring metaheuristics. In Proceedings of the genetic and evolutionary computation conference, pages 11–18, 2002. [3] N. Chawla, K. Bowyer, L. O. Hall, and W. P. Kegelmeyer. Smote: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research (JAIR), 16:321–357, 2002. [4] A. Dal Pozzolo, O. Caelen, and G. Bontempi. unbalanced: Racing For Unbalanced Methods Selection., 2015. R package version 2.0. [5] A. Dal Pozzolo, O. Caelen, R. A. Johnson, and G. Bontempi. Calibrating probability with undersampling for unbalanced classification. In 2015 IEEE Symposium on Computational Intelligence and Data Mining. IEEE, 2015. [6] A. Dal Pozzolo, O. Caelen, S. Waterschoot, and G. Bontempi. Racing for unbalanced methods selection. In Proceedings of the 14th International Conference on Intelligent Data Engineering and Automated Learning. IDEAL, 2013. [7] C. Drummond and R. Holte. C4.5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling. In Workshop on Learning from Imbalanced Datasets II, 2003. 17/ 20
  • 19. Introduction Preprocessing methods Racing Conclusions BIBLIOGRAPHY II [8] P. E. Hart. The condensed nearest neighbor rule. IEEE Transactions on Information Theory, 1968. [9] N. Japkowicz and S. Stephen. The class imbalance problem: A systematic study. Intelligent data analysis, 6(5):429–449, 2002. [10] M. Kubat, S. Matwin, et al. Addressing the curse of imbalanced training sets: one-sided selection. In ICML, volume 97, pages 179–186. Nashville, USA, 1997. [11] J. Laurikkala. Improving identification of difficult small classes by balancing class distribution. Artificial Intelligence in Medicine, pages 63–66, 2001. [12] O. Maron and A. Moore. Hoeffding races: Accelerating model selection search for classification and function approximation. Robotics Institute, page 263, 1993. [13] J. Quionero-Candela, M. Sugiyama, A. Schwaighofer, and N. D. Lawrence. Dataset shift in machine learning. The MIT Press, 2009. [14] R Development Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2011. ISBN 3-900051-07-0. 18/ 20
  • 20. Introduction Preprocessing methods Racing Conclusions BIBLIOGRAPHY III [15] I. Tomek. Two modifications of cnn. IEEE Trans. Syst. Man Cybern., 6:769–772, 1976. [16] D. Wilson. Asymptotic properties of nearest neighbor rules using edited data. Systems, Man and Cybernetics, (3):408–421, 1972. [17] D. H. Wolpert. The lack of a priori distinctions between learning algorithms. Neural computation, 8(7):1341–1390, 1996. 19/ 20
  • 21. Introduction Preprocessing methods Racing Conclusions SAMPLING SELECTION BIAS Sampling selection bias [13] occurs when the training and testing set comes from a different distribution. In this case we need to calibrate the posterior probability of the classifier [5]. Posi%ves( Nega%ves( Undersampling( Posi%ves( Nega%ves( Data( Genera%on( Process( Posi%ves( Nega%ves( Training( Tes%ng( Different(distribu%ons( 20/ 20