SlideShare a Scribd company logo
1 of 34
Download to read offline
Classifying Commit
Messages: A Case Study in
Resampling Techniques
Presenter:
Hamid Shekarforoush
Advisor :
Dr Robert Green
Bowling Green State University
Computer Science
Bowling Green, OH, USA
Our Dataset
● A set of commit messages that have been extracted from multiple Github and
Sourceforge projects in order to answer the question, “Do developers discuss
design?”
● Highly imbalanced
○ 15% design commits
○ 85% non Design commits
Our Dataset
Commit ID
Update Handlebars to v1000 and recompile
templates
0
Update descriptions of ant targets 0
update dpchartpiejs add event actions 1
Add interface ConstField 1
Feature Extraction
● TF-IDF
○ Term Frequency–Inverse Document Frequency
● Countvectorizer
○ Convert text to to matrix of token counts
Our Dataset
Principal component analysis (PCA)
● Blue dots are normal commits (Majority)
● Red dots are design commits (Minority)
Classification
Identifying the category of new
features based on the training set
Image : http://cdn-akfst-hatenacom/images/fotolife/T/TJO/20140106/20140106225602png
Classifiers (k Nearest Neighbors)
Image : http://trevorwhitneycom/data_mining/classification
?
Our Classifiers
● Random Forest
● Decision Tree
● SVC : Support Vector Classification
● Linear SVC
● BNB: Bernoulli Naive Bayes
● NC: Nearest Centroid
Imbalance dataset
Machine learning algorithms have
problem with imbalance datasets
Image : http://contribscikit-learnorg/imbalanced-learn/_images/sphx_glr_plot_make_imbalance_thumbpng
Resampling
● Under samplers
○ Deleting the number of features (usually reducing only the majority class)
● Over-samplers
○ Generating new features (usually from minority class)
● Hybrid method
○ A combination of Under samplers and Over samplers
Resampling - Under Sampling
CNN
started with two bins, store and garbage The
first sample is placed into the store and then
the second sample is classified by the nearest
neighbor rule using store as the reference If the
sample is classified correctly, it will be stored in
the garbage bin Otherwise, it will be placed in
the store bin This procedure repeats for the
entire sample space
Resampling - Over sampling
SMOTE
Synthetic Minority Over sampling
TEchnique, which over samples the
minority class by generating synthetic
examples. Over sampling takes one
feature and its nearest neighbor,
calculates the difference between the
two, and then multiplies it by a random
number between 0 and 1. This new
sample is then added to the feature
space
Resampling - Hybrid
Smote Tomek
First over sample with Smote then
under sample with Tomek Link
Resampling
Image : http://contribscikit-learnorg/imbalanced-learn
Samples from Smote
oversampler family
Our Resamplers
Under Samplers
Random Under Sampler
Tomek Links
Cluster Centroids
Near Miss
Condensed Nearest Neighbour
One Sided Selection
Neighbourhood Cleaning Rule
Edited Nearest Neighbours
Instance Hardness Threshold
Repeated Edited Nearest Neighbours
Over samplers
Random Over Sampler
SMOTE
SMOTE borderline 1
SMOTE borderline 2
SMOTE svm
ADASYN
Hybrid
SMOTE Tomek
SMOTE ENN
Resampling results
Original data
Experiment
Performance Metrics
Confusion Matrix
True positive (TP)
True negative (TN) : correct rejection
False positive (FP) : False alarm
False negative (FN) : Miss
Performance Metrics
Recall (R) : true positive rate or sensitivity
Specificity (S) : true negative rate
Precision (P) : positive predictive value
Performance Metrics
Accuracy
F1-score (F) : Harmonic mean of precision and recall
Gmean (G): Geometric mean
Experiment - Tools
● Python 3
○ Pandas
○ Numpy
● Scikit Learn
○ imbalanced-learn
Experiment 1
Training set: All the data (Github and Sourceforge)
Testing set: All the data (Github and Sourceforge)
Experiment 1 - k-fold cross-validation
Using 10-fold cross-validation
Image : https://enwikipediaorg/wiki/Cross-validation_(statistics)#/media/File:K-fold_cross_validation_ENjpg
Experiment 1 - Results (Accuracy)
RF DT SVC LSVC BNB NC
No Sampling 084 081 086 085 084 086
ROS 095 087 052 095 084 082
SMOTE B1 09 085 078 091 087 084
SMOTE B2 09 085 06 09 083 084
SMOTE SVM 088 082 066 088 083 084
ADASYN 086 08 051 097 075 NA
SMOTE 094 087 063 095 088 083
SMOTE Tomek 094 088 059 095 088 083
SMOTE ENN 094 088 051 096 088 083
RF DT SVC
LSV
C
BNB NC
No Sampling 084 081 086 085 084 086
Tomek Links 084 081 086 085 084 086
OSS 084 081 086 085 084 086
NCR 084 081 086 085 084 086
ENN 084 081 086 085 084 086
RENN 084 081 086 085 084 086
RUS 059 063 057 068 06 072
CC 061 062 067 069 061 NA
NM 074 076 088 086 087 089
CNN 067 063 08 075 08 077
IHT 07 069 063 079 065 083
Experiment 1 - Results (P R S)
RFC DTC SVC LSVC BNB NC
P R S P R S P R S P R S P R S P R S
No Sampling 23 05 97 30 29 89 0 0 100 33 10 97 35 15 95 24 66 66
Tomek Links 19 03 98 31 30 89 0 0 100 37 10 97 36 15 95 24 67 65
OSS 14 03 97 30 30 89 0 0 100 37 10 97 36 15 95 24 67 65
NCR 25 05 98 32 29 90 0 0 100 38 10 97 36 15 95 24 67 65
ENN 21 04 97 31 29 89 0 0 100 38 10 97 36 15 95 24 67 65
RENN 26 05 98 31 30 89 0 0 100 38 10 97 36 15 95 24 67 65
RUS 76 32 90 68 50 77 54 90 24 73 60 78 81 32 93 62 78 53
CC 75 33 89 64 51 72 67 71 64 74 58 80 56 100 21 73 55 79
NM 81 58 86 85 64 89 86 91 85 87 85 87 86 89 85 86 90 85
CNN 82 71 47 79 67 40 77 100 0 76 92 01 77 100 0 75 79 12
IHT 84 56 90 72 73 71 57 98 28 79 87 76 61 88 43 60 90 40
ROS 91 100 90 81 100 76 52 91 15 91 100 91 83 85 83 68 88 58
SMOTE B1 96 85 97 85 87 84 99 54 99 94 87 95 91 82 92 94 77 95
SMOTE B2 96 85 96 85 86 84 99 37 100 94 86 94 88 76 90 98 78 99
SMOTE SVM 92 72 97 76 76 87 0 0 100 89 76 95 81 70 91 86 70 94
ADASYN 89 79 91 77 86 76 0 0 100 94 99 94 79 66 83 76 58 82
SMOTE 93 94 93 83 93 81 59 87 40 92 100 91 90 86 91 72 85 67
SMOTE Tomek 92 95 92 83 93 81 56 56 60 92 100 91 90 86 91 72 85 66
SMOTE ENN 93 95 93 83 94 81 0 0 100 92 100 91 90 86 91 71 85 67
Precision
Recall
specificity
Experiment 1 - Results
F1
G mean
RFC DTC SVC LSVC BNB NC
F1 G F1 G F1 G F1 G F1 G F1 G
No Sampling 09 23 30 51 0 0 15 31 21 38 35 66
Tomek Links 05 18 31 52 0 0 16 32 22 38 35 66
OSS 05 16 30 51 0 0 16 32 22 38 35 66
NCR 08 21 30 51 0 0 16 32 22 38 35 66
ENN 07 20 30 51 0 0 16 32 22 38 35 66
RENN 08 22 31 52 0 0 16 32 22 38 35 66
RUS 45 54 58 62 68 46 66 68 45 54 69 64
CC 46 55 57 60 69 68 65 68 72 46 62 66
NM 67 71 73 76 89 88 86 86 88 87 88 88
CNN 76 58 72 52 87 0 83 09 87 0 77 31
IHT 68 71 72 72 72 52 82 81 72 62 72 60
ROS 95 95 89 87 66 37 96 95 84 84 77 71
SMOTE B1 90 90 86 85 70 74 90 90 86 86 85 86
SMOTE B2 90 90 86 85 49 61 90 90 82 83 87 88
SMOTE SVM 81 84 76 81 0 0 82 85 75 80 77 81
ADASYN 84 85 81 80 0 0 97 97 72 74 66 69
SMOTE 94 94 88 87 70 59 96 95 88 88 78 75
SMOTE Tomek 94 93 88 87 48 58 96 95 88 88 78 75
SMOTE ENN 94 94 88 87 0 0 96 95 88 88 78 75
Experiment 2
Training set: Github
Testing set: Sourceforge
Training set: Sourceforge
Testing set: Github
Just resampling the Training Set
Experiment 2 - Results (Accuracy)
RFC DTC SVC LSVC BNB NC
No Sampling 85 80 86 86 85 26
ROS 85 82 14 85 77 37
SMOTE 86 81 14 85 86 64
SMOTE B1 86 81 14 85 85 55
SMOTE B2 85 81 28 84 85 21
SMOTE svm 86 80 86 86 85 50
ADASYN 85 81 86 86 78 86
SMOTETomek 85 82 14 85 86 70
SMOTEENN 86 81 86 85 86 78
RFC DTC SVC LSVC BNB NC
No Sampling 85 80 86 86 85 26
RUS 86 69 14 40 59 17
TomekLinks 85 80 86 86 85 25
CC 85 83 16 60 14 39
NM 86 85 14 14 14 14
CNN 86 82 86 81 86 85
OSS 85 80 86 86 85 34
NCR 86 80 86 86 85 25
ENN 86 80 86 86 85 25
IHT 85 42 14 16 14 14
RENN 85 80 86 86 85 25
Experiment 2 - Results (P R S)
Experiment 1 - Results
Other experiments
Changing the TF-IDF setting from 3 words to single word
Using countVectorizer
The results fairly stays the same
Conclusion
● Resampling works
○ Need enough training data
○ Choose suitable resampler
● Bad resampling can deteriorate your results
Future study
● Combining different resamplings methods
● Expanding the dataset
● Different type of data
● Using natural language processing instead of TF-IDF
Thank you
Questions?

More Related Content

Similar to resampling techniques in machine learning

LG CRT TV MODEL Ct 21q92ke
LG CRT TV MODEL Ct 21q92keLG CRT TV MODEL Ct 21q92ke
LG CRT TV MODEL Ct 21q92keMalik Arif
 
The making of the Perfect MOSFET Final
The making of the Perfect MOSFET FinalThe making of the Perfect MOSFET Final
The making of the Perfect MOSFET FinalAlan Elbanhawy
 
Single elctron transisto PHASE 2.pptx
Single elctron transisto PHASE 2.pptxSingle elctron transisto PHASE 2.pptx
Single elctron transisto PHASE 2.pptxssuser1580e5
 
Levels and fate of antimycotic pharmaceuticals at sewage treatment plants
Levels and fate of antimycotic pharmaceuticals at sewage treatment plantsLevels and fate of antimycotic pharmaceuticals at sewage treatment plants
Levels and fate of antimycotic pharmaceuticals at sewage treatment plantsJorge Casado Agrelo
 
LISUN Compact Goniophotometer LSG-1200A
LISUN Compact Goniophotometer LSG-1200ALISUN Compact Goniophotometer LSG-1200A
LISUN Compact Goniophotometer LSG-1200A世满 江
 
Compact Goniophotometer
Compact GoniophotometerCompact Goniophotometer
Compact GoniophotometerLisun Group
 
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 WorkshopSatoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 WorkshopAssociation for Computational Linguistics
 
Active Filter Design Using PSpice
Active Filter Design Using PSpiceActive Filter Design Using PSpice
Active Filter Design Using PSpiceTsuyoshi Horigome
 
Introduction to electrical distribution and control components
Introduction to electrical distribution and control componentsIntroduction to electrical distribution and control components
Introduction to electrical distribution and control componentsCTY TNHH HẠO PHƯƠNG
 
The_ERICSSON_commands_listed_below_are_f (1) (1).pdf
The_ERICSSON_commands_listed_below_are_f (1) (1).pdfThe_ERICSSON_commands_listed_below_are_f (1) (1).pdf
The_ERICSSON_commands_listed_below_are_f (1) (1).pdfssuser340a0c
 
Monitoring Containers with Weave Scope
Monitoring Containers with Weave ScopeMonitoring Containers with Weave Scope
Monitoring Containers with Weave ScopeWeaveworks
 
Cassandra Performance Tuning Like You've Been Doing It for Ten Years
Cassandra Performance Tuning Like You've Been Doing It for Ten YearsCassandra Performance Tuning Like You've Been Doing It for Ten Years
Cassandra Performance Tuning Like You've Been Doing It for Ten YearsJon Haddad
 
PosterFormatRNYF(1)
PosterFormatRNYF(1)PosterFormatRNYF(1)
PosterFormatRNYF(1)Usman Khalid
 
Applied Econometrics assignment3
Applied Econometrics assignment3Applied Econometrics assignment3
Applied Econometrics assignment3Chenguang Li
 
PPT FINAL (1)-1 (1).ppt
PPT FINAL (1)-1 (1).pptPPT FINAL (1)-1 (1).ppt
PPT FINAL (1)-1 (1).ppttariqqureshi33
 

Similar to resampling techniques in machine learning (20)

LG CRT TV MODEL Ct 21q92ke
LG CRT TV MODEL Ct 21q92keLG CRT TV MODEL Ct 21q92ke
LG CRT TV MODEL Ct 21q92ke
 
The making of the Perfect MOSFET Final
The making of the Perfect MOSFET FinalThe making of the Perfect MOSFET Final
The making of the Perfect MOSFET Final
 
Ohaus T Indicators
Ohaus T IndicatorsOhaus T Indicators
Ohaus T Indicators
 
Single elctron transisto PHASE 2.pptx
Single elctron transisto PHASE 2.pptxSingle elctron transisto PHASE 2.pptx
Single elctron transisto PHASE 2.pptx
 
Referencia practica mercedes clk 320 w208 1999
Referencia practica mercedes clk 320 w208 1999Referencia practica mercedes clk 320 w208 1999
Referencia practica mercedes clk 320 w208 1999
 
Ewdts 2018
Ewdts 2018Ewdts 2018
Ewdts 2018
 
Levels and fate of antimycotic pharmaceuticals at sewage treatment plants
Levels and fate of antimycotic pharmaceuticals at sewage treatment plantsLevels and fate of antimycotic pharmaceuticals at sewage treatment plants
Levels and fate of antimycotic pharmaceuticals at sewage treatment plants
 
LISUN Compact Goniophotometer LSG-1200A
LISUN Compact Goniophotometer LSG-1200ALISUN Compact Goniophotometer LSG-1200A
LISUN Compact Goniophotometer LSG-1200A
 
Compact Goniophotometer
Compact GoniophotometerCompact Goniophotometer
Compact Goniophotometer
 
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 WorkshopSatoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop
 
Active Filter Design Using PSpice
Active Filter Design Using PSpiceActive Filter Design Using PSpice
Active Filter Design Using PSpice
 
Introduction to electrical distribution and control components
Introduction to electrical distribution and control componentsIntroduction to electrical distribution and control components
Introduction to electrical distribution and control components
 
The_ERICSSON_commands_listed_below_are_f (1) (1).pdf
The_ERICSSON_commands_listed_below_are_f (1) (1).pdfThe_ERICSSON_commands_listed_below_are_f (1) (1).pdf
The_ERICSSON_commands_listed_below_are_f (1) (1).pdf
 
PosterRexChinHaoChen2016
PosterRexChinHaoChen2016PosterRexChinHaoChen2016
PosterRexChinHaoChen2016
 
Monitoring Containers with Weave Scope
Monitoring Containers with Weave ScopeMonitoring Containers with Weave Scope
Monitoring Containers with Weave Scope
 
Cassandra Performance Tuning Like You've Been Doing It for Ten Years
Cassandra Performance Tuning Like You've Been Doing It for Ten YearsCassandra Performance Tuning Like You've Been Doing It for Ten Years
Cassandra Performance Tuning Like You've Been Doing It for Ten Years
 
PosterFormatRNYF(1)
PosterFormatRNYF(1)PosterFormatRNYF(1)
PosterFormatRNYF(1)
 
Making a peaking filter by Julio Marqués
Making a peaking filter by Julio MarquésMaking a peaking filter by Julio Marqués
Making a peaking filter by Julio Marqués
 
Applied Econometrics assignment3
Applied Econometrics assignment3Applied Econometrics assignment3
Applied Econometrics assignment3
 
PPT FINAL (1)-1 (1).ppt
PPT FINAL (1)-1 (1).pptPPT FINAL (1)-1 (1).ppt
PPT FINAL (1)-1 (1).ppt
 

Recently uploaded

Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxolyaivanovalion
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 

Recently uploaded (20)

Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 

resampling techniques in machine learning

  • 1. Classifying Commit Messages: A Case Study in Resampling Techniques Presenter: Hamid Shekarforoush Advisor : Dr Robert Green Bowling Green State University Computer Science Bowling Green, OH, USA
  • 2. Our Dataset ● A set of commit messages that have been extracted from multiple Github and Sourceforge projects in order to answer the question, “Do developers discuss design?” ● Highly imbalanced ○ 15% design commits ○ 85% non Design commits
  • 3. Our Dataset Commit ID Update Handlebars to v1000 and recompile templates 0 Update descriptions of ant targets 0 update dpchartpiejs add event actions 1 Add interface ConstField 1
  • 4. Feature Extraction ● TF-IDF ○ Term Frequency–Inverse Document Frequency ● Countvectorizer ○ Convert text to to matrix of token counts
  • 5. Our Dataset Principal component analysis (PCA) ● Blue dots are normal commits (Majority) ● Red dots are design commits (Minority)
  • 6. Classification Identifying the category of new features based on the training set Image : http://cdn-akfst-hatenacom/images/fotolife/T/TJO/20140106/20140106225602png
  • 7. Classifiers (k Nearest Neighbors) Image : http://trevorwhitneycom/data_mining/classification ?
  • 8. Our Classifiers ● Random Forest ● Decision Tree ● SVC : Support Vector Classification ● Linear SVC ● BNB: Bernoulli Naive Bayes ● NC: Nearest Centroid
  • 9. Imbalance dataset Machine learning algorithms have problem with imbalance datasets Image : http://contribscikit-learnorg/imbalanced-learn/_images/sphx_glr_plot_make_imbalance_thumbpng
  • 10. Resampling ● Under samplers ○ Deleting the number of features (usually reducing only the majority class) ● Over-samplers ○ Generating new features (usually from minority class) ● Hybrid method ○ A combination of Under samplers and Over samplers
  • 11. Resampling - Under Sampling CNN started with two bins, store and garbage The first sample is placed into the store and then the second sample is classified by the nearest neighbor rule using store as the reference If the sample is classified correctly, it will be stored in the garbage bin Otherwise, it will be placed in the store bin This procedure repeats for the entire sample space
  • 12. Resampling - Over sampling SMOTE Synthetic Minority Over sampling TEchnique, which over samples the minority class by generating synthetic examples. Over sampling takes one feature and its nearest neighbor, calculates the difference between the two, and then multiplies it by a random number between 0 and 1. This new sample is then added to the feature space
  • 13. Resampling - Hybrid Smote Tomek First over sample with Smote then under sample with Tomek Link
  • 15. Our Resamplers Under Samplers Random Under Sampler Tomek Links Cluster Centroids Near Miss Condensed Nearest Neighbour One Sided Selection Neighbourhood Cleaning Rule Edited Nearest Neighbours Instance Hardness Threshold Repeated Edited Nearest Neighbours Over samplers Random Over Sampler SMOTE SMOTE borderline 1 SMOTE borderline 2 SMOTE svm ADASYN Hybrid SMOTE Tomek SMOTE ENN
  • 18. Performance Metrics Confusion Matrix True positive (TP) True negative (TN) : correct rejection False positive (FP) : False alarm False negative (FN) : Miss
  • 19. Performance Metrics Recall (R) : true positive rate or sensitivity Specificity (S) : true negative rate Precision (P) : positive predictive value
  • 20. Performance Metrics Accuracy F1-score (F) : Harmonic mean of precision and recall Gmean (G): Geometric mean
  • 21. Experiment - Tools ● Python 3 ○ Pandas ○ Numpy ● Scikit Learn ○ imbalanced-learn
  • 22. Experiment 1 Training set: All the data (Github and Sourceforge) Testing set: All the data (Github and Sourceforge)
  • 23. Experiment 1 - k-fold cross-validation Using 10-fold cross-validation Image : https://enwikipediaorg/wiki/Cross-validation_(statistics)#/media/File:K-fold_cross_validation_ENjpg
  • 24. Experiment 1 - Results (Accuracy) RF DT SVC LSVC BNB NC No Sampling 084 081 086 085 084 086 ROS 095 087 052 095 084 082 SMOTE B1 09 085 078 091 087 084 SMOTE B2 09 085 06 09 083 084 SMOTE SVM 088 082 066 088 083 084 ADASYN 086 08 051 097 075 NA SMOTE 094 087 063 095 088 083 SMOTE Tomek 094 088 059 095 088 083 SMOTE ENN 094 088 051 096 088 083 RF DT SVC LSV C BNB NC No Sampling 084 081 086 085 084 086 Tomek Links 084 081 086 085 084 086 OSS 084 081 086 085 084 086 NCR 084 081 086 085 084 086 ENN 084 081 086 085 084 086 RENN 084 081 086 085 084 086 RUS 059 063 057 068 06 072 CC 061 062 067 069 061 NA NM 074 076 088 086 087 089 CNN 067 063 08 075 08 077 IHT 07 069 063 079 065 083
  • 25. Experiment 1 - Results (P R S) RFC DTC SVC LSVC BNB NC P R S P R S P R S P R S P R S P R S No Sampling 23 05 97 30 29 89 0 0 100 33 10 97 35 15 95 24 66 66 Tomek Links 19 03 98 31 30 89 0 0 100 37 10 97 36 15 95 24 67 65 OSS 14 03 97 30 30 89 0 0 100 37 10 97 36 15 95 24 67 65 NCR 25 05 98 32 29 90 0 0 100 38 10 97 36 15 95 24 67 65 ENN 21 04 97 31 29 89 0 0 100 38 10 97 36 15 95 24 67 65 RENN 26 05 98 31 30 89 0 0 100 38 10 97 36 15 95 24 67 65 RUS 76 32 90 68 50 77 54 90 24 73 60 78 81 32 93 62 78 53 CC 75 33 89 64 51 72 67 71 64 74 58 80 56 100 21 73 55 79 NM 81 58 86 85 64 89 86 91 85 87 85 87 86 89 85 86 90 85 CNN 82 71 47 79 67 40 77 100 0 76 92 01 77 100 0 75 79 12 IHT 84 56 90 72 73 71 57 98 28 79 87 76 61 88 43 60 90 40 ROS 91 100 90 81 100 76 52 91 15 91 100 91 83 85 83 68 88 58 SMOTE B1 96 85 97 85 87 84 99 54 99 94 87 95 91 82 92 94 77 95 SMOTE B2 96 85 96 85 86 84 99 37 100 94 86 94 88 76 90 98 78 99 SMOTE SVM 92 72 97 76 76 87 0 0 100 89 76 95 81 70 91 86 70 94 ADASYN 89 79 91 77 86 76 0 0 100 94 99 94 79 66 83 76 58 82 SMOTE 93 94 93 83 93 81 59 87 40 92 100 91 90 86 91 72 85 67 SMOTE Tomek 92 95 92 83 93 81 56 56 60 92 100 91 90 86 91 72 85 66 SMOTE ENN 93 95 93 83 94 81 0 0 100 92 100 91 90 86 91 71 85 67 Precision Recall specificity
  • 26. Experiment 1 - Results F1 G mean RFC DTC SVC LSVC BNB NC F1 G F1 G F1 G F1 G F1 G F1 G No Sampling 09 23 30 51 0 0 15 31 21 38 35 66 Tomek Links 05 18 31 52 0 0 16 32 22 38 35 66 OSS 05 16 30 51 0 0 16 32 22 38 35 66 NCR 08 21 30 51 0 0 16 32 22 38 35 66 ENN 07 20 30 51 0 0 16 32 22 38 35 66 RENN 08 22 31 52 0 0 16 32 22 38 35 66 RUS 45 54 58 62 68 46 66 68 45 54 69 64 CC 46 55 57 60 69 68 65 68 72 46 62 66 NM 67 71 73 76 89 88 86 86 88 87 88 88 CNN 76 58 72 52 87 0 83 09 87 0 77 31 IHT 68 71 72 72 72 52 82 81 72 62 72 60 ROS 95 95 89 87 66 37 96 95 84 84 77 71 SMOTE B1 90 90 86 85 70 74 90 90 86 86 85 86 SMOTE B2 90 90 86 85 49 61 90 90 82 83 87 88 SMOTE SVM 81 84 76 81 0 0 82 85 75 80 77 81 ADASYN 84 85 81 80 0 0 97 97 72 74 66 69 SMOTE 94 94 88 87 70 59 96 95 88 88 78 75 SMOTE Tomek 94 93 88 87 48 58 96 95 88 88 78 75 SMOTE ENN 94 94 88 87 0 0 96 95 88 88 78 75
  • 27. Experiment 2 Training set: Github Testing set: Sourceforge Training set: Sourceforge Testing set: Github Just resampling the Training Set
  • 28. Experiment 2 - Results (Accuracy) RFC DTC SVC LSVC BNB NC No Sampling 85 80 86 86 85 26 ROS 85 82 14 85 77 37 SMOTE 86 81 14 85 86 64 SMOTE B1 86 81 14 85 85 55 SMOTE B2 85 81 28 84 85 21 SMOTE svm 86 80 86 86 85 50 ADASYN 85 81 86 86 78 86 SMOTETomek 85 82 14 85 86 70 SMOTEENN 86 81 86 85 86 78 RFC DTC SVC LSVC BNB NC No Sampling 85 80 86 86 85 26 RUS 86 69 14 40 59 17 TomekLinks 85 80 86 86 85 25 CC 85 83 16 60 14 39 NM 86 85 14 14 14 14 CNN 86 82 86 81 86 85 OSS 85 80 86 86 85 34 NCR 86 80 86 86 85 25 ENN 86 80 86 86 85 25 IHT 85 42 14 16 14 14 RENN 85 80 86 86 85 25
  • 29. Experiment 2 - Results (P R S)
  • 30. Experiment 1 - Results
  • 31. Other experiments Changing the TF-IDF setting from 3 words to single word Using countVectorizer The results fairly stays the same
  • 32. Conclusion ● Resampling works ○ Need enough training data ○ Choose suitable resampler ● Bad resampling can deteriorate your results
  • 33. Future study ● Combining different resamplings methods ● Expanding the dataset ● Different type of data ● Using natural language processing instead of TF-IDF