Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
resampling techniques in machine learning
1. Classifying Commit
Messages: A Case Study in
Resampling Techniques
Presenter:
Hamid Shekarforoush
Advisor :
Dr Robert Green
Bowling Green State University
Computer Science
Bowling Green, OH, USA
2. Our Dataset
● A set of commit messages that have been extracted from multiple Github and
Sourceforge projects in order to answer the question, “Do developers discuss
design?”
● Highly imbalanced
○ 15% design commits
○ 85% non Design commits
3. Our Dataset
Commit ID
Update Handlebars to v1000 and recompile
templates
0
Update descriptions of ant targets 0
update dpchartpiejs add event actions 1
Add interface ConstField 1
4. Feature Extraction
● TF-IDF
○ Term Frequency–Inverse Document Frequency
● Countvectorizer
○ Convert text to to matrix of token counts
5. Our Dataset
Principal component analysis (PCA)
● Blue dots are normal commits (Majority)
● Red dots are design commits (Minority)
6. Classification
Identifying the category of new
features based on the training set
Image : http://cdn-akfst-hatenacom/images/fotolife/T/TJO/20140106/20140106225602png
8. Our Classifiers
● Random Forest
● Decision Tree
● SVC : Support Vector Classification
● Linear SVC
● BNB: Bernoulli Naive Bayes
● NC: Nearest Centroid
9. Imbalance dataset
Machine learning algorithms have
problem with imbalance datasets
Image : http://contribscikit-learnorg/imbalanced-learn/_images/sphx_glr_plot_make_imbalance_thumbpng
10. Resampling
● Under samplers
○ Deleting the number of features (usually reducing only the majority class)
● Over-samplers
○ Generating new features (usually from minority class)
● Hybrid method
○ A combination of Under samplers and Over samplers
11. Resampling - Under Sampling
CNN
started with two bins, store and garbage The
first sample is placed into the store and then
the second sample is classified by the nearest
neighbor rule using store as the reference If the
sample is classified correctly, it will be stored in
the garbage bin Otherwise, it will be placed in
the store bin This procedure repeats for the
entire sample space
12. Resampling - Over sampling
SMOTE
Synthetic Minority Over sampling
TEchnique, which over samples the
minority class by generating synthetic
examples. Over sampling takes one
feature and its nearest neighbor,
calculates the difference between the
two, and then multiplies it by a random
number between 0 and 1. This new
sample is then added to the feature
space
31. Other experiments
Changing the TF-IDF setting from 3 words to single word
Using countVectorizer
The results fairly stays the same
32. Conclusion
● Resampling works
○ Need enough training data
○ Choose suitable resampler
● Bad resampling can deteriorate your results
33. Future study
● Combining different resamplings methods
● Expanding the dataset
● Different type of data
● Using natural language processing instead of TF-IDF