SlideShare a Scribd company logo
Topic:
Machine learning           Synthetic minority over-
Imbalanced data sets   sampling technique (SMOTE)
                        Presented by Hector Franco
                                               TCD
Basic concepts

     Introduction
1.
     Recent developments
2.
     Algorithms description.
3.
     Evaluation.
4.
     Discursion.
5.
0
Multi class problems are imbalance when we

    compare one against all.
    In some cases the data set is very small, to

    generalize well.
    Text classification is an example of imbalanced

    data.
    It can be use with tree-kernels.

Effect of SMOTE and DEC – (SDC)




 After DEC   alone    After SMOTE
 and DEC
: Majority sample
: Minority sample
: Synthetic sample
                                         6
introduction
By convention the class with less number of

    examples is called minority or positive
    samples.
The recent developments in
imbalanced data sets learning
Between-class imbalanced.

    (where we focused on)
    Within-class imbalanced.



    It is important in text classification.

    We focused on the minority class, we want a

    high prediction for the minority class..
    Two class problem = multiclass problem .

NOT VERY GOOD
                         IN UNBALANCED
                              DATA




Popular evaluation for
 imbalance problem.
 Usually B=1, and =1
    in this paper
AUC:
TP rate
          AREA
          UNDER
          ROC


                  FP rate
Data level: Change the distribution

    ◦ make the data balanced
    Modify the existing data mining algorithms

    ◦ Make new algorithms
Random oversampling: duplicate

    Random under sampling: (can remove

    important data)
    Remove noise

    SMOTE

    Combine under sampling and over sampling.

    Find the hard examples and over sample

    them.
Adaboost (increase weights of misclassified),

    it does not perform well on imbalances ds. 
    Improve updated weights of TP & FP, better
    than weights of prediction based on TP & FP.
    Use a kernel of SVM

    Use a BMPM

    Biased Mini max Probability Machine.
    There are other cost-based learning…

A new Over-Sampling Method:
Borderline-SMOTE.
Algorithms usually

    try to learn the
    borderline, as
    exactly as possible.
Borderline-SMOTE1

    Borderline-SMOTE2

Also oversampling the majority class.

    The random numbers are between 0 and 0.5

    so the synthetic examples are more close to
    each other.
Experiments
Nothing: base line.

    SMOTE

    Random over-sampling

    Borderline-SMOTE1

    Borderline-SMOTE2



    K=5

    10 Fold cross validation.

    C4.5 classified

    We only want to improve the prediction of the

    minority class
conclusion
Is a common problem to work with

    imbalanced data sets.
    Borderline examples are more easy to

    misclassified.
    Our methods are better than traditional

    SMOTE.
    Open to research:

    ◦ how to define DANGER examples.
    ◦ Determination of number of examples in DANGER.
    ◦ Combine to data mining algorithms.
You are free:
•to copy, distribute, display, and perform the work
•to make derivative works

Under the following conditions:
•Attribution. You must give the original author credit.
What does quot;Attribute this workquot; mean?
The page you came from contained embedded licensing metadata, including how the
creator wishes to be attributed for re-use. You can use the HTML here to cite the work.
Doing so will also include metadata on your page so that others can find the original work
as well.

•Non-Commercial. You may not use this work for commercial purposes.
•For any reuse or distribution, you must make clear to others the licence terms of this
work.
•Any of these conditions can be waived if you get permission from the copyright holder.
•Nothing in this license impairs or restricts the author's moral rights.

More Related Content

What's hot

05 Clustering in Data Mining
05 Clustering in Data Mining05 Clustering in Data Mining
05 Clustering in Data Mining
Valerii Klymchuk
 
boosting algorithm
boosting algorithmboosting algorithm
boosting algorithm
Prithvi Paneru
 
Resnet
ResnetResnet
Multiclass classification of imbalanced data
Multiclass classification of imbalanced dataMulticlass classification of imbalanced data
Multiclass classification of imbalanced data
SaurabhWani6
 
Overfitting & Underfitting
Overfitting & UnderfittingOverfitting & Underfitting
Overfitting & Underfitting
SOUMIT KAR
 
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & KamberChapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
error007
 
Introduction to text classification using naive bayes
Introduction to text classification using naive bayesIntroduction to text classification using naive bayes
Introduction to text classification using naive bayes
Dhwaj Raj
 
What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...
What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...
What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...
Simplilearn
 
Replication Techniques for Distributed Database Design
Replication Techniques for Distributed Database DesignReplication Techniques for Distributed Database Design
Replication Techniques for Distributed Database Design
Meghaj Mallick
 
Clustering
ClusteringClustering
Clustering
LipikaSaha2
 
Random forest
Random forestRandom forest
Random forestUjjawal
 
Customer Churn Analysis and Prediction
Customer Churn Analysis and PredictionCustomer Churn Analysis and Prediction
Customer Churn Analysis and Prediction
SOUMIT KAR
 
The CAP Theorem
The CAP Theorem The CAP Theorem
The CAP Theorem
Aleksandar Bradic
 
NOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQLNOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQL
Ramakant Soni
 
Ensemble learning
Ensemble learningEnsemble learning
Ensemble learning
Haris Jamil
 
Deep learning - A Visual Introduction
Deep learning - A Visual IntroductionDeep learning - A Visual Introduction
Deep learning - A Visual Introduction
Lukas Masuch
 
Ensemble methods in machine learning
Ensemble methods in machine learningEnsemble methods in machine learning
Ensemble methods in machine learning
SANTHOSH RAJA M G
 
Stacking ensemble
Stacking ensembleStacking ensemble
Stacking ensemble
kalung0313
 
Credit Card Fraud Detection - Anomaly Detection
Credit Card Fraud Detection - Anomaly DetectionCredit Card Fraud Detection - Anomaly Detection
Credit Card Fraud Detection - Anomaly Detection
Lalit Jain
 

What's hot (20)

05 Clustering in Data Mining
05 Clustering in Data Mining05 Clustering in Data Mining
05 Clustering in Data Mining
 
boosting algorithm
boosting algorithmboosting algorithm
boosting algorithm
 
Resnet
ResnetResnet
Resnet
 
Multiclass classification of imbalanced data
Multiclass classification of imbalanced dataMulticlass classification of imbalanced data
Multiclass classification of imbalanced data
 
Overfitting & Underfitting
Overfitting & UnderfittingOverfitting & Underfitting
Overfitting & Underfitting
 
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & KamberChapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
 
Introduction to text classification using naive bayes
Introduction to text classification using naive bayesIntroduction to text classification using naive bayes
Introduction to text classification using naive bayes
 
What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...
What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...
What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...
 
Replication Techniques for Distributed Database Design
Replication Techniques for Distributed Database DesignReplication Techniques for Distributed Database Design
Replication Techniques for Distributed Database Design
 
Clustering
ClusteringClustering
Clustering
 
Random forest
Random forestRandom forest
Random forest
 
Presentation on K-Means Clustering
Presentation on K-Means ClusteringPresentation on K-Means Clustering
Presentation on K-Means Clustering
 
Customer Churn Analysis and Prediction
Customer Churn Analysis and PredictionCustomer Churn Analysis and Prediction
Customer Churn Analysis and Prediction
 
The CAP Theorem
The CAP Theorem The CAP Theorem
The CAP Theorem
 
NOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQLNOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQL
 
Ensemble learning
Ensemble learningEnsemble learning
Ensemble learning
 
Deep learning - A Visual Introduction
Deep learning - A Visual IntroductionDeep learning - A Visual Introduction
Deep learning - A Visual Introduction
 
Ensemble methods in machine learning
Ensemble methods in machine learningEnsemble methods in machine learning
Ensemble methods in machine learning
 
Stacking ensemble
Stacking ensembleStacking ensemble
Stacking ensemble
 
Credit Card Fraud Detection - Anomaly Detection
Credit Card Fraud Detection - Anomaly DetectionCredit Card Fraud Detection - Anomaly Detection
Credit Card Fraud Detection - Anomaly Detection
 

Viewers also liked

Racing for unbalanced methods selection
Racing for unbalanced methods selectionRacing for unbalanced methods selection
Racing for unbalanced methods selection
Andrea Dal Pozzolo
 
Learning On The Border:Active Learning in Imbalanced classification Data
Learning On The Border:Active Learning in Imbalanced classification DataLearning On The Border:Active Learning in Imbalanced classification Data
Learning On The Border:Active Learning in Imbalanced classification Data萍華 楊
 
Predictive Modeling: Predict Premium Subscriber for a Leading International M...
Predictive Modeling: Predict Premium Subscriber for a Leading International M...Predictive Modeling: Predict Premium Subscriber for a Leading International M...
Predictive Modeling: Predict Premium Subscriber for a Leading International M...
Kaushik Nuvvula
 
Ensemble of Exemplar-SVM for Object Detection and Beyond
Ensemble of Exemplar-SVM for Object Detection and BeyondEnsemble of Exemplar-SVM for Object Detection and Beyond
Ensemble of Exemplar-SVM for Object Detection and Beyondzukun
 
Présentation ardia pfe sabrine gharbi 2015 slide share
Présentation ardia pfe sabrine gharbi 2015 slide sharePrésentation ardia pfe sabrine gharbi 2015 slide share
Présentation ardia pfe sabrine gharbi 2015 slide share
gharbi sabrine
 
Présentation pfe
Présentation pfePrésentation pfe
Présentation pfe
Abdelghafour Zguindou
 
Tong quan ve phan cum data mining
Tong quan ve phan cum   data miningTong quan ve phan cum   data mining
Tong quan ve phan cum data miningHoa Chu
 
Support Vector Machine (SVM) Based Classifier For Khmer Printed Character-set...
Support Vector Machine (SVM) Based Classifier For Khmer Printed Character-set...Support Vector Machine (SVM) Based Classifier For Khmer Printed Character-set...
Support Vector Machine (SVM) Based Classifier For Khmer Printed Character-set...
osify
 
Support Vector Machine without tears
Support Vector Machine without tearsSupport Vector Machine without tears
Support Vector Machine without tears
Ankit Sharma
 
はじめてのパターン認識 第5章 k最近傍法(k_nn法)
はじめてのパターン認識 第5章 k最近傍法(k_nn法)はじめてのパターン認識 第5章 k最近傍法(k_nn法)
はじめてのパターン認識 第5章 k最近傍法(k_nn法)Motoya Wakiyama
 
不均衡データのクラス分類
不均衡データのクラス分類不均衡データのクラス分類
不均衡データのクラス分類Shintaro Fukushima
 

Viewers also liked (12)

Racing for unbalanced methods selection
Racing for unbalanced methods selectionRacing for unbalanced methods selection
Racing for unbalanced methods selection
 
Learning On The Border:Active Learning in Imbalanced classification Data
Learning On The Border:Active Learning in Imbalanced classification DataLearning On The Border:Active Learning in Imbalanced classification Data
Learning On The Border:Active Learning in Imbalanced classification Data
 
Predictive Modeling: Predict Premium Subscriber for a Leading International M...
Predictive Modeling: Predict Premium Subscriber for a Leading International M...Predictive Modeling: Predict Premium Subscriber for a Leading International M...
Predictive Modeling: Predict Premium Subscriber for a Leading International M...
 
Ensemble of Exemplar-SVM for Object Detection and Beyond
Ensemble of Exemplar-SVM for Object Detection and BeyondEnsemble of Exemplar-SVM for Object Detection and Beyond
Ensemble of Exemplar-SVM for Object Detection and Beyond
 
Présentation ardia pfe sabrine gharbi 2015 slide share
Présentation ardia pfe sabrine gharbi 2015 slide sharePrésentation ardia pfe sabrine gharbi 2015 slide share
Présentation ardia pfe sabrine gharbi 2015 slide share
 
Présentation pfe
Présentation pfePrésentation pfe
Présentation pfe
 
Tong quan ve phan cum data mining
Tong quan ve phan cum   data miningTong quan ve phan cum   data mining
Tong quan ve phan cum data mining
 
Support Vector Machine (SVM) Based Classifier For Khmer Printed Character-set...
Support Vector Machine (SVM) Based Classifier For Khmer Printed Character-set...Support Vector Machine (SVM) Based Classifier For Khmer Printed Character-set...
Support Vector Machine (SVM) Based Classifier For Khmer Printed Character-set...
 
Support Vector Machine without tears
Support Vector Machine without tearsSupport Vector Machine without tears
Support Vector Machine without tears
 
Lecture12 - SVM
Lecture12 - SVMLecture12 - SVM
Lecture12 - SVM
 
はじめてのパターン認識 第5章 k最近傍法(k_nn法)
はじめてのパターン認識 第5章 k最近傍法(k_nn法)はじめてのパターン認識 第5章 k最近傍法(k_nn法)
はじめてのパターン認識 第5章 k最近傍法(k_nn法)
 
不均衡データのクラス分類
不均衡データのクラス分類不均衡データのクラス分類
不均衡データのクラス分類
 

Similar to Borderline Smote

Troubleshooting Deep Neural Networks - Full Stack Deep Learning
Troubleshooting Deep Neural Networks - Full Stack Deep LearningTroubleshooting Deep Neural Networks - Full Stack Deep Learning
Troubleshooting Deep Neural Networks - Full Stack Deep Learning
Sergey Karayev
 
Simulating data to gain insights into power and p-hacking
Simulating data to gain insights intopower and p-hackingSimulating data to gain insights intopower and p-hacking
Simulating data to gain insights into power and p-hacking
Dorothy Bishop
 
Lecture 1
Lecture 1Lecture 1
Lecture 1
Aun Akbar
 
lec1.ppt
lec1.pptlec1.ppt
lec1.ppt
SVasuKrishna1
 
ICML2015 Slides
ICML2015 SlidesICML2015 Slides
ICML2015 Slides
Taehoon Lee
 
Kaggle presentation
Kaggle presentationKaggle presentation
Kaggle presentation
HJ van Veen
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
Kai Koenig
 
deepnet-lourentzou.ppt
deepnet-lourentzou.pptdeepnet-lourentzou.ppt
deepnet-lourentzou.ppt
yang947066
 
Demystifying Machine Learning
Demystifying Machine LearningDemystifying Machine Learning
Demystifying Machine Learning
Ayodele Odubela
 
Neural Networks in Data Mining - “An Overview”
Neural Networks  in Data Mining -   “An Overview”Neural Networks  in Data Mining -   “An Overview”
Neural Networks in Data Mining - “An Overview”
Dr.(Mrs).Gethsiyal Augasta
 
NITW_Improving Deep Neural Networks.pptx
NITW_Improving Deep Neural Networks.pptxNITW_Improving Deep Neural Networks.pptx
NITW_Improving Deep Neural Networks.pptx
ssuserd23711
 
NITW_Improving Deep Neural Networks (1).pptx
NITW_Improving Deep Neural Networks (1).pptxNITW_Improving Deep Neural Networks (1).pptx
NITW_Improving Deep Neural Networks (1).pptx
DrKBManwade
 
in5490-classification (1).pptx
in5490-classification (1).pptxin5490-classification (1).pptx
in5490-classification (1).pptx
MonicaTimber
 
Local vs. Global Models for Effort Estimation and Defect Prediction
Local vs. Global Models for Effort Estimation and Defect Prediction Local vs. Global Models for Effort Estimation and Defect Prediction
Local vs. Global Models for Effort Estimation and Defect Prediction
CS, NcState
 
Deep learning
Deep learningDeep learning
Deep learning
Aman Kamboj
 
Machine learning session6(decision trees random forrest)
Machine learning   session6(decision trees random forrest)Machine learning   session6(decision trees random forrest)
Machine learning session6(decision trees random forrest)
Abhimanyu Dwivedi
 
Query Linguistic Intent Detection
Query Linguistic Intent DetectionQuery Linguistic Intent Detection
Query Linguistic Intent Detectionbutest
 
Deep learning concepts
Deep learning conceptsDeep learning concepts
Deep learning concepts
Joe li
 
Deep learning architectures
Deep learning architecturesDeep learning architectures
Deep learning architectures
Joe li
 
Strong Baselines for Neural Semi-supervised Learning under Domain Shift
Strong Baselines for Neural Semi-supervised Learning under Domain ShiftStrong Baselines for Neural Semi-supervised Learning under Domain Shift
Strong Baselines for Neural Semi-supervised Learning under Domain Shift
Sebastian Ruder
 

Similar to Borderline Smote (20)

Troubleshooting Deep Neural Networks - Full Stack Deep Learning
Troubleshooting Deep Neural Networks - Full Stack Deep LearningTroubleshooting Deep Neural Networks - Full Stack Deep Learning
Troubleshooting Deep Neural Networks - Full Stack Deep Learning
 
Simulating data to gain insights into power and p-hacking
Simulating data to gain insights intopower and p-hackingSimulating data to gain insights intopower and p-hacking
Simulating data to gain insights into power and p-hacking
 
Lecture 1
Lecture 1Lecture 1
Lecture 1
 
lec1.ppt
lec1.pptlec1.ppt
lec1.ppt
 
ICML2015 Slides
ICML2015 SlidesICML2015 Slides
ICML2015 Slides
 
Kaggle presentation
Kaggle presentationKaggle presentation
Kaggle presentation
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
 
deepnet-lourentzou.ppt
deepnet-lourentzou.pptdeepnet-lourentzou.ppt
deepnet-lourentzou.ppt
 
Demystifying Machine Learning
Demystifying Machine LearningDemystifying Machine Learning
Demystifying Machine Learning
 
Neural Networks in Data Mining - “An Overview”
Neural Networks  in Data Mining -   “An Overview”Neural Networks  in Data Mining -   “An Overview”
Neural Networks in Data Mining - “An Overview”
 
NITW_Improving Deep Neural Networks.pptx
NITW_Improving Deep Neural Networks.pptxNITW_Improving Deep Neural Networks.pptx
NITW_Improving Deep Neural Networks.pptx
 
NITW_Improving Deep Neural Networks (1).pptx
NITW_Improving Deep Neural Networks (1).pptxNITW_Improving Deep Neural Networks (1).pptx
NITW_Improving Deep Neural Networks (1).pptx
 
in5490-classification (1).pptx
in5490-classification (1).pptxin5490-classification (1).pptx
in5490-classification (1).pptx
 
Local vs. Global Models for Effort Estimation and Defect Prediction
Local vs. Global Models for Effort Estimation and Defect Prediction Local vs. Global Models for Effort Estimation and Defect Prediction
Local vs. Global Models for Effort Estimation and Defect Prediction
 
Deep learning
Deep learningDeep learning
Deep learning
 
Machine learning session6(decision trees random forrest)
Machine learning   session6(decision trees random forrest)Machine learning   session6(decision trees random forrest)
Machine learning session6(decision trees random forrest)
 
Query Linguistic Intent Detection
Query Linguistic Intent DetectionQuery Linguistic Intent Detection
Query Linguistic Intent Detection
 
Deep learning concepts
Deep learning conceptsDeep learning concepts
Deep learning concepts
 
Deep learning architectures
Deep learning architecturesDeep learning architectures
Deep learning architectures
 
Strong Baselines for Neural Semi-supervised Learning under Domain Shift
Strong Baselines for Neural Semi-supervised Learning under Domain ShiftStrong Baselines for Neural Semi-supervised Learning under Domain Shift
Strong Baselines for Neural Semi-supervised Learning under Domain Shift
 

More from Trector Rancor

Cryptocurrencies overview
Cryptocurrencies overviewCryptocurrencies overview
Cryptocurrencies overview
Trector Rancor
 
Tree distance algorithm
Tree distance algorithmTree distance algorithm
Tree distance algorithmTrector Rancor
 
A Comparative Study On Featuree Selection In Text2
A Comparative Study On Featuree Selection In Text2A Comparative Study On Featuree Selection In Text2
A Comparative Study On Featuree Selection In Text2Trector Rancor
 
My First Presentation
My First PresentationMy First Presentation
My First Presentation
Trector Rancor
 

More from Trector Rancor (7)

Cryptocurrencies overview
Cryptocurrencies overviewCryptocurrencies overview
Cryptocurrencies overview
 
Tree distance algorithm
Tree distance algorithmTree distance algorithm
Tree distance algorithm
 
Virtual Journalist
Virtual JournalistVirtual Journalist
Virtual Journalist
 
Class Diagram Uml
Class Diagram UmlClass Diagram Uml
Class Diagram Uml
 
A Comparative Study On Featuree Selection In Text2
A Comparative Study On Featuree Selection In Text2A Comparative Study On Featuree Selection In Text2
A Comparative Study On Featuree Selection In Text2
 
going to uni
going to unigoing to uni
going to uni
 
My First Presentation
My First PresentationMy First Presentation
My First Presentation
 

Recently uploaded

JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
Bhaskar Mitra
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
Abida Shariff
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
CatarinaPereira64715
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 

Recently uploaded (20)

JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 

Borderline Smote

  • 1. Topic: Machine learning Synthetic minority over- Imbalanced data sets sampling technique (SMOTE) Presented by Hector Franco TCD
  • 2. Basic concepts  Introduction 1. Recent developments 2. Algorithms description. 3. Evaluation. 4. Discursion. 5.
  • 3. 0
  • 4. Multi class problems are imbalance when we  compare one against all. In some cases the data set is very small, to  generalize well. Text classification is an example of imbalanced  data. It can be use with tree-kernels. 
  • 5. Effect of SMOTE and DEC – (SDC) After DEC alone After SMOTE and DEC
  • 6. : Majority sample : Minority sample : Synthetic sample 6
  • 7.
  • 9. By convention the class with less number of  examples is called minority or positive samples.
  • 10. The recent developments in imbalanced data sets learning
  • 11. Between-class imbalanced.  (where we focused on) Within-class imbalanced.  It is important in text classification.  We focused on the minority class, we want a  high prediction for the minority class.. Two class problem = multiclass problem . 
  • 12. NOT VERY GOOD IN UNBALANCED DATA Popular evaluation for imbalance problem. Usually B=1, and =1 in this paper
  • 13. AUC: TP rate AREA UNDER ROC FP rate
  • 14. Data level: Change the distribution  ◦ make the data balanced Modify the existing data mining algorithms  ◦ Make new algorithms
  • 15. Random oversampling: duplicate  Random under sampling: (can remove  important data) Remove noise  SMOTE  Combine under sampling and over sampling.  Find the hard examples and over sample  them.
  • 16. Adaboost (increase weights of misclassified),  it does not perform well on imbalances ds.  Improve updated weights of TP & FP, better than weights of prediction based on TP & FP. Use a kernel of SVM  Use a BMPM  Biased Mini max Probability Machine. There are other cost-based learning… 
  • 17. A new Over-Sampling Method: Borderline-SMOTE.
  • 18. Algorithms usually  try to learn the borderline, as exactly as possible.
  • 19. Borderline-SMOTE1  Borderline-SMOTE2 
  • 20.
  • 21.
  • 22. Also oversampling the majority class.  The random numbers are between 0 and 0.5  so the synthetic examples are more close to each other.
  • 23.
  • 24.
  • 25.
  • 27.
  • 28. Nothing: base line.  SMOTE  Random over-sampling  Borderline-SMOTE1  Borderline-SMOTE2  K=5  10 Fold cross validation.  C4.5 classified  We only want to improve the prediction of the  minority class
  • 29.
  • 30.
  • 31.
  • 32.
  • 34. Is a common problem to work with  imbalanced data sets. Borderline examples are more easy to  misclassified. Our methods are better than traditional  SMOTE. Open to research:  ◦ how to define DANGER examples. ◦ Determination of number of examples in DANGER. ◦ Combine to data mining algorithms.
  • 35.
  • 36. You are free: •to copy, distribute, display, and perform the work •to make derivative works Under the following conditions: •Attribution. You must give the original author credit. What does quot;Attribute this workquot; mean? The page you came from contained embedded licensing metadata, including how the creator wishes to be attributed for re-use. You can use the HTML here to cite the work. Doing so will also include metadata on your page so that others can find the original work as well. •Non-Commercial. You may not use this work for commercial purposes. •For any reuse or distribution, you must make clear to others the licence terms of this work. •Any of these conditions can be waived if you get permission from the copyright holder. •Nothing in this license impairs or restricts the author's moral rights.