SlideShare a Scribd company logo
1 of 25
Download to read offline
2018-08-15, Berlin Lyubomir Danov, RatePAY
HIGHLY IMBALANCED DATA CLASSIFICATION PROBLEMS FOR
ESTIMATING EXPOSURE TO FRAUD RISK
a.k.a. Imbalanced learning 101
2 | PyData Berlin August meetup | RatePAY | 2018
CONTENT
 RatePAY, Regulation and Finance
 ML classification problem setup
 Pitfalls of imbalanced data
 Some methodologies to counter imbalanced data
 Performance on example dataset
3 | PyData Berlin August meetup | RatePAY | 2018
FOREWORD: TERMINOLOGY
 Exposure – fancy finance word for potential monetary loss
 Positive - in binary classification problems with imbalanced data – minority class
 Any unknown terminology – ask away
4 | PyData Berlin August meetup | RatePAY | 2018
FOREWORD: RATEPAY
 Founded 2009
 Payment (no Service) Provider for deferred payments
 Web shop checkout oriented
5 | PyData Berlin August meetup | RatePAY | 2018
PSP <> DATA SCIENCE <> CODING
 Required to measure risk exposure
 Simulate and generate report
 Risk exposure
 E.g. Expected loss = 𝑃𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦𝑒𝑣𝑒𝑛𝑡 ∗ 𝐿𝑜𝑠𝑠𝑒𝑣𝑒𝑛𝑡
 Value at Risk
6 | PyData Berlin August meetup | RatePAY | 2018
WHAT IS VALUE AT RISK
7 | PyData Berlin August meetup | RatePAY | 2018
Source: federalreserve.gov
Simulation steps:
FRAUD VALUE AT RISK
 First level
 Fraud_default1 = Rand(0,1) <= Probability
 Second level
 Fraud_default2=max(Fraud_default1, Cluster_default_rate)
 Cluster_default_rate =
Fraud_default1𝑁𝑁
𝑁𝑁
> 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑
 Expected lossz = 𝑜𝑝𝑒𝑛 𝑎𝑚𝑜𝑢𝑛𝑡𝑠𝐹𝑟𝑎𝑢𝑑 _𝑑𝑒𝑓𝑎𝑢𝑙𝑡2==𝑇𝑟𝑢𝑒
fVaR = quantile (Confidence, Expected losses)
8 | PyData Berlin August meetup | RatePAY | 2018
Minimum viable product
MVP
 Have’s : amounts, categorical vars for NN
 Have nots’: probability of fraud, underlying theory explaining why
9 | PyData Berlin August meetup | RatePAY | 2018
ML MODEL
 Models generate probabilities, right?
10| PyData Berlin August meetup | RatePAY | 2018
ML MODEL
11| PyData Berlin August meetup | RatePAY | 2018
ML MODEL
 Scale your non-probabilistic output (tree-based methods)
 softmax function in classifier NN
12| PyData Berlin August meetup | RatePAY | 2018
BACK TO ML MODEL
 MVP – what performance measure to use?
a) Accuracy?
b) AUC?
c) Recall (true positive rate) = TP/(TP+FN)? Precision = TP/(TP+FP)?
 What is positive, what is negative
d) Business value – e.g. expected loss?
13| PyData Berlin August meetup | RatePAY | 2018
CAREFUL WITH THE PERFORMANCE MEASURE
 Accuracy performance on imbalanced dataset
14| PyData Berlin August meetup | RatePAY | 2018
Class share accuracy on test
0_nonfraud 99.82% 99.82%
1_fraud 0.18% 0.00%
Total 100.00% 99.82%
Dataset:
kaggle.com/
mlg-ulb/creditcardfraud
TAXONOMY OF IMBALANCED LEARNING
15| PyData Berlin August meetup | RatePAY | 2018
Data-level Algorithm-level
Undersampling Cost-based learning
Oversampling Ensembles or meta-algorithms
Combined
SELECTED APPROACHES
16| PyData Berlin August meetup | RatePAY | 2018
Data-level (implementation used) Oversampling Undersampling
Random (imbalanced-learn) X X
SMOTE (own implem.) Synthetic -
SMOTE+ENN (imbalanced-learn) Synthetic Filter
ADASYN (imbalanced-learn) Synthetic -
Safe-Level-SMOTE (own implem.) Synthetic -
SMOTE+IPF (own implem.) Synthetic Filter
LOGIC BEHIND METHODS
17| PyData Berlin August meetup | RatePAY | 2018
source: fig 1. Fernandez et. al. (2018)
“SMOTE […] 15-year Anniversary”
Data-level
Random
SMOTE
SMOTE+ENN
ADASYN
Safe-Level-SMOTE
SMOTE+IPF
EXPERIMENTAL SETUP
 Data: kaggle.com/mlg-ulb/creditcardfraud
 h2o gradient boosted classification trees
 separate (not sampled) validation set
 stopping metric: mean (tpr, tnr)
 oversampling sought after share of minority: 0.18% >> 8.00%
 subject to methodology
 kNN = 5
18| PyData Berlin August meetup | RatePAY | 2018
19| PyData Berlin August meetup | RatePAY | 2018
20| PyData Berlin August meetup | RatePAY | 2018
Threshold and performance where tnr=max(tnr), s.t. tpr==max(tpr)
PERFORMANCE ON TEST
21| PyData Berlin August meetup | RatePAY | 2018
dataset f1 f2 tpr tnr mean_tr min_tr threshold
ada 0.213 0.403 1 0.987 0.993 0.987 0.449
ros 0.0193 0.0469 1 0.821 0.911 0.821 0.309
rus 0.0177 0.0431 1 0.805 0.902 0.805 0.347
sl_smote 0.0514 0.119 1 0.935 0.968 0.935 0.0397
smote 0.476 0.694 1 0.996 0.998 0.996 0.165
smote_enn 0.29 0.505 1 0.991 0.996 0.991 0.526
smote_ipf 0.541 0.746 1 0.997 0.999 0.997 0.142
WHAT TO OPTIMISE ON
 distance used
 Target ratio
 k in NN
 cost function
=> hyper-parameter optimisation options?
22| PyData Berlin August meetup | RatePAY | 2018
PAPERS
 Fernandez et. al. (2018) “SMOTE for Learning from Imbalanced Data: Progress
andChallenges, Marking the 15-year Anniversary”
 Chawla et. al. (2002) “SMOTE […]”
 Batista et. al. (2004) “A study of the behavior of several methods for balancing machine
learning training data”
 He et. al. (2008) “ADASYN […]”
 Bunkhumpornpat et. al. (2009) „Safe-Level-SMOTE […]“
 Saez et. al. (2015) “SMOTE-IPF […]”
23| PyData Berlin August meetup | RatePAY | 2018
24 | PyData Berlin August meetup | RatePAY | 2018
RatePAY GmbH | Franklinstraße 28-29 | 10587 Berlin | www.ratepay.com
THANK YOU FOR YOUR PATIENCE 

More Related Content

Similar to Highly imbalanced data classification problems for estimating exposure to fraud risk

Asset enlargement certification part 2
Asset enlargement certification part 2Asset enlargement certification part 2
Asset enlargement certification part 2
Aleksandr Shepelev
 
TPC-H analytics' scenarios and performances on Hadoop data clouds
TPC-H analytics' scenarios and performances on Hadoop data cloudsTPC-H analytics' scenarios and performances on Hadoop data clouds
TPC-H analytics' scenarios and performances on Hadoop data clouds
Rim Moussa
 

Similar to Highly imbalanced data classification problems for estimating exposure to fraud risk (20)

Machine Learning with Binary Logistic Regression - APAC
Machine Learning with Binary Logistic Regression - APACMachine Learning with Binary Logistic Regression - APAC
Machine Learning with Binary Logistic Regression - APAC
 
Let the figures talk 1 7 w int1
Let the figures talk 1 7 w int1 Let the figures talk 1 7 w int1
Let the figures talk 1 7 w int1
 
Presentation of dataPreparation at meet-up RAddicts
Presentation of dataPreparation at meet-up RAddictsPresentation of dataPreparation at meet-up RAddicts
Presentation of dataPreparation at meet-up RAddicts
 
Sessione II - Estimation methods and accuracy - P.D. Falorsi F. Petrarca, P...
Sessione II - Estimation methods and accuracy  -  P.D. Falorsi F. Petrarca, P...Sessione II - Estimation methods and accuracy  -  P.D. Falorsi F. Petrarca, P...
Sessione II - Estimation methods and accuracy - P.D. Falorsi F. Petrarca, P...
 
Machine Learning with Classification & Regression Trees - APAC
Machine Learning with Classification & Regression Trees - APAC Machine Learning with Classification & Regression Trees - APAC
Machine Learning with Classification & Regression Trees - APAC
 
3prep
3prep3prep
3prep
 
Companion by Minitab - Seeing the unknown identifying risk and quantifying pr...
Companion by Minitab - Seeing the unknown identifying risk and quantifying pr...Companion by Minitab - Seeing the unknown identifying risk and quantifying pr...
Companion by Minitab - Seeing the unknown identifying risk and quantifying pr...
 
Asset enlargement certification part 2
Asset enlargement certification part 2Asset enlargement certification part 2
Asset enlargement certification part 2
 
Retail Tech Q3 2018 Startup Highlights
Retail Tech Q3 2018 Startup HighlightsRetail Tech Q3 2018 Startup Highlights
Retail Tech Q3 2018 Startup Highlights
 
ABM best practices from the pros
ABM best practices from the prosABM best practices from the pros
ABM best practices from the pros
 
Revenue Assurance Industry Update - Webinar by Dr. Gadi Solotorevsky, cVidya'...
Revenue Assurance Industry Update - Webinar by Dr. Gadi Solotorevsky, cVidya'...Revenue Assurance Industry Update - Webinar by Dr. Gadi Solotorevsky, cVidya'...
Revenue Assurance Industry Update - Webinar by Dr. Gadi Solotorevsky, cVidya'...
 
Data Centre Cost Benchmarking - An Insight & Common Pitfalls
Data Centre Cost Benchmarking - An Insight & Common PitfallsData Centre Cost Benchmarking - An Insight & Common Pitfalls
Data Centre Cost Benchmarking - An Insight & Common Pitfalls
 
March 2, 2018 - Machine Learning for Production Forecasting
March 2, 2018 - Machine Learning for Production ForecastingMarch 2, 2018 - Machine Learning for Production Forecasting
March 2, 2018 - Machine Learning for Production Forecasting
 
AMPL Workshop, part 1: Model-Based Optimization, Plain and Simple
AMPL Workshop, part 1: Model-Based Optimization, Plain and SimpleAMPL Workshop, part 1: Model-Based Optimization, Plain and Simple
AMPL Workshop, part 1: Model-Based Optimization, Plain and Simple
 
Price optimization for high-mix, low-volume environments | Using R and Tablea...
Price optimization for high-mix, low-volume environments | Using R and Tablea...Price optimization for high-mix, low-volume environments | Using R and Tablea...
Price optimization for high-mix, low-volume environments | Using R and Tablea...
 
Hedging Your Bets: Why Top FI’s are Investing in Machine Learning
Hedging Your Bets: Why Top FI’s are Investing in Machine LearningHedging Your Bets: Why Top FI’s are Investing in Machine Learning
Hedging Your Bets: Why Top FI’s are Investing in Machine Learning
 
APT_&_VaR[1]
APT_&_VaR[1]APT_&_VaR[1]
APT_&_VaR[1]
 
TPC-H analytics' scenarios and performances on Hadoop data clouds
TPC-H analytics' scenarios and performances on Hadoop data cloudsTPC-H analytics' scenarios and performances on Hadoop data clouds
TPC-H analytics' scenarios and performances on Hadoop data clouds
 
GitaCloud Keynote - Ashutosh Bansal - SAP & GitaCloud IBP Event Mumbai - Marc...
GitaCloud Keynote - Ashutosh Bansal - SAP & GitaCloud IBP Event Mumbai - Marc...GitaCloud Keynote - Ashutosh Bansal - SAP & GitaCloud IBP Event Mumbai - Marc...
GitaCloud Keynote - Ashutosh Bansal - SAP & GitaCloud IBP Event Mumbai - Marc...
 
PyData Paris 2015 - Track 2.3 AXA
PyData Paris 2015 - Track 2.3 AXA PyData Paris 2015 - Track 2.3 AXA
PyData Paris 2015 - Track 2.3 AXA
 

Recently uploaded

Call Girls in Tilak Nagar (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in Tilak Nagar (delhi) call me [🔝9953056974🔝] escort service 24X7Call Girls in Tilak Nagar (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in Tilak Nagar (delhi) call me [🔝9953056974🔝] escort service 24X7
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 

Recently uploaded (20)

Test bank for advanced assessment interpreting findings and formulating diffe...
Test bank for advanced assessment interpreting findings and formulating diffe...Test bank for advanced assessment interpreting findings and formulating diffe...
Test bank for advanced assessment interpreting findings and formulating diffe...
 
Female Russian Escorts Mumbai Call Girls-((ANdheri))9833754194-Jogeshawri Fre...
Female Russian Escorts Mumbai Call Girls-((ANdheri))9833754194-Jogeshawri Fre...Female Russian Escorts Mumbai Call Girls-((ANdheri))9833754194-Jogeshawri Fre...
Female Russian Escorts Mumbai Call Girls-((ANdheri))9833754194-Jogeshawri Fre...
 
✂️ 👅 Independent Lucknow Escorts U.P Call Girls With Room Lucknow Call Girls ...
✂️ 👅 Independent Lucknow Escorts U.P Call Girls With Room Lucknow Call Girls ...✂️ 👅 Independent Lucknow Escorts U.P Call Girls With Room Lucknow Call Girls ...
✂️ 👅 Independent Lucknow Escorts U.P Call Girls With Room Lucknow Call Girls ...
 
Lion One Corporate Presentation May 2024
Lion One Corporate Presentation May 2024Lion One Corporate Presentation May 2024
Lion One Corporate Presentation May 2024
 
GIFT City Overview India's Gateway to Global Finance
GIFT City Overview  India's Gateway to Global FinanceGIFT City Overview  India's Gateway to Global Finance
GIFT City Overview India's Gateway to Global Finance
 
Mahendragarh Escorts 🥰 8617370543 Call Girls Offer VIP Hot Girls
Mahendragarh Escorts 🥰 8617370543 Call Girls Offer VIP Hot GirlsMahendragarh Escorts 🥰 8617370543 Call Girls Offer VIP Hot Girls
Mahendragarh Escorts 🥰 8617370543 Call Girls Offer VIP Hot Girls
 
Significant AI Trends for the Financial Industry in 2024 and How to Utilize Them
Significant AI Trends for the Financial Industry in 2024 and How to Utilize ThemSignificant AI Trends for the Financial Industry in 2024 and How to Utilize Them
Significant AI Trends for the Financial Industry in 2024 and How to Utilize Them
 
Premium Call Girls Bangalore Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top...
Premium Call Girls Bangalore Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top...Premium Call Girls Bangalore Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top...
Premium Call Girls Bangalore Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top...
 
Call Girls in Tilak Nagar (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in Tilak Nagar (delhi) call me [🔝9953056974🔝] escort service 24X7Call Girls in Tilak Nagar (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in Tilak Nagar (delhi) call me [🔝9953056974🔝] escort service 24X7
 
Bhubaneswar🌹Kalpana Mesuem ❤CALL GIRLS 9777949614 💟 CALL GIRLS IN bhubaneswa...
Bhubaneswar🌹Kalpana Mesuem  ❤CALL GIRLS 9777949614 💟 CALL GIRLS IN bhubaneswa...Bhubaneswar🌹Kalpana Mesuem  ❤CALL GIRLS 9777949614 💟 CALL GIRLS IN bhubaneswa...
Bhubaneswar🌹Kalpana Mesuem ❤CALL GIRLS 9777949614 💟 CALL GIRLS IN bhubaneswa...
 
cost-volume-profit analysis.ppt(managerial accounting).pptx
cost-volume-profit analysis.ppt(managerial accounting).pptxcost-volume-profit analysis.ppt(managerial accounting).pptx
cost-volume-profit analysis.ppt(managerial accounting).pptx
 
Strategic Resources May 2024 Corporate Presentation
Strategic Resources May 2024 Corporate PresentationStrategic Resources May 2024 Corporate Presentation
Strategic Resources May 2024 Corporate Presentation
 
Call Girls Howrah ( 8250092165 ) Cheap rates call girls | Get low budget
Call Girls Howrah ( 8250092165 ) Cheap rates call girls | Get low budgetCall Girls Howrah ( 8250092165 ) Cheap rates call girls | Get low budget
Call Girls Howrah ( 8250092165 ) Cheap rates call girls | Get low budget
 
Collecting banker, Capacity of collecting Banker, conditions under section 13...
Collecting banker, Capacity of collecting Banker, conditions under section 13...Collecting banker, Capacity of collecting Banker, conditions under section 13...
Collecting banker, Capacity of collecting Banker, conditions under section 13...
 
Virar Best Sex Call Girls Number-📞📞9833754194-Poorbi Nalasopara Housewife Cal...
Virar Best Sex Call Girls Number-📞📞9833754194-Poorbi Nalasopara Housewife Cal...Virar Best Sex Call Girls Number-📞📞9833754194-Poorbi Nalasopara Housewife Cal...
Virar Best Sex Call Girls Number-📞📞9833754194-Poorbi Nalasopara Housewife Cal...
 
Certified Kala Jadu, Black magic specialist in Rawalpindi and Bangali Amil ba...
Certified Kala Jadu, Black magic specialist in Rawalpindi and Bangali Amil ba...Certified Kala Jadu, Black magic specialist in Rawalpindi and Bangali Amil ba...
Certified Kala Jadu, Black magic specialist in Rawalpindi and Bangali Amil ba...
 
Kurla Capable Call Girls ,07506202331, Sion Affordable Call Girls
Kurla Capable Call Girls ,07506202331, Sion Affordable Call GirlsKurla Capable Call Girls ,07506202331, Sion Affordable Call Girls
Kurla Capable Call Girls ,07506202331, Sion Affordable Call Girls
 
logistics industry development power point ppt.pdf
logistics industry development power point ppt.pdflogistics industry development power point ppt.pdf
logistics industry development power point ppt.pdf
 
Webinar on E-Invoicing for Fintech Belgium
Webinar on E-Invoicing for Fintech BelgiumWebinar on E-Invoicing for Fintech Belgium
Webinar on E-Invoicing for Fintech Belgium
 
Seeman_Fiintouch_LLP_Newsletter_May-2024.pdf
Seeman_Fiintouch_LLP_Newsletter_May-2024.pdfSeeman_Fiintouch_LLP_Newsletter_May-2024.pdf
Seeman_Fiintouch_LLP_Newsletter_May-2024.pdf
 

Highly imbalanced data classification problems for estimating exposure to fraud risk

  • 1. 2018-08-15, Berlin Lyubomir Danov, RatePAY HIGHLY IMBALANCED DATA CLASSIFICATION PROBLEMS FOR ESTIMATING EXPOSURE TO FRAUD RISK
  • 2. a.k.a. Imbalanced learning 101 2 | PyData Berlin August meetup | RatePAY | 2018
  • 3. CONTENT  RatePAY, Regulation and Finance  ML classification problem setup  Pitfalls of imbalanced data  Some methodologies to counter imbalanced data  Performance on example dataset 3 | PyData Berlin August meetup | RatePAY | 2018
  • 4. FOREWORD: TERMINOLOGY  Exposure – fancy finance word for potential monetary loss  Positive - in binary classification problems with imbalanced data – minority class  Any unknown terminology – ask away 4 | PyData Berlin August meetup | RatePAY | 2018
  • 5. FOREWORD: RATEPAY  Founded 2009  Payment (no Service) Provider for deferred payments  Web shop checkout oriented 5 | PyData Berlin August meetup | RatePAY | 2018
  • 6. PSP <> DATA SCIENCE <> CODING  Required to measure risk exposure  Simulate and generate report  Risk exposure  E.g. Expected loss = 𝑃𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦𝑒𝑣𝑒𝑛𝑡 ∗ 𝐿𝑜𝑠𝑠𝑒𝑣𝑒𝑛𝑡  Value at Risk 6 | PyData Berlin August meetup | RatePAY | 2018
  • 7. WHAT IS VALUE AT RISK 7 | PyData Berlin August meetup | RatePAY | 2018 Source: federalreserve.gov
  • 8. Simulation steps: FRAUD VALUE AT RISK  First level  Fraud_default1 = Rand(0,1) <= Probability  Second level  Fraud_default2=max(Fraud_default1, Cluster_default_rate)  Cluster_default_rate = Fraud_default1𝑁𝑁 𝑁𝑁 > 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑  Expected lossz = 𝑜𝑝𝑒𝑛 𝑎𝑚𝑜𝑢𝑛𝑡𝑠𝐹𝑟𝑎𝑢𝑑 _𝑑𝑒𝑓𝑎𝑢𝑙𝑡2==𝑇𝑟𝑢𝑒 fVaR = quantile (Confidence, Expected losses) 8 | PyData Berlin August meetup | RatePAY | 2018
  • 9. Minimum viable product MVP  Have’s : amounts, categorical vars for NN  Have nots’: probability of fraud, underlying theory explaining why 9 | PyData Berlin August meetup | RatePAY | 2018
  • 10. ML MODEL  Models generate probabilities, right? 10| PyData Berlin August meetup | RatePAY | 2018
  • 11. ML MODEL 11| PyData Berlin August meetup | RatePAY | 2018
  • 12. ML MODEL  Scale your non-probabilistic output (tree-based methods)  softmax function in classifier NN 12| PyData Berlin August meetup | RatePAY | 2018
  • 13. BACK TO ML MODEL  MVP – what performance measure to use? a) Accuracy? b) AUC? c) Recall (true positive rate) = TP/(TP+FN)? Precision = TP/(TP+FP)?  What is positive, what is negative d) Business value – e.g. expected loss? 13| PyData Berlin August meetup | RatePAY | 2018
  • 14. CAREFUL WITH THE PERFORMANCE MEASURE  Accuracy performance on imbalanced dataset 14| PyData Berlin August meetup | RatePAY | 2018 Class share accuracy on test 0_nonfraud 99.82% 99.82% 1_fraud 0.18% 0.00% Total 100.00% 99.82% Dataset: kaggle.com/ mlg-ulb/creditcardfraud
  • 15. TAXONOMY OF IMBALANCED LEARNING 15| PyData Berlin August meetup | RatePAY | 2018 Data-level Algorithm-level Undersampling Cost-based learning Oversampling Ensembles or meta-algorithms Combined
  • 16. SELECTED APPROACHES 16| PyData Berlin August meetup | RatePAY | 2018 Data-level (implementation used) Oversampling Undersampling Random (imbalanced-learn) X X SMOTE (own implem.) Synthetic - SMOTE+ENN (imbalanced-learn) Synthetic Filter ADASYN (imbalanced-learn) Synthetic - Safe-Level-SMOTE (own implem.) Synthetic - SMOTE+IPF (own implem.) Synthetic Filter
  • 17. LOGIC BEHIND METHODS 17| PyData Berlin August meetup | RatePAY | 2018 source: fig 1. Fernandez et. al. (2018) “SMOTE […] 15-year Anniversary” Data-level Random SMOTE SMOTE+ENN ADASYN Safe-Level-SMOTE SMOTE+IPF
  • 18. EXPERIMENTAL SETUP  Data: kaggle.com/mlg-ulb/creditcardfraud  h2o gradient boosted classification trees  separate (not sampled) validation set  stopping metric: mean (tpr, tnr)  oversampling sought after share of minority: 0.18% >> 8.00%  subject to methodology  kNN = 5 18| PyData Berlin August meetup | RatePAY | 2018
  • 19. 19| PyData Berlin August meetup | RatePAY | 2018
  • 20. 20| PyData Berlin August meetup | RatePAY | 2018
  • 21. Threshold and performance where tnr=max(tnr), s.t. tpr==max(tpr) PERFORMANCE ON TEST 21| PyData Berlin August meetup | RatePAY | 2018 dataset f1 f2 tpr tnr mean_tr min_tr threshold ada 0.213 0.403 1 0.987 0.993 0.987 0.449 ros 0.0193 0.0469 1 0.821 0.911 0.821 0.309 rus 0.0177 0.0431 1 0.805 0.902 0.805 0.347 sl_smote 0.0514 0.119 1 0.935 0.968 0.935 0.0397 smote 0.476 0.694 1 0.996 0.998 0.996 0.165 smote_enn 0.29 0.505 1 0.991 0.996 0.991 0.526 smote_ipf 0.541 0.746 1 0.997 0.999 0.997 0.142
  • 22. WHAT TO OPTIMISE ON  distance used  Target ratio  k in NN  cost function => hyper-parameter optimisation options? 22| PyData Berlin August meetup | RatePAY | 2018
  • 23. PAPERS  Fernandez et. al. (2018) “SMOTE for Learning from Imbalanced Data: Progress andChallenges, Marking the 15-year Anniversary”  Chawla et. al. (2002) “SMOTE […]”  Batista et. al. (2004) “A study of the behavior of several methods for balancing machine learning training data”  He et. al. (2008) “ADASYN […]”  Bunkhumpornpat et. al. (2009) „Safe-Level-SMOTE […]“  Saez et. al. (2015) “SMOTE-IPF […]” 23| PyData Berlin August meetup | RatePAY | 2018
  • 24. 24 | PyData Berlin August meetup | RatePAY | 2018
  • 25. RatePAY GmbH | Franklinstraße 28-29 | 10587 Berlin | www.ratepay.com THANK YOU FOR YOUR PATIENCE 