3. CONTENT
RatePAY, Regulation and Finance
ML classification problem setup
Pitfalls of imbalanced data
Some methodologies to counter imbalanced data
Performance on example dataset
3 | PyData Berlin August meetup | RatePAY | 2018
4. FOREWORD: TERMINOLOGY
Exposure – fancy finance word for potential monetary loss
Positive - in binary classification problems with imbalanced data – minority class
Any unknown terminology – ask away
4 | PyData Berlin August meetup | RatePAY | 2018
5. FOREWORD: RATEPAY
Founded 2009
Payment (no Service) Provider for deferred payments
Web shop checkout oriented
5 | PyData Berlin August meetup | RatePAY | 2018
6. PSP <> DATA SCIENCE <> CODING
Required to measure risk exposure
Simulate and generate report
Risk exposure
E.g. Expected loss = 𝑃𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦𝑒𝑣𝑒𝑛𝑡 ∗ 𝐿𝑜𝑠𝑠𝑒𝑣𝑒𝑛𝑡
Value at Risk
6 | PyData Berlin August meetup | RatePAY | 2018
7. WHAT IS VALUE AT RISK
7 | PyData Berlin August meetup | RatePAY | 2018
Source: federalreserve.gov
8. Simulation steps:
FRAUD VALUE AT RISK
First level
Fraud_default1 = Rand(0,1) <= Probability
Second level
Fraud_default2=max(Fraud_default1, Cluster_default_rate)
Cluster_default_rate =
Fraud_default1𝑁𝑁
𝑁𝑁
> 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑
Expected lossz = 𝑜𝑝𝑒𝑛 𝑎𝑚𝑜𝑢𝑛𝑡𝑠𝐹𝑟𝑎𝑢𝑑 _𝑑𝑒𝑓𝑎𝑢𝑙𝑡2==𝑇𝑟𝑢𝑒
fVaR = quantile (Confidence, Expected losses)
8 | PyData Berlin August meetup | RatePAY | 2018
9. Minimum viable product
MVP
Have’s : amounts, categorical vars for NN
Have nots’: probability of fraud, underlying theory explaining why
9 | PyData Berlin August meetup | RatePAY | 2018
10. ML MODEL
Models generate probabilities, right?
10| PyData Berlin August meetup | RatePAY | 2018
12. ML MODEL
Scale your non-probabilistic output (tree-based methods)
softmax function in classifier NN
12| PyData Berlin August meetup | RatePAY | 2018
13. BACK TO ML MODEL
MVP – what performance measure to use?
a) Accuracy?
b) AUC?
c) Recall (true positive rate) = TP/(TP+FN)? Precision = TP/(TP+FP)?
What is positive, what is negative
d) Business value – e.g. expected loss?
13| PyData Berlin August meetup | RatePAY | 2018
14. CAREFUL WITH THE PERFORMANCE MEASURE
Accuracy performance on imbalanced dataset
14| PyData Berlin August meetup | RatePAY | 2018
Class share accuracy on test
0_nonfraud 99.82% 99.82%
1_fraud 0.18% 0.00%
Total 100.00% 99.82%
Dataset:
kaggle.com/
mlg-ulb/creditcardfraud
15. TAXONOMY OF IMBALANCED LEARNING
15| PyData Berlin August meetup | RatePAY | 2018
Data-level Algorithm-level
Undersampling Cost-based learning
Oversampling Ensembles or meta-algorithms
Combined
16. SELECTED APPROACHES
16| PyData Berlin August meetup | RatePAY | 2018
Data-level (implementation used) Oversampling Undersampling
Random (imbalanced-learn) X X
SMOTE (own implem.) Synthetic -
SMOTE+ENN (imbalanced-learn) Synthetic Filter
ADASYN (imbalanced-learn) Synthetic -
Safe-Level-SMOTE (own implem.) Synthetic -
SMOTE+IPF (own implem.) Synthetic Filter
21. Threshold and performance where tnr=max(tnr), s.t. tpr==max(tpr)
PERFORMANCE ON TEST
21| PyData Berlin August meetup | RatePAY | 2018
dataset f1 f2 tpr tnr mean_tr min_tr threshold
ada 0.213 0.403 1 0.987 0.993 0.987 0.449
ros 0.0193 0.0469 1 0.821 0.911 0.821 0.309
rus 0.0177 0.0431 1 0.805 0.902 0.805 0.347
sl_smote 0.0514 0.119 1 0.935 0.968 0.935 0.0397
smote 0.476 0.694 1 0.996 0.998 0.996 0.165
smote_enn 0.29 0.505 1 0.991 0.996 0.991 0.526
smote_ipf 0.541 0.746 1 0.997 0.999 0.997 0.142
22. WHAT TO OPTIMISE ON
distance used
Target ratio
k in NN
cost function
=> hyper-parameter optimisation options?
22| PyData Berlin August meetup | RatePAY | 2018
23. PAPERS
Fernandez et. al. (2018) “SMOTE for Learning from Imbalanced Data: Progress
andChallenges, Marking the 15-year Anniversary”
Chawla et. al. (2002) “SMOTE […]”
Batista et. al. (2004) “A study of the behavior of several methods for balancing machine
learning training data”
He et. al. (2008) “ADASYN […]”
Bunkhumpornpat et. al. (2009) „Safe-Level-SMOTE […]“
Saez et. al. (2015) “SMOTE-IPF […]”
23| PyData Berlin August meetup | RatePAY | 2018
24. 24 | PyData Berlin August meetup | RatePAY | 2018
25. RatePAY GmbH | Franklinstraße 28-29 | 10587 Berlin | www.ratepay.com
THANK YOU FOR YOUR PATIENCE