Due to the nature of fraud detection data is imbalanced, area under ROC is not a good performance matrix to evaluate the classifiers' performance. Synthetic Minority Oversampling Technique (SMOTE) is adopted to balance the response ratio, and area under the ROC and area under precision-recall curve both improved the predictions. Tree-based approaches did better jobs in predictions.
1. FraudDetetionforInsuranceClaims
Yit Wei Chia - cyitwei@uwaterloo.ca
University of Waterloo, Winter 2019, CS 680
Introduction
• 81% of Canadians think that the addi-
tional premium mainly due to false claims
• Built binary classifiers using Random
Forest, SVM with RBF kernel and Gra-
dient Boosting
• Developed Synthetic Minority Over-
sampling Technique (SMOTE) approach
to balance data
• Achieved max AUC and AUPRC of 94%
and 96%
Dataset and Analysis
• Databricks dataset with 1000 claims and
33 features
• Imbalanced data with 247 false claims
• SMOTE using 5-nearest neighbors
• SMOTE - Balanced data with 741 obs on
fraudulent & non-fraudulent claims
• 70:30 split for train and test sets
• 10-fold cross validation & OOB samples
Models
• Random Forest
ˆfB
rf =
1
B
B
b=1
fb(x)
• SVM with RBF kernel
min
w, ξ
1
2
w
2
+ C
∀ i
ξi
K(x, z) = exp − x−z 2
2σ2
• Gradient Boosting
Fm(x) = Fm−1(x) +
Jm
j=1
γjmI(x ∈ Rjm)
γm =
xi∈Rjm
(yi − p(xi))
xi∈Rjm
p(xi) (1 − p(xi))
References
[1] Gareth James, Daniela Witten, Trevor Hastie,
Robert Tibshirani: An Introduction to Statistical
Learning (2013)
[2] Leo Breiman: Random Forest (2001)
[3] Jerome H. Friedman: Greedy Function Approxima-
tion: A Gradient Boosting Machine (2001)
Future Work
• Neural Network & Robust Logistic on
combinations of powerful classifiers
• Other sampling techniques
Data insights from Learning Algorithms
Variable Importance plot from Random Forest and the Influence from Gradient Boosting shows a
consistent result. Insured hobbies and incident severity are both importance features (with high
predictive power) from the learning algorithm.
Adding more trees will not overfit in Random Forest as the OOB error will always converge (Breiman
1999). However, features selection will overfit the model. Gradient boosting shows a cutoff threshold
indicating the model is overfit if we select # of trees beyond the minimum validation loss.
ROC for Sensitivity Specificity & Precision Recall + Statistics
We will compare the improvement before synthetic oversampling approach, the train and test sets
are tuned via 10-fold cross validation for RBF kernel and gradient boosting, random forest are
tuned through out-of-bag samples. The data set (before SMOTE) are split into train and test using
stratified sampling approach, we want to maintain the same response ratio in both sets.
We shift our original concentration from PC Curve back to ROC curve after SMOTE as the AUC
measures the true/false positives so well when data is balanced. The SMOTE approach shows a huge
improvement in all performance metrics.
Discussion:
• Tree-based learning algorithms perform very well in our dataset
• Gradient boosting achieved highest AUC & Random Forest has the highest AUPC