SlideShare a Scribd company logo
FraudDetetionforInsuranceClaims
Yit Wei Chia - cyitwei@uwaterloo.ca
University of Waterloo, Winter 2019, CS 680
Introduction
• 81% of Canadians think that the addi-
tional premium mainly due to false claims
• Built binary classifiers using Random
Forest, SVM with RBF kernel and Gra-
dient Boosting
• Developed Synthetic Minority Over-
sampling Technique (SMOTE) approach
to balance data
• Achieved max AUC and AUPRC of 94%
and 96%
Dataset and Analysis
• Databricks dataset with 1000 claims and
33 features
• Imbalanced data with 247 false claims
• SMOTE using 5-nearest neighbors
• SMOTE - Balanced data with 741 obs on
fraudulent & non-fraudulent claims
• 70:30 split for train and test sets
• 10-fold cross validation & OOB samples
Models
• Random Forest
ˆfB
rf =
1
B
B
b=1
fb(x)
• SVM with RBF kernel
min
w, ξ
1
2
w
2
+ C
∀ i
ξi
K(x, z) = exp − x−z 2
2σ2
• Gradient Boosting
Fm(x) = Fm−1(x) +
Jm
j=1
γjmI(x ∈ Rjm)
γm =
xi∈Rjm
(yi − p(xi))
xi∈Rjm
p(xi) (1 − p(xi))
References
[1] Gareth James, Daniela Witten, Trevor Hastie,
Robert Tibshirani: An Introduction to Statistical
Learning (2013)
[2] Leo Breiman: Random Forest (2001)
[3] Jerome H. Friedman: Greedy Function Approxima-
tion: A Gradient Boosting Machine (2001)
Future Work
• Neural Network & Robust Logistic on
combinations of powerful classifiers
• Other sampling techniques
Data insights from Learning Algorithms
Variable Importance plot from Random Forest and the Influence from Gradient Boosting shows a
consistent result. Insured hobbies and incident severity are both importance features (with high
predictive power) from the learning algorithm.
Adding more trees will not overfit in Random Forest as the OOB error will always converge (Breiman
1999). However, features selection will overfit the model. Gradient boosting shows a cutoff threshold
indicating the model is overfit if we select # of trees beyond the minimum validation loss.
ROC for Sensitivity Specificity & Precision Recall + Statistics
We will compare the improvement before synthetic oversampling approach, the train and test sets
are tuned via 10-fold cross validation for RBF kernel and gradient boosting, random forest are
tuned through out-of-bag samples. The data set (before SMOTE) are split into train and test using
stratified sampling approach, we want to maintain the same response ratio in both sets.
We shift our original concentration from PC Curve back to ROC curve after SMOTE as the AUC
measures the true/false positives so well when data is balanced. The SMOTE approach shows a huge
improvement in all performance metrics.
Discussion:
• Tree-based learning algorithms perform very well in our dataset
• Gradient boosting achieved highest AUC & Random Forest has the highest AUPC

More Related Content

Similar to Fraud Detection for Insurance Claims

Modified monte carlo technique for confidence limits of system reliability us...
Modified monte carlo technique for confidence limits of system reliability us...Modified monte carlo technique for confidence limits of system reliability us...
Modified monte carlo technique for confidence limits of system reliability us...DineshRaj Goud
 
Machine Learning Project - 1994 U.S. Census
Machine Learning Project - 1994 U.S. CensusMachine Learning Project - 1994 U.S. Census
Machine Learning Project - 1994 U.S. CensusTim Enalls
 
Heuristic design of experiments w meta gradient search
Heuristic design of experiments w meta gradient searchHeuristic design of experiments w meta gradient search
Heuristic design of experiments w meta gradient searchGreg Makowski
 
House Sale Price Prediction
House Sale Price PredictionHouse Sale Price Prediction
House Sale Price Predictionsriram30691
 
NEURAL Network Design Training
NEURAL Network Design  TrainingNEURAL Network Design  Training
NEURAL Network Design TrainingESCOM
 
A Framework for Scene Recognition Using Convolutional Neural Network as Featu...
A Framework for Scene Recognition Using Convolutional Neural Network as Featu...A Framework for Scene Recognition Using Convolutional Neural Network as Featu...
A Framework for Scene Recognition Using Convolutional Neural Network as Featu...Tahmid Abtahi
 
Bridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
Bridging the Gap: Machine Learning for Ubiquitous Computing -- EvaluationBridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
Bridging the Gap: Machine Learning for Ubiquitous Computing -- EvaluationThomas Ploetz
 
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for De...
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for De...Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for De...
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for De...Databricks
 
To bag, or to boost? A question of balance
To bag, or to boost? A question of balanceTo bag, or to boost? A question of balance
To bag, or to boost? A question of balanceAlex Henderson
 
Trinity of AI: data, algorithms and cloud
Trinity of AI: data, algorithms and cloudTrinity of AI: data, algorithms and cloud
Trinity of AI: data, algorithms and cloudAnima Anandkumar
 
Predict Backorder on a supply chain data for an Organization
Predict Backorder on a supply chain data for an OrganizationPredict Backorder on a supply chain data for an Organization
Predict Backorder on a supply chain data for an OrganizationPiyush Srivastava
 
Random Forest / Bootstrap Aggregation
Random Forest / Bootstrap AggregationRandom Forest / Bootstrap Aggregation
Random Forest / Bootstrap AggregationRupak Roy
 
Kaggle Higgs Boson Machine Learning Challenge
Kaggle Higgs Boson Machine Learning ChallengeKaggle Higgs Boson Machine Learning Challenge
Kaggle Higgs Boson Machine Learning ChallengeBernard Ong
 
COSMOS1_Scitech_2014_Ali
COSMOS1_Scitech_2014_AliCOSMOS1_Scitech_2014_Ali
COSMOS1_Scitech_2014_AliMDO_Lab
 
Thesis presentation: Applications of machine learning in predicting supply risks
Thesis presentation: Applications of machine learning in predicting supply risksThesis presentation: Applications of machine learning in predicting supply risks
Thesis presentation: Applications of machine learning in predicting supply risksTuanNguyen1697
 

Similar to Fraud Detection for Insurance Claims (20)

Modified monte carlo technique for confidence limits of system reliability us...
Modified monte carlo technique for confidence limits of system reliability us...Modified monte carlo technique for confidence limits of system reliability us...
Modified monte carlo technique for confidence limits of system reliability us...
 
Machine Learning Project - 1994 U.S. Census
Machine Learning Project - 1994 U.S. CensusMachine Learning Project - 1994 U.S. Census
Machine Learning Project - 1994 U.S. Census
 
Data mining Project
Data mining ProjectData mining Project
Data mining Project
 
Heuristic design of experiments w meta gradient search
Heuristic design of experiments w meta gradient searchHeuristic design of experiments w meta gradient search
Heuristic design of experiments w meta gradient search
 
House Sale Price Prediction
House Sale Price PredictionHouse Sale Price Prediction
House Sale Price Prediction
 
NEURAL Network Design Training
NEURAL Network Design  TrainingNEURAL Network Design  Training
NEURAL Network Design Training
 
A Framework for Scene Recognition Using Convolutional Neural Network as Featu...
A Framework for Scene Recognition Using Convolutional Neural Network as Featu...A Framework for Scene Recognition Using Convolutional Neural Network as Featu...
A Framework for Scene Recognition Using Convolutional Neural Network as Featu...
 
Bridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
Bridging the Gap: Machine Learning for Ubiquitous Computing -- EvaluationBridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
Bridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
 
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for De...
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for De...Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for De...
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for De...
 
To bag, or to boost? A question of balance
To bag, or to boost? A question of balanceTo bag, or to boost? A question of balance
To bag, or to boost? A question of balance
 
crossvalidation.pptx
crossvalidation.pptxcrossvalidation.pptx
crossvalidation.pptx
 
Trinity of AI: data, algorithms and cloud
Trinity of AI: data, algorithms and cloudTrinity of AI: data, algorithms and cloud
Trinity of AI: data, algorithms and cloud
 
evaluation and credibility-Part 2
evaluation and credibility-Part 2evaluation and credibility-Part 2
evaluation and credibility-Part 2
 
Predict Backorder on a supply chain data for an Organization
Predict Backorder on a supply chain data for an OrganizationPredict Backorder on a supply chain data for an Organization
Predict Backorder on a supply chain data for an Organization
 
Random Forest / Bootstrap Aggregation
Random Forest / Bootstrap AggregationRandom Forest / Bootstrap Aggregation
Random Forest / Bootstrap Aggregation
 
Expedia Data Analysis
Expedia Data AnalysisExpedia Data Analysis
Expedia Data Analysis
 
Kaggle Higgs Boson Machine Learning Challenge
Kaggle Higgs Boson Machine Learning ChallengeKaggle Higgs Boson Machine Learning Challenge
Kaggle Higgs Boson Machine Learning Challenge
 
COSMOS1_Scitech_2014_Ali
COSMOS1_Scitech_2014_AliCOSMOS1_Scitech_2014_Ali
COSMOS1_Scitech_2014_Ali
 
Thesis presentation: Applications of machine learning in predicting supply risks
Thesis presentation: Applications of machine learning in predicting supply risksThesis presentation: Applications of machine learning in predicting supply risks
Thesis presentation: Applications of machine learning in predicting supply risks
 
Mattar_PhD_Thesis
Mattar_PhD_ThesisMattar_PhD_Thesis
Mattar_PhD_Thesis
 

Recently uploaded

Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJames Polillo
 
Uber Ride Supply Demand Gap Analysis Report
Uber Ride Supply Demand Gap Analysis ReportUber Ride Supply Demand Gap Analysis Report
Uber Ride Supply Demand Gap Analysis ReportSatyamNeelmani2
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单ewymefz
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单vcaxypu
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单ukgaet
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单vcaxypu
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay
 
Introduction-to-Cybersecurit57hhfcbbcxxx
Introduction-to-Cybersecurit57hhfcbbcxxxIntroduction-to-Cybersecurit57hhfcbbcxxx
Introduction-to-Cybersecurit57hhfcbbcxxxzahraomer517
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单ewymefz
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhArpitMalhotra16
 
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPsWebinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPsCEPTES Software Inc
 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatheahmadsaood
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundOppotus
 
Computer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage sComputer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage sMAQIB18
 
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...elinavihriala
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单ocavb
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单enxupq
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxbenishzehra469
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单ewymefz
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesStarCompliance.io
 

Recently uploaded (20)

Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
 
Uber Ride Supply Demand Gap Analysis Report
Uber Ride Supply Demand Gap Analysis ReportUber Ride Supply Demand Gap Analysis Report
Uber Ride Supply Demand Gap Analysis Report
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
Introduction-to-Cybersecurit57hhfcbbcxxx
Introduction-to-Cybersecurit57hhfcbbcxxxIntroduction-to-Cybersecurit57hhfcbbcxxx
Introduction-to-Cybersecurit57hhfcbbcxxx
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
 
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPsWebinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive data
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
Computer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage sComputer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage s
 
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
 

Fraud Detection for Insurance Claims

  • 1. FraudDetetionforInsuranceClaims Yit Wei Chia - cyitwei@uwaterloo.ca University of Waterloo, Winter 2019, CS 680 Introduction • 81% of Canadians think that the addi- tional premium mainly due to false claims • Built binary classifiers using Random Forest, SVM with RBF kernel and Gra- dient Boosting • Developed Synthetic Minority Over- sampling Technique (SMOTE) approach to balance data • Achieved max AUC and AUPRC of 94% and 96% Dataset and Analysis • Databricks dataset with 1000 claims and 33 features • Imbalanced data with 247 false claims • SMOTE using 5-nearest neighbors • SMOTE - Balanced data with 741 obs on fraudulent & non-fraudulent claims • 70:30 split for train and test sets • 10-fold cross validation & OOB samples Models • Random Forest ˆfB rf = 1 B B b=1 fb(x) • SVM with RBF kernel min w, ξ 1 2 w 2 + C ∀ i ξi K(x, z) = exp − x−z 2 2σ2 • Gradient Boosting Fm(x) = Fm−1(x) + Jm j=1 γjmI(x ∈ Rjm) γm = xi∈Rjm (yi − p(xi)) xi∈Rjm p(xi) (1 − p(xi)) References [1] Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani: An Introduction to Statistical Learning (2013) [2] Leo Breiman: Random Forest (2001) [3] Jerome H. Friedman: Greedy Function Approxima- tion: A Gradient Boosting Machine (2001) Future Work • Neural Network & Robust Logistic on combinations of powerful classifiers • Other sampling techniques Data insights from Learning Algorithms Variable Importance plot from Random Forest and the Influence from Gradient Boosting shows a consistent result. Insured hobbies and incident severity are both importance features (with high predictive power) from the learning algorithm. Adding more trees will not overfit in Random Forest as the OOB error will always converge (Breiman 1999). However, features selection will overfit the model. Gradient boosting shows a cutoff threshold indicating the model is overfit if we select # of trees beyond the minimum validation loss. ROC for Sensitivity Specificity & Precision Recall + Statistics We will compare the improvement before synthetic oversampling approach, the train and test sets are tuned via 10-fold cross validation for RBF kernel and gradient boosting, random forest are tuned through out-of-bag samples. The data set (before SMOTE) are split into train and test using stratified sampling approach, we want to maintain the same response ratio in both sets. We shift our original concentration from PC Curve back to ROC curve after SMOTE as the AUC measures the true/false positives so well when data is balanced. The SMOTE approach shows a huge improvement in all performance metrics. Discussion: • Tree-based learning algorithms perform very well in our dataset • Gradient boosting achieved highest AUC & Random Forest has the highest AUPC