Peipei Han
Marshall School of Business, USC
Card Payment Fraud Detection
Agenda
• Objective
• Data Exploratory
• Experts Variables
• Feature Selection
• Supervised Models
• Fraud Predictive
Predict potential fraud transactions
Build supervised model to generate fraud scores
Objective
Original Dataset
95,007 records of credit card payment transaction in 2010
10 variables including recordnum, cardnum, date, merchnum, merch
description, merch state,merch zip, transtype, amount and fraud
298 fraud labels
3 records from the original dataset:
Data Exploration
Data Exploration- Fraud Spatial Distribution
Fraud transactions Palo Alto, Seattle LA, Chicago, Savage, State College
Data Exploration – Amount & Seasonality
Different patterns
between fraud and normal transactions
Seasonality
Avoid when build variables
Totally 61 new numeric variables
Experts Variables
Experts Variables Cont.
Two dimension combination variables
Experts Variables Cont.
Last 7/3 days amount & count variables
HiveQL window function
The computation takes no more than 1 minute
Experts Variables Cont.
Lasso variable selection plot
Feature Selection
log(Best Lambda) = -7
Logistic Regression
Supervised Models
3% population capture 50% frauds
XGBoost
Supervised Models Cont.
3% population capture 91.94% frauds
Depth 30 Round 16
Random Forest
Supervised Models Cont.
3% population capture 94.63% frauds
Model Comparison
Supervised Models Cont.
Models FDR* at 3% FDR at 5% FDR at 50% FDR at 90%
Logistic Regression 50 52.68 86.91 98.99
XGBoost 91.95 93.29 96.98 100%
Random Forest 94.63 96.31 98.99 99.66
* FDR Fraud Detection Rate
Fraud Predictive
**False Positive Rate = # goods caught / # examined*False Positive Ratio = # goods caught / # bads caught
Fraud Predictive Cont.
Maximum ROI $146,440
($1,000,000)
($800,000)
($600,000)
($400,000)
($200,000)
$0
$200,000
$400,000
$600,000
$800,000
$1,000,000
$1,200,000
0 0.2 0.4 0.6 0.8 1 1.2
Fraud Savings Lost Sales ROI
Loss a
Fraud
Loss a
Sale
Cost per
Score
$600 $10 $0.05

Credit card payment_fraud_detection