This document discusses data analysis techniques for credit card fraud detection. It introduces logistic regression models for fraud prediction and evaluates their performance based on accuracy metrics and financial costs. Cost-sensitive logistic regression directly incorporates financial costs into the model training process, achieving lower total costs compared to models optimized for accuracy alone. In conclusion, fraud detection algorithms should evaluate and optimize models based on real financial impacts rather than traditional accuracy metrics.
6. Database
โข Larger European card processing
company
โข 2012 card present transactions
โข 750,000 Transactions
โข 3500 Frauds
โข 0.467% Fraud rate
โข 148,562 EUR lost due to fraud on
test dataset
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
Test
Train
7. โข Raw attributes
โข Other attributes:
Age, country of residence, postal code, type of card
Database
TRXID Client ID Date Amount Location Type
Merchant
Group
Fraud
1 1 2/1/12 6:00 580 Lux Internet Airlines No
2 1 2/1/12 6:15 120 Lux Present Car Renting No
3 2 2/1/12 8:20 12 Bel Present Hotel Yes
4 1 3/1/12 4:15 60 Lux ATM ATM No
5 2 3/1/12 9:18 8 Fra Present Retail No
6 1 3/1/12 9:55 1210 Lux Internet Airlines Yes
7
8. โข Derived attributes
Combination of
following criteria:
Database
ID
Num
CC
Date Amt Location Type
Merchant
Group
Fraud
No. of Trx โ
same client โ
last 6 hour
Sum โ same
client โ last 7
days
1 1 2/1/12 6:00 580 Lux Internet Airlines No 0 0
2 1 2/1/12 6:15 120 Lux Present Car Renting No 1 580
3 2 2/1/12 8:20 12 Bel Present Hotel Yes 0 0
4 1 3/1/12 4:15 60 Lux ATM ATM No 0 700
5 2 3/1/12 9:18 8 Fra Present Retail No 0 12
6 1 3/1/12 9:55 1210 Lux Internet Airlines Yes 1 760
By Group Last Function
Client None hour Count
Credit Card Transaction Type day Sum(Amount)
Merchant week Avg(Amount)
Merchant Category month
Merchant Group 1 3 months
Merchant Group 2
Merchant Country
8
14. โข Motivation
โข False positives carry a different cost than false
negatives
โข Frauds range from few to thousands of euros
(dollars, pounds, etc)
Financial evaluation
There is a need for a real comparison measure
15. โข Cost matrix
where:
Financial evaluation
Ca Administrative costs
Amt Amount of transaction i
True Class (๐ฆ๐)
Fraud (๐ฆ๐=1) Legitimate (๐ฆ๐=0)
Predicted class
(๐๐)
Fraud (๐๐=1) Ca Ca
Legitimate (๐๐=0) Amt 0
โข Evaluation measure
16. Logistic Regression
Results
โฌ 148,562โฌ 148,196โฌ 142,510
โฌ 112,103
โฌ 79,838
โฌ 65,870
โฌ 46,530
โฌ -
โฌ 20,000
โฌ 40,000
โฌ 60,000
โฌ 80,000
โฌ 100,000
โฌ 120,000
โฌ 140,000
โฌ 160,000
0%
10%
20%
30%
40%
50%
60%
70%
No Model All 1% 5% 10% 20% 50%
Cost Recall Precision F1-Score
Selecting the algorithm by F1-Score
Selecting the algorithm by Cost
17. Logistic Regression
โข Best model selected using traditional F1-Score does not give
the best results in terms of cost
โข Model selected by cost, is trained using less than 1% of the
database, meaning there is a lot of information excluded
โข The algorithm is trained to minimize the miss-classification
(approx.) but then is evaluated based on cost
โข Why not train the algorithm to minimize the cost instead?
18. True Class (๐ฆ๐)
Fraud (๐ฆ๐=1) Legitimate (๐ฆ๐=0)
Predicted class
(๐๐)
Fraud (๐๐=1) Ca Ca
Legitimate (๐๐=0) Amt 0
โข Cost Matrix
Cost Sensitive Logistic Regression
โข Cost Function
โข Objective
Find ๐ that minimized the cost function (Genetic Algorithms)
21. Conclusion
โข Selecting models based on traditional statistics does not
give the best results in terms of cost
โข Models should be evaluated taking into account real
financial costs of the application
โข Algorithms should be developed to incorporate those
financial costs
23. Contact information
Alejandro Correa Bahnsen
University of Luxembourg
Luxembourg
al.bahnsen@gmail.com
http://www.linkedin.com/in/albahnsen
http://www.slideshare.net/albahnsen