1609 Fraud Data Science

Fraud Data Science
Alejandro Correa Bahnsen, PhD
Lead Data Scientist

About me
• PhD in Machine Learning at Luxembourg University
• Lead Data Scientist at Easy Solutions
• Worked for +8 years as a data scientist at GE Money, Scotiabank
and SIX Financial Services
• Bachelor and Master in Industrial Engineering
• Organizer of the Big Data & Data Science Bogota Meetup
2

Big data (Data Science) is like teenage sex:
everyone talks about it,
nobody really knows how to do it,
everyone thinks everyone else is doing it,
so everyone claims they are doing it...
6

7
Those are the pillars of data science: computing, statistics,
mathematics and quantitative disciplines combined to
analyze data for better decision making

Data Science is the use
of methods and tools of
Machine Learning and
Artificial Intelligence
with the objective
making data-driven
decisions
8

Fraud detection
and prevention
9

Estimate the probability of a transaction being fraud based on
analyzing customer patterns and recent fraudulent behavior
Issues when constructing a fraud detection system:
• Skewness of the data
• Cost-sensitivity
• Short time response of the system
• Dimensionality of the search space
• Feature preprocessing
• Model selection
10
Credit card fraud detection

• Larger European card processing
company
• 2012 & 2013 card present
transactions
• 20MM Transactions
• 40,000 Frauds
• 0.467% Fraud rate
• ~ 2MM EUR lost due to fraud on
test dataset
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
Test
Train
Data

• “Purpose is to use facts and rules, taken from the knowledge
of many human experts, to help make decisions.”
• Example of rules
• More than 4 ATM transactions in one hour?
• More than 2 transactions in 5 minutes?
• Magnetic stripe transaction then internet transaction?
13
If-Then rules (Expert rules)

1.04%
31%
17%
22%
Miss-cla Recall Precision F1-Score
14

Credit card fraud detection is a cost-sensitive problem. As the cost due to a
false positive is different than the cost of a false negative.
• False positives: When predicting a transaction as fraudulent, when in
fact it is not a fraud, there is an administrative cost that is incurred by
the financial institution.
• False negatives: Failing to detect a fraud, the amount of that transaction
is lost.
Moreover, it is not enough to assume a constant cost difference between
false positives and false negatives, as the amount of the transactions varies
quite significantly.
15
Financial evaluation

Cost matrix
𝐶𝑜𝑠𝑡 𝑓 𝑆 =
𝑖=1
𝑁
𝑦𝑖 𝑐𝑖 𝐶 𝑇𝑃 𝑖
+ 1 − 𝑐𝑖 𝐶 𝐹𝑁 𝑖
+ 1 − 𝑦𝑖 𝑐𝑖 𝐶 𝐹𝑃 𝑖
+ 1 − 𝑐𝑖 𝐶 𝑇𝑁 𝑖
16
Actual Positive
𝒚𝒊 = 𝟏
Actual Negative
𝒚𝒊 = 𝟎
Predicted Positive
𝒄𝒊 = 𝟏
𝐶 𝑇𝑃 𝑖
= 𝐶 𝑎 𝐶 𝐹𝑃 𝑖
= 𝐶 𝑎
Predicted Negative
𝒄𝒊 = 𝟎
𝐶 𝐹𝑁 𝑖
= 𝐴𝑚𝑡𝑖 𝐶 𝑇𝑁 𝑖
= 0
Financial evaluation

1.24 €
1.94 €
Cost Total Losses
1.04%
31%
17%
22%
Miss-cla Recall Precision F1-Score
17

Fraud Data Science is the use of
statistical and mathematical techniques
(Machine Learning) to discover patterns
in data in order to make predictions
Fraud Data Science

Raw features
20
Attribute name Description
Transaction ID Transaction identification number
Time Date and time of the transaction
Account number Identification number of the customer
Card number Identification of the credit card
Transaction type ie. Internet, ATM, POS, ...
Entry mode ie. Chip and pin, magnetic stripe, ...
Amount Amount of the transaction in Euros
Merchant code Identification of the merchant type
Merchant group Merchant group identification
Country Country of trx
Country 2 Country of residence
Type of card ie. Visa debit, Mastercard, American Express...
Gender Gender of the card holder
Age Card holder age
Bank Issuer bank of the card
Features

Transaction aggregation strategy
21
Raw Features
TrxId Time Type Country Amt
1 1/1 18:20 POS Lux 250
2 1/1 20:35 POS Lux 400
3 1/1 22:30 ATM Lux 250
4 2/1 00:50 POS Ger 50
5 2/1 19:18 POS Ger 100
6 2/1 23:45 POS Ger 150
7 3/1 06:00 POS Lux 10
Aggregated Features
No Trx
last 24h
Amt last
24h
No Trx
last 24h
same
type and
country
Amt last
24h same
type and
country
0 0 0 0
1 250 1 250
2 650 0 0
3 900 0 0
3 700 1 50
2 150 2 150
3 400 0 0
Features

When is a customer expected to
make a new transaction?
Considering a von Mises
distribution with a period of 24
hours such that
𝑃(𝑡𝑖𝑚𝑒) ~ 𝑣𝑜𝑛𝑚𝑖𝑠𝑒𝑠 𝜇, 𝜎
=
𝑒 𝜎𝑐𝑜𝑠(𝑡𝑖𝑚𝑒−𝜇)
2𝜋𝐼0 𝜎
where 𝝁 is the mean, 𝝈 is the standard
deviation, and 𝑰 𝟎 is the Bessel function
22
Periodic features

24
*New Periodic features
• Analyzing the time of
a transaction using a
24 hour clock
• Model a non-linear
von Mises kernel

25
*New Periodic features
19h risk = 10
9h risk = 95
• Estimate the risk comparing a new transaction with the kernel
distribution

Amountofthetransaction
Number of transactions last day
Normal Transaction
Fraud
27

28
Amountofthetransaction
Number of transactions last day
Normal Transaction
Fraud

29
Amount of the transaction
Normal Transaction
Fraud
Number of transactions last dayNumber of ATM transactions
last week

Fraud Analytics
Algorithms
Fuzzy Rules
Neural Nets
Naive Bayes
Random Forests
RF – with Cost-Proportionate
Rejection Sampling
Cost-Sensitive Random Patches
Decision Trees
30

0%
20%
40%
60%
80%
100%
Expert
Rules
Fuzzy
Rules
Neural
Nets
Naïve
Bayes
Random
Forests
RF - CP
Random
Sampling
CS
Random
Patches
% Savings % Frauds
31

32
Model Performance vs. Interpretability

34
Local Interpretable Model-agnostic Explanations
The LIME algorithm approximates
the underlying model with an
interpretable one by:
• Learning on perturbations of the
original instance
• Finding the nearest neighborhood
around the target instance
• Training a sparse linear model in
the

35
Interpreting Model Predictions
Transaction 1
Anomaly Score = 82
Example of using LIME to
understand predictions of
an anomaly detection
algorithm (Isolation Forest),
trained with over 2 million
parameters.

36
Interpreting Model Predictions
Transaction 3
Anomaly Score = 99
Transaction 2
Anomaly Score = 0

• Fraud Data Science (ML) models are
significantly better than expert rules
• Models should be evaluated taking into
account real financial costs of the application
• Algorithms should be developed to
incorporate those financial costs
• Don't be afraid of complex ML models
Takeaways!!
37

Questions?
Alejandro Correa Bahnsen, PhD
Lead Data Scientist
acorrea@Easysol.net
38

1609 Fraud Data Science

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (12)

Similar to 1609 Fraud Data Science

Similar to 1609 Fraud Data Science (20)

More from Alejandro Correa Bahnsen, PhD

More from Alejandro Correa Bahnsen, PhD (6)

Recently uploaded

Recently uploaded (20)

1609 Fraud Data Science

Editor's Notes