Credit Card Fraudulent Transaction Detection Research Paper using Machine Learning technologies like Logistic Regression, Random Forrest, Feature Engineering and various techniques to deal with highly skewed dataset
Credit Card Fraudulent Transaction Detection Research Paper
1. Credit Card Fraudulent
Transaction Detection
Garvit Burad Dr. Brijendra Gupta
Scholar Associate Professor
Author Co-Author
Department of Computer Engineering
Poornima Institute of Engineering & Technology
2. The Problem
$21.84 Billion
Lost worldwide due to Credit Card Frauds in 2015
Less than 0.2% Transactions are Fraud
Hence very hard to predict
Our Dataset had
2,84,807 Authentic Transactions
492 Fraudulent Transactions
0.172% of Total Transactions
3. How Frauds are Recognized
The methods that are generally employed by
the credit card companies.
● Location: Purchase made from different location
● Items you buy: If you deviate from your regular
buying pattern or time
● Frequency: Make a large number of transactions in
short period of time
● Amount: Suddenly if the costly items are purchased
5. Challenges
■ Normal machine learning algorithms like
◻Logistic Regression, SVM Classifiers etc.
Would give 99%+ Accuracy
But we can get 99.8% accuracy even if we
classify all the Frauds as Legitimate
6. Understanding the Data
■ The dataset contains 31 features.
■Out of which V1 to V28 have been
transformed with PCA for privacy of
users
■ The other three features are time,
frequency and the class either 1 for
fraud or 0 otherwise
7. Approaching the Problem
■ Given the class imbalance, the confusion matrix
accuracy would not be good criteria.
■ So we used Area Under Precision Recall Curve
(AUPC) and Receiver operating characteristic curve
(ROC)
■ A high area under the curve represents both high
recall and high precision
■ High precision relates to a low false positive rate, and
high recall relates to a low false negative rate
■ The ROC plots the true positive (TP) rate against the
false positive (FP) rate at various threshold settings
8. Dividing the dataset into Test &
Train
To validate our model's prediction
Divide dataset into
◻75% for Training (2,13,605 rows)
◻25% for Testing (71,202 rows)
9. Resampling Methods
■ The resampling methods are used to adjust
the class distribution of the data as the
minority class is not equally represented.
■There are two methods to perform the
resampling.
◻Oversampling
◻Undersampling
1. Preprocessing
2. Filtering
3. Classification
4. Decision Trees
10. Feature Engineering
■ To select only those features which matter
the most
■ Those features can be identified by
◻finding the correlation factor among the
features
◻studying the domain of the problem
◻by transforming some existing features to
make them more worthy
1. Preprocessing
2. Filtering
3. Classification
4. Decision Trees
12. Logistic Regression
■ Logistic regression is a binary classifier
based on the regression model, where the
dependable variable is categorical.
■ In logistic regression, we find
logit(P) = a + bX,
1. Preprocessing
2. Filtering
3. Classification
4. Decision Trees
14. Logistic Regression
we can observe that our logistic regression
classifier classifies
98% as true negative and 92% as true positive
but the precision for the fraud class is 0.06.
---Classification Report---
precision -recall
0 1.00 0.98
1 0.06 0.92
15. LOGISTIC REGRESSION
WITH SMOTE
■ SMOTE stands for Synthetic Minority
Oversampling Technique to an input dataset
■ It works by generating new instances of data
from the existing by taking feature space of
each target class and its nearest neighbors
■ This is a statistical technique to increase the
number of samples in minority class in the
dataset to make it balanced.
17. RANDOM FOREST CLASSIFIER
AFTER SMOTE
■ A random forest classifier fits a multiple number of
decision tree classifiers on various subsamples of
the dataset
■ Then calculates the average to improve the
accuracy of the prediction and control over fitting.
■ We have used the random forest classifier with the
Gini impurity
■ It is a measure of how many times a randomly
selected element from the dataset would be
incorrectly labeled.
20. LOGISTIC REGRESSION WITH
UNDERSAMPLING
■ Applying the undersampling to make the
number of samples from the majority class
equal to the number of samples in minority
class
■ Sample size: becomes
◻Number of Fraud cases: 492
◻Number of normal cases:492
23. LOGISTIC REGRESSION WITH
UNDERSAMPLING
ROC curve for the complete original training
dataset with logistic regression model trained by
undersampled data. The AUC is 0.96.
24. APPLICATIONS
● These techniques of detecting the fraud
transactions can be implemented to make
the transactions through the credit cards
more secure.
● The learned models can be deployed on
high-performance computers to monitor
all the transactions in real time and stop
the ongoing transaction while also alerting
the owner about the steps required to
further eliminate the risk
25. CONCLUSION
The best AUC for the ROC curve were
achieved by:
Logistic Regression with SMOTE:0.95
AUC
Logistic regression with undersampling:
0.96 AUC
These are the methods that performed
better than other methods.
26. References
[1] G. O. [1]Andrea Dal Pozzolo, Olivier Caelen, Reid A. Johnson and Gianluca Bontempi. Calibrating Probability with
Undersampling for Unbalanced Classification. In Symposium on Computational Intelligence and Data Mining
(CIDM), IEEE, 2015
[2] [2]http://mlg.ulb.ac.be(URL).
[3] [3]http://www.businesswire.com/news/home/20150804007054/en/Global-Card-Fraud-Losses-Reach-16.31-Billion
[4] [4]http://www.kaggle.com(URL).
[5] [5]The Nilson Report
[6] [6]http://www.thinksaveretire.com/2015/09/14/how-credit-card-fraud-detection-works/(URL).
[7] [7]SMOTE: Synthetic Minority Over-sampling Technique. Nitesh V. Chawla chawla@csee.usf.edu. Department of
Computer Science and Engineering
[8] [8]Area Under the Precision-Recall Curve: Point Estimates and Confidence Intervals Kendrick Boyd1 , Kevin H. Eng
, and C. David Page
[9] [9]http://scikit-learn.org/stable/auto_examples/model_selection/plot_precision_recall.html(URL).
[10] [11]Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.
[11] [10]Jones E, Oliphant E, Peterson P, et al. SciPy: Open Source Scientific Tools for Python, 2001-,
http://www.scipy.org/ [Online; accessed 2017-03-30]
[12] [11]Stéfan van der Walt, S. Chris Colbert and Gaël Varoquaux. The NumPy Array: A Structure for Efficient
Numerical Computation, Computing in Science & Engineering, 13, 22-30 (2011), DOI:10.1109/MCSE.2011.37