1. Sensitivity: Internal
Distributed Data Processing & Learning
Approaches for Credit Card Fraud Detection
Systems (C.C.F.D.S)
Supervisor: Dr. Danish Mehmood
Taj Wali (MS DS - 4)
2. Sensitivity: Internal
Introduction
• Started in 1887, Visa introduced the modern concept of
credit cards
• Millions of users use a credit card for transactions
• The rapid growth of credit card at POS and other offline
& online methods has increased the abuse
• Concerned authorities have adopted many measures
such as introducing smart cards but still the issue is
increasing
3. Sensitivity: Internal
Introduction
• In the proposed thesis, we are using the advance
Machine Learning techniques for detecting fraudulent
transactions.
• Major difference between fraudulent and genuine
transactions
• To overcome the under sampling issue we will be using
MCC-SMOTE and Entropy techniques.
4. Sensitivity: Internal
Literature Review-[1]
Domain
Bank Sector Data
Contribution
They used a private dataset on which all the data
was available and no feature was missing.
Feature Engineering
Transaction Aggregation and PCA
Classifier
Random Forest CNN
Parameters
Accuracy, Precision, Recall, F1
Limitation
Private Dataset is used so for future work it will be
hard to improve the existing work.
Ref [1] Li Z, Huang M, Liu G, Jiang C. A hybrid method with dynamic weighted entropy for handling the problem of class imbalance with overlap in credit card fraud
detection. Expert Systems with Applications. 2021 Aug 1;175:114750.
5. Sensitivity: Internal
Literature Review-[2]
Domain
Bank Sector Data
Contribution
They handled the same type of data entries that
causes an unwanted correlation between the records
Feature Engineering
PCA & Feature Engineering techniques
Classifier
Random Forest, LR, Decision Tree
Parameters
Accuracy
Limitation
Private Dataset is used so for future work it will be
hard to improve the existing work.
Ref [2] Han Y, Yao S, Wen T, Tian Z, Wang C, Gu Z. Detection and Analysis of Credit Card Application Fraud Using Machine Learning Algorithms. InJournal of
Physics: Conference Series 2020 Dec 1 (Vol. 1693, No. 1, p. 012064). IOP Publishing.
6. Sensitivity: Internal
Literature Review-[3]
Domain
Bank Sector Data of Credit Card Users
Contribution
They have used the real dataset for detecting credit
card fraud in online transactions.
Feature Engineering
PCA & Feature Engineering techniques + Filling
Missing Values
Classifier
Deep Learning Methods
Parameters
Accuracy
Limitation
Deep learning is helpful but expensive in term of
memory occupation and computational power
Ref [3] Alghofaili Y, Albattah A, Rassam MA. A Financial Fraud Detection Model Based on LSTM Deep Learning Technique. Journal of Applied Security Research.
2020 Oct 1;15(4):498-516.
7. Sensitivity: Internal
Literature Review-[4]
Domain
Singaporean Bank Data
Contribution
They improved the results of existing work who
used the same dataset and same methods
Feature Engineering
Filling Missing values and Normalization & Data
integration
Classifier
Decision Tree
Parameters
Accuracy, F1, Precision
Limitation
All the data in a dataset need to be converted to
numeric values and be normalized.
Ref [4] Ahammad J, Hossain N, Alam MS. Credit card fraud detection using data pre-processing on imbalanced data-Both oversampling and undersampling.
InProceedings of the International Conference on Computing Advancements 2020 Jan 10 (pp. 1-4).
8. Sensitivity: Internal
Critical Evaluation Table
Ref. Domain Dataset Preprocessing techniques Classifiers used
Limitations Validation Remarks
[31] Bank Sector Public
(Kaggle) &
Private
Dataset
Data Sampling for
class imbalance
and
Normalization
techniques
Transaction
aggregation
Random forest
CNN
Private Dataset which is not
available publicly
Accuracy,
Precision, Recall,
F1
Two datasets were used for
this study from which one is
private so that future work
potentially be difficult
[32] Bank Sector Public
Dataset
from Kaggle
with a
Million
Records
All the data is
100% populated,
and there is no
need to fill in the
missing values
Frequency
variables were
engineered
based on the
credit card users
frequency
Linear
Regression
Decision trees
Random forest
the same type of data entries
causes an unwanted
correlation between the
records.
Accuracy PCA is best for dimensionality
reduction in any credit card
fraud detection system
[35] Bank Sector A real
dataset of
credit card
frauds is
used
Missing value
filling and basic
data handling
techniques
Encoder method Deep Learning
Methods
Some of the records are
missing from the dataset due
to the confidentiality issue
Accuracy and Loss
function
Deep learning methods are
helpful; these are expensive
in term of memory
occupation and
computational power
[37] Bank Sector Public
Dataset
from the
Kaggle
Repository
Filling Missing
values and
Normalization &
Data integration
Oversampling
and
undersampling
techniques to
process the
imbalanced
classes
KNN The performance can be
increased if all the relevant
features are added to the
Dataset while training
G-mean, F1 and
AUC
Missing data features causes
to remove some rows from
the information, which
potentially affect the overall
results of the ML classifier
9. Sensitivity: Internal
Scope
• This study focuses on the online transactions only made through credit cards
• We are using advance SMOTE techniques to handle the class imbalance issue
in the credit card transactions datasets
• We are using a public dataset from Kaggle repository for conducting the study
10. Sensitivity: Internal
Significance
• This research indicates how important it is to preprocess and handle the
anomalies in online credit card transactions
• Millions of transactions are made each day through credit cards, and most of
these are genuine and a few are fraudulent but due to the importance and
significance these transactions needs a proper mechanism to be resolved and
handled.
• The significance is to detect fraud transactions overall and then give future
directions to the concerned authorities.
11. Sensitivity: Internal
Research Questions
RQ.1 What is the impact of using DDP for training the model?
RQ.2 Does using multiple datasets for the training of the CCFD model impact in
a better way?
RQ.3 What is the impact of cluster-based feature engineering?
RQ.4 Which performance measure(s) is (are) the most adequate to detect credit
card fraud in online transactions?
12. Sensitivity: Internal
Problem Statement
• Class imbalance is the fundamental problem in the credit card
fraud detection domain
• Researchers have used SMOTE for handling this problem; it
creates instances that lead to noisy data points creation
• The cluster-based technique used in this thesis overcome this issue
• The distributed SMOTE uses the most accurate clusters from
existing data for better results in less time
20. Sensitivity: Internal
Research Contribution
• Data pre-processing through MCC-SMOTE in a distributed environment
(Apache Spark)
• Utilization of DDP environment (SPARK) for less latent operations.
• Impact of training on multiple datasets on different data types and generate an
alert if an abnormal transaction occurs.
21. Sensitivity: Internal
Future Work
• MCC SMOTE is cost-sensitive, it takes more time when the size
increases.
• In future, making changes to the algorithm can reduce run time.
• Besides the used techniques, new methods can help improve to
improve performance.
• PCA for feature reduction for better results
22. Sensitivity: Internal
Conclusion
• Frauds in credit card transactions is a significant issue
• Proper methodology should be implemented to avoid the huge financial loses
in this domain
• Machine Learning and Distributed Data Processing can handle the huge
amount of data generation in an effective way
• The results can be used in the production level