CCFDS - Thesis II PPT.pptx

Sensitivity: Internal
Distributed Data Processing & Learning
Approaches for Credit Card Fraud Detection
Systems (C.C.F.D.S)
Supervisor: Dr. Danish Mehmood
Taj Wali (MS DS - 4)

Introduction
• Started in 1887, Visa introduced the modern concept of
credit cards
• Millions of users use a credit card for transactions
• The rapid growth of credit card at POS and other offline
& online methods has increased the abuse
• Concerned authorities have adopted many measures
such as introducing smart cards but still the issue is
increasing

Introduction
• In the proposed thesis, we are using the advance
Machine Learning techniques for detecting fraudulent
transactions.
• Major difference between fraudulent and genuine
transactions
• To overcome the under sampling issue we will be using
MCC-SMOTE and Entropy techniques.

Literature Review-[1]
Domain
Bank Sector Data
Contribution
They used a private dataset on which all the data
was available and no feature was missing.
Feature Engineering
Transaction Aggregation and PCA
Classifier
Random Forest CNN
Parameters
Accuracy, Precision, Recall, F1
Limitation
Private Dataset is used so for future work it will be
hard to improve the existing work.
Ref [1] Li Z, Huang M, Liu G, Jiang C. A hybrid method with dynamic weighted entropy for handling the problem of class imbalance with overlap in credit card fraud
detection. Expert Systems with Applications. 2021 Aug 1;175:114750.

Domain
Bank Sector Data
Contribution
They handled the same type of data entries that
causes an unwanted correlation between the records
Feature Engineering
PCA & Feature Engineering techniques
Classifier
Random Forest, LR, Decision Tree
Parameters
Accuracy
Limitation
Private Dataset is used so for future work it will be
hard to improve the existing work.
Ref [2] Han Y, Yao S, Wen T, Tian Z, Wang C, Gu Z. Detection and Analysis of Credit Card Application Fraud Using Machine Learning Algorithms. InJournal of
Physics: Conference Series 2020 Dec 1 (Vol. 1693, No. 1, p. 012064). IOP Publishing.

Domain
Bank Sector Data of Credit Card Users
Contribution
They have used the real dataset for detecting credit
card fraud in online transactions.
Feature Engineering
PCA & Feature Engineering techniques + Filling
Missing Values
Classifier
Deep Learning Methods
Parameters
Accuracy
Limitation
Deep learning is helpful but expensive in term of
memory occupation and computational power
Ref [3] Alghofaili Y, Albattah A, Rassam MA. A Financial Fraud Detection Model Based on LSTM Deep Learning Technique. Journal of Applied Security Research.
2020 Oct 1;15(4):498-516.

Domain
Singaporean Bank Data
Contribution
They improved the results of existing work who
used the same dataset and same methods
Feature Engineering
Filling Missing values and Normalization & Data
integration
Classifier
Decision Tree
Parameters
Accuracy, F1, Precision
Limitation
All the data in a dataset need to be converted to
numeric values and be normalized.
Ref [4] Ahammad J, Hossain N, Alam MS. Credit card fraud detection using data pre-processing on imbalanced data-Both oversampling and undersampling.
InProceedings of the International Conference on Computing Advancements 2020 Jan 10 (pp. 1-4).

Critical Evaluation Table
Ref. Domain Dataset Preprocessing techniques Classifiers used
Limitations Validation Remarks
[31] Bank Sector Public
(Kaggle) &
Private
Dataset
Data Sampling for
class imbalance
and
Normalization
techniques
Transaction
aggregation
 Random forest
 CNN
Private Dataset which is not
available publicly
Accuracy,
Precision, Recall,
F1
Two datasets were used for
this study from which one is
private so that future work
potentially be difficult
Dataset
from Kaggle
with a
Million
Records
All the data is
100% populated,
and there is no
need to fill in the
missing values
Frequency
variables were
engineered
based on the
credit card users
frequency
 Linear
Regression
 Decision trees
 Random forest
the same type of data entries
causes an unwanted
correlation between the
records.
Accuracy PCA is best for dimensionality
reduction in any credit card
fraud detection system
[35] Bank Sector A real
dataset of
credit card
frauds is
used
Missing value
filling and basic
data handling
techniques
Encoder method  Deep Learning
Methods
Some of the records are
missing from the dataset due
to the confidentiality issue
Accuracy and Loss
function
Deep learning methods are
helpful; these are expensive
in term of memory
occupation and
computational power
Dataset
from the
Kaggle
Repository
Filling Missing
values and
Normalization &
Data integration
Oversampling
and
undersampling
techniques to
process the
imbalanced
classes
 KNN The performance can be
increased if all the relevant
features are added to the
Dataset while training
G-mean, F1 and
AUC
Missing data features causes
to remove some rows from
the information, which
potentially affect the overall
results of the ML classifier

Scope
• This study focuses on the online transactions only made through credit cards
• We are using advance SMOTE techniques to handle the class imbalance issue
in the credit card transactions datasets
• We are using a public dataset from Kaggle repository for conducting the study

Significance
• This research indicates how important it is to preprocess and handle the
anomalies in online credit card transactions
• Millions of transactions are made each day through credit cards, and most of
these are genuine and a few are fraudulent but due to the importance and
significance these transactions needs a proper mechanism to be resolved and
handled.
• The significance is to detect fraud transactions overall and then give future
directions to the concerned authorities.

Research Questions
RQ.1 What is the impact of using DDP for training the model?
RQ.2 Does using multiple datasets for the training of the CCFD model impact in
a better way?
RQ.3 What is the impact of cluster-based feature engineering?
RQ.4 Which performance measure(s) is (are) the most adequate to detect credit
card fraud in online transactions?

Problem Statement
• Class imbalance is the fundamental problem in the credit card
fraud detection domain
• Researchers have used SMOTE for handling this problem; it
creates instances that lead to noisy data points creation
• The cluster-based technique used in this thesis overcome this issue
• The distributed SMOTE uses the most accurate clusters from
existing data for better results in less time

Proposed System Model

Results: Undersampled
Algorithms Random Forest Decision Tree SVM Logistic Regression
Datasets Precision Recall F1 Precision Recall F1 Precision Recall F1 Precision Recall F1
Dataset1 76 50 89 92 71 82 88 86 88 88 76 91
Dataset2 82 72 82 91 69 76 90 76 83 95 60 92
Dataset3 90 68 92 93 70 84 92 80 82 82 71 90
Dataset4 82 80 80 89 74 91 95 72 84 86 78 77
Dataset5 81 69 83 82 84 78 82 91 78 82 82 87
Dataset6 89 72 90 87 68 81 80 74 75 81 84 82

Results: Oversampled
Dataset1 70 77 80 92 75 77 80 86 87 80 70 88
Dataset2 82 72 82 95 69 76 90 77 83 95 87 92
Dataset3 88 80 88 93 80 89 92 80 82 74 73 91
Dataset4 82 72 80 82 74 89 87 73 85 74 76 95
Dataset5 81 64 83 82 84 78 83 91 78 82 83 89
Dataset6 83 60 89 92 91 80 78 74 79 80 82 82

Results: SMOTE
Dataset1 66 89 81 91 78 79 90 87 88 91 91 79
Dataset2 81 72 84 98 81 75 81 77 84 95 87 73
Dataset3 85 67 85 96 80 91 78 81 82 71 73 81
Dataset4 82 78 83 81 74 79 81 75 91 74 74 94
Dataset5 83 63 80 78 65 81 82 89 89 78 82 87
Dataset6 86 65 92 91 89 86 79 75 79 80 81 81

Results: MCC-SMOTE
Dataset1 81 67 81 91 70 81 91 92 91 92 70 81
Dataset2 83 71 86 92 72 69 93 76 81 90 66 92
Dataset3 83 56 81 94 81 91 92 81 82 89 65 81
Dataset4 81 81 79 92 72 82 90 63 84 87 76 93
Dataset5 84 64 86 91 81 86 87 80 90 84 79 90
Dataset6 87 70 89 93 55 69 82 71 91 82 81 87

Results: Distributed SMOTE
Algorithm Random Forest Decision Tree SVM Logistic Regression
Dataset1 71 53 88 88 47 83 91 83 81 90 81 72
Dataset2 81 72 84 98 81 75 81 77 84 95 87 73
Dataset3 78 67 85 96 80 91 78 81 82 71 73 81
Dataset4 71 78 83 81 74 79 81 75 91 74 74 94
Dataset5 54 63 80 78 65 81 82 89 89 78 82 87
Dataset6 87 65 92 91 89 86 79 75 79 80 81 81

Results: Comparsion
Algorithm Distributed MCC SMOTE MCC SMOTE SMOTE Oversampling Under sampling
Datasets Precision Recall F1 Precision Recall F1 Precision Recall F1 Precision Recall F1 Precision Recall F1
LR 81.33 79.66 81.3 87 72 87 81.5 81.3 82 80.8 78 89 83.3 68.5 86
DT 88 72.6 85.3 81 71 79 82 77 81 89 78 81.5 89 72 82
SVM 82 80 88.3 79 77 86 81 80 85 85 80 82 87 79 81
RF 73.66 66.3 85.3 83.5 68 83 80.5 72 84 81 70 83 83 68 86

Research Contribution
• Data pre-processing through MCC-SMOTE in a distributed environment
(Apache Spark)
• Utilization of DDP environment (SPARK) for less latent operations.
• Impact of training on multiple datasets on different data types and generate an
alert if an abnormal transaction occurs.

Future Work
• MCC SMOTE is cost-sensitive, it takes more time when the size
increases.
• In future, making changes to the algorithm can reduce run time.
• Besides the used techniques, new methods can help improve to
improve performance.
• PCA for feature reduction for better results

Conclusion
• Frauds in credit card transactions is a significant issue
• Proper methodology should be implemented to avoid the huge financial loses
in this domain
• Machine Learning and Distributed Data Processing can handle the huge
amount of data generation in an effective way
• The results can be used in the production level

Thank You

CCFDS - Thesis II PPT.pptx

Recommended

Recommended

More Related Content

Similar to CCFDS - Thesis II PPT.pptx

Similar to CCFDS - Thesis II PPT.pptx (20)

More from Danish Mahmood

More from Danish Mahmood (7)

Recently uploaded

Recently uploaded (20)

CCFDS - Thesis II PPT.pptx

Editor's Notes