SlideShare a Scribd company logo
Sensitivity: Internal
Distributed Data Processing & Learning
Approaches for Credit Card Fraud Detection
Systems (C.C.F.D.S)
Supervisor: Dr. Danish Mehmood
Taj Wali (MS DS - 4)
Sensitivity: Internal
Introduction
• Started in 1887, Visa introduced the modern concept of
credit cards
• Millions of users use a credit card for transactions
• The rapid growth of credit card at POS and other offline
& online methods has increased the abuse
• Concerned authorities have adopted many measures
such as introducing smart cards but still the issue is
increasing
Sensitivity: Internal
Introduction
• In the proposed thesis, we are using the advance
Machine Learning techniques for detecting fraudulent
transactions.
• Major difference between fraudulent and genuine
transactions
• To overcome the under sampling issue we will be using
MCC-SMOTE and Entropy techniques.
Sensitivity: Internal
Literature Review-[1]
Domain
Bank Sector Data
Contribution
They used a private dataset on which all the data
was available and no feature was missing.
Feature Engineering
Transaction Aggregation and PCA
Classifier
Random Forest CNN
Parameters
Accuracy, Precision, Recall, F1
Limitation
Private Dataset is used so for future work it will be
hard to improve the existing work.
Ref [1] Li Z, Huang M, Liu G, Jiang C. A hybrid method with dynamic weighted entropy for handling the problem of class imbalance with overlap in credit card fraud
detection. Expert Systems with Applications. 2021 Aug 1;175:114750.
Sensitivity: Internal
Literature Review-[2]
Domain
Bank Sector Data
Contribution
They handled the same type of data entries that
causes an unwanted correlation between the records
Feature Engineering
PCA & Feature Engineering techniques
Classifier
Random Forest, LR, Decision Tree
Parameters
Accuracy
Limitation
Private Dataset is used so for future work it will be
hard to improve the existing work.
Ref [2] Han Y, Yao S, Wen T, Tian Z, Wang C, Gu Z. Detection and Analysis of Credit Card Application Fraud Using Machine Learning Algorithms. InJournal of
Physics: Conference Series 2020 Dec 1 (Vol. 1693, No. 1, p. 012064). IOP Publishing.
Sensitivity: Internal
Literature Review-[3]
Domain
Bank Sector Data of Credit Card Users
Contribution
They have used the real dataset for detecting credit
card fraud in online transactions.
Feature Engineering
PCA & Feature Engineering techniques + Filling
Missing Values
Classifier
Deep Learning Methods
Parameters
Accuracy
Limitation
Deep learning is helpful but expensive in term of
memory occupation and computational power
Ref [3] Alghofaili Y, Albattah A, Rassam MA. A Financial Fraud Detection Model Based on LSTM Deep Learning Technique. Journal of Applied Security Research.
2020 Oct 1;15(4):498-516.
Sensitivity: Internal
Literature Review-[4]
Domain
Singaporean Bank Data
Contribution
They improved the results of existing work who
used the same dataset and same methods
Feature Engineering
Filling Missing values and Normalization & Data
integration
Classifier
Decision Tree
Parameters
Accuracy, F1, Precision
Limitation
All the data in a dataset need to be converted to
numeric values and be normalized.
Ref [4] Ahammad J, Hossain N, Alam MS. Credit card fraud detection using data pre-processing on imbalanced data-Both oversampling and undersampling.
InProceedings of the International Conference on Computing Advancements 2020 Jan 10 (pp. 1-4).
Sensitivity: Internal
Critical Evaluation Table
Ref. Domain Dataset Preprocessing techniques Classifiers used
Limitations Validation Remarks
[31] Bank Sector Public
(Kaggle) &
Private
Dataset
Data Sampling for
class imbalance
and
Normalization
techniques
Transaction
aggregation
 Random forest
 CNN
Private Dataset which is not
available publicly
Accuracy,
Precision, Recall,
F1
Two datasets were used for
this study from which one is
private so that future work
potentially be difficult
[32] Bank Sector Public
Dataset
from Kaggle
with a
Million
Records
All the data is
100% populated,
and there is no
need to fill in the
missing values
Frequency
variables were
engineered
based on the
credit card users
frequency
 Linear
Regression
 Decision trees
 Random forest
the same type of data entries
causes an unwanted
correlation between the
records.
Accuracy PCA is best for dimensionality
reduction in any credit card
fraud detection system
[35] Bank Sector A real
dataset of
credit card
frauds is
used
Missing value
filling and basic
data handling
techniques
Encoder method  Deep Learning
Methods
Some of the records are
missing from the dataset due
to the confidentiality issue
Accuracy and Loss
function
Deep learning methods are
helpful; these are expensive
in term of memory
occupation and
computational power
[37] Bank Sector Public
Dataset
from the
Kaggle
Repository
Filling Missing
values and
Normalization &
Data integration
Oversampling
and
undersampling
techniques to
process the
imbalanced
classes
 KNN The performance can be
increased if all the relevant
features are added to the
Dataset while training
G-mean, F1 and
AUC
Missing data features causes
to remove some rows from
the information, which
potentially affect the overall
results of the ML classifier
Sensitivity: Internal
Scope
• This study focuses on the online transactions only made through credit cards
• We are using advance SMOTE techniques to handle the class imbalance issue
in the credit card transactions datasets
• We are using a public dataset from Kaggle repository for conducting the study
Sensitivity: Internal
Significance
• This research indicates how important it is to preprocess and handle the
anomalies in online credit card transactions
• Millions of transactions are made each day through credit cards, and most of
these are genuine and a few are fraudulent but due to the importance and
significance these transactions needs a proper mechanism to be resolved and
handled.
• The significance is to detect fraud transactions overall and then give future
directions to the concerned authorities.
Sensitivity: Internal
Research Questions
RQ.1 What is the impact of using DDP for training the model?
RQ.2 Does using multiple datasets for the training of the CCFD model impact in
a better way?
RQ.3 What is the impact of cluster-based feature engineering?
RQ.4 Which performance measure(s) is (are) the most adequate to detect credit
card fraud in online transactions?
Sensitivity: Internal
Problem Statement
• Class imbalance is the fundamental problem in the credit card
fraud detection domain
• Researchers have used SMOTE for handling this problem; it
creates instances that lead to noisy data points creation
• The cluster-based technique used in this thesis overcome this issue
• The distributed SMOTE uses the most accurate clusters from
existing data for better results in less time
Sensitivity: Internal
Proposed System Model
Sensitivity: Internal
Results: Undersampled
Algorithms Random Forest Decision Tree SVM Logistic Regression
Datasets Precision Recall F1 Precision Recall F1 Precision Recall F1 Precision Recall F1
Dataset1 76 50 89 92 71 82 88 86 88 88 76 91
Dataset2 82 72 82 91 69 76 90 76 83 95 60 92
Dataset3 90 68 92 93 70 84 92 80 82 82 71 90
Dataset4 82 80 80 89 74 91 95 72 84 86 78 77
Dataset5 81 69 83 82 84 78 82 91 78 82 82 87
Dataset6 89 72 90 87 68 81 80 74 75 81 84 82
Sensitivity: Internal
Results: Oversampled
Algorithms Random Forest Decision Tree SVM Logistic Regression
Datasets Precision Recall F1 Precision Recall F1 Precision Recall F1 Precision Recall F1
Dataset1 70 77 80 92 75 77 80 86 87 80 70 88
Dataset2 82 72 82 95 69 76 90 77 83 95 87 92
Dataset3 88 80 88 93 80 89 92 80 82 74 73 91
Dataset4 82 72 80 82 74 89 87 73 85 74 76 95
Dataset5 81 64 83 82 84 78 83 91 78 82 83 89
Dataset6 83 60 89 92 91 80 78 74 79 80 82 82
Sensitivity: Internal
Results: SMOTE
Algorithms Random Forest Decision Tree SVM Logistic Regression
Datasets Precision Recall F1 Precision Recall F1 Precision Recall F1 Precision Recall F1
Dataset1 66 89 81 91 78 79 90 87 88 91 91 79
Dataset2 81 72 84 98 81 75 81 77 84 95 87 73
Dataset3 85 67 85 96 80 91 78 81 82 71 73 81
Dataset4 82 78 83 81 74 79 81 75 91 74 74 94
Dataset5 83 63 80 78 65 81 82 89 89 78 82 87
Dataset6 86 65 92 91 89 86 79 75 79 80 81 81
Sensitivity: Internal
Results: MCC-SMOTE
Algorithms Random Forest Decision Tree SVM Logistic Regression
Datasets Precision Recall F1 Precision Recall F1 Precision Recall F1 Precision Recall F1
Dataset1 81 67 81 91 70 81 91 92 91 92 70 81
Dataset2 83 71 86 92 72 69 93 76 81 90 66 92
Dataset3 83 56 81 94 81 91 92 81 82 89 65 81
Dataset4 81 81 79 92 72 82 90 63 84 87 76 93
Dataset5 84 64 86 91 81 86 87 80 90 84 79 90
Dataset6 87 70 89 93 55 69 82 71 91 82 81 87
Sensitivity: Internal
Results: Distributed SMOTE
Algorithm Random Forest Decision Tree SVM Logistic Regression
Datasets Precision Recall F1 Precision Recall F1 Precision Recall F1 Precision Recall F1
Dataset1 71 53 88 88 47 83 91 83 81 90 81 72
Dataset2 81 72 84 98 81 75 81 77 84 95 87 73
Dataset3 78 67 85 96 80 91 78 81 82 71 73 81
Dataset4 71 78 83 81 74 79 81 75 91 74 74 94
Dataset5 54 63 80 78 65 81 82 89 89 78 82 87
Dataset6 87 65 92 91 89 86 79 75 79 80 81 81
Sensitivity: Internal
Results: Comparsion
Algorithm Distributed MCC SMOTE MCC SMOTE SMOTE Oversampling Under sampling
Datasets Precision Recall F1 Precision Recall F1 Precision Recall F1 Precision Recall F1 Precision Recall F1
LR 81.33 79.66 81.3 87 72 87 81.5 81.3 82 80.8 78 89 83.3 68.5 86
DT 88 72.6 85.3 81 71 79 82 77 81 89 78 81.5 89 72 82
SVM 82 80 88.3 79 77 86 81 80 85 85 80 82 87 79 81
RF 73.66 66.3 85.3 83.5 68 83 80.5 72 84 81 70 83 83 68 86
Sensitivity: Internal
Research Contribution
• Data pre-processing through MCC-SMOTE in a distributed environment
(Apache Spark)
• Utilization of DDP environment (SPARK) for less latent operations.
• Impact of training on multiple datasets on different data types and generate an
alert if an abnormal transaction occurs.
Sensitivity: Internal
Future Work
• MCC SMOTE is cost-sensitive, it takes more time when the size
increases.
• In future, making changes to the algorithm can reduce run time.
• Besides the used techniques, new methods can help improve to
improve performance.
• PCA for feature reduction for better results
Sensitivity: Internal
Conclusion
• Frauds in credit card transactions is a significant issue
• Proper methodology should be implemented to avoid the huge financial loses
in this domain
• Machine Learning and Distributed Data Processing can handle the huge
amount of data generation in an effective way
• The results can be used in the production level
Sensitivity: Internal
Thank You

More Related Content

Similar to CCFDS - Thesis II PPT.pptx

BI Chapter 04.pdf business business business business
BI Chapter 04.pdf business business business businessBI Chapter 04.pdf business business business business
BI Chapter 04.pdf business business business business
JawaherAlbaddawi
 
Implementation of Automated Attendance System using Deep Learning
Implementation of Automated Attendance System using Deep LearningImplementation of Automated Attendance System using Deep Learning
Implementation of Automated Attendance System using Deep Learning
Md. Mahfujur Rahman
 

Similar to CCFDS - Thesis II PPT.pptx (20)

STOCK MARKET PREDICTION USING NEURAL NETWORKS
STOCK MARKET PREDICTION USING NEURAL NETWORKSSTOCK MARKET PREDICTION USING NEURAL NETWORKS
STOCK MARKET PREDICTION USING NEURAL NETWORKS
 
IRJET- Fault Detection and Prediction of Failure using Vibration Analysis
IRJET-	 Fault Detection and Prediction of Failure using Vibration AnalysisIRJET-	 Fault Detection and Prediction of Failure using Vibration Analysis
IRJET- Fault Detection and Prediction of Failure using Vibration Analysis
 
Machine learning_ Replicating Human Brain
Machine learning_ Replicating Human BrainMachine learning_ Replicating Human Brain
Machine learning_ Replicating Human Brain
 
CREDIT CARD FRAUD DETECTION
CREDIT CARD FRAUD DETECTION CREDIT CARD FRAUD DETECTION
CREDIT CARD FRAUD DETECTION
 
Deep hypersphere embedding for real-time face recognition
Deep hypersphere embedding for real-time face recognitionDeep hypersphere embedding for real-time face recognition
Deep hypersphere embedding for real-time face recognition
 
Comparative Study of Enchancement of Automated Student Attendance System Usin...
Comparative Study of Enchancement of Automated Student Attendance System Usin...Comparative Study of Enchancement of Automated Student Attendance System Usin...
Comparative Study of Enchancement of Automated Student Attendance System Usin...
 
Leveragin research, behavioural and demeographic data
Leveragin research, behavioural and demeographic dataLeveragin research, behavioural and demeographic data
Leveragin research, behavioural and demeographic data
 
IRJET- Convenience Improvement for Graphical Interface using Gesture Dete...
IRJET-  	  Convenience Improvement for Graphical Interface using Gesture Dete...IRJET-  	  Convenience Improvement for Graphical Interface using Gesture Dete...
IRJET- Convenience Improvement for Graphical Interface using Gesture Dete...
 
K012647982
K012647982K012647982
K012647982
 
A Simple Signature Recognition System
A Simple Signature Recognition System A Simple Signature Recognition System
A Simple Signature Recognition System
 
K012647982
K012647982K012647982
K012647982
 
IRJET - Effective Workflow for High-Performance Recognition of Fruits using M...
IRJET - Effective Workflow for High-Performance Recognition of Fruits using M...IRJET - Effective Workflow for High-Performance Recognition of Fruits using M...
IRJET - Effective Workflow for High-Performance Recognition of Fruits using M...
 
Fraudulent Activities Detection in E-commerce Websites
Fraudulent Activities Detection in E-commerce WebsitesFraudulent Activities Detection in E-commerce Websites
Fraudulent Activities Detection in E-commerce Websites
 
IRJET- Credit Card Fraud Detection Analysis
IRJET- Credit Card Fraud Detection AnalysisIRJET- Credit Card Fraud Detection Analysis
IRJET- Credit Card Fraud Detection Analysis
 
IRJET - An User Friendly Interface for Data Preprocessing and Visualizati...
IRJET -  	  An User Friendly Interface for Data Preprocessing and Visualizati...IRJET -  	  An User Friendly Interface for Data Preprocessing and Visualizati...
IRJET - An User Friendly Interface for Data Preprocessing and Visualizati...
 
BI Chapter 04.pdf business business business business
BI Chapter 04.pdf business business business businessBI Chapter 04.pdf business business business business
BI Chapter 04.pdf business business business business
 
Implementation of Automated Attendance System using Deep Learning
Implementation of Automated Attendance System using Deep LearningImplementation of Automated Attendance System using Deep Learning
Implementation of Automated Attendance System using Deep Learning
 
Final Document
Final DocumentFinal Document
Final Document
 
Detecting and Improving Distorted Fingerprints using rectification techniques.
Detecting and Improving Distorted Fingerprints using rectification techniques.Detecting and Improving Distorted Fingerprints using rectification techniques.
Detecting and Improving Distorted Fingerprints using rectification techniques.
 
Automated Feature Selection and Churn Prediction using Deep Learning Models
Automated Feature Selection and Churn Prediction using Deep Learning ModelsAutomated Feature Selection and Churn Prediction using Deep Learning Models
Automated Feature Selection and Churn Prediction using Deep Learning Models
 

More from Danish Mahmood (7)

PROJECT POSTER TEM.pptx
PROJECT POSTER TEM.pptxPROJECT POSTER TEM.pptx
PROJECT POSTER TEM.pptx
 
GSM Architecture.ppt
 GSM Architecture.ppt GSM Architecture.ppt
GSM Architecture.ppt
 
SG Data analytics.pptx
SG Data analytics.pptxSG Data analytics.pptx
SG Data analytics.pptx
 
block chain.pptx
block chain.pptxblock chain.pptx
block chain.pptx
 
One way functions and trapdoor functions.pptx
One way functions and trapdoor functions.pptxOne way functions and trapdoor functions.pptx
One way functions and trapdoor functions.pptx
 
network security lec2 ccns
network security lec2 ccnsnetwork security lec2 ccns
network security lec2 ccns
 
Mcse notes
Mcse notesMcse notes
Mcse notes
 

Recently uploaded

Digital Signal Processing Lecture notes n.pdf
Digital Signal Processing Lecture notes n.pdfDigital Signal Processing Lecture notes n.pdf
Digital Signal Processing Lecture notes n.pdf
AbrahamGadissa
 
Standard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - NeometrixStandard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - Neometrix
Neometrix_Engineering_Pvt_Ltd
 
Laundry management system project report.pdf
Laundry management system project report.pdfLaundry management system project report.pdf
Laundry management system project report.pdf
Kamal Acharya
 
Automobile Management System Project Report.pdf
Automobile Management System Project Report.pdfAutomobile Management System Project Report.pdf
Automobile Management System Project Report.pdf
Kamal Acharya
 

Recently uploaded (20)

Scaling in conventional MOSFET for constant electric field and constant voltage
Scaling in conventional MOSFET for constant electric field and constant voltageScaling in conventional MOSFET for constant electric field and constant voltage
Scaling in conventional MOSFET for constant electric field and constant voltage
 
Online resume builder management system project report.pdf
Online resume builder management system project report.pdfOnline resume builder management system project report.pdf
Online resume builder management system project report.pdf
 
Peek implant persentation - Copy (1).pdf
Peek implant persentation - Copy (1).pdfPeek implant persentation - Copy (1).pdf
Peek implant persentation - Copy (1).pdf
 
ENERGY STORAGE DEVICES INTRODUCTION UNIT-I
ENERGY STORAGE DEVICES  INTRODUCTION UNIT-IENERGY STORAGE DEVICES  INTRODUCTION UNIT-I
ENERGY STORAGE DEVICES INTRODUCTION UNIT-I
 
HYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generationHYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generation
 
A case study of cinema management system project report..pdf
A case study of cinema management system project report..pdfA case study of cinema management system project report..pdf
A case study of cinema management system project report..pdf
 
Toll tax management system project report..pdf
Toll tax management system project report..pdfToll tax management system project report..pdf
Toll tax management system project report..pdf
 
Digital Signal Processing Lecture notes n.pdf
Digital Signal Processing Lecture notes n.pdfDigital Signal Processing Lecture notes n.pdf
Digital Signal Processing Lecture notes n.pdf
 
İTÜ CAD and Reverse Engineering Workshop
İTÜ CAD and Reverse Engineering WorkshopİTÜ CAD and Reverse Engineering Workshop
İTÜ CAD and Reverse Engineering Workshop
 
Standard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - NeometrixStandard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - Neometrix
 
Furniture showroom management system project.pdf
Furniture showroom management system project.pdfFurniture showroom management system project.pdf
Furniture showroom management system project.pdf
 
NO1 Pandit Amil Baba In Bahawalpur, Sargodha, Sialkot, Sheikhupura, Rahim Yar...
NO1 Pandit Amil Baba In Bahawalpur, Sargodha, Sialkot, Sheikhupura, Rahim Yar...NO1 Pandit Amil Baba In Bahawalpur, Sargodha, Sialkot, Sheikhupura, Rahim Yar...
NO1 Pandit Amil Baba In Bahawalpur, Sargodha, Sialkot, Sheikhupura, Rahim Yar...
 
Laundry management system project report.pdf
Laundry management system project report.pdfLaundry management system project report.pdf
Laundry management system project report.pdf
 
ASME IX(9) 2007 Full Version .pdf
ASME IX(9)  2007 Full Version       .pdfASME IX(9)  2007 Full Version       .pdf
ASME IX(9) 2007 Full Version .pdf
 
Automobile Management System Project Report.pdf
Automobile Management System Project Report.pdfAutomobile Management System Project Report.pdf
Automobile Management System Project Report.pdf
 
Natalia Rutkowska - BIM School Course in Kraków
Natalia Rutkowska - BIM School Course in KrakówNatalia Rutkowska - BIM School Course in Kraków
Natalia Rutkowska - BIM School Course in Kraków
 
Cloud-Computing_CSE311_Computer-Networking CSE GUB BD - Shahidul.pptx
Cloud-Computing_CSE311_Computer-Networking CSE GUB BD - Shahidul.pptxCloud-Computing_CSE311_Computer-Networking CSE GUB BD - Shahidul.pptx
Cloud-Computing_CSE311_Computer-Networking CSE GUB BD - Shahidul.pptx
 
fundamentals of drawing and isometric and orthographic projection
fundamentals of drawing and isometric and orthographic projectionfundamentals of drawing and isometric and orthographic projection
fundamentals of drawing and isometric and orthographic projection
 
Introduction to Machine Learning Unit-4 Notes for II-II Mechanical Engineering
Introduction to Machine Learning Unit-4 Notes for II-II Mechanical EngineeringIntroduction to Machine Learning Unit-4 Notes for II-II Mechanical Engineering
Introduction to Machine Learning Unit-4 Notes for II-II Mechanical Engineering
 
Democratizing Fuzzing at Scale by Abhishek Arya
Democratizing Fuzzing at Scale by Abhishek AryaDemocratizing Fuzzing at Scale by Abhishek Arya
Democratizing Fuzzing at Scale by Abhishek Arya
 

CCFDS - Thesis II PPT.pptx

  • 1. Sensitivity: Internal Distributed Data Processing & Learning Approaches for Credit Card Fraud Detection Systems (C.C.F.D.S) Supervisor: Dr. Danish Mehmood Taj Wali (MS DS - 4)
  • 2. Sensitivity: Internal Introduction • Started in 1887, Visa introduced the modern concept of credit cards • Millions of users use a credit card for transactions • The rapid growth of credit card at POS and other offline & online methods has increased the abuse • Concerned authorities have adopted many measures such as introducing smart cards but still the issue is increasing
  • 3. Sensitivity: Internal Introduction • In the proposed thesis, we are using the advance Machine Learning techniques for detecting fraudulent transactions. • Major difference between fraudulent and genuine transactions • To overcome the under sampling issue we will be using MCC-SMOTE and Entropy techniques.
  • 4. Sensitivity: Internal Literature Review-[1] Domain Bank Sector Data Contribution They used a private dataset on which all the data was available and no feature was missing. Feature Engineering Transaction Aggregation and PCA Classifier Random Forest CNN Parameters Accuracy, Precision, Recall, F1 Limitation Private Dataset is used so for future work it will be hard to improve the existing work. Ref [1] Li Z, Huang M, Liu G, Jiang C. A hybrid method with dynamic weighted entropy for handling the problem of class imbalance with overlap in credit card fraud detection. Expert Systems with Applications. 2021 Aug 1;175:114750.
  • 5. Sensitivity: Internal Literature Review-[2] Domain Bank Sector Data Contribution They handled the same type of data entries that causes an unwanted correlation between the records Feature Engineering PCA & Feature Engineering techniques Classifier Random Forest, LR, Decision Tree Parameters Accuracy Limitation Private Dataset is used so for future work it will be hard to improve the existing work. Ref [2] Han Y, Yao S, Wen T, Tian Z, Wang C, Gu Z. Detection and Analysis of Credit Card Application Fraud Using Machine Learning Algorithms. InJournal of Physics: Conference Series 2020 Dec 1 (Vol. 1693, No. 1, p. 012064). IOP Publishing.
  • 6. Sensitivity: Internal Literature Review-[3] Domain Bank Sector Data of Credit Card Users Contribution They have used the real dataset for detecting credit card fraud in online transactions. Feature Engineering PCA & Feature Engineering techniques + Filling Missing Values Classifier Deep Learning Methods Parameters Accuracy Limitation Deep learning is helpful but expensive in term of memory occupation and computational power Ref [3] Alghofaili Y, Albattah A, Rassam MA. A Financial Fraud Detection Model Based on LSTM Deep Learning Technique. Journal of Applied Security Research. 2020 Oct 1;15(4):498-516.
  • 7. Sensitivity: Internal Literature Review-[4] Domain Singaporean Bank Data Contribution They improved the results of existing work who used the same dataset and same methods Feature Engineering Filling Missing values and Normalization & Data integration Classifier Decision Tree Parameters Accuracy, F1, Precision Limitation All the data in a dataset need to be converted to numeric values and be normalized. Ref [4] Ahammad J, Hossain N, Alam MS. Credit card fraud detection using data pre-processing on imbalanced data-Both oversampling and undersampling. InProceedings of the International Conference on Computing Advancements 2020 Jan 10 (pp. 1-4).
  • 8. Sensitivity: Internal Critical Evaluation Table Ref. Domain Dataset Preprocessing techniques Classifiers used Limitations Validation Remarks [31] Bank Sector Public (Kaggle) & Private Dataset Data Sampling for class imbalance and Normalization techniques Transaction aggregation  Random forest  CNN Private Dataset which is not available publicly Accuracy, Precision, Recall, F1 Two datasets were used for this study from which one is private so that future work potentially be difficult [32] Bank Sector Public Dataset from Kaggle with a Million Records All the data is 100% populated, and there is no need to fill in the missing values Frequency variables were engineered based on the credit card users frequency  Linear Regression  Decision trees  Random forest the same type of data entries causes an unwanted correlation between the records. Accuracy PCA is best for dimensionality reduction in any credit card fraud detection system [35] Bank Sector A real dataset of credit card frauds is used Missing value filling and basic data handling techniques Encoder method  Deep Learning Methods Some of the records are missing from the dataset due to the confidentiality issue Accuracy and Loss function Deep learning methods are helpful; these are expensive in term of memory occupation and computational power [37] Bank Sector Public Dataset from the Kaggle Repository Filling Missing values and Normalization & Data integration Oversampling and undersampling techniques to process the imbalanced classes  KNN The performance can be increased if all the relevant features are added to the Dataset while training G-mean, F1 and AUC Missing data features causes to remove some rows from the information, which potentially affect the overall results of the ML classifier
  • 9. Sensitivity: Internal Scope • This study focuses on the online transactions only made through credit cards • We are using advance SMOTE techniques to handle the class imbalance issue in the credit card transactions datasets • We are using a public dataset from Kaggle repository for conducting the study
  • 10. Sensitivity: Internal Significance • This research indicates how important it is to preprocess and handle the anomalies in online credit card transactions • Millions of transactions are made each day through credit cards, and most of these are genuine and a few are fraudulent but due to the importance and significance these transactions needs a proper mechanism to be resolved and handled. • The significance is to detect fraud transactions overall and then give future directions to the concerned authorities.
  • 11. Sensitivity: Internal Research Questions RQ.1 What is the impact of using DDP for training the model? RQ.2 Does using multiple datasets for the training of the CCFD model impact in a better way? RQ.3 What is the impact of cluster-based feature engineering? RQ.4 Which performance measure(s) is (are) the most adequate to detect credit card fraud in online transactions?
  • 12. Sensitivity: Internal Problem Statement • Class imbalance is the fundamental problem in the credit card fraud detection domain • Researchers have used SMOTE for handling this problem; it creates instances that lead to noisy data points creation • The cluster-based technique used in this thesis overcome this issue • The distributed SMOTE uses the most accurate clusters from existing data for better results in less time
  • 14. Sensitivity: Internal Results: Undersampled Algorithms Random Forest Decision Tree SVM Logistic Regression Datasets Precision Recall F1 Precision Recall F1 Precision Recall F1 Precision Recall F1 Dataset1 76 50 89 92 71 82 88 86 88 88 76 91 Dataset2 82 72 82 91 69 76 90 76 83 95 60 92 Dataset3 90 68 92 93 70 84 92 80 82 82 71 90 Dataset4 82 80 80 89 74 91 95 72 84 86 78 77 Dataset5 81 69 83 82 84 78 82 91 78 82 82 87 Dataset6 89 72 90 87 68 81 80 74 75 81 84 82
  • 15. Sensitivity: Internal Results: Oversampled Algorithms Random Forest Decision Tree SVM Logistic Regression Datasets Precision Recall F1 Precision Recall F1 Precision Recall F1 Precision Recall F1 Dataset1 70 77 80 92 75 77 80 86 87 80 70 88 Dataset2 82 72 82 95 69 76 90 77 83 95 87 92 Dataset3 88 80 88 93 80 89 92 80 82 74 73 91 Dataset4 82 72 80 82 74 89 87 73 85 74 76 95 Dataset5 81 64 83 82 84 78 83 91 78 82 83 89 Dataset6 83 60 89 92 91 80 78 74 79 80 82 82
  • 16. Sensitivity: Internal Results: SMOTE Algorithms Random Forest Decision Tree SVM Logistic Regression Datasets Precision Recall F1 Precision Recall F1 Precision Recall F1 Precision Recall F1 Dataset1 66 89 81 91 78 79 90 87 88 91 91 79 Dataset2 81 72 84 98 81 75 81 77 84 95 87 73 Dataset3 85 67 85 96 80 91 78 81 82 71 73 81 Dataset4 82 78 83 81 74 79 81 75 91 74 74 94 Dataset5 83 63 80 78 65 81 82 89 89 78 82 87 Dataset6 86 65 92 91 89 86 79 75 79 80 81 81
  • 17. Sensitivity: Internal Results: MCC-SMOTE Algorithms Random Forest Decision Tree SVM Logistic Regression Datasets Precision Recall F1 Precision Recall F1 Precision Recall F1 Precision Recall F1 Dataset1 81 67 81 91 70 81 91 92 91 92 70 81 Dataset2 83 71 86 92 72 69 93 76 81 90 66 92 Dataset3 83 56 81 94 81 91 92 81 82 89 65 81 Dataset4 81 81 79 92 72 82 90 63 84 87 76 93 Dataset5 84 64 86 91 81 86 87 80 90 84 79 90 Dataset6 87 70 89 93 55 69 82 71 91 82 81 87
  • 18. Sensitivity: Internal Results: Distributed SMOTE Algorithm Random Forest Decision Tree SVM Logistic Regression Datasets Precision Recall F1 Precision Recall F1 Precision Recall F1 Precision Recall F1 Dataset1 71 53 88 88 47 83 91 83 81 90 81 72 Dataset2 81 72 84 98 81 75 81 77 84 95 87 73 Dataset3 78 67 85 96 80 91 78 81 82 71 73 81 Dataset4 71 78 83 81 74 79 81 75 91 74 74 94 Dataset5 54 63 80 78 65 81 82 89 89 78 82 87 Dataset6 87 65 92 91 89 86 79 75 79 80 81 81
  • 19. Sensitivity: Internal Results: Comparsion Algorithm Distributed MCC SMOTE MCC SMOTE SMOTE Oversampling Under sampling Datasets Precision Recall F1 Precision Recall F1 Precision Recall F1 Precision Recall F1 Precision Recall F1 LR 81.33 79.66 81.3 87 72 87 81.5 81.3 82 80.8 78 89 83.3 68.5 86 DT 88 72.6 85.3 81 71 79 82 77 81 89 78 81.5 89 72 82 SVM 82 80 88.3 79 77 86 81 80 85 85 80 82 87 79 81 RF 73.66 66.3 85.3 83.5 68 83 80.5 72 84 81 70 83 83 68 86
  • 20. Sensitivity: Internal Research Contribution • Data pre-processing through MCC-SMOTE in a distributed environment (Apache Spark) • Utilization of DDP environment (SPARK) for less latent operations. • Impact of training on multiple datasets on different data types and generate an alert if an abnormal transaction occurs.
  • 21. Sensitivity: Internal Future Work • MCC SMOTE is cost-sensitive, it takes more time when the size increases. • In future, making changes to the algorithm can reduce run time. • Besides the used techniques, new methods can help improve to improve performance. • PCA for feature reduction for better results
  • 22. Sensitivity: Internal Conclusion • Frauds in credit card transactions is a significant issue • Proper methodology should be implemented to avoid the huge financial loses in this domain • Machine Learning and Distributed Data Processing can handle the huge amount of data generation in an effective way • The results can be used in the production level

Editor's Notes

  1. Paper Reference: Credit Card Fraud Detection Systems (CCFDS) using Machine Learning (Apache Spark)
  2. Paper Reference: Credit Card Fraud Detection Systems (CCFDS) using Machine Learning (Apache Spark)