Comparative study of various approaches for transaction Fraud Detection using Machine Learning Algorithms

COMPARATIVE STUDY OF VARIOUS APPROACHES
FOR TRANSACTION FRAUD DETECTION USING
DATA MINING
Submitted By
PRATIBHA SINGH
M.Tech(CS)/4505/14
Guided By :
Dr. B. B. SAGAR (asst. prof., Dept of CSE)
Mrs. S. MALLIKA(asst. prof., Dept of CSE)

CONTENT
 AIM
 INTRODUCTION
 LITERATURE SURVEY
 DEFINATION OF DATAMINING
 ORIGIN OF DATA MINING
 DATAMINING PROCESS
 APPROACHES TO USE DATAMINING
 DATA MINING TASK
 WHY TO PROCESSESSING OF DATA
REQUIRED
 DATA MINING TECHNIQUES AND
TYPES OF FRAUD
 DESCRIPTION OF DATASET USED IN
OUR STUDY
 IMPLEMENTATION OF NN, LR AND
KNN.
 PREFORMANCE EVALUATION OF
VARIOUS MODELS
 RESULT
 CONCLUSION
 REFRENCES
 PAPER PUBLISHED

AIM
Our aim is to compares three different predictive data-
mining techniques (Neural Network, Logistic
Regression and K-Nearest neighbour) on the dataset
taken from large Brazilian bank, with registers within
time window between Jul/14/2004 through
Sep/12/2004. Each register represents a credit card
authorization, with only approved transactions
excluding the visualization of the denied transactions.
The simulation of the data mining prediction is done
on R console for better understanding and visualize
the result in the form of ROC, Lift Chart, PR curve,
Confusion Matrix and other skill scores for better
understanding.

INTRODUCTION
 Fraud detection as we all know it is a process of
detecting fraud through various data mining &
machine learning approaches.
 In this research we will be comparing various
approaches for transaction fraud detection using
data mining and machine learning techniques. The
comparison will show the results of various
transaction fraud detection techniques applied on
real dataset based on certain parameters. With these
results we are showing how accurate our method
works on real datasets and our aim is be to find out
the most suitable method which would help in
catching fraud, will be cost sensitive and which will
reduced false rate,etc.

LITERATURE SURVEY
 As we can see in paper [5] S.Bhattacharyya has done
comparative study of various approaches on one
synthetic dataset and analysed results of logistic
regression, random forest and SVM. And the results
shows LR is best in his research.
 I took this[5] paper as a base paper and done
comparative study on LR, NN and KNN on real dataset
which would be helpful for further research.

DATA MINING
The efficient discovery of previously
unknown, valid, potentially useful,
understandable patterns in large
datasets
The analysis of (often large)
observational data sets to find
unsuspected relationships and to
summarize the data in novel ways
that are both understandable and
useful to the data owner
TERMS
Data
Pattern:
Attribute
Interestingness

AI /
Machine Learning
Statistics
Data Mining
Database
systems
ORIGINS OF DATA MINING

APPROACH TO USE DATA MINING
Identify the problem
Use data mining
techniques to
transform the data
into information
Act on the
information
Measure the results
Understand the
domain
Create a dataset
Choose the data
mining task and the
specific algorithm
Interpret the
results
Select the
interesting
attributes
Data cleaning
and
preprocessing
General Approach

Data Mining Tasks
Classification
learning a
function that
maps an item
into one of a set
of predefined
classes
Regression
Learning a
function that
maps an
item to a real
value.
OR
A
Independent
variable to a
dependent
variable
Clustering
Identify a
set of groups
of similar
items
Dependencies
and
associations
Identify
significant
dependencies
between data
attributes
Summarization
find a compact
description of
the dataset or a
subset of the
dataset

Why Preprocessing of Data is required?
 The available data does not full fill the requirement of input
data in Data mining process
 The attributes are not properly defined for Data mining
process
 Incomplete data: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate data.
 Noisy data: containing errors or outliers
 Incomplete Data:
 Attributes of interest are not available (e.g., customer information for sales
transaction data)
 Missing/unknown values for some data
Since, data quality are not good then the result of data
mining are also not good and vary
 Quality decisions must be based on quality data
So, Data cleaning are required

DATA MINING
TECHNIQUES
Classification
Regression
Clustering
Visualisation
Outlier
detection
Prediction
TYPES OF FRAUD

Header Attributes Symbol Attribute Description
A MCC Merchant Category Code
B MCC_P Merchant Category Code from
previous transaction of the same
credit card
C Post_zip_code Post/zip code
D Post_zip_code_P Post/zip code from previous
transaction
E Amount Amount of money of the current
transaction
F Amount_P Amount of money of the previous
transaction
G Type_T Type of transaction - Card present
(Normal transaction), Internet,
Telephone, direct debit, etc
H Credit_limit Credit limit of the account
I Brand_scheme Brand/scheme - Visa,
MasterCard, Diners, JCB, etc
J Variant Variant - Local, International,
Gold, Platinum
K F_score Fraud score of the previous
transaction (this is really - really
important to know)
L No_of_instalm
ents
Number of instalments of the
current transaction
M Time_Last_T Time in minutes since the last
transaction
N Diff_F_Score Difference between the fraud
score of the current previous
transaction and fraud score from
the one before
O M_T_Limit Merchant transaction limit,
maximum amount of money
allowed for a transaction in that
specific type of business
P Flag Fraud transaction flag (N = No
fraud , S = Yes Fraud)
DESCRIPTION OF THE DATASET TAKEN FOR THE STUDY

NUMBER OF FRAUDS AND LEGAL IN
EACH SPLIT IN DATASET

FRAUD DETECTION METHOD
MCC MCC_
P
Post_zi
p_code
Post_zi
p_code
_P
Amoun
t
Amoun
t_P
Type_
T
Credit_
limit
Brand_
scheme
Variant F_scor
e
VIP_C
ommo
n
Local_
Interan
tional
No_of_
instalm
ents
Time_
Last_T
Diff_F
_Score
M_T_
Limit
Flag
Class
33 33 10 10 10 10 10 10 6 6 10 2 2 4 8 6 9 2
Test
Set
Learn
Classifier Model
Training
Set
Curves Confusion Matrix
ROC PR LIFT
VALIDATION
NN K-NN LR
Structure of
the dataset

FRAUD DETECTION USING NEURAL NETWORK
Dataset
training
testing
R-Script
R-Library: Neuralnet
Algorithm: resilient backpropagation
Hidden layer: c(4,2) OR 8
Activation fn: Logistic
Error fn : ‘sse’ (sum of squared

R Code for Implementation of Neural Network

FRAUD DETECTION USING K-NEAREST NEIGHBOUR
Dataset
trainset
testset
R-Script
R-Library: Class
Algorithm: knn
k = 1, 2, 3, 4, 5
Categ-
orical
Class
(x - min(x)) /
(max(x) - min(x)
Normalization
Dataset
Trainset target
Testset target
Best K

R CODE FOR IMPLEMENTATION OF
K-NEAREST NEIGHBOR

FRAUD DETECTION USING LOGISTIC REGRESSION
Dataset
training
testing
R-Script
R-Library: Linear model (in-built)
function: glm
Family : Binomial

R CODE FOR IMPLEMENTATION OF
LOGISTIC REGRESSION

FRAUD DETECTION MODEL VALIDATION
 Stands for “Receiver Operating
Characteristic”
 From signal processing: trade-off
between hit rate and false alarm rate
over noisy channel
 Compute FPR, TPR and plot them in
ROC space
 Every classifier is a point in ROC space
 For probabilistic algorithms
ROC Analysis
Area Under
Curve (AUC)
=0.75 AUC

TP Rate (Sensitivity):
FP Rate (fall-out):
+ -
+
-
TP
FN
FP
TN
actual
TP+FN FP+TN
true positive false positive
false negative true negative
FRAUD DETECTION MODEL VALIDATION (CONT.)
Confusion Matrix
Recall
TP+FP

FRAUD DETECTION MODEL VALIDATION (CONT.)
Positive Predicted Value (PPV)
P(TP): % True Positives: Sensitivity
P(FP): % False Positives: 1 – Specificity

PERFORMANCE MEASURE CALCULATED FROM CONFUSION
MATRIX OF NEURAL NETWORK (NN), K-NEAREST NEIGHBOUR
(KNN) AND LOGISTIC REGRESSION (LR)
Performance
Measure
Neural Network K-NN LR
Accuracy 96.2 % 97.14 % 96.19 %
AUC 0.856 0.77 0.86
Execution Time 17 seconds 3 seconds Instant
Detection rate 96.06 % 95.03% 95.7 %
Sensitivity 99.7 98.7 % 99.4 %
Specificity 4 56.7 % 12.2 %
Precision 96.4 % 98.3 % 96.7 %
RMSE 0.194 0.16 0.17
RSquare 0.14 0.33 0.14

RESULTS (LIFT CHART)
LIFT IS A MEASURE OF THE EFFECTIVENESS OF A PREDICTIVE MODEL CALCULATED AS
THE RATIO BETWEEN THE RESULTS OBTAINED WITH AND WITHOUT THE PREDICTIVE
MODEL.
NN
K-NN
LR

CONCLUSION
 The performance of the three data mining algorithm are compared
and we found that the fraud detection rate of NN is highest among all
three algorithm taken for the study and Specificity is lowest that
shows the NN has good predictive model than other two but if we
take the execution time and RMSE the KNN algorithm is better than
NN. So the overall results showed the performance of KNN and NN is
better than Logistic Regression but KNN and NN both take some
execution time to process the large data. But if we take a better
Argument and function to implement the model like Backpropagation
and Hidden Node reduce the execution time for NN on the other hand
the value of K and other function affects the execution time for
implementing the model using KNN.
 We believe that these results are very promising and supportive of a
multi-algorithmic approach to classifying and assessing large, noisy
and real time data sets, and future work will focus upon testing the
algorithms and resolution strategies on similarly complex data sets
from other real-world domains.

REFERENCES
[1] S. Benson Edwin Raj, A. Annie Portia “Analysis on Credit Card Fraud Detection Methods”.
IEEE-International Conference on Computer, Communication and Electrical Technology; (2011).
(152-156).
[2] Haruna, C., abdul-kareem., S. abubakar. A: A Framework for selecting the optimal technique
suitable for application in data mining task., Future information technology,163-169, (2014).
[3] Manoel Fernando Alonso Gadi, Xidi Wang, and Alair Pereira do Lago. Credit card fraud detection
with artificial immune system, 7th international conference, icaris 2008, phuket, thailand, august
10-13, 2008, proceedings. In ICARIS, volume 5132 of Lecture Notes in Computer Science, pages
119 – 131. Springer, 2008
[4] E.W.T. Ngai, Yong Hu, Y.H. Wong, Yijun Chen, Xin Sun “The application of data mining techniques
in financial fraud detection: A classification framework and an academic review of literature”.
Elsevier-Decision Support Systems.50; (559–569), (2011).
[5] S.Bhattacharyya, S. Jha, K. Tharakunnel, J.C. Westland, “Data mining for credit card fraud: A
comparative study”, in Elsevier- Decision Support Systems, 2011.
[6] Usama, M. Fayyad, et al., Advances in Knowledge Discovery and Data Mining. Cambridge,
Mass.: MIT Press (1996).
[7] Han, Jun; Morag, Claudio, “The influence of the sigmoid function parameters on the speed of
Backpropagation learning", In Mira, José, Sandoval, Francisco, From Natural to Artificial Neural
Computation. pp. 195–201, 1995.
[8] Neda .S Halvaiee, M. Kazem Akbari “A novel model for credit card fraud detection using Artificial
Immune Systems”, Elsevier-Applied soft computing, Vol- 24, pp 40-49, 2014.
[9] F. Campos, S. Cavalcante, An extended approach for Dempster–Shafer theory, in:
Proceedings of the IEEE International Conference on Information Reuse and Integration, 2003,
pp. 338–344.

RESEARCH PAPER PUBLISHED
IEEE CONFERENCE ID: 37465

Comparative study of various approaches for transaction Fraud Detection using Machine Learning Algorithms

Comparative study of various approaches for transaction Fraud Detection using Machine Learning Algorithms

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Comparative study of various approaches for transaction Fraud Detection using Machine Learning Algorithms

Similar to Comparative study of various approaches for transaction Fraud Detection using Machine Learning Algorithms (20)

Recently uploaded

Recently uploaded (20)

Comparative study of various approaches for transaction Fraud Detection using Machine Learning Algorithms