Comparative study of various approaches for transaction Fraud Detection using Machine Learning Algorithms
1. COMPARATIVE STUDY OF VARIOUS APPROACHES
FOR TRANSACTION FRAUD DETECTION USING
DATA MINING
Submitted By
PRATIBHA SINGH
M.Tech(CS)/4505/14
Guided By :
Dr. B. B. SAGAR (asst. prof., Dept of CSE)
Mrs. S. MALLIKA(asst. prof., Dept of CSE)
2. CONTENT
AIM
INTRODUCTION
LITERATURE SURVEY
DEFINATION OF DATAMINING
ORIGIN OF DATA MINING
DATAMINING PROCESS
APPROACHES TO USE DATAMINING
DATA MINING TASK
WHY TO PROCESSESSING OF DATA
REQUIRED
DATA MINING TECHNIQUES AND
TYPES OF FRAUD
DESCRIPTION OF DATASET USED IN
OUR STUDY
IMPLEMENTATION OF NN, LR AND
KNN.
PREFORMANCE EVALUATION OF
VARIOUS MODELS
RESULT
CONCLUSION
REFRENCES
PAPER PUBLISHED
3. AIM
Our aim is to compares three different predictive data-
mining techniques (Neural Network, Logistic
Regression and K-Nearest neighbour) on the dataset
taken from large Brazilian bank, with registers within
time window between Jul/14/2004 through
Sep/12/2004. Each register represents a credit card
authorization, with only approved transactions
excluding the visualization of the denied transactions.
The simulation of the data mining prediction is done
on R console for better understanding and visualize
the result in the form of ROC, Lift Chart, PR curve,
Confusion Matrix and other skill scores for better
understanding.
4. INTRODUCTION
Fraud detection as we all know it is a process of
detecting fraud through various data mining &
machine learning approaches.
In this research we will be comparing various
approaches for transaction fraud detection using
data mining and machine learning techniques. The
comparison will show the results of various
transaction fraud detection techniques applied on
real dataset based on certain parameters. With these
results we are showing how accurate our method
works on real datasets and our aim is be to find out
the most suitable method which would help in
catching fraud, will be cost sensitive and which will
reduced false rate,etc.
5. LITERATURE SURVEY
As we can see in paper [5] S.Bhattacharyya has done
comparative study of various approaches on one
synthetic dataset and analysed results of logistic
regression, random forest and SVM. And the results
shows LR is best in his research.
I took this[5] paper as a base paper and done
comparative study on LR, NN and KNN on real dataset
which would be helpful for further research.
6. DATA MINING
The efficient discovery of previously
unknown, valid, potentially useful,
understandable patterns in large
datasets
The analysis of (often large)
observational data sets to find
unsuspected relationships and to
summarize the data in novel ways
that are both understandable and
useful to the data owner
TERMS
Data
Pattern:
Attribute
Interestingness
9. APPROACH TO USE DATA MINING
Identify the problem
Use data mining
techniques to
transform the data
into information
Act on the
information
Measure the results
Understand the
domain
Create a dataset
Choose the data
mining task and the
specific algorithm
Interpret the
results
Select the
interesting
attributes
Data cleaning
and
preprocessing
General Approach
10. Data Mining Tasks
Classification
learning a
function that
maps an item
into one of a set
of predefined
classes
Regression
Learning a
function that
maps an
item to a real
value.
OR
A
Independent
variable to a
dependent
variable
Clustering
Identify a
set of groups
of similar
items
Dependencies
and
associations
Identify
significant
dependencies
between data
attributes
Summarization
find a compact
description of
the dataset or a
subset of the
dataset
11. Why Preprocessing of Data is required?
The available data does not full fill the requirement of input
data in Data mining process
The attributes are not properly defined for Data mining
process
Incomplete data: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate data.
Noisy data: containing errors or outliers
Incomplete Data:
Attributes of interest are not available (e.g., customer information for sales
transaction data)
Missing/unknown values for some data
Since, data quality are not good then the result of data
mining are also not good and vary
Quality decisions must be based on quality data
So, Data cleaning are required
13. Header Attributes Symbol Attribute Description
A MCC Merchant Category Code
B MCC_P Merchant Category Code from
previous transaction of the same
credit card
C Post_zip_code Post/zip code
D Post_zip_code_P Post/zip code from previous
transaction
E Amount Amount of money of the current
transaction
F Amount_P Amount of money of the previous
transaction
G Type_T Type of transaction - Card present
(Normal transaction), Internet,
Telephone, direct debit, etc
H Credit_limit Credit limit of the account
I Brand_scheme Brand/scheme - Visa,
MasterCard, Diners, JCB, etc
J Variant Variant - Local, International,
Gold, Platinum
K F_score Fraud score of the previous
transaction (this is really - really
important to know)
L No_of_instalm
ents
Number of instalments of the
current transaction
M Time_Last_T Time in minutes since the last
transaction
N Diff_F_Score Difference between the fraud
score of the current previous
transaction and fraud score from
the one before
O M_T_Limit Merchant transaction limit,
maximum amount of money
allowed for a transaction in that
specific type of business
P Flag Fraud transaction flag (N = No
fraud , S = Yes Fraud)
DESCRIPTION OF THE DATASET TAKEN FOR THE STUDY
19. FRAUD DETECTION USING K-NEAREST NEIGHBOUR
Dataset
trainset
testset
R-Script
R-Library: Class
Algorithm: knn
k = 1, 2, 3, 4, 5
Categ-
orical
Class
(x - min(x)) /
(max(x) - min(x)
Normalization
Dataset
Trainset target
Testset target
Best K
20. R CODE FOR IMPLEMENTATION OF
K-NEAREST NEIGHBOR
21. FRAUD DETECTION USING LOGISTIC REGRESSION
Dataset
training
testing
R-Script
R-Library: Linear model (in-built)
function: glm
Family : Binomial
22. R CODE FOR IMPLEMENTATION OF
LOGISTIC REGRESSION
23. FRAUD DETECTION MODEL VALIDATION
Stands for “Receiver Operating
Characteristic”
From signal processing: trade-off
between hit rate and false alarm rate
over noisy channel
Compute FPR, TPR and plot them in
ROC space
Every classifier is a point in ROC space
For probabilistic algorithms
ROC Analysis
Area Under
Curve (AUC)
=0.75 AUC
28. RESULTS (LIFT CHART)
LIFT IS A MEASURE OF THE EFFECTIVENESS OF A PREDICTIVE MODEL CALCULATED AS
THE RATIO BETWEEN THE RESULTS OBTAINED WITH AND WITHOUT THE PREDICTIVE
MODEL.
NN
K-NN
LR
30. CONCLUSION
The performance of the three data mining algorithm are compared
and we found that the fraud detection rate of NN is highest among all
three algorithm taken for the study and Specificity is lowest that
shows the NN has good predictive model than other two but if we
take the execution time and RMSE the KNN algorithm is better than
NN. So the overall results showed the performance of KNN and NN is
better than Logistic Regression but KNN and NN both take some
execution time to process the large data. But if we take a better
Argument and function to implement the model like Backpropagation
and Hidden Node reduce the execution time for NN on the other hand
the value of K and other function affects the execution time for
implementing the model using KNN.
We believe that these results are very promising and supportive of a
multi-algorithmic approach to classifying and assessing large, noisy
and real time data sets, and future work will focus upon testing the
algorithms and resolution strategies on similarly complex data sets
from other real-world domains.
31. REFERENCES
[1] S. Benson Edwin Raj, A. Annie Portia “Analysis on Credit Card Fraud Detection Methods”.
IEEE-International Conference on Computer, Communication and Electrical Technology; (2011).
(152-156).
[2] Haruna, C., abdul-kareem., S. abubakar. A: A Framework for selecting the optimal technique
suitable for application in data mining task., Future information technology,163-169, (2014).
[3] Manoel Fernando Alonso Gadi, Xidi Wang, and Alair Pereira do Lago. Credit card fraud detection
with artificial immune system, 7th international conference, icaris 2008, phuket, thailand, august
10-13, 2008, proceedings. In ICARIS, volume 5132 of Lecture Notes in Computer Science, pages
119 – 131. Springer, 2008
[4] E.W.T. Ngai, Yong Hu, Y.H. Wong, Yijun Chen, Xin Sun “The application of data mining techniques
in financial fraud detection: A classification framework and an academic review of literature”.
Elsevier-Decision Support Systems.50; (559–569), (2011).
[5] S.Bhattacharyya, S. Jha, K. Tharakunnel, J.C. Westland, “Data mining for credit card fraud: A
comparative study”, in Elsevier- Decision Support Systems, 2011.
[6] Usama, M. Fayyad, et al., Advances in Knowledge Discovery and Data Mining. Cambridge,
Mass.: MIT Press (1996).
[7] Han, Jun; Morag, Claudio, “The influence of the sigmoid function parameters on the speed of
Backpropagation learning", In Mira, José, Sandoval, Francisco, From Natural to Artificial Neural
Computation. pp. 195–201, 1995.
[8] Neda .S Halvaiee, M. Kazem Akbari “A novel model for credit card fraud detection using Artificial
Immune Systems”, Elsevier-Applied soft computing, Vol- 24, pp 40-49, 2014.
[9] F. Campos, S. Cavalcante, An extended approach for Dempster–Shafer theory, in:
Proceedings of the IEEE International Conference on Information Reuse and Integration, 2003,
pp. 338–344.