COMPARATIVE STUDY OF VARIOUS APPROACHES
FOR TRANSACTION FRAUD DETECTION USING
DATA MINING
Submitted By
PRATIBHA SINGH
M.Tech(CS)/4505/14
Guided By :
Dr. B. B. SAGAR (asst. prof., Dept of CSE)
Mrs. S. MALLIKA(asst. prof., Dept of CSE)
CONTENT
 AIM
 INTRODUCTION
 LITERATURE SURVEY
 DEFINATION OF DATAMINING
 ORIGIN OF DATA MINING
 DATAMINING PROCESS
 APPROACHES TO USE DATAMINING
 DATA MINING TASK
 WHY TO PROCESSESSING OF DATA
REQUIRED
 DATA MINING TECHNIQUES AND
TYPES OF FRAUD
 DESCRIPTION OF DATASET USED IN
OUR STUDY
 IMPLEMENTATION OF NN, LR AND
KNN.
 PREFORMANCE EVALUATION OF
VARIOUS MODELS
 RESULT
 CONCLUSION
 REFRENCES
 PAPER PUBLISHED
AIM
Our aim is to compares three different predictive data-
mining techniques (Neural Network, Logistic
Regression and K-Nearest neighbour) on the dataset
taken from large Brazilian bank, with registers within
time window between Jul/14/2004 through
Sep/12/2004. Each register represents a credit card
authorization, with only approved transactions
excluding the visualization of the denied transactions.
The simulation of the data mining prediction is done
on R console for better understanding and visualize
the result in the form of ROC, Lift Chart, PR curve,
Confusion Matrix and other skill scores for better
understanding.
INTRODUCTION
 Fraud detection as we all know it is a process of
detecting fraud through various data mining &
machine learning approaches.
 In this research we will be comparing various
approaches for transaction fraud detection using
data mining and machine learning techniques. The
comparison will show the results of various
transaction fraud detection techniques applied on
real dataset based on certain parameters. With these
results we are showing how accurate our method
works on real datasets and our aim is be to find out
the most suitable method which would help in
catching fraud, will be cost sensitive and which will
reduced false rate,etc.
LITERATURE SURVEY
 As we can see in paper [5] S.Bhattacharyya has done
comparative study of various approaches on one
synthetic dataset and analysed results of logistic
regression, random forest and SVM. And the results
shows LR is best in his research.
 I took this[5] paper as a base paper and done
comparative study on LR, NN and KNN on real dataset
which would be helpful for further research.
DATA MINING
The efficient discovery of previously
unknown, valid, potentially useful,
understandable patterns in large
datasets
The analysis of (often large)
observational data sets to find
unsuspected relationships and to
summarize the data in novel ways
that are both understandable and
useful to the data owner
TERMS
Data
Pattern:
Attribute
Interestingness
AI /
Machine Learning
Statistics
Data Mining
Database
systems
ORIGINS OF DATA MINING
KNOWLEDGE DISCOVERY
APPROACH TO USE DATA MINING
Identify the problem
Use data mining
techniques to
transform the data
into information
Act on the
information
Measure the results
Understand the
domain
Create a dataset
Choose the data
mining task and the
specific algorithm
Interpret the
results
Select the
interesting
attributes
Data cleaning
and
preprocessing
General Approach
Data Mining Tasks
Classification
learning a
function that
maps an item
into one of a set
of predefined
classes
Regression
Learning a
function that
maps an
item to a real
value.
OR
A
Independent
variable to a
dependent
variable
Clustering
Identify a
set of groups
of similar
items
Dependencies
and
associations
Identify
significant
dependencies
between data
attributes
Summarization
find a compact
description of
the dataset or a
subset of the
dataset
Why Preprocessing of Data is required?
 The available data does not full fill the requirement of input
data in Data mining process
 The attributes are not properly defined for Data mining
process
 Incomplete data: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate data.
 Noisy data: containing errors or outliers
 Incomplete Data:
 Attributes of interest are not available (e.g., customer information for sales
transaction data)
 Missing/unknown values for some data
Since, data quality are not good then the result of data
mining are also not good and vary
 Quality decisions must be based on quality data
So, Data cleaning are required
DATA MINING
TECHNIQUES
Classification
Regression
Clustering
Visualisation
Outlier
detection
Prediction
TYPES OF FRAUD
Header Attributes Symbol Attribute Description
A MCC Merchant Category Code
B MCC_P Merchant Category Code from
previous transaction of the same
credit card
C Post_zip_code Post/zip code
D Post_zip_code_P Post/zip code from previous
transaction
E Amount Amount of money of the current
transaction
F Amount_P Amount of money of the previous
transaction
G Type_T Type of transaction - Card present
(Normal transaction), Internet,
Telephone, direct debit, etc
H Credit_limit Credit limit of the account
I Brand_scheme Brand/scheme - Visa,
MasterCard, Diners, JCB, etc
J Variant Variant - Local, International,
Gold, Platinum
K F_score Fraud score of the previous
transaction (this is really - really
important to know)
L No_of_instalm
ents
Number of instalments of the
current transaction
M Time_Last_T Time in minutes since the last
transaction
N Diff_F_Score Difference between the fraud
score of the current previous
transaction and fraud score from
the one before
O M_T_Limit Merchant transaction limit,
maximum amount of money
allowed for a transaction in that
specific type of business
P Flag Fraud transaction flag (N = No
fraud , S = Yes Fraud)
DESCRIPTION OF THE DATASET TAKEN FOR THE STUDY
NUMBER OF FRAUDS AND LEGAL IN
EACH SPLIT IN DATASET
FRAUD DETECTION METHOD
MCC MCC_
P
Post_zi
p_code
Post_zi
p_code
_P
Amoun
t
Amoun
t_P
Type_
T
Credit_
limit
Brand_
scheme
Variant F_scor
e
VIP_C
ommo
n
Local_
Interan
tional
No_of_
instalm
ents
Time_
Last_T
Diff_F
_Score
M_T_
Limit
Flag
Class
33 33 10 10 10 10 10 10 6 6 10 2 2 4 8 6 9 2
Test
Set
Learn
Classifier Model
Training
Set
Curves Confusion Matrix
ROC PR LIFT
VALIDATION
NN K-NN LR
Structure of
the dataset
FRAUD DETECTION USING NEURAL NETWORK
Dataset
training
testing
R-Script
R-Library: Neuralnet
Algorithm: resilient backpropagation
Hidden layer: c(4,2) OR 8
Activation fn: Logistic
Error fn : ‘sse’ (sum of squared
R Code for Implementation of Neural Network
FRAUD DETECTION USING K-NEAREST NEIGHBOUR
Dataset
trainset
testset
R-Script
R-Library: Class
Algorithm: knn
k = 1, 2, 3, 4, 5
Categ-
orical
Class
(x - min(x)) /
(max(x) - min(x)
Normalization
Dataset
Trainset target
Testset target
Best K
R CODE FOR IMPLEMENTATION OF
K-NEAREST NEIGHBOR
FRAUD DETECTION USING LOGISTIC REGRESSION
Dataset
training
testing
R-Script
R-Library: Linear model (in-built)
function: glm
Family : Binomial
R CODE FOR IMPLEMENTATION OF
LOGISTIC REGRESSION
FRAUD DETECTION MODEL VALIDATION
 Stands for “Receiver Operating
Characteristic”
 From signal processing: trade-off
between hit rate and false alarm rate
over noisy channel
 Compute FPR, TPR and plot them in
ROC space
 Every classifier is a point in ROC space
 For probabilistic algorithms
ROC Analysis
Area Under
Curve (AUC)
=0.75 AUC
TP Rate (Sensitivity):
FP Rate (fall-out):
+ -
+
-
TP
FN
FP
TN
actual
TP+FN FP+TN
true positive false positive
false negative true negative
FRAUD DETECTION MODEL VALIDATION (CONT.)
Confusion Matrix
Recall
TP+FP
FRAUD DETECTION MODEL VALIDATION (CONT.)
Positive Predicted Value (PPV)
P(TP): % True Positives: Sensitivity
P(FP): % False Positives: 1 – Specificity
PERFORMANCE MEASURE CALCULATED FROM CONFUSION
MATRIX OF NEURAL NETWORK (NN), K-NEAREST NEIGHBOUR
(KNN) AND LOGISTIC REGRESSION (LR)
Performance
Measure
Neural Network K-NN LR
Accuracy 96.2 % 97.14 % 96.19 %
AUC 0.856 0.77 0.86
Execution Time 17 seconds 3 seconds Instant
Detection rate 96.06 % 95.03% 95.7 %
Sensitivity 99.7 98.7 % 99.4 %
Specificity 4 56.7 % 12.2 %
Precision 96.4 % 98.3 % 96.7 %
RMSE 0.194 0.16 0.17
RSquare 0.14 0.33 0.14
RESULTS (ROC)
NN
K-NN
LR
RESULTS (LIFT CHART)
LIFT IS A MEASURE OF THE EFFECTIVENESS OF A PREDICTIVE MODEL CALCULATED AS
THE RATIO BETWEEN THE RESULTS OBTAINED WITH AND WITHOUT THE PREDICTIVE
MODEL.
NN
K-NN
LR
RESULTS (PR CHART)
NN
K-NN
LR
CONCLUSION
 The performance of the three data mining algorithm are compared
and we found that the fraud detection rate of NN is highest among all
three algorithm taken for the study and Specificity is lowest that
shows the NN has good predictive model than other two but if we
take the execution time and RMSE the KNN algorithm is better than
NN. So the overall results showed the performance of KNN and NN is
better than Logistic Regression but KNN and NN both take some
execution time to process the large data. But if we take a better
Argument and function to implement the model like Backpropagation
and Hidden Node reduce the execution time for NN on the other hand
the value of K and other function affects the execution time for
implementing the model using KNN.
 We believe that these results are very promising and supportive of a
multi-algorithmic approach to classifying and assessing large, noisy
and real time data sets, and future work will focus upon testing the
algorithms and resolution strategies on similarly complex data sets
from other real-world domains.
REFERENCES
[1] S. Benson Edwin Raj, A. Annie Portia “Analysis on Credit Card Fraud Detection Methods”.
IEEE-International Conference on Computer, Communication and Electrical Technology; (2011).
(152-156).
[2] Haruna, C., abdul-kareem., S. abubakar. A: A Framework for selecting the optimal technique
suitable for application in data mining task., Future information technology,163-169, (2014).
[3] Manoel Fernando Alonso Gadi, Xidi Wang, and Alair Pereira do Lago. Credit card fraud detection
with artificial immune system, 7th international conference, icaris 2008, phuket, thailand, august
10-13, 2008, proceedings. In ICARIS, volume 5132 of Lecture Notes in Computer Science, pages
119 – 131. Springer, 2008
[4] E.W.T. Ngai, Yong Hu, Y.H. Wong, Yijun Chen, Xin Sun “The application of data mining techniques
in financial fraud detection: A classification framework and an academic review of literature”.
Elsevier-Decision Support Systems.50; (559–569), (2011).
[5] S.Bhattacharyya, S. Jha, K. Tharakunnel, J.C. Westland, “Data mining for credit card fraud: A
comparative study”, in Elsevier- Decision Support Systems, 2011.
[6] Usama, M. Fayyad, et al., Advances in Knowledge Discovery and Data Mining. Cambridge,
Mass.: MIT Press (1996).
[7] Han, Jun; Morag, Claudio, “The influence of the sigmoid function parameters on the speed of
Backpropagation learning", In Mira, José, Sandoval, Francisco, From Natural to Artificial Neural
Computation. pp. 195–201, 1995.
[8] Neda .S Halvaiee, M. Kazem Akbari “A novel model for credit card fraud detection using Artificial
Immune Systems”, Elsevier-Applied soft computing, Vol- 24, pp 40-49, 2014.
[9] F. Campos, S. Cavalcante, An extended approach for Dempster–Shafer theory, in:
Proceedings of the IEEE International Conference on Information Reuse and Integration, 2003,
pp. 338–344.
RESEARCH PAPER PUBLISHED
IEEE CONFERENCE ID: 37465
Comparative study of various approaches for transaction Fraud Detection using Machine Learning Algorithms

Comparative study of various approaches for transaction Fraud Detection using Machine Learning Algorithms

  • 1.
    COMPARATIVE STUDY OFVARIOUS APPROACHES FOR TRANSACTION FRAUD DETECTION USING DATA MINING Submitted By PRATIBHA SINGH M.Tech(CS)/4505/14 Guided By : Dr. B. B. SAGAR (asst. prof., Dept of CSE) Mrs. S. MALLIKA(asst. prof., Dept of CSE)
  • 2.
    CONTENT  AIM  INTRODUCTION LITERATURE SURVEY  DEFINATION OF DATAMINING  ORIGIN OF DATA MINING  DATAMINING PROCESS  APPROACHES TO USE DATAMINING  DATA MINING TASK  WHY TO PROCESSESSING OF DATA REQUIRED  DATA MINING TECHNIQUES AND TYPES OF FRAUD  DESCRIPTION OF DATASET USED IN OUR STUDY  IMPLEMENTATION OF NN, LR AND KNN.  PREFORMANCE EVALUATION OF VARIOUS MODELS  RESULT  CONCLUSION  REFRENCES  PAPER PUBLISHED
  • 3.
    AIM Our aim isto compares three different predictive data- mining techniques (Neural Network, Logistic Regression and K-Nearest neighbour) on the dataset taken from large Brazilian bank, with registers within time window between Jul/14/2004 through Sep/12/2004. Each register represents a credit card authorization, with only approved transactions excluding the visualization of the denied transactions. The simulation of the data mining prediction is done on R console for better understanding and visualize the result in the form of ROC, Lift Chart, PR curve, Confusion Matrix and other skill scores for better understanding.
  • 4.
    INTRODUCTION  Fraud detectionas we all know it is a process of detecting fraud through various data mining & machine learning approaches.  In this research we will be comparing various approaches for transaction fraud detection using data mining and machine learning techniques. The comparison will show the results of various transaction fraud detection techniques applied on real dataset based on certain parameters. With these results we are showing how accurate our method works on real datasets and our aim is be to find out the most suitable method which would help in catching fraud, will be cost sensitive and which will reduced false rate,etc.
  • 5.
    LITERATURE SURVEY  Aswe can see in paper [5] S.Bhattacharyya has done comparative study of various approaches on one synthetic dataset and analysed results of logistic regression, random forest and SVM. And the results shows LR is best in his research.  I took this[5] paper as a base paper and done comparative study on LR, NN and KNN on real dataset which would be helpful for further research.
  • 6.
    DATA MINING The efficientdiscovery of previously unknown, valid, potentially useful, understandable patterns in large datasets The analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data owner TERMS Data Pattern: Attribute Interestingness
  • 7.
    AI / Machine Learning Statistics DataMining Database systems ORIGINS OF DATA MINING
  • 8.
  • 9.
    APPROACH TO USEDATA MINING Identify the problem Use data mining techniques to transform the data into information Act on the information Measure the results Understand the domain Create a dataset Choose the data mining task and the specific algorithm Interpret the results Select the interesting attributes Data cleaning and preprocessing General Approach
  • 10.
    Data Mining Tasks Classification learninga function that maps an item into one of a set of predefined classes Regression Learning a function that maps an item to a real value. OR A Independent variable to a dependent variable Clustering Identify a set of groups of similar items Dependencies and associations Identify significant dependencies between data attributes Summarization find a compact description of the dataset or a subset of the dataset
  • 11.
    Why Preprocessing ofData is required?  The available data does not full fill the requirement of input data in Data mining process  The attributes are not properly defined for Data mining process  Incomplete data: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data.  Noisy data: containing errors or outliers  Incomplete Data:  Attributes of interest are not available (e.g., customer information for sales transaction data)  Missing/unknown values for some data Since, data quality are not good then the result of data mining are also not good and vary  Quality decisions must be based on quality data So, Data cleaning are required
  • 12.
  • 13.
    Header Attributes SymbolAttribute Description A MCC Merchant Category Code B MCC_P Merchant Category Code from previous transaction of the same credit card C Post_zip_code Post/zip code D Post_zip_code_P Post/zip code from previous transaction E Amount Amount of money of the current transaction F Amount_P Amount of money of the previous transaction G Type_T Type of transaction - Card present (Normal transaction), Internet, Telephone, direct debit, etc H Credit_limit Credit limit of the account I Brand_scheme Brand/scheme - Visa, MasterCard, Diners, JCB, etc J Variant Variant - Local, International, Gold, Platinum K F_score Fraud score of the previous transaction (this is really - really important to know) L No_of_instalm ents Number of instalments of the current transaction M Time_Last_T Time in minutes since the last transaction N Diff_F_Score Difference between the fraud score of the current previous transaction and fraud score from the one before O M_T_Limit Merchant transaction limit, maximum amount of money allowed for a transaction in that specific type of business P Flag Fraud transaction flag (N = No fraud , S = Yes Fraud) DESCRIPTION OF THE DATASET TAKEN FOR THE STUDY
  • 15.
    NUMBER OF FRAUDSAND LEGAL IN EACH SPLIT IN DATASET
  • 16.
    FRAUD DETECTION METHOD MCCMCC_ P Post_zi p_code Post_zi p_code _P Amoun t Amoun t_P Type_ T Credit_ limit Brand_ scheme Variant F_scor e VIP_C ommo n Local_ Interan tional No_of_ instalm ents Time_ Last_T Diff_F _Score M_T_ Limit Flag Class 33 33 10 10 10 10 10 10 6 6 10 2 2 4 8 6 9 2 Test Set Learn Classifier Model Training Set Curves Confusion Matrix ROC PR LIFT VALIDATION NN K-NN LR Structure of the dataset
  • 17.
    FRAUD DETECTION USINGNEURAL NETWORK Dataset training testing R-Script R-Library: Neuralnet Algorithm: resilient backpropagation Hidden layer: c(4,2) OR 8 Activation fn: Logistic Error fn : ‘sse’ (sum of squared
  • 18.
    R Code forImplementation of Neural Network
  • 19.
    FRAUD DETECTION USINGK-NEAREST NEIGHBOUR Dataset trainset testset R-Script R-Library: Class Algorithm: knn k = 1, 2, 3, 4, 5 Categ- orical Class (x - min(x)) / (max(x) - min(x) Normalization Dataset Trainset target Testset target Best K
  • 20.
    R CODE FORIMPLEMENTATION OF K-NEAREST NEIGHBOR
  • 21.
    FRAUD DETECTION USINGLOGISTIC REGRESSION Dataset training testing R-Script R-Library: Linear model (in-built) function: glm Family : Binomial
  • 22.
    R CODE FORIMPLEMENTATION OF LOGISTIC REGRESSION
  • 23.
    FRAUD DETECTION MODELVALIDATION  Stands for “Receiver Operating Characteristic”  From signal processing: trade-off between hit rate and false alarm rate over noisy channel  Compute FPR, TPR and plot them in ROC space  Every classifier is a point in ROC space  For probabilistic algorithms ROC Analysis Area Under Curve (AUC) =0.75 AUC
  • 24.
    TP Rate (Sensitivity): FPRate (fall-out): + - + - TP FN FP TN actual TP+FN FP+TN true positive false positive false negative true negative FRAUD DETECTION MODEL VALIDATION (CONT.) Confusion Matrix Recall TP+FP
  • 25.
    FRAUD DETECTION MODELVALIDATION (CONT.) Positive Predicted Value (PPV) P(TP): % True Positives: Sensitivity P(FP): % False Positives: 1 – Specificity
  • 26.
    PERFORMANCE MEASURE CALCULATEDFROM CONFUSION MATRIX OF NEURAL NETWORK (NN), K-NEAREST NEIGHBOUR (KNN) AND LOGISTIC REGRESSION (LR) Performance Measure Neural Network K-NN LR Accuracy 96.2 % 97.14 % 96.19 % AUC 0.856 0.77 0.86 Execution Time 17 seconds 3 seconds Instant Detection rate 96.06 % 95.03% 95.7 % Sensitivity 99.7 98.7 % 99.4 % Specificity 4 56.7 % 12.2 % Precision 96.4 % 98.3 % 96.7 % RMSE 0.194 0.16 0.17 RSquare 0.14 0.33 0.14
  • 27.
  • 28.
    RESULTS (LIFT CHART) LIFTIS A MEASURE OF THE EFFECTIVENESS OF A PREDICTIVE MODEL CALCULATED AS THE RATIO BETWEEN THE RESULTS OBTAINED WITH AND WITHOUT THE PREDICTIVE MODEL. NN K-NN LR
  • 29.
  • 30.
    CONCLUSION  The performanceof the three data mining algorithm are compared and we found that the fraud detection rate of NN is highest among all three algorithm taken for the study and Specificity is lowest that shows the NN has good predictive model than other two but if we take the execution time and RMSE the KNN algorithm is better than NN. So the overall results showed the performance of KNN and NN is better than Logistic Regression but KNN and NN both take some execution time to process the large data. But if we take a better Argument and function to implement the model like Backpropagation and Hidden Node reduce the execution time for NN on the other hand the value of K and other function affects the execution time for implementing the model using KNN.  We believe that these results are very promising and supportive of a multi-algorithmic approach to classifying and assessing large, noisy and real time data sets, and future work will focus upon testing the algorithms and resolution strategies on similarly complex data sets from other real-world domains.
  • 31.
    REFERENCES [1] S. BensonEdwin Raj, A. Annie Portia “Analysis on Credit Card Fraud Detection Methods”. IEEE-International Conference on Computer, Communication and Electrical Technology; (2011). (152-156). [2] Haruna, C., abdul-kareem., S. abubakar. A: A Framework for selecting the optimal technique suitable for application in data mining task., Future information technology,163-169, (2014). [3] Manoel Fernando Alonso Gadi, Xidi Wang, and Alair Pereira do Lago. Credit card fraud detection with artificial immune system, 7th international conference, icaris 2008, phuket, thailand, august 10-13, 2008, proceedings. In ICARIS, volume 5132 of Lecture Notes in Computer Science, pages 119 – 131. Springer, 2008 [4] E.W.T. Ngai, Yong Hu, Y.H. Wong, Yijun Chen, Xin Sun “The application of data mining techniques in financial fraud detection: A classification framework and an academic review of literature”. Elsevier-Decision Support Systems.50; (559–569), (2011). [5] S.Bhattacharyya, S. Jha, K. Tharakunnel, J.C. Westland, “Data mining for credit card fraud: A comparative study”, in Elsevier- Decision Support Systems, 2011. [6] Usama, M. Fayyad, et al., Advances in Knowledge Discovery and Data Mining. Cambridge, Mass.: MIT Press (1996). [7] Han, Jun; Morag, Claudio, “The influence of the sigmoid function parameters on the speed of Backpropagation learning", In Mira, José, Sandoval, Francisco, From Natural to Artificial Neural Computation. pp. 195–201, 1995. [8] Neda .S Halvaiee, M. Kazem Akbari “A novel model for credit card fraud detection using Artificial Immune Systems”, Elsevier-Applied soft computing, Vol- 24, pp 40-49, 2014. [9] F. Campos, S. Cavalcante, An extended approach for Dempster–Shafer theory, in: Proceedings of the IEEE International Conference on Information Reuse and Integration, 2003, pp. 338–344.
  • 32.
    RESEARCH PAPER PUBLISHED IEEECONFERENCE ID: 37465