CREDIT CARD FRAUD DETECTION
USING RANDOM FOREST
ALGORITHM
GUIDED BY
TEAM MEMBERS
ABSTRACT
❖Credit Card Fraud is increasing considerably with the
development of modern technology.
❖Here we mainly focus on credit card fraud transactions in real
world. Initially we will collect the credit card dataset and then the
dataset will be analysed and processed.
❖After that random forest algorithm is applied for obtaining the
accuracy of the dataset. Finally the number of fraud transactions
present in the dataset will be identified.
❖The performance of the techniques is evaluated based on
accuracy, sensitivity, specificity and precision. The accuracy of the
resultant dataset obtained is about 98%.
EXISTING SYSTEM
❖In existing system methods such as Cluster Analysis, SVM,
Bayesian network, Logistic Regression, Naïve Bayer’s , Hidden
Markov model etc are used to find out the credit card fraud
transactions.
❖The methods used in the existing system are based on
unsupervised learning and the accuracy obtained by these
methods is about 60-70%.
PROPOSED SYSTEM
❖The proposed system overcomes the above mentioned issue in
an efficient way. It aims at analysing the number of fraud of fraud
transactions that are present in the dataset.
❖In proposed System, we use Random forest algorithm to
classify the credit card dataset. Random Forest is an algorithm for
classification and regression.
❖The dataset is classified into trained and test dataset where the
data can be trained individually. The Random Forest Algorithm
can able to process large amount of data.
❖Even for large dataset this algorithm is extremely fast and can
able to give accuracy of about 98%. Finally the number of fraud
transactions will be identified and represented in the form of
confusion matrix.
LITERATURE SURVEY
s.no Title Year Authors Techniques
used
Demerits
1. Credit Card 2017 Andrea Dal Pozzolo, Cluster analysis, Accuracy of an
Fraud Detection: Giacomo Boracchi, Artificial Neural algorithm is only around
A Realistic Olivier Caelen, Cesare Network (ANN) 90%.
Modeling and a Alippi, Gianluca
Novel Learning Bontempi,
Strategy
2 . Credit card fraud 2017 John O. Awoyemi , Naïve Bayes, Imbalanced data set,
detection using Adebayo O. K-Nearest Accuracy of an
Machine Adetunmbi, Samuel A. Neighbour, algorithm is only around
Learning Oluwadare Logistic 71% to 80%
Techniques Regression (Naïve Bayes),
For KNN it degrades
with high-dimension
data as there is little
difference between
nearest and farthest
neighbour.
Sno Title Y
ear Authors Techniques
used
Demerits
3. Analysis on Credit Card
Fraud identification
techniques based on
KNN and outlier
detection
2017 N.Malini,
M. Pushpa
K-Nearest
Neighbour
(KNN),
Outlier
detection
technique
Couldn’t handle large
datasets of range more
than 1.5 lakh transactions,
Accuracy is less than 80%
(degrades with high-
dimension data)
4. Credit card fraud
detection: A Hybrid
approach using fuzzy
clustering and Neural
Networks
2015 Tanumay Kumar
Behera, Suvasini
Panigrahi
Fuzzy
C-means
clustering,Ar
tificial
Neural
Networks
(ANN)
Acuuracy is about 60% to
80% for classifiers,
For essemble result it is
90%
RANDOM FOREST ALGORITHM
❖Random Forest Algorithm is a supervised learning algorithm
which is used for Classification and Regression.
❖Random Forest is a tree based algorithm which involves building
several decision trees, then combining their output to improve
ability of the model.
❖It can be used for identifying the features from the training dataset
and used to handle thousands of input variables.
❖Random Forest Algorithm is widely used where each decision tree
can be trained independently and it reduces over fitting.
REQUIREMENTS
HARDWARE
❖RAM – 4GB
SOFTWARE
❖Anaconda
PROGRAMMING LANGUAGE
❖Python
Dataset
pre-processing Feature extraction
Machine learning
model
Classifier Section
Result
Performance
analysis
Test data
SYSTEM ARCHITECTURE
MODULES DESCRIPTION
MODULE 1: DATA COLLECTION
Data used in this paper is a set of product reviews collected from
credit card transactions records. This step is concerned with selecting
the subset of all available data that you will be working with. ML
problems start with data preferably, lots of data (examples or
observations) for which you already know the target answer. Data for
which you already know the target answer is called labelled data.
MODULE 2: DATA PRE-PROCESSING
Pre-processing refers to the transformations applied to credit card
dataset before feeding it to the algorithm. In python, scikit-learn
library has a pre-built functionality under
sklearn.preprocessing.Three common data pre-processing steps for
Credit Card Dataset are as follows:
1.Formatting
2.Cleaning
3.Sampling
MODULE 3 : FEATURE EXTRACTION
Next thing is to do Feature extraction is an attribute reduction process.
Unlike feature selection, which ranks the existing attributes according
to their predictive significance, feature extraction actually transforms
the attributes. The transformed attributes, or features, are linear
combinations of the original attributes. Finally, our models are trained
using Classifier algorithm. We use classify module on Natural
Language Toolkit library on Python. We use the labelled dataset
gathered. The rest of our labelled data will be used to evaluate the
models. Some machine learning algorithms were used to classify pre-
processed data. The chosen classifiers were Random forest. These
algorithms are very popular in text classification tasks.
MODULE 4: EVALUATION MODEL
➢ Model Evaluation is an integral part of the model
development process. It helps to find the best model that represents
our data and how well the chosen model will work in the future.
Evaluating model performance with the data used for training is not
acceptable in data science because it can easily generate
overoptimistic and over fitted models. There are two methods of
evaluating models in data science,
1. Hold-Out
2. Cross-Validation
➢To avoid over fitting, both methods use a test set to evaluate
model performance. Performance of each classification model is
estimated base on its averaged. The result will be in the visualized
form. Representation of classified data in the form of graphs.
Accuracy is defined as the percentage of correct predictions for the
test data. It can be calculated easily by dividing the number of
correct predictions by the number of total predictions.
DATASET
SCREENSHOTS
IMPORT PACKAGEAND READ DATASET:
PRE-PROCESSING
SPLITTING THE DATASET
RANDOM FOREST ALGORITHM AND ACCURACY
VISUALIZATION
FINAL CONFUSION MATRIX
ACCURACY COMPARISON
METRICS SVM NAVIE
BAYER’S
K MEANS
CLUSTERING
LR RF
ACCURACY 85.05 83.50 78.62 96.82 98.60
SENSITIVITY 84.06 78.00 69.93 95.68 98.87
GRAPHICAL REPRESENTATION
SV
M
NB KMEAN
S
L
R
R
F
0
20
40
60
80
10
0
12
0
ALGORITHM
ACCURACY
CONCLUSION & FUTURE ENHANCEMENT
The Random forest algorithm will perform better with a larger number of training data and
application of more pre-processing techniques would also help. It is one of the most
accurate learning algorithms available. For many data sets, it produces a highly accurate
classifier. It runs efficiently on large databases. It can handle thousands of input variables
without variable deletion. Thus the Random Forest Algorithm produce more accurate
results in credit card fraud detection and it has the capacity to estimate the missing data
and also it can able to handle the large proportion of missing data.
Results show that when the imbalance ratio increases gradually in the data, Random
Forest try to perform very well. As the Random Forest gave better results so there is need
to explore them more with larger datasets using these findings, further extension to this
work can be to apply different resampling techniques on the data to find more insights
for the credit card imbalanced data and get more improved results.
REFERENCES
1“Credit Card Fraud Detection: A Realistic Modeling and a Novel Learning
Strategy”,Andrea Dal Pozzolo, Giacomo Boracchi, Olivier Caelen, Cesare Alippi,
Gianluca Bontempi, IEEE on Neural Networks and Learning Systems,2018.
2“A new Credit card fraud detecting method based on behavior certificate”,
Lutao Zheng, Guanjnu Liu, Wenjing Luan, Zhengchuan Li, Yuwei Zhang,
Chungang Yan, Changjun Jiang, 2018 IEEE 15th Internatinal Conference on
Networking,Sensing and Control(ICNSC).
3”Supervised Machine Learning Algorithms for Credit Card Fraudulent
Transaction Detection: A Comparative Study”, Sahil Dhankhad ,Emad
Mohammed , Behrouz Far, 2018 IEEE International Conference on Information
Reuse and Integration(IRI).
4"Credit Card Fraud Detection using Machine Learning Models and Collating
Machine Learning models", Navanshu Khare and Saad Yunus Sait, International
Journal of Pure and Applied Mathematics, Volume 118 No. 20 2018, 825-
838,2018.
5”Credit Card Fraud Detection using learning to Rank Approach”, N.Kalaiselvi,
S.Rajalakshmi, J.Padmavathi, Joyce B.Karthiga, 2018 International Conference
on Computation of Power, Energy, Information and Communication(ICCPEIC).
CREDIT_CARD.ppt

CREDIT_CARD.ppt

  • 1.
    CREDIT CARD FRAUDDETECTION USING RANDOM FOREST ALGORITHM GUIDED BY TEAM MEMBERS
  • 2.
    ABSTRACT ❖Credit Card Fraudis increasing considerably with the development of modern technology. ❖Here we mainly focus on credit card fraud transactions in real world. Initially we will collect the credit card dataset and then the dataset will be analysed and processed. ❖After that random forest algorithm is applied for obtaining the accuracy of the dataset. Finally the number of fraud transactions present in the dataset will be identified. ❖The performance of the techniques is evaluated based on accuracy, sensitivity, specificity and precision. The accuracy of the resultant dataset obtained is about 98%.
  • 3.
    EXISTING SYSTEM ❖In existingsystem methods such as Cluster Analysis, SVM, Bayesian network, Logistic Regression, Naïve Bayer’s , Hidden Markov model etc are used to find out the credit card fraud transactions. ❖The methods used in the existing system are based on unsupervised learning and the accuracy obtained by these methods is about 60-70%.
  • 4.
    PROPOSED SYSTEM ❖The proposedsystem overcomes the above mentioned issue in an efficient way. It aims at analysing the number of fraud of fraud transactions that are present in the dataset. ❖In proposed System, we use Random forest algorithm to classify the credit card dataset. Random Forest is an algorithm for classification and regression. ❖The dataset is classified into trained and test dataset where the data can be trained individually. The Random Forest Algorithm can able to process large amount of data. ❖Even for large dataset this algorithm is extremely fast and can able to give accuracy of about 98%. Finally the number of fraud transactions will be identified and represented in the form of confusion matrix.
  • 5.
    LITERATURE SURVEY s.no TitleYear Authors Techniques used Demerits 1. Credit Card 2017 Andrea Dal Pozzolo, Cluster analysis, Accuracy of an Fraud Detection: Giacomo Boracchi, Artificial Neural algorithm is only around A Realistic Olivier Caelen, Cesare Network (ANN) 90%. Modeling and a Alippi, Gianluca Novel Learning Bontempi, Strategy 2 . Credit card fraud 2017 John O. Awoyemi , Naïve Bayes, Imbalanced data set, detection using Adebayo O. K-Nearest Accuracy of an Machine Adetunmbi, Samuel A. Neighbour, algorithm is only around Learning Oluwadare Logistic 71% to 80% Techniques Regression (Naïve Bayes), For KNN it degrades with high-dimension data as there is little difference between nearest and farthest neighbour.
  • 6.
    Sno Title Y earAuthors Techniques used Demerits 3. Analysis on Credit Card Fraud identification techniques based on KNN and outlier detection 2017 N.Malini, M. Pushpa K-Nearest Neighbour (KNN), Outlier detection technique Couldn’t handle large datasets of range more than 1.5 lakh transactions, Accuracy is less than 80% (degrades with high- dimension data) 4. Credit card fraud detection: A Hybrid approach using fuzzy clustering and Neural Networks 2015 Tanumay Kumar Behera, Suvasini Panigrahi Fuzzy C-means clustering,Ar tificial Neural Networks (ANN) Acuuracy is about 60% to 80% for classifiers, For essemble result it is 90%
  • 7.
    RANDOM FOREST ALGORITHM ❖RandomForest Algorithm is a supervised learning algorithm which is used for Classification and Regression. ❖Random Forest is a tree based algorithm which involves building several decision trees, then combining their output to improve ability of the model. ❖It can be used for identifying the features from the training dataset and used to handle thousands of input variables. ❖Random Forest Algorithm is widely used where each decision tree can be trained independently and it reduces over fitting.
  • 8.
  • 9.
    Dataset pre-processing Feature extraction Machinelearning model Classifier Section Result Performance analysis Test data SYSTEM ARCHITECTURE
  • 10.
    MODULES DESCRIPTION MODULE 1:DATA COLLECTION Data used in this paper is a set of product reviews collected from credit card transactions records. This step is concerned with selecting the subset of all available data that you will be working with. ML problems start with data preferably, lots of data (examples or observations) for which you already know the target answer. Data for which you already know the target answer is called labelled data.
  • 11.
    MODULE 2: DATAPRE-PROCESSING Pre-processing refers to the transformations applied to credit card dataset before feeding it to the algorithm. In python, scikit-learn library has a pre-built functionality under sklearn.preprocessing.Three common data pre-processing steps for Credit Card Dataset are as follows: 1.Formatting 2.Cleaning 3.Sampling
  • 12.
    MODULE 3 :FEATURE EXTRACTION Next thing is to do Feature extraction is an attribute reduction process. Unlike feature selection, which ranks the existing attributes according to their predictive significance, feature extraction actually transforms the attributes. The transformed attributes, or features, are linear combinations of the original attributes. Finally, our models are trained using Classifier algorithm. We use classify module on Natural Language Toolkit library on Python. We use the labelled dataset gathered. The rest of our labelled data will be used to evaluate the models. Some machine learning algorithms were used to classify pre- processed data. The chosen classifiers were Random forest. These algorithms are very popular in text classification tasks.
  • 13.
    MODULE 4: EVALUATIONMODEL ➢ Model Evaluation is an integral part of the model development process. It helps to find the best model that represents our data and how well the chosen model will work in the future. Evaluating model performance with the data used for training is not acceptable in data science because it can easily generate overoptimistic and over fitted models. There are two methods of evaluating models in data science, 1. Hold-Out 2. Cross-Validation ➢To avoid over fitting, both methods use a test set to evaluate model performance. Performance of each classification model is estimated base on its averaged. The result will be in the visualized form. Representation of classified data in the form of graphs. Accuracy is defined as the percentage of correct predictions for the test data. It can be calculated easily by dividing the number of correct predictions by the number of total predictions.
  • 14.
  • 15.
  • 16.
  • 17.
  • 19.
  • 20.
  • 22.
  • 23.
    ACCURACY COMPARISON METRICS SVMNAVIE BAYER’S K MEANS CLUSTERING LR RF ACCURACY 85.05 83.50 78.62 96.82 98.60 SENSITIVITY 84.06 78.00 69.93 95.68 98.87
  • 24.
  • 25.
    CONCLUSION & FUTUREENHANCEMENT The Random forest algorithm will perform better with a larger number of training data and application of more pre-processing techniques would also help. It is one of the most accurate learning algorithms available. For many data sets, it produces a highly accurate classifier. It runs efficiently on large databases. It can handle thousands of input variables without variable deletion. Thus the Random Forest Algorithm produce more accurate results in credit card fraud detection and it has the capacity to estimate the missing data and also it can able to handle the large proportion of missing data. Results show that when the imbalance ratio increases gradually in the data, Random Forest try to perform very well. As the Random Forest gave better results so there is need to explore them more with larger datasets using these findings, further extension to this work can be to apply different resampling techniques on the data to find more insights for the credit card imbalanced data and get more improved results.
  • 26.
    REFERENCES 1“Credit Card FraudDetection: A Realistic Modeling and a Novel Learning Strategy”,Andrea Dal Pozzolo, Giacomo Boracchi, Olivier Caelen, Cesare Alippi, Gianluca Bontempi, IEEE on Neural Networks and Learning Systems,2018. 2“A new Credit card fraud detecting method based on behavior certificate”, Lutao Zheng, Guanjnu Liu, Wenjing Luan, Zhengchuan Li, Yuwei Zhang, Chungang Yan, Changjun Jiang, 2018 IEEE 15th Internatinal Conference on Networking,Sensing and Control(ICNSC). 3”Supervised Machine Learning Algorithms for Credit Card Fraudulent Transaction Detection: A Comparative Study”, Sahil Dhankhad ,Emad Mohammed , Behrouz Far, 2018 IEEE International Conference on Information Reuse and Integration(IRI). 4"Credit Card Fraud Detection using Machine Learning Models and Collating Machine Learning models", Navanshu Khare and Saad Yunus Sait, International Journal of Pure and Applied Mathematics, Volume 118 No. 20 2018, 825- 838,2018. 5”Credit Card Fraud Detection using learning to Rank Approach”, N.Kalaiselvi, S.Rajalakshmi, J.Padmavathi, Joyce B.Karthiga, 2018 International Conference on Computation of Power, Energy, Information and Communication(ICCPEIC).