SlideShare a Scribd company logo
1 of 17
Download to read offline
DATA MINING PROJECT
Fraud Detention using Data Mining
JUNE 7, 2015
NORTHWESTERN UNIVERSITY, MSIS 435
ALBERT KENNEDY
1
TABLE OF CONTENTS
Abstract............................................................................................................................................... 2
Introduction........................................................................................................................................ 3
Data Mining Applications.................................................................................................................... 4
Data Mining Themes........................................................................................................................... 4
CRISP-DM Methodology...................................................................................................................... 5
Data Understanding............................................................................................................................ 7
Data Preparation................................................................................................................................. 9
Data Mining Algorithm...................................................................................................................... 10
Experimental Results and Analysis................................................................................................ 11
Conclusion..................................................................................................................................... 14
Future Work.................................................................................................................................. 15
References .................................................................................................................................... 16
2
Fraud Detention
1 ABSTRACT
The Adoption of data mining can be great for many use cases and organization that have a special need
and understanding of what can be done with existing data. Many organizations don’t understand the
power and value they have in what you control. With this, the benefits of using a process for making
smarter decision will be discussed in this paper. For the purpose of explaining not yet a data mining
topic and its benefits but also to address common problems in the Fraud and identity sector where
many businesses and individuals can take advantage of.
Fraud detention should be more important today than ever before. With the growing e-commence
business that moving rapidly and people having more access to important, financial institutions need to
be more aware of ways to detect possible fraudulent acts. We can achieve this goal with the use of data
mining.
This document will show case a typical problem with sample data taken from borrowers and their
information that related to credit approval. Another data set is similar but present typical account
holders and a common profile of related attributes that make up a “type” of customer. We will go into
detail and the proper process of how to solve this problem using the outline:
 Defining the business problem
 Collection the data and enhancing that data
 Choosing a model strategy and algorithm that fixes the business need
 Executing the model through a training set then test that model
 Evaluating the results of the model
 Decide for the model or make any changes
 Deploy the model into an actionable project
The above outline is based on a framework that many data scientist use today called “The Cross Industry
Standard Process for Data Mining” (CRISP-DM)1
. This foundation is what will be use to analyze the fraud
detention use case of both the German Credit fraud data and the Give Me Some Credit data set to
compare against using two separate techniques. This two datasets have been thoroughly cleanse and
checked for correctly and free of any bias input to shew any result toward any direction. In order to
become successful in building a fraud detention system, it is important to understand the data mining
tools applications used in the industry. Second, it’s important to know the themes of data mining and
most importantly the CRISP-DM methodology that will be used to support our business problem and
data mining design for a reliable fraud detention system.
1
Information related to CRISP-DM, see CRISP-DM 1.0, Step-by-Step data mining guide, SPSS,
3
2 INTRODUCTION
In America and many other others of the world, crime appears to be relevant in today’s society. As the
government and we as people continue to find ways to prevent and coach individuals to shrey away
from unlawful event, people find alternative and intelligent ways to be corrupted. Using our government
law enforcement has only become as good as to catch perpetrators after a particular wrong doing event
has been either reported or caught. Not all actions we can’t control and stop before happening that’s
within any person’s private setting. How about we can help prevent those crimes that can be warned if
not stop prior to happing? This type of crime that we people such as businesses and individual victims
can have control over with the used of data analysis. The crimes of identity thief and Credit approval
can be defended.
The purpose of this study is to empower financial businesses and individuals with the ability to combat
potential thief of personal and company information to avoid misuse for another person’s gain.
Description of Problem: In today’s fast growing technology sociality, individuals are completing more
and more online transactions and sending data to multiple sources of businesses using the same vital
information to identify a person.
Examples of this could be applying for credit via an online store. So what information is needed for this?
In many cases
 Full name
 Telephone number or street address
 Social security number (SSN)
The above information is all that is needed for a creditors to approve someone for a line of credit or
account under a person’s name. There’s an issue here; anyone that doesn’t know you can obtain this
information easily. The only harder bit of information to obtain is an SSN. The best way for someone to
get a person’s SSN, and through personal work records; a person that has access to this who handle’s
administrative tasks for employees. This is a problem, because a telephone number and street address
can be any number or address that the creditor does not care for other than a place to mail bills to.
Typically a creditor will validate this through mailing information. Say a different phone number was
given and validated with a phone call that could also be possible. We can use smarter ways to combat
this, which some companies do with multiple levels on validating (through phone call, matching current
address and identification).
Our Objective: We can help fix these problems through the use of data mining and making sound
business decisions in order to complete a credit transition.
 First we need we can identify the types of data needed to solve an identity or fraud similar
action. We will use pervious data from users of different credit accounts to do the analysis work
on.
 Then, we need to choose one or many different data mining algorithms to test this data for an
outcome of what we are analyzing to make a recommendation, prediction, classifications,
and/or description of for better business decision.
4
So, let’s explore the many different data mining applications that can be used that’s related to this
problem
3 DATA MINING APPLICATIONS
There are many related data mining applications that can be used for the purpose of detecting
fraudulent activities. This is a growing study that many established organizations have seek out and
completed in depth research that more so expose the issues and weakness more so than real world
working applications that actual identity the issue and combat the problem in a defense matter.
A company called Morpho, has a mission where they are the market leader in security solutions who are
the pioneer in identification and detection systems. They deliver many products that many target
government and national agencies with dedication tools and systems to safeguard sensitive information.
They completed a study using data mining and the relation to identity fraud as an application to prevent
and or warn businesses, government organizations and individuals of a possible fraudulent act. From
there “Fighting Identity Fraud with Data Mining,” paper on Safran product2
, they speak about a
comprehensive fraud-proof process.
A second company that should be worth mention who use data mining methods to conducted a similar
study was Federal Data Corporation and the SASA institute Inc. These two completed a thorough study
on “Using Data Mining Techniques for Fraud Detection.” This was solved in conjunction with using the
SAS Enterprise Miner software. The two use cases presented where 1) Health Care Fraud Detection and
2) Purchase Card Fraud detection. Both have similar if not the same business problems and ending goals.
The first case, the FDC and SAS used Decision Trees to group all the nominal values of input into smaller
group that will in turn give a predictive target outcome. The second case study, they used a modeling
strategy that was clustering. Their analysis included three clusters that help explain the cluster analysis
efficiently segments data into groups of similar cases.
The overall conclusions for both unveiled unknown patterns and regularities in their data.
4 DATA MINING THEMES
The study of data mining to best explained and organized into different themes. These different area are
better described by the four core data mining tasks. According to “Introduction to Data Mining,” by
Pang-Ning Tan, these themes are covered under the four core tasks; Predictive modeling, cluster
analysis, Association Analysis Anomaly detection.3
To briefly describe each theme of data mining, it is best to show by examples. These themes in detailed
are:
 Classification
 Clustering
 Anomaly
2
Product from the Morpho Inc., Safran, “Fighting Identity Fraud With Data Mining”
3
See more information on themes from Assignment 1: Data Science Applicaiton, Kennedy, Albert
5
First off, predictive modeling is split into two types 1) Classification which is used for discrete target
variables and 2) regression, which is used for continuous target variables” Tan. The Classification type is
mostly common for making a prediction for an outcome which is the target variable of a single action.
Whereas the regression, may analyze the cost that a consumer may spend monthly on an e-commerce
website. These two types of variables (classification and regression) are what help define predictive
model.
The second them, Cluster analysis “seeks to find groups of closely related observations so that
observations that belong to the same cluster are more similar to each other than observations that
belong to other cluster”, Tan. Good examples of this would be grouping different customers’ purchasing
behaviors. Doing this helps define the types of customers and their purchases a clothing retail store
have for analysis.
Then lastly, there is the Association Analysis them. This theme is used “to discover patterns that
describe strongly associated features in the data” Tan. The association theme is commonly used in the
retailer or grocery market businesses for analysis. We can group liked things together that have
similarities based on related attributes and/or pair transitions from users.
These three themes all have its purpose to help solve particular data science problems. With solving
these problems, businesses can make decisions using these techniques. Understanding its definition is
key to ensure the right solution/theme is being utilized for the correct problem. Once the understanding
is complete, analyst can make use of the right tool for apply for the most appropriate case. In many uses,
not just one theme may apply to a case, but multiple themes can be applied for better analysis and
comparisons for the best results.
5 CRISP-DM METHODOLOGY
For many of these data mining techniques, we don’t want to apply the wrong or less effective solution.
Lucky, there is a well-organized methodology that gives businesses the steps and processes to handle
out these type of data mining project. We use what’s called the “Cross Industry Standard Process for
Data Mining” or CRIPS-DM for short. The CRIPS-DM is a structured framework with hierarchical steps to
follow in order to help guide through a proper data mining problem and solution. The CRISP-DM include
six phases:4
4
Information related to CRISP-DM, see CRISP-DM 1.0, Step-by-Step data mining guide, SPSS
6
Figure 5-1: CRISP-DM diagram
1) Business Understanding – this makes sense as an initial step. Any data mining problem has an
business need and problem that needs to be understood. This stage “represents a part of the
craft where the analysts’ creativity plays a large role…the design team should think carefully
about the use scenario” Provost. In this stage questions such as what needs to be done and how
it needs to be done are asked.
2) Data Understanding – this phase is a self-explanatory step yet could take a lot of time. Making
sure as an analyst you become knowledgeable of the data can make for an easier process. This
phase enables you to become “familiar with the data, identify data quality problems, discover
first insights into data, and/or detect interesting subsets to form hypotheses regarding hidden
information”, SPSS.
3) Data Preparation – in order for us to make a good analysis, we need tools that will enable us to
process the data in the best matter suitable for the necessary model. Examples of this phase
may require converting data in a simple tabular format, removed pointless attributes not
relevant to the data problem and/or converting a data file to a particular file format in order to
operate in the chosen data mining tools.
7
4) Modeling – “The modeling stage is the primary place where data mining techniques are applied
to the data” Provost. Simply put, this is where the magic happens and the actual data mining
craft and chosen algorithm(s) are put into work.
5) Evaluation – at the evaluation phase, we take time to access the results of the outcomes from
the models that were built from our data. The most important aspect of this phase going
through this evaluation is to gain confidence in the model’s outcome. We would like to analysis
the results and understand its outcome to ensure it’s reliable for meet the original business
problem’s needs.
6) Deployment – Now the model has been created and test, now we can make use of this reliable
model into a real life production case. What can we do with it…“the knowledge gained will need
to be organized and presented in a way that the customer can use it”, SPSS. Depending on the
businesses need for the data mining model that was craft, the deployment can be simple or
complex. Simple being as creating the results to report to managers or taking that model and
actually implementing in to the business in need.
Within each phase, there are a set of tasks that the business will help generate for both the data mining
team and business to complete in order to complete through the CRISP-DM cycle successfully.
6 DATA UNDERSTANDING
For the purpose of proving the data mining algorithms used for the case of fraud detention, we will
examine two different data sets. The first is the German Credit fraud data from Dr. Hans Hofmann, the
University of Hamburg in Germany. This particular data has 1000 instances of customer information for
preparation of understanding approval of credit. There are 20 attributes used to help describe each
instance and there uniqueness.
German Creadit Data Definition:
Variable Name Description Type
over_draft Status of existing checking account qualitative
credit_usage real Duration in month numerical
credit_history A30 : no credits taken/
all credits paid back duly
A31 : all credits at this bank paid back duly
A32 : existing credits paid back duly till now
A33 : delay in paying off in the past
A34 : critical account/
other credits existing (not at this bank)
qualitative
purpose Type of credit/loan needed for (new car, used car,
furniture/equipment, radio/tv, repairs, education,
vacation, business,other
qualitative
current_balance real Credit amount numerical
Average_Credit_Balance Savings account/bonds qualitative
employment Present employement since a date qualitative
8
Location real Installment rate in percentage of disposble income numerical
Personal_status Personal status and sex qualitative
other_parties Other debtors / guarantor qualitative
residence_since real Present residence since a date qualitative
property_magnitude Property qualitative
cc_age real Age count in months numerical
other_payment_plans Other installment plans qualitative
housing Housing type - rent, own, for free qualitative
existing_Credits real Number of existing credits at this bank numerical
job Job and type Unemployed/unskilled, skilled,
management/self-employed, highly qualified
qualitative
num_dependents real Number of people being liable to provide maintenance numerical
own_telephone Telephone qualitative
foreign_worker Indicate Yes or No if Foreign worker qualitative
class the cost matrix used for indicating is customer is Good
or Bad
qualitative
The above information is financial data taken from year 1994 of customer that submitted for credit.
Details of the attributes used are of types in integers and categorical formats. This data set is best used
as a Classification data mining task.
The second data source comes from Kaggle, Give Me Some Credit competition.
According to Kaggle the purpose of this second is to so “state of the art in credit scoring by predicting
the probability that somebody will experience financial distress in the next two years.” This dataset uses
less attributes and input for descripting the customers. However this dataset has 4000 instances
Credit distress probability Data Definition:
Variable Name Description Type
SeriousDlqin2yrs Person experienced 90 days past due delinquency or
worse
Y/N
RevolvingUtilizationOfUnsecuredLines Total balance on credit cards and personal lines of credit
except real estate and no installment debt like car loans
divided by the sum of credit limits
percentage
age Age of borrower in years integer
NumberOfTime30-
59DaysPastDueNotWorse
Number of times borrower has been 30-59 days past due
but no worse in the last 2 years.
integer
DebtRatio Monthly debt payments, alimony,living costs divided by
monthy gross income
percentage
MonthlyIncome Monthly income real
NumberOfOpenCreditLinesAndLoans Number of Open loans (installment like car loan or
mortgage) and Lines of credit (e.g. credit cards)
integer
NumberOfTimes90DaysLate Number of times borrower has been 90 days or more
past due.
integer
NumberRealEstateLoansOrLines Number of mortgage and real estate loans including
home equity lines of credit
integer
NumberOfTime60-
89DaysPastDueNotWorse
Number of times borrower has been 60-89 days past due
but no worse in the last 2 years.
integer
9
NumberOfDependents Number of dependents in family excluding themselves
(spouse, children etc.)
integer
The goal of this particular data is to help build a model to help those that are borrowing make better
financial decisions. This will be under a Classification task type as well.
7 DATA PREPARATION
The process used to was very simple for the German Credit fraud data. Decided to use an ARFF file
format due to the source of the data from UC Irvine, Machine Learning Repository to organize. Had
collect the attributes needed for the analysis and paste into notepad application with the @ beginning
symbols to denote that these variables are the attributes. Then taking the @data field and pasting
below the raw data. As long as the raw data that’s divided by commas has the same number of values
to match the number of attributes given, the file will be accurate.
Here’s a snapshot of what the inside of an ARFF file will contains:
@relation german_credit
@attribute over_draft { '<0', '0<=X<200', '>=200', 'no checking'}
@attribute credit_usage real
@attribute credit_history { 'no credits/all paid', 'all paid', 'existing paid', 'delayed
previously', 'critical/other existing credit'}
@attribute purpose { 'new car', 'used car', furniture/equipment, radio/tv, 'domestic appliance',
repairs, education, vacation, retraining, business, other}
@attribute current_balance real
@attribute Average_Credit_Balance { '<100', '100<=X<500', '500<=X<1000', '>=1000', 'no known
savings'}
@attribute employment { unemployed, '<1', '1<=X<4', '4<=X<7', '>=7'}
@attribute location real
@attribute personal_status { 'male div/sep', 'female div/dep/mar', 'male single', 'male mar/wid',
'female single'}
@attribute other_parties { none, 'co applicant', guarantor}
@attribute residence_since real
@attribute property_magnitude { 'real estate', 'life insurance', car, 'no known property'}
@attribute cc_age real
@attribute other_payment_plans { bank, stores, none}
@attribute housing { rent, own, 'for free'}
@attribute existing_credits real
@attribute job { 'unemp/unskilled non res', 'unskilled resident', skilled, 'high qualif/self
emp/mgmt'}
@attribute num_dependents real
@attribute own_telephone { none, yes}
@attribute foreign_worker { yes, no}
@attribute class { good, bad}
@data
'<0',6,'critical/other existing credit',radio/tv,1169,'no known savings','>=7',4,'male
single',none,4,'real estate',67,none,own,2,skilled,1,yes,yes,good
'0<=X<200',48,'existing paid',radio/tv,5951,'<100','1<=X<4',2,'female div/dep/mar',none,2,'real
estate',22,none,own,1,skilled,1,none,yes,bad
A compiled version of this can be viewed here 
http://weka.8497.n7.nabble.com/file/n23121/credit_fruad.arff
The second data source is a CSV file that required some modification. The original data set file had
150,000 instances which was too large for the WEKA data mining tool’s heap size to handle. I manually
removed sections of the data set by dividing the total 150,000 into four sections and removing 3,750
10
from bottom half of the sections. I did this so I have an even amount of distributed data instead of only
taking the top 4000 instances. This file was then saved as cs-traning.csv file for WEKA input.
For both files, I created Data definitions to elaborate on their chosen attributes. The purpose of this is to
explain what inputs are used and its purpose for a clearer business understanding when we need to
make a decision after reviewing the results.
8 DATA MINING ALGORITHM
Select data mining algorithm for your project, elaborate chosen algorithm in detail with reason why
algorithm was chosen over other algorithms.
German Credit Fraud Use case:
The first dataset using the German credit fraud data was ideal for the use of Decision tree algorithm.
This type of classification algorithm, the decision tree is a very first good technique for this particular use
case. Because this is a very common algorithm to use, we have many reason of a benefit to do an
analysis with this method. First, it’s relatively simple approach for classification type of data. It gives the
ability to take sample data with known attributes and place them into categories. Second, it help you
visualize the workflow of how the data is been broken down into sections that make decisions. Last, you
can determine a predictive outcome from the results.
Using Decision Tree, we need to explain more in depth how this method works and its structure. When
data is ran through this type of analysis, it structures the data into these three areas called nodes:
 The root node is an attribute used to question the initial question if or if not something is in a
particular group or not to start off with. This can only be a single node item where the groups
are then branched off from it much like a tree from the bottom.
 Then there grows the internal nodes which is a title of the proceeding branch off the root node.
The purpose of the internal node is to give information only pertaining to that group.
 Lastly, there’s the leaf node(s). Think of these are individual leafs as answers to the internal
nodes. There can be multiple leads branching off an internal node. These finalize the answer of
the item in a particular internal node group.
11
Figure 8-1 Decision tree
The simple method works each time for data that has multiple attributes that has classification of types
in them. For the purposes of our fraudulent credit problem, out root node answers the major questions
of those that will over draft or not. In this cases, I wouldn’t care to use the nodes that are derived, but
more so of the confusion matrix.
The Confusion Matrix helps classifies the actual classes from those that are potentially negative classed
numbers.5
We use the confusion matrix to help spate the decisions made by the classifier making
explicit how one class is being confused for another. This way error can be handled separately. We do
this be looking at the True class items and the predicted class items in a matrix box.
TP = true positive
FN = false negative
PREDICTED CLASS
ACTUAL
CLASS
Yes No
Yes a=TP b=FN
Yes c=FN d=TN
Figure 8.2 – Confusion Matrix
The goal here is to have the model obtain the highest possible accuracy rate or lowest error rate.
**Confusion matrix
9 EXPERIMENTAL RESULTS AND ANALYSIS
To test our fraudulent data, the test was executed with two different data mining techniques. Decision
tree and the Simple K-means algorithms to generate the results.
German Credit Data Use:
Examining the first data set of the German credit fraud data, we analyze this data using the decision tree
algorithm. The outcome below shows the summary of the results.
=== Stratified cross-validation ===
=== Summary ===
Correctly Classified Instances 705 70.5 %
Incorrectly Classified Instances 295 29.5 %
Kappa statistic 0.2467
Mean absolute error 0.3467
Root mean squared error 0.4796
Relative absolute error 82.5233 %
Root relative squared error 104.6565 %
Total Number of Instances 1000
=== Detailed Accuracy By Class ===
5
See more information from Assignment 2: Marketing Campaign Effectiveness, Kennedy, Albert
12
TP Rate FP Rate Precision Recall F-Measure ROC Area Class
0.84 0.61 0.763 0.84 0.799 0.639 good
0.39 0.16 0.511 0.39 0.442 0.639 bad
Weighted Avg. 0.705 0.475 0.687 0.705 0.692 0.639
=== Confusion Matrix ===
a b <-- classified as
588 112 | a = good
183 117 | b = bad
What we are looking for here is a way to determine a good enough model for use. Based on the results,
we have a 70% accuracy of correct classified instances in the data set. This is decent for justification.
However, the ROC area curve is at 64%, which isn’t bad nor near perfect or ideal. How about we
consider the confusion matrix for deeper analysis. Remember from below the confusion matrix is
separated on four different section to help determine where our good and bad classifiers fall into best.
A = True Positive
B = False Negatives
Our matrix has an overwhelming amount, 588 instances that fall under the True Positive section where
it also equals true for the predicted and actual class. Say we take the percentage for the A class
588 + 183 = 771 (total)
588 (TP) / 771(total) = 76% accuracy
In order to generate this results, the data mining tool used was Weka v3-6-12
Give Me Some Credit Use:
Using the data set from the Kapple competition, we have a difference approach of how we want to view
the results. Initially, trying the decision tree algorithm for this dataset did not yield any solid enough
results for analysis purposes or to make any business sense. So, it was best to do a Cluster algorithm
type and view the results.
Choosing the Simple K Means was the algorithm used for this.
kMeans
======
Number of iterations: 10
Within cluster sum of squared errors: 5429.211249046848
Missing values globally replaced with mean/mode
Cluster centroids:
13
Time taken to build model (full training data) : 0.05 seconds
=== Model and evaluation on training set ===
Clustered Instances
0 401 ( 40%)
1 266 ( 27%)
2 333 ( 33%)
To help explain the results, it will help to define the K-means method and its use. The K-means is a
commonly used cluster algorithm which is “simple, iterative way to approach and divide a data set into
specific number of cluster”- Manning. This process run through any dataset and uses a “closeness” as a
way to measure the distance of items in a dataset. This is called the Euclidean distance. To help explain
how the Euclidean distance works there’s a center point that’s created which is the centroid as the
cluster. Any number of items of K, that is within that distance of the centroid is defined as a cluster that
it is assigned to. If one item is closer in distance to another centroid than another centroid point, that
closer item is part of a different cluster.
In the above analysis, we have three defined clusters separated into their groups based on like
attributes their share.
Cluster 0:
This cluster has the strongest poll of instances. The results would suggest that in this cluster more
individuals that has an existing paid credit history and happen to be in the younger age group (31) ,
females and are most likely to be requesting a NEW CAR for credit.
14
Cluster 1:
This cluster has the weakness amount of instances. The results would suggest that customers in this
group are the oldest (age 40) are seeking for credit for USED CARS.
Cluster 2:
This cluster has a distribution of 33%. The results would suggest these are SINGLE MALE customers
seeking credit for RADIO/TV.
10 CONCLUSION
The completed analysis drawn from two different dataset would yield two different ending business
decision results. If visually inspecting the first dataset using the decision tree algorithm, I would initial
conclude this is not a valid enough test to make as a solid choice for fraud detention. We have to
remember the goal is to define the problem where fraud used is being done where customer
information is being misuse in place to benefit another person by creating credit accounts. However,
the confusion matrix does present some promising results to consider. What a financial institution can
take from that is a starter point of where to predict. However it gives about a 76% accuracy of the
changes of this results be true for prediction.
The second analysis is to be view from a different approach not so much as for predictive, but for a clear
view of where customer land is their relative attributes that tie to them. Using the Simple K-Mean
algorithm and picking 3 clusters, we can analysis a few important things:
1) where the most important group of customers are at
2) the related attributes in comparison to other groups
3) purpose of line of credit
4) Instances or distribution percentage of which group has the highest activity and means to apply
for credit.
With the clustering results, a business can answer the above questions ahead of time. We can pick and
choose the determinant for our analysis. For our purpose we wish to detent the reason for a line of
credit.
15
11 FUTURE WORK
Describe next steps to continue work on this project
There are organizations that already taking the necessary actions to benefit from these findings.
Companies like Morpho with their Safran product. Here’s a list of procedures and steps that should be
consider in order for this study to become successful.
1) Ensure those data scientist and analyst placed with the responsibility to craft these results to
use proper data mining practices and methodologies like CRISP-DM.
2) Take risk: I believe there are enough intelligent groups of people that understand the worth and
capable of drafting similar fraud detention analysis. There needs to more action on the
deployment phase.
3) Implement a production system where this process is automated.
Based on the analysis and results we have a plausible solution to solve a business problem with fraud
due to credit applications.
16
12 REFERENCES
Tan, Pang and Michel Steinbach. Introduction to Data Mining. Boston: Pearson Addison Wesley, 2005.
Print (Chp 1, Chp 4).
Provost, Foster, and Fawcett, Tom, Data Science for Business, Sebastopol, CA, 2014. Print (Chp 2, Chp 7)
Morpho, Safran, Fighting Identity Fraud with Data Mining, Groundbreaking means to prevent fraud in
identity management solutions, France, Print (page 4, and page 7)
Federal Data Corporation and SAS, Using Data Mining Techniques for Fraud Detection, Solving Business
Problems using SAS Enterprise Minder Software, Cary, NC, Print (page 1, page 15 and page 20)
Dr. Hofmann, Hans, University at Hamburg, UCI Machine Learning Reposity, CA, 2000,
http://archive.ics.uci.edu/ml/datasets/Statlog+%28German+Credit+Data%29
Kaggle, Give Me Some Credit, 201, https://www.kaggle.com/c/GiveMeSomeCredit

More Related Content

What's hot

Facial Expression Recognition System using Deep Convolutional Neural Networks.
Facial Expression Recognition  System using Deep Convolutional Neural Networks.Facial Expression Recognition  System using Deep Convolutional Neural Networks.
Facial Expression Recognition System using Deep Convolutional Neural Networks.Sandeep Wakchaure
 
Credit card fraud detection using machine learning Algorithms
Credit card fraud detection using machine learning AlgorithmsCredit card fraud detection using machine learning Algorithms
Credit card fraud detection using machine learning Algorithmsankit panigrahy
 
Credit Card Fraud Detection
Credit Card Fraud DetectionCredit Card Fraud Detection
Credit Card Fraud DetectionBinayakreddy
 
Fraud detection ML
Fraud detection MLFraud detection ML
Fraud detection MLMaatougSelim
 
Presentation on FACE MASK DETECTION
Presentation on FACE MASK DETECTIONPresentation on FACE MASK DETECTION
Presentation on FACE MASK DETECTIONShantaJha2
 
Credit card fraud detection using python machine learning
Credit card fraud detection using python machine learningCredit card fraud detection using python machine learning
Credit card fraud detection using python machine learningSandeep Garg
 
House Sale Price Prediction
House Sale Price PredictionHouse Sale Price Prediction
House Sale Price Predictionsriram30691
 
Anomaly Detection for Real-World Systems
Anomaly Detection for Real-World SystemsAnomaly Detection for Real-World Systems
Anomaly Detection for Real-World SystemsManojit Nandi
 
Loan Prediction System Using Machine Learning.pptx
Loan Prediction System Using Machine Learning.pptxLoan Prediction System Using Machine Learning.pptx
Loan Prediction System Using Machine Learning.pptxBhoirRitesh19ET5008
 
Credit card payment_fraud_detection
Credit card payment_fraud_detectionCredit card payment_fraud_detection
Credit card payment_fraud_detectionPEIPEI HAN
 
Analysis of-credit-card-fault-detection
Analysis of-credit-card-fault-detectionAnalysis of-credit-card-fault-detection
Analysis of-credit-card-fault-detectionJustluk Luk
 
Is Machine learning useful for Fraud Prevention?
Is Machine learning useful for Fraud Prevention?Is Machine learning useful for Fraud Prevention?
Is Machine learning useful for Fraud Prevention?Andrea Dal Pozzolo
 
Image classification using CNN
Image classification using CNNImage classification using CNN
Image classification using CNNNoura Hussein
 
Machine Learning for Disease Prediction
Machine Learning for Disease PredictionMachine Learning for Disease Prediction
Machine Learning for Disease PredictionMustafa Oğuz
 
Adaptive Machine Learning for Credit Card Fraud Detection
Adaptive Machine Learning for Credit Card Fraud DetectionAdaptive Machine Learning for Credit Card Fraud Detection
Adaptive Machine Learning for Credit Card Fraud DetectionAndrea Dal Pozzolo
 

What's hot (20)

Facial Expression Recognition System using Deep Convolutional Neural Networks.
Facial Expression Recognition  System using Deep Convolutional Neural Networks.Facial Expression Recognition  System using Deep Convolutional Neural Networks.
Facial Expression Recognition System using Deep Convolutional Neural Networks.
 
Fraud detection
Fraud detectionFraud detection
Fraud detection
 
Anomaly detection
Anomaly detectionAnomaly detection
Anomaly detection
 
Credit card fraud detection using machine learning Algorithms
Credit card fraud detection using machine learning AlgorithmsCredit card fraud detection using machine learning Algorithms
Credit card fraud detection using machine learning Algorithms
 
Credit Card Fraud Detection
Credit Card Fraud DetectionCredit Card Fraud Detection
Credit Card Fraud Detection
 
Fraud detection ML
Fraud detection MLFraud detection ML
Fraud detection ML
 
Credit card fraud dection
Credit card fraud dectionCredit card fraud dection
Credit card fraud dection
 
Presentation on FACE MASK DETECTION
Presentation on FACE MASK DETECTIONPresentation on FACE MASK DETECTION
Presentation on FACE MASK DETECTION
 
Credit card fraud detection using python machine learning
Credit card fraud detection using python machine learningCredit card fraud detection using python machine learning
Credit card fraud detection using python machine learning
 
House Sale Price Prediction
House Sale Price PredictionHouse Sale Price Prediction
House Sale Price Prediction
 
Anomaly Detection for Real-World Systems
Anomaly Detection for Real-World SystemsAnomaly Detection for Real-World Systems
Anomaly Detection for Real-World Systems
 
Loan Prediction System Using Machine Learning.pptx
Loan Prediction System Using Machine Learning.pptxLoan Prediction System Using Machine Learning.pptx
Loan Prediction System Using Machine Learning.pptx
 
Credit card payment_fraud_detection
Credit card payment_fraud_detectionCredit card payment_fraud_detection
Credit card payment_fraud_detection
 
Analysis of-credit-card-fault-detection
Analysis of-credit-card-fault-detectionAnalysis of-credit-card-fault-detection
Analysis of-credit-card-fault-detection
 
Is Machine learning useful for Fraud Prevention?
Is Machine learning useful for Fraud Prevention?Is Machine learning useful for Fraud Prevention?
Is Machine learning useful for Fraud Prevention?
 
Image classification using CNN
Image classification using CNNImage classification using CNN
Image classification using CNN
 
Machine Learning for Disease Prediction
Machine Learning for Disease PredictionMachine Learning for Disease Prediction
Machine Learning for Disease Prediction
 
Fraud and Risk in Big Data
Fraud and Risk in Big DataFraud and Risk in Big Data
Fraud and Risk in Big Data
 
CREDIT_CARD.ppt
CREDIT_CARD.pptCREDIT_CARD.ppt
CREDIT_CARD.ppt
 
Adaptive Machine Learning for Credit Card Fraud Detection
Adaptive Machine Learning for Credit Card Fraud DetectionAdaptive Machine Learning for Credit Card Fraud Detection
Adaptive Machine Learning for Credit Card Fraud Detection
 

Viewers also liked

Application of Data Mining and Machine Learning techniques for Fraud Detectio...
Application of Data Mining and Machine Learning techniques for Fraud Detectio...Application of Data Mining and Machine Learning techniques for Fraud Detectio...
Application of Data Mining and Machine Learning techniques for Fraud Detectio...Christian Adom
 
A data mining framework for fraud detection in telecom based on MapReduce (Pr...
A data mining framework for fraud detection in telecom based on MapReduce (Pr...A data mining framework for fraud detection in telecom based on MapReduce (Pr...
A data mining framework for fraud detection in telecom based on MapReduce (Pr...Mohammed Kharma
 
Fraud Detection presentation
Fraud Detection presentationFraud Detection presentation
Fraud Detection presentationHernan Huwyler
 
Credit risk scoring model final
Credit risk scoring model finalCredit risk scoring model final
Credit risk scoring model finalRitu Sarkar
 
Survey on Credit Card Fraud Detection Using Different Data Mining Techniques
Survey on Credit Card Fraud Detection Using Different Data Mining TechniquesSurvey on Credit Card Fraud Detection Using Different Data Mining Techniques
Survey on Credit Card Fraud Detection Using Different Data Mining Techniquesijsrd.com
 
FRAUD DETECTION IN ONLINE AUCTIONING
FRAUD DETECTION IN ONLINE AUCTIONINGFRAUD DETECTION IN ONLINE AUCTIONING
FRAUD DETECTION IN ONLINE AUCTIONINGSatish Chandra
 
Entire forensic accounting project
Entire forensic accounting projectEntire forensic accounting project
Entire forensic accounting projectavinash mathias
 
Frauds in telecom sector
Frauds in telecom sectorFrauds in telecom sector
Frauds in telecom sectorsksahu099
 
Data Mining in telecommunication industry
Data Mining in telecommunication industryData Mining in telecommunication industry
Data Mining in telecommunication industrypragya ratan
 
Data mining in Telecommunications
Data mining in TelecommunicationsData mining in Telecommunications
Data mining in TelecommunicationsMohsin Nadaf
 
Using Data Mining Techniques to Analyze Crime Pattern
Using Data Mining Techniques to Analyze Crime PatternUsing Data Mining Techniques to Analyze Crime Pattern
Using Data Mining Techniques to Analyze Crime PatternZakaria Zubi
 
Anomaly detection in deep learning
Anomaly detection in deep learningAnomaly detection in deep learning
Anomaly detection in deep learningAdam Gibson
 
Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic Concepts
Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic ConceptsData Mining:Concepts and Techniques, Chapter 8. Classification: Basic Concepts
Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic ConceptsSalah Amean
 
ACFE Presentation on Analytics for Fraud Detection and Mitigation
ACFE Presentation on Analytics for Fraud Detection and MitigationACFE Presentation on Analytics for Fraud Detection and Mitigation
ACFE Presentation on Analytics for Fraud Detection and MitigationScott Mongeau
 
PayPal's Fraud Detection with Deep Learning in H2O World 2014
PayPal's Fraud Detection with Deep Learning in H2O World 2014PayPal's Fraud Detection with Deep Learning in H2O World 2014
PayPal's Fraud Detection with Deep Learning in H2O World 2014Sri Ambati
 

Viewers also liked (20)

Fraud Detection Architecture
Fraud Detection ArchitectureFraud Detection Architecture
Fraud Detection Architecture
 
Application of Data Mining and Machine Learning techniques for Fraud Detectio...
Application of Data Mining and Machine Learning techniques for Fraud Detectio...Application of Data Mining and Machine Learning techniques for Fraud Detectio...
Application of Data Mining and Machine Learning techniques for Fraud Detectio...
 
A data mining framework for fraud detection in telecom based on MapReduce (Pr...
A data mining framework for fraud detection in telecom based on MapReduce (Pr...A data mining framework for fraud detection in telecom based on MapReduce (Pr...
A data mining framework for fraud detection in telecom based on MapReduce (Pr...
 
Fraud Detection presentation
Fraud Detection presentationFraud Detection presentation
Fraud Detection presentation
 
Big Data CDR Analyzer - Kanthaka
Big Data CDR Analyzer - KanthakaBig Data CDR Analyzer - Kanthaka
Big Data CDR Analyzer - Kanthaka
 
Credit risk scoring model final
Credit risk scoring model finalCredit risk scoring model final
Credit risk scoring model final
 
Survey on Credit Card Fraud Detection Using Different Data Mining Techniques
Survey on Credit Card Fraud Detection Using Different Data Mining TechniquesSurvey on Credit Card Fraud Detection Using Different Data Mining Techniques
Survey on Credit Card Fraud Detection Using Different Data Mining Techniques
 
Managing the business_risk_of_fraud
Managing the business_risk_of_fraudManaging the business_risk_of_fraud
Managing the business_risk_of_fraud
 
Lecture - Data Mining
Lecture - Data MiningLecture - Data Mining
Lecture - Data Mining
 
FRAUD DETECTION IN ONLINE AUCTIONING
FRAUD DETECTION IN ONLINE AUCTIONINGFRAUD DETECTION IN ONLINE AUCTIONING
FRAUD DETECTION IN ONLINE AUCTIONING
 
Entire forensic accounting project
Entire forensic accounting projectEntire forensic accounting project
Entire forensic accounting project
 
Frauds in telecom sector
Frauds in telecom sectorFrauds in telecom sector
Frauds in telecom sector
 
Data Mining in telecommunication industry
Data Mining in telecommunication industryData Mining in telecommunication industry
Data Mining in telecommunication industry
 
Data mining in Telecommunications
Data mining in TelecommunicationsData mining in Telecommunications
Data mining in Telecommunications
 
Using Data Mining Techniques to Analyze Crime Pattern
Using Data Mining Techniques to Analyze Crime PatternUsing Data Mining Techniques to Analyze Crime Pattern
Using Data Mining Techniques to Analyze Crime Pattern
 
Anomaly detection in deep learning
Anomaly detection in deep learningAnomaly detection in deep learning
Anomaly detection in deep learning
 
Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic Concepts
Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic ConceptsData Mining:Concepts and Techniques, Chapter 8. Classification: Basic Concepts
Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic Concepts
 
ACFE Presentation on Analytics for Fraud Detection and Mitigation
ACFE Presentation on Analytics for Fraud Detection and MitigationACFE Presentation on Analytics for Fraud Detection and Mitigation
ACFE Presentation on Analytics for Fraud Detection and Mitigation
 
Deep Learning for Fraud Detection
Deep Learning for Fraud DetectionDeep Learning for Fraud Detection
Deep Learning for Fraud Detection
 
PayPal's Fraud Detection with Deep Learning in H2O World 2014
PayPal's Fraud Detection with Deep Learning in H2O World 2014PayPal's Fraud Detection with Deep Learning in H2O World 2014
PayPal's Fraud Detection with Deep Learning in H2O World 2014
 

Similar to Detect Fraud Before It Happens

Data mining techniques and dss
Data mining techniques and dssData mining techniques and dss
Data mining techniques and dssNiyitegekabilly
 
Big Data Analytics Fraud Detection and Risk Management in Fintech.pdf
Big Data Analytics Fraud Detection and Risk Management in Fintech.pdfBig Data Analytics Fraud Detection and Risk Management in Fintech.pdf
Big Data Analytics Fraud Detection and Risk Management in Fintech.pdfSmartinfologiks
 
Data mining and privacy preserving in data mining
Data mining and privacy preserving in data miningData mining and privacy preserving in data mining
Data mining and privacy preserving in data miningNeeda Multani
 
Claims Fraud Network Analysis
Claims Fraud Network AnalysisClaims Fraud Network Analysis
Claims Fraud Network AnalysisCogitate.us
 
Big Data Meets Privacy:De-identification Maturity Model for Benchmarking and ...
Big Data Meets Privacy:De-identification Maturity Model for Benchmarking and ...Big Data Meets Privacy:De-identification Maturity Model for Benchmarking and ...
Big Data Meets Privacy:De-identification Maturity Model for Benchmarking and ...Khaled El Emam
 
Data Mining: What is Data Mining?
Data Mining: What is Data Mining?Data Mining: What is Data Mining?
Data Mining: What is Data Mining?Seerat Malik
 
Predictive analytics-white-paper
Predictive analytics-white-paperPredictive analytics-white-paper
Predictive analytics-white-paperShubhashish Biswas
 
Data mining by_ashok
Data mining by_ashokData mining by_ashok
Data mining by_ashokAshok Kumar
 
Top Data Mining Techniques and Their Applications
Top Data Mining Techniques and Their ApplicationsTop Data Mining Techniques and Their Applications
Top Data Mining Techniques and Their ApplicationsPromptCloud
 
big data on science of analytics and innovativeness among udergraduate studen...
big data on science of analytics and innovativeness among udergraduate studen...big data on science of analytics and innovativeness among udergraduate studen...
big data on science of analytics and innovativeness among udergraduate studen...johnmutiso245
 
big data on science of analytics and innovativeness among udergraduate studen...
big data on science of analytics and innovativeness among udergraduate studen...big data on science of analytics and innovativeness among udergraduate studen...
big data on science of analytics and innovativeness among udergraduate studen...johnmutiso245
 
Why Data Science is Getting Popular in 2023?
Why Data Science is Getting Popular in 2023?Why Data Science is Getting Popular in 2023?
Why Data Science is Getting Popular in 2023?kavyagaur3
 
Big data introduction
Big data introductionBig data introduction
Big data introductionvikas samant
 
Application areas of data mining
Application areas of data miningApplication areas of data mining
Application areas of data miningpriya jain
 

Similar to Detect Fraud Before It Happens (20)

Data mining
Data miningData mining
Data mining
 
Data mining techniques and dss
Data mining techniques and dssData mining techniques and dss
Data mining techniques and dss
 
Cis 500 assignment 4
Cis 500 assignment 4Cis 500 assignment 4
Cis 500 assignment 4
 
Big Data Analytics Fraud Detection and Risk Management in Fintech.pdf
Big Data Analytics Fraud Detection and Risk Management in Fintech.pdfBig Data Analytics Fraud Detection and Risk Management in Fintech.pdf
Big Data Analytics Fraud Detection and Risk Management in Fintech.pdf
 
Fake News Detection
Fake News DetectionFake News Detection
Fake News Detection
 
Data mining and privacy preserving in data mining
Data mining and privacy preserving in data miningData mining and privacy preserving in data mining
Data mining and privacy preserving in data mining
 
Claims Fraud Network Analysis
Claims Fraud Network AnalysisClaims Fraud Network Analysis
Claims Fraud Network Analysis
 
Big Data Meets Privacy:De-identification Maturity Model for Benchmarking and ...
Big Data Meets Privacy:De-identification Maturity Model for Benchmarking and ...Big Data Meets Privacy:De-identification Maturity Model for Benchmarking and ...
Big Data Meets Privacy:De-identification Maturity Model for Benchmarking and ...
 
Data Mining: What is Data Mining?
Data Mining: What is Data Mining?Data Mining: What is Data Mining?
Data Mining: What is Data Mining?
 
Predictive analytics-white-paper
Predictive analytics-white-paperPredictive analytics-white-paper
Predictive analytics-white-paper
 
Data mining by_ashok
Data mining by_ashokData mining by_ashok
Data mining by_ashok
 
Top Data Mining Techniques and Their Applications
Top Data Mining Techniques and Their ApplicationsTop Data Mining Techniques and Their Applications
Top Data Mining Techniques and Their Applications
 
Data mining
Data miningData mining
Data mining
 
big data on science of analytics and innovativeness among udergraduate studen...
big data on science of analytics and innovativeness among udergraduate studen...big data on science of analytics and innovativeness among udergraduate studen...
big data on science of analytics and innovativeness among udergraduate studen...
 
big data on science of analytics and innovativeness among udergraduate studen...
big data on science of analytics and innovativeness among udergraduate studen...big data on science of analytics and innovativeness among udergraduate studen...
big data on science of analytics and innovativeness among udergraduate studen...
 
Why Data Science is Getting Popular in 2023?
Why Data Science is Getting Popular in 2023?Why Data Science is Getting Popular in 2023?
Why Data Science is Getting Popular in 2023?
 
Big data impact and concerns
Big data impact and concernsBig data impact and concerns
Big data impact and concerns
 
Big data introduction
Big data introductionBig data introduction
Big data introduction
 
Application areas of data mining
Application areas of data miningApplication areas of data mining
Application areas of data mining
 
A data-centric program
A data-centric program A data-centric program
A data-centric program
 

Detect Fraud Before It Happens

  • 1. DATA MINING PROJECT Fraud Detention using Data Mining JUNE 7, 2015 NORTHWESTERN UNIVERSITY, MSIS 435 ALBERT KENNEDY
  • 2. 1 TABLE OF CONTENTS Abstract............................................................................................................................................... 2 Introduction........................................................................................................................................ 3 Data Mining Applications.................................................................................................................... 4 Data Mining Themes........................................................................................................................... 4 CRISP-DM Methodology...................................................................................................................... 5 Data Understanding............................................................................................................................ 7 Data Preparation................................................................................................................................. 9 Data Mining Algorithm...................................................................................................................... 10 Experimental Results and Analysis................................................................................................ 11 Conclusion..................................................................................................................................... 14 Future Work.................................................................................................................................. 15 References .................................................................................................................................... 16
  • 3. 2 Fraud Detention 1 ABSTRACT The Adoption of data mining can be great for many use cases and organization that have a special need and understanding of what can be done with existing data. Many organizations don’t understand the power and value they have in what you control. With this, the benefits of using a process for making smarter decision will be discussed in this paper. For the purpose of explaining not yet a data mining topic and its benefits but also to address common problems in the Fraud and identity sector where many businesses and individuals can take advantage of. Fraud detention should be more important today than ever before. With the growing e-commence business that moving rapidly and people having more access to important, financial institutions need to be more aware of ways to detect possible fraudulent acts. We can achieve this goal with the use of data mining. This document will show case a typical problem with sample data taken from borrowers and their information that related to credit approval. Another data set is similar but present typical account holders and a common profile of related attributes that make up a “type” of customer. We will go into detail and the proper process of how to solve this problem using the outline:  Defining the business problem  Collection the data and enhancing that data  Choosing a model strategy and algorithm that fixes the business need  Executing the model through a training set then test that model  Evaluating the results of the model  Decide for the model or make any changes  Deploy the model into an actionable project The above outline is based on a framework that many data scientist use today called “The Cross Industry Standard Process for Data Mining” (CRISP-DM)1 . This foundation is what will be use to analyze the fraud detention use case of both the German Credit fraud data and the Give Me Some Credit data set to compare against using two separate techniques. This two datasets have been thoroughly cleanse and checked for correctly and free of any bias input to shew any result toward any direction. In order to become successful in building a fraud detention system, it is important to understand the data mining tools applications used in the industry. Second, it’s important to know the themes of data mining and most importantly the CRISP-DM methodology that will be used to support our business problem and data mining design for a reliable fraud detention system. 1 Information related to CRISP-DM, see CRISP-DM 1.0, Step-by-Step data mining guide, SPSS,
  • 4. 3 2 INTRODUCTION In America and many other others of the world, crime appears to be relevant in today’s society. As the government and we as people continue to find ways to prevent and coach individuals to shrey away from unlawful event, people find alternative and intelligent ways to be corrupted. Using our government law enforcement has only become as good as to catch perpetrators after a particular wrong doing event has been either reported or caught. Not all actions we can’t control and stop before happening that’s within any person’s private setting. How about we can help prevent those crimes that can be warned if not stop prior to happing? This type of crime that we people such as businesses and individual victims can have control over with the used of data analysis. The crimes of identity thief and Credit approval can be defended. The purpose of this study is to empower financial businesses and individuals with the ability to combat potential thief of personal and company information to avoid misuse for another person’s gain. Description of Problem: In today’s fast growing technology sociality, individuals are completing more and more online transactions and sending data to multiple sources of businesses using the same vital information to identify a person. Examples of this could be applying for credit via an online store. So what information is needed for this? In many cases  Full name  Telephone number or street address  Social security number (SSN) The above information is all that is needed for a creditors to approve someone for a line of credit or account under a person’s name. There’s an issue here; anyone that doesn’t know you can obtain this information easily. The only harder bit of information to obtain is an SSN. The best way for someone to get a person’s SSN, and through personal work records; a person that has access to this who handle’s administrative tasks for employees. This is a problem, because a telephone number and street address can be any number or address that the creditor does not care for other than a place to mail bills to. Typically a creditor will validate this through mailing information. Say a different phone number was given and validated with a phone call that could also be possible. We can use smarter ways to combat this, which some companies do with multiple levels on validating (through phone call, matching current address and identification). Our Objective: We can help fix these problems through the use of data mining and making sound business decisions in order to complete a credit transition.  First we need we can identify the types of data needed to solve an identity or fraud similar action. We will use pervious data from users of different credit accounts to do the analysis work on.  Then, we need to choose one or many different data mining algorithms to test this data for an outcome of what we are analyzing to make a recommendation, prediction, classifications, and/or description of for better business decision.
  • 5. 4 So, let’s explore the many different data mining applications that can be used that’s related to this problem 3 DATA MINING APPLICATIONS There are many related data mining applications that can be used for the purpose of detecting fraudulent activities. This is a growing study that many established organizations have seek out and completed in depth research that more so expose the issues and weakness more so than real world working applications that actual identity the issue and combat the problem in a defense matter. A company called Morpho, has a mission where they are the market leader in security solutions who are the pioneer in identification and detection systems. They deliver many products that many target government and national agencies with dedication tools and systems to safeguard sensitive information. They completed a study using data mining and the relation to identity fraud as an application to prevent and or warn businesses, government organizations and individuals of a possible fraudulent act. From there “Fighting Identity Fraud with Data Mining,” paper on Safran product2 , they speak about a comprehensive fraud-proof process. A second company that should be worth mention who use data mining methods to conducted a similar study was Federal Data Corporation and the SASA institute Inc. These two completed a thorough study on “Using Data Mining Techniques for Fraud Detection.” This was solved in conjunction with using the SAS Enterprise Miner software. The two use cases presented where 1) Health Care Fraud Detection and 2) Purchase Card Fraud detection. Both have similar if not the same business problems and ending goals. The first case, the FDC and SAS used Decision Trees to group all the nominal values of input into smaller group that will in turn give a predictive target outcome. The second case study, they used a modeling strategy that was clustering. Their analysis included three clusters that help explain the cluster analysis efficiently segments data into groups of similar cases. The overall conclusions for both unveiled unknown patterns and regularities in their data. 4 DATA MINING THEMES The study of data mining to best explained and organized into different themes. These different area are better described by the four core data mining tasks. According to “Introduction to Data Mining,” by Pang-Ning Tan, these themes are covered under the four core tasks; Predictive modeling, cluster analysis, Association Analysis Anomaly detection.3 To briefly describe each theme of data mining, it is best to show by examples. These themes in detailed are:  Classification  Clustering  Anomaly 2 Product from the Morpho Inc., Safran, “Fighting Identity Fraud With Data Mining” 3 See more information on themes from Assignment 1: Data Science Applicaiton, Kennedy, Albert
  • 6. 5 First off, predictive modeling is split into two types 1) Classification which is used for discrete target variables and 2) regression, which is used for continuous target variables” Tan. The Classification type is mostly common for making a prediction for an outcome which is the target variable of a single action. Whereas the regression, may analyze the cost that a consumer may spend monthly on an e-commerce website. These two types of variables (classification and regression) are what help define predictive model. The second them, Cluster analysis “seeks to find groups of closely related observations so that observations that belong to the same cluster are more similar to each other than observations that belong to other cluster”, Tan. Good examples of this would be grouping different customers’ purchasing behaviors. Doing this helps define the types of customers and their purchases a clothing retail store have for analysis. Then lastly, there is the Association Analysis them. This theme is used “to discover patterns that describe strongly associated features in the data” Tan. The association theme is commonly used in the retailer or grocery market businesses for analysis. We can group liked things together that have similarities based on related attributes and/or pair transitions from users. These three themes all have its purpose to help solve particular data science problems. With solving these problems, businesses can make decisions using these techniques. Understanding its definition is key to ensure the right solution/theme is being utilized for the correct problem. Once the understanding is complete, analyst can make use of the right tool for apply for the most appropriate case. In many uses, not just one theme may apply to a case, but multiple themes can be applied for better analysis and comparisons for the best results. 5 CRISP-DM METHODOLOGY For many of these data mining techniques, we don’t want to apply the wrong or less effective solution. Lucky, there is a well-organized methodology that gives businesses the steps and processes to handle out these type of data mining project. We use what’s called the “Cross Industry Standard Process for Data Mining” or CRIPS-DM for short. The CRIPS-DM is a structured framework with hierarchical steps to follow in order to help guide through a proper data mining problem and solution. The CRISP-DM include six phases:4 4 Information related to CRISP-DM, see CRISP-DM 1.0, Step-by-Step data mining guide, SPSS
  • 7. 6 Figure 5-1: CRISP-DM diagram 1) Business Understanding – this makes sense as an initial step. Any data mining problem has an business need and problem that needs to be understood. This stage “represents a part of the craft where the analysts’ creativity plays a large role…the design team should think carefully about the use scenario” Provost. In this stage questions such as what needs to be done and how it needs to be done are asked. 2) Data Understanding – this phase is a self-explanatory step yet could take a lot of time. Making sure as an analyst you become knowledgeable of the data can make for an easier process. This phase enables you to become “familiar with the data, identify data quality problems, discover first insights into data, and/or detect interesting subsets to form hypotheses regarding hidden information”, SPSS. 3) Data Preparation – in order for us to make a good analysis, we need tools that will enable us to process the data in the best matter suitable for the necessary model. Examples of this phase may require converting data in a simple tabular format, removed pointless attributes not relevant to the data problem and/or converting a data file to a particular file format in order to operate in the chosen data mining tools.
  • 8. 7 4) Modeling – “The modeling stage is the primary place where data mining techniques are applied to the data” Provost. Simply put, this is where the magic happens and the actual data mining craft and chosen algorithm(s) are put into work. 5) Evaluation – at the evaluation phase, we take time to access the results of the outcomes from the models that were built from our data. The most important aspect of this phase going through this evaluation is to gain confidence in the model’s outcome. We would like to analysis the results and understand its outcome to ensure it’s reliable for meet the original business problem’s needs. 6) Deployment – Now the model has been created and test, now we can make use of this reliable model into a real life production case. What can we do with it…“the knowledge gained will need to be organized and presented in a way that the customer can use it”, SPSS. Depending on the businesses need for the data mining model that was craft, the deployment can be simple or complex. Simple being as creating the results to report to managers or taking that model and actually implementing in to the business in need. Within each phase, there are a set of tasks that the business will help generate for both the data mining team and business to complete in order to complete through the CRISP-DM cycle successfully. 6 DATA UNDERSTANDING For the purpose of proving the data mining algorithms used for the case of fraud detention, we will examine two different data sets. The first is the German Credit fraud data from Dr. Hans Hofmann, the University of Hamburg in Germany. This particular data has 1000 instances of customer information for preparation of understanding approval of credit. There are 20 attributes used to help describe each instance and there uniqueness. German Creadit Data Definition: Variable Name Description Type over_draft Status of existing checking account qualitative credit_usage real Duration in month numerical credit_history A30 : no credits taken/ all credits paid back duly A31 : all credits at this bank paid back duly A32 : existing credits paid back duly till now A33 : delay in paying off in the past A34 : critical account/ other credits existing (not at this bank) qualitative purpose Type of credit/loan needed for (new car, used car, furniture/equipment, radio/tv, repairs, education, vacation, business,other qualitative current_balance real Credit amount numerical Average_Credit_Balance Savings account/bonds qualitative employment Present employement since a date qualitative
  • 9. 8 Location real Installment rate in percentage of disposble income numerical Personal_status Personal status and sex qualitative other_parties Other debtors / guarantor qualitative residence_since real Present residence since a date qualitative property_magnitude Property qualitative cc_age real Age count in months numerical other_payment_plans Other installment plans qualitative housing Housing type - rent, own, for free qualitative existing_Credits real Number of existing credits at this bank numerical job Job and type Unemployed/unskilled, skilled, management/self-employed, highly qualified qualitative num_dependents real Number of people being liable to provide maintenance numerical own_telephone Telephone qualitative foreign_worker Indicate Yes or No if Foreign worker qualitative class the cost matrix used for indicating is customer is Good or Bad qualitative The above information is financial data taken from year 1994 of customer that submitted for credit. Details of the attributes used are of types in integers and categorical formats. This data set is best used as a Classification data mining task. The second data source comes from Kaggle, Give Me Some Credit competition. According to Kaggle the purpose of this second is to so “state of the art in credit scoring by predicting the probability that somebody will experience financial distress in the next two years.” This dataset uses less attributes and input for descripting the customers. However this dataset has 4000 instances Credit distress probability Data Definition: Variable Name Description Type SeriousDlqin2yrs Person experienced 90 days past due delinquency or worse Y/N RevolvingUtilizationOfUnsecuredLines Total balance on credit cards and personal lines of credit except real estate and no installment debt like car loans divided by the sum of credit limits percentage age Age of borrower in years integer NumberOfTime30- 59DaysPastDueNotWorse Number of times borrower has been 30-59 days past due but no worse in the last 2 years. integer DebtRatio Monthly debt payments, alimony,living costs divided by monthy gross income percentage MonthlyIncome Monthly income real NumberOfOpenCreditLinesAndLoans Number of Open loans (installment like car loan or mortgage) and Lines of credit (e.g. credit cards) integer NumberOfTimes90DaysLate Number of times borrower has been 90 days or more past due. integer NumberRealEstateLoansOrLines Number of mortgage and real estate loans including home equity lines of credit integer NumberOfTime60- 89DaysPastDueNotWorse Number of times borrower has been 60-89 days past due but no worse in the last 2 years. integer
  • 10. 9 NumberOfDependents Number of dependents in family excluding themselves (spouse, children etc.) integer The goal of this particular data is to help build a model to help those that are borrowing make better financial decisions. This will be under a Classification task type as well. 7 DATA PREPARATION The process used to was very simple for the German Credit fraud data. Decided to use an ARFF file format due to the source of the data from UC Irvine, Machine Learning Repository to organize. Had collect the attributes needed for the analysis and paste into notepad application with the @ beginning symbols to denote that these variables are the attributes. Then taking the @data field and pasting below the raw data. As long as the raw data that’s divided by commas has the same number of values to match the number of attributes given, the file will be accurate. Here’s a snapshot of what the inside of an ARFF file will contains: @relation german_credit @attribute over_draft { '<0', '0<=X<200', '>=200', 'no checking'} @attribute credit_usage real @attribute credit_history { 'no credits/all paid', 'all paid', 'existing paid', 'delayed previously', 'critical/other existing credit'} @attribute purpose { 'new car', 'used car', furniture/equipment, radio/tv, 'domestic appliance', repairs, education, vacation, retraining, business, other} @attribute current_balance real @attribute Average_Credit_Balance { '<100', '100<=X<500', '500<=X<1000', '>=1000', 'no known savings'} @attribute employment { unemployed, '<1', '1<=X<4', '4<=X<7', '>=7'} @attribute location real @attribute personal_status { 'male div/sep', 'female div/dep/mar', 'male single', 'male mar/wid', 'female single'} @attribute other_parties { none, 'co applicant', guarantor} @attribute residence_since real @attribute property_magnitude { 'real estate', 'life insurance', car, 'no known property'} @attribute cc_age real @attribute other_payment_plans { bank, stores, none} @attribute housing { rent, own, 'for free'} @attribute existing_credits real @attribute job { 'unemp/unskilled non res', 'unskilled resident', skilled, 'high qualif/self emp/mgmt'} @attribute num_dependents real @attribute own_telephone { none, yes} @attribute foreign_worker { yes, no} @attribute class { good, bad} @data '<0',6,'critical/other existing credit',radio/tv,1169,'no known savings','>=7',4,'male single',none,4,'real estate',67,none,own,2,skilled,1,yes,yes,good '0<=X<200',48,'existing paid',radio/tv,5951,'<100','1<=X<4',2,'female div/dep/mar',none,2,'real estate',22,none,own,1,skilled,1,none,yes,bad A compiled version of this can be viewed here  http://weka.8497.n7.nabble.com/file/n23121/credit_fruad.arff The second data source is a CSV file that required some modification. The original data set file had 150,000 instances which was too large for the WEKA data mining tool’s heap size to handle. I manually removed sections of the data set by dividing the total 150,000 into four sections and removing 3,750
  • 11. 10 from bottom half of the sections. I did this so I have an even amount of distributed data instead of only taking the top 4000 instances. This file was then saved as cs-traning.csv file for WEKA input. For both files, I created Data definitions to elaborate on their chosen attributes. The purpose of this is to explain what inputs are used and its purpose for a clearer business understanding when we need to make a decision after reviewing the results. 8 DATA MINING ALGORITHM Select data mining algorithm for your project, elaborate chosen algorithm in detail with reason why algorithm was chosen over other algorithms. German Credit Fraud Use case: The first dataset using the German credit fraud data was ideal for the use of Decision tree algorithm. This type of classification algorithm, the decision tree is a very first good technique for this particular use case. Because this is a very common algorithm to use, we have many reason of a benefit to do an analysis with this method. First, it’s relatively simple approach for classification type of data. It gives the ability to take sample data with known attributes and place them into categories. Second, it help you visualize the workflow of how the data is been broken down into sections that make decisions. Last, you can determine a predictive outcome from the results. Using Decision Tree, we need to explain more in depth how this method works and its structure. When data is ran through this type of analysis, it structures the data into these three areas called nodes:  The root node is an attribute used to question the initial question if or if not something is in a particular group or not to start off with. This can only be a single node item where the groups are then branched off from it much like a tree from the bottom.  Then there grows the internal nodes which is a title of the proceeding branch off the root node. The purpose of the internal node is to give information only pertaining to that group.  Lastly, there’s the leaf node(s). Think of these are individual leafs as answers to the internal nodes. There can be multiple leads branching off an internal node. These finalize the answer of the item in a particular internal node group.
  • 12. 11 Figure 8-1 Decision tree The simple method works each time for data that has multiple attributes that has classification of types in them. For the purposes of our fraudulent credit problem, out root node answers the major questions of those that will over draft or not. In this cases, I wouldn’t care to use the nodes that are derived, but more so of the confusion matrix. The Confusion Matrix helps classifies the actual classes from those that are potentially negative classed numbers.5 We use the confusion matrix to help spate the decisions made by the classifier making explicit how one class is being confused for another. This way error can be handled separately. We do this be looking at the True class items and the predicted class items in a matrix box. TP = true positive FN = false negative PREDICTED CLASS ACTUAL CLASS Yes No Yes a=TP b=FN Yes c=FN d=TN Figure 8.2 – Confusion Matrix The goal here is to have the model obtain the highest possible accuracy rate or lowest error rate. **Confusion matrix 9 EXPERIMENTAL RESULTS AND ANALYSIS To test our fraudulent data, the test was executed with two different data mining techniques. Decision tree and the Simple K-means algorithms to generate the results. German Credit Data Use: Examining the first data set of the German credit fraud data, we analyze this data using the decision tree algorithm. The outcome below shows the summary of the results. === Stratified cross-validation === === Summary === Correctly Classified Instances 705 70.5 % Incorrectly Classified Instances 295 29.5 % Kappa statistic 0.2467 Mean absolute error 0.3467 Root mean squared error 0.4796 Relative absolute error 82.5233 % Root relative squared error 104.6565 % Total Number of Instances 1000 === Detailed Accuracy By Class === 5 See more information from Assignment 2: Marketing Campaign Effectiveness, Kennedy, Albert
  • 13. 12 TP Rate FP Rate Precision Recall F-Measure ROC Area Class 0.84 0.61 0.763 0.84 0.799 0.639 good 0.39 0.16 0.511 0.39 0.442 0.639 bad Weighted Avg. 0.705 0.475 0.687 0.705 0.692 0.639 === Confusion Matrix === a b <-- classified as 588 112 | a = good 183 117 | b = bad What we are looking for here is a way to determine a good enough model for use. Based on the results, we have a 70% accuracy of correct classified instances in the data set. This is decent for justification. However, the ROC area curve is at 64%, which isn’t bad nor near perfect or ideal. How about we consider the confusion matrix for deeper analysis. Remember from below the confusion matrix is separated on four different section to help determine where our good and bad classifiers fall into best. A = True Positive B = False Negatives Our matrix has an overwhelming amount, 588 instances that fall under the True Positive section where it also equals true for the predicted and actual class. Say we take the percentage for the A class 588 + 183 = 771 (total) 588 (TP) / 771(total) = 76% accuracy In order to generate this results, the data mining tool used was Weka v3-6-12 Give Me Some Credit Use: Using the data set from the Kapple competition, we have a difference approach of how we want to view the results. Initially, trying the decision tree algorithm for this dataset did not yield any solid enough results for analysis purposes or to make any business sense. So, it was best to do a Cluster algorithm type and view the results. Choosing the Simple K Means was the algorithm used for this. kMeans ====== Number of iterations: 10 Within cluster sum of squared errors: 5429.211249046848 Missing values globally replaced with mean/mode Cluster centroids:
  • 14. 13 Time taken to build model (full training data) : 0.05 seconds === Model and evaluation on training set === Clustered Instances 0 401 ( 40%) 1 266 ( 27%) 2 333 ( 33%) To help explain the results, it will help to define the K-means method and its use. The K-means is a commonly used cluster algorithm which is “simple, iterative way to approach and divide a data set into specific number of cluster”- Manning. This process run through any dataset and uses a “closeness” as a way to measure the distance of items in a dataset. This is called the Euclidean distance. To help explain how the Euclidean distance works there’s a center point that’s created which is the centroid as the cluster. Any number of items of K, that is within that distance of the centroid is defined as a cluster that it is assigned to. If one item is closer in distance to another centroid than another centroid point, that closer item is part of a different cluster. In the above analysis, we have three defined clusters separated into their groups based on like attributes their share. Cluster 0: This cluster has the strongest poll of instances. The results would suggest that in this cluster more individuals that has an existing paid credit history and happen to be in the younger age group (31) , females and are most likely to be requesting a NEW CAR for credit.
  • 15. 14 Cluster 1: This cluster has the weakness amount of instances. The results would suggest that customers in this group are the oldest (age 40) are seeking for credit for USED CARS. Cluster 2: This cluster has a distribution of 33%. The results would suggest these are SINGLE MALE customers seeking credit for RADIO/TV. 10 CONCLUSION The completed analysis drawn from two different dataset would yield two different ending business decision results. If visually inspecting the first dataset using the decision tree algorithm, I would initial conclude this is not a valid enough test to make as a solid choice for fraud detention. We have to remember the goal is to define the problem where fraud used is being done where customer information is being misuse in place to benefit another person by creating credit accounts. However, the confusion matrix does present some promising results to consider. What a financial institution can take from that is a starter point of where to predict. However it gives about a 76% accuracy of the changes of this results be true for prediction. The second analysis is to be view from a different approach not so much as for predictive, but for a clear view of where customer land is their relative attributes that tie to them. Using the Simple K-Mean algorithm and picking 3 clusters, we can analysis a few important things: 1) where the most important group of customers are at 2) the related attributes in comparison to other groups 3) purpose of line of credit 4) Instances or distribution percentage of which group has the highest activity and means to apply for credit. With the clustering results, a business can answer the above questions ahead of time. We can pick and choose the determinant for our analysis. For our purpose we wish to detent the reason for a line of credit.
  • 16. 15 11 FUTURE WORK Describe next steps to continue work on this project There are organizations that already taking the necessary actions to benefit from these findings. Companies like Morpho with their Safran product. Here’s a list of procedures and steps that should be consider in order for this study to become successful. 1) Ensure those data scientist and analyst placed with the responsibility to craft these results to use proper data mining practices and methodologies like CRISP-DM. 2) Take risk: I believe there are enough intelligent groups of people that understand the worth and capable of drafting similar fraud detention analysis. There needs to more action on the deployment phase. 3) Implement a production system where this process is automated. Based on the analysis and results we have a plausible solution to solve a business problem with fraud due to credit applications.
  • 17. 16 12 REFERENCES Tan, Pang and Michel Steinbach. Introduction to Data Mining. Boston: Pearson Addison Wesley, 2005. Print (Chp 1, Chp 4). Provost, Foster, and Fawcett, Tom, Data Science for Business, Sebastopol, CA, 2014. Print (Chp 2, Chp 7) Morpho, Safran, Fighting Identity Fraud with Data Mining, Groundbreaking means to prevent fraud in identity management solutions, France, Print (page 4, and page 7) Federal Data Corporation and SAS, Using Data Mining Techniques for Fraud Detection, Solving Business Problems using SAS Enterprise Minder Software, Cary, NC, Print (page 1, page 15 and page 20) Dr. Hofmann, Hans, University at Hamburg, UCI Machine Learning Reposity, CA, 2000, http://archive.ics.uci.edu/ml/datasets/Statlog+%28German+Credit+Data%29 Kaggle, Give Me Some Credit, 201, https://www.kaggle.com/c/GiveMeSomeCredit