TELECOM
SUBSCRIPTION
FRAUD DETECTION
Using Naïve Bayes in R
By (BA Group2)
Santosh Koppada
Maruthi Nataraj K
Sudhanshu Ranjan
Sunil Kumar
Sumit Sahay
 Introduction
 Business Background
 Objective
 Datasets Description
 Tools & Methods Used
 Statistical Procedure
 Future Direction
 Telecommunication Fraud which is the focus is appealing to
fraudsters as calling from the mobile terminal is not bound to any
physical local and it is easy to get subscription.
 Fraud negatively impacts on the company in 4 ways such as
financially, marketing, customer relations and shareholder
perceptions.
 In Subscription Fraud, fraudsters obtain an account without
intention to pay the bill(theft of service).Thus, at the level of phone
number, all the calls from it will be fraudulent indicating an
abnormal usage. The account is usually used for call selling at
cheaper rates or intensive self usage.
 Bad Idea Company Ltd was a target of Subscription Fraud by a
gang of fraudsters consisting of 3 people: Sally, Virginia and Vince.
 Call logs of the fraudsters spanning over one and half months
were recorded.
 An audit is undertaken after every 5 days to check whether the
above fraudsters have joined the company network.
 The list of subscribers is reviewed to identify their calling
pattern matching with that of fraudsters.
Note : Praxis Plan –>(4 calls a day) Morning (9AM-Noon)-1, Afternoon (Noon-4PM)-1,
Evening(4PM-9PM)-1 and Night (9PM-Midnight)-1.
 Our goal is to create a fraud management classification model
that is powerful enough to handle the subscription fraud that the
company has encountered and flexible enough to potentially apply
to things that had not been witnessed yet.
 In this case, the company wants to be absolutely sure that the
person is a fraudster backed up by high percentage of confidence
(probability).
 Call detail records generated in real time that are available
immediately could be used for building a robust statistical model.
 Dataset 1 – BlackListSubscriberCallLogs.xlsx
# Instances – 138 Target Variable – Caller (3 Levels)
 Dataset 2 – AuditLog.xlsx
# Instances – 15
 Tools : R (RStudio)
 Statistical Method : Naïve Bayes Classifier
 Before the start of process, all the required packages are to be loaded.
The list is as below :
1.ElemStatLearn
2.Caret
3.klaR
4.gmodels
 BlackListSubscriberCallLogs (CSV) file is read into R Environment as
shown below (Working Directory – My Documents) :
 Then, a random sample of 70% of the total instances from the Black List
Callers is selected as Training dataset and the remaining 30% as Test
dataset.
 Next, the table proportions are checked for target variable of both
Training and Test datasets to maintain uniformity.
 Later, all the attributes and labels of Training and Test datasets are
stored in separate variables(X~, Y~ respectively) for convenience in coding.
 It is followed by building of Naïve Bayes Classifier Model based on 10 fold
cross-validation using Training dataset ( Data is broken down into 10 sets of
size n/10. Trained on 9 datasets and tested on 1. The process is repeated 10
times and mean accuracy is taken.)
 The classification model generated is applied on Test data for the
prediction of target class (Here, the posterior probabilities are also seen in
the bottom half).
 After that, confusion matrix is generated for predictions of the Naive
Bayes model versus the actual classification of the data instances to
visualize the classification errors.
 AuditLog (CSV) file (validation dataset) is read into R Environment as
shown below (Working Directory – My Documents) :
 At this stage, all the required independent attributes are stored in
separate variable accordingly and the same previous model is applied on
validation dataset this time for the prediction of probable fraudsters along
with probability.
From above, we can infer that Customer X and Customer Z might probably be
Sally (as per calling pattern) and Customer Y might be Virginia.
The same results
with a greater
accuracy can be
obtained using
E1071 package
and laplacian
correction as
shown here.
The same results
with a greater
accuracy can be
obtained using
E1071 package
and laplacian
correction as
shown here.
Naïve Bayes Classification in RapidMiner
Black List Callers
Split Validation(70:30)
Audit Log
4 Time Slots
Inside Split Validation
Performance Classification
Confusion Matrix for Test data (30%)
Final Result with Probable Fraudster and Probabilities
Thank You

Telecom Fraud Detection - Naive Bayes Classification

  • 1.
    TELECOM SUBSCRIPTION FRAUD DETECTION Using NaïveBayes in R By (BA Group2) Santosh Koppada Maruthi Nataraj K Sudhanshu Ranjan Sunil Kumar Sumit Sahay
  • 2.
     Introduction  BusinessBackground  Objective  Datasets Description  Tools & Methods Used  Statistical Procedure  Future Direction
  • 3.
     Telecommunication Fraudwhich is the focus is appealing to fraudsters as calling from the mobile terminal is not bound to any physical local and it is easy to get subscription.  Fraud negatively impacts on the company in 4 ways such as financially, marketing, customer relations and shareholder perceptions.  In Subscription Fraud, fraudsters obtain an account without intention to pay the bill(theft of service).Thus, at the level of phone number, all the calls from it will be fraudulent indicating an abnormal usage. The account is usually used for call selling at cheaper rates or intensive self usage.
  • 4.
     Bad IdeaCompany Ltd was a target of Subscription Fraud by a gang of fraudsters consisting of 3 people: Sally, Virginia and Vince.  Call logs of the fraudsters spanning over one and half months were recorded.  An audit is undertaken after every 5 days to check whether the above fraudsters have joined the company network.  The list of subscribers is reviewed to identify their calling pattern matching with that of fraudsters. Note : Praxis Plan –>(4 calls a day) Morning (9AM-Noon)-1, Afternoon (Noon-4PM)-1, Evening(4PM-9PM)-1 and Night (9PM-Midnight)-1.
  • 5.
     Our goalis to create a fraud management classification model that is powerful enough to handle the subscription fraud that the company has encountered and flexible enough to potentially apply to things that had not been witnessed yet.  In this case, the company wants to be absolutely sure that the person is a fraudster backed up by high percentage of confidence (probability).  Call detail records generated in real time that are available immediately could be used for building a robust statistical model.
  • 6.
     Dataset 1– BlackListSubscriberCallLogs.xlsx # Instances – 138 Target Variable – Caller (3 Levels)  Dataset 2 – AuditLog.xlsx # Instances – 15
  • 7.
     Tools :R (RStudio)  Statistical Method : Naïve Bayes Classifier
  • 8.
     Before thestart of process, all the required packages are to be loaded. The list is as below : 1.ElemStatLearn 2.Caret 3.klaR 4.gmodels
  • 9.
     BlackListSubscriberCallLogs (CSV)file is read into R Environment as shown below (Working Directory – My Documents) :
  • 10.
     Then, arandom sample of 70% of the total instances from the Black List Callers is selected as Training dataset and the remaining 30% as Test dataset.
  • 11.
     Next, thetable proportions are checked for target variable of both Training and Test datasets to maintain uniformity.  Later, all the attributes and labels of Training and Test datasets are stored in separate variables(X~, Y~ respectively) for convenience in coding.  It is followed by building of Naïve Bayes Classifier Model based on 10 fold cross-validation using Training dataset ( Data is broken down into 10 sets of size n/10. Trained on 9 datasets and tested on 1. The process is repeated 10 times and mean accuracy is taken.)
  • 13.
     The classificationmodel generated is applied on Test data for the prediction of target class (Here, the posterior probabilities are also seen in the bottom half).
  • 14.
     After that,confusion matrix is generated for predictions of the Naive Bayes model versus the actual classification of the data instances to visualize the classification errors.
  • 16.
     AuditLog (CSV)file (validation dataset) is read into R Environment as shown below (Working Directory – My Documents) :
  • 17.
     At thisstage, all the required independent attributes are stored in separate variable accordingly and the same previous model is applied on validation dataset this time for the prediction of probable fraudsters along with probability.
  • 19.
    From above, wecan infer that Customer X and Customer Z might probably be Sally (as per calling pattern) and Customer Y might be Virginia.
  • 20.
    The same results witha greater accuracy can be obtained using E1071 package and laplacian correction as shown here. The same results with a greater accuracy can be obtained using E1071 package and laplacian correction as shown here.
  • 22.
    Naïve Bayes Classificationin RapidMiner Black List Callers Split Validation(70:30) Audit Log 4 Time Slots
  • 23.
  • 24.
    Confusion Matrix forTest data (30%)
  • 25.
    Final Result withProbable Fraudster and Probabilities
  • 26.