SlideShare a Scribd company logo
1 of 9
Download to read offline
SUPERVISED LEARNING: CLASSIFY SUBSCRIPTION FRAUD
The problem: Bad Idea Company knew that eventually someone will work on an Analytics
project on their rate plans. So they designed a weird rate plan called Praxis Plan where you
are allowed to make only one call in the Morning (9AM-Noon), Afternoon (Noon-4PM),
Evening(4PM-9PM) and Night (9PM-Midnight). i.e. 4 calls per day. This was a very popular
plan and lot of people opted for the plan.
However, Bad Idea was a target of Subscription Fraud by a gang of fraudsters consisting
of 3 people: Sally, Vince and Virginia. They finally terminated their services. Bad Idea has
their call logs spanning over one and half months.
After every 5 days they undertake an audit to see whether these Fraudsters have joined
their network. They review the list of subscribers who have made calls to the same people
as these three fraudsters and in the same time frame.
The approach: This problem may be classified as an example of Supervised Learning
Techniques in Machine Learning. Unlike unsupervised learning, where the idea is to find
patterns in unlabeled data, supervised learning is the task of inferring a function from
labeled training data. To be able to find a suitable classifier to provide a solution to the
problem, we take a look at our data.
The data: Our training data, in the form of an Excel sheet, has 138 instances of the names
of the people called by the fraudsters Sally, Vince and Virginia in each of the time frames.
We also have a test set of 15 instances with the fraudster Caller feature/column missing,
which we predict by the end of this report. We use R to build our classifiers. Before that, we
take a look at the steps.
Methods: Since we deal with categorical or non-metric data, we have used Naïve Bayes
classification technique using both the “caret” and “klaR” packages in R. A Naïve Bayes
classifier is a simple probabilistic classifier based on applying Bayes’ theorem with strong
independence assumptions. In simple terms, a Naïve Bayes classifier assumes that the
presence or absence of a particular feature is unrelated (independent) to the presence or
absence of any other feature, given the class variable. So in this case, given each caller, each
of the time frames or features is independent of each other.
An advantage of using Naïve Bayes classifier is that it only requires a small amount of
training data to estimate the parameters necessary for classification.
Since this problem follows Bayes theorem (in a way), we can compute each of the steps required
manually. Let us find the prior probabilities of each caller.
P(Sally) = 47/138 = 0.340579710144928
P(Vince) = 43/138 = 0.311594202898551
P(Virginia) = 48/138 = 0.347826086956522
This we compute from our training data (mlcallers), our given labeled BlackListedSubscriberCall
dataset. Now suppose we need to find out who the likely caller would be, if in the Morning, a
call was made to Robert, in the Afternoon to John, in the Evening to David and at night to Alex.
P(Morning = Robert| Caller=Sally) = 0 :According to Laplace, we assume non zero probabilities
P(Afternoon = John| Caller = Sally) = 19/47
P(Evening = David| Caller = Sally) = 23/47
P(Night = Alex| Caller = Sally) = 28/47
Thus, P(Caller = Sally| Morning = R, Afternoon = J, Evening = D, Night = Alex)
= 0*(19/47)*(23/47)*(28/47)*(47/138)
which should be zero, but if we assign a threshold probability to events which are very less
likely, we end up with a final conditional probability. Now, computing the same process for both
Vince and Virginia, and then comparing the 3 conditional probabilities, we find that the caller
being Sally has the highest probability, and we choose this maximum.
Cross Validation: In order to implement the Naïve Bayes classifier, we also make use of a 10fold cross validation method. In this method, the algorithm divides the training set randomly into
10 parts, uses 9 parts to train the algorithm and test it on the remaining part. In this case it uses
124 instances randomly to build the classifier, tests it on the remaining 14 instances (rows). For
the next set it might choose another random 125 instances to train and 13 to test on and repeats
this process 10 times. The best algorithm is selected, tweaked if necessary by manipulating a few
parameters, checked for the accuracy and then predicted on the remaining part of the dataset.
The Process: Making use of the “caret” and “klaR” packages in R, we do the following:


Load the libraries



Train our model using 10-fold cross validation



Check for Accuracy and Kappa values



Tweak parameters of the classifier model


Run on test to predict

Now that we have our model in place, we randomly shuffle 15 instances out of our training set (say
about 10%) and test the prediction abilities of our classifier on it.
Upon analyzing the confusion matrix, we find that :





The accuracy of our model is (4+3+4)/15 = 73.3%
Precision of predicting Sally = 4/5 = 80%
Precision of predicting Vince = 3/5 = 60%
Precision of predicting Virginia = 4/5 = 80%

Tweaking parameters: A Digression – Upon tweaking a few controls while building the classifier, we
found out that the confusion matrix given by our initial model remained unchanged. In this case, we
have used “usekernel = TRUE” and “factor Laplace smoothing (fL) = 2). Kernel Density Estimation
is a non-parametric way to estimate the probability density function of a random variable, mostly a
data smoothing technique. Laplace or Additive smoothing is a technique used to smooth categorical
data. Additive smoothing allows the assignment of non-zero probabilities to events which do not
occur in a sample.

Also checking for the posterior probabilities, we find that the predictions on the random (single)
part of our training set, populates strong conditional probabilities in most cases.
The unseen test data: All of these activities till now is done precisely on the training set, albeit
splitting and randomly validating each part. Now we move on to an unlabeled test set, ie, one
without the Caller column, which we need to predict with the algorithm, trained earlier.
The new, unseen test set has only 15 observations but without the class label. We call it the
“testdata”.
Now trying to predict the Caller in terms of confidence or the conditional probability, we have:

The calling pattern can be predicted fairly, apart from those marked in red. For those instances, we
do not have strong probability evidence to make us feel certain that the callers might indeed be
fraudsters. We might want to flag those instances and set up a tagged priority for those cases. For
others, we might be quite sure of, probabilistically.
Using Decision Trees: Decision Tree Learning (or Classification Tree) uses decision tree as a
predictive model which maps observations about an item to conclusions about the item’s target
value. In these tree structures, leaves represent class labels and branches represent conjunctions of
features that lead to those class labels. The classification tree can be used as an input for decision
making.
Once again, we make use of cross validation techniques, to build the classifier. The code goes as:
Comparing the accuracy of the classifier with the results of the Naïve Bayes’, we find that the latter
still scores better. We then plot the tree as:

An error with this graph is Sally being printed all throughout the bottom of the plot. We consider it
to be the caller whose probability of occurrence is maximum. For example, in node 20 (second from
right), it should have been Virginia.
Reading the plot: In the evening, if the call lands to Frank and at night, to Clark, and in the morning
the call reaches either, Kelly, Larry or Robert, the probability of the caller fraudster being Vince is
close to 80%. The entire tree may be read similarly, taking into account all the branches and leaves.
Code Re-Run: Tweaking our decision tree algorithm a bit further, we came up with an accuracy
quite close to that predicted by the Naïve Bayes classifier.
Going by mini-criterion at 0.01, we arrive at an accuracy of 72.6%. Further, we try to implement a
Random Forest classifier on the given data.
Using Random Forests: Random Forests are an ensemble learning method for classification, that
operate by constructing a multitude of decision trees at training time and outputting the class that
is the mode of the classes output by individual trees.
Using R codes similarly, to train the classifier, we use:

Though we find that the model accuracy is higher than the Naïve Bayes at 84.8%, we still fail to use
this classifier as the confusion matrix on the train set (which should ideally give very optimistic
results) shows around 43% error rate.
Decisively and looking into all of these models, we come to the conclusion that Naïve Bayes may be
considered the best classifier in this case, maybe because the data is considerably small or it is
categorical and of course, provides us with the best results.

More Related Content

What's hot

SVM Vs Naive Bays Algorithm (Jupyter Notebook)
SVM Vs Naive Bays Algorithm (Jupyter Notebook)SVM Vs Naive Bays Algorithm (Jupyter Notebook)
SVM Vs Naive Bays Algorithm (Jupyter Notebook)Ravi Nakulan
 
M08 BiasVarianceTradeoff
M08 BiasVarianceTradeoffM08 BiasVarianceTradeoff
M08 BiasVarianceTradeoffRaman Kannan
 
Application of Genetic Algorithm in Software Testing
Application of Genetic Algorithm in Software TestingApplication of Genetic Algorithm in Software Testing
Application of Genetic Algorithm in Software TestingGhanshyam Yadav
 
Understanding the Machine Learning Algorithms
Understanding the Machine Learning AlgorithmsUnderstanding the Machine Learning Algorithms
Understanding the Machine Learning AlgorithmsRupak Roy
 
Machine learning session6(decision trees random forrest)
Machine learning   session6(decision trees random forrest)Machine learning   session6(decision trees random forrest)
Machine learning session6(decision trees random forrest)Abhimanyu Dwivedi
 
Validation and Over fitting , Validation strategies
Validation and Over fitting , Validation strategiesValidation and Over fitting , Validation strategies
Validation and Over fitting , Validation strategiesChode Amarnath
 
Intro to modelling-supervised learning
Intro to modelling-supervised learningIntro to modelling-supervised learning
Intro to modelling-supervised learningJustin Sebok
 
Performance Metrics for Machine Learning Algorithms
Performance Metrics for Machine Learning AlgorithmsPerformance Metrics for Machine Learning Algorithms
Performance Metrics for Machine Learning AlgorithmsKush Kulshrestha
 
Solving non linear programming minimization problem using genetic algorithm
Solving non linear programming minimization problem using genetic algorithmSolving non linear programming minimization problem using genetic algorithm
Solving non linear programming minimization problem using genetic algorithmLahiru Dilshan
 

What's hot (12)

SVM Vs Naive Bays Algorithm (Jupyter Notebook)
SVM Vs Naive Bays Algorithm (Jupyter Notebook)SVM Vs Naive Bays Algorithm (Jupyter Notebook)
SVM Vs Naive Bays Algorithm (Jupyter Notebook)
 
Chapter 05 k nn
Chapter 05 k nnChapter 05 k nn
Chapter 05 k nn
 
M08 BiasVarianceTradeoff
M08 BiasVarianceTradeoffM08 BiasVarianceTradeoff
M08 BiasVarianceTradeoff
 
Application of Genetic Algorithm in Software Testing
Application of Genetic Algorithm in Software TestingApplication of Genetic Algorithm in Software Testing
Application of Genetic Algorithm in Software Testing
 
Understanding the Machine Learning Algorithms
Understanding the Machine Learning AlgorithmsUnderstanding the Machine Learning Algorithms
Understanding the Machine Learning Algorithms
 
Machine learning session6(decision trees random forrest)
Machine learning   session6(decision trees random forrest)Machine learning   session6(decision trees random forrest)
Machine learning session6(decision trees random forrest)
 
Validation and Over fitting , Validation strategies
Validation and Over fitting , Validation strategiesValidation and Over fitting , Validation strategies
Validation and Over fitting , Validation strategies
 
Intro to modelling-supervised learning
Intro to modelling-supervised learningIntro to modelling-supervised learning
Intro to modelling-supervised learning
 
Performance Metrics for Machine Learning Algorithms
Performance Metrics for Machine Learning AlgorithmsPerformance Metrics for Machine Learning Algorithms
Performance Metrics for Machine Learning Algorithms
 
Modified Genetic Algorithm for Solving n-Queens Problem
Modified Genetic Algorithm for Solving n-Queens ProblemModified Genetic Algorithm for Solving n-Queens Problem
Modified Genetic Algorithm for Solving n-Queens Problem
 
PyGotham 2016
PyGotham 2016PyGotham 2016
PyGotham 2016
 
Solving non linear programming minimization problem using genetic algorithm
Solving non linear programming minimization problem using genetic algorithmSolving non linear programming minimization problem using genetic algorithm
Solving non linear programming minimization problem using genetic algorithm
 

Similar to Ba group3

Subscription fraud analytics using classification
Subscription fraud analytics using classificationSubscription fraud analytics using classification
Subscription fraud analytics using classificationSomdeep Sen
 
UNIT2_NaiveBayes algorithms used in machine learning
UNIT2_NaiveBayes algorithms used in machine learningUNIT2_NaiveBayes algorithms used in machine learning
UNIT2_NaiveBayes algorithms used in machine learningmichaelaaron25322
 
Machine learning naive bayes and svm.pdf
Machine learning naive bayes and svm.pdfMachine learning naive bayes and svm.pdf
Machine learning naive bayes and svm.pdfSubhamKumar3239
 
Supervised and unsupervised learning
Supervised and unsupervised learningSupervised and unsupervised learning
Supervised and unsupervised learningAmAn Singh
 
Naive_hehe.pptx
Naive_hehe.pptxNaive_hehe.pptx
Naive_hehe.pptxMahimMajee
 
Lazy Association Classification
Lazy Association ClassificationLazy Association Classification
Lazy Association ClassificationJason Yang
 
Barga Data Science lecture 9
Barga Data Science lecture 9Barga Data Science lecture 9
Barga Data Science lecture 9Roger Barga
 
Naïve Bayes Classifier Algorithm.pptx
Naïve Bayes Classifier Algorithm.pptxNaïve Bayes Classifier Algorithm.pptx
Naïve Bayes Classifier Algorithm.pptxPriyadharshiniG41
 
WEKA:Credibility Evaluating Whats Been Learned
WEKA:Credibility Evaluating Whats Been LearnedWEKA:Credibility Evaluating Whats Been Learned
WEKA:Credibility Evaluating Whats Been Learnedweka Content
 
Robustness Metrics for ML Models based on Deep Learning Methods
Robustness Metrics for ML Models based on Deep Learning MethodsRobustness Metrics for ML Models based on Deep Learning Methods
Robustness Metrics for ML Models based on Deep Learning MethodsData Science Milan
 
NLP - Sentiment Analysis
NLP - Sentiment AnalysisNLP - Sentiment Analysis
NLP - Sentiment AnalysisRupak Roy
 
SVM - Functional Verification
SVM - Functional VerificationSVM - Functional Verification
SVM - Functional VerificationSai Kiran Kadam
 
4. Classification.pdf
4. Classification.pdf4. Classification.pdf
4. Classification.pdfJyoti Yadav
 
Learning On The Border:Active Learning in Imbalanced classification Data
Learning On The Border:Active Learning in Imbalanced classification DataLearning On The Border:Active Learning in Imbalanced classification Data
Learning On The Border:Active Learning in Imbalanced classification Data萍華 楊
 

Similar to Ba group3 (20)

Subscription fraud analytics using classification
Subscription fraud analytics using classificationSubscription fraud analytics using classification
Subscription fraud analytics using classification
 
UNIT2_NaiveBayes algorithms used in machine learning
UNIT2_NaiveBayes algorithms used in machine learningUNIT2_NaiveBayes algorithms used in machine learning
UNIT2_NaiveBayes algorithms used in machine learning
 
Machine learning naive bayes and svm.pdf
Machine learning naive bayes and svm.pdfMachine learning naive bayes and svm.pdf
Machine learning naive bayes and svm.pdf
 
Supervised algorithms
Supervised algorithmsSupervised algorithms
Supervised algorithms
 
Naive.pdf
Naive.pdfNaive.pdf
Naive.pdf
 
Supervised and unsupervised learning
Supervised and unsupervised learningSupervised and unsupervised learning
Supervised and unsupervised learning
 
Naive_hehe.pptx
Naive_hehe.pptxNaive_hehe.pptx
Naive_hehe.pptx
 
Lazy Association Classification
Lazy Association ClassificationLazy Association Classification
Lazy Association Classification
 
Barga Data Science lecture 9
Barga Data Science lecture 9Barga Data Science lecture 9
Barga Data Science lecture 9
 
Navies bayes
Navies bayesNavies bayes
Navies bayes
 
Naive bayes classifier
Naive bayes classifierNaive bayes classifier
Naive bayes classifier
 
Naïve Bayes Classifier Algorithm.pptx
Naïve Bayes Classifier Algorithm.pptxNaïve Bayes Classifier Algorithm.pptx
Naïve Bayes Classifier Algorithm.pptx
 
WEKA:Credibility Evaluating Whats Been Learned
WEKA:Credibility Evaluating Whats Been LearnedWEKA:Credibility Evaluating Whats Been Learned
WEKA:Credibility Evaluating Whats Been Learned
 
Machine Learning 1
Machine Learning 1Machine Learning 1
Machine Learning 1
 
Robustness Metrics for ML Models based on Deep Learning Methods
Robustness Metrics for ML Models based on Deep Learning MethodsRobustness Metrics for ML Models based on Deep Learning Methods
Robustness Metrics for ML Models based on Deep Learning Methods
 
NLP - Sentiment Analysis
NLP - Sentiment AnalysisNLP - Sentiment Analysis
NLP - Sentiment Analysis
 
SVM - Functional Verification
SVM - Functional VerificationSVM - Functional Verification
SVM - Functional Verification
 
4. Classification.pdf
4. Classification.pdf4. Classification.pdf
4. Classification.pdf
 
NAIVE BAYES ALGORITHM
NAIVE BAYES ALGORITHMNAIVE BAYES ALGORITHM
NAIVE BAYES ALGORITHM
 
Learning On The Border:Active Learning in Imbalanced classification Data
Learning On The Border:Active Learning in Imbalanced classification DataLearning On The Border:Active Learning in Imbalanced classification Data
Learning On The Border:Active Learning in Imbalanced classification Data
 

More from Ashish Ranjan

Telecom Fraudsters Prediction
Telecom Fraudsters Prediction Telecom Fraudsters Prediction
Telecom Fraudsters Prediction Ashish Ranjan
 
Sales forecasting of an airline company using time series analysis (1) (1)
Sales forecasting of an airline company using time series analysis (1) (1)Sales forecasting of an airline company using time series analysis (1) (1)
Sales forecasting of an airline company using time series analysis (1) (1)Ashish Ranjan
 
Classification model for predicting student's knowledge
Classification model for predicting student's knowledgeClassification model for predicting student's knowledge
Classification model for predicting student's knowledgeAshish Ranjan
 
Sas medical case study final (1)
Sas medical case study final (1)Sas medical case study final (1)
Sas medical case study final (1)Ashish Ranjan
 
Sbi mm project-final
Sbi mm project-finalSbi mm project-final
Sbi mm project-finalAshish Ranjan
 
Telecom analytics assignment revised
Telecom analytics assignment revisedTelecom analytics assignment revised
Telecom analytics assignment revisedAshish Ranjan
 
Insurance claims clustering final (1)
Insurance claims  clustering final (1)Insurance claims  clustering final (1)
Insurance claims clustering final (1)Ashish Ranjan
 
Insurance claims clustering final (1)
Insurance claims  clustering final (1)Insurance claims  clustering final (1)
Insurance claims clustering final (1)Ashish Ranjan
 

More from Ashish Ranjan (10)

Telecom Fraudsters Prediction
Telecom Fraudsters Prediction Telecom Fraudsters Prediction
Telecom Fraudsters Prediction
 
Sales forecasting of an airline company using time series analysis (1) (1)
Sales forecasting of an airline company using time series analysis (1) (1)Sales forecasting of an airline company using time series analysis (1) (1)
Sales forecasting of an airline company using time series analysis (1) (1)
 
Regression ppt (1)
Regression ppt (1)Regression ppt (1)
Regression ppt (1)
 
Classification model for predicting student's knowledge
Classification model for predicting student's knowledgeClassification model for predicting student's knowledge
Classification model for predicting student's knowledge
 
Sas medical case study final (1)
Sas medical case study final (1)Sas medical case study final (1)
Sas medical case study final (1)
 
Ntpc final
Ntpc finalNtpc final
Ntpc final
 
Sbi mm project-final
Sbi mm project-finalSbi mm project-final
Sbi mm project-final
 
Telecom analytics assignment revised
Telecom analytics assignment revisedTelecom analytics assignment revised
Telecom analytics assignment revised
 
Insurance claims clustering final (1)
Insurance claims  clustering final (1)Insurance claims  clustering final (1)
Insurance claims clustering final (1)
 
Insurance claims clustering final (1)
Insurance claims  clustering final (1)Insurance claims  clustering final (1)
Insurance claims clustering final (1)
 

Recently uploaded

Ensure the security of your HCL environment by applying the Zero Trust princi...
Ensure the security of your HCL environment by applying the Zero Trust princi...Ensure the security of your HCL environment by applying the Zero Trust princi...
Ensure the security of your HCL environment by applying the Zero Trust princi...Roland Driesen
 
M.C Lodges -- Guest House in Jhang.
M.C Lodges --  Guest House in Jhang.M.C Lodges --  Guest House in Jhang.
M.C Lodges -- Guest House in Jhang.Aaiza Hassan
 
Call Girls Pune Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Pune Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Pune Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Pune Just Call 9907093804 Top Class Call Girl Service AvailableDipal Arora
 
Mysore Call Girls 8617370543 WhatsApp Number 24x7 Best Services
Mysore Call Girls 8617370543 WhatsApp Number 24x7 Best ServicesMysore Call Girls 8617370543 WhatsApp Number 24x7 Best Services
Mysore Call Girls 8617370543 WhatsApp Number 24x7 Best ServicesDipal Arora
 
Event mailer assignment progress report .pdf
Event mailer assignment progress report .pdfEvent mailer assignment progress report .pdf
Event mailer assignment progress report .pdftbatkhuu1
 
Call Girls in Gomti Nagar - 7388211116 - With room Service
Call Girls in Gomti Nagar - 7388211116  - With room ServiceCall Girls in Gomti Nagar - 7388211116  - With room Service
Call Girls in Gomti Nagar - 7388211116 - With room Servicediscovermytutordmt
 
Pharma Works Profile of Karan Communications
Pharma Works Profile of Karan CommunicationsPharma Works Profile of Karan Communications
Pharma Works Profile of Karan Communicationskarancommunications
 
Yaroslav Rozhankivskyy: Три складові і три передумови максимальної продуктивн...
Yaroslav Rozhankivskyy: Три складові і три передумови максимальної продуктивн...Yaroslav Rozhankivskyy: Три складові і три передумови максимальної продуктивн...
Yaroslav Rozhankivskyy: Три складові і три передумови максимальної продуктивн...Lviv Startup Club
 
VIP Call Girls In Saharaganj ( Lucknow ) 🔝 8923113531 🔝 Cash Payment (COD) 👒
VIP Call Girls In Saharaganj ( Lucknow  ) 🔝 8923113531 🔝  Cash Payment (COD) 👒VIP Call Girls In Saharaganj ( Lucknow  ) 🔝 8923113531 🔝  Cash Payment (COD) 👒
VIP Call Girls In Saharaganj ( Lucknow ) 🔝 8923113531 🔝 Cash Payment (COD) 👒anilsa9823
 
Boost the utilization of your HCL environment by reevaluating use cases and f...
Boost the utilization of your HCL environment by reevaluating use cases and f...Boost the utilization of your HCL environment by reevaluating use cases and f...
Boost the utilization of your HCL environment by reevaluating use cases and f...Roland Driesen
 
VIP Call Girls Gandi Maisamma ( Hyderabad ) Phone 8250192130 | ₹5k To 25k Wit...
VIP Call Girls Gandi Maisamma ( Hyderabad ) Phone 8250192130 | ₹5k To 25k Wit...VIP Call Girls Gandi Maisamma ( Hyderabad ) Phone 8250192130 | ₹5k To 25k Wit...
VIP Call Girls Gandi Maisamma ( Hyderabad ) Phone 8250192130 | ₹5k To 25k Wit...Suhani Kapoor
 
It will be International Nurses' Day on 12 May
It will be International Nurses' Day on 12 MayIt will be International Nurses' Day on 12 May
It will be International Nurses' Day on 12 MayNZSG
 
Lucknow 💋 Escorts in Lucknow - 450+ Call Girl Cash Payment 8923113531 Neha Th...
Lucknow 💋 Escorts in Lucknow - 450+ Call Girl Cash Payment 8923113531 Neha Th...Lucknow 💋 Escorts in Lucknow - 450+ Call Girl Cash Payment 8923113531 Neha Th...
Lucknow 💋 Escorts in Lucknow - 450+ Call Girl Cash Payment 8923113531 Neha Th...anilsa9823
 
KYC-Verified Accounts: Helping Companies Handle Challenging Regulatory Enviro...
KYC-Verified Accounts: Helping Companies Handle Challenging Regulatory Enviro...KYC-Verified Accounts: Helping Companies Handle Challenging Regulatory Enviro...
KYC-Verified Accounts: Helping Companies Handle Challenging Regulatory Enviro...Any kyc Account
 
Understanding the Pakistan Budgeting Process: Basics and Key Insights
Understanding the Pakistan Budgeting Process: Basics and Key InsightsUnderstanding the Pakistan Budgeting Process: Basics and Key Insights
Understanding the Pakistan Budgeting Process: Basics and Key Insightsseri bangash
 
Mondelez State of Snacking and Future Trends 2023
Mondelez State of Snacking and Future Trends 2023Mondelez State of Snacking and Future Trends 2023
Mondelez State of Snacking and Future Trends 2023Neil Kimberley
 
The Coffee Bean & Tea Leaf(CBTL), Business strategy case study
The Coffee Bean & Tea Leaf(CBTL), Business strategy case studyThe Coffee Bean & Tea Leaf(CBTL), Business strategy case study
The Coffee Bean & Tea Leaf(CBTL), Business strategy case studyEthan lee
 
Call Girls In Holiday Inn Express Gurugram➥99902@11544 ( Best price)100% Genu...
Call Girls In Holiday Inn Express Gurugram➥99902@11544 ( Best price)100% Genu...Call Girls In Holiday Inn Express Gurugram➥99902@11544 ( Best price)100% Genu...
Call Girls In Holiday Inn Express Gurugram➥99902@11544 ( Best price)100% Genu...lizamodels9
 
Monte Carlo simulation : Simulation using MCSM
Monte Carlo simulation : Simulation using MCSMMonte Carlo simulation : Simulation using MCSM
Monte Carlo simulation : Simulation using MCSMRavindra Nath Shukla
 
Grateful 7 speech thanking everyone that has helped.pdf
Grateful 7 speech thanking everyone that has helped.pdfGrateful 7 speech thanking everyone that has helped.pdf
Grateful 7 speech thanking everyone that has helped.pdfPaul Menig
 

Recently uploaded (20)

Ensure the security of your HCL environment by applying the Zero Trust princi...
Ensure the security of your HCL environment by applying the Zero Trust princi...Ensure the security of your HCL environment by applying the Zero Trust princi...
Ensure the security of your HCL environment by applying the Zero Trust princi...
 
M.C Lodges -- Guest House in Jhang.
M.C Lodges --  Guest House in Jhang.M.C Lodges --  Guest House in Jhang.
M.C Lodges -- Guest House in Jhang.
 
Call Girls Pune Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Pune Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Pune Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Pune Just Call 9907093804 Top Class Call Girl Service Available
 
Mysore Call Girls 8617370543 WhatsApp Number 24x7 Best Services
Mysore Call Girls 8617370543 WhatsApp Number 24x7 Best ServicesMysore Call Girls 8617370543 WhatsApp Number 24x7 Best Services
Mysore Call Girls 8617370543 WhatsApp Number 24x7 Best Services
 
Event mailer assignment progress report .pdf
Event mailer assignment progress report .pdfEvent mailer assignment progress report .pdf
Event mailer assignment progress report .pdf
 
Call Girls in Gomti Nagar - 7388211116 - With room Service
Call Girls in Gomti Nagar - 7388211116  - With room ServiceCall Girls in Gomti Nagar - 7388211116  - With room Service
Call Girls in Gomti Nagar - 7388211116 - With room Service
 
Pharma Works Profile of Karan Communications
Pharma Works Profile of Karan CommunicationsPharma Works Profile of Karan Communications
Pharma Works Profile of Karan Communications
 
Yaroslav Rozhankivskyy: Три складові і три передумови максимальної продуктивн...
Yaroslav Rozhankivskyy: Три складові і три передумови максимальної продуктивн...Yaroslav Rozhankivskyy: Три складові і три передумови максимальної продуктивн...
Yaroslav Rozhankivskyy: Три складові і три передумови максимальної продуктивн...
 
VIP Call Girls In Saharaganj ( Lucknow ) 🔝 8923113531 🔝 Cash Payment (COD) 👒
VIP Call Girls In Saharaganj ( Lucknow  ) 🔝 8923113531 🔝  Cash Payment (COD) 👒VIP Call Girls In Saharaganj ( Lucknow  ) 🔝 8923113531 🔝  Cash Payment (COD) 👒
VIP Call Girls In Saharaganj ( Lucknow ) 🔝 8923113531 🔝 Cash Payment (COD) 👒
 
Boost the utilization of your HCL environment by reevaluating use cases and f...
Boost the utilization of your HCL environment by reevaluating use cases and f...Boost the utilization of your HCL environment by reevaluating use cases and f...
Boost the utilization of your HCL environment by reevaluating use cases and f...
 
VIP Call Girls Gandi Maisamma ( Hyderabad ) Phone 8250192130 | ₹5k To 25k Wit...
VIP Call Girls Gandi Maisamma ( Hyderabad ) Phone 8250192130 | ₹5k To 25k Wit...VIP Call Girls Gandi Maisamma ( Hyderabad ) Phone 8250192130 | ₹5k To 25k Wit...
VIP Call Girls Gandi Maisamma ( Hyderabad ) Phone 8250192130 | ₹5k To 25k Wit...
 
It will be International Nurses' Day on 12 May
It will be International Nurses' Day on 12 MayIt will be International Nurses' Day on 12 May
It will be International Nurses' Day on 12 May
 
Lucknow 💋 Escorts in Lucknow - 450+ Call Girl Cash Payment 8923113531 Neha Th...
Lucknow 💋 Escorts in Lucknow - 450+ Call Girl Cash Payment 8923113531 Neha Th...Lucknow 💋 Escorts in Lucknow - 450+ Call Girl Cash Payment 8923113531 Neha Th...
Lucknow 💋 Escorts in Lucknow - 450+ Call Girl Cash Payment 8923113531 Neha Th...
 
KYC-Verified Accounts: Helping Companies Handle Challenging Regulatory Enviro...
KYC-Verified Accounts: Helping Companies Handle Challenging Regulatory Enviro...KYC-Verified Accounts: Helping Companies Handle Challenging Regulatory Enviro...
KYC-Verified Accounts: Helping Companies Handle Challenging Regulatory Enviro...
 
Understanding the Pakistan Budgeting Process: Basics and Key Insights
Understanding the Pakistan Budgeting Process: Basics and Key InsightsUnderstanding the Pakistan Budgeting Process: Basics and Key Insights
Understanding the Pakistan Budgeting Process: Basics and Key Insights
 
Mondelez State of Snacking and Future Trends 2023
Mondelez State of Snacking and Future Trends 2023Mondelez State of Snacking and Future Trends 2023
Mondelez State of Snacking and Future Trends 2023
 
The Coffee Bean & Tea Leaf(CBTL), Business strategy case study
The Coffee Bean & Tea Leaf(CBTL), Business strategy case studyThe Coffee Bean & Tea Leaf(CBTL), Business strategy case study
The Coffee Bean & Tea Leaf(CBTL), Business strategy case study
 
Call Girls In Holiday Inn Express Gurugram➥99902@11544 ( Best price)100% Genu...
Call Girls In Holiday Inn Express Gurugram➥99902@11544 ( Best price)100% Genu...Call Girls In Holiday Inn Express Gurugram➥99902@11544 ( Best price)100% Genu...
Call Girls In Holiday Inn Express Gurugram➥99902@11544 ( Best price)100% Genu...
 
Monte Carlo simulation : Simulation using MCSM
Monte Carlo simulation : Simulation using MCSMMonte Carlo simulation : Simulation using MCSM
Monte Carlo simulation : Simulation using MCSM
 
Grateful 7 speech thanking everyone that has helped.pdf
Grateful 7 speech thanking everyone that has helped.pdfGrateful 7 speech thanking everyone that has helped.pdf
Grateful 7 speech thanking everyone that has helped.pdf
 

Ba group3

  • 1. SUPERVISED LEARNING: CLASSIFY SUBSCRIPTION FRAUD The problem: Bad Idea Company knew that eventually someone will work on an Analytics project on their rate plans. So they designed a weird rate plan called Praxis Plan where you are allowed to make only one call in the Morning (9AM-Noon), Afternoon (Noon-4PM), Evening(4PM-9PM) and Night (9PM-Midnight). i.e. 4 calls per day. This was a very popular plan and lot of people opted for the plan. However, Bad Idea was a target of Subscription Fraud by a gang of fraudsters consisting of 3 people: Sally, Vince and Virginia. They finally terminated their services. Bad Idea has their call logs spanning over one and half months. After every 5 days they undertake an audit to see whether these Fraudsters have joined their network. They review the list of subscribers who have made calls to the same people as these three fraudsters and in the same time frame. The approach: This problem may be classified as an example of Supervised Learning Techniques in Machine Learning. Unlike unsupervised learning, where the idea is to find patterns in unlabeled data, supervised learning is the task of inferring a function from labeled training data. To be able to find a suitable classifier to provide a solution to the problem, we take a look at our data. The data: Our training data, in the form of an Excel sheet, has 138 instances of the names of the people called by the fraudsters Sally, Vince and Virginia in each of the time frames. We also have a test set of 15 instances with the fraudster Caller feature/column missing, which we predict by the end of this report. We use R to build our classifiers. Before that, we take a look at the steps.
  • 2. Methods: Since we deal with categorical or non-metric data, we have used Naïve Bayes classification technique using both the “caret” and “klaR” packages in R. A Naïve Bayes classifier is a simple probabilistic classifier based on applying Bayes’ theorem with strong independence assumptions. In simple terms, a Naïve Bayes classifier assumes that the presence or absence of a particular feature is unrelated (independent) to the presence or absence of any other feature, given the class variable. So in this case, given each caller, each of the time frames or features is independent of each other. An advantage of using Naïve Bayes classifier is that it only requires a small amount of training data to estimate the parameters necessary for classification. Since this problem follows Bayes theorem (in a way), we can compute each of the steps required manually. Let us find the prior probabilities of each caller. P(Sally) = 47/138 = 0.340579710144928 P(Vince) = 43/138 = 0.311594202898551 P(Virginia) = 48/138 = 0.347826086956522 This we compute from our training data (mlcallers), our given labeled BlackListedSubscriberCall dataset. Now suppose we need to find out who the likely caller would be, if in the Morning, a call was made to Robert, in the Afternoon to John, in the Evening to David and at night to Alex. P(Morning = Robert| Caller=Sally) = 0 :According to Laplace, we assume non zero probabilities P(Afternoon = John| Caller = Sally) = 19/47 P(Evening = David| Caller = Sally) = 23/47 P(Night = Alex| Caller = Sally) = 28/47 Thus, P(Caller = Sally| Morning = R, Afternoon = J, Evening = D, Night = Alex) = 0*(19/47)*(23/47)*(28/47)*(47/138) which should be zero, but if we assign a threshold probability to events which are very less likely, we end up with a final conditional probability. Now, computing the same process for both Vince and Virginia, and then comparing the 3 conditional probabilities, we find that the caller being Sally has the highest probability, and we choose this maximum. Cross Validation: In order to implement the Naïve Bayes classifier, we also make use of a 10fold cross validation method. In this method, the algorithm divides the training set randomly into 10 parts, uses 9 parts to train the algorithm and test it on the remaining part. In this case it uses 124 instances randomly to build the classifier, tests it on the remaining 14 instances (rows). For the next set it might choose another random 125 instances to train and 13 to test on and repeats this process 10 times. The best algorithm is selected, tweaked if necessary by manipulating a few parameters, checked for the accuracy and then predicted on the remaining part of the dataset. The Process: Making use of the “caret” and “klaR” packages in R, we do the following:  Load the libraries  Train our model using 10-fold cross validation  Check for Accuracy and Kappa values  Tweak parameters of the classifier model
  • 3.  Run on test to predict Now that we have our model in place, we randomly shuffle 15 instances out of our training set (say about 10%) and test the prediction abilities of our classifier on it.
  • 4. Upon analyzing the confusion matrix, we find that :     The accuracy of our model is (4+3+4)/15 = 73.3% Precision of predicting Sally = 4/5 = 80% Precision of predicting Vince = 3/5 = 60% Precision of predicting Virginia = 4/5 = 80% Tweaking parameters: A Digression – Upon tweaking a few controls while building the classifier, we found out that the confusion matrix given by our initial model remained unchanged. In this case, we have used “usekernel = TRUE” and “factor Laplace smoothing (fL) = 2). Kernel Density Estimation is a non-parametric way to estimate the probability density function of a random variable, mostly a data smoothing technique. Laplace or Additive smoothing is a technique used to smooth categorical data. Additive smoothing allows the assignment of non-zero probabilities to events which do not occur in a sample. Also checking for the posterior probabilities, we find that the predictions on the random (single) part of our training set, populates strong conditional probabilities in most cases.
  • 5. The unseen test data: All of these activities till now is done precisely on the training set, albeit splitting and randomly validating each part. Now we move on to an unlabeled test set, ie, one without the Caller column, which we need to predict with the algorithm, trained earlier. The new, unseen test set has only 15 observations but without the class label. We call it the “testdata”.
  • 6. Now trying to predict the Caller in terms of confidence or the conditional probability, we have: The calling pattern can be predicted fairly, apart from those marked in red. For those instances, we do not have strong probability evidence to make us feel certain that the callers might indeed be fraudsters. We might want to flag those instances and set up a tagged priority for those cases. For others, we might be quite sure of, probabilistically. Using Decision Trees: Decision Tree Learning (or Classification Tree) uses decision tree as a predictive model which maps observations about an item to conclusions about the item’s target value. In these tree structures, leaves represent class labels and branches represent conjunctions of features that lead to those class labels. The classification tree can be used as an input for decision making. Once again, we make use of cross validation techniques, to build the classifier. The code goes as:
  • 7. Comparing the accuracy of the classifier with the results of the Naïve Bayes’, we find that the latter still scores better. We then plot the tree as: An error with this graph is Sally being printed all throughout the bottom of the plot. We consider it to be the caller whose probability of occurrence is maximum. For example, in node 20 (second from right), it should have been Virginia. Reading the plot: In the evening, if the call lands to Frank and at night, to Clark, and in the morning the call reaches either, Kelly, Larry or Robert, the probability of the caller fraudster being Vince is close to 80%. The entire tree may be read similarly, taking into account all the branches and leaves. Code Re-Run: Tweaking our decision tree algorithm a bit further, we came up with an accuracy quite close to that predicted by the Naïve Bayes classifier.
  • 8. Going by mini-criterion at 0.01, we arrive at an accuracy of 72.6%. Further, we try to implement a Random Forest classifier on the given data.
  • 9. Using Random Forests: Random Forests are an ensemble learning method for classification, that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes output by individual trees. Using R codes similarly, to train the classifier, we use: Though we find that the model accuracy is higher than the Naïve Bayes at 84.8%, we still fail to use this classifier as the confusion matrix on the train set (which should ideally give very optimistic results) shows around 43% error rate. Decisively and looking into all of these models, we come to the conclusion that Naïve Bayes may be considered the best classifier in this case, maybe because the data is considerably small or it is categorical and of course, provides us with the best results.