Haicku submission

© 2015 Cognizant1
Haicku Milestone 3
Spam Mail Classification
The Excellers
Lakshmi Prasad (460470) Ravi H S (475560)
Saurabh Singh (431471) Siva Surya Teja (444898)

© 2015 Cognizant2
SUMMARY
Email is not just text; it has structure. SPAM filtering is not just classification because wrong classification of genuine mail as SPAM mail is so severe that it should be treated
as a different kind of error to get the desirable model.
Business Problem
Control False negative:
To prevent genuine mail classification as SPAM mail
without compromising on model accuracy significantly
Data
Import
Exploratory Data
Analysis
Leverage Python script to import .eml files &
Integrate it with R
Business Solution
Model
Building
Explore various algorithms: Naïve Bayes,
SVM, KNN, Neural Network to build the
model
Model Selection &
Optimization
Business Benefits
Efficient and scalable data import
technique
Finalize the
model
predictors by
exploratory data
analysis
Email Subject +
Attachment Flag
+ CC Flag +
Sender Domain
Select
algorithm(s):
Average of Naïve
Bayes + SVM
Additional focus
to control false
negative
Mail Classification:
To set apart SPAM
mails from genuine
mails with high
accuracy
High accuracy and near zero false classification
of genuine mail
Integrated approach
not susceptible to over
fitting and better control
over false negative
Algorithm Accuracy FN ratio
Average of Naïve
Bayes & SVM 99.5% 0.0%

Get Prepare Build
Data
Extraction
Data
Preparation
Model
Building
Tools UsedIntegration of R and Python Algorithms Explored
Naïve Bayes (NB)
Support Vector Machine (SVM)
K-Nearest Neighbors (KNN)
Neural Network
Column Values
Mail ID TRAIN_169.eml
To Online.Friends@netnoteinc.com
From iccsuuccesss@yahoo.com
CC -
Date Sun, 21 Jul 2002 01:19:02 -0700
Attachment No
Subject Say "Goodbye" to the 9-5 yhe @yahoo.com
Body Fire Your Boss... Say Goodbye to the 9-5! Tired of working to make someone
else wealthy? FREE tape teaches you how to make YOU wealthy! Click here
and send your name and mailing address for a free copy To unsubscribe click
here xgcqyahtsvdwhqnhjiuweimhfumiaiyawr @yahoo.com
NB
Score: 50.47
Prediction:
HAM
SVM
Score: 0.22
Prediction:
SPAM
NB+SVM
Score:25.34
Prediction:
SPAM
Analytical EngineFlowChartCaseStudy
Email files
Extract Using python
Structured Data
Reading the data in R
Data Cleaning
Sender Domain
Attachment and cc flags
Enriched Subject and Body
Predictors Selection
NB on mail subject
SVM on mail subject
Models Integration
Spam/Ham Classification

© 2015 Cognizant4
Model Predictor/ Feature Vector
A machine learning algorithm will make mistakes without wise features selection. A clever choice of features can make classification much easier. The features need not be
taken only from mail body, additional features can be added to improve model accuracy.
Message/Subject Features Additional Features
All Words or
Frequently
featured words in
message or body
Text
Top N differentially
featured words in
message or body
Text
Selected Features
Mail Text Frequency Mail Type
csminingorg 20% SPAM
hibody 17% SPAM
work 12% Genuine
20% SPAM mails were from
one domain: csmining.org
2. Capturing importance of cc attributes
3. Leverage attached documents knowledge
21% genuine mails have at
least one email in CC in
comparison to only 6% in
SPAM mails.
20% genuine mails have
attachment in comparison to
only 1% in SPAM mail.
1. Identify influential sender domains
Cc FlagAttachment flag
Email sender
domain
Frequently
featuring words
in email message

6
5
4
3
2
1
2 Better control of False Negatives
3 Works well even in case of noisy data
4 Not prone to over fitting
5 Computationally Not intensive
1 Overall accuracy of the model*
Naïve Bayes SVM Naïve Bayes SVM
Average of Naïve Bayes &
SVM
Mail Body + Mail Attributes Subject + Mail Attributes Subject + Mail Attributes
Accuracy 89.40% 97.70% 96.50% 97.60% 99.50%
FN ratio 3.17% 0.00% 0.07% 0.00% 0.00%
89.40%
97.70%
96.50%
97.60%
99.50%
3.17%
0.00% 0.07% 0.00% 0.00%
0.00%
1.00%
2.00%
3.00%
4.00%
84.00%
90.00%
96.00%
102.00%
Model Selection
Accuracy FN ratio
N.B:
* Based on 2000 email records
• : Desired
• : Average
• : Poor
NB SVM ANN KNN NB+SVM
The source of error is not just random variation in spam filtering, but a live human spammer working actively to defeat your filter. The algorithm should be smart enough to
defeat such attempts.
ALGORITHM SELECTION
FN Ratio =
#of misclassified genuine mails
#of genuine mails

© 2015 Cognizant6
Model Accuracy
Errors of mistakenly classifying legitimate mail as SPAM mail is completely inacceptable. In this case user has to review the Spam folder regularly, and that somehow defeats the
whole purpose of spam filtering.
Google claims
Its Artificial
Intelligence
Catches 99.9
Percent of
Gmail Spam
Algorithm Accuracy FN ratio
Naive Bayes 92.8% 3.0%
SVM 98.1% 1.6%
KNN 91.2% 9.5%
Neural Netw ork 98.5% 1.3%
Machine Learning
Integrated
Solution
Approach
Prudent
Algorithm
choice
Careful
Predictors
Selection
Predictors Algorithm Accuracy FN
ratio
Subject +
Mail
Attributes
Average of
Naïve Bayes &
SVM
99.5% 0.0%

Haicku submission

More Related Content

What's hot

Viewers also liked

Similar to Haicku submission

Recently uploaded

Haicku submission