© 2015 Cognizant1
Haicku Milestone 3
Spam Mail Classification
The Excellers
Lakshmi Prasad (460470) Ravi H S (475560)
Saurabh Singh (431471) Siva Surya Teja (444898)
© 2015 Cognizant2
SUMMARY
Email is not just text; it has structure. SPAM filtering is not just classification because wrong classification of genuine mail as SPAM mail is so severe that it should be treated
as a different kind of error to get the desirable model.
Business Problem
Control False negative:
To prevent genuine mail classification as SPAM mail
without compromising on model accuracy significantly
Data
Import
Exploratory Data
Analysis
Leverage Python script to import .eml files &
Integrate it with R
Business Solution
Model
Building
Explore various algorithms: Naïve Bayes,
SVM, KNN, Neural Network to build the
model
Model Selection &
Optimization
Business Benefits
Efficient and scalable data import
technique
Finalize the
model
predictors by
exploratory data
analysis
Email Subject +
Attachment Flag
+ CC Flag +
Sender Domain
Select
algorithm(s):
Average of Naïve
Bayes + SVM
Additional focus
to control false
negative
Mail Classification:
To set apart SPAM
mails from genuine
mails with high
accuracy
High accuracy and near zero false classification
of genuine mail
Integrated approach
not susceptible to over
fitting and better control
over false negative
Algorithm Accuracy FN ratio
Average of Naïve
Bayes & SVM 99.5% 0.0%
Get Prepare Build
Data
Extraction
Data
Preparation
Model
Building
Tools UsedIntegration of R and Python Algorithms Explored
Naïve Bayes (NB)
Support Vector Machine (SVM)
K-Nearest Neighbors (KNN)
Neural Network
Column Values
Mail ID TRAIN_169.eml
To Online.Friends@netnoteinc.com
From iccsuuccesss@yahoo.com
CC -
Date Sun, 21 Jul 2002 01:19:02 -0700
Attachment No
Subject Say "Goodbye" to the 9-5 yhe @yahoo.com
Body Fire Your Boss... Say Goodbye to the 9-5! Tired of working to make someone
else wealthy? FREE tape teaches you how to make YOU wealthy! Click here
and send your name and mailing address for a free copy To unsubscribe click
here xgcqyahtsvdwhqnhjiuweimhfumiaiyawr @yahoo.com
NB
Score: 50.47
Prediction:
HAM
SVM
Score: 0.22
Prediction:
SPAM
NB+SVM
Score:25.34
Prediction:
SPAM
Analytical EngineFlowChartCaseStudy
Email files
Extract Using python
Structured Data
Reading the data in R
Data Cleaning
Sender Domain
Attachment and cc flags
Enriched Subject and Body
Predictors Selection
NB on mail subject
SVM on mail subject
Models Integration
Spam/Ham Classification
© 2015 Cognizant4
Model Predictor/ Feature Vector
A machine learning algorithm will make mistakes without wise features selection. A clever choice of features can make classification much easier. The features need not be
taken only from mail body, additional features can be added to improve model accuracy.
Message/Subject Features Additional Features
All Words or
Frequently
featured words in
message or body
Text
Top N differentially
featured words in
message or body
Text
Selected Features
Mail Text Frequency Mail Type
csminingorg 20% SPAM
hibody 17% SPAM
work 12% Genuine
20% SPAM mails were from
one domain: csmining.org
2. Capturing importance of cc attributes
3. Leverage attached documents knowledge
21% genuine mails have at
least one email in CC in
comparison to only 6% in
SPAM mails.
20% genuine mails have
attachment in comparison to
only 1% in SPAM mail.
1. Identify influential sender domains
Cc FlagAttachment flag
Email sender
domain
Frequently
featuring words
in email message
6
5
4
3
2
1
2 Better control of False Negatives
3 Works well even in case of noisy data
4 Not prone to over fitting
5 Computationally Not intensive
1 Overall accuracy of the model*
Naïve Bayes SVM Naïve Bayes SVM
Average of Naïve Bayes &
SVM
Mail Body + Mail Attributes Subject + Mail Attributes Subject + Mail Attributes
Accuracy 89.40% 97.70% 96.50% 97.60% 99.50%
FN ratio 3.17% 0.00% 0.07% 0.00% 0.00%
89.40%
97.70%
96.50%
97.60%
99.50%
3.17%
0.00% 0.07% 0.00% 0.00%
0.00%
1.00%
2.00%
3.00%
4.00%
84.00%
90.00%
96.00%
102.00%
Model Selection
Accuracy FN ratio
N.B:
* Based on 2000 email records
• : Desired
• : Average
• : Poor
NB SVM ANN KNN NB+SVM
The source of error is not just random variation in spam filtering, but a live human spammer working actively to defeat your filter. The algorithm should be smart enough to
defeat such attempts.
ALGORITHM SELECTION
FN Ratio =
#of misclassified genuine mails
#of genuine mails
© 2015 Cognizant6
Model Accuracy
Errors of mistakenly classifying legitimate mail as SPAM mail is completely inacceptable. In this case user has to review the Spam folder regularly, and that somehow defeats the
whole purpose of spam filtering.
Google claims
Its Artificial
Intelligence
Catches 99.9
Percent of
Gmail Spam
Algorithm Accuracy FN ratio
Naive Bayes 92.8% 3.0%
SVM 98.1% 1.6%
KNN 91.2% 9.5%
Neural Netw ork 98.5% 1.3%
Machine Learning
Integrated
Solution
Approach
Prudent
Algorithm
choice
Careful
Predictors
Selection
Predictors Algorithm Accuracy FN
ratio
Subject +
Mail
Attributes
Average of
Naïve Bayes &
SVM
99.5% 0.0%
© 2015 Cognizant7
Thank
You

Haicku submission

  • 1.
    © 2015 Cognizant1 HaickuMilestone 3 Spam Mail Classification The Excellers Lakshmi Prasad (460470) Ravi H S (475560) Saurabh Singh (431471) Siva Surya Teja (444898)
  • 2.
    © 2015 Cognizant2 SUMMARY Emailis not just text; it has structure. SPAM filtering is not just classification because wrong classification of genuine mail as SPAM mail is so severe that it should be treated as a different kind of error to get the desirable model. Business Problem Control False negative: To prevent genuine mail classification as SPAM mail without compromising on model accuracy significantly Data Import Exploratory Data Analysis Leverage Python script to import .eml files & Integrate it with R Business Solution Model Building Explore various algorithms: Naïve Bayes, SVM, KNN, Neural Network to build the model Model Selection & Optimization Business Benefits Efficient and scalable data import technique Finalize the model predictors by exploratory data analysis Email Subject + Attachment Flag + CC Flag + Sender Domain Select algorithm(s): Average of Naïve Bayes + SVM Additional focus to control false negative Mail Classification: To set apart SPAM mails from genuine mails with high accuracy High accuracy and near zero false classification of genuine mail Integrated approach not susceptible to over fitting and better control over false negative Algorithm Accuracy FN ratio Average of Naïve Bayes & SVM 99.5% 0.0%
  • 3.
    Get Prepare Build Data Extraction Data Preparation Model Building ToolsUsedIntegration of R and Python Algorithms Explored Naïve Bayes (NB) Support Vector Machine (SVM) K-Nearest Neighbors (KNN) Neural Network Column Values Mail ID TRAIN_169.eml To Online.Friends@netnoteinc.com From iccsuuccesss@yahoo.com CC - Date Sun, 21 Jul 2002 01:19:02 -0700 Attachment No Subject Say "Goodbye" to the 9-5 yhe @yahoo.com Body Fire Your Boss... Say Goodbye to the 9-5! Tired of working to make someone else wealthy? FREE tape teaches you how to make YOU wealthy! Click here and send your name and mailing address for a free copy To unsubscribe click here xgcqyahtsvdwhqnhjiuweimhfumiaiyawr @yahoo.com NB Score: 50.47 Prediction: HAM SVM Score: 0.22 Prediction: SPAM NB+SVM Score:25.34 Prediction: SPAM Analytical EngineFlowChartCaseStudy Email files Extract Using python Structured Data Reading the data in R Data Cleaning Sender Domain Attachment and cc flags Enriched Subject and Body Predictors Selection NB on mail subject SVM on mail subject Models Integration Spam/Ham Classification
  • 4.
    © 2015 Cognizant4 ModelPredictor/ Feature Vector A machine learning algorithm will make mistakes without wise features selection. A clever choice of features can make classification much easier. The features need not be taken only from mail body, additional features can be added to improve model accuracy. Message/Subject Features Additional Features All Words or Frequently featured words in message or body Text Top N differentially featured words in message or body Text Selected Features Mail Text Frequency Mail Type csminingorg 20% SPAM hibody 17% SPAM work 12% Genuine 20% SPAM mails were from one domain: csmining.org 2. Capturing importance of cc attributes 3. Leverage attached documents knowledge 21% genuine mails have at least one email in CC in comparison to only 6% in SPAM mails. 20% genuine mails have attachment in comparison to only 1% in SPAM mail. 1. Identify influential sender domains Cc FlagAttachment flag Email sender domain Frequently featuring words in email message
  • 5.
    6 5 4 3 2 1 2 Better controlof False Negatives 3 Works well even in case of noisy data 4 Not prone to over fitting 5 Computationally Not intensive 1 Overall accuracy of the model* Naïve Bayes SVM Naïve Bayes SVM Average of Naïve Bayes & SVM Mail Body + Mail Attributes Subject + Mail Attributes Subject + Mail Attributes Accuracy 89.40% 97.70% 96.50% 97.60% 99.50% FN ratio 3.17% 0.00% 0.07% 0.00% 0.00% 89.40% 97.70% 96.50% 97.60% 99.50% 3.17% 0.00% 0.07% 0.00% 0.00% 0.00% 1.00% 2.00% 3.00% 4.00% 84.00% 90.00% 96.00% 102.00% Model Selection Accuracy FN ratio N.B: * Based on 2000 email records • : Desired • : Average • : Poor NB SVM ANN KNN NB+SVM The source of error is not just random variation in spam filtering, but a live human spammer working actively to defeat your filter. The algorithm should be smart enough to defeat such attempts. ALGORITHM SELECTION FN Ratio = #of misclassified genuine mails #of genuine mails
  • 6.
    © 2015 Cognizant6 ModelAccuracy Errors of mistakenly classifying legitimate mail as SPAM mail is completely inacceptable. In this case user has to review the Spam folder regularly, and that somehow defeats the whole purpose of spam filtering. Google claims Its Artificial Intelligence Catches 99.9 Percent of Gmail Spam Algorithm Accuracy FN ratio Naive Bayes 92.8% 3.0% SVM 98.1% 1.6% KNN 91.2% 9.5% Neural Netw ork 98.5% 1.3% Machine Learning Integrated Solution Approach Prudent Algorithm choice Careful Predictors Selection Predictors Algorithm Accuracy FN ratio Subject + Mail Attributes Average of Naïve Bayes & SVM 99.5% 0.0%
  • 7.