MACHINE LEARNING
Project Title: Email-Spam Filtering
Aman Singhla 16212220
Shareesh Bellamkonda 16212926
Vikas Chillar 16212887
Vikas Chhillar
Machine Learning
Agenda:
Introduction
Problem Definition
Review Existing Methods
Proposed Methods
Description of Dataset
Description of Source Code
Machine learning
Introduction:
Machine learning focuses on the development of computer programs that can teach themselves to grow
and change when exposed to new data.
HAM
SPAM
Machine learning
Problem definition:
● We consider a dataset contains about 33,700 emails which are pre classified into ham and spam
emails.
● We have also found the top 10 words for ham emails and spam emails.
● We have also found out that which emails are generally longer i.e, ham or spam by calculating the
average word count in ham emails and spam emails.
Machine learning
Review existing methods:
Most Anti-spam programs are designed to do the same job but they all go about it in a different way. There
are different techniques used for filtering:
● List based.
● Content-Based Filters.
Machine learning
Proposed method
We used 2 algorithms for our work
1)Support Vector Machine (SVM)
2)Naïve Bayes Algorithm
Receiver Operating Characteristic (ROC)
Receiver Operating Characteristic curve (or ROC curve) is a plot of the true positive rate against the
false positive rate for the different possible cutpoints of a diagnostic test.
Description of Dataset:
Enron Dataset 1: Ham Emails - 3672 emails Spam Emails - 1500 emails
Enron Dataset 2: Ham Emails - 4361 emails Spam Emails - 1496 emails
Enron Dataset 3: Ham Emails - 4012 emails Spam Emails - 1500 emails
Enron Dataset 4: Ham Emails - 1500 emails Spam Emails - 4500 emails
Enron Dataset 5: Ham Emails - 1500 emails Spam Emails - 3675 emails
Enron Dataset 6: Ham Emails - 1500 emails Spam Emails - 4500 emails
Total Ham Emails: 16545 emails
Total Spam Emails: 17171 emails
Total Emails: 33716 emails
Description of Source Code:
➢Split Dataset into 70-30(Training and Testing)
➢Make Dictionary and Extract Features
➢Top 10 words for Ham and Spam
➢Which emails are longer?
➢ROC Curve
Results:
Number of Train Emails(Training Set):
70% of 33,716 emails = 23,596 emails
Number of Test Emails(Test Set):
30% of 33716 emails = 10,112 emails
Confusion Matrix:
For Multinomial Naive Bayes: [[4822 143]
[115 5031]]
For Scalar Vector Machines: [[4843 122]
[78
5068]]
Top 10 words for Ham:
Top 10 Words for Spam:
Which emails are generally longer ??
Here, we have calculated average word count for Ham Emails and Spam Emails
separately and then predicted which emails are generally longer.
Average Word Count for Ham Emails: 365.5 words
Average Word Count for Spam Emails: 261.3 words
So, it can be concluded that, Ham emails are generally longer than spam emails.
ROC Curve:
Division of Tasks:
Source Code: majorly done by Aman, help given by Vikas and
Shareesh
Report: majorly done by Vikas, help given by Aman and
Shareesh
Presentation: majorly done by Shareesh, help given by Aman and
Vikas
Video: all 3 of us.
Thank You !!

Machine Learning Project - Email Spam Filtering using Enron Dataset

  • 1.
    MACHINE LEARNING Project Title:Email-Spam Filtering Aman Singhla 16212220 Shareesh Bellamkonda 16212926 Vikas Chillar 16212887 Vikas Chhillar
  • 2.
    Machine Learning Agenda: Introduction Problem Definition ReviewExisting Methods Proposed Methods Description of Dataset Description of Source Code
  • 3.
    Machine learning Introduction: Machine learningfocuses on the development of computer programs that can teach themselves to grow and change when exposed to new data. HAM SPAM
  • 4.
    Machine learning Problem definition: ●We consider a dataset contains about 33,700 emails which are pre classified into ham and spam emails. ● We have also found the top 10 words for ham emails and spam emails. ● We have also found out that which emails are generally longer i.e, ham or spam by calculating the average word count in ham emails and spam emails.
  • 5.
    Machine learning Review existingmethods: Most Anti-spam programs are designed to do the same job but they all go about it in a different way. There are different techniques used for filtering: ● List based. ● Content-Based Filters.
  • 6.
    Machine learning Proposed method Weused 2 algorithms for our work 1)Support Vector Machine (SVM) 2)Naïve Bayes Algorithm Receiver Operating Characteristic (ROC) Receiver Operating Characteristic curve (or ROC curve) is a plot of the true positive rate against the false positive rate for the different possible cutpoints of a diagnostic test.
  • 7.
    Description of Dataset: EnronDataset 1: Ham Emails - 3672 emails Spam Emails - 1500 emails Enron Dataset 2: Ham Emails - 4361 emails Spam Emails - 1496 emails Enron Dataset 3: Ham Emails - 4012 emails Spam Emails - 1500 emails Enron Dataset 4: Ham Emails - 1500 emails Spam Emails - 4500 emails Enron Dataset 5: Ham Emails - 1500 emails Spam Emails - 3675 emails Enron Dataset 6: Ham Emails - 1500 emails Spam Emails - 4500 emails Total Ham Emails: 16545 emails Total Spam Emails: 17171 emails Total Emails: 33716 emails
  • 8.
    Description of SourceCode: ➢Split Dataset into 70-30(Training and Testing) ➢Make Dictionary and Extract Features ➢Top 10 words for Ham and Spam ➢Which emails are longer? ➢ROC Curve
  • 9.
    Results: Number of TrainEmails(Training Set): 70% of 33,716 emails = 23,596 emails Number of Test Emails(Test Set): 30% of 33716 emails = 10,112 emails
  • 10.
    Confusion Matrix: For MultinomialNaive Bayes: [[4822 143] [115 5031]] For Scalar Vector Machines: [[4843 122] [78 5068]]
  • 11.
    Top 10 wordsfor Ham:
  • 12.
    Top 10 Wordsfor Spam:
  • 13.
    Which emails aregenerally longer ?? Here, we have calculated average word count for Ham Emails and Spam Emails separately and then predicted which emails are generally longer. Average Word Count for Ham Emails: 365.5 words Average Word Count for Spam Emails: 261.3 words So, it can be concluded that, Ham emails are generally longer than spam emails.
  • 14.
  • 15.
    Division of Tasks: SourceCode: majorly done by Aman, help given by Vikas and Shareesh Report: majorly done by Vikas, help given by Aman and Shareesh Presentation: majorly done by Shareesh, help given by Aman and Vikas Video: all 3 of us.
  • 16.