Machine Learning Project - Email Spam Filtering using Enron Dataset

MACHINE LEARNING
Project Title: Email-Spam Filtering
Aman Singhla 16212220
Shareesh Bellamkonda 16212926
Vikas Chillar 16212887
Vikas Chhillar

Machine Learning
Agenda:
Introduction
Problem Definition
Review Existing Methods
Proposed Methods
Description of Dataset
Description of Source Code

Machine learning
Introduction:
Machine learning focuses on the development of computer programs that can teach themselves to grow
and change when exposed to new data.
HAM
SPAM

Machine learning
Problem definition:
● We consider a dataset contains about 33,700 emails which are pre classified into ham and spam
emails.
● We have also found the top 10 words for ham emails and spam emails.
● We have also found out that which emails are generally longer i.e, ham or spam by calculating the
average word count in ham emails and spam emails.

Machine learning
Review existing methods:
Most Anti-spam programs are designed to do the same job but they all go about it in a different way. There
are different techniques used for filtering:
● List based.
● Content-Based Filters.

Machine learning
Proposed method
We used 2 algorithms for our work
1)Support Vector Machine (SVM)
2)Naïve Bayes Algorithm
Receiver Operating Characteristic (ROC)
Receiver Operating Characteristic curve (or ROC curve) is a plot of the true positive rate against the
false positive rate for the different possible cutpoints of a diagnostic test.

Description of Dataset:
Enron Dataset 1: Ham Emails - 3672 emails Spam Emails - 1500 emails
Total Ham Emails: 16545 emails
Total Spam Emails: 17171 emails
Total Emails: 33716 emails

Description of Source Code:
➢Split Dataset into 70-30(Training and Testing)
➢Make Dictionary and Extract Features
➢Top 10 words for Ham and Spam
➢Which emails are longer?
➢ROC Curve

Results:
Number of Train Emails(Training Set):
70% of 33,716 emails = 23,596 emails
Number of Test Emails(Test Set):
30% of 33716 emails = 10,112 emails

Confusion Matrix:
For Multinomial Naive Bayes: [[4822 143]
[115 5031]]
For Scalar Vector Machines: [[4843 122]
[78
5068]]

Which emails are generally longer ??
Here, we have calculated average word count for Ham Emails and Spam Emails
separately and then predicted which emails are generally longer.
Average Word Count for Ham Emails: 365.5 words
Average Word Count for Spam Emails: 261.3 words
So, it can be concluded that, Ham emails are generally longer than spam emails.

Division of Tasks:
Source Code: majorly done by Aman, help given by Vikas and
Shareesh
Report: majorly done by Vikas, help given by Aman and
Shareesh
Presentation: majorly done by Shareesh, help given by Aman and
Vikas
Video: all 3 of us.

Machine Learning Project - Email Spam Filtering using Enron Dataset

More Related Content

What's hot

Similar to Machine Learning Project - Email Spam Filtering using Enron Dataset

Recently uploaded

Machine Learning Project - Email Spam Filtering using Enron Dataset