This is the presentation for Machine Learning Assignment in Dublin City University for Spring 2017. In this Project, we made an email spam filtering code using Enron Dataset
4. Machine learning
Problem definition:
● We consider a dataset contains about 33,700 emails which are pre classified into ham and spam
emails.
● We have also found the top 10 words for ham emails and spam emails.
● We have also found out that which emails are generally longer i.e, ham or spam by calculating the
average word count in ham emails and spam emails.
5. Machine learning
Review existing methods:
Most Anti-spam programs are designed to do the same job but they all go about it in a different way. There
are different techniques used for filtering:
● List based.
● Content-Based Filters.
6. Machine learning
Proposed method
We used 2 algorithms for our work
1)Support Vector Machine (SVM)
2)Naïve Bayes Algorithm
Receiver Operating Characteristic (ROC)
Receiver Operating Characteristic curve (or ROC curve) is a plot of the true positive rate against the
false positive rate for the different possible cutpoints of a diagnostic test.
7. Description of Dataset:
Enron Dataset 1: Ham Emails - 3672 emails Spam Emails - 1500 emails
Enron Dataset 2: Ham Emails - 4361 emails Spam Emails - 1496 emails
Enron Dataset 3: Ham Emails - 4012 emails Spam Emails - 1500 emails
Enron Dataset 4: Ham Emails - 1500 emails Spam Emails - 4500 emails
Enron Dataset 5: Ham Emails - 1500 emails Spam Emails - 3675 emails
Enron Dataset 6: Ham Emails - 1500 emails Spam Emails - 4500 emails
Total Ham Emails: 16545 emails
Total Spam Emails: 17171 emails
Total Emails: 33716 emails
8. Description of Source Code:
➢Split Dataset into 70-30(Training and Testing)
➢Make Dictionary and Extract Features
➢Top 10 words for Ham and Spam
➢Which emails are longer?
➢ROC Curve
9. Results:
Number of Train Emails(Training Set):
70% of 33,716 emails = 23,596 emails
Number of Test Emails(Test Set):
30% of 33716 emails = 10,112 emails
13. Which emails are generally longer ??
Here, we have calculated average word count for Ham Emails and Spam Emails
separately and then predicted which emails are generally longer.
Average Word Count for Ham Emails: 365.5 words
Average Word Count for Spam Emails: 261.3 words
So, it can be concluded that, Ham emails are generally longer than spam emails.
15. Division of Tasks:
Source Code: majorly done by Aman, help given by Vikas and
Shareesh
Report: majorly done by Vikas, help given by Aman and
Shareesh
Presentation: majorly done by Shareesh, help given by Aman and
Vikas
Video: all 3 of us.