CLASSIFYING EMAILS USING THEIR LANGUAGE
AND READABILITY
Rushdi Shams
Computational Linguistics Group
Department of Compute...
PRESENTATION OUTLINE
• Text Denoising
• Keyphrase: What and Why
• Supervised automatic keyphrase indexing
• How they work
...
INTRODUCTION
• Email spam is one of the major problems of
the today’s Internet
– Financial loss of institutions ($50B in 2...
EXISTING EMAIL CLASSIFICATION APPROACHES
4
• More stable
• Fast
• Wide coverage
• Better results
• Less stable
• Fast
• Sm...
ML-BASED EMAIL CLASSIFICATION APPROACHES
5
• Limited features
• Language independent
• Less stability
• Unbound features
•...
PROPOSED APPROACH
6
Message m features
Classification
Algorithm
10 fold CV
Email Dataset
Performance
DATASET
7
Email Dataset
Dataset Messages Spam Rate Raw Texts? Year of
Curation
SpamAssassin 6,046 31.36% Yes 2002
LingSpam...
FEATURES
8
Message m features
Groups Features
Traditional Spam
detection Features Spam Words Total HTML Tags Total Anchor ...
FEATURE SELECTION
• For each dataset, we applied Boruta feature
selection algorithm on the extracted features
• The outcom...
FEATURE SELECTION
• For each dataset, we applied Boruta feature
selection algorithm on the extracted features
• The outcom...
IMPORTANCE OF FEATURES
(SNAPSHOT FOR SPAMASSASSIN)
11
Readability based
Features
Traditional Spam
detection Features
Langu...
IMPORTANCE OF FEATURES (SPAMASSASSIN)
12
IMPORTANCE OF FEATURES (LINGSPAM)
13
IMPORTANCE OF FEATURES (CSDMC)
14
CLASSIFICATION ALGORITHM
1. Random Forest
[Jarrah et al. (2012), Hu et al. (2010)]
2. Boosted Random Forest with AdaBoost
...
PERFORMANCE EVALUATION
16
FP
FN
False Positive Rate or Ham Misclassification
False Negative Rate or Spam Misclassification...
PERFORMANCE ON SPAMASSASSIN
FPR FNR Accuracy % Precision Recall F1 AUC
RF 0.035 0.093 94.707 0.923 0.907 0.915 0.979
Boost...
PERFORMANCE ON LINGSPAM
18
FPR FNR Accuracy % Precision Recall F1 AUC
RF 0.018 0.162 95.817 0.907 0.838 0.869 0.978
Booste...
PERFORMANCE ON CSDMC
19
FPR FNR Accuracy % Precision Recall F1 AUC
RF 0.040 0.092 94.338 0.914 0.908 0.911 0.980
Boosted R...
PERFORMANCE COMPARISON: SPAMASSASSIN
Author Algorithm Reported Performance
Performance of
our approach
P < 0.05?
Ma et al....
PERFORMANCE COMPARISON: LINGSPAM
Author Algorithm Reported Performance
Performance of
our approach
P < 0.05?
Basavaraju
an...
PERFORMANCE COMPARISON: CSDMC
Author Algorithm Reported Performance
Performance of
our approach
P < 0.05?
Jarrah et al.
(2...
CONCLUSIONS
• Our spam classification approach performed
– the Best for LingSpam
• Smallest dataset
• Least no. of spams
•...
CONCLUSIONS
• Using personalized email data rather than
random collection
– Enron-Spam
• Using probability scores of terms...
Upcoming SlideShare
Loading in...5
×

UWORCS 2013

312

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
312
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
2
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

UWORCS 2013

  1. 1. CLASSIFYING EMAILS USING THEIR LANGUAGE AND READABILITY Rushdi Shams Computational Linguistics Group Department of Computer Science University of Western Ontario, London, Canada. rshams@uwo.ca Supervisor: Prof. Bob Mercer
  2. 2. PRESENTATION OUTLINE • Text Denoising • Keyphrase: What and Why • Supervised automatic keyphrase indexing • How they work • Examples • Effect of document size • Objective • Methods • Datasets • Training and Testing • Performance Measures • Denoising Threshold • Results • Conclusions and Future Work 2
  3. 3. INTRODUCTION • Email spam is one of the major problems of the today’s Internet – Financial loss of institutions ($50B in 2005) – Misuse of network traffic/storage – Loss of work productivity, etc. • In addition, spam emails constitute 75-80% of total emails. 3 Total Emails Spam Ham
  4. 4. EXISTING EMAIL CLASSIFICATION APPROACHES 4 • More stable • Fast • Wide coverage • Better results • Less stable • Fast • Small coverage • Good results • Stable • Slow • Good coverage • Good results
  5. 5. ML-BASED EMAIL CLASSIFICATION APPROACHES 5 • Limited features • Language independent • Less stability • Unbound features • Language dependent • More stability Contains both pros and cons of the previous two
  6. 6. PROPOSED APPROACH 6 Message m features Classification Algorithm 10 fold CV Email Dataset Performance
  7. 7. DATASET 7 Email Dataset Dataset Messages Spam Rate Raw Texts? Year of Curation SpamAssassin 6,046 31.36% Yes 2002 LingSpam 2,893 16.63% No 2000 CSDMC2010 4,327 31.85% Yes 2010 • All data are preprocessed whenever necessary like removing headers, subjects and attachments, and removing non-ASCII characters
  8. 8. FEATURES 8 Message m features Groups Features Traditional Spam detection Features Spam Words Total HTML Tags Total Anchor Tags Total Regular Tags Language based Features Alphanumeric Words Verbs Stop Words TF-ISF TF-IDF Grammar and Spell Errors Grammar Errors Spell Errors Readability based Features Fog Index (FI) FKRI Smog Index FORCAST FRES Simple FI Inverse FI Complex Words Simple Words Document Length Word Length TF-IDF (Simple Words) TF-IDF (Complex Words) • We extracted 39 Features and grouped them into 3 groups
  9. 9. FEATURE SELECTION • For each dataset, we applied Boruta feature selection algorithm on the extracted features • The outcome shows that all of these features are important to classify emails from the datasets. 9
  10. 10. FEATURE SELECTION • For each dataset, we applied Boruta feature selection algorithm on the extracted features • The outcome shows that all of these features are important to classify emails from the datasets. – Exception on LingSpam dataset where word length feature was labeled as unimportant. 10
  11. 11. IMPORTANCE OF FEATURES (SNAPSHOT FOR SPAMASSASSIN) 11 Readability based Features Traditional Spam detection Features Language based Features
  12. 12. IMPORTANCE OF FEATURES (SPAMASSASSIN) 12
  13. 13. IMPORTANCE OF FEATURES (LINGSPAM) 13
  14. 14. IMPORTANCE OF FEATURES (CSDMC) 14
  15. 15. CLASSIFICATION ALGORITHM 1. Random Forest [Jarrah et al. (2012), Hu et al. (2010)] 2. Boosted Random Forest with AdaBoost [Zhang et al. (2004)] 3. Bagged Random Forest 4. Support Vector Machine (SVM) [Jarrah et al. (2012), Hu et al. (2010), Ye et al.(2008), Lai and Tsai (2004), Zhang et al. (2004)] 5. Naïve Bayes (NB) [Hu et al. (2010), Haidar and Rocha (2008), Metsis et al. (2008), Lai and Tsai (2004)] 15 Classification Algorithm
  16. 16. PERFORMANCE EVALUATION 16 FP FN False Positive Rate or Ham Misclassification False Negative Rate or Spam Misclassification Accuracy or (1- Overall Misclassification) Precision or Spam Discovery Rate Recall or Spam Hit Rate F1-Score Area Under ROC Curve (AUC)
  17. 17. PERFORMANCE ON SPAMASSASSIN FPR FNR Accuracy % Precision Recall F1 AUC RF 0.035 0.093 94.707 0.923 0.907 0.915 0.979 Boosted RF 0.027 0.079 95.700 0.941 0.921 0.931 0.982 Bagged RF 0.023 0.099 95.353 0.948 0.901 0.924 0.986 SVM 0.052 0.292 87.265 0.861 0.708 0.777 0.828 NB 0.104 0.558 75.373 0.660 0.443 0.529 0.847 17 • Best FPR: Bagged RF • Best FNR: Boosted RF • Best ACC: Boosted RF • Best Precision: Bagged RF • Best Recall: Boosted RF • Best F1: Boosted RF • Best AUC: Bagged RF
  18. 18. PERFORMANCE ON LINGSPAM 18 FPR FNR Accuracy % Precision Recall F1 AUC RF 0.018 0.162 95.817 0.907 0.838 0.869 0.978 Boosted RF 0.017 0.162 95.886 0.910 0.838 0.871 0.977 Bagged RF 0.010 0.193 95.956 0.944 0.807 0.868 0.986 SVM 0.014 0.341 93.156 0.907 0.659 0.760 0.822 NB 0.219 0.277 77.186 0.402 0.723 0.515 0.831 • Best FPR: Bagged RF • Best FNR: Boosted RF/RF • Best ACC: Bagged RF • Best Precision: Bagged RF • Best Recall: Boosted RF/RF • Best F1: Boosted RF • Best AUC: Bagged RF
  19. 19. PERFORMANCE ON CSDMC 19 FPR FNR Accuracy % Precision Recall F1 AUC RF 0.040 0.092 94.338 0.914 0.908 0.911 0.980 Boosted RF 0.030 0.089 95.124 0.934 0.912 0.922 0.980 Bagged RF 0.021 0.107 95.193 0.953 0.893 0.922 0.988 SVM 0.028 0.390 85.718 0.913 0.610 0.730 0.792 NB 0.101 0.396 80.471 0.737 0.604 0.662 0.855 • Best FPR: Bagged RF • Best FNR: Boosted RF • Best ACC: Bagged RF • Best Precision: Bagged RF • Best Recall: Boosted RF • Best F1: Boosted/Bagged RF • Best AUC: Bagged RF
  20. 20. PERFORMANCE COMPARISON: SPAMASSASSIN Author Algorithm Reported Performance Performance of our approach P < 0.05? Ma et al. (2010) Neural Nets Precision (0.920) Overall Misclassification (0.080) Precision (0.948) Overall Misclassification (0.043) YES Srisanyalak and Sornil (2007) Neural Nets Accuracy (0.924) Accuracy (0.957) YES Bratko et al. (2006) Statistical FPR (0.001) FNR (0.012) AUC (0.982) FPR (0.023) FNR (0.079) AUC (0.986) YES 20
  21. 21. PERFORMANCE COMPARISON: LINGSPAM Author Algorithm Reported Performance Performance of our approach P < 0.05? Basavaraju and Pravakar (2010) BIRCH and K-NNC Precision (0.698) Recall (0.637) Specificity (0.828) Accuracy(0.755) Precision (0.944) Recall (0.838) Specificity (0.990) Accuracy(0.960) YES Cormack and Bratko (2006) PPM AUC (0.960) AUC (0.986) YES Yang et al. (2011) Naïve Bayes Precision (0.943) Recall (0.820) AUC (0.992) Precision (0.944) Recall (0.838) AUC (0.986) YES (for Recall) 21
  22. 22. PERFORMANCE COMPARISON: CSDMC Author Algorithm Reported Performance Performance of our approach P < 0.05? Jarrah et al. (2012) RF Precision (0.958) Recall (0.958) F1 (0.958) AUC (0.981) Precision (0.953) Recall (0.912) F1 (0.922) AUC (0.988) YES (for Recall and F1) Yang et al. (2011) Naïve Bayes Precision (0.935) Recall (1.000) AUC (0.976) Precision (0.953) Recall (0.912) AUC (0.988) YES Yang et al. (2011) SVM Precision (0.943) Recall (0.965) AUC (0.995) Precision (0.953) Recall (0.912) AUC (0.988) YES 22
  23. 23. CONCLUSIONS • Our spam classification approach performed – the Best for LingSpam • Smallest dataset • Least no. of spams • Hams are collected from forums • Easy to achieve better FPR and Accuracy – Better than many others for SpamAssassin and comparably for CSDMC2010 • Similar spam:ham ratio • Random ham and spam collection 23
  24. 24. CONCLUSIONS • Using personalized email data rather than random collection – Enron-Spam • Using probability scores of terms in email contents from a Naïve Bayes spam filter as an additional feature 24
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×