• Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
904
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
20
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. P R A G Pattern Recognition and Applications Group University of Cagliari, Italy Department of Electrical and Electronic Engineering Improving Image Spam Filtering Using Image Text Features Battista Biggio, Ignazio Pillai, Giorgio Fumera, Fabio Roli 5th Conference on Email and Anti-Spam (CEAS) 2008, Mountain View, California, USA, August 21st - 22nd CEAS 2008
  • 2. About me • Pattern Recognition and Applications Group http://prag.diee.unica.it – DIEE, University of Cagliari, Italy. • Contact – Battista Biggio, Ph.D. student battista.biggio@diee.unica.it 21-08-2008 Image Spam Filtering CEAS 2008 2
  • 3. Pattern Recognition and P R A G Applications Group • Research interests • Faculty members – Methodological issues – F. Roli (group head) • Multiple classifier systems – G. Giacinto • Adversarial learning – G. Fumera • Classification reliability – L. Didaci – Main applications – G.L. Marcialis • Intrusion detection in computer networks • Multimedia document – 7 PhD students categorization, Spam filtering – 3 post docs • Biometric authentication – 2 consultants (fingerprint, face) • Content-based image retrieval 21-08-2008 Image Spam Filtering CEAS 2008 3
  • 4. Outline • Introduction – What is image spam? • Image spam filtering – Image spam SoA – Our work • Experiments • A plug-in for SpamAssassin: Image Cerberus 21-08-2008 Image Spam Filtering CEAS 2008 4
  • 5. Image spam • Since about 2005: image spam – Embedding spam messages into images to evade modules based on machine learning approaches (e.g. bayesian filters) – Adding adversarial noise to prevent OCR from reading embedded text (obfuscated spam images) 21-08-2008 Image Spam Filtering CEAS 2008 5
  • 6. Image spam SoA • Commercial / open source anti-spam filters: – OCR + keyword search – Image low-level feature analysis • Research: – OCR + TC • Fumera et al., JMLR 2006 – BayesOCR plug-in for SpamAssassin – Image classifiers (ham/spam) based on low-level image features (text areas, color distribution, etc.) • Wu et al., ICIP 2005 • Aradhye et al., ICDAR 2005 • Dredze et al., CEAS 2007 21-08-2008 Image Spam Filtering CEAS 2008 6
  • 7. Our past work • OCR is not effective against obfuscated images – Spammers learned from CAPTCHAs / HIPs! • Our idea: the presence of adversarial obfuscated text can be a spamminess hint (Biggio et al., CEAS 2007) – How did we detect the presence of adv. obfuscated text? • Four features based on: – Text localisation – Perimetric complexity – Edge detection • However, these features did not work as we thought for detecting only adversarial obfuscated text… 21-08-2008 Image Spam Filtering CEAS 2008 7
  • 8. This work • Our image text defect measures seemed to be able to provide some discriminant information about low level text characteristics between ham and spam images • We exploit the proposed image text defect measures as additional features in approaches based on image classification techniques, to improve their discriminant capability 21-08-2008 Image Spam Filtering CEAS 2008 8
  • 9. Experiments • Data sets (1) – A: 2006 ham images, 3297 spam images – B: 2006 ham images, 8549 spam images • Image feature sets – Aradhye et al., ICDAR 2005 • Color heterogeneity, color saturation, text area – Dredze et al., CEAS 2007 • Image meta-data, visual features – Four other visual features, for comparison (generic) • Number of colors (log), number of pixels (log), relative area occupied by the most common color, text area – Features used in this work (text) (1) Data sets are publicly available at: http://prag.diee.unica.it/n3ws1t0/eng/spamRepository 21-08-2008 Image Spam Filtering CEAS 2008 9
  • 10. Experiments (cont’d) • We evaluated performances of image ham/spam classifiers based on individual feature sets (aradhye, dredze, generic) and their fusion (either at feature or score level) with our features (text). C(s) C(x1∪x2) C(x1) C(x2) Feature level fusion Score level fusion 21-08-2008 Image Spam Filtering CEAS 2008 10
  • 11. Results 21-08-2008 Image Spam Filtering CEAS 2008 11
  • 12. A plug-in for SpamAssassin: Image Cerberus • We implemented a SpamAssassin plug-in based on our approach – generic + text fused at feature level • Publicly available – http://prag.diee.unica.it/n3ws1t0/imageCerberus • We will release source code (C++) soon We need your feedback! Image Cerberus P R A G 21-08-2008 Image Spam Filtering CEAS 2008 12
  • 13. Some examples 1.06 score = 0.28 0.98 21-08-2008 Image Spam Filtering CEAS 2008 13
  • 14. Some examples (cont’d) score = 0.82 score = 0.63 1.00 21-08-2008 Image Spam Filtering CEAS 2008 14
  • 15. Spam or ham? score = 0.20 score = - 1.4 score = 0.27 • Ham images from the TREC 2007 spam corpus! 21-08-2008 Image Spam Filtering CEAS 2008 15
  • 16. Thank you! • See you at the poster session! • Contacts – roli@diee.unica.it – fumera@diee.unica.it – pillai@diee.unica.it – battista.biggio@diee.unica.it • Web – http://prag.diee.unica.it P R A G 21-08-2008 Image Spam Filtering CEAS 2008 16