Improving Image Spam Filtering Using Image Text Features - Presentation Transcript
P R A G
Pattern Recognition and Applications Group
University of Cagliari, Italy
Department of Electrical and Electronic Engineering
Improving Image Spam Filtering
Using Image Text Features
Battista Biggio, Ignazio Pillai, Giorgio Fumera, Fabio Roli
5th Conference on Email and Anti-Spam (CEAS) 2008,
Mountain View, California, USA, August 21st - 22nd
CEAS 2008
About me
• Pattern Recognition and Applications Group
http://prag.diee.unica.it
– DIEE, University of Cagliari, Italy.
• Contact
– Battista Biggio, Ph.D. student
battista.biggio@diee.unica.it
21-08-2008 Image Spam Filtering CEAS 2008 2
Pattern Recognition and
P R A G
Applications Group
• Research interests • Faculty members
– Methodological issues – F. Roli (group head)
• Multiple classifier systems – G. Giacinto
• Adversarial learning – G. Fumera
• Classification reliability
– L. Didaci
– Main applications – G.L. Marcialis
• Intrusion detection in
computer networks
• Multimedia document – 7 PhD students
categorization, Spam filtering – 3 post docs
• Biometric authentication – 2 consultants
(fingerprint, face)
• Content-based image
retrieval
21-08-2008 Image Spam Filtering CEAS 2008 3
Outline
• Introduction
– What is image spam?
• Image spam filtering
– Image spam SoA
– Our work
• Experiments
• A plug-in for SpamAssassin: Image Cerberus
21-08-2008 Image Spam Filtering CEAS 2008 4
Image spam
• Since about 2005: image spam
– Embedding spam messages into images to evade
modules based on machine learning approaches
(e.g. bayesian filters)
– Adding adversarial noise to prevent OCR from
reading embedded text (obfuscated spam images)
21-08-2008 Image Spam Filtering CEAS 2008 5
Image spam SoA
• Commercial / open source anti-spam filters:
– OCR + keyword search
– Image low-level feature analysis
• Research:
– OCR + TC
• Fumera et al., JMLR 2006
– BayesOCR plug-in for SpamAssassin
– Image classifiers (ham/spam) based on low-level
image features (text areas, color distribution, etc.)
• Wu et al., ICIP 2005
• Aradhye et al., ICDAR 2005
• Dredze et al., CEAS 2007
21-08-2008 Image Spam Filtering CEAS 2008 6
Our past work
• OCR is not effective against obfuscated images
– Spammers learned from CAPTCHAs / HIPs!
• Our idea: the presence of adversarial obfuscated text
can be a spamminess hint (Biggio et al., CEAS 2007)
– How did we detect the presence of adv. obfuscated text?
• Four features based on:
– Text localisation
– Perimetric complexity
– Edge detection
• However, these features did not work as we thought for
detecting only adversarial obfuscated text…
21-08-2008 Image Spam Filtering CEAS 2008 7
This work
• Our image text defect measures seemed to be
able to provide some discriminant information
about low level text characteristics between
ham and spam images
• We exploit the proposed image text defect
measures as additional features in approaches
based on image classification techniques, to
improve their discriminant capability
21-08-2008 Image Spam Filtering CEAS 2008 8
Experiments
• Data sets (1)
– A: 2006 ham images, 3297 spam images
– B: 2006 ham images, 8549 spam images
• Image feature sets
– Aradhye et al., ICDAR 2005
• Color heterogeneity, color saturation, text area
– Dredze et al., CEAS 2007
• Image meta-data, visual features
– Four other visual features, for comparison (generic)
• Number of colors (log), number of pixels (log), relative
area occupied by the most common color, text area
– Features used in this work (text)
(1) Data sets are publicly available at: http://prag.diee.unica.it/n3ws1t0/eng/spamRepository
21-08-2008 Image Spam Filtering CEAS 2008 9
Experiments (cont’d)
• We evaluated performances of image
ham/spam classifiers based on individual
feature sets (aradhye, dredze, generic) and
their fusion (either at feature or score level) with
our features (text).
C(s)
C(x1∪x2)
C(x1) C(x2)
Feature level fusion Score level fusion
21-08-2008 Image Spam Filtering CEAS 2008 10
A plug-in for SpamAssassin:
Image Cerberus
• We implemented a SpamAssassin plug-in based on our
approach
– generic + text fused at feature level
• Publicly available
– http://prag.diee.unica.it/n3ws1t0/imageCerberus
• We will release source code (C++) soon
We need your feedback!
Image Cerberus
P R A G
21-08-2008 Image Spam Filtering CEAS 2008 12
Spam or ham?
score = 0.20
score = - 1.4
score = 0.27
• Ham images from the TREC 2007 spam corpus!
21-08-2008 Image Spam Filtering CEAS 2008 15
Thank you!
• See you at the poster session!
• Contacts
– roli@diee.unica.it
– fumera@diee.unica.it
– pillai@diee.unica.it
– battista.biggio@diee.unica.it
• Web
– http://prag.diee.unica.it P R A G
21-08-2008 Image Spam Filtering CEAS 2008 16
0 comments
Post a comment