Mobile Phone Spam Image Detection
based on Graph Partitioning with Pyramid
Histogram of Visual Words Image Descriptor
SO YEON KIM, KYUNG-AH SOHN
DEPARTMENT OF INFORMATION AND COMPUTER ENGINEERING, AJOU UNIVERS ITY
Contents
Introduction
•Motivation and Challenges
Methods
•Feature extraction
•Database construction
•Image classification and evaluation
Results
•Dataset
•Performance comparison in 5-fold cross validation
•Averaged performance comparison in optimal parameter
•Misclassified samples in best-performed cluster
Conclusion
•Summary and future works
Introduction
Motivation and challenges
Introduction
Instead of text spams, image spams are rapidly increasing in mobile phone.
Introduction
We want to predict the image spams in our mobile phone.
66 spam images
405 non-spam images
377 images for training (80%)
94 images for test (20%)
Too small to train the predictive model Some image spams in e-mail have similar features to mobile phone spams
Methods
Feature extraction
Database construction
Image classification and evaluation
Methods
Input image
SIFT feature extraction Concatenation of spatial histogramsBag of visual words
…
…
Feature
vector
Feature
vector
RGB histogram feature extraction
… …
Feature extraction
◦ RGB histogram
◦ Pyramid Histogram of Visual Words (PHOW) – color mode: gray / RGB / opponent
Methods
Database construction
◦ K-means clustering
◦ Elkan k-means clustering algorithm
◦ K-means++ algorithm for initializing centroids
Mobile phone
spam images
E-mail
spam images
euclidean
distance
matrix
Emailimage
Phone image
K-means
Clustering
Most similar
email images Phone + Email
Spam Image
Dataset
Phone
images
Feature
vector
Feature
vector
Methods
Database construction
◦ Spectral clustering
Mobile phone
spam images
E-mail
spam images
Feature
vector
Feature
vector
euclidean
distance matrix
Emailimage
Phone image
…
similarity matrix
Emailimage
Email image
Methods
Database construction
◦ Spectral clustering
Phone + Email
Spam Image Dataset
Phone
images
Spectral clustering
(normalized cut)
Methods
Image classification and Evaluation
SVM classification
spam
hamPhone + Email
Spam Image Dataset
Training set
Test set
e-mail phone
5-fold cross validation
80%
20%
Results
Dataset
Performance comparison in 5-fold cross validation
Averaged performance comparison in optimal parameter
Misclassified samples in best-performed cluster
Results
Dataset
◦ Similar sub-set of e-mail spam images from Image Spam Hunter dataset.
Phone E-mail Total
Spam
RGB histogram
66
12 78
PHOW-gray 201 267
PHOW-RGB 20 86
PHOW-opponent 324 390
Non-spam 405 - 405
Similar sub-set from spectral clustering
Results
Performance comparison in 5-fold cross validation
◦ Evaluation measure
Predicted
Spam Non-spam
Actual
Spam TP FN
Non-spam FP TN
Confusion matrix Accuracy =
TP + TN
TP + FN + FP + TN
Sensitivity =
TP
TP + FN
Specificity =
TN
FP + TN
Precision =
TP
TP + FP
F − score = 2 ∗
Precision ∗ Sensitivity
Precision + Sensitivity
Results
Performance comparison in 5-fold cross validation
◦ RGB-histogram feature
Results
Performance comparison in 5-fold cross validation
◦ PHOW feature (gray mode)
Results
Performance comparison in 5-fold cross validation
◦ PHOW feature (RGB color mode)
Results
Performance comparison in 5-fold cross validation
◦ PHOW feature (opponent color mode)
Results
Sample e-mail spam images
◦ Those are correctly grouped in the same cluster with PHOW descriptor but in a different one with RGB
histogram feature.
PHOW descriptor considers not only
color distribution but geometric
information of images
Results
Averaged performance comparison in optimal parameter
 PHOW descriptors outperform than RGB
histogram feature
 The color mode of PHOW descriptor doesn’t
affect the performance significantly
Results
Averaged performance comparison in k-means clustering
RGB
Histogram
PHOW
(gray)
PHOW
(RGB)
PHOW
(opponent)
random
10%
Accuracy 73.47% 95.12% 95.54% 94.27% 72.25%
Sensitivity 42.42% 92.42% 92.42% 87.91% 32.03%
Specificity 78.52% 95.56% 96.05% 95.31% 78.81%
F-score 30.73% 84.19% 85.49% 81.15% 24.14%
Results
Averaged performance comparison in spectral clustering
RGB
histogram
PHOW
(gray)
PHOW
(RGB)
PHOW
(opponent)
random
σ=0.4 σ=0.7 σ=0.3 σ=0.6 10%
Accuracy 81.75% 96.39% 96.82% 96.39% 72.25%
Sensitivity 30.55% 95.45% 87.91% 84.95% 32.03%
Specificity 90.12% 96.54% 98.27% 98.27% 78.81%
F-score 32.31% 88.28% 88.48% 86.76% 24.14%
Results
(a) False positives (FP) (b) False negatives (FN)
Misclassified samples in best-performed cluster
Conclusion
Summary and future works
Conclusion
 We proposed a mobile phone spam image filtering system using a large set of e-mail spam
images to solve the problem of insufficient phone spam image data.
 The performances on phone spam image classification with RGB histogram and PHOW
descriptor with various color modes (gray, RGB, opponent) are compared.
 PHOW descriptor which considers both geometric and color information can improve the
performance.
An advanced clustering technique such as spectral clustering has positive impact on
improvement.
Thank you !
Q & A

Mobile Phone Spam Image Detection based on Graph Partitioning with Pyramid Histogram of Visual Words Image Descriptor

  • 1.
    Mobile Phone SpamImage Detection based on Graph Partitioning with Pyramid Histogram of Visual Words Image Descriptor SO YEON KIM, KYUNG-AH SOHN DEPARTMENT OF INFORMATION AND COMPUTER ENGINEERING, AJOU UNIVERS ITY
  • 2.
    Contents Introduction •Motivation and Challenges Methods •Featureextraction •Database construction •Image classification and evaluation Results •Dataset •Performance comparison in 5-fold cross validation •Averaged performance comparison in optimal parameter •Misclassified samples in best-performed cluster Conclusion •Summary and future works
  • 3.
  • 4.
    Introduction Instead of textspams, image spams are rapidly increasing in mobile phone.
  • 5.
    Introduction We want topredict the image spams in our mobile phone. 66 spam images 405 non-spam images 377 images for training (80%) 94 images for test (20%) Too small to train the predictive model Some image spams in e-mail have similar features to mobile phone spams
  • 6.
  • 7.
    Methods Input image SIFT featureextraction Concatenation of spatial histogramsBag of visual words … … Feature vector Feature vector RGB histogram feature extraction … … Feature extraction ◦ RGB histogram ◦ Pyramid Histogram of Visual Words (PHOW) – color mode: gray / RGB / opponent
  • 8.
    Methods Database construction ◦ K-meansclustering ◦ Elkan k-means clustering algorithm ◦ K-means++ algorithm for initializing centroids Mobile phone spam images E-mail spam images euclidean distance matrix Emailimage Phone image K-means Clustering Most similar email images Phone + Email Spam Image Dataset Phone images Feature vector Feature vector
  • 9.
    Methods Database construction ◦ Spectralclustering Mobile phone spam images E-mail spam images Feature vector Feature vector euclidean distance matrix Emailimage Phone image … similarity matrix Emailimage Email image
  • 10.
    Methods Database construction ◦ Spectralclustering Phone + Email Spam Image Dataset Phone images Spectral clustering (normalized cut)
  • 11.
    Methods Image classification andEvaluation SVM classification spam hamPhone + Email Spam Image Dataset Training set Test set e-mail phone 5-fold cross validation 80% 20%
  • 12.
    Results Dataset Performance comparison in5-fold cross validation Averaged performance comparison in optimal parameter Misclassified samples in best-performed cluster
  • 13.
    Results Dataset ◦ Similar sub-setof e-mail spam images from Image Spam Hunter dataset. Phone E-mail Total Spam RGB histogram 66 12 78 PHOW-gray 201 267 PHOW-RGB 20 86 PHOW-opponent 324 390 Non-spam 405 - 405 Similar sub-set from spectral clustering
  • 14.
    Results Performance comparison in5-fold cross validation ◦ Evaluation measure Predicted Spam Non-spam Actual Spam TP FN Non-spam FP TN Confusion matrix Accuracy = TP + TN TP + FN + FP + TN Sensitivity = TP TP + FN Specificity = TN FP + TN Precision = TP TP + FP F − score = 2 ∗ Precision ∗ Sensitivity Precision + Sensitivity
  • 15.
    Results Performance comparison in5-fold cross validation ◦ RGB-histogram feature
  • 16.
    Results Performance comparison in5-fold cross validation ◦ PHOW feature (gray mode)
  • 17.
    Results Performance comparison in5-fold cross validation ◦ PHOW feature (RGB color mode)
  • 18.
    Results Performance comparison in5-fold cross validation ◦ PHOW feature (opponent color mode)
  • 19.
    Results Sample e-mail spamimages ◦ Those are correctly grouped in the same cluster with PHOW descriptor but in a different one with RGB histogram feature. PHOW descriptor considers not only color distribution but geometric information of images
  • 20.
    Results Averaged performance comparisonin optimal parameter  PHOW descriptors outperform than RGB histogram feature  The color mode of PHOW descriptor doesn’t affect the performance significantly
  • 21.
    Results Averaged performance comparisonin k-means clustering RGB Histogram PHOW (gray) PHOW (RGB) PHOW (opponent) random 10% Accuracy 73.47% 95.12% 95.54% 94.27% 72.25% Sensitivity 42.42% 92.42% 92.42% 87.91% 32.03% Specificity 78.52% 95.56% 96.05% 95.31% 78.81% F-score 30.73% 84.19% 85.49% 81.15% 24.14%
  • 22.
    Results Averaged performance comparisonin spectral clustering RGB histogram PHOW (gray) PHOW (RGB) PHOW (opponent) random σ=0.4 σ=0.7 σ=0.3 σ=0.6 10% Accuracy 81.75% 96.39% 96.82% 96.39% 72.25% Sensitivity 30.55% 95.45% 87.91% 84.95% 32.03% Specificity 90.12% 96.54% 98.27% 98.27% 78.81% F-score 32.31% 88.28% 88.48% 86.76% 24.14%
  • 23.
    Results (a) False positives(FP) (b) False negatives (FN) Misclassified samples in best-performed cluster
  • 24.
  • 25.
    Conclusion  We proposeda mobile phone spam image filtering system using a large set of e-mail spam images to solve the problem of insufficient phone spam image data.  The performances on phone spam image classification with RGB histogram and PHOW descriptor with various color modes (gray, RGB, opponent) are compared.  PHOW descriptor which considers both geometric and color information can improve the performance. An advanced clustering technique such as spectral clustering has positive impact on improvement.
  • 26.