Tran Phuc Ho, Ho-Seok Kang, Sung-Ryul Kim
Journal of Universal Computer Science, vol. 19, no. 16 (2013)
*
*
* Spam SMS : advertisements by commercial
companies, hacking messages for cheating and
stealing personal information.
* Content-based approach
Graph-based
Text representation
KNN
algorithm
spam
normal
Labeled
small
message
groups
5 messages (in real time, only 1 message)
Tokenize them by white spaces
and punctuations
*
*
* remove the noisy features and select the good
ones
Mutual information(MI),
X2-Statistic (CHI)
*
The dependence between a word(t) and a type of message(c)
t : token (word or phrase)
c : class (type of message – spam or ham)
The probability that t and c
co-occur
The conditional probability of t in c
Probability of t
*
The lack of independence between a word(t) and a type of message(c)
t : token (word or phrase)
c : class (type of message – spam or ham)
Probability of t
The probability that
t and c co-occur
t t
Probability that the text belong to c
*
* calculate the weight of each feature
*Use the high weighted words for constructing
the graphs
CHI(X2-statistic)
MI(Mutual Information)
*
Token selected
by feature selection
- unique word
G = (V, E, FWN)
V :set of nodes
E :set of weighted edges linking the nodes
FWN :feature weight matrix – weight of edges and nodes
The order &
Co-occurrence relationship
Between two feature words
(If feature words co-occur
within a step length, assign
an edge)
*
G = (V, E, FWN)
V :set of nodes
E :set of weighted edges linking the nodes
FWN :feature weight matrix – weight of edges and nodes
Weight of edges, Probability of tokens represented by nodes
W_ij : co-occurrence frequency of two feature words
f_i and f_j within a step length
Only calculate the
weight W_ij (i>j).
Ex) scientific paper
Zero
Ex) paper scientific
Frequency of single words
*
in K nearest neighbors of the text T to be classified, the class of T is the most
frequently appearing class in this collection
1. Build sample graphs (elements)
2. New message comes in
3. Build a testing graph
Similarity
Of two graphs
-> Feature Weight :
Weights of the edges
+ weight of the edge itself
(appear in the two graphs)
*
Testing graph (g) Sample graphs (sg_1, sg_2, … sg_n)
….
List (RL)
1 FW(tg,sg1)=2 Spam
2 FW(tg,sg2)=3 Spam
… FW(tg,sg3)=4 Normal
K FW(tg,sg4)=5 Spam
(Nfp : how many nodes in the sample
graph with their weights larger than 0
also appear in the test graph)
If Nfp > threshold, calculate FW(tg,sg1)
0.0001
3
*
Testing graph (g) Sample graphs (sg_1, sg_2, … sg_n)
….
List (RL)
1 FW(tg,sg5)=6 Spam
2 FW(tg,sg2)=3 Spam
… FW(tg,sg3)=4 Normal
K FW(tg,sg4)=5 Spam
If Nfp > threshold, calculate FW(tg,sg5)
6
Spam message
*
NUS SMS Corpus (5,574 messages)
– 4,827 normal(86.6%), 747 spam(13.4%)
[Uysal and Yildiz] SMS
collection
(875 messages)
- 450 normal, 425 spam
*
*
(%)
(seconds)
*
* Spam SMS messages are evolving.. Hard to
capture keywords.
* ex) 대★출, 이ㅈr, <<통>> / <<장>>, no space or
punctuation, no specific keyword, same content
with other phone numbers, no words only with
image …
* Graph patterns of communication between
sender and receiver should be added with
content-based approach.

Graph-based KNN Algorithm for Spam SMS Detection

  • 1.
    Tran Phuc Ho,Ho-Seok Kang, Sung-Ryul Kim Journal of Universal Computer Science, vol. 19, no. 16 (2013) *
  • 2.
    * * Spam SMS: advertisements by commercial companies, hacking messages for cheating and stealing personal information. * Content-based approach Graph-based Text representation KNN algorithm
  • 3.
    spam normal Labeled small message groups 5 messages (inreal time, only 1 message) Tokenize them by white spaces and punctuations *
  • 4.
    * * remove thenoisy features and select the good ones Mutual information(MI), X2-Statistic (CHI)
  • 5.
    * The dependence betweena word(t) and a type of message(c) t : token (word or phrase) c : class (type of message – spam or ham) The probability that t and c co-occur The conditional probability of t in c Probability of t
  • 6.
    * The lack ofindependence between a word(t) and a type of message(c) t : token (word or phrase) c : class (type of message – spam or ham) Probability of t The probability that t and c co-occur t t Probability that the text belong to c
  • 7.
    * * calculate theweight of each feature *Use the high weighted words for constructing the graphs CHI(X2-statistic) MI(Mutual Information)
  • 8.
    * Token selected by featureselection - unique word G = (V, E, FWN) V :set of nodes E :set of weighted edges linking the nodes FWN :feature weight matrix – weight of edges and nodes The order & Co-occurrence relationship Between two feature words (If feature words co-occur within a step length, assign an edge)
  • 9.
    * G = (V,E, FWN) V :set of nodes E :set of weighted edges linking the nodes FWN :feature weight matrix – weight of edges and nodes Weight of edges, Probability of tokens represented by nodes W_ij : co-occurrence frequency of two feature words f_i and f_j within a step length Only calculate the weight W_ij (i>j). Ex) scientific paper Zero Ex) paper scientific Frequency of single words
  • 10.
    * in K nearestneighbors of the text T to be classified, the class of T is the most frequently appearing class in this collection 1. Build sample graphs (elements) 2. New message comes in 3. Build a testing graph Similarity Of two graphs -> Feature Weight : Weights of the edges + weight of the edge itself (appear in the two graphs)
  • 11.
    * Testing graph (g)Sample graphs (sg_1, sg_2, … sg_n) …. List (RL) 1 FW(tg,sg1)=2 Spam 2 FW(tg,sg2)=3 Spam … FW(tg,sg3)=4 Normal K FW(tg,sg4)=5 Spam (Nfp : how many nodes in the sample graph with their weights larger than 0 also appear in the test graph) If Nfp > threshold, calculate FW(tg,sg1) 0.0001 3
  • 12.
    * Testing graph (g)Sample graphs (sg_1, sg_2, … sg_n) …. List (RL) 1 FW(tg,sg5)=6 Spam 2 FW(tg,sg2)=3 Spam … FW(tg,sg3)=4 Normal K FW(tg,sg4)=5 Spam If Nfp > threshold, calculate FW(tg,sg5) 6 Spam message
  • 13.
    * NUS SMS Corpus(5,574 messages) – 4,827 normal(86.6%), 747 spam(13.4%) [Uysal and Yildiz] SMS collection (875 messages) - 450 normal, 425 spam
  • 14.
  • 15.
  • 16.
    * * Spam SMSmessages are evolving.. Hard to capture keywords. * ex) 대★출, 이ㅈr, <<통>> / <<장>>, no space or punctuation, no specific keyword, same content with other phone numbers, no words only with image … * Graph patterns of communication between sender and receiver should be added with content-based approach.