Graph-based KNN Algorithm for Spam SMS Detection

Tran Phuc Ho, Ho-Seok Kang, Sung-Ryul Kim
Journal of Universal Computer Science, vol. 19, no. 16 (2013)
*

*
* Spam SMS : advertisements by commercial
companies, hacking messages for cheating and
stealing personal information.
* Content-based approach
Graph-based
Text representation
KNN
algorithm

spam
normal
Labeled
small
message
groups
5 messages (in real time, only 1 message)
Tokenize them by white spaces
and punctuations
*

*
* remove the noisy features and select the good
ones
Mutual information(MI),
X2-Statistic (CHI)

*
The dependence between a word(t) and a type of message(c)
t : token (word or phrase)
c : class (type of message – spam or ham)
The probability that t and c
co-occur
The conditional probability of t in c
Probability of t

*
The lack of independence between a word(t) and a type of message(c)
t : token (word or phrase)
c : class (type of message – spam or ham)
Probability of t
The probability that
t and c co-occur
t t
Probability that the text belong to c

*
* calculate the weight of each feature
*Use the high weighted words for constructing
the graphs
CHI(X2-statistic)
MI(Mutual Information)

*
Token selected
by feature selection
- unique word
G = (V, E, FWN)
V :set of nodes
E :set of weighted edges linking the nodes
FWN :feature weight matrix – weight of edges and nodes
The order &
Co-occurrence relationship
Between two feature words
(If feature words co-occur
within a step length, assign
an edge)

*
G = (V, E, FWN)
V :set of nodes
E :set of weighted edges linking the nodes
FWN :feature weight matrix – weight of edges and nodes
Weight of edges, Probability of tokens represented by nodes
W_ij : co-occurrence frequency of two feature words
f_i and f_j within a step length
Only calculate the
weight W_ij (i>j).
Ex) scientific paper
Zero
Ex) paper scientific
Frequency of single words

*
in K nearest neighbors of the text T to be classified, the class of T is the most
frequently appearing class in this collection
1. Build sample graphs (elements)
2. New message comes in
3. Build a testing graph
Similarity
Of two graphs
-> Feature Weight :
Weights of the edges
+ weight of the edge itself
(appear in the two graphs)

*
Testing graph (g) Sample graphs (sg_1, sg_2, … sg_n)
….
List (RL)
1 FW(tg,sg1)=2 Spam
2 FW(tg,sg2)=3 Spam
… FW(tg,sg3)=4 Normal
K FW(tg,sg4)=5 Spam
(Nfp : how many nodes in the sample
graph with their weights larger than 0
also appear in the test graph)
If Nfp > threshold, calculate FW(tg,sg1)
0.0001
3

*
Testing graph (g) Sample graphs (sg_1, sg_2, … sg_n)
….
List (RL)
1 FW(tg,sg5)=6 Spam
2 FW(tg,sg2)=3 Spam
… FW(tg,sg3)=4 Normal
K FW(tg,sg4)=5 Spam
If Nfp > threshold, calculate FW(tg,sg5)
6
Spam message

*
NUS SMS Corpus (5,574 messages)
– 4,827 normal(86.6%), 747 spam(13.4%)
[Uysal and Yildiz] SMS
collection
(875 messages)
- 450 normal, 425 spam

*
* Spam SMS messages are evolving.. Hard to
capture keywords.
* ex) 대★출, 이ㅈr, <<통>> / <<장>>, no space or
punctuation, no specific keyword, same content
with other phone numbers, no words only with
image …
* Graph patterns of communication between
sender and receiver should be added with
content-based approach.

Graph-based KNN Algorithm for Spam SMS Detection

More Related Content

Viewers also liked

Similar to Graph-based KNN Algorithm for Spam SMS Detection

More from SOYEON KIM

Recently uploaded

Graph-based KNN Algorithm for Spam SMS Detection