Motivation
• Text mining + Network effect
SMS corpus
Spam score
Content Analysis
Network Analysis
spam ham
Spam Filtering
System
Many users’ data are needed
Deep Belief Networks (DBNs)
• What is a DBN (for
classification)?
– A feedforward neural network
with a deep architecture - many
hidden layers
– Consists of : visible (input) units,
hidden units, output units (for
classification, one for each class)
• Parameters of a DBN
– W(j) :weights between the units of
layers j-1 and j
– b(j) : biases of layer j (no biases in
the input layer).
Training a DBN
• Conventional approach: Gradient based optimization
– Random initialization of weights and biases
– Adjustment by backpropagation
Optimization algorithms get stuck in poor solutions
due to random initialization
 Solution
– Hinton et al [2006] proposed the use of a greedy layer-
wise unsupervised algorithm for initialization of DBNs
parameters
– Initialization phase: initialize each layer by treating it as a
Restricted Boltzmann Machine (RBM)
Restricted Boltzmann Machines
(RBMs)
• An RBM is a two layer neural network
– Binary inputs (visible units) are connected
to binary outputs (hidden units) using symmetrically weighted
connections
• Parameters of an RBM
– W :weights between the two layers
– b, c :biases for visible and hidden layers respectively
• Layer-to-layer conditional distributions
Bidirectional
Connections
RBM Training
• For every training example
1. Propagate it from visible to hidden units
2. Sample from the conditional
3. Propagate the sample in the opposite direction using
⇒ confabulation of the original data
4. Update the hidden units once more using the confabulation
• Update the RBM parameters
Data vector v
Sample
Sample
Remember that RBM training is
unsupervised
Repeat
DBN Training
1. Train the first layer RBM
2. Stack another hidden
layer on top of the first
RBM & train W(2) as a
second RBM
3. Continue to stack layers
on top of the network,
and train it as previous
step
W(1) ,b(1)
W(2) ,b(2)
W(L) ,b(L)
W(L+1)
random
Good initializations are obtained
Fine tune the whole network by typical supervised criterion (mean square
error, cross-entropy) -> they used conjugate gradients
Dataset
• LingSpam SpamAssassin EnronSpam
Performance Measures
• Accuracy: percentage of correctly classified messages
• Ham - Spam Recall: percentage of correctly classified ham – spam
messages
• Ham - Spam Precision: percentage of messages that are classified
as ham – spam that are indeed ham - spam
Experimental Setup
• Message representation: x=[x1, x2, …, xm]
– Each attribute(message) corresponds to a distinct word from
the corpus
– Use of frequency of the corresponding word
• Attribute selection
– Stop words and words appearing in <2 messages were
removed + Information gain score (m=1500 for LingSpam,
m=1000 for SpamAssassin and EnronSpam)
• All experiments were performed using 10-fold cross
validation
Experimental Setup
• SVM configuration
– Cosine kernel (the usual trend in text classification)
– The cost parameter C must be determined a priori
– Tried many values for C – kept the best
• DBN configuration
– Use of a m-50-50-200-2 DBN architecture (3 hidden layers)
– RBM training was performed using binary vectors for message
representation (the presence or absence of a word in a
message)
Experimental Results
Experimental Results
 The DBN achieves higher accuracy on all
datasets
 Beats the SVM against all measures on
SpamAssassin
 The DBN proved robust to variations on the
number of units of each layer
DBN training is much slower compared to SVM
training
Conclusions
• The effectiveness of the initialization
method was demonstrated in practice
• DBNs constitute a new viable solution to e-
mail filtering
• The selection of the DBN architecture needs
to be addressed in a more systematic way
– Number of layers
– Number of units in each layer
Challenges
• One example of SpamAssassin dataset (email spam)
Hi there,
To be removed please visit:
http://www.supersitescentral.com/rl/remove.html
BIG News...
Visit http://www.supersitescentral.com/rl/x601001.html for full details.
We have discovered a secret to generating a fortune over
the Internet and are looking for a few good people to
share it with.
This could finally be your chance to get that brand new
car and go on that dream vacation you have always wanted.
This is THE BIG ONE! So pay real close attention...
Literally thousands of people are making obscene amounts
of money from the Internet and ecommerce. We found an
Internet giant who markets 11 million products with HUGE
demand in every country around the globe.
You can sit in the comfort of your home making money
hand over fist with a HUGE global market at your
fingertips. Most people never get the opportunity like
this to join *BEFORE* the masses come in.
Consider this:
* Debt Free Multi-Million Dollar Company
* International - in over 180 Countries
* A 100 Billion Dollar Industry
* 3 Year Proven Track Record
* eCommerce Shopping Giant
* Online Marketing Tools
* Phenomenal Support Systems
* Automated Recruiting Systems
* Proprietary Back Office Technology
* Huge Compensation Plan
* Lifetime Residual income
Go to the web site below to get all the details.
http://www.supersitescentral.com/rl/x601001.html
Isn't it your turn to make a fortune over the
Internet? Don't drag your feet on this one. It
could be the one you have been waiting for all
your life.
Talk to you soon,
Mark
iNet Marketing Services
Challenges
[Web발신]
- N H. 금 융 -
더쉽고, 더안전하게
~
7.8 % 로 7000
사.용.하.실.수
있습니다
[Web발신]
크♥사[ㅏ리1.95
로♡ㅂㅔ당+0.05
스♥ㅂL셀1.65
OK♡레알1.49
추쳐닌ck77
time-pr콤
[Web발신]
사-용-중-인
체_크_카_드
빌-려-주-면
월-4-5-0
당-일-진-행
바_로_결_제
[Web발신]
KB국민카드 김소
연님08/18KB국민
카드결제금액
3,500원.잔여포인
트리230(08/06기
준)
<공학인
증>2014-1학기
미상담 시 성적확
인 및 수강신청
제약!! 학기 중 상
담 필수~!!
• In case of Korean Spam SMS..?
1. See the distribution of words
and special characters in spam
and ham messages.
2. Input vector of DBN can be
‘number of special characters’
or ‘how correct the grammar
of message is’ … instead of
‘number of spam words’
Challenges
• How to handle MMS Spam with image..?
• Extract text from image
• Image clustering
• Input vector of DBN can
be image vector

Deep belief networks for spam filtering

  • 2.
    Motivation • Text mining+ Network effect SMS corpus Spam score Content Analysis Network Analysis spam ham Spam Filtering System Many users’ data are needed
  • 3.
    Deep Belief Networks(DBNs) • What is a DBN (for classification)? – A feedforward neural network with a deep architecture - many hidden layers – Consists of : visible (input) units, hidden units, output units (for classification, one for each class) • Parameters of a DBN – W(j) :weights between the units of layers j-1 and j – b(j) : biases of layer j (no biases in the input layer).
  • 4.
    Training a DBN •Conventional approach: Gradient based optimization – Random initialization of weights and biases – Adjustment by backpropagation Optimization algorithms get stuck in poor solutions due to random initialization  Solution – Hinton et al [2006] proposed the use of a greedy layer- wise unsupervised algorithm for initialization of DBNs parameters – Initialization phase: initialize each layer by treating it as a Restricted Boltzmann Machine (RBM)
  • 5.
    Restricted Boltzmann Machines (RBMs) •An RBM is a two layer neural network – Binary inputs (visible units) are connected to binary outputs (hidden units) using symmetrically weighted connections • Parameters of an RBM – W :weights between the two layers – b, c :biases for visible and hidden layers respectively • Layer-to-layer conditional distributions Bidirectional Connections
  • 6.
    RBM Training • Forevery training example 1. Propagate it from visible to hidden units 2. Sample from the conditional 3. Propagate the sample in the opposite direction using ⇒ confabulation of the original data 4. Update the hidden units once more using the confabulation • Update the RBM parameters Data vector v Sample Sample Remember that RBM training is unsupervised Repeat
  • 7.
    DBN Training 1. Trainthe first layer RBM 2. Stack another hidden layer on top of the first RBM & train W(2) as a second RBM 3. Continue to stack layers on top of the network, and train it as previous step W(1) ,b(1) W(2) ,b(2) W(L) ,b(L) W(L+1) random Good initializations are obtained Fine tune the whole network by typical supervised criterion (mean square error, cross-entropy) -> they used conjugate gradients
  • 8.
  • 9.
    Performance Measures • Accuracy:percentage of correctly classified messages • Ham - Spam Recall: percentage of correctly classified ham – spam messages • Ham - Spam Precision: percentage of messages that are classified as ham – spam that are indeed ham - spam
  • 10.
    Experimental Setup • Messagerepresentation: x=[x1, x2, …, xm] – Each attribute(message) corresponds to a distinct word from the corpus – Use of frequency of the corresponding word • Attribute selection – Stop words and words appearing in <2 messages were removed + Information gain score (m=1500 for LingSpam, m=1000 for SpamAssassin and EnronSpam) • All experiments were performed using 10-fold cross validation
  • 11.
    Experimental Setup • SVMconfiguration – Cosine kernel (the usual trend in text classification) – The cost parameter C must be determined a priori – Tried many values for C – kept the best • DBN configuration – Use of a m-50-50-200-2 DBN architecture (3 hidden layers) – RBM training was performed using binary vectors for message representation (the presence or absence of a word in a message)
  • 12.
  • 13.
    Experimental Results  TheDBN achieves higher accuracy on all datasets  Beats the SVM against all measures on SpamAssassin  The DBN proved robust to variations on the number of units of each layer DBN training is much slower compared to SVM training
  • 14.
    Conclusions • The effectivenessof the initialization method was demonstrated in practice • DBNs constitute a new viable solution to e- mail filtering • The selection of the DBN architecture needs to be addressed in a more systematic way – Number of layers – Number of units in each layer
  • 15.
    Challenges • One exampleof SpamAssassin dataset (email spam) Hi there, To be removed please visit: http://www.supersitescentral.com/rl/remove.html BIG News... Visit http://www.supersitescentral.com/rl/x601001.html for full details. We have discovered a secret to generating a fortune over the Internet and are looking for a few good people to share it with. This could finally be your chance to get that brand new car and go on that dream vacation you have always wanted. This is THE BIG ONE! So pay real close attention... Literally thousands of people are making obscene amounts of money from the Internet and ecommerce. We found an Internet giant who markets 11 million products with HUGE demand in every country around the globe. You can sit in the comfort of your home making money hand over fist with a HUGE global market at your fingertips. Most people never get the opportunity like this to join *BEFORE* the masses come in. Consider this: * Debt Free Multi-Million Dollar Company * International - in over 180 Countries * A 100 Billion Dollar Industry * 3 Year Proven Track Record * eCommerce Shopping Giant * Online Marketing Tools * Phenomenal Support Systems * Automated Recruiting Systems * Proprietary Back Office Technology * Huge Compensation Plan * Lifetime Residual income Go to the web site below to get all the details. http://www.supersitescentral.com/rl/x601001.html Isn't it your turn to make a fortune over the Internet? Don't drag your feet on this one. It could be the one you have been waiting for all your life. Talk to you soon, Mark iNet Marketing Services
  • 16.
    Challenges [Web발신] - N H.금 융 - 더쉽고, 더안전하게 ~ 7.8 % 로 7000 사.용.하.실.수 있습니다 [Web발신] 크♥사[ㅏ리1.95 로♡ㅂㅔ당+0.05 스♥ㅂL셀1.65 OK♡레알1.49 추쳐닌ck77 time-pr콤 [Web발신] 사-용-중-인 체_크_카_드 빌-려-주-면 월-4-5-0 당-일-진-행 바_로_결_제 [Web발신] KB국민카드 김소 연님08/18KB국민 카드결제금액 3,500원.잔여포인 트리230(08/06기 준) <공학인 증>2014-1학기 미상담 시 성적확 인 및 수강신청 제약!! 학기 중 상 담 필수~!! • In case of Korean Spam SMS..? 1. See the distribution of words and special characters in spam and ham messages. 2. Input vector of DBN can be ‘number of special characters’ or ‘how correct the grammar of message is’ … instead of ‘number of spam words’
  • 17.
    Challenges • How tohandle MMS Spam with image..? • Extract text from image • Image clustering • Input vector of DBN can be image vector