SlideShare a Scribd company logo
1 of 30
A Deep Learning Approach
For Twitter Spam Detection
Lijie Zhou (lijie@mail.sfsu.edu) & Hao Yue
San Francisco State University
Outline
β€’ Problem and Challenges
β€’ Past Work
β€’ Our Model and Results
β€’ Conclusion
β€’ Future Work
What Is Spam?
Spam on Facebook and Twitter
# of active
users
# of spam
accounts
%
Facebook 2.2 billion 60-83 million 2.73%-3.77%
Twitter 330 million 23 million 6.97%
Source: https://www.statista.com/
Various Social Media Sites
Social Media’s Fundamental Design Flaw
β€’ Sophisticated spam accounts know how to use various features to
make the biggest harm:
β€’ Use shortened URL to trick users
β€’ Buy compromised accounts to look legitimate
β€’ Use campaigns to gain traction in a short period time
β€’ Use bots to amplify the noise
β€’ Social media makes it easier and faster to spread spam.
Related Work
β€’ Detection at the tweet level
β€’ Focus on the content of tweets
β€’ E.g., spam words? Overuse of hashtag, URL, mention, …?
β€’ Detection at the account level
β€’ Focus on the characteristics of spam accounts
β€’ E.g., Age of the account? # of followers? # of followees? …
Challenges
β€’ Large amount of unlabeled data
β€’ Time and labor intensive
β€’ Feature selection may cause model overfitting problem
β€’ Twitter spam drift
β€’ Spamming behavior changes over time, thus the performance of existing
machine learning based classifiers decreases.
Research Questions
β€’ Question 1: Can we find an unsupervised way to learn from the
unlabeled data and later apply what we have learnt on labeled data?
β€’ Will this approach outperform the hand-labeling process?
β€’ Question 2: Can we find a more systematic way to reduce the feature
dimensions instead of feature engineering?
Stage 1: Self-taught Learning From Unlabeled Data
Training Data
W/O Label
One-to-N
Encoding
Max-Min
Norm
Sparse Auto-
encoder
Trained
Parameter Set
Stage 2: Soft-max Classifier Training
Preprocessed
Labeled
Training Data
Sparse Auto-
encoder
Soft-max
Regression
Trained
Parameter Set
Stage 3: Classification
Preprocessed
Test Data
Sparse Auto-
encoder
Soft-Max
Regression
Spam/Non-
Spam
Self-taught Learning
β€’ Assumption:
β€’ A single unlabeled record is less informative
β€’ A large of amount of unlabeled records may show certain pattern
β€’ Goal:
β€’ Find an effective model to reveal this pattern (if exists)
β€’ Choose sparse auto-encoder for its good performance and simplicity
Auto-encoder
β€’ A special neural network whose
output is (almost) identical to its
input.
β€’ A compression tool
β€’ The hidden layer is considered the
compressed representation of the
input.
Auto-encoder
β€’ Model parameter:
(π‘Š, b) = (π‘Š(1), 𝑏(1), π‘Š(2), 𝑏(2))
β€’ Activation function
π‘Ž1
2
= f(π‘Š11
(1)
π‘₯1 + π‘Š12
(1)
π‘₯2+ π‘Š13
(1)
π‘₯3+ 𝑏1
(1)
)
π‘Ž2
2
= f(π‘Š21
(1)
π‘₯1 + π‘Š22
(1)
π‘₯2+ π‘Š23
(1)
π‘₯3+ 𝑏2
(1)
)
π‘Ž3
2
= f(π‘Š31
(1)
π‘₯1 + π‘Š32
(1)
π‘₯2+ π‘Š33
(1)
π‘₯3+ 𝑏3
(1)
)
β€’ Hypothesis β„Ž 𝑀,𝑏(π‘₯) :
β„Ž 𝑀,𝑏(π‘₯)= π‘Ž1
3
= f(π‘Š11
(2)
π‘Ž1
2
+ π‘Š12
(2)
π‘Ž2
2
+ π‘Š13
(2)
π‘Ž3
2
+ 𝑏1
(2)
) = π‘₯
Sparse Auto-encoder
β€’ Sparsity parameter
β€’ Definition: a constraint imposed on the hidden layer
β€’ Goal: ensure pattern will be revealed even if the size of hidden layer is large
β€’ Average activation: 𝜌 =
1
π‘š 𝑖=1
π‘š
[π‘Žπ‘—
(2)
(π‘₯(𝑖))]
β€’ Penalty term
β€’ 𝜌 = 𝜌 (𝜌 = 0.05)
β€’ Kullback-Leibler (KL) divergence: 𝑗=1
𝐾
𝐾𝐿(𝜌 || 𝜌)= πœŒπ‘™π‘œπ‘”
𝜌
𝜌
+ (1-𝜌) lπ‘™π‘œπ‘”
1βˆ’ 𝜌
1βˆ’ 𝜌
β€’ 𝑗=1
𝐾
𝐾𝐿(𝜌 || 𝜌) = 0 if 𝜌= 𝜌
Cost Function
J(W,b) =
𝟏
π’Ž π’Š=𝟏
π’Ž
| |π’™π’Š βˆ’ π’™π’Š||
𝟐
+
𝝀
𝟐
( π’Œ,𝒏 𝑾 𝟐 + 𝒏,π’Œ 𝑽 𝟐 + π’Œ 𝒃 𝟏
𝟐
+ 𝒏 𝒃 𝟐
𝟐
) +
𝜷 𝒋=𝟏
π’Œ
𝑲𝑳(𝝆|| 𝝆𝒋)
Average sum-of-square error term
Weigh decay term
Penalty term
Cost Function
β€’ Goal: minimize J(W, b) as a function of W and b
β€’ Steps
β€’ Initialization
β€’ Update parameters with gradient descent
π‘Šπ‘–π‘—
(𝑙)
= π‘Šπ‘–π‘—
(𝑙)
- 𝛼
πœ•
πœ•π‘Šπ‘–π‘—
𝑙 𝐽 π‘Š, 𝑏
𝑏𝑖
(𝑙)
= 𝑏𝑖
(𝑙)
- 𝛼
πœ•
πœ•π‘π‘–
(𝑙) 𝐽 π‘Š, 𝑏
Back-propagation
𝛿𝑖
(𝑛 𝑙)
β€œerror term”
how much the node is β€œresponsible” for any error in the output
Back-propagation
1. Perform a feedforward pass, compute activations for layers𝐿2, 𝐿3,
up until the output layer 𝐿 𝑛 𝑙
2. For each output unit I in layer 𝑛𝑙 (the output layer), set
β€’ 𝛿𝑖
(𝑛 𝑙)
= -(𝑦𝑖 βˆ’ π‘Žπ‘–
(𝑛 𝑙)
) x π‘“βˆ’1(𝑧𝑖
(𝑛 𝑙)
)
3. For l = 𝑛𝑙 -1, 𝑛𝑙-2, 𝑛𝑙-3, …, 2
β€’ For each node I in layer l, set 𝛿𝑖
(𝑙)
= ( 𝑗=1
𝑠 𝑙+1
π‘Šπ‘–π‘—
𝑙
𝛿𝑗
(𝑙+1)
) π‘“βˆ’1(𝑧𝑖
(𝑙)
)
4. Compute the partial derivatives
β€’ 𝛼
πœ•
πœ•π‘Šπ‘–π‘—
𝑙 𝐽 π‘Š, 𝑏; π‘₯, 𝑦 = π‘Žπ‘—
(𝑙)
𝛿𝑖
(𝑙+1)
β€’ 𝛼
πœ•
πœ•π‘π‘–
𝑙 𝐽 π‘Š, 𝑏; π‘₯, 𝑦 = 𝛿𝑖
(𝑙+1)
Fine-tuning
Preprocessed
Labeled
Training Data
Sparse Auto-
encoder
Soft-max
Regression
Trained
Parameter Set
Fine-tuning
Dataset
β€’ 1065 instances; Each instance has 62 features.
β€’ Split 1065 instances into three groups:
β€’ Training w/o label – 600 instances
β€’ Training w label – 365 instances
β€’ Test w label - 100 instances
β€’ Comparison group: SVM, naΓ―ve bayes, and random forests
β€’ Training w label – 365 instances
β€’ Test w label – 100 instances
Evaluation
β€’ True Positive (TP): actual spammer, prediction spammer.
β€’ True Negative (TN): actual non-spammer, prediction non-spammer.
β€’ False Positive (FP): actual non-spammer, prediction spammer.
β€’ False Negative (FN): actual spammer, prediction non-spammer.
Evaluation
Accuracy: the correctly classified instances over the total number of
test instances.
Precision: P =
𝑇𝑃
(𝑻𝑃 + 𝐹𝑃)
* 100%
Recall: R =
𝑇𝑃
(𝑇𝑃 + 𝐹𝑁)
* 100%
F-Measure: F =
2βˆ—π‘…π‘ƒ
(𝑅 + 𝑃)
Results
Hidden L2
Hidden
L1
15 20 25 30 35 40 45 50 55 Avg
55 86% 88% 85% 84% 87% 85% 83% 86% 86% 86%
50 84% 84% 86% 88% 86% 89% 87% 86% 88% 86%
45 85% 88% 87% 86% 85% 84% 88% 86% 86% 86%
40 88% 87% 85% 85% 85% 87% 87% 86% 89% 87%
35 87% 88% 87% 86% 87% 86% 86% 85% 86% 86%
30 85% 86% 89% 85% 85% 84% 83% 87% 88% 86%
25 87% 87% 88% 87% 85% 88% 85% 87% 88% 87%
20 84% 88% 83% 88% 86% 85% 88% 87% 86% 86%
15 83% 83% 83% 87% 85% 82% 85% 86% 85% 84%
Avg 85% 87% 86% 86% 86% 86% 86% 86% 87%
Results – Comparison with SVM
TP TN FP FN A P R F
SAE 34 52 3 11 86% 91.9% 75.6% 83.0%
Top 5 28 52 2 18 80% 93.3% 60.9% 73.7%
Top 10 27 52 3 18 79% 90% 60.0% 72.0%
Top 20 28 52 3 17 80% 90.3% 62.2% 73.7%
Top 30 29 52 3 16 81% 90.6% 64.4% 75.3%
Results – Comparison with Random Forests &
NaΓ―ve Bayes
TP TN FP FN A P R F
SAE 34 52 3 11 86% 91.9% 75.6% 83.0%
Random
Forrest
32 52 3 13 84% 91% 71.0% 80.0%
NaΓ―ve
Bayes
33 50 5 12 83% 86.8% 73.0% 79.5%
Conclusion
β€’ Self-taught Learning: large amount of unlabeled data + small amount
of labeled data
β€’ Sparse AE: reduce the feature dimensions
β€’ Fine tuning: improve the deep learning model by large extent.
Limitation & Future Work
β€’ The dataset we use is relatively small.
β€’ We are still exploring new ways to apply this model on raw data.
A Deep Learning Approach
For Twitter Spam Detection
Lijie Zhou (lijie@mail.sfsu.edu) and Hao Yue
San Francisco State University

More Related Content

Similar to A deep learning approach for twitter spam detection lijie zhou

DeepLearningLecture.pptx
DeepLearningLecture.pptxDeepLearningLecture.pptx
DeepLearningLecture.pptxssuserf07225
Β 
Deep learning from scratch
Deep learning from scratch Deep learning from scratch
Deep learning from scratch Eran Shlomo
Β 
Introduction to machine learning
Introduction to machine learningIntroduction to machine learning
Introduction to machine learningKoundinya Desiraju
Β 
Comparative Study of Machine Learning Algorithms for Sentiment Analysis with ...
Comparative Study of Machine Learning Algorithms for Sentiment Analysis with ...Comparative Study of Machine Learning Algorithms for Sentiment Analysis with ...
Comparative Study of Machine Learning Algorithms for Sentiment Analysis with ...Sagar Deogirkar
Β 
DSA 103 Object Oriented Programming :: Week 3
DSA 103 Object Oriented Programming :: Week 3DSA 103 Object Oriented Programming :: Week 3
DSA 103 Object Oriented Programming :: Week 3Ferdin Joe John Joseph PhD
Β 
Soft And Handling
Soft And HandlingSoft And Handling
Soft And Handlinghiratufail
Β 
Deep Feed Forward Neural Networks and Regularization
Deep Feed Forward Neural Networks and RegularizationDeep Feed Forward Neural Networks and Regularization
Deep Feed Forward Neural Networks and RegularizationYan Xu
Β 
Basics in algorithms and data structure
Basics in algorithms and data structure Basics in algorithms and data structure
Basics in algorithms and data structure Eman magdy
Β 
Artificial Neural Network
Artificial Neural NetworkArtificial Neural Network
Artificial Neural NetworkDessy Amirudin
Β 
Ch02 primitive-data-definite-loops
Ch02 primitive-data-definite-loopsCh02 primitive-data-definite-loops
Ch02 primitive-data-definite-loopsJames Brotsos
Β 
Problem-solving and design 1.pptx
Problem-solving and design 1.pptxProblem-solving and design 1.pptx
Problem-solving and design 1.pptxTadiwaMawere
Β 
Online learning with structured streaming, spark summit brussels 2016
Online learning with structured streaming, spark summit brussels 2016Online learning with structured streaming, spark summit brussels 2016
Online learning with structured streaming, spark summit brussels 2016Ram Sriharsha
Β 
Nimrita deep learning
Nimrita deep learningNimrita deep learning
Nimrita deep learningNimrita Koul
Β 
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Universitat Politècnica de Catalunya
Β 
Training DNN Models - II.pptx
Training DNN Models - II.pptxTraining DNN Models - II.pptx
Training DNN Models - II.pptxPrabhuSelvaraj15
Β 
Large Scale Online Experimentation with Quantile Metrics
Large Scale Online Experimentation with Quantile MetricsLarge Scale Online Experimentation with Quantile Metrics
Large Scale Online Experimentation with Quantile MetricsWeitao Duan
Β 

Similar to A deep learning approach for twitter spam detection lijie zhou (20)

DeepLearningLecture.pptx
DeepLearningLecture.pptxDeepLearningLecture.pptx
DeepLearningLecture.pptx
Β 
Backpropagation - Elisa Sayrol - UPC Barcelona 2018
Backpropagation - Elisa Sayrol - UPC Barcelona 2018Backpropagation - Elisa Sayrol - UPC Barcelona 2018
Backpropagation - Elisa Sayrol - UPC Barcelona 2018
Β 
Deep learning from scratch
Deep learning from scratch Deep learning from scratch
Deep learning from scratch
Β 
Introduction to machine learning
Introduction to machine learningIntroduction to machine learning
Introduction to machine learning
Β 
Comparative Study of Machine Learning Algorithms for Sentiment Analysis with ...
Comparative Study of Machine Learning Algorithms for Sentiment Analysis with ...Comparative Study of Machine Learning Algorithms for Sentiment Analysis with ...
Comparative Study of Machine Learning Algorithms for Sentiment Analysis with ...
Β 
DSA 103 Object Oriented Programming :: Week 3
DSA 103 Object Oriented Programming :: Week 3DSA 103 Object Oriented Programming :: Week 3
DSA 103 Object Oriented Programming :: Week 3
Β 
Soft And Handling
Soft And HandlingSoft And Handling
Soft And Handling
Β 
Deep Feed Forward Neural Networks and Regularization
Deep Feed Forward Neural Networks and RegularizationDeep Feed Forward Neural Networks and Regularization
Deep Feed Forward Neural Networks and Regularization
Β 
Basics in algorithms and data structure
Basics in algorithms and data structure Basics in algorithms and data structure
Basics in algorithms and data structure
Β 
Artificial Neural Network
Artificial Neural NetworkArtificial Neural Network
Artificial Neural Network
Β 
Ch02 primitive-data-definite-loops
Ch02 primitive-data-definite-loopsCh02 primitive-data-definite-loops
Ch02 primitive-data-definite-loops
Β 
BIRTE-13-Kawashima
BIRTE-13-KawashimaBIRTE-13-Kawashima
BIRTE-13-Kawashima
Β 
Problem-solving and design 1.pptx
Problem-solving and design 1.pptxProblem-solving and design 1.pptx
Problem-solving and design 1.pptx
Β 
Lesson 39
Lesson 39Lesson 39
Lesson 39
Β 
AI Lesson 39
AI Lesson 39AI Lesson 39
AI Lesson 39
Β 
Online learning with structured streaming, spark summit brussels 2016
Online learning with structured streaming, spark summit brussels 2016Online learning with structured streaming, spark summit brussels 2016
Online learning with structured streaming, spark summit brussels 2016
Β 
Nimrita deep learning
Nimrita deep learningNimrita deep learning
Nimrita deep learning
Β 
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Β 
Training DNN Models - II.pptx
Training DNN Models - II.pptxTraining DNN Models - II.pptx
Training DNN Models - II.pptx
Β 
Large Scale Online Experimentation with Quantile Metrics
Large Scale Online Experimentation with Quantile MetricsLarge Scale Online Experimentation with Quantile Metrics
Large Scale Online Experimentation with Quantile Metrics
Β 

Recently uploaded

(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...ranjana rawat
Β 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINESIVASHANKAR N
Β 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxpranjaldaimarysona
Β 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...Call Girls in Nagpur High Profile
Β 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Dr.Costas Sachpazis
Β 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...Soham Mondal
Β 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )Tsuyoshi Horigome
Β 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
Β 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSRajkumarAkumalla
Β 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
Β 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
Β 
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
Β 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
Β 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxAsutosh Ranjan
Β 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Christo Ananth
Β 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduitsrknatarajan
Β 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
Β 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth
Β 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat
Β 

Recently uploaded (20)

(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
Β 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
Β 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptx
Β 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Β 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Β 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
Β 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )
Β 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
Β 
β˜… CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
β˜… CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCRβ˜… CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
β˜… CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
Β 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
Β 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Β 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
Β 
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
Β 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
Β 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
Β 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Β 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduits
Β 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
Β 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Β 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
Β 

A deep learning approach for twitter spam detection lijie zhou

  • 1. A Deep Learning Approach For Twitter Spam Detection Lijie Zhou (lijie@mail.sfsu.edu) & Hao Yue San Francisco State University
  • 2. Outline β€’ Problem and Challenges β€’ Past Work β€’ Our Model and Results β€’ Conclusion β€’ Future Work
  • 4. Spam on Facebook and Twitter # of active users # of spam accounts % Facebook 2.2 billion 60-83 million 2.73%-3.77% Twitter 330 million 23 million 6.97% Source: https://www.statista.com/
  • 6. Social Media’s Fundamental Design Flaw β€’ Sophisticated spam accounts know how to use various features to make the biggest harm: β€’ Use shortened URL to trick users β€’ Buy compromised accounts to look legitimate β€’ Use campaigns to gain traction in a short period time β€’ Use bots to amplify the noise β€’ Social media makes it easier and faster to spread spam.
  • 7. Related Work β€’ Detection at the tweet level β€’ Focus on the content of tweets β€’ E.g., spam words? Overuse of hashtag, URL, mention, …? β€’ Detection at the account level β€’ Focus on the characteristics of spam accounts β€’ E.g., Age of the account? # of followers? # of followees? …
  • 8. Challenges β€’ Large amount of unlabeled data β€’ Time and labor intensive β€’ Feature selection may cause model overfitting problem β€’ Twitter spam drift β€’ Spamming behavior changes over time, thus the performance of existing machine learning based classifiers decreases.
  • 9. Research Questions β€’ Question 1: Can we find an unsupervised way to learn from the unlabeled data and later apply what we have learnt on labeled data? β€’ Will this approach outperform the hand-labeling process? β€’ Question 2: Can we find a more systematic way to reduce the feature dimensions instead of feature engineering?
  • 10. Stage 1: Self-taught Learning From Unlabeled Data Training Data W/O Label One-to-N Encoding Max-Min Norm Sparse Auto- encoder Trained Parameter Set
  • 11. Stage 2: Soft-max Classifier Training Preprocessed Labeled Training Data Sparse Auto- encoder Soft-max Regression Trained Parameter Set
  • 12. Stage 3: Classification Preprocessed Test Data Sparse Auto- encoder Soft-Max Regression Spam/Non- Spam
  • 13. Self-taught Learning β€’ Assumption: β€’ A single unlabeled record is less informative β€’ A large of amount of unlabeled records may show certain pattern β€’ Goal: β€’ Find an effective model to reveal this pattern (if exists) β€’ Choose sparse auto-encoder for its good performance and simplicity
  • 14. Auto-encoder β€’ A special neural network whose output is (almost) identical to its input. β€’ A compression tool β€’ The hidden layer is considered the compressed representation of the input.
  • 15. Auto-encoder β€’ Model parameter: (π‘Š, b) = (π‘Š(1), 𝑏(1), π‘Š(2), 𝑏(2)) β€’ Activation function π‘Ž1 2 = f(π‘Š11 (1) π‘₯1 + π‘Š12 (1) π‘₯2+ π‘Š13 (1) π‘₯3+ 𝑏1 (1) ) π‘Ž2 2 = f(π‘Š21 (1) π‘₯1 + π‘Š22 (1) π‘₯2+ π‘Š23 (1) π‘₯3+ 𝑏2 (1) ) π‘Ž3 2 = f(π‘Š31 (1) π‘₯1 + π‘Š32 (1) π‘₯2+ π‘Š33 (1) π‘₯3+ 𝑏3 (1) ) β€’ Hypothesis β„Ž 𝑀,𝑏(π‘₯) : β„Ž 𝑀,𝑏(π‘₯)= π‘Ž1 3 = f(π‘Š11 (2) π‘Ž1 2 + π‘Š12 (2) π‘Ž2 2 + π‘Š13 (2) π‘Ž3 2 + 𝑏1 (2) ) = π‘₯
  • 16. Sparse Auto-encoder β€’ Sparsity parameter β€’ Definition: a constraint imposed on the hidden layer β€’ Goal: ensure pattern will be revealed even if the size of hidden layer is large β€’ Average activation: 𝜌 = 1 π‘š 𝑖=1 π‘š [π‘Žπ‘— (2) (π‘₯(𝑖))] β€’ Penalty term β€’ 𝜌 = 𝜌 (𝜌 = 0.05) β€’ Kullback-Leibler (KL) divergence: 𝑗=1 𝐾 𝐾𝐿(𝜌 || 𝜌)= πœŒπ‘™π‘œπ‘” 𝜌 𝜌 + (1-𝜌) lπ‘™π‘œπ‘” 1βˆ’ 𝜌 1βˆ’ 𝜌 β€’ 𝑗=1 𝐾 𝐾𝐿(𝜌 || 𝜌) = 0 if 𝜌= 𝜌
  • 17. Cost Function J(W,b) = 𝟏 π’Ž π’Š=𝟏 π’Ž | |π’™π’Š βˆ’ π’™π’Š|| 𝟐 + 𝝀 𝟐 ( π’Œ,𝒏 𝑾 𝟐 + 𝒏,π’Œ 𝑽 𝟐 + π’Œ 𝒃 𝟏 𝟐 + 𝒏 𝒃 𝟐 𝟐 ) + 𝜷 𝒋=𝟏 π’Œ 𝑲𝑳(𝝆|| 𝝆𝒋) Average sum-of-square error term Weigh decay term Penalty term
  • 18. Cost Function β€’ Goal: minimize J(W, b) as a function of W and b β€’ Steps β€’ Initialization β€’ Update parameters with gradient descent π‘Šπ‘–π‘— (𝑙) = π‘Šπ‘–π‘— (𝑙) - 𝛼 πœ• πœ•π‘Šπ‘–π‘— 𝑙 𝐽 π‘Š, 𝑏 𝑏𝑖 (𝑙) = 𝑏𝑖 (𝑙) - 𝛼 πœ• πœ•π‘π‘– (𝑙) 𝐽 π‘Š, 𝑏
  • 19. Back-propagation 𝛿𝑖 (𝑛 𝑙) β€œerror term” how much the node is β€œresponsible” for any error in the output
  • 20. Back-propagation 1. Perform a feedforward pass, compute activations for layers𝐿2, 𝐿3, up until the output layer 𝐿 𝑛 𝑙 2. For each output unit I in layer 𝑛𝑙 (the output layer), set β€’ 𝛿𝑖 (𝑛 𝑙) = -(𝑦𝑖 βˆ’ π‘Žπ‘– (𝑛 𝑙) ) x π‘“βˆ’1(𝑧𝑖 (𝑛 𝑙) ) 3. For l = 𝑛𝑙 -1, 𝑛𝑙-2, 𝑛𝑙-3, …, 2 β€’ For each node I in layer l, set 𝛿𝑖 (𝑙) = ( 𝑗=1 𝑠 𝑙+1 π‘Šπ‘–π‘— 𝑙 𝛿𝑗 (𝑙+1) ) π‘“βˆ’1(𝑧𝑖 (𝑙) ) 4. Compute the partial derivatives β€’ 𝛼 πœ• πœ•π‘Šπ‘–π‘— 𝑙 𝐽 π‘Š, 𝑏; π‘₯, 𝑦 = π‘Žπ‘— (𝑙) 𝛿𝑖 (𝑙+1) β€’ 𝛼 πœ• πœ•π‘π‘– 𝑙 𝐽 π‘Š, 𝑏; π‘₯, 𝑦 = 𝛿𝑖 (𝑙+1)
  • 22. Dataset β€’ 1065 instances; Each instance has 62 features. β€’ Split 1065 instances into three groups: β€’ Training w/o label – 600 instances β€’ Training w label – 365 instances β€’ Test w label - 100 instances β€’ Comparison group: SVM, naΓ―ve bayes, and random forests β€’ Training w label – 365 instances β€’ Test w label – 100 instances
  • 23. Evaluation β€’ True Positive (TP): actual spammer, prediction spammer. β€’ True Negative (TN): actual non-spammer, prediction non-spammer. β€’ False Positive (FP): actual non-spammer, prediction spammer. β€’ False Negative (FN): actual spammer, prediction non-spammer.
  • 24. Evaluation Accuracy: the correctly classified instances over the total number of test instances. Precision: P = 𝑇𝑃 (𝑻𝑃 + 𝐹𝑃) * 100% Recall: R = 𝑇𝑃 (𝑇𝑃 + 𝐹𝑁) * 100% F-Measure: F = 2βˆ—π‘…π‘ƒ (𝑅 + 𝑃)
  • 25. Results Hidden L2 Hidden L1 15 20 25 30 35 40 45 50 55 Avg 55 86% 88% 85% 84% 87% 85% 83% 86% 86% 86% 50 84% 84% 86% 88% 86% 89% 87% 86% 88% 86% 45 85% 88% 87% 86% 85% 84% 88% 86% 86% 86% 40 88% 87% 85% 85% 85% 87% 87% 86% 89% 87% 35 87% 88% 87% 86% 87% 86% 86% 85% 86% 86% 30 85% 86% 89% 85% 85% 84% 83% 87% 88% 86% 25 87% 87% 88% 87% 85% 88% 85% 87% 88% 87% 20 84% 88% 83% 88% 86% 85% 88% 87% 86% 86% 15 83% 83% 83% 87% 85% 82% 85% 86% 85% 84% Avg 85% 87% 86% 86% 86% 86% 86% 86% 87%
  • 26. Results – Comparison with SVM TP TN FP FN A P R F SAE 34 52 3 11 86% 91.9% 75.6% 83.0% Top 5 28 52 2 18 80% 93.3% 60.9% 73.7% Top 10 27 52 3 18 79% 90% 60.0% 72.0% Top 20 28 52 3 17 80% 90.3% 62.2% 73.7% Top 30 29 52 3 16 81% 90.6% 64.4% 75.3%
  • 27. Results – Comparison with Random Forests & NaΓ―ve Bayes TP TN FP FN A P R F SAE 34 52 3 11 86% 91.9% 75.6% 83.0% Random Forrest 32 52 3 13 84% 91% 71.0% 80.0% NaΓ―ve Bayes 33 50 5 12 83% 86.8% 73.0% 79.5%
  • 28. Conclusion β€’ Self-taught Learning: large amount of unlabeled data + small amount of labeled data β€’ Sparse AE: reduce the feature dimensions β€’ Fine tuning: improve the deep learning model by large extent.
  • 29. Limitation & Future Work β€’ The dataset we use is relatively small. β€’ We are still exploring new ways to apply this model on raw data.
  • 30. A Deep Learning Approach For Twitter Spam Detection Lijie Zhou (lijie@mail.sfsu.edu) and Hao Yue San Francisco State University

Editor's Notes

  1. The key is to compute the partial derivatives.
  2. We conducted an experiment on this implementation but the result is not as expected.