Microposts2015 - Social Spam Detection on Twitter

Making the Most of Tweet-Inherent Features for
Social Spam Detection on Twitter
Bo Wang, Arkaitz Zubiaga, Maria Liakata and Rob Procter
Department of Computer Science
University of Warwick
18th May 2015

Social Spam on Twitter
Motivation
• Social spam is an important issue in social media services
such as Twitter, e.g.:
• Users inject tweets in trending topics.
• Users reply with promotional messages providing a link.
• We want to be able to identify these spam tweets in a
Twitter stream.

How Did we Feel the Need to Identify Spam?
• We started tracking events via streaming API.
• They were often riddled with noisy tweets.

Example

Our Approach
• Detection of spammers: unsuitable, we couldn’t
aggregate a user’s data from a stream.
• Alternative solution: Determine if tweet is spam from its
inherent features.

Definitions
• Spam originally coined for unsolicited email.
• How to define spam for Twitter? (not easy!)
• Twitter has own definition of spam, where certain level of
advertisements is allowed:
• It rather refers to the user level rather than tweet level, e.g.,
users who massively follow others.
• Harder to define a spam than a spammer.

Our Deﬁnition
• Twitter spam: noisy content produced by users who
express a diﬀerent behaviour from what the system is
intended for, and has the goal of grabbing attention by
exploiting the social media service’s characteristics.

Spammer vs. Spam Detection
What Did Others Do?
• Most previous work focused on spammer detection (users).
• They used features which are not readily available in a
tweet:
• For example, historical user behaviour and network
features.
• Not feasible for our use.

Spammer vs. Spam Detection
What Do We Want To Do Instead?
• (Near) Real-time spam detection, limited to features
readily available in a stream of tweets.
• Contributions:
• Test on two existing datasets, adapted to our purposes.
• Definition of different feature sets.
• Compare different classification algorithms.
• Investigate the use of different tweet-inherent features.

Datasets
• We relied on two (spammer vs non-spammer) datasets:
• Social Honeypot (Lee et al., 2011 [1]): used social honeypots
to attract spammers.
• 1KS-10KN (Yang et al., 2011 [2]): harvested tweets
containing certain malicious URLs.
• Spammer dataset to our spam dataset: Randomly select
one tweet from each spammer or legitimate user.
• Social Honeypot: 20,707 spam vs 19,249 non-spam (∼1:1).
• 1KS-10KN: 1,000 spam vs 9,828 non-spam (∼1:10).

Feature Engineering
User features Content features
Length of proﬁle name Number of words
Length of proﬁle description Number of characters
Number of followings (FI) Number of white spaces
Number of followers (FE) Number of capitalization words
Number of tweets posted Number of capitalization words per word
Age of the user account, in hours (AU) Maximum word length
Ratio of number of followings and followers (FE/FI) Mean word length
Reputation of the user (FE/(FI + FE)) Number of exclamation marks
Following rate (FI/AU) Number of question marks
Number of tweets posted per day Number of URL links
Number of tweets posted per week Number of URL links per word
N-grams Number of hashtags
Uni + bi-gram or bi + tri-gram Number of hashtags per word
Number of mentions
Sentiment features Number of mentions per word
Automatically created sentiment lexicons Number of spam words
Manually created sentiment lexicons Number of spam words per word
Part of speech tags of every tweet

Evaluation
Experiment Settings
• 5 widely-used classiﬁcation algorithms: Bernoulli Naive
Bayes, KNN, SVM, Decision Tree and Random Forests.
• Hyperparameters optimised from a subset of the dataset
separate from train/test sets.
• All 4 feature sets were combined.
• 10-fold cross-validation.

Evaluation
Selection of Classiﬁer
Classiﬁer
1KS-10KN Dataset Social Honeypot Dataset
Precision Recall F-measure Precision Recall F1-measure
Bernoulli NB 0.899 0.688 0.778 0.772 0.806 0.789
KNN 0.924 0.706 0.798 0.802 0.778 0.790
SVM 0.872 0.708 0.780 0.844 0.817 0.830
Decision Tree 0.788 0.782 0.784 0.914 0.916 0.915
Random Forest 0.993 0.716 0.831 0.941 0.950 0.946
• Random Forests outperform others in terms of
F1-measure and Precision.
• Better performance on Social Honeypot (1:1 ratio rather
than 1:10?).
• Results only 4% below original papers, which require
historic user features.

Evaluation
Evaluation of Features (w/ Random Forests)
Feature Set
Precision Recall F-measure Precision Recall F-measure
User features (U) 0.895 0.709 0.791 0.938 0.940 0.940
Content features (C) 0.951 0.657 0.776 0.771 0.753 0.762
Uni + Bi-gram (Binary) 0.930 0.725 0.815 0.759 0.727 0.743
Uni + Bi-gram (Tf) 0.959 0.715 0.819 0.783 0.767 0.775
Uni + Bi-gram (Tfidf) 0.943 0.726 0.820 0.784 0.765 0.775
Bi + Tri-gram (Tfidf) 0.931 0.684 0.788 0.797 0.656 0.720
Sentiment features (S) 0.966 0.574 0.718 0.679 0.727 0.702
• Testing feature sets one by one:
• User features (U) most determinant for Social Honeypot.
• N-gram features best for 1KS-10KN.
• Potentially due to diff. dataset generation approaches?

Evaluation
Evaluation of Features (w/ Random Forests)
Feature Set
Precision Recall F-measure Precision Recall F-measure
Single feature set 0.943 0.726 0.820 0.938 0.940 0.940
U + C 0.974 0.708 0.819 0.938 0.949 0.943
U + Bi & Tri-gram (Tf) 0.972 0.745 0.843 0.937 0.949 0.943
U + S 0.948 0.732 0.825 0.940 0.944 0.942
Uni & Bi-gram (Tf) + S 0.964 0.721 0.824 0.797 0.744 0.770
C + S 0.970 0.649 0.777 0.778 0.762 0.770
C + Uni & Bi-gram (Tf) 0.968 0.717 0.823 0.783 0.757 0.770
U + C + Uni & Bi-gram (Tf) 0.985 0.727 0.835 0.934 0.949 0.941
U + C + S 0.982 0.704 0.819 0.937 0.948 0.942
U + Uni & Bi-gram (Tf) + S 0.994 0.720 0.834 0.928 0.946 0.937
C + Uni & Bi-gram (Tf) + S 0.966 0.720 0.824 0.806 0.758 0.782
U + C + Uni & Bi-gram (Tf) + S 0.988 0.725 0.835 0.936 0.947 0.942
• However, when we combine feature sets:
• The same approach performs best (F1) for both: U + Bi &
Tri-gram (Tf).
• Combining features helps us capture diﬀ. types of spam
tweets.

Evaluation
Computational Eﬃciency
• Beyond accuracy, how can all these features be applied
eﬃciently in a stream?

Evaluation
Computational Eﬃciency
Feature set
Comp. time (seconds)
for 1k tweets
User features 0.0057
N-gram 0.3965
Sentiment features 20.9838
Number of spam words (NSW) 19.0111
Part-of-speech counts (POS) 0.6139
Content features including NSW and POS 20.2367
Content features without NSW 1.0448
Content features without POS 19.6165
• Tested on regular computer (2.8 GHz Intel Core i7 processor
and 16 GB memory).
• The features that performed best in combination (User
and N-grams) are those most eﬃciently calculated.

Conclusion
• Random Forests were found to be the most accurate
classifier.
• Comparable performance to previous work (-4%) while
limiting features to those in a tweet.
• The use of multiple feature sets increases the possibility
to capture different spam types, and makes it more
difficult for spammers to evade.
• Diff. features perform better when used separately, but
same features are useful when combined.

Future Work
• Spam corpus constructed by picking tweets from
spammers.
• Need to study if legitimate users also likely to post spam
tweets, and how it could aﬀect the results.
• A more recent, manually labelled spam/non-spam
dataset.
• Feasibility of cross-dataset spam classiﬁcation?

That’s it!
• Any Questions?

K. Lee, B. D. Eoﬀ, and J. Caverlee.
Seven months with the devils: A long-term study of content
polluters on twitter.
In L. A. Adamic, R. A. Baeza-Yates, and S. Counts, editors,
ICWSM. The AAAI Press, 2011.
C. Yang, R. C. Harkreader, and G. Gu.
Die free or live hard? empirical evaluation and new design for
ﬁghting evolving twitter spammers.
In Proceedings of the 14th International Conference on Recent
Advances in Intrusion Detection, RAID’11, pages 318–337,
Berlin, Heidelberg, 2011. Springer-Verlag.

Microposts2015 - Social Spam Detection on Twitter

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (12)

Similar to Microposts2015 - Social Spam Detection on Twitter

Similar to Microposts2015 - Social Spam Detection on Twitter (20)

More from azubiaga

More from azubiaga (14)

Recently uploaded

Recently uploaded (20)

Microposts2015 - Social Spam Detection on Twitter