SlideShare a Scribd company logo
1 of 21
Download to read offline
Making the Most of Tweet-Inherent Features for
Social Spam Detection on Twitter
Bo Wang, Arkaitz Zubiaga, Maria Liakata and Rob Procter
Department of Computer Science
University of Warwick
18th May 2015
Social Spam on Twitter
Motivation
• Social spam is an important issue in social media services
such as Twitter, e.g.:
• Users inject tweets in trending topics.
• Users reply with promotional messages providing a link.
• We want to be able to identify these spam tweets in a
Twitter stream.
Social Spam on Twitter
How Did we Feel the Need to Identify Spam?
• We started tracking events via streaming API.
• They were often riddled with noisy tweets.
Social Spam on Twitter
Example
Social Spam on Twitter
Our Approach
• Detection of spammers: unsuitable, we couldn’t
aggregate a user’s data from a stream.
• Alternative solution: Determine if tweet is spam from its
inherent features.
Social Spam on Twitter
Definitions
• Spam originally coined for unsolicited email.
• How to define spam for Twitter? (not easy!)
• Twitter has own definition of spam, where certain level of
advertisements is allowed:
• It rather refers to the user level rather than tweet level, e.g.,
users who massively follow others.
• Harder to define a spam than a spammer.
Social Spam on Twitter
Our Definition
• Twitter spam: noisy content produced by users who
express a different behaviour from what the system is
intended for, and has the goal of grabbing attention by
exploiting the social media service’s characteristics.
Spammer vs. Spam Detection
What Did Others Do?
• Most previous work focused on spammer detection (users).
• They used features which are not readily available in a
tweet:
• For example, historical user behaviour and network
features.
• Not feasible for our use.
Spammer vs. Spam Detection
What Do We Want To Do Instead?
• (Near) Real-time spam detection, limited to features
readily available in a stream of tweets.
• Contributions:
• Test on two existing datasets, adapted to our purposes.
• Definition of different feature sets.
• Compare different classification algorithms.
• Investigate the use of different tweet-inherent features.
Datasets
• We relied on two (spammer vs non-spammer) datasets:
• Social Honeypot (Lee et al., 2011 [1]): used social honeypots
to attract spammers.
• 1KS-10KN (Yang et al., 2011 [2]): harvested tweets
containing certain malicious URLs.
• Spammer dataset to our spam dataset: Randomly select
one tweet from each spammer or legitimate user.
• Social Honeypot: 20,707 spam vs 19,249 non-spam (∼1:1).
• 1KS-10KN: 1,000 spam vs 9,828 non-spam (∼1:10).
Feature Engineering
User features Content features
Length of profile name Number of words
Length of profile description Number of characters
Number of followings (FI) Number of white spaces
Number of followers (FE) Number of capitalization words
Number of tweets posted Number of capitalization words per word
Age of the user account, in hours (AU) Maximum word length
Ratio of number of followings and followers (FE/FI) Mean word length
Reputation of the user (FE/(FI + FE)) Number of exclamation marks
Following rate (FI/AU) Number of question marks
Number of tweets posted per day Number of URL links
Number of tweets posted per week Number of URL links per word
N-grams Number of hashtags
Uni + bi-gram or bi + tri-gram Number of hashtags per word
Number of mentions
Sentiment features Number of mentions per word
Automatically created sentiment lexicons Number of spam words
Manually created sentiment lexicons Number of spam words per word
Part of speech tags of every tweet
Evaluation
Experiment Settings
• 5 widely-used classification algorithms: Bernoulli Naive
Bayes, KNN, SVM, Decision Tree and Random Forests.
• Hyperparameters optimised from a subset of the dataset
separate from train/test sets.
• All 4 feature sets were combined.
• 10-fold cross-validation.
Evaluation
Selection of Classifier
Classifier
1KS-10KN Dataset Social Honeypot Dataset
Precision Recall F-measure Precision Recall F1-measure
Bernoulli NB 0.899 0.688 0.778 0.772 0.806 0.789
KNN 0.924 0.706 0.798 0.802 0.778 0.790
SVM 0.872 0.708 0.780 0.844 0.817 0.830
Decision Tree 0.788 0.782 0.784 0.914 0.916 0.915
Random Forest 0.993 0.716 0.831 0.941 0.950 0.946
• Random Forests outperform others in terms of
F1-measure and Precision.
• Better performance on Social Honeypot (1:1 ratio rather
than 1:10?).
• Results only 4% below original papers, which require
historic user features.
Evaluation
Evaluation of Features (w/ Random Forests)
Feature Set
1KS-10KN Dataset Social Honeypot Dataset
Precision Recall F-measure Precision Recall F-measure
User features (U) 0.895 0.709 0.791 0.938 0.940 0.940
Content features (C) 0.951 0.657 0.776 0.771 0.753 0.762
Uni + Bi-gram (Binary) 0.930 0.725 0.815 0.759 0.727 0.743
Uni + Bi-gram (Tf) 0.959 0.715 0.819 0.783 0.767 0.775
Uni + Bi-gram (Tfidf) 0.943 0.726 0.820 0.784 0.765 0.775
Bi + Tri-gram (Tfidf) 0.931 0.684 0.788 0.797 0.656 0.720
Sentiment features (S) 0.966 0.574 0.718 0.679 0.727 0.702
• Testing feature sets one by one:
• User features (U) most determinant for Social Honeypot.
• N-gram features best for 1KS-10KN.
• Potentially due to diff. dataset generation approaches?
Evaluation
Evaluation of Features (w/ Random Forests)
Feature Set
1KS-10KN Dataset Social Honeypot Dataset
Precision Recall F-measure Precision Recall F-measure
Single feature set 0.943 0.726 0.820 0.938 0.940 0.940
U + C 0.974 0.708 0.819 0.938 0.949 0.943
U + Bi & Tri-gram (Tf) 0.972 0.745 0.843 0.937 0.949 0.943
U + S 0.948 0.732 0.825 0.940 0.944 0.942
Uni & Bi-gram (Tf) + S 0.964 0.721 0.824 0.797 0.744 0.770
C + S 0.970 0.649 0.777 0.778 0.762 0.770
C + Uni & Bi-gram (Tf) 0.968 0.717 0.823 0.783 0.757 0.770
U + C + Uni & Bi-gram (Tf) 0.985 0.727 0.835 0.934 0.949 0.941
U + C + S 0.982 0.704 0.819 0.937 0.948 0.942
U + Uni & Bi-gram (Tf) + S 0.994 0.720 0.834 0.928 0.946 0.937
C + Uni & Bi-gram (Tf) + S 0.966 0.720 0.824 0.806 0.758 0.782
U + C + Uni & Bi-gram (Tf) + S 0.988 0.725 0.835 0.936 0.947 0.942
• However, when we combine feature sets:
• The same approach performs best (F1) for both: U + Bi &
Tri-gram (Tf).
• Combining features helps us capture diff. types of spam
tweets.
Evaluation
Computational Efficiency
• Beyond accuracy, how can all these features be applied
efficiently in a stream?
Evaluation
Computational Efficiency
Feature set
Comp. time (seconds)
for 1k tweets
User features 0.0057
N-gram 0.3965
Sentiment features 20.9838
Number of spam words (NSW) 19.0111
Part-of-speech counts (POS) 0.6139
Content features including NSW and POS 20.2367
Content features without NSW 1.0448
Content features without POS 19.6165
• Tested on regular computer (2.8 GHz Intel Core i7 processor
and 16 GB memory).
• The features that performed best in combination (User
and N-grams) are those most efficiently calculated.
Conclusion
• Random Forests were found to be the most accurate
classifier.
• Comparable performance to previous work (-4%) while
limiting features to those in a tweet.
• The use of multiple feature sets increases the possibility
to capture different spam types, and makes it more
difficult for spammers to evade.
• Diff. features perform better when used separately, but
same features are useful when combined.
Future Work
• Spam corpus constructed by picking tweets from
spammers.
• Need to study if legitimate users also likely to post spam
tweets, and how it could affect the results.
• A more recent, manually labelled spam/non-spam
dataset.
• Feasibility of cross-dataset spam classification?
That’s it!
• Any Questions?
K. Lee, B. D. Eoff, and J. Caverlee.
Seven months with the devils: A long-term study of content
polluters on twitter.
In L. A. Adamic, R. A. Baeza-Yates, and S. Counts, editors,
ICWSM. The AAAI Press, 2011.
C. Yang, R. C. Harkreader, and G. Gu.
Die free or live hard? empirical evaluation and new design for
fighting evolving twitter spammers.
In Proceedings of the 14th International Conference on Recent
Advances in Intrusion Detection, RAID’11, pages 318–337,
Berlin, Heidelberg, 2011. Springer-Verlag.

More Related Content

What's hot

Detection of cyber-bullying
Detection of cyber-bullying Detection of cyber-bullying
Detection of cyber-bullying Ziar Khan
 
Sentiment analysis using ml
Sentiment analysis using mlSentiment analysis using ml
Sentiment analysis using mlPravin Katiyar
 
Twitter sentimentanalysis report
Twitter sentimentanalysis reportTwitter sentimentanalysis report
Twitter sentimentanalysis reportSavio Aberneithie
 
Fake news detection project
Fake news detection projectFake news detection project
Fake news detection projectHarshdaGhai
 
Sentiment Analaysis on Twitter
Sentiment Analaysis on TwitterSentiment Analaysis on Twitter
Sentiment Analaysis on TwitterNitish J Prabhu
 
5.1 mining data streams
5.1 mining data streams5.1 mining data streams
5.1 mining data streamsKrish_ver2
 
Twitter sentiment analysis
Twitter sentiment analysisTwitter sentiment analysis
Twitter sentiment analysisSunil Kandari
 
Sentiment Analysis Using Hybrid Structure of Machine Learning Algorithms
Sentiment Analysis Using Hybrid Structure of Machine Learning AlgorithmsSentiment Analysis Using Hybrid Structure of Machine Learning Algorithms
Sentiment Analysis Using Hybrid Structure of Machine Learning AlgorithmsSangeeth Nagarajan
 
Web scraping
Web scrapingWeb scraping
Web scrapingSelecto
 
Emotion Detection in text
Emotion Detection in text Emotion Detection in text
Emotion Detection in text kashif kashif
 
Sentiment Analysis using Twitter Data
Sentiment Analysis using Twitter DataSentiment Analysis using Twitter Data
Sentiment Analysis using Twitter DataHari Prasad
 
Sentiment analysis of twitter data
Sentiment analysis of twitter dataSentiment analysis of twitter data
Sentiment analysis of twitter dataBhagyashree Deokar
 
Sentiment Analysis of Twitter Data
Sentiment Analysis of Twitter DataSentiment Analysis of Twitter Data
Sentiment Analysis of Twitter DataSumit Raj
 
Detecting Fake News Through NLP
Detecting Fake News Through NLPDetecting Fake News Through NLP
Detecting Fake News Through NLPSakha Global
 
Sentiment Analysis in Twitter
Sentiment Analysis in TwitterSentiment Analysis in Twitter
Sentiment Analysis in TwitterAyushi Dalmia
 

What's hot (20)

Detection of cyber-bullying
Detection of cyber-bullying Detection of cyber-bullying
Detection of cyber-bullying
 
Web crawler
Web crawlerWeb crawler
Web crawler
 
Sentiment analysis using ml
Sentiment analysis using mlSentiment analysis using ml
Sentiment analysis using ml
 
Twitter sentimentanalysis report
Twitter sentimentanalysis reportTwitter sentimentanalysis report
Twitter sentimentanalysis report
 
Captcha
CaptchaCaptcha
Captcha
 
Fake news detection project
Fake news detection projectFake news detection project
Fake news detection project
 
Sentiment Analaysis on Twitter
Sentiment Analaysis on TwitterSentiment Analaysis on Twitter
Sentiment Analaysis on Twitter
 
5.1 mining data streams
5.1 mining data streams5.1 mining data streams
5.1 mining data streams
 
Twitter sentiment analysis
Twitter sentiment analysisTwitter sentiment analysis
Twitter sentiment analysis
 
Sentiment Analysis Using Hybrid Structure of Machine Learning Algorithms
Sentiment Analysis Using Hybrid Structure of Machine Learning AlgorithmsSentiment Analysis Using Hybrid Structure of Machine Learning Algorithms
Sentiment Analysis Using Hybrid Structure of Machine Learning Algorithms
 
Web scraping
Web scrapingWeb scraping
Web scraping
 
Emotion Detection in text
Emotion Detection in text Emotion Detection in text
Emotion Detection in text
 
Sentiment Analysis using Twitter Data
Sentiment Analysis using Twitter DataSentiment Analysis using Twitter Data
Sentiment Analysis using Twitter Data
 
Sentiment analysis of twitter data
Sentiment analysis of twitter dataSentiment analysis of twitter data
Sentiment analysis of twitter data
 
Sentiment Analysis of Twitter Data
Sentiment Analysis of Twitter DataSentiment Analysis of Twitter Data
Sentiment Analysis of Twitter Data
 
Detecting Fake News Through NLP
Detecting Fake News Through NLPDetecting Fake News Through NLP
Detecting Fake News Through NLP
 
FAKE NEWS DETECTION (1).pptx
FAKE NEWS DETECTION (1).pptxFAKE NEWS DETECTION (1).pptx
FAKE NEWS DETECTION (1).pptx
 
Sentiment Analysis in Twitter
Sentiment Analysis in TwitterSentiment Analysis in Twitter
Sentiment Analysis in Twitter
 
Content based filtering
Content based filteringContent based filtering
Content based filtering
 
Web Scraping Basics
Web Scraping BasicsWeb Scraping Basics
Web Scraping Basics
 

Viewers also liked

Detecting Spammers on Social Networks
Detecting Spammers on Social NetworksDetecting Spammers on Social Networks
Detecting Spammers on Social NetworksGianluca Stringhini
 
Twitter Content-based Spam Filtering - CISIS 2013
Twitter Content-based Spam Filtering - CISIS 2013Twitter Content-based Spam Filtering - CISIS 2013
Twitter Content-based Spam Filtering - CISIS 2013Carlos Laorden
 
12 ways trending twitter topics and hashtags may not be working for you
12 ways trending twitter topics and hashtags may not be working for you12 ways trending twitter topics and hashtags may not be working for you
12 ways trending twitter topics and hashtags may not be working for youOnline Promotion Success, Inc.
 
Graph-based KNN Algorithm for Spam SMS Detection
Graph-based KNN Algorithm for Spam SMS DetectionGraph-based KNN Algorithm for Spam SMS Detection
Graph-based KNN Algorithm for Spam SMS DetectionSOYEON KIM
 
Spamming and Spam Filtering
Spamming and Spam FilteringSpamming and Spam Filtering
Spamming and Spam FilteringiNazneen
 
E Mail & Spam Presentation
E Mail & Spam PresentationE Mail & Spam Presentation
E Mail & Spam Presentationnewsan2001
 
Enhancing Twitter spam discovery using cross account pattern matching.
Enhancing Twitter spam discovery using cross account pattern matching.Enhancing Twitter spam discovery using cross account pattern matching.
Enhancing Twitter spam discovery using cross account pattern matching.Ambarish Pande
 

Viewers also liked (12)

Spam Filtering
Spam FilteringSpam Filtering
Spam Filtering
 
Detecting Spammers on Social Networks
Detecting Spammers on Social NetworksDetecting Spammers on Social Networks
Detecting Spammers on Social Networks
 
Spam, security
Spam, securitySpam, security
Spam, security
 
Twitter Spam
Twitter SpamTwitter Spam
Twitter Spam
 
Twitter Content-based Spam Filtering - CISIS 2013
Twitter Content-based Spam Filtering - CISIS 2013Twitter Content-based Spam Filtering - CISIS 2013
Twitter Content-based Spam Filtering - CISIS 2013
 
12 ways trending twitter topics and hashtags may not be working for you
12 ways trending twitter topics and hashtags may not be working for you12 ways trending twitter topics and hashtags may not be working for you
12 ways trending twitter topics and hashtags may not be working for you
 
Graph-based KNN Algorithm for Spam SMS Detection
Graph-based KNN Algorithm for Spam SMS DetectionGraph-based KNN Algorithm for Spam SMS Detection
Graph-based KNN Algorithm for Spam SMS Detection
 
Bulk sms
Bulk smsBulk sms
Bulk sms
 
Spamming and Spam Filtering
Spamming and Spam FilteringSpamming and Spam Filtering
Spamming and Spam Filtering
 
E Mail & Spam Presentation
E Mail & Spam PresentationE Mail & Spam Presentation
E Mail & Spam Presentation
 
Enhancing Twitter spam discovery using cross account pattern matching.
Enhancing Twitter spam discovery using cross account pattern matching.Enhancing Twitter spam discovery using cross account pattern matching.
Enhancing Twitter spam discovery using cross account pattern matching.
 
Spam
SpamSpam
Spam
 

Similar to Microposts2015 - Social Spam Detection on Twitter

SplunkLive! New York Dec 2012 - SNAP Interactive
SplunkLive! New York Dec 2012 - SNAP InteractiveSplunkLive! New York Dec 2012 - SNAP Interactive
SplunkLive! New York Dec 2012 - SNAP InteractiveSplunk
 
Semantic Analysis to Compute Personality Traits from Social Media Posts
Semantic Analysis to Compute Personality Traits from Social Media PostsSemantic Analysis to Compute Personality Traits from Social Media Posts
Semantic Analysis to Compute Personality Traits from Social Media PostsGiulio Carducci
 
18.02.05_IAAI2018_Mobille Network Failure Event Detection and Forecasting wit...
18.02.05_IAAI2018_Mobille Network Failure Event Detection and Forecasting wit...18.02.05_IAAI2018_Mobille Network Failure Event Detection and Forecasting wit...
18.02.05_IAAI2018_Mobille Network Failure Event Detection and Forecasting wit...LINE Corp.
 
Scalable Topic-Specific Influence Analysis on Microblogs
Scalable Topic-Specific Influence Analysis on MicroblogsScalable Topic-Specific Influence Analysis on Microblogs
Scalable Topic-Specific Influence Analysis on MicroblogsYuanyuan Tian
 
A flexible recommenndation system for Cable TV
A flexible recommenndation system for Cable TVA flexible recommenndation system for Cable TV
A flexible recommenndation system for Cable TVIntoTheMinds
 
A Flexible Recommendation System for Cable TV
A Flexible Recommendation System for Cable TVA Flexible Recommendation System for Cable TV
A Flexible Recommendation System for Cable TVFrancisco Couto
 
Feature Based Opinion Mining from Amazon Reviews
Feature Based Opinion Mining from Amazon ReviewsFeature Based Opinion Mining from Amazon Reviews
Feature Based Opinion Mining from Amazon ReviewsRavi Kiran Holur Vijay
 
Recommendation engine Using Genetic Algorithm
Recommendation engine Using Genetic AlgorithmRecommendation engine Using Genetic Algorithm
Recommendation engine Using Genetic AlgorithmVaibhav Varshney
 
Building Continuous Learning Systems
Building Continuous Learning SystemsBuilding Continuous Learning Systems
Building Continuous Learning SystemsAnuj Gupta
 
DeepScan: Exploiting Deep Learning for Malicious Account Detection in Locatio...
DeepScan: Exploiting Deep Learning for Malicious Account Detection in Locatio...DeepScan: Exploiting Deep Learning for Malicious Account Detection in Locatio...
DeepScan: Exploiting Deep Learning for Malicious Account Detection in Locatio...yeung2000
 
Recommendation Systems
Recommendation SystemsRecommendation Systems
Recommendation SystemsRobin Reni
 
Who will RT this?: Automatically Identifying and Engaging Strangers on Twitte...
Who will RT this?: Automatically Identifying and Engaging Strangers on Twitte...Who will RT this?: Automatically Identifying and Engaging Strangers on Twitte...
Who will RT this?: Automatically Identifying and Engaging Strangers on Twitte...Jeffrey Nichols
 
Towards a Quality Assessment of Web Corpora for Language Technology Applications
Towards a Quality Assessment of Web Corpora for Language Technology ApplicationsTowards a Quality Assessment of Web Corpora for Language Technology Applications
Towards a Quality Assessment of Web Corpora for Language Technology ApplicationsMarina Santini
 
A Two Step Ranking Solution for Twitter User Engagement
A Two Step Ranking Solution for Twitter User Engagement�A Two Step Ranking Solution for Twitter User Engagement�
A Two Step Ranking Solution for Twitter User EngagementBehnoush Abdollahi
 
Real-time Classification of Malicious URLs on Twitter using Machine Activity ...
Real-time Classification of Malicious URLs on Twitter using Machine Activity ...Real-time Classification of Malicious URLs on Twitter using Machine Activity ...
Real-time Classification of Malicious URLs on Twitter using Machine Activity ...Pete Burnap
 
Building High Available and Scalable Machine Learning Applications
Building High Available and Scalable Machine Learning ApplicationsBuilding High Available and Scalable Machine Learning Applications
Building High Available and Scalable Machine Learning ApplicationsYalçın Yenigün
 
Software Analytics: Data Analytics for Software Engineering
Software Analytics: Data Analytics for Software EngineeringSoftware Analytics: Data Analytics for Software Engineering
Software Analytics: Data Analytics for Software EngineeringTao Xie
 
AWS re:Invent 2016: Getting to Ground Truth with Amazon Mechanical Turk (MAC201)
AWS re:Invent 2016: Getting to Ground Truth with Amazon Mechanical Turk (MAC201)AWS re:Invent 2016: Getting to Ground Truth with Amazon Mechanical Turk (MAC201)
AWS re:Invent 2016: Getting to Ground Truth with Amazon Mechanical Turk (MAC201)Amazon Web Services
 
ML&AI APPROACH TO USER UNDERSTANDING ECOSYSTEM AT VCCORP Applications to News...
ML&AI APPROACH TO USER UNDERSTANDING ECOSYSTEM AT VCCORP Applications to News...ML&AI APPROACH TO USER UNDERSTANDING ECOSYSTEM AT VCCORP Applications to News...
ML&AI APPROACH TO USER UNDERSTANDING ECOSYSTEM AT VCCORP Applications to News...Tuan Hoang
 

Similar to Microposts2015 - Social Spam Detection on Twitter (20)

SplunkLive! New York Dec 2012 - SNAP Interactive
SplunkLive! New York Dec 2012 - SNAP InteractiveSplunkLive! New York Dec 2012 - SNAP Interactive
SplunkLive! New York Dec 2012 - SNAP Interactive
 
Semantic Analysis to Compute Personality Traits from Social Media Posts
Semantic Analysis to Compute Personality Traits from Social Media PostsSemantic Analysis to Compute Personality Traits from Social Media Posts
Semantic Analysis to Compute Personality Traits from Social Media Posts
 
18.02.05_IAAI2018_Mobille Network Failure Event Detection and Forecasting wit...
18.02.05_IAAI2018_Mobille Network Failure Event Detection and Forecasting wit...18.02.05_IAAI2018_Mobille Network Failure Event Detection and Forecasting wit...
18.02.05_IAAI2018_Mobille Network Failure Event Detection and Forecasting wit...
 
Scalable Topic-Specific Influence Analysis on Microblogs
Scalable Topic-Specific Influence Analysis on MicroblogsScalable Topic-Specific Influence Analysis on Microblogs
Scalable Topic-Specific Influence Analysis on Microblogs
 
Fuzzy Rough Set Feature Selection to Enhance Phishing Attack Detection
Fuzzy Rough Set Feature Selection to Enhance Phishing Attack Detection Fuzzy Rough Set Feature Selection to Enhance Phishing Attack Detection
Fuzzy Rough Set Feature Selection to Enhance Phishing Attack Detection
 
A flexible recommenndation system for Cable TV
A flexible recommenndation system for Cable TVA flexible recommenndation system for Cable TV
A flexible recommenndation system for Cable TV
 
A Flexible Recommendation System for Cable TV
A Flexible Recommendation System for Cable TVA Flexible Recommendation System for Cable TV
A Flexible Recommendation System for Cable TV
 
Feature Based Opinion Mining from Amazon Reviews
Feature Based Opinion Mining from Amazon ReviewsFeature Based Opinion Mining from Amazon Reviews
Feature Based Opinion Mining from Amazon Reviews
 
Recommendation engine Using Genetic Algorithm
Recommendation engine Using Genetic AlgorithmRecommendation engine Using Genetic Algorithm
Recommendation engine Using Genetic Algorithm
 
Building Continuous Learning Systems
Building Continuous Learning SystemsBuilding Continuous Learning Systems
Building Continuous Learning Systems
 
DeepScan: Exploiting Deep Learning for Malicious Account Detection in Locatio...
DeepScan: Exploiting Deep Learning for Malicious Account Detection in Locatio...DeepScan: Exploiting Deep Learning for Malicious Account Detection in Locatio...
DeepScan: Exploiting Deep Learning for Malicious Account Detection in Locatio...
 
Recommendation Systems
Recommendation SystemsRecommendation Systems
Recommendation Systems
 
Who will RT this?: Automatically Identifying and Engaging Strangers on Twitte...
Who will RT this?: Automatically Identifying and Engaging Strangers on Twitte...Who will RT this?: Automatically Identifying and Engaging Strangers on Twitte...
Who will RT this?: Automatically Identifying and Engaging Strangers on Twitte...
 
Towards a Quality Assessment of Web Corpora for Language Technology Applications
Towards a Quality Assessment of Web Corpora for Language Technology ApplicationsTowards a Quality Assessment of Web Corpora for Language Technology Applications
Towards a Quality Assessment of Web Corpora for Language Technology Applications
 
A Two Step Ranking Solution for Twitter User Engagement
A Two Step Ranking Solution for Twitter User Engagement�A Two Step Ranking Solution for Twitter User Engagement�
A Two Step Ranking Solution for Twitter User Engagement
 
Real-time Classification of Malicious URLs on Twitter using Machine Activity ...
Real-time Classification of Malicious URLs on Twitter using Machine Activity ...Real-time Classification of Malicious URLs on Twitter using Machine Activity ...
Real-time Classification of Malicious URLs on Twitter using Machine Activity ...
 
Building High Available and Scalable Machine Learning Applications
Building High Available and Scalable Machine Learning ApplicationsBuilding High Available and Scalable Machine Learning Applications
Building High Available and Scalable Machine Learning Applications
 
Software Analytics: Data Analytics for Software Engineering
Software Analytics: Data Analytics for Software EngineeringSoftware Analytics: Data Analytics for Software Engineering
Software Analytics: Data Analytics for Software Engineering
 
AWS re:Invent 2016: Getting to Ground Truth with Amazon Mechanical Turk (MAC201)
AWS re:Invent 2016: Getting to Ground Truth with Amazon Mechanical Turk (MAC201)AWS re:Invent 2016: Getting to Ground Truth with Amazon Mechanical Turk (MAC201)
AWS re:Invent 2016: Getting to Ground Truth with Amazon Mechanical Turk (MAC201)
 
ML&AI APPROACH TO USER UNDERSTANDING ECOSYSTEM AT VCCORP Applications to News...
ML&AI APPROACH TO USER UNDERSTANDING ECOSYSTEM AT VCCORP Applications to News...ML&AI APPROACH TO USER UNDERSTANDING ECOSYSTEM AT VCCORP Applications to News...
ML&AI APPROACH TO USER UNDERSTANDING ECOSYSTEM AT VCCORP Applications to News...
 

More from azubiaga

Exploiting context for rumour detection in social media
Exploiting context for rumour detection in social mediaExploiting context for rumour detection in social media
Exploiting context for rumour detection in social mediaazubiaga
 
Crowdsourcing the Annotation of Rumourous Conversations in Social Media
Crowdsourcing the Annotation of Rumourous Conversations in Social MediaCrowdsourcing the Annotation of Rumourous Conversations in Social Media
Crowdsourcing the Annotation of Rumourous Conversations in Social Mediaazubiaga
 
Curating and Contextualizing Twitter Stories to Assist with Social Newsgathering
Curating and Contextualizing Twitter Stories to Assist with Social NewsgatheringCurating and Contextualizing Twitter Stories to Assist with Social Newsgathering
Curating and Contextualizing Twitter Stories to Assist with Social Newsgatheringazubiaga
 
Mining Twitter for Real-Time Trend and Information Discovery
Mining Twitter for Real-Time Trend and Information DiscoveryMining Twitter for Real-Time Trend and Information Discovery
Mining Twitter for Real-Time Trend and Information Discoveryazubiaga
 
Newspaper Editors vs the Crowd: On the Appropriateness of Front Page News Sel...
Newspaper Editors vs the Crowd: On the Appropriateness of Front Page News Sel...Newspaper Editors vs the Crowd: On the Appropriateness of Front Page News Sel...
Newspaper Editors vs the Crowd: On the Appropriateness of Front Page News Sel...azubiaga
 
Harnessing Folksonomies for Resource Classification
Harnessing Folksonomies for Resource ClassificationHarnessing Folksonomies for Resource Classification
Harnessing Folksonomies for Resource Classificationazubiaga
 
Clasificación de Páginas Web con Anotaciones Sociales
Clasificación de Páginas Web con Anotaciones SocialesClasificación de Páginas Web con Anotaciones Sociales
Clasificación de Páginas Web con Anotaciones Socialesazubiaga
 
Content-based Clustering for Tag Cloud Visualization
Content-based Clustering for Tag Cloud VisualizationContent-based Clustering for Tag Cloud Visualization
Content-based Clustering for Tag Cloud Visualizationazubiaga
 
Getting the Most Out of Social Annotations for Web Page Classification
Getting the Most Out of Social Annotations for Web Page ClassificationGetting the Most Out of Social Annotations for Web Page Classification
Getting the Most Out of Social Annotations for Web Page Classificationazubiaga
 
Enhancing Navigation on Wikipedia with Social Tags
Enhancing Navigation on Wikipedia with Social TagsEnhancing Navigation on Wikipedia with Social Tags
Enhancing Navigation on Wikipedia with Social Tagsazubiaga
 
Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?
Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?
Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?azubiaga
 
Etiketa-lainoen ikuskera hobetzeko multzokatzea
Etiketa-lainoen ikuskera hobetzeko multzokatzeaEtiketa-lainoen ikuskera hobetzeko multzokatzea
Etiketa-lainoen ikuskera hobetzeko multzokatzeaazubiaga
 
Master thesis presentation
Master thesis presentationMaster thesis presentation
Master thesis presentationazubiaga
 
Tags vs Shelves: From Social Tagging to Social Classification
Tags vs Shelves: From Social Tagging to Social ClassificationTags vs Shelves: From Social Tagging to Social Classification
Tags vs Shelves: From Social Tagging to Social Classificationazubiaga
 

More from azubiaga (14)

Exploiting context for rumour detection in social media
Exploiting context for rumour detection in social mediaExploiting context for rumour detection in social media
Exploiting context for rumour detection in social media
 
Crowdsourcing the Annotation of Rumourous Conversations in Social Media
Crowdsourcing the Annotation of Rumourous Conversations in Social MediaCrowdsourcing the Annotation of Rumourous Conversations in Social Media
Crowdsourcing the Annotation of Rumourous Conversations in Social Media
 
Curating and Contextualizing Twitter Stories to Assist with Social Newsgathering
Curating and Contextualizing Twitter Stories to Assist with Social NewsgatheringCurating and Contextualizing Twitter Stories to Assist with Social Newsgathering
Curating and Contextualizing Twitter Stories to Assist with Social Newsgathering
 
Mining Twitter for Real-Time Trend and Information Discovery
Mining Twitter for Real-Time Trend and Information DiscoveryMining Twitter for Real-Time Trend and Information Discovery
Mining Twitter for Real-Time Trend and Information Discovery
 
Newspaper Editors vs the Crowd: On the Appropriateness of Front Page News Sel...
Newspaper Editors vs the Crowd: On the Appropriateness of Front Page News Sel...Newspaper Editors vs the Crowd: On the Appropriateness of Front Page News Sel...
Newspaper Editors vs the Crowd: On the Appropriateness of Front Page News Sel...
 
Harnessing Folksonomies for Resource Classification
Harnessing Folksonomies for Resource ClassificationHarnessing Folksonomies for Resource Classification
Harnessing Folksonomies for Resource Classification
 
Clasificación de Páginas Web con Anotaciones Sociales
Clasificación de Páginas Web con Anotaciones SocialesClasificación de Páginas Web con Anotaciones Sociales
Clasificación de Páginas Web con Anotaciones Sociales
 
Content-based Clustering for Tag Cloud Visualization
Content-based Clustering for Tag Cloud VisualizationContent-based Clustering for Tag Cloud Visualization
Content-based Clustering for Tag Cloud Visualization
 
Getting the Most Out of Social Annotations for Web Page Classification
Getting the Most Out of Social Annotations for Web Page ClassificationGetting the Most Out of Social Annotations for Web Page Classification
Getting the Most Out of Social Annotations for Web Page Classification
 
Enhancing Navigation on Wikipedia with Social Tags
Enhancing Navigation on Wikipedia with Social TagsEnhancing Navigation on Wikipedia with Social Tags
Enhancing Navigation on Wikipedia with Social Tags
 
Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?
Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?
Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?
 
Etiketa-lainoen ikuskera hobetzeko multzokatzea
Etiketa-lainoen ikuskera hobetzeko multzokatzeaEtiketa-lainoen ikuskera hobetzeko multzokatzea
Etiketa-lainoen ikuskera hobetzeko multzokatzea
 
Master thesis presentation
Master thesis presentationMaster thesis presentation
Master thesis presentation
 
Tags vs Shelves: From Social Tagging to Social Classification
Tags vs Shelves: From Social Tagging to Social ClassificationTags vs Shelves: From Social Tagging to Social Classification
Tags vs Shelves: From Social Tagging to Social Classification
 

Recently uploaded

Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...Suhani Kapoor
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Digi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxDigi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxTanveerAhmed817946
 
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...shivangimorya083
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 

Recently uploaded (20)

Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Digi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxDigi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptx
 
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 

Microposts2015 - Social Spam Detection on Twitter

  • 1. Making the Most of Tweet-Inherent Features for Social Spam Detection on Twitter Bo Wang, Arkaitz Zubiaga, Maria Liakata and Rob Procter Department of Computer Science University of Warwick 18th May 2015
  • 2. Social Spam on Twitter Motivation • Social spam is an important issue in social media services such as Twitter, e.g.: • Users inject tweets in trending topics. • Users reply with promotional messages providing a link. • We want to be able to identify these spam tweets in a Twitter stream.
  • 3. Social Spam on Twitter How Did we Feel the Need to Identify Spam? • We started tracking events via streaming API. • They were often riddled with noisy tweets.
  • 4. Social Spam on Twitter Example
  • 5. Social Spam on Twitter Our Approach • Detection of spammers: unsuitable, we couldn’t aggregate a user’s data from a stream. • Alternative solution: Determine if tweet is spam from its inherent features.
  • 6. Social Spam on Twitter Definitions • Spam originally coined for unsolicited email. • How to define spam for Twitter? (not easy!) • Twitter has own definition of spam, where certain level of advertisements is allowed: • It rather refers to the user level rather than tweet level, e.g., users who massively follow others. • Harder to define a spam than a spammer.
  • 7. Social Spam on Twitter Our Definition • Twitter spam: noisy content produced by users who express a different behaviour from what the system is intended for, and has the goal of grabbing attention by exploiting the social media service’s characteristics.
  • 8. Spammer vs. Spam Detection What Did Others Do? • Most previous work focused on spammer detection (users). • They used features which are not readily available in a tweet: • For example, historical user behaviour and network features. • Not feasible for our use.
  • 9. Spammer vs. Spam Detection What Do We Want To Do Instead? • (Near) Real-time spam detection, limited to features readily available in a stream of tweets. • Contributions: • Test on two existing datasets, adapted to our purposes. • Definition of different feature sets. • Compare different classification algorithms. • Investigate the use of different tweet-inherent features.
  • 10. Datasets • We relied on two (spammer vs non-spammer) datasets: • Social Honeypot (Lee et al., 2011 [1]): used social honeypots to attract spammers. • 1KS-10KN (Yang et al., 2011 [2]): harvested tweets containing certain malicious URLs. • Spammer dataset to our spam dataset: Randomly select one tweet from each spammer or legitimate user. • Social Honeypot: 20,707 spam vs 19,249 non-spam (∼1:1). • 1KS-10KN: 1,000 spam vs 9,828 non-spam (∼1:10).
  • 11. Feature Engineering User features Content features Length of profile name Number of words Length of profile description Number of characters Number of followings (FI) Number of white spaces Number of followers (FE) Number of capitalization words Number of tweets posted Number of capitalization words per word Age of the user account, in hours (AU) Maximum word length Ratio of number of followings and followers (FE/FI) Mean word length Reputation of the user (FE/(FI + FE)) Number of exclamation marks Following rate (FI/AU) Number of question marks Number of tweets posted per day Number of URL links Number of tweets posted per week Number of URL links per word N-grams Number of hashtags Uni + bi-gram or bi + tri-gram Number of hashtags per word Number of mentions Sentiment features Number of mentions per word Automatically created sentiment lexicons Number of spam words Manually created sentiment lexicons Number of spam words per word Part of speech tags of every tweet
  • 12. Evaluation Experiment Settings • 5 widely-used classification algorithms: Bernoulli Naive Bayes, KNN, SVM, Decision Tree and Random Forests. • Hyperparameters optimised from a subset of the dataset separate from train/test sets. • All 4 feature sets were combined. • 10-fold cross-validation.
  • 13. Evaluation Selection of Classifier Classifier 1KS-10KN Dataset Social Honeypot Dataset Precision Recall F-measure Precision Recall F1-measure Bernoulli NB 0.899 0.688 0.778 0.772 0.806 0.789 KNN 0.924 0.706 0.798 0.802 0.778 0.790 SVM 0.872 0.708 0.780 0.844 0.817 0.830 Decision Tree 0.788 0.782 0.784 0.914 0.916 0.915 Random Forest 0.993 0.716 0.831 0.941 0.950 0.946 • Random Forests outperform others in terms of F1-measure and Precision. • Better performance on Social Honeypot (1:1 ratio rather than 1:10?). • Results only 4% below original papers, which require historic user features.
  • 14. Evaluation Evaluation of Features (w/ Random Forests) Feature Set 1KS-10KN Dataset Social Honeypot Dataset Precision Recall F-measure Precision Recall F-measure User features (U) 0.895 0.709 0.791 0.938 0.940 0.940 Content features (C) 0.951 0.657 0.776 0.771 0.753 0.762 Uni + Bi-gram (Binary) 0.930 0.725 0.815 0.759 0.727 0.743 Uni + Bi-gram (Tf) 0.959 0.715 0.819 0.783 0.767 0.775 Uni + Bi-gram (Tfidf) 0.943 0.726 0.820 0.784 0.765 0.775 Bi + Tri-gram (Tfidf) 0.931 0.684 0.788 0.797 0.656 0.720 Sentiment features (S) 0.966 0.574 0.718 0.679 0.727 0.702 • Testing feature sets one by one: • User features (U) most determinant for Social Honeypot. • N-gram features best for 1KS-10KN. • Potentially due to diff. dataset generation approaches?
  • 15. Evaluation Evaluation of Features (w/ Random Forests) Feature Set 1KS-10KN Dataset Social Honeypot Dataset Precision Recall F-measure Precision Recall F-measure Single feature set 0.943 0.726 0.820 0.938 0.940 0.940 U + C 0.974 0.708 0.819 0.938 0.949 0.943 U + Bi & Tri-gram (Tf) 0.972 0.745 0.843 0.937 0.949 0.943 U + S 0.948 0.732 0.825 0.940 0.944 0.942 Uni & Bi-gram (Tf) + S 0.964 0.721 0.824 0.797 0.744 0.770 C + S 0.970 0.649 0.777 0.778 0.762 0.770 C + Uni & Bi-gram (Tf) 0.968 0.717 0.823 0.783 0.757 0.770 U + C + Uni & Bi-gram (Tf) 0.985 0.727 0.835 0.934 0.949 0.941 U + C + S 0.982 0.704 0.819 0.937 0.948 0.942 U + Uni & Bi-gram (Tf) + S 0.994 0.720 0.834 0.928 0.946 0.937 C + Uni & Bi-gram (Tf) + S 0.966 0.720 0.824 0.806 0.758 0.782 U + C + Uni & Bi-gram (Tf) + S 0.988 0.725 0.835 0.936 0.947 0.942 • However, when we combine feature sets: • The same approach performs best (F1) for both: U + Bi & Tri-gram (Tf). • Combining features helps us capture diff. types of spam tweets.
  • 16. Evaluation Computational Efficiency • Beyond accuracy, how can all these features be applied efficiently in a stream?
  • 17. Evaluation Computational Efficiency Feature set Comp. time (seconds) for 1k tweets User features 0.0057 N-gram 0.3965 Sentiment features 20.9838 Number of spam words (NSW) 19.0111 Part-of-speech counts (POS) 0.6139 Content features including NSW and POS 20.2367 Content features without NSW 1.0448 Content features without POS 19.6165 • Tested on regular computer (2.8 GHz Intel Core i7 processor and 16 GB memory). • The features that performed best in combination (User and N-grams) are those most efficiently calculated.
  • 18. Conclusion • Random Forests were found to be the most accurate classifier. • Comparable performance to previous work (-4%) while limiting features to those in a tweet. • The use of multiple feature sets increases the possibility to capture different spam types, and makes it more difficult for spammers to evade. • Diff. features perform better when used separately, but same features are useful when combined.
  • 19. Future Work • Spam corpus constructed by picking tweets from spammers. • Need to study if legitimate users also likely to post spam tweets, and how it could affect the results. • A more recent, manually labelled spam/non-spam dataset. • Feasibility of cross-dataset spam classification?
  • 20. That’s it! • Any Questions?
  • 21. K. Lee, B. D. Eoff, and J. Caverlee. Seven months with the devils: A long-term study of content polluters on twitter. In L. A. Adamic, R. A. Baeza-Yates, and S. Counts, editors, ICWSM. The AAAI Press, 2011. C. Yang, R. C. Harkreader, and G. Gu. Die free or live hard? empirical evaluation and new design for fighting evolving twitter spammers. In Proceedings of the 14th International Conference on Recent Advances in Intrusion Detection, RAID’11, pages 318–337, Berlin, Heidelberg, 2011. Springer-Verlag.