The document describes 3 methods for analyzing fabrication on Twitter:
1. A lightweight supervised classifier that classifies tweets as real or fake based on the user's account history and behavior rates.
2. A dynamic lightweight topic modeler that uses incremental Pareto NMF to determine topics in a corpus and the percentage of real and fake tweets per topic.
3. A corpus summarizer that extracts the most important tweet to capture each subtopic by calculating word and tweet importance based on TF-IDF.
1. BotBoosted
Explore and Understand
the Amount of Fabrication
of a Topic in Twitter
Brian Balagot
brian.balagot@gmail.com
https://www.linkedin.com/in/briancbalagot
https://github.com/brityboy
2. To explore and understand the
amount of fabrication on Twitter…
A tweet is fabricated if it was not made
by a bona-fide human or genuine news source
3. To explore and understand the
amount of fabrication on Twitter…
A tweet is fabricated if it was not made
by a bona-fide human or genuine news source
Search: lose weight instantly
4. To explore and understand the
amount of fabrication on Twitter…
A tweet is fabricated if it was not made
by a bona-fide human or genuine news source
Search: lose weight instantly
5. To explore and understand the
amount of fabrication on Twitter…
A tweet is fabricated if it was not made
by a bona-fide human or genuine news source
REAL
REAL
REAL
REAL
FAKE
REAL
Search: lose weight instantly
6. To explore and understand the
amount of fabrication on Twitter…
A tweet is fabricated if it was not made
by a bona-fide human or genuine news source
REAL
REAL
REAL
REAL
FAKE
REAL
Search: lose weight instantly
#1: Lightweight
Supervised Classifier
7. To explore and understand the
amount of fabrication on Twitter…
A tweet is fabricated if it was not made
by a bona-fide human or genuine news source
REAL
REAL
REAL
REAL
FAKE
REAL
Search: lose weight instantly
“think invent
try lose
weight”
100% real
“learn shed
pounds burn
belly fat
smoothie”
50% real
#1: Lightweight
Supervised Classifier
8. To explore and understand the
amount of fabrication on Twitter…
A tweet is fabricated if it was not made
by a bona-fide human or genuine news source
REAL
REAL
REAL
REAL
FAKE
REAL
Search: lose weight instantly
“think invent
try lose
weight”
100% real
“learn shed
pounds burn
belly fat
smoothie”
50% real
#1: Lightweight
Supervised Classifier
#2: Dynamic Lightweight
Topic Modeler
9. To explore and understand the
amount of fabrication on Twitter…
A tweet is fabricated if it was not made
by a bona-fide human or genuine news source
REAL
REAL
REAL
REAL
FAKE
REAL
Search: lose weight instantly
“think invent
try lose
weight”
100% real
“learn shed
pounds burn
belly fat
smoothie”
50% real
REAL
FAKE
REAL
#1: Lightweight
Supervised Classifier
#2: Dynamic Lightweight
Topic Modeler
10. To explore and understand the
amount of fabrication on Twitter…
A tweet is fabricated if it was not made
by a bona-fide human or genuine news source
REAL
REAL
REAL
REAL
FAKE
REAL
Search: lose weight instantly
“think invent
try lose
weight”
100% real
“learn shed
pounds burn
belly fat
smoothie”
50% real
REAL
FAKE
REAL
#1: Lightweight
Supervised Classifier
#2: Dynamic Lightweight
Topic Modeler
#3: Corpus Summarizer
11. 2: Dynamic Lightweight Topic Modeler
Tweet
Corpus
(labeled)
Tweet
Tokenizer
Normalized
Tokens
TFIDF
Vectorizer
TFIDF
Matrix
The GOAL: Heuristically determine the topic count of a corpus, on the fly
12. 2: Dynamic Lightweight Topic Modeler
Tweet
Corpus
(labeled)
Tweet
Tokenizer
Normalized
Tokens
TFIDF
Vectorizer
TFIDF
Matrix
Incremental
“Pareto”
NMF
The GOAL: Heuristically determine the topic count of a corpus, on the fly
This is different from Regular NMF
(we have to specify the number
of topics to extract when using NMF)
13. 2: Dynamic Lightweight Topic Modeler
Tweet
Corpus
(labeled)
Tweet
Tokenizer
Normalized
Tokens
TFIDF
Vectorizer
TFIDF
Matrix
Incremental
“Pareto”
NMF
Unexplained
Topics
(“noise”)
Rich Content
(the “MEAT”)
The GOAL: Heuristically determine the topic count of a corpus, on the fly
14. 2: Dynamic Lightweight Topic Modeler
Tweet
Corpus
(labeled)
Tweet
Tokenizer
Normalized
Tokens
TFIDF
Vectorizer
TFIDF
Matrix
Incremental
“Pareto”
NMF
Unexplained
Topics
(“noise”)
Rich Content
(the “MEAT”)
The GOAL: Heuristically determine the topic count of a corpus, on the fly
4 in the body
2 in the tail
Increment = 2
15. 2: Dynamic Lightweight Topic Modeler
Tweet
Corpus
(labeled)
Tweet
Tokenizer
Normalized
Tokens
TFIDF
Vectorizer
TFIDF
Matrix
Incremental
“Pareto”
NMF
Unexplained
Topics
(“noise”)
Rich Content
(the “MEAT”)
The GOAL: Heuristically determine the topic count of a corpus, on the fly
4 in the body
2 in the tail
6 in the body
2 in the tail
Increment = 2
16. 2: Dynamic Lightweight Topic Modeler
Tweet
Corpus
(labeled)
Tweet
Tokenizer
Normalized
Tokens
TFIDF
Vectorizer
TFIDF
Matrix
Incremental
“Pareto”
NMF
Unexplained
Topics
(“noise”)
Rich Content
(the “MEAT”)
The GOAL: Heuristically determine the topic count of a corpus, on the fly
4 in the body
2 in the tail
6 in the body
2 in the tail
6 in the body
4 in the tail
Increment = 2
17. 2: Dynamic Lightweight Topic Modeler
Tweet
Corpus
(labeled)
Tweet
Tokenizer
Normalized
Tokens
TFIDF
Vectorizer
TFIDF
Matrix
Topic Count,
Document
“Soft Clustering”
Incremental
“Pareto”
NMF
Unexplained
Topics
(“noise”)
Rich Content
(the “MEAT”)
The GOAL: Heuristically determine the topic count of a corpus, on the fly
4 in the body
2 in the tail
6 in the body
2 in the tail
6 in the body
4 in the tail
Increment = 2
18. BotBoosted
% Share of Real/Fake Tweets per Topic for “Lose Weight Now”
On Average, 34% of Tweets are Fake and 66% are Real
fast, weight,
lose, tips,
smoothie
%ShareofTweets
fast, lose,
weight, diet,
days
ways, lose,
fast, weight,
fat
tips, fast,
weight, lose,
scientifically
brian, flatt,
lose, fast,
ways
#Workout #Exercise Smoothie
To Lose Weight
https://...
6 Bedtime Habits
To lose Weight Fast
https://...
#Zumba #dance to Lose
#Weight by Brian Flatt
#lose_weight_fast https://...
Natural Ways to Lose #Weight
Fast by Brian Flatt |
#lose_weight https://...
REAL
FAKE
REAL
FAKE
19. BotBoosted
% Share of Real/Fake Tweets per Topic for “Lose Weight Now”
On Average, 34% of Tweets are Fake and 66% are Real
%ShareofTweets
fast, lose,
weight, diet,
days
ways, lose,
fast, weight,
fat
tips, fast,
weight, lose,
scientifically
brian, flatt,
lose, fast,
ways
We can see which conversations are boosted by bots
fast, weight,
lose, tips,
smoothie
#Workout #Exercise Smoothie
To Lose Weight
https://...
6 Bedtime Habits
To lose Weight Fast
https://...
#Zumba #dance to Lose
#Weight by Brian Flatt
#lose_weight_fast https://...
Natural Ways to Lose #Weight
Fast by Brian Flatt |
#lose_weight https://...
REAL
FAKE
REAL
FAKE
20. Limitations & Future Work
• Twitter is very aggressive at suspending
spammers (77% suspended 1 day after 1st tweet)
(BotBoosted detects Twitter’s False Negatives)
• Twitter’s FREE API is not a statistically
representative sample (try the Garden Hose API)
• Strengthen the model by continuously training it
with twitter’s false negatives
• Benchmark Incremental “Pareto” NMF vs HDP and
LDA to further improve the heuristic algorithm
• Improve corpus summarization with graph theory
via tweet “centrality” based on word co-
occurrence
21. BotBoosted
Explore and Understand
the Amount of Fabrication
of a Topic in Twitter
Brian Balagot
brian.balagot@gmail.com
https://www.linkedin.com/in/briancbalagot
https://github.com/brityboy
Thank
You!Major References:
• Aritter. Twitter NLP. (2016). Github repository https://github.com/aritter/twitter_nlp
• Azab, A., Idrees, A., Mahmoud, M., Hefny, H. (Nov 1, 2016). Fake Account Detection in Twitter Based on Minimum Weighted
Feature set. World Academy of Science, Engineering and Technology International Journal of Computer, Electrical, Automation,
Control and Information Engineering Vol:10, No:1, 2016.
• Cresci, S., Di Pietro, R., Petrocchi, M., Spognardi, A., Tesconi, M. (2015). Fame for Sale: efficient
detection of fake Twitter followers. Retrieved from Cornell University Library. (arXiv:1509.04098)
• Joseph, K., Landwehr, P., Carley, K. (2014). Two 1%s don’t make a whole: Comparing simultaneous samples from Twitter’s Streaming API.
• Karambelkar, B. (2015, Jan 5). How to use Twitter’s Search REST API most effectively.
Kharde, V., Sonawane, S. (2016). Sentiment Analysis of Twitter Data: A Survey of Techniques. International Journal of Computer
Applications. Volum 139, No 11, April 2016.
• Kontaxis, G., Polakis, I., Ioannidis, S., Markatos, E. (2011). Detecting Social Network Profile Cloning.
Retrieved from SysSec Consortium. (n.d.).
• Mori, T., Kikuchi, M., Yoshia K. (2001). Term Weighting Method based on Information Gain Ratio for Summarizing
Documents retrieved by IR systems.
• Thomas, K., Grier, C., Paxson, V., Song, D. (2011). Suspended Accounts in Retrospect: An Analysis of Twitter Spam.
22. 1: Lightweight Supervised Classifier
The GOAL: Classify a user, based on their most recent tweet, as real or fake
Relative_Volume:
likes_friends,
tweets_friends
Account
History
Random
Forest
Behavior Rate
Random
Forest
10FCV
Acc: 97%
10FCV
Acc: 95%
Account History
Behavior Rate
History Pred %
Rate Pred %
Random
Forest
10FCV
Acc: 98%
Account History: followers_count,
friends_count, total_tweets,
total_likes
Behavior Rate: likes_day,
tweets_day, friends_day,
likes_friends_day
23. 3: Corpus Summarizer
The GOAL: extract the most important tweet that captures the subtopic
Tweet
Corpus
(labeled)
Tweet
Tokenizer
Normalized
Tokens
TFIDF
Vectorizer
TFIDF
Matrix
Incremental
“Pareto”
NMF
Topic Count,
Document
“Soft Clustering”
Term
Frequency
Inverse
Document
Frequency
Word
Importance
= X
Highest Total Word
Importance
Tweet
Importance
=
Feature/Word
Importance
X
TFIDF Matrix
Topic
Label
T
w
e
et
s
Bag of Words
Feature/Word Importance
Random Forest
A word is important if it is helpful in
“bucketing” tweets into topics
A tweet is important if it is made up of important words