SlideShare a Scribd company logo
1 of 23
Download to read offline
BotBoosted
Explore and Understand
the Amount of Fabrication
of a Topic in Twitter
Brian Balagot
brian.balagot@gmail.com
https://www.linkedin.com/in/briancbalagot
https://github.com/brityboy
To explore and understand the
amount of fabrication on Twitter…
A tweet is fabricated if it was not made
by a bona-fide human or genuine news source
To explore and understand the
amount of fabrication on Twitter…
A tweet is fabricated if it was not made
by a bona-fide human or genuine news source
Search: lose weight instantly
To explore and understand the
amount of fabrication on Twitter…
A tweet is fabricated if it was not made
by a bona-fide human or genuine news source
Search: lose weight instantly
To explore and understand the
amount of fabrication on Twitter…
A tweet is fabricated if it was not made
by a bona-fide human or genuine news source
REAL
REAL
REAL
REAL
FAKE
REAL
Search: lose weight instantly
To explore and understand the
amount of fabrication on Twitter…
A tweet is fabricated if it was not made
by a bona-fide human or genuine news source
REAL
REAL
REAL
REAL
FAKE
REAL
Search: lose weight instantly
#1: Lightweight
Supervised Classifier
To explore and understand the
amount of fabrication on Twitter…
A tweet is fabricated if it was not made
by a bona-fide human or genuine news source
REAL
REAL
REAL
REAL
FAKE
REAL
Search: lose weight instantly
“think invent
try lose
weight”
100% real
“learn shed
pounds burn
belly fat
smoothie”
50% real
#1: Lightweight
Supervised Classifier
To explore and understand the
amount of fabrication on Twitter…
A tweet is fabricated if it was not made
by a bona-fide human or genuine news source
REAL
REAL
REAL
REAL
FAKE
REAL
Search: lose weight instantly
“think invent
try lose
weight”
100% real
“learn shed
pounds burn
belly fat
smoothie”
50% real
#1: Lightweight
Supervised Classifier
#2: Dynamic Lightweight
Topic Modeler
To explore and understand the
amount of fabrication on Twitter…
A tweet is fabricated if it was not made
by a bona-fide human or genuine news source
REAL
REAL
REAL
REAL
FAKE
REAL
Search: lose weight instantly
“think invent
try lose
weight”
100% real
“learn shed
pounds burn
belly fat
smoothie”
50% real
REAL
FAKE
REAL
#1: Lightweight
Supervised Classifier
#2: Dynamic Lightweight
Topic Modeler
To explore and understand the
amount of fabrication on Twitter…
A tweet is fabricated if it was not made
by a bona-fide human or genuine news source
REAL
REAL
REAL
REAL
FAKE
REAL
Search: lose weight instantly
“think invent
try lose
weight”
100% real
“learn shed
pounds burn
belly fat
smoothie”
50% real
REAL
FAKE
REAL
#1: Lightweight
Supervised Classifier
#2: Dynamic Lightweight
Topic Modeler
#3: Corpus Summarizer
2: Dynamic Lightweight Topic Modeler
Tweet
Corpus
(labeled)
Tweet
Tokenizer
Normalized
Tokens
TFIDF
Vectorizer
TFIDF
Matrix
The GOAL: Heuristically determine the topic count of a corpus, on the fly
2: Dynamic Lightweight Topic Modeler
Tweet
Corpus
(labeled)
Tweet
Tokenizer
Normalized
Tokens
TFIDF
Vectorizer
TFIDF
Matrix
Incremental
“Pareto”
NMF
The GOAL: Heuristically determine the topic count of a corpus, on the fly
This is different from Regular NMF
(we have to specify the number
of topics to extract when using NMF)
2: Dynamic Lightweight Topic Modeler
Tweet
Corpus
(labeled)
Tweet
Tokenizer
Normalized
Tokens
TFIDF
Vectorizer
TFIDF
Matrix
Incremental
“Pareto”
NMF
Unexplained
Topics
(“noise”)
Rich Content
(the “MEAT”)
The GOAL: Heuristically determine the topic count of a corpus, on the fly
2: Dynamic Lightweight Topic Modeler
Tweet
Corpus
(labeled)
Tweet
Tokenizer
Normalized
Tokens
TFIDF
Vectorizer
TFIDF
Matrix
Incremental
“Pareto”
NMF
Unexplained
Topics
(“noise”)
Rich Content
(the “MEAT”)
The GOAL: Heuristically determine the topic count of a corpus, on the fly
4 in the body
2 in the tail
Increment = 2
2: Dynamic Lightweight Topic Modeler
Tweet
Corpus
(labeled)
Tweet
Tokenizer
Normalized
Tokens
TFIDF
Vectorizer
TFIDF
Matrix
Incremental
“Pareto”
NMF
Unexplained
Topics
(“noise”)
Rich Content
(the “MEAT”)
The GOAL: Heuristically determine the topic count of a corpus, on the fly
4 in the body
2 in the tail
6 in the body
2 in the tail
Increment = 2
2: Dynamic Lightweight Topic Modeler
Tweet
Corpus
(labeled)
Tweet
Tokenizer
Normalized
Tokens
TFIDF
Vectorizer
TFIDF
Matrix
Incremental
“Pareto”
NMF
Unexplained
Topics
(“noise”)
Rich Content
(the “MEAT”)
The GOAL: Heuristically determine the topic count of a corpus, on the fly
4 in the body
2 in the tail
6 in the body
2 in the tail
6 in the body
4 in the tail
Increment = 2
2: Dynamic Lightweight Topic Modeler
Tweet
Corpus
(labeled)
Tweet
Tokenizer
Normalized
Tokens
TFIDF
Vectorizer
TFIDF
Matrix
Topic Count,
Document
“Soft Clustering”
Incremental
“Pareto”
NMF
Unexplained
Topics
(“noise”)
Rich Content
(the “MEAT”)
The GOAL: Heuristically determine the topic count of a corpus, on the fly
4 in the body
2 in the tail
6 in the body
2 in the tail
6 in the body
4 in the tail
Increment = 2
BotBoosted
% Share of Real/Fake Tweets per Topic for “Lose Weight Now”
On Average, 34% of Tweets are Fake and 66% are Real
fast, weight,
lose, tips,
smoothie
%ShareofTweets
fast, lose,
weight, diet,
days
ways, lose,
fast, weight,
fat
tips, fast,
weight, lose,
scientifically
brian, flatt,
lose, fast,
ways
#Workout #Exercise Smoothie
To Lose Weight
https://...
6 Bedtime Habits
To lose Weight Fast
https://...
#Zumba #dance to Lose
#Weight by Brian Flatt
#lose_weight_fast https://...
Natural Ways to Lose #Weight
Fast by Brian Flatt |
#lose_weight https://...
REAL
FAKE
REAL
FAKE
BotBoosted
% Share of Real/Fake Tweets per Topic for “Lose Weight Now”
On Average, 34% of Tweets are Fake and 66% are Real
%ShareofTweets
fast, lose,
weight, diet,
days
ways, lose,
fast, weight,
fat
tips, fast,
weight, lose,
scientifically
brian, flatt,
lose, fast,
ways
We can see which conversations are boosted by bots
fast, weight,
lose, tips,
smoothie
#Workout #Exercise Smoothie
To Lose Weight
https://...
6 Bedtime Habits
To lose Weight Fast
https://...
#Zumba #dance to Lose
#Weight by Brian Flatt
#lose_weight_fast https://...
Natural Ways to Lose #Weight
Fast by Brian Flatt |
#lose_weight https://...
REAL
FAKE
REAL
FAKE
Limitations & Future Work
• Twitter is very aggressive at suspending
spammers (77% suspended 1 day after 1st tweet)
(BotBoosted detects Twitter’s False Negatives)
• Twitter’s FREE API is not a statistically
representative sample (try the Garden Hose API)
• Strengthen the model by continuously training it
with twitter’s false negatives
• Benchmark Incremental “Pareto” NMF vs HDP and
LDA to further improve the heuristic algorithm
• Improve corpus summarization with graph theory
via tweet “centrality” based on word co-
occurrence
BotBoosted
Explore and Understand
the Amount of Fabrication
of a Topic in Twitter
Brian Balagot
brian.balagot@gmail.com
https://www.linkedin.com/in/briancbalagot
https://github.com/brityboy
Thank
You!Major References:
• Aritter. Twitter NLP. (2016). Github repository https://github.com/aritter/twitter_nlp
• Azab, A., Idrees, A., Mahmoud, M., Hefny, H. (Nov 1, 2016). Fake Account Detection in Twitter Based on Minimum Weighted
Feature set. World Academy of Science, Engineering and Technology International Journal of Computer, Electrical, Automation,
Control and Information Engineering Vol:10, No:1, 2016.
• Cresci, S., Di Pietro, R., Petrocchi, M., Spognardi, A., Tesconi, M. (2015). Fame for Sale: efficient
detection of fake Twitter followers. Retrieved from Cornell University Library. (arXiv:1509.04098)
• Joseph, K., Landwehr, P., Carley, K. (2014). Two 1%s don’t make a whole: Comparing simultaneous samples from Twitter’s Streaming API.
• Karambelkar, B. (2015, Jan 5). How to use Twitter’s Search REST API most effectively.
Kharde, V., Sonawane, S. (2016). Sentiment Analysis of Twitter Data: A Survey of Techniques. International Journal of Computer
Applications. Volum 139, No 11, April 2016.
• Kontaxis, G., Polakis, I., Ioannidis, S., Markatos, E. (2011). Detecting Social Network Profile Cloning.
Retrieved from SysSec Consortium. (n.d.).
• Mori, T., Kikuchi, M., Yoshia K. (2001). Term Weighting Method based on Information Gain Ratio for Summarizing
Documents retrieved by IR systems.
• Thomas, K., Grier, C., Paxson, V., Song, D. (2011). Suspended Accounts in Retrospect: An Analysis of Twitter Spam.
1: Lightweight Supervised Classifier
The GOAL: Classify a user, based on their most recent tweet, as real or fake
Relative_Volume:
likes_friends,
tweets_friends
Account
History
Random
Forest
Behavior Rate
Random
Forest
10FCV
Acc: 97%
10FCV
Acc: 95%
Account History
Behavior Rate
History Pred %
Rate Pred %
Random
Forest
10FCV
Acc: 98%
Account History: followers_count,
friends_count, total_tweets,
total_likes
Behavior Rate: likes_day,
tweets_day, friends_day,
likes_friends_day
3: Corpus Summarizer
The GOAL: extract the most important tweet that captures the subtopic
Tweet
Corpus
(labeled)
Tweet
Tokenizer
Normalized
Tokens
TFIDF
Vectorizer
TFIDF
Matrix
Incremental
“Pareto”
NMF
Topic Count,
Document
“Soft Clustering”
Term
Frequency
Inverse
Document
Frequency
Word
Importance
= X
Highest Total Word
Importance
Tweet
Importance
=
Feature/Word
Importance
X
TFIDF Matrix
Topic
Label
T
w
e
et
s
Bag of Words
Feature/Word Importance
Random Forest
A word is important if it is helpful in
“bucketing” tweets into topics
A tweet is important if it is made up of important words

More Related Content

What's hot

Twitter for Recruiters
Twitter for RecruitersTwitter for Recruiters
Twitter for Recruiters
Carmen Hudson
 
Research ready wikipedia_ebook
Research ready wikipedia_ebookResearch ready wikipedia_ebook
Research ready wikipedia_ebook
thejoshspeaks
 

What's hot (19)

Using social media as powerful reporting tools - Daniel Victor - New England ...
Using social media as powerful reporting tools - Daniel Victor - New England ...Using social media as powerful reporting tools - Daniel Victor - New England ...
Using social media as powerful reporting tools - Daniel Victor - New England ...
 
Tweeting For NC State University
Tweeting For NC State UniversityTweeting For NC State University
Tweeting For NC State University
 
"Twitter Basics" - Brent Williams (Multifamily Insiders)- Apartment Internet ...
"Twitter Basics" - Brent Williams (Multifamily Insiders)- Apartment Internet ..."Twitter Basics" - Brent Williams (Multifamily Insiders)- Apartment Internet ...
"Twitter Basics" - Brent Williams (Multifamily Insiders)- Apartment Internet ...
 
Use and Applications of Social Media in Research
Use and Applications of Social Media in ResearchUse and Applications of Social Media in Research
Use and Applications of Social Media in Research
 
Seo And Trending Topics How You Can Use Trending Topics To Get More Visitors
Seo And Trending Topics   How You Can Use Trending Topics To Get More VisitorsSeo And Trending Topics   How You Can Use Trending Topics To Get More Visitors
Seo And Trending Topics How You Can Use Trending Topics To Get More Visitors
 
Egyptian afterlife blog powerpoint
Egyptian afterlife blog powerpointEgyptian afterlife blog powerpoint
Egyptian afterlife blog powerpoint
 
Twitter for Recruiters
Twitter for RecruitersTwitter for Recruiters
Twitter for Recruiters
 
English 111, October 2nd, 2012 Late classes
English 111, October 2nd, 2012 Late classesEnglish 111, October 2nd, 2012 Late classes
English 111, October 2nd, 2012 Late classes
 
Ignite pitch
Ignite pitchIgnite pitch
Ignite pitch
 
UWO Journalism Twitter Tutorial
UWO Journalism Twitter TutorialUWO Journalism Twitter Tutorial
UWO Journalism Twitter Tutorial
 
Social Media Driving Licence 3 - Twitter: come fly with us
Social Media Driving Licence 3 - Twitter: come fly with usSocial Media Driving Licence 3 - Twitter: come fly with us
Social Media Driving Licence 3 - Twitter: come fly with us
 
Week 4 twitter indispensable news tool.2014
Week 4   twitter indispensable news tool.2014Week 4   twitter indispensable news tool.2014
Week 4 twitter indispensable news tool.2014
 
KW Twitter 101
KW Twitter 101KW Twitter 101
KW Twitter 101
 
Cross posting from instagram to facebook good strategy
Cross posting from instagram to facebook good strategyCross posting from instagram to facebook good strategy
Cross posting from instagram to facebook good strategy
 
Am I Doing Social Media Right - Karen Workman - Las Vegas NewsTrain - Oct. 10...
Am I Doing Social Media Right - Karen Workman - Las Vegas NewsTrain - Oct. 10...Am I Doing Social Media Right - Karen Workman - Las Vegas NewsTrain - Oct. 10...
Am I Doing Social Media Right - Karen Workman - Las Vegas NewsTrain - Oct. 10...
 
Fake news scatter
Fake news scatterFake news scatter
Fake news scatter
 
Research ready wikipedia_ebook
Research ready wikipedia_ebookResearch ready wikipedia_ebook
Research ready wikipedia_ebook
 
7 Counterintuitive Stats and Facts on Blog Posting Times
7 Counterintuitive Stats and Facts on Blog Posting Times7 Counterintuitive Stats and Facts on Blog Posting Times
7 Counterintuitive Stats and Facts on Blog Posting Times
 
Twitter Foundations
Twitter FoundationsTwitter Foundations
Twitter Foundations
 

Similar to BotBoosted_v11

Similar to BotBoosted_v11 (20)

Social Media Presentation for Catholic Organizations
Social Media Presentation for Catholic OrganizationsSocial Media Presentation for Catholic Organizations
Social Media Presentation for Catholic Organizations
 
Jan 2010 Twitter Effectiveness Preso
Jan 2010 Twitter Effectiveness PresoJan 2010 Twitter Effectiveness Preso
Jan 2010 Twitter Effectiveness Preso
 
American Majority Twitter Manual
American Majority Twitter ManualAmerican Majority Twitter Manual
American Majority Twitter Manual
 
Twitter
TwitterTwitter
Twitter
 
Your 'traditional' social media toolkit
Your 'traditional' social media toolkitYour 'traditional' social media toolkit
Your 'traditional' social media toolkit
 
Twitter Fundraising Holy Grail Or Fail Whale
Twitter Fundraising  Holy Grail Or Fail WhaleTwitter Fundraising  Holy Grail Or Fail Whale
Twitter Fundraising Holy Grail Or Fail Whale
 
Social Media for CEOs
Social Media for CEOsSocial Media for CEOs
Social Media for CEOs
 
Twitter Basics for NPOs
Twitter Basics for NPOsTwitter Basics for NPOs
Twitter Basics for NPOs
 
Twitter for NOT so dummies webinar
Twitter for NOT so dummies webinarTwitter for NOT so dummies webinar
Twitter for NOT so dummies webinar
 
Social Media: Share Your Genealogy
Social Media: Share Your GenealogySocial Media: Share Your Genealogy
Social Media: Share Your Genealogy
 
How to Use Blogs, Twitter & LinkedIn for Legal Professionals
How to Use Blogs, Twitter & LinkedIn for Legal ProfessionalsHow to Use Blogs, Twitter & LinkedIn for Legal Professionals
How to Use Blogs, Twitter & LinkedIn for Legal Professionals
 
Get the Most Out of Twitter at a Convention
Get the Most Out of  Twitter at a ConventionGet the Most Out of  Twitter at a Convention
Get the Most Out of Twitter at a Convention
 
Twitter For Education and Private Schools - Webinar 10/23/12
Twitter For Education and Private Schools - Webinar 10/23/12Twitter For Education and Private Schools - Webinar 10/23/12
Twitter For Education and Private Schools - Webinar 10/23/12
 
Twitter
TwitterTwitter
Twitter
 
Twitter for Researchers
Twitter for ResearchersTwitter for Researchers
Twitter for Researchers
 
Twitter for Professional Development
Twitter for Professional DevelopmentTwitter for Professional Development
Twitter for Professional Development
 
Champlain College - Tweeting!
Champlain College - Tweeting!Champlain College - Tweeting!
Champlain College - Tweeting!
 
Twitter for Coalitions
Twitter for CoalitionsTwitter for Coalitions
Twitter for Coalitions
 
10 terrible twitter tools make your marketing sick!
10 terrible twitter tools   make your marketing sick!10 terrible twitter tools   make your marketing sick!
10 terrible twitter tools make your marketing sick!
 
Social IRL: Atlanta
Social IRL: AtlantaSocial IRL: Atlanta
Social IRL: Atlanta
 

BotBoosted_v11

  • 1. BotBoosted Explore and Understand the Amount of Fabrication of a Topic in Twitter Brian Balagot brian.balagot@gmail.com https://www.linkedin.com/in/briancbalagot https://github.com/brityboy
  • 2. To explore and understand the amount of fabrication on Twitter… A tweet is fabricated if it was not made by a bona-fide human or genuine news source
  • 3. To explore and understand the amount of fabrication on Twitter… A tweet is fabricated if it was not made by a bona-fide human or genuine news source Search: lose weight instantly
  • 4. To explore and understand the amount of fabrication on Twitter… A tweet is fabricated if it was not made by a bona-fide human or genuine news source Search: lose weight instantly
  • 5. To explore and understand the amount of fabrication on Twitter… A tweet is fabricated if it was not made by a bona-fide human or genuine news source REAL REAL REAL REAL FAKE REAL Search: lose weight instantly
  • 6. To explore and understand the amount of fabrication on Twitter… A tweet is fabricated if it was not made by a bona-fide human or genuine news source REAL REAL REAL REAL FAKE REAL Search: lose weight instantly #1: Lightweight Supervised Classifier
  • 7. To explore and understand the amount of fabrication on Twitter… A tweet is fabricated if it was not made by a bona-fide human or genuine news source REAL REAL REAL REAL FAKE REAL Search: lose weight instantly “think invent try lose weight” 100% real “learn shed pounds burn belly fat smoothie” 50% real #1: Lightweight Supervised Classifier
  • 8. To explore and understand the amount of fabrication on Twitter… A tweet is fabricated if it was not made by a bona-fide human or genuine news source REAL REAL REAL REAL FAKE REAL Search: lose weight instantly “think invent try lose weight” 100% real “learn shed pounds burn belly fat smoothie” 50% real #1: Lightweight Supervised Classifier #2: Dynamic Lightweight Topic Modeler
  • 9. To explore and understand the amount of fabrication on Twitter… A tweet is fabricated if it was not made by a bona-fide human or genuine news source REAL REAL REAL REAL FAKE REAL Search: lose weight instantly “think invent try lose weight” 100% real “learn shed pounds burn belly fat smoothie” 50% real REAL FAKE REAL #1: Lightweight Supervised Classifier #2: Dynamic Lightweight Topic Modeler
  • 10. To explore and understand the amount of fabrication on Twitter… A tweet is fabricated if it was not made by a bona-fide human or genuine news source REAL REAL REAL REAL FAKE REAL Search: lose weight instantly “think invent try lose weight” 100% real “learn shed pounds burn belly fat smoothie” 50% real REAL FAKE REAL #1: Lightweight Supervised Classifier #2: Dynamic Lightweight Topic Modeler #3: Corpus Summarizer
  • 11. 2: Dynamic Lightweight Topic Modeler Tweet Corpus (labeled) Tweet Tokenizer Normalized Tokens TFIDF Vectorizer TFIDF Matrix The GOAL: Heuristically determine the topic count of a corpus, on the fly
  • 12. 2: Dynamic Lightweight Topic Modeler Tweet Corpus (labeled) Tweet Tokenizer Normalized Tokens TFIDF Vectorizer TFIDF Matrix Incremental “Pareto” NMF The GOAL: Heuristically determine the topic count of a corpus, on the fly This is different from Regular NMF (we have to specify the number of topics to extract when using NMF)
  • 13. 2: Dynamic Lightweight Topic Modeler Tweet Corpus (labeled) Tweet Tokenizer Normalized Tokens TFIDF Vectorizer TFIDF Matrix Incremental “Pareto” NMF Unexplained Topics (“noise”) Rich Content (the “MEAT”) The GOAL: Heuristically determine the topic count of a corpus, on the fly
  • 14. 2: Dynamic Lightweight Topic Modeler Tweet Corpus (labeled) Tweet Tokenizer Normalized Tokens TFIDF Vectorizer TFIDF Matrix Incremental “Pareto” NMF Unexplained Topics (“noise”) Rich Content (the “MEAT”) The GOAL: Heuristically determine the topic count of a corpus, on the fly 4 in the body 2 in the tail Increment = 2
  • 15. 2: Dynamic Lightweight Topic Modeler Tweet Corpus (labeled) Tweet Tokenizer Normalized Tokens TFIDF Vectorizer TFIDF Matrix Incremental “Pareto” NMF Unexplained Topics (“noise”) Rich Content (the “MEAT”) The GOAL: Heuristically determine the topic count of a corpus, on the fly 4 in the body 2 in the tail 6 in the body 2 in the tail Increment = 2
  • 16. 2: Dynamic Lightweight Topic Modeler Tweet Corpus (labeled) Tweet Tokenizer Normalized Tokens TFIDF Vectorizer TFIDF Matrix Incremental “Pareto” NMF Unexplained Topics (“noise”) Rich Content (the “MEAT”) The GOAL: Heuristically determine the topic count of a corpus, on the fly 4 in the body 2 in the tail 6 in the body 2 in the tail 6 in the body 4 in the tail Increment = 2
  • 17. 2: Dynamic Lightweight Topic Modeler Tweet Corpus (labeled) Tweet Tokenizer Normalized Tokens TFIDF Vectorizer TFIDF Matrix Topic Count, Document “Soft Clustering” Incremental “Pareto” NMF Unexplained Topics (“noise”) Rich Content (the “MEAT”) The GOAL: Heuristically determine the topic count of a corpus, on the fly 4 in the body 2 in the tail 6 in the body 2 in the tail 6 in the body 4 in the tail Increment = 2
  • 18. BotBoosted % Share of Real/Fake Tweets per Topic for “Lose Weight Now” On Average, 34% of Tweets are Fake and 66% are Real fast, weight, lose, tips, smoothie %ShareofTweets fast, lose, weight, diet, days ways, lose, fast, weight, fat tips, fast, weight, lose, scientifically brian, flatt, lose, fast, ways #Workout #Exercise Smoothie To Lose Weight https://... 6 Bedtime Habits To lose Weight Fast https://... #Zumba #dance to Lose #Weight by Brian Flatt #lose_weight_fast https://... Natural Ways to Lose #Weight Fast by Brian Flatt | #lose_weight https://... REAL FAKE REAL FAKE
  • 19. BotBoosted % Share of Real/Fake Tweets per Topic for “Lose Weight Now” On Average, 34% of Tweets are Fake and 66% are Real %ShareofTweets fast, lose, weight, diet, days ways, lose, fast, weight, fat tips, fast, weight, lose, scientifically brian, flatt, lose, fast, ways We can see which conversations are boosted by bots fast, weight, lose, tips, smoothie #Workout #Exercise Smoothie To Lose Weight https://... 6 Bedtime Habits To lose Weight Fast https://... #Zumba #dance to Lose #Weight by Brian Flatt #lose_weight_fast https://... Natural Ways to Lose #Weight Fast by Brian Flatt | #lose_weight https://... REAL FAKE REAL FAKE
  • 20. Limitations & Future Work • Twitter is very aggressive at suspending spammers (77% suspended 1 day after 1st tweet) (BotBoosted detects Twitter’s False Negatives) • Twitter’s FREE API is not a statistically representative sample (try the Garden Hose API) • Strengthen the model by continuously training it with twitter’s false negatives • Benchmark Incremental “Pareto” NMF vs HDP and LDA to further improve the heuristic algorithm • Improve corpus summarization with graph theory via tweet “centrality” based on word co- occurrence
  • 21. BotBoosted Explore and Understand the Amount of Fabrication of a Topic in Twitter Brian Balagot brian.balagot@gmail.com https://www.linkedin.com/in/briancbalagot https://github.com/brityboy Thank You!Major References: • Aritter. Twitter NLP. (2016). Github repository https://github.com/aritter/twitter_nlp • Azab, A., Idrees, A., Mahmoud, M., Hefny, H. (Nov 1, 2016). Fake Account Detection in Twitter Based on Minimum Weighted Feature set. World Academy of Science, Engineering and Technology International Journal of Computer, Electrical, Automation, Control and Information Engineering Vol:10, No:1, 2016. • Cresci, S., Di Pietro, R., Petrocchi, M., Spognardi, A., Tesconi, M. (2015). Fame for Sale: efficient detection of fake Twitter followers. Retrieved from Cornell University Library. (arXiv:1509.04098) • Joseph, K., Landwehr, P., Carley, K. (2014). Two 1%s don’t make a whole: Comparing simultaneous samples from Twitter’s Streaming API. • Karambelkar, B. (2015, Jan 5). How to use Twitter’s Search REST API most effectively. Kharde, V., Sonawane, S. (2016). Sentiment Analysis of Twitter Data: A Survey of Techniques. International Journal of Computer Applications. Volum 139, No 11, April 2016. • Kontaxis, G., Polakis, I., Ioannidis, S., Markatos, E. (2011). Detecting Social Network Profile Cloning. Retrieved from SysSec Consortium. (n.d.). • Mori, T., Kikuchi, M., Yoshia K. (2001). Term Weighting Method based on Information Gain Ratio for Summarizing Documents retrieved by IR systems. • Thomas, K., Grier, C., Paxson, V., Song, D. (2011). Suspended Accounts in Retrospect: An Analysis of Twitter Spam.
  • 22. 1: Lightweight Supervised Classifier The GOAL: Classify a user, based on their most recent tweet, as real or fake Relative_Volume: likes_friends, tweets_friends Account History Random Forest Behavior Rate Random Forest 10FCV Acc: 97% 10FCV Acc: 95% Account History Behavior Rate History Pred % Rate Pred % Random Forest 10FCV Acc: 98% Account History: followers_count, friends_count, total_tweets, total_likes Behavior Rate: likes_day, tweets_day, friends_day, likes_friends_day
  • 23. 3: Corpus Summarizer The GOAL: extract the most important tweet that captures the subtopic Tweet Corpus (labeled) Tweet Tokenizer Normalized Tokens TFIDF Vectorizer TFIDF Matrix Incremental “Pareto” NMF Topic Count, Document “Soft Clustering” Term Frequency Inverse Document Frequency Word Importance = X Highest Total Word Importance Tweet Importance = Feature/Word Importance X TFIDF Matrix Topic Label T w e et s Bag of Words Feature/Word Importance Random Forest A word is important if it is helpful in “bucketing” tweets into topics A tweet is important if it is made up of important words