SlideShare a Scribd company logo
Self Trending a Tweet
Cluster and Topic Analysis on Tweets
© Mor Krispil, 2019 1
Abstract
In a Twitter dataset we are provided with tweets, retrieved per a pre-defined
“Trend”. Can we verify those trends back from the raw statuses? If so – we could
use this technique to topic any un-trended tweet-list!
On a tweet dataset, curated from a top of 10 twitter trends, I researched different
Natural Language Processing (NLP) and Clustering techniques to apply on the
raw tweets’ text.
I found that with the right NLP and Clustering – I could verify ~80% of the
tweets back to their labeled trends!
2© Mor Krispil, 2019
Motivation
While trended tweets can be provided from the tweeter API, they represent only
the latest and a fraction of all public tweets. In addition, they are grouped into
trends with a high degree of bias (like promotions and popularity indices).
In order to group any tweet list: historical, non “popular” and so on – we’ll need
to come by with a Tweet-to-Topic technique that can be verified.
Twitter, no doubt, have put tremendous work into their trend labels – so we could
actually use those labels to verify our Topic-ing abilities! (ground truths)
3© Mor Krispil, 2019
Dataset(s)
I got the Twitter data from the Twitter API, and Clustering and NLP techniques I
use in my day to day job.
I mined my dataset from the Twitter API, in 2 steps:
● Top 10 Twitter Trends in the US, mined with the twitter trends-per-place API.
Sample tuple: (trend name, trend query string, tweet volume)
● Per each trend: I mined ~800-1000 Tweets, in an API call iterations, with the
twitter search-per-query API. Sample tuple: (trend name, tweet text, tweet
hashtag list)
Total size: (7528, 3)
4© Mor Krispil, 2019
Dataset(s)
In Addition I used small helper-datasets
● English stop-words from: NLTK, stop_words and genism libraries
● Manually added some tweet-specific missing stop words
5© Mor Krispil, 2019
Data Preparation and Cleaning
Empty values
● No missing values were encountered in the mining process
● In the enrichment process, some text vectorization calculation created small
missing values in the new features
● I preferred replacement with “0” instead of dropping rows, since some matrix
calculation were heavy, and steps’ checkpoints had to be cached, so row-
count had to be kept consistent
6© Mor Krispil, 2019
Data Preparation and Cleaning
Non-Unicode text problems
● It caused issues with printing results to the notebook, and with some libraries
like matplotlib
● I filtered the mining process only to the US trend-place
● I filtered the tweets API to English locale (‘en’)
● I still had to make sure and encode text values as utf-8
● I had to decode back from utf-8 in other cases
7© Mor Krispil, 2019
Data Preparation and Cleaning
Twitter specific text pre-processing
● A tweet has a little bit different rules of punctuations
● Special conventions like: hashtags and mentions
● I used an NLTK object built for tokenizing tweets – TweetTokenizer
● For the LDA part of the NLP, I used genism’s preprocessing,
WordNetLemmatizer and SnowballStemmer objects
8© Mor Krispil, 2019
Research Question(s)
● Can I cluster / topic the tweets back into their ground-truth trends, based only
on the raw tweets’ text?
● Is this method accurate enough for clustering / topic-ing un-trended tweets into
meaningful groups?
9© Mor Krispil, 2019
Methods
Clustering Validation method:
● Plotting a 2-d scatter TSNE plot (after dimensionality reduction) of the current
stage data-frame
● Applying a color to each dot, tracking it back to its “ground truth” trend-color
● The more the different colors concentrated in different clusters – the better!
This way I could validate if a specific object, or even a parameter in the pipeline -
improves or worsens the previous state.
10© Mor Krispil, 2019
Methods
Topic Modeling Validation method - in order to keep track of progress, I used this
validation method:
● I collected the top 2-4 tokens (words) in each of the top 10 topics
● I compared it to the top 10 trends’ name from the twitter dataset
● The more the topics covered the trends – the better!
* This is a unique case were we’re dealing with un-supervised task, but we’re able
to validate it like a supervised one!
11© Mor Krispil, 2019
Methods
Text Tokenization – initial text processing, extracting the “essence” of the text
○ Custom tokenization of a tweet’s text
○ After trying many tokenizers, I settled on NLTK’s TweetTokenizer object
● Engineering 4 features:
○ Text length and word count of the original text
○ Text length and word count of the cleaned text
12© Mor Krispil, 2019
Methods
Word2Vec - Word Vectorization technique used to produce word embeddings, for
extracting topic and context from texts.
● Applying a pre-trained word vectorization (the google news vector) upon the
cleaned text. Each word is represented as a 300 vector array, and then the
entire tweet’s text (=sentence) is calculated into a 300 vector array
● Engineering 2 features:
○ A skew of that vector array (1 float)
○ A kurtosis of that vector array (1 float)
13© Mor Krispil, 2019
Methods
Term Frequency (TFIDF)
I researched different methods of:
● scikit-learn objects: HashingVectorizer, TfidfTransformer, TfidfVectorizer
● Stop words, tokenization
● Distance metrics from scipy and scikit-learn
● Engineering ~50-500 features of tfidf values
14© Mor Krispil, 2019
Methods
Clustering - Hierarchical Density-based spatial clustering (HDBSCAN)
● HDBSCAN main advantages here:
○ Less sensitive to sparse data (like TFIDF results)
○ Has a built in advanced clustering validation - DBCV
● Different dimensionality reductions params using TruncatedSVD
● I researched different HDBSCAN params: Min-Cluster-Size, distance-metric,
etc..
15© Mor Krispil, 2019
Methods
NLP – Topic Modeling using an LDA model
● Using the LdaMulticore object from the gensim library for LDA
● Using Lemmatization and Stemming using NLTK’s WordNetLemmatizer,
SnowballStemmer objects
● Building a Corpus from all the tweets and a Bag Of Words per each tweet
● Calculating the top 10 Topics – assuming they’d converge to our 10 Trends
16© Mor Krispil, 2019
Findings – with Tokenization features – best TSNE
• Twitter links formatting
• Tweet Tokenization
• Text length, #words, #hashtags
• TSNE with Canberra metric
17© Mor Krispil, 2019
Findings – with Word2Vec features – best TSNE
• Skew and Kurtosis measurements
• TSNE Hamming metric
18© Mor Krispil, 2019
Findings – with TFIDF features – best TSNE
• ~85% clustering match of the raw tweets into the Trends!!
• 512 TFIDF features, unicode accent, L1 normalization
• TSNE with PCA initialization and Russellrao metric
19© Mor Krispil, 2019
Findings – Text Features Clustering
There was a tedious stage of tuning, per each stage: tokenization, word
vectorization and TFIDF vectorization.
At the end, I could find that ~85% of the Raw tweets samples were mapped
back into their Trends!!
20© Mor Krispil, 2019
Findings – NLP LDA’s Topics vs our top Trends
Trend / Topic
Order
Trend Name Topic Top 3 Words
1 Laura Loomer #novsdal,#dallascowboy,cowboy
2 Meek abcd,girl,name
3 #GoodFormVideo abcd,one,ryan
4 Santiago Bernabxc3xa9u #novsdal,meek,album
5 Marc Lamont Hill ryan,paul,today
6 Abcde never,hill,marc
7 Paul Ryan abcd,kid,girl
8 Ed Burke loomer,laura,twitter
9 Blade Runner ryan,paul,polit
10 #NOvsDAL vote,million,paul
21© Mor Krispil, 2019
Findings - NLP LDA Topic Modeling
After some tuning, (not nearly as much as in the clustering stage), I found a high
degree (~75) of convergence between the original 10 trends’ names and the
LDA’s 10 Topics.
● The LDA’s topics are not distinct like cluster labels, but are a collection of
probabilities for a collection of topic-words to be considered from the same
“Topic”
● Tuning the LDA for more then 10 topics – increased the coverage of the 10
trends, as expected
● Adding additional Tweet-specific stop-words – helped as well
22© Mor Krispil, 2019
Conclusions
● Can I cluster / topic the tweets back into their ground-truth trends, based only
on the raw tweets’ text? – I believe so, with ~80 of success using an
ensemble of techniques from clustering and topic modeling
● Is this method accurate enough for clustering / topic-ing un-trended tweets into
meaningful groups? – I believe so. All learning stages were done without
the knowledge of the original Trends (just for a measure of success)
● Retrospectively, Tweets are very hard to text-process using conventional
techniques. This is mainly due to their nature of: heavily accented, repeated,
shortened, etc..
23© Mor Krispil, 2019
Limitations
1. Text encoding limitation – the research was limited to English text from the US
due to encoding compatability, but given more time I’d build a more
comprehensive data collection and processing from more languages – which I
believe would improve the clustering results
2. Twitter heavily influence their API towards the popular stream of tweets. When
trying to insist on non-trendy tweets – the results get scarce. Also, they limit
the data collection through the public API, so no bulk access is allowed, just
iterations of limited API calls, ending with quite small dataset in each iteration
24© Mor Krispil, 2019
Limitations
3. TSNE computation is quite heavy, so I had to cache checkpoints of data
transformation on disk, to avoid repeated task. However, it limited me with ability
of changing the number of rows, between steps, as the original indices could point
to missing rows
25© Mor Krispil, 2019
Acknowledgements
The data was obtained from twitter API, using a user access license.
26© Mor Krispil, 2019
References
● https://code.google.com/archive/p/word2vec/
● McInnes L, Healy J. Accelerated Hierarchical Density Based Clustering In:
2017 IEEE International Conference on Data Mining Workshops (ICDMW),
IEEE, pp 33-42. 2017 [pdf]
● Campello, R. J., Moulavi, D., Zimek, A., & Sander, J. (2015). Hierarchical
density estimates for data clustering, visualization, and outlier detection. ACM
Transactions on Knowledge Discovery from Data (TKDD), 10(1), 5.
● Moulavi, D., Jaskowiak, P.A., Campello, R.J., Zimek, A. and Sander, J., 2014.
Density-Based Clustering Validation. In SDM (pp. 839-847).
27© Mor Krispil, 2019

More Related Content

Similar to Self Trending a Tweet - Cluster and Topic Analysis on Tweets

Perceptual Mapping using Twitter Data
Perceptual Mapping using Twitter DataPerceptual Mapping using Twitter Data
Perceptual Mapping using Twitter Data
David Gerson
 
Social Media Brand Positioning Workflow- David Gerson
Social Media Brand Positioning Workflow- David GersonSocial Media Brand Positioning Workflow- David Gerson
Social Media Brand Positioning Workflow- David Gerson
PyData
 
Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"
Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"
Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"
Fwdays
 
Predicting Tweet Sentiment
Predicting Tweet SentimentPredicting Tweet Sentiment
Predicting Tweet Sentiment
Lucinda Linde
 
Sentiment Analysis of Twitter Data
Sentiment Analysis of Twitter DataSentiment Analysis of Twitter Data
Sentiment Analysis of Twitter Data
Sumit Raj
 
ML Framework for auto-responding to customer support queries
ML Framework for auto-responding to customer support queriesML Framework for auto-responding to customer support queries
ML Framework for auto-responding to customer support queries
Varun Nathan
 
2201.00598.pdf
2201.00598.pdf2201.00598.pdf
2201.00598.pdf
KSHITIJCHAUDHARY20
 
ML Framework for auto-responding to customer support queries
ML Framework for auto-responding to customer support queriesML Framework for auto-responding to customer support queries
ML Framework for auto-responding to customer support queries
Varun Nathan
 
Market Research Meets Big Data Analytics for Business Transformation
Market Research Meets Big Data Analytics  for Business Transformation Market Research Meets Big Data Analytics  for Business Transformation
Market Research Meets Big Data Analytics for Business Transformation
Sally Sadosky
 
STAT200 Assignment #2 - Descriptive Statistics Analysis and.docx
STAT200 Assignment #2 - Descriptive Statistics Analysis and.docxSTAT200 Assignment #2 - Descriptive Statistics Analysis and.docx
STAT200 Assignment #2 - Descriptive Statistics Analysis and.docx
rafaelaj1
 
[DSC MENA 24] Nada_GabAllah_-_Advancement_in_NLP_and_Text_Analytics.pptx
[DSC MENA 24] Nada_GabAllah_-_Advancement_in_NLP_and_Text_Analytics.pptx[DSC MENA 24] Nada_GabAllah_-_Advancement_in_NLP_and_Text_Analytics.pptx
[DSC MENA 24] Nada_GabAllah_-_Advancement_in_NLP_and_Text_Analytics.pptx
DataScienceConferenc1
 
ChatGPT and OpenAI.pdf
ChatGPT and OpenAI.pdfChatGPT and OpenAI.pdf
ChatGPT and OpenAI.pdf
Sonal Tiwari
 
Research Literature OUTLINE, APA Format!American writer Wa.docx
Research Literature OUTLINE, APA Format!American writer Wa.docxResearch Literature OUTLINE, APA Format!American writer Wa.docx
Research Literature OUTLINE, APA Format!American writer Wa.docx
WilheminaRossi174
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
Dr Arash Najmaei ( Phd., MBA, BSc)
 
BTech Final Project (1).pptx
BTech Final Project (1).pptxBTech Final Project (1).pptx
BTech Final Project (1).pptx
SwarajPatel19
 
SentimentAnalysisofTwitterProductReviewsDocument.pdf
SentimentAnalysisofTwitterProductReviewsDocument.pdfSentimentAnalysisofTwitterProductReviewsDocument.pdf
SentimentAnalysisofTwitterProductReviewsDocument.pdf
DevinSohi
 
Tldr
TldrTldr
IRJET- Twitter Sentimental Analysis for Predicting Election Result using ...
IRJET-  	  Twitter Sentimental Analysis for Predicting Election Result using ...IRJET-  	  Twitter Sentimental Analysis for Predicting Election Result using ...
IRJET- Twitter Sentimental Analysis for Predicting Election Result using ...
IRJET Journal
 
How Oracle Uses CrowdFlower For Sentiment Analysis
How Oracle Uses CrowdFlower For Sentiment AnalysisHow Oracle Uses CrowdFlower For Sentiment Analysis
How Oracle Uses CrowdFlower For Sentiment Analysis
CrowdFlower
 
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.comEnhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com
Simon Hughes
 

Similar to Self Trending a Tweet - Cluster and Topic Analysis on Tweets (20)

Perceptual Mapping using Twitter Data
Perceptual Mapping using Twitter DataPerceptual Mapping using Twitter Data
Perceptual Mapping using Twitter Data
 
Social Media Brand Positioning Workflow- David Gerson
Social Media Brand Positioning Workflow- David GersonSocial Media Brand Positioning Workflow- David Gerson
Social Media Brand Positioning Workflow- David Gerson
 
Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"
Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"
Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"
 
Predicting Tweet Sentiment
Predicting Tweet SentimentPredicting Tweet Sentiment
Predicting Tweet Sentiment
 
Sentiment Analysis of Twitter Data
Sentiment Analysis of Twitter DataSentiment Analysis of Twitter Data
Sentiment Analysis of Twitter Data
 
ML Framework for auto-responding to customer support queries
ML Framework for auto-responding to customer support queriesML Framework for auto-responding to customer support queries
ML Framework for auto-responding to customer support queries
 
2201.00598.pdf
2201.00598.pdf2201.00598.pdf
2201.00598.pdf
 
ML Framework for auto-responding to customer support queries
ML Framework for auto-responding to customer support queriesML Framework for auto-responding to customer support queries
ML Framework for auto-responding to customer support queries
 
Market Research Meets Big Data Analytics for Business Transformation
Market Research Meets Big Data Analytics  for Business Transformation Market Research Meets Big Data Analytics  for Business Transformation
Market Research Meets Big Data Analytics for Business Transformation
 
STAT200 Assignment #2 - Descriptive Statistics Analysis and.docx
STAT200 Assignment #2 - Descriptive Statistics Analysis and.docxSTAT200 Assignment #2 - Descriptive Statistics Analysis and.docx
STAT200 Assignment #2 - Descriptive Statistics Analysis and.docx
 
[DSC MENA 24] Nada_GabAllah_-_Advancement_in_NLP_and_Text_Analytics.pptx
[DSC MENA 24] Nada_GabAllah_-_Advancement_in_NLP_and_Text_Analytics.pptx[DSC MENA 24] Nada_GabAllah_-_Advancement_in_NLP_and_Text_Analytics.pptx
[DSC MENA 24] Nada_GabAllah_-_Advancement_in_NLP_and_Text_Analytics.pptx
 
ChatGPT and OpenAI.pdf
ChatGPT and OpenAI.pdfChatGPT and OpenAI.pdf
ChatGPT and OpenAI.pdf
 
Research Literature OUTLINE, APA Format!American writer Wa.docx
Research Literature OUTLINE, APA Format!American writer Wa.docxResearch Literature OUTLINE, APA Format!American writer Wa.docx
Research Literature OUTLINE, APA Format!American writer Wa.docx
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
 
BTech Final Project (1).pptx
BTech Final Project (1).pptxBTech Final Project (1).pptx
BTech Final Project (1).pptx
 
SentimentAnalysisofTwitterProductReviewsDocument.pdf
SentimentAnalysisofTwitterProductReviewsDocument.pdfSentimentAnalysisofTwitterProductReviewsDocument.pdf
SentimentAnalysisofTwitterProductReviewsDocument.pdf
 
Tldr
TldrTldr
Tldr
 
IRJET- Twitter Sentimental Analysis for Predicting Election Result using ...
IRJET-  	  Twitter Sentimental Analysis for Predicting Election Result using ...IRJET-  	  Twitter Sentimental Analysis for Predicting Election Result using ...
IRJET- Twitter Sentimental Analysis for Predicting Election Result using ...
 
How Oracle Uses CrowdFlower For Sentiment Analysis
How Oracle Uses CrowdFlower For Sentiment AnalysisHow Oracle Uses CrowdFlower For Sentiment Analysis
How Oracle Uses CrowdFlower For Sentiment Analysis
 
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.comEnhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com
 

Recently uploaded

UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
Daiki Mogmet Ito
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Zilliz
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Vladimir Iglovikov, Ph.D.
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
Neo4j
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
SOFTTECHHUB
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
Building RAG with self-deployed Milvus vector database and Snowpark Container...
Building RAG with self-deployed Milvus vector database and Snowpark Container...Building RAG with self-deployed Milvus vector database and Snowpark Container...
Building RAG with self-deployed Milvus vector database and Snowpark Container...
Zilliz
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 

Recently uploaded (20)

UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
Building RAG with self-deployed Milvus vector database and Snowpark Container...
Building RAG with self-deployed Milvus vector database and Snowpark Container...Building RAG with self-deployed Milvus vector database and Snowpark Container...
Building RAG with self-deployed Milvus vector database and Snowpark Container...
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 

Self Trending a Tweet - Cluster and Topic Analysis on Tweets

  • 1. Self Trending a Tweet Cluster and Topic Analysis on Tweets © Mor Krispil, 2019 1
  • 2. Abstract In a Twitter dataset we are provided with tweets, retrieved per a pre-defined “Trend”. Can we verify those trends back from the raw statuses? If so – we could use this technique to topic any un-trended tweet-list! On a tweet dataset, curated from a top of 10 twitter trends, I researched different Natural Language Processing (NLP) and Clustering techniques to apply on the raw tweets’ text. I found that with the right NLP and Clustering – I could verify ~80% of the tweets back to their labeled trends! 2© Mor Krispil, 2019
  • 3. Motivation While trended tweets can be provided from the tweeter API, they represent only the latest and a fraction of all public tweets. In addition, they are grouped into trends with a high degree of bias (like promotions and popularity indices). In order to group any tweet list: historical, non “popular” and so on – we’ll need to come by with a Tweet-to-Topic technique that can be verified. Twitter, no doubt, have put tremendous work into their trend labels – so we could actually use those labels to verify our Topic-ing abilities! (ground truths) 3© Mor Krispil, 2019
  • 4. Dataset(s) I got the Twitter data from the Twitter API, and Clustering and NLP techniques I use in my day to day job. I mined my dataset from the Twitter API, in 2 steps: ● Top 10 Twitter Trends in the US, mined with the twitter trends-per-place API. Sample tuple: (trend name, trend query string, tweet volume) ● Per each trend: I mined ~800-1000 Tweets, in an API call iterations, with the twitter search-per-query API. Sample tuple: (trend name, tweet text, tweet hashtag list) Total size: (7528, 3) 4© Mor Krispil, 2019
  • 5. Dataset(s) In Addition I used small helper-datasets ● English stop-words from: NLTK, stop_words and genism libraries ● Manually added some tweet-specific missing stop words 5© Mor Krispil, 2019
  • 6. Data Preparation and Cleaning Empty values ● No missing values were encountered in the mining process ● In the enrichment process, some text vectorization calculation created small missing values in the new features ● I preferred replacement with “0” instead of dropping rows, since some matrix calculation were heavy, and steps’ checkpoints had to be cached, so row- count had to be kept consistent 6© Mor Krispil, 2019
  • 7. Data Preparation and Cleaning Non-Unicode text problems ● It caused issues with printing results to the notebook, and with some libraries like matplotlib ● I filtered the mining process only to the US trend-place ● I filtered the tweets API to English locale (‘en’) ● I still had to make sure and encode text values as utf-8 ● I had to decode back from utf-8 in other cases 7© Mor Krispil, 2019
  • 8. Data Preparation and Cleaning Twitter specific text pre-processing ● A tweet has a little bit different rules of punctuations ● Special conventions like: hashtags and mentions ● I used an NLTK object built for tokenizing tweets – TweetTokenizer ● For the LDA part of the NLP, I used genism’s preprocessing, WordNetLemmatizer and SnowballStemmer objects 8© Mor Krispil, 2019
  • 9. Research Question(s) ● Can I cluster / topic the tweets back into their ground-truth trends, based only on the raw tweets’ text? ● Is this method accurate enough for clustering / topic-ing un-trended tweets into meaningful groups? 9© Mor Krispil, 2019
  • 10. Methods Clustering Validation method: ● Plotting a 2-d scatter TSNE plot (after dimensionality reduction) of the current stage data-frame ● Applying a color to each dot, tracking it back to its “ground truth” trend-color ● The more the different colors concentrated in different clusters – the better! This way I could validate if a specific object, or even a parameter in the pipeline - improves or worsens the previous state. 10© Mor Krispil, 2019
  • 11. Methods Topic Modeling Validation method - in order to keep track of progress, I used this validation method: ● I collected the top 2-4 tokens (words) in each of the top 10 topics ● I compared it to the top 10 trends’ name from the twitter dataset ● The more the topics covered the trends – the better! * This is a unique case were we’re dealing with un-supervised task, but we’re able to validate it like a supervised one! 11© Mor Krispil, 2019
  • 12. Methods Text Tokenization – initial text processing, extracting the “essence” of the text ○ Custom tokenization of a tweet’s text ○ After trying many tokenizers, I settled on NLTK’s TweetTokenizer object ● Engineering 4 features: ○ Text length and word count of the original text ○ Text length and word count of the cleaned text 12© Mor Krispil, 2019
  • 13. Methods Word2Vec - Word Vectorization technique used to produce word embeddings, for extracting topic and context from texts. ● Applying a pre-trained word vectorization (the google news vector) upon the cleaned text. Each word is represented as a 300 vector array, and then the entire tweet’s text (=sentence) is calculated into a 300 vector array ● Engineering 2 features: ○ A skew of that vector array (1 float) ○ A kurtosis of that vector array (1 float) 13© Mor Krispil, 2019
  • 14. Methods Term Frequency (TFIDF) I researched different methods of: ● scikit-learn objects: HashingVectorizer, TfidfTransformer, TfidfVectorizer ● Stop words, tokenization ● Distance metrics from scipy and scikit-learn ● Engineering ~50-500 features of tfidf values 14© Mor Krispil, 2019
  • 15. Methods Clustering - Hierarchical Density-based spatial clustering (HDBSCAN) ● HDBSCAN main advantages here: ○ Less sensitive to sparse data (like TFIDF results) ○ Has a built in advanced clustering validation - DBCV ● Different dimensionality reductions params using TruncatedSVD ● I researched different HDBSCAN params: Min-Cluster-Size, distance-metric, etc.. 15© Mor Krispil, 2019
  • 16. Methods NLP – Topic Modeling using an LDA model ● Using the LdaMulticore object from the gensim library for LDA ● Using Lemmatization and Stemming using NLTK’s WordNetLemmatizer, SnowballStemmer objects ● Building a Corpus from all the tweets and a Bag Of Words per each tweet ● Calculating the top 10 Topics – assuming they’d converge to our 10 Trends 16© Mor Krispil, 2019
  • 17. Findings – with Tokenization features – best TSNE • Twitter links formatting • Tweet Tokenization • Text length, #words, #hashtags • TSNE with Canberra metric 17© Mor Krispil, 2019
  • 18. Findings – with Word2Vec features – best TSNE • Skew and Kurtosis measurements • TSNE Hamming metric 18© Mor Krispil, 2019
  • 19. Findings – with TFIDF features – best TSNE • ~85% clustering match of the raw tweets into the Trends!! • 512 TFIDF features, unicode accent, L1 normalization • TSNE with PCA initialization and Russellrao metric 19© Mor Krispil, 2019
  • 20. Findings – Text Features Clustering There was a tedious stage of tuning, per each stage: tokenization, word vectorization and TFIDF vectorization. At the end, I could find that ~85% of the Raw tweets samples were mapped back into their Trends!! 20© Mor Krispil, 2019
  • 21. Findings – NLP LDA’s Topics vs our top Trends Trend / Topic Order Trend Name Topic Top 3 Words 1 Laura Loomer #novsdal,#dallascowboy,cowboy 2 Meek abcd,girl,name 3 #GoodFormVideo abcd,one,ryan 4 Santiago Bernabxc3xa9u #novsdal,meek,album 5 Marc Lamont Hill ryan,paul,today 6 Abcde never,hill,marc 7 Paul Ryan abcd,kid,girl 8 Ed Burke loomer,laura,twitter 9 Blade Runner ryan,paul,polit 10 #NOvsDAL vote,million,paul 21© Mor Krispil, 2019
  • 22. Findings - NLP LDA Topic Modeling After some tuning, (not nearly as much as in the clustering stage), I found a high degree (~75) of convergence between the original 10 trends’ names and the LDA’s 10 Topics. ● The LDA’s topics are not distinct like cluster labels, but are a collection of probabilities for a collection of topic-words to be considered from the same “Topic” ● Tuning the LDA for more then 10 topics – increased the coverage of the 10 trends, as expected ● Adding additional Tweet-specific stop-words – helped as well 22© Mor Krispil, 2019
  • 23. Conclusions ● Can I cluster / topic the tweets back into their ground-truth trends, based only on the raw tweets’ text? – I believe so, with ~80 of success using an ensemble of techniques from clustering and topic modeling ● Is this method accurate enough for clustering / topic-ing un-trended tweets into meaningful groups? – I believe so. All learning stages were done without the knowledge of the original Trends (just for a measure of success) ● Retrospectively, Tweets are very hard to text-process using conventional techniques. This is mainly due to their nature of: heavily accented, repeated, shortened, etc.. 23© Mor Krispil, 2019
  • 24. Limitations 1. Text encoding limitation – the research was limited to English text from the US due to encoding compatability, but given more time I’d build a more comprehensive data collection and processing from more languages – which I believe would improve the clustering results 2. Twitter heavily influence their API towards the popular stream of tweets. When trying to insist on non-trendy tweets – the results get scarce. Also, they limit the data collection through the public API, so no bulk access is allowed, just iterations of limited API calls, ending with quite small dataset in each iteration 24© Mor Krispil, 2019
  • 25. Limitations 3. TSNE computation is quite heavy, so I had to cache checkpoints of data transformation on disk, to avoid repeated task. However, it limited me with ability of changing the number of rows, between steps, as the original indices could point to missing rows 25© Mor Krispil, 2019
  • 26. Acknowledgements The data was obtained from twitter API, using a user access license. 26© Mor Krispil, 2019
  • 27. References ● https://code.google.com/archive/p/word2vec/ ● McInnes L, Healy J. Accelerated Hierarchical Density Based Clustering In: 2017 IEEE International Conference on Data Mining Workshops (ICDMW), IEEE, pp 33-42. 2017 [pdf] ● Campello, R. J., Moulavi, D., Zimek, A., & Sander, J. (2015). Hierarchical density estimates for data clustering, visualization, and outlier detection. ACM Transactions on Knowledge Discovery from Data (TKDD), 10(1), 5. ● Moulavi, D., Jaskowiak, P.A., Campello, R.J., Zimek, A. and Sander, J., 2014. Density-Based Clustering Validation. In SDM (pp. 839-847). 27© Mor Krispil, 2019