SlideShare a Scribd company logo
1 of 27
Download to read offline
Self Trending a Tweet
Cluster and Topic Analysis on Tweets
© Mor Krispil, 2019 1
Abstract
In a Twitter dataset we are provided with tweets, retrieved per a pre-defined
“Trend”. Can we verify those trends back from the raw statuses? If so – we could
use this technique to topic any un-trended tweet-list!
On a tweet dataset, curated from a top of 10 twitter trends, I researched different
Natural Language Processing (NLP) and Clustering techniques to apply on the
raw tweets’ text.
I found that with the right NLP and Clustering – I could verify ~80% of the
tweets back to their labeled trends!
2© Mor Krispil, 2019
Motivation
While trended tweets can be provided from the tweeter API, they represent only
the latest and a fraction of all public tweets. In addition, they are grouped into
trends with a high degree of bias (like promotions and popularity indices).
In order to group any tweet list: historical, non “popular” and so on – we’ll need
to come by with a Tweet-to-Topic technique that can be verified.
Twitter, no doubt, have put tremendous work into their trend labels – so we could
actually use those labels to verify our Topic-ing abilities! (ground truths)
3© Mor Krispil, 2019
Dataset(s)
I got the Twitter data from the Twitter API, and Clustering and NLP techniques I
use in my day to day job.
I mined my dataset from the Twitter API, in 2 steps:
● Top 10 Twitter Trends in the US, mined with the twitter trends-per-place API.
Sample tuple: (trend name, trend query string, tweet volume)
● Per each trend: I mined ~800-1000 Tweets, in an API call iterations, with the
twitter search-per-query API. Sample tuple: (trend name, tweet text, tweet
hashtag list)
Total size: (7528, 3)
4© Mor Krispil, 2019
Dataset(s)
In Addition I used small helper-datasets
● English stop-words from: NLTK, stop_words and genism libraries
● Manually added some tweet-specific missing stop words
5© Mor Krispil, 2019
Data Preparation and Cleaning
Empty values
● No missing values were encountered in the mining process
● In the enrichment process, some text vectorization calculation created small
missing values in the new features
● I preferred replacement with “0” instead of dropping rows, since some matrix
calculation were heavy, and steps’ checkpoints had to be cached, so row-
count had to be kept consistent
6© Mor Krispil, 2019
Data Preparation and Cleaning
Non-Unicode text problems
● It caused issues with printing results to the notebook, and with some libraries
like matplotlib
● I filtered the mining process only to the US trend-place
● I filtered the tweets API to English locale (‘en’)
● I still had to make sure and encode text values as utf-8
● I had to decode back from utf-8 in other cases
7© Mor Krispil, 2019
Data Preparation and Cleaning
Twitter specific text pre-processing
● A tweet has a little bit different rules of punctuations
● Special conventions like: hashtags and mentions
● I used an NLTK object built for tokenizing tweets – TweetTokenizer
● For the LDA part of the NLP, I used genism’s preprocessing,
WordNetLemmatizer and SnowballStemmer objects
8© Mor Krispil, 2019
Research Question(s)
● Can I cluster / topic the tweets back into their ground-truth trends, based only
on the raw tweets’ text?
● Is this method accurate enough for clustering / topic-ing un-trended tweets into
meaningful groups?
9© Mor Krispil, 2019
Methods
Clustering Validation method:
● Plotting a 2-d scatter TSNE plot (after dimensionality reduction) of the current
stage data-frame
● Applying a color to each dot, tracking it back to its “ground truth” trend-color
● The more the different colors concentrated in different clusters – the better!
This way I could validate if a specific object, or even a parameter in the pipeline -
improves or worsens the previous state.
10© Mor Krispil, 2019
Methods
Topic Modeling Validation method - in order to keep track of progress, I used this
validation method:
● I collected the top 2-4 tokens (words) in each of the top 10 topics
● I compared it to the top 10 trends’ name from the twitter dataset
● The more the topics covered the trends – the better!
* This is a unique case were we’re dealing with un-supervised task, but we’re able
to validate it like a supervised one!
11© Mor Krispil, 2019
Methods
Text Tokenization – initial text processing, extracting the “essence” of the text
○ Custom tokenization of a tweet’s text
○ After trying many tokenizers, I settled on NLTK’s TweetTokenizer object
● Engineering 4 features:
○ Text length and word count of the original text
○ Text length and word count of the cleaned text
12© Mor Krispil, 2019
Methods
Word2Vec - Word Vectorization technique used to produce word embeddings, for
extracting topic and context from texts.
● Applying a pre-trained word vectorization (the google news vector) upon the
cleaned text. Each word is represented as a 300 vector array, and then the
entire tweet’s text (=sentence) is calculated into a 300 vector array
● Engineering 2 features:
○ A skew of that vector array (1 float)
○ A kurtosis of that vector array (1 float)
13© Mor Krispil, 2019
Methods
Term Frequency (TFIDF)
I researched different methods of:
● scikit-learn objects: HashingVectorizer, TfidfTransformer, TfidfVectorizer
● Stop words, tokenization
● Distance metrics from scipy and scikit-learn
● Engineering ~50-500 features of tfidf values
14© Mor Krispil, 2019
Methods
Clustering - Hierarchical Density-based spatial clustering (HDBSCAN)
● HDBSCAN main advantages here:
○ Less sensitive to sparse data (like TFIDF results)
○ Has a built in advanced clustering validation - DBCV
● Different dimensionality reductions params using TruncatedSVD
● I researched different HDBSCAN params: Min-Cluster-Size, distance-metric,
etc..
15© Mor Krispil, 2019
Methods
NLP – Topic Modeling using an LDA model
● Using the LdaMulticore object from the gensim library for LDA
● Using Lemmatization and Stemming using NLTK’s WordNetLemmatizer,
SnowballStemmer objects
● Building a Corpus from all the tweets and a Bag Of Words per each tweet
● Calculating the top 10 Topics – assuming they’d converge to our 10 Trends
16© Mor Krispil, 2019
Findings – with Tokenization features – best TSNE
• Twitter links formatting
• Tweet Tokenization
• Text length, #words, #hashtags
• TSNE with Canberra metric
17© Mor Krispil, 2019
Findings – with Word2Vec features – best TSNE
• Skew and Kurtosis measurements
• TSNE Hamming metric
18© Mor Krispil, 2019
Findings – with TFIDF features – best TSNE
• ~85% clustering match of the raw tweets into the Trends!!
• 512 TFIDF features, unicode accent, L1 normalization
• TSNE with PCA initialization and Russellrao metric
19© Mor Krispil, 2019
Findings – Text Features Clustering
There was a tedious stage of tuning, per each stage: tokenization, word
vectorization and TFIDF vectorization.
At the end, I could find that ~85% of the Raw tweets samples were mapped
back into their Trends!!
20© Mor Krispil, 2019
Findings – NLP LDA’s Topics vs our top Trends
Trend / Topic
Order
Trend Name Topic Top 3 Words
1 Laura Loomer #novsdal,#dallascowboy,cowboy
2 Meek abcd,girl,name
3 #GoodFormVideo abcd,one,ryan
4 Santiago Bernabxc3xa9u #novsdal,meek,album
5 Marc Lamont Hill ryan,paul,today
6 Abcde never,hill,marc
7 Paul Ryan abcd,kid,girl
8 Ed Burke loomer,laura,twitter
9 Blade Runner ryan,paul,polit
10 #NOvsDAL vote,million,paul
21© Mor Krispil, 2019
Findings - NLP LDA Topic Modeling
After some tuning, (not nearly as much as in the clustering stage), I found a high
degree (~75) of convergence between the original 10 trends’ names and the
LDA’s 10 Topics.
● The LDA’s topics are not distinct like cluster labels, but are a collection of
probabilities for a collection of topic-words to be considered from the same
“Topic”
● Tuning the LDA for more then 10 topics – increased the coverage of the 10
trends, as expected
● Adding additional Tweet-specific stop-words – helped as well
22© Mor Krispil, 2019
Conclusions
● Can I cluster / topic the tweets back into their ground-truth trends, based only
on the raw tweets’ text? – I believe so, with ~80 of success using an
ensemble of techniques from clustering and topic modeling
● Is this method accurate enough for clustering / topic-ing un-trended tweets into
meaningful groups? – I believe so. All learning stages were done without
the knowledge of the original Trends (just for a measure of success)
● Retrospectively, Tweets are very hard to text-process using conventional
techniques. This is mainly due to their nature of: heavily accented, repeated,
shortened, etc..
23© Mor Krispil, 2019
Limitations
1. Text encoding limitation – the research was limited to English text from the US
due to encoding compatability, but given more time I’d build a more
comprehensive data collection and processing from more languages – which I
believe would improve the clustering results
2. Twitter heavily influence their API towards the popular stream of tweets. When
trying to insist on non-trendy tweets – the results get scarce. Also, they limit
the data collection through the public API, so no bulk access is allowed, just
iterations of limited API calls, ending with quite small dataset in each iteration
24© Mor Krispil, 2019
Limitations
3. TSNE computation is quite heavy, so I had to cache checkpoints of data
transformation on disk, to avoid repeated task. However, it limited me with ability
of changing the number of rows, between steps, as the original indices could point
to missing rows
25© Mor Krispil, 2019
Acknowledgements
The data was obtained from twitter API, using a user access license.
26© Mor Krispil, 2019
References
● https://code.google.com/archive/p/word2vec/
● McInnes L, Healy J. Accelerated Hierarchical Density Based Clustering In:
2017 IEEE International Conference on Data Mining Workshops (ICDMW),
IEEE, pp 33-42. 2017 [pdf]
● Campello, R. J., Moulavi, D., Zimek, A., & Sander, J. (2015). Hierarchical
density estimates for data clustering, visualization, and outlier detection. ACM
Transactions on Knowledge Discovery from Data (TKDD), 10(1), 5.
● Moulavi, D., Jaskowiak, P.A., Campello, R.J., Zimek, A. and Sander, J., 2014.
Density-Based Clustering Validation. In SDM (pp. 839-847).
27© Mor Krispil, 2019

More Related Content

Similar to Self Trending a Tweet - Cluster and Topic Analysis on Tweets

Social Media Brand Positioning Workflow- David Gerson
Social Media Brand Positioning Workflow- David GersonSocial Media Brand Positioning Workflow- David Gerson
Social Media Brand Positioning Workflow- David GersonPyData
 
Perceptual Mapping using Twitter Data
Perceptual Mapping using Twitter DataPerceptual Mapping using Twitter Data
Perceptual Mapping using Twitter DataDavid Gerson
 
Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"
Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"
Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"Fwdays
 
Predicting Tweet Sentiment
Predicting Tweet SentimentPredicting Tweet Sentiment
Predicting Tweet SentimentLucinda Linde
 
Sentiment Analysis of Twitter Data
Sentiment Analysis of Twitter DataSentiment Analysis of Twitter Data
Sentiment Analysis of Twitter DataSumit Raj
 
ML Framework for auto-responding to customer support queries
ML Framework for auto-responding to customer support queriesML Framework for auto-responding to customer support queries
ML Framework for auto-responding to customer support queriesVarun Nathan
 
ML Framework for auto-responding to customer support queries
ML Framework for auto-responding to customer support queriesML Framework for auto-responding to customer support queries
ML Framework for auto-responding to customer support queriesVarun Nathan
 
Market Research Meets Big Data Analytics for Business Transformation
Market Research Meets Big Data Analytics  for Business Transformation Market Research Meets Big Data Analytics  for Business Transformation
Market Research Meets Big Data Analytics for Business Transformation Sally Sadosky
 
STAT200 Assignment #2 - Descriptive Statistics Analysis and.docx
STAT200 Assignment #2 - Descriptive Statistics Analysis and.docxSTAT200 Assignment #2 - Descriptive Statistics Analysis and.docx
STAT200 Assignment #2 - Descriptive Statistics Analysis and.docxrafaelaj1
 
[DSC MENA 24] Nada_GabAllah_-_Advancement_in_NLP_and_Text_Analytics.pptx
[DSC MENA 24] Nada_GabAllah_-_Advancement_in_NLP_and_Text_Analytics.pptx[DSC MENA 24] Nada_GabAllah_-_Advancement_in_NLP_and_Text_Analytics.pptx
[DSC MENA 24] Nada_GabAllah_-_Advancement_in_NLP_and_Text_Analytics.pptxDataScienceConferenc1
 
ChatGPT and OpenAI.pdf
ChatGPT and OpenAI.pdfChatGPT and OpenAI.pdf
ChatGPT and OpenAI.pdfSonal Tiwari
 
Research Literature OUTLINE, APA Format!American writer Wa.docx
Research Literature OUTLINE, APA Format!American writer Wa.docxResearch Literature OUTLINE, APA Format!American writer Wa.docx
Research Literature OUTLINE, APA Format!American writer Wa.docxWilheminaRossi174
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...Dr Arash Najmaei ( Phd., MBA, BSc)
 
BTech Final Project (1).pptx
BTech Final Project (1).pptxBTech Final Project (1).pptx
BTech Final Project (1).pptxSwarajPatel19
 
SentimentAnalysisofTwitterProductReviewsDocument.pdf
SentimentAnalysisofTwitterProductReviewsDocument.pdfSentimentAnalysisofTwitterProductReviewsDocument.pdf
SentimentAnalysisofTwitterProductReviewsDocument.pdfDevinSohi
 
IRJET- Twitter Sentimental Analysis for Predicting Election Result using ...
IRJET-  	  Twitter Sentimental Analysis for Predicting Election Result using ...IRJET-  	  Twitter Sentimental Analysis for Predicting Election Result using ...
IRJET- Twitter Sentimental Analysis for Predicting Election Result using ...IRJET Journal
 
Continuous Learning Systems: Building ML systems that learn from their mistakes
Continuous Learning Systems: Building ML systems that learn from their mistakesContinuous Learning Systems: Building ML systems that learn from their mistakes
Continuous Learning Systems: Building ML systems that learn from their mistakesAnuj Gupta
 
How Oracle Uses CrowdFlower For Sentiment Analysis
How Oracle Uses CrowdFlower For Sentiment AnalysisHow Oracle Uses CrowdFlower For Sentiment Analysis
How Oracle Uses CrowdFlower For Sentiment AnalysisCrowdFlower
 

Similar to Self Trending a Tweet - Cluster and Topic Analysis on Tweets (20)

Social Media Brand Positioning Workflow- David Gerson
Social Media Brand Positioning Workflow- David GersonSocial Media Brand Positioning Workflow- David Gerson
Social Media Brand Positioning Workflow- David Gerson
 
Perceptual Mapping using Twitter Data
Perceptual Mapping using Twitter DataPerceptual Mapping using Twitter Data
Perceptual Mapping using Twitter Data
 
Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"
Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"
Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"
 
Predicting Tweet Sentiment
Predicting Tweet SentimentPredicting Tweet Sentiment
Predicting Tweet Sentiment
 
Sentiment Analysis of Twitter Data
Sentiment Analysis of Twitter DataSentiment Analysis of Twitter Data
Sentiment Analysis of Twitter Data
 
ML Framework for auto-responding to customer support queries
ML Framework for auto-responding to customer support queriesML Framework for auto-responding to customer support queries
ML Framework for auto-responding to customer support queries
 
2201.00598.pdf
2201.00598.pdf2201.00598.pdf
2201.00598.pdf
 
ML Framework for auto-responding to customer support queries
ML Framework for auto-responding to customer support queriesML Framework for auto-responding to customer support queries
ML Framework for auto-responding to customer support queries
 
Market Research Meets Big Data Analytics for Business Transformation
Market Research Meets Big Data Analytics  for Business Transformation Market Research Meets Big Data Analytics  for Business Transformation
Market Research Meets Big Data Analytics for Business Transformation
 
STAT200 Assignment #2 - Descriptive Statistics Analysis and.docx
STAT200 Assignment #2 - Descriptive Statistics Analysis and.docxSTAT200 Assignment #2 - Descriptive Statistics Analysis and.docx
STAT200 Assignment #2 - Descriptive Statistics Analysis and.docx
 
[DSC MENA 24] Nada_GabAllah_-_Advancement_in_NLP_and_Text_Analytics.pptx
[DSC MENA 24] Nada_GabAllah_-_Advancement_in_NLP_and_Text_Analytics.pptx[DSC MENA 24] Nada_GabAllah_-_Advancement_in_NLP_and_Text_Analytics.pptx
[DSC MENA 24] Nada_GabAllah_-_Advancement_in_NLP_and_Text_Analytics.pptx
 
ChatGPT and OpenAI.pdf
ChatGPT and OpenAI.pdfChatGPT and OpenAI.pdf
ChatGPT and OpenAI.pdf
 
Research Literature OUTLINE, APA Format!American writer Wa.docx
Research Literature OUTLINE, APA Format!American writer Wa.docxResearch Literature OUTLINE, APA Format!American writer Wa.docx
Research Literature OUTLINE, APA Format!American writer Wa.docx
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
 
BTech Final Project (1).pptx
BTech Final Project (1).pptxBTech Final Project (1).pptx
BTech Final Project (1).pptx
 
SentimentAnalysisofTwitterProductReviewsDocument.pdf
SentimentAnalysisofTwitterProductReviewsDocument.pdfSentimentAnalysisofTwitterProductReviewsDocument.pdf
SentimentAnalysisofTwitterProductReviewsDocument.pdf
 
Tldr
TldrTldr
Tldr
 
IRJET- Twitter Sentimental Analysis for Predicting Election Result using ...
IRJET-  	  Twitter Sentimental Analysis for Predicting Election Result using ...IRJET-  	  Twitter Sentimental Analysis for Predicting Election Result using ...
IRJET- Twitter Sentimental Analysis for Predicting Election Result using ...
 
Continuous Learning Systems: Building ML systems that learn from their mistakes
Continuous Learning Systems: Building ML systems that learn from their mistakesContinuous Learning Systems: Building ML systems that learn from their mistakes
Continuous Learning Systems: Building ML systems that learn from their mistakes
 
How Oracle Uses CrowdFlower For Sentiment Analysis
How Oracle Uses CrowdFlower For Sentiment AnalysisHow Oracle Uses CrowdFlower For Sentiment Analysis
How Oracle Uses CrowdFlower For Sentiment Analysis
 

Recently uploaded

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 

Recently uploaded (20)

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 

Self Trending a Tweet - Cluster and Topic Analysis on Tweets

  • 1. Self Trending a Tweet Cluster and Topic Analysis on Tweets © Mor Krispil, 2019 1
  • 2. Abstract In a Twitter dataset we are provided with tweets, retrieved per a pre-defined “Trend”. Can we verify those trends back from the raw statuses? If so – we could use this technique to topic any un-trended tweet-list! On a tweet dataset, curated from a top of 10 twitter trends, I researched different Natural Language Processing (NLP) and Clustering techniques to apply on the raw tweets’ text. I found that with the right NLP and Clustering – I could verify ~80% of the tweets back to their labeled trends! 2© Mor Krispil, 2019
  • 3. Motivation While trended tweets can be provided from the tweeter API, they represent only the latest and a fraction of all public tweets. In addition, they are grouped into trends with a high degree of bias (like promotions and popularity indices). In order to group any tweet list: historical, non “popular” and so on – we’ll need to come by with a Tweet-to-Topic technique that can be verified. Twitter, no doubt, have put tremendous work into their trend labels – so we could actually use those labels to verify our Topic-ing abilities! (ground truths) 3© Mor Krispil, 2019
  • 4. Dataset(s) I got the Twitter data from the Twitter API, and Clustering and NLP techniques I use in my day to day job. I mined my dataset from the Twitter API, in 2 steps: ● Top 10 Twitter Trends in the US, mined with the twitter trends-per-place API. Sample tuple: (trend name, trend query string, tweet volume) ● Per each trend: I mined ~800-1000 Tweets, in an API call iterations, with the twitter search-per-query API. Sample tuple: (trend name, tweet text, tweet hashtag list) Total size: (7528, 3) 4© Mor Krispil, 2019
  • 5. Dataset(s) In Addition I used small helper-datasets ● English stop-words from: NLTK, stop_words and genism libraries ● Manually added some tweet-specific missing stop words 5© Mor Krispil, 2019
  • 6. Data Preparation and Cleaning Empty values ● No missing values were encountered in the mining process ● In the enrichment process, some text vectorization calculation created small missing values in the new features ● I preferred replacement with “0” instead of dropping rows, since some matrix calculation were heavy, and steps’ checkpoints had to be cached, so row- count had to be kept consistent 6© Mor Krispil, 2019
  • 7. Data Preparation and Cleaning Non-Unicode text problems ● It caused issues with printing results to the notebook, and with some libraries like matplotlib ● I filtered the mining process only to the US trend-place ● I filtered the tweets API to English locale (‘en’) ● I still had to make sure and encode text values as utf-8 ● I had to decode back from utf-8 in other cases 7© Mor Krispil, 2019
  • 8. Data Preparation and Cleaning Twitter specific text pre-processing ● A tweet has a little bit different rules of punctuations ● Special conventions like: hashtags and mentions ● I used an NLTK object built for tokenizing tweets – TweetTokenizer ● For the LDA part of the NLP, I used genism’s preprocessing, WordNetLemmatizer and SnowballStemmer objects 8© Mor Krispil, 2019
  • 9. Research Question(s) ● Can I cluster / topic the tweets back into their ground-truth trends, based only on the raw tweets’ text? ● Is this method accurate enough for clustering / topic-ing un-trended tweets into meaningful groups? 9© Mor Krispil, 2019
  • 10. Methods Clustering Validation method: ● Plotting a 2-d scatter TSNE plot (after dimensionality reduction) of the current stage data-frame ● Applying a color to each dot, tracking it back to its “ground truth” trend-color ● The more the different colors concentrated in different clusters – the better! This way I could validate if a specific object, or even a parameter in the pipeline - improves or worsens the previous state. 10© Mor Krispil, 2019
  • 11. Methods Topic Modeling Validation method - in order to keep track of progress, I used this validation method: ● I collected the top 2-4 tokens (words) in each of the top 10 topics ● I compared it to the top 10 trends’ name from the twitter dataset ● The more the topics covered the trends – the better! * This is a unique case were we’re dealing with un-supervised task, but we’re able to validate it like a supervised one! 11© Mor Krispil, 2019
  • 12. Methods Text Tokenization – initial text processing, extracting the “essence” of the text ○ Custom tokenization of a tweet’s text ○ After trying many tokenizers, I settled on NLTK’s TweetTokenizer object ● Engineering 4 features: ○ Text length and word count of the original text ○ Text length and word count of the cleaned text 12© Mor Krispil, 2019
  • 13. Methods Word2Vec - Word Vectorization technique used to produce word embeddings, for extracting topic and context from texts. ● Applying a pre-trained word vectorization (the google news vector) upon the cleaned text. Each word is represented as a 300 vector array, and then the entire tweet’s text (=sentence) is calculated into a 300 vector array ● Engineering 2 features: ○ A skew of that vector array (1 float) ○ A kurtosis of that vector array (1 float) 13© Mor Krispil, 2019
  • 14. Methods Term Frequency (TFIDF) I researched different methods of: ● scikit-learn objects: HashingVectorizer, TfidfTransformer, TfidfVectorizer ● Stop words, tokenization ● Distance metrics from scipy and scikit-learn ● Engineering ~50-500 features of tfidf values 14© Mor Krispil, 2019
  • 15. Methods Clustering - Hierarchical Density-based spatial clustering (HDBSCAN) ● HDBSCAN main advantages here: ○ Less sensitive to sparse data (like TFIDF results) ○ Has a built in advanced clustering validation - DBCV ● Different dimensionality reductions params using TruncatedSVD ● I researched different HDBSCAN params: Min-Cluster-Size, distance-metric, etc.. 15© Mor Krispil, 2019
  • 16. Methods NLP – Topic Modeling using an LDA model ● Using the LdaMulticore object from the gensim library for LDA ● Using Lemmatization and Stemming using NLTK’s WordNetLemmatizer, SnowballStemmer objects ● Building a Corpus from all the tweets and a Bag Of Words per each tweet ● Calculating the top 10 Topics – assuming they’d converge to our 10 Trends 16© Mor Krispil, 2019
  • 17. Findings – with Tokenization features – best TSNE • Twitter links formatting • Tweet Tokenization • Text length, #words, #hashtags • TSNE with Canberra metric 17© Mor Krispil, 2019
  • 18. Findings – with Word2Vec features – best TSNE • Skew and Kurtosis measurements • TSNE Hamming metric 18© Mor Krispil, 2019
  • 19. Findings – with TFIDF features – best TSNE • ~85% clustering match of the raw tweets into the Trends!! • 512 TFIDF features, unicode accent, L1 normalization • TSNE with PCA initialization and Russellrao metric 19© Mor Krispil, 2019
  • 20. Findings – Text Features Clustering There was a tedious stage of tuning, per each stage: tokenization, word vectorization and TFIDF vectorization. At the end, I could find that ~85% of the Raw tweets samples were mapped back into their Trends!! 20© Mor Krispil, 2019
  • 21. Findings – NLP LDA’s Topics vs our top Trends Trend / Topic Order Trend Name Topic Top 3 Words 1 Laura Loomer #novsdal,#dallascowboy,cowboy 2 Meek abcd,girl,name 3 #GoodFormVideo abcd,one,ryan 4 Santiago Bernabxc3xa9u #novsdal,meek,album 5 Marc Lamont Hill ryan,paul,today 6 Abcde never,hill,marc 7 Paul Ryan abcd,kid,girl 8 Ed Burke loomer,laura,twitter 9 Blade Runner ryan,paul,polit 10 #NOvsDAL vote,million,paul 21© Mor Krispil, 2019
  • 22. Findings - NLP LDA Topic Modeling After some tuning, (not nearly as much as in the clustering stage), I found a high degree (~75) of convergence between the original 10 trends’ names and the LDA’s 10 Topics. ● The LDA’s topics are not distinct like cluster labels, but are a collection of probabilities for a collection of topic-words to be considered from the same “Topic” ● Tuning the LDA for more then 10 topics – increased the coverage of the 10 trends, as expected ● Adding additional Tweet-specific stop-words – helped as well 22© Mor Krispil, 2019
  • 23. Conclusions ● Can I cluster / topic the tweets back into their ground-truth trends, based only on the raw tweets’ text? – I believe so, with ~80 of success using an ensemble of techniques from clustering and topic modeling ● Is this method accurate enough for clustering / topic-ing un-trended tweets into meaningful groups? – I believe so. All learning stages were done without the knowledge of the original Trends (just for a measure of success) ● Retrospectively, Tweets are very hard to text-process using conventional techniques. This is mainly due to their nature of: heavily accented, repeated, shortened, etc.. 23© Mor Krispil, 2019
  • 24. Limitations 1. Text encoding limitation – the research was limited to English text from the US due to encoding compatability, but given more time I’d build a more comprehensive data collection and processing from more languages – which I believe would improve the clustering results 2. Twitter heavily influence their API towards the popular stream of tweets. When trying to insist on non-trendy tweets – the results get scarce. Also, they limit the data collection through the public API, so no bulk access is allowed, just iterations of limited API calls, ending with quite small dataset in each iteration 24© Mor Krispil, 2019
  • 25. Limitations 3. TSNE computation is quite heavy, so I had to cache checkpoints of data transformation on disk, to avoid repeated task. However, it limited me with ability of changing the number of rows, between steps, as the original indices could point to missing rows 25© Mor Krispil, 2019
  • 26. Acknowledgements The data was obtained from twitter API, using a user access license. 26© Mor Krispil, 2019
  • 27. References ● https://code.google.com/archive/p/word2vec/ ● McInnes L, Healy J. Accelerated Hierarchical Density Based Clustering In: 2017 IEEE International Conference on Data Mining Workshops (ICDMW), IEEE, pp 33-42. 2017 [pdf] ● Campello, R. J., Moulavi, D., Zimek, A., & Sander, J. (2015). Hierarchical density estimates for data clustering, visualization, and outlier detection. ACM Transactions on Knowledge Discovery from Data (TKDD), 10(1), 5. ● Moulavi, D., Jaskowiak, P.A., Campello, R.J., Zimek, A. and Sander, J., 2014. Density-Based Clustering Validation. In SDM (pp. 839-847). 27© Mor Krispil, 2019