Self Trending a Tweet - Cluster and Topic Analysis on Tweets

Self Trending a Tweet
Cluster and Topic Analysis on Tweets
© Mor Krispil, 2019 1

Abstract
In a Twitter dataset we are provided with tweets, retrieved per a pre-defined
“Trend”. Can we verify those trends back from the raw statuses? If so – we could
use this technique to topic any un-trended tweet-list!
On a tweet dataset, curated from a top of 10 twitter trends, I researched different
Natural Language Processing (NLP) and Clustering techniques to apply on the
raw tweets’ text.
I found that with the right NLP and Clustering – I could verify ~80% of the
tweets back to their labeled trends!
2© Mor Krispil, 2019

Motivation
While trended tweets can be provided from the tweeter API, they represent only
the latest and a fraction of all public tweets. In addition, they are grouped into
trends with a high degree of bias (like promotions and popularity indices).
In order to group any tweet list: historical, non “popular” and so on – we’ll need
to come by with a Tweet-to-Topic technique that can be verified.
Twitter, no doubt, have put tremendous work into their trend labels – so we could
actually use those labels to verify our Topic-ing abilities! (ground truths)

Dataset(s)
I got the Twitter data from the Twitter API, and Clustering and NLP techniques I
use in my day to day job.
I mined my dataset from the Twitter API, in 2 steps:
● Top 10 Twitter Trends in the US, mined with the twitter trends-per-place API.
Sample tuple: (trend name, trend query string, tweet volume)
● Per each trend: I mined ~800-1000 Tweets, in an API call iterations, with the
twitter search-per-query API. Sample tuple: (trend name, tweet text, tweet
hashtag list)
Total size: (7528, 3)

Dataset(s)
In Addition I used small helper-datasets
● English stop-words from: NLTK, stop_words and genism libraries
● Manually added some tweet-specific missing stop words

Data Preparation and Cleaning
Empty values
● No missing values were encountered in the mining process
● In the enrichment process, some text vectorization calculation created small
missing values in the new features
● I preferred replacement with “0” instead of dropping rows, since some matrix
calculation were heavy, and steps’ checkpoints had to be cached, so row-
count had to be kept consistent

Non-Unicode text problems
● It caused issues with printing results to the notebook, and with some libraries
like matplotlib
● I filtered the mining process only to the US trend-place
● I filtered the tweets API to English locale (‘en’)
● I still had to make sure and encode text values as utf-8
● I had to decode back from utf-8 in other cases

Twitter specific text pre-processing
● A tweet has a little bit different rules of punctuations
● Special conventions like: hashtags and mentions
● I used an NLTK object built for tokenizing tweets – TweetTokenizer
● For the LDA part of the NLP, I used genism’s preprocessing,
WordNetLemmatizer and SnowballStemmer objects

Research Question(s)
● Can I cluster / topic the tweets back into their ground-truth trends, based only
on the raw tweets’ text?
● Is this method accurate enough for clustering / topic-ing un-trended tweets into
meaningful groups?

Methods
Clustering Validation method:
● Plotting a 2-d scatter TSNE plot (after dimensionality reduction) of the current
stage data-frame
● Applying a color to each dot, tracking it back to its “ground truth” trend-color
● The more the different colors concentrated in different clusters – the better!
This way I could validate if a specific object, or even a parameter in the pipeline -
improves or worsens the previous state.

Methods
Topic Modeling Validation method - in order to keep track of progress, I used this
validation method:
● I collected the top 2-4 tokens (words) in each of the top 10 topics
● I compared it to the top 10 trends’ name from the twitter dataset
● The more the topics covered the trends – the better!
* This is a unique case were we’re dealing with un-supervised task, but we’re able
to validate it like a supervised one!

Methods
Text Tokenization – initial text processing, extracting the “essence” of the text
○ Custom tokenization of a tweet’s text
○ After trying many tokenizers, I settled on NLTK’s TweetTokenizer object
● Engineering 4 features:
○ Text length and word count of the original text
○ Text length and word count of the cleaned text

Methods
Word2Vec - Word Vectorization technique used to produce word embeddings, for
extracting topic and context from texts.
● Applying a pre-trained word vectorization (the google news vector) upon the
cleaned text. Each word is represented as a 300 vector array, and then the
entire tweet’s text (=sentence) is calculated into a 300 vector array
● Engineering 2 features:
○ A skew of that vector array (1 float)
○ A kurtosis of that vector array (1 float)

Methods
Term Frequency (TFIDF)
I researched different methods of:
● scikit-learn objects: HashingVectorizer, TfidfTransformer, TfidfVectorizer
● Stop words, tokenization
● Distance metrics from scipy and scikit-learn
● Engineering ~50-500 features of tfidf values

Methods
Clustering - Hierarchical Density-based spatial clustering (HDBSCAN)
● HDBSCAN main advantages here:
○ Less sensitive to sparse data (like TFIDF results)
○ Has a built in advanced clustering validation - DBCV
● Different dimensionality reductions params using TruncatedSVD
● I researched different HDBSCAN params: Min-Cluster-Size, distance-metric,
etc..

Methods
NLP – Topic Modeling using an LDA model
● Using the LdaMulticore object from the gensim library for LDA
● Using Lemmatization and Stemming using NLTK’s WordNetLemmatizer,
SnowballStemmer objects
● Building a Corpus from all the tweets and a Bag Of Words per each tweet
● Calculating the top 10 Topics – assuming they’d converge to our 10 Trends

Findings – with Tokenization features – best TSNE
• Twitter links formatting
• Tweet Tokenization
• Text length, #words, #hashtags
• TSNE with Canberra metric

Findings – with Word2Vec features – best TSNE
• Skew and Kurtosis measurements
• TSNE Hamming metric

Findings – with TFIDF features – best TSNE
• ~85% clustering match of the raw tweets into the Trends!!
• 512 TFIDF features, unicode accent, L1 normalization
• TSNE with PCA initialization and Russellrao metric

Findings – Text Features Clustering
There was a tedious stage of tuning, per each stage: tokenization, word
vectorization and TFIDF vectorization.
At the end, I could find that ~85% of the Raw tweets samples were mapped
back into their Trends!!

Findings – NLP LDA’s Topics vs our top Trends
Trend / Topic
Order
Trend Name Topic Top 3 Words
1 Laura Loomer #novsdal,#dallascowboy,cowboy
2 Meek abcd,girl,name
3 #GoodFormVideo abcd,one,ryan
4 Santiago Bernabxc3xa9u #novsdal,meek,album
5 Marc Lamont Hill ryan,paul,today
6 Abcde never,hill,marc
7 Paul Ryan abcd,kid,girl
8 Ed Burke loomer,laura,twitter
9 Blade Runner ryan,paul,polit
10 #NOvsDAL vote,million,paul

Findings - NLP LDA Topic Modeling
After some tuning, (not nearly as much as in the clustering stage), I found a high
degree (~75) of convergence between the original 10 trends’ names and the
LDA’s 10 Topics.
● The LDA’s topics are not distinct like cluster labels, but are a collection of
probabilities for a collection of topic-words to be considered from the same
“Topic”
● Tuning the LDA for more then 10 topics – increased the coverage of the 10
trends, as expected
● Adding additional Tweet-specific stop-words – helped as well

Conclusions
● Can I cluster / topic the tweets back into their ground-truth trends, based only
on the raw tweets’ text? – I believe so, with ~80 of success using an
ensemble of techniques from clustering and topic modeling
● Is this method accurate enough for clustering / topic-ing un-trended tweets into
meaningful groups? – I believe so. All learning stages were done without
the knowledge of the original Trends (just for a measure of success)
● Retrospectively, Tweets are very hard to text-process using conventional
techniques. This is mainly due to their nature of: heavily accented, repeated,
shortened, etc..

Limitations
1. Text encoding limitation – the research was limited to English text from the US
due to encoding compatability, but given more time I’d build a more
comprehensive data collection and processing from more languages – which I
believe would improve the clustering results
2. Twitter heavily influence their API towards the popular stream of tweets. When
trying to insist on non-trendy tweets – the results get scarce. Also, they limit
the data collection through the public API, so no bulk access is allowed, just
iterations of limited API calls, ending with quite small dataset in each iteration

Limitations
3. TSNE computation is quite heavy, so I had to cache checkpoints of data
transformation on disk, to avoid repeated task. However, it limited me with ability
of changing the number of rows, between steps, as the original indices could point
to missing rows

Acknowledgements
The data was obtained from twitter API, using a user access license.

References
● https://code.google.com/archive/p/word2vec/
● McInnes L, Healy J. Accelerated Hierarchical Density Based Clustering In:
2017 IEEE International Conference on Data Mining Workshops (ICDMW),
IEEE, pp 33-42. 2017 [pdf]
● Campello, R. J., Moulavi, D., Zimek, A., & Sander, J. (2015). Hierarchical
density estimates for data clustering, visualization, and outlier detection. ACM
Transactions on Knowledge Discovery from Data (TKDD), 10(1), 5.
● Moulavi, D., Jaskowiak, P.A., Campello, R.J., Zimek, A. and Sander, J., 2014.
Density-Based Clustering Validation. In SDM (pp. 839-847).

Self Trending a Tweet - Cluster and Topic Analysis on Tweets

Recommended

Recommended

More Related Content

Similar to Self Trending a Tweet - Cluster and Topic Analysis on Tweets

Similar to Self Trending a Tweet - Cluster and Topic Analysis on Tweets (20)

Recently uploaded

Recently uploaded (20)

Self Trending a Tweet - Cluster and Topic Analysis on Tweets