Top Astrologer, Kala ilam specialist in USA and Bangali Amil baba in Saudi Ar...
A Topic Analysis Approach To Revealing Discussions On The Australian Twittersphere
1. A TOPIC ANALYSIS APPROACH TO REVEALING
DISCUSSIONS ON THE AUSTRALIAN TWITTERSPHERE
Brenda Moon
Queensland University of Technology
2. Introduction
This paper investigates techniques to identify the topics
being discussed in one week of tweets from the
Australian Twittersphere. Tweets were extracted from a
comprehensive dataset which captures all tweets by
2.8m Australian: the Tracking Infrastructure for Social
Media Analysis (TrISMA) (Bruns, Burgess & Banks et al.,
2016).
3. Selected week: Sunday 2 August to Saturday 8
August 2015
• Thursday 6th August 2015 was used for One Day in
the Life of a National Twittersphere (Axel Bruns and Brenda
Moon, presented at Social Media and Society, London, 13 July 2016)
• Same day used for initial development of topic
modelling approach
• Then extended to full week
5. Data cleaning
• Remove
– retweets & multitweets (“rt”, “mt” or “via”)
– URLs
– dates, times, distances & weights
– Words less than 3 characters
– elipses ('...’)
• NTLK tokenisation using Twitter Tokenizer
– Remove all @users and urls
– Lowercase
• Convert
– HTML entities to text
– Hashtags to words (trim ‘#’ off hashtags)
• NLTK lemmatisation
• NLTK stopwords
6. Hashtag pooling
• Mehrotra, Sanner, Buntine & Xie (2013) looked at
different options of ‘pooling’ tweets into documents
before LDA analysis to see if this could increase accuracy.
They found that hashtag pooling was effective (best was
hashtag pooling with clustering, but more complex to
apply)
• Group all the tweets with hashtags into documents for
each hashtag (some tweets will be added into more than
one document)
• Tweets without hashtags stay as individual documents
7. Corpus filtering
(Thursday 6 August 2015)
• Raw tweets: 963,064
• After data cleaning: 583,528
• After hashtag pooling: 516,263
– 23% of tweets had hashtags
• Dictionary pruning – remove most frequent and least
frequent terms
– no_above=0.5 (percent of documents), no_below=5
(documents)
– 223,157 unique tokens reduced to 49,964 unique tokens
16. Thursday 6th
August 2015
30 topics,
With hashtag
pooling.
Comparison to
other study
Pop?
Teen culture?
MH370
17. 1.1m tweets from 147k, to 224k accounts
294k nodes total, including non-Australians
535k edges from 856k @mentions / RTs
Visualisation: Gephi, Force Atlas 2
Colours: Gephi, modularity resolution 1.0
Labels assigned through qualitative evaluation
Politics
Cricket
Teen Culture
Pop
From “One Day in the Life of a National Twittersphere” by Axel Bruns and
Brenda Moon, presented at Social Media and Society, London, 13 July 2016.
18. Further Outlook
• Confirm initial topic labelling by looking at top tweets for each
topic
• Check whether the hashtag pooling has allowed non-hashtag
tweet topics to still be visible
• Use statistical coherence of model (U_Mass Coherence, C_V
coherence) to tune LDA parameters
• Model different numbers of topics (coarse/fine grain)
• Relate topics per user back to our mention network graphs
• Extend to the full week (or longer)
• Compare to alternative approaches
– Doc2Vec / Tensorflow / dynamic LDA etc
19. References
• Blei, D. M. (2011). Introduction to probabilistic topic models. Communications of the ACM, 1–
16. Retrieved from http://www.cs.princeton.edu/~blei/papers/Blei2011.pdf
• Mehrotra, R., Sanner, S., Buntine, W., & Xie, L. (2013). Improving LDA Topic Models for
Microblogs via Tweet Pooling and Automatic Labeling. Proceedings of the 36th International
ACM SIGIR Conference on Research and Development in Information Retrieval, 889–892.
http://doi.org/10.1145/2484028.2484166
• Lau, J. H., & Baldwin, T. (2014). An Empirical Evaluation of doc2vec with Practical Insights into
Document Embedding Generation.
• Puschmann, C., & Scheffler, T. (2016). Topic modeling for media and communication
research : A short primer (HIIG Discussion Paper Series No. 2016–5). Retrieved from
http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2836478
• Sievert, C., & Shirley, K. (2014). LDAvis: A method for visualizing and interpreting topics.
Proceedings of the Workshop on Interactive Language Learning, Visualization, and Interfaces,
63–70. Retrieved from http://www.aclweb.org/anthology/W/W14/W14-3110