Independent Study_Final Report

2015
Topic Model
Comparision on
Microblog Data
FALL ’15 INDEPENDENT STUDYREPORT
JOE KOOLIPPURACKAL | SHIKHA SWAMI

1
Intoduction
In today’s world, social network is a biggest platform to communicate and express ideas.
And Twitter is one of the popular social media which has abundant text data. Twitter has
over 300 Million active monthly users sharing 500M tweets every day. Twitter provides
unprecedented opportunities for researchers, both in academia and businesses, to analyze
user opinions, sentiments and interests. However, one major problem encountered while
developing classification or prediction model on microblog data like tweets is the need to
manually label the tweets in the training dataset, which is extremely cumbersome and time-
consuming owing to the large size of datasets.
In this study, we analyse and compare the two topic modelling techniques, Correlated
Topics Models (CTM) and Latent Dirichlet algorithm(LDA) on two different datasets –
a. Dataset A: Tweets captured using ‘asthma’ keyword
b. Dataset B: Tweets captured using ‘#asthma’ keyword
We intend to compare the two topic modelling techniques on these two datasets,
comparing the tems in the topics by varying the number of topics in the corpus.
Literature Review
A number of research is performed to analyse social network data for finding health-related
information. The studies rely on natural language processing methods to extract
information from unstructured and raw data. The study “Review of Extracting Information
From the Social Web for Health Personalization”[1] explains the concept of extraction of
information for health personalization. It explains that individuals are socialising to share
information about their health, the problems faced by them and their experiences. This
article shows how promising the study of health related topics can be using web as a source
of information.
A study by Dr. Sudha Ram, used predictive modelling to extract data from multiple sources
like Twitter, Google etc. to predict asthma-related Emergency Department visits[2]. This
research shows that asthama is very prevelant disease in US and has high severity. The
research analysed the relation between asthama related ED visits and data from the web.
Another study in the field of public health is by Michael J. Paul and Mark Dredze of Johns
Hopkins University. In their paper “You Are What You Tweet: Analyzing Twitter for Public
Health”[3] they have used Ailment Topic Aspect Model to analyse how users express their
illnesses and ailments in tweets.
A similar study in the paper ”Use of Hangeul Twitter to Track and Predict Human Influenza
Infection
”[4] to predict and track spread of influenza was performed by analysing the tweets.
All these research show that twitter and web media are data rich resources to analyse and
predict health related information. Researchers are continuously exploring new cost
effective and robust tools to analyse unstructured data on web using data mining
techniques.

2
Topic Modeling Techniques
In this study, LDA and CTM model will be used to analyse asthma related micro blog
discussons on twitter. As mentioned earlier, we have two datasets (Dataset A, and Dataset
B), which has asthma related tweets from June 2015 to Aug 2015. The goal of the project
would be to identify the topics discussed in these tweets. The approach will be to pre-
process the tweets and identify the minimal set of terms useful for analysis. Then we label
and cluster the topics found in the tweets.
Topic modelling is the technique in machine learning, which is used to find the theme of the
document. Topic modelling is used to infer latent (hidden) topics in a document set and it
determine what the document is about. Given a document or data set, topic modelling
techiques determine the topics based on the frequency of occurance of a particular word. In
our study, the topics contained asthama and copd words with highest probabilities.
Latent Dirichlet Allocation (LDA) is a probabilistic model which is used to infer find latent
topics in a document. LDA works on the idea that every document (the tweets in our case) is
composed of multiple topics. Based on each tweet’s balance of topics, we can identify the
topic which has the highest score for each tweet, and label the tweet accordingly. LDA uses
Dirichlet Algorithm and Dirichlet parameters to compute the probability of topics and the
words under that topic.
Similar to LDA, Correlated Topic Models (CTM) is used to find the hidden topics in the
document. It determines the words and topic probabilities based on frequency of its
occurance. But CTM also finds correlation between the topics. Idea behind this topic
modelling is that existance of one topic in the document can be correlated to the existence
of other topic.
Implementation
For this project, we have used R programming tool to implement both the techniques.
The tweets from the twitter are extracted into an excel file. This file is preprocessed to clean
the dataset. This dataset and stopword list is provided as an input to both LDA and CTM
implementation. For both the algorithms, the dataset and stopword list is same. The process
is repeated for 3 clusters of size 2,5 and 10.
Pre-processing
Approximately more than 41k tweets which contain the keyword ‘asthma’ were collected
and merged in a .csv file. The file is processed to contain the tweet id, date of the tweet,
user id and the tweet. Post merging the dataset, below were the pre-processing steps
performed on the datasets:
i. Remove URLs
ii. Remove usernames (starting with @)
iii. Remove numbers
iv. Remove special characters
v. Remove Non-ascii characters

3
vi. Removal of stopwords
vii. Removed punctuation
viii.Converted all text to lower case
For all the above pre-processing, we used the ‘tm’ package in R. For the removal of the
stopwords, in addition to the inbuilt stopwords list in the ‘tm’ package, we added a list of
stopwords specific to these datasets. These stopwords included names of users and words
that weren’t relevant to asthma. We removed some of these stopwords by manually
inspecting the tweets. We then iteratively ran the topic models and identified the irrelevant
terms in the topics and added them to the stopwords list.
Please refer the below file for the list of stopwords used:
Topic Modelling Process
We used the ‘topicmodels’ package to implement the CTM and LDA techniques. Post the
pre-processing, we created a document term matrix for each of the two datastes. The
sparce terms from the document which occur less than 0.1% were removed from the
document term matrix. On each of the two datasets, we ran the LDA and CTM techniques
for three different cluster sizes -
a. 2 cluster
b. 5 clusters
c. 10 clusters
The terms in each of these clusters were sorted in the descending order of probability score,
and we picked the top 10 terms in each of the topics with the highest probability scores.
Results and Analysis
By increasing the number of clusters, new terms get added to the topics which have high
probability. Comparing the clusters, the important terms captured about asthma are :
Asthma, hygiene, pets, allergy, farm, bird, anaphylactic, pollution. The topic discussed could
mean that pets, farms, birds can be reason for asthma.
Comparing the topics under CTM and LDA with search key as #asthma, it can be seen that
frquency of word asthma under all the topics is very high when LDA topic modelling is used
whereas the frquency of word asthma is varied across the topics.
Also as the cluster size increases, this variation is more prominent for the words under the
topics modelled by CTM.
By comparing cluster size of 10 modelled by LDA, the variation in the frequency for the word
asthma is more under search key with asthma as compared to the variation under #asthma.
But for CTM, the frequency under both the search keys(cluster size 10) is almost same.

4
For cluster size of 10, LDA has captured word ‘june’ under many topics, when search key of
#asthma is used. But this word is not captured when search key of asthma is used. CTM
captures this word only once when #asthma keyword is used. This means CTMgives a better
correlation between the words of the topics when cluster size is increased.
For smaller cluster size, both CTM and LDA have similar kind of behavior. The words found
across all the topics found using CTM is similar to the words found using LDA. But variation
can be seen in CTM when cluster size is increased.
For eg. Under cluster size 2 and 5, repeating words under CTM and LDA are asthma,
cannabisoil, children, symptoms, antibiotic. The frequency of the word ‘asthma’ is almost
highest under all the topics for the both CTM and LDA.
The attached file has the results of the analysis.
Results_Consolidate
d.xlsx
References
1) Review of Extracting Information From the Social Web for Health Personalization:
http://www.jmir.org/2011/1/e15/
2) Predicting Asthma-Related Emergency Department Visits Using Big Data
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7045443
3) You Are What You Tweet: Analyzing Twitter for Public Health
https://www.cs.jhu.edu/~mdredze/publications/twitter_health_icwsm_11.pdf
4) Use of Hangeul Twitter to Track and Predict Human Influenza Infection
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0069305
5) Correlated Topic Models
https://www.cs.princeton.edu/~blei/papers/BleiLafferty2006.pdf
6) Probabilistic Topic Models
https://www.cs.princeton.edu/~blei/papers/Blei2012.pdf

5
Appendix
Description of the files accompanying this report:
File Description
Merged_Asthma_Clean.csv Cleaneddataset‘asthma’containing(SetId,TweetId,Date,Userid,
Tweets)
Merged_HashAsthma_Clean.csv Cleaneddataset‘#asthma’containing(SetId,TweetId,Date,
Userid,Tweets)
TweetCleaning.R Scriptusedto cleanthe tweets
Stopwords.txt List of stopwordsused
LDA.R Scriptfor LDA Model
CTM.R Scriptfor CTM Model
Results_Consolidated.xlsx ConsolidatedresultsforLDA andCTM for differentclustersizes
(2, 5, 10) for both‘asthma’and ‘#asthma’datasets.

Independent Study_Final Report

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Independent Study_Final Report

Similar to Independent Study_Final Report (20)

Independent Study_Final Report