SlideShare a Scribd company logo
1 of 6
2015
Topic Model
Comparision on
Microblog Data
FALL ’15 INDEPENDENT STUDYREPORT
JOE KOOLIPPURACKAL | SHIKHA SWAMI
1
Intoduction
In today’s world, social network is a biggest platform to communicate and express ideas.
And Twitter is one of the popular social media which has abundant text data. Twitter has
over 300 Million active monthly users sharing 500M tweets every day. Twitter provides
unprecedented opportunities for researchers, both in academia and businesses, to analyze
user opinions, sentiments and interests. However, one major problem encountered while
developing classification or prediction model on microblog data like tweets is the need to
manually label the tweets in the training dataset, which is extremely cumbersome and time-
consuming owing to the large size of datasets.
In this study, we analyse and compare the two topic modelling techniques, Correlated
Topics Models (CTM) and Latent Dirichlet algorithm(LDA) on two different datasets –
a. Dataset A: Tweets captured using ‘asthma’ keyword
b. Dataset B: Tweets captured using ‘#asthma’ keyword
We intend to compare the two topic modelling techniques on these two datasets,
comparing the tems in the topics by varying the number of topics in the corpus.
Literature Review
A number of research is performed to analyse social network data for finding health-related
information. The studies rely on natural language processing methods to extract
information from unstructured and raw data. The study “Review of Extracting Information
From the Social Web for Health Personalization”[1] explains the concept of extraction of
information for health personalization. It explains that individuals are socialising to share
information about their health, the problems faced by them and their experiences. This
article shows how promising the study of health related topics can be using web as a source
of information.
A study by Dr. Sudha Ram, used predictive modelling to extract data from multiple sources
like Twitter, Google etc. to predict asthma-related Emergency Department visits[2]. This
research shows that asthama is very prevelant disease in US and has high severity. The
research analysed the relation between asthama related ED visits and data from the web.
Another study in the field of public health is by Michael J. Paul and Mark Dredze of Johns
Hopkins University. In their paper “You Are What You Tweet: Analyzing Twitter for Public
Health”[3] they have used Ailment Topic Aspect Model to analyse how users express their
illnesses and ailments in tweets.
A similar study in the paper ”Use of Hangeul Twitter to Track and Predict Human Influenza
Infection
”[4] to predict and track spread of influenza was performed by analysing the tweets.
All these research show that twitter and web media are data rich resources to analyse and
predict health related information. Researchers are continuously exploring new cost
effective and robust tools to analyse unstructured data on web using data mining
techniques.
2
Topic Modeling Techniques
In this study, LDA and CTM model will be used to analyse asthma related micro blog
discussons on twitter. As mentioned earlier, we have two datasets (Dataset A, and Dataset
B), which has asthma related tweets from June 2015 to Aug 2015. The goal of the project
would be to identify the topics discussed in these tweets. The approach will be to pre-
process the tweets and identify the minimal set of terms useful for analysis. Then we label
and cluster the topics found in the tweets.
Topic modelling is the technique in machine learning, which is used to find the theme of the
document. Topic modelling is used to infer latent (hidden) topics in a document set and it
determine what the document is about. Given a document or data set, topic modelling
techiques determine the topics based on the frequency of occurance of a particular word. In
our study, the topics contained asthama and copd words with highest probabilities.
Latent Dirichlet Allocation (LDA) is a probabilistic model which is used to infer find latent
topics in a document. LDA works on the idea that every document (the tweets in our case) is
composed of multiple topics. Based on each tweet’s balance of topics, we can identify the
topic which has the highest score for each tweet, and label the tweet accordingly. LDA uses
Dirichlet Algorithm and Dirichlet parameters to compute the probability of topics and the
words under that topic.
Similar to LDA, Correlated Topic Models (CTM) is used to find the hidden topics in the
document. It determines the words and topic probabilities based on frequency of its
occurance. But CTM also finds correlation between the topics. Idea behind this topic
modelling is that existance of one topic in the document can be correlated to the existence
of other topic.
Implementation
For this project, we have used R programming tool to implement both the techniques.
The tweets from the twitter are extracted into an excel file. This file is preprocessed to clean
the dataset. This dataset and stopword list is provided as an input to both LDA and CTM
implementation. For both the algorithms, the dataset and stopword list is same. The process
is repeated for 3 clusters of size 2,5 and 10.
Pre-processing
Approximately more than 41k tweets which contain the keyword ‘asthma’ were collected
and merged in a .csv file. The file is processed to contain the tweet id, date of the tweet,
user id and the tweet. Post merging the dataset, below were the pre-processing steps
performed on the datasets:
i. Remove URLs
ii. Remove usernames (starting with @)
iii. Remove numbers
iv. Remove special characters
v. Remove Non-ascii characters
3
vi. Removal of stopwords
vii. Removed punctuation
viii.Converted all text to lower case
For all the above pre-processing, we used the ‘tm’ package in R. For the removal of the
stopwords, in addition to the inbuilt stopwords list in the ‘tm’ package, we added a list of
stopwords specific to these datasets. These stopwords included names of users and words
that weren’t relevant to asthma. We removed some of these stopwords by manually
inspecting the tweets. We then iteratively ran the topic models and identified the irrelevant
terms in the topics and added them to the stopwords list.
Please refer the below file for the list of stopwords used:
Topic Modelling Process
We used the ‘topicmodels’ package to implement the CTM and LDA techniques. Post the
pre-processing, we created a document term matrix for each of the two datastes. The
sparce terms from the document which occur less than 0.1% were removed from the
document term matrix. On each of the two datasets, we ran the LDA and CTM techniques
for three different cluster sizes -
a. 2 cluster
b. 5 clusters
c. 10 clusters
The terms in each of these clusters were sorted in the descending order of probability score,
and we picked the top 10 terms in each of the topics with the highest probability scores.
Results and Analysis
By increasing the number of clusters, new terms get added to the topics which have high
probability. Comparing the clusters, the important terms captured about asthma are :
Asthma, hygiene, pets, allergy, farm, bird, anaphylactic, pollution. The topic discussed could
mean that pets, farms, birds can be reason for asthma.
Comparing the topics under CTM and LDA with search key as #asthma, it can be seen that
frquency of word asthma under all the topics is very high when LDA topic modelling is used
whereas the frquency of word asthma is varied across the topics.
Also as the cluster size increases, this variation is more prominent for the words under the
topics modelled by CTM.
By comparing cluster size of 10 modelled by LDA, the variation in the frequency for the word
asthma is more under search key with asthma as compared to the variation under #asthma.
But for CTM, the frequency under both the search keys(cluster size 10) is almost same.
4
For cluster size of 10, LDA has captured word ‘june’ under many topics, when search key of
#asthma is used. But this word is not captured when search key of asthma is used. CTM
captures this word only once when #asthma keyword is used. This means CTMgives a better
correlation between the words of the topics when cluster size is increased.
For smaller cluster size, both CTM and LDA have similar kind of behavior. The words found
across all the topics found using CTM is similar to the words found using LDA. But variation
can be seen in CTM when cluster size is increased.
For eg. Under cluster size 2 and 5, repeating words under CTM and LDA are asthma,
cannabisoil, children, symptoms, antibiotic. The frequency of the word ‘asthma’ is almost
highest under all the topics for the both CTM and LDA.
The attached file has the results of the analysis.
Results_Consolidate
d.xlsx
References
1) Review of Extracting Information From the Social Web for Health Personalization:
http://www.jmir.org/2011/1/e15/
2) Predicting Asthma-Related Emergency Department Visits Using Big Data
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7045443
3) You Are What You Tweet: Analyzing Twitter for Public Health
https://www.cs.jhu.edu/~mdredze/publications/twitter_health_icwsm_11.pdf
4) Use of Hangeul Twitter to Track and Predict Human Influenza Infection
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0069305
5) Correlated Topic Models
https://www.cs.princeton.edu/~blei/papers/BleiLafferty2006.pdf
6) Probabilistic Topic Models
https://www.cs.princeton.edu/~blei/papers/Blei2012.pdf
5
Appendix
Description of the files accompanying this report:
File Description
Merged_Asthma_Clean.csv Cleaneddataset‘asthma’containing(SetId,TweetId,Date,Userid,
Tweets)
Merged_HashAsthma_Clean.csv Cleaneddataset‘#asthma’containing(SetId,TweetId,Date,
Userid,Tweets)
TweetCleaning.R Scriptusedto cleanthe tweets
Stopwords.txt List of stopwordsused
LDA.R Scriptfor LDA Model
CTM.R Scriptfor CTM Model
Results_Consolidated.xlsx ConsolidatedresultsforLDA andCTM for differentclustersizes
(2, 5, 10) for both‘asthma’and ‘#asthma’datasets.

More Related Content

What's hot

FAKE NEWS DETECTION WITH SEMANTIC FEATURES AND TEXT MINING
FAKE NEWS DETECTION WITH SEMANTIC FEATURES AND TEXT MININGFAKE NEWS DETECTION WITH SEMANTIC FEATURES AND TEXT MINING
FAKE NEWS DETECTION WITH SEMANTIC FEATURES AND TEXT MININGijnlc
 
ODSC East 2017: Data Science Models For Good
ODSC East 2017: Data Science Models For GoodODSC East 2017: Data Science Models For Good
ODSC East 2017: Data Science Models For GoodKarry Lu
 
Modern association rule mining methods
Modern association rule mining methodsModern association rule mining methods
Modern association rule mining methodsijcsity
 
Aspects of broad folksonomies
Aspects of broad folksonomiesAspects of broad folksonomies
Aspects of broad folksonomiesdermotte
 
Social Networks analysis to characterize HIV at-risk populations - Progress a...
Social Networks analysis to characterize HIV at-risk populations - Progress a...Social Networks analysis to characterize HIV at-risk populations - Progress a...
Social Networks analysis to characterize HIV at-risk populations - Progress a...UC San Diego
 
Natural Language Processing on Non-Textual Data
Natural Language Processing on Non-Textual DataNatural Language Processing on Non-Textual Data
Natural Language Processing on Non-Textual Datagpano
 
Cross breed Spam Categorization Method using Machine Learning Techniques
Cross breed Spam Categorization Method using Machine Learning TechniquesCross breed Spam Categorization Method using Machine Learning Techniques
Cross breed Spam Categorization Method using Machine Learning TechniquesIJSRED
 
Data Tactics Data Science Brown Bag (April 2014)
Data Tactics Data Science Brown Bag (April 2014)Data Tactics Data Science Brown Bag (April 2014)
Data Tactics Data Science Brown Bag (April 2014)Rich Heimann
 
Odsc 2018 detection_classification_of_fake_news_using_cnn_venkatraman
Odsc 2018 detection_classification_of_fake_news_using_cnn_venkatramanOdsc 2018 detection_classification_of_fake_news_using_cnn_venkatraman
Odsc 2018 detection_classification_of_fake_news_using_cnn_venkatramanvenkatramanJ4
 
NLP Structured Data Investigation on Non-Text
NLP Structured Data Investigation on Non-TextNLP Structured Data Investigation on Non-Text
NLP Structured Data Investigation on Non-TextHortonworks
 
NLP Structured Data Investigation on Non-Text by Casey Stella
NLP Structured Data Investigation on Non-Text by Casey StellaNLP Structured Data Investigation on Non-Text by Casey Stella
NLP Structured Data Investigation on Non-Text by Casey StellaSpark Summit
 
Lexalytics Text Analytics Workshop: Perfect Text Analytics
Lexalytics Text Analytics Workshop: Perfect Text AnalyticsLexalytics Text Analytics Workshop: Perfect Text Analytics
Lexalytics Text Analytics Workshop: Perfect Text AnalyticsLexalytics
 

What's hot (20)

FAKE NEWS DETECTION WITH SEMANTIC FEATURES AND TEXT MINING
FAKE NEWS DETECTION WITH SEMANTIC FEATURES AND TEXT MININGFAKE NEWS DETECTION WITH SEMANTIC FEATURES AND TEXT MINING
FAKE NEWS DETECTION WITH SEMANTIC FEATURES AND TEXT MINING
 
Mcs 021 solve assignment
Mcs 021 solve assignmentMcs 021 solve assignment
Mcs 021 solve assignment
 
Mcs 021
Mcs 021Mcs 021
Mcs 021
 
[IJET V2I3P7] Authors: Muthe Sandhya, Shitole Sarika, Sinha Anukriti, Aghav S...
[IJET V2I3P7] Authors: Muthe Sandhya, Shitole Sarika, Sinha Anukriti, Aghav S...[IJET V2I3P7] Authors: Muthe Sandhya, Shitole Sarika, Sinha Anukriti, Aghav S...
[IJET V2I3P7] Authors: Muthe Sandhya, Shitole Sarika, Sinha Anukriti, Aghav S...
 
ODSC East 2017: Data Science Models For Good
ODSC East 2017: Data Science Models For GoodODSC East 2017: Data Science Models For Good
ODSC East 2017: Data Science Models For Good
 
Linked sensor data
Linked sensor dataLinked sensor data
Linked sensor data
 
Modern association rule mining methods
Modern association rule mining methodsModern association rule mining methods
Modern association rule mining methods
 
Aspects of broad folksonomies
Aspects of broad folksonomiesAspects of broad folksonomies
Aspects of broad folksonomies
 
Social Networks analysis to characterize HIV at-risk populations - Progress a...
Social Networks analysis to characterize HIV at-risk populations - Progress a...Social Networks analysis to characterize HIV at-risk populations - Progress a...
Social Networks analysis to characterize HIV at-risk populations - Progress a...
 
Natural Language Processing on Non-Textual Data
Natural Language Processing on Non-Textual DataNatural Language Processing on Non-Textual Data
Natural Language Processing on Non-Textual Data
 
Cl4201593597
Cl4201593597Cl4201593597
Cl4201593597
 
DS4G
DS4GDS4G
DS4G
 
Cross breed Spam Categorization Method using Machine Learning Techniques
Cross breed Spam Categorization Method using Machine Learning TechniquesCross breed Spam Categorization Method using Machine Learning Techniques
Cross breed Spam Categorization Method using Machine Learning Techniques
 
Data Tactics Data Science Brown Bag (April 2014)
Data Tactics Data Science Brown Bag (April 2014)Data Tactics Data Science Brown Bag (April 2014)
Data Tactics Data Science Brown Bag (April 2014)
 
Examination of Document Similarity Using Rabin-Karp Algorithm
Examination of Document Similarity Using Rabin-Karp AlgorithmExamination of Document Similarity Using Rabin-Karp Algorithm
Examination of Document Similarity Using Rabin-Karp Algorithm
 
Odsc 2018 detection_classification_of_fake_news_using_cnn_venkatraman
Odsc 2018 detection_classification_of_fake_news_using_cnn_venkatramanOdsc 2018 detection_classification_of_fake_news_using_cnn_venkatraman
Odsc 2018 detection_classification_of_fake_news_using_cnn_venkatraman
 
NLP Structured Data Investigation on Non-Text
NLP Structured Data Investigation on Non-TextNLP Structured Data Investigation on Non-Text
NLP Structured Data Investigation on Non-Text
 
NLP Structured Data Investigation on Non-Text by Casey Stella
NLP Structured Data Investigation on Non-Text by Casey StellaNLP Structured Data Investigation on Non-Text by Casey Stella
NLP Structured Data Investigation on Non-Text by Casey Stella
 
Aj35198205
Aj35198205Aj35198205
Aj35198205
 
Lexalytics Text Analytics Workshop: Perfect Text Analytics
Lexalytics Text Analytics Workshop: Perfect Text AnalyticsLexalytics Text Analytics Workshop: Perfect Text Analytics
Lexalytics Text Analytics Workshop: Perfect Text Analytics
 

Similar to Independent Study_Final Report

Latent Dirichlet Allocation as a Twitter Hashtag Recommendation System
Latent Dirichlet Allocation as a Twitter Hashtag Recommendation SystemLatent Dirichlet Allocation as a Twitter Hashtag Recommendation System
Latent Dirichlet Allocation as a Twitter Hashtag Recommendation SystemShailly Saxena
 
A Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia ArticlesA Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia Articlesijma
 
Svm and maximum entropy model for sentiment analysis of tweets
Svm and maximum entropy model for sentiment analysis of tweetsSvm and maximum entropy model for sentiment analysis of tweets
Svm and maximum entropy model for sentiment analysis of tweetsS M Raju
 
76 s201914
76 s20191476 s201914
76 s201914IJRAT
 
ANALYSIS OF TOPIC MODELING WITH UNPOOLED AND POOLED TWEETS AND EXPLORATION OF...
ANALYSIS OF TOPIC MODELING WITH UNPOOLED AND POOLED TWEETS AND EXPLORATION OF...ANALYSIS OF TOPIC MODELING WITH UNPOOLED AND POOLED TWEETS AND EXPLORATION OF...
ANALYSIS OF TOPIC MODELING WITH UNPOOLED AND POOLED TWEETS AND EXPLORATION OF...IJCSEA Journal
 
ANALYSIS OF TOPIC MODELING WITH UNPOOLED AND POOLED TWEETS AND EXPLORATION OF...
ANALYSIS OF TOPIC MODELING WITH UNPOOLED AND POOLED TWEETS AND EXPLORATION OF...ANALYSIS OF TOPIC MODELING WITH UNPOOLED AND POOLED TWEETS AND EXPLORATION OF...
ANALYSIS OF TOPIC MODELING WITH UNPOOLED AND POOLED TWEETS AND EXPLORATION OF...IJCSEA Journal
 
ANALYSIS OF TOPIC MODELING WITH UNPOOLED AND POOLED TWEETS AND EXPLORATION OF...
ANALYSIS OF TOPIC MODELING WITH UNPOOLED AND POOLED TWEETS AND EXPLORATION OF...ANALYSIS OF TOPIC MODELING WITH UNPOOLED AND POOLED TWEETS AND EXPLORATION OF...
ANALYSIS OF TOPIC MODELING WITH UNPOOLED AND POOLED TWEETS AND EXPLORATION OF...IJCSEA Journal
 
IRJET- A Survey on Trend Analysis on Twitter for Predicting Public Opinion on...
IRJET- A Survey on Trend Analysis on Twitter for Predicting Public Opinion on...IRJET- A Survey on Trend Analysis on Twitter for Predicting Public Opinion on...
IRJET- A Survey on Trend Analysis on Twitter for Predicting Public Opinion on...IRJET Journal
 
A Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic ModellingA Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic Modellingcsandit
 
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLINGA TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLINGcscpconf
 
T OP K-O PINION D ECISIONS R ETRIEVAL IN H EALTHCARE S YSTEM
T OP  K-O PINION  D ECISIONS  R ETRIEVAL IN  H EALTHCARE  S YSTEM T OP  K-O PINION  D ECISIONS  R ETRIEVAL IN  H EALTHCARE  S YSTEM
T OP K-O PINION D ECISIONS R ETRIEVAL IN H EALTHCARE S YSTEM csandit
 
Detecting Trends Through Twitter Stream v2
Detecting Trends Through Twitter Stream v2Detecting Trends Through Twitter Stream v2
Detecting Trends Through Twitter Stream v2The Night's Watch
 
CIS 25 SPRING 2020FINAL Due 1159 PM May 22 (this is a har.docx
CIS 25 SPRING 2020FINAL Due 1159 PM May 22 (this is a har.docxCIS 25 SPRING 2020FINAL Due 1159 PM May 22 (this is a har.docx
CIS 25 SPRING 2020FINAL Due 1159 PM May 22 (this is a har.docxsleeperharwell
 
BUS 625 Week 4 Response to Discussion 2Guided Response Your.docx
BUS 625 Week 4 Response to Discussion 2Guided Response Your.docxBUS 625 Week 4 Response to Discussion 2Guided Response Your.docx
BUS 625 Week 4 Response to Discussion 2Guided Response Your.docxjasoninnes20
 
BUS 625 Week 4 Response to Discussion 2Guided Response Your.docx
BUS 625 Week 4 Response to Discussion 2Guided Response Your.docxBUS 625 Week 4 Response to Discussion 2Guided Response Your.docx
BUS 625 Week 4 Response to Discussion 2Guided Response Your.docxcurwenmichaela
 
SentimentAnalysisofTwitterProductReviewsDocument.pdf
SentimentAnalysisofTwitterProductReviewsDocument.pdfSentimentAnalysisofTwitterProductReviewsDocument.pdf
SentimentAnalysisofTwitterProductReviewsDocument.pdfDevinSohi
 
Questions about questions
Questions about questionsQuestions about questions
Questions about questionsmoresmile
 

Similar to Independent Study_Final Report (20)

Latent Dirichlet Allocation as a Twitter Hashtag Recommendation System
Latent Dirichlet Allocation as a Twitter Hashtag Recommendation SystemLatent Dirichlet Allocation as a Twitter Hashtag Recommendation System
Latent Dirichlet Allocation as a Twitter Hashtag Recommendation System
 
A Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia ArticlesA Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia Articles
 
Svm and maximum entropy model for sentiment analysis of tweets
Svm and maximum entropy model for sentiment analysis of tweetsSvm and maximum entropy model for sentiment analysis of tweets
Svm and maximum entropy model for sentiment analysis of tweets
 
76 s201914
76 s20191476 s201914
76 s201914
 
A-Study_TopicModeling
A-Study_TopicModelingA-Study_TopicModeling
A-Study_TopicModeling
 
E017433538
E017433538E017433538
E017433538
 
ANALYSIS OF TOPIC MODELING WITH UNPOOLED AND POOLED TWEETS AND EXPLORATION OF...
ANALYSIS OF TOPIC MODELING WITH UNPOOLED AND POOLED TWEETS AND EXPLORATION OF...ANALYSIS OF TOPIC MODELING WITH UNPOOLED AND POOLED TWEETS AND EXPLORATION OF...
ANALYSIS OF TOPIC MODELING WITH UNPOOLED AND POOLED TWEETS AND EXPLORATION OF...
 
ANALYSIS OF TOPIC MODELING WITH UNPOOLED AND POOLED TWEETS AND EXPLORATION OF...
ANALYSIS OF TOPIC MODELING WITH UNPOOLED AND POOLED TWEETS AND EXPLORATION OF...ANALYSIS OF TOPIC MODELING WITH UNPOOLED AND POOLED TWEETS AND EXPLORATION OF...
ANALYSIS OF TOPIC MODELING WITH UNPOOLED AND POOLED TWEETS AND EXPLORATION OF...
 
ANALYSIS OF TOPIC MODELING WITH UNPOOLED AND POOLED TWEETS AND EXPLORATION OF...
ANALYSIS OF TOPIC MODELING WITH UNPOOLED AND POOLED TWEETS AND EXPLORATION OF...ANALYSIS OF TOPIC MODELING WITH UNPOOLED AND POOLED TWEETS AND EXPLORATION OF...
ANALYSIS OF TOPIC MODELING WITH UNPOOLED AND POOLED TWEETS AND EXPLORATION OF...
 
IRJET- A Survey on Trend Analysis on Twitter for Predicting Public Opinion on...
IRJET- A Survey on Trend Analysis on Twitter for Predicting Public Opinion on...IRJET- A Survey on Trend Analysis on Twitter for Predicting Public Opinion on...
IRJET- A Survey on Trend Analysis on Twitter for Predicting Public Opinion on...
 
A Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic ModellingA Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic Modelling
 
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLINGA TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
 
unit-5.pdf
unit-5.pdfunit-5.pdf
unit-5.pdf
 
T OP K-O PINION D ECISIONS R ETRIEVAL IN H EALTHCARE S YSTEM
T OP  K-O PINION  D ECISIONS  R ETRIEVAL IN  H EALTHCARE  S YSTEM T OP  K-O PINION  D ECISIONS  R ETRIEVAL IN  H EALTHCARE  S YSTEM
T OP K-O PINION D ECISIONS R ETRIEVAL IN H EALTHCARE S YSTEM
 
Detecting Trends Through Twitter Stream v2
Detecting Trends Through Twitter Stream v2Detecting Trends Through Twitter Stream v2
Detecting Trends Through Twitter Stream v2
 
CIS 25 SPRING 2020FINAL Due 1159 PM May 22 (this is a har.docx
CIS 25 SPRING 2020FINAL Due 1159 PM May 22 (this is a har.docxCIS 25 SPRING 2020FINAL Due 1159 PM May 22 (this is a har.docx
CIS 25 SPRING 2020FINAL Due 1159 PM May 22 (this is a har.docx
 
BUS 625 Week 4 Response to Discussion 2Guided Response Your.docx
BUS 625 Week 4 Response to Discussion 2Guided Response Your.docxBUS 625 Week 4 Response to Discussion 2Guided Response Your.docx
BUS 625 Week 4 Response to Discussion 2Guided Response Your.docx
 
BUS 625 Week 4 Response to Discussion 2Guided Response Your.docx
BUS 625 Week 4 Response to Discussion 2Guided Response Your.docxBUS 625 Week 4 Response to Discussion 2Guided Response Your.docx
BUS 625 Week 4 Response to Discussion 2Guided Response Your.docx
 
SentimentAnalysisofTwitterProductReviewsDocument.pdf
SentimentAnalysisofTwitterProductReviewsDocument.pdfSentimentAnalysisofTwitterProductReviewsDocument.pdf
SentimentAnalysisofTwitterProductReviewsDocument.pdf
 
Questions about questions
Questions about questionsQuestions about questions
Questions about questions
 

Independent Study_Final Report

  • 1. 2015 Topic Model Comparision on Microblog Data FALL ’15 INDEPENDENT STUDYREPORT JOE KOOLIPPURACKAL | SHIKHA SWAMI
  • 2. 1 Intoduction In today’s world, social network is a biggest platform to communicate and express ideas. And Twitter is one of the popular social media which has abundant text data. Twitter has over 300 Million active monthly users sharing 500M tweets every day. Twitter provides unprecedented opportunities for researchers, both in academia and businesses, to analyze user opinions, sentiments and interests. However, one major problem encountered while developing classification or prediction model on microblog data like tweets is the need to manually label the tweets in the training dataset, which is extremely cumbersome and time- consuming owing to the large size of datasets. In this study, we analyse and compare the two topic modelling techniques, Correlated Topics Models (CTM) and Latent Dirichlet algorithm(LDA) on two different datasets – a. Dataset A: Tweets captured using ‘asthma’ keyword b. Dataset B: Tweets captured using ‘#asthma’ keyword We intend to compare the two topic modelling techniques on these two datasets, comparing the tems in the topics by varying the number of topics in the corpus. Literature Review A number of research is performed to analyse social network data for finding health-related information. The studies rely on natural language processing methods to extract information from unstructured and raw data. The study “Review of Extracting Information From the Social Web for Health Personalization”[1] explains the concept of extraction of information for health personalization. It explains that individuals are socialising to share information about their health, the problems faced by them and their experiences. This article shows how promising the study of health related topics can be using web as a source of information. A study by Dr. Sudha Ram, used predictive modelling to extract data from multiple sources like Twitter, Google etc. to predict asthma-related Emergency Department visits[2]. This research shows that asthama is very prevelant disease in US and has high severity. The research analysed the relation between asthama related ED visits and data from the web. Another study in the field of public health is by Michael J. Paul and Mark Dredze of Johns Hopkins University. In their paper “You Are What You Tweet: Analyzing Twitter for Public Health”[3] they have used Ailment Topic Aspect Model to analyse how users express their illnesses and ailments in tweets. A similar study in the paper ”Use of Hangeul Twitter to Track and Predict Human Influenza Infection ”[4] to predict and track spread of influenza was performed by analysing the tweets. All these research show that twitter and web media are data rich resources to analyse and predict health related information. Researchers are continuously exploring new cost effective and robust tools to analyse unstructured data on web using data mining techniques.
  • 3. 2 Topic Modeling Techniques In this study, LDA and CTM model will be used to analyse asthma related micro blog discussons on twitter. As mentioned earlier, we have two datasets (Dataset A, and Dataset B), which has asthma related tweets from June 2015 to Aug 2015. The goal of the project would be to identify the topics discussed in these tweets. The approach will be to pre- process the tweets and identify the minimal set of terms useful for analysis. Then we label and cluster the topics found in the tweets. Topic modelling is the technique in machine learning, which is used to find the theme of the document. Topic modelling is used to infer latent (hidden) topics in a document set and it determine what the document is about. Given a document or data set, topic modelling techiques determine the topics based on the frequency of occurance of a particular word. In our study, the topics contained asthama and copd words with highest probabilities. Latent Dirichlet Allocation (LDA) is a probabilistic model which is used to infer find latent topics in a document. LDA works on the idea that every document (the tweets in our case) is composed of multiple topics. Based on each tweet’s balance of topics, we can identify the topic which has the highest score for each tweet, and label the tweet accordingly. LDA uses Dirichlet Algorithm and Dirichlet parameters to compute the probability of topics and the words under that topic. Similar to LDA, Correlated Topic Models (CTM) is used to find the hidden topics in the document. It determines the words and topic probabilities based on frequency of its occurance. But CTM also finds correlation between the topics. Idea behind this topic modelling is that existance of one topic in the document can be correlated to the existence of other topic. Implementation For this project, we have used R programming tool to implement both the techniques. The tweets from the twitter are extracted into an excel file. This file is preprocessed to clean the dataset. This dataset and stopword list is provided as an input to both LDA and CTM implementation. For both the algorithms, the dataset and stopword list is same. The process is repeated for 3 clusters of size 2,5 and 10. Pre-processing Approximately more than 41k tweets which contain the keyword ‘asthma’ were collected and merged in a .csv file. The file is processed to contain the tweet id, date of the tweet, user id and the tweet. Post merging the dataset, below were the pre-processing steps performed on the datasets: i. Remove URLs ii. Remove usernames (starting with @) iii. Remove numbers iv. Remove special characters v. Remove Non-ascii characters
  • 4. 3 vi. Removal of stopwords vii. Removed punctuation viii.Converted all text to lower case For all the above pre-processing, we used the ‘tm’ package in R. For the removal of the stopwords, in addition to the inbuilt stopwords list in the ‘tm’ package, we added a list of stopwords specific to these datasets. These stopwords included names of users and words that weren’t relevant to asthma. We removed some of these stopwords by manually inspecting the tweets. We then iteratively ran the topic models and identified the irrelevant terms in the topics and added them to the stopwords list. Please refer the below file for the list of stopwords used: Topic Modelling Process We used the ‘topicmodels’ package to implement the CTM and LDA techniques. Post the pre-processing, we created a document term matrix for each of the two datastes. The sparce terms from the document which occur less than 0.1% were removed from the document term matrix. On each of the two datasets, we ran the LDA and CTM techniques for three different cluster sizes - a. 2 cluster b. 5 clusters c. 10 clusters The terms in each of these clusters were sorted in the descending order of probability score, and we picked the top 10 terms in each of the topics with the highest probability scores. Results and Analysis By increasing the number of clusters, new terms get added to the topics which have high probability. Comparing the clusters, the important terms captured about asthma are : Asthma, hygiene, pets, allergy, farm, bird, anaphylactic, pollution. The topic discussed could mean that pets, farms, birds can be reason for asthma. Comparing the topics under CTM and LDA with search key as #asthma, it can be seen that frquency of word asthma under all the topics is very high when LDA topic modelling is used whereas the frquency of word asthma is varied across the topics. Also as the cluster size increases, this variation is more prominent for the words under the topics modelled by CTM. By comparing cluster size of 10 modelled by LDA, the variation in the frequency for the word asthma is more under search key with asthma as compared to the variation under #asthma. But for CTM, the frequency under both the search keys(cluster size 10) is almost same.
  • 5. 4 For cluster size of 10, LDA has captured word ‘june’ under many topics, when search key of #asthma is used. But this word is not captured when search key of asthma is used. CTM captures this word only once when #asthma keyword is used. This means CTMgives a better correlation between the words of the topics when cluster size is increased. For smaller cluster size, both CTM and LDA have similar kind of behavior. The words found across all the topics found using CTM is similar to the words found using LDA. But variation can be seen in CTM when cluster size is increased. For eg. Under cluster size 2 and 5, repeating words under CTM and LDA are asthma, cannabisoil, children, symptoms, antibiotic. The frequency of the word ‘asthma’ is almost highest under all the topics for the both CTM and LDA. The attached file has the results of the analysis. Results_Consolidate d.xlsx References 1) Review of Extracting Information From the Social Web for Health Personalization: http://www.jmir.org/2011/1/e15/ 2) Predicting Asthma-Related Emergency Department Visits Using Big Data http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7045443 3) You Are What You Tweet: Analyzing Twitter for Public Health https://www.cs.jhu.edu/~mdredze/publications/twitter_health_icwsm_11.pdf 4) Use of Hangeul Twitter to Track and Predict Human Influenza Infection http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0069305 5) Correlated Topic Models https://www.cs.princeton.edu/~blei/papers/BleiLafferty2006.pdf 6) Probabilistic Topic Models https://www.cs.princeton.edu/~blei/papers/Blei2012.pdf
  • 6. 5 Appendix Description of the files accompanying this report: File Description Merged_Asthma_Clean.csv Cleaneddataset‘asthma’containing(SetId,TweetId,Date,Userid, Tweets) Merged_HashAsthma_Clean.csv Cleaneddataset‘#asthma’containing(SetId,TweetId,Date, Userid,Tweets) TweetCleaning.R Scriptusedto cleanthe tweets Stopwords.txt List of stopwordsused LDA.R Scriptfor LDA Model CTM.R Scriptfor CTM Model Results_Consolidated.xlsx ConsolidatedresultsforLDA andCTM for differentclustersizes (2, 5, 10) for both‘asthma’and ‘#asthma’datasets.