SlideShare a Scribd company logo
Latent Dirichlet Allocation as a Twitter Hashtag
Recommender System ∗
Brian Gillespie
Northeastern University
Seattle, WA
bng1290@gmail.com
Shailly Saxena
Northeastern University
Seattle, WA
saxena.sha@husky.neu.edu
ABSTRACT
In this paper, we investigate an unsupervised approach
to hashtag recommendation, to aid in the classification of
tweets and to improve the usability of Twitter. Two cor-
pora were generated from a sample of tweets collected from
September 2009 to January 2010. The first corpus was com-
posed simply of individual tweets, and the second corpus
was made by aggregating tweets by UserID into what is re-
ferred to as the USER PROFILE model. Latent Dirichlet
Allocation(LDA) was used to generate a topic model for
each corpus, and then this model was queried to to gener-
ate topic distributions for new test tweets. By sampling a
topic from each test tweet, and then sampling one of the
top terms from that topic, a set of words can be selected as
recommended hashtags. This recommendation process was
applied to several example tweets, and the relevancy of the
suggested hashtags were evaluated by human observers. The
results of this would be briefly mentioned here if there were
any.
Keywords
Twitter, LDA, topic modeling, NLP, social media
1. INTRODUCTION
Twitter is an amazing source of text data, both in con-
tent and quantity. With over 305 million monthly active
users across the globe, there are a wide variety of subjects
being discussed. Cataloging this data, and finding ways to
make it more search-able is very desirable, as this data can
be an effective resource for business knowledge and study-
ing social trends. Hashtags appear as the natural catego-
rization index for tweets, however since only 8% of tweets
contain hashtags, they cannot be used as a direct catego-
∗There are no legal restrictions! Break all the rules! We
don’t own this, and wouldn’t even dare claim it if our lives
depended on it. Plagiarize to your heart’s desire, but maybe
hook us up with a job interview if you think we had some
good ideas in here.
rization for all tweets. Compounding this issue, there are
no prescribed hashtags; any sequence of characters can be
a hashtag as long as it has a # in front of it. A hashtag
recommendation system can be implemented to encourage
users to utilize more hashtags, and to provide a more consol-
idated hashtag base for tweet categorization. But in order
to avoid prescribing hashtags, and thus limiting the free ex-
pression of Twitter users, it is more reasonable to generate
the suggested hashtags from the content of Twitter itself.
This has the added gain of providing an adaptable system
that can develop alongside the vocabulary and interests of
the Twitter user base.
2. RELATED WORK
Although there has been much research in the structure
and dynamics of microblogging networks, and many insights
have been gained by looking into how users interact with one
another and with the world, the area of hashtag recommen-
dations is still not explored enough. Providing hashtags is
an important feature of Twitter for they aid in classification
and easy retrieval of tweets and this problem has been ad-
dressed in a number of ways so far.
There have been recent developments in hashtag recom-
mendation approaches to Twitter data such as Zangerle et
al. [10] and Kywe, Hoang, et al. [6] where recommendations
are made on the basis of similarity. These similarities can be
of tweet content or users. [10] analyzed three different ways
of tweet recommendations. They rank the hashtags based
on the overall popularity of the tweet, the popularity within
the most similar tweets, and the most similar tweets. Out
of these, the third approach was found to out-perform the
rest. On the other hand, Mazzia and Juett [8] use a naive
Bayes model to determine the relevance of hashtags to an
individual tweet. Since these methods do not take into con-
sideration, the personal preferences of the user, Kywe et
al. [6] suggest recommending hashtags incorporating simi-
lar users along with similar tweets and thus achieving more
than 20% accuracy than the above stated methods.
One of the drawbacks of these approaches is that they
rely on existing hashtags to recommend the new ones. Since,
the current number of hashtags is already very low and very
few of them are hardly ever repeated, using these hashtags
will not improve classification of tweets. [2] Godin et. al
took these facts into consideration and applied LDA model
which was trained to cluster tweets in a number of topics
from which keywords can be suggested for new tweets. In
this paper, we extend the work done by Godin et al.[2] by
incorporating user profile approach as suggested in Hong et
Table 1: Twitter Sample Sep 2009 to Jan 2010
UserID Tweet
66 For all the everything you’ve brought to my internet life #mathowielove
53717680 What’s good Tweeps this Friday
16729208 ”I bet you all thought Bert & Ernie were just roommates!”
86114354 @miakhyrra take me with you! lol
al. [5]. In this approach, tweets were aggregated by the
User IDs of their authors, and a new set of documents was
created where each document is a user profile of a unique
user id and his or her combined set of tweets.
Hence, we explore some of the more promising approaches
to hashtags recommendations in tweets, and investigate their
implications to further research into developing a more suc-
cessful Twitter hashtags recommendations using topic mod-
eling. Due to the success of the user profile approach, we
will employ and evaluate this method, to further investigate
its effectiveness.
3. DATASET AND PREPROCESSING
The datasets used were obtained from Cheng, Caverlee,
and Lee [9]. In order to reduce noise in the text data, a
set of stop words were removed from the tweet bodies, the
remaining words were then stemmed, and any hyperlinks
were also removed. A bag of words model was created, and
TF-IDF vectors were generated. For the user profile LDA
approach [5] and the Twitter-LDA model [11], the vectors
were aggregated based on user IDs. We also processed each
tweet for the documents “as-is” for the Twitter-LDA model.
3.1 The Dataset
The training dataset from Cheng, Caverlee, and Lee [9],
contains 3,844,612 tweets from 115,886 Twitter users over
the time period of September 2009 to January 2010. The
test set, also from [9], contains 5,156,047 tweets from 5,136
Twitter users over the same time period. In general, tweets
contain between 1 and 20 words, with an average of 7 words
per tweet. A smaller proportion of the tweets contained
more than 20 words as seen in Figure 1. Each line of the
dataset contains a unique user ID and tweet ID, text con-
tent, and a timestamp. The text content of each tweet is
limited to 140 characters, and can contain references to other
users of the form @username, as well as popular #hashtags.
Many tweets also include hyperlinks which are often passed
through URL shorteners (e.g. http://goo.gl/uLLAe). The
implications of these more anomalous text instances are con-
sidered in the next section.
An additional document corpus was generated from each
dataset by aggregating tweets by user ID. This reduction is
referred to as the user profile model [5]. After aggregation,
the corpus contained a total of 115,886 documents. Data
preprocessing was performed on both of these corpora, and
is described in the next section.
3.2 Data Preprocessing and Reduction
Due to the unconventional vocabulary in tweets and the
large amount of noise inherent in such a diverse set of text,
we decided to perform several data preprocessing steps. A
regular expression was used to remove any non-latin char-
acters and any URL links, as URL links are generated ran-
domly and cannot be easily related to a topic. We also re-
move all numbers, punctuations, retweets and words starting
from character @. After pruning the tweet contents, the text
was tokenized, stop words were removed, and the remain-
ing tokens were stemmed using the Porter Stemming library
from the Natural Language Toolkit (NLTK). The prepared
corpus of tweets was then converted to a set of Term Fre-
quency (TF) vectors, using the corpora library from gensim.
We noticed that many tweets contained excessive repetition
of words (for instance one tweet read: ”@jester eh eh eh eh
eh eh eh eh eh...”), so in order to reduce any bias towards
overused words in the data we removed words appearing
over 5 times in a tweet, as well as using the TF-IDF model
to reduce emphasis on such words. To reduce bias to com-
mon words across the corpus, we removed terms appearing
in over 70% of the documents as per [11]. As seen in Figure
2, before preprocessing the data was dominated by common
words that do not confer much meaning (e.g get, u, just,
one). After preprocessing, however, we begin to see more
meaningful words (e.g people, twitter, home, blog in Figure
3). After preprocessing, there were 26,542,006 total words
and 184,002 unique words in the vocabulary.
Figure 1: Plate Diagram of LDA.
4. MODEL AND METHODOLOGY
Latent Dirichlet Allocation, first proposed by Blei, Ng,
and Jordan [1] is a generative probabilistic model, that sup-
poses each document in a corpus is generated from a smaller,
hidden set of topics. Each word in a document is drawn
from a topic, making every document a mixture of samples
from a variety of different topics. The plate diagram in Fig-
ure 1 demonstrates the generative story for a given word
in this model. Here, the outer plate represents the corpus
level where M is the number of documents, and the inner
plate represents the document level where N is the number
of words in a given document. So working from the outside-
Figure 2: Average Hellinger Distances Between Topic Mod-
els
in, α is a parameter to the Dirichlet prior which defines the
topic distribution for a set of documents. This informs θ, the
topic distribution for a given document, and this supplies us
with a topic, z. Ultimately, each word in each document will
be drawn from one such topic, which is itself a distribution
of words with a Dirichlet prior parameterized by β.
There are multiple ways of estimating these hidden top-
ics, including Hidden Markov Monte Carlo techniques such
as Collapsed Gibbs Sampling [3], or with a stochastic gradi-
ent descent method such as in [4]. In this paper we utilized
the latter, and the details of our topic model generation are
discussed in the next section.
For the purpose of hashtag recommendation, we first
need to see which topics are most represented by a new
tweet. We then propose that the most significant words
from within those strongly related topics will provide a more
general representation of the message contained in a given
tweet. So essentially for each new test tweet, we sample from
the top topics in that tweet, and we recommend a hashtag
from the top most representative words in that topic. In
this way, we are recommending topics that also reflect the
social component of Twitter. Since the topics we find are
driven by the tweets themselves, our recommendations serve
as an approximation of the connections between what users
are talking about.
food thanksgiv good great eat love dinner coffe happi
cook lunch turkey chocol recip tweet enjoy drink hope
pizza friend chicken tast wine restaur sweet pie fun cream
cooki delici nice pumpkin night tea chees favorit bar
soup ice chef awesom fri fresh cake tonight thing bacon
breakfast meal year morn grill kid thx hey work sandwich
idea special hot sound place kitchen salad burger top
peopl bake famili start glad call bread potato twitter
tomorrow cool meet appl amaz hear holiday butter sauc
lot find wonder big share serv bean parti read egg red
bring open milk rice tasti
food drink coffe good wine dinner photo eat art
great beer lunch thanksgiv wed tonight work cook design
recip chocol tast bar night photographi pizza turkey
pumpkin paint chicken fun tea happi love halloween fresh
restaur cream delici photograph parti nice cooki pie start
kitchen sweet tomorrow chees red open ice store awesom
bottl cake enjoy soup chef breakfast imag hot favorit
idea place cool holiday appl top hous special order year
fri shot morn grill water glass bacon burger sandwich
photoshop shop bake salad lot light meal green head big
friend taco cup amaz potato free color offic bread ˆa˘A´N
ˆa˘A´N
Figure 3: Two topics generated by our LDA. The first is
from the User Profile model with T=20, the second is from
the User Profile model with T=40. These were found to
be highly similar, with a Hellinger Distance of 0.40636.
ˆa˘A´N
4.1 Getting a Topic Model
In order to generate our topic model, we utilized the
LDAModel package from gensim. This package uses the on-
lineldavb.py algorithm that was proposed in [4]. This is an
online, constant memory implementation of the LDA algo-
rithm, which allowed us to process a rather large dataset
with limited resources. This package processes the docu-
ments in mini batches, and we chose our batch size to be the
default, 2,000. The key parameters κ and τ as mentioned
in [4] were kept at 0.5 and 1.0, respectively, as these values
performed best in their performance analyses. Symmetric
priors were also kept for the topic-to-word and document-
to-topic distributions; these priors can be tweaked in order
to boost certain topics or terms. Finally, our choices for
number of topics were chosen based on the more successful
results seen in [5]. For the per tweet based corpus we chose
topic sizes of 50, 100, and 200 in keeping with [2]. However
for the user profile model we turned to the experiments by
Hong, et. al., who determined that this approach tends to
favor lower topic sizes, and in their case a size of 40 per-
formed best. For our own user profile implementation we
used topic sizes of 20, 40, 70, and 100.
Since several models were created, it is informative to see
if our various approaches have had much of an effect on the
final model. In order to compare the similarity of our topic
distributions, we calculated the Hellinger Distance between
the distributions generated for each topic size and training
corpus. For two distributions, P and Q, each of size k, the
Hellinger Distance is given by
H(P, Q) =
1
2
k
i=1
(
√
pi −
√
qi)2
Table 2: Tweets and their suggested hashtags.
Tweet Text Suggested Hashtags
And I want to know why I dream about food!?!? I’m sure this is not normal cook happi bad love great guy love
Detailed Pics of The Dj Am Dunks and Dj Premier Af1 Low QS Will be up tomorrow music album ik vote dec op
@MacsSTLsweetie Dakota always chews the squeaker out of her toys! dog code report free save airport fire
headache - then eye exam in an hour. first one in 15 years... bad guy tonight work thing good
. The Hellinger Distance is a distance metric that can be
used to evaluate the similarity of two probability distribu-
tions, and in this case we are comparing the word distribu-
tions of the topics generated by our various LDA models.
Of particular interest were the Hellinger Distances between
the various user profile corpora and the single-tweet cor-
pora. In Figure 4 , we see two of the topics generated by
our LDA. These topics were found to be highly similar by
their Hellinger Distance, which was 0.40636. By looking
over the terms in each topic, it is clear that they’re both
related to food and eating, and each contains a lot of the
same terms. From Figure 2, we can see that the user pro-
file model seems to favor more similar topics. While the
regular model was more susceptible to change as the num-
ber of topics was changed, the user profile model tended to
have more stable topics, as the average distance remained
relatively lower. It is likely that the smaller number of top-
ics contributed to this, however it is noted that at T=100,
the user profile topic models were still comparatively quite
close to the others. The topics generated by the regular
LDA approach were very diverse, and were very susceptible
to changes as the topic size was altered.
4.2 Hashtag Recommendation
In order to provide recommendations for a new tweet, we
first decided to do some quick preliminary preprocessing by
removing any user-added hashtags, references to other users,
URL links, and non-latin characters. From here, we can con-
vert this new document into its bag-of-words representation
by referring to our existing dictionary. We used the LDA
model generated previously to infer the topics from which
each word in the tweet is drawn. This can also be done using
gensim; by querying the generated model, LDAModel will
return the distribution of topics and their respective poste-
rior probabilities for a given document. We used the belief
that each document represents some mixture of topics and
we sample the most strongly correlated topics from a new
tweet. Next, we select one of these strong topics, and rec-
ommend a hashtag from the set of words that make up that
topic. In this experiment, we decided to select the first three
recommendations only from the words that most strongly
represent their topics, in the hopes that this will strengthen
the relevance of our suggestions. In case of 3 recommen-
dations, we take the top two words from the top topic and
third recommendation as the top word from the second best
topic. These three recommendations remain the same for 5
and 7 recommendation-categories as well. To get remaining
recommendations for 5, 7 categories, we randomly sample
the top two topics for the given tweet.
5. RESULTS AND DISCUSSION
In order to evaluate the quality of recommendations from
our model, we decided to make a simple web app, where
a user can write out a tweet and the number of hashtags
they’d like, and see our recommendations. We intend to
provide this setup to several human testers, who will evalu-
ate whether a given hashtag is relevant to their submission
or not.
For the evaluation of experiments, we took a random
sample of test 100 tweets and queried three different LDA
models generated during training. One of which was taken
from User Profile model for 40 topics and other two were
taken from simple tweets model with number of topics being
50 and 100. We are providing results for user profile model
with T = 40, since, it performed the best. We generated
three sets of hashtags for each tweet; the first set had three
recommendations, second had 5 and the third, 7. After test-
ing with 2 human subjects, we tallied up the frequency of
positive cases for each trial. For example, if three of seven
recommended hashtags seem appropriate to the tester, he
assigns a score of 3 to this tweet for the 7-hashtag category.
This is done for every tweet, for each of the three hashtag
categories. The results are available in figure 4.
Figure 4: Histogram of the number of correctly suggested
hashtags per tweet when three, five or seven hashtags are
suggested with user profile model, T = 40.
It was found that for 49% of the tweets, at least 1 suitable
hashtag could be recommended out of 3 generated hashtags.
Similarly, when 5 hashtags were generated, 58% of tweets
had at least 1 suitable hashtag and in case of 7, this accuracy
increased to 67%. There are instances where the tweet does
not have sufficient words in it and hence, none of the rec-
ommended hashtags are appropriate. We also came across
recommendations like ’good’ appearing in a large number of
tweets and this could be eliminated by developing a more
sophisticated means of drawing a topic or more data that
would result in better topic generation. In some cases, 5 out
of 7 hashtags are appropriate and this number could increase
provided we train on more data.
6. FUTURE WORK
In the future, more preprocessing work could be done, in
order to make more meaningful topics, and thus better hash-
tag recommendations. Some words that bore little meaning
on their own (e.g. good) still made it into a lot of our topics,
and their presence in so many of the topics had the effect of
polluting our recommendations. Additionally, it would be
very advantageous to get more data. Ultimately our exper-
iments were performed on around 3 million tweets, however
this really isn’t a significant number. Many words from test
tweets were not incorporated into our model, and thus pre-
dictions on those words cannot be done reliably. By using
the Twitter Streaming API, we could get a dense dataset
of 10million+ tweets over a much shorter period of time.
As well as getting more data, there are other models we
would like to try, namely the Twitter-LDA model proposed
in [11], and the Author-Topic model. In this way we would
aim to get stronger topics that better reflect the structure of
tweet data. One more enhancement we considered, was ag-
gregating a collection of popular tweets, and trying to boost
the prior distributions of our LDA so that these terms are
weighted more heavily. Doing this, we may be able to use
meaningful, pre-existing hashtags in our recommendations.
Besides enhancements to our approach, we encountered
a few interesting ideas while working on this project. One
interested avenue of study would be in topic-based tweet and
user recommendations. One possible experiment would in-
volve using the user profile model to generate a corpus, then
taking a new user profile and recommending users based on
some distance measure between their posterior topic proba-
bilities.
7. REFERENCES
[1] David M Blei, Andrew Y Ng, and Michael I Jordan.
Latent dirichlet allocation. the Journal of machine
Learning research, 3:993–1022, 2003.
[2] Fr´ederic Godin, Viktor Slavkovikj, Wesley De Neve,
Benjamin Schrauwen, and Rik Van de Walle. Using
topic models for twitter hashtag recommendation. In
Proceedings of the 22nd international conference on
World Wide Web companion, pages 593–596.
International World Wide Web Conferences Steering
Committee, 2013.
[3] Tom Griffiths. Gibbs sampling in the generative model
of latent dirichlet allocation. 2002.
[4] Matthew Hoffman, Francis R Bach, and David M Blei.
Online learning for latent dirichlet allocation. In
advances in neural information processing systems,
pages 856–864, 2010.
[5] Liangjie Hong and Brian D Davison. Empirical study
of topic modeling in twitter. In Proceedings of the first
workshop on social media analytics, pages 80–88.
ACM, 2010.
[6] Su Mon Kywe, Tuan-Anh Hoang, Ee-Peng Lim, and
Feida Zhu. On recommending hashtags in twitter
networks. In Social Informatics, pages 337–350.
Springer, 2012.
[7] Tianxi Li, Yu Wu, and Yu Zhang. Twitter hash tag
prediction algorithm. In ICOMPˆa ˘A´Z11-The 2011
International Conference on Internet Computing,
2011.
[8] Allie Mazzia and James Juett. Suggesting hashtags on
twitter. EECS 545m, Machine Learning, Computer
Science and Engineering, University of Michigan,
2009.
[9] J. Caverlee Z. Cheng and K. Lee. You are where you
tweet: A content-based approach to geo-locating
twitter users. In Proceeding of the 19th ACM
Conference on Information and Knowledge
Management (CIKM), October 2010.
[10] Eva Zangerle, Wolfgang Gassler, and Gunther Specht.
Recommending#-tags in twitter. In Proceedings of the
Workshop on Semantic Adaptive Social Web (SASWeb
2011). CEUR Workshop Proceedings, volume 730,
pages 67–78, 2011.
[11] Wayne Xin Zhao, Jing Jiang, Jianshu Weng, Jing He,
Ee-Peng Lim, Hongfei Yan, and Xiaoming Li.
Comparing twitter and traditional media using topic
models. In Advances in Information Retrieval, pages
338–349. Springer, 2011.

More Related Content

What's hot

Harnessing Web Page Directories for Large-Scale Classification of Tweets
Harnessing Web Page Directories for Large-Scale Classification of TweetsHarnessing Web Page Directories for Large-Scale Classification of Tweets
Harnessing Web Page Directories for Large-Scale Classification of Tweets
Gabriela Agustini
 
Citation semantic based approaches to identify article quality
Citation semantic based approaches to identify article qualityCitation semantic based approaches to identify article quality
Citation semantic based approaches to identify article quality
csandit
 
Twitter text mining using sas
Twitter text mining using sasTwitter text mining using sas
Twitter text mining using sas
Analyst
 
Tweet Segmentation and Its Application to Named Entity Recognition
Tweet Segmentation and Its Application to Named Entity RecognitionTweet Segmentation and Its Application to Named Entity Recognition
Tweet Segmentation and Its Application to Named Entity Recognition
1crore projects
 
Data Mining In Social Networks Using K-Means Clustering Algorithm
Data Mining In Social Networks Using K-Means Clustering AlgorithmData Mining In Social Networks Using K-Means Clustering Algorithm
Data Mining In Social Networks Using K-Means Clustering Algorithm
nishant24894
 
Tweet Summarization and Segmentation: A Survey
Tweet Summarization and Segmentation: A SurveyTweet Summarization and Segmentation: A Survey
Tweet Summarization and Segmentation: A Survey
vivatechijri
 
Social media recommendation based on people and tags (final)
Social media recommendation based on people and tags (final)Social media recommendation based on people and tags (final)
Social media recommendation based on people and tags (final)es712
 
professional fuzzy type-ahead rummage around in xml type-ahead search techni...
professional fuzzy type-ahead rummage around in xml  type-ahead search techni...professional fuzzy type-ahead rummage around in xml  type-ahead search techni...
professional fuzzy type-ahead rummage around in xml type-ahead search techni...
Kumar Goud
 
Survey on article extraction and comment monitoring techniques
Survey on article extraction and comment monitoring techniquesSurvey on article extraction and comment monitoring techniques
Survey on article extraction and comment monitoring techniques
Anunaya
 
Topic Lifecycle on Social Networks
Topic Lifecycle on Social NetworksTopic Lifecycle on Social Networks
Topic Lifecycle on Social Networks
Kritika Garg
 
Entity Annotation WordPress Plugin using TAGME Technology
Entity Annotation WordPress Plugin using TAGME TechnologyEntity Annotation WordPress Plugin using TAGME Technology
Entity Annotation WordPress Plugin using TAGME Technology
TELKOMNIKA JOURNAL
 
Ay3313861388
Ay3313861388Ay3313861388
Ay3313861388
IJMER
 
Named Entity Recognition using Tweet Segmentation
Named Entity Recognition using Tweet SegmentationNamed Entity Recognition using Tweet Segmentation
Named Entity Recognition using Tweet Segmentation
IRJET Journal
 
IRJET- Semantic Question Matching
IRJET- Semantic Question MatchingIRJET- Semantic Question Matching
IRJET- Semantic Question Matching
IRJET Journal
 
Project prSentiment Analysis of Twitter Data Using Machine Learning Approach...
Project prSentiment Analysis  of Twitter Data Using Machine Learning Approach...Project prSentiment Analysis  of Twitter Data Using Machine Learning Approach...
Project prSentiment Analysis of Twitter Data Using Machine Learning Approach...
Geetika Gautam
 
Chat bot using text similarity approach
Chat bot using text similarity approachChat bot using text similarity approach
Chat bot using text similarity approach
dinesh_joshy
 
Sentiment analysis in twitter using python
Sentiment analysis in twitter using pythonSentiment analysis in twitter using python
Sentiment analysis in twitter using python
CloudTechnologies
 
Identifying Valid Email Spam Emails Using Decision Tree
Identifying Valid Email Spam Emails Using Decision TreeIdentifying Valid Email Spam Emails Using Decision Tree
Identifying Valid Email Spam Emails Using Decision Tree
Editor IJCATR
 
Vivo Search
Vivo SearchVivo Search
Vivo Search
Anup Sawant
 

What's hot (20)

SAS Text Mining
SAS Text MiningSAS Text Mining
SAS Text Mining
 
Harnessing Web Page Directories for Large-Scale Classification of Tweets
Harnessing Web Page Directories for Large-Scale Classification of TweetsHarnessing Web Page Directories for Large-Scale Classification of Tweets
Harnessing Web Page Directories for Large-Scale Classification of Tweets
 
Citation semantic based approaches to identify article quality
Citation semantic based approaches to identify article qualityCitation semantic based approaches to identify article quality
Citation semantic based approaches to identify article quality
 
Twitter text mining using sas
Twitter text mining using sasTwitter text mining using sas
Twitter text mining using sas
 
Tweet Segmentation and Its Application to Named Entity Recognition
Tweet Segmentation and Its Application to Named Entity RecognitionTweet Segmentation and Its Application to Named Entity Recognition
Tweet Segmentation and Its Application to Named Entity Recognition
 
Data Mining In Social Networks Using K-Means Clustering Algorithm
Data Mining In Social Networks Using K-Means Clustering AlgorithmData Mining In Social Networks Using K-Means Clustering Algorithm
Data Mining In Social Networks Using K-Means Clustering Algorithm
 
Tweet Summarization and Segmentation: A Survey
Tweet Summarization and Segmentation: A SurveyTweet Summarization and Segmentation: A Survey
Tweet Summarization and Segmentation: A Survey
 
Social media recommendation based on people and tags (final)
Social media recommendation based on people and tags (final)Social media recommendation based on people and tags (final)
Social media recommendation based on people and tags (final)
 
professional fuzzy type-ahead rummage around in xml type-ahead search techni...
professional fuzzy type-ahead rummage around in xml  type-ahead search techni...professional fuzzy type-ahead rummage around in xml  type-ahead search techni...
professional fuzzy type-ahead rummage around in xml type-ahead search techni...
 
Survey on article extraction and comment monitoring techniques
Survey on article extraction and comment monitoring techniquesSurvey on article extraction and comment monitoring techniques
Survey on article extraction and comment monitoring techniques
 
Topic Lifecycle on Social Networks
Topic Lifecycle on Social NetworksTopic Lifecycle on Social Networks
Topic Lifecycle on Social Networks
 
Entity Annotation WordPress Plugin using TAGME Technology
Entity Annotation WordPress Plugin using TAGME TechnologyEntity Annotation WordPress Plugin using TAGME Technology
Entity Annotation WordPress Plugin using TAGME Technology
 
Ay3313861388
Ay3313861388Ay3313861388
Ay3313861388
 
Named Entity Recognition using Tweet Segmentation
Named Entity Recognition using Tweet SegmentationNamed Entity Recognition using Tweet Segmentation
Named Entity Recognition using Tweet Segmentation
 
IRJET- Semantic Question Matching
IRJET- Semantic Question MatchingIRJET- Semantic Question Matching
IRJET- Semantic Question Matching
 
Project prSentiment Analysis of Twitter Data Using Machine Learning Approach...
Project prSentiment Analysis  of Twitter Data Using Machine Learning Approach...Project prSentiment Analysis  of Twitter Data Using Machine Learning Approach...
Project prSentiment Analysis of Twitter Data Using Machine Learning Approach...
 
Chat bot using text similarity approach
Chat bot using text similarity approachChat bot using text similarity approach
Chat bot using text similarity approach
 
Sentiment analysis in twitter using python
Sentiment analysis in twitter using pythonSentiment analysis in twitter using python
Sentiment analysis in twitter using python
 
Identifying Valid Email Spam Emails Using Decision Tree
Identifying Valid Email Spam Emails Using Decision TreeIdentifying Valid Email Spam Emails Using Decision Tree
Identifying Valid Email Spam Emails Using Decision Tree
 
Vivo Search
Vivo SearchVivo Search
Vivo Search
 

Viewers also liked

Blei ngjordan2003
Blei ngjordan2003Blei ngjordan2003
Blei ngjordan2003
Ajay Ohri
 
LDA Beginner's Tutorial
LDA Beginner's TutorialLDA Beginner's Tutorial
LDA Beginner's Tutorial
Wayne Lee
 
WSDM2014
WSDM2014WSDM2014
WSDM2014Jun Yu
 
Interactive Latent Dirichlet Allocation
Interactive Latent Dirichlet AllocationInteractive Latent Dirichlet Allocation
Interactive Latent Dirichlet Allocation
Quentin Pleplé
 
Goal-based Recommendation utilizing Latent Dirichlet Allocation
Goal-based Recommendation utilizing Latent Dirichlet AllocationGoal-based Recommendation utilizing Latent Dirichlet Allocation
Goal-based Recommendation utilizing Latent Dirichlet Allocation
Sebastien Louvigne
 
What's new in scikit-learn 0.17
What's new in scikit-learn 0.17What's new in scikit-learn 0.17
What's new in scikit-learn 0.17
Andreas Mueller
 
A Note on Expectation-Propagation for Latent Dirichlet Allocation
A Note on Expectation-Propagation for Latent Dirichlet AllocationA Note on Expectation-Propagation for Latent Dirichlet Allocation
A Note on Expectation-Propagation for Latent Dirichlet AllocationTomonari Masada
 
Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Al...
Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Al...Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Al...
Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Al...Tomonari Masada
 
Latent dirichletallocation presentation
Latent dirichletallocation presentationLatent dirichletallocation presentation
Latent dirichletallocation presentation
Soojung Hong
 
Visualzing Topic Models
Visualzing Topic ModelsVisualzing Topic Models
Visualzing Topic Models
Turi, Inc.
 
Topic Modeling
Topic ModelingTopic Modeling
Topic Modeling
Kyunghoon Kim
 
A Simple Stochastic Gradient Variational Bayes for the Correlated Topic Model
A Simple Stochastic Gradient Variational Bayes for the Correlated Topic ModelA Simple Stochastic Gradient Variational Bayes for the Correlated Topic Model
A Simple Stochastic Gradient Variational Bayes for the Correlated Topic Model
Tomonari Masada
 
20 cv mil_models_for_words
20 cv mil_models_for_words20 cv mil_models_for_words
20 cv mil_models_for_wordszukun
 
Recommending Tags with a Model of Human Categorization
Recommending Tags with a Model of Human CategorizationRecommending Tags with a Model of Human Categorization
Recommending Tags with a Model of Human Categorization
Christoph Trattner
 
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)rchbeir
 
Geometric Aspects of LSA
Geometric Aspects of LSAGeometric Aspects of LSA
Geometric Aspects of LSA
Heinrich Hartmann
 
Latent Semantic Indexing and Search Engines Optimimization (SEO)
Latent Semantic Indexing and Search Engines Optimimization (SEO)Latent Semantic Indexing and Search Engines Optimimization (SEO)
Latent Semantic Indexing and Search Engines Optimimization (SEO)
muzzy4friends
 
AutoCardSorter - Designing the Information Architecture of a web site using L...
AutoCardSorter - Designing the Information Architecture of a web site using L...AutoCardSorter - Designing the Information Architecture of a web site using L...
AutoCardSorter - Designing the Information Architecture of a web site using L...
Christos Katsanos
 
Analysis of Reviews on Sony Z3
Analysis of Reviews on Sony Z3Analysis of Reviews on Sony Z3
Analysis of Reviews on Sony Z3
Krishna Bollojula
 

Viewers also liked (20)

Blei ngjordan2003
Blei ngjordan2003Blei ngjordan2003
Blei ngjordan2003
 
C4.5
C4.5C4.5
C4.5
 
LDA Beginner's Tutorial
LDA Beginner's TutorialLDA Beginner's Tutorial
LDA Beginner's Tutorial
 
WSDM2014
WSDM2014WSDM2014
WSDM2014
 
Interactive Latent Dirichlet Allocation
Interactive Latent Dirichlet AllocationInteractive Latent Dirichlet Allocation
Interactive Latent Dirichlet Allocation
 
Goal-based Recommendation utilizing Latent Dirichlet Allocation
Goal-based Recommendation utilizing Latent Dirichlet AllocationGoal-based Recommendation utilizing Latent Dirichlet Allocation
Goal-based Recommendation utilizing Latent Dirichlet Allocation
 
What's new in scikit-learn 0.17
What's new in scikit-learn 0.17What's new in scikit-learn 0.17
What's new in scikit-learn 0.17
 
A Note on Expectation-Propagation for Latent Dirichlet Allocation
A Note on Expectation-Propagation for Latent Dirichlet AllocationA Note on Expectation-Propagation for Latent Dirichlet Allocation
A Note on Expectation-Propagation for Latent Dirichlet Allocation
 
Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Al...
Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Al...Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Al...
Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Al...
 
Latent dirichletallocation presentation
Latent dirichletallocation presentationLatent dirichletallocation presentation
Latent dirichletallocation presentation
 
Visualzing Topic Models
Visualzing Topic ModelsVisualzing Topic Models
Visualzing Topic Models
 
Topic Modeling
Topic ModelingTopic Modeling
Topic Modeling
 
A Simple Stochastic Gradient Variational Bayes for the Correlated Topic Model
A Simple Stochastic Gradient Variational Bayes for the Correlated Topic ModelA Simple Stochastic Gradient Variational Bayes for the Correlated Topic Model
A Simple Stochastic Gradient Variational Bayes for the Correlated Topic Model
 
20 cv mil_models_for_words
20 cv mil_models_for_words20 cv mil_models_for_words
20 cv mil_models_for_words
 
Recommending Tags with a Model of Human Categorization
Recommending Tags with a Model of Human CategorizationRecommending Tags with a Model of Human Categorization
Recommending Tags with a Model of Human Categorization
 
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
 
Geometric Aspects of LSA
Geometric Aspects of LSAGeometric Aspects of LSA
Geometric Aspects of LSA
 
Latent Semantic Indexing and Search Engines Optimimization (SEO)
Latent Semantic Indexing and Search Engines Optimimization (SEO)Latent Semantic Indexing and Search Engines Optimimization (SEO)
Latent Semantic Indexing and Search Engines Optimimization (SEO)
 
AutoCardSorter - Designing the Information Architecture of a web site using L...
AutoCardSorter - Designing the Information Architecture of a web site using L...AutoCardSorter - Designing the Information Architecture of a web site using L...
AutoCardSorter - Designing the Information Architecture of a web site using L...
 
Analysis of Reviews on Sony Z3
Analysis of Reviews on Sony Z3Analysis of Reviews on Sony Z3
Analysis of Reviews on Sony Z3
 

Similar to Latent Dirichlet Allocation as a Twitter Hashtag Recommendation System

Independent Study_Final Report
Independent Study_Final ReportIndependent Study_Final Report
Independent Study_Final ReportShikha Swami
 
Twitter Sentiment Analysis: An Unsupervised Approach
Twitter Sentiment Analysis: An Unsupervised ApproachTwitter Sentiment Analysis: An Unsupervised Approach
Twitter Sentiment Analysis: An Unsupervised Approach
IRJET Journal
 
SentimentAnalysisofTwitterProductReviewsDocument.pdf
SentimentAnalysisofTwitterProductReviewsDocument.pdfSentimentAnalysisofTwitterProductReviewsDocument.pdf
SentimentAnalysisofTwitterProductReviewsDocument.pdf
DevinSohi
 
Groundhog day: near duplicate detection on twitter
Groundhog day: near duplicate detection on twitterGroundhog day: near duplicate detection on twitter
Groundhog day: near duplicate detection on twitter
Dan Nguyen
 
E017433538
E017433538E017433538
E017433538
IOSR Journals
 
Text mining on Twitter information based on R platform
Text mining on Twitter information based on R platformText mining on Twitter information based on R platform
Text mining on Twitter information based on R platformFayan TAO
 
Detailed Investigation of Text Classification and Clustering of Twitter Data ...
Detailed Investigation of Text Classification and Clustering of Twitter Data ...Detailed Investigation of Text Classification and Clustering of Twitter Data ...
Detailed Investigation of Text Classification and Clustering of Twitter Data ...
ijtsrd
 
USING HASHTAG GRAPH-BASED TOPIC MODEL TO CONNECT SEMANTICALLY-RELATED WORDS W...
USING HASHTAG GRAPH-BASED TOPIC MODEL TO CONNECT SEMANTICALLY-RELATED WORDS W...USING HASHTAG GRAPH-BASED TOPIC MODEL TO CONNECT SEMANTICALLY-RELATED WORDS W...
USING HASHTAG GRAPH-BASED TOPIC MODEL TO CONNECT SEMANTICALLY-RELATED WORDS W...
Nexgen Technology
 
ABBREVIATION DICTIONARY FOR TWITTER HATE SPEECH
ABBREVIATION DICTIONARY FOR TWITTER HATE SPEECHABBREVIATION DICTIONARY FOR TWITTER HATE SPEECH
ABBREVIATION DICTIONARY FOR TWITTER HATE SPEECH
IJCI JOURNAL
 
BUS 625 Week 4 Response to Discussion 2Guided Response Your.docx
BUS 625 Week 4 Response to Discussion 2Guided Response Your.docxBUS 625 Week 4 Response to Discussion 2Guided Response Your.docx
BUS 625 Week 4 Response to Discussion 2Guided Response Your.docx
jasoninnes20
 
BUS 625 Week 4 Response to Discussion 2Guided Response Your.docx
BUS 625 Week 4 Response to Discussion 2Guided Response Your.docxBUS 625 Week 4 Response to Discussion 2Guided Response Your.docx
BUS 625 Week 4 Response to Discussion 2Guided Response Your.docx
curwenmichaela
 
Conversational aspects of twitter
Conversational aspects of twitterConversational aspects of twitter
Conversational aspects of twitter
Prayukth K V
 
Detection and Analysis of Twitter Trending Topics via Link-Anomaly Detection
Detection and Analysis of Twitter Trending Topics via Link-Anomaly DetectionDetection and Analysis of Twitter Trending Topics via Link-Anomaly Detection
Detection and Analysis of Twitter Trending Topics via Link-Anomaly Detection
IJERA Editor
 
Applying Clustering Techniques for Efficient Text Mining in Twitter Data
Applying Clustering Techniques for Efficient Text Mining in Twitter DataApplying Clustering Techniques for Efficient Text Mining in Twitter Data
Applying Clustering Techniques for Efficient Text Mining in Twitter Data
ijbuiiir1
 
Tweet segmentation and its application to named entity recognition
Tweet segmentation and its application to named entity recognitionTweet segmentation and its application to named entity recognition
Tweet segmentation and its application to named entity recognition
ieeepondy
 
Ire major project
Ire major projectIre major project
Ire major project
Abhishek Mungoli
 
Big Data Analytics- USE CASES SOLVED USING NETWORK ANALYSIS TECHNIQUES IN GEPHI
Big Data Analytics- USE CASES SOLVED USING NETWORK ANALYSIS TECHNIQUES IN GEPHIBig Data Analytics- USE CASES SOLVED USING NETWORK ANALYSIS TECHNIQUES IN GEPHI
Big Data Analytics- USE CASES SOLVED USING NETWORK ANALYSIS TECHNIQUES IN GEPHI
Ruchika Sharma
 
Characterizing microblogs
Characterizing microblogsCharacterizing microblogs
Characterizing microblogs
Etico Capital
 
Profile Analysis of Users in Data Analytics Domain
Profile Analysis of   Users in Data Analytics DomainProfile Analysis of   Users in Data Analytics Domain
Profile Analysis of Users in Data Analytics Domain
Drjabez
 
INFORMATION RETRIEVAL TOPICS IN TWITTER USING WEIGHTED PREDICTION NETWORK
INFORMATION RETRIEVAL TOPICS IN TWITTER USING WEIGHTED PREDICTION NETWORKINFORMATION RETRIEVAL TOPICS IN TWITTER USING WEIGHTED PREDICTION NETWORK
INFORMATION RETRIEVAL TOPICS IN TWITTER USING WEIGHTED PREDICTION NETWORK
IAEME Publication
 

Similar to Latent Dirichlet Allocation as a Twitter Hashtag Recommendation System (20)

Independent Study_Final Report
Independent Study_Final ReportIndependent Study_Final Report
Independent Study_Final Report
 
Twitter Sentiment Analysis: An Unsupervised Approach
Twitter Sentiment Analysis: An Unsupervised ApproachTwitter Sentiment Analysis: An Unsupervised Approach
Twitter Sentiment Analysis: An Unsupervised Approach
 
SentimentAnalysisofTwitterProductReviewsDocument.pdf
SentimentAnalysisofTwitterProductReviewsDocument.pdfSentimentAnalysisofTwitterProductReviewsDocument.pdf
SentimentAnalysisofTwitterProductReviewsDocument.pdf
 
Groundhog day: near duplicate detection on twitter
Groundhog day: near duplicate detection on twitterGroundhog day: near duplicate detection on twitter
Groundhog day: near duplicate detection on twitter
 
E017433538
E017433538E017433538
E017433538
 
Text mining on Twitter information based on R platform
Text mining on Twitter information based on R platformText mining on Twitter information based on R platform
Text mining on Twitter information based on R platform
 
Detailed Investigation of Text Classification and Clustering of Twitter Data ...
Detailed Investigation of Text Classification and Clustering of Twitter Data ...Detailed Investigation of Text Classification and Clustering of Twitter Data ...
Detailed Investigation of Text Classification and Clustering of Twitter Data ...
 
USING HASHTAG GRAPH-BASED TOPIC MODEL TO CONNECT SEMANTICALLY-RELATED WORDS W...
USING HASHTAG GRAPH-BASED TOPIC MODEL TO CONNECT SEMANTICALLY-RELATED WORDS W...USING HASHTAG GRAPH-BASED TOPIC MODEL TO CONNECT SEMANTICALLY-RELATED WORDS W...
USING HASHTAG GRAPH-BASED TOPIC MODEL TO CONNECT SEMANTICALLY-RELATED WORDS W...
 
ABBREVIATION DICTIONARY FOR TWITTER HATE SPEECH
ABBREVIATION DICTIONARY FOR TWITTER HATE SPEECHABBREVIATION DICTIONARY FOR TWITTER HATE SPEECH
ABBREVIATION DICTIONARY FOR TWITTER HATE SPEECH
 
BUS 625 Week 4 Response to Discussion 2Guided Response Your.docx
BUS 625 Week 4 Response to Discussion 2Guided Response Your.docxBUS 625 Week 4 Response to Discussion 2Guided Response Your.docx
BUS 625 Week 4 Response to Discussion 2Guided Response Your.docx
 
BUS 625 Week 4 Response to Discussion 2Guided Response Your.docx
BUS 625 Week 4 Response to Discussion 2Guided Response Your.docxBUS 625 Week 4 Response to Discussion 2Guided Response Your.docx
BUS 625 Week 4 Response to Discussion 2Guided Response Your.docx
 
Conversational aspects of twitter
Conversational aspects of twitterConversational aspects of twitter
Conversational aspects of twitter
 
Detection and Analysis of Twitter Trending Topics via Link-Anomaly Detection
Detection and Analysis of Twitter Trending Topics via Link-Anomaly DetectionDetection and Analysis of Twitter Trending Topics via Link-Anomaly Detection
Detection and Analysis of Twitter Trending Topics via Link-Anomaly Detection
 
Applying Clustering Techniques for Efficient Text Mining in Twitter Data
Applying Clustering Techniques for Efficient Text Mining in Twitter DataApplying Clustering Techniques for Efficient Text Mining in Twitter Data
Applying Clustering Techniques for Efficient Text Mining in Twitter Data
 
Tweet segmentation and its application to named entity recognition
Tweet segmentation and its application to named entity recognitionTweet segmentation and its application to named entity recognition
Tweet segmentation and its application to named entity recognition
 
Ire major project
Ire major projectIre major project
Ire major project
 
Big Data Analytics- USE CASES SOLVED USING NETWORK ANALYSIS TECHNIQUES IN GEPHI
Big Data Analytics- USE CASES SOLVED USING NETWORK ANALYSIS TECHNIQUES IN GEPHIBig Data Analytics- USE CASES SOLVED USING NETWORK ANALYSIS TECHNIQUES IN GEPHI
Big Data Analytics- USE CASES SOLVED USING NETWORK ANALYSIS TECHNIQUES IN GEPHI
 
Characterizing microblogs
Characterizing microblogsCharacterizing microblogs
Characterizing microblogs
 
Profile Analysis of Users in Data Analytics Domain
Profile Analysis of   Users in Data Analytics DomainProfile Analysis of   Users in Data Analytics Domain
Profile Analysis of Users in Data Analytics Domain
 
INFORMATION RETRIEVAL TOPICS IN TWITTER USING WEIGHTED PREDICTION NETWORK
INFORMATION RETRIEVAL TOPICS IN TWITTER USING WEIGHTED PREDICTION NETWORKINFORMATION RETRIEVAL TOPICS IN TWITTER USING WEIGHTED PREDICTION NETWORK
INFORMATION RETRIEVAL TOPICS IN TWITTER USING WEIGHTED PREDICTION NETWORK
 

Recently uploaded

Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
u86oixdj
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
oz8q3jxlp
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Linda486226
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
yhkoc
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
benishzehra469
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
Tiktokethiodaily
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
ahzuo
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
ewymefz
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
nscud
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
nscud
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
MaleehaSheikh2
 

Recently uploaded (20)

Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
 

Latent Dirichlet Allocation as a Twitter Hashtag Recommendation System

  • 1. Latent Dirichlet Allocation as a Twitter Hashtag Recommender System ∗ Brian Gillespie Northeastern University Seattle, WA bng1290@gmail.com Shailly Saxena Northeastern University Seattle, WA saxena.sha@husky.neu.edu ABSTRACT In this paper, we investigate an unsupervised approach to hashtag recommendation, to aid in the classification of tweets and to improve the usability of Twitter. Two cor- pora were generated from a sample of tweets collected from September 2009 to January 2010. The first corpus was com- posed simply of individual tweets, and the second corpus was made by aggregating tweets by UserID into what is re- ferred to as the USER PROFILE model. Latent Dirichlet Allocation(LDA) was used to generate a topic model for each corpus, and then this model was queried to to gener- ate topic distributions for new test tweets. By sampling a topic from each test tweet, and then sampling one of the top terms from that topic, a set of words can be selected as recommended hashtags. This recommendation process was applied to several example tweets, and the relevancy of the suggested hashtags were evaluated by human observers. The results of this would be briefly mentioned here if there were any. Keywords Twitter, LDA, topic modeling, NLP, social media 1. INTRODUCTION Twitter is an amazing source of text data, both in con- tent and quantity. With over 305 million monthly active users across the globe, there are a wide variety of subjects being discussed. Cataloging this data, and finding ways to make it more search-able is very desirable, as this data can be an effective resource for business knowledge and study- ing social trends. Hashtags appear as the natural catego- rization index for tweets, however since only 8% of tweets contain hashtags, they cannot be used as a direct catego- ∗There are no legal restrictions! Break all the rules! We don’t own this, and wouldn’t even dare claim it if our lives depended on it. Plagiarize to your heart’s desire, but maybe hook us up with a job interview if you think we had some good ideas in here. rization for all tweets. Compounding this issue, there are no prescribed hashtags; any sequence of characters can be a hashtag as long as it has a # in front of it. A hashtag recommendation system can be implemented to encourage users to utilize more hashtags, and to provide a more consol- idated hashtag base for tweet categorization. But in order to avoid prescribing hashtags, and thus limiting the free ex- pression of Twitter users, it is more reasonable to generate the suggested hashtags from the content of Twitter itself. This has the added gain of providing an adaptable system that can develop alongside the vocabulary and interests of the Twitter user base. 2. RELATED WORK Although there has been much research in the structure and dynamics of microblogging networks, and many insights have been gained by looking into how users interact with one another and with the world, the area of hashtag recommen- dations is still not explored enough. Providing hashtags is an important feature of Twitter for they aid in classification and easy retrieval of tweets and this problem has been ad- dressed in a number of ways so far. There have been recent developments in hashtag recom- mendation approaches to Twitter data such as Zangerle et al. [10] and Kywe, Hoang, et al. [6] where recommendations are made on the basis of similarity. These similarities can be of tweet content or users. [10] analyzed three different ways of tweet recommendations. They rank the hashtags based on the overall popularity of the tweet, the popularity within the most similar tweets, and the most similar tweets. Out of these, the third approach was found to out-perform the rest. On the other hand, Mazzia and Juett [8] use a naive Bayes model to determine the relevance of hashtags to an individual tweet. Since these methods do not take into con- sideration, the personal preferences of the user, Kywe et al. [6] suggest recommending hashtags incorporating simi- lar users along with similar tweets and thus achieving more than 20% accuracy than the above stated methods. One of the drawbacks of these approaches is that they rely on existing hashtags to recommend the new ones. Since, the current number of hashtags is already very low and very few of them are hardly ever repeated, using these hashtags will not improve classification of tweets. [2] Godin et. al took these facts into consideration and applied LDA model which was trained to cluster tweets in a number of topics from which keywords can be suggested for new tweets. In this paper, we extend the work done by Godin et al.[2] by incorporating user profile approach as suggested in Hong et
  • 2. Table 1: Twitter Sample Sep 2009 to Jan 2010 UserID Tweet 66 For all the everything you’ve brought to my internet life #mathowielove 53717680 What’s good Tweeps this Friday 16729208 ”I bet you all thought Bert & Ernie were just roommates!” 86114354 @miakhyrra take me with you! lol al. [5]. In this approach, tweets were aggregated by the User IDs of their authors, and a new set of documents was created where each document is a user profile of a unique user id and his or her combined set of tweets. Hence, we explore some of the more promising approaches to hashtags recommendations in tweets, and investigate their implications to further research into developing a more suc- cessful Twitter hashtags recommendations using topic mod- eling. Due to the success of the user profile approach, we will employ and evaluate this method, to further investigate its effectiveness. 3. DATASET AND PREPROCESSING The datasets used were obtained from Cheng, Caverlee, and Lee [9]. In order to reduce noise in the text data, a set of stop words were removed from the tweet bodies, the remaining words were then stemmed, and any hyperlinks were also removed. A bag of words model was created, and TF-IDF vectors were generated. For the user profile LDA approach [5] and the Twitter-LDA model [11], the vectors were aggregated based on user IDs. We also processed each tweet for the documents “as-is” for the Twitter-LDA model. 3.1 The Dataset The training dataset from Cheng, Caverlee, and Lee [9], contains 3,844,612 tweets from 115,886 Twitter users over the time period of September 2009 to January 2010. The test set, also from [9], contains 5,156,047 tweets from 5,136 Twitter users over the same time period. In general, tweets contain between 1 and 20 words, with an average of 7 words per tweet. A smaller proportion of the tweets contained more than 20 words as seen in Figure 1. Each line of the dataset contains a unique user ID and tweet ID, text con- tent, and a timestamp. The text content of each tweet is limited to 140 characters, and can contain references to other users of the form @username, as well as popular #hashtags. Many tweets also include hyperlinks which are often passed through URL shorteners (e.g. http://goo.gl/uLLAe). The implications of these more anomalous text instances are con- sidered in the next section. An additional document corpus was generated from each dataset by aggregating tweets by user ID. This reduction is referred to as the user profile model [5]. After aggregation, the corpus contained a total of 115,886 documents. Data preprocessing was performed on both of these corpora, and is described in the next section. 3.2 Data Preprocessing and Reduction Due to the unconventional vocabulary in tweets and the large amount of noise inherent in such a diverse set of text, we decided to perform several data preprocessing steps. A regular expression was used to remove any non-latin char- acters and any URL links, as URL links are generated ran- domly and cannot be easily related to a topic. We also re- move all numbers, punctuations, retweets and words starting from character @. After pruning the tweet contents, the text was tokenized, stop words were removed, and the remain- ing tokens were stemmed using the Porter Stemming library from the Natural Language Toolkit (NLTK). The prepared corpus of tweets was then converted to a set of Term Fre- quency (TF) vectors, using the corpora library from gensim. We noticed that many tweets contained excessive repetition of words (for instance one tweet read: ”@jester eh eh eh eh eh eh eh eh eh...”), so in order to reduce any bias towards overused words in the data we removed words appearing over 5 times in a tweet, as well as using the TF-IDF model to reduce emphasis on such words. To reduce bias to com- mon words across the corpus, we removed terms appearing in over 70% of the documents as per [11]. As seen in Figure 2, before preprocessing the data was dominated by common words that do not confer much meaning (e.g get, u, just, one). After preprocessing, however, we begin to see more meaningful words (e.g people, twitter, home, blog in Figure 3). After preprocessing, there were 26,542,006 total words and 184,002 unique words in the vocabulary. Figure 1: Plate Diagram of LDA. 4. MODEL AND METHODOLOGY Latent Dirichlet Allocation, first proposed by Blei, Ng, and Jordan [1] is a generative probabilistic model, that sup- poses each document in a corpus is generated from a smaller, hidden set of topics. Each word in a document is drawn from a topic, making every document a mixture of samples from a variety of different topics. The plate diagram in Fig- ure 1 demonstrates the generative story for a given word in this model. Here, the outer plate represents the corpus level where M is the number of documents, and the inner plate represents the document level where N is the number of words in a given document. So working from the outside-
  • 3. Figure 2: Average Hellinger Distances Between Topic Mod- els in, α is a parameter to the Dirichlet prior which defines the topic distribution for a set of documents. This informs θ, the topic distribution for a given document, and this supplies us with a topic, z. Ultimately, each word in each document will be drawn from one such topic, which is itself a distribution of words with a Dirichlet prior parameterized by β. There are multiple ways of estimating these hidden top- ics, including Hidden Markov Monte Carlo techniques such as Collapsed Gibbs Sampling [3], or with a stochastic gradi- ent descent method such as in [4]. In this paper we utilized the latter, and the details of our topic model generation are discussed in the next section. For the purpose of hashtag recommendation, we first need to see which topics are most represented by a new tweet. We then propose that the most significant words from within those strongly related topics will provide a more general representation of the message contained in a given tweet. So essentially for each new test tweet, we sample from the top topics in that tweet, and we recommend a hashtag from the top most representative words in that topic. In this way, we are recommending topics that also reflect the social component of Twitter. Since the topics we find are driven by the tweets themselves, our recommendations serve as an approximation of the connections between what users are talking about. food thanksgiv good great eat love dinner coffe happi cook lunch turkey chocol recip tweet enjoy drink hope pizza friend chicken tast wine restaur sweet pie fun cream cooki delici nice pumpkin night tea chees favorit bar soup ice chef awesom fri fresh cake tonight thing bacon breakfast meal year morn grill kid thx hey work sandwich idea special hot sound place kitchen salad burger top peopl bake famili start glad call bread potato twitter tomorrow cool meet appl amaz hear holiday butter sauc lot find wonder big share serv bean parti read egg red bring open milk rice tasti food drink coffe good wine dinner photo eat art great beer lunch thanksgiv wed tonight work cook design recip chocol tast bar night photographi pizza turkey pumpkin paint chicken fun tea happi love halloween fresh restaur cream delici photograph parti nice cooki pie start kitchen sweet tomorrow chees red open ice store awesom bottl cake enjoy soup chef breakfast imag hot favorit idea place cool holiday appl top hous special order year fri shot morn grill water glass bacon burger sandwich photoshop shop bake salad lot light meal green head big friend taco cup amaz potato free color offic bread ˆa˘A´N ˆa˘A´N Figure 3: Two topics generated by our LDA. The first is from the User Profile model with T=20, the second is from the User Profile model with T=40. These were found to be highly similar, with a Hellinger Distance of 0.40636. ˆa˘A´N 4.1 Getting a Topic Model In order to generate our topic model, we utilized the LDAModel package from gensim. This package uses the on- lineldavb.py algorithm that was proposed in [4]. This is an online, constant memory implementation of the LDA algo- rithm, which allowed us to process a rather large dataset with limited resources. This package processes the docu- ments in mini batches, and we chose our batch size to be the default, 2,000. The key parameters κ and τ as mentioned in [4] were kept at 0.5 and 1.0, respectively, as these values performed best in their performance analyses. Symmetric priors were also kept for the topic-to-word and document- to-topic distributions; these priors can be tweaked in order to boost certain topics or terms. Finally, our choices for number of topics were chosen based on the more successful results seen in [5]. For the per tweet based corpus we chose topic sizes of 50, 100, and 200 in keeping with [2]. However for the user profile model we turned to the experiments by Hong, et. al., who determined that this approach tends to favor lower topic sizes, and in their case a size of 40 per- formed best. For our own user profile implementation we used topic sizes of 20, 40, 70, and 100. Since several models were created, it is informative to see if our various approaches have had much of an effect on the final model. In order to compare the similarity of our topic distributions, we calculated the Hellinger Distance between the distributions generated for each topic size and training corpus. For two distributions, P and Q, each of size k, the Hellinger Distance is given by H(P, Q) = 1 2 k i=1 ( √ pi − √ qi)2
  • 4. Table 2: Tweets and their suggested hashtags. Tweet Text Suggested Hashtags And I want to know why I dream about food!?!? I’m sure this is not normal cook happi bad love great guy love Detailed Pics of The Dj Am Dunks and Dj Premier Af1 Low QS Will be up tomorrow music album ik vote dec op @MacsSTLsweetie Dakota always chews the squeaker out of her toys! dog code report free save airport fire headache - then eye exam in an hour. first one in 15 years... bad guy tonight work thing good . The Hellinger Distance is a distance metric that can be used to evaluate the similarity of two probability distribu- tions, and in this case we are comparing the word distribu- tions of the topics generated by our various LDA models. Of particular interest were the Hellinger Distances between the various user profile corpora and the single-tweet cor- pora. In Figure 4 , we see two of the topics generated by our LDA. These topics were found to be highly similar by their Hellinger Distance, which was 0.40636. By looking over the terms in each topic, it is clear that they’re both related to food and eating, and each contains a lot of the same terms. From Figure 2, we can see that the user pro- file model seems to favor more similar topics. While the regular model was more susceptible to change as the num- ber of topics was changed, the user profile model tended to have more stable topics, as the average distance remained relatively lower. It is likely that the smaller number of top- ics contributed to this, however it is noted that at T=100, the user profile topic models were still comparatively quite close to the others. The topics generated by the regular LDA approach were very diverse, and were very susceptible to changes as the topic size was altered. 4.2 Hashtag Recommendation In order to provide recommendations for a new tweet, we first decided to do some quick preliminary preprocessing by removing any user-added hashtags, references to other users, URL links, and non-latin characters. From here, we can con- vert this new document into its bag-of-words representation by referring to our existing dictionary. We used the LDA model generated previously to infer the topics from which each word in the tweet is drawn. This can also be done using gensim; by querying the generated model, LDAModel will return the distribution of topics and their respective poste- rior probabilities for a given document. We used the belief that each document represents some mixture of topics and we sample the most strongly correlated topics from a new tweet. Next, we select one of these strong topics, and rec- ommend a hashtag from the set of words that make up that topic. In this experiment, we decided to select the first three recommendations only from the words that most strongly represent their topics, in the hopes that this will strengthen the relevance of our suggestions. In case of 3 recommen- dations, we take the top two words from the top topic and third recommendation as the top word from the second best topic. These three recommendations remain the same for 5 and 7 recommendation-categories as well. To get remaining recommendations for 5, 7 categories, we randomly sample the top two topics for the given tweet. 5. RESULTS AND DISCUSSION In order to evaluate the quality of recommendations from our model, we decided to make a simple web app, where a user can write out a tweet and the number of hashtags they’d like, and see our recommendations. We intend to provide this setup to several human testers, who will evalu- ate whether a given hashtag is relevant to their submission or not. For the evaluation of experiments, we took a random sample of test 100 tweets and queried three different LDA models generated during training. One of which was taken from User Profile model for 40 topics and other two were taken from simple tweets model with number of topics being 50 and 100. We are providing results for user profile model with T = 40, since, it performed the best. We generated three sets of hashtags for each tweet; the first set had three recommendations, second had 5 and the third, 7. After test- ing with 2 human subjects, we tallied up the frequency of positive cases for each trial. For example, if three of seven recommended hashtags seem appropriate to the tester, he assigns a score of 3 to this tweet for the 7-hashtag category. This is done for every tweet, for each of the three hashtag categories. The results are available in figure 4. Figure 4: Histogram of the number of correctly suggested hashtags per tweet when three, five or seven hashtags are suggested with user profile model, T = 40. It was found that for 49% of the tweets, at least 1 suitable hashtag could be recommended out of 3 generated hashtags. Similarly, when 5 hashtags were generated, 58% of tweets had at least 1 suitable hashtag and in case of 7, this accuracy increased to 67%. There are instances where the tweet does not have sufficient words in it and hence, none of the rec- ommended hashtags are appropriate. We also came across recommendations like ’good’ appearing in a large number of tweets and this could be eliminated by developing a more sophisticated means of drawing a topic or more data that would result in better topic generation. In some cases, 5 out of 7 hashtags are appropriate and this number could increase provided we train on more data.
  • 5. 6. FUTURE WORK In the future, more preprocessing work could be done, in order to make more meaningful topics, and thus better hash- tag recommendations. Some words that bore little meaning on their own (e.g. good) still made it into a lot of our topics, and their presence in so many of the topics had the effect of polluting our recommendations. Additionally, it would be very advantageous to get more data. Ultimately our exper- iments were performed on around 3 million tweets, however this really isn’t a significant number. Many words from test tweets were not incorporated into our model, and thus pre- dictions on those words cannot be done reliably. By using the Twitter Streaming API, we could get a dense dataset of 10million+ tweets over a much shorter period of time. As well as getting more data, there are other models we would like to try, namely the Twitter-LDA model proposed in [11], and the Author-Topic model. In this way we would aim to get stronger topics that better reflect the structure of tweet data. One more enhancement we considered, was ag- gregating a collection of popular tweets, and trying to boost the prior distributions of our LDA so that these terms are weighted more heavily. Doing this, we may be able to use meaningful, pre-existing hashtags in our recommendations. Besides enhancements to our approach, we encountered a few interesting ideas while working on this project. One interested avenue of study would be in topic-based tweet and user recommendations. One possible experiment would in- volve using the user profile model to generate a corpus, then taking a new user profile and recommending users based on some distance measure between their posterior topic proba- bilities. 7. REFERENCES [1] David M Blei, Andrew Y Ng, and Michael I Jordan. Latent dirichlet allocation. the Journal of machine Learning research, 3:993–1022, 2003. [2] Fr´ederic Godin, Viktor Slavkovikj, Wesley De Neve, Benjamin Schrauwen, and Rik Van de Walle. Using topic models for twitter hashtag recommendation. In Proceedings of the 22nd international conference on World Wide Web companion, pages 593–596. International World Wide Web Conferences Steering Committee, 2013. [3] Tom Griffiths. Gibbs sampling in the generative model of latent dirichlet allocation. 2002. [4] Matthew Hoffman, Francis R Bach, and David M Blei. Online learning for latent dirichlet allocation. In advances in neural information processing systems, pages 856–864, 2010. [5] Liangjie Hong and Brian D Davison. Empirical study of topic modeling in twitter. In Proceedings of the first workshop on social media analytics, pages 80–88. ACM, 2010. [6] Su Mon Kywe, Tuan-Anh Hoang, Ee-Peng Lim, and Feida Zhu. On recommending hashtags in twitter networks. In Social Informatics, pages 337–350. Springer, 2012. [7] Tianxi Li, Yu Wu, and Yu Zhang. Twitter hash tag prediction algorithm. In ICOMPˆa ˘A´Z11-The 2011 International Conference on Internet Computing, 2011. [8] Allie Mazzia and James Juett. Suggesting hashtags on twitter. EECS 545m, Machine Learning, Computer Science and Engineering, University of Michigan, 2009. [9] J. Caverlee Z. Cheng and K. Lee. You are where you tweet: A content-based approach to geo-locating twitter users. In Proceeding of the 19th ACM Conference on Information and Knowledge Management (CIKM), October 2010. [10] Eva Zangerle, Wolfgang Gassler, and Gunther Specht. Recommending#-tags in twitter. In Proceedings of the Workshop on Semantic Adaptive Social Web (SASWeb 2011). CEUR Workshop Proceedings, volume 730, pages 67–78, 2011. [11] Wayne Xin Zhao, Jing Jiang, Jianshu Weng, Jing He, Ee-Peng Lim, Hongfei Yan, and Xiaoming Li. Comparing twitter and traditional media using topic models. In Advances in Information Retrieval, pages 338–349. Springer, 2011.