Latent Dirichlet Allocation as a Twitter Hashtag Recommendation System
GeospatialDataAnalysis
1. Geospatial Clustering and Classifying
of Twitter Data
Taylor Graham
University of Colorado
taylor.s.graham@colorado.edu
Spring 2015
Abstract
This paper investigates how to organize and classify a large collection of geotagged Twitter data, working
with a dataset of almost 15 million tweets collected directly from Twitter. The approach described here
combines spatial analysis of the GPS location of the tweet with content analysis of the text and hashtags
associated with the same tweet. I use the spatial distribution of where people tweet from to define ’hot
spots’, which are regions across America where the largest volume of tweets come from. I then look
into those specific regions to try and identify the most popular topics and hashtags of the regions, and
compare differences between each area. I expect my results to vary based on where people are tweeting
from. Posts on the east coast will likely be about a different subject than posts on the west coast, for example.
I. Introduction
S
ocial media has drastically changed the
way that we as individuals communicate
with the world around us in the past 15
years or so. During the early years of social me-
dia, a user would be connected to their direct
peers, their friends and family and colleagues
(if they so wanted to be, that is). However, as
the social platform grew and matured, people
began interacting worldwide, and with people
who they know literally nothing about. People
began connecting based on similar ideologies
and common discussion topics instead of sim-
ply who you knew or wished you knew.
Twitter is a social media platform loosely
based on this idea. On Twitter, users are able
to share their opinions and communicate with
other users around the entire world, as long
as you are able to keep your message short:
Twitter limits users to 140 characters for any
particular tweet. As a part of this message,
users can include hashtags, which are distinct
keywords prefixed with a # symbol. These
hashtags are used to apply some sort of theme
or distinct message with each tweet. This also
works well for Twitter, because they can easily
track the most popular hashtags across the
world, and identify the ’trending’ topics, or
the most used hashtags over time. I wanted to
investigate the trending topics on Twitter, but
with a finer granularity than what is currently
reported on by Twitter.
II. Data
My dataset was collected by downloading pub-
lic tweets and their metadata using Twitter’s
streaming API. More information about the
API can be found here: https://dev.twitter.
com/streaming/overview. Data was collected
during intervals during a two week period in
late March, 2015. I had to stop collecting data
for certain periods of time because I did not
have a dedicated server to collect the tweets,
so any time I had to move my computer, such
as going to class, I had to turn off the data
stream. Figure ?? is a plot of the specific times
1
2. Figure 1: Data Availability
when data is available. Data that contained
geolocation data and was located in the United
States was kept, while the rest of the data was
discarded. Luckily, Twitters API is built so
that you can query it with a bounding box
of gps coordinates. The API will then only
return data that is geolocated in that particular
bounding box.
Data is returned from Twitter’s servers for-
matted as JSON objects. These objects have
many fields useful to this research such as the
message text and hashtags as well as the gps
coordinates, as mentioned earlier. However,
these objects also contained many other fields
that were unnecessary for the system, such as
user information, retweets, and favorite counts.
In order to limit the size of the database, I
chose to only save the relevant fields: tweet_id,
message text, timestamp, gps coordinates, and
a user_id. This reduced the size of a typical
tweet from around 2.5kb to 0.8kb. Once I
had the final format of the data, I populated
a postGIS database using python scripts. In
total, I collected 14,412,529 tweets, with 754,109
unique hashtags shared between them.
A problem I ran into using Twitter’s data
is that there are quite a bit of ’bots’ that are
set up by humans that will automatically post
tweets to the system, usually spamming some
advertisment or recruitment event to all of twit-
ter. As you can imagine, these bots can greatly
skew the frequency of hashtags in a particular
region, or even the enitre world if they send
enough messages. In order to try and limit the
amount of spam from a single bot, I choose to
further filter the data set by ensuring that each
post from a specific gps location was from a
unique user_id. This takes care of the problem
of a single user posting many tweets from a
specific location, which is the case with these
bots. By doing this filtering step, I cut the total
amount of tweets I was considering from 14
million, to about 2.5 million.
2
3. III. Methods
Given a large collection of geotagged tweets,
I wanted to automatically find popular places
at which people are tweeting from. In mea-
suring how popular a place is, I only want
to consider the number of unique users who
post a tweet from that particular location. This
avoids things like apartment complexes and
common residential areas from overwhelming
the results, as a lot of users tend to tweet daily
from their homes and offices. We expect this
will also take into account the wide variability
in tweeting rates and behaviour across differ-
ent individuals. Most of the analysis methods
were inspired from the paper [?], which did
similar analysis on Flickr picture sets instead
of Twitter tweets.
The problem of finding the ’hot spots’ of
the most frequently tweeted from areas in
the country can be viewed as a problem of
clustering points in a two-dimensional feature
space. I choose to use mean-shift clustering
instead of using a fixed-cluster method such
a k-means, which requires the user to define
the amount of clusters before hand. Mean-shift
is a non-parametric method for estimating
the modes of an underlying probability dis-
tribution from a set of samples, given just an
estimate of the scale of the data. For my project
specifically, there is an underlying unobserv-
able probability distribution of where people
are tweeting from, with modes being the most
popular or frequent places to tweet from. Mean
shift estimates the modes of these underlying
distributions by only directly observing the
locations that people are tweeting from.
The main idea behind mean shift is to treat
the gps points in the 2-d feature space as an
empirical probability density function where
dense regions in the feature space correspond
to the local maxima or modes of the underly-
ing distribution. For each point in the world
a gradient ascent procedure is performed on
the local estimated density, until that density
converges. The stationary points identified
by this procedure represent the modes of the
distribution. Additionally, the data points asso-
ciated roughly with the same stationary point
are considered members of the same cluster.
Below is an example of the result of mean shift
being ran on arbitrary points in space.
Once I generate clusters using mean shift,
then I begin to look at the topics of each dif-
ferent cluster. A single query was made to
my database, which returned a list of every
single hashtag contained in every single tweet
within that cluster. I then parsed through the
list for each cluster, and sorted them based on
the frequency of each hashtag. I tracked the
top 5 hashtags in each cluster, since I was only
interested in the most common topics in that
region. Ideally, the major topics between each
different cluster would change, so that only
analysing the top 5 hashtags will be enough
to see difference. Once topics were identified,
comparisons were made between neighbour-
ing clusters as well as clusters that were in
entirely different parts of America. These com-
parisons highlighted some conclusions about
Tweets based solely on the location where they
originate from.
3
4. Figure 2: Clustsered Tweets
Location Top Hashtag 2nd Hashtag 3rd Hashtag 4th Hashtag 5th Hashtag
Cluster 1 NYC nyc newyork photo Brooklyn
Cluster 2 LA losangeles LosAngeles California la
Cluster 3 SXSW sxsw photo Austin austin
Cluster 4 photo Job toronto apple applegeeks
Cluster 5 Miami miami photo miamibeach ultra2015
Cluster 6 chicago Chicago photo Job breakfast
Cluster 7 Atlanta Job Jobs TweetMyJobs Nursing
Cluster 8 photo SanFrancisco sf SF love
Cluster 9 photo trndnl Job nc clt
Cluster 10 photo seattle Seattle vancouver Vancouver
Cluster 11 earthquake photo Earthquake Sismo USGS
Cluster 12 photo fashion beautiful fashionshow springintostyle
Cluster 13 jagstate RubberBandYogis FLUFFNFOLD Missouri PetsLoveUsDoingYoga
Cluster 14 photo trndnl Memphis HAA TheHauntedGhostTown
Cluster 15 PCB pcb2k15 PCB2K15 NOLA SpringBreak
Cluster 16 photo trndnl Job EauClaire minneapolis
Cluster 17 photo Job kc kcfw DesMoines
Cluster 18 photo drums BeatADay denver Denver
Cluster 19 utah nature photo mountains travel
Cluster 20 trndnl photo BoomerSooner buyacar kbb
Table 1: Most popular hashtags in each cluster
4
5. IV. Results and Discussion
It turns out that the hashtags that people use
when they are tweeting varies greatly depend-
ing on where those people are tweeting from.
Figure ?? is a plot of one solution to the clus-
tering algorithm ran on the entire data set,
2,469,476 tweets in total. The color of the clus-
ter is arbitrary, and is only there so that it is
easier to tell two clusters apart. The clusters
are ordered by the amount of tweets in each
cluster. Cluster 1 has the largest volume of
tweets, Cluster 2 is the second most popular,
and so on. Table ?? is a list of the top five hash-
tags found in the top 20 most popular clusters
being analysed. I choose to only display the
top 20 clusters, as they are the most frequently
tweeted from areas in America. Additionally,
the clusters after 20 have less variation in users,
so they become dominated by the hashtags
#Job, #photo, #trndl, and #TweetMyJobs.
There are a couple of things you can imme-
diately conclude from the Figure. First of all,
you can clearly see the United States, just by
plotting the Tweets collected. Some interesting
things I noticed were the volume of tweets in
the Bahamas, and the fact that you can actually
make out the great lakes. Also, you can tell
by the density of the data points that there are
much more Tweets being posted from the east
coast and west coast compared to the midwest.
This is likely due to the difference in popu-
lation of those areas rather than the rate at
which individuals are tweeting. I noticed that
the tweets begin to disperse rapidly around
Cluster 29, at the Texas - Mexico border. I spec-
ulate that this is due to the language barrier,
and that people begin posting in Spanish more
frequently somewhere around there.
Looking at the table you can clearly see
differences in the topics between each cluster.
The most posted tags were typically the closest
large city or state in that cluster. In fact, in
many cases, you can begin to guess what part
of the United States the cluster is located in,
without even looking at the map.
V. Limitations and Future Work
A large limitation of this project was actually
finding the perfect values to seed my clustering
algorithm with. These values greatly impact
the clusters that mean shift identifies, and I sus-
pect that further work on the algorithm could
have found some better means. There was also
some variability on the size of each cluster. I
sort of played around with the algorithm until
the resulting clusters seemed reasonably small
on a national level, but to where you can still
make out the clusters. Further work could
have been done to find the perfect number for
this value as well. I think additional studies at
a city level instead of the country wide level
would be worth looking into. The clustering
algorithm would have to be further tweaked
to fit to the data, but I still think that popular
regions within a city would be identified, and
that those regions would potentially have dif-
fering tweet content.
Another huge limitation I ran into specif-
ically with the data was the amount of posts
that were clearly from a Twitter posting bot.
These bots will post thousands of messages a
day, each one with the same or similar hash-
tags. Despite my efforts to remove duplicate
posts and posts from a single user in the same
location, some of these hashtags still made it
through. Many of the clusters were dominated
by the hashtags #Job and #Jobs. This is more
evident in clusters 25+, as those are the clusters
with the least amount of tweets coming from
them. The tag #photo was by far the most
common hashtag, coming up as first in most
of the clusters.
One thing I didn’t take into account when
doing tweet analysis was the lowercase and up-
percase hashtags. According to Twitter, #NYC
is different than #nyc, but for my analysis,
they could easily be considered the same tag
entirely. Doing so would allow for a larger
variation in the results, as the ’almost dupli-
cates’ would be combined.
5
6. An obvious extension of this study is look-
ing not only at the contents of each cluster us-
ing static data collected previously, but rather
looking at the data as it is consumed by the
system in real time. This would allow us to not
only see the differences between clusters, but
also how those clusters change over time. This
would be particularly interesting to watch as
some new topic emerges on social media. An
immediate example I thought of is the current
protests happening in Baltimore. In that area
of the country, I would expect the frequency of
tweets to increase, making some larger clusters
in the area than what was there previously.
Additionally, I would expect the most frequent
hashtags to not only be different than the west
coast and down south, but there would also
be a point where the most popular hashtags
in that region would change rapidly, as the
protests first began.
VI. Conclusion
In this paper, I introduced several techniques
for analyzing and classifying a large collection
of geotagged Twitter data. The approach de-
scribed combines spatial analysis of the GPS
latitude and longitude of a tweet, with con-
tent analysis of the text and hashtags associ-
ated with the same tweet. I present a tech-
nique to automatically identify the places in
the United States which have the most tweets
being posted using the mean-shift clustering
algorithm. Then, once regions are defined, I
investigate the theme of each region by iden-
tifying the top 5 most used hashtags within
each region. Comparisons were made between
the most popular hashtags in each region. Pre-
liminary investigation shows that the process
worked, and that there are clear differences be-
tween the hashtags used in different locations
across the United States. However, further anal-
ysis and refinement of the algorithms used may
be necessary to accurately draw meaningful
conclusions beyond that.
References
[Mapping the World’s Photos, 2009] David Crandall, Lars Backstrom, Daniel Huttenlocher, Jon
Kleinberg, (2009) Mapping The World’s Photos
[Mean Shift and other clustering algorithms] http://scikit-learn.org/stable/modules/
generated/sklearn.cluster.MeanShift.html
6