SlideShare a Scribd company logo
1 of 6
Download to read offline
Geospatial Clustering and Classifying
of Twitter Data
Taylor Graham
University of Colorado
taylor.s.graham@colorado.edu
Spring 2015
Abstract
This paper investigates how to organize and classify a large collection of geotagged Twitter data, working
with a dataset of almost 15 million tweets collected directly from Twitter. The approach described here
combines spatial analysis of the GPS location of the tweet with content analysis of the text and hashtags
associated with the same tweet. I use the spatial distribution of where people tweet from to define ’hot
spots’, which are regions across America where the largest volume of tweets come from. I then look
into those specific regions to try and identify the most popular topics and hashtags of the regions, and
compare differences between each area. I expect my results to vary based on where people are tweeting
from. Posts on the east coast will likely be about a different subject than posts on the west coast, for example.
I. Introduction
S
ocial media has drastically changed the
way that we as individuals communicate
with the world around us in the past 15
years or so. During the early years of social me-
dia, a user would be connected to their direct
peers, their friends and family and colleagues
(if they so wanted to be, that is). However, as
the social platform grew and matured, people
began interacting worldwide, and with people
who they know literally nothing about. People
began connecting based on similar ideologies
and common discussion topics instead of sim-
ply who you knew or wished you knew.
Twitter is a social media platform loosely
based on this idea. On Twitter, users are able
to share their opinions and communicate with
other users around the entire world, as long
as you are able to keep your message short:
Twitter limits users to 140 characters for any
particular tweet. As a part of this message,
users can include hashtags, which are distinct
keywords prefixed with a # symbol. These
hashtags are used to apply some sort of theme
or distinct message with each tweet. This also
works well for Twitter, because they can easily
track the most popular hashtags across the
world, and identify the ’trending’ topics, or
the most used hashtags over time. I wanted to
investigate the trending topics on Twitter, but
with a finer granularity than what is currently
reported on by Twitter.
II. Data
My dataset was collected by downloading pub-
lic tweets and their metadata using Twitter’s
streaming API. More information about the
API can be found here: https://dev.twitter.
com/streaming/overview. Data was collected
during intervals during a two week period in
late March, 2015. I had to stop collecting data
for certain periods of time because I did not
have a dedicated server to collect the tweets,
so any time I had to move my computer, such
as going to class, I had to turn off the data
stream. Figure ?? is a plot of the specific times
1
Figure 1: Data Availability
when data is available. Data that contained
geolocation data and was located in the United
States was kept, while the rest of the data was
discarded. Luckily, Twitters API is built so
that you can query it with a bounding box
of gps coordinates. The API will then only
return data that is geolocated in that particular
bounding box.
Data is returned from Twitter’s servers for-
matted as JSON objects. These objects have
many fields useful to this research such as the
message text and hashtags as well as the gps
coordinates, as mentioned earlier. However,
these objects also contained many other fields
that were unnecessary for the system, such as
user information, retweets, and favorite counts.
In order to limit the size of the database, I
chose to only save the relevant fields: tweet_id,
message text, timestamp, gps coordinates, and
a user_id. This reduced the size of a typical
tweet from around 2.5kb to 0.8kb. Once I
had the final format of the data, I populated
a postGIS database using python scripts. In
total, I collected 14,412,529 tweets, with 754,109
unique hashtags shared between them.
A problem I ran into using Twitter’s data
is that there are quite a bit of ’bots’ that are
set up by humans that will automatically post
tweets to the system, usually spamming some
advertisment or recruitment event to all of twit-
ter. As you can imagine, these bots can greatly
skew the frequency of hashtags in a particular
region, or even the enitre world if they send
enough messages. In order to try and limit the
amount of spam from a single bot, I choose to
further filter the data set by ensuring that each
post from a specific gps location was from a
unique user_id. This takes care of the problem
of a single user posting many tweets from a
specific location, which is the case with these
bots. By doing this filtering step, I cut the total
amount of tweets I was considering from 14
million, to about 2.5 million.
2
III. Methods
Given a large collection of geotagged tweets,
I wanted to automatically find popular places
at which people are tweeting from. In mea-
suring how popular a place is, I only want
to consider the number of unique users who
post a tweet from that particular location. This
avoids things like apartment complexes and
common residential areas from overwhelming
the results, as a lot of users tend to tweet daily
from their homes and offices. We expect this
will also take into account the wide variability
in tweeting rates and behaviour across differ-
ent individuals. Most of the analysis methods
were inspired from the paper [?], which did
similar analysis on Flickr picture sets instead
of Twitter tweets.
The problem of finding the ’hot spots’ of
the most frequently tweeted from areas in
the country can be viewed as a problem of
clustering points in a two-dimensional feature
space. I choose to use mean-shift clustering
instead of using a fixed-cluster method such
a k-means, which requires the user to define
the amount of clusters before hand. Mean-shift
is a non-parametric method for estimating
the modes of an underlying probability dis-
tribution from a set of samples, given just an
estimate of the scale of the data. For my project
specifically, there is an underlying unobserv-
able probability distribution of where people
are tweeting from, with modes being the most
popular or frequent places to tweet from. Mean
shift estimates the modes of these underlying
distributions by only directly observing the
locations that people are tweeting from.
The main idea behind mean shift is to treat
the gps points in the 2-d feature space as an
empirical probability density function where
dense regions in the feature space correspond
to the local maxima or modes of the underly-
ing distribution. For each point in the world
a gradient ascent procedure is performed on
the local estimated density, until that density
converges. The stationary points identified
by this procedure represent the modes of the
distribution. Additionally, the data points asso-
ciated roughly with the same stationary point
are considered members of the same cluster.
Below is an example of the result of mean shift
being ran on arbitrary points in space.
Once I generate clusters using mean shift,
then I begin to look at the topics of each dif-
ferent cluster. A single query was made to
my database, which returned a list of every
single hashtag contained in every single tweet
within that cluster. I then parsed through the
list for each cluster, and sorted them based on
the frequency of each hashtag. I tracked the
top 5 hashtags in each cluster, since I was only
interested in the most common topics in that
region. Ideally, the major topics between each
different cluster would change, so that only
analysing the top 5 hashtags will be enough
to see difference. Once topics were identified,
comparisons were made between neighbour-
ing clusters as well as clusters that were in
entirely different parts of America. These com-
parisons highlighted some conclusions about
Tweets based solely on the location where they
originate from.
3
Figure 2: Clustsered Tweets
Location Top Hashtag 2nd Hashtag 3rd Hashtag 4th Hashtag 5th Hashtag
Cluster 1 NYC nyc newyork photo Brooklyn
Cluster 2 LA losangeles LosAngeles California la
Cluster 3 SXSW sxsw photo Austin austin
Cluster 4 photo Job toronto apple applegeeks
Cluster 5 Miami miami photo miamibeach ultra2015
Cluster 6 chicago Chicago photo Job breakfast
Cluster 7 Atlanta Job Jobs TweetMyJobs Nursing
Cluster 8 photo SanFrancisco sf SF love
Cluster 9 photo trndnl Job nc clt
Cluster 10 photo seattle Seattle vancouver Vancouver
Cluster 11 earthquake photo Earthquake Sismo USGS
Cluster 12 photo fashion beautiful fashionshow springintostyle
Cluster 13 jagstate RubberBandYogis FLUFFNFOLD Missouri PetsLoveUsDoingYoga
Cluster 14 photo trndnl Memphis HAA TheHauntedGhostTown
Cluster 15 PCB pcb2k15 PCB2K15 NOLA SpringBreak
Cluster 16 photo trndnl Job EauClaire minneapolis
Cluster 17 photo Job kc kcfw DesMoines
Cluster 18 photo drums BeatADay denver Denver
Cluster 19 utah nature photo mountains travel
Cluster 20 trndnl photo BoomerSooner buyacar kbb
Table 1: Most popular hashtags in each cluster
4
IV. Results and Discussion
It turns out that the hashtags that people use
when they are tweeting varies greatly depend-
ing on where those people are tweeting from.
Figure ?? is a plot of one solution to the clus-
tering algorithm ran on the entire data set,
2,469,476 tweets in total. The color of the clus-
ter is arbitrary, and is only there so that it is
easier to tell two clusters apart. The clusters
are ordered by the amount of tweets in each
cluster. Cluster 1 has the largest volume of
tweets, Cluster 2 is the second most popular,
and so on. Table ?? is a list of the top five hash-
tags found in the top 20 most popular clusters
being analysed. I choose to only display the
top 20 clusters, as they are the most frequently
tweeted from areas in America. Additionally,
the clusters after 20 have less variation in users,
so they become dominated by the hashtags
#Job, #photo, #trndl, and #TweetMyJobs.
There are a couple of things you can imme-
diately conclude from the Figure. First of all,
you can clearly see the United States, just by
plotting the Tweets collected. Some interesting
things I noticed were the volume of tweets in
the Bahamas, and the fact that you can actually
make out the great lakes. Also, you can tell
by the density of the data points that there are
much more Tweets being posted from the east
coast and west coast compared to the midwest.
This is likely due to the difference in popu-
lation of those areas rather than the rate at
which individuals are tweeting. I noticed that
the tweets begin to disperse rapidly around
Cluster 29, at the Texas - Mexico border. I spec-
ulate that this is due to the language barrier,
and that people begin posting in Spanish more
frequently somewhere around there.
Looking at the table you can clearly see
differences in the topics between each cluster.
The most posted tags were typically the closest
large city or state in that cluster. In fact, in
many cases, you can begin to guess what part
of the United States the cluster is located in,
without even looking at the map.
V. Limitations and Future Work
A large limitation of this project was actually
finding the perfect values to seed my clustering
algorithm with. These values greatly impact
the clusters that mean shift identifies, and I sus-
pect that further work on the algorithm could
have found some better means. There was also
some variability on the size of each cluster. I
sort of played around with the algorithm until
the resulting clusters seemed reasonably small
on a national level, but to where you can still
make out the clusters. Further work could
have been done to find the perfect number for
this value as well. I think additional studies at
a city level instead of the country wide level
would be worth looking into. The clustering
algorithm would have to be further tweaked
to fit to the data, but I still think that popular
regions within a city would be identified, and
that those regions would potentially have dif-
fering tweet content.
Another huge limitation I ran into specif-
ically with the data was the amount of posts
that were clearly from a Twitter posting bot.
These bots will post thousands of messages a
day, each one with the same or similar hash-
tags. Despite my efforts to remove duplicate
posts and posts from a single user in the same
location, some of these hashtags still made it
through. Many of the clusters were dominated
by the hashtags #Job and #Jobs. This is more
evident in clusters 25+, as those are the clusters
with the least amount of tweets coming from
them. The tag #photo was by far the most
common hashtag, coming up as first in most
of the clusters.
One thing I didn’t take into account when
doing tweet analysis was the lowercase and up-
percase hashtags. According to Twitter, #NYC
is different than #nyc, but for my analysis,
they could easily be considered the same tag
entirely. Doing so would allow for a larger
variation in the results, as the ’almost dupli-
cates’ would be combined.
5
An obvious extension of this study is look-
ing not only at the contents of each cluster us-
ing static data collected previously, but rather
looking at the data as it is consumed by the
system in real time. This would allow us to not
only see the differences between clusters, but
also how those clusters change over time. This
would be particularly interesting to watch as
some new topic emerges on social media. An
immediate example I thought of is the current
protests happening in Baltimore. In that area
of the country, I would expect the frequency of
tweets to increase, making some larger clusters
in the area than what was there previously.
Additionally, I would expect the most frequent
hashtags to not only be different than the west
coast and down south, but there would also
be a point where the most popular hashtags
in that region would change rapidly, as the
protests first began.
VI. Conclusion
In this paper, I introduced several techniques
for analyzing and classifying a large collection
of geotagged Twitter data. The approach de-
scribed combines spatial analysis of the GPS
latitude and longitude of a tweet, with con-
tent analysis of the text and hashtags associ-
ated with the same tweet. I present a tech-
nique to automatically identify the places in
the United States which have the most tweets
being posted using the mean-shift clustering
algorithm. Then, once regions are defined, I
investigate the theme of each region by iden-
tifying the top 5 most used hashtags within
each region. Comparisons were made between
the most popular hashtags in each region. Pre-
liminary investigation shows that the process
worked, and that there are clear differences be-
tween the hashtags used in different locations
across the United States. However, further anal-
ysis and refinement of the algorithms used may
be necessary to accurately draw meaningful
conclusions beyond that.
References
[Mapping the World’s Photos, 2009] David Crandall, Lars Backstrom, Daniel Huttenlocher, Jon
Kleinberg, (2009) Mapping The World’s Photos
[Mean Shift and other clustering algorithms] http://scikit-learn.org/stable/modules/
generated/sklearn.cluster.MeanShift.html
6

More Related Content

What's hot

Microposts2015 - Social Spam Detection on Twitter
Microposts2015 - Social Spam Detection on TwitterMicroposts2015 - Social Spam Detection on Twitter
Microposts2015 - Social Spam Detection on Twitterazubiaga
 
DIY basic Facebook data mining
DIY basic Facebook data miningDIY basic Facebook data mining
DIY basic Facebook data miningSTEM/MARK
 
Embeddings-Based Clustering for Target Specific Stances
Embeddings-Based Clustering for Target Specific StancesEmbeddings-Based Clustering for Target Specific Stances
Embeddings-Based Clustering for Target Specific StancesAmmar Rashed
 
Text mining on Twitter information based on R platform
Text mining on Twitter information based on R platformText mining on Twitter information based on R platform
Text mining on Twitter information based on R platformFayan TAO
 
Twitterology - The Science of Twitter
Twitterology - The Science of TwitterTwitterology - The Science of Twitter
Twitterology - The Science of TwitterBruno Gonçalves
 
Machine Classification and Analysis of Suicide-Related Communication on Twitter
Machine Classification and Analysis of Suicide-Related Communication on Twitter Machine Classification and Analysis of Suicide-Related Communication on Twitter
Machine Classification and Analysis of Suicide-Related Communication on Twitter Pete Burnap
 
Deeper Inside PageRank (NOTES)
Deeper Inside PageRank (NOTES)Deeper Inside PageRank (NOTES)
Deeper Inside PageRank (NOTES)Subhajit Sahu
 
2016 Presidential Candidate Tracker
2016 Presidential Candidate Tracker2016 Presidential Candidate Tracker
2016 Presidential Candidate TrackerAnwar Jameel
 
Done reread deeperinsidepagerank
Done reread deeperinsidepagerankDone reread deeperinsidepagerank
Done reread deeperinsidepagerankJames Arnold
 
Team CDTW Capstone Presentation
Team CDTW Capstone Presentation Team CDTW Capstone Presentation
Team CDTW Capstone Presentation Todd Rutherford
 
Presentation-Detecting Spammers on Social Networks
Presentation-Detecting Spammers on Social NetworksPresentation-Detecting Spammers on Social Networks
Presentation-Detecting Spammers on Social NetworksAshish Arora
 
Toward Formal Reasoning with Epistemic Policies about Information Quality i...
  Toward Formal Reasoning with Epistemic Policies about Information Quality i...  Toward Formal Reasoning with Epistemic Policies about Information Quality i...
Toward Formal Reasoning with Epistemic Policies about Information Quality i...Brian Ulicny
 
Who to follow and why: link prediction with explanations
Who to follow and why: link prediction with explanationsWho to follow and why: link prediction with explanations
Who to follow and why: link prediction with explanationsNicola Barbieri
 

What's hot (20)

Microposts2015 - Social Spam Detection on Twitter
Microposts2015 - Social Spam Detection on TwitterMicroposts2015 - Social Spam Detection on Twitter
Microposts2015 - Social Spam Detection on Twitter
 
DIY basic Facebook data mining
DIY basic Facebook data miningDIY basic Facebook data mining
DIY basic Facebook data mining
 
Content-based link prediction
Content-based link predictionContent-based link prediction
Content-based link prediction
 
Embeddings-Based Clustering for Target Specific Stances
Embeddings-Based Clustering for Target Specific StancesEmbeddings-Based Clustering for Target Specific Stances
Embeddings-Based Clustering for Target Specific Stances
 
Text mining on Twitter information based on R platform
Text mining on Twitter information based on R platformText mining on Twitter information based on R platform
Text mining on Twitter information based on R platform
 
hwk1
hwk1hwk1
hwk1
 
1213-3492-2-PB
1213-3492-2-PB1213-3492-2-PB
1213-3492-2-PB
 
Twitterology - The Science of Twitter
Twitterology - The Science of TwitterTwitterology - The Science of Twitter
Twitterology - The Science of Twitter
 
Machine Classification and Analysis of Suicide-Related Communication on Twitter
Machine Classification and Analysis of Suicide-Related Communication on Twitter Machine Classification and Analysis of Suicide-Related Communication on Twitter
Machine Classification and Analysis of Suicide-Related Communication on Twitter
 
Broker Bots: Analyzing automated activity during High Impact Events on Twitter
Broker Bots: Analyzing automated activity during High Impact Events on TwitterBroker Bots: Analyzing automated activity during High Impact Events on Twitter
Broker Bots: Analyzing automated activity during High Impact Events on Twitter
 
Deeper Inside PageRank (NOTES)
Deeper Inside PageRank (NOTES)Deeper Inside PageRank (NOTES)
Deeper Inside PageRank (NOTES)
 
2016 Presidential Candidate Tracker
2016 Presidential Candidate Tracker2016 Presidential Candidate Tracker
2016 Presidential Candidate Tracker
 
Done reread deeperinsidepagerank
Done reread deeperinsidepagerankDone reread deeperinsidepagerank
Done reread deeperinsidepagerank
 
Team CDTW Capstone Presentation
Team CDTW Capstone Presentation Team CDTW Capstone Presentation
Team CDTW Capstone Presentation
 
Pydata Taipei 2020
Pydata Taipei 2020Pydata Taipei 2020
Pydata Taipei 2020
 
Presentation-Detecting Spammers on Social Networks
Presentation-Detecting Spammers on Social NetworksPresentation-Detecting Spammers on Social Networks
Presentation-Detecting Spammers on Social Networks
 
Link prediction
Link predictionLink prediction
Link prediction
 
Toward Formal Reasoning with Epistemic Policies about Information Quality i...
  Toward Formal Reasoning with Epistemic Policies about Information Quality i...  Toward Formal Reasoning with Epistemic Policies about Information Quality i...
Toward Formal Reasoning with Epistemic Policies about Information Quality i...
 
Complex networks - Assortativity
Complex networks -  AssortativityComplex networks -  Assortativity
Complex networks - Assortativity
 
Who to follow and why: link prediction with explanations
Who to follow and why: link prediction with explanationsWho to follow and why: link prediction with explanations
Who to follow and why: link prediction with explanations
 

Viewers also liked

Adverteren op Facebook: Geavanceerde campagne-optimalisatie en analyse
Adverteren op Facebook: Geavanceerde campagne-optimalisatie en analyseAdverteren op Facebook: Geavanceerde campagne-optimalisatie en analyse
Adverteren op Facebook: Geavanceerde campagne-optimalisatie en analyseKomfo
 
7/27/16 Deep Learning Top 5
7/27/16 Deep Learning Top 57/27/16 Deep Learning Top 5
7/27/16 Deep Learning Top 5NVIDIA
 
The Truth About Metal Music
The Truth About Metal MusicThe Truth About Metal Music
The Truth About Metal MusicSteven Lavendier
 
Understanding Veeam Methodologies and impact on Storage I/O - in persian
Understanding Veeam Methodologies and impact on Storage I/O - in persianUnderstanding Veeam Methodologies and impact on Storage I/O - in persian
Understanding Veeam Methodologies and impact on Storage I/O - in persianFarid Nasiri
 
Apresentação da COESCOLA - Aprendizagem Livre e Colaborativa
Apresentação da COESCOLA - Aprendizagem Livre e ColaborativaApresentação da COESCOLA - Aprendizagem Livre e Colaborativa
Apresentação da COESCOLA - Aprendizagem Livre e ColaborativaMarcio Okabe
 
Speed up your Tests - Devi Sridharan, ThoughtWorks
Speed up your Tests - Devi Sridharan, ThoughtWorksSpeed up your Tests - Devi Sridharan, ThoughtWorks
Speed up your Tests - Devi Sridharan, ThoughtWorksThoughtworks
 
OEE Canyon Guide Training Checklist (1)
OEE Canyon Guide Training Checklist (1)OEE Canyon Guide Training Checklist (1)
OEE Canyon Guide Training Checklist (1)Colter Christensen
 
Introducción a la biología - Célula
Introducción a la biología - Célula Introducción a la biología - Célula
Introducción a la biología - Célula acambientales
 
Storia degli scorpions
Storia degli scorpionsStoria degli scorpions
Storia degli scorpionsrobertlekaj
 
Xub magis republic day edition vol1
Xub magis republic day edition vol1Xub magis republic day edition vol1
Xub magis republic day edition vol1MBA(RM) XIMB
 
Market research case indian paints limited
Market research case  indian paints limitedMarket research case  indian paints limited
Market research case indian paints limitedPrafulla Tekriwal
 
Ipsec SitetoSite secure vpn between mikrotik and astaro utm - in persian
Ipsec SitetoSite secure vpn between mikrotik and astaro utm - in persianIpsec SitetoSite secure vpn between mikrotik and astaro utm - in persian
Ipsec SitetoSite secure vpn between mikrotik and astaro utm - in persianFarid Nasiri
 
Xây dựng giao diện website dựa trên mã nguồn joomla(tiếp theo)
Xây dựng giao diện website dựa trên mã nguồn joomla(tiếp theo)Xây dựng giao diện website dựa trên mã nguồn joomla(tiếp theo)
Xây dựng giao diện website dựa trên mã nguồn joomla(tiếp theo)thach28
 
Presentation for CF at SCHOOL Webinar hosted by CFQ
Presentation for CF at SCHOOL Webinar hosted by CFQPresentation for CF at SCHOOL Webinar hosted by CFQ
Presentation for CF at SCHOOL Webinar hosted by CFQChannon Goodwin
 

Viewers also liked (20)

Adverteren op Facebook: Geavanceerde campagne-optimalisatie en analyse
Adverteren op Facebook: Geavanceerde campagne-optimalisatie en analyseAdverteren op Facebook: Geavanceerde campagne-optimalisatie en analyse
Adverteren op Facebook: Geavanceerde campagne-optimalisatie en analyse
 
7/27/16 Deep Learning Top 5
7/27/16 Deep Learning Top 57/27/16 Deep Learning Top 5
7/27/16 Deep Learning Top 5
 
The Truth About Metal Music
The Truth About Metal MusicThe Truth About Metal Music
The Truth About Metal Music
 
Understanding Veeam Methodologies and impact on Storage I/O - in persian
Understanding Veeam Methodologies and impact on Storage I/O - in persianUnderstanding Veeam Methodologies and impact on Storage I/O - in persian
Understanding Veeam Methodologies and impact on Storage I/O - in persian
 
Head hunter 23.09.2010
Head hunter 23.09.2010Head hunter 23.09.2010
Head hunter 23.09.2010
 
Demand Gen Case Study on Social Media
Demand Gen Case Study on Social MediaDemand Gen Case Study on Social Media
Demand Gen Case Study on Social Media
 
Apresentação da COESCOLA - Aprendizagem Livre e Colaborativa
Apresentação da COESCOLA - Aprendizagem Livre e ColaborativaApresentação da COESCOLA - Aprendizagem Livre e Colaborativa
Apresentação da COESCOLA - Aprendizagem Livre e Colaborativa
 
Speed up your Tests - Devi Sridharan, ThoughtWorks
Speed up your Tests - Devi Sridharan, ThoughtWorksSpeed up your Tests - Devi Sridharan, ThoughtWorks
Speed up your Tests - Devi Sridharan, ThoughtWorks
 
OEE Canyon Guide Training Checklist (1)
OEE Canyon Guide Training Checklist (1)OEE Canyon Guide Training Checklist (1)
OEE Canyon Guide Training Checklist (1)
 
Introducción a la biología - Célula
Introducción a la biología - Célula Introducción a la biología - Célula
Introducción a la biología - Célula
 
2. Cnnecst-Why the use of FPGA?
2. Cnnecst-Why the use of FPGA? 2. Cnnecst-Why the use of FPGA?
2. Cnnecst-Why the use of FPGA?
 
Storia degli scorpions
Storia degli scorpionsStoria degli scorpions
Storia degli scorpions
 
Xub magis republic day edition vol1
Xub magis republic day edition vol1Xub magis republic day edition vol1
Xub magis republic day edition vol1
 
Market Research Efx
Market Research   EfxMarket Research   Efx
Market Research Efx
 
Market research case indian paints limited
Market research case  indian paints limitedMarket research case  indian paints limited
Market research case indian paints limited
 
Ipsec SitetoSite secure vpn between mikrotik and astaro utm - in persian
Ipsec SitetoSite secure vpn between mikrotik and astaro utm - in persianIpsec SitetoSite secure vpn between mikrotik and astaro utm - in persian
Ipsec SitetoSite secure vpn between mikrotik and astaro utm - in persian
 
Xây dựng giao diện website dựa trên mã nguồn joomla(tiếp theo)
Xây dựng giao diện website dựa trên mã nguồn joomla(tiếp theo)Xây dựng giao diện website dựa trên mã nguồn joomla(tiếp theo)
Xây dựng giao diện website dựa trên mã nguồn joomla(tiếp theo)
 
Presentation for CF at SCHOOL Webinar hosted by CFQ
Presentation for CF at SCHOOL Webinar hosted by CFQPresentation for CF at SCHOOL Webinar hosted by CFQ
Presentation for CF at SCHOOL Webinar hosted by CFQ
 
EEON103 Хичээл 13
EEON103 Хичээл 13EEON103 Хичээл 13
EEON103 Хичээл 13
 
Guía pensandolo bien
Guía pensandolo bien Guía pensandolo bien
Guía pensandolo bien
 

Similar to GeospatialDataAnalysis

Analyzing-Threat-Levels-of-Extremists-using-Tweets
Analyzing-Threat-Levels-of-Extremists-using-TweetsAnalyzing-Threat-Levels-of-Extremists-using-Tweets
Analyzing-Threat-Levels-of-Extremists-using-TweetsRESHAN FARAZ
 
Twitter: Social Network Or News Medium?
Twitter: Social Network Or News Medium?Twitter: Social Network Or News Medium?
Twitter: Social Network Or News Medium?Serge Beckers
 
Twitter: Social Network Or News Medium?
Twitter: Social Network Or News Medium?Twitter: Social Network Or News Medium?
Twitter: Social Network Or News Medium?Serge Beckers
 
Twitter as a personalizable information service ii
Twitter as a personalizable information service iiTwitter as a personalizable information service ii
Twitter as a personalizable information service iiKan-Han (John) Lu
 
Detecting Trends Through Twitter Stream v2
Detecting Trends Through Twitter Stream v2Detecting Trends Through Twitter Stream v2
Detecting Trends Through Twitter Stream v2The Night's Watch
 
IRJET- An Experimental Evaluation of Mechanical Properties of Bamboo Fiber Re...
IRJET- An Experimental Evaluation of Mechanical Properties of Bamboo Fiber Re...IRJET- An Experimental Evaluation of Mechanical Properties of Bamboo Fiber Re...
IRJET- An Experimental Evaluation of Mechanical Properties of Bamboo Fiber Re...IRJET Journal
 
IRJET- Tweet Segmentation and its Application to Named Entity Recognition
IRJET- Tweet Segmentation and its Application to Named Entity RecognitionIRJET- Tweet Segmentation and its Application to Named Entity Recognition
IRJET- Tweet Segmentation and its Application to Named Entity RecognitionIRJET Journal
 
What Your Tweets Tell Us About You, Speaker Notes
What Your Tweets Tell Us About You, Speaker NotesWhat Your Tweets Tell Us About You, Speaker Notes
What Your Tweets Tell Us About You, Speaker NotesKrisKasianovitz
 
INFORMATION RETRIEVAL TOPICS IN TWITTER USING WEIGHTED PREDICTION NETWORK
INFORMATION RETRIEVAL TOPICS IN TWITTER USING WEIGHTED PREDICTION NETWORKINFORMATION RETRIEVAL TOPICS IN TWITTER USING WEIGHTED PREDICTION NETWORK
INFORMATION RETRIEVAL TOPICS IN TWITTER USING WEIGHTED PREDICTION NETWORKIAEME Publication
 
SENTIMENT ANALYSIS OF TWITTER DATA
SENTIMENT ANALYSIS OF TWITTER DATASENTIMENT ANALYSIS OF TWITTER DATA
SENTIMENT ANALYSIS OF TWITTER DATAanargha gangadharan
 
REAL TIME SENTIMENT ANALYSIS OF TWITTER DATA
REAL TIME SENTIMENT ANALYSIS OF TWITTER DATAREAL TIME SENTIMENT ANALYSIS OF TWITTER DATA
REAL TIME SENTIMENT ANALYSIS OF TWITTER DATAMary Lis Joseph
 
SENTIMENT ANALYSIS OF TWITTER DATA
SENTIMENT ANALYSIS OF TWITTER DATASENTIMENT ANALYSIS OF TWITTER DATA
SENTIMENT ANALYSIS OF TWITTER DATAParvathy Devaraj
 
srd117.final.512Spring2016
srd117.final.512Spring2016srd117.final.512Spring2016
srd117.final.512Spring2016Saurabh Deochake
 
BUS 625 Week 4 Response to Discussion 2Guided Response Your.docx
BUS 625 Week 4 Response to Discussion 2Guided Response Your.docxBUS 625 Week 4 Response to Discussion 2Guided Response Your.docx
BUS 625 Week 4 Response to Discussion 2Guided Response Your.docxjasoninnes20
 
BUS 625 Week 4 Response to Discussion 2Guided Response Your.docx
BUS 625 Week 4 Response to Discussion 2Guided Response Your.docxBUS 625 Week 4 Response to Discussion 2Guided Response Your.docx
BUS 625 Week 4 Response to Discussion 2Guided Response Your.docxcurwenmichaela
 
IRJET- Categorization of Geo-Located Tweets for Data Analysis
IRJET- Categorization of Geo-Located Tweets for Data AnalysisIRJET- Categorization of Geo-Located Tweets for Data Analysis
IRJET- Categorization of Geo-Located Tweets for Data AnalysisIRJET Journal
 
Tweet Segmentation and Its Application to Named Entity Recognition
Tweet Segmentation and Its Application to Named Entity RecognitionTweet Segmentation and Its Application to Named Entity Recognition
Tweet Segmentation and Its Application to Named Entity Recognition1crore projects
 
Groundhog Day: Near-Duplicate Detection on Twitter
Groundhog Day: Near-Duplicate Detection on Twitter Groundhog Day: Near-Duplicate Detection on Twitter
Groundhog Day: Near-Duplicate Detection on Twitter Ke Tao
 
Latent Dirichlet Allocation as a Twitter Hashtag Recommendation System
Latent Dirichlet Allocation as a Twitter Hashtag Recommendation SystemLatent Dirichlet Allocation as a Twitter Hashtag Recommendation System
Latent Dirichlet Allocation as a Twitter Hashtag Recommendation SystemShailly Saxena
 

Similar to GeospatialDataAnalysis (20)

Analyzing-Threat-Levels-of-Extremists-using-Tweets
Analyzing-Threat-Levels-of-Extremists-using-TweetsAnalyzing-Threat-Levels-of-Extremists-using-Tweets
Analyzing-Threat-Levels-of-Extremists-using-Tweets
 
Twitter: Social Network Or News Medium?
Twitter: Social Network Or News Medium?Twitter: Social Network Or News Medium?
Twitter: Social Network Or News Medium?
 
Twitter: Social Network Or News Medium?
Twitter: Social Network Or News Medium?Twitter: Social Network Or News Medium?
Twitter: Social Network Or News Medium?
 
Twitter as a personalizable information service ii
Twitter as a personalizable information service iiTwitter as a personalizable information service ii
Twitter as a personalizable information service ii
 
Detecting Trends Through Twitter Stream v2
Detecting Trends Through Twitter Stream v2Detecting Trends Through Twitter Stream v2
Detecting Trends Through Twitter Stream v2
 
IRJET- An Experimental Evaluation of Mechanical Properties of Bamboo Fiber Re...
IRJET- An Experimental Evaluation of Mechanical Properties of Bamboo Fiber Re...IRJET- An Experimental Evaluation of Mechanical Properties of Bamboo Fiber Re...
IRJET- An Experimental Evaluation of Mechanical Properties of Bamboo Fiber Re...
 
IRJET- Tweet Segmentation and its Application to Named Entity Recognition
IRJET- Tweet Segmentation and its Application to Named Entity RecognitionIRJET- Tweet Segmentation and its Application to Named Entity Recognition
IRJET- Tweet Segmentation and its Application to Named Entity Recognition
 
What Your Tweets Tell Us About You, Speaker Notes
What Your Tweets Tell Us About You, Speaker NotesWhat Your Tweets Tell Us About You, Speaker Notes
What Your Tweets Tell Us About You, Speaker Notes
 
INFORMATION RETRIEVAL TOPICS IN TWITTER USING WEIGHTED PREDICTION NETWORK
INFORMATION RETRIEVAL TOPICS IN TWITTER USING WEIGHTED PREDICTION NETWORKINFORMATION RETRIEVAL TOPICS IN TWITTER USING WEIGHTED PREDICTION NETWORK
INFORMATION RETRIEVAL TOPICS IN TWITTER USING WEIGHTED PREDICTION NETWORK
 
SENTIMENT ANALYSIS OF TWITTER DATA
SENTIMENT ANALYSIS OF TWITTER DATASENTIMENT ANALYSIS OF TWITTER DATA
SENTIMENT ANALYSIS OF TWITTER DATA
 
REAL TIME SENTIMENT ANALYSIS OF TWITTER DATA
REAL TIME SENTIMENT ANALYSIS OF TWITTER DATAREAL TIME SENTIMENT ANALYSIS OF TWITTER DATA
REAL TIME SENTIMENT ANALYSIS OF TWITTER DATA
 
SENTIMENT ANALYSIS OF TWITTER DATA
SENTIMENT ANALYSIS OF TWITTER DATASENTIMENT ANALYSIS OF TWITTER DATA
SENTIMENT ANALYSIS OF TWITTER DATA
 
srd117.final.512Spring2016
srd117.final.512Spring2016srd117.final.512Spring2016
srd117.final.512Spring2016
 
BUS 625 Week 4 Response to Discussion 2Guided Response Your.docx
BUS 625 Week 4 Response to Discussion 2Guided Response Your.docxBUS 625 Week 4 Response to Discussion 2Guided Response Your.docx
BUS 625 Week 4 Response to Discussion 2Guided Response Your.docx
 
BUS 625 Week 4 Response to Discussion 2Guided Response Your.docx
BUS 625 Week 4 Response to Discussion 2Guided Response Your.docxBUS 625 Week 4 Response to Discussion 2Guided Response Your.docx
BUS 625 Week 4 Response to Discussion 2Guided Response Your.docx
 
IRJET- Categorization of Geo-Located Tweets for Data Analysis
IRJET- Categorization of Geo-Located Tweets for Data AnalysisIRJET- Categorization of Geo-Located Tweets for Data Analysis
IRJET- Categorization of Geo-Located Tweets for Data Analysis
 
Tweet Segmentation and Its Application to Named Entity Recognition
Tweet Segmentation and Its Application to Named Entity RecognitionTweet Segmentation and Its Application to Named Entity Recognition
Tweet Segmentation and Its Application to Named Entity Recognition
 
Groundhog Day: Near-Duplicate Detection on Twitter
Groundhog Day: Near-Duplicate Detection on Twitter Groundhog Day: Near-Duplicate Detection on Twitter
Groundhog Day: Near-Duplicate Detection on Twitter
 
Who gives a tweet
Who gives a tweetWho gives a tweet
Who gives a tweet
 
Latent Dirichlet Allocation as a Twitter Hashtag Recommendation System
Latent Dirichlet Allocation as a Twitter Hashtag Recommendation SystemLatent Dirichlet Allocation as a Twitter Hashtag Recommendation System
Latent Dirichlet Allocation as a Twitter Hashtag Recommendation System
 

GeospatialDataAnalysis

  • 1. Geospatial Clustering and Classifying of Twitter Data Taylor Graham University of Colorado taylor.s.graham@colorado.edu Spring 2015 Abstract This paper investigates how to organize and classify a large collection of geotagged Twitter data, working with a dataset of almost 15 million tweets collected directly from Twitter. The approach described here combines spatial analysis of the GPS location of the tweet with content analysis of the text and hashtags associated with the same tweet. I use the spatial distribution of where people tweet from to define ’hot spots’, which are regions across America where the largest volume of tweets come from. I then look into those specific regions to try and identify the most popular topics and hashtags of the regions, and compare differences between each area. I expect my results to vary based on where people are tweeting from. Posts on the east coast will likely be about a different subject than posts on the west coast, for example. I. Introduction S ocial media has drastically changed the way that we as individuals communicate with the world around us in the past 15 years or so. During the early years of social me- dia, a user would be connected to their direct peers, their friends and family and colleagues (if they so wanted to be, that is). However, as the social platform grew and matured, people began interacting worldwide, and with people who they know literally nothing about. People began connecting based on similar ideologies and common discussion topics instead of sim- ply who you knew or wished you knew. Twitter is a social media platform loosely based on this idea. On Twitter, users are able to share their opinions and communicate with other users around the entire world, as long as you are able to keep your message short: Twitter limits users to 140 characters for any particular tweet. As a part of this message, users can include hashtags, which are distinct keywords prefixed with a # symbol. These hashtags are used to apply some sort of theme or distinct message with each tweet. This also works well for Twitter, because they can easily track the most popular hashtags across the world, and identify the ’trending’ topics, or the most used hashtags over time. I wanted to investigate the trending topics on Twitter, but with a finer granularity than what is currently reported on by Twitter. II. Data My dataset was collected by downloading pub- lic tweets and their metadata using Twitter’s streaming API. More information about the API can be found here: https://dev.twitter. com/streaming/overview. Data was collected during intervals during a two week period in late March, 2015. I had to stop collecting data for certain periods of time because I did not have a dedicated server to collect the tweets, so any time I had to move my computer, such as going to class, I had to turn off the data stream. Figure ?? is a plot of the specific times 1
  • 2. Figure 1: Data Availability when data is available. Data that contained geolocation data and was located in the United States was kept, while the rest of the data was discarded. Luckily, Twitters API is built so that you can query it with a bounding box of gps coordinates. The API will then only return data that is geolocated in that particular bounding box. Data is returned from Twitter’s servers for- matted as JSON objects. These objects have many fields useful to this research such as the message text and hashtags as well as the gps coordinates, as mentioned earlier. However, these objects also contained many other fields that were unnecessary for the system, such as user information, retweets, and favorite counts. In order to limit the size of the database, I chose to only save the relevant fields: tweet_id, message text, timestamp, gps coordinates, and a user_id. This reduced the size of a typical tweet from around 2.5kb to 0.8kb. Once I had the final format of the data, I populated a postGIS database using python scripts. In total, I collected 14,412,529 tweets, with 754,109 unique hashtags shared between them. A problem I ran into using Twitter’s data is that there are quite a bit of ’bots’ that are set up by humans that will automatically post tweets to the system, usually spamming some advertisment or recruitment event to all of twit- ter. As you can imagine, these bots can greatly skew the frequency of hashtags in a particular region, or even the enitre world if they send enough messages. In order to try and limit the amount of spam from a single bot, I choose to further filter the data set by ensuring that each post from a specific gps location was from a unique user_id. This takes care of the problem of a single user posting many tweets from a specific location, which is the case with these bots. By doing this filtering step, I cut the total amount of tweets I was considering from 14 million, to about 2.5 million. 2
  • 3. III. Methods Given a large collection of geotagged tweets, I wanted to automatically find popular places at which people are tweeting from. In mea- suring how popular a place is, I only want to consider the number of unique users who post a tweet from that particular location. This avoids things like apartment complexes and common residential areas from overwhelming the results, as a lot of users tend to tweet daily from their homes and offices. We expect this will also take into account the wide variability in tweeting rates and behaviour across differ- ent individuals. Most of the analysis methods were inspired from the paper [?], which did similar analysis on Flickr picture sets instead of Twitter tweets. The problem of finding the ’hot spots’ of the most frequently tweeted from areas in the country can be viewed as a problem of clustering points in a two-dimensional feature space. I choose to use mean-shift clustering instead of using a fixed-cluster method such a k-means, which requires the user to define the amount of clusters before hand. Mean-shift is a non-parametric method for estimating the modes of an underlying probability dis- tribution from a set of samples, given just an estimate of the scale of the data. For my project specifically, there is an underlying unobserv- able probability distribution of where people are tweeting from, with modes being the most popular or frequent places to tweet from. Mean shift estimates the modes of these underlying distributions by only directly observing the locations that people are tweeting from. The main idea behind mean shift is to treat the gps points in the 2-d feature space as an empirical probability density function where dense regions in the feature space correspond to the local maxima or modes of the underly- ing distribution. For each point in the world a gradient ascent procedure is performed on the local estimated density, until that density converges. The stationary points identified by this procedure represent the modes of the distribution. Additionally, the data points asso- ciated roughly with the same stationary point are considered members of the same cluster. Below is an example of the result of mean shift being ran on arbitrary points in space. Once I generate clusters using mean shift, then I begin to look at the topics of each dif- ferent cluster. A single query was made to my database, which returned a list of every single hashtag contained in every single tweet within that cluster. I then parsed through the list for each cluster, and sorted them based on the frequency of each hashtag. I tracked the top 5 hashtags in each cluster, since I was only interested in the most common topics in that region. Ideally, the major topics between each different cluster would change, so that only analysing the top 5 hashtags will be enough to see difference. Once topics were identified, comparisons were made between neighbour- ing clusters as well as clusters that were in entirely different parts of America. These com- parisons highlighted some conclusions about Tweets based solely on the location where they originate from. 3
  • 4. Figure 2: Clustsered Tweets Location Top Hashtag 2nd Hashtag 3rd Hashtag 4th Hashtag 5th Hashtag Cluster 1 NYC nyc newyork photo Brooklyn Cluster 2 LA losangeles LosAngeles California la Cluster 3 SXSW sxsw photo Austin austin Cluster 4 photo Job toronto apple applegeeks Cluster 5 Miami miami photo miamibeach ultra2015 Cluster 6 chicago Chicago photo Job breakfast Cluster 7 Atlanta Job Jobs TweetMyJobs Nursing Cluster 8 photo SanFrancisco sf SF love Cluster 9 photo trndnl Job nc clt Cluster 10 photo seattle Seattle vancouver Vancouver Cluster 11 earthquake photo Earthquake Sismo USGS Cluster 12 photo fashion beautiful fashionshow springintostyle Cluster 13 jagstate RubberBandYogis FLUFFNFOLD Missouri PetsLoveUsDoingYoga Cluster 14 photo trndnl Memphis HAA TheHauntedGhostTown Cluster 15 PCB pcb2k15 PCB2K15 NOLA SpringBreak Cluster 16 photo trndnl Job EauClaire minneapolis Cluster 17 photo Job kc kcfw DesMoines Cluster 18 photo drums BeatADay denver Denver Cluster 19 utah nature photo mountains travel Cluster 20 trndnl photo BoomerSooner buyacar kbb Table 1: Most popular hashtags in each cluster 4
  • 5. IV. Results and Discussion It turns out that the hashtags that people use when they are tweeting varies greatly depend- ing on where those people are tweeting from. Figure ?? is a plot of one solution to the clus- tering algorithm ran on the entire data set, 2,469,476 tweets in total. The color of the clus- ter is arbitrary, and is only there so that it is easier to tell two clusters apart. The clusters are ordered by the amount of tweets in each cluster. Cluster 1 has the largest volume of tweets, Cluster 2 is the second most popular, and so on. Table ?? is a list of the top five hash- tags found in the top 20 most popular clusters being analysed. I choose to only display the top 20 clusters, as they are the most frequently tweeted from areas in America. Additionally, the clusters after 20 have less variation in users, so they become dominated by the hashtags #Job, #photo, #trndl, and #TweetMyJobs. There are a couple of things you can imme- diately conclude from the Figure. First of all, you can clearly see the United States, just by plotting the Tweets collected. Some interesting things I noticed were the volume of tweets in the Bahamas, and the fact that you can actually make out the great lakes. Also, you can tell by the density of the data points that there are much more Tweets being posted from the east coast and west coast compared to the midwest. This is likely due to the difference in popu- lation of those areas rather than the rate at which individuals are tweeting. I noticed that the tweets begin to disperse rapidly around Cluster 29, at the Texas - Mexico border. I spec- ulate that this is due to the language barrier, and that people begin posting in Spanish more frequently somewhere around there. Looking at the table you can clearly see differences in the topics between each cluster. The most posted tags were typically the closest large city or state in that cluster. In fact, in many cases, you can begin to guess what part of the United States the cluster is located in, without even looking at the map. V. Limitations and Future Work A large limitation of this project was actually finding the perfect values to seed my clustering algorithm with. These values greatly impact the clusters that mean shift identifies, and I sus- pect that further work on the algorithm could have found some better means. There was also some variability on the size of each cluster. I sort of played around with the algorithm until the resulting clusters seemed reasonably small on a national level, but to where you can still make out the clusters. Further work could have been done to find the perfect number for this value as well. I think additional studies at a city level instead of the country wide level would be worth looking into. The clustering algorithm would have to be further tweaked to fit to the data, but I still think that popular regions within a city would be identified, and that those regions would potentially have dif- fering tweet content. Another huge limitation I ran into specif- ically with the data was the amount of posts that were clearly from a Twitter posting bot. These bots will post thousands of messages a day, each one with the same or similar hash- tags. Despite my efforts to remove duplicate posts and posts from a single user in the same location, some of these hashtags still made it through. Many of the clusters were dominated by the hashtags #Job and #Jobs. This is more evident in clusters 25+, as those are the clusters with the least amount of tweets coming from them. The tag #photo was by far the most common hashtag, coming up as first in most of the clusters. One thing I didn’t take into account when doing tweet analysis was the lowercase and up- percase hashtags. According to Twitter, #NYC is different than #nyc, but for my analysis, they could easily be considered the same tag entirely. Doing so would allow for a larger variation in the results, as the ’almost dupli- cates’ would be combined. 5
  • 6. An obvious extension of this study is look- ing not only at the contents of each cluster us- ing static data collected previously, but rather looking at the data as it is consumed by the system in real time. This would allow us to not only see the differences between clusters, but also how those clusters change over time. This would be particularly interesting to watch as some new topic emerges on social media. An immediate example I thought of is the current protests happening in Baltimore. In that area of the country, I would expect the frequency of tweets to increase, making some larger clusters in the area than what was there previously. Additionally, I would expect the most frequent hashtags to not only be different than the west coast and down south, but there would also be a point where the most popular hashtags in that region would change rapidly, as the protests first began. VI. Conclusion In this paper, I introduced several techniques for analyzing and classifying a large collection of geotagged Twitter data. The approach de- scribed combines spatial analysis of the GPS latitude and longitude of a tweet, with con- tent analysis of the text and hashtags associ- ated with the same tweet. I present a tech- nique to automatically identify the places in the United States which have the most tweets being posted using the mean-shift clustering algorithm. Then, once regions are defined, I investigate the theme of each region by iden- tifying the top 5 most used hashtags within each region. Comparisons were made between the most popular hashtags in each region. Pre- liminary investigation shows that the process worked, and that there are clear differences be- tween the hashtags used in different locations across the United States. However, further anal- ysis and refinement of the algorithms used may be necessary to accurately draw meaningful conclusions beyond that. References [Mapping the World’s Photos, 2009] David Crandall, Lars Backstrom, Daniel Huttenlocher, Jon Kleinberg, (2009) Mapping The World’s Photos [Mean Shift and other clustering algorithms] http://scikit-learn.org/stable/modules/ generated/sklearn.cluster.MeanShift.html 6