Flux of MEME - DOW 1st semester

Flux of MEME - description of work, 1st semester

project: Flux of Meme
author: Thomas M. Alisi - thomasalisi@gmail.com
client: Telecom Italia
review: deliverable 11.3.11

1

Wednesday, March 9, 2011

even if geo-tagging is growing,
it still represents <1% of the total user generated content

2


What makes a trend a Trend?
Twitter users now send more than 95 million Tweets a day, on just about every topic imaginable. We track the
volume of terms mentioned on Twitter on an ongoing basis. Topics break into the Trends list when the volume of
Tweets about that topic at a given moment dramatically increases.

from Twitter blog, december 2010:

3


project overview

1. fetch data 2. create clusters 4. analyze stats
3. extract topics
from real-time social networks of geo-located information creating timeline predictions

4


prologue - struggling with hardware and algorithms

5


fetching data: the Twitter streaming API

• data is fetched using Twitter streaming API

• issues:

• access to data is limited: a basic “Spritzer”
account is limited to 1% of total tweets

• the amount of geo-localized tweets still
represent a small ﬁgure: around 1%

• “good” data (meaning that has geo-
localized information) is around:
90M (total tweets/day) * 1% * 1%

6


problems

1.how to increase geo-localized data?

2.how to increase the amount / quality of text used for topic extraction?

7


approximating geo-information

geo information is extracted after having indexed its content
and searched on geonames database
as text from twitter proﬁle (cities with population > 5,000)

8


enriching information

geo information present

fetched through GeoNames

not present

• extra information carried by single tweets is
used to enrich data sets for topic extraction

• linked data is ﬁltered through a blacklist to
crawl and fetch what is effectively relevant
for clustering purposes

9


e.r. model, focusing on posts / links / queries / clusters

10


application lifecycle

• as the twitter API is connected and fetches a
continuous stream of data, the clustering
algorithm is executed asynchronously T

1.fetch data and store in a continuous timeline

2.cut time in relevant slices
yesterday today tomorrow?

3.create geo-localized clusters of information,
time slice

using HAC (Hierarchical Agglomerative
Clustering)

4.extract topics from geo-clusters using LDA
(Latent Dirichlet Allocation)

11


software architecture

12


web interface

• ﬁrst prototype of web interface,
showing geo-localized clusters

• radius of clusters indicates standard
deviation

• opacity indicates density (number of
posts)

• for each cluster, its corresponding
metadata is shown, including:

• list of topics

• list of posts

• related links

13


what’s next?

• reﬁnements of LDA topic extraction algorithm (using different sources, determining better datasets of ground
truth content for construction of statistical model)

• twitter streaming API tweaks:

• location boxes

• use of keywords and keyword expansion for context speciﬁc searches

• implementation of search masks with a content indexing system (i.e. Apach Solr)

• timeline representation of clusters / topics

14


http://a.parsons.edu/~drumb588/tweetcatcha/ http://truthy.indiana.edu/

15
http://www.janwillemtulp.com/worldeconomicforum/ http://moritz.stefaner.eu/projects/map%20your%20moves/

thanks!

Thomas M. Alisi, PhD Giuseppe Serra, PhD Marco Bertini, PhD
thomasalisi@gmail.com giuseppe.serra@gmail.com bertini@dsi.uniﬁ.it

16


Flux of MEME - DOW 1st semester

Recommended

Recommended

More Related Content

Similar to Flux of MEME - DOW 1st semester

Similar to Flux of MEME - DOW 1st semester (20)

Recently uploaded

Recently uploaded (20)

Flux of MEME - DOW 1st semester