This document provides a summary of the Flux of Meme project for the 1st semester deliverable. The project involves fetching geo-located social media data from Twitter, creating clusters of this information, extracting topics from the clusters, and analyzing statistics to create timeline predictions. Initial issues involved limited access to Twitter data and a small percentage of tweets being geo-tagged. The document outlines the software architecture and application lifecycle, and discusses plans to refine the topic extraction algorithm and Twitter data collection.
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Flux of MEME - DOW 1st semester
1. Flux of MEME - description of work, 1st semester
project: Flux of Meme
author: Thomas M. Alisi - thomasalisi@gmail.com
client: Telecom Italia
review: deliverable 11.3.11
1
Wednesday, March 9, 2011
2. even if geo-tagging is growing,
it still represents <1% of the total user generated content
2
Wednesday, March 9, 2011
3. What makes a trend a Trend?
Twitter users now send more than 95 million Tweets a day, on just about every topic imaginable. We track the
volume of terms mentioned on Twitter on an ongoing basis. Topics break into the Trends list when the volume of
Tweets about that topic at a given moment dramatically increases.
from Twitter blog, december 2010:
3
Wednesday, March 9, 2011
4. project overview
1. fetch data 2. create clusters 4. analyze stats
3. extract topics
from real-time social networks of geo-located information creating timeline predictions
4
Wednesday, March 9, 2011
6. fetching data: the Twitter streaming API
• data is fetched using Twitter streaming API
• issues:
• access to data is limited: a basic “Spritzer”
account is limited to 1% of total tweets
• the amount of geo-localized tweets still
represent a small figure: around 1%
• “good” data (meaning that has geo-
localized information) is around:
90M (total tweets/day) * 1% * 1%
6
Wednesday, March 9, 2011
7. problems
1.how to increase geo-localized data?
2.how to increase the amount / quality of text used for topic extraction?
7
Wednesday, March 9, 2011
8. approximating geo-information
geo information is extracted after having indexed its content
and searched on geonames database
as text from twitter profile (cities with population > 5,000)
8
Wednesday, March 9, 2011
9. enriching information
geo information present
fetched through GeoNames
not present
• extra information carried by single tweets is
used to enrich data sets for topic extraction
• linked data is filtered through a blacklist to
crawl and fetch what is effectively relevant
for clustering purposes
9
Wednesday, March 9, 2011
10. e.r. model, focusing on posts / links / queries / clusters
10
Wednesday, March 9, 2011
11. application lifecycle
• as the twitter API is connected and fetches a
continuous stream of data, the clustering
algorithm is executed asynchronously T
1.fetch data and store in a continuous timeline
2.cut time in relevant slices
yesterday today tomorrow?
3.create geo-localized clusters of information,
time slice
using HAC (Hierarchical Agglomerative
Clustering)
4.extract topics from geo-clusters using LDA
(Latent Dirichlet Allocation)
11
Wednesday, March 9, 2011
13. web interface
• first prototype of web interface,
showing geo-localized clusters
• radius of clusters indicates standard
deviation
• opacity indicates density (number of
posts)
• for each cluster, its corresponding
metadata is shown, including:
• list of topics
• list of posts
• related links
13
Wednesday, March 9, 2011
14. what’s next?
• refinements of LDA topic extraction algorithm (using different sources, determining better datasets of ground
truth content for construction of statistical model)
• twitter streaming API tweaks:
• location boxes
• use of keywords and keyword expansion for context specific searches
• implementation of search masks with a content indexing system (i.e. Apach Solr)
• timeline representation of clusters / topics
14
Wednesday, March 9, 2011
15. http://a.parsons.edu/~drumb588/tweetcatcha/ http://truthy.indiana.edu/
15
http://www.janwillemtulp.com/worldeconomicforum/ http://moritz.stefaner.eu/projects/map%20your%20moves/
Wednesday, March 9, 2011
16. thanks!
Thomas M. Alisi, PhD Giuseppe Serra, PhD Marco Bertini, PhD
thomasalisi@gmail.com giuseppe.serra@gmail.com bertini@dsi.unifi.it
16
Wednesday, March 9, 2011