• Like
Flux of MEME - DOW 1st semester
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Flux of MEME - DOW 1st semester


first results after 6 months of work with Telecom Italia - Working Capital research grant. showing technology used and prototype of web interface for a topic extraction and clustering tool.

first results after 6 months of work with Telecom Italia - Working Capital research grant. showing technology used and prototype of web interface for a topic extraction and clustering tool.

Published in Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads


Total Views
On SlideShare
From Embeds
Number of Embeds



Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

    No notes for slide


  • 1. Flux of MEME - description of work, 1st semester project: Flux of Meme author: Thomas M. Alisi - thomasalisi@gmail.com client: Telecom Italia review: deliverable 11.3.11 1Wednesday, March 9, 2011
  • 2. even if geo-tagging is growing, it still represents <1% of the total user generated content 2Wednesday, March 9, 2011
  • 3. What makes a trend a Trend? Twitter users now send more than 95 million Tweets a day, on just about every topic imaginable. We track the volume of terms mentioned on Twitter on an ongoing basis. Topics break into the Trends list when the volume of Tweets about that topic at a given moment dramatically increases. from Twitter blog, december 2010: 3Wednesday, March 9, 2011
  • 4. project overview 1. fetch data 2. create clusters 4. analyze stats 3. extract topics from real-time social networks of geo-located information creating timeline predictions 4Wednesday, March 9, 2011
  • 5. prologue - struggling with hardware and algorithms 5Wednesday, March 9, 2011
  • 6. fetching data: the Twitter streaming API • data is fetched using Twitter streaming API • issues: • access to data is limited: a basic “Spritzer” account is limited to 1% of total tweets • the amount of geo-localized tweets still represent a small figure: around 1% • “good” data (meaning that has geo- localized information) is around: 90M (total tweets/day) * 1% * 1% 6Wednesday, March 9, 2011
  • 7. problems 1.how to increase geo-localized data? 2.how to increase the amount / quality of text used for topic extraction? 7Wednesday, March 9, 2011
  • 8. approximating geo-information geo information is extracted after having indexed its content and searched on geonames database as text from twitter profile (cities with population > 5,000) 8Wednesday, March 9, 2011
  • 9. enriching information geo information present fetched through GeoNames not present • extra information carried by single tweets is used to enrich data sets for topic extraction • linked data is filtered through a blacklist to crawl and fetch what is effectively relevant for clustering purposes 9Wednesday, March 9, 2011
  • 10. e.r. model, focusing on posts / links / queries / clusters 10Wednesday, March 9, 2011
  • 11. application lifecycle • as the twitter API is connected and fetches a continuous stream of data, the clustering algorithm is executed asynchronously T 1.fetch data and store in a continuous timeline 2.cut time in relevant slices yesterday today tomorrow? 3.create geo-localized clusters of information, time slice using HAC (Hierarchical Agglomerative Clustering) 4.extract topics from geo-clusters using LDA (Latent Dirichlet Allocation) 11Wednesday, March 9, 2011
  • 12. software architecture 12Wednesday, March 9, 2011
  • 13. web interface • first prototype of web interface, showing geo-localized clusters • radius of clusters indicates standard deviation • opacity indicates density (number of posts) • for each cluster, its corresponding metadata is shown, including: • list of topics • list of posts • related links 13Wednesday, March 9, 2011
  • 14. what’s next? • refinements of LDA topic extraction algorithm (using different sources, determining better datasets of ground truth content for construction of statistical model) • twitter streaming API tweaks: • location boxes • use of keywords and keyword expansion for context specific searches • implementation of search masks with a content indexing system (i.e. Apach Solr) • timeline representation of clusters / topics 14Wednesday, March 9, 2011
  • 15. http://a.parsons.edu/~drumb588/tweetcatcha/ http://truthy.indiana.edu/ 15 http://www.janwillemtulp.com/worldeconomicforum/ http://moritz.stefaner.eu/projects/map%20your%20moves/Wednesday, March 9, 2011
  • 16. thanks! Thomas M. Alisi, PhD Giuseppe Serra, PhD Marco Bertini, PhD thomasalisi@gmail.com giuseppe.serra@gmail.com bertini@dsi.unifi.it 16Wednesday, March 9, 2011