Flux of MEME - final report


Published on

final presentation and results of topic extraction and analysis tool, developed for telecom italia working capital research grant

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Flux of MEME - final report

  1. 1. flux of meme - final report telecom italia, milan 30.9.11 thomas alisi @grudelsudFriday, September 30, 11
  2. 2. the basicsFriday, September 30, 11
  3. 3. the idea Meme: a postulated unit or element of cultural ideas transmitted from one mind to another through speech or similar phenomena. Zeitgeist: German language expression referring to "the spirit of the times" Semantic Web: an evolving development of the World Wide Web in which the meaning (semantics) of information on the web is defined, making it possible for machines to process it Flux of MEME: analysis of the web Zeitgeist through geo-localized Memes, updated and shared on social media mainly via mobile networksFriday, September 30, 11
  4. 4. background yahoo research WWW2011 - Who Says What to Whom on Twitter - Wu, Hofman, Mason, Watts WSDM2011 - Who Uses Web Search for What? And How? - Weber, Jaimes CSCW2011 - Peaks and Persistence: Modeling the Shape of Microblog Conversations - Shamma, Kennedy, Churchill others WWW2010 - What is Twitter, a Social Network or a News Media? - Kwak, Lee, Park, Moon Tech report 2009 (Princeton / Carnegie Mellon) - Topic Models - Blei, Lafferty Tech report 2009 (Facebook / Maryland / Princeton) - Reading Tea Leaves: How Humans Interpret Topic Models - Chang, Boyd-Graber, Gerrish, Wang, BleiFriday, September 30, 11
  5. 5. algorithm steps 1. fetch data 2. create clusters 3. extract topics 4. analyze statsFriday, September 30, 11
  6. 6. implementationFriday, September 30, 11
  7. 7. step 1. fetch data! using the free Spritzer access to Twitter streaming API (~1% of total tweets) defined set of location boxes (Italy, UK, France, Spain) reinforcing locations with geonames didn’t prove to be efficient (origin: from a galaxy far far away) enrich content through web scraping, also carrying meta & opengraph keywords blacklist of noisy sourcesFriday, September 30, 11
  8. 8. step 2. create geo-clusters create time slices select all the posts within a time slice choose geo-granularity (radius of clusters) agglomerate posts with Hierarchical Agglomerative Clustering (HAC)Friday, September 30, 11
  9. 9. step 3. extract topics a geo-cluster represents the whole bag of word used to define a document topic extraction is implemented with LDA α Dirichlet prior param. on the per-document topic distributions (frontend output: weight) β Dirichlet prior param on the per-topic word distribution θi is the topic distribution for document i, zij is the topic for the jth word in document i, and wij is the specific word. user defined params: number of topics, number of words per topic, min followersFriday, September 30, 11
  10. 10. step 4. analyze data define search context: topics or keywords perform live search with TF-IDF indicators display time-lapse of clusters’ analytics evolution (log-scale count and average size) quick and easy interface: toggle visibility of clustersFriday, September 30, 11
  11. 11. step 4. analyze data drag and zoom on specific location boxes select time interval display aggregated stats of clusters (count and size) within location box show and export breakdown of posts’ languagesFriday, September 30, 11
  12. 12. step 4. analyze data show stats and content of specific clusters lat-lon of centroids, std. deviation, surface and radius display weighted topics, TF-IDF of terms within topics, TF-IDF of meta keywords show / export list of posts show related linksFriday, September 30, 11
  13. 13. step 4. analyze data show query metrics and parameters display overall TF-IDF for the selected queryFriday, September 30, 11
  14. 14. demo http://fom.londondroids.com/fom/Friday, September 30, 11
  15. 15. sorry guys, now the boring stuff... backend, front-end API, cron jobsFriday, September 30, 11
  16. 16. Backend Streaming API a batch process is constantly running and saving data on the db options: fetch by search query, expand terms with wikiminer, access all the stream, filter geotagged, filter location box, fetch related content Clustering and Topic extraction define geo granularity time/size of geo clusters followers and retweets number of topics / keywords language mappingFriday, September 30, 11
  17. 17. API search clusters containing specific topics / keywords returns lists of clusters ordered by topic weight all the data extraction API conforms to a RESTful model and returns JSON structured dataFriday, September 30, 11
  18. 18. API read list of geographic clusters usually called after a search topic has been raisedFriday, September 30, 11
  19. 19. API read semantic content of a geographic cluster topics group by score (alpha parameter in LDA) and word weighted with TF-IDF with respect to the whole cluster contentFriday, September 30, 11
  20. 20. API read meta / opengraph content of a geographic clusterFriday, September 30, 11
  21. 21. API export list of posts exports all the posts contained in a cluster example request: /cluster/export_posts/1026/csv read post content reads the content of a post example request: /cluster/read_post/560951 read related link read the content of a link related to a post (the id is usually fetched through the variable “links” returned by the function above) example request: /cluster/read_link/16268 execute cluster stats within a location box read list of clusters contained within a location box and creates stat charts (in form of google chart images) example request: /cluster/dzstat/c_since=2011-05-07/c_until=2011-05-10/swLat=44.61/swLon=8.52/neLat=45.57/neLon=11.33 execute post stats within a location box read list of posts contained within a location box and perform stats on languages example request: /search/dzstat/p_since=2011-05-07/p_until=2011-05-10/p_timespan=daily/swLat=44.61/swLon=8.52/neLat=45.57/ neLon=11.33 read query content reads the list of geo-clusters associated to a specific query id (usually fetched by the function above) example request: /cluster/read/2Friday, September 30, 11
  22. 22. Cron keep everything running restart the streaming API now and then, so as to keep twitter happy create the clusters at the end of the dayFriday, September 30, 11
  23. 23. Friday, September 30, 11
  24. 24. serversFriday, September 30, 11
  25. 25. final thoughtsFriday, September 30, 11
  26. 26. improvements optimize time slicing! emerging topics should be checked on hourly basis among the complete dataset train models! a training set would be ideal to create models and optimize performances of the topic extraction algorithm models could relate to specific context in order to improve results (e.g. all the tweets from newspapers) create language classifiers increase the precision of language detection with naive bayes classifiers think of scalability increasing the amount of data makes it necessary to scale up to Map/Reduce architectures increase flexibility (e.g. manage multimedia data, offer a rich contextualized API, ...) enhance analysis and visualization (e.g. reinforce topic correlation / n-grams)Friday, September 30, 11
  27. 27. other refs algorithms LDA - http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation HAC - http://en.wikipedia.org/wiki/Cluster_analysis libraries twitter 4 java - http://twitter4j.org machine learning - http://mallet.cs.umass.edu/ jquery (core + ui) - http://jquery.org/ data tables - http://datatables.net/ chart api - http://code.google.com/apis/chart/ image courtesy http://yesyesno.com/nike-city-runsFriday, September 30, 11
  28. 28. ? thanks! codebase source + wiki https://github.com/grudelsud/fom thomas alisi @grudelsud giuseppe serra @giuseppeserra marco bertini @bertinimarcoFriday, September 30, 11