Flux of MEME - description of work, 1st semester            project: Flux of Meme            author: Thomas M. Alisi - tho...
even if geo-tagging is growing,            it still represents <1% of the total user generated content            2Wednesd...
What makes a trend a Trend?     Twitter users now send more than 95 million Tweets a day, on just about every topic imagin...
project overview                         1. fetch data                2. create clusters                                 4...
prologue - struggling with hardware and algorithms            5Wednesday, March 9, 2011
fetching data: the Twitter streaming API            • data is fetched using Twitter streaming API            • issues:    ...
problems            1.how to increase geo-localized data?            2.how to increase the amount / quality of text used f...
approximating geo-information                     geo information is extracted                                       after...
enriching information                                                              geo information present                ...
e.r. model, focusing on posts / links / queries / clusters            10Wednesday, March 9, 2011
application lifecycle            • as the twitter API is connected and fetches a              continuous stream of data, t...
software architecture            12Wednesday, March 9, 2011
web interface            • first prototype of web interface,              showing geo-localized clusters            • radiu...
what’s next?            • refinements of LDA topic extraction algorithm (using different sources, determining better datase...
http://a.parsons.edu/~drumb588/tweetcatcha/                     http://truthy.indiana.edu/            15                  ...
thanks!                            Thomas M. Alisi, PhD      Giuseppe Serra, PhD     Marco Bertini, PhD                   ...
Upcoming SlideShare
Loading in …5
×

Flux of MEME - DOW 1st semester

2,145 views

Published on

first results after 6 months of work with Telecom Italia - Working Capital research grant. showing technology used and prototype of web interface for a topic extraction and clustering tool.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
2,145
On SlideShare
0
From Embeds
0
Number of Embeds
736
Actions
Shares
0
Downloads
6
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Flux of MEME - DOW 1st semester

  1. 1. Flux of MEME - description of work, 1st semester project: Flux of Meme author: Thomas M. Alisi - thomasalisi@gmail.com client: Telecom Italia review: deliverable 11.3.11 1Wednesday, March 9, 2011
  2. 2. even if geo-tagging is growing, it still represents <1% of the total user generated content 2Wednesday, March 9, 2011
  3. 3. What makes a trend a Trend? Twitter users now send more than 95 million Tweets a day, on just about every topic imaginable. We track the volume of terms mentioned on Twitter on an ongoing basis. Topics break into the Trends list when the volume of Tweets about that topic at a given moment dramatically increases. from Twitter blog, december 2010: 3Wednesday, March 9, 2011
  4. 4. project overview 1. fetch data 2. create clusters 4. analyze stats 3. extract topics from real-time social networks of geo-located information creating timeline predictions 4Wednesday, March 9, 2011
  5. 5. prologue - struggling with hardware and algorithms 5Wednesday, March 9, 2011
  6. 6. fetching data: the Twitter streaming API • data is fetched using Twitter streaming API • issues: • access to data is limited: a basic “Spritzer” account is limited to 1% of total tweets • the amount of geo-localized tweets still represent a small figure: around 1% • “good” data (meaning that has geo- localized information) is around: 90M (total tweets/day) * 1% * 1% 6Wednesday, March 9, 2011
  7. 7. problems 1.how to increase geo-localized data? 2.how to increase the amount / quality of text used for topic extraction? 7Wednesday, March 9, 2011
  8. 8. approximating geo-information geo information is extracted after having indexed its content and searched on geonames database as text from twitter profile (cities with population > 5,000) 8Wednesday, March 9, 2011
  9. 9. enriching information geo information present fetched through GeoNames not present • extra information carried by single tweets is used to enrich data sets for topic extraction • linked data is filtered through a blacklist to crawl and fetch what is effectively relevant for clustering purposes 9Wednesday, March 9, 2011
  10. 10. e.r. model, focusing on posts / links / queries / clusters 10Wednesday, March 9, 2011
  11. 11. application lifecycle • as the twitter API is connected and fetches a continuous stream of data, the clustering algorithm is executed asynchronously T 1.fetch data and store in a continuous timeline 2.cut time in relevant slices yesterday today tomorrow? 3.create geo-localized clusters of information, time slice using HAC (Hierarchical Agglomerative Clustering) 4.extract topics from geo-clusters using LDA (Latent Dirichlet Allocation) 11Wednesday, March 9, 2011
  12. 12. software architecture 12Wednesday, March 9, 2011
  13. 13. web interface • first prototype of web interface, showing geo-localized clusters • radius of clusters indicates standard deviation • opacity indicates density (number of posts) • for each cluster, its corresponding metadata is shown, including: • list of topics • list of posts • related links 13Wednesday, March 9, 2011
  14. 14. what’s next? • refinements of LDA topic extraction algorithm (using different sources, determining better datasets of ground truth content for construction of statistical model) • twitter streaming API tweaks: • location boxes • use of keywords and keyword expansion for context specific searches • implementation of search masks with a content indexing system (i.e. Apach Solr) • timeline representation of clusters / topics 14Wednesday, March 9, 2011
  15. 15. http://a.parsons.edu/~drumb588/tweetcatcha/ http://truthy.indiana.edu/ 15 http://www.janwillemtulp.com/worldeconomicforum/ http://moritz.stefaner.eu/projects/map%20your%20moves/Wednesday, March 9, 2011
  16. 16. thanks! Thomas M. Alisi, PhD Giuseppe Serra, PhD Marco Bertini, PhD thomasalisi@gmail.com giuseppe.serra@gmail.com bertini@dsi.unifi.it 16Wednesday, March 9, 2011

×