Your SlideShare is downloading. ×
0
Flux of MEME - DOW 1st semester
Flux of MEME - DOW 1st semester
Flux of MEME - DOW 1st semester
Flux of MEME - DOW 1st semester
Flux of MEME - DOW 1st semester
Flux of MEME - DOW 1st semester
Flux of MEME - DOW 1st semester
Flux of MEME - DOW 1st semester
Flux of MEME - DOW 1st semester
Flux of MEME - DOW 1st semester
Flux of MEME - DOW 1st semester
Flux of MEME - DOW 1st semester
Flux of MEME - DOW 1st semester
Flux of MEME - DOW 1st semester
Flux of MEME - DOW 1st semester
Flux of MEME - DOW 1st semester
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Flux of MEME - DOW 1st semester

1,670

Published on

first results after 6 months of work with Telecom Italia - Working Capital research grant. showing technology used and prototype of web interface for a topic extraction and clustering tool.

first results after 6 months of work with Telecom Italia - Working Capital research grant. showing technology used and prototype of web interface for a topic extraction and clustering tool.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,670
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  1. Flux of MEME - description of work, 1st semester project: Flux of Meme author: Thomas M. Alisi - thomasalisi@gmail.com client: Telecom Italia review: deliverable 11.3.11 1Wednesday, March 9, 2011
  2. even if geo-tagging is growing, it still represents <1% of the total user generated content 2Wednesday, March 9, 2011
  3. What makes a trend a Trend? Twitter users now send more than 95 million Tweets a day, on just about every topic imaginable. We track the volume of terms mentioned on Twitter on an ongoing basis. Topics break into the Trends list when the volume of Tweets about that topic at a given moment dramatically increases. from Twitter blog, december 2010: 3Wednesday, March 9, 2011
  4. project overview 1. fetch data 2. create clusters 4. analyze stats 3. extract topics from real-time social networks of geo-located information creating timeline predictions 4Wednesday, March 9, 2011
  5. prologue - struggling with hardware and algorithms 5Wednesday, March 9, 2011
  6. fetching data: the Twitter streaming API • data is fetched using Twitter streaming API • issues: • access to data is limited: a basic “Spritzer” account is limited to 1% of total tweets • the amount of geo-localized tweets still represent a small figure: around 1% • “good” data (meaning that has geo- localized information) is around: 90M (total tweets/day) * 1% * 1% 6Wednesday, March 9, 2011
  7. problems 1.how to increase geo-localized data? 2.how to increase the amount / quality of text used for topic extraction? 7Wednesday, March 9, 2011
  8. approximating geo-information geo information is extracted after having indexed its content and searched on geonames database as text from twitter profile (cities with population > 5,000) 8Wednesday, March 9, 2011
  9. enriching information geo information present fetched through GeoNames not present • extra information carried by single tweets is used to enrich data sets for topic extraction • linked data is filtered through a blacklist to crawl and fetch what is effectively relevant for clustering purposes 9Wednesday, March 9, 2011
  10. e.r. model, focusing on posts / links / queries / clusters 10Wednesday, March 9, 2011
  11. application lifecycle • as the twitter API is connected and fetches a continuous stream of data, the clustering algorithm is executed asynchronously T 1.fetch data and store in a continuous timeline 2.cut time in relevant slices yesterday today tomorrow? 3.create geo-localized clusters of information, time slice using HAC (Hierarchical Agglomerative Clustering) 4.extract topics from geo-clusters using LDA (Latent Dirichlet Allocation) 11Wednesday, March 9, 2011
  12. software architecture 12Wednesday, March 9, 2011
  13. web interface • first prototype of web interface, showing geo-localized clusters • radius of clusters indicates standard deviation • opacity indicates density (number of posts) • for each cluster, its corresponding metadata is shown, including: • list of topics • list of posts • related links 13Wednesday, March 9, 2011
  14. what’s next? • refinements of LDA topic extraction algorithm (using different sources, determining better datasets of ground truth content for construction of statistical model) • twitter streaming API tweaks: • location boxes • use of keywords and keyword expansion for context specific searches • implementation of search masks with a content indexing system (i.e. Apach Solr) • timeline representation of clusters / topics 14Wednesday, March 9, 2011
  15. http://a.parsons.edu/~drumb588/tweetcatcha/ http://truthy.indiana.edu/ 15 http://www.janwillemtulp.com/worldeconomicforum/ http://moritz.stefaner.eu/projects/map%20your%20moves/Wednesday, March 9, 2011
  16. thanks! Thomas M. Alisi, PhD Giuseppe Serra, PhD Marco Bertini, PhD thomasalisi@gmail.com giuseppe.serra@gmail.com bertini@dsi.unifi.it 16Wednesday, March 9, 2011

×