20090914 Petamedia Irp5

644 views

Published on

Presentation for PetaMedia meeting in Lausanne, discussing IRP5 (Data acquisition) results.

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
644
On SlideShare
0
From Embeds
0
Number of Embeds
17
Actions
Shares
0
Downloads
4
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

20090914 Petamedia Irp5

  1. 1. IRP5: Social Media Data Acquisition Presented by: Arjen P. de Vries
  2. 2. PROBLEM <ul><li>Each of us can think of many research questions related to the Petamedia objective of integrating the data network, the user network and the physical network... </li></ul>
  3. 4. PROBLEM <ul><li>... but, today, these three networks are mostly disparate – they do not overlap! </li></ul><ul><li>So, how to evaluate the effectiveness of our ideas?! </li></ul>
  4. 5. IRP5 OBJECTIVES <ul><li>Develop a common data-set that is large enough to allow meaningful research, and contains video content as well as explicit information generated by a social network of sufficient density </li></ul><ul><li>Organize a PetaMedia ‘data-set yellow pages’ </li></ul><ul><li>irp5petamedia.pbworks.com </li></ul><ul><li>Functions also as crawling code repository! </li></ul>
  5. 7. DATASET REQUIREMENTS <ul><li>Availability of video data </li></ul><ul><ul><li>Creative Commons (CC) licenced data only </li></ul></ul><ul><ul><li>Data sources with developer-friendly APIs only </li></ul></ul><ul><li>Availability of social data </li></ul><ul><ul><li>Users creating the data should be organized in a social network, and provide feedback about their preferences in relation to the data (comments, ratings, ...) </li></ul></ul>
  6. 8. CANDIDATE DATA SOURCES <ul><li>Blip.tv </li></ul><ul><ul><li>high quality, 25-60% CC, poor social data </li></ul></ul><ul><li>Revver.com </li></ul><ul><ul><li>Medium quality, 100% CC, poor social data </li></ul></ul><ul><li>Flickr.com </li></ul><ul><ul><li>200K CC, </li></ul></ul><ul><ul><li>No API for access to video content </li></ul></ul>
  7. 9. DECISIONS <ul><li>1) Join Blip.tv and revver.com with Del.icio.us, digg and Twitter to get richer social data </li></ul><ul><li>2) Crawl links to videos, as well as the social networks of users creating those links, up to 4 levels of social network but only 2 levels of metadata (bookmarks/posts/profiles) </li></ul><ul><li>3) Crawl Flickr irrespective of missing API </li></ul>Comments, User info, Friend info Video data blip.tv revver.com <ul><li>Social ‘mention’ in </li></ul><ul><li>Digg </li></ul><ul><li>Del.icio.us </li></ul><ul><li>Twitter </li></ul>
  8. 10. FLICKR <ul><li>Outline: </li></ul><ul><ul><li>Video data acquired through scraping the mobile flickr site ( m.flickr.com ) </li></ul></ul><ul><ul><ul><li>At most 90 seconds each </li></ul></ul></ul><ul><ul><ul><li>Only mp4 (no flv) </li></ul></ul></ul><ul><ul><li>Metadata acquired through API </li></ul></ul><ul><ul><li>Typical download rate: 1 video with metadata per minute </li></ul></ul><ul><li>Approach: </li></ul><ul><ul><li>Query for travel-related tags; 10 videos per tag </li></ul></ul><ul><ul><li>Leads to 211 videos from 162 uploaders; </li></ul></ul><ul><ul><li>Leads to 4143 videos in total from these uploaders </li></ul></ul><ul><ul><li>Leads to 17598 videos from their 32K contacts </li></ul></ul>
  9. 11. BLIP+REVVER <ul><li>Most popular videos: </li></ul><ul><ul><li>Blip-10,000 and Revver-10,000 data sets </li></ul></ul><ul><li>Social mentions of additional videos: </li></ul><ul><ul><li>175 blip.tv and 45 revver.com downloadable videos mentioned at Del.icio.us (out of 5GB of social data, reached by starting with ‘joshua’ – its founder – and recursively following network fan links) </li></ul></ul><ul><ul><li>1250 blip.tv and 9198 revver.com video clips digg- ed (by 3602 unique users) </li></ul></ul><ul><ul><li>~850 blip.tv links posted by Twitter users per week </li></ul></ul>
  10. 12. PRELIMINARY RESULTS <ul><li>Del.icio.us (now delicious.com) is better queried over the TU Berlin DAI-Labor lab collection (bookmarks from 2003-2007) </li></ul><ul><ul><li>14K distinct links to blip.tv (from 22K bookmarks) </li></ul></ul><ul><ul><li>4K distinct links to revver (from 10K bookmarks) </li></ul></ul><ul><li>Reason: API truncates results to only 100 per item </li></ul>
  11. 13. PRELIMINARY RESULTS <ul><li>Twitter is better ‘crawled’ through Topsy, a search engine over Tweets </li></ul><ul><ul><li>27K links to blip (from 42K indexed by Topsy) </li></ul></ul><ul><ul><li>300 links to revver (from ~1200 indexed by Topsy) </li></ul></ul><ul><li>Reason: API usage limited to #queries per IP address, but, more importantly, API access only to msgs at most 7 days old </li></ul><ul><ul><li>BTW: Topsy has been queried using ~2100 ‘popular’ `travel-related’ tags, to circumvent 500 results per query limitation </li></ul></ul>
  12. 14. NEXT <ul><li>Crawl actual social network data and tweets corresponding to the Twitter V2 and Del.icio.us V2 links, by re-using QMUL and EPFL code </li></ul>
  13. 15. TO BE DONE <ul><li>Analyse data and methods </li></ul><ul><ul><li>Useful? </li></ul></ul><ul><ul><li>Complete? </li></ul></ul><ul><li>Create data repository </li></ul><ul><li>Legal check for public sharing </li></ul><ul><ul><li>Twitter data </li></ul></ul><ul><ul><li>Flickr data html-scraped </li></ul></ul><ul><li>Advertise data sets (SIGIR-Forum?) </li></ul>
  14. 16. TEAM <ul><li>TUD: </li></ul><ul><ul><li>Pavel Serdyukov (coordination, Del.icio.us V2 , Twitter V2 ) </li></ul></ul><ul><ul><li>Stevan Rudinac (blip.tv, revver.com) </li></ul></ul><ul><ul><li>Ronald Poppe (blip.tv, revver.com) </li></ul></ul><ul><ul><li>Maarten Clements (blip.tv, revver.com) </li></ul></ul><ul><ul><li>Arjen P. de Vries (coordination) </li></ul></ul><ul><li>TUB: </li></ul><ul><ul><li>Sebastian Schmiedeke (flickr.com) </li></ul></ul><ul><li>EPFL: </li></ul><ul><ul><li>Ivan Ivanov (Twitter V1) </li></ul></ul><ul><li>UEP: </li></ul><ul><ul><li>David Chudan (Digg) </li></ul></ul><ul><ul><li>Tomas Kliegr (Digg) </li></ul></ul><ul><li>QMUL: </li></ul><ul><ul><li>Naeem Ramzan (Del.icio.us V1) </li></ul></ul><ul><ul><li>Muhammad Akram (Del.icio.us V1) </li></ul></ul>

×