Your SlideShare is downloading. ×
0
20090914 Petamedia Irp5
20090914 Petamedia Irp5
20090914 Petamedia Irp5
20090914 Petamedia Irp5
20090914 Petamedia Irp5
20090914 Petamedia Irp5
20090914 Petamedia Irp5
20090914 Petamedia Irp5
20090914 Petamedia Irp5
20090914 Petamedia Irp5
20090914 Petamedia Irp5
20090914 Petamedia Irp5
20090914 Petamedia Irp5
20090914 Petamedia Irp5
20090914 Petamedia Irp5
20090914 Petamedia Irp5
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

20090914 Petamedia Irp5

440

Published on

Presentation for PetaMedia meeting in Lausanne, discussing IRP5 (Data acquisition) results.

Presentation for PetaMedia meeting in Lausanne, discussing IRP5 (Data acquisition) results.

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
440
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. IRP5: Social Media Data Acquisition Presented by: Arjen P. de Vries
  • 2. PROBLEM <ul><li>Each of us can think of many research questions related to the Petamedia objective of integrating the data network, the user network and the physical network... </li></ul>
  • 3.  
  • 4. PROBLEM <ul><li>... but, today, these three networks are mostly disparate – they do not overlap! </li></ul><ul><li>So, how to evaluate the effectiveness of our ideas?! </li></ul>
  • 5. IRP5 OBJECTIVES <ul><li>Develop a common data-set that is large enough to allow meaningful research, and contains video content as well as explicit information generated by a social network of sufficient density </li></ul><ul><li>Organize a PetaMedia ‘data-set yellow pages’ </li></ul><ul><li>irp5petamedia.pbworks.com </li></ul><ul><li>Functions also as crawling code repository! </li></ul>
  • 6.  
  • 7. DATASET REQUIREMENTS <ul><li>Availability of video data </li></ul><ul><ul><li>Creative Commons (CC) licenced data only </li></ul></ul><ul><ul><li>Data sources with developer-friendly APIs only </li></ul></ul><ul><li>Availability of social data </li></ul><ul><ul><li>Users creating the data should be organized in a social network, and provide feedback about their preferences in relation to the data (comments, ratings, ...) </li></ul></ul>
  • 8. CANDIDATE DATA SOURCES <ul><li>Blip.tv </li></ul><ul><ul><li>high quality, 25-60% CC, poor social data </li></ul></ul><ul><li>Revver.com </li></ul><ul><ul><li>Medium quality, 100% CC, poor social data </li></ul></ul><ul><li>Flickr.com </li></ul><ul><ul><li>200K CC, </li></ul></ul><ul><ul><li>No API for access to video content </li></ul></ul>
  • 9. DECISIONS <ul><li>1) Join Blip.tv and revver.com with Del.icio.us, digg and Twitter to get richer social data </li></ul><ul><li>2) Crawl links to videos, as well as the social networks of users creating those links, up to 4 levels of social network but only 2 levels of metadata (bookmarks/posts/profiles) </li></ul><ul><li>3) Crawl Flickr irrespective of missing API </li></ul>Comments, User info, Friend info Video data blip.tv revver.com <ul><li>Social ‘mention’ in </li></ul><ul><li>Digg </li></ul><ul><li>Del.icio.us </li></ul><ul><li>Twitter </li></ul>
  • 10. FLICKR <ul><li>Outline: </li></ul><ul><ul><li>Video data acquired through scraping the mobile flickr site ( m.flickr.com ) </li></ul></ul><ul><ul><ul><li>At most 90 seconds each </li></ul></ul></ul><ul><ul><ul><li>Only mp4 (no flv) </li></ul></ul></ul><ul><ul><li>Metadata acquired through API </li></ul></ul><ul><ul><li>Typical download rate: 1 video with metadata per minute </li></ul></ul><ul><li>Approach: </li></ul><ul><ul><li>Query for travel-related tags; 10 videos per tag </li></ul></ul><ul><ul><li>Leads to 211 videos from 162 uploaders; </li></ul></ul><ul><ul><li>Leads to 4143 videos in total from these uploaders </li></ul></ul><ul><ul><li>Leads to 17598 videos from their 32K contacts </li></ul></ul>
  • 11. BLIP+REVVER <ul><li>Most popular videos: </li></ul><ul><ul><li>Blip-10,000 and Revver-10,000 data sets </li></ul></ul><ul><li>Social mentions of additional videos: </li></ul><ul><ul><li>175 blip.tv and 45 revver.com downloadable videos mentioned at Del.icio.us (out of 5GB of social data, reached by starting with ‘joshua’ – its founder – and recursively following network fan links) </li></ul></ul><ul><ul><li>1250 blip.tv and 9198 revver.com video clips digg- ed (by 3602 unique users) </li></ul></ul><ul><ul><li>~850 blip.tv links posted by Twitter users per week </li></ul></ul>
  • 12. PRELIMINARY RESULTS <ul><li>Del.icio.us (now delicious.com) is better queried over the TU Berlin DAI-Labor lab collection (bookmarks from 2003-2007) </li></ul><ul><ul><li>14K distinct links to blip.tv (from 22K bookmarks) </li></ul></ul><ul><ul><li>4K distinct links to revver (from 10K bookmarks) </li></ul></ul><ul><li>Reason: API truncates results to only 100 per item </li></ul>
  • 13. PRELIMINARY RESULTS <ul><li>Twitter is better ‘crawled’ through Topsy, a search engine over Tweets </li></ul><ul><ul><li>27K links to blip (from 42K indexed by Topsy) </li></ul></ul><ul><ul><li>300 links to revver (from ~1200 indexed by Topsy) </li></ul></ul><ul><li>Reason: API usage limited to #queries per IP address, but, more importantly, API access only to msgs at most 7 days old </li></ul><ul><ul><li>BTW: Topsy has been queried using ~2100 ‘popular’ `travel-related’ tags, to circumvent 500 results per query limitation </li></ul></ul>
  • 14. NEXT <ul><li>Crawl actual social network data and tweets corresponding to the Twitter V2 and Del.icio.us V2 links, by re-using QMUL and EPFL code </li></ul>
  • 15. TO BE DONE <ul><li>Analyse data and methods </li></ul><ul><ul><li>Useful? </li></ul></ul><ul><ul><li>Complete? </li></ul></ul><ul><li>Create data repository </li></ul><ul><li>Legal check for public sharing </li></ul><ul><ul><li>Twitter data </li></ul></ul><ul><ul><li>Flickr data html-scraped </li></ul></ul><ul><li>Advertise data sets (SIGIR-Forum?) </li></ul>
  • 16. TEAM <ul><li>TUD: </li></ul><ul><ul><li>Pavel Serdyukov (coordination, Del.icio.us V2 , Twitter V2 ) </li></ul></ul><ul><ul><li>Stevan Rudinac (blip.tv, revver.com) </li></ul></ul><ul><ul><li>Ronald Poppe (blip.tv, revver.com) </li></ul></ul><ul><ul><li>Maarten Clements (blip.tv, revver.com) </li></ul></ul><ul><ul><li>Arjen P. de Vries (coordination) </li></ul></ul><ul><li>TUB: </li></ul><ul><ul><li>Sebastian Schmiedeke (flickr.com) </li></ul></ul><ul><li>EPFL: </li></ul><ul><ul><li>Ivan Ivanov (Twitter V1) </li></ul></ul><ul><li>UEP: </li></ul><ul><ul><li>David Chudan (Digg) </li></ul></ul><ul><ul><li>Tomas Kliegr (Digg) </li></ul></ul><ul><li>QMUL: </li></ul><ul><ul><li>Naeem Ramzan (Del.icio.us V1) </li></ul></ul><ul><ul><li>Muhammad Akram (Del.icio.us V1) </li></ul></ul>

×