Plan A: Sampling Users∗ Unfortunately, YouTube’s user identiﬁers do not follow a standard format, YouTube’s user identiﬁers are user-speciﬁed strings. We were therefore unable to create a random sample of YouTube users. Mislove (2007) Measurement and Analysis of Online Social Networks. IMC
Plan B: Sampling Videos∗ Using the YouTube search API, Zhou et al develop a random prefix sampling method, and find that roughly 500 millions YouTube videos by May, 2011.∗ Sample the videos first, and then find the respective users. Zhou et al. (2011) Counting YouTube Videos via Random Prefix Sampling. IMC
Get proportional users?∗ Limitation: selection bias towards those who uploading more videos. Therefore, weight against the number of videos per user (by the max value) is necessary to get a random sample of YouTube users.∗ Is it possible? 1 1 Videos crawled Users detected
UserID Video Active Num DaysUser Video Weight Active 1 10 20ID Num Factor Days 2 5 15 2 5 151 10 1 20 3 1 1 Weight 3 1 1 Cases 3 1 12 5 2 15 3 1 1 3 1 1 3 1 13 1 10 1 3 1 1 3 1 1 3 1 1 3 1 1
Strategy∗ 60^10*16 = 9.674588e+18∗ YouTube video is randomly generated from the id space∗ Sampling space is tooooooo large!∗ Any good idea?∗ http://www.youtube.com/watch?v=1yo0zBFCMxo∗ http://www.youtube.com/watch?v=_OBlgSz8sSM
YouTube Search API∗ One unique property of YouTube search API we find is that when searching using a keyword string of the format “watch?v=xy...z” (including the quotes) where “xy...” is a prefix (of length L, 1 ≤ L ≤ 11) of a possible YouTube video id which does not contain the literal “-” in the prefix, YouTube will return a list of videos whose id’s begin with this prefix followed by “-”, if they exist.∗ YouTube limits the number of returned results for any query.∗ When the prefix is short (e.g., 1 or 2), it is more likely that the returned search results may contain such “noisy” video ids; also, the short prefix may match a large number of videos∗ In contrast, if the prefix is too long (e.g., 6 or 7), no result may be returned by the search engine.
Practice∗ However, in practice, a prefix of length L < 5 contains usually more than one hundred results, and YouTube API can only return at most 30 ids for each prefix query.∗ On the other hand, based on our experimental results, a prefix with length L = 5 always contains less than 10 valid ids.∗ Therefore, a prefix length of 5 is a good choice in practice.
∗ They find that querying prefixes with a prefix length of four will returned ids having a “-” in the fifth place, which provides a big enough result set so that each prefix returns some results and small enough to never reach the result limit set by the API.
∗ Zhou et al. found that there are about 500 million YouTube videos by 2011! Zhou et al. (2011) Counting YouTube Videos via Random Prefix Sampling. IMC
Python and gdata gdata Code∗ gdata is a module for def SearchAndPrint(search_terms): yt_service = gdata.youtube.service.YouTubeService() connecting Google data query = gdata.youtube.service.YouTubeVideoQuery() (including YouTube) via API query.vq = search_terms query.orderby = viewCount query.racy = include feed = yt_service.YouTubeQuery(query) PrintVideoFeed(feed)
Test Validity∗ http://www.youtube.com/watch?v=1yo0zBFCMxo∗ The Secret State - The Biggest Mistake - Official Lyric Music Video Cant’ find the video!∗ searchApi("watch?v=1yo0z")
Reference∗ Zhou et al. (2011) Counting YouTube Videos via Random Prefix Sampling. IMC∗ Mislove (2007) Measurement and Analysis of Online Social Networks. IMC∗ YouTube deverlopers guide for python https://developers.google.com/youtube/1.0/developers_guide_python∗ Introduction to the library of gdata.youtube http://gdata-pythonclient.googlecode.com/svn/trunk/pydocs/gdata.youtube.html#YouTubeVideoEntry