This document describes a random prefix sampling method for sampling YouTube videos developed by Zhou et al. They found there are approximately 500 million YouTube videos in 2011. The method samples video IDs by querying YouTube's search API with random prefixes of length 5, which returns a small but non-empty set of video IDs. This allows generating an unbiased sample of YouTube videos for analysis. The document also discusses simulating this sampling method numerically and comparing properties of the sample to the overall population.
4. Plan A: Sampling Users
∗ Unfortunately, YouTube’s user identifiers do not follow a
standard format, YouTube’s user identifiers are user-specified
strings. We were therefore unable to create a random sample
of YouTube users.
Mislove (2007) Measurement and Analysis of Online Social Networks. IMC
5. Plan B: Sampling Videos
∗ Using the YouTube search API, Zhou et al develop a random
prefix sampling method, and find that roughly 500 millions
YouTube videos by May, 2011.
∗ Sample the videos first, and then find the respective users.
Zhou et al. (2011) Counting YouTube Videos via Random Prefix Sampling. IMC
6. Get proportional users?
∗ Limitation: selection bias towards those who uploading more
videos. Therefore, weight against the number of videos per
user (by the max value) is necessary to get a random sample of
YouTube users.
∗ Is it possible?
1
1 Videos crawled Users detected
7. UserID Video Active
Num Days
User Video Weight Active 1 10 20
ID Num Factor Days 2 5 15
2 5 15
1 10 1 20 3 1 1
Weight 3 1 1
Cases
3 1 1
2 5 2 15 3 1 1
3 1 1
3 1 1
3 1 10 1 3 1 1
3 1 1
3 1 1
3 1 1
8. Strategy
∗ 60^10*16 = 9.674588e+18
∗ YouTube video is randomly generated from the id space
∗ Sampling space is tooooooo large!
∗ Any good idea?
∗ http://www.youtube.com/watch?v=1yo0zBFCMxo
∗ http://www.youtube.com/watch?v=_OBlgSz8sSM
9. YouTube Search API
∗ One unique property of YouTube search API we find is that when searching
using a keyword string of the format “watch?v=xy...z” (including the quotes)
where “xy...” is a prefix (of length L, 1 ≤ L ≤ 11) of a possible YouTube video id
which does not contain the literal “-” in the prefix, YouTube will return a list
of videos whose id’s begin with this prefix followed by “-”, if they exist.
∗ YouTube limits the number of returned results for any query.
∗ When the prefix is short (e.g., 1 or 2), it is more likely that the returned
search results may contain such “noisy” video ids; also, the short prefix may
match a large number of videos
∗ In contrast, if the prefix is too long (e.g., 6 or 7), no result may be returned
by the search engine.
10. Practice
∗ However, in practice, a prefix of length L < 5 contains usually
more than one hundred results, and YouTube API can only
return at most 30 ids for each prefix query.
∗ On the other hand, based on our experimental results, a prefix
with length L = 5 always contains less than 10 valid ids.
∗ Therefore, a prefix length of 5 is a good choice in practice.
11. ∗ They find that querying prefixes with a prefix length of four
will returned ids having a “-” in the fifth place, which provides
a big enough result set so that each prefix returns some results
and small enough to never reach the result limit set by the API.
12. ∗ Zhou et al. found that there are about 500 million YouTube
videos by 2011!
Zhou et al. (2011) Counting YouTube Videos via Random Prefix Sampling. IMC
13. Python and gdata
gdata Code
∗ gdata is a module for def SearchAndPrint(search_terms):
yt_service = gdata.youtube.service.YouTubeService()
connecting Google data query = gdata.youtube.service.YouTubeVideoQuery()
(including YouTube) via API query.vq = search_terms
query.orderby = 'viewCount'
query.racy = 'include'
feed = yt_service.YouTubeQuery(query)
PrintVideoFeed(feed)
16. Compare two random samples
∗ # summary(da$Freq)
∗ # Min. 1st Qu. Median Mean 3rd Qu. Max.
∗ # 1.00 7.00 25.00 17.15 25.00 75.00
∗
∗ # summary(db$Freq)
∗ # Min. 1st Qu. Median Mean 3rd Qu. Max.
∗ # 1.00 8.00 25.00 17.57 25.00 50.00
17. There are 604 million videos in
YouTube by Dec, 2012!
∗ length(unique(subset(a[,1], b[,1]%in%a[,1]))) == 26
∗ 34361/x = 125/34361
∗ X = (34361^2/125)*64 == 604507300
19. Reference
∗ Zhou et al. (2011) Counting YouTube Videos via Random Prefix
Sampling. IMC
∗ Mislove (2007) Measurement and Analysis of Online Social
Networks. IMC
∗ YouTube deverlopers guide for python
https://developers.google.com/youtube/1.0/developers_guide_python
∗ Introduction to the library of gdata.youtube
http://gdata-pythonclient.googlecode.com/svn/trunk/pydocs/gdata.youtube.html#YouTubeVideoEntry