Random Prefix Sampling YouTube Users

Randomly Sampling YouTube Users:
An Introduction to Random Prefix
Sampling Method

Cheng-Jun Wang

Web Ming Lab
City University of Hong Kong
20121225

YouTube growth curve

http://singularityhub.com/2012/05/25/now-serving-the-latest-in-exponential-growth-youtube/

https://gdata.youtube.com/feeds/api/standardfeeds/most_recent

Plan A: Sampling Users

∗ Unfortunately, YouTube’s user identifiers do not follow a
standard format, YouTube’s user identifiers are user-specified
strings. We were therefore unable to create a random sample
of YouTube users.

Mislove (2007) Measurement and Analysis of Online Social Networks. IMC

Plan B: Sampling Videos

∗ Using the YouTube search API, Zhou et al develop a random
prefix sampling method, and find that roughly 500 millions
YouTube videos by May, 2011.
∗ Sample the videos first, and then find the respective users.

Zhou et al. (2011) Counting YouTube Videos via Random Prefix Sampling. IMC

Get proportional users?

∗ Limitation: selection bias towards those who uploading more
videos. Therefore, weight against the number of videos per
user (by the max value) is necessary to get a random sample of
YouTube users.
∗ Is it possible?

1

1 Videos crawled Users detected

UserID Video Active
Num Days
User Video Weight Active 1 10 20
ID Num Factor Days 2 5 15
2 5 15
1 10 1 20 3 1 1
Weight 3 1 1
Cases
3 1 1
2 5 2 15 3 1 1
3 1 1
3 1 1
3 1 10 1 3 1 1
3 1 1
3 1 1
3 1 1

Strategy

∗ 60^10*16 = 9.674588e+18
∗ YouTube video is randomly generated from the id space
∗ Sampling space is tooooooo large!
∗ Any good idea?
∗ http://www.youtube.com/watch?v=1yo0zBFCMxo
∗ http://www.youtube.com/watch?v=_OBlgSz8sSM

YouTube Search API
∗ One unique property of YouTube search API we find is that when searching
using a keyword string of the format “watch?v=xy...z” (including the quotes)
where “xy...” is a prefix (of length L, 1 ≤ L ≤ 11) of a possible YouTube video id
which does not contain the literal “-” in the prefix, YouTube will return a list
of videos whose id’s begin with this prefix followed by “-”, if they exist.
∗ YouTube limits the number of returned results for any query.

∗ When the prefix is short (e.g., 1 or 2), it is more likely that the returned
search results may contain such “noisy” video ids; also, the short prefix may
match a large number of videos
∗ In contrast, if the prefix is too long (e.g., 6 or 7), no result may be returned
by the search engine.

Practice

∗ However, in practice, a prefix of length L < 5 contains usually
more than one hundred results, and YouTube API can only
return at most 30 ids for each prefix query.
∗ On the other hand, based on our experimental results, a prefix
with length L = 5 always contains less than 10 valid ids.
∗ Therefore, a prefix length of 5 is a good choice in practice.

∗ They find that querying prefixes with a prefix length of four
will returned ids having a “-” in the fifth place, which provides
a big enough result set so that each prefix returns some results
and small enough to never reach the result limit set by the API.

∗ Zhou et al. found that there are about 500 million YouTube
videos by 2011!

Zhou et al. (2011) Counting YouTube Videos via Random Prefix Sampling. IMC

Python and gdata

gdata Code
∗ gdata is a module for def SearchAndPrint(search_terms):
yt_service = gdata.youtube.service.YouTubeService()
connecting Google data query = gdata.youtube.service.YouTubeVideoQuery()
(including YouTube) via API query.vq = search_terms
query.orderby = 'viewCount'
query.racy = 'include'
feed = yt_service.YouTubeQuery(query)
PrintVideoFeed(feed)

Test Validity

∗ http://www.youtube.com/watch?v=1yo0zBFCMxo
∗ The Secret State - The Biggest Mistake - Official Lyric Music
Video
Cant’ find
the video!
∗ searchApi("watch?v=1yo0z")

Restricted query term

∗ searchApi('"watch?v=1yo0"')

Compare two random samples

∗ # summary(da$Freq)
∗ # Min. 1st Qu. Median Mean 3rd Qu. Max.
∗ # 1.00 7.00 25.00 17.15 25.00 75.00
∗
∗ # summary(db$Freq)
∗ # Min. 1st Qu. Median Mean 3rd Qu. Max.
∗ # 1.00 8.00 25.00 17.57 25.00 50.00

There are 604 million videos in
YouTube by Dec, 2012!
∗ length(unique(subset(a[,1], b[,1]%in%a[,1]))) == 26
∗ 34361/x = 125/34361
∗ X = (34361^2/125)*64 == 604507300

Numeric simulation of random
prefix sampling
∗ # using degreenet to simulate decrete pareto distribution
∗ library(degreenet)
∗ a<-simdp(n=100000, v=3.5, maxdeg=10000)

∗ b<-data.frame(cbind(c(1:length(a)),a))
∗ c<-b[rep(1:nrow(b),b$a),]
∗ c$vid<-c(1:length(c$a))
∗ names(c)<-c("uid", "count", "vid")

∗ id<-sample(c(1:length(c$vid)), 2000, replace = F) #
∗ ds<-subset(c, c$vid%in%id)
∗ dat<-subset(ds, !duplicated(ds$uid))

∗ hist(dat$count)

∗ da<-as.data.frame(table(a))
∗ ds<-as.data.frame(table(dat$count))

∗ plot(log(da[,2])~log(as.numeric(as.character(da[,1]))), xlab = "Number of Videos (Log)", ylab = "Frequency (Log)" )
∗ points(log(ds[,2])~log(as.numeric(as.character(ds[,1]))), pch=2, col="red")
∗ legend("topright", c("population", "sample"),
∗ col = c( "black","red"),
∗ cex=0.9, pch= c(3, 2))

Reference

∗ Zhou et al. (2011) Counting YouTube Videos via Random Prefix
Sampling. IMC
∗ Mislove (2007) Measurement and Analysis of Online Social
Networks. IMC
∗ YouTube deverlopers guide for python
https://developers.google.com/youtube/1.0/developers_guide_python

∗ Introduction to the library of gdata.youtube
http://gdata-pythonclient.googlecode.com/svn/trunk/pydocs/gdata.youtube.html#YouTubeVideoEntry

Random Prefix Sampling YouTube Users

Recommended

Recommended

More Related Content

Similar to Random Prefix Sampling YouTube Users

Similar to Random Prefix Sampling YouTube Users (20)

More from Chengjun Wang

More from Chengjun Wang (15)

Recently uploaded

Recently uploaded (20)

Random Prefix Sampling YouTube Users