Randomly Sampling YouTube Users: An Introduction to Random Prefix         Sampling Method             Cheng-Jun Wang      ...
YouTube growth curvehttp://singularityhub.com/2012/05/25/now-serving-the-latest-in-exponential-growth-youtube/https://gdat...
Contents
Plan A: Sampling Users∗ Unfortunately, YouTube’s user identifiers do not follow a  standard format, YouTube’s user identifie...
Plan B: Sampling Videos∗ Using the YouTube search API, Zhou et al develop a random  prefix sampling method, and find that ...
Get proportional users?∗ Limitation: selection bias towards those who uploading more  videos. Therefore, weight against th...
UserID   Video   Active                                                 Num     DaysUser   Video Weight   Active          ...
Strategy∗   60^10*16 = 9.674588e+18∗   YouTube video is randomly generated from the id space∗   Sampling space is tooooooo...
YouTube Search API∗ One unique property of YouTube search API we find is that when searching  using a keyword string of th...
Practice∗ However, in practice, a prefix of length L < 5 contains usually  more than one hundred results, and YouTube API ...
∗ They find that querying prefixes with a prefix length of four  will returned ids having a “-” in the fifth place, which ...
∗ Zhou et al. found that there are about 500 million YouTube  videos by 2011!        Zhou et al. (2011) Counting YouTube V...
Python and gdata             gdata                                    Code∗ gdata is a module for         def SearchAndPri...
Test Validity∗ http://www.youtube.com/watch?v=1yo0zBFCMxo∗ The Secret State - The Biggest Mistake - Official Lyric Music  ...
Restricted query term∗ searchApi("watch?v=1yo0")
Compare two random samples∗   # summary(da$Freq)∗   # Min. 1st Qu. Median Mean 3rd Qu. Max.∗   # 1.00 7.00 25.00 17.15 25....
There are 604 million videos in        YouTube by Dec, 2012!∗ length(unique(subset(a[,1], b[,1]%in%a[,1]))) == 26∗ 34361/x...
Numeric simulation of random                 prefix sampling∗   # using degreenet to simulate decrete pareto distribution∗...
Reference∗ Zhou et al. (2011) Counting YouTube Videos via Random Prefix  Sampling. IMC∗ Mislove (2007) Measurement and Ana...
20121225
Upcoming SlideShare
Loading in …5
×

Randomly sampling YouTube users

994 views

Published on

Published in: Self Improvement
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
994
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
0
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Randomly sampling YouTube users

  1. 1. Randomly Sampling YouTube Users: An Introduction to Random Prefix Sampling Method Cheng-Jun Wang Web Ming Lab City University of Hong Kong 20121225
  2. 2. YouTube growth curvehttp://singularityhub.com/2012/05/25/now-serving-the-latest-in-exponential-growth-youtube/https://gdata.youtube.com/feeds/api/standardfeeds/most_recent
  3. 3. Contents
  4. 4. Plan A: Sampling Users∗ Unfortunately, YouTube’s user identifiers do not follow a standard format, YouTube’s user identifiers are user-specified strings. We were therefore unable to create a random sample of YouTube users. Mislove (2007) Measurement and Analysis of Online Social Networks. IMC
  5. 5. Plan B: Sampling Videos∗ Using the YouTube search API, Zhou et al develop a random prefix sampling method, and find that roughly 500 millions YouTube videos by May, 2011.∗ Sample the videos first, and then find the respective users. Zhou et al. (2011) Counting YouTube Videos via Random Prefix Sampling. IMC
  6. 6. Get proportional users?∗ Limitation: selection bias towards those who uploading more videos. Therefore, weight against the number of videos per user (by the max value) is necessary to get a random sample of YouTube users.∗ Is it possible? 1 1 Videos crawled Users detected
  7. 7. UserID Video Active Num DaysUser Video Weight Active 1 10 20ID Num Factor Days 2 5 15 2 5 151 10 1 20 3 1 1 Weight 3 1 1 Cases 3 1 12 5 2 15 3 1 1 3 1 1 3 1 13 1 10 1 3 1 1 3 1 1 3 1 1 3 1 1
  8. 8. Strategy∗ 60^10*16 = 9.674588e+18∗ YouTube video is randomly generated from the id space∗ Sampling space is tooooooo large!∗ Any good idea?∗ http://www.youtube.com/watch?v=1yo0zBFCMxo∗ http://www.youtube.com/watch?v=_OBlgSz8sSM
  9. 9. YouTube Search API∗ One unique property of YouTube search API we find is that when searching using a keyword string of the format “watch?v=xy...z” (including the quotes) where “xy...” is a prefix (of length L, 1 ≤ L ≤ 11) of a possible YouTube video id which does not contain the literal “-” in the prefix, YouTube will return a list of videos whose id’s begin with this prefix followed by “-”, if they exist.∗ YouTube limits the number of returned results for any query.∗ When the prefix is short (e.g., 1 or 2), it is more likely that the returned search results may contain such “noisy” video ids; also, the short prefix may match a large number of videos∗ In contrast, if the prefix is too long (e.g., 6 or 7), no result may be returned by the search engine.
  10. 10. Practice∗ However, in practice, a prefix of length L < 5 contains usually more than one hundred results, and YouTube API can only return at most 30 ids for each prefix query.∗ On the other hand, based on our experimental results, a prefix with length L = 5 always contains less than 10 valid ids.∗ Therefore, a prefix length of 5 is a good choice in practice.
  11. 11. ∗ They find that querying prefixes with a prefix length of four will returned ids having a “-” in the fifth place, which provides a big enough result set so that each prefix returns some results and small enough to never reach the result limit set by the API.
  12. 12. ∗ Zhou et al. found that there are about 500 million YouTube videos by 2011! Zhou et al. (2011) Counting YouTube Videos via Random Prefix Sampling. IMC
  13. 13. Python and gdata gdata Code∗ gdata is a module for def SearchAndPrint(search_terms): yt_service = gdata.youtube.service.YouTubeService() connecting Google data query = gdata.youtube.service.YouTubeVideoQuery() (including YouTube) via API query.vq = search_terms query.orderby = viewCount query.racy = include feed = yt_service.YouTubeQuery(query) PrintVideoFeed(feed)
  14. 14. Test Validity∗ http://www.youtube.com/watch?v=1yo0zBFCMxo∗ The Secret State - The Biggest Mistake - Official Lyric Music Video Cant’ find the video!∗ searchApi("watch?v=1yo0z")
  15. 15. Restricted query term∗ searchApi("watch?v=1yo0")
  16. 16. Compare two random samples∗ # summary(da$Freq)∗ # Min. 1st Qu. Median Mean 3rd Qu. Max.∗ # 1.00 7.00 25.00 17.15 25.00 75.00∗∗ # summary(db$Freq)∗ # Min. 1st Qu. Median Mean 3rd Qu. Max.∗ # 1.00 8.00 25.00 17.57 25.00 50.00
  17. 17. There are 604 million videos in YouTube by Dec, 2012!∗ length(unique(subset(a[,1], b[,1]%in%a[,1]))) == 26∗ 34361/x = 125/34361∗ X = (34361^2/125)*64 == 604507300
  18. 18. Numeric simulation of random prefix sampling∗ # using degreenet to simulate decrete pareto distribution∗ library(degreenet)∗ a<-simdp(n=100000, v=3.5, maxdeg=10000)∗ b<-data.frame(cbind(c(1:length(a)),a))∗ c<-b[rep(1:nrow(b),b$a),]∗ c$vid<-c(1:length(c$a))∗ names(c)<-c("uid", "count", "vid")∗ id<-sample(c(1:length(c$vid)), 2000, replace = F) #∗ ds<-subset(c, c$vid%in%id)∗ dat<-subset(ds, !duplicated(ds$uid))∗ hist(dat$count)∗ da<-as.data.frame(table(a))∗ ds<-as.data.frame(table(dat$count))∗ plot(log(da[,2])~log(as.numeric(as.character(da[,1]))), xlab = "Number of Videos (Log)", ylab = "Frequency (Log)" )∗ points(log(ds[,2])~log(as.numeric(as.character(ds[,1]))), pch=2, col="red")∗ legend("topright", c("population", "sample"),∗ col = c( "black","red"),∗ cex=0.9, pch= c(3, 2))
  19. 19. Reference∗ Zhou et al. (2011) Counting YouTube Videos via Random Prefix Sampling. IMC∗ Mislove (2007) Measurement and Analysis of Online Social Networks. IMC∗ YouTube deverlopers guide for python https://developers.google.com/youtube/1.0/developers_guide_python∗ Introduction to the library of gdata.youtube http://gdata-pythonclient.googlecode.com/svn/trunk/pydocs/gdata.youtube.html#YouTubeVideoEntry
  20. 20. 20121225

×