SlideShare a Scribd company logo
1 of 20
Randomly Sampling YouTube Users:
 An Introduction to Random Prefix
         Sampling Method




             Cheng-Jun Wang

               Web Ming Lab
        City University of Hong Kong
                  20121225
YouTube growth curve




http://singularityhub.com/2012/05/25/now-serving-the-latest-in-exponential-growth-youtube/



https://gdata.youtube.com/feeds/api/standardfeeds/most_recent
Contents
Plan A: Sampling Users

∗ Unfortunately, YouTube’s user identifiers do not follow a
  standard format, YouTube’s user identifiers are user-specified
  strings. We were therefore unable to create a random sample
  of YouTube users.




  Mislove (2007) Measurement and Analysis of Online Social Networks. IMC
Plan B: Sampling Videos

∗ Using the YouTube search API, Zhou et al develop a random
  prefix sampling method, and find that roughly 500 millions
  YouTube videos by May, 2011.
∗ Sample the videos first, and then find the respective users.




  Zhou et al. (2011) Counting YouTube Videos via Random Prefix Sampling. IMC
Get proportional users?

∗ Limitation: selection bias towards those who uploading more
  videos. Therefore, weight against the number of videos per
  user (by the max value) is necessary to get a random sample of
  YouTube users.
∗ Is it possible?



                                                     1




              1    Videos crawled   Users detected
UserID   Video   Active
                                                 Num     Days
User   Video Weight   Active            1        10      20
ID     Num Factor     Days              2        5       15
                                        2        5       15
1      10    1        20                3        1       1
                               Weight   3        1       1
                               Cases
                                        3        1       1
2      5     2        15                3        1       1
                                        3        1       1
                                        3        1       1
3      1     10       1                 3        1       1
                                        3        1       1
                                        3        1       1
                                        3        1       1
Strategy




∗   60^10*16 = 9.674588e+18
∗   YouTube video is randomly generated from the id space
∗   Sampling space is tooooooo large!
∗   Any good idea?
∗   http://www.youtube.com/watch?v=1yo0zBFCMxo
∗   http://www.youtube.com/watch?v=_OBlgSz8sSM
YouTube Search API
∗ One unique property of YouTube search API we find is that when searching
  using a keyword string of the format “watch?v=xy...z” (including the quotes)
  where “xy...” is a prefix (of length L, 1 ≤ L ≤ 11) of a possible YouTube video id
  which does not contain the literal “-” in the prefix, YouTube will return a list
  of videos whose id’s begin with this prefix followed by “-”, if they exist.
∗ YouTube limits the number of returned results for any query.


∗ When the prefix is short (e.g., 1 or 2), it is more likely that the returned
  search results may contain such “noisy” video ids; also, the short prefix may
  match a large number of videos
∗ In contrast, if the prefix is too long (e.g., 6 or 7), no result may be returned
  by the search engine.
Practice

∗ However, in practice, a prefix of length L < 5 contains usually
  more than one hundred results, and YouTube API can only
  return at most 30 ids for each prefix query.
∗ On the other hand, based on our experimental results, a prefix
  with length L = 5 always contains less than 10 valid ids.
∗ Therefore, a prefix length of 5 is a good choice in practice.
∗ They find that querying prefixes with a prefix length of four
  will returned ids having a “-” in the fifth place, which provides
  a big enough result set so that each prefix returns some results
  and small enough to never reach the result limit set by the API.
∗ Zhou et al. found that there are about 500 million YouTube
  videos by 2011!




        Zhou et al. (2011) Counting YouTube Videos via Random Prefix Sampling. IMC
Python and gdata


             gdata                                    Code
∗ gdata is a module for         def SearchAndPrint(search_terms):
                                 yt_service = gdata.youtube.service.YouTubeService()
  connecting Google data         query = gdata.youtube.service.YouTubeVideoQuery()
  (including YouTube) via API    query.vq = search_terms
                                 query.orderby = 'viewCount'
                                 query.racy = 'include'
                                 feed = yt_service.YouTubeQuery(query)
                                 PrintVideoFeed(feed)
Test Validity

∗ http://www.youtube.com/watch?v=1yo0zBFCMxo
∗ The Secret State - The Biggest Mistake - Official Lyric Music
  Video
                                                    Cant’ find
                                                    the video!
∗ searchApi("watch?v=1yo0z")
Restricted query term

∗ searchApi('"watch?v=1yo0"')
Compare two random samples

∗   # summary(da$Freq)
∗   # Min. 1st Qu. Median Mean 3rd Qu. Max.
∗   # 1.00 7.00 25.00 17.15 25.00 75.00
∗
∗   # summary(db$Freq)
∗   # Min. 1st Qu. Median Mean 3rd Qu. Max.
∗   # 1.00 8.00 25.00 17.57 25.00 50.00
There are 604 million videos in
        YouTube by Dec, 2012!
∗ length(unique(subset(a[,1], b[,1]%in%a[,1]))) == 26
∗ 34361/x = 125/34361
∗ X = (34361^2/125)*64 == 604507300
Numeric simulation of random
                 prefix sampling
∗   # using degreenet to simulate decrete pareto distribution
∗   library(degreenet)
∗   a<-simdp(n=100000, v=3.5, maxdeg=10000)

∗   b<-data.frame(cbind(c(1:length(a)),a))
∗   c<-b[rep(1:nrow(b),b$a),]
∗   c$vid<-c(1:length(c$a))
∗   names(c)<-c("uid", "count", "vid")

∗   id<-sample(c(1:length(c$vid)), 2000, replace = F) #
∗   ds<-subset(c, c$vid%in%id)
∗   dat<-subset(ds, !duplicated(ds$uid))

∗   hist(dat$count)

∗   da<-as.data.frame(table(a))
∗   ds<-as.data.frame(table(dat$count))

∗   plot(log(da[,2])~log(as.numeric(as.character(da[,1]))), xlab = "Number of Videos (Log)", ylab = "Frequency (Log)" )
∗   points(log(ds[,2])~log(as.numeric(as.character(ds[,1]))), pch=2, col="red")
∗   legend("topright", c("population", "sample"),
∗               col = c( "black","red"),
∗               cex=0.9, pch= c(3, 2))
Reference

∗ Zhou et al. (2011) Counting YouTube Videos via Random Prefix
  Sampling. IMC
∗ Mislove (2007) Measurement and Analysis of Online Social
  Networks. IMC
∗ YouTube deverlopers guide for python
  https://developers.google.com/youtube/1.0/developers_guide_python

∗ Introduction to the library of gdata.youtube
  http://gdata-pythonclient.googlecode.com/svn/trunk/pydocs/gdata.youtube.html#YouTubeVideoEntry
20121225

More Related Content

Similar to Random Prefix Sampling YouTube Users

Video summarization using clustering
Video summarization using clusteringVideo summarization using clustering
Video summarization using clusteringSahil Biswas
 
NoTube: Ad Insertion [compatibility mode]
NoTube: Ad Insertion [compatibility mode]NoTube: Ad Insertion [compatibility mode]
NoTube: Ad Insertion [compatibility mode]MODUL Technology GmbH
 
Thomas Kauders - Agile Test Design And Automation of a Life-Critical Medical ...
Thomas Kauders - Agile Test Design And Automation of a Life-Critical Medical ...Thomas Kauders - Agile Test Design And Automation of a Life-Critical Medical ...
Thomas Kauders - Agile Test Design And Automation of a Life-Critical Medical ...TEST Huddle
 
Phillipson learning from archives how historical content can be used to eng...
Phillipson learning from archives   how historical content can be used to eng...Phillipson learning from archives   how historical content can be used to eng...
Phillipson learning from archives how historical content can be used to eng...FIAT/IFTA
 
Develop Maintainable Apps - edUiConf
Develop Maintainable Apps - edUiConfDevelop Maintainable Apps - edUiConf
Develop Maintainable Apps - edUiConfAnnyce Davis
 
Rubinius For You - GoRuCo
Rubinius For You - GoRuCoRubinius For You - GoRuCo
Rubinius For You - GoRuCoevanphx
 
YouTube APIs presentation at Facultad de Ciencias, Universidad Nacional Autón...
YouTube APIs presentation at Facultad de Ciencias, Universidad Nacional Autón...YouTube APIs presentation at Facultad de Ciencias, Universidad Nacional Autón...
YouTube APIs presentation at Facultad de Ciencias, Universidad Nacional Autón...Jarek Wilkiewicz
 
Qtp interview questions and answers
Qtp interview questions and answersQtp interview questions and answers
Qtp interview questions and answersITeLearn
 
Real-Time Video Copy Detection in Big Data
Real-Time Video Copy Detection in Big DataReal-Time Video Copy Detection in Big Data
Real-Time Video Copy Detection in Big DataIRJET Journal
 
PredictionIO - Building Applications That Predict User Behavior Through Big D...
PredictionIO - Building Applications That Predict User Behavior Through Big D...PredictionIO - Building Applications That Predict User Behavior Through Big D...
PredictionIO - Building Applications That Predict User Behavior Through Big D...predictionio
 
Precomputing recommendations with Apache Beam
Precomputing recommendations with Apache BeamPrecomputing recommendations with Apache Beam
Precomputing recommendations with Apache BeamTatiana Al-Chueyr
 
UC2010_BRS1280_Eastman_Chemical_Johnston
UC2010_BRS1280_Eastman_Chemical_JohnstonUC2010_BRS1280_Eastman_Chemical_Johnston
UC2010_BRS1280_Eastman_Chemical_JohnstonH Eddie Newton
 
The Ring programming language version 1.3 book - Part 8 of 88
The Ring programming language version 1.3 book - Part 8 of 88The Ring programming language version 1.3 book - Part 8 of 88
The Ring programming language version 1.3 book - Part 8 of 88Mahmoud Samir Fayed
 
iOSDevCamp 2011 - Getting "Test"-y: Test Driven Development & Automated Deplo...
iOSDevCamp 2011 - Getting "Test"-y: Test Driven Development & Automated Deplo...iOSDevCamp 2011 - Getting "Test"-y: Test Driven Development & Automated Deplo...
iOSDevCamp 2011 - Getting "Test"-y: Test Driven Development & Automated Deplo...Rudy Jahchan
 
Why biased matrix factorization works well?
Why biased matrix factorization works well?Why biased matrix factorization works well?
Why biased matrix factorization works well?Joonyoung Yi
 
Scene Graphs & Component Based Game Engines
Scene Graphs & Component Based Game EnginesScene Graphs & Component Based Game Engines
Scene Graphs & Component Based Game EnginesBryan Duggan
 

Similar to Random Prefix Sampling YouTube Users (20)

Video summarization using clustering
Video summarization using clusteringVideo summarization using clustering
Video summarization using clustering
 
NoTube: Ad Insertion [compatibility mode]
NoTube: Ad Insertion [compatibility mode]NoTube: Ad Insertion [compatibility mode]
NoTube: Ad Insertion [compatibility mode]
 
Thomas Kauders - Agile Test Design And Automation of a Life-Critical Medical ...
Thomas Kauders - Agile Test Design And Automation of a Life-Critical Medical ...Thomas Kauders - Agile Test Design And Automation of a Life-Critical Medical ...
Thomas Kauders - Agile Test Design And Automation of a Life-Critical Medical ...
 
Phillipson learning from archives how historical content can be used to eng...
Phillipson learning from archives   how historical content can be used to eng...Phillipson learning from archives   how historical content can be used to eng...
Phillipson learning from archives how historical content can be used to eng...
 
Develop Maintainable Apps - edUiConf
Develop Maintainable Apps - edUiConfDevelop Maintainable Apps - edUiConf
Develop Maintainable Apps - edUiConf
 
Rubinius For You - GoRuCo
Rubinius For You - GoRuCoRubinius For You - GoRuCo
Rubinius For You - GoRuCo
 
YouTube APIs presentation at Facultad de Ciencias, Universidad Nacional Autón...
YouTube APIs presentation at Facultad de Ciencias, Universidad Nacional Autón...YouTube APIs presentation at Facultad de Ciencias, Universidad Nacional Autón...
YouTube APIs presentation at Facultad de Ciencias, Universidad Nacional Autón...
 
Qtp interview questions and answers
Qtp interview questions and answersQtp interview questions and answers
Qtp interview questions and answers
 
Real-Time Video Copy Detection in Big Data
Real-Time Video Copy Detection in Big DataReal-Time Video Copy Detection in Big Data
Real-Time Video Copy Detection in Big Data
 
PredictionIO - Building Applications That Predict User Behavior Through Big D...
PredictionIO - Building Applications That Predict User Behavior Through Big D...PredictionIO - Building Applications That Predict User Behavior Through Big D...
PredictionIO - Building Applications That Predict User Behavior Through Big D...
 
Java Performance Tuning
Java Performance TuningJava Performance Tuning
Java Performance Tuning
 
Precomputing recommendations with Apache Beam
Precomputing recommendations with Apache BeamPrecomputing recommendations with Apache Beam
Precomputing recommendations with Apache Beam
 
UC2010_BRS1280_Eastman_Chemical_Johnston
UC2010_BRS1280_Eastman_Chemical_JohnstonUC2010_BRS1280_Eastman_Chemical_Johnston
UC2010_BRS1280_Eastman_Chemical_Johnston
 
肉体言語 Tython
肉体言語 Tython肉体言語 Tython
肉体言語 Tython
 
The Ring programming language version 1.3 book - Part 8 of 88
The Ring programming language version 1.3 book - Part 8 of 88The Ring programming language version 1.3 book - Part 8 of 88
The Ring programming language version 1.3 book - Part 8 of 88
 
iOSDevCamp 2011 - Getting "Test"-y: Test Driven Development & Automated Deplo...
iOSDevCamp 2011 - Getting "Test"-y: Test Driven Development & Automated Deplo...iOSDevCamp 2011 - Getting "Test"-y: Test Driven Development & Automated Deplo...
iOSDevCamp 2011 - Getting "Test"-y: Test Driven Development & Automated Deplo...
 
NMSL_2017summer
NMSL_2017summerNMSL_2017summer
NMSL_2017summer
 
YouTube for Developers
YouTube for DevelopersYouTube for Developers
YouTube for Developers
 
Why biased matrix factorization works well?
Why biased matrix factorization works well?Why biased matrix factorization works well?
Why biased matrix factorization works well?
 
Scene Graphs & Component Based Game Engines
Scene Graphs & Component Based Game EnginesScene Graphs & Component Based Game Engines
Scene Graphs & Component Based Game Engines
 

More from Chengjun Wang

计算传播学导论
计算传播学导论计算传播学导论
计算传播学导论Chengjun Wang
 
数据可视化 概念案例方法 王成军 20140104
数据可视化 概念案例方法 王成军 20140104数据可视化 概念案例方法 王成军 20140104
数据可视化 概念案例方法 王成军 20140104Chengjun Wang
 
An introduction to computational communication
An introduction to computational communication An introduction to computational communication
An introduction to computational communication Chengjun Wang
 
Pajek chapter2 Attributes and Relations
Pajek chapter2 Attributes and RelationsPajek chapter2 Attributes and Relations
Pajek chapter2 Attributes and RelationsChengjun Wang
 
Calculate Thresholds of Diffusion with Pajek
Calculate Thresholds of Diffusion with PajekCalculate Thresholds of Diffusion with Pajek
Calculate Thresholds of Diffusion with PajekChengjun Wang
 
Chapter 2. Multivariate Analysis of Stationary Time Series
 Chapter 2. Multivariate Analysis of Stationary Time Series Chapter 2. Multivariate Analysis of Stationary Time Series
Chapter 2. Multivariate Analysis of Stationary Time SeriesChengjun Wang
 
人类行为与最大熵原理
人类行为与最大熵原理人类行为与最大熵原理
人类行为与最大熵原理Chengjun Wang
 
Impact of human value, consumer perceived value
Impact of human value, consumer perceived valueImpact of human value, consumer perceived value
Impact of human value, consumer perceived valueChengjun Wang
 
Introduction to News diffusion On News Sharing Website
Introduction to News diffusion On News Sharing WebsiteIntroduction to News diffusion On News Sharing Website
Introduction to News diffusion On News Sharing WebsiteChengjun Wang
 
The Emergence of Spiral of Silence from Individual behaviors: Agent-based Mod...
The Emergence of Spiral of Silence from Individual behaviors: Agent-based Mod...The Emergence of Spiral of Silence from Individual behaviors: Agent-based Mod...
The Emergence of Spiral of Silence from Individual behaviors: Agent-based Mod...Chengjun Wang
 
Suppressor and distort variables
Suppressor and distort variablesSuppressor and distort variables
Suppressor and distort variablesChengjun Wang
 
Stata Learning From Treiman
Stata Learning From TreimanStata Learning From Treiman
Stata Learning From TreimanChengjun Wang
 
A M O S L E A R N I N G
A M O S  L E A R N I N GA M O S  L E A R N I N G
A M O S L E A R N I N GChengjun Wang
 

More from Chengjun Wang (15)

计算传播学导论
计算传播学导论计算传播学导论
计算传播学导论
 
数据可视化 概念案例方法 王成军 20140104
数据可视化 概念案例方法 王成军 20140104数据可视化 概念案例方法 王成军 20140104
数据可视化 概念案例方法 王成军 20140104
 
An introduction to computational communication
An introduction to computational communication An introduction to computational communication
An introduction to computational communication
 
Pajek chapter2 Attributes and Relations
Pajek chapter2 Attributes and RelationsPajek chapter2 Attributes and Relations
Pajek chapter2 Attributes and Relations
 
Calculate Thresholds of Diffusion with Pajek
Calculate Thresholds of Diffusion with PajekCalculate Thresholds of Diffusion with Pajek
Calculate Thresholds of Diffusion with Pajek
 
Chapter 2. Multivariate Analysis of Stationary Time Series
 Chapter 2. Multivariate Analysis of Stationary Time Series Chapter 2. Multivariate Analysis of Stationary Time Series
Chapter 2. Multivariate Analysis of Stationary Time Series
 
人类行为与最大熵原理
人类行为与最大熵原理人类行为与最大熵原理
人类行为与最大熵原理
 
Impact of human value, consumer perceived value
Impact of human value, consumer perceived valueImpact of human value, consumer perceived value
Impact of human value, consumer perceived value
 
Introduction to News diffusion On News Sharing Website
Introduction to News diffusion On News Sharing WebsiteIntroduction to News diffusion On News Sharing Website
Introduction to News diffusion On News Sharing Website
 
The Emergence of Spiral of Silence from Individual behaviors: Agent-based Mod...
The Emergence of Spiral of Silence from Individual behaviors: Agent-based Mod...The Emergence of Spiral of Silence from Individual behaviors: Agent-based Mod...
The Emergence of Spiral of Silence from Individual behaviors: Agent-based Mod...
 
Suppressor and distort variables
Suppressor and distort variablesSuppressor and distort variables
Suppressor and distort variables
 
Pajek chapter1
Pajek chapter1Pajek chapter1
Pajek chapter1
 
Stata Learning From Treiman
Stata Learning From TreimanStata Learning From Treiman
Stata Learning From Treiman
 
A M O S L E A R N I N G
A M O S  L E A R N I N GA M O S  L E A R N I N G
A M O S L E A R N I N G
 
Amos Learning
Amos LearningAmos Learning
Amos Learning
 

Recently uploaded

Reinventing Corporate Philanthropy_ Strategies for Meaningful Impact by Leko ...
Reinventing Corporate Philanthropy_ Strategies for Meaningful Impact by Leko ...Reinventing Corporate Philanthropy_ Strategies for Meaningful Impact by Leko ...
Reinventing Corporate Philanthropy_ Strategies for Meaningful Impact by Leko ...Leko Durda
 
$ Love Spells^ 💎 (310) 882-6330 in West Virginia, WV | Psychic Reading Best B...
$ Love Spells^ 💎 (310) 882-6330 in West Virginia, WV | Psychic Reading Best B...$ Love Spells^ 💎 (310) 882-6330 in West Virginia, WV | Psychic Reading Best B...
$ Love Spells^ 💎 (310) 882-6330 in West Virginia, WV | Psychic Reading Best B...PsychicRuben LoveSpells
 
Call Girls Anjuna beach Mariott Resort ₰8588052666
Call Girls Anjuna beach Mariott Resort ₰8588052666Call Girls Anjuna beach Mariott Resort ₰8588052666
Call Girls Anjuna beach Mariott Resort ₰8588052666nishakur201
 
CALL ON ➥8923113531 🔝Call Girls Aliganj Lucknow best sexual service
CALL ON ➥8923113531 🔝Call Girls Aliganj Lucknow best sexual serviceCALL ON ➥8923113531 🔝Call Girls Aliganj Lucknow best sexual service
CALL ON ➥8923113531 🔝Call Girls Aliganj Lucknow best sexual serviceanilsa9823
 
LC_YouSaidYes_NewBelieverBookletDone.pdf
LC_YouSaidYes_NewBelieverBookletDone.pdfLC_YouSaidYes_NewBelieverBookletDone.pdf
LC_YouSaidYes_NewBelieverBookletDone.pdfpastor83
 
Understanding Relationship Anarchy: A Guide to Liberating Love | CIO Women Ma...
Understanding Relationship Anarchy: A Guide to Liberating Love | CIO Women Ma...Understanding Relationship Anarchy: A Guide to Liberating Love | CIO Women Ma...
Understanding Relationship Anarchy: A Guide to Liberating Love | CIO Women Ma...CIOWomenMagazine
 
REFLECTIONS Newsletter Jan-Jul 2024.pdf.pdf
REFLECTIONS Newsletter Jan-Jul 2024.pdf.pdfREFLECTIONS Newsletter Jan-Jul 2024.pdf.pdf
REFLECTIONS Newsletter Jan-Jul 2024.pdf.pdfssusere8ea60
 
8377087607 Full Enjoy @24/7-CLEAN-Call Girls In Chhatarpur,
8377087607 Full Enjoy @24/7-CLEAN-Call Girls In Chhatarpur,8377087607 Full Enjoy @24/7-CLEAN-Call Girls In Chhatarpur,
8377087607 Full Enjoy @24/7-CLEAN-Call Girls In Chhatarpur,dollysharma2066
 
CALL ON ➥8923113531 🔝Call Girls Mahanagar Lucknow best sexual service
CALL ON ➥8923113531 🔝Call Girls Mahanagar Lucknow best sexual serviceCALL ON ➥8923113531 🔝Call Girls Mahanagar Lucknow best sexual service
CALL ON ➥8923113531 🔝Call Girls Mahanagar Lucknow best sexual serviceanilsa9823
 
CALL ON ➥8923113531 🔝Call Girls Rajajipuram Lucknow best sexual service
CALL ON ➥8923113531 🔝Call Girls Rajajipuram Lucknow best sexual serviceCALL ON ➥8923113531 🔝Call Girls Rajajipuram Lucknow best sexual service
CALL ON ➥8923113531 🔝Call Girls Rajajipuram Lucknow best sexual serviceanilsa9823
 
文凭办理《原版美国USU学位证书》犹他州立大学毕业证制作成绩单修改
文凭办理《原版美国USU学位证书》犹他州立大学毕业证制作成绩单修改文凭办理《原版美国USU学位证书》犹他州立大学毕业证制作成绩单修改
文凭办理《原版美国USU学位证书》犹他州立大学毕业证制作成绩单修改atducpo
 
Top Rated Pune Call Girls Tingre Nagar ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Tingre Nagar ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Tingre Nagar ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Tingre Nagar ⟟ 6297143586 ⟟ Call Me For Genuine Se...Call Girls in Nagpur High Profile
 
The Selfspace Journal Preview by Mindbrush
The Selfspace Journal Preview by MindbrushThe Selfspace Journal Preview by Mindbrush
The Selfspace Journal Preview by MindbrushShivain97
 
Call Girls In Andheri East Call US Pooja📞 9892124323 Book Hot And
Call Girls In Andheri East Call US Pooja📞 9892124323 Book Hot AndCall Girls In Andheri East Call US Pooja📞 9892124323 Book Hot And
Call Girls In Andheri East Call US Pooja📞 9892124323 Book Hot AndPooja Nehwal
 
Lilac Illustrated Social Psychology Presentation.pptx
Lilac Illustrated Social Psychology Presentation.pptxLilac Illustrated Social Psychology Presentation.pptx
Lilac Illustrated Social Psychology Presentation.pptxABMWeaklings
 
9892124323, Call Girls in mumbai, Vashi Call Girls , Kurla Call girls
9892124323, Call Girls in mumbai, Vashi Call Girls , Kurla Call girls9892124323, Call Girls in mumbai, Vashi Call Girls , Kurla Call girls
9892124323, Call Girls in mumbai, Vashi Call Girls , Kurla Call girlsPooja Nehwal
 
办理国外毕业证学位证《原版美国montana文凭》蒙大拿州立大学毕业证制作成绩单修改
办理国外毕业证学位证《原版美国montana文凭》蒙大拿州立大学毕业证制作成绩单修改办理国外毕业证学位证《原版美国montana文凭》蒙大拿州立大学毕业证制作成绩单修改
办理国外毕业证学位证《原版美国montana文凭》蒙大拿州立大学毕业证制作成绩单修改atducpo
 
Breath, Brain & Beyond_A Holistic Approach to Peak Performance.pdf
Breath, Brain & Beyond_A Holistic Approach to Peak Performance.pdfBreath, Brain & Beyond_A Holistic Approach to Peak Performance.pdf
Breath, Brain & Beyond_A Holistic Approach to Peak Performance.pdfJess Walker
 
CALL ON ➥8923113531 🔝Call Girls Adil Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Adil Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Adil Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Adil Nagar Lucknow best Female serviceanilsa9823
 

Recently uploaded (20)

escort service sasti (*~Call Girls in Paschim Vihar Metro❤️9953056974
escort service  sasti (*~Call Girls in Paschim Vihar Metro❤️9953056974escort service  sasti (*~Call Girls in Paschim Vihar Metro❤️9953056974
escort service sasti (*~Call Girls in Paschim Vihar Metro❤️9953056974
 
Reinventing Corporate Philanthropy_ Strategies for Meaningful Impact by Leko ...
Reinventing Corporate Philanthropy_ Strategies for Meaningful Impact by Leko ...Reinventing Corporate Philanthropy_ Strategies for Meaningful Impact by Leko ...
Reinventing Corporate Philanthropy_ Strategies for Meaningful Impact by Leko ...
 
$ Love Spells^ 💎 (310) 882-6330 in West Virginia, WV | Psychic Reading Best B...
$ Love Spells^ 💎 (310) 882-6330 in West Virginia, WV | Psychic Reading Best B...$ Love Spells^ 💎 (310) 882-6330 in West Virginia, WV | Psychic Reading Best B...
$ Love Spells^ 💎 (310) 882-6330 in West Virginia, WV | Psychic Reading Best B...
 
Call Girls Anjuna beach Mariott Resort ₰8588052666
Call Girls Anjuna beach Mariott Resort ₰8588052666Call Girls Anjuna beach Mariott Resort ₰8588052666
Call Girls Anjuna beach Mariott Resort ₰8588052666
 
CALL ON ➥8923113531 🔝Call Girls Aliganj Lucknow best sexual service
CALL ON ➥8923113531 🔝Call Girls Aliganj Lucknow best sexual serviceCALL ON ➥8923113531 🔝Call Girls Aliganj Lucknow best sexual service
CALL ON ➥8923113531 🔝Call Girls Aliganj Lucknow best sexual service
 
LC_YouSaidYes_NewBelieverBookletDone.pdf
LC_YouSaidYes_NewBelieverBookletDone.pdfLC_YouSaidYes_NewBelieverBookletDone.pdf
LC_YouSaidYes_NewBelieverBookletDone.pdf
 
Understanding Relationship Anarchy: A Guide to Liberating Love | CIO Women Ma...
Understanding Relationship Anarchy: A Guide to Liberating Love | CIO Women Ma...Understanding Relationship Anarchy: A Guide to Liberating Love | CIO Women Ma...
Understanding Relationship Anarchy: A Guide to Liberating Love | CIO Women Ma...
 
REFLECTIONS Newsletter Jan-Jul 2024.pdf.pdf
REFLECTIONS Newsletter Jan-Jul 2024.pdf.pdfREFLECTIONS Newsletter Jan-Jul 2024.pdf.pdf
REFLECTIONS Newsletter Jan-Jul 2024.pdf.pdf
 
8377087607 Full Enjoy @24/7-CLEAN-Call Girls In Chhatarpur,
8377087607 Full Enjoy @24/7-CLEAN-Call Girls In Chhatarpur,8377087607 Full Enjoy @24/7-CLEAN-Call Girls In Chhatarpur,
8377087607 Full Enjoy @24/7-CLEAN-Call Girls In Chhatarpur,
 
CALL ON ➥8923113531 🔝Call Girls Mahanagar Lucknow best sexual service
CALL ON ➥8923113531 🔝Call Girls Mahanagar Lucknow best sexual serviceCALL ON ➥8923113531 🔝Call Girls Mahanagar Lucknow best sexual service
CALL ON ➥8923113531 🔝Call Girls Mahanagar Lucknow best sexual service
 
CALL ON ➥8923113531 🔝Call Girls Rajajipuram Lucknow best sexual service
CALL ON ➥8923113531 🔝Call Girls Rajajipuram Lucknow best sexual serviceCALL ON ➥8923113531 🔝Call Girls Rajajipuram Lucknow best sexual service
CALL ON ➥8923113531 🔝Call Girls Rajajipuram Lucknow best sexual service
 
文凭办理《原版美国USU学位证书》犹他州立大学毕业证制作成绩单修改
文凭办理《原版美国USU学位证书》犹他州立大学毕业证制作成绩单修改文凭办理《原版美国USU学位证书》犹他州立大学毕业证制作成绩单修改
文凭办理《原版美国USU学位证书》犹他州立大学毕业证制作成绩单修改
 
Top Rated Pune Call Girls Tingre Nagar ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Tingre Nagar ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Tingre Nagar ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Tingre Nagar ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
The Selfspace Journal Preview by Mindbrush
The Selfspace Journal Preview by MindbrushThe Selfspace Journal Preview by Mindbrush
The Selfspace Journal Preview by Mindbrush
 
Call Girls In Andheri East Call US Pooja📞 9892124323 Book Hot And
Call Girls In Andheri East Call US Pooja📞 9892124323 Book Hot AndCall Girls In Andheri East Call US Pooja📞 9892124323 Book Hot And
Call Girls In Andheri East Call US Pooja📞 9892124323 Book Hot And
 
Lilac Illustrated Social Psychology Presentation.pptx
Lilac Illustrated Social Psychology Presentation.pptxLilac Illustrated Social Psychology Presentation.pptx
Lilac Illustrated Social Psychology Presentation.pptx
 
9892124323, Call Girls in mumbai, Vashi Call Girls , Kurla Call girls
9892124323, Call Girls in mumbai, Vashi Call Girls , Kurla Call girls9892124323, Call Girls in mumbai, Vashi Call Girls , Kurla Call girls
9892124323, Call Girls in mumbai, Vashi Call Girls , Kurla Call girls
 
办理国外毕业证学位证《原版美国montana文凭》蒙大拿州立大学毕业证制作成绩单修改
办理国外毕业证学位证《原版美国montana文凭》蒙大拿州立大学毕业证制作成绩单修改办理国外毕业证学位证《原版美国montana文凭》蒙大拿州立大学毕业证制作成绩单修改
办理国外毕业证学位证《原版美国montana文凭》蒙大拿州立大学毕业证制作成绩单修改
 
Breath, Brain & Beyond_A Holistic Approach to Peak Performance.pdf
Breath, Brain & Beyond_A Holistic Approach to Peak Performance.pdfBreath, Brain & Beyond_A Holistic Approach to Peak Performance.pdf
Breath, Brain & Beyond_A Holistic Approach to Peak Performance.pdf
 
CALL ON ➥8923113531 🔝Call Girls Adil Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Adil Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Adil Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Adil Nagar Lucknow best Female service
 

Random Prefix Sampling YouTube Users

  • 1. Randomly Sampling YouTube Users: An Introduction to Random Prefix Sampling Method Cheng-Jun Wang Web Ming Lab City University of Hong Kong 20121225
  • 4. Plan A: Sampling Users ∗ Unfortunately, YouTube’s user identifiers do not follow a standard format, YouTube’s user identifiers are user-specified strings. We were therefore unable to create a random sample of YouTube users. Mislove (2007) Measurement and Analysis of Online Social Networks. IMC
  • 5. Plan B: Sampling Videos ∗ Using the YouTube search API, Zhou et al develop a random prefix sampling method, and find that roughly 500 millions YouTube videos by May, 2011. ∗ Sample the videos first, and then find the respective users. Zhou et al. (2011) Counting YouTube Videos via Random Prefix Sampling. IMC
  • 6. Get proportional users? ∗ Limitation: selection bias towards those who uploading more videos. Therefore, weight against the number of videos per user (by the max value) is necessary to get a random sample of YouTube users. ∗ Is it possible? 1 1 Videos crawled Users detected
  • 7. UserID Video Active Num Days User Video Weight Active 1 10 20 ID Num Factor Days 2 5 15 2 5 15 1 10 1 20 3 1 1 Weight 3 1 1 Cases 3 1 1 2 5 2 15 3 1 1 3 1 1 3 1 1 3 1 10 1 3 1 1 3 1 1 3 1 1 3 1 1
  • 8. Strategy ∗ 60^10*16 = 9.674588e+18 ∗ YouTube video is randomly generated from the id space ∗ Sampling space is tooooooo large! ∗ Any good idea? ∗ http://www.youtube.com/watch?v=1yo0zBFCMxo ∗ http://www.youtube.com/watch?v=_OBlgSz8sSM
  • 9. YouTube Search API ∗ One unique property of YouTube search API we find is that when searching using a keyword string of the format “watch?v=xy...z” (including the quotes) where “xy...” is a prefix (of length L, 1 ≤ L ≤ 11) of a possible YouTube video id which does not contain the literal “-” in the prefix, YouTube will return a list of videos whose id’s begin with this prefix followed by “-”, if they exist. ∗ YouTube limits the number of returned results for any query. ∗ When the prefix is short (e.g., 1 or 2), it is more likely that the returned search results may contain such “noisy” video ids; also, the short prefix may match a large number of videos ∗ In contrast, if the prefix is too long (e.g., 6 or 7), no result may be returned by the search engine.
  • 10. Practice ∗ However, in practice, a prefix of length L < 5 contains usually more than one hundred results, and YouTube API can only return at most 30 ids for each prefix query. ∗ On the other hand, based on our experimental results, a prefix with length L = 5 always contains less than 10 valid ids. ∗ Therefore, a prefix length of 5 is a good choice in practice.
  • 11. ∗ They find that querying prefixes with a prefix length of four will returned ids having a “-” in the fifth place, which provides a big enough result set so that each prefix returns some results and small enough to never reach the result limit set by the API.
  • 12. ∗ Zhou et al. found that there are about 500 million YouTube videos by 2011! Zhou et al. (2011) Counting YouTube Videos via Random Prefix Sampling. IMC
  • 13. Python and gdata gdata Code ∗ gdata is a module for def SearchAndPrint(search_terms): yt_service = gdata.youtube.service.YouTubeService() connecting Google data query = gdata.youtube.service.YouTubeVideoQuery() (including YouTube) via API query.vq = search_terms query.orderby = 'viewCount' query.racy = 'include' feed = yt_service.YouTubeQuery(query) PrintVideoFeed(feed)
  • 14. Test Validity ∗ http://www.youtube.com/watch?v=1yo0zBFCMxo ∗ The Secret State - The Biggest Mistake - Official Lyric Music Video Cant’ find the video! ∗ searchApi("watch?v=1yo0z")
  • 15. Restricted query term ∗ searchApi('"watch?v=1yo0"')
  • 16. Compare two random samples ∗ # summary(da$Freq) ∗ # Min. 1st Qu. Median Mean 3rd Qu. Max. ∗ # 1.00 7.00 25.00 17.15 25.00 75.00 ∗ ∗ # summary(db$Freq) ∗ # Min. 1st Qu. Median Mean 3rd Qu. Max. ∗ # 1.00 8.00 25.00 17.57 25.00 50.00
  • 17. There are 604 million videos in YouTube by Dec, 2012! ∗ length(unique(subset(a[,1], b[,1]%in%a[,1]))) == 26 ∗ 34361/x = 125/34361 ∗ X = (34361^2/125)*64 == 604507300
  • 18. Numeric simulation of random prefix sampling ∗ # using degreenet to simulate decrete pareto distribution ∗ library(degreenet) ∗ a<-simdp(n=100000, v=3.5, maxdeg=10000) ∗ b<-data.frame(cbind(c(1:length(a)),a)) ∗ c<-b[rep(1:nrow(b),b$a),] ∗ c$vid<-c(1:length(c$a)) ∗ names(c)<-c("uid", "count", "vid") ∗ id<-sample(c(1:length(c$vid)), 2000, replace = F) # ∗ ds<-subset(c, c$vid%in%id) ∗ dat<-subset(ds, !duplicated(ds$uid)) ∗ hist(dat$count) ∗ da<-as.data.frame(table(a)) ∗ ds<-as.data.frame(table(dat$count)) ∗ plot(log(da[,2])~log(as.numeric(as.character(da[,1]))), xlab = "Number of Videos (Log)", ylab = "Frequency (Log)" ) ∗ points(log(ds[,2])~log(as.numeric(as.character(ds[,1]))), pch=2, col="red") ∗ legend("topright", c("population", "sample"), ∗ col = c( "black","red"), ∗ cex=0.9, pch= c(3, 2))
  • 19. Reference ∗ Zhou et al. (2011) Counting YouTube Videos via Random Prefix Sampling. IMC ∗ Mislove (2007) Measurement and Analysis of Online Social Networks. IMC ∗ YouTube deverlopers guide for python https://developers.google.com/youtube/1.0/developers_guide_python ∗ Introduction to the library of gdata.youtube http://gdata-pythonclient.googlecode.com/svn/trunk/pydocs/gdata.youtube.html#YouTubeVideoEntry