SlideShare a Scribd company logo
1 of 26
Download to read offline
A longitudinal analysis of search engine index size
Antal van den Bosch^, Toine Bogers*, Maurice de Kunder#
^ Radboud University, Nijmegen, the Netherlands
* Aalborg University Copenhagen, Denmark
# De Kunder Internet Media BV, Nijmegen, the Netherlands
ISSI 2015, Istanbul, Turkey
June 29 – July 3, 2015
Introduction
• Webometrics is the study of the content, structure and technologies
of the WWW (Almind & Ingwersen, 1997; Thelwall, 2009)
- Research topics include link structure, Web citation analysis, user
demographics, Web page credibility, search engines, and WWW size
• Size of the WWW is hard to measure!
- Only subset is accessible through search engines and Web crawling (aka
the Surface web)
‣ Deep web is the part of the WWW not indexed by search engines
- Most work has therefore focused on estimating search engine index size
2
Introduction
• Our work focuses on the estimation of index sizes of individual
search engines
• Why is this important?
- Index size used to be a competitive advantage for search engines
‣ Slowly been superseded by recency and personalization
- Index size is an important aspect of the quality of a Web search engine
- Provides a ceiling estimate of the size of the WWW accessible to the
average Internet user
3
Contributions of this work
1. A novel method of estimating the size of a Web search engine’s
index
2. A longitudinal analysis of the size of Google and Bing’s indexes over
a nine-year period
4
Background
• Index size estimation
- Bharat & Broder (1998) estimated the size of the indexed WWW using
self-reported index sizes and overlap estimates → 200 million pages
- Gulli et al. (2005) extended their work → 11.5 billion pages
- Lawrence & Giles (1998) estimated the size using capture-recapture
methodology and self-reported index sizes → 320 million pages
- Lawrence & Giles (1999) updated their own work → 800 million pages
- Dobra et al. (2004) updated the original 1998 estimates of Lawrence &
Giles (1998) → doubled to 788 million pages in 1998
5
Background
• Some related work on the stability of search engine results
- In terms of hit counts, rankings, and persistence of results
• Problem: no true longitudinal studies on hit counts or index size!
- Longest period for hit count variability studies was 3 months (Rousseau,
1999)
• Question: how stable are studies based on hit counts over time?
- We attempt to provide an answer by analyzing the results of a novel
estimation method over a nine-year period (March 2006 – January 2015)
6
Methodology
• Our method: estimation through extrapolation
- We extrapolate the unknown index size by using another textual training
corpus that is fully available to us
- We assume that for in-domain corpora the relative document
frequencies will be the same
- Results in following formula:
7
= |C| =
dfw,C ⇥ |T|
dfw,T
index size
|C|
|T|
dfw,C
dfw,T
= size of index
= size of training corpus
= hit count
= doc frequency of w in T
Methodology
• Our method: estimation through extrapolation
- We extrapolate the unknown index size by using another textual training
corpus that is fully available to us
- We assume that for in-domain corpora the relative document
frequencies will be the same
- Results in following formula:
7
= |C| =
dfw,C ⇥ |T|
dfw,T
index size
|C|
|T|
dfw,C
dfw,T
= size of index
= size of training corpus
= hit count
= doc frequency of w in T
Methodology
• Selecting a training corpus
- Should be representative of Web search engine indexes
- Crawled a random selection of 531,624 Web pages from DMOZ
‣ 254,094,395 word tokens and 4,395,017 unique word types
• Estimation example for the term ‘are’:
- ‘are’ occurs in 50% of all DMOZ documents
- Google hit count is 17,540,000,000 pages
- Extrapolation: Google’s index contains 35 billion pages
8
Methodology
• Which terms should we use for the extrapolation?
- Single-word terms are preferred according to Uyar (2009)
- Random selection of word types will oversample low-frequent words as
predicted by Zipf’s second law
- Terms should be sampled from across document frequency bands →
selected exponential series of selection rank with exponent 1.6, rounded
off to the nearest integer
- Set of words used should be not be overly small → averaged
estimations over a set of 28 words (where predictions became stable)
9
Methodology
• Final set of 28 selected words:
10
and was photo preliminary accordée
of can headlines definite reticular
to do william psychologists recitificació
for people basketball vielfalt
on very spread illini
are show nfl chèque
Methodology
• Validation
- Predictions on an out-of-sample DMOZ test corpus were only off by 1.3%
• Daily procedure
- Estimate index size for each of these 28 words
- Average all estimates into a single estimate
- Rinse and repeat
11
Methodology
• Collected data from two search engines from March 2006 – January 2015
- Google: 3,027 data points (93.6% of all possible days)
- Bing: 3,002 data points (92.8% of all possible days)
12
Google Bing (aka Live Search)
Results
• Google usually has the largest index
- Peak of 49.4 billion pages (December 2011)
- Bing has a peak of 23 billion pages (March 2014)
• Both search engines show great variability!
13
0
5x10
9
1x10
10
1.5x10
10
2x10
10
2.5x10
10
3x10
10
3.5x10
10
4x10
10
4.5x10
10
5x10
10
2007 2008 2009 2010 2011 2012 2013 2014 2015
Est.numberofwebpages
Google
Bing
0
5x10
9
1x10
10
1.5x10
10
2x10
10
2.5x10
10
3x10
10
3.5x10
10
4x10
10
4.5x10
10
5x10
10
2007 2008 2009 2010 2011 2012 2013 2014 2015
Est.numberofwebpages
Google
Bing
Estimatedno.ofwebpages
2007 2008 2009 2010 2011 2012 2013 2014 2015
Year
0
5x10
9
1x10
10
1.5x1010
2x1010
2.5x1010
3x10
10
3.5x10
10
4x1010
4.5x1010
5x1010
2007 2008 2009 2010 2011 2012 2013 2014 2015
Est.numberofwebpages
Year
Google
Bing
10 billion
5 billion
20 billion
15 billion
30 billion
25 billion
40 billion
35 billion
45 billion
55 billion
50 billion
0
5x10
9
1x10
10
1.5x10
10
2x1010
2.5x10
10
3x10
10
3.5x10
10
4x1010
4.5x10
10
5x10
10
Est.numberofwebpages
Google
Bing
2007 2008 2009 2010 2011 2012 2013 2014 2015
2007 2008 2009 2010 2011 2012 2013 2014 2015
Google
Bing
Each point is a
moving average
over 31 days
What causes this variability?
• Intrinsic variability
- However, it performs well on a representative in-domain sample
- Rel. doc. frequency is unlikely to radically change over short time periods!
• Extrinsic variability
- Changes in indexing and ranking infrastructure happen all the time
‣ Google makes “roughly 500 changes to our search algorithm in a typical year” (Cutts,
2011)
- Affects the hit count estimates and thus the index size estimates!
- Examined Google and Bing search engine blogs for reported changes
15
0
5x10
9
1x10
10
1.5x10
10
2x10
10
2.5x10
10
3x10
10
3.5x10
10
4x10
10
4.5x10
10
5x10
10
2007 2008 2009 2010 2011 2012 2013 2014 2015
Est.numberofwebpages
Google
Bing
0
5x10
9
1x10
10
1.5x10
10
2x10
10
2.5x10
10
3x10
10
3.5x10
10
4x10
10
4.5x10
10
5x10
10
2007 2008 2009 2010 2011 2012 2013 2014 2015
Est.numberofwebpages
Google
Bing
Estimatedno.ofwebpages
2007 2008 2009 2010 2011 2012 2013 2014 2015
Year
0
5x10
9
1x10
10
1.5x1010
2x1010
2.5x1010
3x10
10
3.5x10
10
4x1010
4.5x1010
5x1010
2007 2008 2009 2010 2011 2012 2013 2014 2015
Est.numberofwebpages
Year
Google
Bing
10 billion
5 billion
20 billion
15 billion
30 billion
25 billion
40 billion
35 billion
45 billion
55 billion
50 billion
0
5x10
9
1x10
10
1.5x10
10
2x1010
2.5x10
10
3x10
10
3.5x10
10
4x1010
4.5x10
10
5x10
10
Est.numberofwebpages
Google
Bing
2007 2008 2009 2010 2011 2012 2013 2014 2015
2007 2008 2009 2010 2011 2012 2013 2014 2015
Google
Bing
Caffeine
update
Panda 1.0
update
Panda 4.0
update
Launch
of Bing
Launch of
BingBot crawler
Catapult
update
0
5x10
9
1x10
10
1.5x10
10
2x10
10
2.5x10
10
3x10
10
3.5x10
10
4x10
10
4.5x10
10
5x10
10
2007 2008 2009 2010 2011 2012 2013 2014 2015
Est.numberofwebpages
Google
Bing
0
5x10
9
1x10
10
1.5x10
10
2x10
10
2.5x10
10
3x10
10
3.5x10
10
4x10
10
4.5x10
10
5x10
10
2007 2008 2009 2010 2011 2012 2013 2014 2015
Est.numberofwebpages
Google
Bing
Estimatedno.ofwebpages
2007 2008 2009 2010 2011 2012 2013 2014 2015
Year
0
5x10
9
1x10
10
1.5x1010
2x1010
2.5x1010
3x10
10
3.5x10
10
4x1010
4.5x1010
5x1010
2007 2008 2009 2010 2011 2012 2013 2014 2015
Est.numberofwebpages
Year
Google
Bing
10 billion
5 billion
20 billion
15 billion
30 billion
25 billion
40 billion
35 billion
45 billion
55 billion
50 billion
0
5x10
9
1x10
10
1.5x10
10
2x1010
2.5x10
10
3x10
10
3.5x10
10
4x1010
4.5x10
10
5x10
10
Est.numberofwebpages
Google
Bing
2007 2008 2009 2010 2011 2012 2013 2014 2015
2007 2008 2009 2010 2011 2012 2013 2014 2015
Google
Bing
Launch
of Bing
Caffeine
update
Panda 1.0
update
Panda 4.0
update
Launch of
BingBot crawler
Catapult
update
Discussion
• Estimation bias
- Distributed indexes result in hit count variability
‣ Different servers contain different shards in different states of up-to-dateness
- Modern search engines use Document-at-a-time (DAAT) processing
‣ Means they traverse the postings list of an index until they have found enough
matching documents, not until they’ve found all matching documents
‣ Overall hit counts are then estimated using statistical prediction methods
• Language
- English dominates the WWW (55%), DMOZ might suffer from this even more
19
Discussion
• Cut-off bias
- Search engines are reported to cut off indexing up to a certain size
- If that size is equal to the average DMOZ page size, then our estimates
are great :)
• Quality bias
- DMOZ is a curated directory of ‘good’ websites
- May not be representative of the ‘average’ website
20
Conclusions
• Long-term longitudinal analysis of search engine index sizes
- Estimation using hit counts shows great variability over time!
• Much of the variability seems attributable to infrastructure changes
- 72% of infrastructure changes are reflected in estimate variation
- Be careful when using hit counts for one-off Webometric studies!
- Confirmation of work by Rousseau (1999), Bar-Ilan (1999), and Payne &
Thelwall (2008)
• Future work will focus on extending the analysis to other languages
21
Questions? Comments? Suggestions?
Thanks for your attention!
22
References
• Almind, T.C. & Ingwersen, P. (1997). Informetric Analyses on the World
Wide Web: Methodological Approaches to ‘Webometrics’. Journal of
Documentation, 53, pp. 404–426.
• Bar-Ilan, J. (1999). Search Engine Results over Time: A Case Study on
Search Engine Stability. Cybermetrics, 2, 1.
• Bharat, K. & Broder, A. (1998). A Technique for Measuring the Relative Size
and Overlap of Public Web Search Engines. In Proceedings of WWW ’98
(pp. 379–388). New York, NY, USA: ACM Press.
23
References
• Cutts, M. (2011). Ten Algorithm Changes on Inside Search, Google Official
Blog. Available at http://googleblog.blogspot.com/2011/11/ten-
algorithm-changes-on-inside-search.html, last visited January 21, 2015.
• Dobra, A. & Fienberg, S.E. (2004). How Large is the World Wide Web? In
Web Dynamics (pp. 23– 43). Berlin: Springer.
• Gulli, A. & Signorini, A. (2005). The Indexable Web is More than 11.5 Billion
Pages. In Proceedings of WWW ’05 (pp. 902–903). New York, NY, USA:
ACM Press.
24
References
• Lawrence, S. & Giles, C.L. (1998). Searching the World Wide Web. Science,
280, pp. 98–100.
• Lawrence, S. & Giles, C.L. (1999). Accessibility of Information on the Web.
Nature, 400, pp. 107–109.
• Payne, N. & Thelwall, M. (2008). Longitudinal Trends in Academic Web
Links. Journal of Information Science, 34, pp. 3–14.
• Rousseau, R. (1999). Daily Time Series of Common Single Word Searches in
AltaVista and NorthernLight. Cybermetrics, 2, 1.
25
References
• Thelwall, M. (2008). Quantitative Comparisons of Search Engine Results.
Journal of the American Society for Information Science and Technology,
59, pp. 1702–1710.
• Thelwall, M. (2009). Introduction to Webometrics: Quantitative Web
Research for the Social Sciences. Synthesis Lectures on Information
Concepts, Retrieval, and Services, 1, pp. 1–116.
• Thelwall, M. & Sud, P. (2012). Webometric Research with the Bing Search
API 2.0. Journal of Informetrics, 6, pp. 44–52.
26

More Related Content

Viewers also liked

Marketing Research Report Proposal [Elegant (VI)]
Marketing Research Report Proposal [Elegant (VI)]Marketing Research Report Proposal [Elegant (VI)]
Marketing Research Report Proposal [Elegant (VI)]Md. Abdur Rakib
 
PPC AdWords Report
PPC AdWords ReportPPC AdWords Report
PPC AdWords ReportReportGarden
 
Writing Smarter Applications with Machine Learning
Writing Smarter Applications with Machine LearningWriting Smarter Applications with Machine Learning
Writing Smarter Applications with Machine LearningAnoop Thomas Mathew
 
Site analysis presentation
Site analysis presentationSite analysis presentation
Site analysis presentationAh Jun
 
The Top Skills That Can Get You Hired in 2017
The Top Skills That Can Get You Hired in 2017The Top Skills That Can Get You Hired in 2017
The Top Skills That Can Get You Hired in 2017LinkedIn
 
State of the Word 2011
State of the Word 2011State of the Word 2011
State of the Word 2011photomatt
 

Viewers also liked (9)

Guida facile a Google Adwords
Guida facile a Google Adwords Guida facile a Google Adwords
Guida facile a Google Adwords
 
Marketing Research Report Proposal [Elegant (VI)]
Marketing Research Report Proposal [Elegant (VI)]Marketing Research Report Proposal [Elegant (VI)]
Marketing Research Report Proposal [Elegant (VI)]
 
PPC AdWords Report
PPC AdWords ReportPPC AdWords Report
PPC AdWords Report
 
Sam sung presentation
Sam sung presentationSam sung presentation
Sam sung presentation
 
Writing Smarter Applications with Machine Learning
Writing Smarter Applications with Machine LearningWriting Smarter Applications with Machine Learning
Writing Smarter Applications with Machine Learning
 
Site analysis presentation
Site analysis presentationSite analysis presentation
Site analysis presentation
 
9 handy Excel demos
9 handy Excel demos9 handy Excel demos
9 handy Excel demos
 
The Top Skills That Can Get You Hired in 2017
The Top Skills That Can Get You Hired in 2017The Top Skills That Can Get You Hired in 2017
The Top Skills That Can Get You Hired in 2017
 
State of the Word 2011
State of the Word 2011State of the Word 2011
State of the Word 2011
 

Similar to A Longitudinal Analysis of Search Engine Index Size

Search and Social Media Marketing Course Slides - Salford Universtiy
Search and Social Media Marketing Course Slides - Salford UniverstiySearch and Social Media Marketing Course Slides - Salford Universtiy
Search and Social Media Marketing Course Slides - Salford UniverstiyTom Mason
 
Search Accuracy Metrics and Predictive Analytics - A Big Data Use Case: Prese...
Search Accuracy Metrics and Predictive Analytics - A Big Data Use Case: Prese...Search Accuracy Metrics and Predictive Analytics - A Big Data Use Case: Prese...
Search Accuracy Metrics and Predictive Analytics - A Big Data Use Case: Prese...Lucidworks
 
SEO - What is it?
SEO - What is it?SEO - What is it?
SEO - What is it?Woj Kwasi
 
Data analytics and SEO to grow your international business
Data analytics and SEO to grow your international businessData analytics and SEO to grow your international business
Data analytics and SEO to grow your international businessEnterprise Ireland
 
Efficient Query Processing Infrastructures
Efficient Query Processing InfrastructuresEfficient Query Processing Infrastructures
Efficient Query Processing InfrastructuresCrai Macdonald
 
Sw mas-web-workshop
Sw mas-web-workshopSw mas-web-workshop
Sw mas-web-workshopAndrew Knutt
 
The 5-Day "Meal Plan" for SEO Success
The 5-Day "Meal Plan" for SEO SuccessThe 5-Day "Meal Plan" for SEO Success
The 5-Day "Meal Plan" for SEO SuccessKnoxville HUG
 
Webinar - SEO for Beginners: Simple Steps for Nonprofits and Libraries - 2016...
Webinar - SEO for Beginners: Simple Steps for Nonprofits and Libraries - 2016...Webinar - SEO for Beginners: Simple Steps for Nonprofits and Libraries - 2016...
Webinar - SEO for Beginners: Simple Steps for Nonprofits and Libraries - 2016...TechSoup
 
From Optimiser to Consultant: How to Remain Relevant as an SEO Practitioner
From Optimiser to Consultant: How to Remain Relevant as an SEO PractitionerFrom Optimiser to Consultant: How to Remain Relevant as an SEO Practitioner
From Optimiser to Consultant: How to Remain Relevant as an SEO PractitionerBoom Online Marketing
 
Croud Presents: How to Build a Data-driven SEO Strategy Using NLP
Croud Presents: How to Build a Data-driven SEO Strategy Using NLPCroud Presents: How to Build a Data-driven SEO Strategy Using NLP
Croud Presents: How to Build a Data-driven SEO Strategy Using NLPDaniel Liddle
 
TechSEO Boost 2018: Research Competition
TechSEO Boost 2018: Research CompetitionTechSEO Boost 2018: Research Competition
TechSEO Boost 2018: Research CompetitionCatalyst
 
[@IndeedEng] Large scale interactive analytics with Imhotep
[@IndeedEng] Large scale interactive analytics with Imhotep[@IndeedEng] Large scale interactive analytics with Imhotep
[@IndeedEng] Large scale interactive analytics with Imhotepindeedeng
 
Introduction To SEO (SEARCH ENGINE OPTIMIZATION)- Learning Catalyst
Introduction To SEO (SEARCH ENGINE OPTIMIZATION)- Learning CatalystIntroduction To SEO (SEARCH ENGINE OPTIMIZATION)- Learning Catalyst
Introduction To SEO (SEARCH ENGINE OPTIMIZATION)- Learning CatalystLearning-Catalyst
 
Using data to guide product development
Using data to guide product developmentUsing data to guide product development
Using data to guide product developmentMat Clayton
 
Seo and analytics wk 2
Seo and analytics wk 2Seo and analytics wk 2
Seo and analytics wk 2Toby Eborn
 
The beginners guide to SEO
The beginners guide to SEOThe beginners guide to SEO
The beginners guide to SEOThanh Nguyen
 
SEO presentation for marketing summit 2017
SEO presentation for marketing summit 2017SEO presentation for marketing summit 2017
SEO presentation for marketing summit 2017Scott True
 

Similar to A Longitudinal Analysis of Search Engine Index Size (20)

Search and Social Media Marketing Course Slides - Salford Universtiy
Search and Social Media Marketing Course Slides - Salford UniverstiySearch and Social Media Marketing Course Slides - Salford Universtiy
Search and Social Media Marketing Course Slides - Salford Universtiy
 
Search Accuracy Metrics and Predictive Analytics - A Big Data Use Case: Prese...
Search Accuracy Metrics and Predictive Analytics - A Big Data Use Case: Prese...Search Accuracy Metrics and Predictive Analytics - A Big Data Use Case: Prese...
Search Accuracy Metrics and Predictive Analytics - A Big Data Use Case: Prese...
 
SEO - What is it?
SEO - What is it?SEO - What is it?
SEO - What is it?
 
How to be data savvy manager
How to be data savvy managerHow to be data savvy manager
How to be data savvy manager
 
Data analytics and SEO to grow your international business
Data analytics and SEO to grow your international businessData analytics and SEO to grow your international business
Data analytics and SEO to grow your international business
 
Efficient Query Processing Infrastructures
Efficient Query Processing InfrastructuresEfficient Query Processing Infrastructures
Efficient Query Processing Infrastructures
 
Sw mas-web-workshop
Sw mas-web-workshopSw mas-web-workshop
Sw mas-web-workshop
 
The 5-Day "Meal Plan" for SEO Success
The 5-Day "Meal Plan" for SEO SuccessThe 5-Day "Meal Plan" for SEO Success
The 5-Day "Meal Plan" for SEO Success
 
Webinar - SEO for Beginners: Simple Steps for Nonprofits and Libraries - 2016...
Webinar - SEO for Beginners: Simple Steps for Nonprofits and Libraries - 2016...Webinar - SEO for Beginners: Simple Steps for Nonprofits and Libraries - 2016...
Webinar - SEO for Beginners: Simple Steps for Nonprofits and Libraries - 2016...
 
From Optimiser to Consultant: How to Remain Relevant as an SEO Practitioner
From Optimiser to Consultant: How to Remain Relevant as an SEO PractitionerFrom Optimiser to Consultant: How to Remain Relevant as an SEO Practitioner
From Optimiser to Consultant: How to Remain Relevant as an SEO Practitioner
 
Croud Presents: How to Build a Data-driven SEO Strategy Using NLP
Croud Presents: How to Build a Data-driven SEO Strategy Using NLPCroud Presents: How to Build a Data-driven SEO Strategy Using NLP
Croud Presents: How to Build a Data-driven SEO Strategy Using NLP
 
Search Analytics - Comperio
Search Analytics - ComperioSearch Analytics - Comperio
Search Analytics - Comperio
 
TechSEO Boost 2018: Research Competition
TechSEO Boost 2018: Research CompetitionTechSEO Boost 2018: Research Competition
TechSEO Boost 2018: Research Competition
 
[@IndeedEng] Large scale interactive analytics with Imhotep
[@IndeedEng] Large scale interactive analytics with Imhotep[@IndeedEng] Large scale interactive analytics with Imhotep
[@IndeedEng] Large scale interactive analytics with Imhotep
 
Introduction To SEO (SEARCH ENGINE OPTIMIZATION)- Learning Catalyst
Introduction To SEO (SEARCH ENGINE OPTIMIZATION)- Learning CatalystIntroduction To SEO (SEARCH ENGINE OPTIMIZATION)- Learning Catalyst
Introduction To SEO (SEARCH ENGINE OPTIMIZATION)- Learning Catalyst
 
B2B SEO in 2020
B2B SEO in 2020B2B SEO in 2020
B2B SEO in 2020
 
Using data to guide product development
Using data to guide product developmentUsing data to guide product development
Using data to guide product development
 
Seo and analytics wk 2
Seo and analytics wk 2Seo and analytics wk 2
Seo and analytics wk 2
 
The beginners guide to SEO
The beginners guide to SEOThe beginners guide to SEO
The beginners guide to SEO
 
SEO presentation for marketing summit 2017
SEO presentation for marketing summit 2017SEO presentation for marketing summit 2017
SEO presentation for marketing summit 2017
 

More from Toine Bogers

"If I like BLANK, what else will I like?": Analyzing a Human Recommendation C...
"If I like BLANK, what else will I like?": Analyzing a Human Recommendation C..."If I like BLANK, what else will I like?": Analyzing a Human Recommendation C...
"If I like BLANK, what else will I like?": Analyzing a Human Recommendation C...Toine Bogers
 
Hands-free but not Eyes-free: A Usability Evaluation of Siri while Driving
Hands-free but not Eyes-free: A Usability Evaluation of Siri while DrivingHands-free but not Eyes-free: A Usability Evaluation of Siri while Driving
Hands-free but not Eyes-free: A Usability Evaluation of Siri while DrivingToine Bogers
 
“Looking for an Amazing Game I Can Relax and Sink Hours into...”: A Study of ...
“Looking for an Amazing Game I Can Relax and Sink Hours into...”: A Study of ...“Looking for an Amazing Game I Can Relax and Sink Hours into...”: A Study of ...
“Looking for an Amazing Game I Can Relax and Sink Hours into...”: A Study of ...Toine Bogers
 
A Study of Usage and Usability of Intelligent Personal Assistants in Denmark
A Study of Usage and Usability of Intelligent Personal Assistants in DenmarkA Study of Usage and Usability of Intelligent Personal Assistants in Denmark
A Study of Usage and Usability of Intelligent Personal Assistants in DenmarkToine Bogers
 
“What was this movie about this chick?”: A Comparative Study of Relevance Asp...
“What was this movie about this chick?”: A Comparative Study of Relevance Asp...“What was this movie about this chick?”: A Comparative Study of Relevance Asp...
“What was this movie about this chick?”: A Comparative Study of Relevance Asp...Toine Bogers
 
"I just scroll through my stuff until I find it or give up": A Contextual Inq...
"I just scroll through my stuff until I find it or give up": A Contextual Inq..."I just scroll through my stuff until I find it or give up": A Contextual Inq...
"I just scroll through my stuff until I find it or give up": A Contextual Inq...Toine Bogers
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingToine Bogers
 
Defining and Supporting Narrative-driven Recommendation
Defining and Supporting Narrative-driven RecommendationDefining and Supporting Narrative-driven Recommendation
Defining and Supporting Narrative-driven RecommendationToine Bogers
 
An In-depth Analysis of Tags and Controlled Metadata for Book Search
An In-depth Analysis of Tags and Controlled Metadata for Book SearchAn In-depth Analysis of Tags and Controlled Metadata for Book Search
An In-depth Analysis of Tags and Controlled Metadata for Book SearchToine Bogers
 
Personalized search
Personalized searchPersonalized search
Personalized searchToine Bogers
 
Tagging vs. Controlled Vocabulary: Which is More Helpful for Book Search?
Tagging vs. Controlled Vocabulary: Which is More Helpful for Book Search?Tagging vs. Controlled Vocabulary: Which is More Helpful for Book Search?
Tagging vs. Controlled Vocabulary: Which is More Helpful for Book Search?Toine Bogers
 
Measuring System Performance in Cultural Heritage Systems
Measuring System Performance in Cultural Heritage SystemsMeasuring System Performance in Cultural Heritage Systems
Measuring System Performance in Cultural Heritage SystemsToine Bogers
 
How 'Social' are Social News Sites? Exploring the Motivations for Using Reddi...
How 'Social' are Social News Sites? Exploring the Motivations for Using Reddi...How 'Social' are Social News Sites? Exploring the Motivations for Using Reddi...
How 'Social' are Social News Sites? Exploring the Motivations for Using Reddi...Toine Bogers
 
Search & Recommendation: Birds of a Feather?
Search & Recommendation: Birds of a Feather?Search & Recommendation: Birds of a Feather?
Search & Recommendation: Birds of a Feather?Toine Bogers
 
Micro-Serendipity: Meaningful Coincidences in Everyday Life Shared on Twitter
Micro-Serendipity: Meaningful Coincidences in Everyday Life Shared on TwitterMicro-Serendipity: Meaningful Coincidences in Everyday Life Shared on Twitter
Micro-Serendipity: Meaningful Coincidences in Everyday Life Shared on TwitterToine Bogers
 
Benchmarking Domain-specific Expert Search using Workshop Program Committees
Benchmarking Domain-specific Expert Search using Workshop Program CommitteesBenchmarking Domain-specific Expert Search using Workshop Program Committees
Benchmarking Domain-specific Expert Search using Workshop Program CommitteesToine Bogers
 

More from Toine Bogers (16)

"If I like BLANK, what else will I like?": Analyzing a Human Recommendation C...
"If I like BLANK, what else will I like?": Analyzing a Human Recommendation C..."If I like BLANK, what else will I like?": Analyzing a Human Recommendation C...
"If I like BLANK, what else will I like?": Analyzing a Human Recommendation C...
 
Hands-free but not Eyes-free: A Usability Evaluation of Siri while Driving
Hands-free but not Eyes-free: A Usability Evaluation of Siri while DrivingHands-free but not Eyes-free: A Usability Evaluation of Siri while Driving
Hands-free but not Eyes-free: A Usability Evaluation of Siri while Driving
 
“Looking for an Amazing Game I Can Relax and Sink Hours into...”: A Study of ...
“Looking for an Amazing Game I Can Relax and Sink Hours into...”: A Study of ...“Looking for an Amazing Game I Can Relax and Sink Hours into...”: A Study of ...
“Looking for an Amazing Game I Can Relax and Sink Hours into...”: A Study of ...
 
A Study of Usage and Usability of Intelligent Personal Assistants in Denmark
A Study of Usage and Usability of Intelligent Personal Assistants in DenmarkA Study of Usage and Usability of Intelligent Personal Assistants in Denmark
A Study of Usage and Usability of Intelligent Personal Assistants in Denmark
 
“What was this movie about this chick?”: A Comparative Study of Relevance Asp...
“What was this movie about this chick?”: A Comparative Study of Relevance Asp...“What was this movie about this chick?”: A Comparative Study of Relevance Asp...
“What was this movie about this chick?”: A Comparative Study of Relevance Asp...
 
"I just scroll through my stuff until I find it or give up": A Contextual Inq...
"I just scroll through my stuff until I find it or give up": A Contextual Inq..."I just scroll through my stuff until I find it or give up": A Contextual Inq...
"I just scroll through my stuff until I find it or give up": A Contextual Inq...
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Defining and Supporting Narrative-driven Recommendation
Defining and Supporting Narrative-driven RecommendationDefining and Supporting Narrative-driven Recommendation
Defining and Supporting Narrative-driven Recommendation
 
An In-depth Analysis of Tags and Controlled Metadata for Book Search
An In-depth Analysis of Tags and Controlled Metadata for Book SearchAn In-depth Analysis of Tags and Controlled Metadata for Book Search
An In-depth Analysis of Tags and Controlled Metadata for Book Search
 
Personalized search
Personalized searchPersonalized search
Personalized search
 
Tagging vs. Controlled Vocabulary: Which is More Helpful for Book Search?
Tagging vs. Controlled Vocabulary: Which is More Helpful for Book Search?Tagging vs. Controlled Vocabulary: Which is More Helpful for Book Search?
Tagging vs. Controlled Vocabulary: Which is More Helpful for Book Search?
 
Measuring System Performance in Cultural Heritage Systems
Measuring System Performance in Cultural Heritage SystemsMeasuring System Performance in Cultural Heritage Systems
Measuring System Performance in Cultural Heritage Systems
 
How 'Social' are Social News Sites? Exploring the Motivations for Using Reddi...
How 'Social' are Social News Sites? Exploring the Motivations for Using Reddi...How 'Social' are Social News Sites? Exploring the Motivations for Using Reddi...
How 'Social' are Social News Sites? Exploring the Motivations for Using Reddi...
 
Search & Recommendation: Birds of a Feather?
Search & Recommendation: Birds of a Feather?Search & Recommendation: Birds of a Feather?
Search & Recommendation: Birds of a Feather?
 
Micro-Serendipity: Meaningful Coincidences in Everyday Life Shared on Twitter
Micro-Serendipity: Meaningful Coincidences in Everyday Life Shared on TwitterMicro-Serendipity: Meaningful Coincidences in Everyday Life Shared on Twitter
Micro-Serendipity: Meaningful Coincidences in Everyday Life Shared on Twitter
 
Benchmarking Domain-specific Expert Search using Workshop Program Committees
Benchmarking Domain-specific Expert Search using Workshop Program CommitteesBenchmarking Domain-specific Expert Search using Workshop Program Committees
Benchmarking Domain-specific Expert Search using Workshop Program Committees
 

A Longitudinal Analysis of Search Engine Index Size

  • 1. A longitudinal analysis of search engine index size Antal van den Bosch^, Toine Bogers*, Maurice de Kunder# ^ Radboud University, Nijmegen, the Netherlands * Aalborg University Copenhagen, Denmark # De Kunder Internet Media BV, Nijmegen, the Netherlands ISSI 2015, Istanbul, Turkey June 29 – July 3, 2015
  • 2. Introduction • Webometrics is the study of the content, structure and technologies of the WWW (Almind & Ingwersen, 1997; Thelwall, 2009) - Research topics include link structure, Web citation analysis, user demographics, Web page credibility, search engines, and WWW size • Size of the WWW is hard to measure! - Only subset is accessible through search engines and Web crawling (aka the Surface web) ‣ Deep web is the part of the WWW not indexed by search engines - Most work has therefore focused on estimating search engine index size 2
  • 3. Introduction • Our work focuses on the estimation of index sizes of individual search engines • Why is this important? - Index size used to be a competitive advantage for search engines ‣ Slowly been superseded by recency and personalization - Index size is an important aspect of the quality of a Web search engine - Provides a ceiling estimate of the size of the WWW accessible to the average Internet user 3
  • 4. Contributions of this work 1. A novel method of estimating the size of a Web search engine’s index 2. A longitudinal analysis of the size of Google and Bing’s indexes over a nine-year period 4
  • 5. Background • Index size estimation - Bharat & Broder (1998) estimated the size of the indexed WWW using self-reported index sizes and overlap estimates → 200 million pages - Gulli et al. (2005) extended their work → 11.5 billion pages - Lawrence & Giles (1998) estimated the size using capture-recapture methodology and self-reported index sizes → 320 million pages - Lawrence & Giles (1999) updated their own work → 800 million pages - Dobra et al. (2004) updated the original 1998 estimates of Lawrence & Giles (1998) → doubled to 788 million pages in 1998 5
  • 6. Background • Some related work on the stability of search engine results - In terms of hit counts, rankings, and persistence of results • Problem: no true longitudinal studies on hit counts or index size! - Longest period for hit count variability studies was 3 months (Rousseau, 1999) • Question: how stable are studies based on hit counts over time? - We attempt to provide an answer by analyzing the results of a novel estimation method over a nine-year period (March 2006 – January 2015) 6
  • 7. Methodology • Our method: estimation through extrapolation - We extrapolate the unknown index size by using another textual training corpus that is fully available to us - We assume that for in-domain corpora the relative document frequencies will be the same - Results in following formula: 7 = |C| = dfw,C ⇥ |T| dfw,T index size |C| |T| dfw,C dfw,T = size of index = size of training corpus = hit count = doc frequency of w in T
  • 8. Methodology • Our method: estimation through extrapolation - We extrapolate the unknown index size by using another textual training corpus that is fully available to us - We assume that for in-domain corpora the relative document frequencies will be the same - Results in following formula: 7 = |C| = dfw,C ⇥ |T| dfw,T index size |C| |T| dfw,C dfw,T = size of index = size of training corpus = hit count = doc frequency of w in T
  • 9. Methodology • Selecting a training corpus - Should be representative of Web search engine indexes - Crawled a random selection of 531,624 Web pages from DMOZ ‣ 254,094,395 word tokens and 4,395,017 unique word types • Estimation example for the term ‘are’: - ‘are’ occurs in 50% of all DMOZ documents - Google hit count is 17,540,000,000 pages - Extrapolation: Google’s index contains 35 billion pages 8
  • 10. Methodology • Which terms should we use for the extrapolation? - Single-word terms are preferred according to Uyar (2009) - Random selection of word types will oversample low-frequent words as predicted by Zipf’s second law - Terms should be sampled from across document frequency bands → selected exponential series of selection rank with exponent 1.6, rounded off to the nearest integer - Set of words used should be not be overly small → averaged estimations over a set of 28 words (where predictions became stable) 9
  • 11. Methodology • Final set of 28 selected words: 10 and was photo preliminary accordée of can headlines definite reticular to do william psychologists recitificació for people basketball vielfalt on very spread illini are show nfl chèque
  • 12. Methodology • Validation - Predictions on an out-of-sample DMOZ test corpus were only off by 1.3% • Daily procedure - Estimate index size for each of these 28 words - Average all estimates into a single estimate - Rinse and repeat 11
  • 13. Methodology • Collected data from two search engines from March 2006 – January 2015 - Google: 3,027 data points (93.6% of all possible days) - Bing: 3,002 data points (92.8% of all possible days) 12 Google Bing (aka Live Search)
  • 14. Results • Google usually has the largest index - Peak of 49.4 billion pages (December 2011) - Bing has a peak of 23 billion pages (March 2014) • Both search engines show great variability! 13
  • 15. 0 5x10 9 1x10 10 1.5x10 10 2x10 10 2.5x10 10 3x10 10 3.5x10 10 4x10 10 4.5x10 10 5x10 10 2007 2008 2009 2010 2011 2012 2013 2014 2015 Est.numberofwebpages Google Bing 0 5x10 9 1x10 10 1.5x10 10 2x10 10 2.5x10 10 3x10 10 3.5x10 10 4x10 10 4.5x10 10 5x10 10 2007 2008 2009 2010 2011 2012 2013 2014 2015 Est.numberofwebpages Google Bing Estimatedno.ofwebpages 2007 2008 2009 2010 2011 2012 2013 2014 2015 Year 0 5x10 9 1x10 10 1.5x1010 2x1010 2.5x1010 3x10 10 3.5x10 10 4x1010 4.5x1010 5x1010 2007 2008 2009 2010 2011 2012 2013 2014 2015 Est.numberofwebpages Year Google Bing 10 billion 5 billion 20 billion 15 billion 30 billion 25 billion 40 billion 35 billion 45 billion 55 billion 50 billion 0 5x10 9 1x10 10 1.5x10 10 2x1010 2.5x10 10 3x10 10 3.5x10 10 4x1010 4.5x10 10 5x10 10 Est.numberofwebpages Google Bing 2007 2008 2009 2010 2011 2012 2013 2014 2015 2007 2008 2009 2010 2011 2012 2013 2014 2015 Google Bing Each point is a moving average over 31 days
  • 16. What causes this variability? • Intrinsic variability - However, it performs well on a representative in-domain sample - Rel. doc. frequency is unlikely to radically change over short time periods! • Extrinsic variability - Changes in indexing and ranking infrastructure happen all the time ‣ Google makes “roughly 500 changes to our search algorithm in a typical year” (Cutts, 2011) - Affects the hit count estimates and thus the index size estimates! - Examined Google and Bing search engine blogs for reported changes 15
  • 17. 0 5x10 9 1x10 10 1.5x10 10 2x10 10 2.5x10 10 3x10 10 3.5x10 10 4x10 10 4.5x10 10 5x10 10 2007 2008 2009 2010 2011 2012 2013 2014 2015 Est.numberofwebpages Google Bing 0 5x10 9 1x10 10 1.5x10 10 2x10 10 2.5x10 10 3x10 10 3.5x10 10 4x10 10 4.5x10 10 5x10 10 2007 2008 2009 2010 2011 2012 2013 2014 2015 Est.numberofwebpages Google Bing Estimatedno.ofwebpages 2007 2008 2009 2010 2011 2012 2013 2014 2015 Year 0 5x10 9 1x10 10 1.5x1010 2x1010 2.5x1010 3x10 10 3.5x10 10 4x1010 4.5x1010 5x1010 2007 2008 2009 2010 2011 2012 2013 2014 2015 Est.numberofwebpages Year Google Bing 10 billion 5 billion 20 billion 15 billion 30 billion 25 billion 40 billion 35 billion 45 billion 55 billion 50 billion 0 5x10 9 1x10 10 1.5x10 10 2x1010 2.5x10 10 3x10 10 3.5x10 10 4x1010 4.5x10 10 5x10 10 Est.numberofwebpages Google Bing 2007 2008 2009 2010 2011 2012 2013 2014 2015 2007 2008 2009 2010 2011 2012 2013 2014 2015 Google Bing Caffeine update Panda 1.0 update Panda 4.0 update Launch of Bing Launch of BingBot crawler Catapult update
  • 18. 0 5x10 9 1x10 10 1.5x10 10 2x10 10 2.5x10 10 3x10 10 3.5x10 10 4x10 10 4.5x10 10 5x10 10 2007 2008 2009 2010 2011 2012 2013 2014 2015 Est.numberofwebpages Google Bing 0 5x10 9 1x10 10 1.5x10 10 2x10 10 2.5x10 10 3x10 10 3.5x10 10 4x10 10 4.5x10 10 5x10 10 2007 2008 2009 2010 2011 2012 2013 2014 2015 Est.numberofwebpages Google Bing Estimatedno.ofwebpages 2007 2008 2009 2010 2011 2012 2013 2014 2015 Year 0 5x10 9 1x10 10 1.5x1010 2x1010 2.5x1010 3x10 10 3.5x10 10 4x1010 4.5x1010 5x1010 2007 2008 2009 2010 2011 2012 2013 2014 2015 Est.numberofwebpages Year Google Bing 10 billion 5 billion 20 billion 15 billion 30 billion 25 billion 40 billion 35 billion 45 billion 55 billion 50 billion 0 5x10 9 1x10 10 1.5x10 10 2x1010 2.5x10 10 3x10 10 3.5x10 10 4x1010 4.5x10 10 5x10 10 Est.numberofwebpages Google Bing 2007 2008 2009 2010 2011 2012 2013 2014 2015 2007 2008 2009 2010 2011 2012 2013 2014 2015 Google Bing Launch of Bing Caffeine update Panda 1.0 update Panda 4.0 update Launch of BingBot crawler Catapult update
  • 19. Discussion • Estimation bias - Distributed indexes result in hit count variability ‣ Different servers contain different shards in different states of up-to-dateness - Modern search engines use Document-at-a-time (DAAT) processing ‣ Means they traverse the postings list of an index until they have found enough matching documents, not until they’ve found all matching documents ‣ Overall hit counts are then estimated using statistical prediction methods • Language - English dominates the WWW (55%), DMOZ might suffer from this even more 19
  • 20. Discussion • Cut-off bias - Search engines are reported to cut off indexing up to a certain size - If that size is equal to the average DMOZ page size, then our estimates are great :) • Quality bias - DMOZ is a curated directory of ‘good’ websites - May not be representative of the ‘average’ website 20
  • 21. Conclusions • Long-term longitudinal analysis of search engine index sizes - Estimation using hit counts shows great variability over time! • Much of the variability seems attributable to infrastructure changes - 72% of infrastructure changes are reflected in estimate variation - Be careful when using hit counts for one-off Webometric studies! - Confirmation of work by Rousseau (1999), Bar-Ilan (1999), and Payne & Thelwall (2008) • Future work will focus on extending the analysis to other languages 21
  • 23. References • Almind, T.C. & Ingwersen, P. (1997). Informetric Analyses on the World Wide Web: Methodological Approaches to ‘Webometrics’. Journal of Documentation, 53, pp. 404–426. • Bar-Ilan, J. (1999). Search Engine Results over Time: A Case Study on Search Engine Stability. Cybermetrics, 2, 1. • Bharat, K. & Broder, A. (1998). A Technique for Measuring the Relative Size and Overlap of Public Web Search Engines. In Proceedings of WWW ’98 (pp. 379–388). New York, NY, USA: ACM Press. 23
  • 24. References • Cutts, M. (2011). Ten Algorithm Changes on Inside Search, Google Official Blog. Available at http://googleblog.blogspot.com/2011/11/ten- algorithm-changes-on-inside-search.html, last visited January 21, 2015. • Dobra, A. & Fienberg, S.E. (2004). How Large is the World Wide Web? In Web Dynamics (pp. 23– 43). Berlin: Springer. • Gulli, A. & Signorini, A. (2005). The Indexable Web is More than 11.5 Billion Pages. In Proceedings of WWW ’05 (pp. 902–903). New York, NY, USA: ACM Press. 24
  • 25. References • Lawrence, S. & Giles, C.L. (1998). Searching the World Wide Web. Science, 280, pp. 98–100. • Lawrence, S. & Giles, C.L. (1999). Accessibility of Information on the Web. Nature, 400, pp. 107–109. • Payne, N. & Thelwall, M. (2008). Longitudinal Trends in Academic Web Links. Journal of Information Science, 34, pp. 3–14. • Rousseau, R. (1999). Daily Time Series of Common Single Word Searches in AltaVista and NorthernLight. Cybermetrics, 2, 1. 25
  • 26. References • Thelwall, M. (2008). Quantitative Comparisons of Search Engine Results. Journal of the American Society for Information Science and Technology, 59, pp. 1702–1710. • Thelwall, M. (2009). Introduction to Webometrics: Quantitative Web Research for the Social Sciences. Synthesis Lectures on Information Concepts, Retrieval, and Services, 1, pp. 1–116. • Thelwall, M. & Sud, P. (2012). Webometric Research with the Bing Search API 2.0. Journal of Informetrics, 6, pp. 44–52. 26