Benchmarking Domain-specific Expert Search using Workshop Program Committees
A Longitudinal Analysis of Search Engine Index Size
1. A longitudinal analysis of search engine index size
Antal van den Bosch^, Toine Bogers*, Maurice de Kunder#
^ Radboud University, Nijmegen, the Netherlands
* Aalborg University Copenhagen, Denmark
# De Kunder Internet Media BV, Nijmegen, the Netherlands
ISSI 2015, Istanbul, Turkey
June 29 – July 3, 2015
2. Introduction
• Webometrics is the study of the content, structure and technologies
of the WWW (Almind & Ingwersen, 1997; Thelwall, 2009)
- Research topics include link structure, Web citation analysis, user
demographics, Web page credibility, search engines, and WWW size
• Size of the WWW is hard to measure!
- Only subset is accessible through search engines and Web crawling (aka
the Surface web)
‣ Deep web is the part of the WWW not indexed by search engines
- Most work has therefore focused on estimating search engine index size
2
3. Introduction
• Our work focuses on the estimation of index sizes of individual
search engines
• Why is this important?
- Index size used to be a competitive advantage for search engines
‣ Slowly been superseded by recency and personalization
- Index size is an important aspect of the quality of a Web search engine
- Provides a ceiling estimate of the size of the WWW accessible to the
average Internet user
3
4. Contributions of this work
1. A novel method of estimating the size of a Web search engine’s
index
2. A longitudinal analysis of the size of Google and Bing’s indexes over
a nine-year period
4
5. Background
• Index size estimation
- Bharat & Broder (1998) estimated the size of the indexed WWW using
self-reported index sizes and overlap estimates → 200 million pages
- Gulli et al. (2005) extended their work → 11.5 billion pages
- Lawrence & Giles (1998) estimated the size using capture-recapture
methodology and self-reported index sizes → 320 million pages
- Lawrence & Giles (1999) updated their own work → 800 million pages
- Dobra et al. (2004) updated the original 1998 estimates of Lawrence &
Giles (1998) → doubled to 788 million pages in 1998
5
6. Background
• Some related work on the stability of search engine results
- In terms of hit counts, rankings, and persistence of results
• Problem: no true longitudinal studies on hit counts or index size!
- Longest period for hit count variability studies was 3 months (Rousseau,
1999)
• Question: how stable are studies based on hit counts over time?
- We attempt to provide an answer by analyzing the results of a novel
estimation method over a nine-year period (March 2006 – January 2015)
6
7. Methodology
• Our method: estimation through extrapolation
- We extrapolate the unknown index size by using another textual training
corpus that is fully available to us
- We assume that for in-domain corpora the relative document
frequencies will be the same
- Results in following formula:
7
= |C| =
dfw,C ⇥ |T|
dfw,T
index size
|C|
|T|
dfw,C
dfw,T
= size of index
= size of training corpus
= hit count
= doc frequency of w in T
8. Methodology
• Our method: estimation through extrapolation
- We extrapolate the unknown index size by using another textual training
corpus that is fully available to us
- We assume that for in-domain corpora the relative document
frequencies will be the same
- Results in following formula:
7
= |C| =
dfw,C ⇥ |T|
dfw,T
index size
|C|
|T|
dfw,C
dfw,T
= size of index
= size of training corpus
= hit count
= doc frequency of w in T
9. Methodology
• Selecting a training corpus
- Should be representative of Web search engine indexes
- Crawled a random selection of 531,624 Web pages from DMOZ
‣ 254,094,395 word tokens and 4,395,017 unique word types
• Estimation example for the term ‘are’:
- ‘are’ occurs in 50% of all DMOZ documents
- Google hit count is 17,540,000,000 pages
- Extrapolation: Google’s index contains 35 billion pages
8
10. Methodology
• Which terms should we use for the extrapolation?
- Single-word terms are preferred according to Uyar (2009)
- Random selection of word types will oversample low-frequent words as
predicted by Zipf’s second law
- Terms should be sampled from across document frequency bands →
selected exponential series of selection rank with exponent 1.6, rounded
off to the nearest integer
- Set of words used should be not be overly small → averaged
estimations over a set of 28 words (where predictions became stable)
9
11. Methodology
• Final set of 28 selected words:
10
and was photo preliminary accordée
of can headlines definite reticular
to do william psychologists recitificació
for people basketball vielfalt
on very spread illini
are show nfl chèque
12. Methodology
• Validation
- Predictions on an out-of-sample DMOZ test corpus were only off by 1.3%
• Daily procedure
- Estimate index size for each of these 28 words
- Average all estimates into a single estimate
- Rinse and repeat
11
13. Methodology
• Collected data from two search engines from March 2006 – January 2015
- Google: 3,027 data points (93.6% of all possible days)
- Bing: 3,002 data points (92.8% of all possible days)
12
Google Bing (aka Live Search)
14. Results
• Google usually has the largest index
- Peak of 49.4 billion pages (December 2011)
- Bing has a peak of 23 billion pages (March 2014)
• Both search engines show great variability!
13
16. What causes this variability?
• Intrinsic variability
- However, it performs well on a representative in-domain sample
- Rel. doc. frequency is unlikely to radically change over short time periods!
• Extrinsic variability
- Changes in indexing and ranking infrastructure happen all the time
‣ Google makes “roughly 500 changes to our search algorithm in a typical year” (Cutts,
2011)
- Affects the hit count estimates and thus the index size estimates!
- Examined Google and Bing search engine blogs for reported changes
15
19. Discussion
• Estimation bias
- Distributed indexes result in hit count variability
‣ Different servers contain different shards in different states of up-to-dateness
- Modern search engines use Document-at-a-time (DAAT) processing
‣ Means they traverse the postings list of an index until they have found enough
matching documents, not until they’ve found all matching documents
‣ Overall hit counts are then estimated using statistical prediction methods
• Language
- English dominates the WWW (55%), DMOZ might suffer from this even more
19
20. Discussion
• Cut-off bias
- Search engines are reported to cut off indexing up to a certain size
- If that size is equal to the average DMOZ page size, then our estimates
are great :)
• Quality bias
- DMOZ is a curated directory of ‘good’ websites
- May not be representative of the ‘average’ website
20
21. Conclusions
• Long-term longitudinal analysis of search engine index sizes
- Estimation using hit counts shows great variability over time!
• Much of the variability seems attributable to infrastructure changes
- 72% of infrastructure changes are reflected in estimate variation
- Be careful when using hit counts for one-off Webometric studies!
- Confirmation of work by Rousseau (1999), Bar-Ilan (1999), and Payne &
Thelwall (2008)
• Future work will focus on extending the analysis to other languages
21
23. References
• Almind, T.C. & Ingwersen, P. (1997). Informetric Analyses on the World
Wide Web: Methodological Approaches to ‘Webometrics’. Journal of
Documentation, 53, pp. 404–426.
• Bar-Ilan, J. (1999). Search Engine Results over Time: A Case Study on
Search Engine Stability. Cybermetrics, 2, 1.
• Bharat, K. & Broder, A. (1998). A Technique for Measuring the Relative Size
and Overlap of Public Web Search Engines. In Proceedings of WWW ’98
(pp. 379–388). New York, NY, USA: ACM Press.
23
24. References
• Cutts, M. (2011). Ten Algorithm Changes on Inside Search, Google Official
Blog. Available at http://googleblog.blogspot.com/2011/11/ten-
algorithm-changes-on-inside-search.html, last visited January 21, 2015.
• Dobra, A. & Fienberg, S.E. (2004). How Large is the World Wide Web? In
Web Dynamics (pp. 23– 43). Berlin: Springer.
• Gulli, A. & Signorini, A. (2005). The Indexable Web is More than 11.5 Billion
Pages. In Proceedings of WWW ’05 (pp. 902–903). New York, NY, USA:
ACM Press.
24
25. References
• Lawrence, S. & Giles, C.L. (1998). Searching the World Wide Web. Science,
280, pp. 98–100.
• Lawrence, S. & Giles, C.L. (1999). Accessibility of Information on the Web.
Nature, 400, pp. 107–109.
• Payne, N. & Thelwall, M. (2008). Longitudinal Trends in Academic Web
Links. Journal of Information Science, 34, pp. 3–14.
• Rousseau, R. (1999). Daily Time Series of Common Single Word Searches in
AltaVista and NorthernLight. Cybermetrics, 2, 1.
25
26. References
• Thelwall, M. (2008). Quantitative Comparisons of Search Engine Results.
Journal of the American Society for Information Science and Technology,
59, pp. 1702–1710.
• Thelwall, M. (2009). Introduction to Webometrics: Quantitative Web
Research for the Social Sciences. Synthesis Lectures on Information
Concepts, Retrieval, and Services, 1, pp. 1–116.
• Thelwall, M. & Sud, P. (2012). Webometric Research with the Bing Search
API 2.0. Journal of Informetrics, 6, pp. 44–52.
26