Document ranking using
web evidence
Trystan Garrett Upstill
A thesis submitted for the degree of
Doctor of Philosophy at
The Australian National University
August 2005
c Trystan Garrett Upstill
Typeset in Palatino by TEX and LATEX2ε.
This thesis includes experiments published in:
• Upstill T., Craswell N., and Hawking D. “Buying Bestsellers Online: A Case
Study in Search and Searchability”, which appeared in the Proceedings of
ADCS2002, December 2002 [199].
• Upstill T., Craswell N., and Hawking D. “Query-independent evidence in home
page finding”, which appeared in the ACM TOIS volume 21:3, July 2003 [201].
• Craswell N., Hawking D., Thom J., Upstill T., Wilkinson R., and Wu M. “TREC12
Web Track at CSIRO”, which appeared in the TREC-12 Notebook Proceedings,
November 2003 [58].
• Upstill T., Craswell N., and Hawking D. “Predicting Fame and Fortune: Page-
Rank or Indegree?”, which appeared in the Proceedings of ADCS2003, Decem-
ber 2003 [200].
• Upstill T., and Robertson S. “Exploiting Hyperlink Recommendation Evidence
in Navigational Web Search”, which appeared in the Proceedings of SIGIR’04,
August 2004 [202].
• Hawking D., Upstill T., and Craswell N. “Towards Better Weighting of An-
chors”, which appeared in the Proceedings of SIGIR’04, August 2004 [120].
Chapter 9 contains results submitted as “csiro” runs in TREC 2003. The Topic Distilla-
tion runs submitted to TREC 2003 were generated in collaboration with Nick Craswell
and David Hawking. The framework used to tune parameters in Chapter 9 was de-
veloped by Nick Craswell. The first-cut ranking algorithm presented in Chapter 9 was
formulated by David Hawking for use in the Panoptic search system.
Except where indicated above, this thesis is my own original work.
Trystan Garrett Upstill
13 August 2005
Abstract
Evidence based on web graph structure is reportedly used by the current generation
of World-Wide Web (WWW) search engines to identify “high-quality”, “important”
pages and to reject “spam” content. However, despite the apparent wide use of this
evidence its application in web-based document retrieval is controversial. Confusion
exists as to how to incorporate web evidence in document ranking, and whether such
evidence is in fact useful.
This thesis demonstrates how web evidence can be used to improve retrieval effec-
tiveness for navigational search tasks. Fundamental questions investigated include:
which forms of web evidence are useful, how web evidence should be combined with
other document evidence, and what biases are present in web evidence. Through
investigating these questions, this thesis presents a number of findings regarding
how web evidence may be effectively used in a general-purpose web-based document
ranking algorithm.
The results of experimentation with well-known forms of web evidence on several
small-to-medium collections of web data are surprising. Aggregate anchor-text mea-
sures perform well, but well-studied hyperlink recommendation algorithms are far
less useful. Further gains in retrieval effectiveness are achieved for anchor-text mea-
sures by revising traditional full-text ranking methods to favour aggregate anchor-text
documents containing large volumes of anchor-text. For home page finding tasks ad-
ditional gains are achieved by including a simple URL depth measure which favours
short URLs over long ones.
The most effective combination of evidence treats document-level and web-based
evidence as separate document components, and uses a linear combination to sum
scores. It is submitted that the document-level evidence contains the author’s de-
scription of document contents, and that the web-based evidence gives the wider web
community view of the document. Consequently if both measures agree, and the doc-
ument is scored highly in both cases, this is a strong indication that the page is what it
claims to be. A linear combination of the two types of evidence is found to be partic-
ularly effective, achieving the highest retrieval effectiveness of any query-dependent
evidence on navigational and Topic Distillation tasks.
However, care should be taken when using hyperlink-based evidence as a direct
measure of document quality. Thesis experiments show the existence of bias towards
the home pages of large, popular and technology-oriented companies. Further empir-
ical evidence is presented to demonstrate how the authorship of web documents and
sites directly affects the quantity and quality of available web evidence. These factors
demonstrate the need for robust methods for mining and interpreting data from the
web graph.
v
vi
Contents
Abstract v
1 Introduction 3
1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Background 5
2.1 A web search system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 The document gatherer . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.2 The indexer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.3 The query processor . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.4 The results presentation interface . . . . . . . . . . . . . . . . . . . 7
2.2 Ranking in web search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Document-level evidence . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3.1 Text-based document evidence . . . . . . . . . . . . . . . . . . . . 9
2.3.1.1 Boolean matching . . . . . . . . . . . . . . . . . . . . . . 9
2.3.1.2 Vector space model . . . . . . . . . . . . . . . . . . . . . 10
2.3.1.3 Probabilistic ranking . . . . . . . . . . . . . . . . . . . . 12
2.3.1.4 Statistical language model ranking . . . . . . . . . . . . 14
2.3.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.3 Other evidence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.3.1 Metadata . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.3.2 URL information . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.3.3 Document structure and tag information . . . . . . . . . 19
2.3.3.4 Quality metrics . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3.3.5 Units of retrieval . . . . . . . . . . . . . . . . . . . . . . . 20
2.4 Web-based evidence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.4.1 Anchor-text evidence . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.4.2 Bibliometric measures . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.4.2.1 Bibliographic methods applied to a web . . . . . . . . . 27
2.4.3 Hyperlink recommendation . . . . . . . . . . . . . . . . . . . . . . 28
2.4.3.1 Link counting / in-degree . . . . . . . . . . . . . . . . . 28
2.4.3.2 PageRank . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.4.3.3 Topic-specific PageRank . . . . . . . . . . . . . . . . . . 30
2.4.4 Other hyperlink analysis methods . . . . . . . . . . . . . . . . . . 30
2.4.4.1 HITS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.5 Combining document evidence . . . . . . . . . . . . . . . . . . . . . . . . 33
vii
viii Contents
2.5.1 Score/rank fusion methods . . . . . . . . . . . . . . . . . . . . . . 33
2.5.1.1 Linear combination of scores . . . . . . . . . . . . . . . . 34
2.5.1.2 Re-ranking . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.5.1.3 Meta-search fusion techniques . . . . . . . . . . . . . . . 34
2.5.1.4 Rank aggregation . . . . . . . . . . . . . . . . . . . . . . 35
2.5.1.5 Using minimum query-independent evidence thresh-
olds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.5.2 Revising retrieval models to address combination of evidence . . 35
2.5.2.1 Field-weighted Okapi BM25 . . . . . . . . . . . . . . . . 36
2.5.2.2 Language mixture models . . . . . . . . . . . . . . . . . 37
2.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.6.1 Web information needs and search taxonomy . . . . . . . . . . . . 38
2.6.2 Navigational search tasks . . . . . . . . . . . . . . . . . . . . . . . 39
2.6.2.1 Home page finding . . . . . . . . . . . . . . . . . . . . . 39
2.6.2.2 Named page finding . . . . . . . . . . . . . . . . . . . . 39
2.6.3 Informational search tasks . . . . . . . . . . . . . . . . . . . . . . . 40
2.6.3.1 Topic Distillation . . . . . . . . . . . . . . . . . . . . . . . 40
2.6.4 Transactional search tasks . . . . . . . . . . . . . . . . . . . . . . . 40
2.6.5 Evaluation strategies / judging relevance . . . . . . . . . . . . . . 40
2.6.5.1 Human relevance judging . . . . . . . . . . . . . . . . . 40
2.6.5.2 Implicit human judgements . . . . . . . . . . . . . . . . 42
2.6.5.3 Judgements based on authoritative links . . . . . . . . . 42
2.6.6 Evaluation measures . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.6.6.1 Precision and recall . . . . . . . . . . . . . . . . . . . . . 42
2.6.6.2 Mean Reciprocal Rank and success rates . . . . . . . . . 44
2.6.7 The Text REtrieval Conference . . . . . . . . . . . . . . . . . . . . 44
2.6.7.1 TREC corpora used in this thesis . . . . . . . . . . . . . 45
2.6.7.2 TREC web track evaluations . . . . . . . . . . . . . . . . 45
3 Hyperlink methods - implementation issues 49
3.1 Building the web graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.1.1 URL address resolution . . . . . . . . . . . . . . . . . . . . . . . . 50
3.1.2 Duplicate documents . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.1.3 Hyperlink redirects . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.1.4 Dynamic content . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.1.5 Links created for reasons other than recommendation . . . . . . . 54
3.2 Extracting hyperlink evidence from WWW search engines . . . . . . . . 55
3.3 Implementing PageRank . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.3.1 Dangling links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.3.2 Bookmark vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.3.3 PageRank convergence . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.3.4 PageRank applied to small-to-medium webs . . . . . . . . . . . . 59
3.4 Expected correlation of hyperlink recommendation measures . . . . . . 59
Contents ix
4 Web search and site searchability 61
4.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.1.1 Query selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.1.2 Search engine selection . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.1.3 Bookstore selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.1.4 Submitting queries and collecting results . . . . . . . . . . . . . . 65
4.1.5 Judging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.2 Comparing bookstores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.3 Comparing search engines . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.3.1 Search engine bookstore coverage . . . . . . . . . . . . . . . . . . 67
4.4 Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.4.1 Bookstore searchability: coverage . . . . . . . . . . . . . . . . . . 70
4.4.2 Bookstore searchability: matching/ranking performance . . . . . 73
4.4.3 Search engine retrieval effectiveness . . . . . . . . . . . . . . . . . 73
4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5 Analysis of hyperlink recommendation evidence 77
5.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.1.1 Sourcing candidate pages . . . . . . . . . . . . . . . . . . . . . . . 78
5.1.2 Company attributes . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.1.3 Extracting hyperlink recommendation scores . . . . . . . . . . . . 79
5.2 Hyperlink recommendation bias . . . . . . . . . . . . . . . . . . . . . . . 81
5.2.1 Home page preference . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.2.2 Hyperlink recommendation as a page quality recommendation . 82
5.2.2.1 Large, famous company preference . . . . . . . . . . . . 82
5.2.2.2 Country and technology preference . . . . . . . . . . . . 82
5.3 Correlation between hyperlink recommendation measures . . . . . . . . 87
5.3.1 For company home pages . . . . . . . . . . . . . . . . . . . . . . . 87
5.3.2 For spam pages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.4.1 Home page bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.4.2 Other systematic biases . . . . . . . . . . . . . . . . . . . . . . . . 90
5.4.3 PageRank or in-degree? . . . . . . . . . . . . . . . . . . . . . . . . 91
6 Combining query-independent web evidence with query-dependent evidence 93
6.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.1.1 Query and document set . . . . . . . . . . . . . . . . . . . . . . . . 94
6.1.2 Query-dependent baselines . . . . . . . . . . . . . . . . . . . . . . 94
6.1.3 Extracting PageRank . . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.1.4 Combining query-dependent baselines with query-independent
web evidence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6.2.1 Baseline performance . . . . . . . . . . . . . . . . . . . . . . . . . 96
6.2.2 Using a threshold . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
x Contents
6.2.3 Re-ranking using PageRank . . . . . . . . . . . . . . . . . . . . . . 97
6.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
7 Home page finding using query-independent web evidence 101
7.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
7.1.1 Query-independent evidence . . . . . . . . . . . . . . . . . . . . . 102
7.1.2 Query-dependent baselines . . . . . . . . . . . . . . . . . . . . . . 102
7.1.3 Test collections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
7.1.4 Combining query-dependent baselines with query-independent
evidence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
7.2 Minimum threshold experiments . . . . . . . . . . . . . . . . . . . . . . . 106
7.2.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
7.2.2 Training cutoffs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
7.3 Optimal combination experiments . . . . . . . . . . . . . . . . . . . . . . 112
7.3.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
7.4 Score-based re-ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
7.4.1 Setting score cutoffs . . . . . . . . . . . . . . . . . . . . . . . . . . 116
7.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
7.5 Interpretation of results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
7.5.1 What query-independent evidence should be used in re-ranking? 123
7.5.2 Which query-dependent baseline should be used? . . . . . . . . . 125
7.6 Further experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
7.6.1 Rank and score distributions . . . . . . . . . . . . . . . . . . . . . 127
7.6.2 Can the four-tier URL-type classification be improved? . . . . . . 127
7.6.3 PageRank and in-degree correlation . . . . . . . . . . . . . . . . . 131
7.6.4 Use of external link information . . . . . . . . . . . . . . . . . . . 132
7.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
8 Anchor-text in web search 135
8.1 Document statistics in anchor-text . . . . . . . . . . . . . . . . . . . . . . 135
8.1.1 Term frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
8.1.2 Inverse document frequency . . . . . . . . . . . . . . . . . . . . . 136
8.1.3 Document length normalisation . . . . . . . . . . . . . . . . . . . 138
8.1.3.1 Removing aggregate anchor-text length normalisation . 140
8.1.3.2 Anchor-text length normalisation by other document
fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
8.2 Combining anchor-text with other document evidence . . . . . . . . . . 143
8.2.1 Linear combination . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
8.2.2 Field-weighted Okapi BM25 . . . . . . . . . . . . . . . . . . . . . 143
8.2.3 Fusion of linear combination and field-weighted evidence . . . . 144
8.2.4 Snippet-based anchor-text scoring . . . . . . . . . . . . . . . . . . 144
8.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
8.3.1 Anchor-text baseline effectiveness . . . . . . . . . . . . . . . . . . 145
8.3.2 Anchor-text and full-text document evidence . . . . . . . . . . . . 146
Contents xi
8.3.2.1 Field-weighted Okapi BM25 combination . . . . . . . . 147
8.3.2.2 Linear combination . . . . . . . . . . . . . . . . . . . . . 148
8.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
9 A first-cut document ranking function using web evidence 151
9.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
9.1.1 Evaluating performance . . . . . . . . . . . . . . . . . . . . . . . . 151
9.1.2 Document evidence . . . . . . . . . . . . . . . . . . . . . . . . . . 152
9.1.2.1 Full-text evidence . . . . . . . . . . . . . . . . . . . . . . 152
9.1.2.2 Title evidence . . . . . . . . . . . . . . . . . . . . . . . . . 153
9.1.2.3 URL length . . . . . . . . . . . . . . . . . . . . . . . . . . 153
9.1.3 Web evidence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
9.1.3.1 Anchor-text . . . . . . . . . . . . . . . . . . . . . . . . . . 153
9.1.3.2 In-degree . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
9.1.4 Combining document evidence . . . . . . . . . . . . . . . . . . . . 154
9.1.5 Test sets and tuning . . . . . . . . . . . . . . . . . . . . . . . . . . 155
9.1.6 Addressing the combined HP/NP task . . . . . . . . . . . . . . . 156
9.2 Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
9.2.1 Combining HP and NP runs for the combined task . . . . . . . . 160
9.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
9.3.1 TREC 2003 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
9.3.1.1 Topic Distillation 2003 (TD2003) results . . . . . . . . . . 160
9.3.1.2 Combined HP/NP 2003 (HP/NP2003) results . . . . . . 162
9.3.2 Evaluating the ranking function on further corporate web col-
lections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
9.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
10 Discussion 167
10.1 Web search system applicability . . . . . . . . . . . . . . . . . . . . . . . . 167
10.2 Which tasks should be modelled and evaluated in web search experi-
ments? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
10.3 Building a more efficient ranking system . . . . . . . . . . . . . . . . . . . 169
10.4 Tuning on a per corpus basis . . . . . . . . . . . . . . . . . . . . . . . . . . 170
11 Summary and conclusions 173
11.1 Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
11.2 Document ranking recommendations . . . . . . . . . . . . . . . . . . . . 176
11.3 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
A Glossary 179
B The canonicalisation of URLs 183
xii Contents
C Bookstore search and searchability: case study data 185
C.1 Book categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
C.2 Web search engine querying . . . . . . . . . . . . . . . . . . . . . . . . . . 185
C.3 Correct book answers in bookstore case study . . . . . . . . . . . . . . . 187
D TREC participation in 2002 195
D.1 Topic Distillation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
D.2 Named page finding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
E Analysis of hyperlink recommendation evidence additional results 199
F Okapi BM25 distributions 203
G Query sets 205
G.1 .GOV home page set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
Bibliography 213
List of Tables
2.1 Proximity of the the term “Yahoo” to links to http://www.yahoo.com/ 24
4.1 Search engine properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.2 Bookstores included in the evaluation . . . . . . . . . . . . . . . . . . . . 64
4.3 Bookstore comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.4 Search engine success rates . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.5 Search engine precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.6 Search engine document coverage . . . . . . . . . . . . . . . . . . . . . . 69
4.7 Search engine link coverage . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.1 Values extracted from Google . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.2 PageRanks by industry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.3 Extreme cases where PageRank and in-degree scores disagree. . . . . . . 88
7.1 Test collection information . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
7.2 Using query-independent thresholds on the ANU collection . . . . . . . 107
7.3 Using query-independent thresholds on the WT10gC collection . . . . . 109
7.4 Using query-independent thresholds on the WT10gT collection. . . . . . 111
7.5 Optimal re-ranking results for content . . . . . . . . . . . . . . . . . . . . 113
7.6 Optimal re-ranking results for anchor-text . . . . . . . . . . . . . . . . . . 114
7.7 Optimal re-ranking results for content+anchor-text . . . . . . . . . . . . . 115
7.8 Significant differences between methods when using Optimal re-rankings116
7.9 Summary of Optimal re-ranking results . . . . . . . . . . . . . . . . . . . 117
7.10 Score-based re-ranking results for content . . . . . . . . . . . . . . . . . . 120
7.11 Score-based re-ranking results for anchor-text . . . . . . . . . . . . . . . . 121
7.12 Score-based re-ranking results for content+anchor-text . . . . . . . . . . 122
7.13 Numerical summary of re-ranking improvements . . . . . . . . . . . . . 123
7.14 S@5 for URL-type category combinations, length and directory depth . . 131
7.15 Correlation of PageRank variants with in-degree . . . . . . . . . . . . . . 132
7.16 Using VLC2 links in WT10g . . . . . . . . . . . . . . . . . . . . . . . . . . 133
8.1 Summary of idf variants used in ranking functions under examination . 138
8.2 Summary of document length normalisation variants in ranking func-
tions under examination . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
8.3 Summary of snippet-based document ranking algorithms under exam-
ination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
8.4 Okapi BM25 aggregate anchor-text scores and ranks for length normal-
isation variants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
xiii
xiv LIST OF TABLES
8.5 Effectiveness of Okapi BM25 aggregate anchor-text length normalisa-
tion techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
8.6 Length normalisation in Field-weighted Okapi BM25 . . . . . . . . . . . 147
8.7 Effectiveness of anchor-text snippet-based ranking functions . . . . . . . 148
8.8 Effectiveness of the evaluated combination methods for TD2003 . . . . . 149
8.9 Effectiveness of the evaluated combination methods for NP2002 and
NP&HP2003 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
9.1 Tuned parameters and retrieval effectiveness . . . . . . . . . . . . . . . . 159
9.2 Results for combined HP/NP runs on the training set . . . . . . . . . . . 160
9.3 Topic Distillation submission summary . . . . . . . . . . . . . . . . . . . 161
9.4 Combined home page/named page finding task submission summary . 162
9.5 Ranking function retrieval effectiveness on the public corporate webs
of several large Australian organisations . . . . . . . . . . . . . . . . . . . 164
C.1 Correct book answers in bookstore case study . . . . . . . . . . . . . . . 194
D.1 Official results for submissions to the 2002 TREC web track Topic Dis-
tillation task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
D.2 Official results for submissions to the 2002 TREC web track named page
finding task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
G.1 .GOV home page finding training set . . . . . . . . . . . . . . . . . . . . . 211
List of Figures
2.1 A sample network of relationships . . . . . . . . . . . . . . . . . . . . . . 25
3.1 Effect of PageRank d value (random jump probability) on success rate
for Democratic PageRank calculations for the WT10gC test collection . . 57
3.2 Effect of PageRank d value (random jump probability) on success rate
for Aristocratic PageRank calculations for the WT10gC test collection . . 58
3.3 Effect of PageRank d value on the rate of Democratic PageRank conver-
gence on WT10g, by number of iterations . . . . . . . . . . . . . . . . . . 58
5.1 Combined PageRank distribution for the non-home page document set . 79
5.2 Toolbar PageRank distributions within sites . . . . . . . . . . . . . . . . . 83
5.3 Bias in hyperlink recommendation evidence towards large, admired
and popular companies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.4 Bias in hyperlink recommendation evidence towards technology-oriented
or US companies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.5 Toolbar PageRank versus in-degree for company home pages. . . . . . . 88
5.6 Toolbar PageRank versus in-degree for links to a spam company. . . . . 89
6.1 The percentage of home pages and non-home pages that exceed each
Google PageRank value. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.2 Quota-based re-ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
6.3 Score-based re-ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
6.4 Example of two queries using different re-ranking techniques . . . . . . 99
7.1 Example of an Optimal re-ranking and calculation of random control
success rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
7.2 Setting score-based re-ranking cutoffs for the content and anchor-text
baselines using the WT10gC collection . . . . . . . . . . . . . . . . . . . . 118
7.3 Setting score-based re-ranking cutoffs for the content+anchor-text base-
line using the WT10gC collection . . . . . . . . . . . . . . . . . . . . . . . 119
7.4 Baseline success rates across different cutoffs . . . . . . . . . . . . . . . . 126
7.5 Baseline rankings of the correct answers for WT10gC . . . . . . . . . . . 128
7.6 PageRank distributions for WT10gC . . . . . . . . . . . . . . . . . . . . . 129
7.7 In-degree and URL-type distributions for WT10gC . . . . . . . . . . . . . 130
8.1 Document scores achieved by BM25 using several values of k1 with
increasing tf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
xv
xvi LIST OF FIGURES
8.2 Aggregate anchor-text term distribution for the USGS home page . . . . 139
8.3 Aggregate anchor-text term distribution for a USGS info page . . . . . . 139
8.4 The effect of document length normalisation on BM25 scores for a sin-
gle term query. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
9.1 Document scores achieved by AF1 and BM25 for values of tf . . . . . . 154
9.2 A plot illustrating the concurrent exploration of Okapi BM25 k1 and b
values using the hill-climbing function . . . . . . . . . . . . . . . . . . . . 157
9.3 A full iteration of the hill-climbing function . . . . . . . . . . . . . . . . . 158
E.1 Google Toolbar PageRank distributions within sites (Additional to those
in Chapter 5) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
E.2 Google Toolbar PageRank distributions within sites (Additional to those
in Chapter 5) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
F.1 Distribution of normalised Okapi BM25 scores for document full-text . 204
F.2 Distribution of normalised Okapi BM25 scores for aggregate anchor-text 204
“In an extreme view, the world can be seen as only connections, nothing
else. We think of a dictionary as the repository of meaning, but it defines
words only in terms of other words. I liked the idea that a piece of infor-
mation is really defined only by what it’s related to, and how it’s related.
There really is little else to meaning. The structure is everything. There
are billions of neurons in our brains, but what are neurons? Just cells. The
brain has no knowledge until connections are made between neurons. All
that we know, all that we are, comes from the way our neurons are con-
nected.”
— Tim Berners-Lee [20]
Chapter 1
Introduction
Document retrieval on the World-Wide Web (WWW), arguably the world’s largest col-
lection of documents, is a challenging and important task. The scale of the WWW is
immense, consisting of at least ten billion publicly visible web documents1 distributed
on millions of servers world-wide. Web authors follow few formal protocols, often re-
main anonymous and publish in a wide variety of formats. There is no central registry
or repository of the WWW’s contents and documents are often in a constant state of
flux. The WWW is also an environment where documents often misrepresent their
content as some web authors seek to unbalance ranking algorithms in their favour for
personal gain [122]. To compound these factors, WWW search engine users typically
provide short queries (averaging around two terms [184]) and expect a sub-second
response time from the system. Given these significant challenges, there is potentially
much to be learnt from the search systems which manage to retrieve relevant docu-
ments in such an environment.
The current generation of WWW search engines reportedly makes extensive use
of evidence derived from the structure of the WWW to better match relevant doc-
uments and identify potentially authoritative pages [31]. However, despite this re-
ported use, to date there has been little analysis which supports the inclusion of web
evidence in document ranking, or which examines precisely what its effect on search
results might be. The success of document ranking in the current generation of WWW
search engines is attributed to a number of web analysis techniques. How these tech-
niques are used and incorporated remains a trade secret. It also remains unclear as
to whether such techniques can be employed to improve retrieval effectiveness in
smaller, corporate-sized web collections.
This thesis investigates how web evidence can be used to improve retrieval ef-
fectiveness for navigational search tasks. Three important forms of web evidence
are considered: anchor-text, hyperlink recommendation measures (PageRank vari-
ants and in-degree), and URL hierarchy-based measures. These forms of web evi-
dence are believed to be used by prominent WWW search engines [31]. Other forms
of web evidence reviewed, but not examined, include HITS [132], HTML document
structure [42] and page segmentation [37], information unit measures [196], and click-
through evidence [129].
1
This is necessarily a crude estimate of the WWW’s static size. See Section 2.4 for details.
3
4 Introduction
To exploit web evidence effectively in a document ranking algorithm, several ques-
tions must be addressed:
• Which forms of web evidence are useful?
• How should web evidence be combined with other document evidence?
• What biases are inherent in web evidence?
Through addressing these and other related problems, this thesis demonstrates
how web evidence may be used effectively in a general-purpose web-based document
ranking algorithm.
1.1 Overview
Chapters 2 and 3 review background literature and implementation issues. Chap-
ter 2 surveys the web search domain, and presents an overview of document and
web evidence often used in web-based document ranking, methods for combining
this evidence, and a review of strategies for evaluating the effectiveness of ranking
algorithms. To justify the formulations of hyperlink evidence used, and to ensure ex-
periments can be reproduced, Chapter 3 describes methods used to process the web
graph and implement recommendation evidence.
Chapters 4 to 8 present a series of detailed experiments. Chapter 4 reports results
from an investigation of how the searchability of web sites affects hyperlink evidence,
and thereby retrieval effectiveness in WWW search engines. Chapter 5 presents a
set of experiments that analyse the extent to which hyperlink evidence is correlated
with “real-world” measures of authority or quality. It includes an analysis of how
the use of web evidence may bias search results, and whether hyperlink recommen-
dation evidence is useful in identifying site entry points. Chapters 6 and 7 follow
with an evaluation of retrieval effectiveness improvements afforded by hyperlink ev-
idence. Chapter 6 investigates how query-independent evidence might be combined
with query-dependent baselines. Chapter 7 investigates the home page finding task
on small-to-medium web collections. Chapter 8 presents a set of experiments that in-
vestigates further possibilities for improving the effectiveness of measures based on
anchor-text evidence.
The experiments culminate in a proposal for, and evaluation of, a ranking function
that incorporates evidence explored in this thesis. The effectiveness of this ranking
function is evaluated through submissions to the TREC 2003 web track, presented in
Chapter 9. Chapters 10 and 11 present and discuss findings, draw conclusions and
outline future research directions. A glossary is included as Appendix A.
Chapter 2
Background
To provide a foundation and context for thesis experiments, this chapter outlines the
web-based document ranking domain. The chapter includes:
• An overview of a generic web search system, outlining the role of document
ranking in web search;
• A detailed analysis of document and web-level evidence commonly used for
document ranking in research and (believed to be used in) commercial web
search systems;
• An exploration of methods for combining evidence into a single ranking func-
tion; and
• A review of common user web search tasks and methods used to evaluate the
effectiveness of document ranking for such tasks.
Where applicable, reference is made throughout this chapter to the related scope
of the thesis and the rationale for experiments undertaken.
2.1 A web search system
A web search engine typically consists of a document gatherer (usually a crawler), a
document indexer, a query processor and a results presentation interface [31]. The
document gatherer and document indexer need only be run when the underlying set
of web documents has changed (which is likely to be continuous on the WWW, but
perhaps intermittent for other web corpora).
How each element which makes up a generic web search system is understood in
the context of this thesis is discussed below.
5
6 Background
2.1.1 The document gatherer
Web-based documents are normally1 gathered using a crawler [123]. Crawlers traverse
a web graph by recursively following hyperlinks, storing each document encountered,
and parsing stored documents for URLs to crawl. Crawlers typically maintain a fron-
tier, the queue of pages which remain to be downloaded. The frontier may be a FIFO2
queue, or sorted by some other attribute, such as perceived authority or frequency of
change [46]. Crawlers also typically maintain a list of all downloaded or detected du-
plicate pages (so pages are not fetched more than once), and a scope of pages to crawl
(for example, a maximum depth, specified domain, or timeout value), both of which
are checked prior to adding pages to the frontier. The crawler frontier is initialised
with a set of seed pages from which the crawl starts (these are specified manually).
Crawling ceases when the frontier is empty, or some time or resource limit is reached.
Once crawling is complete,3 the downloaded documents are indexed.
2.1.2 The indexer
The indexer distills information contained within corpus documents into a format
which is amenable to quick access by the query processor. Typically this involves ex-
tracting document features by breaking-down documents into their constituent terms,
extracting statistics relating to term presence within the documents and corpus, and
calculating any query-independent evidence.4 After the index is built, the system is
ready to process queries.
2.1.3 The query processor
The query processor serves user queries by matching and ranking documents from the
index according to user input. As the query processor interacts directly with the doc-
ument index created by the indexer, they are often considered in tandem.
This thesis is concerned with a non-iterative retrieval process, i.e. one without
query refinement or relevance feedback [169, 174, 175, 177]. This is the level of in-
teraction supported by current popular WWW search systems and many web search
systems, most of whom incorporate little relevance feedback beyond “find more like
this” [93] or lists containing suggested supplementary query terms [217].
Although particularly important in WWW search systems, this thesis is not pri-
marily concerned with the efficiency of query processing. A comprehensive overview
of efficient document query processing and indexing methods is provided in [214].
1
In some cases alternative document accessing methods may be available, for example if the docu-
ments being indexed are stored locally.
2
A queue ordered such that the first item in is the first item out.
3
If crawling is continuous, and an incremental index structure is used, documents might be indexed
continuously.
4
Query-independent evidence is evidence that does not depend on the user query. For efficiency
reasons such evidence is generally collected and calculated during the document indexing phase (prior
to query processing).
§2.2 Ranking in web search 7
2.1.4 The results presentation interface
The results presentation interface displays and links to the documents matched by the
query processor in response to the user query. Current popular WWW and web search
systems present a linear list of ranked results, sometimes with the degree of match
and/or summaries and abstracts for the matching documents. This type of interface
is modelled in experiments within this thesis.
2.2 Ranking in web search
The principal component of the query processor is the document ranking function.
The ranking functions of modern search systems frequently incorporate many forms
of document evidence [31]. Some of this evidence, such as textual information, is
collected locally for each document in the corpus (described in Section 2.3). Other
evidence, such as external document descriptions or recommendations, is amassed
through an examination of the context of a document within the web graph (described
in Section 2.4).
2.3 Document-level evidence
Text-based ranking algorithms typically assign scores to documents based on the dis-
tribution of query terms within both the document and the corpus. Therefore the
choice of what should constitute a term is an important concern. While terms are
often simply defined as document words (treated individually) [170] they may also
take further forms. For example, terms may consist of the canonical string compo-
nents of words (stems) [163], include (n-)tuples of words [214], consist of a word and
associated synonyms [128], or may include a combination of some or many of these
properties.
Unless otherwise noted, the ranking functions examined within this thesis use
single words as terms. In some experiments ranking functions instead make use of
canonical word stems, conflated using the Porter stemmer [163], as terms. These and
alternative term representations are discussed below.
The conflation of terms may increase the overlap between documents and queries,
finding term matches which may otherwise have been missed. For example if the
query term “cat” is processed and a document in the corpus mentions “cats” it is
likely that the document will be relevant to the user’s request. Stemming methods
are frequently employed to reduce words to their canonical forms and thereby allow
such matches. An empirically validated method for reducing terms to their canon-
ical forms is the Porter stemmer [163]. The Porter stemmer has been demonstrated
to perform as well as other suffix-stemming algorithms and to perform comparably
to other significantly more expensive, linguistic-based stemming algorithms [126].5
5
These algorithms are expensive with regard to training and computational cost.
8 Background
The Porter stemmer removes suffixes, for example “shipping” and “shipped” would
become “ship”. In this way suffix-stemming attempts to remove pluralisation from
terms and to generalise words [126],6 sometimes leading to an improvement in re-
trieval system recall [134]. However, reducing the exactness of term matches can
result in the retrieval of less relevant documents [84, 101], thereby reducing search
precision.7 Furthermore, if a retrieved document does not contain any occurrences of
a query term, as all term matches are stems, it may be difficult for a user to understand
why that document was retrieved [108].
In many popular ranking functions documents are considered to be “bags-of-
words” [162, 170, 176], where term occurrence is assumed to be independent and
unordered. For example, given a term such as “Computer” there is no prior probabil-
ity of encountering the word “Science” afterwards. Accordingly no extra evidence
is recorded if the words “Computer Science” are encountered together in a docu-
ment rather than separately. While there is arguably more meaning conveyed through
higher order terms (terms containing multiple words) than in single-word term mod-
els, there is little empirical evidence to support the use of higher-order terms [128].
Even when using manually created word association thesauri, retrieval effectiveness
has not been observed to be significantly improved [128]. “Bags-of-words” algorithms
are also generally less expensive when indexing and querying English language doc-
uments8 [214].
Terms may have multiple meanings (polysemy) and many concepts are repre-
sented by multiple words (synonyms). Several methods attempt to explore relation-
ships between terms to compress the document and query space. The association
of words to concepts can be performed manually through the use of dictionaries
or ontologies, or automatically using techniques such as Latent Semantic Analysis
(LSA) [22, 66]. LSA involves the extraction of contextual meaning of words through
examinations of the distribution of terms within a corpus using the vector space model
(see Section 2.3.1.2). Terms are broken down into co-occurrence tables and then a Sin-
gular Value Decomposition (SVD) is performed to determine term relationships [66].
The SVD projects the initial term meanings onto a subspace spanned by only the “im-
portant” singular term vectors. The potential benefits of LSA techniques are two-fold:
firstly they may reduce user confusion through the compression of similar (synony-
mous or polysemous) terms, and secondly they may reduce the size of term space, and
thereby improve system efficiency [66]. Indeed, LSA techniques have been shown
to improve the efficiency of retrieval systems considerably while maintaining (but
not exceeding) the effectiveness of non-decomposed vector space-based retrieval sys-
tems [22, 43]. However, the use of LSA-based algorithms is likely to negatively affect
navigational search (an important search task, described in Section 2.6.2) as the mean-
ing conveyed by entity naming terms may be lost.
6
Employing stemming prior to indexing reduces the size of the corpus index however the discarded
term information is then lost. As an alternative stemming can be applied during query processing [214].
7
The measures of system precision and recall are defined in Section 2.6.6.1
8
The use of phrase-optimised indexes can improve the efficiency of phrase-based retrieval [213].
§2.3 Document-level evidence 9
Some terms occur so frequently in a corpus that their presence or absence within
a document may have negligible effect. The most frequent terms arguably convey the
least document relevance information and have the smallest discrimination value (see
inverse document frequency measure in Section 2.3.1.2). Additionally, because of the
high frequency of occurrence, such terms are likely to generate the highest overhead
during indexing and querying.9 Extremely frequent terms (commonly referred to as
“stop words”) are often removed from documents prior to indexing.10 However, it
has been suggested that such terms might be useful when matching documents [214],
particularly in phrase-based searches [14]. Nevertheless, in experiments within this
thesis, stop words are removed prior to indexing.
2.3.1 Text-based document evidence
To build a retrieval model, an operational definition of what constitutes a relevant
document is required. While each of the ranking models discussed below shares sim-
ilar document statistics, they were all derived through different relevance matching
assumptions. Experiments within this thesis employ the Okapi BM25 probabilistic
algorithm (for reasons outlined in Section 2.3.2). Other full-text ranking methods are
discussed for completeness.
The notation used during the model discussions below is as follows: D denotes a
document, Q denotes a query, t is a term, wt indicates a weight or score for a single
term, and S(D, Q) is the score assigned to the query to document match.
2.3.1.1 Boolean matching
In the Boolean model, retrieved documents are matched to queries formed with logic
operators. There are no degrees of match; a document either satisfies the query or
does not. Thus Boolean models are often referred to as “exact match” techniques [14].
While Boolean matching makes it clear why documents were retrieved, its syntax is
largely unfamiliar to ordinary users [28, 49, 51]. Nevertheless empirical evidence sug-
gests that trained search users prefer Boolean search as it provides an exact specifica-
tion for retrieving documents [49]. However, without any ranking by degree of match,
the navigation of the set of matching documents is difficult, particularly on large cor-
pora with unstructured content [14]. Empirical evidence also suggests that the use
of term weights in the retrieval model (described in the next sub-section) brings large
gains [14]. To employ Boolean matching techniques on corpora of the scale considered
in this thesis, it would have to be supplemented by some other document statistic in
order to provide a ranked list of results [14].
9
However, given the high amount of expected repetition they could potentially be more efficiently
compressed [214].
10
This list often contains common function words or connectives such as “the”, “and” and “a”.
10 Background
The Boolean scoring function is:
S(D, Q) =
0 Q /∈ D
1 Q ∈ D
(2.1)
where Q is the query condition expressed in Boolean logic operators.
2.3.1.2 Vector space model
The vector space model is based on the implicit assumption that the relevance of a
document with respect to some query is correlated with the distance between the query
and document. In the vector space model each document (and query) is represented
in an n-dimensional Euclidean space with an orthogonal dimension for each term in
the corpus.11 The degree of relevance between a query and document is measured
using a distance function [176].
The most basic term vector representation simply flags term presence using vec-
tors of binary {0, 1}. This is known as the binary vector model [176]. The document
representation can be extended by including term and document statistics in the docu-
ment and query vector representations [176]. An empirically validated document sta-
tistic is the number of term occurrences within a document (term frequency or tf ) [176].
The intuitive justification for this statistic is that a document that mentions a term
more often is more likely to be relevant for, or about, that term. Another important
statistic is the potential for a term to discriminate between candidate documents [190].
The potential of a term to discriminate between documents has been observed to be
inversely proportional to the frequency of its occurrence in a corpus [190], with terms
that are common in a corpus less likely to convey useful relevance information. A
frequently used measure of term discrimination based on this observation is inverse
document frequency (or idf ) [190]. Using the tf and idf measures, the weight of a term
present in a document can be defined as:
wt,D = tf t,D × idft (2.2)
where idf is:
idft = log
N
nt
(2.3)
where nt is the number of documents in the corpus that contain term t, and N is the
total number of documents in the corpus.
11
So all dimensions are linearly independent.
§2.3 Document-level evidence 11
There are many functions that can be used to score the distance between document
and query vectors [176]. A commonly used distance function is the cosine measure of
similarity [14]:
S(D, Q) =
D · Q
(|D| × |Q|)
(2.4)
or:
S(D, Q) =
t∈Q wt,D × wt,Q
t∈Q w2
t,D × t∈Q w2
t,Q
(2.5)
Because the longer a document is, the more likely it is that a term will be encoun-
tered in it, an unnormalised tf component is more likely to assign higher scores to
longer documents. To compensate for this effect the term weighting function in the
vector space model is often length normalised, such that a term that occurs in a short
document is assigned more weight than a term that occurs in a long document. This
is termed document length normalisation. For example, a simple form of length normal-
isation is [14]:
wt,D =
tf t,D + 1
maxtf D + 1
× idft (2.6)
where maxtfD is the maximum term frequency observed for a term in document D.
After observing relatively poor performance for the vector space model in a set of
TREC experiments, Singhal et al. [186] hypothesised that the form of document length
normalisation used within the model was inferior to that used in other models. To in-
vestigate this effect they compared the length of known relevant documents with the
length of documents otherwise retrieved by the retrieval system. Their results indi-
cated that long documents were more likely to be relevant for the task studied,12 but
no more likely to be retrieved after length normalisation in the vector space model.
Accordingly, Singhal et al. [186] proposed that the (cosine) length normalisation com-
ponent be pivoted to favour documents that were more frequently relevant (in this
case, longer documents).
12
The task studied was the TREC-3 ad-hoc retrieval task. The ad-hoc retrieval task is an informational
task (see Section 2.6.1) where the user needs to acquire or learn some information that may be present in
a document.
12 Background
2.3.1.3 Probabilistic ranking
Probabilistic ranking algorithms provide an intuitive justification for the relevance of
matched documents by attempting to model and thereby rank the statistical proba-
bility that a document is relevant given the matching terms found [146, 169]. The
Probability Ranking Principle was described by Cooper [167] as:
“If a reference retrieval system’s response to each request is a ranking of
the documents in the collections in order of decreasing probability of use-
fulness to the user who submitted the request, where the probabilities are
estimated as accurately as possible on the basis of whatever data has been
made available to the system for this purpose, the overall effectiveness of
the system to its users will be the best that is obtainable on the basis of that
data.”
The probabilistic model for information retrieval was originally proposed by
Maron and Kuhn [146] and updated in an influential paper by Robertson and
Sparck-Jones [169]. Probabilistic ranking techniques have a strong theoretical basis
and should, at least in principle and given all available information, provide the best
predictions of document relevance. The formal specification of the Probabilistic Rank-
ing Principle can be described as an optimisation problem, where documents should
only be retrieved in response to a query if the cost of retrieving the document is less
than the cost of not retrieving the document [169].
A prominent probabilistic ranking formulation is the Binary Independence Model
used in the Okapi BM25 algorithm [171]. The Binary Independence Model is con-
ditioned by several important assumptions in order to decrease complexity. These
assumptions include:
• Independence of documents, i.e. that the relevance of one document is indepen-
dent of the relevance of all other documents;13
• Independence of terms, i.e. that the occurrence or absence of one term is not
related to the presence or absence of any other term;14 and
• That the distribution of terms within a document can be used to estimate the
document’s probability of relevance.15
13
This is brought into question when one document’s relevance may be affected by another document
ranked above it (as is the case with duplicate documents). This independence assumption was removed
in several probabilistic formulations without significant improvement in retrieval effectiveness [204].
14
This assumption was also removed from probabilistic formulations without significant effectiveness
improvements [204].
15
This assumption is made according to the cluster hypothesis which states that “closely associated
documents tend to be relevant to the same requests”, therefore “documents relevant to a request are
separate from those that are not” [204].
§2.3 Document-level evidence 13
In most early probabilistic models, the term probabilities were estimated from
a sample set of documents and queries with corresponding relevance judgements.
However, this information is not always available. Croft and Harper [61] have revis-
ited the initial formulation of relevance and proposed a probabilistic model that did
not include a prior estimate of relevance.
Okapi BM25
The Okapi BM25 formula was proposed by Robertson et al. [172]. In Okapi BM25,
documents are ordered by decreasing probability of their relevance to the query,
P(R|Q, D). The formulation takes into account the number of times a query term oc-
curs in a document (tf ), the proportion of other documents which contain the query
term (idf ), and the relative length of the document. A score for each document is
calculated by summing the match weights for each query term. The document score
indicates the Bayesian inference weight that the document will be relevant to the user
query.
Robertson and Walker [170] derived the document length normalisation used in
the Okapi BM25 formula as an approximation to the 2-Poisson model. The form of
length normalisation employed when using Okapi BM25 with default parameters
(k1 = 2, b = 0.75) is justified because long documents contain more information than
shorter documents, and are thus more likely to be relevant [186].
The base Okapi BM25 formulation [172] is:
BM25wt = idf t ×
(k1 + 1)tf t,D
k1((1 − b) + b×dl
avdl ) + tf t,D
×
(k3 + 1) × Qwt
k3 + Qwt
+ k2 × nq
(avdl − dl)
avdl + dl
(2.7)
where wt is the relevance weight assigned to a document due to query term t, Qwt is
the weight attached to the term by the query, nq is the number of query terms, tf t,D is
the number of times t occurs in the document, N is the total number of documents, nt
is the number of documents containing t, dl is the length of the document and avdl is
the average document length (both measured in bytes).
Here k1 controls the influence of tf t,D and b adjusts the document length normali-
sation. A k1 approaching 0 reduces the influence of the term frequency, while a larger
k1 increases the influence. A b approaching 1 assumes that the documents are longer
due to repetition (full length normalisation), whilst b = 0 assumes that documents are
long because they cover multiple topics (no length normalisation) [168].
Setting k1 = 2, k2 = 0, k3 = ∞ and b = 0.75 (verified experimentally in TREC tasks
and on large corpora [168, 186]):
BM25wt,D = Qwt × tf t,D ×
log(N−nt+0.5
nt+0.5 )
2 × (0.25 + 0.75 × dl
avdl ) + tf t,D
(2.8)
14 Background
The final document score is the sum of term weights:
BM25(D, Q) =
t∈Q
wt,D (2.9)
2.3.1.4 Statistical language model ranking
Statistical language modelling is based on Shannon’s communication theory [182]16
and examines the distribution of language in a document to estimate the probabil-
ity that a query was generated in an attempt to retrieve that document. Statistical
language models have long been used in language generation, speech recognition
and machine translation tasks, but have only recently been applied to document re-
trieval [162].
Language models calculate the probability of encountering a particular string (s)
in a language (modelled by M) by estimating P(s|M). The application of language
modelling to information retrieval conceptually reverses the document ranking
process. Unlike probabilistic ranking functions which model the relevance of docu-
ments to a query, language modelling approaches model the probability that a query
was generated from a document. In this way, language models replace the notion of
relevance with one of sampling, where the probability that the query was picked from
a document is modelled. The motivation for this approach is that users have some pro-
totype document in mind when an information need is formed, and they choose query
terms to that effect. Further, it is asserted that when a user seeks a document they are
thinking about what it is that makes the document they are seeking “different”. The
statistical language model ranks documents using the maximum likelihood estima-
tion (Pmle) that the query was generated with that document in mind (P(Q|MD)),
otherwise considered to be the probability of generating the query according to each
document language model.
Language modelling was initially applied to document retrieval by Ponte and
Croft [162] who proposed a simple unigram-based document model.17 The simple
unigram model assigns:
P(D|Q) =
t∈Q
P(t|MD) (2.10)
The model presented above may not be effective in general document retrieval
as it requires a document to contain all query terms. Any document that is missing
one or more query terms will be assigned a probability of query generation of zero.
Smoothing is often used to counter this effect (by adjusting the maximum likelihood
16
This is primarily known for its application to text sequencing and estimation of message noise.
17
A unigram language model models the probability of each term occurring independently, whilst
higher order (n-gram) language models model the probability that consecutive terms appear near each
other (described in Section 2.3). In the unigram model the occurrence of a term is independent of
the presence or absence of any other term (similar to the term independence assumption in the Okapi
model).
§2.3 Document-level evidence 15
estimation of the language model). Smoothing methods discount the probabilities
of the terms seen in the text, to assign extra probability mass to the unseen terms
according to a fallback model [218]. In information retrieval it is common to exploit
corpus properties for this purpose. Thereby:
P(D|Q) = t∈Q P(t|MD) if t ∈ MD
αP(t|MC) otherwise
(2.11)
where P(t|Md) is the smoothed probability of a term seen in the document D, p(t|MC)
is the collection language model (over C), and α is the co-efficient controlling proba-
bility mass assigned to unseen terms (so that all probabilities sum to one).
Models for smoothing the document model include Dirichlet smoothing [155],
geometric smoothing [162], linear interpolation [19] and 2-state Hidden Markov Mod-
els. Dirichlet smoothing has been shown to be particularly effective when dealing
with short queries, as it provides an effective normalisation using document
length [155, 218].18 Language models with Dirichlet smoothing have been used to
good effect in recent TREC web tracks by Ogilvie and Callan [155].
A document language model is built for all query terms [155]:
P(Q|MD) =
t∈Q
P(t|MD) (2.12)
Adding smoothing to the document model using the collection model:
P(t|MD) = β1Pmle(t|D) + β2Pmle(t|C) (2.13)
The β1 and β2 collection and document linear interpolation parameters are then esti-
mated using Dirichlet smoothing.
β1 =
|D|
|D| + γ
, β2 =
γ
|D| + γ
(2.14)
where |D| is the document length and γ is often set near the average document length
in the corpus [155]. The mle for a document is defined as:
Pmle(w|D) =
tft,D
|D|
(2.15)
Similarly, for the corpus:
Pmle(w|C) =
tft,C
|C|
(2.16)
18
Document length has been exploited with success in the Okapi BM25 model and in the vector space
model.
16 Background
The document score is then:
S(D, Q) =
t∈Q
(β1 × (
count(t; D)
|D|
)) + (β2 × (
count(t; C)
|C|
)) (2.17)
Statistical language models have several beneficial properties. If users are as-
sumed to provide query terms that are likely to occur in documents of interest, and
that distinguish those documents from other documents in the corpus, language mod-
els provide a degree of confidence that a particular document should be retrieved [162].
Further, while the vector space and probabilistic models use a crude approximation
to document corpus statistics (such as document frequency, discrimination value and
document length), language models are sometimes seen to provide a more integrated
and natural use of corpus statistics [162].
2.3.2 Discussion
The most effective implementations of each of the retrieval models discussed above
have been empirically shown to be very similar [53, 60, 106, 110, 119, 121]. Discrepan-
cies previously observed in the effectiveness of the different models have been found
to be due to differences in the underlying statistics used in the model implementa-
tion, and not the model formalisation [186]. All models employ a tf × idf approach to
some degree, and normalise term contribution using document length. This is explicit
in probabilistic [170] and vector space models [186], and is often included within the
smoothing function in language models [155, 218]. The use of these document statis-
tics in information retrieval systems has been empirically validated over the past ten
years [155, 168].
When dealing with free-text elements, experiments within this thesis use the prob-
abilistic ranking function Okapi BM25 without prior relevance information [170].
This function has been empirically validated to perform as well as current state-of-
the-art ranking functions [53, 57, 58, 59, 60, 168, 170].
Further discussion and comparison of full-text ranking functions is outside the
scope of this thesis. If interested the reader should consult [14, 176, 191, 204].
2.3.3 Other evidence
To build a baseline that achieves similar performance to that of popular web and
WWW search engines several further types of document-level evidence may need
to be considered [31, 109, 113].
2.3.3.1 Metadata
Metadata is data used to describe data. An example of real-world metadata is a library
catalogue card, which contains data that describes a book within the library (although
metadata is not always stored separately from the document it describes). In web
documents metadata may be stored within HTML metadata tags (<META>), or in a
§2.3 Document-level evidence 17
separate XML/RDF resource descriptors. As metadata tags are intended to describe
document contents, the content of metadata tags is not rendered by web browsers.
Several standards exist for metadata creation, one of the least restricted forms of which
is simple Dublin Core [70]. Dublin Core provides a small set of core elements (all
of which are optional) that are used to describe resources. These elements include:
document author, title, subject, description, and language. An example of HTML
metadata usage, taken from http://cs.anu.edu.au/∼Trystan.Upstill/19 is:
<meta http-equiv="Content-Type"
content="text/html;
charset=iso-8859-1" />
<meta name="keywords"
content="Upstill, Web, Information, Retrieval" />
<meta name="description"
content="Trystan Upstill’s Homepage, Web IR" />
<meta name="revised"
content="Trystan Upstill, 6/27/01" />
<meta name="author"
content="Trystan Upstill" />
The utility of metadata depends on the observance of document authorship stan-
dards. Inconsistencies between document content and purpose, and associated meta-
data tags, may severely reduce system retrieval effectiveness. Such inconsistencies
may occur either unintentionally through outdated metadata information, or through
deliberate attribute “stuffing” in an attempt by the document author to have the doc-
ument retrieved for a particular search term [71]. When a document is retrieved due
to misleading metadata information, search system users may have no idea why the
document has been retrieved, with no visible text justifying the document match.
The use of HTML metadata tags is not considered within this thesis due to the
relatively low adherence to metadata standards in documents across the WWW, and
the inconsistency of adherence in other web corpora [107]. This policy is followed by
many WWW search systems [71, 193].
2.3.3.2 URL information
Uniform Resource Locators, or URLs, provide web addresses for documents. The URL
of a document may contain document evidence, either though term presence in the
URL or implicitly through some other URL characteristic (such as depth in the site
hierarchy).
The URL string may contain useful query-dependent evidence by including a po-
tential search term (e.g: http://cs.anu.edu.au/∼Joe.Blogs/ contains the po-
tentially useful terms of “Joe” and “Blogs”). URLs can be matched using simple string
matching techniques (e.g. checking if the text is present or not) or using full-text
19
META tags have been formatted according to X-HTML 1.0.
18 Background
ranking algorithms (although a binary term presence vector would probably suffice).
Ogilvie and Callan [50, 155, 156] proposed a novel method for matching URL strings
within a language modelling framework. In their method the probability that a URL
was generated for a particular term, given the URLs of all corpus documents, is cal-
culated. Query terms and URLs are treated as character sequences and a character-
based trigram generative probability is computed for each URL. The numerator and
denominator probabilities in the trigram expansion are then estimated using a linear
interpolation with the collection model [50, 155, 156]. Ogilvie and Callan then com-
bined this URL-based language model with the language models of other document
components. The actual contribution of this type of URL matching is unclear.20
Further query-independent evidence relating to URLs might also be gained
through examining common formatting practices. For example some features could
be correlated with the length of a URL (by characters or directory depth), the match-
ing of a particular character in the URL (e.g. looking for ‘∼’ when matching personal
home pages [181]), or a more advanced metric. Westerveld et al. [135, 212] proposed
a URL-type indicator for estimating the likelihood that a page is a home page. In this
measure URLs are grouped into four categories, Root, Subroot, Path and File, using
the following rules:
Root a domain name,
e.g. www.cyborg.com/.
Subroot a domain name followed by a single directory,
e.g. www.glasgow.ac.uk/staff/.
Path a domain name followed by two or more directories,
e.g. trec.nist.gov/pubs/trec9/.
File any URL ending in a filename rather than a directory,
e.g. trec.nist.gov/contact.html.
Westerveld et al. [135, 212] calculated probabilities for encountering a home page
in each of these URL-types using training data on the WT10g collection (described
in Section 2.6.7.2). They then used these probabilities to assign scores to documents
based on the likelihood that their document URL would be a home page.
In experiments reported within this thesis, URL-type and URL length informa-
tion are considered. While the textual elements in a URL may be useful in doc-
ument matching, consistent benefits arising from their use are yet to be substanti-
ated [107, 155]. As such they are not considered within this work.
20
Ranking functions which included this URL measure performed well, but the contribution of the
URL measure was unclear.
§2.3 Document-level evidence 19
2.3.3.3 Document structure and tag information
Important information might be marked up within a web document to indicate to
a document viewer that a particular segment of the document, or full document, is
important. For example useful evidence could be collected from:
• Titles / Heading tags: encoded in <H?> or <TITLE> tags.
• Marked-up text: For example bold (B), emphasised (E) or italic (I) text may
contain important information.
• Internal tag structure: The structural makeup of a document may give insight
into what a document contains. For example, if a document contains a very
long table, list or form, this may give some indication as to the utility of that
document.
• Descriptive text tags: Images often include descriptions of their content for users
viewing web pages without graphics capabilities. These are included as an at-
tribute in the IMG tag (ALT=).
Ogilvie and Callan [50, 155, 156] achieved small effectiveness gains through an up-
weighting of TITLE, Image ALT text and FONT tag text for both named page finding
and home page finding tasks. However, the effectiveness gains through the use of
these additional forms of evidence were small compared to those achieved through
the use of document full-text, referring anchor-text and URL length priors.21
The only document structure used in experiments within this thesis is document
TITLE. While there is some evidence to suggest that up-weighting marked-up text
might provide some gains, experiments have shown that the associated improvement
is relatively small [155].
2.3.3.4 Quality metrics
Zhu and Gauch [219] considered whether the effectiveness of full-text-based docu-
ment ranking22 could be improved through the inclusion of quality metrics.
They evaluated six measures of document quality:
• Currency: how recently a document was last modified (using document time
stamps).
• Availability: how many links leaving a document were available (calculated as
the number of broken links from a page divided by the total number of links).
• Information-to-noise: a measurement of how much text in the document was
noise (such as HTML tags or whitespace) as opposed to how much was useful
content.
21
Using the length of a URL to estimate a prior probability of document relevance.
22
Calculated using a tf × idf vector space model (see Section 2.3.1.2).
20 Background
• Authority: a score sourced from Yahoo Internet Life reviews and ZDNet ratings
in 1999. According to these reviews each site was assigned an authority score.
Sites not reviewed were assigned an authority score of zero.
• Popularity: how many documents link to the site (in-degree). This information
was sourced from AltaVista [7]. The in-degree measure is discussed in detail in
Section 2.4.3.1.
• Cohesiveness: how closely related the elements of a web page are, determined
by classifying elements using a vector space model into a 4385 node ontology
and measuring the distance between competing classifications. A small distance
between classifications indicates that the document was cohesive. A large dis-
tance indicates the opposite.
Zhu and Gauch [219] evaluated performance using a small corpus with 40 queries
taken from a query log file.23 They observed some improvement in mean precision
based on all the quality metrics, although not all improvements were significant.24
The smallest individual improvements were for “Popularity” and “Authority” (both
non-significant). The improvements obtained through the use of all other metrics was
significant. The largest individual improvement was observed for the “Information-
to-noise” ratio. Using all quality metrics apart from “Popularity” and “Authority”
resulted in a (significant) 24% increase in performance over the baseline document
ranking [219].
These quality metrics, apart from in-degree, are not included in experiments
within this thesis because sourced information may be incomplete [219] or inaccu-
rate [113].
2.3.3.5 Units of retrieval
Identifying the URL which contains the information unit most relevant to the user
may be a difficult task. There are many ways in which a unit of information may be
defined on a web and so the granularity of information units retrieved by web search
engines may vary considerably.
If the granularity is too fine (e.g. the retrieval of a single document URL when a
whole web site is relevant), the user may not be able to fulfil their information need.
In particular the user may not be able to tell whether the system has retrieved an ad-
equate answer, or the retrieved document list may contain many parts of a composite
document from a single web site.
If the unit of retrieval is too large (e.g. the retrieval of a home page URL when
only a deep page is relevant), the information may be buried such that it is difficult
for users to retrieve.
The obvious unit for WWW-based document retrieval is the web page. However,
there are many situations in which a user may be looking for a smaller element of
23
It is unclear how the underlying search task [106, 108] was modelled in this experiment.
24
Significance was tested using a paired-samples t-test [219].
§2.3 Document-level evidence 21
information, such as when seeking an answer to a specific question. Alternatively, a
unit of information may be considered to be a set of web pages. It is common for web
documents to be made up of multiple web pages, or at least be related to other co-
located documents [75]. An example of a composite document is the WWW site for
the Keith van Rijsbergen book ‘Information Retrieval’ which consists of many pages,
each containing small sections from the book [205]. In a study of the IBM intranet
Eiron and McCurley [75] reported that approximately 25% of all URLs encountered on
the IBM corpus were members of some larger “compound” document that spanned
several pages.
The problem of determining the “most useful” level for an information unit
was considered in the 2003 TREC Topic Distillation task (TD2003 – described in Sec-
tion 2.6.7). The TD2003 task judged systems according to whether they retrieved im-
portant resources, and did not mark subsidiary documents as being relevant [60]. The
TD2003 task is similar to the “component coverage” assessment used in the INEX
XML task [85], where XML retrieval systems are rewarded for retrieving the correct
unit of information. In the XML task the optimal system would return the unit of
information that contains the relevant information and nothing else.
Some methods analyse the web graph and site structure in an attempt to identify
logical information units. Terveen et al. build site units by graphing co-located pages,
using a method entitled “clan graphs” [196]. Further methods attempt to determine
the appropriate information unit by applying a set of heuristics based on site hierarchy
and linkage [13, 142].
This thesis adopts the view that finding the correct information unit is analogous
to finding the optimal entry point for the correct information unit. As such, none of
the heuristics outlined above are used to detect information units. Instead, hyperlink
recommendation and other document evidence is evaluated according to whether it
can be used to find information unit entry points.
Document segmentation
Document segmentation methods break-down HTML documents into document com-
ponents that can be analysed individually. A commonly used segmentation method
is to break-down HTML documents into their Document Object Model (DOM), accord-
ing to the document tag hierarchy [42, 45]. Visual Information Processing System
(VIPS) [37, 38, 39] is a recently proposed extension of DOM-based break-down and
dissects HTML documents using visual elements in addition to their DOM.
Document segmentation techniques are not considered in this thesis. While finer
document breakdown might be useful for finding short answers to particular ques-
tions, there is little evidence of improvements in ranking at the web page level [39].
22 Background
2.4 Web-based evidence
Many early WWW search engines conceptualised the document corpus as a flat struc-
ture and relied solely on the document-level evidence outlined above, ignoring hy-
perlinks between documents [33]. This section outlines techniques for exploiting the
web graph that is created when considering documents within a web as nodes and
hyperlinks between documents as directed edges.
This thesis does not consider measures based on user interaction with the web
search system, such as click-through evidence [129]. While click-through evidence
may be useful when ranking web pages, assumptions made about user behaviour may
be questionable. In many cases it may be difficult to determine whether users have
judged a document relevant from a sequence of queries and clicks. Collecting such
evidence also requires access to user interaction logs for a large scale search system.
Work within this thesis relating to the combination of query-dependent evidence with
other query-independent evidence is applicable to this domain.
The WWW graph was initially hypothesised to be a small world network [18], that
is, a network that has a finite diameter,25 where each node has a path to every other
node by a relatively small number of steps. Small world networks have been shown
to exist in other natural phenomena, such as relationships between research scientists
or between actors [2, 5, 6, 18]. Barabasi hypothesised that the diameter of the WWW
graph was 18.59 links (estimated for 8 × 108 documents) [18]. However, this work
was challenged by WWW graph analysis performed by Broder et al. [35]. Using a
200 million page crawl from AltaVista, which contained 1.5 billion links [7], Broder et
al. observed that the WWW graph’s maximal and average diameter was infinite. The
study revealed that the WWW graph resembles a bow-tie with a Strongly Connected
Component (SCC), an upstream component (IN), a downstream component (OUT),
links between IN and OUT (Tendrils), and disconnected components. Each of these
components was observed to be roughly the same size (around 50 million nodes).
The SCC is a highly connected graph that exhibits the small-world property. The IN
component consists of nodes that link into the SCC, but cannot be accessed from the
SCC. The OUT component consists of nodes that are linked to from the SCC, but do
not link back to the SCC. Tendrils link IN nodes directly to OUT nodes, bypassing the
SCC. Disconnected components are pages to which no-one linked, and which linked-
to no-one.
The minimal diameter26 for the bow-tie was 28 for the SCC and 500 for the entire
graph. The probability of a directed path existing between two nodes was observed
to be 24%, and the average length of such a path was observed to be 16 links. The
shortest directed path between two random nodes in the SCC was, on average, 16 to
20 links. Further work by Dill et al. [67] has reported that WWW subgraphs, when
restricted by domain or keyword occurrence, also form bow-tie-like structures. This
phenomenon has been termed the fractal nature of the WWW, and is exhibited by
25
Average distance between two nodes in a graph
26
The minimum number of steps by which the graph could be crossed
§2.4 Web-based evidence 23
other scale-free networks [67].
Many WWW distributions have been observed to follow a power law [3]. That is,
the distributions take some form k = 1/ix for i > 1, where k is the probability that a
node has the value i according to some exponent x. Important WWW distributions
that have been observed to follow the power law include:
• WWW site in-links (in-degrees). The fraction of pages with an in-degree i was
first approximated by Kumar et al. [136, 137] to be distributed according to
power law with exponent x = 2 on a 1997 crawl of around 40 million pages
gathered by Alexa.27 Later Barabasi et al. estimated the exponent at x = 2.1
over a graph computed for a corpus containing 325 000 documents from the
nd.edu domain [17, 18]. Broder et al. [35] have since confirmed the estimate of
x = 2.1.
• WWW site out-links (out-degrees). Barabasi and Albert [17] estimated a power
law distribution with exponent x = 2.45. Broder et al. [35] reported a x = 2.75
exponent for out-degree on a 200 million page crawl from AltaVista.
• Local WWW site in-degrees and out-degrees [25].
• WWW site accesses [4].
2.4.1 Anchor-text evidence
Web authors often supply textual snippets when marking-up links between web doc-
uments, encoded within anchor “<A HREF=’’></A>” tags. The average length of
an anchor-text snippet has been observed to be 2.85 terms [159]. This is similar to
the average query length submitted to WWW search engines [184] and suggests there
might be some similarity between a document’s anchor-text and the queries typically
submitted to search engines to find that document [73, 74].
A common method for exploiting anchor-text is to combine all anchor-text snip-
pets pointing to a single document into a single aggregate anchor-text document, and
then to use the aggregate document to score the target document [56]. In terms of
document evidence, this aggregate anchor-text document may give some indication
of what other web authors view as the content, or purpose, of a document. It has been
observed that anchor-text frequently includes information associated with a page that
is not included in the page itself [90].
To increase the anchor-text information collected for hyperlinks, anchor-text evi-
dence can be expanded to include text outside (but in close proximity to) anchor-tags.
However, there is disagreement regarding whether such text should be included.
Chakrabarti [44] investigated the potential utility of text surrounding anchor tags
by measuring the proximity of the term “Yahoo” to the anchor tags of links to
http://www.yahoo.com in 5000 web documents. Chakrabarti found that includ-
ing 50 words around the anchor tags performed best as most occurrences of Yahoo
27
http://www.alexa.com
24 Background
Distance -100 -75 -50 -25 0 25 50 75 100
Density 1 6 11 31 880 73 112 21 7
Table 2.1: Proximity of the the term “Yahoo” to links to http://www.yahoo.com/ for 5000
WWW documents (from [44]). Distance is measured in bytes. A distance of 0 indicates that
“Yahoo” appeared within the anchor tag. A negative distance indicates it occurred before the
anchor-tag, and a positive distance indicates that it occurred after the tag.
were within that bound (see Table 2.1). Chakrabarti found that using this extra text
improved recall, but at the cost of precision (precision and recall are described in Sec-
tion 2.6.6.1). In later research Davison [64, 65] reported that extra text surrounding the
anchor-text did not describe the target document any more accurately than the text
within anchor-tags. However, Glover et al. [90] reported that using up to 25 terms
around anchor-text tags improved page-content classification performance. Pant el
al. [159] proposed a further method for expanding anchor-text evidence using a DOM
break-down (DOM described in Section 2.3.3.5). They suggested that if an anchor-text
snippet contains under 20 terms then the anchor-text evidence should be extended to
consider all text up to the next set of HTML tags. They found that expanding to be-
tween two and four HTML tag levels improved classification of the target documents
when compared to only using text that occurred within anchor-tags.
Experiments within this thesis only consider text within the anchor tags, as there
is little conclusive evidence to support the use of text surrounding anchor tags.
Anchor-text ranking
Approaches to ranking anchor-text evidence include:
• Vector space. Hyperlink Vector Voting, proposed by Li and Rafsky [143], ranks
anchor-text evidence using a vector space containing all anchor-text pointing to
a document. The final score is the sum of all the dot products between the query
vector and anchor-text vectors. Li and Rafsky did not formally evaluate this
method.
• Okapi BM25. Craswell, Hawking and Robertson [56] built surrogate docu-
ments from all the anchor-text snippets pointing to a page and ranked the doc-
uments as if they contained document full-text. This application of anchor-text
provided dramatic improvements in navigational search performance.28
• Language Modelling. Ogilvie and Callan [155] modelled anchor-text separately
from other document evidence using a unigram language model with Dirich-
28
Navigational search is described in Section 2.6.2.
§2.4 Web-based evidence 25
let smoothing. The anchor-text language model was then combined with their
models for other sections of the document using a mixture model (see Sec-
tion 2.5.2.2). This type of anchor-text scoring has been empirically evaluated
and shown to be effective [155, 156].
Unless otherwise noted, the anchor-text baselines used in this thesis are scored
from anchor-text aggregate documents using the Okapi BM25 ranking algorithm.
This method is used because it has previously been reported to perform well [56].
2.4.2 Bibliometric measures
1
2
3
4
5
Figure 2.1: A sample network of relationships
Social networks researchers [125, 131] are concerned with the general study of
links in nature for diverse applications, including communication (to detect espi-
onage or optimise transmission) and modelling disease outbreak [89]. Bibliomet-
rics researchers are similarly interested in the citation patterns between research pa-
pers [87, 89], and study these citations in an attempt to identify relationships. This
can be seen as a specialisation of social network analysis [89]. In many social net-
work models, there is an implicit assumption that the occurrence of a link (citation)
indicates a relationship or some attribution of prestige. However, in the context of
some areas (such as research) it may be difficult to determine whether a citation is an
indication of praise or retort [203].
Social networks and citations may be modelled using link adjacency matrices. A
directed social network of size n can be represented as an n × n matrix, where links
between nodes are encoded in the matrix (e.g. if a node i links to j, then Ei,j = 1). For
example, the relationship network shown in Figure 2.1 may be represented as:
E =








0 0 1 1 0
0 0 1 1 1
0 0 0 1 0
0 0 0 0 1
0 0 1 0 0








26 Background
Prestige
The number of incoming links to a node is a basic measure of its prestige [131]. This
gives a measure of the direct endorsement the node has received. However, examining
direct endorsements alone, may not give an accurate representation of node prestige.
It may be more interesting to know if a node is recognised by other important nodes,
thus transitive citation becomes important. A transitive endorsement is an endorse-
ment through an intermediate node (i.e. if A links to B links to C, then A weakly
endorses C).
An early measure of prestige in a social network analysis was proposed by See-
ley [179] and later revised by Hubbell [125]. In this model, every document has an
initial prestige associated with it (represented as a row in p), which is transferred to
its adjacent nodes (through the adjacency matrix E). Thus the direct prestige of any (a
priori equal) node can be calculated by setting p = (1, ..., 1)T and calculating p = pET .
By performing a power iteration over p ← pET the prestige measure p converges to
the principal eigenvector of the matrix ET and provides a measure of transitive pres-
tige.29 The power iteration method multiplies p by increasing powers of ET until the
calculation converges (tested using some entropy constant).
To measure prestige for academic journals Garfield [88] proposed the “impact
factor”. The impact factor score for a journal j is the average number of citations
to papers within that journal received during the previous two years. Pinski and
Narin [161] proposed a variation to the “impact factor”, termed the influence weight,
based on the observation that all journal citations may not be equally important. They
hypothesised that a journal is influential if its papers are cited by papers in other in-
fluential journals, and thus incorporate a measure of transitive endorsement. This
notion of transitive endorsement is similar to that modelled in PageRank and HITS
(described in Sections 2.4.3.2 and 2.4.4.1).
Co-citation and bibliographic coupling
Co-citation is used to measure subject similarity between two documents. If a docu-
ment A cites documents B and C, documents B and C are co-cited by A. If many doc-
uments cite both documents B and C, this indicates that B and C may be related [187].
The more documents that cite both B and C, the closer their relationship.
The co-citation matrix (CC) is calculated as:
CC = ET
E (2.18)
where CCi,j is the number of papers which jointly cite papers i and j, and the diagonal
is node in-degree.
Bibliographic coupling is the inverse of co-citation, and infers that if two docu-
ments include the same references then they are likely to be related, i.e. if document
29
See Golub and Van Loan [91] for more information about principal eigenvectors and the power
method [pp.330–332].
§2.4 Web-based evidence 27
A and B both cite document C this gives some indication that they are related. The
more documents that document A and B both cite, the stronger their relationship.
The bibliographic coupling (BC) matrix is calculated as:
BC = EET
(2.19)
where BCi,j is the number of papers jointly cited by i and j and the diagonal is node
out-degree.
Citation graph measures
Important information may be conveyed by the distance between two nodes in a cita-
tion graph, the radius of a node (maximum distance from a node to the graph edge),
the cut of the graph (or which edges of the graph that when removed will disconnect
large sections of the graph), and the centre of the graph (the node that has the smallest
radius). For example, when examining a field of research, interesting papers can be
identified by their small radius, as this indicates that most papers in the area have a
short path to the paper. The cut of the graph typically indicates communication be-
tween cliques, and can be used to identify important nodes, whose omission would
lead to the loss of the relationship between the groups [196].
2.4.2.1 Bibliographic methods applied to a web
Hyperlink-based scoring assumes that web hyperlinks provide some vote for the im-
portance of their targets. However, due to the relatively small cost of web publishing,
the discretion used when creating links between web pages may be less than is em-
ployed by researchers in scientific literature [203]. Indeed it has been observed that
not all web links are created for recommendation purposes [63] (discussed in Section
3.1.5).
An early use of hyperlink-based evidence was in a WWW site visualisation, where
a site’s visibility represented its direct prestige, and the out-degree of a site was the
node’s luminosity [30]. Larson [138] presented one of the first applications of biblio-
metrics on the WWW by using co-citation to cluster related web pages and to explore
topical themes.
Marchiori [145] provided an early examination of the use of hyperlink evidence in
a document ranking scheme, by proposing that a document’s score should be relative
to that document’s full-text score and “hyper” (hypertext-based) score. Marchiori’s
model was based on the idea that a document’s quality is enriched through the pro-
vision of links to other important resources. In this model, the “hyper-information”
score was a measure based on a document’s subsidiary links, rather than its parent
links. The page score was dependent not only on its full-text content, but the content
of its subsidiaries as well. A decay factor was implemented such that the farther a
subsidiary was from the initial document, the less its contribution would be.
Xu and Croft [215] outline two broad domains for web-based hyperlink informa-
tion: global link information and local link information. Global link information is
28 Background
computed from a full web graph, based on links between all documents in a cor-
pus [40, 215]. In comparison, local link information is built for some subset of the
graph currently under examination, such as the set of documents retrieved in response
to a particular query. In many cases the additional cost involved in calculating local
link information might be unacceptable for web or WWW search systems [40].
2.4.3 Hyperlink recommendation
The hyperlink recommendation techniques examined here are similar to the biblio-
metric measures of prestige, and may be able to provide some measure of the “im-
portance”, “quality” or “authority” of a web document [31]. This hypothesis is tested
through experiments presented in Chapter 5.
2.4.3.1 Link counting / in-degree
A page’s in-degree score is a measure of its direct prestige, and is obtained through a
count of its incoming links [29, 41]. It is widely believed that a web page’s in-degree
may give some indication of its importance or popularity [219].
In an analysis of link targets Bharat et al. [25] found that the US commercial do-
main .com had higher in-degree on average than all other domains. Sites within the
.org and .net domains also had higher in-degree (on average) than sites in other
countries.
2.4.3.2 PageRank
PageRank is a more sophisticated query-independent link citation measure developed
by Page and Brin [31, 157] to “objectively and mechanically [measure] the human in-
terest and attention devoted [to web pages]” [157]. PageRank uses global link infor-
mation and is stated to be the primary link recommendation scheme employed in the
Google search engine [93] and search appliance [96]. PageRank is designed to simu-
late the behaviour of a “random web surfer” [157] who navigates a web by randomly
following links. If a page with no outgoing links is reached, the surfer jumps to a
randomly chosen bookmark. In addition to this normal surfing behaviour, the surfer
occasionally spontaneously jumps to a bookmark instead of following a link. The
PageRank of a page is the probability that the web surfer will be visiting that page at
any given moment.
PageRank is similar to bibliometric prestige, but differs by down-weighting doc-
uments that have many outgoing links; the fewer links a node has, the larger the
portion of prestige it will bestow to its outgoing links. The PageRank distribution
matrix (EPR) is then:
EPRi,j =
Ei,j
n=1..dim(Ei,j) En,j
(2.20)
for the link adjacency matrix E.
§2.4 Web-based evidence 29
The PageRank distribution matrix (EPR) is a non-negative stochastic30 matrix that
is aperiodic and irreducible.31 The PageRank calculation is a Markov process, where
PageRank is an n-state system and the distribution matrix (EPR) contains the inde-
pendent transition probabilities EPRi,j of jumping from state i to j. If the random
surfer is in all states with equal probability leaving from a node i then EPR1..n,j =
(1/n, ..., 1/n).
The basic formulation of a single iteration of PageRank is then:
p = p × EPRT
(2.21)
where p is initialised according to the bookmark vector (by default a unit vector), and
is the updated PageRank score after each iteration.
Page and Brin observed that unlike scientific citation graphs, it is quite common to
find sections of the web graph that act as “rank sinks”. To address this difficulty Page
and Brin introduced a random jump (or teleport) component where, with a constant
probability d, the surfer jumps to a random bookmarked node in b. That is:
p = ((1 − d) × b) + d × b × p × EPRT
(2.22)
If d = 0 or b is not broad enough, the PageRank calculation may not converge [102].
Another complexity in the PageRank calculation are nodes that act as “rank leaks”,
this occurs if the surfer encounters a page with no outgoing links, or a link to a page
that is outside the crawl (a dangling link). One approach to resolving this issue is to
jump with certainty (a probability of one) when a dangling link is encountered [154].
This approach, and several others, are covered in more detail in Section 3.3.1. If ap-
plying the “jump with certainty” method, and using a unit b bookmark vector (such
that the random surfer has every page bookmarked), the final PageRank scores are
equivalent to the principal eigenvector of the transition matrix EPR, where EPR is
updated to include the random jump factor:
EPRi,j =
(1 − d)
dim(Ei,k)
+ d ×
Ei,j
n=1..dim(Ei,j) En,j
(2.23)
Expressed algorithmically, the PageRank algorithm (when using “jump with cer-
tainty”) is:
R0 ← S
loop :
r ← dang(Ri)
Ri+1 ← rE + ARi
Ri+1 ← (1 − d)E + d(Ri+1)
δ ← Ri+1 − Ri 1
while δ >
30
Every node can reach any other node at any time-step (implies irreducibility).
31
Every node can reach every other node.
30 Background
where Ri is the PageRank vector at iteration i, A is the link adjacency matrix (where
Ai,j = 1 if a link exists, and is 0 otherwise), S is the initial PageRank vector (the proba-
bility that a surfer starts at a node), E is the vector of bookmarked pages (the probabil-
ity that the surfer jumps to a certain node at random), dang() is a function that returns
the PageRank of all nodes that have no outgoing links, r is the amount of PageRank
lost due to dangling links which is distributed amongst bookmarks (after [43, 154]),
d is a constant which controls the proportion of random noise (spontaneous jumping)
introduced into the system to ensure stability (0 < d < 1), and is the convergence
constant. The double bar ( 1) notation indicates an l1 norm, the sum of a vector ele-
ment’s absolute values.
In this formulation, for a given link graph, PageRank varies according to the values
of the d constant and the set of bookmark pages E. The PageRank variants investi-
gated in this thesis are described in more detail in Section 3.3.
2.4.3.3 Topic-specific PageRank
Further PageRank formulations seek to personalise the calculation according to user
preferences. Haveliwala [103, 104] proposed Personalised PageRank, and demon-
strated how user topic preferences may be introduced by modifying the bookmark
vector and changing the random jump targets, and thereby altering PageRank scores.
Haveliwala proposed that a bookmark vector be built for each top-level DMOZ [69]
category by including all URLs within the tree as bookmarks.
During query processing, each incoming query is classified into these categories
(represented by the influence vector v) and a new “dynamic” PageRank score is com-
puted from a weighted sum of the category-specific PageRanks (ppr), and the Page-
Rank calculation is modified to explicitly include a bookmark vector (e.g. PR(E, b) is
the PageRank calculation for the adjacency matrix E using bookmarks b). So:
ppr = PR(E, v) (2.24)
Category preferences can also be mixed. To compute a set of personalisation vec-
tors (vi) with weights (wi) for a mixture of categories:
ppr = PR(E,
i
([wi.vi])) (2.25)
2.4.4 Other hyperlink analysis methods
2.4.4.1 HITS
Hyperlink Induced Topic Search (HITS) is a method used to identify two sets of pages
that may be important: Hub pages and Authority pages [132]. Hub and Authority
pages have a mutually reinforcing relationship – a good Hub page links to many Au-
thority pages (thereby indicating high Authority co-citation), and a good Authority
page is linked-to by many Hubs (thereby indicating high Hub bibliometric coupling).
§2.4 Web-based evidence 31
Each page in the web graph is assigned two measures of quality; an Authority score
Au[u] and a Hub score H[u]. Sometimes the act of generating HITS results sets is
termed “Topic Distillation”, but in this thesis the phrase is associated with its use in
the TREC web track experiments (described in Section 2.6.3.1).
HITS-based scores may be computed either using local or global link information.
Local HITS has two major steps: collection sampling and weight propagation. Global
HITS is computed for the entire web graph at once so there is no collection sampling
step.
When calculating local HITS a small focused web subgraph, often based around a
search engine result list, is retrieved for a particular query.32 This root set of pages is
then expanded to make a base set by downloading all pages that link to, or are linked-
to by, pages within the root set. The assumption is that, although the base set may not
be a fully connected graph, it should include a large connected component (otherwise
the computation is ineffective).
The Hub and Authority score computation is a recursive process where Au and
H are updated until convergence (initialised with all pages having the same value).
For a graph containing edges E and links between q and p the weight is distributed
according to:
Aup =
(q,p)∈E
Hq (2.26)
Hp =
(q,p)∈E
Auq (2.27)
Like PageRank, these equations can be solved using the power method [91]. Au
will converge to the principal eigenvector of ET E, and H will converge to the princi-
pal eigenvector of EET [154]. The non-principal eigenvectors can also be calculated,
and may represent different term clusters [132]. For example three term clusters (and
corresponding meanings) occur for the query ‘Jaguar’: one on the large cat, one on
the Atari hand-held game console, and one on the prestige car [132].
Revisiting HITS
Several limitations of the HITS model, as presented by Kleinberg [132], were observed
and addressed by Bharat and Henzinger [26]. These are:
• Mutually reinforcing relationships between hosts. This occurs when a set of
documents on one host point to a single document on a second host.
• Automatically generated links. This occurs when web documents are generated
by tools and are not authored (recommendation) links.
• Non-relevant nodes. This arises through what Bharat and Henzinger termed
topic drift. Topic drift occurs when the local subgraph is expanded to include
32
This was originally performed using result sets from the AltaVista WWW search engine [7].
32 Background
surrounding links, and as a result, pages not relevant to the initial query are
included in the graph, and therefore in the HITS calculation.
Bharat and Henzinger [26] addressed the first and second issues by assigning
a weight to identical multiple links “inversely proportional to their multiplicity”.33
To address the third problem, topic drift, they removed content outliers. This was
achieved by computing a document vector centroid and removing pages that were
dissimilar to the vector from the base set.
Lempel and Moran [140, 141] proposed a more “robust” stochastic version of HITS
called SALSA (Stochastic Algorithm for Link Structure Analysis). This algorithm aims
to address concerns that Tightly Knit Communities (TKC) affect HITS calculations. A
TKC occurs when a small collection of pages is connected so that every Hub links
to every Authority. The pages in a TKC can be ranked very highly by HITS, and
therefore achieve the principal eigenvector, even when there is a larger collection of
pages in which Hubs link to Authorities (but are not completely connected). The TKC
effect could be used by spammers to increase Hub and Authority rankings for their
pages, using techniques such as link farming.34
Calado et al. [40] observed significant improvement through the use of local and
global HITS over a document full-text-only baseline. The experiments examined a set
of 50 informational-type queries (see Section 2.6.1) extracted from a Brazilian WWW
search engine log. The queries were observed to be 1.78 terms long on average, signif-
icantly shorter than those observed in previous WWW log studies (2.35 terms [184]).
Further, it was observed that 28 of the queries were very general, and consisted of
terms such as “tabs”, “movies” and “mp3s”. The information needs behind these
queries were estimated for relevance assessment following the method proposed by
Hawking and Craswell [110]. Through the addition of local HITS to the baseline vec-
tor space ranking Calado et al. observed an improvement of 8% in precision35 at ten
documents retrieved. Through the incorporation of global HITS evidence they ob-
served an improvement of 24% in precision at ten documents retrieved. The improve-
ments were reported to be significant for local link analysis after thirty results, and for
global link analysis after ten results. Similar improvements were observed through
the use of PageRank.
2.4.5 Discussion
This thesis only considers in-degree and variants of PageRank, and not other hyper-
link recommendation techniques. In-degree is included because it is the simplest hy-
perlink recommendation measure and is cheap to compute. PageRank was chosen as
a representative of other more expensive methods because:
33
Thereby lessening the effects of nepotistic and navigational links, described in Section 3.1.5.
34
Link farms are artificial web graphs created by spammers through the generation link spam. They are
designed to funnel hyperlink evidence to a set of pages for which they desire high web rankings.
35
The precision measure is described in Section 2.6.6.1.
§2.5 Combining document evidence 33
• Google [93], one of the world’s most popular search engines, state that PageRank
is an important part of their ranking function [31, 157].
• In recent years there have been many studies of how PageRank might be im-
proved [1, 39, 105, 197], optimised [11, 102] and personalised [103, 104, 127, 130],
but there have not been any detailed evaluations of its potential benefit to re-
trieval effectiveness [8, 78].
• PageRank has been observed to be more resilient to small changes in the web
graph than HITS [154]. This may be an important property when dealing with
WWW-based search as it is difficult to construct an accurate and complete web
graph (see Chapter 3), and the web graph is likely to be impacted by web server
down-time [52].
• PageRank has previously been observed to exhibit similar performance to non-
query-dependent HITS (global HITS) [40].
• While locally computed HITS may perform quite differently to global HITS, the
cost of computing HITS at query-time is prohibitive in most production web and
WWW search systems [132].
2.5 Combining document evidence
There are many ways in which the different types of evidence examined in the previ-
ous two sections could be combined into a single ranking function. It is important that
the combination method is effective, as a poor combination could lead to spurious re-
sults. This section describes several methods that can be used to combine document
evidence.
The discussion of combination methods is split into two sub-sections. The first
sub-section reviews score and rank-based fusion methods. In fusion methods the out-
put from ranking function components is combined without prior knowledge of the
underlying retrieval model (how documents were ranked and scored). The second
sub-section reviews modifications to full-text retrieval models such that they include
more than one form of document evidence.
2.5.1 Score/rank fusion methods
Score or rank-based fusion techniques attempt to merge document rankings based
either on document ranks, or document scores, without prior knowledge of the un-
derlying retrieval model.
The combination of multiple forms of document evidence into a single ranking is
similar to the results merging problem in meta-search, where the ranked output from
several systems are consolidated to a single ranking. A comprehensive discussion of
meta-search data fusion techniques is provided by Montague in [151].
34 Background
2.5.1.1 Linear combination of scores
The simplest method for combining evidence is with a linear combination of docu-
ment scores. A linear combination of scores is referred to as combSUM in distributed
information retrieval research [83]. In a linear combination the total score S for a doc-
ument D and query Q, using document scoring functions F1..N is:
S(D, Q) = F1(D, Q) + ... + FN (D, Q) (2.28)
For a linear combination of scores to be effective, scores need to be normalised to a
common scale, and exhibit compatible distributions. As the forms of evidence consid-
ered in this thesis display different distributions, a simple linear combination of scores
may not be effective. In-degree and PageRank values are distributed according to a
power law [35, 159]. By contrast, Okapi BM25 scores are not distributed according
to a power law. Examples of two Okapi BM25 distributions, for the top 1000 doc-
uments retrieved for 100 queries used in experiments in Chapter 7, are included in
Appendix F.
2.5.1.2 Re-ranking
Another method for combining document rankings “post hoc” is to re-rank docu-
ments above some cutoff using another form of document evidence [178]. The re-
ranking cutoffs can be tuned using a training set.
Re-ranking based combinations have the advantage of not requiring a full under-
standing of the distribution of scores underlying each type of evidence, as only the
ordering of lists can be considered. However, this type of re-ranking may be insensi-
tive to the magnitude of difference between scores.36 A further disadvantage of this
method is that it is relatively expensive to re-rank long result lists.
2.5.1.3 Meta-search fusion techniques
Further methods proposed for the fusion of meta-search results include:
• combMNZ: all non-zero document scores are normalised and then multiplied
together [83].
• combSUM: a linear combination of scores [83] (described above).
• combMAX, combMIN, combMED: In combMAX the maximum score of all runs
is considered. In combMIN the minimum score of all runs is considered. In
combMED the median score of all runs is considered. These methods have pre-
viously been observed to be inferior to combMNZ and combSUM [83]. Fur-
ther, these types of combinations do not make sense when used with query-
36
If re-ranking using the relative ranks of documents only, the magnitude of score differences in all
forms of evidence is lost. By contrast, if re-ranking based on some score-based measure, only the magni-
tude of score differences in the evidence used to re-rank documents is lost.
§2.5 Combining document evidence 35
independent evidence, as such evidence provides an overall ranking of docu-
ments, and needs to be used in conjunction with some form of query-dependent
evidence for query processing.
Other techniques include Condorcet fuse, Borda fuse and the reciprocal rank func-
tion [151]. Recent empirical evidence suggests that when combining document rank-
ings these methods are inferior to those outlined above [155].
2.5.1.4 Rank aggregation
A further method proposed for combining the ranked results lists of meta-search sys-
tems [72] and document rankings [78], is rank aggregation [79]. In rank aggregation
the union of several ranked lists is taken, and the lists are merged into a single rank-
ing with the least disturbance to any of the underlying rankings. This may reduce the
promotion of documents that have only one or a small number of high performing
runs with poor performance on other runs.
The rank aggregation process can make it difficult to measure and control the con-
tribution of each form of evidence. For this reason, rank aggregation techniques are
not considered in this thesis.
2.5.1.5 Using minimum query-independent evidence thresholds
Implementing a threshold involves setting a minimum query-independent score that
a document must exceed to be considered by the ranking function. That is, for some
threshold τ, if QIE(D) < τ then P(R|D, Q) is estimated to be zero.37 The use of
a static threshold means that some documents may never be retrieved. A more ef-
fective technique might exploit query match statistics to dynamically determine the
minimum threshold.
The potential benefits of using thresholds are two-fold: as an effective method by
which to remove uninteresting pages (such as spam or less frequently visited pages),
and the improvement of computational performance (by reducing the number of doc-
uments to be scored, see Section 6.2.2).
2.5.2 Revising retrieval models to address combination of evidence
Rather than combining document evidence post hoc, the underlying retrieval models
can be modified to include further document evidence. The approaches outlined be-
low combine several forms of document evidence in a single unified retrieval model,
through modifications to the full-text ranking algorithms discussed in Section 2.3.1.
37
This is similar to a rank-based re-ranking of query-independent evidence (described in Sec-
tion 2.5.1.2) as documents above the cutoff are re-ranked. In comparison, the use of a cutoff does not
require a full ranking of query-independent evidence, but means that some documents may never be
retrieved.
36 Background
2.5.2.1 Field-weighted Okapi BM25
Field-weighted Okapi BM25, proposed by Robertson et al. [173], is a modification of
Okapi BM25 that combines multiple sources of evidence in a single document rank-
ing function. Conceptually the field weighting model involves the creation of a new
composite document that includes evidence from multiple document fields.38 The im-
portance of fields in the ranking function can be modified by re-weighting their con-
tribution. For example, a two-fold weighting of title compared to document full-text
would see the title repeated twice in the composite document.
If used with Okapi BM25 the score and rank fusion techniques outlined in Sec-
tion 2.5.1 invalidate the non-linear term saturation component and may thereby lessen
retrieval effectiveness [173]. The use of such post hoc score combination means that a
document matching a single query term over multiple fields may outperform a docu-
ment that matches several query terms in a single document field.
In Okapi BM25, the score of a document is equal to the sum of the BM25 scores of
each of its terms:
S(D, Q) =
t∈Q
BM25wt,D (2.29)
The score for each term is calculated using a term weighting function and a measure
of term rarity within the corpus (idf ):
BM25wt,D = f(tf t,D) × idft (2.30)
The term weighting function consists of term saturation and document length nor-
malisation components:
f (tf t,D) =
tf t,D
k1 + tf t,D
, f (tf t,D) =
tf t,D
β
, where β = k1((1 − b) + b
dl
avdl
) (2.31)
where dl is the current document length, and avdl is the average length of a document
in the corpus. These components are combined to form:
BM25wt,D =
tf t,D
k1((1 − b) + b dl
avdl ) + tf t,D
× idft (2.32)
In the Field-weighted Okapi BM25 model, documents are seen to contain fields F1..FN
each holding a different form of (query-dependent) document evidence:
F = (F1, ..., FN ) (2.33)
38
Document fields are some form query-dependent document evidence such as document title, full-
text or anchor-text.
§2.6 Evaluation 37
and each field is assigned a weight:
w = (w1, ..., wN ) , wtf t,D :=
N
F=1
tf t,F × wF (2.34)
where w is a vector of field weights, and wtf is the weighted term frequency. The
contribution of terms is then:
fw (wtf t,D) =
wtf t,D
k1 + wtf t,D
, fw (wtf t,D) =
wtf t,D
β
(2.35)
and the document length is updated to reflect the new composite document length:
wdl :=
F
f=1
dlf × wf , wavdl :=
F
f=1
avdlf × wf (2.36)
The final formulation for Field-weighted Okapi BM25 is then:
BM25FW wt,D =
wtf t,D
k1((1 − b) + b wdl
wavdl ) + wtf t,D
× idft (2.37)
2.5.2.2 Language mixture models
In the same way that document models are combined with collection models in order
to smooth the ranking in language models, document models may also be combined
with other language models for the same documents [135, 155]. For example, to com-
bine the language models for anchor-text and content document evidence:
P(D|Q) = P(D)
(t∈Q)
(1 − λ − γ)P(t|C) + λPanchor(t|D) + γPcontent(t|D) (2.38)
Language mixture models have been used to good effect when combining mul-
tiple modalities for multimedia retrieval in video [211]. Indeed, combining multiple
modalities for multimedia retrieval is a similar problem to that of combining multiple
forms of text-based document evidence.
Kraaij et al. [135] incorporate query-independent evidence into a language mixture
model by computing and including prior probabilities of document relevance. Here
P(D) is set according to the prior probability that a document will be relevant, given
the document and corpus properties. The prior relevance probabilities are estimated
by evaluating how a particular feature affects relevance judgements using training
data.
2.6 Evaluation
Search system performance may be measured over many different dimensions, such
as economy in the use of computational resources, speed of query processing, or user
38 Background
satisfaction with search results [209]. It is unlikely that a single system will outperform
all others on each of these dimensions, and accordingly it is important to understand
the tradeoffs involved ([191], pp.167).
This thesis is primarily concerned with retrieval effectiveness, that is, how well a
given system or algorithm can match and retrieve documents that are most useful or
relevant to the user’s information need [150]. This is difficult to quantify precisely as it
involves assigning some measure to the value of information retrieved ([191], pp.167).
Judgements of information value are expensive39 and difficult to collect in a way that
is representative of needs and judgements of the intended search system users [209].
In addition, the effectiveness of a system depends on a number of system compo-
nents, and identifying those responsible for a particular outcome in an uncontrolled
environment can be difficult (typical web search system components are described in
Section 2.1).
The use of a test collection is a robust method for the evaluation of retrieval effec-
tiveness and avoids some of the cost involved in performing user studies. A test col-
lection consists of a snapshot of a user task and the document corpus ([191], pp.168).
This encompasses a set of documents, queries, and complete judgements for the doc-
uments according to those queries [48, 209]. Test collections allow for standard perfor-
mance baselines, reproducible results and the potential for collaborative experiments.
However, if proper care is not taken, heavily training ranking function parameters us-
ing a test collection can lead to over-tuning, particularly when training and testing on
the same test collection. In this case, observed performance gains may be unrealistic
and may not apply in general. It is therefore important to train algorithms on one test
collection, and evaluate the algorithms on another.
2.6.1 Web information needs and search taxonomy
Traditional information retrieval evaluations and early TREC web experiments evalu-
ated retrieval effectiveness according to how well methods fulfilled informational-type
search requests (i.e. finding documents that contain relevant text) [48, 176, 191, 205].
An early evaluation of WWW search engines examined their performance on an infor-
mational search task and found it to be below that of the then state-of-the-art TREC
systems [112]. Recent research suggests, however, that the task evaluated was not
typical of WWW search tasks [26, 33, 56, 75, 185]. Broder [33] argues that WWW user
information needs are often not of an informational nature and nominates three key
WWW-based retrieval tasks:
Navigational: a need to locate a particular page or site given its name. An example of
such a query is “CSIRO” where the correct answer would be the CSIRO WWW
site home page.
Informational: a need to acquire or learn some information that will be present in
one or more web pages. An example of such a query is “Thesis formatting ad-
39
In that employing judges to rate documents may be a financially expensive operation.
§2.6 Evaluation 39
vice” where correct (relevant) answers would contain advice relating to how a
thesis should be formatted.
Transactional: a need to perform an activity on the WWW. An example of such a
query is “apply for a Californian driver’s licence” where the correct answer
would be a page from which a user could apply for a Californian driver’s li-
cence.
2.6.2 Navigational search tasks
Navigational search, particularly home page finding, is the focus of experiments
within this thesis. Navigational search is an important WWW and web search task
which has been shown to be inadequately fulfilled using full-text-based ranking meth-
ods [56, 185]. Evidence derived from query logs suggests that navigational search
makes up a significant proportion of the total WWW search requests [75]. Naviga-
tional search also provides an important cornerstone in the support of search-and-
browse based interaction. Two prominent navigational search tasks, home page find-
ing and named page finding, are described in more detail below.
2.6.2.1 Home page finding
The home page finding task is: given the name of an entity, retrieve that entity’s
home page. An example of a home page finding search is when a user wants to visit
http://trec.nist.gov and submits the query “Text REtrieval
Conference”. The task is similar to Bharat and Mihaila’s organisation search [27],
where users provided web site naming queries, and Singhal and Kaszkiel’s site-finding
experiment [185], where queries were taken from an Excite log and judged as home
page finding queries [77].
Home page finding queries typically specify entities such as people, companies,
departments and products.40 A searcher who submits an entity name as a query is
likely to be pleased to find a home page for that entity at the top of the list of search
results, even if they were looking for information. It is in this way that home pages
may also provide primary-source information in response to informational and trans-
actional queries [33, 198].
2.6.2.2 Named page finding
The named page finding task can be seen as a superset of the home page finding task,
and includes queries naming both non-home page and home page documents [53].
Accordingly, the objective of the named page finding task is to find a particular web
page given a page naming query.
40
For example: ‘Trystan Upstill’, ‘CSIRO’, ’Computer Science’ or ‘Panoptic’.
40 Background
2.6.3 Informational search tasks
Two prominent informational search tasks evaluated in previous web-based experi-
ments [111] are: the search for pages relevant to an informational need (evaluated in
TREC ad-hoc [119]), and Topic Distillation [53]. Experiments within this thesis con-
sider the Topic Distillation task, but not the traditional ad-hoc informational search
task. The ad-hoc informational search task is described in detail in [111].
2.6.3.1 Topic Distillation
The Topic Distillation task asks systems to construct a list of key resources on some
broad topic, similar to those compiled by human editors of WWW directories [111].
More precisely, in TREC experiments the task is defined as: given a search topic, find
the key resources for that topic [111]. An example of a Topic Distillation query might
be “cotton industry” where the information need modelled might be “give me all
sites in the corpus about the cotton industry, by listing their home pages” [60]. A
good resource is deemed to be an entry point for a site which is “principally devoted
to the topic, provides credible information on the topic, and is not part of larger site
also principally devoted to the topic” [60].
While Topic Distillation is primarily an informational search task, it is somewhat
similar to navigational search tasks. The goal in both cases is to retrieve good en-
try points to relevant information units [58]. Indeed, experiments within this thesis
demonstrate that methods useful in home page finding are also effective for Topic
Distillation tasks (in Chapter 9).
2.6.4 Transactional search tasks
To date there has been little direct evaluation of transactional search tasks [111] and at
the time of writing there are currently no reusable transactional test collections. While
transactional search tasks are not the focus of this thesis, a case study that examines
WWW search engine transactional search performance is presented in Chapter 4.
2.6.5 Evaluation strategies / judging relevance
This section describes methods used to collect snapshots of queries and document
relevance judgements with which retrieval effectiveness can be evaluated.
2.6.5.1 Human relevance judging
The most accurate method for judging user satisfaction with results retrieved by a
search system is to record human judgements. However, care needs to be taken when
collecting human judgements that:
• Judges are representative of the general population using the search tool. In
particular if information needs behind given queries are to be modelled (as
§2.6 Evaluation 41
in [106, 108]), the user demographic responsible for the query should be taken
into account in order to estimate the underlying need.
• Relevance judgements are correlated with the retrieval task modelled. This may
be difficult as judging instructions for the same query can be interpreted in sev-
eral ways [58].
Judging informational type queries
The scale of large corpora makes the generation of complete relevance judgements
(i.e. judging every document for every query) impossible. In the TREC conference,
judgement pools are created, which comprise the union of the top 100 retrieved docu-
ments per run submitted. These document pools are judged so that complete top 100
relevance judgements are collected for all runs submitted to TREC.41 All non-judged
documents are assumed to be non-relevant. Therefore when these judgements are
used in post hoc experiments, the judgements are incomplete and so relevant docu-
ments will likely be marked non-relevant [209]. The judgement pooling process was
used when judging runs submitted to the TREC Topic Distillation task.
These measures have been used when judging informational queries for several
decades [48, 119, 207]. Some relevant observations about such judging are:
• Agreement between human relevance judges is less than perfect. Voorhees and
Harman [210] reported that 71.7% of judgements made by three assessors for
14 968 documents were unanimous. However, Voorhees [208] later found that
while substituting relevance judgements made by different human assessors
changed score magnitude, it had a negligible affect on the rank order of sys-
tems [119, 208].
• When dealing with pooled relevance judgements un-judged documents are as-
sumed to be non-relevant. This may result in bias against systems that retrieve
documents not-typically retrieved by the evaluated systems. Two investigations
of this phenomenon have reported that while the non-complete judging of docu-
ments may affect individual system scores, they are not likely to affect the rank-
ing of systems [206, 220].
• The order of search results affects relevance judgements [76]. However, in later
work it was found that this was not the case when judging less than fifteen
documents [160].
Judging known item or named item queries
In comparison to informational type queries, the cost of judging named item queries
(such as home page finding and named page finding queries) is much lower, and the
41
Although every group is asked to nominated an order for the importance of the submitted runs, in
case full pooled judgements cannot be completed in time.
42 Background
judging is less contentious. Named item queries are developed by judges navigat-
ing to a page in the collection and generating a query designed to retrieve that page.
The judging consists of checking retrieved documents to determine whether they are
duplicates of the correct page (which can be performed semi-automatically [114]).
2.6.5.2 Implicit human judgements
Implicit human judgements can be collected by examining how a user navigates
through a set of search results [129]. Evaluations based on this data may be attrac-
tive for WWW search engines as such data are easy and inexpensive to collect.
One way to collect implicit relevance judgements is through monitoring
click-through of search results [129]. However, the quality of judging obtained based
on this method may depend on how informative the document summaries are, as
the summaries must allow the user to make a satisfactory relevance-based “click-
through” decision. Also, given the implicit user preference for clicking on the first
result retrieved (as it has been most highly recommended by the search system), ob-
served effectiveness scores are likely to be unrealistically high.
If directly comparing algorithms using “click-through”-based evaluation, care must
be taken to ensure competing systems are compared meaningfully. Joachims [129]
proposed that the ranked output from each search algorithm under consideration be
interleaved in the results list, and the order of the algorithms be reversed following
each query so as not to preference one algorithm over the other (thereby removing the
effect of user bias towards the first correct document).
2.6.5.3 Judgements based on authoritative links
A set of navigational queries can be constructed cheaply by sourcing queries and
judgements automatically from human-generated lists of important web resources.
The anchor-text of links within such lists can be used as the queries, and the corre-
sponding target documents as the query answers.
Two recent studies use online WWW directories as authoritative sources for sam-
ple query-document pairs [8, 55]. An extension proposed by Hawking et al. [114] is
the use of site maps found on many web sites as a source of query-document pairs.
In all these methods it is important to remove the query/result source from the cor-
pus prior to query processing, as otherwise anchor-text measures will have an unfair
advantage.
2.6.6 Evaluation measures
2.6.6.1 Precision and recall
Precision and recall are the standard measures for evaluating information retrieval for
informational tasks [14]. Precision is the proportion of retrieved documents that are
relevant to a query at a particular rank cutoff, i.e.:
§2.6 Evaluation 43
precision(k) =
1
k 1≤i≤k
ri (2.39)
where k is the rank cutoff and Rk the corpus of documents from D that are relevant
to the query Q at cutoff k, (D1 . . . Dn) is a ranked list of documents returned by the
system and ri = 1 if Di ∈ Rk or ri = 0 otherwise.
Recall is the total proportion of all relevant documents that have been retrieved
within a particular cut-off for a query, i.e.:
recall(k) =
1
|RQ| 1≤i≤k
ri (2.40)
In large-scale test collections, recall cannot be measured as it is to difficult to obtain
relevance judgements for all documents (as it is too expensive to judge a very large
document pool). Therefore recall is not often considered in web search evaluations.
The measures of precision and recall are intrinsically tied together, as an increase
in recall almost always results in a decrease in precision. In fact precision can be ex-
plicitly traded off for recall by increasing k; for very large k every document in the
corpus is retrieved so perfect recall is assured. Given the expected drop-off in preci-
sion when increasing recall it can be informative to plot a graph of precision against
recall [14]. Precision-recall graphs allow for a closer examination of the distribution of
relevant and irrelevant documents retrieved by the system.
Both precision and recall are unstable at very early cutoffs, and it is therefore more
difficult to achieve statistical significance when comparing runs [36]. However, as
WWW search users tend to evaluate only the first few answers retrieved [99, 184],
precision at early cutoffs may be an important measure for WWW search systems.
Counter-intuitively, rather than precision decreasing when a large collection of
documents is searched, empirical evidence suggests that precision is increased [118].42
This phenomenon was examined in detail by Hawking and Robertson [118] who ex-
plained it in terms of signal detection theory.
A further measure of system performance is average precision:
average precision =
1
|RQ| 1≤i≤|D|
rk × precision(k) (2.41)
where k is the rank at which the first relevant document is observed. Average preci-
sion gives an indication of how many irrelevant documents must be examined before
all relevant documents are found. The average precision is 1 if the system retrieves all
relevant documents without retrieving any irrelevant documents. Average precision
figures are obtained after each new relevant document is observed.
R-precision is a computation of precision at the R-th position in the ranking (i.e.
42
These gains were tested at early precision with a cutoff that did not grow with collection size. Also
the collection grew homogeneously, such that the content did not degrade during the crawl (as might be
observed by crawling more content and thereby retrieving more spam on the WWW).
44 Background
precision(R)), where R is the total number of relevant documents for that query.
R-precision is a useful parameter for averaging algorithm behaviour across several
queries [14].
2.6.6.2 Mean Reciprocal Rank and success rates
Both the Mean Reciprocal Rank and success rate measures give an indication of how
many low value results a user would have to skip before reaching the correct an-
swer [110], or the first relevant answer [180].
The Mean Reciprocal Rank (MRR) measure is commonly used when there is only
one correct answer. For each query examined, the rank of the first correct document
is recorded. The score for that query is then the reciprocal of the rank at which the
document was retrieved. If there is only one relevant answer retrieved by the system,
then the MRR score corresponds exactly to average precision. The score for a system
as a whole is taken by averaging across all queries.
The success rate measure is often used when measuring effectiveness for exact
match queries, such as home page finding and named page finding tasks. Success rate
is indicated by S@k, where k is the cutoff rank and indicates the percentage of queries
for which the correct answer was retrieved in the top k ranks [56]. The “I’m feeling
lucky” button on Google [93] takes a user to the first retrieved result, accordingly S@1
is the rate at which clicking on such a button would take to the user to a right answer.
The success rate at 5, or S@5 is sometimes measured as it represents how often the
correct answer might be visible in the first results page without scrolling (“above the
fold”) [184]. The S@10 measures how often the correct page is returned within the
first 10 results.
These measures may provide important insight as to the utility of a document
ranking function. Silverstein et al. observed from a series of WWW logs that 85% of
query sessions never proceed past the first page of results [184]. Further, it has recently
been demonstrated that more time is spent by users examining results ranked highly,
with less attention paid to results beyond rank 5.43 All results beyond rank 5 were
observed to, on average, be examined for 15% of the time that was spent examining
the top result. These findings illustrate the importance of precision at high cutoffs and
success rates for WWW search systems.
2.6.7 The Text REtrieval Conference
The Text REtrieval Conference (TREC) was established in 1992 by the National Insti-
tute of Standards and Technology (NIST) and the Defence Advanced Research Projects
Agency (DARPA). The conference was initiated to promote the understanding of In-
formation Retrieval algorithms by allowing research groups to compare effectiveness
on common test collections. Voorhees and Harman present a comprehensive history
43
In this experiment 75% of the users reported that Google was their primary search engine. These
users’ prior experience with Google may be that the top ranked answer is often the correct document
and that effectiveness drops off quickly, which could affect these results.
§2.6 Evaluation 45
of the TREC conference and the TREC web track development in [111]. As outlined
by Hawking et al. [117] the benefits of TREC evaluations include: the potential for re-
producible results, the blind testing relevance judgements, the sharing of these judge-
ments, the potential for collaborative experiments, and the extensive training sets cre-
ated for TREC.
2.6.7.1 TREC corpora used in this thesis
Several TREC web track corpora are used and evaluated within this thesis – namely
the TREC VLC2, WT10g and .GOV TREC corpora. Some experiments also use query
sets from the TREC web track of 2001 and 2002 [53], and from the non-interactive web
track of 2003 [60]. These query sets include home page finding sets (2001 and 2003),
named page finding sets (2002 and 2003) and Topic Distillation sets (2002 and 2003).
These query sets and corresponding task descriptions are discussed in Section 2.6.7.2.
• TREC VLC2: is a 100GB corpus containing 18.5 million web documents. This
corpus is one third of an Internet Archive general WWW crawl gathered in
1997 [119]. The size of this corpus is comparable to the size of Google’s index at
the time of its launch (of around 24 million pages [31]). Current search engines
index two orders of magnitude more data [93].
• TREC WT10g: is a 10GB corpus containing a 1.7 million document subset of
the VLC2 corpus [15]. The corpus was designed to be representative of a small
highly connected web crawl. When building the corpus, duplicates, non-English
and binary documents were removed.
• TREC .GOV: is an 18.5GB corpus containing 1.25 million documents crawled
from the US .GOV domain in 2001 [53]. Redirect and duplicate document infor-
mation is available for this corpus (but not WT10g or VLC2).
There is debate as to whether the TREC web track corpora are representative of
larger WWW-based crawls, in particular whether the linkage patterns and density
is comparable (and therefore whether methods useful in WWW-based search would
be applicable to smaller scale web search) [100]. Recent work by Soboroff [188] has
reported that the WT10g and .GOV TREC web corpora do exhibit important charac-
teristics present in the WWW.
2.6.7.2 TREC web track evaluations
TREC 2001 web track
The TREC 2001 web track evaluated two search tasks over the WT10g web corpus
(described in Section 2.6.7.1): home page finding and relevance-based (ad-hoc) infor-
mational search. The objective of the home page finding task was to find a home page
given some query created to name the page (as described in Section 2.6.2.2). The objec-
tive of the relevance-based informational search task was to find documents relevant
46 Background
to some topic, given a short summary query. Experiments in Chapter 7 of this thesis
make use of data from the TREC 2001 web track home page finding task. The ad-hoc
informational search task is not considered.
For the 2001 home page finding task, 145 queries were created by NIST assessors
by navigating to a home page within the WT10g corpus and composing a query de-
signed to locate that home page [110]. A training set of 100 home page finding queries
and correct answers, created in the same way, was provided before the TREC evalu-
ation to allow participants to train their systems for home page finding [56]. Systems
were compared officially on the basis of the rank of the first answer (the correct home
page, or an equivalent duplicate page). Search system performance was compared
using the Mean Reciprocal Rank of the first correct answer and success rate (both de-
fined in Section 2.6.6.2).
TREC 2002 web track
The TREC 2002 web track44 evaluated two search tasks over the .GOV web corpus
(described in Section 2.6.7.1): named page finding and Topic Distillation. The objec-
tive of the named page finding task was to find a particular web page given a page
naming query (as described in Section 2.6.2.2). The objective of the Topic Distillation
task was to retrieve entry points to relevant resources rather than relevant documents
(as described in Section 2.6.3.1).
This thesis includes experiments that use data from both of these web track tasks.
Data from the 2002 TREC Topic Distillation task are used sparingly as the task is con-
sidered to be closer to a traditional ad-hoc informational task, rather than a Topic
Distillation task [53, 111].
For the 2002 named page finding task, 150 queries were created by NIST assessors
by accessing a random page within the .GOV corpus and then composing a query
designed to locate that page [53]. Systems were compared officially on the basis of
the rank of the first answer (the correct page, or an equivalent duplicate page), using
Mean Reciprocal Rank and success rates (both measures defined in Section 2.6.6.2).
The 2002 Topic Distillation task consisted of 50 queries created by NIST to be repre-
sentative of broad topics in the .GOV corpus (however, the topics chosen are believed
to have not been sufficiently broad [53]). System effectiveness was compared using
the precision @ 10 measure.
TREC 2003 web track
The TREC 2003 web track evaluated two further search tasks over the .GOV web cor-
pus: a combined home page / named page finding task, and a revised Topic Distilla-
tion task. The objective of the combined task was to evaluate whether systems could
fulfil both types of query without prior knowledge of whether queries named home
44
The report for the official submissions to the 2002 TREC web track (csiro02–) is included in Appen-
dix D, but the results from these experiments are not discussed further.
§2.6 Evaluation 47
pages or other pages. The objective of the Topic Distillation task was to find entry
points to relevant resources given a broad query (as described in Section 2.6.3.1).
The instructions given to the relevance judges in the 2003 Topic Distillation task
differed from those given in 2002. In 2003 the judges were asked to emphasise
“home page-ness” more than in the 2002 Topic Distillation task, and broader queries
were used to ensure that some sites devoted to the topic existed [60].
The TREC 2003 combined home page / named page task consisted of a total of 300
queries, with an equal mix of home page and named page queries. The query set was
selected using the methods previously used for generating the query/result sets for
the 2001 home page finding task and the 2002 named page finding task. Systems were
compared officially on the basis of the rank of the first retrieved answer, using Mean
Reciprocal Rank and success rate measures.
The TREC 2003 Topic Distillation task consisted of 50 queries created by NIST to
be representative of broad topics in the .GOV corpus. Judges ensured that queries
were “broad” by submitting candidate topics to a search system in order to determine
whether there were sufficient matches for the proposed topics.
Systems were compared officially on the basis of R-precision as many of the topics
did not have 10 correct results (and thus precision @ 10 was not a viable measure).
Later work by Soboroff challenged the use of these measures, and demonstrated that
precision @ 10 would have been a superior evaluation measure [189].
48 Background
Chapter 3
Hyperlink methods -
implementation issues
The value of hyperlink evidence may be seriously degraded if the algorithms that
exploit it are not well implemented. Thus hyperlink-based evidence is intrinsically
dependent on the accuracy and completeness of the web graph from which it is calcu-
lated.
This chapter documents and justifies implementation decisions taken during em-
pirical work and details limitations of the corpora available for use.
3.1 Building the web graph
An ideally accurate web graph would be one where all hyperlinks in the corpus rep-
resented the intentions of the document author when the hyperlinks were created.
Such accuracy would require hyperlink authors to be consulted during web graph
construction to confirm that their hyperlinks were pointing to the web content they
intended to link-to. In most cases this process would not be feasible. Therefore the
discussion of graph accuracy within this chapter relates to how likely it is that the web
graph is an accurate representation of web authors’ link intentions. The discussion of
web graph completeness refers to the amount of hyperlink evidence directed at docu-
ments within the corpus, that has been successfully assigned to the target document
(and not lost). To ensure web graph accuracy and completeness:
• Document URLs need to be resolved;
• Duplicate documents may need to be removed;
• Hyperlink redirects may need to be followed;
• Dynamic page content may need to be detected; and
• Links created for reasons other than recommendation may need to be removed.
The following sections discuss, in turn, how each of these requirements has been ad-
dressed when building a representation of the web graph.
49
50 Hyperlink methods - implementation issues
3.1.1 URL address resolution
Hyperlink targets can be expressed as fully qualified absolute addresses (such as http:
//cs.anu.edu.au/index.html) or provided as an address relative to the hyper-
link source (such as ../index.html from http://cs.anu.edu.au/∼Trystan.
Upstill/index.html). Whether addressed using relative or absolute URLs, hy-
perlinks need to be mapped to a single target document either within or external to
the corpus. Non-standard address resolution could lead to phantom pages (and sub-
graphs) being introduced into the web graph. In experiments within this thesis all
relative URLs are decoded to their associated absolute URL (if present) following the
conventions outlined in RFC 2396 [21] and additional rules detailed in Appendix B.
Some examples of address resolution are:
• A relative link to:
/../foo.html from http://cs.anu.edu.au/
is resolved to:
http://cs.anu.edu.au/foo.html;
• Links to one of:
http://cs.anu.edu.au/∼Trystan.Upstill/index.html,
http://cs.anu.edu.au//////∼Trystan.Upstill//, or
http://cs.anu.edu.au:80/∼Trystan.Upstill/
are resolved to:
http://cs.anu.edu.au/∼Trystan.Upstill/;
• Links to:
http://cs.anu.edu.au/foo.html#Trystan
are resolved to:
http://cs.anu.edu.au/foo.html;
• Links to:
panopticsearch.com/
are resolved to:
http://www.panopticsearch.com/.
3.1.2 Duplicate documents
Duplicate and near-duplicate1 documents are prevalent in most web crawls [34, 78,
123, 183]. In a 30 million page corpus collected by AltaVista [7] from the WWW
in 1996, 20% of documents were found to be effective duplicates (either exact duplicates
or near duplicates) of other documents within the collection [34]. In a 26 million page
WWW crawl collected by Google [93] in 1997 24% of the documents were observed to
be exact duplicates [183]. In a further crawl of 80 million documents from the WWW
1
Near-duplicate documents share the same core content but a small part of the page is changed, such
as a generation date or a navigation pane.
§3.1 Building the web graph 51
in May 1999 [123] 8.5% of all documents downloaded were exact duplicates. In a 2003
crawl of the IBM intranet over 75% of URLs were effective duplicates [78].
The presence of duplicate pages in a web graph can lead to inconsistent assign-
ment of hyperlink evidence to target documents. For example, if two documents con-
tain duplicate content, other web authors may split hyperlink evidence between the
two documents. These duplicate documents should be identified and collapsed down
to a single URL. However, if unrelated pages are mistakenly identified as duplicates
and collapsed, distortion will be introduced into the web graph and the effective-
ness of both hyperlink recommendation and anchor-text evidence may be reduced.
For example, if Microsoft and Toyota’s home pages were tagged as duplicates, all
link information for Microsoft.com might be re-assigned to Toyota’s home page, lead-
ing to http://www.toyota.com possibly being retrieved for the query ‘Microsoft’.
Therefore it is important to ensure exact (or very close) duplicate matching when as-
signing hyperlink recommendation scores and anchor-text evidence to consolidated
documents.
Common causes of duplicate documents in the corpus are:
• Host name aliasing. Host name aliasing is a technique used to assign multiple
host names to a single IP address. In some cases several host names may serve
the same set of documents under each host name. This may result in identical
sets of documents being stored for each web server alias [15, 123].
• Symbolic links between files. Symbolic links are often employed to map multi-
ple file names to the same document [123], resulting in the same content being
retrieved for several URLs. If there is no consensus amongst web authors as to
the correct URL, incoming links may be divided amongst all symbolically linked
URLs.
• Web server redirects. In many web server configurations the root of a directory
is configured to redirect to a default page (e.g. http://cs.anu.edu.au/ to
http://cs.anu.edu.au/index.html). Once again if there is no consensus
amongst web authors as to the correct URL, incoming links may be divided
amongst the URLs.
• File path equivalence. On web servers running on case-insensitive operating
systems (such as the Microsoft Windows Internet Information Server [148]) the
case of characters in the path is ignored and all case variants will map to the
same file (so Foo/, foo/ and FoO/ are all equivalent). By contrast, for web
servers running on case-sensitive operating systems (such as Apache [10] with
default settings on Linux), folder case is meaningful (so Foo/, foo/ and FoO/
may all map to different directories).
• Mirrors. A mirror is a copy of a set of web pages, served with little or no modi-
fication on another host [23, 122]. In a crawl of 179 million URLs in 1998, 10% of
the URLs were observed to contain mirrored content [23].
52 Hyperlink methods - implementation issues
Duplicates created as a result of host name aliasing may be resolved through map-
ping domain names down to their canonical domain name (using “canonical name”
(CNAME) and “address” (A) requests to a domain name server, as detailed in [123]).
This process has several drawbacks, including that some of these virtual hosts may be
incorrectly collapsed down to a single server [123]. To accurately detect duplicates the
process of domain name collapsing should be performed at the time of crawling [123].
This is because the canonical domain name mappings may have changed prior to du-
plicate checking and may incorrectly identify duplicate servers.
In experiments within this thesis host name alias information was collected when
available. Host name alias information was not available for the (externally collected)
VLC2 and WT10g TREC web track collections [15, 62].
Other types of duplicates may be detected using heuristics [24], but page content
examination needs to be performed to resolve these duplicates reliably [123]. Scalable
document full-text-based duplicate detection can be achieved through the calculation
of a signature (typically an MD5 checksum [166]) for each crawled page. However,
such checksums may map two nearly identical pages to very different checksum val-
ues [34]. Therefore document-based checksums cannot be used to detect near du-
plicate documents. Near-duplicate documents can be detected using methods such
as Shingling [32, 34], which detects duplicates using random hash functions, and I-
Match [47], a more efficient method which uses collection statistics. Full site mirrors
may be more easily detected by considering documents not in isolation, but in the
context of all documents on a particular host. Bharat et al. [24] investigated several
methods for detecting mirrors in the web graph using site heuristics such as network
(IP) address, URL structure and host graph connectivity.
In corpora built for this thesis, exact duplicates on the same host were detected
using MD5 checksums [166] during corpus collection. Duplicate host aliases were
also consolidated. Mirror detection and near duplicate detection techniques were not
employed due to link graph distortion that may be introduced through false positive
duplicate matching.
During the construction of the VLC2 test collection no duplicate detection was
employed, however for WT10g (a corpus constructed from VLC2) duplicates present
on the same web server were detected (using checksums) and eliminated [15]. This
was reported to remove around 10% of VLC2 URLs from consideration. Host aliasing
was not checked for either collection [15]. During the .GOV corpus crawl duplicate
documents were detected and eliminated using MD5 checksums.
3.1.3 Hyperlink redirects
Three methods frequently employed by web authors to redirect page requests are:
• Using an HTTP redirect configured through the web server [81].2 The redirect
information is then transferred from the web server to the client in the HTTP
2
This method is recommended by the W3C for encoding redirects [81].
§3.1 Building the web graph 53
response header code. HTTP redirects return a redirection HTTP status code
(301 – Moved Permanently or 302 – Moved Temporarily) [81]. In a crawl of
80 million documents in May 1999 [123] 4.5% of all HTTP requests received a
redirection response.
• Using HTML redirects [164].3 HTML redirects are often accompanied by a tex-
tual explanation of the redirect with some arbitrary timeout value for page for-
warding. HTML redirects return an “OK” (200) HTTP status code [81].
• Using Javascript [152]. The detection of Javascript redirects requires the crawler
(or web page parser4) to have a full Javascript interpreter and run Javascript
code to determine the target page.
Ensuring hyperlink evidence is assigned to the correct page when dealing with
hyperlink redirects is no simple matter. A link pointing to a page containing a redirect
can either be left to point at the placeholder page (the page used to direct users to the
new document) or re-mapped to the new target page. The web author who created
the link is unlikely to have deliberately directed evidence to the placeholder page.
By contrast, if the link is re-mapped to the final target, the document may not be
representative of the initial document for which the link was created.
HTML and Javascript redirect information was logged and stored when building
the VLC2 and WT10g test collections. For the .GOV collection all three types of redi-
rects were stored and logged.
If possible, for experiments within this thesis, redirect information was used to
reassign the link to the end of the redirect chain. Due to the complexity of dealing
with Javascript redirects, experiments in this thesis do not resolve these redirects.
3.1.4 Dynamic content
Unbounded crawling of dynamic content can lead to crawlers being caught in “crawler
traps” [123] and the creation of phantom link structures in the web graph. This may
lead to “sinks” being introduced into the web graph, and a reduction of the effective-
ness of hyperlink analysis techniques.
Dynamic content on the WWW is bounded only by the space of all potential
URLs on live host names. A study in 1997 estimated that 80% of useful WWW docu-
ments are dynamically generated [139]; moreover this has been observed to be a lower
bound [165].
During the creation of the VLC2 test collection, dynamic content was crawled
when linked-to [15]. For the WT10g corpus all identifiably dynamic documents5 were
removed [15]. This meant removing around 20% of the documents present in the
3
This method for encoding redirects is not recommended in the latest HTML specification [164].
4
The system component that processes web documents and extracts document data prior to indexing.
5
i.e. not having a static URL extension, e.g. a “?” or common dynamic extensions such as “.php”,
“.cgi” or “.shtml”.
54 Hyperlink methods - implementation issues
VLC2 corpus. This is surprising given the estimate that 80% of all useful WWW con-
tent is dynamic. The large disagreement indicates that either: the crawler used to
gather the VLC2 corpus did not effectively crawl dynamic content, or the estimate of
dynamic content was incorrect, or static content was crawled first during the Inter-
net Archive crawl.6 It is unclear why dynamic content was removed from the WT10g
corpus, given that dynamic web content is likely to contain useful information.
3.1.5 Links created for reasons other than recommendation
Hyperlink recommendation algorithms assume that links between documents imply
some degree of recommendation [157]. Therefore links created for reasons other than
recommendation may adversely affect hyperlink recommendation scores [63]. Links
are often created for site navigation purposes or for nepotistic reasons [63]. Nepotistic
linking is link generation that is the result of some relationship between the source and
target, rather than the merit of the target [63, 137]. Kleinberg [132] proposed that all
internal site links be removed to lessen the influence of local nepotistic hyperlinks and
navigational hyperlinks. This was further refined by Bharat and Henzinger [26] who
observed that nepotistic links may exist not only within a single site but between sites
as well. To remove these nepotistic links they suggested that all sites be considered as
units, and proposed that only a single link between hosts be counted. However, the
removal of all internal host link structure may discard useful site information. Amitay
et al. [9] studied the relationship between site structure and site content and through
an examination of internal and external hyperlink structure were able to distinguish
between university sites, online directories, virtual hosting services, and link farms.
The link structures in each of these sites were observed to be quite different, indicating
that reducing the effects of nepotistic and navigational links according to the type of
site may be more effective than simply removing all internal links.
Fundamental changes in the use of hyperlinks on the web may also challenge the
recommendation assumption by affecting the quality or quantity of mined hyperlink
information. For example, the use of web logging tools (blogs) [92] may alter the
dynamics of hyperlinks on the WWW. Such pages are often stored together on a single
host, are very frequently updated, and the cost of generating a link to other content
in a blog is small. As such, the applicability of hyperlink recommendation algorithms
in this environment has been challenged [86]. It is also possible that as WWW search
engine effectiveness improves, authors are less likely to link to documents that they
find useful, as such documents can be easily found using a popular WWW search
engine. An analysis how such trends affect hyperlink quality is outside of the scope
of this thesis and is left for future work.
In experiments in this thesis internal site links are preserved and weighted equally.
This is important, as some of the evidence useful in navigational search may be en-
coded into internal site structure or nepotistic links, such as links to site home pages
6
The VLC2 collection consists of the first one-third of the documents stored during an all-of-WWW
crawl performed by the Internet Archive in 1997.
§3.2 Extracting hyperlink evidence from WWW search engines 55
and entry points. For example, almost all external links to the Australia Post web
site [12] are directed to the post-code lookup, with the home page identified by ev-
idence present in the anchor-text of internal links [114]. Also, within some of the
collections studied (such as WT10g [15]), inter-server linking is relatively infrequent.
3.2 Extracting hyperlink evidence from WWW search engines
Some of the experiments performed in this thesis rely on hyperlink evidence extracted
from WWW search engines via their publicly available interfaces. The WWW search
engines used are well engineered and provide effective and robust all-of-WWW search.
However, there are disadvantages in using WWW search engines for link informa-
tion. Such experiments are not reproducible as search engine algorithms and indexes
are not known and may well change over time. Additionally, some of the sourced
information is incomplete (such as the top 1000 results lists) or estimated (such as
document linkage information7).
3.3 Implementing PageRank
PageRank implementations outlined in the literature differ in the ways they deal with
dangling links, in the bookmarks used for random jumping, and in the conditions that
must be satisfied for convergence [154, 157, 158, 201]. Section 2.4.3.2 gave an overview
of the PageRank calculation. The current section outlines the process that has been
followed when calculating PageRank values for use in this thesis.
3.3.1 Dangling links
A hyperlink in the web graph that refers to a document outside of the corpus, or links
to a document which has no outgoing links, is termed a dangling link [157].
In Page and Brin’s [157] PageRank formulation, dangling links are removed prior
to indexing and then re-introduced after the PageRank calculation has converged.
The removal of dangling links using this method increases the weight of PageRank
distributed through other links on pages that point to dangling links. This is because
dangling links are not considered when dividing PageRank amongst document out-
links.
An alternative PageRank calculation sees the random surfer jump with certainty
(probability 1, rather than (1 − d)) when they reach a dangling link. This implies
that the random surfer jumps to a bookmark when they reach a dead-end [43, 154].
This implementation has desirable stability properties when used with a bookmark set
that evenly distributes “jump” PageRank amongst all pages (as described in Section
2.4.3.2).
7
Sourced using methods outlined in Section 5.1.3.
56 Hyperlink methods - implementation issues
A further PageRank variant sees the random surfer jump back to the page they
came from when they reach a dangling link [158]. This variant is problematic as it
may lead to rank sinks if a page has many dangling links. This may result in inflated
scores for sections of the graph.
The PageRanks used for web corpora in this thesis are calculated using the dan-
gling link “jump with certainty” method. This method has been shown to have desir-
able stability and convergence properties [154].
3.3.2 Bookmark vectors
In experiments within this thesis PageRank values are calculated for two different
bookmark vectors (E). The first vector produces a “Democratic” or unbiased Page-
Rank in which all pages are a priori considered equal. The second bookmark vector
“personalises” [157] the PageRank calculation to favour known authoritative pages.
The bookmark vector is created using links from a hand-picked source and is termed
“Aristocratic” PageRank.
In Democratic PageRank (DPR) every page in the corpus is considered to be a book-
mark and therefore every page has a non-zero PageRank. Every link is important and
thus in-degree might be expected to be a good predictor of DPR. Because it is easy for
web page authors to create links and pages, it is easy to manipulate DPR with link
spam.
In Aristocratic PageRank (APR) a set of authoritative pages is used as bookmarks to
systematically bias scores. In practice the authoritative pages might be taken from a
reputable web directory or corpus site-map. For example, for WWW-based corpora,
bookmarks might be sourced from a WWW directory service such as Yahoo! [217],
Looksmart [144] or the Open Directory [69]. APR may be harder to spam than DPR
because newly created pages are not, by default, included in the bookmarks.
3.3.3 PageRank convergence
This section presents a small experiment to determine how the performance of Page-
Rank is affected by changes to the PageRank d value. These experiments examine re-
trieval effectiveness on the WT10gC home page finding test collection, for Optimal re-
rankings (described in Section 2.6.7.1) of two query-dependent baselines (document
full-text and anchor-text). This collection was provided to participants in TREC 2001
so that they could train systems for home page search (described in Section 2.6.7.2, the
test collection is used in experiments in Chapter 7).
Figures 3.1 and 3.2 illustrate how the PageRank on the WT10gC collection is af-
fected by changes to the d value. Figure 3.3 shows how the choice of d affects conver-
gence. In practice the d value is typically chosen to be between 0.8 and 0.9 [16, 157].
Results from these experiments reveal that the performance of PageRank can be
remarkably stable even with large changes in the d value. When d was set to 0.02
the performance of the Optimal re-ranking (see Section 7.3) was similar to the per-
formance at d = 0.85. Without the introduction of any random noise (at d = 1.0)
§3.3 Implementing PageRank 57
the PageRank calculation did not converge. However, the PageRank calculation did
converge with only a small amount distributed in random jumping (d = 0.99).
Unless the score is to be directly incorporated in a ranking function, only the rel-
ative ordering of pages is important. Haveliwala [102] noted this as a possible Page-
Rank optimisation method since a final ordering of pages might be achieved before fi-
nal convergence. Haveliwala observed that the ordering of pages by PageRank values
did not change significantly after few PageRank iterations. When moving from 25 to
100 iterations of the PageRank calculation, on corpora of over 100 000 documents, no
significant difference in document ranking order was observed [102]. In experiments
in this thesis the PageRank calculation was run until convergence. This allowed for
flexibility when combining PageRank values with other ranking components.
Since little improvement in performance was observed when increasing d, the em-
pirical evidence suggests d should be set to a very small value (around 0.10) for cor-
pora of this size, thereby reducing the number of iterations required and minimising
computational cost. However, to maintain consistency with previous evaluations, in
experiments within this thesis, d was set at 0.85, as suggested by Brin and Page [31].
0
20
40
60
80
100
0 0.2 0.4 0.6 0.8 1
Successrate(content)
Democratic PageRank d value
S@1
S@5
S@10
Figure 3.1: Effect of d value (random jump probability) on success rate for Democratic Page-
Rank calculations for the WT10gC test collection. As d approaches 0 the bookmarks become
more influential. As d approaches 1 the calculation approaches “pure” PageRank (i.e. a Page-
Rank calculation with no random jumps). The convergence threshold ( ) is set to 0.0001. The
WT10gC test collection is described in Section 7.1.3. The PageRank scores are combined with a
document full-text (content) baseline ranking using the Optimal re-ranking method described
in Section 7.1.4.
58 Hyperlink methods - implementation issues
0
20
40
60
80
100
0 0.2 0.4 0.6 0.8 1
Successrate(anchor)
Aristocratic PageRank d value
S@1
S@5
S@10
Figure 3.2: Effect of d value (random jump probability) on success rate for Aristocratic
PageRank calculations for the WT10gC collection. As d approaches 0 the bookmarks be-
come more influential. As d approaches 1 the calculation approaches “pure” PageRank (i.e. a
PageRank calculation with no random jumps). The convergence threshold ( ) is set to 0.0001.
The WT10gC test collection is described in Section 7.1.3. The PageRank scores are combined
with an aggregate anchor-text (anchor) baseline ranking using the Optimal re-ranking method
described in Section 7.1.4.
0
50
100
150
200
250
300
350
400
0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90
Numberofiterations
d value
Aristocratic PageRank
Democratic PageRank
Figure 3.3: Effect of PageRank d value on the rate of Democratic PageRank convergence on
WT10g, by number of iterations. PageRank did not converge at d = 1 (no random jumps).
The WT10g collection contains 1.7 million documents and is described in Section 2.6.7.1.
§3.4 Expected correlation of hyperlink recommendation measures 59
3.3.4 PageRank applied to small-to-medium webs
It is sometimes claimed that PageRanks are not useful unless the web graph is very
large (tens or hundreds of millions of nodes), but this claim has not been substan-
tiated. PageRanks can be calculated for a web graph of any size. PageRank scores
are therefore usable within any web crawl, including single organisations (enterprise)
and portals. The Google organisational search appliance incorporates PageRank for
crawls of below 150 000 pages [96].
3.4 Expected correlation of hyperlink recommendation mea-
sures
As DPR depends to some degree on the number of incoming links a page receives, one
might expect DPR to be correlated with in-degree. Ding et al. [68] previously observed
that for this reason, in-degree is a useful first order approximation to DPR. Moreover,
when DPR is calculated with a low convergence threshold it might be expected to
be more highly correlated with in-degree, as little weight is transferred through the
graph. Similarly it might be expected that corpora with large numbers of dangling
links would be more highly correlated. APR is likely to be far less correlated, with
many documents potentially having an APR score of zero.8
In this thesis the following correlations between hyperlink recommendation scores
are tested:
• Between WWW-based PageRank scores (from Google [93]) and WWW-based
in-degree scores (from AllTheWeb [80]), in Section 5.3;
• Between small-to-medium web based scores for DPR, APR and in-degree, in
Section 7.6.3; and
• Between small-to-medium web based scores for DPR, APR and in-degree, and
WWW-based PageRank scores (from Google), also in Section 7.6.3.
8
For example, if not bookmarked, no APR score will be achieved by pages in the so termed WWW
“Tendrils” (unless linked to by other Tendrils) [35] or pages in the IN component.
60 Hyperlink methods - implementation issues
Chapter 4
Web search and site searchability
The potential for hyperlink evidence to improve retrieval effectiveness may depend
upon the authorship of web sites. Some web documents are authored in such a way
as to prevent or discourage direct linking. This may make it difficult for web search
engines to retrieve a document. Decisions made when authoring documents can af-
fect the evidence collected by web crawlers, and thereby reduce or increase the quality
of end-user search results. This chapter investigates how the “searchability” of sites
influences retrieval effectiveness. It also provides a whole-of-WWW context for the
experimental work based on smaller web corpora. In particular, the case study pre-
sented in this chapter illustrates:
• The importance of web hyperlink evidence in the ranking algorithms of promi-
nent WWW search engines, by investigating whether well-linked content is more
likely to be retrieved by WWW search engines.
• Difficulties faced by prominent WWW search engines when resolving author
intentions through web graph processing (and how successfully resolving issues
discussed in Chapter 3 can improve retrieval effectiveness).
• The effect of web authorship conventions on the likelihood of hyperlink evi-
dence generation.
This case study examines both search effectiveness and searchability with respect
to a particular type of commodity which is frequently sold over the WWW; namely
books. The task examined is that of finding web pages from which a book may be
purchased, specifying the book’s title as the query.
Online book buying is a type of transactional search task (see Section 2.6.1) [33].
Transactional search is an important web search task [124], as it drives e-commerce
and directs first-time buyers to particular merchant web sites. However, despite the
prevalence of such tasks, information retrieval research has largely ignored product
purchasing and transactional search tasks [106, 116].
The product purchasing search task is characterised by multiple correct answers.
For example, in this case study, any of the investigated bookstores may provide the
service which has been requested (the purchase of a particular book). Employing a
task with many equivalent answers spread over a number of sites makes it possible to
61
62 Web search and site searchability
study which sites are most easily searchable by search engines, and conversely which
search engines provide the best coverage.
The study of searchability is primarily concerned with site crawlability and the
prevalence of link information, that is, how easy it is to retrieve pages and link struc-
ture from a web site. A site with good searchability is one whose pages can be matched
and ranked well by search engines, and whose URLs are simple and consistent, such
that other authors may be more likely to create hyperlinks to them.
Previous studies of transactional search have evaluated the service finding ability
of TREC search systems [106] and WWW search engines [116] on a set of apparently
transactional queries extracted from natural language WWW logs. The aim of these
studies was to compare search engines on early precision; no information was avail-
able (or needed) about what resources could be found, and there was no comparison
of the searchability of online vendor sites.
4.1 Method
The initial step in the experiment was the selection of candidate books, the titles of
which formed the query set. This query set was then submitted to four popular WWW
search engines and ranked lists of documents were retrieved. Links to candidate
bookstore-based pages within these ranked lists were extracted, examined, and (if
required) downloaded. These documents were then examined to determine whether
they fulfilled the requirements of transactional search; that is, that the document did
not only match the book specified in the query, but also allowed for the book to be
purchased directly. The search engines were then compared based on how often they
successfully retrieved a transactional document for the requested books. Similarly, a
comparison of bookstores was performed based on how often each bookstore had a
transactional document for the desired book retrieved by any of the WWW search en-
gines. To examine the effect that hyperlink and document coverage had on bookstore
and search engine retrieval effectiveness, further site-based information was extracted
from the search engines and analysed. The experimental data used were collected in
the fourth quarter of 2002.
The following sections describe these steps in greater detail. The methods used for
extracting evidence relating to search engine coverage of bookstore URLs and hyper-
links are described in Appendix C.
4.1.1 Query selection
The book query set was identified from the New York Times bestseller lists, by sourc-
ing the titles of the best-sellers for September 2002 [153]. A total of 206 distinct book
titles were retrieved from nine categories.1 Book titles were listed on the best-seller
lists fully capitalised, and were later converted to lower case and revised such that all
terms, apart from join terms (such as “the”, “and” and “or”), began with a capital.
1
The book/category breakdown is included in Appendix C.
§4.1 Method 63
The query selection presumes that users search for a book using its exact title. In
fact users may seek books using author names, topics, or even partial and/or incorrect
titles. However, it is likely that a significant proportion of book searches are made
using the exact listed title.
The ISBNs of correct books were identified for page judging. Both hardcover and
paperback editions of books were considered to be correct answers.2 A list of the
queries and the ISBNs of the books judged as correct answers is available in Appen-
dix C.
4.1.2 Search engine selection
Four search engines were identified from the Nielsen/NetRatings Search Engine Rat-
ings for September 2002 (as outlined in Table 4.1). At the time, the four engines pro-
vided the core search services for the four most popular search services, and for eight
of the top ten search services [194].
S.Engine Abbr. Used by [195] Rank
AltaVista [7] AV AltaVista 8
AllTheWeb [80] FA AllTheWeb -
Google [93] GO Google 3
AOL 4
Netscape 9
Yahoo 1
MSN Search [149] MS MSN Search 2
(based on Looksmart 10
Inktomi) HotBot -
Overture 6
Table 4.1: Search engine properties. The column labelled “Abbr.” contains abbreviations
used in the study. “Used by” indicates search services that used the search engine. “Rank”
indicates the search services position in the Nielsen/NetRatings Search Engine Ratings of Sep-
tember 2002 [194].
4.1.3 Bookstore selection
The bookstore set was derived from the Google DMOZ “Shopping > Publications >
Books > General” [94] and Yahoo! “Business and Economy > Shopping and Services >
Books > Booksellers” [216] directories. Bookstores were considered if they sold the top
bestseller in at least three of the nine categories. The process of bookstore candidate
identification was performed manually using internal search engines to search for
both the title and the author of each book (both title and author were used to uniquely
2
Large print and audio editions were deemed to be incorrect answers.
64 Web search and site searchability
Bookstore Core URL De. Dy. URL Cat.
1BookStreet 1bookstreet.com N Y ISBN 9
A1Books a1books.com N Y ISBN 9
AllDirect alldirect.com N Y ISBN 9
Amazon amazon.com N P ISBN 9
Americana Books americanabooks.com N Y - 7
Arthurs Books arthursbooks.com N Y ISBN 4
Barnes and Noble barnesandnoble.com N Y ISBN 9
BookWorks bookworksaptos.com Y* Y ISBN 9
BookSite booksite.com Y+ Y ISBN 9
Changing Hands changinghands.com Y* Y ISBN 9
ecampus ecampus.com N Y ISBN 9
NetstoreUSA netstoreusa.com N P ISBN 9
Planet Gold planetgold.com N Y - 9
TextbookX.com textbookx.com N Y ISBN 9
Sam Weller’s Books samwellers.com N Y ISBN 9
All Textbooks 4 Less alltextbooks4less.com N Y ISBN 9
The Book Shop bookshopmorris.com Y* Y ISBN 9
Cornwall Discount Books cornwalldiscountbooks.com N Y - 8
A Lot of Books alotofbooks.com N Y - 3
HearthFire Books hearthfirebooks.com Y* Y ISBN 9
Walmart walmart.com N Y - 9
Wordsworth.com wordsworth.com N Y ISBN 9
Powells powells.com N Y - 9
BiggerBooks.com biggerbooks.com N Y ISBN 9
That Bookstore in Blytheville tbib.com Y* Y ISBN 9
StrandBooks.com strandbooks.com N Y ISBN 7
St. Marks Bookshop stmarksbookshop.com Y* Y ISBN 9
RJ Julia rjjulia.com N Y ISBN 9
Paulina Springs Book Company paulinasprings.com Y* Y ISBN 9
Books-A-Million booksamillion.com N Y ISBN 9
CodysBooks.com codysbooks.com Y* Y ISBN 9
The Concord Bookshop concordbookshop.com Y* Y ISBN 9
Dartmouth Bookshop dartbook.com Y* Y ISBN 9
GoodEnough Books goodenoughbooks.com Y* Y ISBN 9
MediaPlay.com mediaplay.com N Y - 9
Table 4.2: Bookstores included in the evaluation. This table reports whether the bookstore
contained ISBNs in its internal URLs (“URL”), whether the sites were generated through a
series of dynamic scripts (“Dy.”), whether they were a derivative of another site (“De.”) and
how many of the nine book categories they matched (“Cat.”). A “*” next to the “De.” column
indicates that the site was a booksense.com derivative, while a “+” indicates that the bookstore
was a booksite.com derivative. A “P” in the “Dyn” column indicates that the site was dynamic
but did not “look” dynamic (it did not have a “?” with parameters following the URL).
§4.2 Comparing bookstores 65
identify books). Bookstores were only judged on the categories for which they stocked
(or listed) the bestseller. The justification for this approach was that there may be some
specialised (e.g. fiction only) bookstores that should be included in the study, but not
considered for all book categories. A full listing of all 35 eligible bookstores and their
salient properties is presented in Table 4.2.
4.1.4 Submitting queries and collecting results
The queries were made up of book titles submitted to search engines as phrases (i.e.
inside double quotes or marked as phrases in advanced searches). The exact query
syntax submitted to each search engine is reported in Appendix C. The top 1000
results for each query from each search engine were retrieved and recorded.
4.1.5 Judging
The candidate documents were required to fulfil two criteria in order to be considered
as a correct answer; 1) the page must have been for the book whose title is given as
the query, and 2) the retrieved page must have been transactional in nature.
A transactional page was considered to be a bookstore page from which a user could
buy a book. Browse pages (documents that list multiple books, for example, a list of
books in a particular category, by series or by author) or bookstore search results were
not judged as correct results. For many bookstores the correct answers were observed
to have the hardcover or paperback ISBN in the URL (in many cases there were many
correct duplicate URLs which were all observed to contain the ISBN). To cut down on
manual judging for these bookstores, automatic judging was performed based on the
presence or absence of the ISBN in the URL. For other bookstores the unique product
identifiers for each book were manually collected and recorded, and URLs checked
for their presence.
4.2 Comparing bookstores
The book finding success rates were measured at several cutoffs (S@1, S@5, S@10,
S@100 and S@1000). Table 4.3 contains the results for this experiment. The following
observations may be made:
• Of the 35 bookstores evaluated, only 14 returned any correct answers within the
top 1000 results of any of the search engines.
• Only four bookstores contributed answers within the top ten results in any
search engine: Amazon, Barnes and Noble, Booksite and Walmart
• Amazon was the most searchable bookstore in the evaluation, achieving the high-
est success rates.
• Only Amazon had correct results returned by every search engine.
66 Web search and site searchability
S@1000 break. Host
Bookstore S@1 / S@5 / S@10 / S@100 / S@1000 (AV:FA:GO:MS) Res.
Amazon 0.124 / 0.325 / 0.402 / 0.492 / 0.584 104:83:162:132 3903
Barnes and Noble 0.028 / 0.096 / 0.140 / 0.225 / 0.316 0:87:170:3 3603
Walmart 0.010 / 0.030 / 0.045 / 0.070 / 0.075 2:0:0:60 277
BookSite 0.000 / 0.004 / 0.005 / 0.013 / 0.013 0:0:0:11 52
ecampus 0.000 / 0.000 / 0.000 / 0.005 / 0.012 0:7:0:3 290
AllDirect 0.000 / 0.000 / 0.000 / 0.002 / 0.005 0:4:0:0 52
NetstoreUSA 0.000 / 0.000 / 0.000 / 0.001 / 0.010 0:8:0:0 261
Sam Weller’s Books 0.000 / 0.000 / 0.000 / 0.001 / 0.006 0:5:0:0 22
Books-A-Million 0.000 / 0.000 / 0.000 / 0.000 / 0.008 0:4:0:3 775
1BookStreet 0.000 / 0.000 / 0.000 / 0.000 / 0.006 0:5:0:0 17
Wordsworth.com 0.000 / 0.000 / 0.000 / 0.000 / 0.004 1:0:1:1 92
TextbookX.com 0.000 / 0.000 / 0.000 / 0.000 / 0.002 0:2:0:0 22
CodysBooks.com 0.000 / 0.000 / 0.000 / 0.000 / 0.002 0:2:0:0 78
Arthurs Books 0.000 / 0.000 / 0.000 / 0.000 / 0.003 0:1:0:0 3
Powells Bookstore 0.000 / 0.000 / 0.000 / 0.000 / 0.000 0:0:0:0 1031
Table 4.3: Bookstore comparison. This table includes all bookstores which had at least one
success at 1000 (S@1000) in a search engine. Powells is included in the table for comparison due
to the high number of results retrieved by the search engines from Powells’ host name. The
“S@1000 break.” column shows the number of correct books retrieved from each bookstore
within the top 1000 search results for each search engine. The “Host Res.” column reports the
number of pages found for each bookstore’s host name by all search engines.
§4.3 Comparing search engines 67
• Barnes and Noble performed well on Google (GO) and AllTheWeb (FA).
• Walmart performed well on MSN Search (MS).
• The only search engine which returned results for many of the smaller book-
stores was AllTheWeb (FA).
4.3 Comparing search engines
Search engine effectiveness was also compared: the results are presented in Table 4.4
and Table 4.5. From data in these tables the following observations were made:
• AltaVista’s (AV) performance was inferior to that of both Google (GO) and MSN
Search (MS) at all cutoffs. AltaVista demonstrated around half the precision of
MSN Search.
• AllTheWeb (FA) trailed well behind all other search engines, but provided a
large number of correct answers between the 100th and 1000th position (success
rate jumps from 0.18 to 0.52). The precision for AllTheWeb was low.
• Google (GO) trailed MSN Search at S@1, but exceeded MSN Search’s perfor-
mance from S@10 onwards. Google returned more correct answers in their
top 5, 10 and 100 results than MSN Search.
• MSN Search (MS) produced the strongest results at S@1 and S@5, but when
cutoffs were extended, retrieval effectiveness decreased dramatically.
Search Success Rates
Engine @1 @5 @10 @100 @1000
AV 0.14 0.39 0.45 0.50 0.52
FA 0.00 0.02 0.05 0.18 0.52
GO 0.15 0.56 0.67 0.83 0.89
MS 0.36 0.57 0.65 0.72 0.73
Table 4.4: Search engine success rates. The best result at each cutoff is highlighted.
4.3.1 Search engine bookstore coverage
The search engine bookstore coverage was measured by sourcing counts from WWW
search engines for the number of URLs indexed per bookstore (site document cover-
age), and the number of hyperlinks that were directed at each bookstore (site hyperlink
coverage).
68 Web search and site searchability
Search Precision
Engine @1 @5 @10 @ 100
AV 0.14 0.08 0.05 0.01
FA 0.00 0.00 0.01 0.00
GO 0.15 0.20 0.15 0.03
MS 0.36 0.13 0.08 0.01
Table 4.5: Search engine precision. Note that precision at 1 is equivalent to the success rate
at 1. The precision at cutoffs greater than 100 is less than 1/100 in all cases. The best result for
each measure is highlighted.
Site document coverage
The transactional pages for some bookstores may not have been returned because they
have never been crawled by a search engine. Table 4.6 lists the number of pages from
each bookstore reported to be contained within each search engines’ index.
From these results it was observed that:
• Amazon had a consistently large search engine coverage – around three million
documents on three-out-of-four search engines. AllTheWeb covered an order
of magnitude less Amazon-based documents than did any of the other search
engines. However, AllTheWeb crawled more pages for Amazon than it did any
other bookstore. This may indicate that AllTheWeb incorrectly eliminated many
of Amazon’s pages as duplicates, applied more stringent limits on crawling dy-
namic content, or its coverage was estimated in a different way compared to the
other search engines.
• The coverage of Barnes and Noble varied widely across engines. While the MSN
Search coverage of Barnes and Noble was small, it appeared to contain many
product pages, with three correct answers retrieved. Only 500 Barnes and Noble
pages were covered by AltaVista. Over a million pages were covered by Google.
• A large number of Walmart pages were covered by MSN Search, whereas
AllTheWeb and Google covered a relatively small number of pages. This may
indicate that MSN handled dynamic pages in a different manner to the other
search engines, or that there was some special relationship between MSN Search
and Walmart.
• AllTheWeb did not have large coverage of any one bookstore (their maximum
crawl of a bookstore was 360 000 pages). Instead they tended to have a larger
breadth of results, with larger crawls of lesser known bookstores. As many
bookstores served content through dynamic pages, this may further indicate
that AllTheWeb applied more stringent limits on dynamic content.
§4.3 Comparing search engines 69
Bookstore AV FA GO MS TOTAL
amazon.com 3 675 723 358 376 3 620 000 2 838 819 10 492 918
barnesandnoble.com 521 192 792 1 240 000 2822 1 436 135
walmart.com 89 243 1076 10 500 916 162 1 016 981
netstoreusa.com 1171 315 002 93 000 42 052 451 225
powells.com 39 397 111 977 65 900 6204 223 478
textbookx.com 18 23 157 38 600 150 61 925
alldirect.com 24 26 278 7 27 26 336
ecampus.com 300 7763 2010 240 10 313
planetgold.com 18 8361 774 18 9171
booksamillion.com 22 5860 54 865 6801
cornwalldiscountbooks.com 1 5423 2 1 5427
wordsworth.com 735 228 2290 1271 4524
booksite.com 93 169 1190 290 1742
codysbooks.com 74 1308 238 57 1677
arthursbooks.com 7 1221 8 384 1620
samwellers.com 7 278 5 8 298
tbib.com 1 2701 3 0 2705
stmarksbookshop.com 1 2414 4 0 2419
1bookstreet.com 5 1009 779 172 1965
a1books.com 15 1311 29 173 1528
biggerbooks.com 0 1 395 1 397
americanabooks.com 3 309 15 14 341
alltextbooks4less.com 3 208 22 31 264
dartbook.com 31 74 17 1 123
mediaplay.com 19 0 32 7 58
paulinasprings.com 1 40 1 0 42
rjjulia.com 7 27 3 4 41
concordbookshop.com 5 2 3 0 10
goodenoughbooks.com 1 4 2 0 7
bookworksaptos.com 1 3 2 0 6
alotofbooks.com 1 1 2 1 5
bookshopmorris.com 1 2 2 0 5
changinghands.com 1 2 2 0 5
hearthfirebooks.com 1 2 2 0 5
Total 3 809 257 1 087 000 5 081 161 3 810 094 13 787 512
Table 4.6: Search engine document coverage. Note that the totals in the right-hand side col-
umn may contain duplicate links (this occurs when the same URL is found by different search
engines). These values were collected using methods outlined in Appendix C. The column la-
belled “AV” contains data from AltaVista, “FA” contains data from AllTheWeb, “GO” contains
data from Google, and “MS” contains data from MSN Search.
70 Web search and site searchability
• AltaVista had large coverage only of Amazon, Walmart and Powells. It seems
unlikely that book results could be found in their small (sub 1000 page) crawls
of other bookstores. The searchability of all three bookstores was improved by
having simple URL structures.
• Powells had large coverage (with three-out-of-four search engines indexing
over 40 000 pages), but did not have any product pages returned in the top 1000
results for these search engines. This may indicate that hyperlink evidence di-
rected at the Powells bookstore was either not present, not directed at book buy-
ing pages, or was resolved incorrectly by the WWW search engines.
Hyperlink graph completeness
Only two of the evaluated search engines supported domain name hyperlink counts:
AltaVista and AllTheWeb. Domain name hyperlink counts retrieve the number of
links to an entire domain name rather than just to a single page. This information was
used to determine the hyperlink coverage of an entire bookstore. Table 4.7 contains
the results for this study. Some observations are that:
• AllTheWeb discovered a large number of links to Amazon, but did not crawl
documents from Amazon as comprehensively as other search engines.
• Powells bookstore had a large number of incoming links, but still performed
poorly. This further indicates that incoming links may not have been success-
fully resolved by the WWW search engines (due to anomalies in the search en-
gine representations of Powells’ document set or link graph), or that links were
not directed to transactional pages.
• AllTheWeb discovered more links to diverse hosts than AltaVista. This could be
attributed to the fact that AllTheWeb performed a deeper crawl of lesser sites
and encountered a larger number of internal links.
4.4 Findings
This section discusses the bookstore findings. It includes an analysis of the URL and
hyperlink coverage, of bookstore ranking performance, and finally of the relative re-
trieval effectiveness of the evaluated search engines.
4.4.1 Bookstore searchability: coverage
The results in Tables 4.6 and 4.7 reveal that the top three bookstores by URL coverage
were also the top three bookstores by success rate. The bookstore coverage appears to
have had a significant impact on how often books from the bookstore were retrieved
early in the document ranking. Amazon achieved high coverage in the indexes of all
evaluated search engines.
§4.4 Findings 71
Bookstore AV FA TOTAL
amazon.com 12 408 441 25 955 858 38 364 299
powells.com 5 197 526 316 989 5 514 515
textbookx.com 3 456 068 28 453 3 484 521
barnesandnoble.com 234 137 784 088 1 018 225
walmart.com 14 783 267 008 281 791
booksite.com 4927 113 729 118 656
booksamillion.com 34 137 79 351 113 488
ecampus.com 2170 102 047 104 217
netstoreusa.com 10 548 91 867 102 415
1bookstreet.com 25 229 50 064 75 293
wordsworth.com 2750 21 694 24 444
a1books.com 4545 16 270 20 815
codysbooks.com 1062 9512 10 574
alldirect.com 614 6508 7122
arthursbooks.com 109 1700 1809
samwellers.com 106 208 314
americanabooks.com 114 2163 2277
alltextbooks4less.com 52 945 997
rjjulia.com 174 337 511
concordbookshop.com 185 118 303
planetgold.com 31 200 231
dartbook.com 95 117 212
changinghands.com 68 96 164
bookshopmorris.com 46 99 145
cornwalldiscountbooks.com 12 123 135
alotofbooks.com 11 108 119
stmarksbookshop.com 42 62 104
hearthfirebooks.com 8 73 81
tbib.com 29 40 69
bookworksaptos.com 15 53 68
paulinasprings.com 31 37 68
biggerbooks.com 2 63 65
goodenoughbooks.com 12 15 27
Total 21 463 332 28 160 545 49 623 877
Table 4.7: Search engine link coverage. The column labelled “AV” contains data from Al-
taVista and “FA” contains data from AllTheWeb. Note that because of overlap between AV
and FA the totals in the right-hand column may contain several links to the same URL.
72 Web search and site searchability
It is important for a bookstore to have deep crawls indexed in as many search
engines as possible. Three potential reasons why bookstores included in this study
were not crawled deeply may be offered:
1. Despite many incoming links to the bookstore domain, few pages were crawled.
This may have been because the crawler was trapped when building the book-
stores’ link graph and only crawled a few books many times over. Alternatively,
book pages could have been identified as near-duplicates and eliminated from
the document index.
2. The bookstores did not receive sufficient links directly to product pages from
external sites (i.e. most links were directed to the bookstore home page).
3. The search engines appeared to label bookstores as containing uninteresting
dynamically generated content. The WWW search engines may not consider
apparent dynamically generated content due to concerns about polluting their
representation of the web graph (see Section 3.1.2). Some dynamic content was
observed to be in the form of parameterised URLs (with question marks) gener-
ated by a single script. Given the poor performance of bookstores which gener-
ated content using a single script, it appears that WWW search engine crawlers
might have either simply ignored some of these documents (according to some
URLs-to-crawl rule, for example, stripping all URL parameters), or have been
unable to retrieve any meaningful information from them.
Many bookstores that have a high link-count were unable to achieve wide URL
coverage. This is most apparent on Powells, which has a large number of incoming
links, but less indexed pages than other well linked bookstores. Further investigation
uncovered that Powells encodes book ISBN codes as a query to a .cgi script. This is in
contrast to the Amazon method, where ISBN codes are encoded in the URL and not as
parameters.
The site which managed to best convert incoming links to crawled pages was Net-
storeUSA. In contrast to all other evaluated bookstores, NetstoreUSA had more pages
indexed by the search engines than incoming links. NetstoreUSA improved its search-
ability by using static-looking documents organised in simple hierarchies of shtml
pages.
To encourage a deep crawl that will cover all site content it is necessary for web
authors to ensure they have both internal and external links directly to their hierar-
chically deep, but important, content. This increases the chance that a WWW search
engine will encounter a link to the page, and adds valuable hyperlink evidence. To
encourage user linking it is important to use meaningful and consistent URL strings.
While one can envisage a web developer linking to a URL which has the form foo.
com/ISBN/ it may be less likely that they link directly to foo.com/prod/prod.
asp?prod=9283&source=09834. There is also a higher likelihood that such a link
would be discarded during the crawl or the creation of the web graph. Deep linking
may be encouraged further through the use of incentive or partnership programs. If
§4.4 Findings 73
such a program is in place, it is important to ensure partners are able to point directly
to products and that all partners point to the same consistent URL for each product
(e.g. Amazon provides an incentive program so that web authors link directly to their
product pages).
To ensure database generated content is not rejected by WWW search engines,
it is important that the content is provided through individual, static looking URLs.
Duplicate pages should also be removed from the site. However, if duplicate pages
are to be retained, it is important that web authors know what URL they should link
to, and that crawls of duplicate pages be minimised (potentially through the use of
page crawl exclusion measures in “robots.txt” files [133]).
4.4.2 Bookstore searchability: matching/ranking performance
Transactional documents for the requested books were most frequently matched (and
retrieved) from the Amazon and Barnes and Noble bookstores. Many of the documents
retrieved by WWW search engines from other bookstores were observed to be browse
and search pages, and not transactional documents. The Powells bookstore is a case in
point. Despite having many links, reasonable coverage in search engine indexes and
having results matched frequently, Powells transactional pages were never returned.
This may indicate poor page full-text content, poor site organisation and/or a lack of
encouragement to link directly to products (as their referral program appears to be
processed through their front page).
These identified problems could also be alleviated somewhat by employing robot
exclusion directives to inform crawlers to ignore search and browse pages, and index
only product pages (through the use of “robots.txt” files [133], as outlined above).
4.4.3 Search engine retrieval effectiveness
The best book finding search engines were Google and MSN Search and the most
successful bookstore was Amazon. MSN Search provided the most correct answers
at the first rank. However, Google provided more correct answers in the top five
positions, potentially giving users more book buying options.
In order to maximise the book finding ability of a WWW search engine empirical
findings indicate that deep crawls of dynamic content needs to be performed. All of
the examined bookstores bury product pages deep within their URL directory tree
(generally as leaf nodes). While AllTheWeb appeared to index a much larger selection
of bookstores, they appeared to not crawl as much of the Amazon bookstore as other
search engines. Given that the majority of correct hits for all search engines came
from the Amazon bookstore, this could be one of the main reasons for the observed
low effectiveness of AllTheWeb on this task.
Some WWW search engines appear to favour certain bookstores over others. For
example Google and AllTheWeb have large indexes of Barnes and Noble while the oth-
ers do not. A further example of this is the good performance of the Walmart book-
store in MSN Search. The results suggests that MSN Search may have access to extra
74 Web search and site searchability
information for Walmart that is not available to the other search engines.
For WWW search engines to provide good coverage of popular bookstores it is
necessary for them to crawl dynamic URLs, even when there are many pages gen-
erated from a single script with different parameters. On the Walmart and Powells
bookstores, all product pages are created from a single script, with the book’s ISBN
as a parameter. Also, as many slightly different URLs frequently contain information
about exactly the same ISBN it may be necessary to perform advanced equivalence
(duplicate) URL or content detection. This is the case with duplicate product pages
on the Amazon bookstore, as the same document is retrieved no matter what referral
identifier is included in the URL. Without effective duplicate detection and consolida-
tion for duplicate documents in the hyperlink graph the effectiveness of link evidence
will be decreased.
4.5 Discussion
The coverage results from leading WWW search engines indicate that all of the eval-
uated engines dealt with web graph anomalies in a different manner (some more ef-
fectively than others). The most effective search engines retrieved book buying pages
from dynamic sites for which they had crawled between 0.9 and 3.6 million docu-
ments. This demonstrates the importance of using robust methods when sourcing and
building the web graph (such as those outlined in Chapter 3) for effective retrieval.
From a web site author’s point of view the design of a web site directly affects how
well search engines can crawl, match and rank its pages. For this reason, searchabil-
ity should be an important concern in site design. Observations from this case study
indicate that there are large discrepancies in the relative searchability of bookselling
web sites. Many of the bookstore sites incorporated dynamic URLs that may be diffi-
cult pages for some WWW search engines to crawl, and unattractive targets for web
authors to direct hyperlinks to. Many bookstore sites were also marred by duplicate
content and confusing link graphs. Of the 35 evaluated bookstores 24 did not appear
in the top 1000 results in any of the evaluated search engines for any of the evaluated
books.
These results illustrate the importance of a combined approach to improving trans-
actional search. To improve effectiveness WWW search engines should endeavour to
discover more product pages, by performing deep crawls of provider sites and of
dynamic pages (especially those that are linked to directly). It is equally important
for bookstores to build a suitable site structure that allows search engines to perform
thorough crawls. To improve searchability, bookstores should use short non-changing
URLs (like NetstoreUSA) and encourage deep linking directly to their product pages
(like Amazon). It is submitted that these findings are likely to hold for other WWW
search tasks.
The amount of link evidence available for a bookstore, as observed in the link
coverage study, proved to be particularly important for achieving high rankings in
some search engines (such as Google [93]). The apparent heavy use of web evidence in
§4.5 Discussion 75
the document ranking algorithms of WWW search engines provides further support
for the investigations of web evidence within this thesis.
76 Web search and site searchability
Chapter 5
Analysis of hyperlink
recommendation evidence
It is commonly stated that hyperlink recommendation measures help modern WWW
search engines rank “important, high quality” pages ahead of relevant, but less valu-
able pages, and to reject “spam” [97]. However, what exactly constitutes an “impor-
tant” or “high quality” page remains unclear [8]. Google has previously been shown
to perform well on a home page finding task [116] and the PageRank hyperlink rec-
ommendation algorithm may be a factor in this success.
This chapter presents an analysis of the potential for hyperlink recommendation
evidence to improve retrieval effectiveness in navigational search tasks, and to favour
documents that possess some “real-world quality” or “importance”. The analysis con-
siders PageRank and in-degree scores extracted from leading WWW search engines.
These scores are tested for bias and their usefulness is compared over corpora of home
page, non-home page and spam page documents. The hyperlink recommendation
scores are tested to determine the weight assigned to the home pages of companies
that exhibit “real-world” measures of quality. The measures of “real-world” qual-
ity investigated include whether favoured companies are highly profitable or well-
known. Less beneficial biases are also tested to examine whether hyperlink recom-
mendation scores favour companies based on their base industry or location.
5.1 Method
An analysis of score biases requires a set of candidate documents, the hyperlink rec-
ommendation scores for those documents, and, in order to test for bias, attributes by
which the candidate documents may be distinguished. In this experiment three sets
of candidate pages are identified from data relating to publicly listed companies and
links to known spam content. These form useful sets for analysis for reasons outlined
in the following sections. Hyperlink recommendation scores are sourced for each of
these pages using WWW search engines and tools. The attributes used to test rec-
ommendation score bias are gathered from listed company information and publicly
available company attributes. The data used in this experiment were extracted during
September 2003.
77
78 Analysis of hyperlink recommendation evidence
The following subsections detail methods used to amass the data for this exper-
iment. This includes a description of how candidate pages were selected, how the
salient company properties (used when evaluating bias) were sourced, and the meth-
ods used to extract hyperlink recommendation scores for each document.
5.1.1 Sourcing candidate pages
The home page set includes the home pages of public companies listed on the three
largest US stock exchanges: the New York Stock Exchange (NYSE), NASDAQ and the
American Stock Exchange (AMEX) (a total of 8329 companies were retrieved). The
home pages of publicly listed companies form a useful corpus as there is publicly
available information relating to company popularity, revenue, and other properties,
such as which industry the company belongs to. Furthermore, publicly listed compa-
nies are plausible targets for home page finding queries.
Company information was obtained from the stock exchange web sites, and in-
cluded the official company name, symbol and description. Then, using the company
information service at http://quote.fool.com/, 5370 unique company home
page URLs were identified. These URLs were almost always the root page of a host
(e.g. http://hostname.com/) without any file path (only fourteen URLs had some
path). These are considered to be the company home pages, even though in some
cases the root page is a Flash animation or another form of redirect. The company
information service also provided an industry for each stock e.g. “Real Estate”.
For comparison with these home pages, two further sets of pages were collected:
a non-home page set and a spam page set. Non-home pages were collected by sorting
company home pages by PageRank (extracted using methods outlined in the next
section) and selecting twenty home pages at a uniform interval. From these home
pages crawls of up to 100 pages were commenced (restricted to the company domain).
The overall PageRank distribution for the pages in the twenty crawls is shown in
Figure 5.1.
The spam page set was collected by sourcing 399 links pointing to a search engine
optimiser company (using Google’s link: operator). The spam pages were largely
content-free, having been created to direct traffic and PageRank towards the search
engine optimiser’s customers. After sourcing in-degrees, all pages with an in-degree
of zero were eliminated leaving 280 pages for consideration.
5.1.2 Company attributes
The set of company home pages was grouped into subsets according to their member-
ships and attributes, such as the Fortune 500 list [82] and the Wired 40 list of compa-
nies judged to be best prepared for the new economy [147]. The goal was to observe
how well PageRank and in-degree could predict inclusion in such lists.
Salient company properties were collected from the following web resources:
• The company information service at http://quote.fool.com provided com-
pany industry and location information.
§5.1 Method 79
0
50
100
150
200
250
300
0 1 2 3 4 5 6 7 8 9 10
#ofpagescrawled
PageRank
Figure 5.1: Combined PageRank distribution for the non-home page document set. The
non-home page document set was constructed by crawling up to 100 pages from a selection of
company webs. The observed PageRank distribution is not a power law distribution as might
be expected in PageRank distributions (see Section 2.4). These pages are more representative
of general WWW page population than the home page only set. The zero PageRanks are most
likely caused by pages not present in Google crawls, or through lost redirects, or through
small PageRanks being rounded to 0.
• The Fortune magazine provided the list of Fortune 500 largest companies (by rev-
enue) and Fortune Most Admired companies. Fortune 500 companies are those
with the highest revenue, based on publicly available data, listed by Fortune
Magazine (http://www.fortune.com/). The Fortune Most Admired com-
pany list is generated through peer review by Fortune Magazine.
• The Business Week magazine Top 100 Global Brands was sourced from
http://www.businessweek.com/magazine/content/03 31/b3844020
mz046.htm. This lists the most valuable brands from around the world, based
on publicly available marketing and financial data.
• The Wired 40 list of technology-ready companies was taken from Wired Mag-
azine and is available online at http://www.wired.com/wired/archive/
11.07/40main.html. The list contains the companies that Wired Magazine
believe are best prepared for the new economy.
In all cases the 2003 editions of the lists were used.
5.1.3 Extracting hyperlink recommendation scores
For each URL, PageRanks and in-degrees were extracted from search engines
Google [93] and AllTheWeb [80].
Unfortunately there is no way for researchers external to Google to access PageR-
anks used in Google document ranking. The only publicly available PageRank values
are provided in the Google toolbar [98] and through the Google directory [95]. When
80 Analysis of hyperlink recommendation evidence
a page is visited, the Toolbar lists its PageRank on a scale of 0 to 10, indicating “the im-
portance Google assigns to a page”.1 When a directory category is viewed, the pages
are listed in descending PageRank order with a PageRank indicator next to each page,
to “tell you at a glance whether other people on the web consider a page to be a high-
quality site worth checking out”.2 With PageRank provided directly in these ways,
it can be analysed as a direct indicator of quality, without needing to know whether
or how it is used in Google ranking. The PageRank from the Google Toolbar is inter-
esting as toolbar users may use it directly as a measure of document quality, and the
quality of this measure is unknown. Further, as it is sometimes claimed that Page-
Rank behaves differently on a large scale web graph, it may allow for some insight
into properties of WWW-based PageRank (to accompany results presented in Chap-
ter 7).
PageRanks were extracted from the Microsoft Internet Explorer Google Toolbar [98]
by visiting pages and noting the interaction between the Toolbar and Google servers.
To ensure consistency a single Google network (IP) address was used to gather Toolbar
data.3 When the requested URL resulted in a redirect, the PageRank was retrieved for
the final destination page (types of redirects are discussed in Section 3.1.3). During the
extraction process it was noted that PageRank values had been heavily transformed.
Actual PageRanks are power law distributed, so low PageRanks values should be
represented far more frequently than higher values. By contrast, the Toolbar reports
values in the range of 0 to 10, with all values frequently reported (see Figure 5.1). It
is likely that one reason for this transformation is to provide a more meaningful mea-
sure of page quality to toolbar users. Without such a transformation most documents
would achieve a Toolbar PageRank value of 0.
Several problems were faced when obtaining in-degree values. These could only
be reliably extracted for site home pages. Problems that have been identified in meth-
ods used by WWW search engines to estimate linkage include:
1. counting pages which simply mention a URL rather than linking to it,
2. not anchoring the link match, so that the count for http://www.apple.com
includes pages with http://www.apple.com.au and http://www.apple.
com/quicktime/, and
3. under reporting the in-degree, for example by systematically ignoring links from
pages with PageRanks less than four.4
Three methods for accessing in-degree estimates for a URL were evaluated (esti-
mates are reported in Figure 5.1):
1
From: http://toolbar.google.com/button help.html
2
From: http://www.google.com/dirhelp.html.
3
The Google Toolbar sources PageRank scores from one of several servers. During experiments it was
noted that the PageRank scores for the same page could differ according to which server was queried.
This effect is believed to be caused by out-of-date indexes being used on some servers.
4
This is believed to be the case in Google’s link counts, see
http://www.webmasterworld.com/forum80/254.htm
§5.2 Hyperlink recommendation bias 81
Extracted from Google
link: contains Page- AllTheWeb
in-degree ‘in-degree’ Rank in-degree
Min 0 0 0 0
Max 857 000 1 250 000 10 14 324 793
Mean 958 1910 5.3 17 889
Median 82 112 5 319
Apple 87 500 237 000 10 2 985 141
Table 5.1: Values extracted from Google [93] and AllTheWeb [80] for 5370 company home
pages in September 2003. Listed are range, mean, median and an example value (for Apple
Computer http://www.apple.com/).
• The first method used the Google query link:URL, which reportedly has prob-
lem 1.
• The second method used Google to find pages which contained the URL. This
solution was suggested by the Google Team,5 but it exhibits problems 1 and 2,
and also seems to only return pages which contain the URL in visible text.
• The third method used the AllTheWeb query link:URL -site:URL to re-
trieve in-degree values. The operator -site:URL was included because the
method has problem 2, and adding a -site:URL excludes all intra-site links,
and so eliminates many of the non-home page links and a few home page links.
All three types of in-degree estimates were found to be correlated with each other
(Pearson r > 0.7).
AllTheWeb in-degrees were chosen for comparison with Google PageRanks to
eliminate any potential search engine preference, and to ensure that in-degree sourc-
ing issue 3 did not impact correlations between in-degree and PageRank values. Both
search engines had independent crawls of a similar size (AllTheWeb crawls 3.1 billion
document, compared to Google’s 3.3 billion.6
Table 5.1 displays some pertinent properties of the extracted values, namely the
minimum, maximum, mean and median values of all extracted hyperlink recommen-
dation evidence.
5.2 Hyperlink recommendation bias
This section presents the results of an analysis of potential bias in hyperlink recom-
mendation scores. Biases considered include a preference for home pages, large fa-
5
As discussed in: http://slashdot.org/comments.pl?sid=75934&cid=6779776
6
The collection size was estimated to be 3.3 billion on http://www.google.com at September 2003;
as of November 2004 it is estimated to be around 8 billion documents on http://www.google.com.
82 Analysis of hyperlink recommendation evidence
mous companies, a particular country of origin, or the industry in which the company
operates.
5.2.1 Home page preference
Figure 5.2 shows the PageRank distributions for eight of the twenty crawls (distri-
butions for the other twelve crawls are included in Appendix E). The distributions
reveal that in almost every case, the company home page has the highest PageRank.
In every case at least some pages received lower PageRank than the home page. This
is not surprising, as links from one server to another usually target the root page of
the target server. In fact targeting deeper pages has even led to lawsuits [192].
5.2.2 Hyperlink recommendation as a page quality recommendation
Having considered intra-site hyperlink recommendation effects, inter-site compar-
isons are now considered.
5.2.2.1 Large, famous company preference
The Fortune 500 (F500), Fortune Most Admired and Business Week Top 100 Global
Brands lists provide good examples of large, famous companies, relative to the general
population of companies. Figure 5.3 shows that companies from these lists tended to
have higher PageRanks than other companies. However, there are examples of non-
F500 companies with PageRank 10 such as http://www.adobe.com. At the other
end of the spectrum, the Zanett group http://www.zanett.com has a F500 rank
of 363, but a PageRank of 3. This puts them in the bottom 6% of 5370 companies,
based on Toolbar advice.
The home pages of Fortune 500 and Most Admired companies receive, on av-
erage, one extra PageRank point. Business Week Top Brand companies receive, on
average, two extra PageRank points. Similar findings were observed for in-degree.
These findings support Google’s claim that PageRank indicates importance and qual-
ity. In-degree was observed to be an equally good indicator of popularity on all three
counts.
5.2.2.2 Country and technology preference
Given the diversity of WWW search users, a preference in hyperlink recommendation
evidence for a particular company, industry or geographical location may be undesir-
able. This section investigates biases towards technically-oriented and US companies.
As shown in Figure 5.4 a bias towards US companies was not observed. However,
it should be noted that all companies studied are listed in US stock exchanges. Further,
as a smaller regional stock exchange was included (AMEX) there may be a bias to-
wards non-US companies by virtue of comparing large international (globally listed)
companies with smaller (regionally listed) US companies. Perhaps if local Australian
§5.2 Hyperlink recommendation bias 83
0
5
10
15
20
25
30
35
0 1 2 3 4 5 6 7 8 910
#ofpagescrawled
PageRank
www.microsoft.com (HP PR=10)
0
10
20
30
40
50
60
70
0 1 2 3 4 5 6 7 8 910
#ofpagescrawled
PageRank
www.apple.com (HP PR=10)
0
10
20
30
40
50
60
0 1 2 3 4 5 6 7 8 910
#ofpagescrawled
PageRank
www.qwest.com (HP PR=8)
0
10
20
30
40
50
60
0 1 2 3 4 5 6 7 8 910
#ofpagescrawled
PageRank
www.captaris.com (HP PR=7)
0
10
20
30
40
50
60
70
0 1 2 3 4 5 6 7 8 910
#ofpagescrawled
PageRank
www.credence.com (HP PR=6)
0
5
10
15
20
25
30
35
40
45
0 1 2 3 4 5 6 7 8 910
#ofpagescrawled
PageRank
www.cummins.com (HP PR=6)
0
10
20
30
40
50
60
70
80
90
0 1 2 3 4 5 6 7 8 910
#ofpagescrawled
PageRank
www.unitedauto.com (HP PR=5)
0
1
2
3
4
5
6
0 1 2 3 4 5 6 7 8 910
#ofpagescrawled
PageRank
www.acmeunited.com (HP PR=4)
Figure 5.2: Toolbar PageRank distributions within sites. The PageRank advice to users is
usually that the home page is the most important or highest quality page, and other pages are
less important or of lower quality. The PageRank of the home page of the site is shown as “HP
PR=”. Distributions for the twelve other companies are provided in Appendix E.
84 Analysis of hyperlink recommendation evidence
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0 1 2 3 4 5 6 7 8 9 10
Proportionofgroup
PageRank
Not F500
F500
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
1 10 100 1000 10000 100000 1e+06
Proportionofgroup
In-degree
Not F500
F500
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0 1 2 3 4 5 6 7 8 9 10
Proportionofgroup
PageRank
Not Most Admired
Most Admired
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
1 10 100 1000 10000 100000 1e+06
Proportionofgroup
In-degree
Not Most Admired
Most Admired
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0 1 2 3 4 5 6 7 8 9 10
Proportionofgroup
PageRank
Not Global Brands
Global Brands
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
1 10 100 1000 10000 100000 1e+06
Proportionofgroup
In-degree
Not Global Brands
Global Brands
Figure 5.3: Bias in hyperlink recommendation evidence towards large, admired and pop-
ular companies. Companies in Fortune 500, Fortune Most Admired and Business Week Top
100 Global Brands lists tend to have higher PageRank. The effect is strongest for companies
with well known brands. On the right similar effects are present in in-degree.
§5.2 Hyperlink recommendation bias 85
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0 1 2 3 4 5 6 7 8 9 10
Proportionofgroup
PageRank
Not US
US
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
1 10 100 1000 10000 100000 1e+06
Proportionofgroup
In-degree
Not US
US
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0 1 2 3 4 5 6 7 8 9 10
Proportionofgroup
PageRank
Not Technology
Technology
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
1 10 100 1000 10000 100000 1e+06
Proportionofgroup
In-degree
Not Technology
Technology
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0 1 2 3 4 5 6 7 8 9 10
Proportionofgroup
PageRank
Not Wired 40
Wired 40
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
1 10 100 1000 10000 100000 1e+06
Proportionofgroup
In-degree
Not Wired 40
Wired 40
Figure 5.4: Bias in hyperlink recommendation evidence towards technology-oriented or US
companies. A strong PageRank bias towards US companies was not observed. However, com-
panies in the “Internet Services”, “Software” and “Computers” industries had higher Page-
Rank, as did those in the Wired 40. The strong bias towards technology companies is most
useful if users are interested in technology, however given the increasing global reach of the
WWW, and the increasing ease of access for non-technical users, such biases are helping a
smaller and smaller proportion of the WWW user population. On the right are similar plots
for in-degree.
86 Analysis of hyperlink recommendation evidence
PageRank
Industry Companies Range Mean
Internet Services 29 3–9 6.66
Publishing 58 4–9 6.66
Airlines 25 3–8 6.48
Office Equipment 7 5–8 6.43
Entertainment 14 4–8 6.36
Software 306 3–10 6.35
Computers 86 4–10 6.29
Consumer Electronics 18 5–8 6.17
Automobile Manufacturers 7 4–8 6.14
Diversified Technology Services 46 4–8 6.02
...
Steel 34 3–7 4.68
Coal 6 4–5 4.67
Clothing & Fabrics 54 2–7 4.63
Oil Companies 132 1–8 4.60
Pipelines 25 3–6 4.56
Banks 433 0–8 4.55
Real Estate 174 2–7 4.55
Precious Metals 38 0–6 4.47
Marine Transport 12 3–6 4.42
Savings & Loans 146 0–6 4.08
Table 5.2: PageRanks by industry. The “Internet Services” and “Publishing” industries,
with 29 and 58 companies respectively, had the highest mean PageRank.
§5.3 Correlation between hyperlink recommendation measures 87
Stock Exchange (ASX) companies were compared to similarly sized companies from
the American Stock Exchange the results would differ. This is left for future work.
Two measures of technology bias were investigated; bias towards companies which
produce technology and bias towards heavy users of it. First, using company infor-
mation from http://quote.fool.com/, companies in industries involving com-
puter software, computer hardware, or the Internet, were identified. The industry and
PageRank breakdown is shown in Table 5.2. Results in Figure 5.4 illustrate a bias to-
wards technology-oriented companies. These companies received an extra PageRank
point on average. The second test of technology bias used the 2003 Wired 40 list of
technology-ready companies. This demonstrated an even greater pro-technology bias
(Figure 5.4), with companies present in the Wired 40 receiving two extra PageRank
points on average.
A strong bias towards technology-oriented companies is useful if users are inter-
ested in technology, however given the increasing global reach of the WWW, and the
increasing ease of access for non-technical users, such biases are assisting a smaller
and smaller proportion of the WWW user population.
5.3 Correlation between hyperlink recommendation measures
This section presents results from an investigation of the extent of the correlation of
advice given by PageRank and in-degree on the WWW. This investigation was con-
ducted over the set of company home pages and the set of known spam pages.
5.3.1 For company home pages
The strong correlation between Toolbar-reported PageRank and log of in-degree for
company home pages is depicted in Figure 5.5. To better understand the differences
between in-degree and PageRank, an analysis of “winners” and “losers” from the
PageRank calculation was performed. Winners in the PageRank calculation have
high PageRanks even though they have low in-degree (the bottom right quadrant
in Figure 5.5), whilst losers have high in-degree but receive a low PageRank (top
left quadrant). Some anomalies were observed due to errors in in-degree calcula-
tions (e.g. www.safeway.com had PageRank of 6 with in-degree 0). However, these
cases were rare and uninteresting, as they appeared to be due to anomalies within the
search engines rather than the link graph. Nonetheless, after discounting cases where
AllTheWeb scores disagreed with the other two in-degree estimates, there were some
extreme cases where in-degree and PageRank were at odds. These cases are shown in
Table 5.3.
In some cases the discrepancies shown in Table 5.3 are very large. For exam-
ple, ESS Technology (http://www.esstech.com) was demoted, achieving only
PageRank 3 despite having 22 357 in-degree. On the other hand, Akamai (http:
//www.akamai.com) achieved a PageRank of 9 with only 17 359 links. The promo-
tions and demotion of sites relative to their in-degree ranking by PageRank does not
88 Analysis of hyperlink recommendation evidence
1
10
100
1000
10000
100000
1e+06
1e+07
1e+08
0 1 2 3 4 5 6 7 8 9 10
AllTheWebin-degree
PageRank
Median in-degree
Figure 5.5: Toolbar PageRank versus in-degree for company home pages. For 5370 company
home pages, Toolbar PageRank and log of AllTheWeb [80] in-degree have a correlation of 0.767
(Pearson r). This high degree of correlation is achieved despite the relatively large spread of
PageRank zero pages. Such pages may have been missed by the Google crawler or indexer, or
might have been penalised by Google policy.
Stock URL Industry PageRank In-degree
AAPL http://www.apple.com Computers 10 2985141
YHOO http://www.yahoo.com Internet Services 9 5620063
AKAM http://www.akamai.com Internet Services 9 17359
EBAY http://www.ebay.com Consumer Services 8 737792
BDAL http://www.bdal.com Advanced Medical Supplies 8 199
GTW http://www.gateway.com Computers 7 170888
JAGI http://www.janushotels.com Lodging 7 64
FLWS http://www.1800flowers.com Retailers 6 38254
KB http://www.kookminbank.co.kr Banks 6 5
IO http://www.i-o.com Oil Drilling 5 235
FFFL http://www.fidelityfederal.com Savings & Loans 5 34
USNA http://www.usanahealthsciences.com Food Products 4 13353
RSC http://www.rextv.com Retailers 4 6
ESST http://www.esstech.com Semiconductors 3 22347
CAFE http://www.selectforce.net Restaurants 3 3
MCBF http://www.monarchcommunitybank.com Savings & Loans 2 6
WEFC http://www.wellsfinancialcorp.com Savings & Loans 2 1
PTNR http://investors.orange.co.il Wireless Communications 1 176
HMP http://www.horizonvascular.com Medical Supplies 1 5
VCLK http://www.valueclick.com Advertising 0 46659
Table 5.3: Extreme cases where PageRank and in-degree disagree. Even after cases where
AllTheWeb in-degrees which were in disagreement with the two Google in-degrees have been
eliminated, large disparities in scores were observed. The promotions and demotion of sites
relative to their in-degree ranking do not seem to indicate a more accurate assessment by
PageRank.
§5.3 Correlation between hyperlink recommendation measures 89
1
10
100
1000
10000
100000
1e+06
1e+07
1e+08
0 1 2 3 4 5 6 7 8 9 10
AllTheWebin-degreee
PageRank
Median in-degree (ignoring zeroes)
Median for companies (ignoring zeroes)
Figure 5.6: Toolbar PageRank versus in-degree for links to a spam company. The 280 spam
pages achieve good PageRank without needing massive numbers of in-links. In some cases,
they achieve good PageRank with few links. Pages with PageRank 6 had a median in-degree
of 1168 for companies and 44 for spam pages.
appear to indicate any systematic additional preference for higher “real-world qual-
ity”.
5.3.2 For spam pages
One claimed benefit of PageRank over in-degree is that it is less susceptible to link
spam [103]. To test this claim the in-degree and PageRank scores for 280 spam pages
were compared. The relationship is plotted in Figure 5.6.
If PageRank were spam-resistant one might expect high in-degree spam pages to
have low PageRank. Such a case would be placed in the top left quadrant of the scat-
ter plots. However, for the 280 spam pages the effect is minimal, and in some cases
the opposite. For example, the median in-degree values for a PageRank score of 6
were 1168 for company home pages and 44 for spam pages. Spam pages tended to
achieve a PageRank of 6 seemingly with fewer incoming links than legitimate compa-
nies.
It is possible that any pages which did fall in the top left quadrant had already been
excluded from Google. However, this still shows that Google cannot rely entirely on
PageRank for eliminating spam. This is not surprising when considering the extreme
case: a legitimate page such as an academic’s home page might have an in-degree
of 10, while a search engine optimiser has massive resources to generate link spam
from thousands or millions of pages.
90 Analysis of hyperlink recommendation evidence
5.4 Discussion
5.4.1 Home page bias
The analysis showed that home pages tended to have higher PageRank. Within all
evaluated sites the home page usually had the highest or equal highest score. These
results lend support to the use of hyperlink recommendation evidence for home page
finding tasks. A detailed evaluation of potential gains using hyperlink recommenda-
tion measures in home page finding is presented Chapter 7.
While the home page bias may be useful in web ranking, in the context of the
Google Toolbar it could have a potentially confusing effect. For example, from a Tool-
bar user’s point of view it might seem mystifying that the “Apple Computer” home
page is rated 10, but its “PowerBook G4 15-inch” page is rated 7. Is the Toolbar im-
plying that the product is less important or of lower quality? Is it useful to give such
advice about deeper pages in general? In fact, it may be preferable to display a con-
stant indicator in the Toolbar when navigating within a web site. An investigation of
whether WWW users understand hyperlink-recommendation scores reported by the
Google Toolbar remains for future work.
5.4.2 Other systematic biases
The experimental results for company home pages show that Toolbar PageRank
favours by, on average, two PageRank points:
1. Companies with famous brands (by Business Week Top Brands)
2. Companies considered to be prepared for the new economy (by Wired 40 listing)
Furthermore, PageRank scores are an average of one point higher for:
1. Companies with large revenue (by Fortune 500 membership)
2. Admired companies (by Fortune Most Admired membership)
3. Technology-oriented companies (by Industry type)
Similar patterns were observed for in-degree (with corresponding larger gaps in
in-degree values).
The bias towards high-revenue, admired and famous companies can be seen to
be consistent with the stated goal of hyperlink recommendation algorithms. The fact
that hyperlink measures more strongly recommend sites operated by companies with
highly recognised brands, suggests that recognition is a key factor. This is intuitively
obvious, as a web site can only be linked to by authors who know of its existence.
Favouring high-recognition sites in search results or directory listings helps searchers
by bringing to bear their existing knowledge.
A list which gives prominence to relevant web sites already known to the searcher
may also inspire confidence in the value of the list. Consider the Google Directory
§5.4 Discussion 91
category for Australian health insurance.7 Viewed alphabetically the top two entries
are the relatively little known web sites “Ask Ted” and “Australian Health Manage-
ment Group”. Viewed in PageRank order, the top two entries are the arguably better
known (in Australia), “Medibank Private” and “MBF Health Insurance”. Even if the
user does not agree that these are the best results in some contexts, it may be better to
list results which the user will immediately recognise.
An important, but less beneficial side-effect of using hyperlink-recommendation
algorithms is the inherent bias towards technology-oriented companies. There are
a number of query terms whose common interpretation may be lost through heavy
use of hyperlink-recommendation algorithms.8 For example, using Google there are
a number of general queries where technology interpretations are ranked higher than
their non-technology interpretations: “opera”, “album”, “java”, “Jakarta”, “png”,
“putty”, “blackberry”, “orange” and “latex”. The strong technology bias may be an
artefact of the fact that people building web pages are from a largely technology-
oriented demographic. Many web authors are technically-oriented and may primar-
ily think of Jakarta as a Java programming project. On the other hand, many WWW
users may predominantly think of Jakarta as the capital of Indonesia! As the demo-
graphics of WWW users change, returning an obscure technology-related result will
become less desirable. This effect highlights the need for recommendation methods
which more closely match user expectations. Such methods, which might take into ac-
count individual differences, or simply estimate the demographics of typical WWW
users, remain for future work. Measures other than link recommendation may be bet-
ter indicators of quality. Such measures may include whether companies are listed on
the stock exchange, present in online directories and/or are highly recommended by
peer review.9
The precise effect of these biases on navigational search is difficult to quantify. It
may be that the observed bias will be more problematic for informational tasks rather
than navigational tasks.
5.4.3 PageRank or in-degree?
PageRank and in-degree measures performed equally well when identifying home
pages and membership to Fortune 500, Most Admired and Global Brand lists. In
cases where the measures did not agree, such as for those listed in Table 5.3, there is
no evidence to demonstrate that PageRank was superior to in-degree.
A high level of correlation was observed between Toolbar PageRank and log in-
degree scores, even for a collection of spam pages. Given the extra cost involved in
computing PageRank, this correlation raises serious questions about the benefit of us-
7
Available at: http://directory.google.com/Top/Regional/Oceania/Australia/
Business and Economy/Financial Services/Insurance/Health/
8
It is likely that anchor-text is also biased in this way, although it may affect results less as the bias
would be narrower, i.e. only for terms that are commonly used in the anchor-text pointing to a particular
page.
9
For example, by using scores from a service such as http://www.alexa.com.
92 Analysis of hyperlink recommendation evidence
ing PageRank over in-degree. Subsequent chapters investigate this further, examining
whether there is anything to be gained by using PageRank or in-degree in navigational
search situations.
Chapter 6
Combining query-independent web
evidence with query-dependent
evidence
Query-independent measures, such as PageRank and in-degree, provide an overall
ranking of corpus documents. Such measures need to be combined with some form
of query-dependent evidence for query processing, otherwise the same list of doc-
uments would be retrieved for every query. There are many ways in which query-
independent and query-dependent evidence can be combined, and few combination
methods have been evaluated explicitly for this purpose (see Section 2.5). This chap-
ter presents an analysis of three methods for combining query-independent evidence,
in the form of WWW PageRanks, with query-dependent baselines.
6.1 Method
This chapter examines a home page finding task where, given the name of a public
company, the ranking algorithm has to retrieve that company’s home page from a
corpus containing the home pages of publicly listed US companies.
The query and document set used in this experiment were sourced from company
data used throughout experiments in the previous chapter. The document corpus
consisted of the downloaded full-text content of each company’s home page, and
the anchor-text of links directed to those home pages. The query set consisted of
the official names of all companies. The query and document set were used to build
three query-dependent baselines; a full-text-only baseline, an aggregate anchor-text-
only baseline, and a baseline using both forms of evidence. The PageRank scores for
these pages were extracted from Google. Three methods for combining PageRank
and query-dependent evidence were examined: the first used PageRank as a mini-
mum score threshold, and the second and third methods used PageRank to re-rank
the query-dependent baseline rankings.
93
94 Combining query-independent web evidence with query-dependent evidence
The following sections outline the query and document set, the scoring methods
used to generate the query-dependent baselines, how hyperlink recommendation ev-
idence was gathered, and methods for combining query-dependent baselines with
query-independent web evidence.
6.1.1 Query and document set
The document corpus consisted of the home pages of the publicly listed companies
used in experiments in Chapter 5. The corpus consisted of 5370 home page
documents – one for each company on a prominent US stock exchange (NYSE, NAS-
DAQ and NYSE) for which a home page URL was found (see Section 5.1.1).
As little useful anchor-text information was contained in the set of downloaded
documents (because companies rarely link to their competitors home pages), the
anchor-text evidence was gathered from the Google WWW search engine [93]. This
WWW-based anchor-text evidence was sourced for a 1000 page sample selected at
random from the set of company home pages. For each of these pages 100 back-links1
were retrieved using Google’s “link:” operator (as described in Appendix C). Each
back-link identified by Google was parsed and anchor-text snippets whose target was
the company home page were added to the aggregate anchor-text for that page.
The query set consisted of the official names for all 5370 companies, and the correct
results were the named company’s home page. For example, for the query
“MICROSOFT CORP” the correct answer was the document downloaded from http:
//www.microsoft.com.
The retrieval effectiveness for both the anchor-text and full-text baselines is likely
to be higher than would be expected for a complete document corpus. In the full-
text baseline the inclusion of only the home pages of candidate companies discounts
many pages that may also match company naming queries. In particular, in a more
complete document corpus, non-homepage documents on a company’s website might
achieve higher match scores than that company’s home page (such as a company con-
tact information page). The anchor-text baseline is also likely to achieve unrealistically
high retrieval effectiveness even given the incomplete aggregate anchor-text evidence
examined (only 100 snippets of anchor-text are retrieved per home page). This is be-
cause the aggregate anchor-text corpus only contains text that is used to link to one
of the evaluated companies, and so will be unlikely to contain much misleading or
ill-targeted anchor-text.
6.1.2 Query-dependent baselines
Three query-dependent baselines were evaluated: content, anchor-text and
content+anchor-text.
• The content baseline was built by scoring the full-text of the downloaded home
pages using Okapi BM25 with untrained parameters (k1 = 2 and b = 0.75) [172]
1
A back-link is a document that has a hyperlink directed to the page under consideration.
§6.1 Method 95
(described in section 2.3.1.3).
• The anchor-text baseline was built by scoring aggregate anchor-text documents
using Okapi BM25 with the same parameters as used for content (described in
Section 2.4.1).
• The content+anchor-text baseline was built by scoring document full-text and ag-
gregate anchor-text concurrently using Field-weighted Okapi BM25 [173] (de-
scribed in Section 2.5.2.1). The field-weights for document full-text (content) and
aggregate anchor-text were set to 1, and k1 and b were set to the same values
used in the content and anchor-text baselines [173]. The content+anchor baseline
was computed for the set of pages for which anchor-text was retrieved.
6.1.3 Extracting PageRank
Google’s PageRank scores were extracted from the Google Microsoft Internet Explorer
Toolbar using the method described in Section 5.1.3. These scores were calculated by
Google [93] for a 3.3 billion page crawl.2
6.1.4 Combining query-dependent baselines with query-independent web
evidence
Many different schemes have been proposed for combining query-independent and
query-dependent evidence. Kraaij et al. [135] suggest measuring the query-
independent evidence as the probability of document relevance and treating it as a
prior in a language model (see Section 2.5.2.2). However, because Okapi BM25 scores
are weights rather than probabilities, prior document relevance cannot be directly
incorporated into the model. Westerveld et al. [212] also make use of linear combi-
nations of normalised scores, but for this to be useful with PageRank, a non-linear
transformation of the scores would almost certainly be needed:3 the distribution of
Google’s PageRanks is unknown, and those provided via the Toolbar have been ob-
served not to follow a power law (see Section 5.1.3). Savoy and Rasolofo [178] combine
query-dependent URL length evidence with Okapi BM25 scores by re-ranking the top
n documents on the basis of the URL scores (described in Section 2.5.1.2). The benefit
of this type of combination is that it does not require knowledge of the underlying
data distribution.
The three combination methods examined in this experiment are: retrieving only
those documents that exceed a PageRank threshold (see Section 2.5.1.5), using Page-
Rank as rank based (quota) re-ranking of query-dependent baselines, and using Page-
Rank in a score-sensitive re-ranking of query-dependent baselines. The re-ranking
approaches are variations on those proposed by Savoy and Rasolofo, and are used
2
The collection size was estimated to be 3.3 billion on http://www.google.com at September 2003;
as of November 2004 it is estimated to be around 8 billion documents on http://www.google.com.
3
This is because while most PageRanks are very low a few are orders of magnitude larger, as Page-
Rank values are believed to follow a power law distribution (see Section 2.4).
96 Combining query-independent web evidence with query-dependent evidence
because they do not require any knowledge of the global distribution of Google’s
PageRank values [178].
The use of a minimum PageRank threshold that pages need to exceed prior to
inclusion is equivalent to ranking results by PageRank evidence and then re-ranking
above a score-based threshold using query-dependent evidence. The use of a static4
minimum query-independent threshold value means that some pages will never be
retrieved, and so could be removed from the corpus. To enable the retrieval of pages
that do not exceed the static threshold value, a dynamic threshold function could be
used. Such a function could reduce the minimum threshold if some condition is not
met (for example if less than ten pages are matched). Such a scheme is discussed
further in Section 10.3.
The re-ranking experiments explore two important scenarios. In the first, Page-
Rank plays a large role in ranking documents; though a quota-based combination. In
the quota-based combination all documents retrieved within the top n ranks in the
query-dependent baseline are re-ranked by PageRank. In the second scenario Page-
Rank has a smaller contribution and is used to re-order documents that achieve query-
dependent scores within n% of the highest baseline score (per query). This is termed
a score-based combination. In both cases if the re-ranking cutoffs are sufficiently large,
then all baseline documents will be re-ranked by PageRank order.
6.2 Results
This section reports the effectiveness of the baselines and the three evaluated combi-
nation methods.
6.2.1 Baseline performance
The effectiveness of the three baselines varied considerably:
• The content baseline retrieved the named home page at the first rank for only
two-out-of-five queries, and within the first ten results for a little over half the
queries (S@1 = 0.42, S@10 = 0.55).
• The anchor baseline performed well, retrieving three-out-of-four companies at
the first rank (S@1 = 0.725, S@10 = 0.79).
• The content+anchor baseline performed well, also retrieving three-out-of-four
companies at the first rank (S@1 = 0.729, S@10 = 0.82).
The performance of the full-text (content) baseline was poor given the small size of
the corpus from which the home pages were retrieved. A small benefit was observed
when adding full-text evidence to the anchor-text baseline.
4
A threshold value that does not change between queries.
§6.2 Results 97
6.2.2 Using a threshold
0
20
40
60
80
100
0 1 2 3 4 5 6 7 8 9 10
%ofpagesthatexceedthePageRankvalue
PageRank
Home pages
Other pages
Figure 6.1: The percentage of home pages and other pages that exceed each PageRank value.
Implementing a PageRank threshold minimum value of 1 would lead to the inclusion of 99.7%
of the home pages, while reducing the number of other pages retrieved by 16.1%.
Figure 6.1 illustrates the percentage of home pages and non-home pages5 that ex-
ceed each PageRank value. Implementing a PageRank threshold value of 1 leads to
the inclusion of 99.7% of the home pages, while significantly reducing the number of
other pages retrieved (by 16.1%, to 83.9% of pages). The non-home page PageRanks
examined here may be somewhat inflated relative to those on the general WWW, as
they were retrieved using a (breadth first) crawl halted after 100 pages. It has been
reported that WWW in-links are distributed according to the power law [35]. Thus,
assuming the distribution of PageRank is similar to that of in-degree,6 setting a thresh-
old at some small PageRank is likely to eliminate many pages from ranking consid-
eration. In a home page finding system this may provide substantial computational
performance gains and little (if any) degradation in home page finding effectiveness.
6.2.3 Re-ranking using PageRank
Results for the quota-based combination are presented in Figure 6.2. Re-ranking by
quota severely degrades performance, with a re-ranking of the top two results in the
full-text (content) baseline decreasing the percentage of home pages retrieved at the
5
The hyperlink recommendation values extracted for the set of “non-home page” documents, de-
scribed in Section 5.1.1.
6
The distribution of Google’s PageRanks for company home pages was observed not to follow a
power-law distribution (Figure 5.1), although the Google PageRanks are likely to have been normalised
and transformed for use in the Toolbar. The PageRanks calculated for use in experiments in Chapter 7
do exhibit a power law distribution (see Section 7.6.1).
98 Combining query-independent web evidence with query-dependent evidence
0
0.2
0.4
0.6
0.8
1
0 5 10 15 20 25 30 35 40 45 50
MeanReciprocalRank(MRR)
Rerank top n results by PageRank
Anchor-text
Content
Content+Anch
Figure 6.2: Quota-based re-ranking. Re-ranking the top x% of documents in the query-
dependent baselines by PageRank. Re-ranking by quota severely degrades performance, with
a re-ranking of the top 2 results in the full-text baseline decreasing the percentage of home
pages retrieved at the first position from 42% to 29%. Note that the re-ranking of all results by
PageRank (at 50) is equivalent to ranking query-matched documents by PageRank.
0
0.2
0.4
0.6
0.8
1
10 20 30 40 50 60 70 80 90 100
MeanReciprocalRank(MRR)
Rerank by PageRank URLs that score in top n%
Anchor-text
Content
Content+Anch
Figure 6.3: Score-based re-ranking. Re-ranking documents that are within x% of the top
query-dependent baseline score. Re-ranking using score produces a much slower decline in
performance than re-ranking based on rank only (Figure 6.2). Note that the re-ranking of all
results by PageRank (at 100% of score) is equivalent to ranking query-matched documents by
PageRank.
§6.3 Discussion 99
0
20
40
60
80
100
1 2 3 4 5 6 7 8 9 10
NormalisedOkapiBM25contentscore
Rank by Okapi BM25 content
Yahoo
Lycos
Figure 6.4: Example of two queries using different re-ranking techniques. For the query
“Lycos” the correct answer is located at position one of the full-text (content) baseline. Given
that the second match scores far less than the first, a shuffling of the first two results would
favour the document with a much smaller query-dependent score. For the second query “Ya-
hoo” the correct answer is located at position two and achieves a comparable score to the first
result: in this case a shuffle would improve retrieval effectiveness.
first rank from 42% to 29%. Results for the score-based combination are presented in
Figure 6.3. Compared to the quota-based combination, re-ranking using score produces
a much slower decline in performance.
An example illustrating the comparative effectiveness of quota-based and score-
based combinations for two queries is presented in Figure 6.4. For the query “Lycos”
the correct answer is located at position one of the full-text (content) baseline. The
second document in the baseline scores far less than the first. Using a quota-based re-
ranking with a cutoff of two the first two results would be reversed. By comparison,
using score-based re-ranking, the cutoff would have to be set to 35% (or larger) of
the top score for a reversal by re-ranking. For the second query “Yahoo” the correct
answer is located at position two and achieves a comparable score to the first result.
In this case re-ranking by PageRank using either a quota or score-based re-ranking with
n = 2 would reverse the ranking (in this case improving retrieval effectiveness).
6.3 Discussion
Results from experiments in this Chapter support the use of PageRank and other hy-
perlink recommendation evidence as a minimum threshold for document retrieval
or in a score-based re-ranking of query-dependent evidence. The use of a minimum
PageRank threshold in a home page finding task may improve computational per-
formance by eliminating non-home pages from the ranking. Another method by
100Combining query-independent web evidence with query-dependent evidence
which computational efficiency could be improved is by ranking documents using
aggregate anchor-text evidence only (which also implicitly imposes a threshold of in-
degree ≥ 1). An anchor-text index would be much smaller than a full-text index and
therefore likely to be more efficient.
Quota-based re-ranking was observed to be inferior to score-based re-ranking. This
illustrates the negative effects of not considering relative query-dependent scores when
combining baselines with query-independent evidence. Further, this suggests that
query-independent evidence should not be relied upon to identify the most relevant
documents. Pages that achieve high query-independent scores are likely to be im-
portant pages in the corpus (such as the home pages of popular, large or technology-
oriented companies, as reported in Chapter 5), but may not necessarily be more rele-
vant (and indeed, in this experiment, might be the “wrong” home pages).
The results from experiments in this Chapter also re-enforce the previously ob-
served importance of aggregate anchor-text for effective home page finding [56]. The
correct home page was retrieved at the first rank in the anchor-text baseline for three-
out-of-four queries, compared to being retrieved at the first rank for only two-out-of-
five queries in the full-text baseline. While the baseline retrieval effectiveness in this
experiment may be unrealistically high, these findings show that there is generally
adequate anchor-text evidence, even when using only 100 snippets, to find the home
pages of publicly listed companies. Combining the full-text and aggregate anchor-text
evidence in a field-weighted combination, resulted in a slight improvement in home
page finding effectiveness.
The next chapter investigates whether query-independent evidence can be used
to improve home page finding effectiveness for small-to-medium web corpora. The
experiments include evaluations of the effectiveness of minimum query-independent
evidence thresholds, score-based re-ranking of query-dependent baselines by query-
independent evidence, and of aggregate anchor-text-only indexes.
Chapter 7
Home page finding using
query-independent web evidence
Providing effective home page search is important for both web and WWW search
systems (see Section 2.6.2.1). The empirical results reported in Chapter 5 showed hy-
perlink recommendation evidence to be biased towards home pages. This chapter
presents a series of detailed experiments to determine whether this bias can be ex-
ploited to improve home page finding performance on small-to-medium sized web
corpora. Experiments in this Chapter evaluate the effectiveness of hyperlink recom-
mendation evidence and URL length for document full-text and anchor-text baselines
on three such corpora.
The potential contribution of query-independent evidence to home page finding
is evaluated in three ways:
• By measuring the potential for query-independent evidence to exclude non-
home pages, through the use of minimum query-independent threshold scores
that documents must achieve for retrieval (following from experiments in Chap-
ter 6). The use of thresholds is investigated as a measure by which both the
retrieval effectiveness and efficiency of a home page finding system could be
improved;
• By gauging the maximum improvements offered through query-independent
evidence when combined with query-dependent baselines using some linear
combination of scores; and
• By empirically investigating a combination method that could be used to incor-
porate query-independent evidence in a production web search system, namely
a score-based re-ranking of query-dependent baselines by query-independent
evidence (following from experiments in Chapter 6).
7.1 Method
The initial step in this experiment was to identify the set of candidate test corpora. The
corpora were then crawled (if required) and indexed. Four types of query-independent
101
102 Home page finding using query-independent web evidence
evidence (in-degree, two PageRank variants and URL-type, described below) were
computed during indexing. Following indexing, the top 1000 documents for each
query-dependent baseline were retrieved. Three query-dependent baselines were
studied; one based solely on document full-text, one based solely on document ag-
gregate anchor-text, and one consisting of both forms of evidence. The baselines were
then combined with query-independent scores using three combination methods. The
first method used query-independent evidence as a threshold, such that documents
that did not exceed the threshold were not retrieved (shown to be a promising ap-
proach in Chapter 6). The second method explored the optimal improvement that
could be gained when combining query-independent evidence with query-dependent
baselines using a linear combination of scores. The final combination method was a
score-based re-ranking of query-dependent baselines by query-independent evidence
(also shown to be a promising approach in Chapter 6). The improvements in effec-
tiveness achieved through these combination methods were then measured and com-
pared.
Throughout the experiments the Wilcoxon matched-pairs signed ranks test was
performed to determine whether improvements afforded were significant. This test
compares the algorithms according to the (best) ranks achieved by correct answers,
rather than the success rate measure. A confidence criterion of 95% (α = 0.05) is used.
Success rates (described in Section 2.6.6.2) were used to evaluate retrieval effec-
tiveness. The success rate measure is indicated by S@n where n is the cutoff rank.
S@n results were computed for n = 1, 5, 10.
The following sections give a description of the query-independent and query-
dependent baselines, outline the test collections used in the experiments and their
salient properties, and discuss the methods used to combine query-independent and
query-dependent evidence.
7.1.1 Query-independent evidence
Four types of query-independent evidence were considered:
IDG the document’s in-degree score (described in Section 2.4.3.1);
DPR the document’s Democratic PageRank score (described in Section 3.3.2);
APR the document’s Aristocratic PageRank score, using bookmarks from the Yahoo!
directory [217] or other web directory listings, which might be available to a
production search system (described in Section 3.3.2);
URL the document’s URL-type score, through a re-ranking by the UTwente/TNO
URL-type [135] (described in Section 2.3.3.2). The URL-types were scored ac-
cording to Root > Subroot > Directory > File.
7.1.2 Query-dependent baselines
The relative improvements achieved over three query dependent baselines were ex-
amined. The baselines were:
§7.1 Method 103
• content baselines built by scoring document full-text using Okapi BM25 with
default parameters (k1 = 2 and b = 0.75) (see Section 2.3.1.3) [172].
• anchor-text baselines built using the methods outlined previously (i.e. by record-
ing all anchor-text pointing to each document and building a new aggregate
document containing all source anchor-text). The aggregate anchor-text docu-
ments were scored using Okapi BM25 using the same parameters as content.
• content+anchor-text baselines built by using Field-weighted Okapi BM25 [173]
to build and score composite documents containing document full-text and ag-
gregate anchor-text evidence. The baseline was scored with document full-text
and anchor-text field-weights set to 1, and k1 and b as above (see Section 2.5.2.1)
[173].
7.1.3 Test collections
Effectiveness improvements were evaluated using five test collections that spanned
three small-to-medium sized web corpora. The test corpora used in the evaluation in-
cluded a 2001 crawl of a university web (the ANU), and the TREC corpora VLC2 [106]
and WT10g [15]. Detailed collection information is reported in Table 7.1 and a further
discussion of the TREC collection properties appears in Section 2.6.7. Note that since
experiments published in Upstill et al. [201] the link tables have been re-visited and
further duplicates and equivalences removed (using methods described in Chapter 3).
This has resulted in some non-statistically significant changes in retrieval effective-
ness.
Test Pages Links Dead Content Anchor No. of Book-
Collection Size (million) (million) links queries queries marks (APR)
ANU 4.7GB 0.40 6.92 0.646 97/100 99/100 439
WT10gC 10GB 1.69 8.06 0.306 93/100 84/100 25 487
WT10gT 10GB 1.69 8.06 0.306 136/145 119/145 25 487
VLC2P 100GB 18.57 96.37 3.343 95/100 93/100 77 150
VLC2R 100GB 18.57 96.37 3.343 88/100 77/100 77 150
Table 7.1: Test collection information. The experiments were performed for five test collec-
tions spanning three small-to-medium sized web corpora. Two sets of queries were submitted
over the VLC2 collection - a popular set (VLC2P) and a random set (VLC2R) (see text for expla-
nation). The two sets computed for WT10g were the set used by Craswell et al. [56] (WT10gC)
and the official queries used in the TREC 2001 home page finding task (WT10gT). The values
in the “Content” and “Anchor” queries columns show the number of home pages found by
the baseline out of the number of queries submitted (this is equivalent to S@1000, as the top
1000 results for each search are considered).
104 Home page finding using query-independent web evidence
Although there are many spam pages on the WWW, little spam was found in the
three corpora. Any spam-like effect observed seemed unintentional. For example, the
pages of a large bibliographic database all linked to the same page, thereby artificially
inflating its in-degree and PageRank.
In each run, sets of 100 or more queries were processed over the applicable corpus
using the chosen baseline algorithm. The first 1000 results for each were recorded.
While all queries have only one correct answer, that answer may have multiple cor-
rect URLs, e.g. a host with two aliases. If multiple correct URLs were retrieved the
minimum baseline rank was used (i.e. earliest in the ranked list of documents) and
had assigned to it the best query-independent score of all the equivalent URLs. This
approach introduces a slight bias in favour of the re-ranking algorithms, ensuring
that any beneficial effect will be detected. If a correct document did not appear in the
top 1000 positions a rank of 1001 was assigned.
These experiments investigated two home page finding scenarios: queries for pop-
ular and random home pages.1 Popular queries allow the study of which forms of
evidence achieve high effectiveness when ranking for queries targeting high profile
sites. Random queries allow the study of effective ranking for any home page, even if
it is not well known.
The ANU web includes a number of official directories of internal sites. These site
directories can be used as PageRank bookmarks. This allows for the evaluation of
APR in a single-organisation environment. Test home pages were picked randomly
from these directories and then queries were generated manually by navigating to a
home page, and formulating a query based on the home page name2. Consequently
APR might be expected to perform well on this collection.
The query set labelled WT10gC [56] was created by randomly selecting pages
within the WT10g corpus, navigating to the corresponding home page, and formu-
lating a query based on the home page’s name. The WT10gC set was used as training
data in the TREC-2001 web track. The query set labelled WT10gT was developed
by the NIST assessors for the TREC-2001 web track using the same method. Wester-
veld et al. [212] have previously found that the URL-type method improved retrieval
performance on the WT10gT collection. Using the method outlined in Section 3.3.2,
every Yahoo-listed page in the WT10g collection is bookmarked in the APR calcula-
tion. These are lower quality bookmarks than the ANU set as the bookmarks played
no part in the selection of either query set.
Two sets of queries were evaluated over the VLC2 collection, popular (VLC2P) and
random (VLC2R). The popular series was derived from the Yahoo! directory. The ran-
dom series was selected using the method described above for WT10g. For the APR
calculation every Yahoo-listed page in the collection was bookmarked. As such, the
bookmarks were well matched to the VLC2P queries (also from Yahoo!), but less so
for the VLC2R set.
1
Note that the labels popular and random were chosen for simplicity and are derived from the method
used to choose the target answer, not from the nature of the queries. Information about query volumes
is obviously unavailable for the TREC test collections and were not used in the case of ANU.
2
This set was generated by Nick Craswell in 2001.
§7.1 Method 105
The home page results for the ANU and VLC2P query sets are considered popular
because they are derived from directory listings. Directory listings have been chosen
by a human editor as important, possibly because they are pages of interest to many
people. Such pages also tend to have above average in-degree. This means that more
web page editors have chosen to link to the page, directing web surfers (and search
engine crawlers) to it.
On all these corpora anchor-text ranking has been shown to improve home page
finding effectiveness (relative to full-text-only) [15, 56].
7.1.4 Combining query-dependent baselines with query-independent evi-
dence
Throughout these experiments there is a risk that a poor choice of combining function
could lead to a spurious conclusion. The combination of evidence experiments in the
previous chapter outlined two methods for combining query-independent and query-
dependent evidence which may be effective: the use of minimum threshold values
and score-based re-ranking. This chapter includes a further combination scheme – an
Optimal re-ranking.
The Optimal re-ranking is an unrealistic re-ranking, and is termed “Optimal” to
distinguish it from a re-ranking that could be used in a production web search sys-
tem.3 In the Optimal combination experiments, the maximum possible improvement
when combining query-independent evidence with query-dependent evidence using
a linear combination is gauged. This is done by locating the right answer in the base-
line (obviously not possible in a practical system) and re-ranking it and the docu-
ments above it, on the basis of the query-independent score alone (as illustrated in
Figure 7.1). This is an unrealistic combination, if this information were known in prac-
tice, perfection could easily be achieved by swapping the document at that position
with the document at rank one. Indeed, no linear combination or product of query-
independent and query-dependent scores (assuming positive coefficients) could im-
prove upon the Optimal combination. This is because documents above the correct
answer score as well or better on both query-independent and query-dependent com-
ponents (see Figure 7.1). In Optimal experiments a control condition random was intro-
duced in which the correct document and all those above it were arbitrarily shuffled.
Throughout re-ranking experiments if the query-independent scores are equal,
then the original baseline ordering is preserved.
The following sections report and discuss the results for each combination method.
The use of minimum query-independent evidence thresholds is investigated first,
followed by re-ranking using the (unrealistic) Optimal combination, and finally re-
ranking using the (realistic) score-based re-ranking.
3
The Optimal re-ranking relies on knowledge of the correct answer within the baseline ranking.
106 Home page finding using query-independent web evidence
document 1
document 2
document 3
document 4
document 5
document 6
document 7
document 8
document 2
document 6
document 1
document 4
document 3
document 5
baseline ranking
...
ranking resorted by PageRank
Figure 7.1: Example of Optimal re-ranking and calculation of random control success rate.
In the baseline, the correct answer is document 6 at rank 6. Re-ranking by PageRank puts it
at rank 2. This is optimal because any document ranked more highly must score as well or
better on both baseline and PageRank (i.e. “document 2” scored better on the baseline, and
PageRank). In this case, S@5 fails on the baseline and succeeds on re-ranking. However, a
random resorting of the top 6 would have succeeded in 5 of 6 cases, so expected S@5 for the
random control is 5/6.
7.2 Minimum threshold experiments
These experiments investigate whether the use of a static minimum threshold require-
ment for page inclusion can improve retrieval effectiveness and system efficiency.
Retrieval effectiveness may be improved through the removal of unimportant doc-
uments from the corpus. Additionally, retrieval efficiency may be improved by re-
ducing the documents requiring ranking when processing a query.
The evaluation of the performance of threshold techniques requires a set of candi-
date cutoff values. Up to nine cutoffs were generated for each form of evidence, and
an attempt was made to pick intervals that would cut the corpus in 10% gaps. These
cutoffs were possible for DPR evidence because scores spanned many values. Even
spacing was not possible for in-degree or URL-type evidence because early cutoffs
eliminated many of the pages from consideration. For example, picking an in-degree
minimum of 2 removed up to 60% of the ANU corpus. Discounting URL-type “File”
URLs removed over 95% of the ANU collection.
An evaluation of the use of minimum thresholds was performed for three-of-the-
five test collections, namely ANU, WT10gC and WT10gT.4
§7.2 Minimum threshold experiments 107
Content Anchor Both
Type Cut Prop. S@1 S@5 S@10 S. S@1 S@5 S@10 S. S@1 S@5 S@10 S.
BASE 100% 0.29 0.50 0.58 0.72 0.96 0.97 0.63 0.81 0.86
IDG 2 51% 0.34 0.57 0.66 *+ 0.72 0.96 0.97 = 0.63 0.81 0.86 *+
IDG 3 45% 0.36 0.58 0.68 *+ 0.73 0.96 0.97 = 0.64 0.82 0.85 *+
IDG 4 37% 0.38 0.60 0.68 *+ 0.73 0.96 0.97 = 0.66 0.82 0.85 *+
IDG 6 33% 0.39 0.61 0.68 *+ 0.72 0.95 0.96 = 0.65 0.81 0.84 =
IDG 8 28% 0.40 0.60 0.70 *+ 0.72 0.95 0.96 = 0.65 0.82 0.86 =
IDG 10 8% 0.41 0.64 0.69 *+ 0.70 0.91 0.92 = 0.65 0.81 0.85 =
IDG 25 2% 0.33 0.42 0.47 *- 0.49 0.62 0.63 *- 0.44 0.55 0.58 *-
IDG 50 1% 0.21 0.30 0.36 *- 0.36 0.42 0.42 *- 0.28 0.38 0.39 *-
IDG 100 0.5% 0.11 0.19 0.20 *- 0.20 0.24 0.24 *- 0.17 0.22 0.22 *-
DPR 5.02 90% 0.30 0.50 0.59 *+ 0.72 0.97 0.98 = 0.63 0.81 0.86 *+
DPR 5.06 80% 0.30 0.50 0.59 *+ 0.72 0.97 0.98 = 0.64 0.81 0.86 *+
DPR 5.10 70% 0.30 0.50 0.59 *+ 0.72 0.97 0.98 = 0.64 0.81 0.86 *+
DPR 5.22 60% 0.31 0.51 0.62 *+ 0.72 0.97 0.98 = 0.64 0.81 0.87 *+
DPR 5.28 55% 0.31 0.52 0.62 *+ 0.72 0.97 0.98 = 0.64 0.81 0.87 *+
DPR 5.61 40% 0.33 0.54 0.63 *+ 0.72 0.97 0.98 = 0.64 0.82 0.87 *+
DPR 6.15 30% 0.34 0.55 0.65 *+ 0.71 0.95 0.97 = 0.65 0.82 0.87 *+
DPR 8.04 20% 0.36 0.57 0.63 *+ 0.64 0.86 0.88 *- 0.61 0.78 0.81 =
DPR 14.9 10% 0.35 0.54 0.60 = 0.62 0.78 0.80 *- 0.58 0.74 0.76 =
URL >F 5% 0.48 0.64 0.76 *+ 0.73 0.88 0.88 = 0.64 0.79 0.82 =
URL >D 2% 0.33 0.48 0.50 *- 0.47 0.55 0.55 *- 0.41 0.53 0.53 *-
URL >SR 0.1% 0.17 0.22 0.23 *- 0.25 0.26 0.26 *- 0.21 0.24 0.24 *-
Table 7.2: Using query-independent thresholds on the ANU collection. Bold values indi-
cate the highest effectiveness achieved for each type of query-independent evidence on each
query-dependent baseline. Underlined bold values indicate the highest effectiveness achieved
for each query-dependent baseline. The cutoff value is indicated by “Cut”. The percentage
amount of the collection that is included within the cutoff is indicated by “Prop.”. “S.” reports
whether observed changes are significantly better (“*+”), equivalent (“=”) or worse (“*-”). The
cutoff values for Democratic PageRank values given are of the order ×10−6
. For URL-type
cutoffs; >F indicates that URLs are more important than “File” URLs (i.e. either “Directory,
“Subroot” or “Root”), >D that URLs are more important than “Directory” (i.e. either “Sub-
root” or “Root”), and >SR that URLs are more important than “Subroot” (i.e. “Root”).
108 Home page finding using query-independent web evidence
7.2.1 Results
ANU
The performance of the ANU collection when using minimum query-independent
thresholds is presented in Table 7.2. Observations from these results are:
• Removing the bottom 80% of pages according to Democratic PageRank, in-degree
or URL-type improves the effectiveness of the content baseline. In the case of
URL-type, the improvement is dramatic.
• Using the least restrictive URL-type as a minimum threshold (i.e. removing
“File” pages) removes around 95% of pages from consideration without a sig-
nificant decrease in retrieval effectiveness for any baseline.
• Using appropriate in-degree and Democratic PageRank threshold values, around
80% of pages can be removed before observing a significant decrease in retrieval
effectiveness for any baseline.
• The highest retrieval effectiveness is achieved using an anchor-text baseline with
no thresholds, although this is not significantly better than that of anchor-text
with the base URL-type threshold.
In the ANU collection there was a group of documents with identical Democratic
PageRank values of 5.28×10−6. This made it impossible to choose a cutoff of 60% and
so a cutoff of 55% was used. The large number of documents that achieved the same
PageRank value was found to be caused by a crawler trap on an ANU web server.
WT10gC
The performance of the WT10gC collection using minimum thresholds is presented in
Table 7.3. Observations from these results are:
• Excluding pages using the “> File” and “> File or Directory” URL-type thresh-
olds provided significant gains on all three baselines while reducing the size of
the collection by 97%. Excluding pages using the “> Subroot” URL-type thresh-
old resulted in the removal of 99% of pages without significantly affecting the
effectiveness of any baseline.
• Excluding pages with in-degree < 2 removed 58% of pages from consideration
without significantly reducing effectiveness for any baseline (and improved ef-
fectiveness for the content baseline).
• Excluding pages with a DPR of < 1.73 × 10−6 removed 40% of pages from con-
sideration without significantly reducing effectiveness for any baseline.
4
An evaluation of performance on the VLC2P and VLC2R test collections was not possible due to
time constraints
§7.2 Minimum threshold experiments 109
Content Anchor Both
Type Cut Prop. S@1 S@5 S@10 S. S@1 S@5 S@10 S. S@1 S@5 S@10 S.
BASE 100% 0.23 0.45 0.55 0.47 0.69 0.72 0.45 0.71 0.83
IDG 2 42% 0.23 0.47 0.55 *+ 0.45 0.65 0.69 = 0.41 0.66 0.75 =
IDG 3 26% 0.23 0.50 0.59 *+ 0.45 0.64 0.67 *- 0.40 0.64 0.72 *-
IDG 4 19% 0.23 0.48 0.54 = 0.43 0.62 0.64 *- 0.39 0.60 0.68 *-
IDG 6 12% 0.24 0.44 0.53 *- 0.41 0.59 0.60 *- 0.38 0.60 0.62 *-
IDG 8 7.5% 0.25 0.45 0.53 *- 0.41 0.56 0.57 *- 0.38 0.59 0.60 *-
IDG 10 5% 0.21 0.43 0.45 *- 0.40 0.49 0.50 *- 0.37 0.52 0.52 *-
IDG 25 2% 0.20 0.36 0.39 *- 0.34 0.41 0.41 *- 0.31 0.43 0.43 *-
IDG 50 1% 0.19 0.28 0.29 *- 0.28 0.30 0.30 *- 0.25 0.31 0.31 *-
IDG 100 0.5% 0.15 0.22 0.23 *- 0.22 0.24 0.24 *- 0.21 0.24 0.24 *-
DPR 1.33 99% 0.23 0.45 0.55 = 0.47 0.69 0.72 = 0.45 0.71 0.83 =
DPR 1.38 80% 0.21 0.42 0.53 = 0.45 0.67 0.61 = 0.44 0.67 0.77 =
DPR 1.51 70% 0.20 0.42 0.53 = 0.45 0.66 0.70 = 0.43 0.68 0.77 =
DPR 1.73 60% 0.19 0.41 0.52 = 0.45 0.65 0.70 = 0.41 0.67 0.75 =
DPR 2.11 50% 0.19 0.39 0.52 = 0.44 0.64 0.68 = 0.39 0.64 0.73 *-
DPR 2.72 40% 0.18 0.39 0.51 = 0.42 0.62 0.66 *- 0.37 0.61 0.69 *-
DPR 3.77 30% 0.20 0.41 0.46 = 0.41 0.59 0.62 *- 0.35 0.58 0.63 *-
DPR 5.45 20% 0.20 0.37 0.44 *- 0.37 0.57 0.59 *- 0.35 0.54 0.58 *-
DPR 8.65 10% 0.19 0.38 0.47 *- 0.39 0.55 0.56 *- 0.37 0.54 0.55 *-
URL > F 7% 0.56 0.83 0.87 *+ 0.68 0.76 0.78 *+ 0.75 0.93 0.95 *+
URL > D 3% 0.63 0.81 0.87 *+ 0.67 0.73 0.75 *+ 0.76 0.89 0.90 *+
URL > SR 1% 0.65 0.75 0.76 = 0.59 0.65 0.65 = 0.75 0.77 0.77 =
Table 7.3: Using query-independent thresholds on the WT10gC collection. Bold values in-
dicate the highest effectiveness achieved for each type of query-independent evidence on each
query-dependent baseline. Underlined bold values indicate the highest effectiveness achieved
for each query-dependent baseline. The cutoff value is indicated by “Cut”. The percentage of
the collection that is included within the cutoff is indicated by “Prop.”. “S.” reports whether
observed changes are significantly better (“*+”), equivalent (“=”) or worse (“*-”). The specified
cutoffs for Democratic PageRank are of the order ×10−6
. For URL-type cutoffs; >F indicates
that URLs are more important than “File” URLs (i.e. either “Directory”, “Subroot” or “Root”),
>D that URLs are more important than “Directory” (i.e. either “Subroot” or “Root”), and >SR
that URLs are more important than “Subroot” (i.e. “Root”).
110 Home page finding using query-independent web evidence
• The highest effectiveness is achieved with a content+anchor-text baseline and
URL-type “> File” threshold. Using the URL-type threshold gives gains of 7%
to 20% over the best baseline score and removes 93% of pages from considera-
tion.
WT10gT
The performance of the WT10gT collection using minimum thresholds is presented in
Table 7.4. Observations from these results are:
• Excluding documents based on a “> File” URL-type threshold, provides signifi-
cant gains on all three baselines while reducing the size of the collection by 93%.
Excluding documents using a “> Subroot” URL-type threshold reduces collec-
tion size by 99% while only negatively affecting anchor-text retrieval effective-
ness.
• Excluding documents which achieve in-degree < 2 removes 58% of pages from
consideration without significantly reducing effectiveness for any baseline.
• Excluding documents which achieve a DPR in the top 90% of values resulted in
a significant decrease in effectiveness for the anchor-text baseline.
• The highest effectiveness is achieved with a content+anchor-text baseline and a
“> File” URL-type threshold. Using this threshold gives gains of 7-15% over the
baseline while removing 93% of pages from consideration.
7.2.2 Training cutoffs
While several cutoffs were considered for each collection, a sensible approach for
future experiments would be to train a threshold cutoff value on a single collection
and then apply that as a threshold on other collections. The trained cutoff, if calcu-
lated for the S@5 measure on the WT10gC collection (as with other realistic combina-
tion experiments detailed below), would have been a “> File” URL-type cutoff (with
an associated effectiveness gain of 24% along with a reduction of collection size by
around 93%). Applied to the WT10gT collection, this cutoff would have resulted in a
significant improvement in retrieval effectiveness of 12% at S@5 (along with the same
reduction of collection size of 93%). Applied to the ANU collection, the collection size
would be reduced by 95%, with an associated non-significant decrease in retrieval
effectiveness of 9% at S@5.
The exact efficiency gains achieved through using a minimum query-independent
value for inclusion are difficult to quantify as they depend on the indexing and query
processing methods used. However, one would expect that indexing an order of mag-
nitude less documents would result in significant efficiency gains.
§7.2 Minimum threshold experiments 111
Content Anchor Both
Type Cut Prop. S@1 S@5 S@10 S. S@1 S@5 S@10 S. S@1 S@5 S@10 S.
BASE 100% 0.22 0.48 0.59 0.53 0.68 0.72 0.48 0.71 0.75
IDG 2 42% 0.22 0.47 0.55 = 0.53 0.67 0.72 = 0.50 0.61 0.67 =
IDG 3 26% 0.26 0.44 0.52 = 0.50 0.59 0.61 *- 0.48 0.61 0.64 =
IDG 4 19% 0.23 0.43 0.51 = 0.46 0.54 0.56 *- 0.43 0.57 0.60 =
IDG 6 12% 0.26 0.43 0.49 *- 0.43 0.51 0.52 *- 0.43 0.52 0.56 *-
IDG 8 7.5% 0.24 0.41 0.45 *- 0.38 0.48 0.49 *- 0.39 0.50 0.51 *-
IDG 10 5% 0.24 0.39 0.42 *- 0.37 0.44 0.46 *- 0.37 0.46 0.48 *-
IDG 25 2% 0.23 0.31 0.34 *- 0.28 0.34 0.35 *- 0.28 0.35 0.37 *-
IDG 50 1% 0.21 0.28 0.30 *- 0.25 0.28 0.28 *- 0.24 0.30 0.30 *-
IDG 100 0.5% 0.16 0.22 0.23 *- 0.19 0.21 0.21 *- 0.20 0.22 0.23 *-
DPR 1.33 99% 0.22 0.48 0.59 = 0.53 0.68 0.72 = 0.48 0.71 0.75 *+
DPR 1.38 80% 0.20 0.41 0.50 = 0.50 0.62 0.66 *- 0.45 0.61 0.65 =
DPR 1.51 70% 0.20 0.39 0.49 = 0.51 0.62 0.64 *- 0.46 0.60 0.63 =
DPR 1.73 60% 0.20 0.37 0.48 = 0.50 0.60 0.63 *- 0.46 0.59 0.61 =
DPR 2.11 50% 0.23 0.41 0.50 = 0.50 0.59 0.63 *- 0.46 0.59 0.63 =
DPR 2.72 40% 0.19 0.36 0.46 = 0.48 0.57 0.59 *- 0.43 0.54 0.59 *-
DPR 3.77 30% 0.18 0.37 0.46 *- 0.47 0.55 0.56 *- 0.43 0.53 0.58 *-
DPR 5.45 20% 0.18 0.37 0.44 *- 0.44 0.52 0.54 *- 0.41 0.50 0.55 *-
DPR 8.65 10% 0.15 0.35 0.42 *- 0.39 0.48 0.48 *- 0.37 0.46 0.49 *-
URL > F 7% 0.53 0.71 0.80 *+ 0.61 0.73 0.74 *+ 0.62 0.80 0.83 *+
URL > D 3% 0.57 0.76 0.78 *+ 0.62 0.70 0.71 = 0.66 0.79 0.81 *+
URL > SR 1% 0.60 0.62 0.63 = 0.53 0.57 0.58 *- 0.61 0.64 0.65 =
Table 7.4: Using query-independent thresholds on the WT10gT collection. Bold values indi-
cate the highest effectiveness achieved for each type of query-independent evidence on each
query-dependent baseline. Underlined bold values indicate the highest effectiveness achieved
for each query-dependent baseline. The cutoff value is indicated by “Cut”. The percentage of
the collection that is included within the cutoff is indicated by “Prop.”. “S.” reports whether
observed changes are significantly better (“*+”), equivalent (“=”) or worse (“*-”). The specified
cutoffs for Democratic PageRank are of the order ×10−6
. For URL-type cutoffs; >F indicates
that URLs are more important than “File” URLs (i.e. either “Directory”, “Subroot” or “Root”),
>D that URLs are more important than “Directory” (i.e. either “Subroot” or “Root”), and >SR
that URLs are more important than “Subroot” (i.e. “Root”).
112 Home page finding using query-independent web evidence
7.3 Optimal combination experiments
These experiments investigate the effectiveness improvements offered through the
use of query-independent evidence in an Optimal re-ranking. The Optimal re-ranking
is unrealistic, and is used to gauge the potential contribution of query-independent
evidence when combined with query-dependent evidence.
7.3.1 Results
Full re-ranking and significance test results are shown in Tables 7.5, 7.6, 7.7 and 7.8,
and a summary of optimal results is presented in Table 7.9. Observations based on
these results are:
1. All re-rankings of the content baseline significantly outperform the random con-
trol.
2. The only re-ranking method which shows significant benefit over the anchor-
text baseline is URL. This benefit is shown only for the random query sets. The
benefits of re-ranking by URL are greatly diminished for anchor-text compared
with content and content+anchor-text baselines.
3. All re-rankings of the content+anchor-text baseline significantly outperform the
random control on ANU, WT10gT and VLC2R. Only the URL-type re-ranking
on WT10gC and VLC2P outperforms the random control.
4. With no re-ranking, the content+anchor-text baselines perform worse than their
anchor-text counterparts. However, the content+anchor-text based re-rankings
are equal to (in ANU), or exceed their counterpart anchor-text re-rankings (in
WT10gC, WT10gT, VLC2P, VLC2R).
5. URL performs at a consistently high level for all baselines. The URL anchor-
text re-ranking is only outperformed by APR on the ANU and VLC2P. These are
cases where the query set and bookmarks were both derived from the same list
of authoritative sources.
6. For the popular home page queries (ANU and VLC2P), all anchor-text re-rankings
outperform their content counterparts.
7. For random home page queries (WT10gT, WT10gC and VLC2R), the content+
anchor-text and content-only re-rankings perform better than their anchor-text
counterparts.
8. Improvements due to APR were only observed when using high quality book-
marks, i.e. when the query answers were to be found among the bookmarks.
9. Improvements due to IDG and DPR are almost identical.
§7.3 Optimal combination experiments 113
Coll. Meas. Base Rand IDG DPR APR URL
ANU S@1 0.29 0.37 0.73 0.71 0.75 0.68
ANU S@5 0.50 0.61 0.88 0.90 0.91 0.87
ANU S@10 0.58 0.69 0.93 0.93 0.96 0.91
ANU Sig. n/a n/a ** ** ** **
WT10gC S@1 0.23 0.34 0.61 0.59 0.55 0.75
WT10gC S@5 0.45 0.58 0.86 0.82 0.84 0.89
WT10gC S@10 0.55 0.68 0.86 0.87 0.88 0.93
WT10gC Sig. n/a n/a ** ** ** **
WT10gT S@1 0.22 0.34 0.64 0.62 0.55 0.84
WT10gT S@5 0.48 0.61 0.81 0.83 0.80 0.90
WT10gT S@10 0.59 0.69 0.86 0.87 0.84 0.92
WT10gT Sig. n/a n/a ** ** ** **
VLC2P S@1 0.27 0.38 0.66 0.62 0.67 0.71
VLC2P S@5 0.51 0.65 0.79 0.79 0.82 0.87
VLC2P S@10 0.61 0.76 0.88 0.87 0.90 0.89
VLC2P Sig. n/a n/a ** ** ** **
VLC2R S@1 0.16 0.25 0.50 0.48 0.46 0.72
VLC2R S@5 0.36 0.48 0.72 0.69 0.69 0.87
VLC2R S@10 0.44 0.58 0.73 0.72 0.72 0.88
VLC2R Sig. n/a n/a ** ** ** **
Table 7.5: Optimal re-ranking results for content. The Optimal combination experiment
is described in Section 7.3. “Sig.” reports the statistical significance of the improvements.
Significance is tested using the Wilcoxon matched-pairs signed ranks test. The Wilcoxon test
compares the full document ranking, and so only a single significance value is reported per
type of evidence, per collection. A “**” indicates improvements were significant at p < 0.01,
and a “*” indicates improvements were significant at p < 0.05. Relative to the random control,
all Optimal re-rankings of the content baseline were significant. The highest effectiveness
achieved for each measure on each collection is highlighted in bold.
114 Home page finding using query-independent web evidence
Coll. Meas. Base Rand IDG DPR APR URL
ANU S@1 0.72 0.82 0.87 0.87 0.89 0.88
ANU S@5 0.96 0.97 0.98 0.98 0.98 0.98
ANU S@10 0.97 0.97 0.98 0.98 0.99 0.98
ANU Sig. n/a n/a - - - -
WT10gC S@1 0.47 0.58 0.60 0.59 0.63 0.73
WT10gC S@5 0.69 0.73 0.71 0.72 0.73 0.82
WT10gC S@10 0.72 0.76 0.74 0.75 0.75 0.83
WT10gC Sig. n/a n/a - - - *
WT10gT S@1 0.53 0.60 0.63 0.61 0.64 0.74
WT10gT S@5 0.68 0.73 0.72 0.71 0.75 0.78
WT10gT S@10 0.72 0.76 0.76 0.76 0.75 0.79
WT10gT Sig. n/a n/a - - - *
VLC2P S@1 0.70 0.77 0.78 0.79 0.85 0.81
VLC2P S@5 0.86 0.88 0.88 0.89 0.92 0.90
VLC2P S@10 0.87 0.89 0.90 0.89 0.92 0.92
VLC2P Sig. n/a n/a - - - -
VLC2R S@1 0.48 0.55 0.63 0.60 0.61 0.68
VLC2R S@5 0.67 0.71 0.75 0.75 0.73 0.74
VLC2R S@10 0.72 0.73 0.75 0.75 0.75 0.76
VLC2R Sig. n/a n/a - - - *
Table 7.6: Optimal re-ranking results for anchor-text. The Optimal combination experiment
is described in Section 7.3. “Sig.” reports the statistical significance of the improvements.
Significance is tested using the Wilcoxon matched-pairs signed ranks test. The Wilcoxon test
compares the full document ranking, and so only a single significance value is reported per
type of evidence, per collection. A “**” indicates improvements were significant at p < 0.01,
and a “*” indicates improvements were significant at p < 0.05. The highest effectiveness
achieved for each measure on each collection is highlighted in bold.
§7.3 Optimal combination experiments 115
Coll. Meas. Base Rand IDG DPR APR URL
ANU S@1 0.63 0.70 0.85 0.85 0.84 0.88
ANU S@5 0.81 0.86 0.96 0.98 0.96 0.98
ANU S@10 0.86 0.90 0.98 0.99 0.98 0.98
ANU Sig. n/a n/a * * * *
WT10gC S@1 0.45 0.58 0.65 0.67 0.68 0.94
WT10gC S@5 0.71 0.81 0.90 0.88 0.89 0.97
WT10gC S@10 0.83 0.86 0.92 0.91 0.90 0.97
WT10gC Sig. n/a n/a - - - **
WT10gT S@1 0.48 0.58 0.70 0.69 0.68 0.84
WT10gT S@5 0.71 0.77 0.88 0.86 0.85 0.94
WT10gT S@10 0.75 0.80 0.88 0.90 0.88 0.95
WT10gT Sig. n/a n/a * * * **
VLC2P S@1 0.67 0.75 0.84 0.86 0.89 0.90
VLC2P S@5 0.85 0.88 0.93 0.94 0.94 0.97
VLC2P S@10 0.88 0.91 0.94 0.94 0.95 0.98
VLC2P Sig. n/a n/a - - * **
VLC2R S@1 0.40 0.50 0.63 0.60 0.58 0.84
VLC2R S@5 0.62 0.69 0.78 0.76 0.74 0.93
VLC2R S@10 0.66 0.75 0.79 0.78 0.77 0.93
VLC2R Sig. n/a n/a - - - **
Table 7.7: Optimal re-ranking results for content+anchor-text. The Optimal combination
experiment is described in Section 7.3. Significance is tested using the Wilcoxon matched-
pairs signed ranks test. The Wilcoxon test compares the full document ranking, and so only
a single significance value is reported per type of evidence, per collection. A “**” indicates
improvements were significant at p < 0.01, and a “*” indicates improvements were signifi-
cant at p < 0.05. The highest effectiveness achieved for each measure on each collection is
highlighted in bold.
116 Home page finding using query-independent web evidence
Collection Type Content Anchor-text Content + Anchor-text
ANU Popular APR > DPR, URL - -
WT10gC Random DPR > IDG URL > IDG, DPR, APR URL > IDG, DPR, APR
APR > IDG
URL > IDG, DPR, APR
WT10gT Random IDG > APR APR > IDG, DPR URL > IDG, DPR, APR
DPR > APR URL > IDG, DPR, APR
URL > IDG, DPR, APR
VLC2P Popular - APR > IDG, DPR URL > IDG
VLC2R Random IDG > APR DPR > IDG IDG > APR
URL > IDG, DPR, APR URL > IDG, DPR, APR URL > IDG, DPR, APR
Table 7.8: Significant differences between methods when using Optimal re-rankings. Each
(non-random) method was compared against each of the others in turn and differences were
tested for significance using the Wilcoxon test. Each significant difference found is shown
with the direction of the difference.
7.4 Score-based re-ranking
These experiments investigate the effectiveness of a score-based re-ranking of base-
lines using query-independent evidence.
7.4.1 Setting score cutoffs
For the realistic score-based re-rankings the same cutoff was applied to all queries.
Suitable score cutoffs were determined for WT10gC by plotting S@5 effectiveness
against potential cutoff values (see Figures 7.2 and 7.3) and recording the optimal
cutoff for each form of query-independent evidence. The other collections were then
re-ranked using this cutoff. Optimal cutoffs were calculated at S@5 due to the insta-
bility of S@15 and the smaller effectiveness gains observed at S@10.
7.4.2 Results
Tables 7.10, 7.11 and 7.12 show the results of the score-based re-ranking of content and
anchor-text baselines. From these results it can be observed that:
1. URL re-ranking provided significant improvements over all three baselines for
WT10gT, VLC2P and VLC2R.
2. URL re-ranking performance is only surpassed by APR on the ANU collection
(at S@1) where APR used very high quality bookmarks.
3. None of the hyperlink-recommendation based schemes provided a significant
improvement over the anchor-text baseline.
5
S@1 is equivalent to P@1, the instability of Precision at 1 is discussed in Section 2.6.6.1.
§7.4 Score-based re-ranking 117
Collection Measure Best Cont. Best Anch Best Cont+Anch
ANU S@1 0.76 0.90 0.88
S@5 0.92 0.98 0.98
S@10 0.94 0.98 0.98
QIE URL ALL URL,DPR
WT10gC S@1 0.82 0.73 0.94
S@5 0.93 0.84 0.97
S@10 0.93 0.84 0.97
QIE URL URL URL
WT10gT S@1 0.84 0.74 0.84
S@5 0.90 0.78 0.94
S@10 0.92 0.79 0.95
QIE URL URL URL
VLC2P S@1 0.71 0.84 0.90
S@5 0.87 0.91 0.97
S@10 0.89 0.91 0.98
QIE URL APR URL
VLC2P S@1 0.73 0.68 0.84
S@5 0.87 0.74 0.93
S@10 0.88 0.76 0.93
QIE URL URL URL
Table 7.9: Summary of Optimal re-ranking results. The highest effectiveness achieved by
each method is highlighted in bold. The “QIE” row indicates the query-independent evidence
that performed best.
118 Home page finding using query-independent web evidence
0
20
40
60
80
100
100 95 90 85 80 75 70 65 60 55 50 45 40 35 30
Successrateat5
% of maximum baseline score
URL
APR
Indeg
DPR
0
20
40
60
80
100
100 95 90 85 80 75 70 65 60 55 50 45 40 35 30
Successrateat5
% of maximum baseline score
APR
URL
Indeg
DPR
Figure 7.2: Setting score-based re-ranking cutoffs for the content (top) and anchor-text (bot-
tom) baselines using the WT10gC collection. The vertical lines represent the chosen cutoff
values, which were then used in all score-based re-ranking experiments. If the optimal cutoff
spanned multiple values then the mean of those values was used. Numerical cutoff scores are
provided in Tables 7.10 and 7.11.
§7.4 Score-based re-ranking 119
0
20
40
60
80
100
100 95 90 85 80 75 70 65 60 55 50 45 40 35 30
Successrateat5
% of maximum baseline score
APR
URL
Indeg
DPR
Figure 7.3: Setting score-based re-ranking cutoffs for the content+anchor-text baseline us-
ing the WT10gC collection. The vertical lines represent the chosen cutoff values, which were
then used in all score-based re-ranking experiments. If the optimal cutoff spanned multiple
values then the mean of those values was used. Numerical cutoff scores are provided in Ta-
ble 7.12.
120 Home page finding using query-independent web evidence
4. For the popular query sets (ANU and VLC2P) the anchor text baseline with URL
re-ranking produced the best performance, although the baseline only narrowly
outperformed content+anchor-text.
5. For the random query sets (WT10gT and VLC2R) the content+anchor-text base-
line with URL re-ranking produced the best performance, with the URL re-
ranking of the content baseline performing better than the anchor-text re-ranking.
6. In the absence of very high quality bookmarks (i.e. on every collection except
for the ANU), APR performance was very similar to that of the other hyperlink
recommendation techniques.
Coll. Meas. Base IDG DPR APR URL
(at 20.6%) (at 17.4%) (at 14.1%) (at 33.7%)
ANU S@1 0.29 0.36 0.29 0.48 0.39
ANU S@5 0.50 0.60 0.52 0.67 0.73
ANU S@10 0.58 0.73 0.6 0.72 0.83
ANU Sig. - - - ** **
WT10gC S@1 0.23 0.36 0.38 0.33 0.71
WT10gC S@5 0.45 0.67 0.58 0.59 0.88
WT10gC S@10 0.55 0.73 0.67 0.65 0.90
WT10gT S@1 0.22 0.46 0.41 0.32 0.70
WT10gT S@5 0.48 0.64 0.59 0.62 0.83
WT10gT S@10 0.59 0.71 0.69 0.65 0.88
WT10gT Sig. - - - - **
VLC2P S@1 0.27 0.38 0.42 0.41 0.56
VLC2P S@5 0.51 0.61 0.61 0.63 0.68
VLC2P S@10 0.61 0.70 0.70 0.76 0.76
VLC2P Sig. - - - ** **
VLC2R S@1 0.16 0.26 0.20 0.22 0.62
VLC2R S@5 0.36 0.47 0.44 0.45 0.82
VLC2R S@10 0.44 0.56 0.52 0.53 0.83
VLC2R Sig. - - - - **
Table 7.10: Score-based re-ranking results for content. Cutoffs (shown as “(at ?)”) were
obtained by training on WT10gC at S@5. “Sig.” reports the statistical significance of the im-
provements. Significance is tested using the Wilcoxon matched-pairs signed ranks test. The
Wilcoxon test compares the full document ranking, and so only a single significance value
is reported per type of evidence, per collection. A “**” indicates improvements were signifi-
cant at p < 0.01, and a “*” indicates improvements were significant at p < 0.05. The highest
effectiveness achieved at each measure for each collection is highlighted in bold.
§7.4 Score-based re-ranking 121
Coll. Meas. Base IDG DPR APR URL
(at 15.5%) (at 11.1%) (at 15.6%) (at 20.4%)
ANU S@1 0.72 0.77 0.74 0.83 0.78
ANU S@5 0.96 0.95 0.94 0.96 0.98
ANU S@10 0.97 0.98 0.98 0.98 0.98
ANU Sig. - - - - -
WT10gC S@1 0.47 0.5 0.51 0.51 0.67
WT10gC S@5 0.69 0.71 0.71 0.71 0.76
WT10gC S@10 0.72 0.72 0.72 0.72 0.76
WT10gT S@1 0.53 0.51 0.52 0.47 0.65
WT10gT S@5 0.68 0.70 0.68 0.70 0.73
WT10gT S@10 0.72 0.72 0.72 0.73 0.74
WT10gT Sig. - - - - **
VLC2P S@1 0.70 0.69 0.70 0.73 0.81
VLC2P S@5 0.86 0.84 0.84 0.85 0.89
VLC2P S@10 0.87 0.86 0.88 0.86 0.91
VLC2P Sig. - - - - **
VLC2R S@1 0.48 0.48 0.46 0.41 0.66
VLC2R S@5 0.67 0.70 0.71 0.69 0.73
VLC2R S@10 0.72 0.73 0.72 0.70 0.76
VLC2R Sig. - - - - **
Table 7.11: Score-based re-ranking results for anchor-text. Cutoffs (shown as “(at ?)”) were
obtained by training on WT10gC at S@5. “Sig.” reports the statistical significance of the im-
provements. Significance is tested using the Wilcoxon matched-pairs signed ranks test. The
Wilcoxon test compares the full document ranking, and so only a single significance value
is reported per type of evidence, per collection. A “**” indicates improvements were signifi-
cant at p < 0.01, and a “*” indicates improvements were significant at p < 0.05. The highest
effectiveness achieved at each measure for each collection is highlighted in bold.
122 Home page finding using query-independent web evidence
Coll. Meas. Base IDG DPR APR URL
(at 10.3%) (at 6.9%) (at 10%) (at 31.7%)
ANU S@1 0.63 0.71 0.64 0.70 0.69
ANU S@5 0.81 0.84 0.82 0.86 0.88
ANU S@10 0.86 0.89 0.86 0.89 0.91
ANU Sig. - * - * *
WT10gC S@1 0.45 0.51 0.49 0.53 0.79
WT10gC S@5 0.71 0.77 0.73 0.75 0.92
WT10gC S@10 0.83 0.83 0.83 0.82 0.94
WT10gT S@1 0.48 0.51 0.52 0.41 0.72
WT10gT S@5 0.71 0.68 0.70 0.67 0.86
WT10gT S@10 0.75 0.77 0.78 0.75 0.89
WT10gT Sig. - - - - **
VLC2P S@1 0.67 0.65 0.68 0.68 0.68
VLC2P S@5 0.85 0.87 0.86 0.86 0.88
VLC2P S@10 0.88 0.91 0.90 0.91 0.93
VLC2P Sig. - - - - *
VLC2R S@1 0.40 0.42 0.42 0.34 0.75
VLC2R S@5 0.62 0.61 0.59 0.61 0.87
VLC2R S@10 0.66 0.70 0.67 0.69 0.89
VLC2R Sig. - - - - **
Table 7.12: Score-based re-ranking results for content+anchor-text. Cutoffs (shown as “(at
?)”) were obtained by training on WT10gC at S@5. “Sig.” reports the statistical significance
of the improvements. Significance is tested using the Wilcoxon matched-pairs signed ranks
test. The Wilcoxon test compares the full document ranking, and so only a single significance
value is reported per type of evidence, per collection. A “**” indicates improvements were
significant at p < 0.01, and a “*” indicates improvements were significant at p < 0.05. The
highest effectiveness achieved at each measure for each collection is highlighted in bold.
§7.5 Interpretation of results 123
7.5 Interpretation of results
Collection Info Optimal Score-based
B’mark Best Best S@1 S@5 S@10
Coll. Type Quality S@5 S@5 Improve Improve Improve Sig.
ANU Pop. v.High 0.98 0.98 7.7% 2.0% 0% -
AT+* AT+URL 0.72→0.78 0.96→0.98 0.98→0.98
WT10gT Rand. Low 0.88 0.85 29% 16% 14% **
C+AT+URL C+AT+URL 0.48→0.68 0.71→0.85 0.75→0.87
VLC2P Pop. High 0.97 0.88 14% 3% 4% **
C+AT+URL AT+URL 0.69→0.79 0.85→0.88 0.86→0.90
VLC2R Rand. Low 0.93 0.87 47% 29% 26% **
C+AT+URL C+AT+URL 0.40→0.75 0.62→0.87 0.66→0.89
Table 7.13: Numerical summary of re-ranking improvements. “Sig.” reports the statistical
significance of the improvements. Significance is tested using the Wilcoxon matched-pairs
signed ranks test. The Wilcoxon test compares the full document ranking, and so only a
single significance value is reported per type of evidence, per collection. A “**” indicates
improvements were significant at p < 0.01, and a “*” indicates improvements were signif-
icant at p < 0.05. The percentile realistic improvements are calculated as a percentage im-
provement over the best baseline. “AT+*” denotes a combination of anchor-text with any
of the query independent evidence examined here. “AT+URL” denotes a combination of
anchor-text with URL-type query-independent evidence. “AT+APR” denotes a combination
of anchor-text with APR query-independent evidence. “C+URL” denotes a combination of
content with URL-type query-independent evidence. “C+AT+URL” denotes a combination of
content+anchor-text with URL-type query-independent evidence.
7.5.1 What query-independent evidence should be used in re-ranking?
The Optimal combination results show that re-rankings by all of the query-
independent methods considered significantly improve upon the random control for
the content baseline. For all random query sets, URL re-ranking of the anchor-text base-
line significantly improves upon the random control. Further, many of content+anchor-
text baseline re-rankings are significant. Results are quite stable across collections de-
spite differences in their scale.
Naturally, the benefits of the realistic score-based re-rankings are smaller, but the
URL method in particular achieves substantial gains over all baselines, as shown in
Table 7.13. It is clear that classification of URL-type is of considerable value in a home
page finding system. Section 7.6.2 examines whether the URL-type classifications em-
ployed in this experiment are optimal.
It is of interest that URL re-ranking results for the ANU collection are poorer than
for the other collections. Although investigation confirmed UTwente/TNO’s order-
ing, i.e. “Root” (36/137) > “Subroot” (50/862) > “Directory” (72/1059) >
124 Home page finding using query-independent web evidence
“File” (40/382 274),6 the ratio for the URL “Subroot” class was higher than for other
collections.
It should be noted that URL re-ranking would be of little use in webs in which
URLs exhibit no hierarchical structure. For example, some organisations publish
URLs of the form xyz.org/getdoc.cgi?docid=9999999. Such URLs include no
potential “Subroot” or “Directory” URL break-downs.
In experiments within this chapter the baseline ordering was preserved if the re-
ranking scores were equal. Such equality occurred more often in URL-type scores,
which could take only one of four distinct values. To confirm that the superiority of
URL-type re-ranking was not an artifact of their quantisation, hyperlink recommen-
dation scores were quantised7 into four groups, and the effectiveness of the quantised
scores was also evaluated. The quantisation of hyperlink recommendation scores de-
creased retrieval effectiveness. This indicates that it is unlikely that URL-type has an
unfair advantage due to quantisation.
Hyperlink recommendation results indicate these schemes may have relatively lit-
tle role to play in home page finding tasks using re-ranking based combination meth-
ods for corpora within the range of sizes studied here (400 000 to 18.5 million pages).
The full-text (content) baseline improvements when using hyperlink recommendation
scores as a minimum threshold for document retrieval, or in an Optimal re-ranking
of the query-dependent baselines, were encouraging. By contrast, the performance
improvements over the anchor-text baseline were minimal. This suggests that most
of the potential improvement offered by hyperlink recommendation methods is al-
ready exploited by the anchor-text baseline. In most of the score-based re-rankings it
is almost impossible to differentiate between the re-ranking of the anchor-text base-
line and the baseline itself. The extent to which hyperlink recommendation evidence
is implicit in anchor-text evidence is considered in the next chapter.
Throughout the experiments in-degree appeared to provide more consistent per-
formance improvements than APR or DPR. APR performed well when using high-
quality bookmark sets, but did not improve performance when using lower qual-
ity bookmark sets on random (WT10gT and VLC2R) query sets. The improvement
achieved by these methods relative to the anchor-text baselines was not significant.
The difference in effectiveness of the two PageRank variants show that PageRank’s
contribution to home page finding on corpora of this size is highly dependent upon
the choice of bookmark pages. However, even for popular queries (ANU and VLC2P),
APR results are generally inferior to those of URL re-rankings. Of the three hyperlink
recommendation methods in-degree may be the best choice, as the PageRank variants
offer little or no advantage and are more computationally expensive.
In conclusion, the results of these experiments show the best query-independent
evidence to be URL-type.
6
Note that in these figures all URLs (including equivalent URLs) were considered.
7
I.e. similar scores were grouped to reduce the number of possible values.
§7.6 Further experiments 125
7.5.2 Which query-dependent baseline should be used?
In the experiments, prior to re-ranking, the anchor-text baseline generally outper-
formed the content and content+anchor-text baselines. However, on two collections,8
URL-type re-rankings of full-text (content) outperformed similar re-rankings of
anchor-text. In these two cases the target home pages were randomly chosen. This
effect was not observed for the popular targets, although the content+anchor-text per-
formance was comparable to that of anchor-text only.
Figure 7.4 illustrates the difference between the random and popular sets by plotting
S@n against n for the content and anchor-text baselines. For the popular query set, the
two baselines converge at about n = 500, but for the random set the content baseline is
clearly superior for n > 150. The plot for VLC2R is similar to that observed in a pre-
vious study of content and anchor-text performance on the WT10gT collection [135].
An explanation for the observed increase in effectiveness of the content baseline
above n > 150 is that while anchor-text rankings are better able to discriminate be-
tween home pages and other relevant pages, full anchor-text rankings are shorter9
than those for content. Some home pages have no useful incoming anchor-text and
therefore do not appear anywhere in the anchor-text ranking. By contrast, most home
pages do contain some form of site name within their content and will eventually
appear in the content ranking.
Selecting queries from a directory within the collection guarantees that the anchor
document for the target home page will not be empty, but there is no such guarantee
for randomly chosen home pages. Selection of home pages for listing in a directory
is undoubtedly biased toward useful, important or well-known sites which are also
more likely to be linked to from other pages (Experiments in Chapter 5 observed that
PageRank does favour popular pages). It should be noted that incoming home page
queries would probably also be biased toward this type of site.
In conclusion, the results of the experiments show the content+anchor-text base-
line to be the most consistent performer across all tasks, and to perform particularly
well when combined with URL-type evidence.
7.6 Further experiments
Having established the principal results above, a series of follow-up experiments was
conducted. In particular these investigated:
• to what extent results can be understood in terms of rank and score distributions;
• whether other classifications of URL-type provide similar, or superior, gains in
retrieval effectiveness;
8
Of the four evaluated. The WT10gC test collection is not included as it was used to train the re-
ranking cutoffs.
9
Ignoring documents that achieve a score of zero.
126 Home page finding using query-independent web evidence
10
20
30
40
50
60
70
80
90
100
1 10 100 1000
Successrate@n
Number of documents (n)
Base C
Base AT
10
20
30
40
50
60
70
80
90
100
1 10 100 1000
Successrate@n
Number of documents (n)
Base C
Base AT
Figure 7.4: Baseline success rates across different cutoffs. The top plot is for VLC2P, the
VLC2 crawl with a popular home page query set. The bottom plot is for VLC2R, the same
crawl, but with a random home page query set. The anchor-text baseline performs well be-
tween 0-150 for both collections. In VLC2P, at around S@150 the anchor-text baseline perfor-
mance approaches the content baseline performance. In VLC2R the anchor-text performance
is surpassed by the content performance at around S@150. These plots are consistent with the
S@1000 values reported in Table 7.1
§7.6 Further experiments 127
• to what extent the PageRanks and in-degrees are correlated with those reported
by Google; and
• whether the use of anchor-text and link graph information external to the corpus
could improve retrieval effectiveness.
7.6.1 Rank and score distributions
This section analyses the distribution of correct answers for each type of evidence over
the WT10gC collection.
The content and anchor-text baseline rankings of the correct answers are plotted
in Figure 7.5. In over 50% of occasions both the content and anchor-text baselines
contain the correct answer within the top ten results. Anchor-text provides the better
scoring of the two baselines, with the correct home page ranked as the top result for
almost 50% of the queries. This confirms the effectiveness of anchor-text for home
page finding [15, 56]).
The PageRank distributions are plotted in Figure 7.6. The distribution of the De-
mocratic PageRank scores for all pages follow a power law. In contrast, the PageRank
distribution for correct answers is much more even, with the proportion of pages that
are correct answers increasing at higher PageRanks. There are many pages which
do not achieve an APR score. Merely having an APR score > 0 gives some indica-
tion that a page is a correct answer in the WT10gC collection. These plots indicate
that both forms of PageRank provide some sort of home page evidence (as observed
in Chapter 5), even though these computed PageRank values differ markedly from
those mined from the Google toolbar in Chapter 5. This large difference re-affirms the
belief that PageRanks reported by the Google toolbar have been heavily transformed.
The in-degree distribution is plotted at the top of Figure 7.7 and is similar to the
Democratic PageRank distribution. However, the graph is slightly shifted to the left,
indicating that there are more pages with low in-degrees than there are pages with
low PageRanks. The distribution of correct answers is spread across in-degree scores,
with the proportion of pages that are correct answers increasing at higher in-degrees.
This shows that in-degree also provides some sort of home page evidence.
The URL-type distribution is plotted on the right in Figure 7.7. URL-type is a
particularly useful home page indicator for this collection, with a large proportion of
the correct answers located in the “Root” class and few correct answers located within
the “File” class.
7.6.2 Can the four-tier URL-type classification be improved?
This section evaluates how combining the four URL-type classes and introducing
length and directory depth based scores impacts retrieval effectiveness. The results
for this series of experiments are presented in Table 7.14.
None of the new URL-type methods significantly improved upon the performance
of the original URL-type classes (“Root” > “Subroot” > “Directory” > “File”). How-
ever, combining the “Subroot” and “Directory” classes did not adversely affect URL-
128 Home page finding using query-independent web evidence
0
5
10
15
20
25
30
35
40
45
50
1 2 3 4 5 6 7 8 9 10 >10
Numberofdocuments
Rank
0
5
10
15
20
25
30
35
40
45
50
1 2 3 4 5 6 7 8 9 10 >10
Numberofdocuments
Rank
Figure 7.5: Baseline rankings of the correct answers for WT10gC (content top, anchor-text
bottom). The correct answer is retrieved within the top ten results for over 50% of queries on
both baselines. The anchor-text baseline has the correct answer ranked as the top result on
almost 50% of the queries.
§7.6 Further experiments 129
1
10
100
1000
10000
100000
1e+06
0.0001 0.001 0.01 0.1 1
Numberofdocuments
Normalised Democratic PageRank score (quantized to 40 steps)
All
Correct
1
10
100
1000
10000
100000
1e+06
0.0001 0.001 0.01 0.1 1
Numberofdocuments
Normalised Aristocratic PageRank score (quantized to 40 steps)
All
Correct
Figure 7.6: PageRank distributions for WT10gC (DPR top, APR bottom). These plots con-
tain the distribution of all pages in the collection (All) and the distribution of the 100 correct
answers (Correct). The distribution of the DPR scores for all pages follow a power law. In
contrast, the correct answers are spread more evenly across DPR scores. Therefore the propor-
tion of pages which are correct answers increases at higher PageRanks. Approximately 17% of
pages do not achieve an APR score, thus merely having an APR score > 0 is a some indication
that a page is more likely to be a correct answer.
130 Home page finding using query-independent web evidence
1
10
100
1000
10000
100000
1e+06
0.0001 0.001 0.01 0.1 1
Numberofdocuments
Normalised in-degree score (quantized to 40 steps)
All
Correct
0
20
40
60
80
100
file path subroot root
Percentageofdocuments
URL Type
All
Correct
Figure 7.7: Other distributions for WT10gC (in-degree top, URL-type bottom). The top plot
contains the in-degree distribution for all pages (All) and the 100 correct answers (Correct).
The distribution of the in-degree scores for all pages follow a power law. In contrast, the
correct answers are spread more evenly across in-degree scores. The proportion of pages
which are correct answers increases at higher in-degree scores. The bottom plot contains the
URL-type distribution (in percentages) of all pages (All) and the correct answers (Correct).
The “Root” tier contains only 1% of the pages in the collection, but 80% of the correct answers.
In contrast, the “File” tier contains 92% of the collections pages, but only 5% of the correct
answer.
§7.6 Further experiments 131
Dataset Baseline R>S>D>F Length Dir Depth R>S+D+F R>S+D>F R>S>D(l)>F
ANU content 87 88 68 62 77 87
ANU anchor-text 98 98 98 97 98 98
WT10gC content 89 90 72 83 89 89
WT10gC anchor-text 82 83 75 78 82 82
WT10gT content 88 88 74 80 85 88
WT10gT anchor-text 77 79 74 75 77 77
VLC2P content 87 86 68 81 84 87
VLC2P anchor-text 89 92 87 89 89 90
VLC2R content 87 86 62 82 85 87
VLC2R anchor-text 74 76 73 73 74 74
Table 7.14: S@5 for URL-type category combinations, length (how long a URL is, favouring
short directories) and directory depth (how many directories the URL contains, favouring
URLs with shallow directories). R represents the “Root” tier, S represents the “Subroot” tier,
D is for the “Directory” tier and F is for the “File” tier. D(l) indicates that directories where
ranked according to length (where sorter directories are preferred). In all cases an Optimal
re-ranking of baselines by query-independent evidence was performed.
type effectiveness. A high level of effectiveness was also obtained using a simple URL
length measure. This measure ranked pages according to the length of their URLs (in
characters, and favouring short URLs). “File” URLs contain filenames and are thereby
longer than their “Root” and “Directory” counterparts, which may explain the good
performance of the URL length measure. Re-ranking baselines using only the URL
directory depth (number of slashes in the URL) performed relatively poorly.
In conclusion, when using URL-type scores for home page finding tasks it is im-
portant to distinguish between “Root”, “Directory” and “File” pages. This can be
done either explicitly through a categorisation of URL-types or by measuring the
length of the URL.
7.6.3 PageRank and in-degree correlation
The results in Table 7.15 show that DPR and in-degree are highly correlated, but that
the correlation tends to weaken as the size of the corpus increases. This weaker as-
sociation as corpus size increases suggests that PageRank might have quite different
properties when calculated for very large crawls. Google’s PageRank, based on 50
to 100 times more documents than are in VLC2, is likely to be different and possi-
bly superior to the PageRanks studied here. In addition, Google may use a different
PageRank variant and different bookmarks.
To understand the relationship between the PageRank values calculated in ex-
periments, and the PageRank employed by the Google WWW search engine, scores
were compared with the Google PageRanks reported for all 201 ANU pages listed
in the Google Directory.10 For those pages, PageRanks were extracted from Google’s
10
A version of the manually constructed DMOZ open WWW directory which reports Google PageR-
132 Home page finding using query-independent web evidence
DPR APR No. of pages
(millions)
ANU 0.836 0.448 0.40
WT10g 0.71 0.555 1.69
VLC2 0.666 0.164 18.57
Table 7.15: Correlation of PageRank variants with in-degree. The correlation was tested
using the Pearson r significance test.
DMOZ directory and in-degrees were extracted using the Google link: query op-
erator. Google PageRank and in-degree were correlated (r=0.358), as they were for
ANU, WT10g and VLC2. Also, the correlation between Google in-degree and ANU
in-degree was very strong (r=0.933). Google’s in-degrees, based on a much larger
crawl, were only three times larger than those from the ANU crawl (during link count
extraction the difficulties outlined in Section 5.1.3 were encountered).
While Google PageRank and ANU PageRank were correlated over the 201 obser-
vations, the correlation was less strong than for in-degree (DPR r=0.26, APR r=0.31).
This indicates that Google PageRank is different from the PageRanks studied here
(as observed in Section 5.1.1). Note that only five different values of PageRank were
reported by Google for the 201 pages (11, 16, 22, 27 and 32 out of 40). The directory-
based PageRanks are on a different scale to those extracted using the Google Toolbar
in Chapter 5, and both have been transformed and quantised from Google’s internal
PageRank values.
Although this study may not be directly applicable to very large crawls, its re-
sults are quite stable for a range of smaller multi-server crawls. The range of sizes of
corpora examined here (400 000 to 18.5 million pages) are typical of many enterprise
webs and thus interesting both scientifically and commercially.11
7.6.4 Use of external link information
To explore the effects of increasing corpus size, a series of hybrid WT10g/VLC2 runs
was performed. This is potentially revealing because the WT10g corpus is a subset
of the VLC2 corpus. The runs, shown in Table 7.16, used combinations of WT10g
corpus data and VLC2 link information. The hypothesis was that by using link tables
from the larger corpus it would be possible to obtain a more complete link graph
and thereby improve the performance of the hyperlink recommendation and anchor-
text measures (due to a potential increase in the hyperlink votes, and the amount of
anks. The Google DMOZ Directory is available at http://directory.google.com
11
The rated capacities of the two Google search appliances are in fact very similar to these sizes (150 000
and 15 million pages), see http://www.google.com/appliance/products.html.
§7.7 Discussion 133
available anchor-text). During these hybrid runs all VLC2 anchor-text that pointed to
pages outside the WT10g corpus was removed.
WT10g anchor-text VLC2 anchor-text
DPR DPR DPR DPR
— WT10g VLC2 — WT10g VLC2
WT10gC 0.69 0.72 0.69 0.78 0.79 0.78
WT10gT 0.68 0.71 0.71 0.72 0.72 0.73
Table 7.16: Using VLC2 links in WT10g. Note that the WT10g collection is a subset of the
VLC2 collection. The WT10g anchor-text scores are the baselines used throughout all other ex-
periments in this chapter. The VLC2 anchor scores are new rankings that use external anchor-
text from the VLC2 collection. WT10g DPR is a Democratic PageRank re-ranking using the
link table from the WT10g collection. VLC2 DPR is a Democratic PageRank re-ranking using
the link table from the VLC2 collection. The use of the (larger) VLC2 link table DPR scores did
not significantly improve the performance of DPR re-ranking. The use of external anchor text,
taken from the VLC2 collection, provided significant performance gains.
Surprisingly, the use of the (larger) VLC2 link table DPR scores did not noticeably
improve the performance of DPR re-ranking. However, the use of external anchor-
text, taken from the VLC2 corpus, provided significant performance gains. This would
suggest that in situations where an enterprise or small web has link information for
a larger web, benefits will be seen if the anchor-text from the external link graph is
recorded and used for the smaller corpus.12
The WT10g collection is not a uniform sample of VLC2, but was engineered to
maximise the interconnectivity of the documents selected [15]. Hence the effects of
scaling up may be smaller than would be expected in other web corpora.
7.7 Discussion
Using query-independent evidence scores as a minimum threshold for page inclusion
appears to be a useful method by which system efficiency can be improved without
significantly harming home page finding effectiveness. The use of hyperlink recom-
mendation evidence as a threshold resulted in a reduction of 10% of the corpus with-
out affecting a change in retrieval effectiveness. By comparison, using a URL-type
threshold of “> File”, corpus size was reduced by over 90%, and retrieval effective-
ness was significantly improved for two-out-of-three collections.
12
This was later investigated further by Hawking et al. [115] who found that the use of external anchor-
text did not improve retrieval effectiveness.
134 Home page finding using query-independent web evidence
Re-ranking query-dependent baselines (both content and anchor-text) on the basis
of URL-type produced consistent benefit. This heuristic would be a valuable compo-
nent of a home page finding system for web corpora with explicit hierarchical struc-
ture.
By contrast, in these experiments, unless Optimal re-ranking is used, hyperlink-
based recommendation schemes do not achieve significant effectiveness gains. Even
on the WT10gC collection, on which the re-ranking cutoffs were trained, the recom-
mendation results were poor. For corpora of up to twenty million pages, the hyper-
link recommendation methods do not appear to provide benefits in document rank-
ing for a home page finding task. Similarly, little benefit has previously been found
for relevance-based retrieval in the TREC web track [121]. An alternative measure for
bias towards pages that are heavily linked-to, by re-weighting the anchor-text ranking
formula to favour large volumes of anchor-text, is investigated in Chapter 8.
An ideal home page finding system would exploit both anchor-text (for superior
performance when targeting popular sites) and document full-text information (to
ensure that home pages with inadequate anchor-text are not missed). While the pre-
liminary content+anchor-text baseline presented here goes some way to investigat-
ing combined performance, further work is needed to better understand whether this
combination is optimal. Further examination is required to determine how to provide
the best all-round search effectiveness when home page queries are interspersed with
other query types. Additional work is also required to determine whether evidence
useful in home page finding is useful for other web retrieval tasks (such as Topic
Distillation). These issues are investigated in Chapter 9, through the description and
evaluation of a first-cut general-purpose document ranking function that incorporates
web evidence.
Chapter 8
Anchor-text in web search
Full-text ranking algorithms have been used to score aggregate anchor-text evidence
with some success, both in experiments within this thesis (see Chapters 6 and 7), and
in experiments reported elsewhere [56]. When comparing the textual contents of doc-
ument full-text and aggregate anchor-text it is clear that, in many cases, they differ
markedly. For example, aggregate anchor-text sometimes contains extremely high
rates of term repetition. Excessive term repetition may make a negligible (or even neg-
ative1) contribution to full-text evidence, but may be a useful indicator in anchor-text
evidence. This is because each term occurrence could indicate an independent “vote”
from an external author that the document is a worthwhile target for that term.
This chapter examines whether the Okapi BM25 and Field-weighted Okapi BM25
ranking algorithms, previously used with success in scoring both document full-text
and aggregate anchor-text [56, 173], can be revised to better match anchor-text evi-
dence. The investigation is split into three sections. The first section presents an inves-
tigation of how the Okapi BM25 full-text ranking algorithm is applied when scoring
aggregate anchor-text. This includes an analysis of how the document and collection
statistics used in BM25 (and commonly used in other full-text ranking algorithms)
might be modified to better score aggregate anchor-text evidence. The second section
examines four different methods for combining the aggregate anchor-text evidence
with other document evidence. The third and final section provides an empirical in-
vestigation of the effectiveness of the revised scoring methods, for both combined
anchor-text and full-text evidence, and anchor-text alone.
8.1 Document statistics in anchor-text
This section examines how document statistics used in the Okapi BM25 ranking func-
tion, and other full-text ranking methods (see Section 2.3.2), apply to aggregate anchor-
text evidence.
1
As it could indicate a spam document, which was designed explicitly to be retrieved by the search
system in response to that query term.
135
136 Anchor-text in web search
8.1.1 Term frequency
In full-text document retrieval, term frequency (tf ) is used to give some measure of
the “aboutness” of a document (see Section 2.3.1.2). The underlying assumption is
that if a document repeats a term many times, it is likely to be about that term.
The distribution of tf s in aggregate anchor-text appears to be quite different from
that in document full-text. For example, an analysis of the term distribution in anchor-
text and full-text for the “World Bank projects” home page2 illustrates how tf s can
differ markedly. In the aggregate anchor-text for this document the term “projects”
has a tf of 6798 (and makes up approximately 80% of all incoming anchor-text). By
comparison, in the document full-text the term “projects” has a tf of only 5 (and makes
up approximately 4% of the total document full-text).
As shown in Figure 8.1 when using the default term saturation parameter (k1 = 2)
Okapi BM25 scores are almost flat beyond a tf of 10. This may not be a desirable prop-
erty when scoring aggregate anchor-text, as each occurrence of a query term may be a
separate vote that the term relates to the contents of the document. The early satura-
tion of term contribution can be particularly problematic when combining document
scores in a linear combination (see Section 2.5.1.1). For example, taking the “World
Bank projects” home page again, if another corpus document (of average length) has
only 60 occurrences of the term “projects” in incoming links (6738 less occurrences
than in the “World Bank projects” home page anchor-text), but the document full-text
contains “projects” ten times (four more occurrences than in the full-text of the “World
Bank projects” home page), that page will outperform the home page when measures
are combined using a linear combination of Okapi BM25 scores (using default k1 and
b parameters).
Changing the rate of saturation for anchor-text, through modification of the Okapi
BM25 k1 value, is one method by which the impact of high aggregate anchor-text term
frequencies might be changed. For example, Figure 8.1 illustrates that given a higher
k1 value, the function saturates more slowly, thereby allowing for higher term counts
before complete function saturation. However, if this evidence is to be combined with
other document evidence (computed using different Okapi BM25 parameters) using
a linear combination, then scores have to be renormalised.
This analysis suggests that when scoring aggregate anchor-text evidence the use
of a much higher value of k1 may be effective.3 A change in saturation rate is ex-
plored below, through length normalising aggregate anchor-text contribution using
the length of document full-text.
8.1.2 Inverse document frequency
Inverse document frequency (idf ) is used in full-text ranking to provide a measure of
the frequency of term occurrence in documents within a corpus, and thereby a mea-
sure of the importance of observing a term in a document or query (see Section 2.3.1.2).
2
Located at: http://www.worldbank.org/projects/
3
Time did not permit confirmation of the benefits of this.
§8.1 Document statistics in anchor-text 137
0
5
10
15
20
25
0 5 10 15 20
DocumentScore
tf
BM25 k1=0
BM25 k1=1
BM25 k1=2
BM25 k1=10
Figure 8.1: Document scores achieved by BM25 using several values of k1 with increasing
tf . Assuming a document of average length, and N = 100 000, nt = 10
The idf measure is likely to be useful when scoring aggregate anchor-text (i.e.
assigning more weight to query terms that occur in fewer documents). However, it
is unclear whether idf values should be calculated across all document fields4 at once
(i.e. one idf value per document), or individually for each document field (i.e. one idf
value per field, per document). Accordingly two possible idf measures are proposed:
• Global inverse document frequency (gidf ): A single idf value is computed per
term.
• Field-based inverse document frequency (fidf ): Multiple idf values are com-
puted per term, one per field (i.e. per type of query-independent evidence).
There are situations in which gidf and fidf scores vary considerably. For example
while the term “Microsoft” occurs in 16 330 documents in the TREC WT10g corpus
(see Section 2.6.7.1), it occurs in the aggregate anchor-text for only 532 documents.
“Microsoft” would have a low gidf in WT10g because many documents in the corpus
mention it, but a relatively high fidf as few documents are the targets of anchor-text
containing that term. A comprehensive comparison of the effectiveness of gidf and
fidf measures was not performed, although a limited examination was performed as
part of revised anchor-text formulations.
A summary of the evaluated idf measures is presented in Table 8.1.
4
A field is a form of query-dependent evidence, for example document full-text, title or anchor-text.
138 Anchor-text in web search
Abbreviation Description Described in
BM25 Default Okapi BM25 Section 2.3.1.3
(calculates field-based idf values)
BM25gidf Okapi BM25 with global idf statistics, such Section 8.1.2
that idf is calculated only once using
all document fields
BM25FW Default Field-weighted Okapi BM25 Section 2.5.2.1
(a single global idf value is calculated
across all document fields)
BM25FWfidf Field-weighted Okapi BM25 with field-based Section 8.1.2
idf values
Table 8.1: Summary of idf variants used in ranking functions under examination.
8.1.3 Document length normalisation
Document length normalisation is used in full-text ranking algorithms to reduce bias
towards long documents. This bias occurs because the longer a piece of text, the
greater the likelihood that a particular query term will occur in it (see Section 2.3.1.3).
In Okapi BM25 the length normalisation function is controlled by b, with b = 1 en-
forcing strict length normalisation, and b = 0 removing length normalisation. Using
Okapi BM25 with the default length normalisation parameter (b = 0.75) [56, 186],
slightly longer documents are favoured. This was shown to be effective when scoring
document full-text in TREC ad-hoc tasks, as slightly longer (full-text) documents were
found to be more likely to be judged relevant [186] (described in Section 2.3.1.3).
The length of aggregate anchor-text is usually dependent on the number of in-
coming links. Therefore, applying length normalisation to aggregate anchor-text, and
thereby reducing the contribution of terms that occur in long aggregate anchor-text, is
in direct contrast to the use of hyperlink recommendation algorithms.
Aggregate anchor-text length is also much more variable than document full-text
length, with many documents having little or no anchor-text, and some having a very
large amount of incoming anchor-text (attributable to the power law distribution of
links amongst pages, see Section 2.4). In the TREC .GOV corpus (see Section 2.6.7.1)
the average full-text document length is around 870 terms.5 By comparison, the aver-
age aggregate anchor-text length is only 25 words.
An example of the negative effects of aggregate anchor-text length normalisation
can be studied for the query “USGS” on the .GOV corpus. Figure 8.2 contains the
aggregate anchor-text distribution for the home page of the United States Geolog-
5
Not including binary documents.
§8.1 Document statistics in anchor-text 139
23%
10%
10%
10%
8%
39%
USGS
SURVEY
GEOLOGICAL
US
HOME
Other
Figure 8.2: Aggregate anchor-text term distribution for the USGS home page
(http://www.usgs.gov) from the .GOV corpus. This page has the highest in-degree of
all .GOV pages (around 88 000 links) and an aggregate anchor-text length of around 170 000
terms.
50%
50%
INFORMATION
USGS
Figure 8.3: Aggregate anchor-text term distribution for
‘‘http://nh.water.usgs.gov/USGSInfo’’ from the .GOV corpus. This page
has 243 incoming links, and an aggregate anchor-text length of around 486 terms.
140 Anchor-text in web search
ical Survey (USGS), the most highly linked-to document in the .GOV corpus. For
comparison, Figure 8.3 contains the aggregate anchor-text distribution for a “USGS
info” page (http://nh.water.usgs.gov/USGSInfo). The USGS home page has
around 170 000 terms in its aggregate anchor-text (from around 88 000 incoming
links), 34 000 of which (23%) are “USGS”. By contrast, http://nh.water.usgs.
gov/USGSInfo has 486 terms in its aggregate anchor-text (from 243 incoming links),
of which half (243) are “USGS”. Considering only aggregate anchor-text evidence
and using the default Okapi BM25 length normalisation parameter (b = 0.75), the
http://nh.water.usgs.gov/USGSInfo page outperforms the USGS home page
for the query “USGS”!
An illustration of the effects of Okapi BM25 length normalisation of aggregate
anchor-text (and other document fields) for a one term query, is presented in Fig-
ure 8.4. This Figure contains plots for both length normalised and unnormalised
Okapi BM25 scores for documents with three different proportions of matching terms.
The average document length (avdl) value is set to the average aggregate anchor-text
length in the .GOV corpus (25 terms). The idf value is set such that the probability of
encountering a term in a document is one-in-one-thousand (nt = 1000, N = 100 000).
The Okapi BM25 k1 parameter is set to 2. In the top plot document scores are length
normalised, and in the bottom they are not. The top plot shows that, when using a
default length normalisation (b = 0.75) value, it is impossible for a document with
only 25% of terms matching the query to be ranked above a document where 50% of
the terms match the query, even when comparing a tf of 5 to a tf of 2 500 000.
When using the Field-weighted Okapi BM25 method (BM25FW , described in
Section 2.5.2.1) the negative effects associated with aggregate anchor-text length nor-
malisation can be even more severe. The field-weighting method combines all doc-
ument evidence (including aggregate anchor-text) into a single composite document
and then uses the combined composite document length to normalise term contribu-
tion. Due to document length normalisation, it is unlikely that a document with a
large number of incoming links will be retrieved for any query. This is even the case
if the document full-text as well as the aggregate anchor-text mention the query term
more than any other document in the corpus.
A summary of the evaluated length normalisation techniques is presented in Ta-
ble 8.2.
8.1.3.1 Removing aggregate anchor-text length normalisation
One approach to dealing with the length normalisation issue outlined above is to
eliminate aggregate anchor-text length from consideration. In the Okapi BM25 for-
mulation length normalisation is controlled by the b constant, so length normalisation
can be removed by setting b = 0. The bottom plot of Figure 8.4 represents the Okapi
BM25 scores for documents with three different proportions of matching terms, with
no length normalisation (b = 0) and other parameters as specified in the section above
(avdl = 25, N = 100 000, nt = 1000, and k1 = 2). Without length normalisation the
proportion of terms that match is ignored and the sheer volume of matching anchor-
§8.1 Document statistics in anchor-text 141
2
2.2
2.4
2.6
2.8
3
3.2
3.4
3.6
3.8
4
10 25 50 100 200 500 1000 10^4 10^5 10^6 10^7
Document Length
BM25Score(k1=2,b=0.75)
75% of terms match
50% of terms match
25% of terms match
2
2.5
3
3.5
4
10 25 50 100 200 500 1000 10^4 10^5 10^6 10^7
Document Length
BM25Score(k1=2,b=0)
75% of terms match
50% of terms match
25% of terms match
Figure 8.4: The effect of document length normalisation on BM25 scores for a single term
query. Each line represents a document containing some proportion of terms that match a
query term (i.e. 25% of terms match is a document where one-out-of-four document terms
match the query term). The graph illustrates the change in scores when the number of doc-
ument terms increases. For example, if 75% of terms match in a document that contains 1000
terms, a total of 750 term matches have been observed. BM25 scores are calculated assuming
an avdl of average aggregate anchor-text length in the .GOV corpus (25 terms), idf values are
calculated using N = 100 000 and nt = 1000, and k1 is set to 2. The top plot shows the Okapi
BM25 document scores when using the default length normalisation parameter (b = 0.75).
The bottom plot gives the Okapi BM25 scores without length normalisation (b = 0). For the
length normalised documents (top), even as the number of term matches are increased, the
“proportion” of terms that match the query is still the most important factor. By comparison
without length normalisation (bottom), only the raw frequency of term matches is important.
142 Anchor-text in web search
Abbreviation Description Described in
BM25 Default Okapi BM25 formulation Section 2.3.1.3
(length normalisation using the field length)
BM25nodln Okapi BM25 using no length normalisation. Section 8.1.3.1
BM25contdln Okapi BM25 using full-text length Section 8.1.3.2
to normalise score
BM25FW Default Field-weighted Okapi BM25 Section 2.5.2.1 &
(length normalised using the composite
document length, which is the
sum of all field lengths). Section 8.2.2
BM25FWnoanchdln Field-weighted Okapi BM25 length Section 8.1.3.1
normalised using lengths of every
field except for anchor-text
Table 8.2: Summary of document length normalisation variants in ranking functions under
examination.
text terms is considered. This favours documents that have a large number of incom-
ing links and that may therefore be expected to achieve high hyperlink recommenda-
tion scores. The revised formulation of Okapi BM25 (with k1 = 2) for a document D,
and a query Q, containing terms t is:
BM25nodln(D, Q) =
t∈Q
tf t,D ×
log(N−nt+0.5
nt+0.5 )
2 × tf t,D
(8.1)
In the BM25FW formulation, the aggregate anchor-text length may be omitted
when computing the composite document length. In these experiments the removal of
aggregate anchor-text length in the BM25FW formulation is referred to as
BM25FWnoanchdln.
8.1.3.2 Anchor-text length normalisation by other document fields
Rather than using the length of aggregate anchor-text to normalise anchor-text scores,
it might be more effective to normalise aggregate anchor-text using the length of an-
other document field. For example, the length of document full-text could be used to
normalise aggregate anchor-text term contribution. Document length is known to be
useful query-independent evidence for some tasks (see Section 2.3.1.2) [186].
In experiments within this chapter, the use of document full-text length when scor-
ing aggregate anchor-text in the Okapi BM25 formulation is referred to as BM25contdln.
This approach may be more efficient than using individual document lengths as only
the full-text document lengths needs to be recorded.
§8.2 Combining anchor-text with other document evidence 143
8.2 Combining anchor-text with other document evidence
Four methods for combining anchor-text baselines with other document evidence are
investigated:
• BM25LC : a linear combination of Okapi BM25 scores;
• BM25FW : a combination using Field-weighted Okapi BM25 (described in Sec-
tion 2.5.2.1);
• BM25HYLC : a linear combination of Okapi BM25 and Field-weighted Okapi
BM25 scores; and
• BM25FWSn, BM25FWSnI : a combination of the best scoring anchor-text snip-
pet (repeated according to in-degree in BM25FWSnI ) with other document evi-
dence using the Field-weighted Okapi BM25 method.
In all cases the 2002 Topic Distillation task (TD 2002) was used to train combination
and ranking function parameters. It was later observed (both in experiments in Chap-
ter 9, and for reasons outlined in Section 2.6.7.2) that due to the informational nature
of the TD 2002 task it may not have been the most appropriate training set for these
navigational tasks [53]. Further gains may be achieved by re-training parameters for
a navigational-based search task (as used in Section 9.1.5).
8.2.1 Linear combination
In experiments within this chapter a linear combination of Okapi BM25 scores for
full-text and aggregate anchor-text is explored. Document title is not considered sep-
arately, and is scored as part of the document full-text baseline. A document score D
for a query Q is then:
BM25LC(D, Q) = BM25(C + T, Q) + αBM25(A, Q) (8.2)
where C + T is the full-text and title of document D, A is the aggregate anchor-text
for document D, and α is tuned according to the expected contribution of anchor-text
evidence.
Conceptually the linear combination assigns separate scores to document full-text
and aggregate anchor-text, considering them as independent descriptions of docu-
ment content. The BM25 linear combination constant was trained on the TD 2002
task, leading to α = 3.
8.2.2 Field-weighted Okapi BM25
The BM25FW formulation (see Section 2.5.2.1) includes three document fields: doc-
ument full-text (content), aggregate anchor-text, and title. The weights for each of
these fields were derived by Robertson et al. [173] for the TD 2002 task (content:1,
anchor-text:20, title:50, k1 = 3.4, b = 0.85). In this chapter, the document fields scored
144 Anchor-text in web search
are represented in brackets after BM25FW , with default fields of full-text, anchor-text
and title indicated by BM25FW ((C, A, T), Q) for query Q.
8.2.3 Fusion of linear combination and field-weighted evidence
A hybrid combination can be performed by grouping and scoring like document evi-
dence with Field-weighted Okapi BM25, and combining independent document evi-
dence using a linear combination of scores. The split examined in experiments in this
chapter is between document-level and web-based evidence: the document full-text
and title are scored independently from the externally related aggregate anchor-text.
This approach is referred to as (BM25HYLC).
BM25HYLC(D, Q) = BM25FW ((C, T), Q) + αBM25gidf (A, Q) (8.3)
8.2.4 Snippet-based anchor-text scoring
An alternative to scoring documents based on their aggregate anchor-text is to score
documents according to their best matching anchor-text snippet.6 Overlap between
different forms of document evidence may be reduced through snippet-based rank-
ing. When using full-text ranking algorithms to score aggregate anchor-text evidence,
there may be overlap in the document features used to score documents. For exam-
ple, a document that has a large number of in-links, is also likely to have a high tf for
a particular term (see “USGS” example in Section 8.1.3). Additionally, the aggregate
anchor-text for a document with a large number of incoming links is likely to be long,
and so will be impacted by document length normalisation.
Snippet-based scores are collected by scoring every snippet of anchor-text pointing
to each document, and using the highest scoring snippet per document.7 These snip-
pets are then combined with other document evidence using Field-weighted Okapi
BM25 with snippet-based collection and document statistics.8 Whilst these may not
be the best formulations of snippet statistics, they are consistent with the derivations
used in Okapi BM25.
Two snippet-based scoring functions were considered: BM25FWSn and
BM25FWSnI . BM25Sn combines a single occurrence of the best scoring snippet with
other document evidence using Field-weighted Okapi BM25. BM25FWSnI combines
the best scoring snippet repeated according to document in-degree with other docu-
ment evidence using Field-weighted Okapi BM25.9
The evaluated snippet based runs are reported in Table 8.3.
6
An anchor-text snippet is the anchor-text of a single link pointing to a document.
7
This is a computationally-expensive operation, as all non-duplicate snippets require individual scor-
ing at query time.
8
The statistics were adapted as follows: term frequency was set to within snippet term frequency, in-
verse document frequency to the frequency of terms within snippets and document length as the length
of a particular snippet.
9
Time did not allow for the investigation of further snippet ranking combinations.
§8.3 Results 145
Abbreviation Description Described in
BM25FWSn Field-weighted Okapi BM25 using Section 8.2.4
the best matching anchor-text snippet
as the anchor-text component.
BM25FWSnI Field-weighted Okapi BM25 using Section 8.2.4
the best matching anchor-text snippet
repeated according to document in-degree
as the anchor-text component.
Table 8.3: Summary of snippet-based document ranking algorithms under examination.
8.3 Results
This section provides an empirical investigation of the effectiveness of the revised
scoring methods. Effectiveness was evaluated using an automatic site-map based
experiment on a university web, and using test collections from the 2002 and 2003
TREC web tracks. The TREC tasks studied were a named page finding task (NP2002),
the 2003 combined home page finding / named page finding task (HP/NP2003), and
the 2003 Topic Distillation task (TD2003). TREC web track corpus and task details are
outlined in Section 2.6.7.
8.3.1 Anchor-text baseline effectiveness
The effectiveness of the aggregate anchor-text scoring techniques was evaluated using
a set of 332 navigational queries over a corpus of 80 000 web pages gathered from a
university web. The navigational queries were sourced using the automatic site map
method (described in Section 2.6.5.3).
Ranking function Score Rank
BM25 61 62
BM25contdln 100 1
BM25nodln 100 1
Table 8.4: Okapi BM25 aggregate anchor-text scores and ranks for length normalisation
variants. The “Score” and “Rank” are the normalised scores and ranks achieved for the correct
answer to the query ‘library’ on the university corpus.
146 Anchor-text in web search
Table 8.4 shows the ranks and normalised scores achieved by the best answer in
response to the query “library” when using only aggregate anchor-text. When in-
corporating aggregate anchor-text length normalisation in Okapi BM25, the correct
answer was severely penalised, as the aggregate anchor-text length was 13 484 words
(262 times the average length in the collection). This was despite the document having
very high term frequency (tf ) for the query term (1664). In contrast, both BM25contdln
and BM25nodln placed the best answer at rank one, but scored it only slightly above
many other candidate documents. In fact, the score was only 1% higher than the home
page of a minor library whose tf was a factor of 7.5 lower. Due to the small difference
in scores assigned for anchor-text evidence, if these scores were combined with other
document scores in a linear combination, the ranking of documents might change.
To increase the contribution of strong anchor-text matches, the weight of anchor-text
evidence must be increased and/or the saturation rate of anchor-text changed. An
anchor-text ranking function that does not saturate anchor-text term contribution is
presented in the following Chapter (AF1, in Section 9.1.3.1).
Table 8.5 shows results for the full set of 332 navigational queries processed over
the university corpus. Wilcoxon tests show that using full-text document length
(BM25contdln) to length normalise aggregate anchor-text significantly (p < 0.02) im-
proved effectiveness relative to the case of no length normalisation (BM25nodln). Fur-
ther, both BM25contdln and BM25nodln were superior to the default Okapi BM25 for-
mulation (p < 10−5).
Ranking function MRR P@1
BM25 0.61 0.47
BM25contdln 0.72 0.63
BM25nodln 0.70 0.61
Table 8.5: Effectiveness of Okapi BM25 aggregate anchor-text length normalisation tech-
niques on the university corpus. MRR depicts the Mean Reciprocal Rank of the first correct
answer; P@1 is precision at 1, the proportion of queries for which the best answer was returned
at rank one.
8.3.2 Anchor-text and full-text document evidence
This section examines the results from experiments that combine the new anchor-text
scoring methods with document full-text evidence. Combined runs are evaluated
using TREC web track test collections from 2002 and 2003 (discussed in Section 2.6.7).
§8.3 Results 147
8.3.2.1 Field-weighted Okapi BM25 combination
Table 8.6 shows the results from the Field-weighted Okapi BM25-based (BM25FW )
experiments.
Task Ranking Function C A T P@1 P@10 MRR Sig.
NP2002 BM25FW 1 50 20 0.59 0.82 0.68 -
NP2002 BM25FWnoanchdln 1 50 20 0.59 0.87 0.68 -
NP2002 BM25FW 1 500 20 0.49 0.78 0.60 -
NP2002 BM25FWnoanchdln 1 500 20 0.52 0.85 0.63 -
TD2003 BM25FW 1 50 20 0.10 0.09 0.10 -
TD2003 BM25FWnoanchdln 1 50 20 0.18 0.09 0.13 *+
TD2003 BM25FW 1 500 20 0.17 0.08 0.09 -
TD2003 BM25FWnoanchdln 1 500 20 0.20 0.09 0.13 *+
HP&NP2003 BM25FW 1 50 20 0.48 0.76 0.58 -
HP&NP2003 BM25FWnoanchdln 1 50 20 0.63 0.85 0.71 *+
HP&NP2003 BM25FW 1 500 20 0.36 0.67 0.46 -
HP&NP2003 BM25FWnoanchdln 1 500 20 0.59 0.84 0.68 *+
Table 8.6: Effectiveness of Field-weighted Okapi BM25. Three TREC web track tasks were
evaluated; “NP2002” is the 2002 TREC web track named page finding task; “TD2003” is the
2003 TREC web track Topic Distillation task; and “HP&NP2003” is the 2003 TREC web track
combined home page / name page finding task. “C” is the content weight (1 by default), “A”
is the aggregate anchor-text weight (50 by default) and “T” is the title weight (20 by default).
“Sig.” indicates whether improvements were significant (“*+”) over the BM25FW (C, A, T)
baseline. Improvements for no length normalisation were only significant for TD2003 and
the HP&NP2003 task. Performance decreased dramatically when up-weighting the aggregate
anchor-text field while including aggregate anchor-text in composite document length.
The removal of aggregate anchor-text length from composite document lengths
in the Field-weighted Okapi BM25 model (BM25FWnoanchdln) significantly improved
performance in two-out-of-three tasks, and did not affect performance in the other.
The results show that increasing the weight of aggregate anchor-text by an order
of magnitude in BM25FW exacerbates the negative effects of including aggregate
anchor-text length in the composite document length. Combining BM25FW scores
with hyperlink recommendation evidence might go some way to re-balancing the re-
trieval of highly linked pages. The investigation of this potential is left for future
work.
Function parameters were optimised for composite document lengths that included
aggregate anchor-text. It is likely that improvements achieved through the removal of
aggregate anchor-text length might be increased through re-tuning Okapi BM25FW ’s
document length (b) and saturation (k1) parameters. This is also left for future work.
148 Anchor-text in web search
Anchor-text snippets in Field-weighted Okapi BM25
The performance of the anchor-text snippet-based ranking functions is presented in
Table 8.7. Both snippet-based runs performed poorly by comparison to the BM25FW
runs. The snippet-based runs were also far less efficient than aggregate anchor-text
runs, as statistics were calculated and stored for each link rather than each document
(and there is an order of magnitude more links than documents in the .GOV corpus).
Further investigation would be required to determine whether a snippet-based rank-
ing could be effective. For example, effectiveness might be improved by re-optimising
the Okapi BM25 parameters, or re-weighting snippets according to their origin (e.g.
according to whether they are within-site or cross-site links) or according to some
notion of source authority.
Ranking function P@1 P@10 MRR Sig.
BM25FW 0.10 0.09 0.10 -
BM25FWnoanchdln 0.18 0.09 0.13 *+
BM25FWSnip 0.06 0.04 0.04 *-
BM25FWSnipIDG 0.12 0.06 0.06 *-
Table 8.7: Effectiveness of anchor-text snippet-based ranking functions. The snippet runs
performed poorly by comparison to the BM25FW runs for the 2003 TREC web track Topic
Distillation task. “Sig.” indicates whether improvements (“*+”) or losses (“*-”) were signifi-
cant compared to the BM25FW (C, A, T) baseline.
8.3.2.2 Linear combination
Table 8.8 shows the performance of the combinations of aggregate anchor-text and
full-text evidence for the Topic Distillation 2003 task. The following observations were
made from these results:
• Excluding aggregate anchor-text length from document length improves the
performance of the BM25FW method by around 25%. Likewise, removing ag-
gregate anchor-text length normalisation when combining content and anchor-
text BM25 scores results in significant performance gains, with an MRR increase
of around 30%.
• A further small effectiveness gain is achieved through a hybrid combination,
where the document title and full-text are scored using the field-weighting
method, and are then combined with aggregate anchor-text evidence in a lin-
ear combination.
§8.4 Discussion 149
• The “pure” linear combination performs poorly, most likely due to the use of
aggregate anchor-text length normalisation.
Ranking function Comb P@1 P@10 MRR Sig
BM25FW FW 0.10 0.09 0.10 -
BM25FWnoanchdln FW 0.18 0.09 0.13 *+
BM25gidf (C) + BM25gidf (A) LC 0.24 0.10 0.12 *+
BM25gidf (C) + BM25gidf ,contdln(A) LC 0.18 0.13 0.16 *+
BM25(C) + BM25contdln(A) LC 0.18 0.12 0.16 *+
BM25FW (C, T) + BM25contdln(A) HYLC 0.22 0.14 0.17 *+
Table 8.8: Effectiveness of the evaluated combination methods for TD2003. “TD2003” is
the 2003 TREC web track Topic Distillation task. C is document full-text, A is aggregate
anchor-text, and T is title. FW uses a Field-weighted Okapi BM25 combination, LC is a linear
combination, and HYLC is a fusion of Field-weighted Okapi BM25 and linear combination.
“Sig.” indicates whether improvements (“*+”) or losses (“*-”) were significant compared to
the BM25FW (C, A, T) baseline.
Table 8.9 contains the results for experiments on further TREC tasks. In all cases
the linear combination methods are outperformed by the field-weighting method.
This demonstrates potential differences between the tasks studied, and suggests that
no one method considered here will achieve high effectiveness on all search tasks.
8.4 Discussion
The results for the Okapi BM25 modifications show that effectiveness was improved
when length normalisation was not performed on aggregate anchor-text. Additional
gains were achieved when aggregate anchor-text was normalised using document
full-text length. The reason for this may be that full-text document length provides
useful query-independent document evidence.
The removal of aggregate anchor-text from composite document lengths in the
Okapi BM25FW formula improved or maintained retrieval effectiveness for all eval-
uated tasks. A re-tuning of the field-weighting weights without aggregate anchor-
text in composite document length is required to determine whether further improve-
ments can be attained. The removal of aggregate anchor-text length from composite
document length normalisation favours documents with long aggregate anchor-text,
as it is more likely that a link to a document containing the query term will be found.
This preference for long aggregate anchor-text is similar to biasing towards heavily
linked-to pages (except that a term match is assured). This may be a method by
which query-independent hyperlink recommendation evidence can be more easily
combined with query-dependent evidence.
150 Anchor-text in web search
Task Ranking function Comb. P@1 P@10 MRR
NP&HP2003 BM25FWnoanchdln FW 0.60 0.85 0.69
NP&HP2003 BM25(C) + BM25(A) LC 0.26 0.57 0.36
NP&HP2003 BM25gidf (C) + BM25gidf ,nodln(A) LC 0.47 0.71 0.56
NP&HP2003 BM25FW (C, T) + BM25contdln(A) HYLC 0.51 0.76 0.60
NP2002 BM25FWnoanchdln FW 0.56 0.87 0.67
NP2002 BM25(C) + BM25(A) LC 0.33 0.65 0.44
NP2002 BM25gidf (C) + BM25gidf ,nodln(A) LC 0.26 0.51 0.35
NP2002 BM25FW (C, T) + BM25contdln(A) HYLC 0.31 0.61 0.30
Table 8.9: Effectiveness of the evaluated combination methods for NP2002 and
NP&HP2003. “NP2002” is the 2002 TREC web track named page finding task; and
“HP&NP2003” is the 2003 TREC web track combined home page / name page finding task.
“C” is document full-text, “A” is aggregate anchor-text, and “T” is title. FW uses a Field-
weighted Okapi BM25 combination, LC is a linear combination, and HYLC is a fusion of
Field-weighted Okapi BM25 and linear combination.
Results for the hybrid combination strategy illustrate the benefits of treating
document-level and web-based evidence as separate document descriptions. The hy-
brid combination approach significantly outperformed other methods, equalling the
best run submitted to TREC 2003 (discussed in Chapter 9). Computing a gidf and
using document full-text to normalise all document fields was also an effective ap-
proach, improving retrieval effectiveness as well as allowing for potential gains in
efficiency by reducing the number of statistics per term. A “pure” linear combination
of document evidence was significantly less effective and more costly (as document
statistics were required for each form of evidence).
In general, the results in this chapter illustrate an interesting trade-off when deal-
ing with aggregate anchor-text. The trade-off is whether to favour documents which
contain the most occurrences of a particular term in anchor-text (by employing no
anchor-text aggregate length normalisation), or to favour documents whose aggregate
anchor-text contains the greatest percentage of anchor-text that matches the query
term (by employing full aggregate anchor-text length normalisation). The choice
is akin to trading off the quantity of anchor-text for the “purity” of the aggregate
anchor-text. If aggregate anchor-text is heavily length normalised, thereby encourag-
ing anchor-text purity, hyperlink recommendation evidence could be used to counter
the preference for short aggregate anchor-text by up-weighting pages with high link
popularity. How best to address these issues is left for future work.
Chapter 9
A first-cut document ranking
function using web evidence
The first-cut ranking function explored in this chapter combines document and web-
based evidence found effective in previous experiments within this thesis. A weighted
linear combination was used to combine this evidence. The weights for evidence and
combination parameters were tuned for three sets of navigational queries using a hill-
climbing algorithm. The tuned ranking function was evaluated through submissions
to the TREC 2003 web track, and on data spanning several small-to-medium sized
corporate web collections.
9.1 Method
The following sections outline:
• How the effectiveness of the ranking function was tested;
• The document-level and web evidence used in the ranking function;
• How document evidence was combined in the ranking function;
• The training data, and how the data were used to tune the ranking function; and
• The methods used to address the combined home page / named page finding
task.
9.1.1 Evaluating performance
The first-cut ranking function was used to generate runs for participations in both
the Topic Distillation (TD2003), and the combined home page / named page finding
(HP/NP2003) tasks of the 2003 TREC web track (described in Section 2.6.7.2).
The goal of the TD2003 task was to study how well systems could find entry points
to relevant sites given a broad query. The Topic Distillation task is nominally an in-
formational task (see Section 2.6.3). However, the focus in Topic Distillation is quite
151
152 A first-cut document ranking function using web evidence
different from previous informational tasks studied in TREC. Topic Distillation stud-
ies the retrieval of relevant resources, rather than relevant documents. The TD2003
submission studied in this chapter sought to determine whether the first-cut ranking
function trained for navigational search (especially home page finding queries) would
perform well for Topic Distillation. This training set was chosen in an effort to favour
the retrieval of relevant resources rather than documents.
The goal of the HP/NP2003 task was to study how well systems could retrieve
both home page documents and other documents specified by their name, without
prior knowledge of which queries were for named pages, and which were for home
pages. The HP/NP2003 submission studied in this chapter examined different meth-
ods for combining home page and named page based tunings into a single run. This
included an investigation of whether best performance was achieved by tuning for
both tasks at once, using a training set containing both types of queries, or through
“post hoc” fusion of home page and named page tuned document rankings.
A series of follow-up experiments used corpora gathered from several small cor-
porate webs to provide a preliminary study of how the ranking function performed
on diverse corporate-sized webs. In each case the effectiveness of the ranking function
studied was compared to that of the incumbent search system.
9.1.2 Document evidence
The ranking function included three important forms of document evidence: full-text,
title and URL length.
The query-dependent evidence (full-text and title) was scored using Okapi BM25
with tuned k1 and b parameters. The k1 and b parameters were tuned once per run
rather than individually per field. The application of term stemming was also eval-
uated (using the Porter stemmer [163], described in Section 2.3). Strict term coordi-
nation was applied for all query-dependent evidence, with documents containing the
most query terms ranked first. If combining Okapi BM25 scores computed for a mul-
tiple term query in a linear combination without term co-ordination, a document that
matches a single query term in multiple document fields can outperform a document
that contains all query terms in a single field. The use of strict term co-ordination
ensures that the first ranked document contains the maximum number of matched
query terms in a document field.
9.1.2.1 Full-text evidence
Okapi BM25 was used to score document full-text evidence (BM25(C)). Prior to
scoring full-text evidence all HTML tags and comments were removed. For efficiency
reasons global document length and global inverse document frequency (gidf ) values
were used (described in Section 8.1.2).
§9.1 Method 153
9.1.2.2 Title evidence
Title text was scored independently of other document evidence using BM25
(BM25(T)). For efficiency reasons the BM25 title formulation used global docu-
ment length and global inverse document frequency (gidf ) values (described in Sec-
tion 8.1.2).
9.1.2.3 URL length
URL lengths (URLlen) were capped at 127 for efficiency reasons. URLs longer than 127
characters recorded as being 127 characters long.
9.1.3 Web evidence
Anchor-text and two forms of in-degree were included in the ranking function. Page-
Rank and (simple) in-degree were not considered because of the relatively poor per-
formance observed in previous experiments. Instead, two important sub-types of in-
degree were examined: off-site and on-site in-degree [55].
9.1.3.1 Anchor-text
The Anchor Formula 1 (AF1) proposed here is an alternative to the revised anchor-
text models presented in the previous chapter. In AF1, term frequency (tf ) values are
not saturated (as described in Section 8.1.1) and document length normalisation is re-
moved (as described in Section 8.1.3). Multiplying AF1 values by 1.7 (using the KWT
parameter, see Section 9.1.4) the curve is similar to the BM25 saturation of an average
length document for the first three term occurrences (with default Okapi parameters,
see Figure 9.1).
The score for a document D, for query Q, over terms t, with aggregate anchor-
text A according to AF1 is:
AF1(D, Q) =
t∈Q
log(tf t,D + 1) × gidft (9.1)
As term frequency scores in the AF1 never saturate, term coordination must be en-
forced. Without term coordination a single term in a query may dominate. For exam-
ple if seeking “Microsoft Research Cambridge” the term “Microsoft” may dominate,
potentially leading to a page which matched “Microsoft” strongly in the aggregate
anchor-text, but never matched “Research” or “Cambridge” being retrieved, such as
the Microsoft home page.
9.1.3.2 In-degree
The log values of on-site (IDGon) and off-site (IDGoff) in-degrees were normalised
(according to the highest in-degree value for the collection) and quantised to 127 val-
ues (for efficiency reasons). This may have reduced ranking effectiveness, although
154 A first-cut document ranking function using web evidence
0
5
10
15
20
25
0 5 10 15 20
DocumentScore
tf
AF1
BM25 k1=0
BM25 k1=1
BM25 k1=2
BM25 k1=10
Figure 9.1: Document scores achieved by AF1 and BM25 for values of tf . A document of
average length is assumed, with the likelihood of encountering a term in the corpora one-in-
one-thousand (using idf values of N = 100 000 and nt = 100)
experience with the retrieval system in practical use suggests that there are minimal
adverse affects associated with this normalisation.
9.1.4 Combining document evidence
The ranking formulation includes four key components: a query dependent score and
three query-independent scores.
• Query-dependent evidence: this component is a linear combination of docu-
ment full-text, title, and AF1 anchor-text scores. The relative contribution of
AF1 is controlled through the KWT parameter. The relative contribution of
query-dependent evidence is controlled using the QD parameter. Full-text, ti-
tle and anchor-text are combined using a linear combination with gidf values, a
method previously demonstrated to be effective for home page and Topic Dis-
tillation tasks in Chapter 8. Term stemming was also evaluated (Stem).
• On-site in-degree: this component is the log normalised number of incoming
on-site links (quantised to 127 values). The contribution of this component is
controlled using the ON parameter.
• Off-site in-degree: this component is the log normalised number of incoming
off-site links (quantised to 127 values). The contribution of this component is
controlled using the OFF parameter.
• URL length: this component is the length, in characters, of the URL (for up
to 127 characters). The contribution of this component is controlled using the
URL parameter.
§9.1 Method 155
Accordingly the score for a document D is computed by:
S(D, Q) = QD × (((BM25gidf (C, Q) + BM25gidf (T, Q) + KWT × AF1(A, Q)))/
(max(BM25gidf (C, Q) + BM25gidf (T, Q) + KWT × AF1(A, Q))))+
ON × (IDGonD/max(IDGon))+
OFF × (IDGoff D/max(IDGoff))+
URL × ((max(URLlen) − URLlenD)/max(URLlen))
Documents must also fulfill the constraints imposed through term coordination.
9.1.5 Test sets and tuning
Eight parameters (k1, b , KWT, QD, ON , OFF, URL and Stem) were tuned for each
test set. The values explored for each parameter are as follows:
• k1 in steps of 0.25 between 0 and 4;
• b in steps of 0.25 between 0 and 1;
• KWT in steps of 1.7 between 0 and 17;
• QD, ON , OFF in steps of 2 between 0 and 20;
• URL in steps of 4 between 0 and 40; and
• Stem on or off.
The parameters were tuned using three test sets:
• Home page set (HPF): this training set was based on the http://first.gov
government home page list. Queries and results were extracted from this docu-
ment using the automatic site map method (see Section 2.6.5.3). The set consists
of 241 queries whose results were home pages. The full query and result set is
included as Appendix G.
• Named page set (NPF): this training set consists of the queries and relevance
judgements (qrels) used in the TREC 2002 named page finding task (described in
Section 2.6.7.2). The set consists of 150 queries whose results are named pages.1
• Both sets of queries (BOTH): this consists of all queries and relevance judge-
ments used in HPF and NPF.
There are inherent limitations in the training sets employed. The set of home pages
was taken from a .GOV portal, which may inadvertently have favoured prestigious, or
larger and more popular home pages. Further, the named page tuning includes some
home pages that were included in the 2002 NP task. This may have biased training
towards home page queries. The BOTH set of queries included a disproportionate
number of home page queries due to the presence of home pages in the NPF set, and
because the HPF set was larger than the NPF set.
1
The results for some of the named page queries were home pages.
156 A first-cut document ranking function using web evidence
9.1.6 Addressing the combined HP/NP task
Three approaches for applying the ranking function to the combined HP/NP task
were evaluated.
The first method was a tuning of parameters for both tasks simultaneously (i.e.
using the BOTH tuning to generate a run).
This second method summed document scores achieved for each tuning. This is
equivalent, in rank fusion and distributed IR terminology, to performing a combSUM
of document HPF and NPF scores.
The third and final method interleaved the ranked results for each run by taking
a document from the top of each ranking in turn, and removing any already seen
(duplicate) documents. For example, the first result in an HP/NP interleaving is the
first ranked document for the HPF tuning, and the second result is the first ranked
document for the NPF tuning.2 In an attempt to improve early precision, the inter-
leaving order was swapped if a keyword indicative of a named page finding query
was observed.3
9.2 Tuning
Parameters were tuned using a hill climbing algorithm with a complete exploration
of two parameters at a time (at each step the parameters which achieved the highest
retrieval effectiveness were stored and used for other tunings). The tuning stopped
when a full tuning cycle completed without change in tuned values. Plots of the tun-
ing process are provided in Figures 9.2 and 9.3. Figure 9.2 provides an example of the
concurrent tuning of two function parameters (in this case the Okapi BM25 k1 and b
values). Figure 9.3 shows plots for the rest of the tuning cycle.
The tuned values and effectiveness of the ranking function on the three training
sets (HPF, NPF and BOTH) are reported in Table 9.1.
The optimal tunings derived for each task differed significantly. The only consis-
tent result was that the query-dependent component was important in all tunings.
The following observations can be made from the Home Page Finding (HPF) pa-
rameter tunings:
• The tuned Okapi BM25 term saturation parameter (k = 3.6) is higher than the
default parameter of k = 2. This indicates that home pages may contain page
naming text several times and that matching their name more than once is a
good indicator of a home page match.
• The tuned Okapi BM25 length normalisation parameter (b = 1) is higher than
the default parameter of b = 0.75. The tuning favoured a strict length normal-
2
So long as that document was not the same document retrieved at the first rank by the HPF tuning,
in which case the next document in the NPF ranking is taken.
3
Query terms were selected from last year’s query set and included terms such as “page”, “form”
and “2000”.
§9.2 Tuning 157
k_1=x b=y Anchors=5 Content=17 Onsite=0 Offsite=6 URL=19 =0
Best MRR@10: 0.6 0.4 0.736853
0
0.5
1
1.5
2
2.5
3
3.5
4
k_1 0
0.2
0.4
0.6
0.8
1
b
0.71
0.715
0.72
0.725
0.73
0.735
0.74
Mean reciprocal rank @10
Figure 9.2: A plot illustrating the concurrent exploration of Okapi BM25 k1 and b values
using the hill-climbing function. The values at which the best performance is achieved are
stored (the highest point in the plot, represented by a “+”) and used when tuning other values.
The tuning stops when a full iteration of the tuning cycle completes without change in tuned
values.
158 A first-cut document ranking function using web evidence
k_1=3 b=2 Anchors=x Content=y Onsite=0 Offsite=6 URL=19 =0
Best MRR@10: 0.4 20 0.745333
0
0.5
1
1.5
2
Anchors 0
5
10
15
20
Content
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75
Mean reciprocal rank @10
k_1=3 b=2 Anchors=4 Content=20 Onsite=x Offsite=y URL=19 =0
Best MRR@10: 1 6 0.747617
0
5
10
15
20
Onsite 0
5
10
15
20
Offsite
0.66
0.67
0.68
0.69
0.7
0.71
0.72
0.73
0.74
0.75
Mean reciprocal rank @10
k_1=3 b=2 Anchors=4 Content=20 Onsite=1 Offsite=6 URL=x =y
Best MRR@10: 15 0 0.750517
0
5
10
15
20
URL 0
5
10
15
20
0.69
0.7
0.71
0.72
0.73
0.74
0.75
0.76
Mean reciprocal rank @10
Figure 9.3: A full iteration of the hill-climbing function. The first step in this iteration is illus-
trated in Figure 9.2. The tuning of parameters was performed using a hill climbing algorithm
with complete exploration of two parameters at a time. The highest point (best performance)
is represented by a “+”, and the parameter values at that point are stored and used when tun-
ing other values. The tuning stops when a full iteration completes without change in tuned
values.
§9.2 Tuning 159
Test Set MRR k1 b KWT QD, ON , OFF, URL Stem
HPF 0.846 3.6 1 11.9 15,1,6,38 Y
NPF 0.522 4 0.2 1.7 18,2,0,1 N
BOTH 0.715 0.8 0.4 8.5 20,0,6,20 Y
Table 9.1: Tuned parameters and retrieval effectiveness. Parameters are as described in Sec-
tion 9.1.5. “MRR” is the Mean Reciprocal Rank, as described in Section 2.6.6.2. “HPF” is the
home page training set. “NPF” is the named page training set. “BOTH” contains both HPF
and NPF.
isation of document full-text.4 This suggests that longer full-text content is no
more likely to be a relevant home page.
• Anchor-text in the form of AF1 was important, with the KWT parameter per-
forming best at 11.9.
• The contribution of off-site and on-site links was small, with off-site links more
useful than on-site links.
• URL length once again proved to be an important contributor in home page
finding.
• Stemming improved retrieval effectiveness.
From the Named Page Finding (NPF) parameter tunings:
• Like for the HPF tunings, a higher than normal Okapi BM25 k1 value was effec-
tive.
• Unlike in the HPF tunings, length normalisation did not improve effectiveness,
with a low b value found to perform best.
• The contribution of URL, on-site in-degree, and off-site in-degree was small.
• Anchor-text was useful for the NPF task, although its contribution was far less
than in the HPF task.
• Stemming adversely affected retrieval effectiveness.
In general the BOTH tuning was similar to the HPF tuning (indicating that home
pages dominated in the tuning). The only large differences between BOTH and HPF
were in the form of a much smaller tuned term saturation value (k1 = 0.8), and less
length normalisation (b = 0.4).
4
Note that length normalisation is not present in the AF1 measure.
160 A first-cut document ranking function using web evidence
9.2.1 Combining HP and NP runs for the combined task
Results for the HP/NP combination methods tested on the combined test set (BOTH)
are presented in Table 9.2.
Combination Method MRR on training set
Tuned for BOTH 0.758
HPF and NPF combSUM 0.489
HPF and NPF interleaved (HP,NP) 0.734
Table 9.2: Results for combined HP/NP runs on the BOTH training set.
The interleaving of runs performed similarly to tuning using BOTH types of query.
This may be an effective method for combining the runs without prior tuning infor-
mation. The performance of the linear combination was relatively poor.
9.3 Results
This section investigates results from the empirical studies of the first-cut ranking
function. The ranking function was evaluated, using the parameter tunings described
above, for the two TREC 2003 web track tasks, and for navigational search on several
corporate webs.
9.3.1 TREC 2003
This section sets out the results from the official TREC 2003 web track submissions.
The Topic Distillation runs (csiro03td–) are presented first, followed by the combined
HP/NP finding task runs (csiro03ki–).
9.3.1.1 Topic Distillation 2003 (TD2003) results
Results for the TD2003 web track task are presented in Table 9.3. The best of these
runs (csiro03td03) achieved the highest performance of any system submission. This
run used the HPF tuning and incorporated stemming. Further observations based on
the Topic Distillation results are:
• The tuned k1 and b values offered some improvement (csiro03td01 versus
csiro03td05). The effectiveness of the length normalisation parameter used (b =
1) suggests that longer pages are no more likely to be relevant in Topic Distilla-
tion.
• The new anchor-text ranking function AF1 was particularly effective, achieving
gains of up to 60% (csiro03td03 versus not sub 05).
§9.3 Results 161
Description Average R-Prec Run Id
HPF (Stem = ON , ON = 0, OFF = 0) 0.170 not sub 01
HPF (Stem = ON ) 0.164 csiro03td03
HPF (Stem = ON , ON = 0, OFF = 0, URL = 0) 0.149 not sub 02
HPF 0.144 csiro03td01
HPF (ON = 0, OFF = 0) 0.143 not sub 03
HPF (k1 = 2, b = 0.75) 0.127 csiro03td05
NPF 0.117 not sub 04
HPF (ON = 0, OFF = 0, URL = 0) 0.116 csiro03td02
HPF (Stem = ON , KWT = 0) 0.108 not sub 05
HPF (KWT = 0) 0.099 csiro03td04
HPF (Stem = ON , No Red./Dup.) 0.147 not sub 06
HPF (No Red./Dup.) 0.138 not sub 07
HPF (KWT = 0, No Red./Dup.) 0.116 not sub 08
HPF (Stem = ON , KWT = 0, No Red./Dup.) 0.106 not sub 09
Table 9.3: Topic Distillation submission summary. “HPF” indicates that the home page
finding tunings were used (tunings in Table 9.1). “NPF” indicates that the named page finding
tunings were used (tunings also in Table 9.1). Other description notes indicate variations from
the tuned parameters. “Run Id” reports the run identifier used in TREC experiments. “No
Red./Dup.” indicates that redirect and duplicate URL information was not used. Further
runs were computed post hoc (not sub –).
• Hyperlink recommendation evidence was not effective. A post hoc run achieved
slightly better performance (4%) when hyperlink recommendation evidence was
removed (not sub 01).
• URL length evidence appeared to slightly improve retrieval effectiveness
(not sub 01 versus not sub 02).
• The NPF tuning performed worse than the HPF tuning (not sub 04), with an
associated drop in MRR of around 20%.
• A linear combination of query-dependent scores from document-level and web-
based evidence, where both scores were computed using gidf values, was effec-
tive.
• The redirect and duplicate information (collected using methods outlined in
Chapter 3) was important when scoring anchor-text using AF1. Without redi-
rect and duplicate information, retrieval effectiveness was reduced by 15%
(csiro03td03 versus not sub 06).
The results from the Topic Distillation task support the notion that the home page
training set favoured prominent resources (an advantage for Topic Distillation). The
162 A first-cut document ranking function using web evidence
results also illustrate the benefits of the new anchor-text ranking component AF1, es-
pecially when used with stemming, and with redirect and duplicate URL information.
9.3.1.2 Combined HP/NP 2003 (HP/NP2003) results
The official run results for the HP/NP2003 task are presented in Table 9.4. The best of
these runs achieved the second highest performance of any submitted system
(csiro03ki04). The results show that tuning specifically for the home page finding task
significantly harmed named page retrieval effectiveness (csiro03ki02 versus
csiro03ki03). The highest MRR was achieved using the NPF-only tuning, whilst the
best S@10 used interleaved lists from HPF and NPF tunings. The results show that an
overemphasis on home page finding harmed the named page searches.
The run with the highest S@10 (csiro03ki04) interleaved the csiro03ki02 and
csiro03ki03 runs (i.e. top HP, top NP, second HP, second NP etc.). From subsequent
evaluations (not sub 01) it was apparent that leading with the top NP result rather
than the top HP result would have further improved precision (achieving an MRR
of 0.717). Tuning for both named page and home page training queries concurrently
(csiro03ki01) performed well for home page finding, but poorly for named page find-
ing. This confirms that the BOTH training set was biased towards home page finding
due to the larger sample of home page queries considered, and the presence of home
page queries in the named page training set (see Section 9.1.5).
In summary, interleaving HP then NP without query classification achieves an
MRR of 0.646. Interleaving HP then NP and reversing the interleaving if the query
appears to be a named page query achieves an MRR of 0.667. Finally, interleaving NP
then HP without query classification achieves 0.717.
Description MRR5
S@10 (%) MRR (HP) MRR (NP) Run Id
HPF and NPF interleaved (NPF,HPF) 0.717 87.0 0.781 0.651 not sub 01
NPF 0.702 84.0 0.755 0.649 csiro03ki03
HPF and NPF combSUM 0.699 81.0 0.812 0.586 csiro03ki05
BOTH 0.692 83.7 0.815 0.569 csiro03ki01
HPF and NPF interleaved (HPF,NPF) 0.667 86.3 0.801 0.532 csiro03ki04
HPF 0.603 77.7 0.774 0.432 csiro03ki02
Table 9.4: Combined home page/named page finding task submission summary. To aid
in the understanding of retrieval performance MRR for home pages only “MRR (HP)” and
named pages only “MRR (NP)” was computed. “HPF”, “NPF”, and “BOTH” indicates the
tunings used (home page finding, named page finding and for both sets respectively, parame-
ters reported in Table 9.1). Other description notes indicate variations from the tuned para-
meters. “Run Id” reports the run identifier used in TREC experiments. Post hoc, a further run
was computed using NPF tunings.
§9.4 Discussion 163
9.3.2 Evaluating the ranking function on further corporate web collections
The ranking function was evaluated for eight further collections built from the pub-
licly available corporate webs of eight large Australian organisations: five public com-
panies, two government departments and an educational institution.
The query and result sets were generated using the automated site map method
described in Section 2.6.5.3. In each case the new ranking function was compared to
the performance of the incumbent search system. The anchor-text component was
calculated using a BM25 anchor-text formulation that used full-text document length
for normalisation (BM25contdln).
Table 9.5 presents the results from this experiment. The first-cut ranking function
performed significantly better than seven out of eight evaluated search systems, and
comparably to the other search system (University). The use of query-independent
evidence (off-site links, on-site links and URL length) did not significantly improve
retrieval effectiveness on any collection.
9.4 Discussion
The first-cut ranking function performed well over a variety of tasks and corpora. The
runs submitted to the 2003 TREC web track achieved the highest Topic Distillation
score [60], and the second highest combined HP/NP score [60]. The ranking function
also outperformed the incumbent search engines of seven-of-the-eight corporate webs
studied (and performed comparably to the other).
The tuning of ranking function parameters using the NPF training set achieved
better retrieval effectiveness than tuning using the HPF set in the HP/NP2003 task.
This indicates that the HPF-tuned ranking function may have been over-trained to-
wards prominent home pages (as would be listed on first.gov).
Arguably, the most important component of the ranking function was the anchor-
text evidence in the form of AF1. This finding re-iterates the importance of anchor-
text evidence in web document retrieval. The AF1 ranking function provided an ef-
fective alternative to scoring anchor-text using full-text ranking methods. However,
the methods used to score aggregate anchor-text evidence merit further investigation.
In particular, more work is required to determine whether the use of global inverse
document frequency values (gidf ) is preferable to the use of field-based anchor-text
(fidf) values.
The results show little performance gain through the use of query-independent
evidence, for both the web track tasks and small corporate web collections. URL
length evidence produced small gains for the home page finding and Topic Distilla-
tion tasks. By contrast hyperlink recommendation evidence never improved retrieval
effectiveness. The poor performance of query-independent evidence could indicate
that the method used to combine it with query-dependent evidence was ineffective.
More effective combination strategies might incorporate query-independent evidence
as some prior probability of document relevance [155], re-rank baselines (as in Sec-
tion 6.1.4), or may use the query-independent score to normalise or transform term
164 A first-cut document ranking function using web evidence
Institution Search Engine Queries S@1 S@5 S@10 Docs
Telecomm. Unknown 266 75 113 126 72 337
New (no QIE) 266 166 208 212
New (w/QIE) 266 166 208 219
Large Bank 1 Lotus Notes 228 15 46 63 6690
New (no QIE) 151 206 209
New (w/QIE) 150 206 210
Large Bank 2 Unknown 64 17 26 28 1805
No (no QIE) 41 59 60
No (w/QIE) 42 59 60
Large Bank 3 Unknown 143 4 21 39 5113
New (no QIE) 116 132 135
New (w/QIE) 100 132 134
Large Bank 4 Unknown 295 96 165 170 7827
New (no QIE) 170 232 243
New (w/QIE) 160 228 241
University Ultraseek 360 179 235 253 50 203
New (no QIE) 218 293 315
New (w/QIE) 204 304 324
Gov Dept 1 ht:// dig 160 38 98 119 8414
New (no QIE) 128 140 146
New (w/QIE) 128 147 148
Gov Dept 2 Verity & MS 154 1 8 12 42 981
New (no QIE) 79 108 111
New (w/QIE) 86 110 111
Table 9.5: Ranking function retrieval effectiveness on the public corporate webs of several
large Australian organisations. “New” is the first-cut ranking function described within this
chapter. A “no QIE” indicates that the run was performed with query-independent evidence
removed (ON = 0, OFF = 0, URL = 0). The BM25 parameters were set to k1 = 2, b = 0.75.
When used, query-independent evidence parameters were specified as QD = 17, ON = 2,
OFF = 6 and URL = 19. The evaluation was performed between February and March 2003.
§9.4 Discussion 165
contribution. For example, the in-degree of a document might be a more useful satu-
ration value than length when scoring aggregate anchor-text. The exploration of new
approaches to term normalisation and transformation may be particularly interesting
in the context of further anchor-text evidence scoring functions.
Hyperlink recommendation evidence, evaluated in the form of off-site (IDGoff)
and on-site (IDGon) in-degree, was once again found to be a relatively poor form of
document evidence. It is possible that this negative result may be attributed to the
relatively small size of the collection (in comparison to the web), and accordingly a
limited amount of cross site linking in the collection. That said, the demonstration of a
search situation in which the use of hyperlink recommendation evidence significantly
improves retrieval effectiveness remains an elusive goal.
URL length evidence, while found to be important in the training set and in pre-
vious home page finding experiments, was found to be relatively ineffective for the
tasks examined here. Incorporating URL length moderately improved effectiveness
for Topic Distillation, but reduced effectiveness on the combined NP/HP finding
tasks. These results indicate that while URL length is an important component for
effective home page search, its contribution to other tasks may be limited.
166 A first-cut document ranking function using web evidence
Chapter 10
Discussion
The findings presented in this thesis raise a number of issues. This chapter discusses:
• the extent to which experimental findings are likely to hold for enterprise and
intranet web search systems and WWW search engines;
• the search tasks that are the most appropriate to model when evaluating web
search performance;
• how web evidence could be used to build a more efficient ranking algorithm
while maintaining retrieval effectiveness; and
• whether the set of document features used by the ranking function could be
tuned on a per corpus basis.
10.1 Web search system applicability
This thesis has evaluated the effectiveness of web evidence over a large selection of
corporate webs and corporate-sized webs, with corpora ranging from 5000 to 18.5 mil-
lion pages. This range of sizes covers almost all enterprise corpora. The web evidence
inclusive ranking function achieved consistent gains over eight diverse enterprise cor-
pora (in Section 9.3.2), indicating that findings are likely to hold for many small-to-
medium sized web corpora. However, it should be noted that the improvements af-
forded by web evidence are dependent on the quality of hyperlink information in
the corpus, and are subject to the publishing procedures employed by organisations.
These procedures can reduce the effectiveness of web evidence (as studied in Chap-
ter 4). For example, the effectiveness of web evidence is likely to be decreased if the
corpus contains URLs that are unlikely to be linked-to, or the corpus contains a lot of
duplicate content.
Findings from experiments in this thesis may be less applicable to WWW search
engines than to enterprise web search engines. WWW search engines are subject to
substantial efficiency constraints, due to the scale of the document corpus and query
processing demands. The indexes of current state-of-the-art WWW search engines
contain two orders of magnitude more documents than the largest corpus considered
167
168 Discussion
in this thesis. These systems also process thousands of concurrent queries with sub-
second response time. These efficiency requirements are likely to limit the document
features examined and scored during query processing. One benefit of a larger corpus
size is that there is likely to be more link evidence, and so differentiation between links
(e.g. on-site, off-site or nepotistic) might lead to larger gains. However, the hyperlink
recommendation scores calculated throughout experiments in this thesis were found
to be correlated with the scores for corresponding documents extracted from WWW
search engines (see Section 7.6.3). Further, a recent experiment reported that the use
of anchor-text evidence external to a web corpus (but linking to documents inside
the corpus) did not improve retrieval effectiveness [114]. Consequently, it is possible
that further link evidence may not be useful. The correlations between hyperlink
recommendation scores, and the small observed benefit achieved by using external
link evidence, indicate that hyperlink evidence used in WWW search systems is likely
to be comparable to that studied here.
WWW search engines also operate in an adversarial information retrieval envi-
ronment, where web authors may seek to bias ranking functions in their favour by
creating spam content [122]. Given the relative ease and low cost of link construction
on the WWW, one might expect hyperlink recommendation scores to be susceptible to
link spamming. Some spam-like properties were observed in thesis experiments, but
these appeared unsystematic and were deemed to have been created unintentionally.
While some experiments in this thesis cast doubts on the use of hyperlink recommen-
dation methods for spam reduction, these results are not conclusive.
Therefore, results presented in this thesis are likely to apply to ranking in enter-
prise web search, subject to publishing practices, but are less directly applicable to
ranking in WWW search systems.
10.2 Which tasks should be modelled and evaluated in web
search experiments?
It is important that the tasks evaluated and modelled for a web search system be repre-
sentative of the tasks that will be performed by the users of the system. Without access
to studies relating to the user populations, intended system usage and/or large scale
query-logs, it is difficult to determine which tasks are most frequently performed.
Document ranking functions need to be evaluated over more than one type of
search task. It is apparent, both in results from experiments presented in this thesis
and in previous TREC web track evaluations, that performance gains in a single re-
trieval task often do not carry benefits to other tasks. For example, URL length based
measures are particularly useful when seeking home pages (Chapter 7), but appear
to reduce retrieval effectiveness on other tasks (Chapter 9). Therefore a mixed query
set should be used when evaluating a general purpose ranking function. In the 2004
TREC web track, one of the tasks examined was a mixed task that included an equal
mix of named page, home page and Topic Distillation queries [54]. Alternatively a
mixed query set might be balanced in anticipation of the types of queries a system
§10.3 Building a more efficient ranking system 169
might receive.1 The query set might also include queries for which the answers are
important resources (either popular, or key corpus documents) for each type of search
task.
The evaluation concerns for WWW search engines are likely to be quite different
from those of corporate webs. WWW search engines need to provide results for a
diverse document corpus and user group. By comparison, search engines on corpo-
rate webs are likely to have a smaller target user audience, and a more homogeneous
document corpus. A prime concern for WWW search engines may be known item
searches, where the pages are important and well known to the user. If the search
system fails for these types of queries, the user is likely to lose some degree of trust in
the system. Therefore, a useful basic effectiveness test may be to observe how well the
search engine can find pages listed in WWW directories, using listing descriptions as
queries (similar to the automated site map method).
In an enterprise search context, the automatic site map method appears to be an
effective way of evaluating retrieval effectiveness for navigational search tasks (when
a site map is available). Site maps often contain organisation specific terminology and
include links to documents that are frequently accessed. For enterprise web search
engines, pages that are contained in site maps may be representative of potential nav-
igational queries, and could thus be an excellent source of queries.
A WWW search engine is likely to be required to process much broader queries
than enterprise web search systems, and so should be evaluated for varied tasks.
Known item search is likely to be particularly important, as a user may be disap-
pointed in a WWW search system if they cannot use it to find a page they know exists,
especially well known entities. For known item search, online WWW directories may
be a good source of query/answer sets of well known and/or useful WWW pages.
10.3 Building a more efficient ranking system
The web evidence and combination methods considered within this thesis may be
used to improve query-processing performance and reduce the size of document in-
dexes.
The high level of effectiveness achieved by anchor-text over all the search tasks
considered in this thesis indicates that a high level of retrieval effectiveness could
be achieved over many search tasks using an anchor-text only index. Such an index
would be far smaller than a full-text index. For the .GOV corpus, aggregate docu-
ments have an average length of 25 terms, as opposed to the 870 terms for document
full-text evidence. Further, there is far more repetition in anchor-text evidence, mean-
ing indexes containing aggregate anchor-text might be expected to achieve higher
compression than indexes of document full-text.
An alternative method for improving query processing efficiency is to exclude
documents that do not meet a minimum query-independent score prior to (or during)
1
For example, if home page finding is an important task, ensure there are many home page finding
queries in the test set.
170 Discussion
indexing. Results from experiments in this thesis indicate that restricting document
inclusion by imposing a minimum URL-type value, can reduce the number of doc-
uments indexed by an order of magnitude, without significantly affecting retrieval
effectiveness for home page finding tasks (see Section 7.2.1).
The use of an anchor-text only index or minimum document threshold may result
in a decrease in retrieval effectiveness for some tasks (such as ad-hoc informational,
or named page finding tasks), as some crawled documents are not indexed and so
would never be retrieved. An extension to this model would be to use two indexes;
one primary index, consisting of aggregate anchor-text only or documents that exceed
the minimum threshold value, and a second index containing the full document cor-
pus. During query processing, if some criteria are not met by documents retrieved
from the primary, faster index (e.g. less than ten matching documents are found, no
documents match all terms, or some minimum score is not achieved), the secondary
index could be consulted. Further work is required to investigate whether such multi-
level indexes would provide large efficiency gains while maintaining (or improving)
retrieval effectiveness, and to explore distributed techniques for dealing with several
indexes.
The size of a combined document index can be reduced through the use of a single
set of document and corpus statistics when scoring query-dependent features. This
requires only one set of statistics to be stored per document/term combination, rather
than a set for each query-dependent feature. In fact, the use of full-text length when
normalising term contribution in aggregate anchor-text improved retrieval effective-
ness (see Section 8.3.1). Further work is required to determine whether inverse docu-
ment frequency should be scored per document field.
10.4 Tuning on a per corpus basis
The results from experiments in this thesis indicate that document ranking effective-
ness not only depends on the search task evaluated, but also on the document corpus.
For example, if the ranking function is to be used on a corporate web in which all
documents are published through a Content Management System (CMS) that uses
long parameterised URLs, URL length-based measures are not likely to be effective.
This effect was observed for one of the corpora studied in Section 9.3.2: Large Bank 1.
This bank publishes all its content using the Lotus Domino system, which (at least con-
figured as it was in this case) serves content using long URLs. Similarly, hyperlink
evidence is not likely to be effective for a corpus which has few hyperlinks.
An attractive avenue for future work may be the tuning of document feature con-
tribution according to the expected utility of that evidence. For example, if a web
site’s hyperlink graph is sufficiently small, hyperlink evidence could be disabled. This
could be generalised further through the creation of profiles for common CMS con-
figurations that indicate what forms of document evidence are likely to be useful.
Alternatively, ranking parameters could be tuned using an automated approach us-
ing judgements such as those collected from a web site map. This remains for future
§10.4 Tuning on a per corpus basis 171
work.
If corpus-based tuning is not employed it is important that the web authors are
aware of evidence commonly used to match and rank documents. This is especially
the case in an enterprise web context.
172 Discussion
Chapter 11
Summary and conclusions
The experiments in this thesis demonstrate how web evidence can be used to improve
retrieval effectiveness for navigational search tasks.
The first set of experiments, presented in Chapter 4, studied the relationship be-
tween site searchability and the likelihood of a site’s documents being retrieved by
prominent WWW search engines. This study provided one of the first empirical in-
vestigations of transactional search. The performance of WWW search engines was
shown to differ markedly, with two-out-of-four search engines never retrieving books
within the top ten results, and one search engine favouring a particular bookstore
(perhaps indicating a partnership). A large variation in bookstore searchability was
also observed.
An investigation of potential biases in hyperlink evidence was then presented in
Chapter 5, using data collected from WWW search engines. Biases were observed
in hyperlink recommendation evidence towards the home pages of popular and/or
technology-oriented companies. These results indicate that the use of hyperlink evi-
dence may not only improve home page finding effectiveness (important in naviga-
tional search), but also bias search results towards this user demographic (i.e. users
who are interested in popular, technology-oriented information). The two types of
hyperlink recommendation evidence (Google PageRank and AllTheWeb in-degree)
were virtually indistinguishable, providing similar recommendations towards popu-
lar companies. Both measures were also correlated for a set of company home pages,
and a set of known spam pages. The similarity between the two measures raised ques-
tions as to the usefulness of PageRank over in-degree. Both measures gave preference
to home page documents, supporting the investigation of hyperlink recommendation
evidence for home page finding tasks in later chapters.
Methods for combining hyperlink recommendation evidence (and other query-
independent measures) with query-dependent evidence were investigated in Chap-
ter 6. Results from this experiment demonstrated how assigning a large weight to
hyperlink recommendation evidence in a ranking function may trade document rele-
vance for link popularity. It was submitted that hyperlink recommendation evidence
should be included either as a small component in the ranking function, or in the form
of a minimum threshold value enforced prior to document ranking.
Chapter 7 presented a detailed evaluation of home page finding on five small-to-
173
174 Summary and conclusions
medium web test collections using three query-dependent baselines and four forms
of query-independent evidence (in-degree, Democratic PageRank, Aristocratic Page-
Rank, and URL length). The results from these experiments demonstrated the impor-
tance of both anchor-text and URL length measures in home page finding tasks. The
most consistent improvements in retrieval effectiveness were achieved using a base-
line containing document full-text and anchor-text, with a score-based re-ranking by
URL-type. Improvements were observed in both efficiency and effectiveness when
using minimum query-independent value thresholds for page inclusion, with the
gains for URL length thresholds being particularly large. Little benefit was observed
through the use of hyperlink recommendation methods. Small gains were achieved
when hyperlink recommendation scores were used as minimum thresholds for page
inclusion. However, a score-based re-ranking of query-dependent baselines by hyper-
link recommendation evidence performed poorly.
Both PageRank and in-degree performed similarly and were found to be highly
correlated. This correlation, and the almost identical performance of both PageRank
and in-degree in the home page finding tasks, indicated no reason to choose Demo-
cratic PageRank over in-degree for home page finding on corpora of under 18.5 mil-
lion pages. When considered with the correlations previously observed in WWW-
based hyperlink recommendation scores, these results also cast doubt as to whether
PageRank and in-degree values would show more divergence on the complete WWW
graph. The PageRank values computed for these experiments were also found to be
correlated with Google WWW PageRanks for pages present in the Open Directory.
A series of follow-up experiments (using the same data) found that the use of
URL length, when measured in characters, is as effective as using URL-types. A fur-
ther finding was that using hyperlink recommendation evidence calculated for a web
graph that included link evidence external to the corpus, did not improve retrieval
effectiveness. By contrast, the use of external anchor-text information significantly
improved retrieval effectiveness.
Chapter 8 presented an analysis of the application of Okapi BM25 based measures
in scoring anchor-text evidence. This analysis led to several proposed modifications
to Okapi BM25 that, it was hypothesised, might improve the scoring of anchor-text
evidence. Proposed modifications included an increase of the saturation point for
document term frequencies, the calculation of separate anchor-text-only inverse doc-
ument frequency values, and the use of document full-text length to normalise aggre-
gate anchor-text. An empirical investigation was carried out to determine whether
the proposed changes to anchor-text scoring improved retrieval effectiveness. This
showed that the revised scoring functions achieved significant improvements in re-
trieval effectiveness, for both Topic Distillation and navigational tasks.
Experiments within Chapter 8 also analysed and evaluated strategies for com-
bining query-dependent baselines. Results for these combinations demonstrated the
importance of treating document-level and web-based evidence as separate entities.
Additionally the results showed that computing a single set of (global) document
and corpus statistics for all query-dependent fields improved system efficiency and
provided small gains in retrieval effectiveness. Surprisingly, the effectiveness of the
§11.1 Findings 175
anchor-text baseline improved when full-text length was used to normalise aggregate
anchor-text document length.
Chapter 9 presented a first-cut document ranking function that included web ev-
idence found useful in earlier experiments within this thesis (anchor-text and URL-
length measures in particular). The ranking function was evaluated through ten runs
submitted to the two TREC web track tasks in 2003. The best of the runs submit-
ted for the Topic Distillation task achieved the highest performance for any system
submission. The best of the runs submitted for the combined home page / named
page finding task achieved the second highest performance of any system submis-
sion. To further validate the ranking function a series of follow up experiments were
performed using corporate web collections. Results from these experiments showed
that the ranking function outperformed seven-out-of-eight incumbent search systems
(while performing comparably to the other).
11.1 Findings
Experimental findings suggest that the most important form of web evidence is anchor-
text. Using anchor-text evidence to rank documents, rather than document full-text,
provides significant effectiveness gains in home page finding and Topic Distillation
tasks. The methods commonly used for length normalising anchor-text aggregate
documents were found to be deficient. Removing aggregate anchor-text length nor-
malisation altogether, or normalising according to full-text document length were
both found to improve retrieval effectiveness. The removal of length normalisation
from the anchor-text scoring function favours large volumes of incoming anchor-text,
and according to prestige and recommendation assumptions, may favour prominent
pages.
The use of URL-length based measures, either through grouping URLs into classes
(as in URL-type) or simply by counting the number of characters, brought consistent
gains for home page finding tasks. However, the use of this evidence reduced ef-
fectiveness for other tasks, and would be ineffective for corpora which do not exhibit
any URL hierarchy. Further work is needed to understand how to best use URL-based
measures in a general purpose web search system.
Hyperlink recommendation evidence was far less effective than URL-based mea-
sures. The use of hyperlink recommendation evidence provided minimal gains, even
when an Optimal re-ranking was used. The most effective use of hyperlink rec-
ommendation scores was in reducing the size of corpora without reducing home
page search performance. However, these gains were small by comparison to those
achieved using URL-type thresholds. Democratic PageRank was not observed to sig-
nificantly out-perform simple in-degree. Given the extra cost involved in comput-
ing Democratic PageRank, this thesis presents no evidence to support the use of De-
mocratic PageRank over in-degree. A PageRank biased towards authoritative sites
improved effectiveness somewhat; however, the scores were based on bookmarks
known to match the best answers for the queries used. Further work is required to
176 Summary and conclusions
investigate and compare this PageRank formulation to other authority-biased mea-
sures.
The combination method for query-dependent evidence which achieved the high-
est retrieval effectiveness on navigational and Topic Distillation tasks was the hybrid
combination of scores. The hybrid combination considers document-level and web
based evidence as separate document components, and uses a linear combination to
sum scores. The separation of document-level and web-based information means that
two scores are assigned per document, one for the document content (or the author’s
description), and one for the wider web community view of the document. If both
measures agree (and the document is scored highly on both measures for a particular
query) this is likely to be a strong indication that the page is what it claims to be. Com-
puting global document and corpus statistics for all query-dependent fields improved
system efficiency and provided small gains in retrieval effectiveness.
The best methods for combining query-independent evidence with query-
dependent baselines involved the application of minimum thresholds for page in-
clusion, or re-ranking all pages within some percentage of the top score. Both combi-
nations proved effective when combining URL-type evidence with query-dependent
baselines.
Bias towards the home pages of popular and/or technology-oriented companies
was observed in hyperlink-based evidence. Some biases, such as the technology bias,
could negatively affect document ranking if ignored, as search results will cater to a
small demographic of web users. These findings indicate that care should be taken
when using such evidence in document ranking, or in a direct Toolbar indicator. The
observed bias may be especially confusing when recommendation scores are used
directly as a measure of a page’s quality, as in the Google Toolbar.
11.2 Document ranking recommendations
Experimental results indicate that an effective web-based document ranking algo-
rithm for navigational tasks should exploit both document-level evidence and web-
based evidence. These two types of document evidence are best combined using a
hybrid combination with globally computed document and term statistics. Document
evidence should include full-text evidence and other useful document-level evidence.
Web-based evidence should make use of incoming anchor-text, and other useful ex-
ternal document descriptions. Anchor-text aggregate document length should not be
used to normalise anchor-text term contribution.
For home page search, a URL depth component either measured by characters
or classified by type, should be included. The measure may be included either by
re-ranking documents that achieve within n% of the top score by URL length, or by
adding a normalised URL length score to the query-dependent score.
The best choice of hyperlink recommendation algorithm for use in home page
finding within corporate-scale corpora is in-degree, as the PageRank variants appear
to offer little or no advantage and are more computationally expensive.
§11.3 Future work 177
11.3 Future work
The findings within this thesis raise several issues that merit further investigation.
Future work for web-based document ranking might include:
• A study of whether web evidence can improve retrieval effectiveness for other
web-based user search tasks, such as informational and transactional search.
• A study of further anchor-text ranking functions. The modifications to Okapi
BM25 improved retrieval effectiveness; however, further work is needed to
determine whether the document and collection statistics applied to scoring
anchor-text were optimal.
• Further study of how document and web-based evidence should be combined.
This thesis has explored many different ways of combining document evidence,
but it is not clear that the optimal method has been found.
Further studies might also look at the nature of hyperlink recommendation on the
WWW. This could include:
• A study of the changing nature of hyperlink evidence on the WWW. For exam-
ple, is the proportion of dynamic vs. static hyperlinks on the WWW constant?
Is the proportion of links which are dead (have no target) constant over time?
Also worthy of further examination is how new trends on the WWW, such as
web logging, might affect the quality and quantity of hyperlink evidence.
• A study of how an increase in the effectiveness of WWW-based search engines
might affect the quality of hyperlink evidence on the WWW. Does high quality
search mean authors are less likely to link to useful documents?
• A further study of how document quality metrics, such as PageRank and in-
degree, relate to user-document quality satisfaction, or industry professional-
document quality satisfaction. This investigation could focus on the use of tools
like the Google toolbar.
178 Summary and conclusions
Appendix A
Glossary
All terms within this thesis, unless defined below, are as used in the (Australian) Mac-
quarie Dictionary, searchable on the WWW at http://www.dict.mq.edu.au.
Aggregate anchor-text: all anchor-text snippets pointing to a page.
Anchor-text: words contained within anchor-tags which are “clicked on” when a
link is followed.
Anchor-text snippet: a piece of anchor-text that annotates a single link.
Anchor-text aggregate document: a surrogate document containing all anchor-text
snippets pointing to a page.
Aristocratic PageRank (APR): a formulation of PageRank that favours a manually
specified set of (authoritative) pages. The PageRank calculation is biased towards
these pages by using the set of pages in the PageRank bookmark vector.
Collection: see Test collection.
Corpus: a set of documents.
Crawler: the web search system component that gathers documents from a web.
Democratic PageRank (DPR): the default PageRank formulation in which all pages
are treated a priori as equal.
Entry point: a document within a site hierarchy from which web users can begin to
explore a particular topic.
Evidence: a document attribute, feature, or group of attributes and features that may
be useful in determining whether the document should be retrieved (or not) for
a particular query.
Feature: information extracted from a document and used during query processing.
Field: a query-dependent document component, for example document full-text,
document title or document aggregate anchor-text.
179
180 Glossary
Home page: the key entry point for a particular web site.
Home page finding: a navigational search task in which the goal is to find home pages.
Hyperlink recommendation: an algorithm which is based on the number or “qual-
ity” of web recommendations for a particular document.
In-degree: the simplest hyperlink recommendation algorithm in which a document’s
value is measured by the number of incoming hyperlinks.
Indexer: the web search system component that indexes documents gathered by the
crawler into a format which is amenable to quick access by the query processor.
Informational search task: a user task in which the user need is to acquire or learn
some information that may be present in one of more web pages.
Link farms: an “artificial” web graph created by spammers through generating link
spam to funnel hyperlink evidence to a set of pages for which they desire high
web rankings.
Link spam: spam content introduced into hyperlink evidence by generating spam
documents that link to other documents with false or misleading information.
Mean Reciprocal Rank (MRR): a measure used in evaluating web search system
performance computed by averaging the reciprocal rank at which the system
finds the first useful result, or when the first relevant document is retrieved.
Named page finding: a navigational search task in which the goal of the search system
is to find a particular page given its name.
Navigational search task: a user task where the user needs to locate a particular
entity given its name.
PageRank: a hyperlink recommendation algorithm that estimates the probability that a
“random” web surfer would be on a particular page on a web at any particular
time.
Precision: a measure used in evaluating web search system performance. Precision
is the proportion of retrieved documents that are relevant to a query at a partic-
ular rank cut-off.
Query-dependent evidence: evidence that depends on the user query and is calcu-
lated by the query processor during query processing.
Query-independent evidence: evidence that does not depend on the user query,
generally calculated during the document indexing phase (prior to query process-
ing).
Query processor: a typical component of a web search system that consults the index
to retrieve documents in response to a user query.
181
R-Precision (R-Prec): a measure used to evaluate web search system performance.
R-Precision is the average of the precision of a system at the Rth document (av-
eraged across multiple queries).
Recall: a measure used in evaluating web search system performance. Recall is the
total proportion of all relevant documents that have been retrieved within a par-
ticular cut-off for a query.
Search Engine Optimisation: optimising document and web structure such that
search engines may better match a document’s content (without generating spam
content).
Spam: is the name applied to content generated by web publishers to artificially
boost the rank of their pages. Spam techniques include addition of otherwise
unneeded keywords and hyperlinks.
Stemming: stripping term suffixes or prefixes to collapse a term down to its canon-
ical form (or stem). The Porter suffix stemmer [163] is used for this purpose in
some thesis experiments.
Test collection: a snapshot of a user task and document corpus used to evaluate
system effectiveness. A test collection includes a set of documents (corpus), a set
of queries, and relevance judgements for documents in the corpus according to
the queries.
Topic Distillation: a user task in which the goal is to find entry points to relevant sites
given a broad query.
Text REtrieval Conference (TREC): an annual conference run by the US National
Institute of Science and Technologies (NIST) and the US Defense Advanced Re-
search Projects Agency (DARPA) since 1992. The goal of the conference is to
promote the understanding of information retrieval algorithms by allowing re-
search groups to compare system effectiveness on common test collections.
Traditional information retrieval: information retrieval performed over flat corpus
using full-text fields.
Transactional search task: a user search task where the user needs to perform some
activity on, or using, the WWW.
URL-type: a URL class breakdown, proposed by Westerveld et al. [212], in which
some URLs are deemed more important than others on the basis of structure
and depth (outlined in Section 2.3.3.2).
web: a corpus containing linked documents.
web evidence: evidence derived from some web property or context.
182 Glossary
web graph: a graph built from the hyperlink structure of a web, where web pages
are nodes, and hyperlinks are edges.
WWW: the World-Wide Web is a huge repository of linked documents distributed on
millions of servers world-wide. The WWW contains at least ten billion publicly
visible web documents.
Appendix B
The canonicalisation of URLs
When canonicalising URLs the following rules were followed:
• If the relative URI steps below the root of the server the link is resolved to the
server root directory. For example:
– A link to /../foo.html from http://cs.anu.edu.au/ will be resolved
to http://cs.anu.edu.au/foo.html;
– A link to ../../foo.html from http://cs.anu.edu.au/∼Trystan.
Upstill/ will be resolved to http://cs.anu.edu.au/foo.html; and
– A link to /../foo.html from http://cs.anu.edu.au/∼Trystan.
Upstill/pubs/ will be resolved to http://cs.anu.edu.au/foo.html.
• Hyperlinks and documents with common default root page names (e.g.
index.htm(l), default.htm(l), welcome.htm(l), and home.htm(l))
are stemmed to the directory path. For example:
– A link to http://cs.anu.edu.au/default.html is resolved to http:
//cs.anu.edu.au/; and
– A link to http://cs.anu.edu.au/∼Trystan.Upstill/index.html
is resolved to http://cs.anu.edu.au/∼Trystan.Upstill/.
• Multiple directory slashes are resolved to a single slash. For example:
– A link to http://cs.anu.edu.au///// is resolved to http://cs.anu.
edu.au/; and
– A link to http://cs.anu.edu.au//////∼Trystan.Upstill// is re-
solved to http://cs.anu.edu.au/∼Trystan.Upstill/.
• URLs pointing to targets inside documents are treated as links to the full docu-
ment. For example:
– A link to http://cs.anu.edu.au/foo.html#Trystan is resolved to
http://cs.anu.edu.au/foo.html; and
183
184 The canonicalisation of URLs
– A link to http://cs.anu.edu.au/#foo is resolved to http://cs.anu.
edu.au/.
• Hyperlinks are not followed from framesets (as they are not crawled). Hyperlink
extraction from frameset sites requires that links directly to navigational panes
be observed (and not links to framesets).
• If the port to which an HTTP request is made is the default port (e.g. 80), it is
removed. For example:
– A link to http://cs.anu.edu.au:80 is resolved to http://cs.anu.
edu.au; and
– A link to http://cs.anu.edu.au:80/∼Trystan.Upstill/ is resolved
to http://cs.anu.edu.au/∼Trystan.Upstill/
• URLs without a leading host are appended with “www”. For example a link to
http://sony.com/ is resolved to http://www.sony.com.
• If no protocol is provided http:// is assumed. For example a link to sony.com
is resolved to http://www.sony.com.
• Host names are converted into lower case (as host names are insensitive).
• Default web server directory listing pages are removed.
Appendix C
Bookstore search and searchability:
case study data
C.1 Book categories
• (27) Children’s
• (15) Hardcover Advice
• (11) Hardcover Business
• (35) Hardcover Fiction
• (29) Hardcover Non-Fiction
• (15) Paperback Advice
• (07) Paperback Business
• (35) Paperback Fiction
• (32) Paperback Non-Fiction
• (206) Total
Duplicate books were removed from the query set. For example the book titled
“Stupid White Men” was in both the Hardcover Business and Hardcover Non-Fiction
sections, and so was only considered in the Hardcover Business category.
C.2 Web search engine querying
• AltaVista
– General Queries: Book title surrounded by quotation (“) marks.
– URL Coverage: canonical domain name with “url:” parameter.
– Link Coverage: canonical domain name with “link:” parameter.
185
186 Bookstore search and searchability: case study data
– Timeframe: General and Domain Restricted Queries submitted between
20/09/02 and 02/10/02 Link Coverage and URL Coverage experiments
performed on the 09/10/02.
• AllTheWeb (Fast)
– General Queries: Book title with exact phrase box ticked.
– URL Coverage: Advanced search restricting to domain using “domain”
textbox with canonical domain name.
– Link Coverage: Advanced search using Word Filter with “Must Include” in
the preceding drop down box, canonical domain name in middle text box
and “in the link to URL” in the final drop down box.
– Timeframe: General and Domain Restricted Queries submitted between
20/09/02 and 02/10/02 Link Coverage and URL Coverage experiments
performed on the 09/10/02.
• Google
– General Queries: Book title surrounded by quotation (“) marks.
– URL Coverage: Search for the non-presence of a non-existing word
(e.g.: -adsljflkjlkjdflkjasdlfj0982739547asdhkas) and using canonical domain
name with “host:” parameter.
– Link Coverage: Not available.
– Timeframe: General and Domain Restricted Queries submitted between
20/09/02 and 02/10/02 Link Coverage and URL Coverage experiments
performed on the 09/10/02.
• MSN Search (Inktomi)
– General Queries: Advanced search with book title as an “exact phrase box”.
– URL Coverage: Advanced search using the domain name as the query, and
restricting domain using “domain” text box with canonical domain name.
– Link Coverage: Not available.
– Timeframe: General and Domain Restricted Queries submitted between
20/09/02 and 02/10/02 Link Coverage and URL Coverage experiments
performed on the 09/10/02.
§C.3 Correct book answers in bookstore case study 187
C.3 Correct book answers in bookstore case study
Category Book Title ISBN
Childrens America 0689851928
Childrens Artemis Fowl 0786808012
0786817070
Childrens Artemis Fowl: the Arctic Incident 0786808551
Childrens Can You See What I See? 0439163919
Childrens Daisy Comes Home 039923618X
Childrens Disney’s Lilo and Stitch 0736413219
Childrens Giggle, Giggle, Quack 0689845065
Childrens Good Morning, Gorillas 0375806148
0375906142
Childrens Harry Potter and the Chamber of Secrets 0439064872
0439064864
0613287142
Childrens Harry Potter and the Goblet of Fire 0439139600
0439139597
Childrens Harry Potter and the Prisoner of Azkaban 0439136369
0439139597
0613371062
Childrens Harry Potter and the Sorcerer’s Stone 059035342X
0590353403
0613206339
Childrens Holes 0440414806
0374332657
044022859X
0613236696
Childrens If You Take a Mouse to School 0060283289
Childrens Junie B., First Grader (at Last!) 0375802932
0375815163
0375902937
Childrens Junie B., First Grader: Boss of Lunch 0375815171
Childrens Lemony Snicket: the Unauthorized Autobiography. 0060007192
Childrens Oh, the Places You’ll Go! 0679805273
Childrens Olivia 0689829531
Childrens Olivia Saves the Circus 068982954X
Childrens Princess in the Spotlight 0060294655
0064472795
0060294663
Childrens Stargirl 037582233X
0679886370
0679986375
B00005TZX9
B00005TPDD
Childrens The All New Captain Underpants Extracrunchy Book O’fun 2 0439376084
Childrens The Bad Beginning 0064407667
0060283122
188 Bookstore search and searchability: case study data
Childrens The Reptile Room 0064407675
0060283130
Childrens The Three Pigs 0618007016
Childrens The Wide Window 0064407683
0060283149
Hardcover Advice 10 Secrets for Success and Inner Peace 1561708755
Hardcover Advice Body for Life 0060193395
Hardcover Advice Conquer the Crash 0470849827
Hardcover Advice Execution 0609610570
Hardcover Advice Fish! 0786866020
Hardcover Advice Get With the Program! 0743225996
Hardcover Advice I Hope You Dance 1558538445
Hardcover Advice Self Matters 074322423X
Hardcover Advice Sylvia Browne’s Book of Dreams 0525946586
Hardcover Advice The Fat Flush Plan 0071383832
Hardcover Advice The Perricone Prescription 0060188790
Hardcover Advice The Prayer of Jabez 1576737330
1576738108
Hardcover Advice The Prayer of Jabez for Women 1576739627
1590520491
Hardcover Advice The Wisdom of Menopause 055380121X
Hardcover Advice Who Moved My Cheese? 0399144463
Hardcover Business Conquer the Crash (duplicate)
Hardcover Business Execution (duplicate)
Hardcover Business Fish (duplicate)
Hardcover Business Fish! Tales 0786868686
Hardcover Business Good to Great 0066620996
Hardcover Business How to Lose Friends and Alienate People 030681188X
Hardcover Business Martha Inc. 0471123005
Hardcover Business Oh, the Things I Know! 052594673X
Hardcover Business Snobbery: the American Version 0395944171
Hardcover Business Stupid White Men 0060392452
Hardcover Business Ten Things I Learned From Bill Porter 1577312031
Hardcover Business The Pact 157322216X
Hardcover Business Tuxedo Park 0684872870
0684872889
Hardcover Business Wealth and Democracy 0767905334
Hardcover Business Who Moved My Cheese? (duplicate)
Hardcover Fiction A Love of My Own 0385492707
Hardcover Fiction A Thousand Country Roads 0971766711
Hardcover Fiction Absolute Rage 0743403444
Hardcover Fiction An Accidental Woman 0743204700
§C.3 Correct book answers in bookstore case study 189
Hardcover Fiction Ash Wednesday 037541326X
Hardcover Fiction Atonement 0385503954
Hardcover Fiction Charleston 0525946500
Hardcover Fiction Eleventh Hour 0399148779
Hardcover Fiction Enemy Women 0066214440
Hardcover Fiction Fire Ice 0399148728
Hardcover Fiction Hard Eight 0312265859
Hardcover Fiction Her Father’s House 0385334729
Hardcover Fiction Hot Ice 0553802747
Hardcover Fiction In This Mountain 0670031046
Hardcover Fiction Lawrence Sanders: Mcnally’s Alibi 0399148795
Hardcover Fiction Leslie 0743228669
Hardcover Fiction Partner in Crime 0380977303
Hardcover Fiction Pasadena 0375504567
Hardcover Fiction Prague 0375507876
Hardcover Fiction Red Rabbit 0399148701
Hardcover Fiction Standing in the Rainbow 0679426159
Hardcover Fiction Stone Kiss 0446530387
Hardcover Fiction Sunset in St. Tropez 0385335466
Hardcover Fiction The Art of Deception 0786867248
Hardcover Fiction The Beach House 0316969680
Hardcover Fiction The Dive From Clausen’s Pier 0375412824
Hardcover Fiction The Emperor of Ocean Park 0375413634
Hardcover Fiction The Lovely Bones 0316666343
Hardcover Fiction The Nanny Diaries 0312278586
Hardcover Fiction The Remnant 0842332278
Hardcover Fiction The Shelters of Stone 0609610597
Hardcover Fiction The Summons 0385503822
Hardcover Fiction Unfit to Practice 0385334842
Hardcover Fiction Whispers and Lies 0743446259
Hardcover Fiction You Are Not a Stranger Here 0385509529
Hardcover Non-Fiction A Long Strange Trip 0767911857
Hardcover Non-Fiction A Mind at a Time 0743202228
Hardcover Non-Fiction A Nation Challenged 0935112766
Hardcover Non-Fiction Among the Heroes 0060099089
Hardcover Non-Fiction Cicero 0375507469
Hardcover Non-Fiction Crossroads of Freedom: Antietam 0195135210
Hardcover Non-Fiction Firehouse 1401300057
Hardcover Non-Fiction General Patton 0060009829
Hardcover Non-Fiction Gettysburg 0060193638
190 Bookstore search and searchability: case study data
Hardcover Non-Fiction Good to Great (duplicate)
Hardcover Non-Fiction John Adams 0743223136
Hardcover Non-Fiction Lucky Man 0786867647
Hardcover Non-Fiction Martha Inc. (duplicate)
Hardcover Non-Fiction Odd Girl Out 0151006040
Hardcover Non-Fiction Once Upon a Town 0060081961
Hardcover Non-Fiction Profiles in Courage for Our Time 0786867930
Hardcover Non-Fiction Running With Scissors 0312283709
Hardcover Non-Fiction Sacred Contracts 0517703920
Hardcover Non-Fiction Sex, Lies, and Headlocks 0609606905
Hardcover Non-Fiction Six Days of War 0195151747
Hardcover Non-Fiction Slander 1400046610
Hardcover Non-Fiction Small Wonder 0060504072
Hardcover Non-Fiction Snobbery (duplicate)
Hardcover Non-Fiction Strong of Heart 006050949X
Hardcover Non-Fiction Stupid White Men (duplicate)
Hardcover Non-Fiction The Art of Travel 0375420827
Hardcover Non-Fiction The Cell 0786869003
Hardcover Non-Fiction The Lobster Chronicles 0786866772
Hardcover Non-Fiction The Right Words at the Right Time 0743446496
Hardcover Non-Fiction The Sexual Life of Catherine M. 0802117163
Hardcover Non-Fiction The Universe in a Nutshell 055380202X
Hardcover Non-Fiction Tuxedo Park (duplicate)
Hardcover Non-Fiction Wealth and Democracy (duplicate)
Hardcover Non-Fiction Why I Am a Catholic 0618134298
Hardcover Non-Fiction You Cannot Be Serious 0399148582
Paperback Advice A Week in the Zone 006103083X
Paperback Advice Chicken Soup for the Teacher’s Soul 1558749780
1558749799
Paperback Advice Crucial Conversations 0071401946
Paperback Advice Dr. Atkins’ New Diet Revolution 006001203X
1590770021
Paperback Advice Fix-it and Forget-it Cookbook 1561483397
1561483389
1561483176
Paperback Advice Guinness World Records 2002 0553583786
Paperback Advice Leonard Maltin’s 2003 Movie and Video Guide 0451206495
Paperback Advice Life Strategies 0786884592
0786865482
Paperback Advice Relationship Rescue 0786866314
078688598X
Paperback Advice Rich Dad, Poor Dad 0446677450
§C.3 Correct book answers in bookstore case study 191
Paperback Advice The Four Agreements 1878424319
1878424505
Paperback Advice The Pill Book: New and Revised 10th Edition. 0553584782
0553050133
Paperback Advice The Unauthorized Osbournes 1572435208
Paperback Advice The Wrinkle Cure 0446677760
1579542379
Paperback Advice What to Expect When You’re Expecting 0761121323
0761125493
Paperback Business Crucial Conversations (duplicate)
Paperback Business Fast Food Nation 0060938455
0395977894
Paperback Business How to Make Money in Stocks 0071373616
Paperback Business Life Strategies (duplicate)
Paperback Business Nickel and Dimed 0805063897
0805063889
Paperback Business Rich Dad, Poor Dad (duplicate)
Paperback Business The Tipping Point 0316316962
0316346624
Paperback Business Two Bad Years and Up We Go! 1892008726
Paperback Business What Color Is Your Parachute 2002 1580083420
1580083412
Paperback Business What Went Wrong at Enron 0471265748
Paperback Fiction A Bend in the Road 0446611867
0446527785
Paperback Fiction A Painted House 044023722X
038550120X
Paperback Fiction A Walk to Remember 0446608955
0613281292
Paperback Fiction Always in My Heart 0451206665
Paperback Fiction Bel Canto 0060934417
Paperback Fiction Blood Work 0446602620
0613236882
Paperback Fiction Cordina’s Royal Family 0373484836
Paperback Fiction Divine Secrets of the Ya-ya Sisterhood 0060928336
0060173289
Paperback Fiction Empire Falls 0375726403
0679432477
Paperback Fiction Enemy Within 0743403436
0743403428
Paperback Fiction Envy 0446611808
0446527130
Paperback Fiction Face the Fire 051513287X
Paperback Fiction Fanning the Flame 0743419162
Paperback Fiction For Better, for Worse 0380820447
Paperback Fiction Four Blondes 080213825X
192 Bookstore search and searchability: case study data
0871138190
Paperback Fiction Good in Bed 0743418174
0743418166
Paperback Fiction Hemlock Bay 0399147381
0515133302
Paperback Fiction Honest Illusions 0399137610
0515110973
Paperback Fiction Little Altars Everywhere 0060976845
006019362X
Paperback Fiction Mercy 0671034022
0671034014
Paperback Fiction Paradise Lost 0140424261
Paperback Fiction Stonebrook Cottage 1551669234
Paperback Fiction Summer Pleasures 0373218397
Paperback Fiction Suzanne’s Diary for Nicholas 0446679593
0316969443
Paperback Fiction The Associate 0061030643
0060196254
Paperback Fiction The Bachelor 0446610542
Paperback Fiction The Last Time They Met 0316781266
0316781142
Paperback Fiction The New Jedi Order: Traitor 034542865X
0553713175
Paperback Fiction The Smoke Jumper 0385334036
0440235162
Paperback Fiction The Straw Men 0515134279
Paperback Fiction The Surgeon 0345447840
0345447832
Paperback Fiction True Blue 0553583980
Paperback Fiction Valhalla Rising 039914787X
0425185710
Paperback Fiction When Strangers Marry 0060507365
Paperback Fiction Whisper of Evil 0553583468
Paperback Non-Fiction A Beautiful Mind 0743224574
0684819066
Paperback Non-Fiction A Child Called ”It” 1558743669
0613171373
Paperback Non-Fiction A Man Named Dave 0452281903
0525945210
Paperback Non-Fiction An Italian Affair 0375724850
0375420657
Paperback Non-Fiction April 1865 0060930888
0060187239
Paperback Non-Fiction Ava’s Man 0375724443
0375410627
Paperback Non-Fiction Black Hawk Down 0871137380
0140288503
§C.3 Correct book answers in bookstore case study 193
Paperback Non-Fiction Brunelleschi’s Dome 0142000159
0802713661
Paperback Non-Fiction Comfort Me With Apples 0375758739
0375501959
Paperback Non-Fiction Fast Food Nation (duplicate)
Paperback Non-Fiction Founding Brothers 0375405445
0375705244
Paperback Non-Fiction French Lessons 0375705619
0375405909
Paperback Non-Fiction From Beirut to Jerusalem 0385413726
0374158959
Paperback Non-Fiction Ghost Soldiers 038549565X
0385495641
Paperback Non-Fiction It’s Not About the Bike 0399146113
0425179613
Paperback Non-Fiction Justice 0609608738
0609809636
Paperback Non-Fiction Me Talk Pretty One Day 0316776963
0316777722
Paperback Non-Fiction Napalm and Silly Putty 0786887583
0786864133
Paperback Non-Fiction Nickel and Dimed (duplicate)
Paperback Non-Fiction On Writing 0743455967
0684853523
Paperback Non-Fiction Paris to the Moon 0679444920
0375758232
Paperback Non-Fiction Perpetual War for Perpetual Peace 156025405X
Paperback Non-Fiction Personal History 0375701044
0394585852
Paperback Non-Fiction Seabiscuit 0375502912
0449005615
Paperback Non-Fiction The Botany of Desire 0375501290
0375760393
Paperback Non-Fiction The Darwin Awards 0525945725
0452283442
Paperback Non-Fiction The First American 0385495404
0385493282
Paperback Non-Fiction The Idiot Girls’ Action-adventure Club 0375760911
Paperback Non-Fiction The Lost Boy 1558745157
0613173538
Paperback Non-Fiction The Map That Changed the World 0060931809
0060193611
Paperback Non-Fiction The Metaphysical Club 0374199639
0374528497
Paperback Non-Fiction The Piano Shop on the Left Bank 0375758623
0375503048
Paperback Non-Fiction The Tipping Point (duplicate)
Paperback Non-Fiction The Wild Blue 0743203399
194 Bookstore search and searchability: case study data
0743223098
Paperback Non-Fiction Washington 1586481185
0783895909
Table C.1: Correct book answers in bookstore case study.
Appendix D
TREC participation in 2002
This appendix is included for reference only and is drawn directly from [57].
TREC2002 included a named page finding task and a Topic Distillation task. A
preliminary exploration of forms of evidence which might be useful for named page
finding and topic distillation was performed. For this reason there was heavy use of
evidence other than page content.
D.1 Topic Distillation
In Topic Distillation the following forms of evidence were used:
• BM25 on full-text (content). Pages returned should be “relevant”. The .GOV
was corpus indexed and applied BM25, sometimes with stemming sometimes
without.
• BM25 on content and referring anchor-text. An alternative to content-only BM25
is to include referring anchor-text words in the BM25 calculation (content and
anchors).
• In-link counting and filtering. To test whether pages with more in-links are po-
tentially better answers, with differentiation between on-host and off-host links.
Many results were eliminated on the grounds that they had insufficient in-links.
• URL length. Short URLs are expected to be better answers than long URLs.
• BM25 score aggregation. Sites with many BM25-matching pages are expected
to be better than those with few.
In the 2002 Topic Distillation (TD2002) task, the focus on local page content rele-
vance (BM25 content only) was probably too high for the non-content and aggrega-
tion methods to succeed. Most correct answers were expected to be shallow URLs of
sites containing much useful content. In fact, correct answers were deeper, and the ag-
gregation method for finding sites rich with relevant information was quite harmful
(csiro02td3 and csiro02td4). The focus on page content is borne out by the improve-
ment in effectiveness achieved when simple BM25 was applied in an unofficial run
195
196 TREC participation in 2002
BM25 BM25 In-link URL BM25
Run P@10 cont. cont. & anch. counting & filtering length aggr.
csiro02td1 0.1000 y y y
csiro02td2 0.0714 y y
csiro02td3 0.0184 y y y y
csiro02td4 0.0184 y y y
csiro02td5 0.0939 y (stem) y y
csiro02unoff 0.1959 y
Table D.1: Official results for submissions to the 2002 TREC web track Topic Distillation
task
(csiro02unoff). To perform better in the TD2002 task, less (or no) emphasis should have
been put on distillation evidence and far more emphasis on relevance. However, in
some Web search situations, it is likely that the distillation evidence would be more
important than it was in this TD2002 task.
D.2 Named page finding
In the named page finding experiments the following forms of evidence was used:
• Okapi BM25 on document full-text (content) and/or anchor text. Okapi BM25
was used to score document content and to aggregate anchor-text documents.
• Stemming of query terms.
• Extra Title Weighting. To bias the results towards “page naming text” further
emphasis was placed on document titles.
• PageRank. To see whether link recommendation could be used to improve re-
sults [31].
Prior to submission twenty named page training queries were generated. This
training found that content with extra title weighting performed best. Therefore page
titles were expected to be important evidence in the official named page finding task.
However this appeared not to be the case, in fact extra title weighting for the TREC
queries appeared to reduce effectiveness (csiro02np01 vs csiro02np03). While there
was some anchor text evidence present for the query set (csiro02np02), when this ev-
idence was combined with content (csiro02np04 and csiro02np16) results were notice-
ably worse than for the content-only run (csiro02np01). PageRank harmed retrieval
effectiveness (run csiro02np16 versus csiro02np04).
§D.2 Named page finding 197
Extra title
Run MRR S@10 BM25 Stemming weighting PageRank
csiro02np01 0.573 0.77 Content
csiro02np02 0.241 0.34 Anchor text
csiro02np03 0.416 0.59 Content y
csiro02np04 0.318 0.51 Content and y y
anchor text
csiro02np16 0.307 0.49 Content and y y y
anchor text
Table D.2: Official results for submissions to the 2002 TREC web track named page finding
task.
198 TREC participation in 2002
Appendix E
Analysis of hyperlink
recommendation evidence
additional results
This appendix contains further graphs from the experiment series examined in Chap-
ter 5, Section 5.2.1. Figure E.1 and E.2 contain PageRank distributions for several
company websites. These Figures support the results presented in Chapter 5, but do
not show any further interesting trends.
199
200 Analysis of hyperlink recommendation evidence additional results
0
5
10
15
20
25
30
35
0 1 2 3 4 5 6 7 8 910
Pagesinourcrawl
PageRank
www.harman.com (HP PR=7)
0
2
4
6
8
10
0 1 2 3 4 5 6 7 8 910
Pagesinourcrawl
PageRank
www.introgen.com (HP PR=5)
0
5
10
15
20
25
0 1 2 3 4 5 6 7 8 910
Pagesinourcrawl
PageRank
www.pnc.com (HP PR=6)
0
1
2
3
4
5
6
0 1 2 3 4 5 6 7 8 910
Pagesinourcrawl
PageRank
www.progressenergy.com (HP PR=5)
0
0.5
1
1.5
2
0 1 2 3 4 5 6 7 8 910
Pagesinourcrawl
PageRank
www.csx.com (HP PR=6)
0
5
10
15
20
25
30
35
40
0 1 2 3 4 5 6 7 8 910
Pagesinourcrawl
PageRank
www.southtrust.com (HP PR=7)
Figure E.1: Toolbar PageRank distributions within sites. (Additional to those presented
in Section 5.2.1) The PageRank distributions for other sites are included in Figure 5.2, and in
Figure E.2. The PageRank advice to users is usually that the home page is the most important
or highest quality page, and other pages are less important or of lower quality. PageRank of
the home page of the site is shown as “HP PR=”.
201
0
5
10
15
20
25
30
35
40
45
50
0 1 2 3 4 5 6 7 8 910
Pagesinourcrawl
PageRank
www.tenneco-automotive.com (HP PR=6)
0
5
10
15
20
25
30
0 1 2 3 4 5 6 7 8 910
Pagesinourcrawl PageRank
www.novavax.com (HP PR=5)
0
5
10
15
20
25
0 1 2 3 4 5 6 7 8 910
Pagesinourcrawl
PageRank
www.valero.com (HP PR=6)
0
1
2
3
4
5
6
7
0 1 2 3 4 5 6 7 8 910
Pagesinourcrawl
PageRank
www.synergybrands.com (HP PR=5)
0
10
20
30
40
50
60
70
80
90
0 1 2 3 4 5 6 7 8 910
Pagesinourcrawl
PageRank
www.teletouch.com (HP PR=5)
0
5
10
15
20
25
30
35
40
45
0 1 2 3 4 5 6 7 8 910
#ofpagescrawled
PageRank
www.tofc.net (HP PR=3)
Figure E.2: Toolbar PageRank distributions within sites (Additional to those presented in
Section 5.2.1)
202 Analysis of hyperlink recommendation evidence additional results
Appendix F
Okapi BM25 distributions
This appendix contains the distributions of Okapi BM25 scores for query-dependent
evidence for the WT10gC collection (see Section 7.1.2) used throughout experiments
in Chapter 7. Figure F.1 contains the distribution scores for the document full-text.
Figure F.2 contains the distribution of scores for the anchor-text baseline. The BM25
distributions are calculated using the top 1000 results for each of the 100 queries. Un-
like query-independent evidence BM25 scores are not comparable between query re-
sults. To build this distribution the BM25 scores for all queries were independently
normalised (the top answer for each query receives a 1). Due to the cutoff at 1000, a
truncated curve is expected. Additionally, because the query score distributions are
not centred at the same point, the plot exhibits a flatter curve than would be observed
for a single query score distribution.
203
204 Okapi BM25 distributions
0
1
2
3
4
5
6
7
0 0.2 0.4 0.6 0.8 1
Percentageofdocuments
Normalised BM25 content score (top 1000 documents per query)
ANU
WT10gC
VLC2R
Figure F.1: Distribution of normalised Okapi BM25 scores for document full-text for the
WT10gC collection. The BM25 distributions are calculated using the top 1000 results for
each of the 100 queries. Unlike query-independent evidence BM25 scores are not comparable
between query results. To build this distribution the BM25 scores for all queries were inde-
pendently normalised (the top answer for each query receives a 1). Due to the cutoff at 1000,
a truncated curve is expected. Additionally, because the query score distributions are not cen-
tred at the same point, the plot exhibits a flatter curve than would be observed for a single
query.
0
1
2
3
4
5
6
7
0 0.2 0.4 0.6 0.8 1
Percentageofdocuments
Normalised BM25 anchor score (top 1000 documents per query)
ANU
WT10gC
VLC2R
Figure F.2: Distribution of normalised Okapi BM25 scores for aggregate anchor-text for
the WT10gC collection. The BM25 distributions are calculated using the top 1000 results for
each of the 100 queries. Unlike query-independent evidence BM25 scores are not comparable
between query results. To build this distribution the BM25 scores for all queries were inde-
pendently normalised (the top answer for each query receives a 1). Due to the cutoff at 1000,
a truncated curve is expected. Additionally, because the query score distributions are not cen-
tred at the same point, the plot exhibits a flatter curve than would be observed for a single
query.
Appendix G
Query sets
G.1 .GOV home page set
Query .GOV Doc ID
White House G03-16-2396677
Office of Homeland Security G25-97-0219687
Office of Management and Budget G01-47-2257273
OMB G01-47-2257273
United States Trade Representative G00-02-0599362
USTR G00-02-0599362
Department of Agriculture G42-03-3102230
USDA G42-03-3102230
Agricultural Research Service G00-03-3996998
Animal Plant Health Inspection Service G00-06-2853218
Cooperative State Research Education and Extension Service G00-11-0223618
Economic Research Service G00-03-2081400
Farm Service Agency G01-58-2364809
National Agricultural Library G00-00-2308409
Natural Resources Conservation Service G00-04-2280100
Research Economics Education G01-91-2827118
Rural Development G00-09-0025460
Bureau of the Census G02-93-4116586
STATUSA Database G00-10-3137809
Bureau of Export Administration G00-03-1901246
FEDWorld G00-06-4174747
International Trade Administration G00-00-3667859
205
206 Query sets
ITA G00-00-3667859
National Institute of Standards Technology G40-04-1519418
NIST G40-04-1519418
National Marine Fisheries Service G46-01-2225985
NMFS G46-01-2225985
National Oceanic Atmospheric Administration G21-42-3486883
NOAA G21-42-3486883
National Ocean Service G00-03-1496820
National Technical Information Service G01-03-0674427
NTIS G01-03-0674427
National Telecommunications Information Administration G00-05-1550998
National Weather Service G00-10-2171731
Department of Education G00-03-2042174
Educational Resources Information Center G08-78-1802103
ERIC G08-78-1802103
National Library of Education G04-56-3588687
NLE G04-56-3588687
Department of Energy G00-06-1479477
Office of Economic Impact and Diversity G05-02-2264248
Southwestern Power Administration G00-11-0259770
Department of Health and Human Services G00-00-3031135
HHS G00-00-3031135
Administration for Children and Families G29-19-2177375
Agency for Health Care Research and Quality G00-01-0960846
AHCRQ G00-01-0960846
Centers for Disease Control and Prevention G08-82-2708305
CDC G08-82-2708305
Food and Drug Administration G00-01-3511414
FDA G00-01-3511414
Health Care Financing Administration G00-03-3635966
National Institutes of Health G00-01-3774693
NIH G00-01-3774693
National Library of Medicine G00-06-1119476
NLM G00-06-1119476
Department of Housing and Urban Development G19-73-3432233
§G.1 .GOV home page set 207
HUD G19-73-3432233
Government National Mortgage Association G37-23-0000000
Ginnie Mae G37-23-0000000
Housing and Urban Development Reading Room G12-73-4081497
Office of Healthy Homes and Lead Hazard Control G10-39-2062297
Public and Indian Housing Agencies G12-36-3618097
Department of the Interior G00-09-2318516
DOI G00-09-2318516
Bureau of Land Management G00-00-2056373
BLM G00-00-2056373
Geological Survey G01-26-3878517
National Park Service G00-03-0029179
Office of Surface Mining G00-44-0995015
Department of Justice G00-04-3171772
DOJ G00-04-3171772
Drug Enforcement Agency G00-72-4001908
DEA G00-72-4001908
Federal Bureau of Investigation G01-84-2237979
FBI G01-84-2237979
Federal Bureau of Prisons G00-03-2244949
Immigration and Naturalization Service G04-47-1027920
INS G04-47-1027920
Office of Justice Programs G00-52-2562368
OJP G00-52-2562368
United States Marshals Service G04-91-1779147
USMS G04-91-1779147
Department of Labor G19-13-1577185
DOL G19-13-1577185
Bureau of Labor Statistics G39-37-3612440
G00-01-0682299
BLS G39-37-3612440
G00-01-0682299
Mine Safety and Health Administration G00-10-3730888
Occupational Safety Health Administration G00-09-2693851
OSHA G00-09-2693851
208 Query sets
Department of State G00-58-0058694
DOS G00-58-0058694
Department of State Library G00-18-1147964
Department of Transportation G01-50-1226182
DOT G01-50-1226182
Bureau of Transportation Statistics G00-01-3065065
Federal Aviation Administration G00-06-2330537
FAA G00-06-2330537
National Transportation Library G00-03-1771651
Department of the Treasury G00-03-3649117
Bureau of Alcohol Tobacco Firearms G04-24-1874467
ATF G04-24-1874467
Bureau of Engraving and Printing G00-01-0534347
Bureau of Public Debt G00-04-1219947
Executive Office for Asset Forfeiture G04-75-2804241
Financial Crimes Enforcement Network G03-33-2329825
Financial Management Service G00-10-2794731
FMS G00-10-2794731
Internal Revenue Service IRS G01-42-2236557
G27-81-0697864
Office of Thrift Supervision G00-10-2917540
OTS G00-10-2917540
Secret Service G03-62-1819147
US Customs Service G26-69-3739619
US Mint G01-38-0907787
Department of Veterans Affairs G07-29-0536719
Advisory Council on Historic Preservation G00-08-1007258
ACHP G00-08-1007258
American Battle Monuments Commission G08-41-4046345
Central Intelligence Agency G06-34-0212798
G00-04-0693582
CIA G06-34-0212798
G00-04-0693582
Commodity Futures Trading Commission G00-16-3850519
CFTC G00-16-3850519
§G.1 .GOV home page set 209
Consumer Product Safety Commission G00-03-1848726
CPSC G00-03-1848726
Corporation for National Service G00-08-4188069
Environmental Protection Agency G00-00-0029827
EPA G00-00-0029827
Equal Employment Opportunity Commission G00-79-1517391
EEOC G00-79-1517391
Farm Credit Administration G00-07-3398062
FCA G00-07-3398062
Federal Communications Commission G36-78-0130889
FCC G36-78-0130889
Federal Deposit Insurance Corporation G01-51-0988286
FDIC G01-51-0988286
Federal Election Commission G00-06-3072823
FEC G00-06-3072823
Federal Emergency Management Agency G00-03-2245885
FEMA G00-03-2245885
Federal Energy Regulatory Commission G00-05-0212361
FERC G00-05-0212361
Federal Labor Relations Authority G00-07-2059058
FLRA G00-07-2059058
Federal Maritime Commission G00-00-2164772
Federal Retirement Thrift Investment Board G00-06-0905797
FRTIB G00-06-0905797
Federal Trade Commission G03-32-2819928
FTC G03-32-2819928
General Services Administration G00-05-1904668
GSA G00-05-1904668
Federal Consumer Information Center Pueblo CO G22-50-0922418
Institute of Museum and Library Services G00-11-0472793
IMLS G00-11-0472793
International Broadcasting Bureau G00-06-1636322
IBB G00-06-1636322
Merit Systems Protection Board G01-60-1363045
MSPB G01-60-1363045
210 Query sets
National Archives and Records Administration G00-02-1372443
NARA G00-02-1372443
National Capital Planning Commission G00-08-1222422
NCPC G00-08-1222422
National Commission on Libraries and Information Science NCLIS G00-05-0712949
NCLIS G00-05-0712949
National Council on Disability G00-08-0435196
National Credit Union Administration G42-74-1917577
NCUA G42-74-1917577
National Endowment for the Arts G00-00-3681135
NEA G00-00-3681135
National Mediation Board G00-06-2661322
NMB G00-06-2661322
National Science Foundation NSF G00-07-1120880
NSF G00-07-1120880
National Transportation Safety Board G00-02-1479121
NTSB G00-02-1479121
Nuclear Regulatory Commission G00-11-0770745
NRC G00-11-0770745
Nuclear Waste Technical Review Board G00-05-1894408
NWTRB G00-05-1894408
Occupational Safety and Health Administration G00-09-2693851
OSHA G00-09-2693851
Office of Federal Housing Enterprise Oversight G00-07-2732685
OFHEO G00-07-2732685
Office of Personnel Management G01-78-1330378
OPM G01-78-1330378
Office of Special Counsel G12-71-1037814
G00-09-3815798
OSC G12-71-1037814
G00-09-3815798
Overseas Private Investment Corporation G00-03-1048747
OPIC G00-03-1048747
Peace Corps G12-14-0612098
Pension Benefit Guaranty Corporation G00-08-2596456
§G.1 .GOV home page set 211
Postal Rate Commission G00-10-2861072
Railroad Retirement Board G00-00-2016453
RRB G00-00-2016453
Securities and Exchange Commission G00-05-3121512
SEC G00-05-3121512
Selective Service System G00-08-4021223
SSS G00-08-4021223
Social Security Administration G03-24-2061352
SSA G03-24-2061352
Tennessee Valley Authority G00-07-2267029
TVA G00-07-2267029
Thrift Savings Plan G00-04-2615580
TSP G00-04-2615580
United States Arms Control and Disarmament Agency G00-50-1769358
ACDA G00-50-1769358
United States International Trade Commission G00-00-0300859
USITC G00-00-0300859
Dataweb G00-00-1961652
United States Office of Government Ethics G01-28-2830345
United States Postal Service G00-07-4137777
USPS G00-07-4137777
United States Trade and Development Agency G00-02-0555602
Voice of America G00-22-0758032
Broadcasting Bureau of Governors G01-30-3859822
Task Force on Agricultural Air Quality Research G01-51-3170401
White House Commission on Aviation Safety and Security G12-57-0619425
Radio and TV Marti G01-88-3234145
Judicial Branch G00-03-1342151
Legislative Branch G02-36-2411536
G02-32-2010279
Library of Congress G00-03-097897
Table G.1: .GOV home page finding training set. Generated us-
ing the automated sitemap method (described in Section 2.6.5.3)
on the first.gov listing of government departments.
212 Query sets
Bibliography
1. ABITEBOUL, S., PREDA, M., AND COBENA, G. Adaptive On-Line Page Impor-
tance Computation. In Proceedings of WWW2003 (Budapest, Hungary, May 2003).
2. ADAMIC, L. A. The small World Wide Web. In Proceedings of ECDL’99 (Paris,
France, 1999), pp. 443–452.
3. ADAMIC, L. A. Zipf, Power-laws, and Pareto - a ranking tutorial. Tech.
rep., Information Dynamics Lab, HP Labs, 2000. http://www.hpl.hp.com/
research/idl/papers/ranking/ranking.html.
4. ADAMIC, L. A., AND HUBERMAN, B. A. The Nature of Markets in the World
Wide Web. Quarterly Journal of Economic Commerce 1 (2000), 5–12.
5. ADAMIC, L. A., AND HUBERMAN, B. A. The Web’s Hidden Order. Communica-
tions of the ACM 44, 9 (September 2001).
6. ALBERT, R., BARABASI, A., AND JEONG, H. Diameter of the World Wide Web.
Nature 401, 9 (September 1999), 103–131.
7. ALTAVISTA. AltaVista. http://www.altavista.com, accessed 10/12/2003.
8. AMENTO, B., TERVEEN, L. G., AND HILL, W. C. Does “authority” mean qual-
ity? Predicting expert quality ratings of Web documents. In Proceedings of ACM
SIGIR’00 (Athens, Greece, July 2000), pp. 296–303.
9. AMITAY, E., CARMEL, D., DARLOW, A., LEMPEL, R., AND SOFFER, A. The
Connectivity Sonar: Detecting Site Functionality by Structural Patterns. In Pro-
ceedings of ACM HT’03 (Nottingham, United Kingdom, August 2003).
10. APACHE. Welcome! - The Apache HTTP Server Project, 2004. http://httpd.
apache.org, accessed 12/11/2004.
11. ARASU, A., NOVAK, J., TOMKINS, A., AND TOMLIN, J. PageRank Computation
and the Structure of the Web: Experiments and Algorithms. In Proceedings of
WWW2002 (Hawaii, USA, May 2002).
12. AUSTRALIA POST. Australia post, 2004. http://www.australiapost.com.
au, accessed 12/11/2004.
13. AYAN, N. F., LI, W.-S., AND KOLAK, O. Automating extraction of logical do-
mains in a web site. Data and Knowledge Engineering 43, 2 (November 2002), 179–
205.
213
214 Bibliography
14. BAEZA-YATES, R., AND RIBEIRO-NETO, B. Modern Information Retrieval. Addi-
son Wesley, 1999.
15. BAILEY, P., CRASWELL, N., AND HAWKING, D. Engineering a multi-purpose
test collection for Web retrieval experiments. Information Processing and Man-
agement 39, 6 (2003), 853–871. http://es.cmis.csiro.au/pubs/bailey
ipm03.pdf.
16. BALDI, P., FRASCONI, P., AND SMYTH, P. Modeling the Internet and the Web:
Probabilistic Methods and Algorithms. Wiley, 2003.
17. BARABASI, A.-L., AND ALBERT, R. Emergence of Scaling in Random Networks.
Science 286 (October 1999).
18. BARABASI, A.-L., ALBERT, R., AND JEONG, H. Scale-free characteristics of ran-
dom networks: the topology of the World-Wide Web. Physica A 281 (2000), 69–
77.
19. BERGER, A., AND LAFFERTY, J. D. Information Retrieval as Statistical Transla-
tion. In Proceedings of ACM SIGIR’99 (Berkeley, CA, USA, 1999), pp. 222–229.
20. BERNERS-LEE, T. Weaving the Web. The Original Design and Ultimate Destiny of the
World Wide Web by its Inventor. Harper Collins, San Francisco, 1999.
21. BERNERS-LEE, T., FIELDING, R., AND MASINTER, L. RFC2396 – Uniform Re-
source Identifiers. Request for Comments, August 1998.
22. BERRY, M. W., DUMAIS, S. T., AND O’BRIEN, G. W. Using Linear Algebra for
Intelligent Information Retrieval. Tech. rep., University of Tennessee, Depart-
ment of Computer Science, December 1994.
23. BHARAT, K., AND BRODER, A. Mirror, Mirror on the Web: A Study of Host Pairs
with Replicated Content. In Proceedings of WWW8 (Toronto, Canada, May 1999).
http://www8.org/w8-papers/4c-server/mirror/mirror.html.
24. BHARAT, K., BRODER, A., DEAN, J., AND HENZINGER, M. A Comparison of
Techniques to Find Mirrored Hosts on the WWW. In WOWS’99 (Berkeley, USA,
August 1999). http://www.henzinger.com/monika/.
25. BHARAT, K., CHANG, B., HENZINGER, M., AND RUHL, M. Who links to whom:
Mining linkage between Web sites. In Proceedings of ICDM’01 (San Jose, USA,
November 2001).
26. BHARAT, K., AND HENZINGER, M. Improved Algorithms for Topic Distilla-
tion in a Hyperlinked Environment. In Proceedings of ACM SIGIR’98 (Melbourne,
Australia, 1998).
27. BHARAT, K., AND MIHAILA, G. A. When Experts Agree: Using Non-Affiliated
Experts to Rank Popular Topics. In Proceedings of WWW2001 (Hong Kong, 2001).
http://www10.org/cdrom/papers/474/.
Bibliography 215
28. BOOKSTEIN, A. Implications of Boolean Structures for Probabilistic Retrieval. In
Proceedings of ACM SIGIR’85 (New York, USA, 1985), pp. 11–17.
29. BOTAFOGO, R., RIVLIN, E., AND SHNEIDERMAN, B. Structural Analysis of Hy-
pertexts: Identifying Hierarchies and Useful Metrics. ACM Transactions on Infor-
mation Systems 10, 2 (1992), 142–180.
30. BRAY, T. Measuring the Web. In Proceedings of WWW5 (Paris, France, May 1996).
31. BRIN, S., AND PAGE, L. The anatomy of a large-scale hypertextual web search
engine. In Proceedings of WWW7 (Brisbane, Australia, May 1998). http:
//www7.scu.edu.au/programme/fullpapers/1921/com1921.htm.
32. BRODER, A. On the Resemblance and Containment of Documents. In Proceed-
ings of SEQS’97 (1997).
33. BRODER, A. A taxonomy of web search. ACM SIGIR Forum 36, 2 (Fall 2002),
3–10.
34. BRODER, A., GLASSMAN, S., MANASSE, M., AND ZWEIG, G. Syntactic
Clustering of the Web. In Proceedings of WWW6 (Santa Clara, USA, April
1997). http://www.scope.gmd.de/info/www6/technical/paper205/
paper205.html.
35. BRODER, A., KUMAR, R., MAGHOUL, F., RAGHAVAN, P., RAJAGOPALAN, S.,
STATA, R., TOMKINS, A., AND WIENER, J. Graph structure in the Web: ex-
periments and models. In Proceedings of WWW9 (Amsterdam, 2000). http:
//www9.org/w9cdrom/index.html.
36. BUCKLEY, C., AND VOORHEES, E. Evaluating evaluation measure stability. In
Proceedings of ACM SIGIR’00 (Athens, Greece, July 2000), pp. 33–40.
37. CAI, D., YU, S., WEN, J.-R., AND MA, W.-Y. VIPS: a Vision-based Page Segmen-
tation Algorithm. Tech. rep., Microsoft Research Asia, 2003. MSR-TR-2003-79.
38. CAI, D., YU, S., WEN, J.-R., AND MA, W.-Y. Block-based web search. In Pro-
ceedings of ACM SIGIR’04 (Sheffield, UK, July 2004), pp. 456–463.
39. CAI, D., YU, S., WEN, J.-R., AND MA, W.-Y. Block-level Link Analysis. In
Proceedings of ACM SIGIR’04 (Sheffield, UK, July 2004), pp. 440–447.
40. CALADO, P., RIBEIRO-NETO, B., ZIVIANI, N., MOURA, E., AND SILVA, I. Local
Versus Global Link Information in the Web. ACM Transactions on Information
Systems 21, 1 (January 2003), 42–63.
41. CARRI `ERE, S. J., AND KAZMAN, R. WebQuery: Searching and visualizing
the Web through connectivity. In Proceedings of WWW6 (Santa Clara, USA,
1997), pp. 701–711. http://www.scope.gmd.de/info/www6/technical/
paper096/paper96.html.
216 Bibliography
42. CHAKRABARTI, S. Integrating the Document Object Model with Hyperlinks
for Enhanced Topic Distillation and Information Extraction. In Proceedings of
WWW2001 (Hong Kong, 2001), pp. 211–220.
43. CHAKRABARTI, S. Mining the Web: Discovering knowledge from hypertext data.
Morgan Kaufmann, San Francisco, 2003.
44. CHAKRABARTI, S., DOM, B., RAGHAVAN, P., RAJOGOPALAN, S., AND KLEIN-
BERG, J. Automatic resource compilation by analyzing hyperlink structure and
associated text. In Proceedings of WWW7 (Melbourne, Australia, 1998), pp. 65–74.
45. CHAKRABARTI, S., JOSHI, M., AND TAWDE, V. Enhanced Topic Distillation us-
ing Text, Markup Tags, and Hyperlinks. In Proceedings of ACM SIGIR’01 (New
Orleans, USA, 2001), pp. 208–216.
46. CHO, J., GARC´IA-MOLINA, H., AND PAGE, L. Efficient crawling through URL
ordering. Computer Networks and ISDN Systems 30, 1–7 (1998), 161–172.
47. CHOWDHURY, A., FRIEDER, O., GROSSMAN, D., AND MCCABE, M. Collection
Statistics for Fast Duplicate Document Detection. ACM Transactions on Informa-
tion Systems 20, 2 (April 2002), 171–191.
48. CLEVERDON, C., MILLS, J., AND KEEN, M. Factors determining the perfor-
mance of indexing systems. In ASLib Cranfield Project. Cranfield, 1966.
49. CLEVERDON, C. W. Optimizing convenient online access to bibliographic data-
bases. Information Services and Use 4 (1984), 37–47.
50. COLLINS-THOMPSON, K., OGILVIE, P., ZHANG, Y., AND CALLAN, J. Informa-
tion Filtering, Novelty Detection, and Named-Page Finding. In TREC-11 Note-
book Proceedings (Gaithersburg, Maryland USA, November 2002), NIST.
51. COOPER, W. S. Getting beyond Boole. Information Processing and Management:
An International Journal 24 (May 1988), 243–248.
52. CRASWELL, N., CRIMMINS, F., HAWKING, D., AND MOFFAT, A. Performance
and cost tradeoffs in web search. In ADC’04 (Dunedin, New Zealand, January
2004), pp. 161–170. http://es.csiro.au/pubs/craswell adc04.pdf.
53. CRASWELL, N., AND HAWKING, D. Overview of the TREC-2002 Web Track. In
TREC-11 Notebook Proceedings (Gaithersburg, MD, USA, November 2002).
54. CRASWELL, N., AND HAWKING, D. TREC-2004 Web Track Guidelines, July
2004. http://es.csiro.au/TRECWeb/guidelines 2004.html, accessed
10/11/2004.
55. CRASWELL, N., AND HAWKING, D. Characteristics of human-generated re-
source lists. Unpublished (In submission).
Bibliography 217
56. CRASWELL, N., HAWKING, D., AND ROBERTSON, S. Effective site finding us-
ing link anchor information. In Proceedings of ACM SIGIR’01 (New Orleans,
USA, 2001), pp. 250–257. http://es.cmis.csiro.au/pubs/craswell
sigir01.pdf.
57. CRASWELL, N., HAWKING, D., THOM, J., UPSTILL, T., WILKINSON, R., AND
WU, M. TREC11 Web and Interactive Tracks at CSIRO. In TREC-11 Notebook
Proceedings (Gaithersburg, MD, USA, November 2002).
58. CRASWELL, N., HAWKING, D., THOM, J., UPSTILL, T., WILKINSON, R., AND
WU, M. TREC12 Web Track at CSIRO. In TREC-12 Notebook Proceedings
(Gaithersburg, MD, USA, November 2003).
59. CRASWELL, N., HAWKING, D., WILKINSON, R., AND WU, M. TREC10 Web
and Interactive Tracks at CSIRO. In TREC-10 Notebook Proceedings (Gaithersburg,
MD, USA, November 2001). http://es.cmis.csiro.au/pubs/craswell
trec01.pdf.
60. CRASWELL, N., HAWKING, D., WILKINSON, R., AND WU, M. Overview of
the TREC-2003 Web Track. In TREC-12 Notebook Proceedings (Gaithersburg, MD,
USA, November 2003).
61. CROFT, W. B., AND HARPER, D. J. Using probabilistic models of document
retrieval without relevance information. Journal of Documentation 35 (1979), 285–
295.
62. CSIRO. TREC Web Corpus: WT10g, 2003. http://es.csiro.au/TRECWeb/
wt10g.html, accessed 12/11/2004.
63. DAVISON, B. D. Recognizing Nepotistic Links on the Web. In Proceedings of
AAAI’00 (Workship on Artificial Intelligence for Web Search) (Austin, Texas USA,
2000), pp. 23–28.
64. DAVISON, B. D. Topical Locality in the Web. In Proceedings of ACM SIGIR’00
(Athens, Greece, July 2000), pp. 272–279.
65. DAVISON, B. D. Topical Locality in the Web: Experiments and Observations.
Tech. rep., Department of Computer Science, Rutgers, New Jersey, July 2000.
66. DEERWESTER, S., DUMAIS, S. T., FURNAS, G. W., LANDAUER, T. K., AND
HARSHMAN, R. Indexing by Latent Semantic Analysis. JASIS 41, 6 (1990), 391–
407.
67. DILL, S., KUMAR, R., MCCURLEY, K. S., RAJAGOPALAN, S., SIVAKUMAR, D.,
AND TOMKINS, A. Self-Similarity in the Web. ACM Transactions On Internet
Technologies 2, 3 (August 2002), 205–223.
218 Bibliography
68. DING, C., HE, X., HUSBANDS, P., ZHA, H., AND SIMON, H. PageRank, HITS
and a unified framework for link analysis. Tech. Rep. 49372, LBNL, 2002. http:
//citeseer.nj.nec.com/546720.html.
69. DMOZ. Open Directory Project. http://www.dmoz.org, accessed
12/11/2004.
70. DUBLIN CORE METADATA INITIATIVE. Dublin Core Metadata Element
Set, Version 1.1: Reference Description, 2003. http://dublincore.org/
documents/dces/, accessed 14/11/2004.
71. DUBLIN CORE METADATA INITIATIVE. DCMI Frequently Asked Ques-
tions (FAQ) – What search-enginges support the Dublin Core Metadata
Element Set?, 2004. http://www.dublincore.org/resources/faq/
#whatsearchenginessupport, accessed 14/11/2004.
72. DWORK, C., KUMAR, R., NAOR, M., AND SIVAKUMAR, D. Rank aggregation
methods for the Web. In Proceedings of WWW2001 (Hong Kong, 2001), pp. 613–
622. http://doi.acm.org/10.1145/371920.372165.
73. EIRON, N., AND MCCURLEY, K. S. Analysis of Anchor Text for Web Search.
Tech. rep., IBM, 2003.
74. EIRON, N., AND MCCURLEY, K. S. Analysis of Anchor Text for Web Search
(Extended Abstract). In Proceedings of ACM SIGIR’03 (Toronto, Canada, 2003),
pp. 450–460.
75. EIRON, N., AND MCCURLEY, K. S. Untangling Compound Documents on the
Web. Tech. rep., IBM, 2003.
76. EISENBERG, M., AND BARRY, C. Order effects: A study of the possible influence
of presentation order on user judgments of document relevance. JASIS 39, 5
(1988), 293–300.
77. EXCITE. Excite, 2004. http://www.excite.com, accessed 12/11/2004.
78. FAGIN, R., KUMAR, R., MCCURLEY, K. S., NOVAK, J., SIVAKUMAR, D., TOM-
LIN, J. A., AND WILLIAMSON, D. P. Searching the Workplace Web. In Proceed-
ings of WWW2003 (Budapest, Hungary, May 2003), pp. 366–375.
79. FAGIN, R., KUMAR, R., AND SIVAKUMAR, D. Comparing top k lists. In ACM
SIAM (Baltimore, MD, USA, 2003), pp. 28–36.
80. FAST SEARCH AND TRANSFER, ASA. Personal communication, 2004. http:
//www.alltheweb.com, accessed 12/11/2003.
81. FIELDING. RFC2616 - HTTP/1.1: Status Code Definitions, 1999. http://www.
w3.org/Protocols/rfc2616/rfc2616-sec10.html#sec10.3, accessed
12/11/2004.
Bibliography 219
82. FORTUNE. Fortune 500, 2003. http://www.fortune.com/fortune/
fortune500, accessed 06/09/2003.
83. FOX, E., AND SHAW, J. Combination of multiple searches. In TREC-3 Notebook
Proceedings (Gaithersburg, MD, USA, 1994), pp. 243–252.
84. FRAKES, W., AND BAEZA-YATES, R., Eds. Information Retrieval: Data Structures
and Algorithms. Prentice Hall, 1992.
85. FUHR, N., LALMAS, M., KAZAI, G., AND VERT, N. G. Proceedings of the INitia-
tive for the Evaluation of XML Retrieval (INEX). In ERCIM workshop proceedings
(Dagstuhl, 2003).
86. FUJIMURA, K., INOUE, T., AND SUGISAKI, M. The EigenRumor Algorithm for
Ranking Blogs. In 2nd Annual Workshop on the Weblogging Ecosystem - Aggregation,
Analysis and Dynamics (Chiba, Japan, 2005).
87. GARFIELD, E. Citation Indexes for Science: A New Dimension in Documentation
through Association of Ideas. Science 122, 3159 (1955), 108–111.
88. GARFIELD, E. Citation analysis as a tool in journal evaluation. Science 178, 4060
(1972), 471–479.
89. GARNER, R. A Computer Oriented, Graph Theoretic Analysis of Citation Index Struc-
tures. Drexel University Press, Philadelphia, 1967.
90. GLOVER, E. J., TSIOUTSIOULIKLIS, K., LAWRENCE, S., PENNOCK, D. M., AND
FLAKE, G. W. Using Web Structure for Classifying and Describing Web Pages.
In Proceedings of WWW2002 (Honolulu, Hawaii, USA, May 2002).
91. GOLUB, G. H., AND LOAN, C. F. V. Matrix Computations. The Johns Hopkins
University Press, Baltimore, USA, 1996.
92. GOOGLE. Blogger. http://www.blogger.com, accessed 06/11/2005.
93. GOOGLE. Google search engine. http://www.google.com, accessed
12/11/2004.
94. GOOGLE. Google Directory > Shopping Publications > Books > Gen-
eral, September 2002. http://directory.google.com/Top/Shopping/
Publications/Books/General, accessed 09/09/2002.
95. GOOGLE. Google Directory, 2004. http://directory.google.com/, ac-
cessed 12/11/2004.
96. GOOGLE. Google Search Appliance Frequently Asked Questions, 2004. http:
//www.google.com/appliance/faq.html, accessed 12/11/2004.
97. GOOGLE. Google Technology, 2004. http://www.google.com/
technology/, accessed 10/11/2004.
220 Bibliography
98. GOOGLE. Google Toolbar, 2004. http://toolbar.google.com/, accessed
12/11/2004.
99. GRANKA, L., JOACHIMS, T., AND GAY, G. Eye-Tracking Analysis of User Behav-
ior in WWW Search. In Proceedings of ACM SIGIR’04 (Sheffield, United Kingdom,
August 2004).
100. GURRIN, C., AND SMEATON, A. F. Replicating Web Structure in Small-Scale Test
Collections. Information Retrieval 7 (2004), 239–263.
101. HARMAN, D. How effective is suffixing? JASIS 42, 1 (1991), 7–15.
102. HAVELIWALA, T. H. Efficient computation of PageRank. Tech. Rep. 1999-31,
Stanford University Database Group, 1999. http://dbpubs.stanford.edu:
8090/pub/1999-31.
103. HAVELIWALA, T. H. Topic-sensitive pagerank. In Proceedings of WWW2002
(Honolulu, Hawaii, USA, 2002), ACM Press, pp. 517–526.
104. HAVELIWALA, T. H. Topic-Sensitive PageRank: A Context-Sensitive Ranking
Algorithm for Web Search. In IEEE Transactions on Knowledge and Data Engineer-
ing (July 2003).
105. HAVELIWALA, T. H., AND KAMVAR, S. D. The Second Eigenvalue of the Google
Matrix. Tech. rep., Stanford University, 2003.
106. HAWKING, D. Overview of the TREC-9 Web Track. In TREC-9 Notebook Pro-
ceedings (Gaithersburg, MD, USA, 2000). http://trec.nist.gov/pubs/
trec9/.
107. HAWKING, D. Challenges in enterprise search. In Proceedings of the Australasian
Database Conference ADC2004 (Dunedin, New Zealand, January 2004), pp. 15–26.
Invited paper: http://es.csiro.au/pubs/hawking adc04keynote.pdf.
108. HAWKING, D., BAILEY, P., AND CRASWELL, N. An intranet reality check for
TREC ad hoc. Tech. rep., CSIRO Mathematical and Information Sciences, 2000.
http://es.cmis.csiro.au/pubs/hawking tr00.pdf.
109. HAWKING, D., BAILEY, P., AND CRASWELL, N. Efficient and flexible search
using text and metadata. Tech. rep., CSIRO Mathematical and Information Sci-
ences, 2000. http://es.csiro.au/pubs/hawking tr00b.pdf.
110. HAWKING, D., AND CRASWELL, N. Overview of the TREC-2001 Web Track. In
TREC-10 Notebook Proceedings (Gaithersburg, MD, USA, 2001). http://trec.
nist.gov/pubs/.
111. HAWKING, D., AND CRASWELL, N. Very large scale retrieval and web search. In
TREC: Experiment and Evaluation in Information Retrieval, E. Voorhees and D. Har-
man, Eds. MIT Press, 2005. http://es.csiro.au/pubs/trecbook for
website.pdf.
Bibliography 221
112. HAWKING, D., CRASWELL, N., BAILEY, P., AND GRIFFITHS, K. Measuring
search engine quality. Information Retrieval 4, 1 (2001), 33–59. http://es.
cmis.csiro.au/pubs/hawking ir01.pdf.
113. HAWKING, D., CRASWELL, N., CRIMMINS, F., AND UPSTILL, T. Enterprise
search: What works and what doesn’t. In Proceedings of the Infonortics Search
Engines Meeting (San Francisco, April 2002). http://es.csiro.au/pubs/
hawking se02talk.pdf.
114. HAWKING, D., CRASWELL, N., CRIMMINS, F., AND UPSTILL, T. How valu-
able is external link evidence when searching enterprise webs? In Proceedings
of ADC’04 (Dunedin, New Zealand, January 2004). http://es.cmis.csiro.
au/pubs/hawking adc04.pdf.
115. HAWKING, D., CRASWELL, N., CRIMMINS, F., AND UPSTILL, T. How Valuable
is External Link Evidence when Searching Enterprise Webs? In Proceedings of
ADC’04 (Dunedin, New Zealand, January 2004). http://es.cmis.csiro.
au/pubs/hawking adc04.pdf.
116. HAWKING, D., CRASWELL, N., AND GRIFFITHS, K. Which search engine is best
at finding online services? In Proceedings of WWW10 (Hong Kong, 2001). http:
//www10.org/cdrom/posters/1089.pdf.
117. HAWKING, D., CRASWELL, N., THISTLEWAITE, P., AND HARMAN, D. Results
and challenges in Web search evaluation. In Proceedings of WWW8 (Toronto,
Canada, 1999), vol. 31, pp. 1321–1330. http://es.cmis.csiro.au/pubs/
hawking www99.pdf.
118. HAWKING, D., AND ROBERTSON, S. On Collection Size and Retrieval Effective-
ness. Information Retrieval 6, 1 (2003), 99–150.
119. HAWKING, D., AND THISTLEWAITE, P. Overview of TREC-6 Very Large Collec-
tion Track. In TREC-6 Notebook Proceedings (Gaithersburg, MD, USA, 1997), E. M.
Voorhees and D. K. Harman, Eds., pp. 93–105.
120. HAWKING, D., UPSTILL, T., AND CRASWELL, N. Towards better weighting of
anchors. In Proceedings of SIGIR’04 (Sheffield, England, July 2004), pp. 512–513.
http://es.csiro.au/pubs/hawking sigirposter04.pdf.
121. HAWKING, D., VOORHEES, E., BAILEY, P., AND CRASWELL, N. Overview of
TREC-8 Web Track. In TREC-8 Notebook Proceedings (Gaithersburg, MD, USA,
1999), pp. 131–150. http://trec.nist.gov/pubs/trec-8.
122. HENZINGER, M., MOTWANI, R., AND SILVERSTEIN, C. Challenges in Web
Search Engines. ACM SIGIR Forum 36, 2 (Fall 2002).
123. HEYDON, A., AND NAJORK, M. Mercator: A Scalable, Extensible Web Crawler.
World Wide Web Journal (December 1999), 219 – 229. http://www.research.
digital.com/SRC/mercator/.
222 Bibliography
124. HORRIGAN, J. B., AND RAINIE, L. PEW Internet & American life project:
Getting serious online, March 2002. http://www.pewinternet.org/
reports/reports.asp?Report=55&Section=ReportLevel1&Field=
Level1ID&ID=241, accessed 12/11/2004.
125. HUBBELL, C. H. An Input-Output Approach to Clique Identification. Sociometry
28 (1965), 377–399.
126. HULL, D. Stemming algorithms – a case study for detailed evaluation. JASIS 47,
1 (1996), 70–84.
127. JEH, G., AND WIDOM, J. Scaling personalized web search. In Proceedings of
WWW2003 (Budapest, Hungry, 2003), pp. 271–279.
128. JING, Y., AND CROFT, W. B. An association thesaurus for information retrieval.
In Proceedings of RIAO’94 (New York, USA, 1994), pp. 146–160.
129. JOACHIMS, T. Evaluating Retrieval Performance Using Clickthrough Data. In
Proceedings of ACM SIGIR’02 Workshop on Mathematical/Formal Methods in Infor-
mation Retrieval (Tampere, Finland, 2002).
130. KAMVAR, S. D., HAVELIWALA, T. H., MANNING, C. D., AND GOLUB, G. H.
Exploiting the block structure of the web for computing PageRank. Tech. rep.,
Stanford University, 2003.
131. KATZ, L. A new status index derived from sociometric analysis. Psychometrika
18, 1 (March 1953), 39–43.
132. KLEINBERG, J. M. Authoritative Sources in a Hyperlinked Environment. Journal
of the ACM 46, 5 (1999), 604–632.
133. KOSTER, M. robotstxt.org, 2003. http://www.robotstxt.org/, accessed
12/11/2003.
134. KRAAIJ, W., AND POHLMANN, R. Viewing Stemming as Recall Enhancement.
In Proceedings of ACM SIGIR’96 (Zurich, Switzerland, 1996), pp. 40–48.
135. KRAAIJ, W., WESTERVELD, T., AND HIEMSTRA, D. The Importance of Prior
Probabilities for Entry Page Search. In Proceedings of ACM SIGIR’02 (Tampere,
Finland, 2002), pp. 27–34.
136. KUMAR, S. R., RAGHAVAN, P., RAJAGOPALAN, S., SIVAKUMAR, D., TOMKINS,
A., AND UPFAL, E. The Web as a Graph. In Symposium on Principles of Database
Systems (Dallas, Texas USA, 2000), pp. 1–10.
137. KUMAR, S. R., RAGHAVAN, P., RAJAGOPALAN, S., AND TOMKINS, A. Trawling
the Web for emerging cyber-communities. In Proceedings of WWW8 (Toronto,
Canada, 1999), pp. 403–415.
Bibliography 223
138. LARSON, R. R. Bibliometrics of the World Wide Web: An exploratory analysis
of the intellection architecture of cyberspace. Tech. rep., Computer Science De-
partment, University of California, Santa Barbara, 1996. http://sherlock.
berkeley.edu/asis96/asis96.html.
139. LAWRENCE, S., AND GILES, C. L. Searching the World Wide Web. Science 280,
5360 (1998).
140. LEMPEL, R., AND MORAN, S. The stochastic approach for link-structure analysis
(SALSA) and the TKC effect. Computer Networks 33, 1–6 (2000), 387–401.
141. LEMPEL, R., AND MORAN, S. (SALSA) the stochastic approach for link-structure
analysis. ACM Transactions on Information Systems (2001).
142. LI, W.-S., KOLAK, O., AND VU, Q. Defining Logical Domains in a Web Site. In
Proceedings of HT’00 (San Antonio, Texas USA, 2000).
143. LI, Y., AND RAFSKY, L. Beyond Relevance Ranking: Hyperlink Vector Vot-
ing. In Proceedings of ACM SIGIR’97 Workshop on Networked Information Retrieval
(Philadelphia, USA, 1997).
144. LOOKSMART. Looksmart, 2003. http://www.looksmart.com, accessed
12/11/2004.
145. MARCHIORI, M. The Quest for Correct Information on the Web: Hyper Search
Engines. In Proceedings of WWW6 (Santa Clara, USA, 1997), pp. 265–276.
146. MARON, M., AND KUHNS, J. On Relevance, Probabilistic Indexing and Infor-
mation Retrieval. Journal of the ACM 7, 3 (1960), 216–244.
147. MCKELLEHER, K. The Wired 40, July 2003. http://www.wired.com/
wired/archive/11.07/40main.html, accessed 06/09/2003.
148. MICROSOFT. Internet Information Services, 2004. http://www.microsoft.
com/windowsserver2003/iis/default.mspx, accessed 11/12/2004.
149. MICROSOFT. MSN Search Engine, 2004. http://search.msn.com, accessed
11/12/2004.
150. MIZZARO, S. Relevance: The Whole History. JASIS 48, 9 (1997), 810–832.
151. MONTAGUE, M. Metasearch: Data fusion for Document Retrieval. PhD thesis, Dart-
mouth College, Hannover, New Hampshire, 2002.
152. NETSCAPE. Core JavaScript Guide 1.5, 2000. http://devedge.netscape.
com/library/manuals/2000/javascript/1.5/guide/.
153. NEW YORK TIMES. Bestsellers. Web Site, September 2002. http://www.
nytimes.com/2002/09/01/books/bestseller/, accessed 09/09/2002.
224 Bibliography
154. NG, A. Y., ZHENG, A. X., AND JORDAN, M. I. Link analysis, eigenvectors, and
stability. In Proceedings of IJCAI’01 (Seattle, USA, 2001), ACM Press.
155. OGILVIE, P., AND CALLAM, J. Combining document representations for known-
item search. In Proceedings of ACM SIGIR’03 (Toronto, Canada, August 2003),
pp. 143–150.
156. OGILVIE, P., AND CALLAM, J. Combining structural information and the use
of priors in mixed named-page and homepage finding. In TREC-12 Notebook
Proceedings (Gaithersburg, MD, USA, November 2003), NIST.
157. PAGE, L., BRIN, S., MOTWANI, R., AND WINOGRAD, T. The PageRank Cita-
tion Ranking: Bringing Order to the Web. Tech. Rep. 1999-66, Stanford Uni-
versity Database Group, 1998. http://dbpubs.stanford.edu:8090/pub/
1999-66.
158. PANDURANGAN, G., RAGHAVAN, P., AND UPFAL, E. Using PageRank to Char-
acterize Web Structure. Tech. rep., Purdue University, 2002.
159. PANT, G. Deriving Link-context from HTML. In ACM DMKD (San Diego, Cali-
fornia, USA, June 2003).
160. PARKER, L. M. P., AND JOHNSON, R. E. Does order of presentation affect users’
judgment of documents? JASIS 41, 7 (1990), 493–494.
161. PINSKI, G., AND NARIN, F. Citation influence for journal aggregates of scientific
publications: Theory, with application to the literature of physics. Information
Processing and Management 12 (1976).
162. PONTE, J. M., AND CROFT, W. B. A Language Modeling Approach to Informa-
tion Retrieval. In Proceedings of ACM SIGIR’98 (Melbourne, Australia, August
1998).
163. PORTER, M. An algorithm for suffix stripping. Program 14, 3 (1980), 130–137.
http://www.tartarus.org/∼martin/PorterStemmer/.
164. RAGGETT, D., HORS, A. L., AND JACOBS, I. HTML 4.01 Specification: The
global structure of an HTML document, 1999. http://www.w3.org/TR/
html4/struct/global.html#didx-meta data, accessed 12/11/2004.
165. RAGHAVAN, S., AND GARCIA-MOLINA, H. Crawling the Hidden Web. In Pro-
ceedings of VLDB’01 (2001), pp. 129–138. http://citeseer.ist.psu.edu/
article/raghavan01crawling.html.
166. RIVEST, R. The MD5 message-digest algorithm. Request for Comments, April
1992.
167. ROBERTSON, S. The probability ranking principle in IR. Journal of Documentation
33 (1977), 294–304. As appears in Spark-Jones and Willet, 1997.
Bibliography 225
168. ROBERTSON, S., AND JONES, K. S. Simple, proven approaches to text
retrieval. Tech. Rep. UCAM-CL-TR-356, University of Cambridge, May
1997. http://www.cl.cam.ac.uk/ftp/papers/reports/abstract.
html#TR356-ksj-approaches-to-text-retrieval.html.
169. ROBERTSON, S., AND SPARCK-JONES, K. Relevance weighting of search terms.
JASIS 27 (1976), 129–146.
170. ROBERTSON, S., AND WALKER, S. Some simple effective approximations to the
2-Poisson model for probabilistic weighted retrieval. In Proceedings of ACM SI-
GIR’94 (Dublin, Ireland, 1994), pp. 232–241.
171. ROBERTSON, S., WALKER, S., HANCOCK-BEAULIEU, M., GULL, A., AND LAU,
M. Okapi at TREC-1. In TREC-1 Notebook Proceedings (Gaithersburg, MD, USA,
1992), pp. 21–30. http://trec.nist.gov/pubs/trec1/.
172. ROBERTSON, S., WALKER, S., JONES, S., HANCOCK-BEAULIEU, M., AND GAT-
FORD, M. Okapi at TREC-3. In TREC-3 Notebook Proceedings (Gaithersburg, MD,
USA, 1994), pp. 109–126. http://trec.nist.gov/pubs/trec3/.
173. ROBERTSON, S., ZARAGOZA, H., AND TAYLOR, M. Simple BM25 extension to
multiple weighted fields. In Proceedings of CIKM’04 (2004), pp. 42–49. http:
//research.microsoft.com/%7Ehugoz/bm25wf.pdf.
174. ROCCHIO, J. Document Retrieval Systems–Optimization and Evaluation. PhD the-
sis, Harvard Computational Laboratory, 1966.
175. ROCCHIO, J. Relevance Feedback in Information Retrieval. Prentice-Hall, Inc., 1971.
176. SALTON, G. Automatic Information Organization. McGraw-Hill, New York, 1968.
177. SALTON, G., Ed. The SMART retrieval system - experiments in automatic documment
processing. McGraw-Hill, New York, 1971.
178. SAVOY, J., AND RASOLOFO, Y. Report on the TREC-10 experiment: Distributed
collections and entrypage searching. In TREC-10 Notebook Proceedings (Gaithers-
burg, MD, USA, 2001). http://trec.nist.gov/pubs/.
179. SEELEY, J. R. The net of reciprocal influence: A problem in treating sociometric
data. Canadian Journal of Psychology 3 (1949), 234–240.
180. SHAH, C., AND CROFT, W. B. Evaluating High Accuracy Retrieval Techniques.
In Proceedings of ACM SIGIR’04 (Sheffield, United Kingdom, 2004), pp. 2–9.
181. SHAKES, J., LANGHEINRICH, M., AND ETZIONI, O. Dynamic reference sifting:
a case study in the homepage domain. Computer Networks and ISDN Systems 29
(1997), 1193–1204.
182. SHANNON, C. E. Prediction and entropy of printed English. Bell Systems Techni-
cal Journal, 30 (1951), 51–64.
226 Bibliography
183. SHIVAKUMAR, N., AND GARCIA-MOLINA, H. Finding Near-Replicas of Docu-
ments on the Web. In Proceedings of WDB’98 (1998).
184. SILVERSTEIN, C., HENZINGER, M., MARAIS, H., AND MORICZ, M. Analysis of
a Very Large AltaVista Query Log. Tech. rep., Digital Systems Research Center,
1998.
185. SINGHAL, A., AND KASZKIEL, M. A Case Study in Web Search using TREC
Algorithms. In Proceedings of WWW10 (Hong Kong, 2001), pp. 708–716. http:
//www10.org/cdrom/papers/317/.
186. SINGHAL, A., SALTON, G., MITRA, M., AND BUCKLEY, C. Document Length
Normalization. Information Processing and Management 32, 5 (1996).
187. SMALL, H. Co-citation in the scientific literature: A new measure of the relation-
ship between two documents. JASIS 24, 4 (1973), 265–269.
188. SOBOROFF, I. Do TREC Web Collections Look Like the Web? ACM SIGIR Forum
36, 2 (2002), 23–31.
189. SOBOROFF, I. On evaluating web search with very few relevant documents. In
Proceedings of ACM SIGIR’04 (Sheffield, United Kingdom, 2004), pp. 530–531.
190. SPARCK-JONES, K. A statistical interpretation of term specificity and its applica-
tion in retrieval. Journal of Documentation 28, 1 (1972), 11–20.
191. SPARCK-JONES, K., AND WILLET, P., Eds. Readings in Information Retrieval. Mor-
gan Kaufmann, 1997.
192. SPINELLO, R. A. An ethical evaluation of web site linking. ACM SIGCAS Com-
puters and Society 30, 4 (2000), 25–32.
193. SULLIVAN, D. How To Use HTML Meta Tags, December 2002. http:
//searchenginewatch.com/webmasters/article.php/2167931,
accessed 08/11/04.
194. SULLIVAN, D. Nielsen/NetRatings Search Engine Ratings. Web Site, September
2002. http://www.searchenginewatch.com/reports/netratings.
html, accessed 06/11/2002.
195. SULLIVAN, D. Who Powers Whom? Search Providers Chart. Web Site, Septem-
ber 2002. http://www.searchenginewatch.com/reports/alliances.
html, accessed 06/11/2002.
196. TERVEEN, L., HILL, W., AND AMENTO, B. Constructing, Organizing, and Col-
lections of Topically Related Web Resources. ACM Transactions of Computer-
Human Interation 6, 1 (March 1999), 67–94.
Bibliography 227
197. TOMLIN, J. A. A New Paradigm for Ranking Pages on the World Wide Web. In
Proceedings of WWW2003 (Budapest, Hungary, May 2003). http://www2003.
org/cdrom/papers/refereed/p042/paper42 html/p42-tomlin.htm.
198. TRAVIS, B., AND BRODER, A. Web search quality vs. informational relevance. In
Proceedings of the Infonortics Search Engines Meeting (Boston, 2001). http://www.
infonortics.com/searchengines/sh01/slides-01/travis.html.
199. UPSTILL, T., CRASWELL, N., AND HAWKING, D. Buying Bestsellers On-
line: A Case Study in Search and Searchability. In Proceedings of ADCS2002
(Sydney, Australia, 2002). http://es.cmis.csiro.au/pubs/upstill
adcs02.pdf.
200. UPSTILL, T., CRASWELL, N., AND HAWKING, D. Predicting fame and fortune:
Pagerank or indegree? In Proceedings of ADCS2003 (Canberra, Australia, Decem-
ber 2003). http://es.cmis.csiro.au/pubs/upstill adcs03.pdf.
201. UPSTILL, T., CRASWELL, N., AND HAWKING, D. Query-independent evidence
in home page finding. ACM Transactions on Information Systems 21, 3 (2003), 286–
313.
202. UPSTILL, T., AND ROBERTSON, S. Exploiting Hyperlink Recommendation Ev-
idence in Navigational Web Search. In Proceedings of ACM SIGIR’04 (Sheffield,
United Kingdom, July 2004), pp. 576–577.
203. VAGHAN, L., AND SHAW, D. Bibliographic and Web Citations: What Is The
Difference? JASIS 54, 14 (2003), 1313–1322.
204. VAN RIJSBERGEN, C. J. Information Retrieval, 2nd edition. Dept. of Computer
Science, University of Glasgow, 1979.
205. VAN RIJSBERGEN, K. Information Retrieval. Butterworths, 1979. http://www.
dcs.gla.ac.uk/Keith/Preface.html.
206. VOORHEES, E. Evaluation by highly relevant documents. In Proceedings of ACM
SIGIR’01 (New Orleans, USA, 2001), pp. 74–82.
207. VOORHEES, E. M. Overview of the first Text REtrieval Conference (TREC-1). In
TREC-1 Notebook Proceedings (Gaithersburg, MD, USA, 1991).
208. VOORHEES, E. M. Variations in relevance judgments and the measurement of
retrieval effectiveness. In Proceedings of ACM SIGIR’98 (Melbourne, 1998).
209. VOORHEES, E. M. The Philosophy of Information Retrieval Evaluation. In
Springer’s Lecture Notes. Springer, January 2002.
210. VOORHEES, E. M., AND HARMAN, D. K. Overview of the fifth Text REtrieval
Conference (TREC-5). In TREC-5 Notebook Proceedings (Gaithersburg, MD, USA,
1996).
228 Bibliography
211. WESTERVELD, T. Using generative probabilistic models for multimedia retrieval. PhD
thesis, Centrum voor Wiskunde en Informatica, Amsterdam, Netherlands, 2004.
212. WESTERVELD, T., KRAAIJ, W., AND HIEMSTRA, D. Retrieving Web pages using
content, links, URLs and anchors. In TREC-10 Notebook Proceedings (Gaithers-
burg, MD, USA, 2001). http://trec.nist.gov/pubs/.
213. WILLIAMS, H. E., ZOBEL, J., AND BAHLE, D. Fast phrase querying with com-
bined indexes. ACM Transactions on Information Systems 22, 4 (October 2004),
573–572.
214. WITTEN, I. H., BELL, T. C., AND MOFFAT, A. Managing Gigabytes: Compressing
and Indexing Documents and Images. John Wiley & Sons, Inc., 1999.
215. XU, J., AND CROFT, W. B. Query expansion using local and global document
analysis. In Proceedings of ACM SIGIR’96 (Zurich, Switzerland, 1996), pp. 4–11.
216. YAHOO! Yahoo! Business and Economy > Shopping and Services >
Books > Booksellers, September 2002. http://www.yahoo.com/Business
and Economy/Shopping and Services/Books/Booksellers/, accessed
09/09/2002.
217. YAHOO! Yahoo! Directory Service, 2004. http://www.yahoo.com, accessed
12/11/2004.
218. ZHAI, C., AND LAFFERTY, J. A study of smoothing methods for language mod-
els applied to information retrieval. ACM Transactions on Information Systems 2, 2
(April 2004).
219. ZHU, X., AND GAUCH, S. Incorporating Quality Metrics in Central-
ized/Distributed Information Retrieval on the World Wide Web. Tech. rep., De-
partment of Electrical Engineering and Computer Science, University of Kansas,
2000.
220. ZOBEL, J. How reliable are the results of large-scale information retrieval exper-
iments? In Proceedings of ACM SIGIR’98 (Melbourne, Australia, August 1998),
pp. 307–314.

Upstill_Thesis_Revised_17Aug05

  • 1.
    Document ranking using webevidence Trystan Garrett Upstill A thesis submitted for the degree of Doctor of Philosophy at The Australian National University August 2005
  • 2.
    c Trystan GarrettUpstill Typeset in Palatino by TEX and LATEX2ε.
  • 3.
    This thesis includesexperiments published in: • Upstill T., Craswell N., and Hawking D. “Buying Bestsellers Online: A Case Study in Search and Searchability”, which appeared in the Proceedings of ADCS2002, December 2002 [199]. • Upstill T., Craswell N., and Hawking D. “Query-independent evidence in home page finding”, which appeared in the ACM TOIS volume 21:3, July 2003 [201]. • Craswell N., Hawking D., Thom J., Upstill T., Wilkinson R., and Wu M. “TREC12 Web Track at CSIRO”, which appeared in the TREC-12 Notebook Proceedings, November 2003 [58]. • Upstill T., Craswell N., and Hawking D. “Predicting Fame and Fortune: Page- Rank or Indegree?”, which appeared in the Proceedings of ADCS2003, Decem- ber 2003 [200]. • Upstill T., and Robertson S. “Exploiting Hyperlink Recommendation Evidence in Navigational Web Search”, which appeared in the Proceedings of SIGIR’04, August 2004 [202]. • Hawking D., Upstill T., and Craswell N. “Towards Better Weighting of An- chors”, which appeared in the Proceedings of SIGIR’04, August 2004 [120]. Chapter 9 contains results submitted as “csiro” runs in TREC 2003. The Topic Distilla- tion runs submitted to TREC 2003 were generated in collaboration with Nick Craswell and David Hawking. The framework used to tune parameters in Chapter 9 was de- veloped by Nick Craswell. The first-cut ranking algorithm presented in Chapter 9 was formulated by David Hawking for use in the Panoptic search system. Except where indicated above, this thesis is my own original work. Trystan Garrett Upstill 13 August 2005
  • 5.
    Abstract Evidence based onweb graph structure is reportedly used by the current generation of World-Wide Web (WWW) search engines to identify “high-quality”, “important” pages and to reject “spam” content. However, despite the apparent wide use of this evidence its application in web-based document retrieval is controversial. Confusion exists as to how to incorporate web evidence in document ranking, and whether such evidence is in fact useful. This thesis demonstrates how web evidence can be used to improve retrieval effec- tiveness for navigational search tasks. Fundamental questions investigated include: which forms of web evidence are useful, how web evidence should be combined with other document evidence, and what biases are present in web evidence. Through investigating these questions, this thesis presents a number of findings regarding how web evidence may be effectively used in a general-purpose web-based document ranking algorithm. The results of experimentation with well-known forms of web evidence on several small-to-medium collections of web data are surprising. Aggregate anchor-text mea- sures perform well, but well-studied hyperlink recommendation algorithms are far less useful. Further gains in retrieval effectiveness are achieved for anchor-text mea- sures by revising traditional full-text ranking methods to favour aggregate anchor-text documents containing large volumes of anchor-text. For home page finding tasks ad- ditional gains are achieved by including a simple URL depth measure which favours short URLs over long ones. The most effective combination of evidence treats document-level and web-based evidence as separate document components, and uses a linear combination to sum scores. It is submitted that the document-level evidence contains the author’s de- scription of document contents, and that the web-based evidence gives the wider web community view of the document. Consequently if both measures agree, and the doc- ument is scored highly in both cases, this is a strong indication that the page is what it claims to be. A linear combination of the two types of evidence is found to be partic- ularly effective, achieving the highest retrieval effectiveness of any query-dependent evidence on navigational and Topic Distillation tasks. However, care should be taken when using hyperlink-based evidence as a direct measure of document quality. Thesis experiments show the existence of bias towards the home pages of large, popular and technology-oriented companies. Further empir- ical evidence is presented to demonstrate how the authorship of web documents and sites directly affects the quantity and quality of available web evidence. These factors demonstrate the need for robust methods for mining and interpreting data from the web graph. v
  • 6.
  • 7.
    Contents Abstract v 1 Introduction3 1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2 Background 5 2.1 A web search system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.1 The document gatherer . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1.2 The indexer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1.3 The query processor . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1.4 The results presentation interface . . . . . . . . . . . . . . . . . . . 7 2.2 Ranking in web search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.3 Document-level evidence . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.3.1 Text-based document evidence . . . . . . . . . . . . . . . . . . . . 9 2.3.1.1 Boolean matching . . . . . . . . . . . . . . . . . . . . . . 9 2.3.1.2 Vector space model . . . . . . . . . . . . . . . . . . . . . 10 2.3.1.3 Probabilistic ranking . . . . . . . . . . . . . . . . . . . . 12 2.3.1.4 Statistical language model ranking . . . . . . . . . . . . 14 2.3.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.3.3 Other evidence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.3.3.1 Metadata . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.3.3.2 URL information . . . . . . . . . . . . . . . . . . . . . . . 17 2.3.3.3 Document structure and tag information . . . . . . . . . 19 2.3.3.4 Quality metrics . . . . . . . . . . . . . . . . . . . . . . . . 19 2.3.3.5 Units of retrieval . . . . . . . . . . . . . . . . . . . . . . . 20 2.4 Web-based evidence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.4.1 Anchor-text evidence . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.4.2 Bibliometric measures . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.4.2.1 Bibliographic methods applied to a web . . . . . . . . . 27 2.4.3 Hyperlink recommendation . . . . . . . . . . . . . . . . . . . . . . 28 2.4.3.1 Link counting / in-degree . . . . . . . . . . . . . . . . . 28 2.4.3.2 PageRank . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.4.3.3 Topic-specific PageRank . . . . . . . . . . . . . . . . . . 30 2.4.4 Other hyperlink analysis methods . . . . . . . . . . . . . . . . . . 30 2.4.4.1 HITS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.5 Combining document evidence . . . . . . . . . . . . . . . . . . . . . . . . 33 vii
  • 8.
    viii Contents 2.5.1 Score/rankfusion methods . . . . . . . . . . . . . . . . . . . . . . 33 2.5.1.1 Linear combination of scores . . . . . . . . . . . . . . . . 34 2.5.1.2 Re-ranking . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.5.1.3 Meta-search fusion techniques . . . . . . . . . . . . . . . 34 2.5.1.4 Rank aggregation . . . . . . . . . . . . . . . . . . . . . . 35 2.5.1.5 Using minimum query-independent evidence thresh- olds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.5.2 Revising retrieval models to address combination of evidence . . 35 2.5.2.1 Field-weighted Okapi BM25 . . . . . . . . . . . . . . . . 36 2.5.2.2 Language mixture models . . . . . . . . . . . . . . . . . 37 2.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.6.1 Web information needs and search taxonomy . . . . . . . . . . . . 38 2.6.2 Navigational search tasks . . . . . . . . . . . . . . . . . . . . . . . 39 2.6.2.1 Home page finding . . . . . . . . . . . . . . . . . . . . . 39 2.6.2.2 Named page finding . . . . . . . . . . . . . . . . . . . . 39 2.6.3 Informational search tasks . . . . . . . . . . . . . . . . . . . . . . . 40 2.6.3.1 Topic Distillation . . . . . . . . . . . . . . . . . . . . . . . 40 2.6.4 Transactional search tasks . . . . . . . . . . . . . . . . . . . . . . . 40 2.6.5 Evaluation strategies / judging relevance . . . . . . . . . . . . . . 40 2.6.5.1 Human relevance judging . . . . . . . . . . . . . . . . . 40 2.6.5.2 Implicit human judgements . . . . . . . . . . . . . . . . 42 2.6.5.3 Judgements based on authoritative links . . . . . . . . . 42 2.6.6 Evaluation measures . . . . . . . . . . . . . . . . . . . . . . . . . . 42 2.6.6.1 Precision and recall . . . . . . . . . . . . . . . . . . . . . 42 2.6.6.2 Mean Reciprocal Rank and success rates . . . . . . . . . 44 2.6.7 The Text REtrieval Conference . . . . . . . . . . . . . . . . . . . . 44 2.6.7.1 TREC corpora used in this thesis . . . . . . . . . . . . . 45 2.6.7.2 TREC web track evaluations . . . . . . . . . . . . . . . . 45 3 Hyperlink methods - implementation issues 49 3.1 Building the web graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.1.1 URL address resolution . . . . . . . . . . . . . . . . . . . . . . . . 50 3.1.2 Duplicate documents . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.1.3 Hyperlink redirects . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.1.4 Dynamic content . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.1.5 Links created for reasons other than recommendation . . . . . . . 54 3.2 Extracting hyperlink evidence from WWW search engines . . . . . . . . 55 3.3 Implementing PageRank . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.3.1 Dangling links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.3.2 Bookmark vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.3.3 PageRank convergence . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.3.4 PageRank applied to small-to-medium webs . . . . . . . . . . . . 59 3.4 Expected correlation of hyperlink recommendation measures . . . . . . 59
  • 9.
    Contents ix 4 Websearch and site searchability 61 4.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.1.1 Query selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.1.2 Search engine selection . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.1.3 Bookstore selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.1.4 Submitting queries and collecting results . . . . . . . . . . . . . . 65 4.1.5 Judging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.2 Comparing bookstores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.3 Comparing search engines . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.3.1 Search engine bookstore coverage . . . . . . . . . . . . . . . . . . 67 4.4 Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.4.1 Bookstore searchability: coverage . . . . . . . . . . . . . . . . . . 70 4.4.2 Bookstore searchability: matching/ranking performance . . . . . 73 4.4.3 Search engine retrieval effectiveness . . . . . . . . . . . . . . . . . 73 4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 5 Analysis of hyperlink recommendation evidence 77 5.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.1.1 Sourcing candidate pages . . . . . . . . . . . . . . . . . . . . . . . 78 5.1.2 Company attributes . . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.1.3 Extracting hyperlink recommendation scores . . . . . . . . . . . . 79 5.2 Hyperlink recommendation bias . . . . . . . . . . . . . . . . . . . . . . . 81 5.2.1 Home page preference . . . . . . . . . . . . . . . . . . . . . . . . . 82 5.2.2 Hyperlink recommendation as a page quality recommendation . 82 5.2.2.1 Large, famous company preference . . . . . . . . . . . . 82 5.2.2.2 Country and technology preference . . . . . . . . . . . . 82 5.3 Correlation between hyperlink recommendation measures . . . . . . . . 87 5.3.1 For company home pages . . . . . . . . . . . . . . . . . . . . . . . 87 5.3.2 For spam pages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 5.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 5.4.1 Home page bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 5.4.2 Other systematic biases . . . . . . . . . . . . . . . . . . . . . . . . 90 5.4.3 PageRank or in-degree? . . . . . . . . . . . . . . . . . . . . . . . . 91 6 Combining query-independent web evidence with query-dependent evidence 93 6.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 6.1.1 Query and document set . . . . . . . . . . . . . . . . . . . . . . . . 94 6.1.2 Query-dependent baselines . . . . . . . . . . . . . . . . . . . . . . 94 6.1.3 Extracting PageRank . . . . . . . . . . . . . . . . . . . . . . . . . . 95 6.1.4 Combining query-dependent baselines with query-independent web evidence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 6.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 6.2.1 Baseline performance . . . . . . . . . . . . . . . . . . . . . . . . . 96 6.2.2 Using a threshold . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
  • 10.
    x Contents 6.2.3 Re-rankingusing PageRank . . . . . . . . . . . . . . . . . . . . . . 97 6.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 7 Home page finding using query-independent web evidence 101 7.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 7.1.1 Query-independent evidence . . . . . . . . . . . . . . . . . . . . . 102 7.1.2 Query-dependent baselines . . . . . . . . . . . . . . . . . . . . . . 102 7.1.3 Test collections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 7.1.4 Combining query-dependent baselines with query-independent evidence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 7.2 Minimum threshold experiments . . . . . . . . . . . . . . . . . . . . . . . 106 7.2.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 7.2.2 Training cutoffs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 7.3 Optimal combination experiments . . . . . . . . . . . . . . . . . . . . . . 112 7.3.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 7.4 Score-based re-ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 7.4.1 Setting score cutoffs . . . . . . . . . . . . . . . . . . . . . . . . . . 116 7.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 7.5 Interpretation of results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 7.5.1 What query-independent evidence should be used in re-ranking? 123 7.5.2 Which query-dependent baseline should be used? . . . . . . . . . 125 7.6 Further experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 7.6.1 Rank and score distributions . . . . . . . . . . . . . . . . . . . . . 127 7.6.2 Can the four-tier URL-type classification be improved? . . . . . . 127 7.6.3 PageRank and in-degree correlation . . . . . . . . . . . . . . . . . 131 7.6.4 Use of external link information . . . . . . . . . . . . . . . . . . . 132 7.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 8 Anchor-text in web search 135 8.1 Document statistics in anchor-text . . . . . . . . . . . . . . . . . . . . . . 135 8.1.1 Term frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 8.1.2 Inverse document frequency . . . . . . . . . . . . . . . . . . . . . 136 8.1.3 Document length normalisation . . . . . . . . . . . . . . . . . . . 138 8.1.3.1 Removing aggregate anchor-text length normalisation . 140 8.1.3.2 Anchor-text length normalisation by other document fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 8.2 Combining anchor-text with other document evidence . . . . . . . . . . 143 8.2.1 Linear combination . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 8.2.2 Field-weighted Okapi BM25 . . . . . . . . . . . . . . . . . . . . . 143 8.2.3 Fusion of linear combination and field-weighted evidence . . . . 144 8.2.4 Snippet-based anchor-text scoring . . . . . . . . . . . . . . . . . . 144 8.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 8.3.1 Anchor-text baseline effectiveness . . . . . . . . . . . . . . . . . . 145 8.3.2 Anchor-text and full-text document evidence . . . . . . . . . . . . 146
  • 11.
    Contents xi 8.3.2.1 Field-weightedOkapi BM25 combination . . . . . . . . 147 8.3.2.2 Linear combination . . . . . . . . . . . . . . . . . . . . . 148 8.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 9 A first-cut document ranking function using web evidence 151 9.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 9.1.1 Evaluating performance . . . . . . . . . . . . . . . . . . . . . . . . 151 9.1.2 Document evidence . . . . . . . . . . . . . . . . . . . . . . . . . . 152 9.1.2.1 Full-text evidence . . . . . . . . . . . . . . . . . . . . . . 152 9.1.2.2 Title evidence . . . . . . . . . . . . . . . . . . . . . . . . . 153 9.1.2.3 URL length . . . . . . . . . . . . . . . . . . . . . . . . . . 153 9.1.3 Web evidence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 9.1.3.1 Anchor-text . . . . . . . . . . . . . . . . . . . . . . . . . . 153 9.1.3.2 In-degree . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 9.1.4 Combining document evidence . . . . . . . . . . . . . . . . . . . . 154 9.1.5 Test sets and tuning . . . . . . . . . . . . . . . . . . . . . . . . . . 155 9.1.6 Addressing the combined HP/NP task . . . . . . . . . . . . . . . 156 9.2 Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 9.2.1 Combining HP and NP runs for the combined task . . . . . . . . 160 9.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 9.3.1 TREC 2003 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 9.3.1.1 Topic Distillation 2003 (TD2003) results . . . . . . . . . . 160 9.3.1.2 Combined HP/NP 2003 (HP/NP2003) results . . . . . . 162 9.3.2 Evaluating the ranking function on further corporate web col- lections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 9.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 10 Discussion 167 10.1 Web search system applicability . . . . . . . . . . . . . . . . . . . . . . . . 167 10.2 Which tasks should be modelled and evaluated in web search experi- ments? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 10.3 Building a more efficient ranking system . . . . . . . . . . . . . . . . . . . 169 10.4 Tuning on a per corpus basis . . . . . . . . . . . . . . . . . . . . . . . . . . 170 11 Summary and conclusions 173 11.1 Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 11.2 Document ranking recommendations . . . . . . . . . . . . . . . . . . . . 176 11.3 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 A Glossary 179 B The canonicalisation of URLs 183
  • 12.
    xii Contents C Bookstoresearch and searchability: case study data 185 C.1 Book categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 C.2 Web search engine querying . . . . . . . . . . . . . . . . . . . . . . . . . . 185 C.3 Correct book answers in bookstore case study . . . . . . . . . . . . . . . 187 D TREC participation in 2002 195 D.1 Topic Distillation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 D.2 Named page finding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196 E Analysis of hyperlink recommendation evidence additional results 199 F Okapi BM25 distributions 203 G Query sets 205 G.1 .GOV home page set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205 Bibliography 213
  • 13.
    List of Tables 2.1Proximity of the the term “Yahoo” to links to http://www.yahoo.com/ 24 4.1 Search engine properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.2 Bookstores included in the evaluation . . . . . . . . . . . . . . . . . . . . 64 4.3 Bookstore comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.4 Search engine success rates . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.5 Search engine precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.6 Search engine document coverage . . . . . . . . . . . . . . . . . . . . . . 69 4.7 Search engine link coverage . . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.1 Values extracted from Google . . . . . . . . . . . . . . . . . . . . . . . . . 81 5.2 PageRanks by industry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 5.3 Extreme cases where PageRank and in-degree scores disagree. . . . . . . 88 7.1 Test collection information . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 7.2 Using query-independent thresholds on the ANU collection . . . . . . . 107 7.3 Using query-independent thresholds on the WT10gC collection . . . . . 109 7.4 Using query-independent thresholds on the WT10gT collection. . . . . . 111 7.5 Optimal re-ranking results for content . . . . . . . . . . . . . . . . . . . . 113 7.6 Optimal re-ranking results for anchor-text . . . . . . . . . . . . . . . . . . 114 7.7 Optimal re-ranking results for content+anchor-text . . . . . . . . . . . . . 115 7.8 Significant differences between methods when using Optimal re-rankings116 7.9 Summary of Optimal re-ranking results . . . . . . . . . . . . . . . . . . . 117 7.10 Score-based re-ranking results for content . . . . . . . . . . . . . . . . . . 120 7.11 Score-based re-ranking results for anchor-text . . . . . . . . . . . . . . . . 121 7.12 Score-based re-ranking results for content+anchor-text . . . . . . . . . . 122 7.13 Numerical summary of re-ranking improvements . . . . . . . . . . . . . 123 7.14 S@5 for URL-type category combinations, length and directory depth . . 131 7.15 Correlation of PageRank variants with in-degree . . . . . . . . . . . . . . 132 7.16 Using VLC2 links in WT10g . . . . . . . . . . . . . . . . . . . . . . . . . . 133 8.1 Summary of idf variants used in ranking functions under examination . 138 8.2 Summary of document length normalisation variants in ranking func- tions under examination . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 8.3 Summary of snippet-based document ranking algorithms under exam- ination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 8.4 Okapi BM25 aggregate anchor-text scores and ranks for length normal- isation variants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 xiii
  • 14.
    xiv LIST OFTABLES 8.5 Effectiveness of Okapi BM25 aggregate anchor-text length normalisa- tion techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 8.6 Length normalisation in Field-weighted Okapi BM25 . . . . . . . . . . . 147 8.7 Effectiveness of anchor-text snippet-based ranking functions . . . . . . . 148 8.8 Effectiveness of the evaluated combination methods for TD2003 . . . . . 149 8.9 Effectiveness of the evaluated combination methods for NP2002 and NP&HP2003 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 9.1 Tuned parameters and retrieval effectiveness . . . . . . . . . . . . . . . . 159 9.2 Results for combined HP/NP runs on the training set . . . . . . . . . . . 160 9.3 Topic Distillation submission summary . . . . . . . . . . . . . . . . . . . 161 9.4 Combined home page/named page finding task submission summary . 162 9.5 Ranking function retrieval effectiveness on the public corporate webs of several large Australian organisations . . . . . . . . . . . . . . . . . . . 164 C.1 Correct book answers in bookstore case study . . . . . . . . . . . . . . . 194 D.1 Official results for submissions to the 2002 TREC web track Topic Dis- tillation task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196 D.2 Official results for submissions to the 2002 TREC web track named page finding task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 G.1 .GOV home page finding training set . . . . . . . . . . . . . . . . . . . . . 211
  • 15.
    List of Figures 2.1A sample network of relationships . . . . . . . . . . . . . . . . . . . . . . 25 3.1 Effect of PageRank d value (random jump probability) on success rate for Democratic PageRank calculations for the WT10gC test collection . . 57 3.2 Effect of PageRank d value (random jump probability) on success rate for Aristocratic PageRank calculations for the WT10gC test collection . . 58 3.3 Effect of PageRank d value on the rate of Democratic PageRank conver- gence on WT10g, by number of iterations . . . . . . . . . . . . . . . . . . 58 5.1 Combined PageRank distribution for the non-home page document set . 79 5.2 Toolbar PageRank distributions within sites . . . . . . . . . . . . . . . . . 83 5.3 Bias in hyperlink recommendation evidence towards large, admired and popular companies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 5.4 Bias in hyperlink recommendation evidence towards technology-oriented or US companies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.5 Toolbar PageRank versus in-degree for company home pages. . . . . . . 88 5.6 Toolbar PageRank versus in-degree for links to a spam company. . . . . 89 6.1 The percentage of home pages and non-home pages that exceed each Google PageRank value. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 6.2 Quota-based re-ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 6.3 Score-based re-ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 6.4 Example of two queries using different re-ranking techniques . . . . . . 99 7.1 Example of an Optimal re-ranking and calculation of random control success rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 7.2 Setting score-based re-ranking cutoffs for the content and anchor-text baselines using the WT10gC collection . . . . . . . . . . . . . . . . . . . . 118 7.3 Setting score-based re-ranking cutoffs for the content+anchor-text base- line using the WT10gC collection . . . . . . . . . . . . . . . . . . . . . . . 119 7.4 Baseline success rates across different cutoffs . . . . . . . . . . . . . . . . 126 7.5 Baseline rankings of the correct answers for WT10gC . . . . . . . . . . . 128 7.6 PageRank distributions for WT10gC . . . . . . . . . . . . . . . . . . . . . 129 7.7 In-degree and URL-type distributions for WT10gC . . . . . . . . . . . . . 130 8.1 Document scores achieved by BM25 using several values of k1 with increasing tf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 xv
  • 16.
    xvi LIST OFFIGURES 8.2 Aggregate anchor-text term distribution for the USGS home page . . . . 139 8.3 Aggregate anchor-text term distribution for a USGS info page . . . . . . 139 8.4 The effect of document length normalisation on BM25 scores for a sin- gle term query. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 9.1 Document scores achieved by AF1 and BM25 for values of tf . . . . . . 154 9.2 A plot illustrating the concurrent exploration of Okapi BM25 k1 and b values using the hill-climbing function . . . . . . . . . . . . . . . . . . . . 157 9.3 A full iteration of the hill-climbing function . . . . . . . . . . . . . . . . . 158 E.1 Google Toolbar PageRank distributions within sites (Additional to those in Chapter 5) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200 E.2 Google Toolbar PageRank distributions within sites (Additional to those in Chapter 5) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201 F.1 Distribution of normalised Okapi BM25 scores for document full-text . 204 F.2 Distribution of normalised Okapi BM25 scores for aggregate anchor-text 204
  • 17.
    “In an extremeview, the world can be seen as only connections, nothing else. We think of a dictionary as the repository of meaning, but it defines words only in terms of other words. I liked the idea that a piece of infor- mation is really defined only by what it’s related to, and how it’s related. There really is little else to meaning. The structure is everything. There are billions of neurons in our brains, but what are neurons? Just cells. The brain has no knowledge until connections are made between neurons. All that we know, all that we are, comes from the way our neurons are con- nected.” — Tim Berners-Lee [20]
  • 19.
    Chapter 1 Introduction Document retrievalon the World-Wide Web (WWW), arguably the world’s largest col- lection of documents, is a challenging and important task. The scale of the WWW is immense, consisting of at least ten billion publicly visible web documents1 distributed on millions of servers world-wide. Web authors follow few formal protocols, often re- main anonymous and publish in a wide variety of formats. There is no central registry or repository of the WWW’s contents and documents are often in a constant state of flux. The WWW is also an environment where documents often misrepresent their content as some web authors seek to unbalance ranking algorithms in their favour for personal gain [122]. To compound these factors, WWW search engine users typically provide short queries (averaging around two terms [184]) and expect a sub-second response time from the system. Given these significant challenges, there is potentially much to be learnt from the search systems which manage to retrieve relevant docu- ments in such an environment. The current generation of WWW search engines reportedly makes extensive use of evidence derived from the structure of the WWW to better match relevant doc- uments and identify potentially authoritative pages [31]. However, despite this re- ported use, to date there has been little analysis which supports the inclusion of web evidence in document ranking, or which examines precisely what its effect on search results might be. The success of document ranking in the current generation of WWW search engines is attributed to a number of web analysis techniques. How these tech- niques are used and incorporated remains a trade secret. It also remains unclear as to whether such techniques can be employed to improve retrieval effectiveness in smaller, corporate-sized web collections. This thesis investigates how web evidence can be used to improve retrieval ef- fectiveness for navigational search tasks. Three important forms of web evidence are considered: anchor-text, hyperlink recommendation measures (PageRank vari- ants and in-degree), and URL hierarchy-based measures. These forms of web evi- dence are believed to be used by prominent WWW search engines [31]. Other forms of web evidence reviewed, but not examined, include HITS [132], HTML document structure [42] and page segmentation [37], information unit measures [196], and click- through evidence [129]. 1 This is necessarily a crude estimate of the WWW’s static size. See Section 2.4 for details. 3
  • 20.
    4 Introduction To exploitweb evidence effectively in a document ranking algorithm, several ques- tions must be addressed: • Which forms of web evidence are useful? • How should web evidence be combined with other document evidence? • What biases are inherent in web evidence? Through addressing these and other related problems, this thesis demonstrates how web evidence may be used effectively in a general-purpose web-based document ranking algorithm. 1.1 Overview Chapters 2 and 3 review background literature and implementation issues. Chap- ter 2 surveys the web search domain, and presents an overview of document and web evidence often used in web-based document ranking, methods for combining this evidence, and a review of strategies for evaluating the effectiveness of ranking algorithms. To justify the formulations of hyperlink evidence used, and to ensure ex- periments can be reproduced, Chapter 3 describes methods used to process the web graph and implement recommendation evidence. Chapters 4 to 8 present a series of detailed experiments. Chapter 4 reports results from an investigation of how the searchability of web sites affects hyperlink evidence, and thereby retrieval effectiveness in WWW search engines. Chapter 5 presents a set of experiments that analyse the extent to which hyperlink evidence is correlated with “real-world” measures of authority or quality. It includes an analysis of how the use of web evidence may bias search results, and whether hyperlink recommen- dation evidence is useful in identifying site entry points. Chapters 6 and 7 follow with an evaluation of retrieval effectiveness improvements afforded by hyperlink ev- idence. Chapter 6 investigates how query-independent evidence might be combined with query-dependent baselines. Chapter 7 investigates the home page finding task on small-to-medium web collections. Chapter 8 presents a set of experiments that in- vestigates further possibilities for improving the effectiveness of measures based on anchor-text evidence. The experiments culminate in a proposal for, and evaluation of, a ranking function that incorporates evidence explored in this thesis. The effectiveness of this ranking function is evaluated through submissions to the TREC 2003 web track, presented in Chapter 9. Chapters 10 and 11 present and discuss findings, draw conclusions and outline future research directions. A glossary is included as Appendix A.
  • 21.
    Chapter 2 Background To providea foundation and context for thesis experiments, this chapter outlines the web-based document ranking domain. The chapter includes: • An overview of a generic web search system, outlining the role of document ranking in web search; • A detailed analysis of document and web-level evidence commonly used for document ranking in research and (believed to be used in) commercial web search systems; • An exploration of methods for combining evidence into a single ranking func- tion; and • A review of common user web search tasks and methods used to evaluate the effectiveness of document ranking for such tasks. Where applicable, reference is made throughout this chapter to the related scope of the thesis and the rationale for experiments undertaken. 2.1 A web search system A web search engine typically consists of a document gatherer (usually a crawler), a document indexer, a query processor and a results presentation interface [31]. The document gatherer and document indexer need only be run when the underlying set of web documents has changed (which is likely to be continuous on the WWW, but perhaps intermittent for other web corpora). How each element which makes up a generic web search system is understood in the context of this thesis is discussed below. 5
  • 22.
    6 Background 2.1.1 Thedocument gatherer Web-based documents are normally1 gathered using a crawler [123]. Crawlers traverse a web graph by recursively following hyperlinks, storing each document encountered, and parsing stored documents for URLs to crawl. Crawlers typically maintain a fron- tier, the queue of pages which remain to be downloaded. The frontier may be a FIFO2 queue, or sorted by some other attribute, such as perceived authority or frequency of change [46]. Crawlers also typically maintain a list of all downloaded or detected du- plicate pages (so pages are not fetched more than once), and a scope of pages to crawl (for example, a maximum depth, specified domain, or timeout value), both of which are checked prior to adding pages to the frontier. The crawler frontier is initialised with a set of seed pages from which the crawl starts (these are specified manually). Crawling ceases when the frontier is empty, or some time or resource limit is reached. Once crawling is complete,3 the downloaded documents are indexed. 2.1.2 The indexer The indexer distills information contained within corpus documents into a format which is amenable to quick access by the query processor. Typically this involves ex- tracting document features by breaking-down documents into their constituent terms, extracting statistics relating to term presence within the documents and corpus, and calculating any query-independent evidence.4 After the index is built, the system is ready to process queries. 2.1.3 The query processor The query processor serves user queries by matching and ranking documents from the index according to user input. As the query processor interacts directly with the doc- ument index created by the indexer, they are often considered in tandem. This thesis is concerned with a non-iterative retrieval process, i.e. one without query refinement or relevance feedback [169, 174, 175, 177]. This is the level of in- teraction supported by current popular WWW search systems and many web search systems, most of whom incorporate little relevance feedback beyond “find more like this” [93] or lists containing suggested supplementary query terms [217]. Although particularly important in WWW search systems, this thesis is not pri- marily concerned with the efficiency of query processing. A comprehensive overview of efficient document query processing and indexing methods is provided in [214]. 1 In some cases alternative document accessing methods may be available, for example if the docu- ments being indexed are stored locally. 2 A queue ordered such that the first item in is the first item out. 3 If crawling is continuous, and an incremental index structure is used, documents might be indexed continuously. 4 Query-independent evidence is evidence that does not depend on the user query. For efficiency reasons such evidence is generally collected and calculated during the document indexing phase (prior to query processing).
  • 23.
    §2.2 Ranking inweb search 7 2.1.4 The results presentation interface The results presentation interface displays and links to the documents matched by the query processor in response to the user query. Current popular WWW and web search systems present a linear list of ranked results, sometimes with the degree of match and/or summaries and abstracts for the matching documents. This type of interface is modelled in experiments within this thesis. 2.2 Ranking in web search The principal component of the query processor is the document ranking function. The ranking functions of modern search systems frequently incorporate many forms of document evidence [31]. Some of this evidence, such as textual information, is collected locally for each document in the corpus (described in Section 2.3). Other evidence, such as external document descriptions or recommendations, is amassed through an examination of the context of a document within the web graph (described in Section 2.4). 2.3 Document-level evidence Text-based ranking algorithms typically assign scores to documents based on the dis- tribution of query terms within both the document and the corpus. Therefore the choice of what should constitute a term is an important concern. While terms are often simply defined as document words (treated individually) [170] they may also take further forms. For example, terms may consist of the canonical string compo- nents of words (stems) [163], include (n-)tuples of words [214], consist of a word and associated synonyms [128], or may include a combination of some or many of these properties. Unless otherwise noted, the ranking functions examined within this thesis use single words as terms. In some experiments ranking functions instead make use of canonical word stems, conflated using the Porter stemmer [163], as terms. These and alternative term representations are discussed below. The conflation of terms may increase the overlap between documents and queries, finding term matches which may otherwise have been missed. For example if the query term “cat” is processed and a document in the corpus mentions “cats” it is likely that the document will be relevant to the user’s request. Stemming methods are frequently employed to reduce words to their canonical forms and thereby allow such matches. An empirically validated method for reducing terms to their canon- ical forms is the Porter stemmer [163]. The Porter stemmer has been demonstrated to perform as well as other suffix-stemming algorithms and to perform comparably to other significantly more expensive, linguistic-based stemming algorithms [126].5 5 These algorithms are expensive with regard to training and computational cost.
  • 24.
    8 Background The Porterstemmer removes suffixes, for example “shipping” and “shipped” would become “ship”. In this way suffix-stemming attempts to remove pluralisation from terms and to generalise words [126],6 sometimes leading to an improvement in re- trieval system recall [134]. However, reducing the exactness of term matches can result in the retrieval of less relevant documents [84, 101], thereby reducing search precision.7 Furthermore, if a retrieved document does not contain any occurrences of a query term, as all term matches are stems, it may be difficult for a user to understand why that document was retrieved [108]. In many popular ranking functions documents are considered to be “bags-of- words” [162, 170, 176], where term occurrence is assumed to be independent and unordered. For example, given a term such as “Computer” there is no prior probabil- ity of encountering the word “Science” afterwards. Accordingly no extra evidence is recorded if the words “Computer Science” are encountered together in a docu- ment rather than separately. While there is arguably more meaning conveyed through higher order terms (terms containing multiple words) than in single-word term mod- els, there is little empirical evidence to support the use of higher-order terms [128]. Even when using manually created word association thesauri, retrieval effectiveness has not been observed to be significantly improved [128]. “Bags-of-words” algorithms are also generally less expensive when indexing and querying English language doc- uments8 [214]. Terms may have multiple meanings (polysemy) and many concepts are repre- sented by multiple words (synonyms). Several methods attempt to explore relation- ships between terms to compress the document and query space. The association of words to concepts can be performed manually through the use of dictionaries or ontologies, or automatically using techniques such as Latent Semantic Analysis (LSA) [22, 66]. LSA involves the extraction of contextual meaning of words through examinations of the distribution of terms within a corpus using the vector space model (see Section 2.3.1.2). Terms are broken down into co-occurrence tables and then a Sin- gular Value Decomposition (SVD) is performed to determine term relationships [66]. The SVD projects the initial term meanings onto a subspace spanned by only the “im- portant” singular term vectors. The potential benefits of LSA techniques are two-fold: firstly they may reduce user confusion through the compression of similar (synony- mous or polysemous) terms, and secondly they may reduce the size of term space, and thereby improve system efficiency [66]. Indeed, LSA techniques have been shown to improve the efficiency of retrieval systems considerably while maintaining (but not exceeding) the effectiveness of non-decomposed vector space-based retrieval sys- tems [22, 43]. However, the use of LSA-based algorithms is likely to negatively affect navigational search (an important search task, described in Section 2.6.2) as the mean- ing conveyed by entity naming terms may be lost. 6 Employing stemming prior to indexing reduces the size of the corpus index however the discarded term information is then lost. As an alternative stemming can be applied during query processing [214]. 7 The measures of system precision and recall are defined in Section 2.6.6.1 8 The use of phrase-optimised indexes can improve the efficiency of phrase-based retrieval [213].
  • 25.
    §2.3 Document-level evidence9 Some terms occur so frequently in a corpus that their presence or absence within a document may have negligible effect. The most frequent terms arguably convey the least document relevance information and have the smallest discrimination value (see inverse document frequency measure in Section 2.3.1.2). Additionally, because of the high frequency of occurrence, such terms are likely to generate the highest overhead during indexing and querying.9 Extremely frequent terms (commonly referred to as “stop words”) are often removed from documents prior to indexing.10 However, it has been suggested that such terms might be useful when matching documents [214], particularly in phrase-based searches [14]. Nevertheless, in experiments within this thesis, stop words are removed prior to indexing. 2.3.1 Text-based document evidence To build a retrieval model, an operational definition of what constitutes a relevant document is required. While each of the ranking models discussed below shares sim- ilar document statistics, they were all derived through different relevance matching assumptions. Experiments within this thesis employ the Okapi BM25 probabilistic algorithm (for reasons outlined in Section 2.3.2). Other full-text ranking methods are discussed for completeness. The notation used during the model discussions below is as follows: D denotes a document, Q denotes a query, t is a term, wt indicates a weight or score for a single term, and S(D, Q) is the score assigned to the query to document match. 2.3.1.1 Boolean matching In the Boolean model, retrieved documents are matched to queries formed with logic operators. There are no degrees of match; a document either satisfies the query or does not. Thus Boolean models are often referred to as “exact match” techniques [14]. While Boolean matching makes it clear why documents were retrieved, its syntax is largely unfamiliar to ordinary users [28, 49, 51]. Nevertheless empirical evidence sug- gests that trained search users prefer Boolean search as it provides an exact specifica- tion for retrieving documents [49]. However, without any ranking by degree of match, the navigation of the set of matching documents is difficult, particularly on large cor- pora with unstructured content [14]. Empirical evidence also suggests that the use of term weights in the retrieval model (described in the next sub-section) brings large gains [14]. To employ Boolean matching techniques on corpora of the scale considered in this thesis, it would have to be supplemented by some other document statistic in order to provide a ranked list of results [14]. 9 However, given the high amount of expected repetition they could potentially be more efficiently compressed [214]. 10 This list often contains common function words or connectives such as “the”, “and” and “a”.
  • 26.
    10 Background The Booleanscoring function is: S(D, Q) = 0 Q /∈ D 1 Q ∈ D (2.1) where Q is the query condition expressed in Boolean logic operators. 2.3.1.2 Vector space model The vector space model is based on the implicit assumption that the relevance of a document with respect to some query is correlated with the distance between the query and document. In the vector space model each document (and query) is represented in an n-dimensional Euclidean space with an orthogonal dimension for each term in the corpus.11 The degree of relevance between a query and document is measured using a distance function [176]. The most basic term vector representation simply flags term presence using vec- tors of binary {0, 1}. This is known as the binary vector model [176]. The document representation can be extended by including term and document statistics in the docu- ment and query vector representations [176]. An empirically validated document sta- tistic is the number of term occurrences within a document (term frequency or tf ) [176]. The intuitive justification for this statistic is that a document that mentions a term more often is more likely to be relevant for, or about, that term. Another important statistic is the potential for a term to discriminate between candidate documents [190]. The potential of a term to discriminate between documents has been observed to be inversely proportional to the frequency of its occurrence in a corpus [190], with terms that are common in a corpus less likely to convey useful relevance information. A frequently used measure of term discrimination based on this observation is inverse document frequency (or idf ) [190]. Using the tf and idf measures, the weight of a term present in a document can be defined as: wt,D = tf t,D × idft (2.2) where idf is: idft = log N nt (2.3) where nt is the number of documents in the corpus that contain term t, and N is the total number of documents in the corpus. 11 So all dimensions are linearly independent.
  • 27.
    §2.3 Document-level evidence11 There are many functions that can be used to score the distance between document and query vectors [176]. A commonly used distance function is the cosine measure of similarity [14]: S(D, Q) = D · Q (|D| × |Q|) (2.4) or: S(D, Q) = t∈Q wt,D × wt,Q t∈Q w2 t,D × t∈Q w2 t,Q (2.5) Because the longer a document is, the more likely it is that a term will be encoun- tered in it, an unnormalised tf component is more likely to assign higher scores to longer documents. To compensate for this effect the term weighting function in the vector space model is often length normalised, such that a term that occurs in a short document is assigned more weight than a term that occurs in a long document. This is termed document length normalisation. For example, a simple form of length normal- isation is [14]: wt,D = tf t,D + 1 maxtf D + 1 × idft (2.6) where maxtfD is the maximum term frequency observed for a term in document D. After observing relatively poor performance for the vector space model in a set of TREC experiments, Singhal et al. [186] hypothesised that the form of document length normalisation used within the model was inferior to that used in other models. To in- vestigate this effect they compared the length of known relevant documents with the length of documents otherwise retrieved by the retrieval system. Their results indi- cated that long documents were more likely to be relevant for the task studied,12 but no more likely to be retrieved after length normalisation in the vector space model. Accordingly, Singhal et al. [186] proposed that the (cosine) length normalisation com- ponent be pivoted to favour documents that were more frequently relevant (in this case, longer documents). 12 The task studied was the TREC-3 ad-hoc retrieval task. The ad-hoc retrieval task is an informational task (see Section 2.6.1) where the user needs to acquire or learn some information that may be present in a document.
  • 28.
    12 Background 2.3.1.3 Probabilisticranking Probabilistic ranking algorithms provide an intuitive justification for the relevance of matched documents by attempting to model and thereby rank the statistical proba- bility that a document is relevant given the matching terms found [146, 169]. The Probability Ranking Principle was described by Cooper [167] as: “If a reference retrieval system’s response to each request is a ranking of the documents in the collections in order of decreasing probability of use- fulness to the user who submitted the request, where the probabilities are estimated as accurately as possible on the basis of whatever data has been made available to the system for this purpose, the overall effectiveness of the system to its users will be the best that is obtainable on the basis of that data.” The probabilistic model for information retrieval was originally proposed by Maron and Kuhn [146] and updated in an influential paper by Robertson and Sparck-Jones [169]. Probabilistic ranking techniques have a strong theoretical basis and should, at least in principle and given all available information, provide the best predictions of document relevance. The formal specification of the Probabilistic Rank- ing Principle can be described as an optimisation problem, where documents should only be retrieved in response to a query if the cost of retrieving the document is less than the cost of not retrieving the document [169]. A prominent probabilistic ranking formulation is the Binary Independence Model used in the Okapi BM25 algorithm [171]. The Binary Independence Model is con- ditioned by several important assumptions in order to decrease complexity. These assumptions include: • Independence of documents, i.e. that the relevance of one document is indepen- dent of the relevance of all other documents;13 • Independence of terms, i.e. that the occurrence or absence of one term is not related to the presence or absence of any other term;14 and • That the distribution of terms within a document can be used to estimate the document’s probability of relevance.15 13 This is brought into question when one document’s relevance may be affected by another document ranked above it (as is the case with duplicate documents). This independence assumption was removed in several probabilistic formulations without significant improvement in retrieval effectiveness [204]. 14 This assumption was also removed from probabilistic formulations without significant effectiveness improvements [204]. 15 This assumption is made according to the cluster hypothesis which states that “closely associated documents tend to be relevant to the same requests”, therefore “documents relevant to a request are separate from those that are not” [204].
  • 29.
    §2.3 Document-level evidence13 In most early probabilistic models, the term probabilities were estimated from a sample set of documents and queries with corresponding relevance judgements. However, this information is not always available. Croft and Harper [61] have revis- ited the initial formulation of relevance and proposed a probabilistic model that did not include a prior estimate of relevance. Okapi BM25 The Okapi BM25 formula was proposed by Robertson et al. [172]. In Okapi BM25, documents are ordered by decreasing probability of their relevance to the query, P(R|Q, D). The formulation takes into account the number of times a query term oc- curs in a document (tf ), the proportion of other documents which contain the query term (idf ), and the relative length of the document. A score for each document is calculated by summing the match weights for each query term. The document score indicates the Bayesian inference weight that the document will be relevant to the user query. Robertson and Walker [170] derived the document length normalisation used in the Okapi BM25 formula as an approximation to the 2-Poisson model. The form of length normalisation employed when using Okapi BM25 with default parameters (k1 = 2, b = 0.75) is justified because long documents contain more information than shorter documents, and are thus more likely to be relevant [186]. The base Okapi BM25 formulation [172] is: BM25wt = idf t × (k1 + 1)tf t,D k1((1 − b) + b×dl avdl ) + tf t,D × (k3 + 1) × Qwt k3 + Qwt + k2 × nq (avdl − dl) avdl + dl (2.7) where wt is the relevance weight assigned to a document due to query term t, Qwt is the weight attached to the term by the query, nq is the number of query terms, tf t,D is the number of times t occurs in the document, N is the total number of documents, nt is the number of documents containing t, dl is the length of the document and avdl is the average document length (both measured in bytes). Here k1 controls the influence of tf t,D and b adjusts the document length normali- sation. A k1 approaching 0 reduces the influence of the term frequency, while a larger k1 increases the influence. A b approaching 1 assumes that the documents are longer due to repetition (full length normalisation), whilst b = 0 assumes that documents are long because they cover multiple topics (no length normalisation) [168]. Setting k1 = 2, k2 = 0, k3 = ∞ and b = 0.75 (verified experimentally in TREC tasks and on large corpora [168, 186]): BM25wt,D = Qwt × tf t,D × log(N−nt+0.5 nt+0.5 ) 2 × (0.25 + 0.75 × dl avdl ) + tf t,D (2.8)
  • 30.
    14 Background The finaldocument score is the sum of term weights: BM25(D, Q) = t∈Q wt,D (2.9) 2.3.1.4 Statistical language model ranking Statistical language modelling is based on Shannon’s communication theory [182]16 and examines the distribution of language in a document to estimate the probabil- ity that a query was generated in an attempt to retrieve that document. Statistical language models have long been used in language generation, speech recognition and machine translation tasks, but have only recently been applied to document re- trieval [162]. Language models calculate the probability of encountering a particular string (s) in a language (modelled by M) by estimating P(s|M). The application of language modelling to information retrieval conceptually reverses the document ranking process. Unlike probabilistic ranking functions which model the relevance of docu- ments to a query, language modelling approaches model the probability that a query was generated from a document. In this way, language models replace the notion of relevance with one of sampling, where the probability that the query was picked from a document is modelled. The motivation for this approach is that users have some pro- totype document in mind when an information need is formed, and they choose query terms to that effect. Further, it is asserted that when a user seeks a document they are thinking about what it is that makes the document they are seeking “different”. The statistical language model ranks documents using the maximum likelihood estima- tion (Pmle) that the query was generated with that document in mind (P(Q|MD)), otherwise considered to be the probability of generating the query according to each document language model. Language modelling was initially applied to document retrieval by Ponte and Croft [162] who proposed a simple unigram-based document model.17 The simple unigram model assigns: P(D|Q) = t∈Q P(t|MD) (2.10) The model presented above may not be effective in general document retrieval as it requires a document to contain all query terms. Any document that is missing one or more query terms will be assigned a probability of query generation of zero. Smoothing is often used to counter this effect (by adjusting the maximum likelihood 16 This is primarily known for its application to text sequencing and estimation of message noise. 17 A unigram language model models the probability of each term occurring independently, whilst higher order (n-gram) language models model the probability that consecutive terms appear near each other (described in Section 2.3). In the unigram model the occurrence of a term is independent of the presence or absence of any other term (similar to the term independence assumption in the Okapi model).
  • 31.
    §2.3 Document-level evidence15 estimation of the language model). Smoothing methods discount the probabilities of the terms seen in the text, to assign extra probability mass to the unseen terms according to a fallback model [218]. In information retrieval it is common to exploit corpus properties for this purpose. Thereby: P(D|Q) = t∈Q P(t|MD) if t ∈ MD αP(t|MC) otherwise (2.11) where P(t|Md) is the smoothed probability of a term seen in the document D, p(t|MC) is the collection language model (over C), and α is the co-efficient controlling proba- bility mass assigned to unseen terms (so that all probabilities sum to one). Models for smoothing the document model include Dirichlet smoothing [155], geometric smoothing [162], linear interpolation [19] and 2-state Hidden Markov Mod- els. Dirichlet smoothing has been shown to be particularly effective when dealing with short queries, as it provides an effective normalisation using document length [155, 218].18 Language models with Dirichlet smoothing have been used to good effect in recent TREC web tracks by Ogilvie and Callan [155]. A document language model is built for all query terms [155]: P(Q|MD) = t∈Q P(t|MD) (2.12) Adding smoothing to the document model using the collection model: P(t|MD) = β1Pmle(t|D) + β2Pmle(t|C) (2.13) The β1 and β2 collection and document linear interpolation parameters are then esti- mated using Dirichlet smoothing. β1 = |D| |D| + γ , β2 = γ |D| + γ (2.14) where |D| is the document length and γ is often set near the average document length in the corpus [155]. The mle for a document is defined as: Pmle(w|D) = tft,D |D| (2.15) Similarly, for the corpus: Pmle(w|C) = tft,C |C| (2.16) 18 Document length has been exploited with success in the Okapi BM25 model and in the vector space model.
  • 32.
    16 Background The documentscore is then: S(D, Q) = t∈Q (β1 × ( count(t; D) |D| )) + (β2 × ( count(t; C) |C| )) (2.17) Statistical language models have several beneficial properties. If users are as- sumed to provide query terms that are likely to occur in documents of interest, and that distinguish those documents from other documents in the corpus, language mod- els provide a degree of confidence that a particular document should be retrieved [162]. Further, while the vector space and probabilistic models use a crude approximation to document corpus statistics (such as document frequency, discrimination value and document length), language models are sometimes seen to provide a more integrated and natural use of corpus statistics [162]. 2.3.2 Discussion The most effective implementations of each of the retrieval models discussed above have been empirically shown to be very similar [53, 60, 106, 110, 119, 121]. Discrepan- cies previously observed in the effectiveness of the different models have been found to be due to differences in the underlying statistics used in the model implementa- tion, and not the model formalisation [186]. All models employ a tf × idf approach to some degree, and normalise term contribution using document length. This is explicit in probabilistic [170] and vector space models [186], and is often included within the smoothing function in language models [155, 218]. The use of these document statis- tics in information retrieval systems has been empirically validated over the past ten years [155, 168]. When dealing with free-text elements, experiments within this thesis use the prob- abilistic ranking function Okapi BM25 without prior relevance information [170]. This function has been empirically validated to perform as well as current state-of- the-art ranking functions [53, 57, 58, 59, 60, 168, 170]. Further discussion and comparison of full-text ranking functions is outside the scope of this thesis. If interested the reader should consult [14, 176, 191, 204]. 2.3.3 Other evidence To build a baseline that achieves similar performance to that of popular web and WWW search engines several further types of document-level evidence may need to be considered [31, 109, 113]. 2.3.3.1 Metadata Metadata is data used to describe data. An example of real-world metadata is a library catalogue card, which contains data that describes a book within the library (although metadata is not always stored separately from the document it describes). In web documents metadata may be stored within HTML metadata tags (<META>), or in a
  • 33.
    §2.3 Document-level evidence17 separate XML/RDF resource descriptors. As metadata tags are intended to describe document contents, the content of metadata tags is not rendered by web browsers. Several standards exist for metadata creation, one of the least restricted forms of which is simple Dublin Core [70]. Dublin Core provides a small set of core elements (all of which are optional) that are used to describe resources. These elements include: document author, title, subject, description, and language. An example of HTML metadata usage, taken from http://cs.anu.edu.au/∼Trystan.Upstill/19 is: <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" /> <meta name="keywords" content="Upstill, Web, Information, Retrieval" /> <meta name="description" content="Trystan Upstill’s Homepage, Web IR" /> <meta name="revised" content="Trystan Upstill, 6/27/01" /> <meta name="author" content="Trystan Upstill" /> The utility of metadata depends on the observance of document authorship stan- dards. Inconsistencies between document content and purpose, and associated meta- data tags, may severely reduce system retrieval effectiveness. Such inconsistencies may occur either unintentionally through outdated metadata information, or through deliberate attribute “stuffing” in an attempt by the document author to have the doc- ument retrieved for a particular search term [71]. When a document is retrieved due to misleading metadata information, search system users may have no idea why the document has been retrieved, with no visible text justifying the document match. The use of HTML metadata tags is not considered within this thesis due to the relatively low adherence to metadata standards in documents across the WWW, and the inconsistency of adherence in other web corpora [107]. This policy is followed by many WWW search systems [71, 193]. 2.3.3.2 URL information Uniform Resource Locators, or URLs, provide web addresses for documents. The URL of a document may contain document evidence, either though term presence in the URL or implicitly through some other URL characteristic (such as depth in the site hierarchy). The URL string may contain useful query-dependent evidence by including a po- tential search term (e.g: http://cs.anu.edu.au/∼Joe.Blogs/ contains the po- tentially useful terms of “Joe” and “Blogs”). URLs can be matched using simple string matching techniques (e.g. checking if the text is present or not) or using full-text 19 META tags have been formatted according to X-HTML 1.0.
  • 34.
    18 Background ranking algorithms(although a binary term presence vector would probably suffice). Ogilvie and Callan [50, 155, 156] proposed a novel method for matching URL strings within a language modelling framework. In their method the probability that a URL was generated for a particular term, given the URLs of all corpus documents, is cal- culated. Query terms and URLs are treated as character sequences and a character- based trigram generative probability is computed for each URL. The numerator and denominator probabilities in the trigram expansion are then estimated using a linear interpolation with the collection model [50, 155, 156]. Ogilvie and Callan then com- bined this URL-based language model with the language models of other document components. The actual contribution of this type of URL matching is unclear.20 Further query-independent evidence relating to URLs might also be gained through examining common formatting practices. For example some features could be correlated with the length of a URL (by characters or directory depth), the match- ing of a particular character in the URL (e.g. looking for ‘∼’ when matching personal home pages [181]), or a more advanced metric. Westerveld et al. [135, 212] proposed a URL-type indicator for estimating the likelihood that a page is a home page. In this measure URLs are grouped into four categories, Root, Subroot, Path and File, using the following rules: Root a domain name, e.g. www.cyborg.com/. Subroot a domain name followed by a single directory, e.g. www.glasgow.ac.uk/staff/. Path a domain name followed by two or more directories, e.g. trec.nist.gov/pubs/trec9/. File any URL ending in a filename rather than a directory, e.g. trec.nist.gov/contact.html. Westerveld et al. [135, 212] calculated probabilities for encountering a home page in each of these URL-types using training data on the WT10g collection (described in Section 2.6.7.2). They then used these probabilities to assign scores to documents based on the likelihood that their document URL would be a home page. In experiments reported within this thesis, URL-type and URL length informa- tion are considered. While the textual elements in a URL may be useful in doc- ument matching, consistent benefits arising from their use are yet to be substanti- ated [107, 155]. As such they are not considered within this work. 20 Ranking functions which included this URL measure performed well, but the contribution of the URL measure was unclear.
  • 35.
    §2.3 Document-level evidence19 2.3.3.3 Document structure and tag information Important information might be marked up within a web document to indicate to a document viewer that a particular segment of the document, or full document, is important. For example useful evidence could be collected from: • Titles / Heading tags: encoded in <H?> or <TITLE> tags. • Marked-up text: For example bold (B), emphasised (E) or italic (I) text may contain important information. • Internal tag structure: The structural makeup of a document may give insight into what a document contains. For example, if a document contains a very long table, list or form, this may give some indication as to the utility of that document. • Descriptive text tags: Images often include descriptions of their content for users viewing web pages without graphics capabilities. These are included as an at- tribute in the IMG tag (ALT=). Ogilvie and Callan [50, 155, 156] achieved small effectiveness gains through an up- weighting of TITLE, Image ALT text and FONT tag text for both named page finding and home page finding tasks. However, the effectiveness gains through the use of these additional forms of evidence were small compared to those achieved through the use of document full-text, referring anchor-text and URL length priors.21 The only document structure used in experiments within this thesis is document TITLE. While there is some evidence to suggest that up-weighting marked-up text might provide some gains, experiments have shown that the associated improvement is relatively small [155]. 2.3.3.4 Quality metrics Zhu and Gauch [219] considered whether the effectiveness of full-text-based docu- ment ranking22 could be improved through the inclusion of quality metrics. They evaluated six measures of document quality: • Currency: how recently a document was last modified (using document time stamps). • Availability: how many links leaving a document were available (calculated as the number of broken links from a page divided by the total number of links). • Information-to-noise: a measurement of how much text in the document was noise (such as HTML tags or whitespace) as opposed to how much was useful content. 21 Using the length of a URL to estimate a prior probability of document relevance. 22 Calculated using a tf × idf vector space model (see Section 2.3.1.2).
  • 36.
    20 Background • Authority:a score sourced from Yahoo Internet Life reviews and ZDNet ratings in 1999. According to these reviews each site was assigned an authority score. Sites not reviewed were assigned an authority score of zero. • Popularity: how many documents link to the site (in-degree). This information was sourced from AltaVista [7]. The in-degree measure is discussed in detail in Section 2.4.3.1. • Cohesiveness: how closely related the elements of a web page are, determined by classifying elements using a vector space model into a 4385 node ontology and measuring the distance between competing classifications. A small distance between classifications indicates that the document was cohesive. A large dis- tance indicates the opposite. Zhu and Gauch [219] evaluated performance using a small corpus with 40 queries taken from a query log file.23 They observed some improvement in mean precision based on all the quality metrics, although not all improvements were significant.24 The smallest individual improvements were for “Popularity” and “Authority” (both non-significant). The improvements obtained through the use of all other metrics was significant. The largest individual improvement was observed for the “Information- to-noise” ratio. Using all quality metrics apart from “Popularity” and “Authority” resulted in a (significant) 24% increase in performance over the baseline document ranking [219]. These quality metrics, apart from in-degree, are not included in experiments within this thesis because sourced information may be incomplete [219] or inaccu- rate [113]. 2.3.3.5 Units of retrieval Identifying the URL which contains the information unit most relevant to the user may be a difficult task. There are many ways in which a unit of information may be defined on a web and so the granularity of information units retrieved by web search engines may vary considerably. If the granularity is too fine (e.g. the retrieval of a single document URL when a whole web site is relevant), the user may not be able to fulfil their information need. In particular the user may not be able to tell whether the system has retrieved an ad- equate answer, or the retrieved document list may contain many parts of a composite document from a single web site. If the unit of retrieval is too large (e.g. the retrieval of a home page URL when only a deep page is relevant), the information may be buried such that it is difficult for users to retrieve. The obvious unit for WWW-based document retrieval is the web page. However, there are many situations in which a user may be looking for a smaller element of 23 It is unclear how the underlying search task [106, 108] was modelled in this experiment. 24 Significance was tested using a paired-samples t-test [219].
  • 37.
    §2.3 Document-level evidence21 information, such as when seeking an answer to a specific question. Alternatively, a unit of information may be considered to be a set of web pages. It is common for web documents to be made up of multiple web pages, or at least be related to other co- located documents [75]. An example of a composite document is the WWW site for the Keith van Rijsbergen book ‘Information Retrieval’ which consists of many pages, each containing small sections from the book [205]. In a study of the IBM intranet Eiron and McCurley [75] reported that approximately 25% of all URLs encountered on the IBM corpus were members of some larger “compound” document that spanned several pages. The problem of determining the “most useful” level for an information unit was considered in the 2003 TREC Topic Distillation task (TD2003 – described in Sec- tion 2.6.7). The TD2003 task judged systems according to whether they retrieved im- portant resources, and did not mark subsidiary documents as being relevant [60]. The TD2003 task is similar to the “component coverage” assessment used in the INEX XML task [85], where XML retrieval systems are rewarded for retrieving the correct unit of information. In the XML task the optimal system would return the unit of information that contains the relevant information and nothing else. Some methods analyse the web graph and site structure in an attempt to identify logical information units. Terveen et al. build site units by graphing co-located pages, using a method entitled “clan graphs” [196]. Further methods attempt to determine the appropriate information unit by applying a set of heuristics based on site hierarchy and linkage [13, 142]. This thesis adopts the view that finding the correct information unit is analogous to finding the optimal entry point for the correct information unit. As such, none of the heuristics outlined above are used to detect information units. Instead, hyperlink recommendation and other document evidence is evaluated according to whether it can be used to find information unit entry points. Document segmentation Document segmentation methods break-down HTML documents into document com- ponents that can be analysed individually. A commonly used segmentation method is to break-down HTML documents into their Document Object Model (DOM), accord- ing to the document tag hierarchy [42, 45]. Visual Information Processing System (VIPS) [37, 38, 39] is a recently proposed extension of DOM-based break-down and dissects HTML documents using visual elements in addition to their DOM. Document segmentation techniques are not considered in this thesis. While finer document breakdown might be useful for finding short answers to particular ques- tions, there is little evidence of improvements in ranking at the web page level [39].
  • 38.
    22 Background 2.4 Web-basedevidence Many early WWW search engines conceptualised the document corpus as a flat struc- ture and relied solely on the document-level evidence outlined above, ignoring hy- perlinks between documents [33]. This section outlines techniques for exploiting the web graph that is created when considering documents within a web as nodes and hyperlinks between documents as directed edges. This thesis does not consider measures based on user interaction with the web search system, such as click-through evidence [129]. While click-through evidence may be useful when ranking web pages, assumptions made about user behaviour may be questionable. In many cases it may be difficult to determine whether users have judged a document relevant from a sequence of queries and clicks. Collecting such evidence also requires access to user interaction logs for a large scale search system. Work within this thesis relating to the combination of query-dependent evidence with other query-independent evidence is applicable to this domain. The WWW graph was initially hypothesised to be a small world network [18], that is, a network that has a finite diameter,25 where each node has a path to every other node by a relatively small number of steps. Small world networks have been shown to exist in other natural phenomena, such as relationships between research scientists or between actors [2, 5, 6, 18]. Barabasi hypothesised that the diameter of the WWW graph was 18.59 links (estimated for 8 × 108 documents) [18]. However, this work was challenged by WWW graph analysis performed by Broder et al. [35]. Using a 200 million page crawl from AltaVista, which contained 1.5 billion links [7], Broder et al. observed that the WWW graph’s maximal and average diameter was infinite. The study revealed that the WWW graph resembles a bow-tie with a Strongly Connected Component (SCC), an upstream component (IN), a downstream component (OUT), links between IN and OUT (Tendrils), and disconnected components. Each of these components was observed to be roughly the same size (around 50 million nodes). The SCC is a highly connected graph that exhibits the small-world property. The IN component consists of nodes that link into the SCC, but cannot be accessed from the SCC. The OUT component consists of nodes that are linked to from the SCC, but do not link back to the SCC. Tendrils link IN nodes directly to OUT nodes, bypassing the SCC. Disconnected components are pages to which no-one linked, and which linked- to no-one. The minimal diameter26 for the bow-tie was 28 for the SCC and 500 for the entire graph. The probability of a directed path existing between two nodes was observed to be 24%, and the average length of such a path was observed to be 16 links. The shortest directed path between two random nodes in the SCC was, on average, 16 to 20 links. Further work by Dill et al. [67] has reported that WWW subgraphs, when restricted by domain or keyword occurrence, also form bow-tie-like structures. This phenomenon has been termed the fractal nature of the WWW, and is exhibited by 25 Average distance between two nodes in a graph 26 The minimum number of steps by which the graph could be crossed
  • 39.
    §2.4 Web-based evidence23 other scale-free networks [67]. Many WWW distributions have been observed to follow a power law [3]. That is, the distributions take some form k = 1/ix for i > 1, where k is the probability that a node has the value i according to some exponent x. Important WWW distributions that have been observed to follow the power law include: • WWW site in-links (in-degrees). The fraction of pages with an in-degree i was first approximated by Kumar et al. [136, 137] to be distributed according to power law with exponent x = 2 on a 1997 crawl of around 40 million pages gathered by Alexa.27 Later Barabasi et al. estimated the exponent at x = 2.1 over a graph computed for a corpus containing 325 000 documents from the nd.edu domain [17, 18]. Broder et al. [35] have since confirmed the estimate of x = 2.1. • WWW site out-links (out-degrees). Barabasi and Albert [17] estimated a power law distribution with exponent x = 2.45. Broder et al. [35] reported a x = 2.75 exponent for out-degree on a 200 million page crawl from AltaVista. • Local WWW site in-degrees and out-degrees [25]. • WWW site accesses [4]. 2.4.1 Anchor-text evidence Web authors often supply textual snippets when marking-up links between web doc- uments, encoded within anchor “<A HREF=’’></A>” tags. The average length of an anchor-text snippet has been observed to be 2.85 terms [159]. This is similar to the average query length submitted to WWW search engines [184] and suggests there might be some similarity between a document’s anchor-text and the queries typically submitted to search engines to find that document [73, 74]. A common method for exploiting anchor-text is to combine all anchor-text snip- pets pointing to a single document into a single aggregate anchor-text document, and then to use the aggregate document to score the target document [56]. In terms of document evidence, this aggregate anchor-text document may give some indication of what other web authors view as the content, or purpose, of a document. It has been observed that anchor-text frequently includes information associated with a page that is not included in the page itself [90]. To increase the anchor-text information collected for hyperlinks, anchor-text evi- dence can be expanded to include text outside (but in close proximity to) anchor-tags. However, there is disagreement regarding whether such text should be included. Chakrabarti [44] investigated the potential utility of text surrounding anchor tags by measuring the proximity of the term “Yahoo” to the anchor tags of links to http://www.yahoo.com in 5000 web documents. Chakrabarti found that includ- ing 50 words around the anchor tags performed best as most occurrences of Yahoo 27 http://www.alexa.com
  • 40.
    24 Background Distance -100-75 -50 -25 0 25 50 75 100 Density 1 6 11 31 880 73 112 21 7 Table 2.1: Proximity of the the term “Yahoo” to links to http://www.yahoo.com/ for 5000 WWW documents (from [44]). Distance is measured in bytes. A distance of 0 indicates that “Yahoo” appeared within the anchor tag. A negative distance indicates it occurred before the anchor-tag, and a positive distance indicates that it occurred after the tag. were within that bound (see Table 2.1). Chakrabarti found that using this extra text improved recall, but at the cost of precision (precision and recall are described in Sec- tion 2.6.6.1). In later research Davison [64, 65] reported that extra text surrounding the anchor-text did not describe the target document any more accurately than the text within anchor-tags. However, Glover et al. [90] reported that using up to 25 terms around anchor-text tags improved page-content classification performance. Pant el al. [159] proposed a further method for expanding anchor-text evidence using a DOM break-down (DOM described in Section 2.3.3.5). They suggested that if an anchor-text snippet contains under 20 terms then the anchor-text evidence should be extended to consider all text up to the next set of HTML tags. They found that expanding to be- tween two and four HTML tag levels improved classification of the target documents when compared to only using text that occurred within anchor-tags. Experiments within this thesis only consider text within the anchor tags, as there is little conclusive evidence to support the use of text surrounding anchor tags. Anchor-text ranking Approaches to ranking anchor-text evidence include: • Vector space. Hyperlink Vector Voting, proposed by Li and Rafsky [143], ranks anchor-text evidence using a vector space containing all anchor-text pointing to a document. The final score is the sum of all the dot products between the query vector and anchor-text vectors. Li and Rafsky did not formally evaluate this method. • Okapi BM25. Craswell, Hawking and Robertson [56] built surrogate docu- ments from all the anchor-text snippets pointing to a page and ranked the doc- uments as if they contained document full-text. This application of anchor-text provided dramatic improvements in navigational search performance.28 • Language Modelling. Ogilvie and Callan [155] modelled anchor-text separately from other document evidence using a unigram language model with Dirich- 28 Navigational search is described in Section 2.6.2.
  • 41.
    §2.4 Web-based evidence25 let smoothing. The anchor-text language model was then combined with their models for other sections of the document using a mixture model (see Sec- tion 2.5.2.2). This type of anchor-text scoring has been empirically evaluated and shown to be effective [155, 156]. Unless otherwise noted, the anchor-text baselines used in this thesis are scored from anchor-text aggregate documents using the Okapi BM25 ranking algorithm. This method is used because it has previously been reported to perform well [56]. 2.4.2 Bibliometric measures 1 2 3 4 5 Figure 2.1: A sample network of relationships Social networks researchers [125, 131] are concerned with the general study of links in nature for diverse applications, including communication (to detect espi- onage or optimise transmission) and modelling disease outbreak [89]. Bibliomet- rics researchers are similarly interested in the citation patterns between research pa- pers [87, 89], and study these citations in an attempt to identify relationships. This can be seen as a specialisation of social network analysis [89]. In many social net- work models, there is an implicit assumption that the occurrence of a link (citation) indicates a relationship or some attribution of prestige. However, in the context of some areas (such as research) it may be difficult to determine whether a citation is an indication of praise or retort [203]. Social networks and citations may be modelled using link adjacency matrices. A directed social network of size n can be represented as an n × n matrix, where links between nodes are encoded in the matrix (e.g. if a node i links to j, then Ei,j = 1). For example, the relationship network shown in Figure 2.1 may be represented as: E =         0 0 1 1 0 0 0 1 1 1 0 0 0 1 0 0 0 0 0 1 0 0 1 0 0        
  • 42.
    26 Background Prestige The numberof incoming links to a node is a basic measure of its prestige [131]. This gives a measure of the direct endorsement the node has received. However, examining direct endorsements alone, may not give an accurate representation of node prestige. It may be more interesting to know if a node is recognised by other important nodes, thus transitive citation becomes important. A transitive endorsement is an endorse- ment through an intermediate node (i.e. if A links to B links to C, then A weakly endorses C). An early measure of prestige in a social network analysis was proposed by See- ley [179] and later revised by Hubbell [125]. In this model, every document has an initial prestige associated with it (represented as a row in p), which is transferred to its adjacent nodes (through the adjacency matrix E). Thus the direct prestige of any (a priori equal) node can be calculated by setting p = (1, ..., 1)T and calculating p = pET . By performing a power iteration over p ← pET the prestige measure p converges to the principal eigenvector of the matrix ET and provides a measure of transitive pres- tige.29 The power iteration method multiplies p by increasing powers of ET until the calculation converges (tested using some entropy constant). To measure prestige for academic journals Garfield [88] proposed the “impact factor”. The impact factor score for a journal j is the average number of citations to papers within that journal received during the previous two years. Pinski and Narin [161] proposed a variation to the “impact factor”, termed the influence weight, based on the observation that all journal citations may not be equally important. They hypothesised that a journal is influential if its papers are cited by papers in other in- fluential journals, and thus incorporate a measure of transitive endorsement. This notion of transitive endorsement is similar to that modelled in PageRank and HITS (described in Sections 2.4.3.2 and 2.4.4.1). Co-citation and bibliographic coupling Co-citation is used to measure subject similarity between two documents. If a docu- ment A cites documents B and C, documents B and C are co-cited by A. If many doc- uments cite both documents B and C, this indicates that B and C may be related [187]. The more documents that cite both B and C, the closer their relationship. The co-citation matrix (CC) is calculated as: CC = ET E (2.18) where CCi,j is the number of papers which jointly cite papers i and j, and the diagonal is node in-degree. Bibliographic coupling is the inverse of co-citation, and infers that if two docu- ments include the same references then they are likely to be related, i.e. if document 29 See Golub and Van Loan [91] for more information about principal eigenvectors and the power method [pp.330–332].
  • 43.
    §2.4 Web-based evidence27 A and B both cite document C this gives some indication that they are related. The more documents that document A and B both cite, the stronger their relationship. The bibliographic coupling (BC) matrix is calculated as: BC = EET (2.19) where BCi,j is the number of papers jointly cited by i and j and the diagonal is node out-degree. Citation graph measures Important information may be conveyed by the distance between two nodes in a cita- tion graph, the radius of a node (maximum distance from a node to the graph edge), the cut of the graph (or which edges of the graph that when removed will disconnect large sections of the graph), and the centre of the graph (the node that has the smallest radius). For example, when examining a field of research, interesting papers can be identified by their small radius, as this indicates that most papers in the area have a short path to the paper. The cut of the graph typically indicates communication be- tween cliques, and can be used to identify important nodes, whose omission would lead to the loss of the relationship between the groups [196]. 2.4.2.1 Bibliographic methods applied to a web Hyperlink-based scoring assumes that web hyperlinks provide some vote for the im- portance of their targets. However, due to the relatively small cost of web publishing, the discretion used when creating links between web pages may be less than is em- ployed by researchers in scientific literature [203]. Indeed it has been observed that not all web links are created for recommendation purposes [63] (discussed in Section 3.1.5). An early use of hyperlink-based evidence was in a WWW site visualisation, where a site’s visibility represented its direct prestige, and the out-degree of a site was the node’s luminosity [30]. Larson [138] presented one of the first applications of biblio- metrics on the WWW by using co-citation to cluster related web pages and to explore topical themes. Marchiori [145] provided an early examination of the use of hyperlink evidence in a document ranking scheme, by proposing that a document’s score should be relative to that document’s full-text score and “hyper” (hypertext-based) score. Marchiori’s model was based on the idea that a document’s quality is enriched through the pro- vision of links to other important resources. In this model, the “hyper-information” score was a measure based on a document’s subsidiary links, rather than its parent links. The page score was dependent not only on its full-text content, but the content of its subsidiaries as well. A decay factor was implemented such that the farther a subsidiary was from the initial document, the less its contribution would be. Xu and Croft [215] outline two broad domains for web-based hyperlink informa- tion: global link information and local link information. Global link information is
  • 44.
    28 Background computed froma full web graph, based on links between all documents in a cor- pus [40, 215]. In comparison, local link information is built for some subset of the graph currently under examination, such as the set of documents retrieved in response to a particular query. In many cases the additional cost involved in calculating local link information might be unacceptable for web or WWW search systems [40]. 2.4.3 Hyperlink recommendation The hyperlink recommendation techniques examined here are similar to the biblio- metric measures of prestige, and may be able to provide some measure of the “im- portance”, “quality” or “authority” of a web document [31]. This hypothesis is tested through experiments presented in Chapter 5. 2.4.3.1 Link counting / in-degree A page’s in-degree score is a measure of its direct prestige, and is obtained through a count of its incoming links [29, 41]. It is widely believed that a web page’s in-degree may give some indication of its importance or popularity [219]. In an analysis of link targets Bharat et al. [25] found that the US commercial do- main .com had higher in-degree on average than all other domains. Sites within the .org and .net domains also had higher in-degree (on average) than sites in other countries. 2.4.3.2 PageRank PageRank is a more sophisticated query-independent link citation measure developed by Page and Brin [31, 157] to “objectively and mechanically [measure] the human in- terest and attention devoted [to web pages]” [157]. PageRank uses global link infor- mation and is stated to be the primary link recommendation scheme employed in the Google search engine [93] and search appliance [96]. PageRank is designed to simu- late the behaviour of a “random web surfer” [157] who navigates a web by randomly following links. If a page with no outgoing links is reached, the surfer jumps to a randomly chosen bookmark. In addition to this normal surfing behaviour, the surfer occasionally spontaneously jumps to a bookmark instead of following a link. The PageRank of a page is the probability that the web surfer will be visiting that page at any given moment. PageRank is similar to bibliometric prestige, but differs by down-weighting doc- uments that have many outgoing links; the fewer links a node has, the larger the portion of prestige it will bestow to its outgoing links. The PageRank distribution matrix (EPR) is then: EPRi,j = Ei,j n=1..dim(Ei,j) En,j (2.20) for the link adjacency matrix E.
  • 45.
    §2.4 Web-based evidence29 The PageRank distribution matrix (EPR) is a non-negative stochastic30 matrix that is aperiodic and irreducible.31 The PageRank calculation is a Markov process, where PageRank is an n-state system and the distribution matrix (EPR) contains the inde- pendent transition probabilities EPRi,j of jumping from state i to j. If the random surfer is in all states with equal probability leaving from a node i then EPR1..n,j = (1/n, ..., 1/n). The basic formulation of a single iteration of PageRank is then: p = p × EPRT (2.21) where p is initialised according to the bookmark vector (by default a unit vector), and is the updated PageRank score after each iteration. Page and Brin observed that unlike scientific citation graphs, it is quite common to find sections of the web graph that act as “rank sinks”. To address this difficulty Page and Brin introduced a random jump (or teleport) component where, with a constant probability d, the surfer jumps to a random bookmarked node in b. That is: p = ((1 − d) × b) + d × b × p × EPRT (2.22) If d = 0 or b is not broad enough, the PageRank calculation may not converge [102]. Another complexity in the PageRank calculation are nodes that act as “rank leaks”, this occurs if the surfer encounters a page with no outgoing links, or a link to a page that is outside the crawl (a dangling link). One approach to resolving this issue is to jump with certainty (a probability of one) when a dangling link is encountered [154]. This approach, and several others, are covered in more detail in Section 3.3.1. If ap- plying the “jump with certainty” method, and using a unit b bookmark vector (such that the random surfer has every page bookmarked), the final PageRank scores are equivalent to the principal eigenvector of the transition matrix EPR, where EPR is updated to include the random jump factor: EPRi,j = (1 − d) dim(Ei,k) + d × Ei,j n=1..dim(Ei,j) En,j (2.23) Expressed algorithmically, the PageRank algorithm (when using “jump with cer- tainty”) is: R0 ← S loop : r ← dang(Ri) Ri+1 ← rE + ARi Ri+1 ← (1 − d)E + d(Ri+1) δ ← Ri+1 − Ri 1 while δ > 30 Every node can reach any other node at any time-step (implies irreducibility). 31 Every node can reach every other node.
  • 46.
    30 Background where Riis the PageRank vector at iteration i, A is the link adjacency matrix (where Ai,j = 1 if a link exists, and is 0 otherwise), S is the initial PageRank vector (the proba- bility that a surfer starts at a node), E is the vector of bookmarked pages (the probabil- ity that the surfer jumps to a certain node at random), dang() is a function that returns the PageRank of all nodes that have no outgoing links, r is the amount of PageRank lost due to dangling links which is distributed amongst bookmarks (after [43, 154]), d is a constant which controls the proportion of random noise (spontaneous jumping) introduced into the system to ensure stability (0 < d < 1), and is the convergence constant. The double bar ( 1) notation indicates an l1 norm, the sum of a vector ele- ment’s absolute values. In this formulation, for a given link graph, PageRank varies according to the values of the d constant and the set of bookmark pages E. The PageRank variants investi- gated in this thesis are described in more detail in Section 3.3. 2.4.3.3 Topic-specific PageRank Further PageRank formulations seek to personalise the calculation according to user preferences. Haveliwala [103, 104] proposed Personalised PageRank, and demon- strated how user topic preferences may be introduced by modifying the bookmark vector and changing the random jump targets, and thereby altering PageRank scores. Haveliwala proposed that a bookmark vector be built for each top-level DMOZ [69] category by including all URLs within the tree as bookmarks. During query processing, each incoming query is classified into these categories (represented by the influence vector v) and a new “dynamic” PageRank score is com- puted from a weighted sum of the category-specific PageRanks (ppr), and the Page- Rank calculation is modified to explicitly include a bookmark vector (e.g. PR(E, b) is the PageRank calculation for the adjacency matrix E using bookmarks b). So: ppr = PR(E, v) (2.24) Category preferences can also be mixed. To compute a set of personalisation vec- tors (vi) with weights (wi) for a mixture of categories: ppr = PR(E, i ([wi.vi])) (2.25) 2.4.4 Other hyperlink analysis methods 2.4.4.1 HITS Hyperlink Induced Topic Search (HITS) is a method used to identify two sets of pages that may be important: Hub pages and Authority pages [132]. Hub and Authority pages have a mutually reinforcing relationship – a good Hub page links to many Au- thority pages (thereby indicating high Authority co-citation), and a good Authority page is linked-to by many Hubs (thereby indicating high Hub bibliometric coupling).
  • 47.
    §2.4 Web-based evidence31 Each page in the web graph is assigned two measures of quality; an Authority score Au[u] and a Hub score H[u]. Sometimes the act of generating HITS results sets is termed “Topic Distillation”, but in this thesis the phrase is associated with its use in the TREC web track experiments (described in Section 2.6.3.1). HITS-based scores may be computed either using local or global link information. Local HITS has two major steps: collection sampling and weight propagation. Global HITS is computed for the entire web graph at once so there is no collection sampling step. When calculating local HITS a small focused web subgraph, often based around a search engine result list, is retrieved for a particular query.32 This root set of pages is then expanded to make a base set by downloading all pages that link to, or are linked- to by, pages within the root set. The assumption is that, although the base set may not be a fully connected graph, it should include a large connected component (otherwise the computation is ineffective). The Hub and Authority score computation is a recursive process where Au and H are updated until convergence (initialised with all pages having the same value). For a graph containing edges E and links between q and p the weight is distributed according to: Aup = (q,p)∈E Hq (2.26) Hp = (q,p)∈E Auq (2.27) Like PageRank, these equations can be solved using the power method [91]. Au will converge to the principal eigenvector of ET E, and H will converge to the princi- pal eigenvector of EET [154]. The non-principal eigenvectors can also be calculated, and may represent different term clusters [132]. For example three term clusters (and corresponding meanings) occur for the query ‘Jaguar’: one on the large cat, one on the Atari hand-held game console, and one on the prestige car [132]. Revisiting HITS Several limitations of the HITS model, as presented by Kleinberg [132], were observed and addressed by Bharat and Henzinger [26]. These are: • Mutually reinforcing relationships between hosts. This occurs when a set of documents on one host point to a single document on a second host. • Automatically generated links. This occurs when web documents are generated by tools and are not authored (recommendation) links. • Non-relevant nodes. This arises through what Bharat and Henzinger termed topic drift. Topic drift occurs when the local subgraph is expanded to include 32 This was originally performed using result sets from the AltaVista WWW search engine [7].
  • 48.
    32 Background surrounding links,and as a result, pages not relevant to the initial query are included in the graph, and therefore in the HITS calculation. Bharat and Henzinger [26] addressed the first and second issues by assigning a weight to identical multiple links “inversely proportional to their multiplicity”.33 To address the third problem, topic drift, they removed content outliers. This was achieved by computing a document vector centroid and removing pages that were dissimilar to the vector from the base set. Lempel and Moran [140, 141] proposed a more “robust” stochastic version of HITS called SALSA (Stochastic Algorithm for Link Structure Analysis). This algorithm aims to address concerns that Tightly Knit Communities (TKC) affect HITS calculations. A TKC occurs when a small collection of pages is connected so that every Hub links to every Authority. The pages in a TKC can be ranked very highly by HITS, and therefore achieve the principal eigenvector, even when there is a larger collection of pages in which Hubs link to Authorities (but are not completely connected). The TKC effect could be used by spammers to increase Hub and Authority rankings for their pages, using techniques such as link farming.34 Calado et al. [40] observed significant improvement through the use of local and global HITS over a document full-text-only baseline. The experiments examined a set of 50 informational-type queries (see Section 2.6.1) extracted from a Brazilian WWW search engine log. The queries were observed to be 1.78 terms long on average, signif- icantly shorter than those observed in previous WWW log studies (2.35 terms [184]). Further, it was observed that 28 of the queries were very general, and consisted of terms such as “tabs”, “movies” and “mp3s”. The information needs behind these queries were estimated for relevance assessment following the method proposed by Hawking and Craswell [110]. Through the addition of local HITS to the baseline vec- tor space ranking Calado et al. observed an improvement of 8% in precision35 at ten documents retrieved. Through the incorporation of global HITS evidence they ob- served an improvement of 24% in precision at ten documents retrieved. The improve- ments were reported to be significant for local link analysis after thirty results, and for global link analysis after ten results. Similar improvements were observed through the use of PageRank. 2.4.5 Discussion This thesis only considers in-degree and variants of PageRank, and not other hyper- link recommendation techniques. In-degree is included because it is the simplest hy- perlink recommendation measure and is cheap to compute. PageRank was chosen as a representative of other more expensive methods because: 33 Thereby lessening the effects of nepotistic and navigational links, described in Section 3.1.5. 34 Link farms are artificial web graphs created by spammers through the generation link spam. They are designed to funnel hyperlink evidence to a set of pages for which they desire high web rankings. 35 The precision measure is described in Section 2.6.6.1.
  • 49.
    §2.5 Combining documentevidence 33 • Google [93], one of the world’s most popular search engines, state that PageRank is an important part of their ranking function [31, 157]. • In recent years there have been many studies of how PageRank might be im- proved [1, 39, 105, 197], optimised [11, 102] and personalised [103, 104, 127, 130], but there have not been any detailed evaluations of its potential benefit to re- trieval effectiveness [8, 78]. • PageRank has been observed to be more resilient to small changes in the web graph than HITS [154]. This may be an important property when dealing with WWW-based search as it is difficult to construct an accurate and complete web graph (see Chapter 3), and the web graph is likely to be impacted by web server down-time [52]. • PageRank has previously been observed to exhibit similar performance to non- query-dependent HITS (global HITS) [40]. • While locally computed HITS may perform quite differently to global HITS, the cost of computing HITS at query-time is prohibitive in most production web and WWW search systems [132]. 2.5 Combining document evidence There are many ways in which the different types of evidence examined in the previ- ous two sections could be combined into a single ranking function. It is important that the combination method is effective, as a poor combination could lead to spurious re- sults. This section describes several methods that can be used to combine document evidence. The discussion of combination methods is split into two sub-sections. The first sub-section reviews score and rank-based fusion methods. In fusion methods the out- put from ranking function components is combined without prior knowledge of the underlying retrieval model (how documents were ranked and scored). The second sub-section reviews modifications to full-text retrieval models such that they include more than one form of document evidence. 2.5.1 Score/rank fusion methods Score or rank-based fusion techniques attempt to merge document rankings based either on document ranks, or document scores, without prior knowledge of the un- derlying retrieval model. The combination of multiple forms of document evidence into a single ranking is similar to the results merging problem in meta-search, where the ranked output from several systems are consolidated to a single ranking. A comprehensive discussion of meta-search data fusion techniques is provided by Montague in [151].
  • 50.
    34 Background 2.5.1.1 Linearcombination of scores The simplest method for combining evidence is with a linear combination of docu- ment scores. A linear combination of scores is referred to as combSUM in distributed information retrieval research [83]. In a linear combination the total score S for a doc- ument D and query Q, using document scoring functions F1..N is: S(D, Q) = F1(D, Q) + ... + FN (D, Q) (2.28) For a linear combination of scores to be effective, scores need to be normalised to a common scale, and exhibit compatible distributions. As the forms of evidence consid- ered in this thesis display different distributions, a simple linear combination of scores may not be effective. In-degree and PageRank values are distributed according to a power law [35, 159]. By contrast, Okapi BM25 scores are not distributed according to a power law. Examples of two Okapi BM25 distributions, for the top 1000 doc- uments retrieved for 100 queries used in experiments in Chapter 7, are included in Appendix F. 2.5.1.2 Re-ranking Another method for combining document rankings “post hoc” is to re-rank docu- ments above some cutoff using another form of document evidence [178]. The re- ranking cutoffs can be tuned using a training set. Re-ranking based combinations have the advantage of not requiring a full under- standing of the distribution of scores underlying each type of evidence, as only the ordering of lists can be considered. However, this type of re-ranking may be insensi- tive to the magnitude of difference between scores.36 A further disadvantage of this method is that it is relatively expensive to re-rank long result lists. 2.5.1.3 Meta-search fusion techniques Further methods proposed for the fusion of meta-search results include: • combMNZ: all non-zero document scores are normalised and then multiplied together [83]. • combSUM: a linear combination of scores [83] (described above). • combMAX, combMIN, combMED: In combMAX the maximum score of all runs is considered. In combMIN the minimum score of all runs is considered. In combMED the median score of all runs is considered. These methods have pre- viously been observed to be inferior to combMNZ and combSUM [83]. Fur- ther, these types of combinations do not make sense when used with query- 36 If re-ranking using the relative ranks of documents only, the magnitude of score differences in all forms of evidence is lost. By contrast, if re-ranking based on some score-based measure, only the magni- tude of score differences in the evidence used to re-rank documents is lost.
  • 51.
    §2.5 Combining documentevidence 35 independent evidence, as such evidence provides an overall ranking of docu- ments, and needs to be used in conjunction with some form of query-dependent evidence for query processing. Other techniques include Condorcet fuse, Borda fuse and the reciprocal rank func- tion [151]. Recent empirical evidence suggests that when combining document rank- ings these methods are inferior to those outlined above [155]. 2.5.1.4 Rank aggregation A further method proposed for combining the ranked results lists of meta-search sys- tems [72] and document rankings [78], is rank aggregation [79]. In rank aggregation the union of several ranked lists is taken, and the lists are merged into a single rank- ing with the least disturbance to any of the underlying rankings. This may reduce the promotion of documents that have only one or a small number of high performing runs with poor performance on other runs. The rank aggregation process can make it difficult to measure and control the con- tribution of each form of evidence. For this reason, rank aggregation techniques are not considered in this thesis. 2.5.1.5 Using minimum query-independent evidence thresholds Implementing a threshold involves setting a minimum query-independent score that a document must exceed to be considered by the ranking function. That is, for some threshold τ, if QIE(D) < τ then P(R|D, Q) is estimated to be zero.37 The use of a static threshold means that some documents may never be retrieved. A more ef- fective technique might exploit query match statistics to dynamically determine the minimum threshold. The potential benefits of using thresholds are two-fold: as an effective method by which to remove uninteresting pages (such as spam or less frequently visited pages), and the improvement of computational performance (by reducing the number of doc- uments to be scored, see Section 6.2.2). 2.5.2 Revising retrieval models to address combination of evidence Rather than combining document evidence post hoc, the underlying retrieval models can be modified to include further document evidence. The approaches outlined be- low combine several forms of document evidence in a single unified retrieval model, through modifications to the full-text ranking algorithms discussed in Section 2.3.1. 37 This is similar to a rank-based re-ranking of query-independent evidence (described in Sec- tion 2.5.1.2) as documents above the cutoff are re-ranked. In comparison, the use of a cutoff does not require a full ranking of query-independent evidence, but means that some documents may never be retrieved.
  • 52.
    36 Background 2.5.2.1 Field-weightedOkapi BM25 Field-weighted Okapi BM25, proposed by Robertson et al. [173], is a modification of Okapi BM25 that combines multiple sources of evidence in a single document rank- ing function. Conceptually the field weighting model involves the creation of a new composite document that includes evidence from multiple document fields.38 The im- portance of fields in the ranking function can be modified by re-weighting their con- tribution. For example, a two-fold weighting of title compared to document full-text would see the title repeated twice in the composite document. If used with Okapi BM25 the score and rank fusion techniques outlined in Sec- tion 2.5.1 invalidate the non-linear term saturation component and may thereby lessen retrieval effectiveness [173]. The use of such post hoc score combination means that a document matching a single query term over multiple fields may outperform a docu- ment that matches several query terms in a single document field. In Okapi BM25, the score of a document is equal to the sum of the BM25 scores of each of its terms: S(D, Q) = t∈Q BM25wt,D (2.29) The score for each term is calculated using a term weighting function and a measure of term rarity within the corpus (idf ): BM25wt,D = f(tf t,D) × idft (2.30) The term weighting function consists of term saturation and document length nor- malisation components: f (tf t,D) = tf t,D k1 + tf t,D , f (tf t,D) = tf t,D β , where β = k1((1 − b) + b dl avdl ) (2.31) where dl is the current document length, and avdl is the average length of a document in the corpus. These components are combined to form: BM25wt,D = tf t,D k1((1 − b) + b dl avdl ) + tf t,D × idft (2.32) In the Field-weighted Okapi BM25 model, documents are seen to contain fields F1..FN each holding a different form of (query-dependent) document evidence: F = (F1, ..., FN ) (2.33) 38 Document fields are some form query-dependent document evidence such as document title, full- text or anchor-text.
  • 53.
    §2.6 Evaluation 37 andeach field is assigned a weight: w = (w1, ..., wN ) , wtf t,D := N F=1 tf t,F × wF (2.34) where w is a vector of field weights, and wtf is the weighted term frequency. The contribution of terms is then: fw (wtf t,D) = wtf t,D k1 + wtf t,D , fw (wtf t,D) = wtf t,D β (2.35) and the document length is updated to reflect the new composite document length: wdl := F f=1 dlf × wf , wavdl := F f=1 avdlf × wf (2.36) The final formulation for Field-weighted Okapi BM25 is then: BM25FW wt,D = wtf t,D k1((1 − b) + b wdl wavdl ) + wtf t,D × idft (2.37) 2.5.2.2 Language mixture models In the same way that document models are combined with collection models in order to smooth the ranking in language models, document models may also be combined with other language models for the same documents [135, 155]. For example, to com- bine the language models for anchor-text and content document evidence: P(D|Q) = P(D) (t∈Q) (1 − λ − γ)P(t|C) + λPanchor(t|D) + γPcontent(t|D) (2.38) Language mixture models have been used to good effect when combining mul- tiple modalities for multimedia retrieval in video [211]. Indeed, combining multiple modalities for multimedia retrieval is a similar problem to that of combining multiple forms of text-based document evidence. Kraaij et al. [135] incorporate query-independent evidence into a language mixture model by computing and including prior probabilities of document relevance. Here P(D) is set according to the prior probability that a document will be relevant, given the document and corpus properties. The prior relevance probabilities are estimated by evaluating how a particular feature affects relevance judgements using training data. 2.6 Evaluation Search system performance may be measured over many different dimensions, such as economy in the use of computational resources, speed of query processing, or user
  • 54.
    38 Background satisfaction withsearch results [209]. It is unlikely that a single system will outperform all others on each of these dimensions, and accordingly it is important to understand the tradeoffs involved ([191], pp.167). This thesis is primarily concerned with retrieval effectiveness, that is, how well a given system or algorithm can match and retrieve documents that are most useful or relevant to the user’s information need [150]. This is difficult to quantify precisely as it involves assigning some measure to the value of information retrieved ([191], pp.167). Judgements of information value are expensive39 and difficult to collect in a way that is representative of needs and judgements of the intended search system users [209]. In addition, the effectiveness of a system depends on a number of system compo- nents, and identifying those responsible for a particular outcome in an uncontrolled environment can be difficult (typical web search system components are described in Section 2.1). The use of a test collection is a robust method for the evaluation of retrieval effec- tiveness and avoids some of the cost involved in performing user studies. A test col- lection consists of a snapshot of a user task and the document corpus ([191], pp.168). This encompasses a set of documents, queries, and complete judgements for the doc- uments according to those queries [48, 209]. Test collections allow for standard perfor- mance baselines, reproducible results and the potential for collaborative experiments. However, if proper care is not taken, heavily training ranking function parameters us- ing a test collection can lead to over-tuning, particularly when training and testing on the same test collection. In this case, observed performance gains may be unrealistic and may not apply in general. It is therefore important to train algorithms on one test collection, and evaluate the algorithms on another. 2.6.1 Web information needs and search taxonomy Traditional information retrieval evaluations and early TREC web experiments evalu- ated retrieval effectiveness according to how well methods fulfilled informational-type search requests (i.e. finding documents that contain relevant text) [48, 176, 191, 205]. An early evaluation of WWW search engines examined their performance on an infor- mational search task and found it to be below that of the then state-of-the-art TREC systems [112]. Recent research suggests, however, that the task evaluated was not typical of WWW search tasks [26, 33, 56, 75, 185]. Broder [33] argues that WWW user information needs are often not of an informational nature and nominates three key WWW-based retrieval tasks: Navigational: a need to locate a particular page or site given its name. An example of such a query is “CSIRO” where the correct answer would be the CSIRO WWW site home page. Informational: a need to acquire or learn some information that will be present in one or more web pages. An example of such a query is “Thesis formatting ad- 39 In that employing judges to rate documents may be a financially expensive operation.
  • 55.
    §2.6 Evaluation 39 vice”where correct (relevant) answers would contain advice relating to how a thesis should be formatted. Transactional: a need to perform an activity on the WWW. An example of such a query is “apply for a Californian driver’s licence” where the correct answer would be a page from which a user could apply for a Californian driver’s li- cence. 2.6.2 Navigational search tasks Navigational search, particularly home page finding, is the focus of experiments within this thesis. Navigational search is an important WWW and web search task which has been shown to be inadequately fulfilled using full-text-based ranking meth- ods [56, 185]. Evidence derived from query logs suggests that navigational search makes up a significant proportion of the total WWW search requests [75]. Naviga- tional search also provides an important cornerstone in the support of search-and- browse based interaction. Two prominent navigational search tasks, home page find- ing and named page finding, are described in more detail below. 2.6.2.1 Home page finding The home page finding task is: given the name of an entity, retrieve that entity’s home page. An example of a home page finding search is when a user wants to visit http://trec.nist.gov and submits the query “Text REtrieval Conference”. The task is similar to Bharat and Mihaila’s organisation search [27], where users provided web site naming queries, and Singhal and Kaszkiel’s site-finding experiment [185], where queries were taken from an Excite log and judged as home page finding queries [77]. Home page finding queries typically specify entities such as people, companies, departments and products.40 A searcher who submits an entity name as a query is likely to be pleased to find a home page for that entity at the top of the list of search results, even if they were looking for information. It is in this way that home pages may also provide primary-source information in response to informational and trans- actional queries [33, 198]. 2.6.2.2 Named page finding The named page finding task can be seen as a superset of the home page finding task, and includes queries naming both non-home page and home page documents [53]. Accordingly, the objective of the named page finding task is to find a particular web page given a page naming query. 40 For example: ‘Trystan Upstill’, ‘CSIRO’, ’Computer Science’ or ‘Panoptic’.
  • 56.
    40 Background 2.6.3 Informationalsearch tasks Two prominent informational search tasks evaluated in previous web-based experi- ments [111] are: the search for pages relevant to an informational need (evaluated in TREC ad-hoc [119]), and Topic Distillation [53]. Experiments within this thesis con- sider the Topic Distillation task, but not the traditional ad-hoc informational search task. The ad-hoc informational search task is described in detail in [111]. 2.6.3.1 Topic Distillation The Topic Distillation task asks systems to construct a list of key resources on some broad topic, similar to those compiled by human editors of WWW directories [111]. More precisely, in TREC experiments the task is defined as: given a search topic, find the key resources for that topic [111]. An example of a Topic Distillation query might be “cotton industry” where the information need modelled might be “give me all sites in the corpus about the cotton industry, by listing their home pages” [60]. A good resource is deemed to be an entry point for a site which is “principally devoted to the topic, provides credible information on the topic, and is not part of larger site also principally devoted to the topic” [60]. While Topic Distillation is primarily an informational search task, it is somewhat similar to navigational search tasks. The goal in both cases is to retrieve good en- try points to relevant information units [58]. Indeed, experiments within this thesis demonstrate that methods useful in home page finding are also effective for Topic Distillation tasks (in Chapter 9). 2.6.4 Transactional search tasks To date there has been little direct evaluation of transactional search tasks [111] and at the time of writing there are currently no reusable transactional test collections. While transactional search tasks are not the focus of this thesis, a case study that examines WWW search engine transactional search performance is presented in Chapter 4. 2.6.5 Evaluation strategies / judging relevance This section describes methods used to collect snapshots of queries and document relevance judgements with which retrieval effectiveness can be evaluated. 2.6.5.1 Human relevance judging The most accurate method for judging user satisfaction with results retrieved by a search system is to record human judgements. However, care needs to be taken when collecting human judgements that: • Judges are representative of the general population using the search tool. In particular if information needs behind given queries are to be modelled (as
  • 57.
    §2.6 Evaluation 41 in[106, 108]), the user demographic responsible for the query should be taken into account in order to estimate the underlying need. • Relevance judgements are correlated with the retrieval task modelled. This may be difficult as judging instructions for the same query can be interpreted in sev- eral ways [58]. Judging informational type queries The scale of large corpora makes the generation of complete relevance judgements (i.e. judging every document for every query) impossible. In the TREC conference, judgement pools are created, which comprise the union of the top 100 retrieved docu- ments per run submitted. These document pools are judged so that complete top 100 relevance judgements are collected for all runs submitted to TREC.41 All non-judged documents are assumed to be non-relevant. Therefore when these judgements are used in post hoc experiments, the judgements are incomplete and so relevant docu- ments will likely be marked non-relevant [209]. The judgement pooling process was used when judging runs submitted to the TREC Topic Distillation task. These measures have been used when judging informational queries for several decades [48, 119, 207]. Some relevant observations about such judging are: • Agreement between human relevance judges is less than perfect. Voorhees and Harman [210] reported that 71.7% of judgements made by three assessors for 14 968 documents were unanimous. However, Voorhees [208] later found that while substituting relevance judgements made by different human assessors changed score magnitude, it had a negligible affect on the rank order of sys- tems [119, 208]. • When dealing with pooled relevance judgements un-judged documents are as- sumed to be non-relevant. This may result in bias against systems that retrieve documents not-typically retrieved by the evaluated systems. Two investigations of this phenomenon have reported that while the non-complete judging of docu- ments may affect individual system scores, they are not likely to affect the rank- ing of systems [206, 220]. • The order of search results affects relevance judgements [76]. However, in later work it was found that this was not the case when judging less than fifteen documents [160]. Judging known item or named item queries In comparison to informational type queries, the cost of judging named item queries (such as home page finding and named page finding queries) is much lower, and the 41 Although every group is asked to nominated an order for the importance of the submitted runs, in case full pooled judgements cannot be completed in time.
  • 58.
    42 Background judging isless contentious. Named item queries are developed by judges navigat- ing to a page in the collection and generating a query designed to retrieve that page. The judging consists of checking retrieved documents to determine whether they are duplicates of the correct page (which can be performed semi-automatically [114]). 2.6.5.2 Implicit human judgements Implicit human judgements can be collected by examining how a user navigates through a set of search results [129]. Evaluations based on this data may be attrac- tive for WWW search engines as such data are easy and inexpensive to collect. One way to collect implicit relevance judgements is through monitoring click-through of search results [129]. However, the quality of judging obtained based on this method may depend on how informative the document summaries are, as the summaries must allow the user to make a satisfactory relevance-based “click- through” decision. Also, given the implicit user preference for clicking on the first result retrieved (as it has been most highly recommended by the search system), ob- served effectiveness scores are likely to be unrealistically high. If directly comparing algorithms using “click-through”-based evaluation, care must be taken to ensure competing systems are compared meaningfully. Joachims [129] proposed that the ranked output from each search algorithm under consideration be interleaved in the results list, and the order of the algorithms be reversed following each query so as not to preference one algorithm over the other (thereby removing the effect of user bias towards the first correct document). 2.6.5.3 Judgements based on authoritative links A set of navigational queries can be constructed cheaply by sourcing queries and judgements automatically from human-generated lists of important web resources. The anchor-text of links within such lists can be used as the queries, and the corre- sponding target documents as the query answers. Two recent studies use online WWW directories as authoritative sources for sam- ple query-document pairs [8, 55]. An extension proposed by Hawking et al. [114] is the use of site maps found on many web sites as a source of query-document pairs. In all these methods it is important to remove the query/result source from the cor- pus prior to query processing, as otherwise anchor-text measures will have an unfair advantage. 2.6.6 Evaluation measures 2.6.6.1 Precision and recall Precision and recall are the standard measures for evaluating information retrieval for informational tasks [14]. Precision is the proportion of retrieved documents that are relevant to a query at a particular rank cutoff, i.e.:
  • 59.
    §2.6 Evaluation 43 precision(k)= 1 k 1≤i≤k ri (2.39) where k is the rank cutoff and Rk the corpus of documents from D that are relevant to the query Q at cutoff k, (D1 . . . Dn) is a ranked list of documents returned by the system and ri = 1 if Di ∈ Rk or ri = 0 otherwise. Recall is the total proportion of all relevant documents that have been retrieved within a particular cut-off for a query, i.e.: recall(k) = 1 |RQ| 1≤i≤k ri (2.40) In large-scale test collections, recall cannot be measured as it is to difficult to obtain relevance judgements for all documents (as it is too expensive to judge a very large document pool). Therefore recall is not often considered in web search evaluations. The measures of precision and recall are intrinsically tied together, as an increase in recall almost always results in a decrease in precision. In fact precision can be ex- plicitly traded off for recall by increasing k; for very large k every document in the corpus is retrieved so perfect recall is assured. Given the expected drop-off in preci- sion when increasing recall it can be informative to plot a graph of precision against recall [14]. Precision-recall graphs allow for a closer examination of the distribution of relevant and irrelevant documents retrieved by the system. Both precision and recall are unstable at very early cutoffs, and it is therefore more difficult to achieve statistical significance when comparing runs [36]. However, as WWW search users tend to evaluate only the first few answers retrieved [99, 184], precision at early cutoffs may be an important measure for WWW search systems. Counter-intuitively, rather than precision decreasing when a large collection of documents is searched, empirical evidence suggests that precision is increased [118].42 This phenomenon was examined in detail by Hawking and Robertson [118] who ex- plained it in terms of signal detection theory. A further measure of system performance is average precision: average precision = 1 |RQ| 1≤i≤|D| rk × precision(k) (2.41) where k is the rank at which the first relevant document is observed. Average preci- sion gives an indication of how many irrelevant documents must be examined before all relevant documents are found. The average precision is 1 if the system retrieves all relevant documents without retrieving any irrelevant documents. Average precision figures are obtained after each new relevant document is observed. R-precision is a computation of precision at the R-th position in the ranking (i.e. 42 These gains were tested at early precision with a cutoff that did not grow with collection size. Also the collection grew homogeneously, such that the content did not degrade during the crawl (as might be observed by crawling more content and thereby retrieving more spam on the WWW).
  • 60.
    44 Background precision(R)), whereR is the total number of relevant documents for that query. R-precision is a useful parameter for averaging algorithm behaviour across several queries [14]. 2.6.6.2 Mean Reciprocal Rank and success rates Both the Mean Reciprocal Rank and success rate measures give an indication of how many low value results a user would have to skip before reaching the correct an- swer [110], or the first relevant answer [180]. The Mean Reciprocal Rank (MRR) measure is commonly used when there is only one correct answer. For each query examined, the rank of the first correct document is recorded. The score for that query is then the reciprocal of the rank at which the document was retrieved. If there is only one relevant answer retrieved by the system, then the MRR score corresponds exactly to average precision. The score for a system as a whole is taken by averaging across all queries. The success rate measure is often used when measuring effectiveness for exact match queries, such as home page finding and named page finding tasks. Success rate is indicated by S@k, where k is the cutoff rank and indicates the percentage of queries for which the correct answer was retrieved in the top k ranks [56]. The “I’m feeling lucky” button on Google [93] takes a user to the first retrieved result, accordingly S@1 is the rate at which clicking on such a button would take to the user to a right answer. The success rate at 5, or S@5 is sometimes measured as it represents how often the correct answer might be visible in the first results page without scrolling (“above the fold”) [184]. The S@10 measures how often the correct page is returned within the first 10 results. These measures may provide important insight as to the utility of a document ranking function. Silverstein et al. observed from a series of WWW logs that 85% of query sessions never proceed past the first page of results [184]. Further, it has recently been demonstrated that more time is spent by users examining results ranked highly, with less attention paid to results beyond rank 5.43 All results beyond rank 5 were observed to, on average, be examined for 15% of the time that was spent examining the top result. These findings illustrate the importance of precision at high cutoffs and success rates for WWW search systems. 2.6.7 The Text REtrieval Conference The Text REtrieval Conference (TREC) was established in 1992 by the National Insti- tute of Standards and Technology (NIST) and the Defence Advanced Research Projects Agency (DARPA). The conference was initiated to promote the understanding of In- formation Retrieval algorithms by allowing research groups to compare effectiveness on common test collections. Voorhees and Harman present a comprehensive history 43 In this experiment 75% of the users reported that Google was their primary search engine. These users’ prior experience with Google may be that the top ranked answer is often the correct document and that effectiveness drops off quickly, which could affect these results.
  • 61.
    §2.6 Evaluation 45 ofthe TREC conference and the TREC web track development in [111]. As outlined by Hawking et al. [117] the benefits of TREC evaluations include: the potential for re- producible results, the blind testing relevance judgements, the sharing of these judge- ments, the potential for collaborative experiments, and the extensive training sets cre- ated for TREC. 2.6.7.1 TREC corpora used in this thesis Several TREC web track corpora are used and evaluated within this thesis – namely the TREC VLC2, WT10g and .GOV TREC corpora. Some experiments also use query sets from the TREC web track of 2001 and 2002 [53], and from the non-interactive web track of 2003 [60]. These query sets include home page finding sets (2001 and 2003), named page finding sets (2002 and 2003) and Topic Distillation sets (2002 and 2003). These query sets and corresponding task descriptions are discussed in Section 2.6.7.2. • TREC VLC2: is a 100GB corpus containing 18.5 million web documents. This corpus is one third of an Internet Archive general WWW crawl gathered in 1997 [119]. The size of this corpus is comparable to the size of Google’s index at the time of its launch (of around 24 million pages [31]). Current search engines index two orders of magnitude more data [93]. • TREC WT10g: is a 10GB corpus containing a 1.7 million document subset of the VLC2 corpus [15]. The corpus was designed to be representative of a small highly connected web crawl. When building the corpus, duplicates, non-English and binary documents were removed. • TREC .GOV: is an 18.5GB corpus containing 1.25 million documents crawled from the US .GOV domain in 2001 [53]. Redirect and duplicate document infor- mation is available for this corpus (but not WT10g or VLC2). There is debate as to whether the TREC web track corpora are representative of larger WWW-based crawls, in particular whether the linkage patterns and density is comparable (and therefore whether methods useful in WWW-based search would be applicable to smaller scale web search) [100]. Recent work by Soboroff [188] has reported that the WT10g and .GOV TREC web corpora do exhibit important charac- teristics present in the WWW. 2.6.7.2 TREC web track evaluations TREC 2001 web track The TREC 2001 web track evaluated two search tasks over the WT10g web corpus (described in Section 2.6.7.1): home page finding and relevance-based (ad-hoc) infor- mational search. The objective of the home page finding task was to find a home page given some query created to name the page (as described in Section 2.6.2.2). The objec- tive of the relevance-based informational search task was to find documents relevant
  • 62.
    46 Background to sometopic, given a short summary query. Experiments in Chapter 7 of this thesis make use of data from the TREC 2001 web track home page finding task. The ad-hoc informational search task is not considered. For the 2001 home page finding task, 145 queries were created by NIST assessors by navigating to a home page within the WT10g corpus and composing a query de- signed to locate that home page [110]. A training set of 100 home page finding queries and correct answers, created in the same way, was provided before the TREC evalu- ation to allow participants to train their systems for home page finding [56]. Systems were compared officially on the basis of the rank of the first answer (the correct home page, or an equivalent duplicate page). Search system performance was compared using the Mean Reciprocal Rank of the first correct answer and success rate (both de- fined in Section 2.6.6.2). TREC 2002 web track The TREC 2002 web track44 evaluated two search tasks over the .GOV web corpus (described in Section 2.6.7.1): named page finding and Topic Distillation. The objec- tive of the named page finding task was to find a particular web page given a page naming query (as described in Section 2.6.2.2). The objective of the Topic Distillation task was to retrieve entry points to relevant resources rather than relevant documents (as described in Section 2.6.3.1). This thesis includes experiments that use data from both of these web track tasks. Data from the 2002 TREC Topic Distillation task are used sparingly as the task is con- sidered to be closer to a traditional ad-hoc informational task, rather than a Topic Distillation task [53, 111]. For the 2002 named page finding task, 150 queries were created by NIST assessors by accessing a random page within the .GOV corpus and then composing a query designed to locate that page [53]. Systems were compared officially on the basis of the rank of the first answer (the correct page, or an equivalent duplicate page), using Mean Reciprocal Rank and success rates (both measures defined in Section 2.6.6.2). The 2002 Topic Distillation task consisted of 50 queries created by NIST to be repre- sentative of broad topics in the .GOV corpus (however, the topics chosen are believed to have not been sufficiently broad [53]). System effectiveness was compared using the precision @ 10 measure. TREC 2003 web track The TREC 2003 web track evaluated two further search tasks over the .GOV web cor- pus: a combined home page / named page finding task, and a revised Topic Distilla- tion task. The objective of the combined task was to evaluate whether systems could fulfil both types of query without prior knowledge of whether queries named home 44 The report for the official submissions to the 2002 TREC web track (csiro02–) is included in Appen- dix D, but the results from these experiments are not discussed further.
  • 63.
    §2.6 Evaluation 47 pagesor other pages. The objective of the Topic Distillation task was to find entry points to relevant resources given a broad query (as described in Section 2.6.3.1). The instructions given to the relevance judges in the 2003 Topic Distillation task differed from those given in 2002. In 2003 the judges were asked to emphasise “home page-ness” more than in the 2002 Topic Distillation task, and broader queries were used to ensure that some sites devoted to the topic existed [60]. The TREC 2003 combined home page / named page task consisted of a total of 300 queries, with an equal mix of home page and named page queries. The query set was selected using the methods previously used for generating the query/result sets for the 2001 home page finding task and the 2002 named page finding task. Systems were compared officially on the basis of the rank of the first retrieved answer, using Mean Reciprocal Rank and success rate measures. The TREC 2003 Topic Distillation task consisted of 50 queries created by NIST to be representative of broad topics in the .GOV corpus. Judges ensured that queries were “broad” by submitting candidate topics to a search system in order to determine whether there were sufficient matches for the proposed topics. Systems were compared officially on the basis of R-precision as many of the topics did not have 10 correct results (and thus precision @ 10 was not a viable measure). Later work by Soboroff challenged the use of these measures, and demonstrated that precision @ 10 would have been a superior evaluation measure [189].
  • 64.
  • 65.
    Chapter 3 Hyperlink methods- implementation issues The value of hyperlink evidence may be seriously degraded if the algorithms that exploit it are not well implemented. Thus hyperlink-based evidence is intrinsically dependent on the accuracy and completeness of the web graph from which it is calcu- lated. This chapter documents and justifies implementation decisions taken during em- pirical work and details limitations of the corpora available for use. 3.1 Building the web graph An ideally accurate web graph would be one where all hyperlinks in the corpus rep- resented the intentions of the document author when the hyperlinks were created. Such accuracy would require hyperlink authors to be consulted during web graph construction to confirm that their hyperlinks were pointing to the web content they intended to link-to. In most cases this process would not be feasible. Therefore the discussion of graph accuracy within this chapter relates to how likely it is that the web graph is an accurate representation of web authors’ link intentions. The discussion of web graph completeness refers to the amount of hyperlink evidence directed at docu- ments within the corpus, that has been successfully assigned to the target document (and not lost). To ensure web graph accuracy and completeness: • Document URLs need to be resolved; • Duplicate documents may need to be removed; • Hyperlink redirects may need to be followed; • Dynamic page content may need to be detected; and • Links created for reasons other than recommendation may need to be removed. The following sections discuss, in turn, how each of these requirements has been ad- dressed when building a representation of the web graph. 49
  • 66.
    50 Hyperlink methods- implementation issues 3.1.1 URL address resolution Hyperlink targets can be expressed as fully qualified absolute addresses (such as http: //cs.anu.edu.au/index.html) or provided as an address relative to the hyper- link source (such as ../index.html from http://cs.anu.edu.au/∼Trystan. Upstill/index.html). Whether addressed using relative or absolute URLs, hy- perlinks need to be mapped to a single target document either within or external to the corpus. Non-standard address resolution could lead to phantom pages (and sub- graphs) being introduced into the web graph. In experiments within this thesis all relative URLs are decoded to their associated absolute URL (if present) following the conventions outlined in RFC 2396 [21] and additional rules detailed in Appendix B. Some examples of address resolution are: • A relative link to: /../foo.html from http://cs.anu.edu.au/ is resolved to: http://cs.anu.edu.au/foo.html; • Links to one of: http://cs.anu.edu.au/∼Trystan.Upstill/index.html, http://cs.anu.edu.au//////∼Trystan.Upstill//, or http://cs.anu.edu.au:80/∼Trystan.Upstill/ are resolved to: http://cs.anu.edu.au/∼Trystan.Upstill/; • Links to: http://cs.anu.edu.au/foo.html#Trystan are resolved to: http://cs.anu.edu.au/foo.html; • Links to: panopticsearch.com/ are resolved to: http://www.panopticsearch.com/. 3.1.2 Duplicate documents Duplicate and near-duplicate1 documents are prevalent in most web crawls [34, 78, 123, 183]. In a 30 million page corpus collected by AltaVista [7] from the WWW in 1996, 20% of documents were found to be effective duplicates (either exact duplicates or near duplicates) of other documents within the collection [34]. In a 26 million page WWW crawl collected by Google [93] in 1997 24% of the documents were observed to be exact duplicates [183]. In a further crawl of 80 million documents from the WWW 1 Near-duplicate documents share the same core content but a small part of the page is changed, such as a generation date or a navigation pane.
  • 67.
    §3.1 Building theweb graph 51 in May 1999 [123] 8.5% of all documents downloaded were exact duplicates. In a 2003 crawl of the IBM intranet over 75% of URLs were effective duplicates [78]. The presence of duplicate pages in a web graph can lead to inconsistent assign- ment of hyperlink evidence to target documents. For example, if two documents con- tain duplicate content, other web authors may split hyperlink evidence between the two documents. These duplicate documents should be identified and collapsed down to a single URL. However, if unrelated pages are mistakenly identified as duplicates and collapsed, distortion will be introduced into the web graph and the effective- ness of both hyperlink recommendation and anchor-text evidence may be reduced. For example, if Microsoft and Toyota’s home pages were tagged as duplicates, all link information for Microsoft.com might be re-assigned to Toyota’s home page, lead- ing to http://www.toyota.com possibly being retrieved for the query ‘Microsoft’. Therefore it is important to ensure exact (or very close) duplicate matching when as- signing hyperlink recommendation scores and anchor-text evidence to consolidated documents. Common causes of duplicate documents in the corpus are: • Host name aliasing. Host name aliasing is a technique used to assign multiple host names to a single IP address. In some cases several host names may serve the same set of documents under each host name. This may result in identical sets of documents being stored for each web server alias [15, 123]. • Symbolic links between files. Symbolic links are often employed to map multi- ple file names to the same document [123], resulting in the same content being retrieved for several URLs. If there is no consensus amongst web authors as to the correct URL, incoming links may be divided amongst all symbolically linked URLs. • Web server redirects. In many web server configurations the root of a directory is configured to redirect to a default page (e.g. http://cs.anu.edu.au/ to http://cs.anu.edu.au/index.html). Once again if there is no consensus amongst web authors as to the correct URL, incoming links may be divided amongst the URLs. • File path equivalence. On web servers running on case-insensitive operating systems (such as the Microsoft Windows Internet Information Server [148]) the case of characters in the path is ignored and all case variants will map to the same file (so Foo/, foo/ and FoO/ are all equivalent). By contrast, for web servers running on case-sensitive operating systems (such as Apache [10] with default settings on Linux), folder case is meaningful (so Foo/, foo/ and FoO/ may all map to different directories). • Mirrors. A mirror is a copy of a set of web pages, served with little or no modi- fication on another host [23, 122]. In a crawl of 179 million URLs in 1998, 10% of the URLs were observed to contain mirrored content [23].
  • 68.
    52 Hyperlink methods- implementation issues Duplicates created as a result of host name aliasing may be resolved through map- ping domain names down to their canonical domain name (using “canonical name” (CNAME) and “address” (A) requests to a domain name server, as detailed in [123]). This process has several drawbacks, including that some of these virtual hosts may be incorrectly collapsed down to a single server [123]. To accurately detect duplicates the process of domain name collapsing should be performed at the time of crawling [123]. This is because the canonical domain name mappings may have changed prior to du- plicate checking and may incorrectly identify duplicate servers. In experiments within this thesis host name alias information was collected when available. Host name alias information was not available for the (externally collected) VLC2 and WT10g TREC web track collections [15, 62]. Other types of duplicates may be detected using heuristics [24], but page content examination needs to be performed to resolve these duplicates reliably [123]. Scalable document full-text-based duplicate detection can be achieved through the calculation of a signature (typically an MD5 checksum [166]) for each crawled page. However, such checksums may map two nearly identical pages to very different checksum val- ues [34]. Therefore document-based checksums cannot be used to detect near du- plicate documents. Near-duplicate documents can be detected using methods such as Shingling [32, 34], which detects duplicates using random hash functions, and I- Match [47], a more efficient method which uses collection statistics. Full site mirrors may be more easily detected by considering documents not in isolation, but in the context of all documents on a particular host. Bharat et al. [24] investigated several methods for detecting mirrors in the web graph using site heuristics such as network (IP) address, URL structure and host graph connectivity. In corpora built for this thesis, exact duplicates on the same host were detected using MD5 checksums [166] during corpus collection. Duplicate host aliases were also consolidated. Mirror detection and near duplicate detection techniques were not employed due to link graph distortion that may be introduced through false positive duplicate matching. During the construction of the VLC2 test collection no duplicate detection was employed, however for WT10g (a corpus constructed from VLC2) duplicates present on the same web server were detected (using checksums) and eliminated [15]. This was reported to remove around 10% of VLC2 URLs from consideration. Host aliasing was not checked for either collection [15]. During the .GOV corpus crawl duplicate documents were detected and eliminated using MD5 checksums. 3.1.3 Hyperlink redirects Three methods frequently employed by web authors to redirect page requests are: • Using an HTTP redirect configured through the web server [81].2 The redirect information is then transferred from the web server to the client in the HTTP 2 This method is recommended by the W3C for encoding redirects [81].
  • 69.
    §3.1 Building theweb graph 53 response header code. HTTP redirects return a redirection HTTP status code (301 – Moved Permanently or 302 – Moved Temporarily) [81]. In a crawl of 80 million documents in May 1999 [123] 4.5% of all HTTP requests received a redirection response. • Using HTML redirects [164].3 HTML redirects are often accompanied by a tex- tual explanation of the redirect with some arbitrary timeout value for page for- warding. HTML redirects return an “OK” (200) HTTP status code [81]. • Using Javascript [152]. The detection of Javascript redirects requires the crawler (or web page parser4) to have a full Javascript interpreter and run Javascript code to determine the target page. Ensuring hyperlink evidence is assigned to the correct page when dealing with hyperlink redirects is no simple matter. A link pointing to a page containing a redirect can either be left to point at the placeholder page (the page used to direct users to the new document) or re-mapped to the new target page. The web author who created the link is unlikely to have deliberately directed evidence to the placeholder page. By contrast, if the link is re-mapped to the final target, the document may not be representative of the initial document for which the link was created. HTML and Javascript redirect information was logged and stored when building the VLC2 and WT10g test collections. For the .GOV collection all three types of redi- rects were stored and logged. If possible, for experiments within this thesis, redirect information was used to reassign the link to the end of the redirect chain. Due to the complexity of dealing with Javascript redirects, experiments in this thesis do not resolve these redirects. 3.1.4 Dynamic content Unbounded crawling of dynamic content can lead to crawlers being caught in “crawler traps” [123] and the creation of phantom link structures in the web graph. This may lead to “sinks” being introduced into the web graph, and a reduction of the effective- ness of hyperlink analysis techniques. Dynamic content on the WWW is bounded only by the space of all potential URLs on live host names. A study in 1997 estimated that 80% of useful WWW docu- ments are dynamically generated [139]; moreover this has been observed to be a lower bound [165]. During the creation of the VLC2 test collection, dynamic content was crawled when linked-to [15]. For the WT10g corpus all identifiably dynamic documents5 were removed [15]. This meant removing around 20% of the documents present in the 3 This method for encoding redirects is not recommended in the latest HTML specification [164]. 4 The system component that processes web documents and extracts document data prior to indexing. 5 i.e. not having a static URL extension, e.g. a “?” or common dynamic extensions such as “.php”, “.cgi” or “.shtml”.
  • 70.
    54 Hyperlink methods- implementation issues VLC2 corpus. This is surprising given the estimate that 80% of all useful WWW con- tent is dynamic. The large disagreement indicates that either: the crawler used to gather the VLC2 corpus did not effectively crawl dynamic content, or the estimate of dynamic content was incorrect, or static content was crawled first during the Inter- net Archive crawl.6 It is unclear why dynamic content was removed from the WT10g corpus, given that dynamic web content is likely to contain useful information. 3.1.5 Links created for reasons other than recommendation Hyperlink recommendation algorithms assume that links between documents imply some degree of recommendation [157]. Therefore links created for reasons other than recommendation may adversely affect hyperlink recommendation scores [63]. Links are often created for site navigation purposes or for nepotistic reasons [63]. Nepotistic linking is link generation that is the result of some relationship between the source and target, rather than the merit of the target [63, 137]. Kleinberg [132] proposed that all internal site links be removed to lessen the influence of local nepotistic hyperlinks and navigational hyperlinks. This was further refined by Bharat and Henzinger [26] who observed that nepotistic links may exist not only within a single site but between sites as well. To remove these nepotistic links they suggested that all sites be considered as units, and proposed that only a single link between hosts be counted. However, the removal of all internal host link structure may discard useful site information. Amitay et al. [9] studied the relationship between site structure and site content and through an examination of internal and external hyperlink structure were able to distinguish between university sites, online directories, virtual hosting services, and link farms. The link structures in each of these sites were observed to be quite different, indicating that reducing the effects of nepotistic and navigational links according to the type of site may be more effective than simply removing all internal links. Fundamental changes in the use of hyperlinks on the web may also challenge the recommendation assumption by affecting the quality or quantity of mined hyperlink information. For example, the use of web logging tools (blogs) [92] may alter the dynamics of hyperlinks on the WWW. Such pages are often stored together on a single host, are very frequently updated, and the cost of generating a link to other content in a blog is small. As such, the applicability of hyperlink recommendation algorithms in this environment has been challenged [86]. It is also possible that as WWW search engine effectiveness improves, authors are less likely to link to documents that they find useful, as such documents can be easily found using a popular WWW search engine. An analysis how such trends affect hyperlink quality is outside of the scope of this thesis and is left for future work. In experiments in this thesis internal site links are preserved and weighted equally. This is important, as some of the evidence useful in navigational search may be en- coded into internal site structure or nepotistic links, such as links to site home pages 6 The VLC2 collection consists of the first one-third of the documents stored during an all-of-WWW crawl performed by the Internet Archive in 1997.
  • 71.
    §3.2 Extracting hyperlinkevidence from WWW search engines 55 and entry points. For example, almost all external links to the Australia Post web site [12] are directed to the post-code lookup, with the home page identified by ev- idence present in the anchor-text of internal links [114]. Also, within some of the collections studied (such as WT10g [15]), inter-server linking is relatively infrequent. 3.2 Extracting hyperlink evidence from WWW search engines Some of the experiments performed in this thesis rely on hyperlink evidence extracted from WWW search engines via their publicly available interfaces. The WWW search engines used are well engineered and provide effective and robust all-of-WWW search. However, there are disadvantages in using WWW search engines for link informa- tion. Such experiments are not reproducible as search engine algorithms and indexes are not known and may well change over time. Additionally, some of the sourced information is incomplete (such as the top 1000 results lists) or estimated (such as document linkage information7). 3.3 Implementing PageRank PageRank implementations outlined in the literature differ in the ways they deal with dangling links, in the bookmarks used for random jumping, and in the conditions that must be satisfied for convergence [154, 157, 158, 201]. Section 2.4.3.2 gave an overview of the PageRank calculation. The current section outlines the process that has been followed when calculating PageRank values for use in this thesis. 3.3.1 Dangling links A hyperlink in the web graph that refers to a document outside of the corpus, or links to a document which has no outgoing links, is termed a dangling link [157]. In Page and Brin’s [157] PageRank formulation, dangling links are removed prior to indexing and then re-introduced after the PageRank calculation has converged. The removal of dangling links using this method increases the weight of PageRank distributed through other links on pages that point to dangling links. This is because dangling links are not considered when dividing PageRank amongst document out- links. An alternative PageRank calculation sees the random surfer jump with certainty (probability 1, rather than (1 − d)) when they reach a dangling link. This implies that the random surfer jumps to a bookmark when they reach a dead-end [43, 154]. This implementation has desirable stability properties when used with a bookmark set that evenly distributes “jump” PageRank amongst all pages (as described in Section 2.4.3.2). 7 Sourced using methods outlined in Section 5.1.3.
  • 72.
    56 Hyperlink methods- implementation issues A further PageRank variant sees the random surfer jump back to the page they came from when they reach a dangling link [158]. This variant is problematic as it may lead to rank sinks if a page has many dangling links. This may result in inflated scores for sections of the graph. The PageRanks used for web corpora in this thesis are calculated using the dan- gling link “jump with certainty” method. This method has been shown to have desir- able stability and convergence properties [154]. 3.3.2 Bookmark vectors In experiments within this thesis PageRank values are calculated for two different bookmark vectors (E). The first vector produces a “Democratic” or unbiased Page- Rank in which all pages are a priori considered equal. The second bookmark vector “personalises” [157] the PageRank calculation to favour known authoritative pages. The bookmark vector is created using links from a hand-picked source and is termed “Aristocratic” PageRank. In Democratic PageRank (DPR) every page in the corpus is considered to be a book- mark and therefore every page has a non-zero PageRank. Every link is important and thus in-degree might be expected to be a good predictor of DPR. Because it is easy for web page authors to create links and pages, it is easy to manipulate DPR with link spam. In Aristocratic PageRank (APR) a set of authoritative pages is used as bookmarks to systematically bias scores. In practice the authoritative pages might be taken from a reputable web directory or corpus site-map. For example, for WWW-based corpora, bookmarks might be sourced from a WWW directory service such as Yahoo! [217], Looksmart [144] or the Open Directory [69]. APR may be harder to spam than DPR because newly created pages are not, by default, included in the bookmarks. 3.3.3 PageRank convergence This section presents a small experiment to determine how the performance of Page- Rank is affected by changes to the PageRank d value. These experiments examine re- trieval effectiveness on the WT10gC home page finding test collection, for Optimal re- rankings (described in Section 2.6.7.1) of two query-dependent baselines (document full-text and anchor-text). This collection was provided to participants in TREC 2001 so that they could train systems for home page search (described in Section 2.6.7.2, the test collection is used in experiments in Chapter 7). Figures 3.1 and 3.2 illustrate how the PageRank on the WT10gC collection is af- fected by changes to the d value. Figure 3.3 shows how the choice of d affects conver- gence. In practice the d value is typically chosen to be between 0.8 and 0.9 [16, 157]. Results from these experiments reveal that the performance of PageRank can be remarkably stable even with large changes in the d value. When d was set to 0.02 the performance of the Optimal re-ranking (see Section 7.3) was similar to the per- formance at d = 0.85. Without the introduction of any random noise (at d = 1.0)
  • 73.
    §3.3 Implementing PageRank57 the PageRank calculation did not converge. However, the PageRank calculation did converge with only a small amount distributed in random jumping (d = 0.99). Unless the score is to be directly incorporated in a ranking function, only the rel- ative ordering of pages is important. Haveliwala [102] noted this as a possible Page- Rank optimisation method since a final ordering of pages might be achieved before fi- nal convergence. Haveliwala observed that the ordering of pages by PageRank values did not change significantly after few PageRank iterations. When moving from 25 to 100 iterations of the PageRank calculation, on corpora of over 100 000 documents, no significant difference in document ranking order was observed [102]. In experiments in this thesis the PageRank calculation was run until convergence. This allowed for flexibility when combining PageRank values with other ranking components. Since little improvement in performance was observed when increasing d, the em- pirical evidence suggests d should be set to a very small value (around 0.10) for cor- pora of this size, thereby reducing the number of iterations required and minimising computational cost. However, to maintain consistency with previous evaluations, in experiments within this thesis, d was set at 0.85, as suggested by Brin and Page [31]. 0 20 40 60 80 100 0 0.2 0.4 0.6 0.8 1 Successrate(content) Democratic PageRank d value S@1 S@5 S@10 Figure 3.1: Effect of d value (random jump probability) on success rate for Democratic Page- Rank calculations for the WT10gC test collection. As d approaches 0 the bookmarks become more influential. As d approaches 1 the calculation approaches “pure” PageRank (i.e. a Page- Rank calculation with no random jumps). The convergence threshold ( ) is set to 0.0001. The WT10gC test collection is described in Section 7.1.3. The PageRank scores are combined with a document full-text (content) baseline ranking using the Optimal re-ranking method described in Section 7.1.4.
  • 74.
    58 Hyperlink methods- implementation issues 0 20 40 60 80 100 0 0.2 0.4 0.6 0.8 1 Successrate(anchor) Aristocratic PageRank d value S@1 S@5 S@10 Figure 3.2: Effect of d value (random jump probability) on success rate for Aristocratic PageRank calculations for the WT10gC collection. As d approaches 0 the bookmarks be- come more influential. As d approaches 1 the calculation approaches “pure” PageRank (i.e. a PageRank calculation with no random jumps). The convergence threshold ( ) is set to 0.0001. The WT10gC test collection is described in Section 7.1.3. The PageRank scores are combined with an aggregate anchor-text (anchor) baseline ranking using the Optimal re-ranking method described in Section 7.1.4. 0 50 100 150 200 250 300 350 400 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 Numberofiterations d value Aristocratic PageRank Democratic PageRank Figure 3.3: Effect of PageRank d value on the rate of Democratic PageRank convergence on WT10g, by number of iterations. PageRank did not converge at d = 1 (no random jumps). The WT10g collection contains 1.7 million documents and is described in Section 2.6.7.1.
  • 75.
    §3.4 Expected correlationof hyperlink recommendation measures 59 3.3.4 PageRank applied to small-to-medium webs It is sometimes claimed that PageRanks are not useful unless the web graph is very large (tens or hundreds of millions of nodes), but this claim has not been substan- tiated. PageRanks can be calculated for a web graph of any size. PageRank scores are therefore usable within any web crawl, including single organisations (enterprise) and portals. The Google organisational search appliance incorporates PageRank for crawls of below 150 000 pages [96]. 3.4 Expected correlation of hyperlink recommendation mea- sures As DPR depends to some degree on the number of incoming links a page receives, one might expect DPR to be correlated with in-degree. Ding et al. [68] previously observed that for this reason, in-degree is a useful first order approximation to DPR. Moreover, when DPR is calculated with a low convergence threshold it might be expected to be more highly correlated with in-degree, as little weight is transferred through the graph. Similarly it might be expected that corpora with large numbers of dangling links would be more highly correlated. APR is likely to be far less correlated, with many documents potentially having an APR score of zero.8 In this thesis the following correlations between hyperlink recommendation scores are tested: • Between WWW-based PageRank scores (from Google [93]) and WWW-based in-degree scores (from AllTheWeb [80]), in Section 5.3; • Between small-to-medium web based scores for DPR, APR and in-degree, in Section 7.6.3; and • Between small-to-medium web based scores for DPR, APR and in-degree, and WWW-based PageRank scores (from Google), also in Section 7.6.3. 8 For example, if not bookmarked, no APR score will be achieved by pages in the so termed WWW “Tendrils” (unless linked to by other Tendrils) [35] or pages in the IN component.
  • 76.
    60 Hyperlink methods- implementation issues
  • 77.
    Chapter 4 Web searchand site searchability The potential for hyperlink evidence to improve retrieval effectiveness may depend upon the authorship of web sites. Some web documents are authored in such a way as to prevent or discourage direct linking. This may make it difficult for web search engines to retrieve a document. Decisions made when authoring documents can af- fect the evidence collected by web crawlers, and thereby reduce or increase the quality of end-user search results. This chapter investigates how the “searchability” of sites influences retrieval effectiveness. It also provides a whole-of-WWW context for the experimental work based on smaller web corpora. In particular, the case study pre- sented in this chapter illustrates: • The importance of web hyperlink evidence in the ranking algorithms of promi- nent WWW search engines, by investigating whether well-linked content is more likely to be retrieved by WWW search engines. • Difficulties faced by prominent WWW search engines when resolving author intentions through web graph processing (and how successfully resolving issues discussed in Chapter 3 can improve retrieval effectiveness). • The effect of web authorship conventions on the likelihood of hyperlink evi- dence generation. This case study examines both search effectiveness and searchability with respect to a particular type of commodity which is frequently sold over the WWW; namely books. The task examined is that of finding web pages from which a book may be purchased, specifying the book’s title as the query. Online book buying is a type of transactional search task (see Section 2.6.1) [33]. Transactional search is an important web search task [124], as it drives e-commerce and directs first-time buyers to particular merchant web sites. However, despite the prevalence of such tasks, information retrieval research has largely ignored product purchasing and transactional search tasks [106, 116]. The product purchasing search task is characterised by multiple correct answers. For example, in this case study, any of the investigated bookstores may provide the service which has been requested (the purchase of a particular book). Employing a task with many equivalent answers spread over a number of sites makes it possible to 61
  • 78.
    62 Web searchand site searchability study which sites are most easily searchable by search engines, and conversely which search engines provide the best coverage. The study of searchability is primarily concerned with site crawlability and the prevalence of link information, that is, how easy it is to retrieve pages and link struc- ture from a web site. A site with good searchability is one whose pages can be matched and ranked well by search engines, and whose URLs are simple and consistent, such that other authors may be more likely to create hyperlinks to them. Previous studies of transactional search have evaluated the service finding ability of TREC search systems [106] and WWW search engines [116] on a set of apparently transactional queries extracted from natural language WWW logs. The aim of these studies was to compare search engines on early precision; no information was avail- able (or needed) about what resources could be found, and there was no comparison of the searchability of online vendor sites. 4.1 Method The initial step in the experiment was the selection of candidate books, the titles of which formed the query set. This query set was then submitted to four popular WWW search engines and ranked lists of documents were retrieved. Links to candidate bookstore-based pages within these ranked lists were extracted, examined, and (if required) downloaded. These documents were then examined to determine whether they fulfilled the requirements of transactional search; that is, that the document did not only match the book specified in the query, but also allowed for the book to be purchased directly. The search engines were then compared based on how often they successfully retrieved a transactional document for the requested books. Similarly, a comparison of bookstores was performed based on how often each bookstore had a transactional document for the desired book retrieved by any of the WWW search en- gines. To examine the effect that hyperlink and document coverage had on bookstore and search engine retrieval effectiveness, further site-based information was extracted from the search engines and analysed. The experimental data used were collected in the fourth quarter of 2002. The following sections describe these steps in greater detail. The methods used for extracting evidence relating to search engine coverage of bookstore URLs and hyper- links are described in Appendix C. 4.1.1 Query selection The book query set was identified from the New York Times bestseller lists, by sourc- ing the titles of the best-sellers for September 2002 [153]. A total of 206 distinct book titles were retrieved from nine categories.1 Book titles were listed on the best-seller lists fully capitalised, and were later converted to lower case and revised such that all terms, apart from join terms (such as “the”, “and” and “or”), began with a capital. 1 The book/category breakdown is included in Appendix C.
  • 79.
    §4.1 Method 63 Thequery selection presumes that users search for a book using its exact title. In fact users may seek books using author names, topics, or even partial and/or incorrect titles. However, it is likely that a significant proportion of book searches are made using the exact listed title. The ISBNs of correct books were identified for page judging. Both hardcover and paperback editions of books were considered to be correct answers.2 A list of the queries and the ISBNs of the books judged as correct answers is available in Appen- dix C. 4.1.2 Search engine selection Four search engines were identified from the Nielsen/NetRatings Search Engine Rat- ings for September 2002 (as outlined in Table 4.1). At the time, the four engines pro- vided the core search services for the four most popular search services, and for eight of the top ten search services [194]. S.Engine Abbr. Used by [195] Rank AltaVista [7] AV AltaVista 8 AllTheWeb [80] FA AllTheWeb - Google [93] GO Google 3 AOL 4 Netscape 9 Yahoo 1 MSN Search [149] MS MSN Search 2 (based on Looksmart 10 Inktomi) HotBot - Overture 6 Table 4.1: Search engine properties. The column labelled “Abbr.” contains abbreviations used in the study. “Used by” indicates search services that used the search engine. “Rank” indicates the search services position in the Nielsen/NetRatings Search Engine Ratings of Sep- tember 2002 [194]. 4.1.3 Bookstore selection The bookstore set was derived from the Google DMOZ “Shopping > Publications > Books > General” [94] and Yahoo! “Business and Economy > Shopping and Services > Books > Booksellers” [216] directories. Bookstores were considered if they sold the top bestseller in at least three of the nine categories. The process of bookstore candidate identification was performed manually using internal search engines to search for both the title and the author of each book (both title and author were used to uniquely 2 Large print and audio editions were deemed to be incorrect answers.
  • 80.
    64 Web searchand site searchability Bookstore Core URL De. Dy. URL Cat. 1BookStreet 1bookstreet.com N Y ISBN 9 A1Books a1books.com N Y ISBN 9 AllDirect alldirect.com N Y ISBN 9 Amazon amazon.com N P ISBN 9 Americana Books americanabooks.com N Y - 7 Arthurs Books arthursbooks.com N Y ISBN 4 Barnes and Noble barnesandnoble.com N Y ISBN 9 BookWorks bookworksaptos.com Y* Y ISBN 9 BookSite booksite.com Y+ Y ISBN 9 Changing Hands changinghands.com Y* Y ISBN 9 ecampus ecampus.com N Y ISBN 9 NetstoreUSA netstoreusa.com N P ISBN 9 Planet Gold planetgold.com N Y - 9 TextbookX.com textbookx.com N Y ISBN 9 Sam Weller’s Books samwellers.com N Y ISBN 9 All Textbooks 4 Less alltextbooks4less.com N Y ISBN 9 The Book Shop bookshopmorris.com Y* Y ISBN 9 Cornwall Discount Books cornwalldiscountbooks.com N Y - 8 A Lot of Books alotofbooks.com N Y - 3 HearthFire Books hearthfirebooks.com Y* Y ISBN 9 Walmart walmart.com N Y - 9 Wordsworth.com wordsworth.com N Y ISBN 9 Powells powells.com N Y - 9 BiggerBooks.com biggerbooks.com N Y ISBN 9 That Bookstore in Blytheville tbib.com Y* Y ISBN 9 StrandBooks.com strandbooks.com N Y ISBN 7 St. Marks Bookshop stmarksbookshop.com Y* Y ISBN 9 RJ Julia rjjulia.com N Y ISBN 9 Paulina Springs Book Company paulinasprings.com Y* Y ISBN 9 Books-A-Million booksamillion.com N Y ISBN 9 CodysBooks.com codysbooks.com Y* Y ISBN 9 The Concord Bookshop concordbookshop.com Y* Y ISBN 9 Dartmouth Bookshop dartbook.com Y* Y ISBN 9 GoodEnough Books goodenoughbooks.com Y* Y ISBN 9 MediaPlay.com mediaplay.com N Y - 9 Table 4.2: Bookstores included in the evaluation. This table reports whether the bookstore contained ISBNs in its internal URLs (“URL”), whether the sites were generated through a series of dynamic scripts (“Dy.”), whether they were a derivative of another site (“De.”) and how many of the nine book categories they matched (“Cat.”). A “*” next to the “De.” column indicates that the site was a booksense.com derivative, while a “+” indicates that the bookstore was a booksite.com derivative. A “P” in the “Dyn” column indicates that the site was dynamic but did not “look” dynamic (it did not have a “?” with parameters following the URL).
  • 81.
    §4.2 Comparing bookstores65 identify books). Bookstores were only judged on the categories for which they stocked (or listed) the bestseller. The justification for this approach was that there may be some specialised (e.g. fiction only) bookstores that should be included in the study, but not considered for all book categories. A full listing of all 35 eligible bookstores and their salient properties is presented in Table 4.2. 4.1.4 Submitting queries and collecting results The queries were made up of book titles submitted to search engines as phrases (i.e. inside double quotes or marked as phrases in advanced searches). The exact query syntax submitted to each search engine is reported in Appendix C. The top 1000 results for each query from each search engine were retrieved and recorded. 4.1.5 Judging The candidate documents were required to fulfil two criteria in order to be considered as a correct answer; 1) the page must have been for the book whose title is given as the query, and 2) the retrieved page must have been transactional in nature. A transactional page was considered to be a bookstore page from which a user could buy a book. Browse pages (documents that list multiple books, for example, a list of books in a particular category, by series or by author) or bookstore search results were not judged as correct results. For many bookstores the correct answers were observed to have the hardcover or paperback ISBN in the URL (in many cases there were many correct duplicate URLs which were all observed to contain the ISBN). To cut down on manual judging for these bookstores, automatic judging was performed based on the presence or absence of the ISBN in the URL. For other bookstores the unique product identifiers for each book were manually collected and recorded, and URLs checked for their presence. 4.2 Comparing bookstores The book finding success rates were measured at several cutoffs (S@1, S@5, S@10, S@100 and S@1000). Table 4.3 contains the results for this experiment. The following observations may be made: • Of the 35 bookstores evaluated, only 14 returned any correct answers within the top 1000 results of any of the search engines. • Only four bookstores contributed answers within the top ten results in any search engine: Amazon, Barnes and Noble, Booksite and Walmart • Amazon was the most searchable bookstore in the evaluation, achieving the high- est success rates. • Only Amazon had correct results returned by every search engine.
  • 82.
    66 Web searchand site searchability S@1000 break. Host Bookstore S@1 / S@5 / S@10 / S@100 / S@1000 (AV:FA:GO:MS) Res. Amazon 0.124 / 0.325 / 0.402 / 0.492 / 0.584 104:83:162:132 3903 Barnes and Noble 0.028 / 0.096 / 0.140 / 0.225 / 0.316 0:87:170:3 3603 Walmart 0.010 / 0.030 / 0.045 / 0.070 / 0.075 2:0:0:60 277 BookSite 0.000 / 0.004 / 0.005 / 0.013 / 0.013 0:0:0:11 52 ecampus 0.000 / 0.000 / 0.000 / 0.005 / 0.012 0:7:0:3 290 AllDirect 0.000 / 0.000 / 0.000 / 0.002 / 0.005 0:4:0:0 52 NetstoreUSA 0.000 / 0.000 / 0.000 / 0.001 / 0.010 0:8:0:0 261 Sam Weller’s Books 0.000 / 0.000 / 0.000 / 0.001 / 0.006 0:5:0:0 22 Books-A-Million 0.000 / 0.000 / 0.000 / 0.000 / 0.008 0:4:0:3 775 1BookStreet 0.000 / 0.000 / 0.000 / 0.000 / 0.006 0:5:0:0 17 Wordsworth.com 0.000 / 0.000 / 0.000 / 0.000 / 0.004 1:0:1:1 92 TextbookX.com 0.000 / 0.000 / 0.000 / 0.000 / 0.002 0:2:0:0 22 CodysBooks.com 0.000 / 0.000 / 0.000 / 0.000 / 0.002 0:2:0:0 78 Arthurs Books 0.000 / 0.000 / 0.000 / 0.000 / 0.003 0:1:0:0 3 Powells Bookstore 0.000 / 0.000 / 0.000 / 0.000 / 0.000 0:0:0:0 1031 Table 4.3: Bookstore comparison. This table includes all bookstores which had at least one success at 1000 (S@1000) in a search engine. Powells is included in the table for comparison due to the high number of results retrieved by the search engines from Powells’ host name. The “S@1000 break.” column shows the number of correct books retrieved from each bookstore within the top 1000 search results for each search engine. The “Host Res.” column reports the number of pages found for each bookstore’s host name by all search engines.
  • 83.
    §4.3 Comparing searchengines 67 • Barnes and Noble performed well on Google (GO) and AllTheWeb (FA). • Walmart performed well on MSN Search (MS). • The only search engine which returned results for many of the smaller book- stores was AllTheWeb (FA). 4.3 Comparing search engines Search engine effectiveness was also compared: the results are presented in Table 4.4 and Table 4.5. From data in these tables the following observations were made: • AltaVista’s (AV) performance was inferior to that of both Google (GO) and MSN Search (MS) at all cutoffs. AltaVista demonstrated around half the precision of MSN Search. • AllTheWeb (FA) trailed well behind all other search engines, but provided a large number of correct answers between the 100th and 1000th position (success rate jumps from 0.18 to 0.52). The precision for AllTheWeb was low. • Google (GO) trailed MSN Search at S@1, but exceeded MSN Search’s perfor- mance from S@10 onwards. Google returned more correct answers in their top 5, 10 and 100 results than MSN Search. • MSN Search (MS) produced the strongest results at S@1 and S@5, but when cutoffs were extended, retrieval effectiveness decreased dramatically. Search Success Rates Engine @1 @5 @10 @100 @1000 AV 0.14 0.39 0.45 0.50 0.52 FA 0.00 0.02 0.05 0.18 0.52 GO 0.15 0.56 0.67 0.83 0.89 MS 0.36 0.57 0.65 0.72 0.73 Table 4.4: Search engine success rates. The best result at each cutoff is highlighted. 4.3.1 Search engine bookstore coverage The search engine bookstore coverage was measured by sourcing counts from WWW search engines for the number of URLs indexed per bookstore (site document cover- age), and the number of hyperlinks that were directed at each bookstore (site hyperlink coverage).
  • 84.
    68 Web searchand site searchability Search Precision Engine @1 @5 @10 @ 100 AV 0.14 0.08 0.05 0.01 FA 0.00 0.00 0.01 0.00 GO 0.15 0.20 0.15 0.03 MS 0.36 0.13 0.08 0.01 Table 4.5: Search engine precision. Note that precision at 1 is equivalent to the success rate at 1. The precision at cutoffs greater than 100 is less than 1/100 in all cases. The best result for each measure is highlighted. Site document coverage The transactional pages for some bookstores may not have been returned because they have never been crawled by a search engine. Table 4.6 lists the number of pages from each bookstore reported to be contained within each search engines’ index. From these results it was observed that: • Amazon had a consistently large search engine coverage – around three million documents on three-out-of-four search engines. AllTheWeb covered an order of magnitude less Amazon-based documents than did any of the other search engines. However, AllTheWeb crawled more pages for Amazon than it did any other bookstore. This may indicate that AllTheWeb incorrectly eliminated many of Amazon’s pages as duplicates, applied more stringent limits on crawling dy- namic content, or its coverage was estimated in a different way compared to the other search engines. • The coverage of Barnes and Noble varied widely across engines. While the MSN Search coverage of Barnes and Noble was small, it appeared to contain many product pages, with three correct answers retrieved. Only 500 Barnes and Noble pages were covered by AltaVista. Over a million pages were covered by Google. • A large number of Walmart pages were covered by MSN Search, whereas AllTheWeb and Google covered a relatively small number of pages. This may indicate that MSN handled dynamic pages in a different manner to the other search engines, or that there was some special relationship between MSN Search and Walmart. • AllTheWeb did not have large coverage of any one bookstore (their maximum crawl of a bookstore was 360 000 pages). Instead they tended to have a larger breadth of results, with larger crawls of lesser known bookstores. As many bookstores served content through dynamic pages, this may further indicate that AllTheWeb applied more stringent limits on dynamic content.
  • 85.
    §4.3 Comparing searchengines 69 Bookstore AV FA GO MS TOTAL amazon.com 3 675 723 358 376 3 620 000 2 838 819 10 492 918 barnesandnoble.com 521 192 792 1 240 000 2822 1 436 135 walmart.com 89 243 1076 10 500 916 162 1 016 981 netstoreusa.com 1171 315 002 93 000 42 052 451 225 powells.com 39 397 111 977 65 900 6204 223 478 textbookx.com 18 23 157 38 600 150 61 925 alldirect.com 24 26 278 7 27 26 336 ecampus.com 300 7763 2010 240 10 313 planetgold.com 18 8361 774 18 9171 booksamillion.com 22 5860 54 865 6801 cornwalldiscountbooks.com 1 5423 2 1 5427 wordsworth.com 735 228 2290 1271 4524 booksite.com 93 169 1190 290 1742 codysbooks.com 74 1308 238 57 1677 arthursbooks.com 7 1221 8 384 1620 samwellers.com 7 278 5 8 298 tbib.com 1 2701 3 0 2705 stmarksbookshop.com 1 2414 4 0 2419 1bookstreet.com 5 1009 779 172 1965 a1books.com 15 1311 29 173 1528 biggerbooks.com 0 1 395 1 397 americanabooks.com 3 309 15 14 341 alltextbooks4less.com 3 208 22 31 264 dartbook.com 31 74 17 1 123 mediaplay.com 19 0 32 7 58 paulinasprings.com 1 40 1 0 42 rjjulia.com 7 27 3 4 41 concordbookshop.com 5 2 3 0 10 goodenoughbooks.com 1 4 2 0 7 bookworksaptos.com 1 3 2 0 6 alotofbooks.com 1 1 2 1 5 bookshopmorris.com 1 2 2 0 5 changinghands.com 1 2 2 0 5 hearthfirebooks.com 1 2 2 0 5 Total 3 809 257 1 087 000 5 081 161 3 810 094 13 787 512 Table 4.6: Search engine document coverage. Note that the totals in the right-hand side col- umn may contain duplicate links (this occurs when the same URL is found by different search engines). These values were collected using methods outlined in Appendix C. The column la- belled “AV” contains data from AltaVista, “FA” contains data from AllTheWeb, “GO” contains data from Google, and “MS” contains data from MSN Search.
  • 86.
    70 Web searchand site searchability • AltaVista had large coverage only of Amazon, Walmart and Powells. It seems unlikely that book results could be found in their small (sub 1000 page) crawls of other bookstores. The searchability of all three bookstores was improved by having simple URL structures. • Powells had large coverage (with three-out-of-four search engines indexing over 40 000 pages), but did not have any product pages returned in the top 1000 results for these search engines. This may indicate that hyperlink evidence di- rected at the Powells bookstore was either not present, not directed at book buy- ing pages, or was resolved incorrectly by the WWW search engines. Hyperlink graph completeness Only two of the evaluated search engines supported domain name hyperlink counts: AltaVista and AllTheWeb. Domain name hyperlink counts retrieve the number of links to an entire domain name rather than just to a single page. This information was used to determine the hyperlink coverage of an entire bookstore. Table 4.7 contains the results for this study. Some observations are that: • AllTheWeb discovered a large number of links to Amazon, but did not crawl documents from Amazon as comprehensively as other search engines. • Powells bookstore had a large number of incoming links, but still performed poorly. This further indicates that incoming links may not have been success- fully resolved by the WWW search engines (due to anomalies in the search en- gine representations of Powells’ document set or link graph), or that links were not directed to transactional pages. • AllTheWeb discovered more links to diverse hosts than AltaVista. This could be attributed to the fact that AllTheWeb performed a deeper crawl of lesser sites and encountered a larger number of internal links. 4.4 Findings This section discusses the bookstore findings. It includes an analysis of the URL and hyperlink coverage, of bookstore ranking performance, and finally of the relative re- trieval effectiveness of the evaluated search engines. 4.4.1 Bookstore searchability: coverage The results in Tables 4.6 and 4.7 reveal that the top three bookstores by URL coverage were also the top three bookstores by success rate. The bookstore coverage appears to have had a significant impact on how often books from the bookstore were retrieved early in the document ranking. Amazon achieved high coverage in the indexes of all evaluated search engines.
  • 87.
    §4.4 Findings 71 BookstoreAV FA TOTAL amazon.com 12 408 441 25 955 858 38 364 299 powells.com 5 197 526 316 989 5 514 515 textbookx.com 3 456 068 28 453 3 484 521 barnesandnoble.com 234 137 784 088 1 018 225 walmart.com 14 783 267 008 281 791 booksite.com 4927 113 729 118 656 booksamillion.com 34 137 79 351 113 488 ecampus.com 2170 102 047 104 217 netstoreusa.com 10 548 91 867 102 415 1bookstreet.com 25 229 50 064 75 293 wordsworth.com 2750 21 694 24 444 a1books.com 4545 16 270 20 815 codysbooks.com 1062 9512 10 574 alldirect.com 614 6508 7122 arthursbooks.com 109 1700 1809 samwellers.com 106 208 314 americanabooks.com 114 2163 2277 alltextbooks4less.com 52 945 997 rjjulia.com 174 337 511 concordbookshop.com 185 118 303 planetgold.com 31 200 231 dartbook.com 95 117 212 changinghands.com 68 96 164 bookshopmorris.com 46 99 145 cornwalldiscountbooks.com 12 123 135 alotofbooks.com 11 108 119 stmarksbookshop.com 42 62 104 hearthfirebooks.com 8 73 81 tbib.com 29 40 69 bookworksaptos.com 15 53 68 paulinasprings.com 31 37 68 biggerbooks.com 2 63 65 goodenoughbooks.com 12 15 27 Total 21 463 332 28 160 545 49 623 877 Table 4.7: Search engine link coverage. The column labelled “AV” contains data from Al- taVista and “FA” contains data from AllTheWeb. Note that because of overlap between AV and FA the totals in the right-hand column may contain several links to the same URL.
  • 88.
    72 Web searchand site searchability It is important for a bookstore to have deep crawls indexed in as many search engines as possible. Three potential reasons why bookstores included in this study were not crawled deeply may be offered: 1. Despite many incoming links to the bookstore domain, few pages were crawled. This may have been because the crawler was trapped when building the book- stores’ link graph and only crawled a few books many times over. Alternatively, book pages could have been identified as near-duplicates and eliminated from the document index. 2. The bookstores did not receive sufficient links directly to product pages from external sites (i.e. most links were directed to the bookstore home page). 3. The search engines appeared to label bookstores as containing uninteresting dynamically generated content. The WWW search engines may not consider apparent dynamically generated content due to concerns about polluting their representation of the web graph (see Section 3.1.2). Some dynamic content was observed to be in the form of parameterised URLs (with question marks) gener- ated by a single script. Given the poor performance of bookstores which gener- ated content using a single script, it appears that WWW search engine crawlers might have either simply ignored some of these documents (according to some URLs-to-crawl rule, for example, stripping all URL parameters), or have been unable to retrieve any meaningful information from them. Many bookstores that have a high link-count were unable to achieve wide URL coverage. This is most apparent on Powells, which has a large number of incoming links, but less indexed pages than other well linked bookstores. Further investigation uncovered that Powells encodes book ISBN codes as a query to a .cgi script. This is in contrast to the Amazon method, where ISBN codes are encoded in the URL and not as parameters. The site which managed to best convert incoming links to crawled pages was Net- storeUSA. In contrast to all other evaluated bookstores, NetstoreUSA had more pages indexed by the search engines than incoming links. NetstoreUSA improved its search- ability by using static-looking documents organised in simple hierarchies of shtml pages. To encourage a deep crawl that will cover all site content it is necessary for web authors to ensure they have both internal and external links directly to their hierar- chically deep, but important, content. This increases the chance that a WWW search engine will encounter a link to the page, and adds valuable hyperlink evidence. To encourage user linking it is important to use meaningful and consistent URL strings. While one can envisage a web developer linking to a URL which has the form foo. com/ISBN/ it may be less likely that they link directly to foo.com/prod/prod. asp?prod=9283&source=09834. There is also a higher likelihood that such a link would be discarded during the crawl or the creation of the web graph. Deep linking may be encouraged further through the use of incentive or partnership programs. If
  • 89.
    §4.4 Findings 73 sucha program is in place, it is important to ensure partners are able to point directly to products and that all partners point to the same consistent URL for each product (e.g. Amazon provides an incentive program so that web authors link directly to their product pages). To ensure database generated content is not rejected by WWW search engines, it is important that the content is provided through individual, static looking URLs. Duplicate pages should also be removed from the site. However, if duplicate pages are to be retained, it is important that web authors know what URL they should link to, and that crawls of duplicate pages be minimised (potentially through the use of page crawl exclusion measures in “robots.txt” files [133]). 4.4.2 Bookstore searchability: matching/ranking performance Transactional documents for the requested books were most frequently matched (and retrieved) from the Amazon and Barnes and Noble bookstores. Many of the documents retrieved by WWW search engines from other bookstores were observed to be browse and search pages, and not transactional documents. The Powells bookstore is a case in point. Despite having many links, reasonable coverage in search engine indexes and having results matched frequently, Powells transactional pages were never returned. This may indicate poor page full-text content, poor site organisation and/or a lack of encouragement to link directly to products (as their referral program appears to be processed through their front page). These identified problems could also be alleviated somewhat by employing robot exclusion directives to inform crawlers to ignore search and browse pages, and index only product pages (through the use of “robots.txt” files [133], as outlined above). 4.4.3 Search engine retrieval effectiveness The best book finding search engines were Google and MSN Search and the most successful bookstore was Amazon. MSN Search provided the most correct answers at the first rank. However, Google provided more correct answers in the top five positions, potentially giving users more book buying options. In order to maximise the book finding ability of a WWW search engine empirical findings indicate that deep crawls of dynamic content needs to be performed. All of the examined bookstores bury product pages deep within their URL directory tree (generally as leaf nodes). While AllTheWeb appeared to index a much larger selection of bookstores, they appeared to not crawl as much of the Amazon bookstore as other search engines. Given that the majority of correct hits for all search engines came from the Amazon bookstore, this could be one of the main reasons for the observed low effectiveness of AllTheWeb on this task. Some WWW search engines appear to favour certain bookstores over others. For example Google and AllTheWeb have large indexes of Barnes and Noble while the oth- ers do not. A further example of this is the good performance of the Walmart book- store in MSN Search. The results suggests that MSN Search may have access to extra
  • 90.
    74 Web searchand site searchability information for Walmart that is not available to the other search engines. For WWW search engines to provide good coverage of popular bookstores it is necessary for them to crawl dynamic URLs, even when there are many pages gen- erated from a single script with different parameters. On the Walmart and Powells bookstores, all product pages are created from a single script, with the book’s ISBN as a parameter. Also, as many slightly different URLs frequently contain information about exactly the same ISBN it may be necessary to perform advanced equivalence (duplicate) URL or content detection. This is the case with duplicate product pages on the Amazon bookstore, as the same document is retrieved no matter what referral identifier is included in the URL. Without effective duplicate detection and consolida- tion for duplicate documents in the hyperlink graph the effectiveness of link evidence will be decreased. 4.5 Discussion The coverage results from leading WWW search engines indicate that all of the eval- uated engines dealt with web graph anomalies in a different manner (some more ef- fectively than others). The most effective search engines retrieved book buying pages from dynamic sites for which they had crawled between 0.9 and 3.6 million docu- ments. This demonstrates the importance of using robust methods when sourcing and building the web graph (such as those outlined in Chapter 3) for effective retrieval. From a web site author’s point of view the design of a web site directly affects how well search engines can crawl, match and rank its pages. For this reason, searchabil- ity should be an important concern in site design. Observations from this case study indicate that there are large discrepancies in the relative searchability of bookselling web sites. Many of the bookstore sites incorporated dynamic URLs that may be diffi- cult pages for some WWW search engines to crawl, and unattractive targets for web authors to direct hyperlinks to. Many bookstore sites were also marred by duplicate content and confusing link graphs. Of the 35 evaluated bookstores 24 did not appear in the top 1000 results in any of the evaluated search engines for any of the evaluated books. These results illustrate the importance of a combined approach to improving trans- actional search. To improve effectiveness WWW search engines should endeavour to discover more product pages, by performing deep crawls of provider sites and of dynamic pages (especially those that are linked to directly). It is equally important for bookstores to build a suitable site structure that allows search engines to perform thorough crawls. To improve searchability, bookstores should use short non-changing URLs (like NetstoreUSA) and encourage deep linking directly to their product pages (like Amazon). It is submitted that these findings are likely to hold for other WWW search tasks. The amount of link evidence available for a bookstore, as observed in the link coverage study, proved to be particularly important for achieving high rankings in some search engines (such as Google [93]). The apparent heavy use of web evidence in
  • 91.
    §4.5 Discussion 75 thedocument ranking algorithms of WWW search engines provides further support for the investigations of web evidence within this thesis.
  • 92.
    76 Web searchand site searchability
  • 93.
    Chapter 5 Analysis ofhyperlink recommendation evidence It is commonly stated that hyperlink recommendation measures help modern WWW search engines rank “important, high quality” pages ahead of relevant, but less valu- able pages, and to reject “spam” [97]. However, what exactly constitutes an “impor- tant” or “high quality” page remains unclear [8]. Google has previously been shown to perform well on a home page finding task [116] and the PageRank hyperlink rec- ommendation algorithm may be a factor in this success. This chapter presents an analysis of the potential for hyperlink recommendation evidence to improve retrieval effectiveness in navigational search tasks, and to favour documents that possess some “real-world quality” or “importance”. The analysis con- siders PageRank and in-degree scores extracted from leading WWW search engines. These scores are tested for bias and their usefulness is compared over corpora of home page, non-home page and spam page documents. The hyperlink recommendation scores are tested to determine the weight assigned to the home pages of companies that exhibit “real-world” measures of quality. The measures of “real-world” qual- ity investigated include whether favoured companies are highly profitable or well- known. Less beneficial biases are also tested to examine whether hyperlink recom- mendation scores favour companies based on their base industry or location. 5.1 Method An analysis of score biases requires a set of candidate documents, the hyperlink rec- ommendation scores for those documents, and, in order to test for bias, attributes by which the candidate documents may be distinguished. In this experiment three sets of candidate pages are identified from data relating to publicly listed companies and links to known spam content. These form useful sets for analysis for reasons outlined in the following sections. Hyperlink recommendation scores are sourced for each of these pages using WWW search engines and tools. The attributes used to test rec- ommendation score bias are gathered from listed company information and publicly available company attributes. The data used in this experiment were extracted during September 2003. 77
  • 94.
    78 Analysis ofhyperlink recommendation evidence The following subsections detail methods used to amass the data for this exper- iment. This includes a description of how candidate pages were selected, how the salient company properties (used when evaluating bias) were sourced, and the meth- ods used to extract hyperlink recommendation scores for each document. 5.1.1 Sourcing candidate pages The home page set includes the home pages of public companies listed on the three largest US stock exchanges: the New York Stock Exchange (NYSE), NASDAQ and the American Stock Exchange (AMEX) (a total of 8329 companies were retrieved). The home pages of publicly listed companies form a useful corpus as there is publicly available information relating to company popularity, revenue, and other properties, such as which industry the company belongs to. Furthermore, publicly listed compa- nies are plausible targets for home page finding queries. Company information was obtained from the stock exchange web sites, and in- cluded the official company name, symbol and description. Then, using the company information service at http://quote.fool.com/, 5370 unique company home page URLs were identified. These URLs were almost always the root page of a host (e.g. http://hostname.com/) without any file path (only fourteen URLs had some path). These are considered to be the company home pages, even though in some cases the root page is a Flash animation or another form of redirect. The company information service also provided an industry for each stock e.g. “Real Estate”. For comparison with these home pages, two further sets of pages were collected: a non-home page set and a spam page set. Non-home pages were collected by sorting company home pages by PageRank (extracted using methods outlined in the next section) and selecting twenty home pages at a uniform interval. From these home pages crawls of up to 100 pages were commenced (restricted to the company domain). The overall PageRank distribution for the pages in the twenty crawls is shown in Figure 5.1. The spam page set was collected by sourcing 399 links pointing to a search engine optimiser company (using Google’s link: operator). The spam pages were largely content-free, having been created to direct traffic and PageRank towards the search engine optimiser’s customers. After sourcing in-degrees, all pages with an in-degree of zero were eliminated leaving 280 pages for consideration. 5.1.2 Company attributes The set of company home pages was grouped into subsets according to their member- ships and attributes, such as the Fortune 500 list [82] and the Wired 40 list of compa- nies judged to be best prepared for the new economy [147]. The goal was to observe how well PageRank and in-degree could predict inclusion in such lists. Salient company properties were collected from the following web resources: • The company information service at http://quote.fool.com provided com- pany industry and location information.
  • 95.
    §5.1 Method 79 0 50 100 150 200 250 300 01 2 3 4 5 6 7 8 9 10 #ofpagescrawled PageRank Figure 5.1: Combined PageRank distribution for the non-home page document set. The non-home page document set was constructed by crawling up to 100 pages from a selection of company webs. The observed PageRank distribution is not a power law distribution as might be expected in PageRank distributions (see Section 2.4). These pages are more representative of general WWW page population than the home page only set. The zero PageRanks are most likely caused by pages not present in Google crawls, or through lost redirects, or through small PageRanks being rounded to 0. • The Fortune magazine provided the list of Fortune 500 largest companies (by rev- enue) and Fortune Most Admired companies. Fortune 500 companies are those with the highest revenue, based on publicly available data, listed by Fortune Magazine (http://www.fortune.com/). The Fortune Most Admired com- pany list is generated through peer review by Fortune Magazine. • The Business Week magazine Top 100 Global Brands was sourced from http://www.businessweek.com/magazine/content/03 31/b3844020 mz046.htm. This lists the most valuable brands from around the world, based on publicly available marketing and financial data. • The Wired 40 list of technology-ready companies was taken from Wired Mag- azine and is available online at http://www.wired.com/wired/archive/ 11.07/40main.html. The list contains the companies that Wired Magazine believe are best prepared for the new economy. In all cases the 2003 editions of the lists were used. 5.1.3 Extracting hyperlink recommendation scores For each URL, PageRanks and in-degrees were extracted from search engines Google [93] and AllTheWeb [80]. Unfortunately there is no way for researchers external to Google to access PageR- anks used in Google document ranking. The only publicly available PageRank values are provided in the Google toolbar [98] and through the Google directory [95]. When
  • 96.
    80 Analysis ofhyperlink recommendation evidence a page is visited, the Toolbar lists its PageRank on a scale of 0 to 10, indicating “the im- portance Google assigns to a page”.1 When a directory category is viewed, the pages are listed in descending PageRank order with a PageRank indicator next to each page, to “tell you at a glance whether other people on the web consider a page to be a high- quality site worth checking out”.2 With PageRank provided directly in these ways, it can be analysed as a direct indicator of quality, without needing to know whether or how it is used in Google ranking. The PageRank from the Google Toolbar is inter- esting as toolbar users may use it directly as a measure of document quality, and the quality of this measure is unknown. Further, as it is sometimes claimed that Page- Rank behaves differently on a large scale web graph, it may allow for some insight into properties of WWW-based PageRank (to accompany results presented in Chap- ter 7). PageRanks were extracted from the Microsoft Internet Explorer Google Toolbar [98] by visiting pages and noting the interaction between the Toolbar and Google servers. To ensure consistency a single Google network (IP) address was used to gather Toolbar data.3 When the requested URL resulted in a redirect, the PageRank was retrieved for the final destination page (types of redirects are discussed in Section 3.1.3). During the extraction process it was noted that PageRank values had been heavily transformed. Actual PageRanks are power law distributed, so low PageRanks values should be represented far more frequently than higher values. By contrast, the Toolbar reports values in the range of 0 to 10, with all values frequently reported (see Figure 5.1). It is likely that one reason for this transformation is to provide a more meaningful mea- sure of page quality to toolbar users. Without such a transformation most documents would achieve a Toolbar PageRank value of 0. Several problems were faced when obtaining in-degree values. These could only be reliably extracted for site home pages. Problems that have been identified in meth- ods used by WWW search engines to estimate linkage include: 1. counting pages which simply mention a URL rather than linking to it, 2. not anchoring the link match, so that the count for http://www.apple.com includes pages with http://www.apple.com.au and http://www.apple. com/quicktime/, and 3. under reporting the in-degree, for example by systematically ignoring links from pages with PageRanks less than four.4 Three methods for accessing in-degree estimates for a URL were evaluated (esti- mates are reported in Figure 5.1): 1 From: http://toolbar.google.com/button help.html 2 From: http://www.google.com/dirhelp.html. 3 The Google Toolbar sources PageRank scores from one of several servers. During experiments it was noted that the PageRank scores for the same page could differ according to which server was queried. This effect is believed to be caused by out-of-date indexes being used on some servers. 4 This is believed to be the case in Google’s link counts, see http://www.webmasterworld.com/forum80/254.htm
  • 97.
    §5.2 Hyperlink recommendationbias 81 Extracted from Google link: contains Page- AllTheWeb in-degree ‘in-degree’ Rank in-degree Min 0 0 0 0 Max 857 000 1 250 000 10 14 324 793 Mean 958 1910 5.3 17 889 Median 82 112 5 319 Apple 87 500 237 000 10 2 985 141 Table 5.1: Values extracted from Google [93] and AllTheWeb [80] for 5370 company home pages in September 2003. Listed are range, mean, median and an example value (for Apple Computer http://www.apple.com/). • The first method used the Google query link:URL, which reportedly has prob- lem 1. • The second method used Google to find pages which contained the URL. This solution was suggested by the Google Team,5 but it exhibits problems 1 and 2, and also seems to only return pages which contain the URL in visible text. • The third method used the AllTheWeb query link:URL -site:URL to re- trieve in-degree values. The operator -site:URL was included because the method has problem 2, and adding a -site:URL excludes all intra-site links, and so eliminates many of the non-home page links and a few home page links. All three types of in-degree estimates were found to be correlated with each other (Pearson r > 0.7). AllTheWeb in-degrees were chosen for comparison with Google PageRanks to eliminate any potential search engine preference, and to ensure that in-degree sourc- ing issue 3 did not impact correlations between in-degree and PageRank values. Both search engines had independent crawls of a similar size (AllTheWeb crawls 3.1 billion document, compared to Google’s 3.3 billion.6 Table 5.1 displays some pertinent properties of the extracted values, namely the minimum, maximum, mean and median values of all extracted hyperlink recommen- dation evidence. 5.2 Hyperlink recommendation bias This section presents the results of an analysis of potential bias in hyperlink recom- mendation scores. Biases considered include a preference for home pages, large fa- 5 As discussed in: http://slashdot.org/comments.pl?sid=75934&cid=6779776 6 The collection size was estimated to be 3.3 billion on http://www.google.com at September 2003; as of November 2004 it is estimated to be around 8 billion documents on http://www.google.com.
  • 98.
    82 Analysis ofhyperlink recommendation evidence mous companies, a particular country of origin, or the industry in which the company operates. 5.2.1 Home page preference Figure 5.2 shows the PageRank distributions for eight of the twenty crawls (distri- butions for the other twelve crawls are included in Appendix E). The distributions reveal that in almost every case, the company home page has the highest PageRank. In every case at least some pages received lower PageRank than the home page. This is not surprising, as links from one server to another usually target the root page of the target server. In fact targeting deeper pages has even led to lawsuits [192]. 5.2.2 Hyperlink recommendation as a page quality recommendation Having considered intra-site hyperlink recommendation effects, inter-site compar- isons are now considered. 5.2.2.1 Large, famous company preference The Fortune 500 (F500), Fortune Most Admired and Business Week Top 100 Global Brands lists provide good examples of large, famous companies, relative to the general population of companies. Figure 5.3 shows that companies from these lists tended to have higher PageRanks than other companies. However, there are examples of non- F500 companies with PageRank 10 such as http://www.adobe.com. At the other end of the spectrum, the Zanett group http://www.zanett.com has a F500 rank of 363, but a PageRank of 3. This puts them in the bottom 6% of 5370 companies, based on Toolbar advice. The home pages of Fortune 500 and Most Admired companies receive, on av- erage, one extra PageRank point. Business Week Top Brand companies receive, on average, two extra PageRank points. Similar findings were observed for in-degree. These findings support Google’s claim that PageRank indicates importance and qual- ity. In-degree was observed to be an equally good indicator of popularity on all three counts. 5.2.2.2 Country and technology preference Given the diversity of WWW search users, a preference in hyperlink recommendation evidence for a particular company, industry or geographical location may be undesir- able. This section investigates biases towards technically-oriented and US companies. As shown in Figure 5.4 a bias towards US companies was not observed. However, it should be noted that all companies studied are listed in US stock exchanges. Further, as a smaller regional stock exchange was included (AMEX) there may be a bias to- wards non-US companies by virtue of comparing large international (globally listed) companies with smaller (regionally listed) US companies. Perhaps if local Australian
  • 99.
    §5.2 Hyperlink recommendationbias 83 0 5 10 15 20 25 30 35 0 1 2 3 4 5 6 7 8 910 #ofpagescrawled PageRank www.microsoft.com (HP PR=10) 0 10 20 30 40 50 60 70 0 1 2 3 4 5 6 7 8 910 #ofpagescrawled PageRank www.apple.com (HP PR=10) 0 10 20 30 40 50 60 0 1 2 3 4 5 6 7 8 910 #ofpagescrawled PageRank www.qwest.com (HP PR=8) 0 10 20 30 40 50 60 0 1 2 3 4 5 6 7 8 910 #ofpagescrawled PageRank www.captaris.com (HP PR=7) 0 10 20 30 40 50 60 70 0 1 2 3 4 5 6 7 8 910 #ofpagescrawled PageRank www.credence.com (HP PR=6) 0 5 10 15 20 25 30 35 40 45 0 1 2 3 4 5 6 7 8 910 #ofpagescrawled PageRank www.cummins.com (HP PR=6) 0 10 20 30 40 50 60 70 80 90 0 1 2 3 4 5 6 7 8 910 #ofpagescrawled PageRank www.unitedauto.com (HP PR=5) 0 1 2 3 4 5 6 0 1 2 3 4 5 6 7 8 910 #ofpagescrawled PageRank www.acmeunited.com (HP PR=4) Figure 5.2: Toolbar PageRank distributions within sites. The PageRank advice to users is usually that the home page is the most important or highest quality page, and other pages are less important or of lower quality. The PageRank of the home page of the site is shown as “HP PR=”. Distributions for the twelve other companies are provided in Appendix E.
  • 100.
    84 Analysis ofhyperlink recommendation evidence 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0 1 2 3 4 5 6 7 8 9 10 Proportionofgroup PageRank Not F500 F500 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 1 10 100 1000 10000 100000 1e+06 Proportionofgroup In-degree Not F500 F500 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0 1 2 3 4 5 6 7 8 9 10 Proportionofgroup PageRank Not Most Admired Most Admired 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 1 10 100 1000 10000 100000 1e+06 Proportionofgroup In-degree Not Most Admired Most Admired 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0 1 2 3 4 5 6 7 8 9 10 Proportionofgroup PageRank Not Global Brands Global Brands 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 1 10 100 1000 10000 100000 1e+06 Proportionofgroup In-degree Not Global Brands Global Brands Figure 5.3: Bias in hyperlink recommendation evidence towards large, admired and pop- ular companies. Companies in Fortune 500, Fortune Most Admired and Business Week Top 100 Global Brands lists tend to have higher PageRank. The effect is strongest for companies with well known brands. On the right similar effects are present in in-degree.
  • 101.
    §5.2 Hyperlink recommendationbias 85 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0 1 2 3 4 5 6 7 8 9 10 Proportionofgroup PageRank Not US US 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 1 10 100 1000 10000 100000 1e+06 Proportionofgroup In-degree Not US US 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0 1 2 3 4 5 6 7 8 9 10 Proportionofgroup PageRank Not Technology Technology 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 1 10 100 1000 10000 100000 1e+06 Proportionofgroup In-degree Not Technology Technology 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0 1 2 3 4 5 6 7 8 9 10 Proportionofgroup PageRank Not Wired 40 Wired 40 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 1 10 100 1000 10000 100000 1e+06 Proportionofgroup In-degree Not Wired 40 Wired 40 Figure 5.4: Bias in hyperlink recommendation evidence towards technology-oriented or US companies. A strong PageRank bias towards US companies was not observed. However, com- panies in the “Internet Services”, “Software” and “Computers” industries had higher Page- Rank, as did those in the Wired 40. The strong bias towards technology companies is most useful if users are interested in technology, however given the increasing global reach of the WWW, and the increasing ease of access for non-technical users, such biases are helping a smaller and smaller proportion of the WWW user population. On the right are similar plots for in-degree.
  • 102.
    86 Analysis ofhyperlink recommendation evidence PageRank Industry Companies Range Mean Internet Services 29 3–9 6.66 Publishing 58 4–9 6.66 Airlines 25 3–8 6.48 Office Equipment 7 5–8 6.43 Entertainment 14 4–8 6.36 Software 306 3–10 6.35 Computers 86 4–10 6.29 Consumer Electronics 18 5–8 6.17 Automobile Manufacturers 7 4–8 6.14 Diversified Technology Services 46 4–8 6.02 ... Steel 34 3–7 4.68 Coal 6 4–5 4.67 Clothing & Fabrics 54 2–7 4.63 Oil Companies 132 1–8 4.60 Pipelines 25 3–6 4.56 Banks 433 0–8 4.55 Real Estate 174 2–7 4.55 Precious Metals 38 0–6 4.47 Marine Transport 12 3–6 4.42 Savings & Loans 146 0–6 4.08 Table 5.2: PageRanks by industry. The “Internet Services” and “Publishing” industries, with 29 and 58 companies respectively, had the highest mean PageRank.
  • 103.
    §5.3 Correlation betweenhyperlink recommendation measures 87 Stock Exchange (ASX) companies were compared to similarly sized companies from the American Stock Exchange the results would differ. This is left for future work. Two measures of technology bias were investigated; bias towards companies which produce technology and bias towards heavy users of it. First, using company infor- mation from http://quote.fool.com/, companies in industries involving com- puter software, computer hardware, or the Internet, were identified. The industry and PageRank breakdown is shown in Table 5.2. Results in Figure 5.4 illustrate a bias to- wards technology-oriented companies. These companies received an extra PageRank point on average. The second test of technology bias used the 2003 Wired 40 list of technology-ready companies. This demonstrated an even greater pro-technology bias (Figure 5.4), with companies present in the Wired 40 receiving two extra PageRank points on average. A strong bias towards technology-oriented companies is useful if users are inter- ested in technology, however given the increasing global reach of the WWW, and the increasing ease of access for non-technical users, such biases are assisting a smaller and smaller proportion of the WWW user population. 5.3 Correlation between hyperlink recommendation measures This section presents results from an investigation of the extent of the correlation of advice given by PageRank and in-degree on the WWW. This investigation was con- ducted over the set of company home pages and the set of known spam pages. 5.3.1 For company home pages The strong correlation between Toolbar-reported PageRank and log of in-degree for company home pages is depicted in Figure 5.5. To better understand the differences between in-degree and PageRank, an analysis of “winners” and “losers” from the PageRank calculation was performed. Winners in the PageRank calculation have high PageRanks even though they have low in-degree (the bottom right quadrant in Figure 5.5), whilst losers have high in-degree but receive a low PageRank (top left quadrant). Some anomalies were observed due to errors in in-degree calcula- tions (e.g. www.safeway.com had PageRank of 6 with in-degree 0). However, these cases were rare and uninteresting, as they appeared to be due to anomalies within the search engines rather than the link graph. Nonetheless, after discounting cases where AllTheWeb scores disagreed with the other two in-degree estimates, there were some extreme cases where in-degree and PageRank were at odds. These cases are shown in Table 5.3. In some cases the discrepancies shown in Table 5.3 are very large. For exam- ple, ESS Technology (http://www.esstech.com) was demoted, achieving only PageRank 3 despite having 22 357 in-degree. On the other hand, Akamai (http: //www.akamai.com) achieved a PageRank of 9 with only 17 359 links. The promo- tions and demotion of sites relative to their in-degree ranking by PageRank does not
  • 104.
    88 Analysis ofhyperlink recommendation evidence 1 10 100 1000 10000 100000 1e+06 1e+07 1e+08 0 1 2 3 4 5 6 7 8 9 10 AllTheWebin-degree PageRank Median in-degree Figure 5.5: Toolbar PageRank versus in-degree for company home pages. For 5370 company home pages, Toolbar PageRank and log of AllTheWeb [80] in-degree have a correlation of 0.767 (Pearson r). This high degree of correlation is achieved despite the relatively large spread of PageRank zero pages. Such pages may have been missed by the Google crawler or indexer, or might have been penalised by Google policy. Stock URL Industry PageRank In-degree AAPL http://www.apple.com Computers 10 2985141 YHOO http://www.yahoo.com Internet Services 9 5620063 AKAM http://www.akamai.com Internet Services 9 17359 EBAY http://www.ebay.com Consumer Services 8 737792 BDAL http://www.bdal.com Advanced Medical Supplies 8 199 GTW http://www.gateway.com Computers 7 170888 JAGI http://www.janushotels.com Lodging 7 64 FLWS http://www.1800flowers.com Retailers 6 38254 KB http://www.kookminbank.co.kr Banks 6 5 IO http://www.i-o.com Oil Drilling 5 235 FFFL http://www.fidelityfederal.com Savings & Loans 5 34 USNA http://www.usanahealthsciences.com Food Products 4 13353 RSC http://www.rextv.com Retailers 4 6 ESST http://www.esstech.com Semiconductors 3 22347 CAFE http://www.selectforce.net Restaurants 3 3 MCBF http://www.monarchcommunitybank.com Savings & Loans 2 6 WEFC http://www.wellsfinancialcorp.com Savings & Loans 2 1 PTNR http://investors.orange.co.il Wireless Communications 1 176 HMP http://www.horizonvascular.com Medical Supplies 1 5 VCLK http://www.valueclick.com Advertising 0 46659 Table 5.3: Extreme cases where PageRank and in-degree disagree. Even after cases where AllTheWeb in-degrees which were in disagreement with the two Google in-degrees have been eliminated, large disparities in scores were observed. The promotions and demotion of sites relative to their in-degree ranking do not seem to indicate a more accurate assessment by PageRank.
  • 105.
    §5.3 Correlation betweenhyperlink recommendation measures 89 1 10 100 1000 10000 100000 1e+06 1e+07 1e+08 0 1 2 3 4 5 6 7 8 9 10 AllTheWebin-degreee PageRank Median in-degree (ignoring zeroes) Median for companies (ignoring zeroes) Figure 5.6: Toolbar PageRank versus in-degree for links to a spam company. The 280 spam pages achieve good PageRank without needing massive numbers of in-links. In some cases, they achieve good PageRank with few links. Pages with PageRank 6 had a median in-degree of 1168 for companies and 44 for spam pages. appear to indicate any systematic additional preference for higher “real-world qual- ity”. 5.3.2 For spam pages One claimed benefit of PageRank over in-degree is that it is less susceptible to link spam [103]. To test this claim the in-degree and PageRank scores for 280 spam pages were compared. The relationship is plotted in Figure 5.6. If PageRank were spam-resistant one might expect high in-degree spam pages to have low PageRank. Such a case would be placed in the top left quadrant of the scat- ter plots. However, for the 280 spam pages the effect is minimal, and in some cases the opposite. For example, the median in-degree values for a PageRank score of 6 were 1168 for company home pages and 44 for spam pages. Spam pages tended to achieve a PageRank of 6 seemingly with fewer incoming links than legitimate compa- nies. It is possible that any pages which did fall in the top left quadrant had already been excluded from Google. However, this still shows that Google cannot rely entirely on PageRank for eliminating spam. This is not surprising when considering the extreme case: a legitimate page such as an academic’s home page might have an in-degree of 10, while a search engine optimiser has massive resources to generate link spam from thousands or millions of pages.
  • 106.
    90 Analysis ofhyperlink recommendation evidence 5.4 Discussion 5.4.1 Home page bias The analysis showed that home pages tended to have higher PageRank. Within all evaluated sites the home page usually had the highest or equal highest score. These results lend support to the use of hyperlink recommendation evidence for home page finding tasks. A detailed evaluation of potential gains using hyperlink recommenda- tion measures in home page finding is presented Chapter 7. While the home page bias may be useful in web ranking, in the context of the Google Toolbar it could have a potentially confusing effect. For example, from a Tool- bar user’s point of view it might seem mystifying that the “Apple Computer” home page is rated 10, but its “PowerBook G4 15-inch” page is rated 7. Is the Toolbar im- plying that the product is less important or of lower quality? Is it useful to give such advice about deeper pages in general? In fact, it may be preferable to display a con- stant indicator in the Toolbar when navigating within a web site. An investigation of whether WWW users understand hyperlink-recommendation scores reported by the Google Toolbar remains for future work. 5.4.2 Other systematic biases The experimental results for company home pages show that Toolbar PageRank favours by, on average, two PageRank points: 1. Companies with famous brands (by Business Week Top Brands) 2. Companies considered to be prepared for the new economy (by Wired 40 listing) Furthermore, PageRank scores are an average of one point higher for: 1. Companies with large revenue (by Fortune 500 membership) 2. Admired companies (by Fortune Most Admired membership) 3. Technology-oriented companies (by Industry type) Similar patterns were observed for in-degree (with corresponding larger gaps in in-degree values). The bias towards high-revenue, admired and famous companies can be seen to be consistent with the stated goal of hyperlink recommendation algorithms. The fact that hyperlink measures more strongly recommend sites operated by companies with highly recognised brands, suggests that recognition is a key factor. This is intuitively obvious, as a web site can only be linked to by authors who know of its existence. Favouring high-recognition sites in search results or directory listings helps searchers by bringing to bear their existing knowledge. A list which gives prominence to relevant web sites already known to the searcher may also inspire confidence in the value of the list. Consider the Google Directory
  • 107.
    §5.4 Discussion 91 categoryfor Australian health insurance.7 Viewed alphabetically the top two entries are the relatively little known web sites “Ask Ted” and “Australian Health Manage- ment Group”. Viewed in PageRank order, the top two entries are the arguably better known (in Australia), “Medibank Private” and “MBF Health Insurance”. Even if the user does not agree that these are the best results in some contexts, it may be better to list results which the user will immediately recognise. An important, but less beneficial side-effect of using hyperlink-recommendation algorithms is the inherent bias towards technology-oriented companies. There are a number of query terms whose common interpretation may be lost through heavy use of hyperlink-recommendation algorithms.8 For example, using Google there are a number of general queries where technology interpretations are ranked higher than their non-technology interpretations: “opera”, “album”, “java”, “Jakarta”, “png”, “putty”, “blackberry”, “orange” and “latex”. The strong technology bias may be an artefact of the fact that people building web pages are from a largely technology- oriented demographic. Many web authors are technically-oriented and may primar- ily think of Jakarta as a Java programming project. On the other hand, many WWW users may predominantly think of Jakarta as the capital of Indonesia! As the demo- graphics of WWW users change, returning an obscure technology-related result will become less desirable. This effect highlights the need for recommendation methods which more closely match user expectations. Such methods, which might take into ac- count individual differences, or simply estimate the demographics of typical WWW users, remain for future work. Measures other than link recommendation may be bet- ter indicators of quality. Such measures may include whether companies are listed on the stock exchange, present in online directories and/or are highly recommended by peer review.9 The precise effect of these biases on navigational search is difficult to quantify. It may be that the observed bias will be more problematic for informational tasks rather than navigational tasks. 5.4.3 PageRank or in-degree? PageRank and in-degree measures performed equally well when identifying home pages and membership to Fortune 500, Most Admired and Global Brand lists. In cases where the measures did not agree, such as for those listed in Table 5.3, there is no evidence to demonstrate that PageRank was superior to in-degree. A high level of correlation was observed between Toolbar PageRank and log in- degree scores, even for a collection of spam pages. Given the extra cost involved in computing PageRank, this correlation raises serious questions about the benefit of us- 7 Available at: http://directory.google.com/Top/Regional/Oceania/Australia/ Business and Economy/Financial Services/Insurance/Health/ 8 It is likely that anchor-text is also biased in this way, although it may affect results less as the bias would be narrower, i.e. only for terms that are commonly used in the anchor-text pointing to a particular page. 9 For example, by using scores from a service such as http://www.alexa.com.
  • 108.
    92 Analysis ofhyperlink recommendation evidence ing PageRank over in-degree. Subsequent chapters investigate this further, examining whether there is anything to be gained by using PageRank or in-degree in navigational search situations.
  • 109.
    Chapter 6 Combining query-independentweb evidence with query-dependent evidence Query-independent measures, such as PageRank and in-degree, provide an overall ranking of corpus documents. Such measures need to be combined with some form of query-dependent evidence for query processing, otherwise the same list of doc- uments would be retrieved for every query. There are many ways in which query- independent and query-dependent evidence can be combined, and few combination methods have been evaluated explicitly for this purpose (see Section 2.5). This chap- ter presents an analysis of three methods for combining query-independent evidence, in the form of WWW PageRanks, with query-dependent baselines. 6.1 Method This chapter examines a home page finding task where, given the name of a public company, the ranking algorithm has to retrieve that company’s home page from a corpus containing the home pages of publicly listed US companies. The query and document set used in this experiment were sourced from company data used throughout experiments in the previous chapter. The document corpus consisted of the downloaded full-text content of each company’s home page, and the anchor-text of links directed to those home pages. The query set consisted of the official names of all companies. The query and document set were used to build three query-dependent baselines; a full-text-only baseline, an aggregate anchor-text- only baseline, and a baseline using both forms of evidence. The PageRank scores for these pages were extracted from Google. Three methods for combining PageRank and query-dependent evidence were examined: the first used PageRank as a mini- mum score threshold, and the second and third methods used PageRank to re-rank the query-dependent baseline rankings. 93
  • 110.
    94 Combining query-independentweb evidence with query-dependent evidence The following sections outline the query and document set, the scoring methods used to generate the query-dependent baselines, how hyperlink recommendation ev- idence was gathered, and methods for combining query-dependent baselines with query-independent web evidence. 6.1.1 Query and document set The document corpus consisted of the home pages of the publicly listed companies used in experiments in Chapter 5. The corpus consisted of 5370 home page documents – one for each company on a prominent US stock exchange (NYSE, NAS- DAQ and NYSE) for which a home page URL was found (see Section 5.1.1). As little useful anchor-text information was contained in the set of downloaded documents (because companies rarely link to their competitors home pages), the anchor-text evidence was gathered from the Google WWW search engine [93]. This WWW-based anchor-text evidence was sourced for a 1000 page sample selected at random from the set of company home pages. For each of these pages 100 back-links1 were retrieved using Google’s “link:” operator (as described in Appendix C). Each back-link identified by Google was parsed and anchor-text snippets whose target was the company home page were added to the aggregate anchor-text for that page. The query set consisted of the official names for all 5370 companies, and the correct results were the named company’s home page. For example, for the query “MICROSOFT CORP” the correct answer was the document downloaded from http: //www.microsoft.com. The retrieval effectiveness for both the anchor-text and full-text baselines is likely to be higher than would be expected for a complete document corpus. In the full- text baseline the inclusion of only the home pages of candidate companies discounts many pages that may also match company naming queries. In particular, in a more complete document corpus, non-homepage documents on a company’s website might achieve higher match scores than that company’s home page (such as a company con- tact information page). The anchor-text baseline is also likely to achieve unrealistically high retrieval effectiveness even given the incomplete aggregate anchor-text evidence examined (only 100 snippets of anchor-text are retrieved per home page). This is be- cause the aggregate anchor-text corpus only contains text that is used to link to one of the evaluated companies, and so will be unlikely to contain much misleading or ill-targeted anchor-text. 6.1.2 Query-dependent baselines Three query-dependent baselines were evaluated: content, anchor-text and content+anchor-text. • The content baseline was built by scoring the full-text of the downloaded home pages using Okapi BM25 with untrained parameters (k1 = 2 and b = 0.75) [172] 1 A back-link is a document that has a hyperlink directed to the page under consideration.
  • 111.
    §6.1 Method 95 (describedin section 2.3.1.3). • The anchor-text baseline was built by scoring aggregate anchor-text documents using Okapi BM25 with the same parameters as used for content (described in Section 2.4.1). • The content+anchor-text baseline was built by scoring document full-text and ag- gregate anchor-text concurrently using Field-weighted Okapi BM25 [173] (de- scribed in Section 2.5.2.1). The field-weights for document full-text (content) and aggregate anchor-text were set to 1, and k1 and b were set to the same values used in the content and anchor-text baselines [173]. The content+anchor baseline was computed for the set of pages for which anchor-text was retrieved. 6.1.3 Extracting PageRank Google’s PageRank scores were extracted from the Google Microsoft Internet Explorer Toolbar using the method described in Section 5.1.3. These scores were calculated by Google [93] for a 3.3 billion page crawl.2 6.1.4 Combining query-dependent baselines with query-independent web evidence Many different schemes have been proposed for combining query-independent and query-dependent evidence. Kraaij et al. [135] suggest measuring the query- independent evidence as the probability of document relevance and treating it as a prior in a language model (see Section 2.5.2.2). However, because Okapi BM25 scores are weights rather than probabilities, prior document relevance cannot be directly incorporated into the model. Westerveld et al. [212] also make use of linear combi- nations of normalised scores, but for this to be useful with PageRank, a non-linear transformation of the scores would almost certainly be needed:3 the distribution of Google’s PageRanks is unknown, and those provided via the Toolbar have been ob- served not to follow a power law (see Section 5.1.3). Savoy and Rasolofo [178] combine query-dependent URL length evidence with Okapi BM25 scores by re-ranking the top n documents on the basis of the URL scores (described in Section 2.5.1.2). The benefit of this type of combination is that it does not require knowledge of the underlying data distribution. The three combination methods examined in this experiment are: retrieving only those documents that exceed a PageRank threshold (see Section 2.5.1.5), using Page- Rank as rank based (quota) re-ranking of query-dependent baselines, and using Page- Rank in a score-sensitive re-ranking of query-dependent baselines. The re-ranking approaches are variations on those proposed by Savoy and Rasolofo, and are used 2 The collection size was estimated to be 3.3 billion on http://www.google.com at September 2003; as of November 2004 it is estimated to be around 8 billion documents on http://www.google.com. 3 This is because while most PageRanks are very low a few are orders of magnitude larger, as Page- Rank values are believed to follow a power law distribution (see Section 2.4).
  • 112.
    96 Combining query-independentweb evidence with query-dependent evidence because they do not require any knowledge of the global distribution of Google’s PageRank values [178]. The use of a minimum PageRank threshold that pages need to exceed prior to inclusion is equivalent to ranking results by PageRank evidence and then re-ranking above a score-based threshold using query-dependent evidence. The use of a static4 minimum query-independent threshold value means that some pages will never be retrieved, and so could be removed from the corpus. To enable the retrieval of pages that do not exceed the static threshold value, a dynamic threshold function could be used. Such a function could reduce the minimum threshold if some condition is not met (for example if less than ten pages are matched). Such a scheme is discussed further in Section 10.3. The re-ranking experiments explore two important scenarios. In the first, Page- Rank plays a large role in ranking documents; though a quota-based combination. In the quota-based combination all documents retrieved within the top n ranks in the query-dependent baseline are re-ranked by PageRank. In the second scenario Page- Rank has a smaller contribution and is used to re-order documents that achieve query- dependent scores within n% of the highest baseline score (per query). This is termed a score-based combination. In both cases if the re-ranking cutoffs are sufficiently large, then all baseline documents will be re-ranked by PageRank order. 6.2 Results This section reports the effectiveness of the baselines and the three evaluated combi- nation methods. 6.2.1 Baseline performance The effectiveness of the three baselines varied considerably: • The content baseline retrieved the named home page at the first rank for only two-out-of-five queries, and within the first ten results for a little over half the queries (S@1 = 0.42, S@10 = 0.55). • The anchor baseline performed well, retrieving three-out-of-four companies at the first rank (S@1 = 0.725, S@10 = 0.79). • The content+anchor baseline performed well, also retrieving three-out-of-four companies at the first rank (S@1 = 0.729, S@10 = 0.82). The performance of the full-text (content) baseline was poor given the small size of the corpus from which the home pages were retrieved. A small benefit was observed when adding full-text evidence to the anchor-text baseline. 4 A threshold value that does not change between queries.
  • 113.
    §6.2 Results 97 6.2.2Using a threshold 0 20 40 60 80 100 0 1 2 3 4 5 6 7 8 9 10 %ofpagesthatexceedthePageRankvalue PageRank Home pages Other pages Figure 6.1: The percentage of home pages and other pages that exceed each PageRank value. Implementing a PageRank threshold minimum value of 1 would lead to the inclusion of 99.7% of the home pages, while reducing the number of other pages retrieved by 16.1%. Figure 6.1 illustrates the percentage of home pages and non-home pages5 that ex- ceed each PageRank value. Implementing a PageRank threshold value of 1 leads to the inclusion of 99.7% of the home pages, while significantly reducing the number of other pages retrieved (by 16.1%, to 83.9% of pages). The non-home page PageRanks examined here may be somewhat inflated relative to those on the general WWW, as they were retrieved using a (breadth first) crawl halted after 100 pages. It has been reported that WWW in-links are distributed according to the power law [35]. Thus, assuming the distribution of PageRank is similar to that of in-degree,6 setting a thresh- old at some small PageRank is likely to eliminate many pages from ranking consid- eration. In a home page finding system this may provide substantial computational performance gains and little (if any) degradation in home page finding effectiveness. 6.2.3 Re-ranking using PageRank Results for the quota-based combination are presented in Figure 6.2. Re-ranking by quota severely degrades performance, with a re-ranking of the top two results in the full-text (content) baseline decreasing the percentage of home pages retrieved at the 5 The hyperlink recommendation values extracted for the set of “non-home page” documents, de- scribed in Section 5.1.1. 6 The distribution of Google’s PageRanks for company home pages was observed not to follow a power-law distribution (Figure 5.1), although the Google PageRanks are likely to have been normalised and transformed for use in the Toolbar. The PageRanks calculated for use in experiments in Chapter 7 do exhibit a power law distribution (see Section 7.6.1).
  • 114.
    98 Combining query-independentweb evidence with query-dependent evidence 0 0.2 0.4 0.6 0.8 1 0 5 10 15 20 25 30 35 40 45 50 MeanReciprocalRank(MRR) Rerank top n results by PageRank Anchor-text Content Content+Anch Figure 6.2: Quota-based re-ranking. Re-ranking the top x% of documents in the query- dependent baselines by PageRank. Re-ranking by quota severely degrades performance, with a re-ranking of the top 2 results in the full-text baseline decreasing the percentage of home pages retrieved at the first position from 42% to 29%. Note that the re-ranking of all results by PageRank (at 50) is equivalent to ranking query-matched documents by PageRank. 0 0.2 0.4 0.6 0.8 1 10 20 30 40 50 60 70 80 90 100 MeanReciprocalRank(MRR) Rerank by PageRank URLs that score in top n% Anchor-text Content Content+Anch Figure 6.3: Score-based re-ranking. Re-ranking documents that are within x% of the top query-dependent baseline score. Re-ranking using score produces a much slower decline in performance than re-ranking based on rank only (Figure 6.2). Note that the re-ranking of all results by PageRank (at 100% of score) is equivalent to ranking query-matched documents by PageRank.
  • 115.
    §6.3 Discussion 99 0 20 40 60 80 100 12 3 4 5 6 7 8 9 10 NormalisedOkapiBM25contentscore Rank by Okapi BM25 content Yahoo Lycos Figure 6.4: Example of two queries using different re-ranking techniques. For the query “Lycos” the correct answer is located at position one of the full-text (content) baseline. Given that the second match scores far less than the first, a shuffling of the first two results would favour the document with a much smaller query-dependent score. For the second query “Ya- hoo” the correct answer is located at position two and achieves a comparable score to the first result: in this case a shuffle would improve retrieval effectiveness. first rank from 42% to 29%. Results for the score-based combination are presented in Figure 6.3. Compared to the quota-based combination, re-ranking using score produces a much slower decline in performance. An example illustrating the comparative effectiveness of quota-based and score- based combinations for two queries is presented in Figure 6.4. For the query “Lycos” the correct answer is located at position one of the full-text (content) baseline. The second document in the baseline scores far less than the first. Using a quota-based re- ranking with a cutoff of two the first two results would be reversed. By comparison, using score-based re-ranking, the cutoff would have to be set to 35% (or larger) of the top score for a reversal by re-ranking. For the second query “Yahoo” the correct answer is located at position two and achieves a comparable score to the first result. In this case re-ranking by PageRank using either a quota or score-based re-ranking with n = 2 would reverse the ranking (in this case improving retrieval effectiveness). 6.3 Discussion Results from experiments in this Chapter support the use of PageRank and other hy- perlink recommendation evidence as a minimum threshold for document retrieval or in a score-based re-ranking of query-dependent evidence. The use of a minimum PageRank threshold in a home page finding task may improve computational per- formance by eliminating non-home pages from the ranking. Another method by
  • 116.
    100Combining query-independent webevidence with query-dependent evidence which computational efficiency could be improved is by ranking documents using aggregate anchor-text evidence only (which also implicitly imposes a threshold of in- degree ≥ 1). An anchor-text index would be much smaller than a full-text index and therefore likely to be more efficient. Quota-based re-ranking was observed to be inferior to score-based re-ranking. This illustrates the negative effects of not considering relative query-dependent scores when combining baselines with query-independent evidence. Further, this suggests that query-independent evidence should not be relied upon to identify the most relevant documents. Pages that achieve high query-independent scores are likely to be im- portant pages in the corpus (such as the home pages of popular, large or technology- oriented companies, as reported in Chapter 5), but may not necessarily be more rele- vant (and indeed, in this experiment, might be the “wrong” home pages). The results from experiments in this Chapter also re-enforce the previously ob- served importance of aggregate anchor-text for effective home page finding [56]. The correct home page was retrieved at the first rank in the anchor-text baseline for three- out-of-four queries, compared to being retrieved at the first rank for only two-out-of- five queries in the full-text baseline. While the baseline retrieval effectiveness in this experiment may be unrealistically high, these findings show that there is generally adequate anchor-text evidence, even when using only 100 snippets, to find the home pages of publicly listed companies. Combining the full-text and aggregate anchor-text evidence in a field-weighted combination, resulted in a slight improvement in home page finding effectiveness. The next chapter investigates whether query-independent evidence can be used to improve home page finding effectiveness for small-to-medium web corpora. The experiments include evaluations of the effectiveness of minimum query-independent evidence thresholds, score-based re-ranking of query-dependent baselines by query- independent evidence, and of aggregate anchor-text-only indexes.
  • 117.
    Chapter 7 Home pagefinding using query-independent web evidence Providing effective home page search is important for both web and WWW search systems (see Section 2.6.2.1). The empirical results reported in Chapter 5 showed hy- perlink recommendation evidence to be biased towards home pages. This chapter presents a series of detailed experiments to determine whether this bias can be ex- ploited to improve home page finding performance on small-to-medium sized web corpora. Experiments in this Chapter evaluate the effectiveness of hyperlink recom- mendation evidence and URL length for document full-text and anchor-text baselines on three such corpora. The potential contribution of query-independent evidence to home page finding is evaluated in three ways: • By measuring the potential for query-independent evidence to exclude non- home pages, through the use of minimum query-independent threshold scores that documents must achieve for retrieval (following from experiments in Chap- ter 6). The use of thresholds is investigated as a measure by which both the retrieval effectiveness and efficiency of a home page finding system could be improved; • By gauging the maximum improvements offered through query-independent evidence when combined with query-dependent baselines using some linear combination of scores; and • By empirically investigating a combination method that could be used to incor- porate query-independent evidence in a production web search system, namely a score-based re-ranking of query-dependent baselines by query-independent evidence (following from experiments in Chapter 6). 7.1 Method The initial step in this experiment was to identify the set of candidate test corpora. The corpora were then crawled (if required) and indexed. Four types of query-independent 101
  • 118.
    102 Home pagefinding using query-independent web evidence evidence (in-degree, two PageRank variants and URL-type, described below) were computed during indexing. Following indexing, the top 1000 documents for each query-dependent baseline were retrieved. Three query-dependent baselines were studied; one based solely on document full-text, one based solely on document ag- gregate anchor-text, and one consisting of both forms of evidence. The baselines were then combined with query-independent scores using three combination methods. The first method used query-independent evidence as a threshold, such that documents that did not exceed the threshold were not retrieved (shown to be a promising ap- proach in Chapter 6). The second method explored the optimal improvement that could be gained when combining query-independent evidence with query-dependent baselines using a linear combination of scores. The final combination method was a score-based re-ranking of query-dependent baselines by query-independent evidence (also shown to be a promising approach in Chapter 6). The improvements in effec- tiveness achieved through these combination methods were then measured and com- pared. Throughout the experiments the Wilcoxon matched-pairs signed ranks test was performed to determine whether improvements afforded were significant. This test compares the algorithms according to the (best) ranks achieved by correct answers, rather than the success rate measure. A confidence criterion of 95% (α = 0.05) is used. Success rates (described in Section 2.6.6.2) were used to evaluate retrieval effec- tiveness. The success rate measure is indicated by S@n where n is the cutoff rank. S@n results were computed for n = 1, 5, 10. The following sections give a description of the query-independent and query- dependent baselines, outline the test collections used in the experiments and their salient properties, and discuss the methods used to combine query-independent and query-dependent evidence. 7.1.1 Query-independent evidence Four types of query-independent evidence were considered: IDG the document’s in-degree score (described in Section 2.4.3.1); DPR the document’s Democratic PageRank score (described in Section 3.3.2); APR the document’s Aristocratic PageRank score, using bookmarks from the Yahoo! directory [217] or other web directory listings, which might be available to a production search system (described in Section 3.3.2); URL the document’s URL-type score, through a re-ranking by the UTwente/TNO URL-type [135] (described in Section 2.3.3.2). The URL-types were scored ac- cording to Root > Subroot > Directory > File. 7.1.2 Query-dependent baselines The relative improvements achieved over three query dependent baselines were ex- amined. The baselines were:
  • 119.
    §7.1 Method 103 •content baselines built by scoring document full-text using Okapi BM25 with default parameters (k1 = 2 and b = 0.75) (see Section 2.3.1.3) [172]. • anchor-text baselines built using the methods outlined previously (i.e. by record- ing all anchor-text pointing to each document and building a new aggregate document containing all source anchor-text). The aggregate anchor-text docu- ments were scored using Okapi BM25 using the same parameters as content. • content+anchor-text baselines built by using Field-weighted Okapi BM25 [173] to build and score composite documents containing document full-text and ag- gregate anchor-text evidence. The baseline was scored with document full-text and anchor-text field-weights set to 1, and k1 and b as above (see Section 2.5.2.1) [173]. 7.1.3 Test collections Effectiveness improvements were evaluated using five test collections that spanned three small-to-medium sized web corpora. The test corpora used in the evaluation in- cluded a 2001 crawl of a university web (the ANU), and the TREC corpora VLC2 [106] and WT10g [15]. Detailed collection information is reported in Table 7.1 and a further discussion of the TREC collection properties appears in Section 2.6.7. Note that since experiments published in Upstill et al. [201] the link tables have been re-visited and further duplicates and equivalences removed (using methods described in Chapter 3). This has resulted in some non-statistically significant changes in retrieval effective- ness. Test Pages Links Dead Content Anchor No. of Book- Collection Size (million) (million) links queries queries marks (APR) ANU 4.7GB 0.40 6.92 0.646 97/100 99/100 439 WT10gC 10GB 1.69 8.06 0.306 93/100 84/100 25 487 WT10gT 10GB 1.69 8.06 0.306 136/145 119/145 25 487 VLC2P 100GB 18.57 96.37 3.343 95/100 93/100 77 150 VLC2R 100GB 18.57 96.37 3.343 88/100 77/100 77 150 Table 7.1: Test collection information. The experiments were performed for five test collec- tions spanning three small-to-medium sized web corpora. Two sets of queries were submitted over the VLC2 collection - a popular set (VLC2P) and a random set (VLC2R) (see text for expla- nation). The two sets computed for WT10g were the set used by Craswell et al. [56] (WT10gC) and the official queries used in the TREC 2001 home page finding task (WT10gT). The values in the “Content” and “Anchor” queries columns show the number of home pages found by the baseline out of the number of queries submitted (this is equivalent to S@1000, as the top 1000 results for each search are considered).
  • 120.
    104 Home pagefinding using query-independent web evidence Although there are many spam pages on the WWW, little spam was found in the three corpora. Any spam-like effect observed seemed unintentional. For example, the pages of a large bibliographic database all linked to the same page, thereby artificially inflating its in-degree and PageRank. In each run, sets of 100 or more queries were processed over the applicable corpus using the chosen baseline algorithm. The first 1000 results for each were recorded. While all queries have only one correct answer, that answer may have multiple cor- rect URLs, e.g. a host with two aliases. If multiple correct URLs were retrieved the minimum baseline rank was used (i.e. earliest in the ranked list of documents) and had assigned to it the best query-independent score of all the equivalent URLs. This approach introduces a slight bias in favour of the re-ranking algorithms, ensuring that any beneficial effect will be detected. If a correct document did not appear in the top 1000 positions a rank of 1001 was assigned. These experiments investigated two home page finding scenarios: queries for pop- ular and random home pages.1 Popular queries allow the study of which forms of evidence achieve high effectiveness when ranking for queries targeting high profile sites. Random queries allow the study of effective ranking for any home page, even if it is not well known. The ANU web includes a number of official directories of internal sites. These site directories can be used as PageRank bookmarks. This allows for the evaluation of APR in a single-organisation environment. Test home pages were picked randomly from these directories and then queries were generated manually by navigating to a home page, and formulating a query based on the home page name2. Consequently APR might be expected to perform well on this collection. The query set labelled WT10gC [56] was created by randomly selecting pages within the WT10g corpus, navigating to the corresponding home page, and formu- lating a query based on the home page’s name. The WT10gC set was used as training data in the TREC-2001 web track. The query set labelled WT10gT was developed by the NIST assessors for the TREC-2001 web track using the same method. Wester- veld et al. [212] have previously found that the URL-type method improved retrieval performance on the WT10gT collection. Using the method outlined in Section 3.3.2, every Yahoo-listed page in the WT10g collection is bookmarked in the APR calcula- tion. These are lower quality bookmarks than the ANU set as the bookmarks played no part in the selection of either query set. Two sets of queries were evaluated over the VLC2 collection, popular (VLC2P) and random (VLC2R). The popular series was derived from the Yahoo! directory. The ran- dom series was selected using the method described above for WT10g. For the APR calculation every Yahoo-listed page in the collection was bookmarked. As such, the bookmarks were well matched to the VLC2P queries (also from Yahoo!), but less so for the VLC2R set. 1 Note that the labels popular and random were chosen for simplicity and are derived from the method used to choose the target answer, not from the nature of the queries. Information about query volumes is obviously unavailable for the TREC test collections and were not used in the case of ANU. 2 This set was generated by Nick Craswell in 2001.
  • 121.
    §7.1 Method 105 Thehome page results for the ANU and VLC2P query sets are considered popular because they are derived from directory listings. Directory listings have been chosen by a human editor as important, possibly because they are pages of interest to many people. Such pages also tend to have above average in-degree. This means that more web page editors have chosen to link to the page, directing web surfers (and search engine crawlers) to it. On all these corpora anchor-text ranking has been shown to improve home page finding effectiveness (relative to full-text-only) [15, 56]. 7.1.4 Combining query-dependent baselines with query-independent evi- dence Throughout these experiments there is a risk that a poor choice of combining function could lead to a spurious conclusion. The combination of evidence experiments in the previous chapter outlined two methods for combining query-independent and query- dependent evidence which may be effective: the use of minimum threshold values and score-based re-ranking. This chapter includes a further combination scheme – an Optimal re-ranking. The Optimal re-ranking is an unrealistic re-ranking, and is termed “Optimal” to distinguish it from a re-ranking that could be used in a production web search sys- tem.3 In the Optimal combination experiments, the maximum possible improvement when combining query-independent evidence with query-dependent evidence using a linear combination is gauged. This is done by locating the right answer in the base- line (obviously not possible in a practical system) and re-ranking it and the docu- ments above it, on the basis of the query-independent score alone (as illustrated in Figure 7.1). This is an unrealistic combination, if this information were known in prac- tice, perfection could easily be achieved by swapping the document at that position with the document at rank one. Indeed, no linear combination or product of query- independent and query-dependent scores (assuming positive coefficients) could im- prove upon the Optimal combination. This is because documents above the correct answer score as well or better on both query-independent and query-dependent com- ponents (see Figure 7.1). In Optimal experiments a control condition random was intro- duced in which the correct document and all those above it were arbitrarily shuffled. Throughout re-ranking experiments if the query-independent scores are equal, then the original baseline ordering is preserved. The following sections report and discuss the results for each combination method. The use of minimum query-independent evidence thresholds is investigated first, followed by re-ranking using the (unrealistic) Optimal combination, and finally re- ranking using the (realistic) score-based re-ranking. 3 The Optimal re-ranking relies on knowledge of the correct answer within the baseline ranking.
  • 122.
    106 Home pagefinding using query-independent web evidence document 1 document 2 document 3 document 4 document 5 document 6 document 7 document 8 document 2 document 6 document 1 document 4 document 3 document 5 baseline ranking ... ranking resorted by PageRank Figure 7.1: Example of Optimal re-ranking and calculation of random control success rate. In the baseline, the correct answer is document 6 at rank 6. Re-ranking by PageRank puts it at rank 2. This is optimal because any document ranked more highly must score as well or better on both baseline and PageRank (i.e. “document 2” scored better on the baseline, and PageRank). In this case, S@5 fails on the baseline and succeeds on re-ranking. However, a random resorting of the top 6 would have succeeded in 5 of 6 cases, so expected S@5 for the random control is 5/6. 7.2 Minimum threshold experiments These experiments investigate whether the use of a static minimum threshold require- ment for page inclusion can improve retrieval effectiveness and system efficiency. Retrieval effectiveness may be improved through the removal of unimportant doc- uments from the corpus. Additionally, retrieval efficiency may be improved by re- ducing the documents requiring ranking when processing a query. The evaluation of the performance of threshold techniques requires a set of candi- date cutoff values. Up to nine cutoffs were generated for each form of evidence, and an attempt was made to pick intervals that would cut the corpus in 10% gaps. These cutoffs were possible for DPR evidence because scores spanned many values. Even spacing was not possible for in-degree or URL-type evidence because early cutoffs eliminated many of the pages from consideration. For example, picking an in-degree minimum of 2 removed up to 60% of the ANU corpus. Discounting URL-type “File” URLs removed over 95% of the ANU collection. An evaluation of the use of minimum thresholds was performed for three-of-the- five test collections, namely ANU, WT10gC and WT10gT.4
  • 123.
    §7.2 Minimum thresholdexperiments 107 Content Anchor Both Type Cut Prop. S@1 S@5 S@10 S. S@1 S@5 S@10 S. S@1 S@5 S@10 S. BASE 100% 0.29 0.50 0.58 0.72 0.96 0.97 0.63 0.81 0.86 IDG 2 51% 0.34 0.57 0.66 *+ 0.72 0.96 0.97 = 0.63 0.81 0.86 *+ IDG 3 45% 0.36 0.58 0.68 *+ 0.73 0.96 0.97 = 0.64 0.82 0.85 *+ IDG 4 37% 0.38 0.60 0.68 *+ 0.73 0.96 0.97 = 0.66 0.82 0.85 *+ IDG 6 33% 0.39 0.61 0.68 *+ 0.72 0.95 0.96 = 0.65 0.81 0.84 = IDG 8 28% 0.40 0.60 0.70 *+ 0.72 0.95 0.96 = 0.65 0.82 0.86 = IDG 10 8% 0.41 0.64 0.69 *+ 0.70 0.91 0.92 = 0.65 0.81 0.85 = IDG 25 2% 0.33 0.42 0.47 *- 0.49 0.62 0.63 *- 0.44 0.55 0.58 *- IDG 50 1% 0.21 0.30 0.36 *- 0.36 0.42 0.42 *- 0.28 0.38 0.39 *- IDG 100 0.5% 0.11 0.19 0.20 *- 0.20 0.24 0.24 *- 0.17 0.22 0.22 *- DPR 5.02 90% 0.30 0.50 0.59 *+ 0.72 0.97 0.98 = 0.63 0.81 0.86 *+ DPR 5.06 80% 0.30 0.50 0.59 *+ 0.72 0.97 0.98 = 0.64 0.81 0.86 *+ DPR 5.10 70% 0.30 0.50 0.59 *+ 0.72 0.97 0.98 = 0.64 0.81 0.86 *+ DPR 5.22 60% 0.31 0.51 0.62 *+ 0.72 0.97 0.98 = 0.64 0.81 0.87 *+ DPR 5.28 55% 0.31 0.52 0.62 *+ 0.72 0.97 0.98 = 0.64 0.81 0.87 *+ DPR 5.61 40% 0.33 0.54 0.63 *+ 0.72 0.97 0.98 = 0.64 0.82 0.87 *+ DPR 6.15 30% 0.34 0.55 0.65 *+ 0.71 0.95 0.97 = 0.65 0.82 0.87 *+ DPR 8.04 20% 0.36 0.57 0.63 *+ 0.64 0.86 0.88 *- 0.61 0.78 0.81 = DPR 14.9 10% 0.35 0.54 0.60 = 0.62 0.78 0.80 *- 0.58 0.74 0.76 = URL >F 5% 0.48 0.64 0.76 *+ 0.73 0.88 0.88 = 0.64 0.79 0.82 = URL >D 2% 0.33 0.48 0.50 *- 0.47 0.55 0.55 *- 0.41 0.53 0.53 *- URL >SR 0.1% 0.17 0.22 0.23 *- 0.25 0.26 0.26 *- 0.21 0.24 0.24 *- Table 7.2: Using query-independent thresholds on the ANU collection. Bold values indi- cate the highest effectiveness achieved for each type of query-independent evidence on each query-dependent baseline. Underlined bold values indicate the highest effectiveness achieved for each query-dependent baseline. The cutoff value is indicated by “Cut”. The percentage amount of the collection that is included within the cutoff is indicated by “Prop.”. “S.” reports whether observed changes are significantly better (“*+”), equivalent (“=”) or worse (“*-”). The cutoff values for Democratic PageRank values given are of the order ×10−6 . For URL-type cutoffs; >F indicates that URLs are more important than “File” URLs (i.e. either “Directory, “Subroot” or “Root”), >D that URLs are more important than “Directory” (i.e. either “Sub- root” or “Root”), and >SR that URLs are more important than “Subroot” (i.e. “Root”).
  • 124.
    108 Home pagefinding using query-independent web evidence 7.2.1 Results ANU The performance of the ANU collection when using minimum query-independent thresholds is presented in Table 7.2. Observations from these results are: • Removing the bottom 80% of pages according to Democratic PageRank, in-degree or URL-type improves the effectiveness of the content baseline. In the case of URL-type, the improvement is dramatic. • Using the least restrictive URL-type as a minimum threshold (i.e. removing “File” pages) removes around 95% of pages from consideration without a sig- nificant decrease in retrieval effectiveness for any baseline. • Using appropriate in-degree and Democratic PageRank threshold values, around 80% of pages can be removed before observing a significant decrease in retrieval effectiveness for any baseline. • The highest retrieval effectiveness is achieved using an anchor-text baseline with no thresholds, although this is not significantly better than that of anchor-text with the base URL-type threshold. In the ANU collection there was a group of documents with identical Democratic PageRank values of 5.28×10−6. This made it impossible to choose a cutoff of 60% and so a cutoff of 55% was used. The large number of documents that achieved the same PageRank value was found to be caused by a crawler trap on an ANU web server. WT10gC The performance of the WT10gC collection using minimum thresholds is presented in Table 7.3. Observations from these results are: • Excluding pages using the “> File” and “> File or Directory” URL-type thresh- olds provided significant gains on all three baselines while reducing the size of the collection by 97%. Excluding pages using the “> Subroot” URL-type thresh- old resulted in the removal of 99% of pages without significantly affecting the effectiveness of any baseline. • Excluding pages with in-degree < 2 removed 58% of pages from consideration without significantly reducing effectiveness for any baseline (and improved ef- fectiveness for the content baseline). • Excluding pages with a DPR of < 1.73 × 10−6 removed 40% of pages from con- sideration without significantly reducing effectiveness for any baseline. 4 An evaluation of performance on the VLC2P and VLC2R test collections was not possible due to time constraints
  • 125.
    §7.2 Minimum thresholdexperiments 109 Content Anchor Both Type Cut Prop. S@1 S@5 S@10 S. S@1 S@5 S@10 S. S@1 S@5 S@10 S. BASE 100% 0.23 0.45 0.55 0.47 0.69 0.72 0.45 0.71 0.83 IDG 2 42% 0.23 0.47 0.55 *+ 0.45 0.65 0.69 = 0.41 0.66 0.75 = IDG 3 26% 0.23 0.50 0.59 *+ 0.45 0.64 0.67 *- 0.40 0.64 0.72 *- IDG 4 19% 0.23 0.48 0.54 = 0.43 0.62 0.64 *- 0.39 0.60 0.68 *- IDG 6 12% 0.24 0.44 0.53 *- 0.41 0.59 0.60 *- 0.38 0.60 0.62 *- IDG 8 7.5% 0.25 0.45 0.53 *- 0.41 0.56 0.57 *- 0.38 0.59 0.60 *- IDG 10 5% 0.21 0.43 0.45 *- 0.40 0.49 0.50 *- 0.37 0.52 0.52 *- IDG 25 2% 0.20 0.36 0.39 *- 0.34 0.41 0.41 *- 0.31 0.43 0.43 *- IDG 50 1% 0.19 0.28 0.29 *- 0.28 0.30 0.30 *- 0.25 0.31 0.31 *- IDG 100 0.5% 0.15 0.22 0.23 *- 0.22 0.24 0.24 *- 0.21 0.24 0.24 *- DPR 1.33 99% 0.23 0.45 0.55 = 0.47 0.69 0.72 = 0.45 0.71 0.83 = DPR 1.38 80% 0.21 0.42 0.53 = 0.45 0.67 0.61 = 0.44 0.67 0.77 = DPR 1.51 70% 0.20 0.42 0.53 = 0.45 0.66 0.70 = 0.43 0.68 0.77 = DPR 1.73 60% 0.19 0.41 0.52 = 0.45 0.65 0.70 = 0.41 0.67 0.75 = DPR 2.11 50% 0.19 0.39 0.52 = 0.44 0.64 0.68 = 0.39 0.64 0.73 *- DPR 2.72 40% 0.18 0.39 0.51 = 0.42 0.62 0.66 *- 0.37 0.61 0.69 *- DPR 3.77 30% 0.20 0.41 0.46 = 0.41 0.59 0.62 *- 0.35 0.58 0.63 *- DPR 5.45 20% 0.20 0.37 0.44 *- 0.37 0.57 0.59 *- 0.35 0.54 0.58 *- DPR 8.65 10% 0.19 0.38 0.47 *- 0.39 0.55 0.56 *- 0.37 0.54 0.55 *- URL > F 7% 0.56 0.83 0.87 *+ 0.68 0.76 0.78 *+ 0.75 0.93 0.95 *+ URL > D 3% 0.63 0.81 0.87 *+ 0.67 0.73 0.75 *+ 0.76 0.89 0.90 *+ URL > SR 1% 0.65 0.75 0.76 = 0.59 0.65 0.65 = 0.75 0.77 0.77 = Table 7.3: Using query-independent thresholds on the WT10gC collection. Bold values in- dicate the highest effectiveness achieved for each type of query-independent evidence on each query-dependent baseline. Underlined bold values indicate the highest effectiveness achieved for each query-dependent baseline. The cutoff value is indicated by “Cut”. The percentage of the collection that is included within the cutoff is indicated by “Prop.”. “S.” reports whether observed changes are significantly better (“*+”), equivalent (“=”) or worse (“*-”). The specified cutoffs for Democratic PageRank are of the order ×10−6 . For URL-type cutoffs; >F indicates that URLs are more important than “File” URLs (i.e. either “Directory”, “Subroot” or “Root”), >D that URLs are more important than “Directory” (i.e. either “Subroot” or “Root”), and >SR that URLs are more important than “Subroot” (i.e. “Root”).
  • 126.
    110 Home pagefinding using query-independent web evidence • The highest effectiveness is achieved with a content+anchor-text baseline and URL-type “> File” threshold. Using the URL-type threshold gives gains of 7% to 20% over the best baseline score and removes 93% of pages from considera- tion. WT10gT The performance of the WT10gT collection using minimum thresholds is presented in Table 7.4. Observations from these results are: • Excluding documents based on a “> File” URL-type threshold, provides signifi- cant gains on all three baselines while reducing the size of the collection by 93%. Excluding documents using a “> Subroot” URL-type threshold reduces collec- tion size by 99% while only negatively affecting anchor-text retrieval effective- ness. • Excluding documents which achieve in-degree < 2 removes 58% of pages from consideration without significantly reducing effectiveness for any baseline. • Excluding documents which achieve a DPR in the top 90% of values resulted in a significant decrease in effectiveness for the anchor-text baseline. • The highest effectiveness is achieved with a content+anchor-text baseline and a “> File” URL-type threshold. Using this threshold gives gains of 7-15% over the baseline while removing 93% of pages from consideration. 7.2.2 Training cutoffs While several cutoffs were considered for each collection, a sensible approach for future experiments would be to train a threshold cutoff value on a single collection and then apply that as a threshold on other collections. The trained cutoff, if calcu- lated for the S@5 measure on the WT10gC collection (as with other realistic combina- tion experiments detailed below), would have been a “> File” URL-type cutoff (with an associated effectiveness gain of 24% along with a reduction of collection size by around 93%). Applied to the WT10gT collection, this cutoff would have resulted in a significant improvement in retrieval effectiveness of 12% at S@5 (along with the same reduction of collection size of 93%). Applied to the ANU collection, the collection size would be reduced by 95%, with an associated non-significant decrease in retrieval effectiveness of 9% at S@5. The exact efficiency gains achieved through using a minimum query-independent value for inclusion are difficult to quantify as they depend on the indexing and query processing methods used. However, one would expect that indexing an order of mag- nitude less documents would result in significant efficiency gains.
  • 127.
    §7.2 Minimum thresholdexperiments 111 Content Anchor Both Type Cut Prop. S@1 S@5 S@10 S. S@1 S@5 S@10 S. S@1 S@5 S@10 S. BASE 100% 0.22 0.48 0.59 0.53 0.68 0.72 0.48 0.71 0.75 IDG 2 42% 0.22 0.47 0.55 = 0.53 0.67 0.72 = 0.50 0.61 0.67 = IDG 3 26% 0.26 0.44 0.52 = 0.50 0.59 0.61 *- 0.48 0.61 0.64 = IDG 4 19% 0.23 0.43 0.51 = 0.46 0.54 0.56 *- 0.43 0.57 0.60 = IDG 6 12% 0.26 0.43 0.49 *- 0.43 0.51 0.52 *- 0.43 0.52 0.56 *- IDG 8 7.5% 0.24 0.41 0.45 *- 0.38 0.48 0.49 *- 0.39 0.50 0.51 *- IDG 10 5% 0.24 0.39 0.42 *- 0.37 0.44 0.46 *- 0.37 0.46 0.48 *- IDG 25 2% 0.23 0.31 0.34 *- 0.28 0.34 0.35 *- 0.28 0.35 0.37 *- IDG 50 1% 0.21 0.28 0.30 *- 0.25 0.28 0.28 *- 0.24 0.30 0.30 *- IDG 100 0.5% 0.16 0.22 0.23 *- 0.19 0.21 0.21 *- 0.20 0.22 0.23 *- DPR 1.33 99% 0.22 0.48 0.59 = 0.53 0.68 0.72 = 0.48 0.71 0.75 *+ DPR 1.38 80% 0.20 0.41 0.50 = 0.50 0.62 0.66 *- 0.45 0.61 0.65 = DPR 1.51 70% 0.20 0.39 0.49 = 0.51 0.62 0.64 *- 0.46 0.60 0.63 = DPR 1.73 60% 0.20 0.37 0.48 = 0.50 0.60 0.63 *- 0.46 0.59 0.61 = DPR 2.11 50% 0.23 0.41 0.50 = 0.50 0.59 0.63 *- 0.46 0.59 0.63 = DPR 2.72 40% 0.19 0.36 0.46 = 0.48 0.57 0.59 *- 0.43 0.54 0.59 *- DPR 3.77 30% 0.18 0.37 0.46 *- 0.47 0.55 0.56 *- 0.43 0.53 0.58 *- DPR 5.45 20% 0.18 0.37 0.44 *- 0.44 0.52 0.54 *- 0.41 0.50 0.55 *- DPR 8.65 10% 0.15 0.35 0.42 *- 0.39 0.48 0.48 *- 0.37 0.46 0.49 *- URL > F 7% 0.53 0.71 0.80 *+ 0.61 0.73 0.74 *+ 0.62 0.80 0.83 *+ URL > D 3% 0.57 0.76 0.78 *+ 0.62 0.70 0.71 = 0.66 0.79 0.81 *+ URL > SR 1% 0.60 0.62 0.63 = 0.53 0.57 0.58 *- 0.61 0.64 0.65 = Table 7.4: Using query-independent thresholds on the WT10gT collection. Bold values indi- cate the highest effectiveness achieved for each type of query-independent evidence on each query-dependent baseline. Underlined bold values indicate the highest effectiveness achieved for each query-dependent baseline. The cutoff value is indicated by “Cut”. The percentage of the collection that is included within the cutoff is indicated by “Prop.”. “S.” reports whether observed changes are significantly better (“*+”), equivalent (“=”) or worse (“*-”). The specified cutoffs for Democratic PageRank are of the order ×10−6 . For URL-type cutoffs; >F indicates that URLs are more important than “File” URLs (i.e. either “Directory”, “Subroot” or “Root”), >D that URLs are more important than “Directory” (i.e. either “Subroot” or “Root”), and >SR that URLs are more important than “Subroot” (i.e. “Root”).
  • 128.
    112 Home pagefinding using query-independent web evidence 7.3 Optimal combination experiments These experiments investigate the effectiveness improvements offered through the use of query-independent evidence in an Optimal re-ranking. The Optimal re-ranking is unrealistic, and is used to gauge the potential contribution of query-independent evidence when combined with query-dependent evidence. 7.3.1 Results Full re-ranking and significance test results are shown in Tables 7.5, 7.6, 7.7 and 7.8, and a summary of optimal results is presented in Table 7.9. Observations based on these results are: 1. All re-rankings of the content baseline significantly outperform the random con- trol. 2. The only re-ranking method which shows significant benefit over the anchor- text baseline is URL. This benefit is shown only for the random query sets. The benefits of re-ranking by URL are greatly diminished for anchor-text compared with content and content+anchor-text baselines. 3. All re-rankings of the content+anchor-text baseline significantly outperform the random control on ANU, WT10gT and VLC2R. Only the URL-type re-ranking on WT10gC and VLC2P outperforms the random control. 4. With no re-ranking, the content+anchor-text baselines perform worse than their anchor-text counterparts. However, the content+anchor-text based re-rankings are equal to (in ANU), or exceed their counterpart anchor-text re-rankings (in WT10gC, WT10gT, VLC2P, VLC2R). 5. URL performs at a consistently high level for all baselines. The URL anchor- text re-ranking is only outperformed by APR on the ANU and VLC2P. These are cases where the query set and bookmarks were both derived from the same list of authoritative sources. 6. For the popular home page queries (ANU and VLC2P), all anchor-text re-rankings outperform their content counterparts. 7. For random home page queries (WT10gT, WT10gC and VLC2R), the content+ anchor-text and content-only re-rankings perform better than their anchor-text counterparts. 8. Improvements due to APR were only observed when using high quality book- marks, i.e. when the query answers were to be found among the bookmarks. 9. Improvements due to IDG and DPR are almost identical.
  • 129.
    §7.3 Optimal combinationexperiments 113 Coll. Meas. Base Rand IDG DPR APR URL ANU S@1 0.29 0.37 0.73 0.71 0.75 0.68 ANU S@5 0.50 0.61 0.88 0.90 0.91 0.87 ANU S@10 0.58 0.69 0.93 0.93 0.96 0.91 ANU Sig. n/a n/a ** ** ** ** WT10gC S@1 0.23 0.34 0.61 0.59 0.55 0.75 WT10gC S@5 0.45 0.58 0.86 0.82 0.84 0.89 WT10gC S@10 0.55 0.68 0.86 0.87 0.88 0.93 WT10gC Sig. n/a n/a ** ** ** ** WT10gT S@1 0.22 0.34 0.64 0.62 0.55 0.84 WT10gT S@5 0.48 0.61 0.81 0.83 0.80 0.90 WT10gT S@10 0.59 0.69 0.86 0.87 0.84 0.92 WT10gT Sig. n/a n/a ** ** ** ** VLC2P S@1 0.27 0.38 0.66 0.62 0.67 0.71 VLC2P S@5 0.51 0.65 0.79 0.79 0.82 0.87 VLC2P S@10 0.61 0.76 0.88 0.87 0.90 0.89 VLC2P Sig. n/a n/a ** ** ** ** VLC2R S@1 0.16 0.25 0.50 0.48 0.46 0.72 VLC2R S@5 0.36 0.48 0.72 0.69 0.69 0.87 VLC2R S@10 0.44 0.58 0.73 0.72 0.72 0.88 VLC2R Sig. n/a n/a ** ** ** ** Table 7.5: Optimal re-ranking results for content. The Optimal combination experiment is described in Section 7.3. “Sig.” reports the statistical significance of the improvements. Significance is tested using the Wilcoxon matched-pairs signed ranks test. The Wilcoxon test compares the full document ranking, and so only a single significance value is reported per type of evidence, per collection. A “**” indicates improvements were significant at p < 0.01, and a “*” indicates improvements were significant at p < 0.05. Relative to the random control, all Optimal re-rankings of the content baseline were significant. The highest effectiveness achieved for each measure on each collection is highlighted in bold.
  • 130.
    114 Home pagefinding using query-independent web evidence Coll. Meas. Base Rand IDG DPR APR URL ANU S@1 0.72 0.82 0.87 0.87 0.89 0.88 ANU S@5 0.96 0.97 0.98 0.98 0.98 0.98 ANU S@10 0.97 0.97 0.98 0.98 0.99 0.98 ANU Sig. n/a n/a - - - - WT10gC S@1 0.47 0.58 0.60 0.59 0.63 0.73 WT10gC S@5 0.69 0.73 0.71 0.72 0.73 0.82 WT10gC S@10 0.72 0.76 0.74 0.75 0.75 0.83 WT10gC Sig. n/a n/a - - - * WT10gT S@1 0.53 0.60 0.63 0.61 0.64 0.74 WT10gT S@5 0.68 0.73 0.72 0.71 0.75 0.78 WT10gT S@10 0.72 0.76 0.76 0.76 0.75 0.79 WT10gT Sig. n/a n/a - - - * VLC2P S@1 0.70 0.77 0.78 0.79 0.85 0.81 VLC2P S@5 0.86 0.88 0.88 0.89 0.92 0.90 VLC2P S@10 0.87 0.89 0.90 0.89 0.92 0.92 VLC2P Sig. n/a n/a - - - - VLC2R S@1 0.48 0.55 0.63 0.60 0.61 0.68 VLC2R S@5 0.67 0.71 0.75 0.75 0.73 0.74 VLC2R S@10 0.72 0.73 0.75 0.75 0.75 0.76 VLC2R Sig. n/a n/a - - - * Table 7.6: Optimal re-ranking results for anchor-text. The Optimal combination experiment is described in Section 7.3. “Sig.” reports the statistical significance of the improvements. Significance is tested using the Wilcoxon matched-pairs signed ranks test. The Wilcoxon test compares the full document ranking, and so only a single significance value is reported per type of evidence, per collection. A “**” indicates improvements were significant at p < 0.01, and a “*” indicates improvements were significant at p < 0.05. The highest effectiveness achieved for each measure on each collection is highlighted in bold.
  • 131.
    §7.3 Optimal combinationexperiments 115 Coll. Meas. Base Rand IDG DPR APR URL ANU S@1 0.63 0.70 0.85 0.85 0.84 0.88 ANU S@5 0.81 0.86 0.96 0.98 0.96 0.98 ANU S@10 0.86 0.90 0.98 0.99 0.98 0.98 ANU Sig. n/a n/a * * * * WT10gC S@1 0.45 0.58 0.65 0.67 0.68 0.94 WT10gC S@5 0.71 0.81 0.90 0.88 0.89 0.97 WT10gC S@10 0.83 0.86 0.92 0.91 0.90 0.97 WT10gC Sig. n/a n/a - - - ** WT10gT S@1 0.48 0.58 0.70 0.69 0.68 0.84 WT10gT S@5 0.71 0.77 0.88 0.86 0.85 0.94 WT10gT S@10 0.75 0.80 0.88 0.90 0.88 0.95 WT10gT Sig. n/a n/a * * * ** VLC2P S@1 0.67 0.75 0.84 0.86 0.89 0.90 VLC2P S@5 0.85 0.88 0.93 0.94 0.94 0.97 VLC2P S@10 0.88 0.91 0.94 0.94 0.95 0.98 VLC2P Sig. n/a n/a - - * ** VLC2R S@1 0.40 0.50 0.63 0.60 0.58 0.84 VLC2R S@5 0.62 0.69 0.78 0.76 0.74 0.93 VLC2R S@10 0.66 0.75 0.79 0.78 0.77 0.93 VLC2R Sig. n/a n/a - - - ** Table 7.7: Optimal re-ranking results for content+anchor-text. The Optimal combination experiment is described in Section 7.3. Significance is tested using the Wilcoxon matched- pairs signed ranks test. The Wilcoxon test compares the full document ranking, and so only a single significance value is reported per type of evidence, per collection. A “**” indicates improvements were significant at p < 0.01, and a “*” indicates improvements were signifi- cant at p < 0.05. The highest effectiveness achieved for each measure on each collection is highlighted in bold.
  • 132.
    116 Home pagefinding using query-independent web evidence Collection Type Content Anchor-text Content + Anchor-text ANU Popular APR > DPR, URL - - WT10gC Random DPR > IDG URL > IDG, DPR, APR URL > IDG, DPR, APR APR > IDG URL > IDG, DPR, APR WT10gT Random IDG > APR APR > IDG, DPR URL > IDG, DPR, APR DPR > APR URL > IDG, DPR, APR URL > IDG, DPR, APR VLC2P Popular - APR > IDG, DPR URL > IDG VLC2R Random IDG > APR DPR > IDG IDG > APR URL > IDG, DPR, APR URL > IDG, DPR, APR URL > IDG, DPR, APR Table 7.8: Significant differences between methods when using Optimal re-rankings. Each (non-random) method was compared against each of the others in turn and differences were tested for significance using the Wilcoxon test. Each significant difference found is shown with the direction of the difference. 7.4 Score-based re-ranking These experiments investigate the effectiveness of a score-based re-ranking of base- lines using query-independent evidence. 7.4.1 Setting score cutoffs For the realistic score-based re-rankings the same cutoff was applied to all queries. Suitable score cutoffs were determined for WT10gC by plotting S@5 effectiveness against potential cutoff values (see Figures 7.2 and 7.3) and recording the optimal cutoff for each form of query-independent evidence. The other collections were then re-ranked using this cutoff. Optimal cutoffs were calculated at S@5 due to the insta- bility of S@15 and the smaller effectiveness gains observed at S@10. 7.4.2 Results Tables 7.10, 7.11 and 7.12 show the results of the score-based re-ranking of content and anchor-text baselines. From these results it can be observed that: 1. URL re-ranking provided significant improvements over all three baselines for WT10gT, VLC2P and VLC2R. 2. URL re-ranking performance is only surpassed by APR on the ANU collection (at S@1) where APR used very high quality bookmarks. 3. None of the hyperlink-recommendation based schemes provided a significant improvement over the anchor-text baseline. 5 S@1 is equivalent to P@1, the instability of Precision at 1 is discussed in Section 2.6.6.1.
  • 133.
    §7.4 Score-based re-ranking117 Collection Measure Best Cont. Best Anch Best Cont+Anch ANU S@1 0.76 0.90 0.88 S@5 0.92 0.98 0.98 S@10 0.94 0.98 0.98 QIE URL ALL URL,DPR WT10gC S@1 0.82 0.73 0.94 S@5 0.93 0.84 0.97 S@10 0.93 0.84 0.97 QIE URL URL URL WT10gT S@1 0.84 0.74 0.84 S@5 0.90 0.78 0.94 S@10 0.92 0.79 0.95 QIE URL URL URL VLC2P S@1 0.71 0.84 0.90 S@5 0.87 0.91 0.97 S@10 0.89 0.91 0.98 QIE URL APR URL VLC2P S@1 0.73 0.68 0.84 S@5 0.87 0.74 0.93 S@10 0.88 0.76 0.93 QIE URL URL URL Table 7.9: Summary of Optimal re-ranking results. The highest effectiveness achieved by each method is highlighted in bold. The “QIE” row indicates the query-independent evidence that performed best.
  • 134.
    118 Home pagefinding using query-independent web evidence 0 20 40 60 80 100 100 95 90 85 80 75 70 65 60 55 50 45 40 35 30 Successrateat5 % of maximum baseline score URL APR Indeg DPR 0 20 40 60 80 100 100 95 90 85 80 75 70 65 60 55 50 45 40 35 30 Successrateat5 % of maximum baseline score APR URL Indeg DPR Figure 7.2: Setting score-based re-ranking cutoffs for the content (top) and anchor-text (bot- tom) baselines using the WT10gC collection. The vertical lines represent the chosen cutoff values, which were then used in all score-based re-ranking experiments. If the optimal cutoff spanned multiple values then the mean of those values was used. Numerical cutoff scores are provided in Tables 7.10 and 7.11.
  • 135.
    §7.4 Score-based re-ranking119 0 20 40 60 80 100 100 95 90 85 80 75 70 65 60 55 50 45 40 35 30 Successrateat5 % of maximum baseline score APR URL Indeg DPR Figure 7.3: Setting score-based re-ranking cutoffs for the content+anchor-text baseline us- ing the WT10gC collection. The vertical lines represent the chosen cutoff values, which were then used in all score-based re-ranking experiments. If the optimal cutoff spanned multiple values then the mean of those values was used. Numerical cutoff scores are provided in Ta- ble 7.12.
  • 136.
    120 Home pagefinding using query-independent web evidence 4. For the popular query sets (ANU and VLC2P) the anchor text baseline with URL re-ranking produced the best performance, although the baseline only narrowly outperformed content+anchor-text. 5. For the random query sets (WT10gT and VLC2R) the content+anchor-text base- line with URL re-ranking produced the best performance, with the URL re- ranking of the content baseline performing better than the anchor-text re-ranking. 6. In the absence of very high quality bookmarks (i.e. on every collection except for the ANU), APR performance was very similar to that of the other hyperlink recommendation techniques. Coll. Meas. Base IDG DPR APR URL (at 20.6%) (at 17.4%) (at 14.1%) (at 33.7%) ANU S@1 0.29 0.36 0.29 0.48 0.39 ANU S@5 0.50 0.60 0.52 0.67 0.73 ANU S@10 0.58 0.73 0.6 0.72 0.83 ANU Sig. - - - ** ** WT10gC S@1 0.23 0.36 0.38 0.33 0.71 WT10gC S@5 0.45 0.67 0.58 0.59 0.88 WT10gC S@10 0.55 0.73 0.67 0.65 0.90 WT10gT S@1 0.22 0.46 0.41 0.32 0.70 WT10gT S@5 0.48 0.64 0.59 0.62 0.83 WT10gT S@10 0.59 0.71 0.69 0.65 0.88 WT10gT Sig. - - - - ** VLC2P S@1 0.27 0.38 0.42 0.41 0.56 VLC2P S@5 0.51 0.61 0.61 0.63 0.68 VLC2P S@10 0.61 0.70 0.70 0.76 0.76 VLC2P Sig. - - - ** ** VLC2R S@1 0.16 0.26 0.20 0.22 0.62 VLC2R S@5 0.36 0.47 0.44 0.45 0.82 VLC2R S@10 0.44 0.56 0.52 0.53 0.83 VLC2R Sig. - - - - ** Table 7.10: Score-based re-ranking results for content. Cutoffs (shown as “(at ?)”) were obtained by training on WT10gC at S@5. “Sig.” reports the statistical significance of the im- provements. Significance is tested using the Wilcoxon matched-pairs signed ranks test. The Wilcoxon test compares the full document ranking, and so only a single significance value is reported per type of evidence, per collection. A “**” indicates improvements were signifi- cant at p < 0.01, and a “*” indicates improvements were significant at p < 0.05. The highest effectiveness achieved at each measure for each collection is highlighted in bold.
  • 137.
    §7.4 Score-based re-ranking121 Coll. Meas. Base IDG DPR APR URL (at 15.5%) (at 11.1%) (at 15.6%) (at 20.4%) ANU S@1 0.72 0.77 0.74 0.83 0.78 ANU S@5 0.96 0.95 0.94 0.96 0.98 ANU S@10 0.97 0.98 0.98 0.98 0.98 ANU Sig. - - - - - WT10gC S@1 0.47 0.5 0.51 0.51 0.67 WT10gC S@5 0.69 0.71 0.71 0.71 0.76 WT10gC S@10 0.72 0.72 0.72 0.72 0.76 WT10gT S@1 0.53 0.51 0.52 0.47 0.65 WT10gT S@5 0.68 0.70 0.68 0.70 0.73 WT10gT S@10 0.72 0.72 0.72 0.73 0.74 WT10gT Sig. - - - - ** VLC2P S@1 0.70 0.69 0.70 0.73 0.81 VLC2P S@5 0.86 0.84 0.84 0.85 0.89 VLC2P S@10 0.87 0.86 0.88 0.86 0.91 VLC2P Sig. - - - - ** VLC2R S@1 0.48 0.48 0.46 0.41 0.66 VLC2R S@5 0.67 0.70 0.71 0.69 0.73 VLC2R S@10 0.72 0.73 0.72 0.70 0.76 VLC2R Sig. - - - - ** Table 7.11: Score-based re-ranking results for anchor-text. Cutoffs (shown as “(at ?)”) were obtained by training on WT10gC at S@5. “Sig.” reports the statistical significance of the im- provements. Significance is tested using the Wilcoxon matched-pairs signed ranks test. The Wilcoxon test compares the full document ranking, and so only a single significance value is reported per type of evidence, per collection. A “**” indicates improvements were signifi- cant at p < 0.01, and a “*” indicates improvements were significant at p < 0.05. The highest effectiveness achieved at each measure for each collection is highlighted in bold.
  • 138.
    122 Home pagefinding using query-independent web evidence Coll. Meas. Base IDG DPR APR URL (at 10.3%) (at 6.9%) (at 10%) (at 31.7%) ANU S@1 0.63 0.71 0.64 0.70 0.69 ANU S@5 0.81 0.84 0.82 0.86 0.88 ANU S@10 0.86 0.89 0.86 0.89 0.91 ANU Sig. - * - * * WT10gC S@1 0.45 0.51 0.49 0.53 0.79 WT10gC S@5 0.71 0.77 0.73 0.75 0.92 WT10gC S@10 0.83 0.83 0.83 0.82 0.94 WT10gT S@1 0.48 0.51 0.52 0.41 0.72 WT10gT S@5 0.71 0.68 0.70 0.67 0.86 WT10gT S@10 0.75 0.77 0.78 0.75 0.89 WT10gT Sig. - - - - ** VLC2P S@1 0.67 0.65 0.68 0.68 0.68 VLC2P S@5 0.85 0.87 0.86 0.86 0.88 VLC2P S@10 0.88 0.91 0.90 0.91 0.93 VLC2P Sig. - - - - * VLC2R S@1 0.40 0.42 0.42 0.34 0.75 VLC2R S@5 0.62 0.61 0.59 0.61 0.87 VLC2R S@10 0.66 0.70 0.67 0.69 0.89 VLC2R Sig. - - - - ** Table 7.12: Score-based re-ranking results for content+anchor-text. Cutoffs (shown as “(at ?)”) were obtained by training on WT10gC at S@5. “Sig.” reports the statistical significance of the improvements. Significance is tested using the Wilcoxon matched-pairs signed ranks test. The Wilcoxon test compares the full document ranking, and so only a single significance value is reported per type of evidence, per collection. A “**” indicates improvements were significant at p < 0.01, and a “*” indicates improvements were significant at p < 0.05. The highest effectiveness achieved at each measure for each collection is highlighted in bold.
  • 139.
    §7.5 Interpretation ofresults 123 7.5 Interpretation of results Collection Info Optimal Score-based B’mark Best Best S@1 S@5 S@10 Coll. Type Quality S@5 S@5 Improve Improve Improve Sig. ANU Pop. v.High 0.98 0.98 7.7% 2.0% 0% - AT+* AT+URL 0.72→0.78 0.96→0.98 0.98→0.98 WT10gT Rand. Low 0.88 0.85 29% 16% 14% ** C+AT+URL C+AT+URL 0.48→0.68 0.71→0.85 0.75→0.87 VLC2P Pop. High 0.97 0.88 14% 3% 4% ** C+AT+URL AT+URL 0.69→0.79 0.85→0.88 0.86→0.90 VLC2R Rand. Low 0.93 0.87 47% 29% 26% ** C+AT+URL C+AT+URL 0.40→0.75 0.62→0.87 0.66→0.89 Table 7.13: Numerical summary of re-ranking improvements. “Sig.” reports the statistical significance of the improvements. Significance is tested using the Wilcoxon matched-pairs signed ranks test. The Wilcoxon test compares the full document ranking, and so only a single significance value is reported per type of evidence, per collection. A “**” indicates improvements were significant at p < 0.01, and a “*” indicates improvements were signif- icant at p < 0.05. The percentile realistic improvements are calculated as a percentage im- provement over the best baseline. “AT+*” denotes a combination of anchor-text with any of the query independent evidence examined here. “AT+URL” denotes a combination of anchor-text with URL-type query-independent evidence. “AT+APR” denotes a combination of anchor-text with APR query-independent evidence. “C+URL” denotes a combination of content with URL-type query-independent evidence. “C+AT+URL” denotes a combination of content+anchor-text with URL-type query-independent evidence. 7.5.1 What query-independent evidence should be used in re-ranking? The Optimal combination results show that re-rankings by all of the query- independent methods considered significantly improve upon the random control for the content baseline. For all random query sets, URL re-ranking of the anchor-text base- line significantly improves upon the random control. Further, many of content+anchor- text baseline re-rankings are significant. Results are quite stable across collections de- spite differences in their scale. Naturally, the benefits of the realistic score-based re-rankings are smaller, but the URL method in particular achieves substantial gains over all baselines, as shown in Table 7.13. It is clear that classification of URL-type is of considerable value in a home page finding system. Section 7.6.2 examines whether the URL-type classifications em- ployed in this experiment are optimal. It is of interest that URL re-ranking results for the ANU collection are poorer than for the other collections. Although investigation confirmed UTwente/TNO’s order- ing, i.e. “Root” (36/137) > “Subroot” (50/862) > “Directory” (72/1059) >
  • 140.
    124 Home pagefinding using query-independent web evidence “File” (40/382 274),6 the ratio for the URL “Subroot” class was higher than for other collections. It should be noted that URL re-ranking would be of little use in webs in which URLs exhibit no hierarchical structure. For example, some organisations publish URLs of the form xyz.org/getdoc.cgi?docid=9999999. Such URLs include no potential “Subroot” or “Directory” URL break-downs. In experiments within this chapter the baseline ordering was preserved if the re- ranking scores were equal. Such equality occurred more often in URL-type scores, which could take only one of four distinct values. To confirm that the superiority of URL-type re-ranking was not an artifact of their quantisation, hyperlink recommen- dation scores were quantised7 into four groups, and the effectiveness of the quantised scores was also evaluated. The quantisation of hyperlink recommendation scores de- creased retrieval effectiveness. This indicates that it is unlikely that URL-type has an unfair advantage due to quantisation. Hyperlink recommendation results indicate these schemes may have relatively lit- tle role to play in home page finding tasks using re-ranking based combination meth- ods for corpora within the range of sizes studied here (400 000 to 18.5 million pages). The full-text (content) baseline improvements when using hyperlink recommendation scores as a minimum threshold for document retrieval, or in an Optimal re-ranking of the query-dependent baselines, were encouraging. By contrast, the performance improvements over the anchor-text baseline were minimal. This suggests that most of the potential improvement offered by hyperlink recommendation methods is al- ready exploited by the anchor-text baseline. In most of the score-based re-rankings it is almost impossible to differentiate between the re-ranking of the anchor-text base- line and the baseline itself. The extent to which hyperlink recommendation evidence is implicit in anchor-text evidence is considered in the next chapter. Throughout the experiments in-degree appeared to provide more consistent per- formance improvements than APR or DPR. APR performed well when using high- quality bookmark sets, but did not improve performance when using lower qual- ity bookmark sets on random (WT10gT and VLC2R) query sets. The improvement achieved by these methods relative to the anchor-text baselines was not significant. The difference in effectiveness of the two PageRank variants show that PageRank’s contribution to home page finding on corpora of this size is highly dependent upon the choice of bookmark pages. However, even for popular queries (ANU and VLC2P), APR results are generally inferior to those of URL re-rankings. Of the three hyperlink recommendation methods in-degree may be the best choice, as the PageRank variants offer little or no advantage and are more computationally expensive. In conclusion, the results of these experiments show the best query-independent evidence to be URL-type. 6 Note that in these figures all URLs (including equivalent URLs) were considered. 7 I.e. similar scores were grouped to reduce the number of possible values.
  • 141.
    §7.6 Further experiments125 7.5.2 Which query-dependent baseline should be used? In the experiments, prior to re-ranking, the anchor-text baseline generally outper- formed the content and content+anchor-text baselines. However, on two collections,8 URL-type re-rankings of full-text (content) outperformed similar re-rankings of anchor-text. In these two cases the target home pages were randomly chosen. This effect was not observed for the popular targets, although the content+anchor-text per- formance was comparable to that of anchor-text only. Figure 7.4 illustrates the difference between the random and popular sets by plotting S@n against n for the content and anchor-text baselines. For the popular query set, the two baselines converge at about n = 500, but for the random set the content baseline is clearly superior for n > 150. The plot for VLC2R is similar to that observed in a pre- vious study of content and anchor-text performance on the WT10gT collection [135]. An explanation for the observed increase in effectiveness of the content baseline above n > 150 is that while anchor-text rankings are better able to discriminate be- tween home pages and other relevant pages, full anchor-text rankings are shorter9 than those for content. Some home pages have no useful incoming anchor-text and therefore do not appear anywhere in the anchor-text ranking. By contrast, most home pages do contain some form of site name within their content and will eventually appear in the content ranking. Selecting queries from a directory within the collection guarantees that the anchor document for the target home page will not be empty, but there is no such guarantee for randomly chosen home pages. Selection of home pages for listing in a directory is undoubtedly biased toward useful, important or well-known sites which are also more likely to be linked to from other pages (Experiments in Chapter 5 observed that PageRank does favour popular pages). It should be noted that incoming home page queries would probably also be biased toward this type of site. In conclusion, the results of the experiments show the content+anchor-text base- line to be the most consistent performer across all tasks, and to perform particularly well when combined with URL-type evidence. 7.6 Further experiments Having established the principal results above, a series of follow-up experiments was conducted. In particular these investigated: • to what extent results can be understood in terms of rank and score distributions; • whether other classifications of URL-type provide similar, or superior, gains in retrieval effectiveness; 8 Of the four evaluated. The WT10gC test collection is not included as it was used to train the re- ranking cutoffs. 9 Ignoring documents that achieve a score of zero.
  • 142.
    126 Home pagefinding using query-independent web evidence 10 20 30 40 50 60 70 80 90 100 1 10 100 1000 Successrate@n Number of documents (n) Base C Base AT 10 20 30 40 50 60 70 80 90 100 1 10 100 1000 Successrate@n Number of documents (n) Base C Base AT Figure 7.4: Baseline success rates across different cutoffs. The top plot is for VLC2P, the VLC2 crawl with a popular home page query set. The bottom plot is for VLC2R, the same crawl, but with a random home page query set. The anchor-text baseline performs well be- tween 0-150 for both collections. In VLC2P, at around S@150 the anchor-text baseline perfor- mance approaches the content baseline performance. In VLC2R the anchor-text performance is surpassed by the content performance at around S@150. These plots are consistent with the S@1000 values reported in Table 7.1
  • 143.
    §7.6 Further experiments127 • to what extent the PageRanks and in-degrees are correlated with those reported by Google; and • whether the use of anchor-text and link graph information external to the corpus could improve retrieval effectiveness. 7.6.1 Rank and score distributions This section analyses the distribution of correct answers for each type of evidence over the WT10gC collection. The content and anchor-text baseline rankings of the correct answers are plotted in Figure 7.5. In over 50% of occasions both the content and anchor-text baselines contain the correct answer within the top ten results. Anchor-text provides the better scoring of the two baselines, with the correct home page ranked as the top result for almost 50% of the queries. This confirms the effectiveness of anchor-text for home page finding [15, 56]). The PageRank distributions are plotted in Figure 7.6. The distribution of the De- mocratic PageRank scores for all pages follow a power law. In contrast, the PageRank distribution for correct answers is much more even, with the proportion of pages that are correct answers increasing at higher PageRanks. There are many pages which do not achieve an APR score. Merely having an APR score > 0 gives some indica- tion that a page is a correct answer in the WT10gC collection. These plots indicate that both forms of PageRank provide some sort of home page evidence (as observed in Chapter 5), even though these computed PageRank values differ markedly from those mined from the Google toolbar in Chapter 5. This large difference re-affirms the belief that PageRanks reported by the Google toolbar have been heavily transformed. The in-degree distribution is plotted at the top of Figure 7.7 and is similar to the Democratic PageRank distribution. However, the graph is slightly shifted to the left, indicating that there are more pages with low in-degrees than there are pages with low PageRanks. The distribution of correct answers is spread across in-degree scores, with the proportion of pages that are correct answers increasing at higher in-degrees. This shows that in-degree also provides some sort of home page evidence. The URL-type distribution is plotted on the right in Figure 7.7. URL-type is a particularly useful home page indicator for this collection, with a large proportion of the correct answers located in the “Root” class and few correct answers located within the “File” class. 7.6.2 Can the four-tier URL-type classification be improved? This section evaluates how combining the four URL-type classes and introducing length and directory depth based scores impacts retrieval effectiveness. The results for this series of experiments are presented in Table 7.14. None of the new URL-type methods significantly improved upon the performance of the original URL-type classes (“Root” > “Subroot” > “Directory” > “File”). How- ever, combining the “Subroot” and “Directory” classes did not adversely affect URL-
  • 144.
    128 Home pagefinding using query-independent web evidence 0 5 10 15 20 25 30 35 40 45 50 1 2 3 4 5 6 7 8 9 10 >10 Numberofdocuments Rank 0 5 10 15 20 25 30 35 40 45 50 1 2 3 4 5 6 7 8 9 10 >10 Numberofdocuments Rank Figure 7.5: Baseline rankings of the correct answers for WT10gC (content top, anchor-text bottom). The correct answer is retrieved within the top ten results for over 50% of queries on both baselines. The anchor-text baseline has the correct answer ranked as the top result on almost 50% of the queries.
  • 145.
    §7.6 Further experiments129 1 10 100 1000 10000 100000 1e+06 0.0001 0.001 0.01 0.1 1 Numberofdocuments Normalised Democratic PageRank score (quantized to 40 steps) All Correct 1 10 100 1000 10000 100000 1e+06 0.0001 0.001 0.01 0.1 1 Numberofdocuments Normalised Aristocratic PageRank score (quantized to 40 steps) All Correct Figure 7.6: PageRank distributions for WT10gC (DPR top, APR bottom). These plots con- tain the distribution of all pages in the collection (All) and the distribution of the 100 correct answers (Correct). The distribution of the DPR scores for all pages follow a power law. In contrast, the correct answers are spread more evenly across DPR scores. Therefore the propor- tion of pages which are correct answers increases at higher PageRanks. Approximately 17% of pages do not achieve an APR score, thus merely having an APR score > 0 is a some indication that a page is more likely to be a correct answer.
  • 146.
    130 Home pagefinding using query-independent web evidence 1 10 100 1000 10000 100000 1e+06 0.0001 0.001 0.01 0.1 1 Numberofdocuments Normalised in-degree score (quantized to 40 steps) All Correct 0 20 40 60 80 100 file path subroot root Percentageofdocuments URL Type All Correct Figure 7.7: Other distributions for WT10gC (in-degree top, URL-type bottom). The top plot contains the in-degree distribution for all pages (All) and the 100 correct answers (Correct). The distribution of the in-degree scores for all pages follow a power law. In contrast, the correct answers are spread more evenly across in-degree scores. The proportion of pages which are correct answers increases at higher in-degree scores. The bottom plot contains the URL-type distribution (in percentages) of all pages (All) and the correct answers (Correct). The “Root” tier contains only 1% of the pages in the collection, but 80% of the correct answers. In contrast, the “File” tier contains 92% of the collections pages, but only 5% of the correct answer.
  • 147.
    §7.6 Further experiments131 Dataset Baseline R>S>D>F Length Dir Depth R>S+D+F R>S+D>F R>S>D(l)>F ANU content 87 88 68 62 77 87 ANU anchor-text 98 98 98 97 98 98 WT10gC content 89 90 72 83 89 89 WT10gC anchor-text 82 83 75 78 82 82 WT10gT content 88 88 74 80 85 88 WT10gT anchor-text 77 79 74 75 77 77 VLC2P content 87 86 68 81 84 87 VLC2P anchor-text 89 92 87 89 89 90 VLC2R content 87 86 62 82 85 87 VLC2R anchor-text 74 76 73 73 74 74 Table 7.14: S@5 for URL-type category combinations, length (how long a URL is, favouring short directories) and directory depth (how many directories the URL contains, favouring URLs with shallow directories). R represents the “Root” tier, S represents the “Subroot” tier, D is for the “Directory” tier and F is for the “File” tier. D(l) indicates that directories where ranked according to length (where sorter directories are preferred). In all cases an Optimal re-ranking of baselines by query-independent evidence was performed. type effectiveness. A high level of effectiveness was also obtained using a simple URL length measure. This measure ranked pages according to the length of their URLs (in characters, and favouring short URLs). “File” URLs contain filenames and are thereby longer than their “Root” and “Directory” counterparts, which may explain the good performance of the URL length measure. Re-ranking baselines using only the URL directory depth (number of slashes in the URL) performed relatively poorly. In conclusion, when using URL-type scores for home page finding tasks it is im- portant to distinguish between “Root”, “Directory” and “File” pages. This can be done either explicitly through a categorisation of URL-types or by measuring the length of the URL. 7.6.3 PageRank and in-degree correlation The results in Table 7.15 show that DPR and in-degree are highly correlated, but that the correlation tends to weaken as the size of the corpus increases. This weaker as- sociation as corpus size increases suggests that PageRank might have quite different properties when calculated for very large crawls. Google’s PageRank, based on 50 to 100 times more documents than are in VLC2, is likely to be different and possi- bly superior to the PageRanks studied here. In addition, Google may use a different PageRank variant and different bookmarks. To understand the relationship between the PageRank values calculated in ex- periments, and the PageRank employed by the Google WWW search engine, scores were compared with the Google PageRanks reported for all 201 ANU pages listed in the Google Directory.10 For those pages, PageRanks were extracted from Google’s 10 A version of the manually constructed DMOZ open WWW directory which reports Google PageR-
  • 148.
    132 Home pagefinding using query-independent web evidence DPR APR No. of pages (millions) ANU 0.836 0.448 0.40 WT10g 0.71 0.555 1.69 VLC2 0.666 0.164 18.57 Table 7.15: Correlation of PageRank variants with in-degree. The correlation was tested using the Pearson r significance test. DMOZ directory and in-degrees were extracted using the Google link: query op- erator. Google PageRank and in-degree were correlated (r=0.358), as they were for ANU, WT10g and VLC2. Also, the correlation between Google in-degree and ANU in-degree was very strong (r=0.933). Google’s in-degrees, based on a much larger crawl, were only three times larger than those from the ANU crawl (during link count extraction the difficulties outlined in Section 5.1.3 were encountered). While Google PageRank and ANU PageRank were correlated over the 201 obser- vations, the correlation was less strong than for in-degree (DPR r=0.26, APR r=0.31). This indicates that Google PageRank is different from the PageRanks studied here (as observed in Section 5.1.1). Note that only five different values of PageRank were reported by Google for the 201 pages (11, 16, 22, 27 and 32 out of 40). The directory- based PageRanks are on a different scale to those extracted using the Google Toolbar in Chapter 5, and both have been transformed and quantised from Google’s internal PageRank values. Although this study may not be directly applicable to very large crawls, its re- sults are quite stable for a range of smaller multi-server crawls. The range of sizes of corpora examined here (400 000 to 18.5 million pages) are typical of many enterprise webs and thus interesting both scientifically and commercially.11 7.6.4 Use of external link information To explore the effects of increasing corpus size, a series of hybrid WT10g/VLC2 runs was performed. This is potentially revealing because the WT10g corpus is a subset of the VLC2 corpus. The runs, shown in Table 7.16, used combinations of WT10g corpus data and VLC2 link information. The hypothesis was that by using link tables from the larger corpus it would be possible to obtain a more complete link graph and thereby improve the performance of the hyperlink recommendation and anchor- text measures (due to a potential increase in the hyperlink votes, and the amount of anks. The Google DMOZ Directory is available at http://directory.google.com 11 The rated capacities of the two Google search appliances are in fact very similar to these sizes (150 000 and 15 million pages), see http://www.google.com/appliance/products.html.
  • 149.
    §7.7 Discussion 133 availableanchor-text). During these hybrid runs all VLC2 anchor-text that pointed to pages outside the WT10g corpus was removed. WT10g anchor-text VLC2 anchor-text DPR DPR DPR DPR — WT10g VLC2 — WT10g VLC2 WT10gC 0.69 0.72 0.69 0.78 0.79 0.78 WT10gT 0.68 0.71 0.71 0.72 0.72 0.73 Table 7.16: Using VLC2 links in WT10g. Note that the WT10g collection is a subset of the VLC2 collection. The WT10g anchor-text scores are the baselines used throughout all other ex- periments in this chapter. The VLC2 anchor scores are new rankings that use external anchor- text from the VLC2 collection. WT10g DPR is a Democratic PageRank re-ranking using the link table from the WT10g collection. VLC2 DPR is a Democratic PageRank re-ranking using the link table from the VLC2 collection. The use of the (larger) VLC2 link table DPR scores did not significantly improve the performance of DPR re-ranking. The use of external anchor text, taken from the VLC2 collection, provided significant performance gains. Surprisingly, the use of the (larger) VLC2 link table DPR scores did not noticeably improve the performance of DPR re-ranking. However, the use of external anchor- text, taken from the VLC2 corpus, provided significant performance gains. This would suggest that in situations where an enterprise or small web has link information for a larger web, benefits will be seen if the anchor-text from the external link graph is recorded and used for the smaller corpus.12 The WT10g collection is not a uniform sample of VLC2, but was engineered to maximise the interconnectivity of the documents selected [15]. Hence the effects of scaling up may be smaller than would be expected in other web corpora. 7.7 Discussion Using query-independent evidence scores as a minimum threshold for page inclusion appears to be a useful method by which system efficiency can be improved without significantly harming home page finding effectiveness. The use of hyperlink recom- mendation evidence as a threshold resulted in a reduction of 10% of the corpus with- out affecting a change in retrieval effectiveness. By comparison, using a URL-type threshold of “> File”, corpus size was reduced by over 90%, and retrieval effective- ness was significantly improved for two-out-of-three collections. 12 This was later investigated further by Hawking et al. [115] who found that the use of external anchor- text did not improve retrieval effectiveness.
  • 150.
    134 Home pagefinding using query-independent web evidence Re-ranking query-dependent baselines (both content and anchor-text) on the basis of URL-type produced consistent benefit. This heuristic would be a valuable compo- nent of a home page finding system for web corpora with explicit hierarchical struc- ture. By contrast, in these experiments, unless Optimal re-ranking is used, hyperlink- based recommendation schemes do not achieve significant effectiveness gains. Even on the WT10gC collection, on which the re-ranking cutoffs were trained, the recom- mendation results were poor. For corpora of up to twenty million pages, the hyper- link recommendation methods do not appear to provide benefits in document rank- ing for a home page finding task. Similarly, little benefit has previously been found for relevance-based retrieval in the TREC web track [121]. An alternative measure for bias towards pages that are heavily linked-to, by re-weighting the anchor-text ranking formula to favour large volumes of anchor-text, is investigated in Chapter 8. An ideal home page finding system would exploit both anchor-text (for superior performance when targeting popular sites) and document full-text information (to ensure that home pages with inadequate anchor-text are not missed). While the pre- liminary content+anchor-text baseline presented here goes some way to investigat- ing combined performance, further work is needed to better understand whether this combination is optimal. Further examination is required to determine how to provide the best all-round search effectiveness when home page queries are interspersed with other query types. Additional work is also required to determine whether evidence useful in home page finding is useful for other web retrieval tasks (such as Topic Distillation). These issues are investigated in Chapter 9, through the description and evaluation of a first-cut general-purpose document ranking function that incorporates web evidence.
  • 151.
    Chapter 8 Anchor-text inweb search Full-text ranking algorithms have been used to score aggregate anchor-text evidence with some success, both in experiments within this thesis (see Chapters 6 and 7), and in experiments reported elsewhere [56]. When comparing the textual contents of doc- ument full-text and aggregate anchor-text it is clear that, in many cases, they differ markedly. For example, aggregate anchor-text sometimes contains extremely high rates of term repetition. Excessive term repetition may make a negligible (or even neg- ative1) contribution to full-text evidence, but may be a useful indicator in anchor-text evidence. This is because each term occurrence could indicate an independent “vote” from an external author that the document is a worthwhile target for that term. This chapter examines whether the Okapi BM25 and Field-weighted Okapi BM25 ranking algorithms, previously used with success in scoring both document full-text and aggregate anchor-text [56, 173], can be revised to better match anchor-text evi- dence. The investigation is split into three sections. The first section presents an inves- tigation of how the Okapi BM25 full-text ranking algorithm is applied when scoring aggregate anchor-text. This includes an analysis of how the document and collection statistics used in BM25 (and commonly used in other full-text ranking algorithms) might be modified to better score aggregate anchor-text evidence. The second section examines four different methods for combining the aggregate anchor-text evidence with other document evidence. The third and final section provides an empirical in- vestigation of the effectiveness of the revised scoring methods, for both combined anchor-text and full-text evidence, and anchor-text alone. 8.1 Document statistics in anchor-text This section examines how document statistics used in the Okapi BM25 ranking func- tion, and other full-text ranking methods (see Section 2.3.2), apply to aggregate anchor- text evidence. 1 As it could indicate a spam document, which was designed explicitly to be retrieved by the search system in response to that query term. 135
  • 152.
    136 Anchor-text inweb search 8.1.1 Term frequency In full-text document retrieval, term frequency (tf ) is used to give some measure of the “aboutness” of a document (see Section 2.3.1.2). The underlying assumption is that if a document repeats a term many times, it is likely to be about that term. The distribution of tf s in aggregate anchor-text appears to be quite different from that in document full-text. For example, an analysis of the term distribution in anchor- text and full-text for the “World Bank projects” home page2 illustrates how tf s can differ markedly. In the aggregate anchor-text for this document the term “projects” has a tf of 6798 (and makes up approximately 80% of all incoming anchor-text). By comparison, in the document full-text the term “projects” has a tf of only 5 (and makes up approximately 4% of the total document full-text). As shown in Figure 8.1 when using the default term saturation parameter (k1 = 2) Okapi BM25 scores are almost flat beyond a tf of 10. This may not be a desirable prop- erty when scoring aggregate anchor-text, as each occurrence of a query term may be a separate vote that the term relates to the contents of the document. The early satura- tion of term contribution can be particularly problematic when combining document scores in a linear combination (see Section 2.5.1.1). For example, taking the “World Bank projects” home page again, if another corpus document (of average length) has only 60 occurrences of the term “projects” in incoming links (6738 less occurrences than in the “World Bank projects” home page anchor-text), but the document full-text contains “projects” ten times (four more occurrences than in the full-text of the “World Bank projects” home page), that page will outperform the home page when measures are combined using a linear combination of Okapi BM25 scores (using default k1 and b parameters). Changing the rate of saturation for anchor-text, through modification of the Okapi BM25 k1 value, is one method by which the impact of high aggregate anchor-text term frequencies might be changed. For example, Figure 8.1 illustrates that given a higher k1 value, the function saturates more slowly, thereby allowing for higher term counts before complete function saturation. However, if this evidence is to be combined with other document evidence (computed using different Okapi BM25 parameters) using a linear combination, then scores have to be renormalised. This analysis suggests that when scoring aggregate anchor-text evidence the use of a much higher value of k1 may be effective.3 A change in saturation rate is ex- plored below, through length normalising aggregate anchor-text contribution using the length of document full-text. 8.1.2 Inverse document frequency Inverse document frequency (idf ) is used in full-text ranking to provide a measure of the frequency of term occurrence in documents within a corpus, and thereby a mea- sure of the importance of observing a term in a document or query (see Section 2.3.1.2). 2 Located at: http://www.worldbank.org/projects/ 3 Time did not permit confirmation of the benefits of this.
  • 153.
    §8.1 Document statisticsin anchor-text 137 0 5 10 15 20 25 0 5 10 15 20 DocumentScore tf BM25 k1=0 BM25 k1=1 BM25 k1=2 BM25 k1=10 Figure 8.1: Document scores achieved by BM25 using several values of k1 with increasing tf . Assuming a document of average length, and N = 100 000, nt = 10 The idf measure is likely to be useful when scoring aggregate anchor-text (i.e. assigning more weight to query terms that occur in fewer documents). However, it is unclear whether idf values should be calculated across all document fields4 at once (i.e. one idf value per document), or individually for each document field (i.e. one idf value per field, per document). Accordingly two possible idf measures are proposed: • Global inverse document frequency (gidf ): A single idf value is computed per term. • Field-based inverse document frequency (fidf ): Multiple idf values are com- puted per term, one per field (i.e. per type of query-independent evidence). There are situations in which gidf and fidf scores vary considerably. For example while the term “Microsoft” occurs in 16 330 documents in the TREC WT10g corpus (see Section 2.6.7.1), it occurs in the aggregate anchor-text for only 532 documents. “Microsoft” would have a low gidf in WT10g because many documents in the corpus mention it, but a relatively high fidf as few documents are the targets of anchor-text containing that term. A comprehensive comparison of the effectiveness of gidf and fidf measures was not performed, although a limited examination was performed as part of revised anchor-text formulations. A summary of the evaluated idf measures is presented in Table 8.1. 4 A field is a form of query-dependent evidence, for example document full-text, title or anchor-text.
  • 154.
    138 Anchor-text inweb search Abbreviation Description Described in BM25 Default Okapi BM25 Section 2.3.1.3 (calculates field-based idf values) BM25gidf Okapi BM25 with global idf statistics, such Section 8.1.2 that idf is calculated only once using all document fields BM25FW Default Field-weighted Okapi BM25 Section 2.5.2.1 (a single global idf value is calculated across all document fields) BM25FWfidf Field-weighted Okapi BM25 with field-based Section 8.1.2 idf values Table 8.1: Summary of idf variants used in ranking functions under examination. 8.1.3 Document length normalisation Document length normalisation is used in full-text ranking algorithms to reduce bias towards long documents. This bias occurs because the longer a piece of text, the greater the likelihood that a particular query term will occur in it (see Section 2.3.1.3). In Okapi BM25 the length normalisation function is controlled by b, with b = 1 en- forcing strict length normalisation, and b = 0 removing length normalisation. Using Okapi BM25 with the default length normalisation parameter (b = 0.75) [56, 186], slightly longer documents are favoured. This was shown to be effective when scoring document full-text in TREC ad-hoc tasks, as slightly longer (full-text) documents were found to be more likely to be judged relevant [186] (described in Section 2.3.1.3). The length of aggregate anchor-text is usually dependent on the number of in- coming links. Therefore, applying length normalisation to aggregate anchor-text, and thereby reducing the contribution of terms that occur in long aggregate anchor-text, is in direct contrast to the use of hyperlink recommendation algorithms. Aggregate anchor-text length is also much more variable than document full-text length, with many documents having little or no anchor-text, and some having a very large amount of incoming anchor-text (attributable to the power law distribution of links amongst pages, see Section 2.4). In the TREC .GOV corpus (see Section 2.6.7.1) the average full-text document length is around 870 terms.5 By comparison, the aver- age aggregate anchor-text length is only 25 words. An example of the negative effects of aggregate anchor-text length normalisation can be studied for the query “USGS” on the .GOV corpus. Figure 8.2 contains the aggregate anchor-text distribution for the home page of the United States Geolog- 5 Not including binary documents.
  • 155.
    §8.1 Document statisticsin anchor-text 139 23% 10% 10% 10% 8% 39% USGS SURVEY GEOLOGICAL US HOME Other Figure 8.2: Aggregate anchor-text term distribution for the USGS home page (http://www.usgs.gov) from the .GOV corpus. This page has the highest in-degree of all .GOV pages (around 88 000 links) and an aggregate anchor-text length of around 170 000 terms. 50% 50% INFORMATION USGS Figure 8.3: Aggregate anchor-text term distribution for ‘‘http://nh.water.usgs.gov/USGSInfo’’ from the .GOV corpus. This page has 243 incoming links, and an aggregate anchor-text length of around 486 terms.
  • 156.
    140 Anchor-text inweb search ical Survey (USGS), the most highly linked-to document in the .GOV corpus. For comparison, Figure 8.3 contains the aggregate anchor-text distribution for a “USGS info” page (http://nh.water.usgs.gov/USGSInfo). The USGS home page has around 170 000 terms in its aggregate anchor-text (from around 88 000 incoming links), 34 000 of which (23%) are “USGS”. By contrast, http://nh.water.usgs. gov/USGSInfo has 486 terms in its aggregate anchor-text (from 243 incoming links), of which half (243) are “USGS”. Considering only aggregate anchor-text evidence and using the default Okapi BM25 length normalisation parameter (b = 0.75), the http://nh.water.usgs.gov/USGSInfo page outperforms the USGS home page for the query “USGS”! An illustration of the effects of Okapi BM25 length normalisation of aggregate anchor-text (and other document fields) for a one term query, is presented in Fig- ure 8.4. This Figure contains plots for both length normalised and unnormalised Okapi BM25 scores for documents with three different proportions of matching terms. The average document length (avdl) value is set to the average aggregate anchor-text length in the .GOV corpus (25 terms). The idf value is set such that the probability of encountering a term in a document is one-in-one-thousand (nt = 1000, N = 100 000). The Okapi BM25 k1 parameter is set to 2. In the top plot document scores are length normalised, and in the bottom they are not. The top plot shows that, when using a default length normalisation (b = 0.75) value, it is impossible for a document with only 25% of terms matching the query to be ranked above a document where 50% of the terms match the query, even when comparing a tf of 5 to a tf of 2 500 000. When using the Field-weighted Okapi BM25 method (BM25FW , described in Section 2.5.2.1) the negative effects associated with aggregate anchor-text length nor- malisation can be even more severe. The field-weighting method combines all doc- ument evidence (including aggregate anchor-text) into a single composite document and then uses the combined composite document length to normalise term contribu- tion. Due to document length normalisation, it is unlikely that a document with a large number of incoming links will be retrieved for any query. This is even the case if the document full-text as well as the aggregate anchor-text mention the query term more than any other document in the corpus. A summary of the evaluated length normalisation techniques is presented in Ta- ble 8.2. 8.1.3.1 Removing aggregate anchor-text length normalisation One approach to dealing with the length normalisation issue outlined above is to eliminate aggregate anchor-text length from consideration. In the Okapi BM25 for- mulation length normalisation is controlled by the b constant, so length normalisation can be removed by setting b = 0. The bottom plot of Figure 8.4 represents the Okapi BM25 scores for documents with three different proportions of matching terms, with no length normalisation (b = 0) and other parameters as specified in the section above (avdl = 25, N = 100 000, nt = 1000, and k1 = 2). Without length normalisation the proportion of terms that match is ignored and the sheer volume of matching anchor-
  • 157.
    §8.1 Document statisticsin anchor-text 141 2 2.2 2.4 2.6 2.8 3 3.2 3.4 3.6 3.8 4 10 25 50 100 200 500 1000 10^4 10^5 10^6 10^7 Document Length BM25Score(k1=2,b=0.75) 75% of terms match 50% of terms match 25% of terms match 2 2.5 3 3.5 4 10 25 50 100 200 500 1000 10^4 10^5 10^6 10^7 Document Length BM25Score(k1=2,b=0) 75% of terms match 50% of terms match 25% of terms match Figure 8.4: The effect of document length normalisation on BM25 scores for a single term query. Each line represents a document containing some proportion of terms that match a query term (i.e. 25% of terms match is a document where one-out-of-four document terms match the query term). The graph illustrates the change in scores when the number of doc- ument terms increases. For example, if 75% of terms match in a document that contains 1000 terms, a total of 750 term matches have been observed. BM25 scores are calculated assuming an avdl of average aggregate anchor-text length in the .GOV corpus (25 terms), idf values are calculated using N = 100 000 and nt = 1000, and k1 is set to 2. The top plot shows the Okapi BM25 document scores when using the default length normalisation parameter (b = 0.75). The bottom plot gives the Okapi BM25 scores without length normalisation (b = 0). For the length normalised documents (top), even as the number of term matches are increased, the “proportion” of terms that match the query is still the most important factor. By comparison without length normalisation (bottom), only the raw frequency of term matches is important.
  • 158.
    142 Anchor-text inweb search Abbreviation Description Described in BM25 Default Okapi BM25 formulation Section 2.3.1.3 (length normalisation using the field length) BM25nodln Okapi BM25 using no length normalisation. Section 8.1.3.1 BM25contdln Okapi BM25 using full-text length Section 8.1.3.2 to normalise score BM25FW Default Field-weighted Okapi BM25 Section 2.5.2.1 & (length normalised using the composite document length, which is the sum of all field lengths). Section 8.2.2 BM25FWnoanchdln Field-weighted Okapi BM25 length Section 8.1.3.1 normalised using lengths of every field except for anchor-text Table 8.2: Summary of document length normalisation variants in ranking functions under examination. text terms is considered. This favours documents that have a large number of incom- ing links and that may therefore be expected to achieve high hyperlink recommenda- tion scores. The revised formulation of Okapi BM25 (with k1 = 2) for a document D, and a query Q, containing terms t is: BM25nodln(D, Q) = t∈Q tf t,D × log(N−nt+0.5 nt+0.5 ) 2 × tf t,D (8.1) In the BM25FW formulation, the aggregate anchor-text length may be omitted when computing the composite document length. In these experiments the removal of aggregate anchor-text length in the BM25FW formulation is referred to as BM25FWnoanchdln. 8.1.3.2 Anchor-text length normalisation by other document fields Rather than using the length of aggregate anchor-text to normalise anchor-text scores, it might be more effective to normalise aggregate anchor-text using the length of an- other document field. For example, the length of document full-text could be used to normalise aggregate anchor-text term contribution. Document length is known to be useful query-independent evidence for some tasks (see Section 2.3.1.2) [186]. In experiments within this chapter, the use of document full-text length when scor- ing aggregate anchor-text in the Okapi BM25 formulation is referred to as BM25contdln. This approach may be more efficient than using individual document lengths as only the full-text document lengths needs to be recorded.
  • 159.
    §8.2 Combining anchor-textwith other document evidence 143 8.2 Combining anchor-text with other document evidence Four methods for combining anchor-text baselines with other document evidence are investigated: • BM25LC : a linear combination of Okapi BM25 scores; • BM25FW : a combination using Field-weighted Okapi BM25 (described in Sec- tion 2.5.2.1); • BM25HYLC : a linear combination of Okapi BM25 and Field-weighted Okapi BM25 scores; and • BM25FWSn, BM25FWSnI : a combination of the best scoring anchor-text snip- pet (repeated according to in-degree in BM25FWSnI ) with other document evi- dence using the Field-weighted Okapi BM25 method. In all cases the 2002 Topic Distillation task (TD 2002) was used to train combination and ranking function parameters. It was later observed (both in experiments in Chap- ter 9, and for reasons outlined in Section 2.6.7.2) that due to the informational nature of the TD 2002 task it may not have been the most appropriate training set for these navigational tasks [53]. Further gains may be achieved by re-training parameters for a navigational-based search task (as used in Section 9.1.5). 8.2.1 Linear combination In experiments within this chapter a linear combination of Okapi BM25 scores for full-text and aggregate anchor-text is explored. Document title is not considered sep- arately, and is scored as part of the document full-text baseline. A document score D for a query Q is then: BM25LC(D, Q) = BM25(C + T, Q) + αBM25(A, Q) (8.2) where C + T is the full-text and title of document D, A is the aggregate anchor-text for document D, and α is tuned according to the expected contribution of anchor-text evidence. Conceptually the linear combination assigns separate scores to document full-text and aggregate anchor-text, considering them as independent descriptions of docu- ment content. The BM25 linear combination constant was trained on the TD 2002 task, leading to α = 3. 8.2.2 Field-weighted Okapi BM25 The BM25FW formulation (see Section 2.5.2.1) includes three document fields: doc- ument full-text (content), aggregate anchor-text, and title. The weights for each of these fields were derived by Robertson et al. [173] for the TD 2002 task (content:1, anchor-text:20, title:50, k1 = 3.4, b = 0.85). In this chapter, the document fields scored
  • 160.
    144 Anchor-text inweb search are represented in brackets after BM25FW , with default fields of full-text, anchor-text and title indicated by BM25FW ((C, A, T), Q) for query Q. 8.2.3 Fusion of linear combination and field-weighted evidence A hybrid combination can be performed by grouping and scoring like document evi- dence with Field-weighted Okapi BM25, and combining independent document evi- dence using a linear combination of scores. The split examined in experiments in this chapter is between document-level and web-based evidence: the document full-text and title are scored independently from the externally related aggregate anchor-text. This approach is referred to as (BM25HYLC). BM25HYLC(D, Q) = BM25FW ((C, T), Q) + αBM25gidf (A, Q) (8.3) 8.2.4 Snippet-based anchor-text scoring An alternative to scoring documents based on their aggregate anchor-text is to score documents according to their best matching anchor-text snippet.6 Overlap between different forms of document evidence may be reduced through snippet-based rank- ing. When using full-text ranking algorithms to score aggregate anchor-text evidence, there may be overlap in the document features used to score documents. For exam- ple, a document that has a large number of in-links, is also likely to have a high tf for a particular term (see “USGS” example in Section 8.1.3). Additionally, the aggregate anchor-text for a document with a large number of incoming links is likely to be long, and so will be impacted by document length normalisation. Snippet-based scores are collected by scoring every snippet of anchor-text pointing to each document, and using the highest scoring snippet per document.7 These snip- pets are then combined with other document evidence using Field-weighted Okapi BM25 with snippet-based collection and document statistics.8 Whilst these may not be the best formulations of snippet statistics, they are consistent with the derivations used in Okapi BM25. Two snippet-based scoring functions were considered: BM25FWSn and BM25FWSnI . BM25Sn combines a single occurrence of the best scoring snippet with other document evidence using Field-weighted Okapi BM25. BM25FWSnI combines the best scoring snippet repeated according to document in-degree with other docu- ment evidence using Field-weighted Okapi BM25.9 The evaluated snippet based runs are reported in Table 8.3. 6 An anchor-text snippet is the anchor-text of a single link pointing to a document. 7 This is a computationally-expensive operation, as all non-duplicate snippets require individual scor- ing at query time. 8 The statistics were adapted as follows: term frequency was set to within snippet term frequency, in- verse document frequency to the frequency of terms within snippets and document length as the length of a particular snippet. 9 Time did not allow for the investigation of further snippet ranking combinations.
  • 161.
    §8.3 Results 145 AbbreviationDescription Described in BM25FWSn Field-weighted Okapi BM25 using Section 8.2.4 the best matching anchor-text snippet as the anchor-text component. BM25FWSnI Field-weighted Okapi BM25 using Section 8.2.4 the best matching anchor-text snippet repeated according to document in-degree as the anchor-text component. Table 8.3: Summary of snippet-based document ranking algorithms under examination. 8.3 Results This section provides an empirical investigation of the effectiveness of the revised scoring methods. Effectiveness was evaluated using an automatic site-map based experiment on a university web, and using test collections from the 2002 and 2003 TREC web tracks. The TREC tasks studied were a named page finding task (NP2002), the 2003 combined home page finding / named page finding task (HP/NP2003), and the 2003 Topic Distillation task (TD2003). TREC web track corpus and task details are outlined in Section 2.6.7. 8.3.1 Anchor-text baseline effectiveness The effectiveness of the aggregate anchor-text scoring techniques was evaluated using a set of 332 navigational queries over a corpus of 80 000 web pages gathered from a university web. The navigational queries were sourced using the automatic site map method (described in Section 2.6.5.3). Ranking function Score Rank BM25 61 62 BM25contdln 100 1 BM25nodln 100 1 Table 8.4: Okapi BM25 aggregate anchor-text scores and ranks for length normalisation variants. The “Score” and “Rank” are the normalised scores and ranks achieved for the correct answer to the query ‘library’ on the university corpus.
  • 162.
    146 Anchor-text inweb search Table 8.4 shows the ranks and normalised scores achieved by the best answer in response to the query “library” when using only aggregate anchor-text. When in- corporating aggregate anchor-text length normalisation in Okapi BM25, the correct answer was severely penalised, as the aggregate anchor-text length was 13 484 words (262 times the average length in the collection). This was despite the document having very high term frequency (tf ) for the query term (1664). In contrast, both BM25contdln and BM25nodln placed the best answer at rank one, but scored it only slightly above many other candidate documents. In fact, the score was only 1% higher than the home page of a minor library whose tf was a factor of 7.5 lower. Due to the small difference in scores assigned for anchor-text evidence, if these scores were combined with other document scores in a linear combination, the ranking of documents might change. To increase the contribution of strong anchor-text matches, the weight of anchor-text evidence must be increased and/or the saturation rate of anchor-text changed. An anchor-text ranking function that does not saturate anchor-text term contribution is presented in the following Chapter (AF1, in Section 9.1.3.1). Table 8.5 shows results for the full set of 332 navigational queries processed over the university corpus. Wilcoxon tests show that using full-text document length (BM25contdln) to length normalise aggregate anchor-text significantly (p < 0.02) im- proved effectiveness relative to the case of no length normalisation (BM25nodln). Fur- ther, both BM25contdln and BM25nodln were superior to the default Okapi BM25 for- mulation (p < 10−5). Ranking function MRR P@1 BM25 0.61 0.47 BM25contdln 0.72 0.63 BM25nodln 0.70 0.61 Table 8.5: Effectiveness of Okapi BM25 aggregate anchor-text length normalisation tech- niques on the university corpus. MRR depicts the Mean Reciprocal Rank of the first correct answer; P@1 is precision at 1, the proportion of queries for which the best answer was returned at rank one. 8.3.2 Anchor-text and full-text document evidence This section examines the results from experiments that combine the new anchor-text scoring methods with document full-text evidence. Combined runs are evaluated using TREC web track test collections from 2002 and 2003 (discussed in Section 2.6.7).
  • 163.
    §8.3 Results 147 8.3.2.1Field-weighted Okapi BM25 combination Table 8.6 shows the results from the Field-weighted Okapi BM25-based (BM25FW ) experiments. Task Ranking Function C A T P@1 P@10 MRR Sig. NP2002 BM25FW 1 50 20 0.59 0.82 0.68 - NP2002 BM25FWnoanchdln 1 50 20 0.59 0.87 0.68 - NP2002 BM25FW 1 500 20 0.49 0.78 0.60 - NP2002 BM25FWnoanchdln 1 500 20 0.52 0.85 0.63 - TD2003 BM25FW 1 50 20 0.10 0.09 0.10 - TD2003 BM25FWnoanchdln 1 50 20 0.18 0.09 0.13 *+ TD2003 BM25FW 1 500 20 0.17 0.08 0.09 - TD2003 BM25FWnoanchdln 1 500 20 0.20 0.09 0.13 *+ HP&NP2003 BM25FW 1 50 20 0.48 0.76 0.58 - HP&NP2003 BM25FWnoanchdln 1 50 20 0.63 0.85 0.71 *+ HP&NP2003 BM25FW 1 500 20 0.36 0.67 0.46 - HP&NP2003 BM25FWnoanchdln 1 500 20 0.59 0.84 0.68 *+ Table 8.6: Effectiveness of Field-weighted Okapi BM25. Three TREC web track tasks were evaluated; “NP2002” is the 2002 TREC web track named page finding task; “TD2003” is the 2003 TREC web track Topic Distillation task; and “HP&NP2003” is the 2003 TREC web track combined home page / name page finding task. “C” is the content weight (1 by default), “A” is the aggregate anchor-text weight (50 by default) and “T” is the title weight (20 by default). “Sig.” indicates whether improvements were significant (“*+”) over the BM25FW (C, A, T) baseline. Improvements for no length normalisation were only significant for TD2003 and the HP&NP2003 task. Performance decreased dramatically when up-weighting the aggregate anchor-text field while including aggregate anchor-text in composite document length. The removal of aggregate anchor-text length from composite document lengths in the Field-weighted Okapi BM25 model (BM25FWnoanchdln) significantly improved performance in two-out-of-three tasks, and did not affect performance in the other. The results show that increasing the weight of aggregate anchor-text by an order of magnitude in BM25FW exacerbates the negative effects of including aggregate anchor-text length in the composite document length. Combining BM25FW scores with hyperlink recommendation evidence might go some way to re-balancing the re- trieval of highly linked pages. The investigation of this potential is left for future work. Function parameters were optimised for composite document lengths that included aggregate anchor-text. It is likely that improvements achieved through the removal of aggregate anchor-text length might be increased through re-tuning Okapi BM25FW ’s document length (b) and saturation (k1) parameters. This is also left for future work.
  • 164.
    148 Anchor-text inweb search Anchor-text snippets in Field-weighted Okapi BM25 The performance of the anchor-text snippet-based ranking functions is presented in Table 8.7. Both snippet-based runs performed poorly by comparison to the BM25FW runs. The snippet-based runs were also far less efficient than aggregate anchor-text runs, as statistics were calculated and stored for each link rather than each document (and there is an order of magnitude more links than documents in the .GOV corpus). Further investigation would be required to determine whether a snippet-based rank- ing could be effective. For example, effectiveness might be improved by re-optimising the Okapi BM25 parameters, or re-weighting snippets according to their origin (e.g. according to whether they are within-site or cross-site links) or according to some notion of source authority. Ranking function P@1 P@10 MRR Sig. BM25FW 0.10 0.09 0.10 - BM25FWnoanchdln 0.18 0.09 0.13 *+ BM25FWSnip 0.06 0.04 0.04 *- BM25FWSnipIDG 0.12 0.06 0.06 *- Table 8.7: Effectiveness of anchor-text snippet-based ranking functions. The snippet runs performed poorly by comparison to the BM25FW runs for the 2003 TREC web track Topic Distillation task. “Sig.” indicates whether improvements (“*+”) or losses (“*-”) were signifi- cant compared to the BM25FW (C, A, T) baseline. 8.3.2.2 Linear combination Table 8.8 shows the performance of the combinations of aggregate anchor-text and full-text evidence for the Topic Distillation 2003 task. The following observations were made from these results: • Excluding aggregate anchor-text length from document length improves the performance of the BM25FW method by around 25%. Likewise, removing ag- gregate anchor-text length normalisation when combining content and anchor- text BM25 scores results in significant performance gains, with an MRR increase of around 30%. • A further small effectiveness gain is achieved through a hybrid combination, where the document title and full-text are scored using the field-weighting method, and are then combined with aggregate anchor-text evidence in a lin- ear combination.
  • 165.
    §8.4 Discussion 149 •The “pure” linear combination performs poorly, most likely due to the use of aggregate anchor-text length normalisation. Ranking function Comb P@1 P@10 MRR Sig BM25FW FW 0.10 0.09 0.10 - BM25FWnoanchdln FW 0.18 0.09 0.13 *+ BM25gidf (C) + BM25gidf (A) LC 0.24 0.10 0.12 *+ BM25gidf (C) + BM25gidf ,contdln(A) LC 0.18 0.13 0.16 *+ BM25(C) + BM25contdln(A) LC 0.18 0.12 0.16 *+ BM25FW (C, T) + BM25contdln(A) HYLC 0.22 0.14 0.17 *+ Table 8.8: Effectiveness of the evaluated combination methods for TD2003. “TD2003” is the 2003 TREC web track Topic Distillation task. C is document full-text, A is aggregate anchor-text, and T is title. FW uses a Field-weighted Okapi BM25 combination, LC is a linear combination, and HYLC is a fusion of Field-weighted Okapi BM25 and linear combination. “Sig.” indicates whether improvements (“*+”) or losses (“*-”) were significant compared to the BM25FW (C, A, T) baseline. Table 8.9 contains the results for experiments on further TREC tasks. In all cases the linear combination methods are outperformed by the field-weighting method. This demonstrates potential differences between the tasks studied, and suggests that no one method considered here will achieve high effectiveness on all search tasks. 8.4 Discussion The results for the Okapi BM25 modifications show that effectiveness was improved when length normalisation was not performed on aggregate anchor-text. Additional gains were achieved when aggregate anchor-text was normalised using document full-text length. The reason for this may be that full-text document length provides useful query-independent document evidence. The removal of aggregate anchor-text from composite document lengths in the Okapi BM25FW formula improved or maintained retrieval effectiveness for all eval- uated tasks. A re-tuning of the field-weighting weights without aggregate anchor- text in composite document length is required to determine whether further improve- ments can be attained. The removal of aggregate anchor-text length from composite document length normalisation favours documents with long aggregate anchor-text, as it is more likely that a link to a document containing the query term will be found. This preference for long aggregate anchor-text is similar to biasing towards heavily linked-to pages (except that a term match is assured). This may be a method by which query-independent hyperlink recommendation evidence can be more easily combined with query-dependent evidence.
  • 166.
    150 Anchor-text inweb search Task Ranking function Comb. P@1 P@10 MRR NP&HP2003 BM25FWnoanchdln FW 0.60 0.85 0.69 NP&HP2003 BM25(C) + BM25(A) LC 0.26 0.57 0.36 NP&HP2003 BM25gidf (C) + BM25gidf ,nodln(A) LC 0.47 0.71 0.56 NP&HP2003 BM25FW (C, T) + BM25contdln(A) HYLC 0.51 0.76 0.60 NP2002 BM25FWnoanchdln FW 0.56 0.87 0.67 NP2002 BM25(C) + BM25(A) LC 0.33 0.65 0.44 NP2002 BM25gidf (C) + BM25gidf ,nodln(A) LC 0.26 0.51 0.35 NP2002 BM25FW (C, T) + BM25contdln(A) HYLC 0.31 0.61 0.30 Table 8.9: Effectiveness of the evaluated combination methods for NP2002 and NP&HP2003. “NP2002” is the 2002 TREC web track named page finding task; and “HP&NP2003” is the 2003 TREC web track combined home page / name page finding task. “C” is document full-text, “A” is aggregate anchor-text, and “T” is title. FW uses a Field- weighted Okapi BM25 combination, LC is a linear combination, and HYLC is a fusion of Field-weighted Okapi BM25 and linear combination. Results for the hybrid combination strategy illustrate the benefits of treating document-level and web-based evidence as separate document descriptions. The hy- brid combination approach significantly outperformed other methods, equalling the best run submitted to TREC 2003 (discussed in Chapter 9). Computing a gidf and using document full-text to normalise all document fields was also an effective ap- proach, improving retrieval effectiveness as well as allowing for potential gains in efficiency by reducing the number of statistics per term. A “pure” linear combination of document evidence was significantly less effective and more costly (as document statistics were required for each form of evidence). In general, the results in this chapter illustrate an interesting trade-off when deal- ing with aggregate anchor-text. The trade-off is whether to favour documents which contain the most occurrences of a particular term in anchor-text (by employing no anchor-text aggregate length normalisation), or to favour documents whose aggregate anchor-text contains the greatest percentage of anchor-text that matches the query term (by employing full aggregate anchor-text length normalisation). The choice is akin to trading off the quantity of anchor-text for the “purity” of the aggregate anchor-text. If aggregate anchor-text is heavily length normalised, thereby encourag- ing anchor-text purity, hyperlink recommendation evidence could be used to counter the preference for short aggregate anchor-text by up-weighting pages with high link popularity. How best to address these issues is left for future work.
  • 167.
    Chapter 9 A first-cutdocument ranking function using web evidence The first-cut ranking function explored in this chapter combines document and web- based evidence found effective in previous experiments within this thesis. A weighted linear combination was used to combine this evidence. The weights for evidence and combination parameters were tuned for three sets of navigational queries using a hill- climbing algorithm. The tuned ranking function was evaluated through submissions to the TREC 2003 web track, and on data spanning several small-to-medium sized corporate web collections. 9.1 Method The following sections outline: • How the effectiveness of the ranking function was tested; • The document-level and web evidence used in the ranking function; • How document evidence was combined in the ranking function; • The training data, and how the data were used to tune the ranking function; and • The methods used to address the combined home page / named page finding task. 9.1.1 Evaluating performance The first-cut ranking function was used to generate runs for participations in both the Topic Distillation (TD2003), and the combined home page / named page finding (HP/NP2003) tasks of the 2003 TREC web track (described in Section 2.6.7.2). The goal of the TD2003 task was to study how well systems could find entry points to relevant sites given a broad query. The Topic Distillation task is nominally an in- formational task (see Section 2.6.3). However, the focus in Topic Distillation is quite 151
  • 168.
    152 A first-cutdocument ranking function using web evidence different from previous informational tasks studied in TREC. Topic Distillation stud- ies the retrieval of relevant resources, rather than relevant documents. The TD2003 submission studied in this chapter sought to determine whether the first-cut ranking function trained for navigational search (especially home page finding queries) would perform well for Topic Distillation. This training set was chosen in an effort to favour the retrieval of relevant resources rather than documents. The goal of the HP/NP2003 task was to study how well systems could retrieve both home page documents and other documents specified by their name, without prior knowledge of which queries were for named pages, and which were for home pages. The HP/NP2003 submission studied in this chapter examined different meth- ods for combining home page and named page based tunings into a single run. This included an investigation of whether best performance was achieved by tuning for both tasks at once, using a training set containing both types of queries, or through “post hoc” fusion of home page and named page tuned document rankings. A series of follow-up experiments used corpora gathered from several small cor- porate webs to provide a preliminary study of how the ranking function performed on diverse corporate-sized webs. In each case the effectiveness of the ranking function studied was compared to that of the incumbent search system. 9.1.2 Document evidence The ranking function included three important forms of document evidence: full-text, title and URL length. The query-dependent evidence (full-text and title) was scored using Okapi BM25 with tuned k1 and b parameters. The k1 and b parameters were tuned once per run rather than individually per field. The application of term stemming was also eval- uated (using the Porter stemmer [163], described in Section 2.3). Strict term coordi- nation was applied for all query-dependent evidence, with documents containing the most query terms ranked first. If combining Okapi BM25 scores computed for a mul- tiple term query in a linear combination without term co-ordination, a document that matches a single query term in multiple document fields can outperform a document that contains all query terms in a single field. The use of strict term co-ordination ensures that the first ranked document contains the maximum number of matched query terms in a document field. 9.1.2.1 Full-text evidence Okapi BM25 was used to score document full-text evidence (BM25(C)). Prior to scoring full-text evidence all HTML tags and comments were removed. For efficiency reasons global document length and global inverse document frequency (gidf ) values were used (described in Section 8.1.2).
  • 169.
    §9.1 Method 153 9.1.2.2Title evidence Title text was scored independently of other document evidence using BM25 (BM25(T)). For efficiency reasons the BM25 title formulation used global docu- ment length and global inverse document frequency (gidf ) values (described in Sec- tion 8.1.2). 9.1.2.3 URL length URL lengths (URLlen) were capped at 127 for efficiency reasons. URLs longer than 127 characters recorded as being 127 characters long. 9.1.3 Web evidence Anchor-text and two forms of in-degree were included in the ranking function. Page- Rank and (simple) in-degree were not considered because of the relatively poor per- formance observed in previous experiments. Instead, two important sub-types of in- degree were examined: off-site and on-site in-degree [55]. 9.1.3.1 Anchor-text The Anchor Formula 1 (AF1) proposed here is an alternative to the revised anchor- text models presented in the previous chapter. In AF1, term frequency (tf ) values are not saturated (as described in Section 8.1.1) and document length normalisation is re- moved (as described in Section 8.1.3). Multiplying AF1 values by 1.7 (using the KWT parameter, see Section 9.1.4) the curve is similar to the BM25 saturation of an average length document for the first three term occurrences (with default Okapi parameters, see Figure 9.1). The score for a document D, for query Q, over terms t, with aggregate anchor- text A according to AF1 is: AF1(D, Q) = t∈Q log(tf t,D + 1) × gidft (9.1) As term frequency scores in the AF1 never saturate, term coordination must be en- forced. Without term coordination a single term in a query may dominate. For exam- ple if seeking “Microsoft Research Cambridge” the term “Microsoft” may dominate, potentially leading to a page which matched “Microsoft” strongly in the aggregate anchor-text, but never matched “Research” or “Cambridge” being retrieved, such as the Microsoft home page. 9.1.3.2 In-degree The log values of on-site (IDGon) and off-site (IDGoff) in-degrees were normalised (according to the highest in-degree value for the collection) and quantised to 127 val- ues (for efficiency reasons). This may have reduced ranking effectiveness, although
  • 170.
    154 A first-cutdocument ranking function using web evidence 0 5 10 15 20 25 0 5 10 15 20 DocumentScore tf AF1 BM25 k1=0 BM25 k1=1 BM25 k1=2 BM25 k1=10 Figure 9.1: Document scores achieved by AF1 and BM25 for values of tf . A document of average length is assumed, with the likelihood of encountering a term in the corpora one-in- one-thousand (using idf values of N = 100 000 and nt = 100) experience with the retrieval system in practical use suggests that there are minimal adverse affects associated with this normalisation. 9.1.4 Combining document evidence The ranking formulation includes four key components: a query dependent score and three query-independent scores. • Query-dependent evidence: this component is a linear combination of docu- ment full-text, title, and AF1 anchor-text scores. The relative contribution of AF1 is controlled through the KWT parameter. The relative contribution of query-dependent evidence is controlled using the QD parameter. Full-text, ti- tle and anchor-text are combined using a linear combination with gidf values, a method previously demonstrated to be effective for home page and Topic Dis- tillation tasks in Chapter 8. Term stemming was also evaluated (Stem). • On-site in-degree: this component is the log normalised number of incoming on-site links (quantised to 127 values). The contribution of this component is controlled using the ON parameter. • Off-site in-degree: this component is the log normalised number of incoming off-site links (quantised to 127 values). The contribution of this component is controlled using the OFF parameter. • URL length: this component is the length, in characters, of the URL (for up to 127 characters). The contribution of this component is controlled using the URL parameter.
  • 171.
    §9.1 Method 155 Accordinglythe score for a document D is computed by: S(D, Q) = QD × (((BM25gidf (C, Q) + BM25gidf (T, Q) + KWT × AF1(A, Q)))/ (max(BM25gidf (C, Q) + BM25gidf (T, Q) + KWT × AF1(A, Q))))+ ON × (IDGonD/max(IDGon))+ OFF × (IDGoff D/max(IDGoff))+ URL × ((max(URLlen) − URLlenD)/max(URLlen)) Documents must also fulfill the constraints imposed through term coordination. 9.1.5 Test sets and tuning Eight parameters (k1, b , KWT, QD, ON , OFF, URL and Stem) were tuned for each test set. The values explored for each parameter are as follows: • k1 in steps of 0.25 between 0 and 4; • b in steps of 0.25 between 0 and 1; • KWT in steps of 1.7 between 0 and 17; • QD, ON , OFF in steps of 2 between 0 and 20; • URL in steps of 4 between 0 and 40; and • Stem on or off. The parameters were tuned using three test sets: • Home page set (HPF): this training set was based on the http://first.gov government home page list. Queries and results were extracted from this docu- ment using the automatic site map method (see Section 2.6.5.3). The set consists of 241 queries whose results were home pages. The full query and result set is included as Appendix G. • Named page set (NPF): this training set consists of the queries and relevance judgements (qrels) used in the TREC 2002 named page finding task (described in Section 2.6.7.2). The set consists of 150 queries whose results are named pages.1 • Both sets of queries (BOTH): this consists of all queries and relevance judge- ments used in HPF and NPF. There are inherent limitations in the training sets employed. The set of home pages was taken from a .GOV portal, which may inadvertently have favoured prestigious, or larger and more popular home pages. Further, the named page tuning includes some home pages that were included in the 2002 NP task. This may have biased training towards home page queries. The BOTH set of queries included a disproportionate number of home page queries due to the presence of home pages in the NPF set, and because the HPF set was larger than the NPF set. 1 The results for some of the named page queries were home pages.
  • 172.
    156 A first-cutdocument ranking function using web evidence 9.1.6 Addressing the combined HP/NP task Three approaches for applying the ranking function to the combined HP/NP task were evaluated. The first method was a tuning of parameters for both tasks simultaneously (i.e. using the BOTH tuning to generate a run). This second method summed document scores achieved for each tuning. This is equivalent, in rank fusion and distributed IR terminology, to performing a combSUM of document HPF and NPF scores. The third and final method interleaved the ranked results for each run by taking a document from the top of each ranking in turn, and removing any already seen (duplicate) documents. For example, the first result in an HP/NP interleaving is the first ranked document for the HPF tuning, and the second result is the first ranked document for the NPF tuning.2 In an attempt to improve early precision, the inter- leaving order was swapped if a keyword indicative of a named page finding query was observed.3 9.2 Tuning Parameters were tuned using a hill climbing algorithm with a complete exploration of two parameters at a time (at each step the parameters which achieved the highest retrieval effectiveness were stored and used for other tunings). The tuning stopped when a full tuning cycle completed without change in tuned values. Plots of the tun- ing process are provided in Figures 9.2 and 9.3. Figure 9.2 provides an example of the concurrent tuning of two function parameters (in this case the Okapi BM25 k1 and b values). Figure 9.3 shows plots for the rest of the tuning cycle. The tuned values and effectiveness of the ranking function on the three training sets (HPF, NPF and BOTH) are reported in Table 9.1. The optimal tunings derived for each task differed significantly. The only consis- tent result was that the query-dependent component was important in all tunings. The following observations can be made from the Home Page Finding (HPF) pa- rameter tunings: • The tuned Okapi BM25 term saturation parameter (k = 3.6) is higher than the default parameter of k = 2. This indicates that home pages may contain page naming text several times and that matching their name more than once is a good indicator of a home page match. • The tuned Okapi BM25 length normalisation parameter (b = 1) is higher than the default parameter of b = 0.75. The tuning favoured a strict length normal- 2 So long as that document was not the same document retrieved at the first rank by the HPF tuning, in which case the next document in the NPF ranking is taken. 3 Query terms were selected from last year’s query set and included terms such as “page”, “form” and “2000”.
  • 173.
    §9.2 Tuning 157 k_1=xb=y Anchors=5 Content=17 Onsite=0 Offsite=6 URL=19 =0 Best MRR@10: 0.6 0.4 0.736853 0 0.5 1 1.5 2 2.5 3 3.5 4 k_1 0 0.2 0.4 0.6 0.8 1 b 0.71 0.715 0.72 0.725 0.73 0.735 0.74 Mean reciprocal rank @10 Figure 9.2: A plot illustrating the concurrent exploration of Okapi BM25 k1 and b values using the hill-climbing function. The values at which the best performance is achieved are stored (the highest point in the plot, represented by a “+”) and used when tuning other values. The tuning stops when a full iteration of the tuning cycle completes without change in tuned values.
  • 174.
    158 A first-cutdocument ranking function using web evidence k_1=3 b=2 Anchors=x Content=y Onsite=0 Offsite=6 URL=19 =0 Best MRR@10: 0.4 20 0.745333 0 0.5 1 1.5 2 Anchors 0 5 10 15 20 Content 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 Mean reciprocal rank @10 k_1=3 b=2 Anchors=4 Content=20 Onsite=x Offsite=y URL=19 =0 Best MRR@10: 1 6 0.747617 0 5 10 15 20 Onsite 0 5 10 15 20 Offsite 0.66 0.67 0.68 0.69 0.7 0.71 0.72 0.73 0.74 0.75 Mean reciprocal rank @10 k_1=3 b=2 Anchors=4 Content=20 Onsite=1 Offsite=6 URL=x =y Best MRR@10: 15 0 0.750517 0 5 10 15 20 URL 0 5 10 15 20 0.69 0.7 0.71 0.72 0.73 0.74 0.75 0.76 Mean reciprocal rank @10 Figure 9.3: A full iteration of the hill-climbing function. The first step in this iteration is illus- trated in Figure 9.2. The tuning of parameters was performed using a hill climbing algorithm with complete exploration of two parameters at a time. The highest point (best performance) is represented by a “+”, and the parameter values at that point are stored and used when tun- ing other values. The tuning stops when a full iteration completes without change in tuned values.
  • 175.
    §9.2 Tuning 159 TestSet MRR k1 b KWT QD, ON , OFF, URL Stem HPF 0.846 3.6 1 11.9 15,1,6,38 Y NPF 0.522 4 0.2 1.7 18,2,0,1 N BOTH 0.715 0.8 0.4 8.5 20,0,6,20 Y Table 9.1: Tuned parameters and retrieval effectiveness. Parameters are as described in Sec- tion 9.1.5. “MRR” is the Mean Reciprocal Rank, as described in Section 2.6.6.2. “HPF” is the home page training set. “NPF” is the named page training set. “BOTH” contains both HPF and NPF. isation of document full-text.4 This suggests that longer full-text content is no more likely to be a relevant home page. • Anchor-text in the form of AF1 was important, with the KWT parameter per- forming best at 11.9. • The contribution of off-site and on-site links was small, with off-site links more useful than on-site links. • URL length once again proved to be an important contributor in home page finding. • Stemming improved retrieval effectiveness. From the Named Page Finding (NPF) parameter tunings: • Like for the HPF tunings, a higher than normal Okapi BM25 k1 value was effec- tive. • Unlike in the HPF tunings, length normalisation did not improve effectiveness, with a low b value found to perform best. • The contribution of URL, on-site in-degree, and off-site in-degree was small. • Anchor-text was useful for the NPF task, although its contribution was far less than in the HPF task. • Stemming adversely affected retrieval effectiveness. In general the BOTH tuning was similar to the HPF tuning (indicating that home pages dominated in the tuning). The only large differences between BOTH and HPF were in the form of a much smaller tuned term saturation value (k1 = 0.8), and less length normalisation (b = 0.4). 4 Note that length normalisation is not present in the AF1 measure.
  • 176.
    160 A first-cutdocument ranking function using web evidence 9.2.1 Combining HP and NP runs for the combined task Results for the HP/NP combination methods tested on the combined test set (BOTH) are presented in Table 9.2. Combination Method MRR on training set Tuned for BOTH 0.758 HPF and NPF combSUM 0.489 HPF and NPF interleaved (HP,NP) 0.734 Table 9.2: Results for combined HP/NP runs on the BOTH training set. The interleaving of runs performed similarly to tuning using BOTH types of query. This may be an effective method for combining the runs without prior tuning infor- mation. The performance of the linear combination was relatively poor. 9.3 Results This section investigates results from the empirical studies of the first-cut ranking function. The ranking function was evaluated, using the parameter tunings described above, for the two TREC 2003 web track tasks, and for navigational search on several corporate webs. 9.3.1 TREC 2003 This section sets out the results from the official TREC 2003 web track submissions. The Topic Distillation runs (csiro03td–) are presented first, followed by the combined HP/NP finding task runs (csiro03ki–). 9.3.1.1 Topic Distillation 2003 (TD2003) results Results for the TD2003 web track task are presented in Table 9.3. The best of these runs (csiro03td03) achieved the highest performance of any system submission. This run used the HPF tuning and incorporated stemming. Further observations based on the Topic Distillation results are: • The tuned k1 and b values offered some improvement (csiro03td01 versus csiro03td05). The effectiveness of the length normalisation parameter used (b = 1) suggests that longer pages are no more likely to be relevant in Topic Distilla- tion. • The new anchor-text ranking function AF1 was particularly effective, achieving gains of up to 60% (csiro03td03 versus not sub 05).
  • 177.
    §9.3 Results 161 DescriptionAverage R-Prec Run Id HPF (Stem = ON , ON = 0, OFF = 0) 0.170 not sub 01 HPF (Stem = ON ) 0.164 csiro03td03 HPF (Stem = ON , ON = 0, OFF = 0, URL = 0) 0.149 not sub 02 HPF 0.144 csiro03td01 HPF (ON = 0, OFF = 0) 0.143 not sub 03 HPF (k1 = 2, b = 0.75) 0.127 csiro03td05 NPF 0.117 not sub 04 HPF (ON = 0, OFF = 0, URL = 0) 0.116 csiro03td02 HPF (Stem = ON , KWT = 0) 0.108 not sub 05 HPF (KWT = 0) 0.099 csiro03td04 HPF (Stem = ON , No Red./Dup.) 0.147 not sub 06 HPF (No Red./Dup.) 0.138 not sub 07 HPF (KWT = 0, No Red./Dup.) 0.116 not sub 08 HPF (Stem = ON , KWT = 0, No Red./Dup.) 0.106 not sub 09 Table 9.3: Topic Distillation submission summary. “HPF” indicates that the home page finding tunings were used (tunings in Table 9.1). “NPF” indicates that the named page finding tunings were used (tunings also in Table 9.1). Other description notes indicate variations from the tuned parameters. “Run Id” reports the run identifier used in TREC experiments. “No Red./Dup.” indicates that redirect and duplicate URL information was not used. Further runs were computed post hoc (not sub –). • Hyperlink recommendation evidence was not effective. A post hoc run achieved slightly better performance (4%) when hyperlink recommendation evidence was removed (not sub 01). • URL length evidence appeared to slightly improve retrieval effectiveness (not sub 01 versus not sub 02). • The NPF tuning performed worse than the HPF tuning (not sub 04), with an associated drop in MRR of around 20%. • A linear combination of query-dependent scores from document-level and web- based evidence, where both scores were computed using gidf values, was effec- tive. • The redirect and duplicate information (collected using methods outlined in Chapter 3) was important when scoring anchor-text using AF1. Without redi- rect and duplicate information, retrieval effectiveness was reduced by 15% (csiro03td03 versus not sub 06). The results from the Topic Distillation task support the notion that the home page training set favoured prominent resources (an advantage for Topic Distillation). The
  • 178.
    162 A first-cutdocument ranking function using web evidence results also illustrate the benefits of the new anchor-text ranking component AF1, es- pecially when used with stemming, and with redirect and duplicate URL information. 9.3.1.2 Combined HP/NP 2003 (HP/NP2003) results The official run results for the HP/NP2003 task are presented in Table 9.4. The best of these runs achieved the second highest performance of any submitted system (csiro03ki04). The results show that tuning specifically for the home page finding task significantly harmed named page retrieval effectiveness (csiro03ki02 versus csiro03ki03). The highest MRR was achieved using the NPF-only tuning, whilst the best S@10 used interleaved lists from HPF and NPF tunings. The results show that an overemphasis on home page finding harmed the named page searches. The run with the highest S@10 (csiro03ki04) interleaved the csiro03ki02 and csiro03ki03 runs (i.e. top HP, top NP, second HP, second NP etc.). From subsequent evaluations (not sub 01) it was apparent that leading with the top NP result rather than the top HP result would have further improved precision (achieving an MRR of 0.717). Tuning for both named page and home page training queries concurrently (csiro03ki01) performed well for home page finding, but poorly for named page find- ing. This confirms that the BOTH training set was biased towards home page finding due to the larger sample of home page queries considered, and the presence of home page queries in the named page training set (see Section 9.1.5). In summary, interleaving HP then NP without query classification achieves an MRR of 0.646. Interleaving HP then NP and reversing the interleaving if the query appears to be a named page query achieves an MRR of 0.667. Finally, interleaving NP then HP without query classification achieves 0.717. Description MRR5 S@10 (%) MRR (HP) MRR (NP) Run Id HPF and NPF interleaved (NPF,HPF) 0.717 87.0 0.781 0.651 not sub 01 NPF 0.702 84.0 0.755 0.649 csiro03ki03 HPF and NPF combSUM 0.699 81.0 0.812 0.586 csiro03ki05 BOTH 0.692 83.7 0.815 0.569 csiro03ki01 HPF and NPF interleaved (HPF,NPF) 0.667 86.3 0.801 0.532 csiro03ki04 HPF 0.603 77.7 0.774 0.432 csiro03ki02 Table 9.4: Combined home page/named page finding task submission summary. To aid in the understanding of retrieval performance MRR for home pages only “MRR (HP)” and named pages only “MRR (NP)” was computed. “HPF”, “NPF”, and “BOTH” indicates the tunings used (home page finding, named page finding and for both sets respectively, parame- ters reported in Table 9.1). Other description notes indicate variations from the tuned para- meters. “Run Id” reports the run identifier used in TREC experiments. Post hoc, a further run was computed using NPF tunings.
  • 179.
    §9.4 Discussion 163 9.3.2Evaluating the ranking function on further corporate web collections The ranking function was evaluated for eight further collections built from the pub- licly available corporate webs of eight large Australian organisations: five public com- panies, two government departments and an educational institution. The query and result sets were generated using the automated site map method described in Section 2.6.5.3. In each case the new ranking function was compared to the performance of the incumbent search system. The anchor-text component was calculated using a BM25 anchor-text formulation that used full-text document length for normalisation (BM25contdln). Table 9.5 presents the results from this experiment. The first-cut ranking function performed significantly better than seven out of eight evaluated search systems, and comparably to the other search system (University). The use of query-independent evidence (off-site links, on-site links and URL length) did not significantly improve retrieval effectiveness on any collection. 9.4 Discussion The first-cut ranking function performed well over a variety of tasks and corpora. The runs submitted to the 2003 TREC web track achieved the highest Topic Distillation score [60], and the second highest combined HP/NP score [60]. The ranking function also outperformed the incumbent search engines of seven-of-the-eight corporate webs studied (and performed comparably to the other). The tuning of ranking function parameters using the NPF training set achieved better retrieval effectiveness than tuning using the HPF set in the HP/NP2003 task. This indicates that the HPF-tuned ranking function may have been over-trained to- wards prominent home pages (as would be listed on first.gov). Arguably, the most important component of the ranking function was the anchor- text evidence in the form of AF1. This finding re-iterates the importance of anchor- text evidence in web document retrieval. The AF1 ranking function provided an ef- fective alternative to scoring anchor-text using full-text ranking methods. However, the methods used to score aggregate anchor-text evidence merit further investigation. In particular, more work is required to determine whether the use of global inverse document frequency values (gidf ) is preferable to the use of field-based anchor-text (fidf) values. The results show little performance gain through the use of query-independent evidence, for both the web track tasks and small corporate web collections. URL length evidence produced small gains for the home page finding and Topic Distilla- tion tasks. By contrast hyperlink recommendation evidence never improved retrieval effectiveness. The poor performance of query-independent evidence could indicate that the method used to combine it with query-dependent evidence was ineffective. More effective combination strategies might incorporate query-independent evidence as some prior probability of document relevance [155], re-rank baselines (as in Sec- tion 6.1.4), or may use the query-independent score to normalise or transform term
  • 180.
    164 A first-cutdocument ranking function using web evidence Institution Search Engine Queries S@1 S@5 S@10 Docs Telecomm. Unknown 266 75 113 126 72 337 New (no QIE) 266 166 208 212 New (w/QIE) 266 166 208 219 Large Bank 1 Lotus Notes 228 15 46 63 6690 New (no QIE) 151 206 209 New (w/QIE) 150 206 210 Large Bank 2 Unknown 64 17 26 28 1805 No (no QIE) 41 59 60 No (w/QIE) 42 59 60 Large Bank 3 Unknown 143 4 21 39 5113 New (no QIE) 116 132 135 New (w/QIE) 100 132 134 Large Bank 4 Unknown 295 96 165 170 7827 New (no QIE) 170 232 243 New (w/QIE) 160 228 241 University Ultraseek 360 179 235 253 50 203 New (no QIE) 218 293 315 New (w/QIE) 204 304 324 Gov Dept 1 ht:// dig 160 38 98 119 8414 New (no QIE) 128 140 146 New (w/QIE) 128 147 148 Gov Dept 2 Verity & MS 154 1 8 12 42 981 New (no QIE) 79 108 111 New (w/QIE) 86 110 111 Table 9.5: Ranking function retrieval effectiveness on the public corporate webs of several large Australian organisations. “New” is the first-cut ranking function described within this chapter. A “no QIE” indicates that the run was performed with query-independent evidence removed (ON = 0, OFF = 0, URL = 0). The BM25 parameters were set to k1 = 2, b = 0.75. When used, query-independent evidence parameters were specified as QD = 17, ON = 2, OFF = 6 and URL = 19. The evaluation was performed between February and March 2003.
  • 181.
    §9.4 Discussion 165 contribution.For example, the in-degree of a document might be a more useful satu- ration value than length when scoring aggregate anchor-text. The exploration of new approaches to term normalisation and transformation may be particularly interesting in the context of further anchor-text evidence scoring functions. Hyperlink recommendation evidence, evaluated in the form of off-site (IDGoff) and on-site (IDGon) in-degree, was once again found to be a relatively poor form of document evidence. It is possible that this negative result may be attributed to the relatively small size of the collection (in comparison to the web), and accordingly a limited amount of cross site linking in the collection. That said, the demonstration of a search situation in which the use of hyperlink recommendation evidence significantly improves retrieval effectiveness remains an elusive goal. URL length evidence, while found to be important in the training set and in pre- vious home page finding experiments, was found to be relatively ineffective for the tasks examined here. Incorporating URL length moderately improved effectiveness for Topic Distillation, but reduced effectiveness on the combined NP/HP finding tasks. These results indicate that while URL length is an important component for effective home page search, its contribution to other tasks may be limited.
  • 182.
    166 A first-cutdocument ranking function using web evidence
  • 183.
    Chapter 10 Discussion The findingspresented in this thesis raise a number of issues. This chapter discusses: • the extent to which experimental findings are likely to hold for enterprise and intranet web search systems and WWW search engines; • the search tasks that are the most appropriate to model when evaluating web search performance; • how web evidence could be used to build a more efficient ranking algorithm while maintaining retrieval effectiveness; and • whether the set of document features used by the ranking function could be tuned on a per corpus basis. 10.1 Web search system applicability This thesis has evaluated the effectiveness of web evidence over a large selection of corporate webs and corporate-sized webs, with corpora ranging from 5000 to 18.5 mil- lion pages. This range of sizes covers almost all enterprise corpora. The web evidence inclusive ranking function achieved consistent gains over eight diverse enterprise cor- pora (in Section 9.3.2), indicating that findings are likely to hold for many small-to- medium sized web corpora. However, it should be noted that the improvements af- forded by web evidence are dependent on the quality of hyperlink information in the corpus, and are subject to the publishing procedures employed by organisations. These procedures can reduce the effectiveness of web evidence (as studied in Chap- ter 4). For example, the effectiveness of web evidence is likely to be decreased if the corpus contains URLs that are unlikely to be linked-to, or the corpus contains a lot of duplicate content. Findings from experiments in this thesis may be less applicable to WWW search engines than to enterprise web search engines. WWW search engines are subject to substantial efficiency constraints, due to the scale of the document corpus and query processing demands. The indexes of current state-of-the-art WWW search engines contain two orders of magnitude more documents than the largest corpus considered 167
  • 184.
    168 Discussion in thisthesis. These systems also process thousands of concurrent queries with sub- second response time. These efficiency requirements are likely to limit the document features examined and scored during query processing. One benefit of a larger corpus size is that there is likely to be more link evidence, and so differentiation between links (e.g. on-site, off-site or nepotistic) might lead to larger gains. However, the hyperlink recommendation scores calculated throughout experiments in this thesis were found to be correlated with the scores for corresponding documents extracted from WWW search engines (see Section 7.6.3). Further, a recent experiment reported that the use of anchor-text evidence external to a web corpus (but linking to documents inside the corpus) did not improve retrieval effectiveness [114]. Consequently, it is possible that further link evidence may not be useful. The correlations between hyperlink recommendation scores, and the small observed benefit achieved by using external link evidence, indicate that hyperlink evidence used in WWW search systems is likely to be comparable to that studied here. WWW search engines also operate in an adversarial information retrieval envi- ronment, where web authors may seek to bias ranking functions in their favour by creating spam content [122]. Given the relative ease and low cost of link construction on the WWW, one might expect hyperlink recommendation scores to be susceptible to link spamming. Some spam-like properties were observed in thesis experiments, but these appeared unsystematic and were deemed to have been created unintentionally. While some experiments in this thesis cast doubts on the use of hyperlink recommen- dation methods for spam reduction, these results are not conclusive. Therefore, results presented in this thesis are likely to apply to ranking in enter- prise web search, subject to publishing practices, but are less directly applicable to ranking in WWW search systems. 10.2 Which tasks should be modelled and evaluated in web search experiments? It is important that the tasks evaluated and modelled for a web search system be repre- sentative of the tasks that will be performed by the users of the system. Without access to studies relating to the user populations, intended system usage and/or large scale query-logs, it is difficult to determine which tasks are most frequently performed. Document ranking functions need to be evaluated over more than one type of search task. It is apparent, both in results from experiments presented in this thesis and in previous TREC web track evaluations, that performance gains in a single re- trieval task often do not carry benefits to other tasks. For example, URL length based measures are particularly useful when seeking home pages (Chapter 7), but appear to reduce retrieval effectiveness on other tasks (Chapter 9). Therefore a mixed query set should be used when evaluating a general purpose ranking function. In the 2004 TREC web track, one of the tasks examined was a mixed task that included an equal mix of named page, home page and Topic Distillation queries [54]. Alternatively a mixed query set might be balanced in anticipation of the types of queries a system
  • 185.
    §10.3 Building amore efficient ranking system 169 might receive.1 The query set might also include queries for which the answers are important resources (either popular, or key corpus documents) for each type of search task. The evaluation concerns for WWW search engines are likely to be quite different from those of corporate webs. WWW search engines need to provide results for a diverse document corpus and user group. By comparison, search engines on corpo- rate webs are likely to have a smaller target user audience, and a more homogeneous document corpus. A prime concern for WWW search engines may be known item searches, where the pages are important and well known to the user. If the search system fails for these types of queries, the user is likely to lose some degree of trust in the system. Therefore, a useful basic effectiveness test may be to observe how well the search engine can find pages listed in WWW directories, using listing descriptions as queries (similar to the automated site map method). In an enterprise search context, the automatic site map method appears to be an effective way of evaluating retrieval effectiveness for navigational search tasks (when a site map is available). Site maps often contain organisation specific terminology and include links to documents that are frequently accessed. For enterprise web search engines, pages that are contained in site maps may be representative of potential nav- igational queries, and could thus be an excellent source of queries. A WWW search engine is likely to be required to process much broader queries than enterprise web search systems, and so should be evaluated for varied tasks. Known item search is likely to be particularly important, as a user may be disap- pointed in a WWW search system if they cannot use it to find a page they know exists, especially well known entities. For known item search, online WWW directories may be a good source of query/answer sets of well known and/or useful WWW pages. 10.3 Building a more efficient ranking system The web evidence and combination methods considered within this thesis may be used to improve query-processing performance and reduce the size of document in- dexes. The high level of effectiveness achieved by anchor-text over all the search tasks considered in this thesis indicates that a high level of retrieval effectiveness could be achieved over many search tasks using an anchor-text only index. Such an index would be far smaller than a full-text index. For the .GOV corpus, aggregate docu- ments have an average length of 25 terms, as opposed to the 870 terms for document full-text evidence. Further, there is far more repetition in anchor-text evidence, mean- ing indexes containing aggregate anchor-text might be expected to achieve higher compression than indexes of document full-text. An alternative method for improving query processing efficiency is to exclude documents that do not meet a minimum query-independent score prior to (or during) 1 For example, if home page finding is an important task, ensure there are many home page finding queries in the test set.
  • 186.
    170 Discussion indexing. Resultsfrom experiments in this thesis indicate that restricting document inclusion by imposing a minimum URL-type value, can reduce the number of doc- uments indexed by an order of magnitude, without significantly affecting retrieval effectiveness for home page finding tasks (see Section 7.2.1). The use of an anchor-text only index or minimum document threshold may result in a decrease in retrieval effectiveness for some tasks (such as ad-hoc informational, or named page finding tasks), as some crawled documents are not indexed and so would never be retrieved. An extension to this model would be to use two indexes; one primary index, consisting of aggregate anchor-text only or documents that exceed the minimum threshold value, and a second index containing the full document cor- pus. During query processing, if some criteria are not met by documents retrieved from the primary, faster index (e.g. less than ten matching documents are found, no documents match all terms, or some minimum score is not achieved), the secondary index could be consulted. Further work is required to investigate whether such multi- level indexes would provide large efficiency gains while maintaining (or improving) retrieval effectiveness, and to explore distributed techniques for dealing with several indexes. The size of a combined document index can be reduced through the use of a single set of document and corpus statistics when scoring query-dependent features. This requires only one set of statistics to be stored per document/term combination, rather than a set for each query-dependent feature. In fact, the use of full-text length when normalising term contribution in aggregate anchor-text improved retrieval effective- ness (see Section 8.3.1). Further work is required to determine whether inverse docu- ment frequency should be scored per document field. 10.4 Tuning on a per corpus basis The results from experiments in this thesis indicate that document ranking effective- ness not only depends on the search task evaluated, but also on the document corpus. For example, if the ranking function is to be used on a corporate web in which all documents are published through a Content Management System (CMS) that uses long parameterised URLs, URL length-based measures are not likely to be effective. This effect was observed for one of the corpora studied in Section 9.3.2: Large Bank 1. This bank publishes all its content using the Lotus Domino system, which (at least con- figured as it was in this case) serves content using long URLs. Similarly, hyperlink evidence is not likely to be effective for a corpus which has few hyperlinks. An attractive avenue for future work may be the tuning of document feature con- tribution according to the expected utility of that evidence. For example, if a web site’s hyperlink graph is sufficiently small, hyperlink evidence could be disabled. This could be generalised further through the creation of profiles for common CMS con- figurations that indicate what forms of document evidence are likely to be useful. Alternatively, ranking parameters could be tuned using an automated approach us- ing judgements such as those collected from a web site map. This remains for future
  • 187.
    §10.4 Tuning ona per corpus basis 171 work. If corpus-based tuning is not employed it is important that the web authors are aware of evidence commonly used to match and rank documents. This is especially the case in an enterprise web context.
  • 188.
  • 189.
    Chapter 11 Summary andconclusions The experiments in this thesis demonstrate how web evidence can be used to improve retrieval effectiveness for navigational search tasks. The first set of experiments, presented in Chapter 4, studied the relationship be- tween site searchability and the likelihood of a site’s documents being retrieved by prominent WWW search engines. This study provided one of the first empirical in- vestigations of transactional search. The performance of WWW search engines was shown to differ markedly, with two-out-of-four search engines never retrieving books within the top ten results, and one search engine favouring a particular bookstore (perhaps indicating a partnership). A large variation in bookstore searchability was also observed. An investigation of potential biases in hyperlink evidence was then presented in Chapter 5, using data collected from WWW search engines. Biases were observed in hyperlink recommendation evidence towards the home pages of popular and/or technology-oriented companies. These results indicate that the use of hyperlink evi- dence may not only improve home page finding effectiveness (important in naviga- tional search), but also bias search results towards this user demographic (i.e. users who are interested in popular, technology-oriented information). The two types of hyperlink recommendation evidence (Google PageRank and AllTheWeb in-degree) were virtually indistinguishable, providing similar recommendations towards popu- lar companies. Both measures were also correlated for a set of company home pages, and a set of known spam pages. The similarity between the two measures raised ques- tions as to the usefulness of PageRank over in-degree. Both measures gave preference to home page documents, supporting the investigation of hyperlink recommendation evidence for home page finding tasks in later chapters. Methods for combining hyperlink recommendation evidence (and other query- independent measures) with query-dependent evidence were investigated in Chap- ter 6. Results from this experiment demonstrated how assigning a large weight to hyperlink recommendation evidence in a ranking function may trade document rele- vance for link popularity. It was submitted that hyperlink recommendation evidence should be included either as a small component in the ranking function, or in the form of a minimum threshold value enforced prior to document ranking. Chapter 7 presented a detailed evaluation of home page finding on five small-to- 173
  • 190.
    174 Summary andconclusions medium web test collections using three query-dependent baselines and four forms of query-independent evidence (in-degree, Democratic PageRank, Aristocratic Page- Rank, and URL length). The results from these experiments demonstrated the impor- tance of both anchor-text and URL length measures in home page finding tasks. The most consistent improvements in retrieval effectiveness were achieved using a base- line containing document full-text and anchor-text, with a score-based re-ranking by URL-type. Improvements were observed in both efficiency and effectiveness when using minimum query-independent value thresholds for page inclusion, with the gains for URL length thresholds being particularly large. Little benefit was observed through the use of hyperlink recommendation methods. Small gains were achieved when hyperlink recommendation scores were used as minimum thresholds for page inclusion. However, a score-based re-ranking of query-dependent baselines by hyper- link recommendation evidence performed poorly. Both PageRank and in-degree performed similarly and were found to be highly correlated. This correlation, and the almost identical performance of both PageRank and in-degree in the home page finding tasks, indicated no reason to choose Demo- cratic PageRank over in-degree for home page finding on corpora of under 18.5 mil- lion pages. When considered with the correlations previously observed in WWW- based hyperlink recommendation scores, these results also cast doubt as to whether PageRank and in-degree values would show more divergence on the complete WWW graph. The PageRank values computed for these experiments were also found to be correlated with Google WWW PageRanks for pages present in the Open Directory. A series of follow-up experiments (using the same data) found that the use of URL length, when measured in characters, is as effective as using URL-types. A fur- ther finding was that using hyperlink recommendation evidence calculated for a web graph that included link evidence external to the corpus, did not improve retrieval effectiveness. By contrast, the use of external anchor-text information significantly improved retrieval effectiveness. Chapter 8 presented an analysis of the application of Okapi BM25 based measures in scoring anchor-text evidence. This analysis led to several proposed modifications to Okapi BM25 that, it was hypothesised, might improve the scoring of anchor-text evidence. Proposed modifications included an increase of the saturation point for document term frequencies, the calculation of separate anchor-text-only inverse doc- ument frequency values, and the use of document full-text length to normalise aggre- gate anchor-text. An empirical investigation was carried out to determine whether the proposed changes to anchor-text scoring improved retrieval effectiveness. This showed that the revised scoring functions achieved significant improvements in re- trieval effectiveness, for both Topic Distillation and navigational tasks. Experiments within Chapter 8 also analysed and evaluated strategies for com- bining query-dependent baselines. Results for these combinations demonstrated the importance of treating document-level and web-based evidence as separate entities. Additionally the results showed that computing a single set of (global) document and corpus statistics for all query-dependent fields improved system efficiency and provided small gains in retrieval effectiveness. Surprisingly, the effectiveness of the
  • 191.
    §11.1 Findings 175 anchor-textbaseline improved when full-text length was used to normalise aggregate anchor-text document length. Chapter 9 presented a first-cut document ranking function that included web ev- idence found useful in earlier experiments within this thesis (anchor-text and URL- length measures in particular). The ranking function was evaluated through ten runs submitted to the two TREC web track tasks in 2003. The best of the runs submit- ted for the Topic Distillation task achieved the highest performance for any system submission. The best of the runs submitted for the combined home page / named page finding task achieved the second highest performance of any system submis- sion. To further validate the ranking function a series of follow up experiments were performed using corporate web collections. Results from these experiments showed that the ranking function outperformed seven-out-of-eight incumbent search systems (while performing comparably to the other). 11.1 Findings Experimental findings suggest that the most important form of web evidence is anchor- text. Using anchor-text evidence to rank documents, rather than document full-text, provides significant effectiveness gains in home page finding and Topic Distillation tasks. The methods commonly used for length normalising anchor-text aggregate documents were found to be deficient. Removing aggregate anchor-text length nor- malisation altogether, or normalising according to full-text document length were both found to improve retrieval effectiveness. The removal of length normalisation from the anchor-text scoring function favours large volumes of incoming anchor-text, and according to prestige and recommendation assumptions, may favour prominent pages. The use of URL-length based measures, either through grouping URLs into classes (as in URL-type) or simply by counting the number of characters, brought consistent gains for home page finding tasks. However, the use of this evidence reduced ef- fectiveness for other tasks, and would be ineffective for corpora which do not exhibit any URL hierarchy. Further work is needed to understand how to best use URL-based measures in a general purpose web search system. Hyperlink recommendation evidence was far less effective than URL-based mea- sures. The use of hyperlink recommendation evidence provided minimal gains, even when an Optimal re-ranking was used. The most effective use of hyperlink rec- ommendation scores was in reducing the size of corpora without reducing home page search performance. However, these gains were small by comparison to those achieved using URL-type thresholds. Democratic PageRank was not observed to sig- nificantly out-perform simple in-degree. Given the extra cost involved in comput- ing Democratic PageRank, this thesis presents no evidence to support the use of De- mocratic PageRank over in-degree. A PageRank biased towards authoritative sites improved effectiveness somewhat; however, the scores were based on bookmarks known to match the best answers for the queries used. Further work is required to
  • 192.
    176 Summary andconclusions investigate and compare this PageRank formulation to other authority-biased mea- sures. The combination method for query-dependent evidence which achieved the high- est retrieval effectiveness on navigational and Topic Distillation tasks was the hybrid combination of scores. The hybrid combination considers document-level and web based evidence as separate document components, and uses a linear combination to sum scores. The separation of document-level and web-based information means that two scores are assigned per document, one for the document content (or the author’s description), and one for the wider web community view of the document. If both measures agree (and the document is scored highly on both measures for a particular query) this is likely to be a strong indication that the page is what it claims to be. Com- puting global document and corpus statistics for all query-dependent fields improved system efficiency and provided small gains in retrieval effectiveness. The best methods for combining query-independent evidence with query- dependent baselines involved the application of minimum thresholds for page in- clusion, or re-ranking all pages within some percentage of the top score. Both combi- nations proved effective when combining URL-type evidence with query-dependent baselines. Bias towards the home pages of popular and/or technology-oriented companies was observed in hyperlink-based evidence. Some biases, such as the technology bias, could negatively affect document ranking if ignored, as search results will cater to a small demographic of web users. These findings indicate that care should be taken when using such evidence in document ranking, or in a direct Toolbar indicator. The observed bias may be especially confusing when recommendation scores are used directly as a measure of a page’s quality, as in the Google Toolbar. 11.2 Document ranking recommendations Experimental results indicate that an effective web-based document ranking algo- rithm for navigational tasks should exploit both document-level evidence and web- based evidence. These two types of document evidence are best combined using a hybrid combination with globally computed document and term statistics. Document evidence should include full-text evidence and other useful document-level evidence. Web-based evidence should make use of incoming anchor-text, and other useful ex- ternal document descriptions. Anchor-text aggregate document length should not be used to normalise anchor-text term contribution. For home page search, a URL depth component either measured by characters or classified by type, should be included. The measure may be included either by re-ranking documents that achieve within n% of the top score by URL length, or by adding a normalised URL length score to the query-dependent score. The best choice of hyperlink recommendation algorithm for use in home page finding within corporate-scale corpora is in-degree, as the PageRank variants appear to offer little or no advantage and are more computationally expensive.
  • 193.
    §11.3 Future work177 11.3 Future work The findings within this thesis raise several issues that merit further investigation. Future work for web-based document ranking might include: • A study of whether web evidence can improve retrieval effectiveness for other web-based user search tasks, such as informational and transactional search. • A study of further anchor-text ranking functions. The modifications to Okapi BM25 improved retrieval effectiveness; however, further work is needed to determine whether the document and collection statistics applied to scoring anchor-text were optimal. • Further study of how document and web-based evidence should be combined. This thesis has explored many different ways of combining document evidence, but it is not clear that the optimal method has been found. Further studies might also look at the nature of hyperlink recommendation on the WWW. This could include: • A study of the changing nature of hyperlink evidence on the WWW. For exam- ple, is the proportion of dynamic vs. static hyperlinks on the WWW constant? Is the proportion of links which are dead (have no target) constant over time? Also worthy of further examination is how new trends on the WWW, such as web logging, might affect the quality and quantity of hyperlink evidence. • A study of how an increase in the effectiveness of WWW-based search engines might affect the quality of hyperlink evidence on the WWW. Does high quality search mean authors are less likely to link to useful documents? • A further study of how document quality metrics, such as PageRank and in- degree, relate to user-document quality satisfaction, or industry professional- document quality satisfaction. This investigation could focus on the use of tools like the Google toolbar.
  • 194.
    178 Summary andconclusions
  • 195.
    Appendix A Glossary All termswithin this thesis, unless defined below, are as used in the (Australian) Mac- quarie Dictionary, searchable on the WWW at http://www.dict.mq.edu.au. Aggregate anchor-text: all anchor-text snippets pointing to a page. Anchor-text: words contained within anchor-tags which are “clicked on” when a link is followed. Anchor-text snippet: a piece of anchor-text that annotates a single link. Anchor-text aggregate document: a surrogate document containing all anchor-text snippets pointing to a page. Aristocratic PageRank (APR): a formulation of PageRank that favours a manually specified set of (authoritative) pages. The PageRank calculation is biased towards these pages by using the set of pages in the PageRank bookmark vector. Collection: see Test collection. Corpus: a set of documents. Crawler: the web search system component that gathers documents from a web. Democratic PageRank (DPR): the default PageRank formulation in which all pages are treated a priori as equal. Entry point: a document within a site hierarchy from which web users can begin to explore a particular topic. Evidence: a document attribute, feature, or group of attributes and features that may be useful in determining whether the document should be retrieved (or not) for a particular query. Feature: information extracted from a document and used during query processing. Field: a query-dependent document component, for example document full-text, document title or document aggregate anchor-text. 179
  • 196.
    180 Glossary Home page:the key entry point for a particular web site. Home page finding: a navigational search task in which the goal is to find home pages. Hyperlink recommendation: an algorithm which is based on the number or “qual- ity” of web recommendations for a particular document. In-degree: the simplest hyperlink recommendation algorithm in which a document’s value is measured by the number of incoming hyperlinks. Indexer: the web search system component that indexes documents gathered by the crawler into a format which is amenable to quick access by the query processor. Informational search task: a user task in which the user need is to acquire or learn some information that may be present in one of more web pages. Link farms: an “artificial” web graph created by spammers through generating link spam to funnel hyperlink evidence to a set of pages for which they desire high web rankings. Link spam: spam content introduced into hyperlink evidence by generating spam documents that link to other documents with false or misleading information. Mean Reciprocal Rank (MRR): a measure used in evaluating web search system performance computed by averaging the reciprocal rank at which the system finds the first useful result, or when the first relevant document is retrieved. Named page finding: a navigational search task in which the goal of the search system is to find a particular page given its name. Navigational search task: a user task where the user needs to locate a particular entity given its name. PageRank: a hyperlink recommendation algorithm that estimates the probability that a “random” web surfer would be on a particular page on a web at any particular time. Precision: a measure used in evaluating web search system performance. Precision is the proportion of retrieved documents that are relevant to a query at a partic- ular rank cut-off. Query-dependent evidence: evidence that depends on the user query and is calcu- lated by the query processor during query processing. Query-independent evidence: evidence that does not depend on the user query, generally calculated during the document indexing phase (prior to query process- ing). Query processor: a typical component of a web search system that consults the index to retrieve documents in response to a user query.
  • 197.
    181 R-Precision (R-Prec): ameasure used to evaluate web search system performance. R-Precision is the average of the precision of a system at the Rth document (av- eraged across multiple queries). Recall: a measure used in evaluating web search system performance. Recall is the total proportion of all relevant documents that have been retrieved within a par- ticular cut-off for a query. Search Engine Optimisation: optimising document and web structure such that search engines may better match a document’s content (without generating spam content). Spam: is the name applied to content generated by web publishers to artificially boost the rank of their pages. Spam techniques include addition of otherwise unneeded keywords and hyperlinks. Stemming: stripping term suffixes or prefixes to collapse a term down to its canon- ical form (or stem). The Porter suffix stemmer [163] is used for this purpose in some thesis experiments. Test collection: a snapshot of a user task and document corpus used to evaluate system effectiveness. A test collection includes a set of documents (corpus), a set of queries, and relevance judgements for documents in the corpus according to the queries. Topic Distillation: a user task in which the goal is to find entry points to relevant sites given a broad query. Text REtrieval Conference (TREC): an annual conference run by the US National Institute of Science and Technologies (NIST) and the US Defense Advanced Re- search Projects Agency (DARPA) since 1992. The goal of the conference is to promote the understanding of information retrieval algorithms by allowing re- search groups to compare system effectiveness on common test collections. Traditional information retrieval: information retrieval performed over flat corpus using full-text fields. Transactional search task: a user search task where the user needs to perform some activity on, or using, the WWW. URL-type: a URL class breakdown, proposed by Westerveld et al. [212], in which some URLs are deemed more important than others on the basis of structure and depth (outlined in Section 2.3.3.2). web: a corpus containing linked documents. web evidence: evidence derived from some web property or context.
  • 198.
    182 Glossary web graph:a graph built from the hyperlink structure of a web, where web pages are nodes, and hyperlinks are edges. WWW: the World-Wide Web is a huge repository of linked documents distributed on millions of servers world-wide. The WWW contains at least ten billion publicly visible web documents.
  • 199.
    Appendix B The canonicalisationof URLs When canonicalising URLs the following rules were followed: • If the relative URI steps below the root of the server the link is resolved to the server root directory. For example: – A link to /../foo.html from http://cs.anu.edu.au/ will be resolved to http://cs.anu.edu.au/foo.html; – A link to ../../foo.html from http://cs.anu.edu.au/∼Trystan. Upstill/ will be resolved to http://cs.anu.edu.au/foo.html; and – A link to /../foo.html from http://cs.anu.edu.au/∼Trystan. Upstill/pubs/ will be resolved to http://cs.anu.edu.au/foo.html. • Hyperlinks and documents with common default root page names (e.g. index.htm(l), default.htm(l), welcome.htm(l), and home.htm(l)) are stemmed to the directory path. For example: – A link to http://cs.anu.edu.au/default.html is resolved to http: //cs.anu.edu.au/; and – A link to http://cs.anu.edu.au/∼Trystan.Upstill/index.html is resolved to http://cs.anu.edu.au/∼Trystan.Upstill/. • Multiple directory slashes are resolved to a single slash. For example: – A link to http://cs.anu.edu.au///// is resolved to http://cs.anu. edu.au/; and – A link to http://cs.anu.edu.au//////∼Trystan.Upstill// is re- solved to http://cs.anu.edu.au/∼Trystan.Upstill/. • URLs pointing to targets inside documents are treated as links to the full docu- ment. For example: – A link to http://cs.anu.edu.au/foo.html#Trystan is resolved to http://cs.anu.edu.au/foo.html; and 183
  • 200.
    184 The canonicalisationof URLs – A link to http://cs.anu.edu.au/#foo is resolved to http://cs.anu. edu.au/. • Hyperlinks are not followed from framesets (as they are not crawled). Hyperlink extraction from frameset sites requires that links directly to navigational panes be observed (and not links to framesets). • If the port to which an HTTP request is made is the default port (e.g. 80), it is removed. For example: – A link to http://cs.anu.edu.au:80 is resolved to http://cs.anu. edu.au; and – A link to http://cs.anu.edu.au:80/∼Trystan.Upstill/ is resolved to http://cs.anu.edu.au/∼Trystan.Upstill/ • URLs without a leading host are appended with “www”. For example a link to http://sony.com/ is resolved to http://www.sony.com. • If no protocol is provided http:// is assumed. For example a link to sony.com is resolved to http://www.sony.com. • Host names are converted into lower case (as host names are insensitive). • Default web server directory listing pages are removed.
  • 201.
    Appendix C Bookstore searchand searchability: case study data C.1 Book categories • (27) Children’s • (15) Hardcover Advice • (11) Hardcover Business • (35) Hardcover Fiction • (29) Hardcover Non-Fiction • (15) Paperback Advice • (07) Paperback Business • (35) Paperback Fiction • (32) Paperback Non-Fiction • (206) Total Duplicate books were removed from the query set. For example the book titled “Stupid White Men” was in both the Hardcover Business and Hardcover Non-Fiction sections, and so was only considered in the Hardcover Business category. C.2 Web search engine querying • AltaVista – General Queries: Book title surrounded by quotation (“) marks. – URL Coverage: canonical domain name with “url:” parameter. – Link Coverage: canonical domain name with “link:” parameter. 185
  • 202.
    186 Bookstore searchand searchability: case study data – Timeframe: General and Domain Restricted Queries submitted between 20/09/02 and 02/10/02 Link Coverage and URL Coverage experiments performed on the 09/10/02. • AllTheWeb (Fast) – General Queries: Book title with exact phrase box ticked. – URL Coverage: Advanced search restricting to domain using “domain” textbox with canonical domain name. – Link Coverage: Advanced search using Word Filter with “Must Include” in the preceding drop down box, canonical domain name in middle text box and “in the link to URL” in the final drop down box. – Timeframe: General and Domain Restricted Queries submitted between 20/09/02 and 02/10/02 Link Coverage and URL Coverage experiments performed on the 09/10/02. • Google – General Queries: Book title surrounded by quotation (“) marks. – URL Coverage: Search for the non-presence of a non-existing word (e.g.: -adsljflkjlkjdflkjasdlfj0982739547asdhkas) and using canonical domain name with “host:” parameter. – Link Coverage: Not available. – Timeframe: General and Domain Restricted Queries submitted between 20/09/02 and 02/10/02 Link Coverage and URL Coverage experiments performed on the 09/10/02. • MSN Search (Inktomi) – General Queries: Advanced search with book title as an “exact phrase box”. – URL Coverage: Advanced search using the domain name as the query, and restricting domain using “domain” text box with canonical domain name. – Link Coverage: Not available. – Timeframe: General and Domain Restricted Queries submitted between 20/09/02 and 02/10/02 Link Coverage and URL Coverage experiments performed on the 09/10/02.
  • 203.
    §C.3 Correct bookanswers in bookstore case study 187 C.3 Correct book answers in bookstore case study Category Book Title ISBN Childrens America 0689851928 Childrens Artemis Fowl 0786808012 0786817070 Childrens Artemis Fowl: the Arctic Incident 0786808551 Childrens Can You See What I See? 0439163919 Childrens Daisy Comes Home 039923618X Childrens Disney’s Lilo and Stitch 0736413219 Childrens Giggle, Giggle, Quack 0689845065 Childrens Good Morning, Gorillas 0375806148 0375906142 Childrens Harry Potter and the Chamber of Secrets 0439064872 0439064864 0613287142 Childrens Harry Potter and the Goblet of Fire 0439139600 0439139597 Childrens Harry Potter and the Prisoner of Azkaban 0439136369 0439139597 0613371062 Childrens Harry Potter and the Sorcerer’s Stone 059035342X 0590353403 0613206339 Childrens Holes 0440414806 0374332657 044022859X 0613236696 Childrens If You Take a Mouse to School 0060283289 Childrens Junie B., First Grader (at Last!) 0375802932 0375815163 0375902937 Childrens Junie B., First Grader: Boss of Lunch 0375815171 Childrens Lemony Snicket: the Unauthorized Autobiography. 0060007192 Childrens Oh, the Places You’ll Go! 0679805273 Childrens Olivia 0689829531 Childrens Olivia Saves the Circus 068982954X Childrens Princess in the Spotlight 0060294655 0064472795 0060294663 Childrens Stargirl 037582233X 0679886370 0679986375 B00005TZX9 B00005TPDD Childrens The All New Captain Underpants Extracrunchy Book O’fun 2 0439376084 Childrens The Bad Beginning 0064407667 0060283122
  • 204.
    188 Bookstore searchand searchability: case study data Childrens The Reptile Room 0064407675 0060283130 Childrens The Three Pigs 0618007016 Childrens The Wide Window 0064407683 0060283149 Hardcover Advice 10 Secrets for Success and Inner Peace 1561708755 Hardcover Advice Body for Life 0060193395 Hardcover Advice Conquer the Crash 0470849827 Hardcover Advice Execution 0609610570 Hardcover Advice Fish! 0786866020 Hardcover Advice Get With the Program! 0743225996 Hardcover Advice I Hope You Dance 1558538445 Hardcover Advice Self Matters 074322423X Hardcover Advice Sylvia Browne’s Book of Dreams 0525946586 Hardcover Advice The Fat Flush Plan 0071383832 Hardcover Advice The Perricone Prescription 0060188790 Hardcover Advice The Prayer of Jabez 1576737330 1576738108 Hardcover Advice The Prayer of Jabez for Women 1576739627 1590520491 Hardcover Advice The Wisdom of Menopause 055380121X Hardcover Advice Who Moved My Cheese? 0399144463 Hardcover Business Conquer the Crash (duplicate) Hardcover Business Execution (duplicate) Hardcover Business Fish (duplicate) Hardcover Business Fish! Tales 0786868686 Hardcover Business Good to Great 0066620996 Hardcover Business How to Lose Friends and Alienate People 030681188X Hardcover Business Martha Inc. 0471123005 Hardcover Business Oh, the Things I Know! 052594673X Hardcover Business Snobbery: the American Version 0395944171 Hardcover Business Stupid White Men 0060392452 Hardcover Business Ten Things I Learned From Bill Porter 1577312031 Hardcover Business The Pact 157322216X Hardcover Business Tuxedo Park 0684872870 0684872889 Hardcover Business Wealth and Democracy 0767905334 Hardcover Business Who Moved My Cheese? (duplicate) Hardcover Fiction A Love of My Own 0385492707 Hardcover Fiction A Thousand Country Roads 0971766711 Hardcover Fiction Absolute Rage 0743403444 Hardcover Fiction An Accidental Woman 0743204700
  • 205.
    §C.3 Correct bookanswers in bookstore case study 189 Hardcover Fiction Ash Wednesday 037541326X Hardcover Fiction Atonement 0385503954 Hardcover Fiction Charleston 0525946500 Hardcover Fiction Eleventh Hour 0399148779 Hardcover Fiction Enemy Women 0066214440 Hardcover Fiction Fire Ice 0399148728 Hardcover Fiction Hard Eight 0312265859 Hardcover Fiction Her Father’s House 0385334729 Hardcover Fiction Hot Ice 0553802747 Hardcover Fiction In This Mountain 0670031046 Hardcover Fiction Lawrence Sanders: Mcnally’s Alibi 0399148795 Hardcover Fiction Leslie 0743228669 Hardcover Fiction Partner in Crime 0380977303 Hardcover Fiction Pasadena 0375504567 Hardcover Fiction Prague 0375507876 Hardcover Fiction Red Rabbit 0399148701 Hardcover Fiction Standing in the Rainbow 0679426159 Hardcover Fiction Stone Kiss 0446530387 Hardcover Fiction Sunset in St. Tropez 0385335466 Hardcover Fiction The Art of Deception 0786867248 Hardcover Fiction The Beach House 0316969680 Hardcover Fiction The Dive From Clausen’s Pier 0375412824 Hardcover Fiction The Emperor of Ocean Park 0375413634 Hardcover Fiction The Lovely Bones 0316666343 Hardcover Fiction The Nanny Diaries 0312278586 Hardcover Fiction The Remnant 0842332278 Hardcover Fiction The Shelters of Stone 0609610597 Hardcover Fiction The Summons 0385503822 Hardcover Fiction Unfit to Practice 0385334842 Hardcover Fiction Whispers and Lies 0743446259 Hardcover Fiction You Are Not a Stranger Here 0385509529 Hardcover Non-Fiction A Long Strange Trip 0767911857 Hardcover Non-Fiction A Mind at a Time 0743202228 Hardcover Non-Fiction A Nation Challenged 0935112766 Hardcover Non-Fiction Among the Heroes 0060099089 Hardcover Non-Fiction Cicero 0375507469 Hardcover Non-Fiction Crossroads of Freedom: Antietam 0195135210 Hardcover Non-Fiction Firehouse 1401300057 Hardcover Non-Fiction General Patton 0060009829 Hardcover Non-Fiction Gettysburg 0060193638
  • 206.
    190 Bookstore searchand searchability: case study data Hardcover Non-Fiction Good to Great (duplicate) Hardcover Non-Fiction John Adams 0743223136 Hardcover Non-Fiction Lucky Man 0786867647 Hardcover Non-Fiction Martha Inc. (duplicate) Hardcover Non-Fiction Odd Girl Out 0151006040 Hardcover Non-Fiction Once Upon a Town 0060081961 Hardcover Non-Fiction Profiles in Courage for Our Time 0786867930 Hardcover Non-Fiction Running With Scissors 0312283709 Hardcover Non-Fiction Sacred Contracts 0517703920 Hardcover Non-Fiction Sex, Lies, and Headlocks 0609606905 Hardcover Non-Fiction Six Days of War 0195151747 Hardcover Non-Fiction Slander 1400046610 Hardcover Non-Fiction Small Wonder 0060504072 Hardcover Non-Fiction Snobbery (duplicate) Hardcover Non-Fiction Strong of Heart 006050949X Hardcover Non-Fiction Stupid White Men (duplicate) Hardcover Non-Fiction The Art of Travel 0375420827 Hardcover Non-Fiction The Cell 0786869003 Hardcover Non-Fiction The Lobster Chronicles 0786866772 Hardcover Non-Fiction The Right Words at the Right Time 0743446496 Hardcover Non-Fiction The Sexual Life of Catherine M. 0802117163 Hardcover Non-Fiction The Universe in a Nutshell 055380202X Hardcover Non-Fiction Tuxedo Park (duplicate) Hardcover Non-Fiction Wealth and Democracy (duplicate) Hardcover Non-Fiction Why I Am a Catholic 0618134298 Hardcover Non-Fiction You Cannot Be Serious 0399148582 Paperback Advice A Week in the Zone 006103083X Paperback Advice Chicken Soup for the Teacher’s Soul 1558749780 1558749799 Paperback Advice Crucial Conversations 0071401946 Paperback Advice Dr. Atkins’ New Diet Revolution 006001203X 1590770021 Paperback Advice Fix-it and Forget-it Cookbook 1561483397 1561483389 1561483176 Paperback Advice Guinness World Records 2002 0553583786 Paperback Advice Leonard Maltin’s 2003 Movie and Video Guide 0451206495 Paperback Advice Life Strategies 0786884592 0786865482 Paperback Advice Relationship Rescue 0786866314 078688598X Paperback Advice Rich Dad, Poor Dad 0446677450
  • 207.
    §C.3 Correct bookanswers in bookstore case study 191 Paperback Advice The Four Agreements 1878424319 1878424505 Paperback Advice The Pill Book: New and Revised 10th Edition. 0553584782 0553050133 Paperback Advice The Unauthorized Osbournes 1572435208 Paperback Advice The Wrinkle Cure 0446677760 1579542379 Paperback Advice What to Expect When You’re Expecting 0761121323 0761125493 Paperback Business Crucial Conversations (duplicate) Paperback Business Fast Food Nation 0060938455 0395977894 Paperback Business How to Make Money in Stocks 0071373616 Paperback Business Life Strategies (duplicate) Paperback Business Nickel and Dimed 0805063897 0805063889 Paperback Business Rich Dad, Poor Dad (duplicate) Paperback Business The Tipping Point 0316316962 0316346624 Paperback Business Two Bad Years and Up We Go! 1892008726 Paperback Business What Color Is Your Parachute 2002 1580083420 1580083412 Paperback Business What Went Wrong at Enron 0471265748 Paperback Fiction A Bend in the Road 0446611867 0446527785 Paperback Fiction A Painted House 044023722X 038550120X Paperback Fiction A Walk to Remember 0446608955 0613281292 Paperback Fiction Always in My Heart 0451206665 Paperback Fiction Bel Canto 0060934417 Paperback Fiction Blood Work 0446602620 0613236882 Paperback Fiction Cordina’s Royal Family 0373484836 Paperback Fiction Divine Secrets of the Ya-ya Sisterhood 0060928336 0060173289 Paperback Fiction Empire Falls 0375726403 0679432477 Paperback Fiction Enemy Within 0743403436 0743403428 Paperback Fiction Envy 0446611808 0446527130 Paperback Fiction Face the Fire 051513287X Paperback Fiction Fanning the Flame 0743419162 Paperback Fiction For Better, for Worse 0380820447 Paperback Fiction Four Blondes 080213825X
  • 208.
    192 Bookstore searchand searchability: case study data 0871138190 Paperback Fiction Good in Bed 0743418174 0743418166 Paperback Fiction Hemlock Bay 0399147381 0515133302 Paperback Fiction Honest Illusions 0399137610 0515110973 Paperback Fiction Little Altars Everywhere 0060976845 006019362X Paperback Fiction Mercy 0671034022 0671034014 Paperback Fiction Paradise Lost 0140424261 Paperback Fiction Stonebrook Cottage 1551669234 Paperback Fiction Summer Pleasures 0373218397 Paperback Fiction Suzanne’s Diary for Nicholas 0446679593 0316969443 Paperback Fiction The Associate 0061030643 0060196254 Paperback Fiction The Bachelor 0446610542 Paperback Fiction The Last Time They Met 0316781266 0316781142 Paperback Fiction The New Jedi Order: Traitor 034542865X 0553713175 Paperback Fiction The Smoke Jumper 0385334036 0440235162 Paperback Fiction The Straw Men 0515134279 Paperback Fiction The Surgeon 0345447840 0345447832 Paperback Fiction True Blue 0553583980 Paperback Fiction Valhalla Rising 039914787X 0425185710 Paperback Fiction When Strangers Marry 0060507365 Paperback Fiction Whisper of Evil 0553583468 Paperback Non-Fiction A Beautiful Mind 0743224574 0684819066 Paperback Non-Fiction A Child Called ”It” 1558743669 0613171373 Paperback Non-Fiction A Man Named Dave 0452281903 0525945210 Paperback Non-Fiction An Italian Affair 0375724850 0375420657 Paperback Non-Fiction April 1865 0060930888 0060187239 Paperback Non-Fiction Ava’s Man 0375724443 0375410627 Paperback Non-Fiction Black Hawk Down 0871137380 0140288503
  • 209.
    §C.3 Correct bookanswers in bookstore case study 193 Paperback Non-Fiction Brunelleschi’s Dome 0142000159 0802713661 Paperback Non-Fiction Comfort Me With Apples 0375758739 0375501959 Paperback Non-Fiction Fast Food Nation (duplicate) Paperback Non-Fiction Founding Brothers 0375405445 0375705244 Paperback Non-Fiction French Lessons 0375705619 0375405909 Paperback Non-Fiction From Beirut to Jerusalem 0385413726 0374158959 Paperback Non-Fiction Ghost Soldiers 038549565X 0385495641 Paperback Non-Fiction It’s Not About the Bike 0399146113 0425179613 Paperback Non-Fiction Justice 0609608738 0609809636 Paperback Non-Fiction Me Talk Pretty One Day 0316776963 0316777722 Paperback Non-Fiction Napalm and Silly Putty 0786887583 0786864133 Paperback Non-Fiction Nickel and Dimed (duplicate) Paperback Non-Fiction On Writing 0743455967 0684853523 Paperback Non-Fiction Paris to the Moon 0679444920 0375758232 Paperback Non-Fiction Perpetual War for Perpetual Peace 156025405X Paperback Non-Fiction Personal History 0375701044 0394585852 Paperback Non-Fiction Seabiscuit 0375502912 0449005615 Paperback Non-Fiction The Botany of Desire 0375501290 0375760393 Paperback Non-Fiction The Darwin Awards 0525945725 0452283442 Paperback Non-Fiction The First American 0385495404 0385493282 Paperback Non-Fiction The Idiot Girls’ Action-adventure Club 0375760911 Paperback Non-Fiction The Lost Boy 1558745157 0613173538 Paperback Non-Fiction The Map That Changed the World 0060931809 0060193611 Paperback Non-Fiction The Metaphysical Club 0374199639 0374528497 Paperback Non-Fiction The Piano Shop on the Left Bank 0375758623 0375503048 Paperback Non-Fiction The Tipping Point (duplicate) Paperback Non-Fiction The Wild Blue 0743203399
  • 210.
    194 Bookstore searchand searchability: case study data 0743223098 Paperback Non-Fiction Washington 1586481185 0783895909 Table C.1: Correct book answers in bookstore case study.
  • 211.
    Appendix D TREC participationin 2002 This appendix is included for reference only and is drawn directly from [57]. TREC2002 included a named page finding task and a Topic Distillation task. A preliminary exploration of forms of evidence which might be useful for named page finding and topic distillation was performed. For this reason there was heavy use of evidence other than page content. D.1 Topic Distillation In Topic Distillation the following forms of evidence were used: • BM25 on full-text (content). Pages returned should be “relevant”. The .GOV was corpus indexed and applied BM25, sometimes with stemming sometimes without. • BM25 on content and referring anchor-text. An alternative to content-only BM25 is to include referring anchor-text words in the BM25 calculation (content and anchors). • In-link counting and filtering. To test whether pages with more in-links are po- tentially better answers, with differentiation between on-host and off-host links. Many results were eliminated on the grounds that they had insufficient in-links. • URL length. Short URLs are expected to be better answers than long URLs. • BM25 score aggregation. Sites with many BM25-matching pages are expected to be better than those with few. In the 2002 Topic Distillation (TD2002) task, the focus on local page content rele- vance (BM25 content only) was probably too high for the non-content and aggrega- tion methods to succeed. Most correct answers were expected to be shallow URLs of sites containing much useful content. In fact, correct answers were deeper, and the ag- gregation method for finding sites rich with relevant information was quite harmful (csiro02td3 and csiro02td4). The focus on page content is borne out by the improve- ment in effectiveness achieved when simple BM25 was applied in an unofficial run 195
  • 212.
    196 TREC participationin 2002 BM25 BM25 In-link URL BM25 Run P@10 cont. cont. & anch. counting & filtering length aggr. csiro02td1 0.1000 y y y csiro02td2 0.0714 y y csiro02td3 0.0184 y y y y csiro02td4 0.0184 y y y csiro02td5 0.0939 y (stem) y y csiro02unoff 0.1959 y Table D.1: Official results for submissions to the 2002 TREC web track Topic Distillation task (csiro02unoff). To perform better in the TD2002 task, less (or no) emphasis should have been put on distillation evidence and far more emphasis on relevance. However, in some Web search situations, it is likely that the distillation evidence would be more important than it was in this TD2002 task. D.2 Named page finding In the named page finding experiments the following forms of evidence was used: • Okapi BM25 on document full-text (content) and/or anchor text. Okapi BM25 was used to score document content and to aggregate anchor-text documents. • Stemming of query terms. • Extra Title Weighting. To bias the results towards “page naming text” further emphasis was placed on document titles. • PageRank. To see whether link recommendation could be used to improve re- sults [31]. Prior to submission twenty named page training queries were generated. This training found that content with extra title weighting performed best. Therefore page titles were expected to be important evidence in the official named page finding task. However this appeared not to be the case, in fact extra title weighting for the TREC queries appeared to reduce effectiveness (csiro02np01 vs csiro02np03). While there was some anchor text evidence present for the query set (csiro02np02), when this ev- idence was combined with content (csiro02np04 and csiro02np16) results were notice- ably worse than for the content-only run (csiro02np01). PageRank harmed retrieval effectiveness (run csiro02np16 versus csiro02np04).
  • 213.
    §D.2 Named pagefinding 197 Extra title Run MRR S@10 BM25 Stemming weighting PageRank csiro02np01 0.573 0.77 Content csiro02np02 0.241 0.34 Anchor text csiro02np03 0.416 0.59 Content y csiro02np04 0.318 0.51 Content and y y anchor text csiro02np16 0.307 0.49 Content and y y y anchor text Table D.2: Official results for submissions to the 2002 TREC web track named page finding task.
  • 214.
  • 215.
    Appendix E Analysis ofhyperlink recommendation evidence additional results This appendix contains further graphs from the experiment series examined in Chap- ter 5, Section 5.2.1. Figure E.1 and E.2 contain PageRank distributions for several company websites. These Figures support the results presented in Chapter 5, but do not show any further interesting trends. 199
  • 216.
    200 Analysis ofhyperlink recommendation evidence additional results 0 5 10 15 20 25 30 35 0 1 2 3 4 5 6 7 8 910 Pagesinourcrawl PageRank www.harman.com (HP PR=7) 0 2 4 6 8 10 0 1 2 3 4 5 6 7 8 910 Pagesinourcrawl PageRank www.introgen.com (HP PR=5) 0 5 10 15 20 25 0 1 2 3 4 5 6 7 8 910 Pagesinourcrawl PageRank www.pnc.com (HP PR=6) 0 1 2 3 4 5 6 0 1 2 3 4 5 6 7 8 910 Pagesinourcrawl PageRank www.progressenergy.com (HP PR=5) 0 0.5 1 1.5 2 0 1 2 3 4 5 6 7 8 910 Pagesinourcrawl PageRank www.csx.com (HP PR=6) 0 5 10 15 20 25 30 35 40 0 1 2 3 4 5 6 7 8 910 Pagesinourcrawl PageRank www.southtrust.com (HP PR=7) Figure E.1: Toolbar PageRank distributions within sites. (Additional to those presented in Section 5.2.1) The PageRank distributions for other sites are included in Figure 5.2, and in Figure E.2. The PageRank advice to users is usually that the home page is the most important or highest quality page, and other pages are less important or of lower quality. PageRank of the home page of the site is shown as “HP PR=”.
  • 217.
    201 0 5 10 15 20 25 30 35 40 45 50 0 1 23 4 5 6 7 8 910 Pagesinourcrawl PageRank www.tenneco-automotive.com (HP PR=6) 0 5 10 15 20 25 30 0 1 2 3 4 5 6 7 8 910 Pagesinourcrawl PageRank www.novavax.com (HP PR=5) 0 5 10 15 20 25 0 1 2 3 4 5 6 7 8 910 Pagesinourcrawl PageRank www.valero.com (HP PR=6) 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 8 910 Pagesinourcrawl PageRank www.synergybrands.com (HP PR=5) 0 10 20 30 40 50 60 70 80 90 0 1 2 3 4 5 6 7 8 910 Pagesinourcrawl PageRank www.teletouch.com (HP PR=5) 0 5 10 15 20 25 30 35 40 45 0 1 2 3 4 5 6 7 8 910 #ofpagescrawled PageRank www.tofc.net (HP PR=3) Figure E.2: Toolbar PageRank distributions within sites (Additional to those presented in Section 5.2.1)
  • 218.
    202 Analysis ofhyperlink recommendation evidence additional results
  • 219.
    Appendix F Okapi BM25distributions This appendix contains the distributions of Okapi BM25 scores for query-dependent evidence for the WT10gC collection (see Section 7.1.2) used throughout experiments in Chapter 7. Figure F.1 contains the distribution scores for the document full-text. Figure F.2 contains the distribution of scores for the anchor-text baseline. The BM25 distributions are calculated using the top 1000 results for each of the 100 queries. Un- like query-independent evidence BM25 scores are not comparable between query re- sults. To build this distribution the BM25 scores for all queries were independently normalised (the top answer for each query receives a 1). Due to the cutoff at 1000, a truncated curve is expected. Additionally, because the query score distributions are not centred at the same point, the plot exhibits a flatter curve than would be observed for a single query score distribution. 203
  • 220.
    204 Okapi BM25distributions 0 1 2 3 4 5 6 7 0 0.2 0.4 0.6 0.8 1 Percentageofdocuments Normalised BM25 content score (top 1000 documents per query) ANU WT10gC VLC2R Figure F.1: Distribution of normalised Okapi BM25 scores for document full-text for the WT10gC collection. The BM25 distributions are calculated using the top 1000 results for each of the 100 queries. Unlike query-independent evidence BM25 scores are not comparable between query results. To build this distribution the BM25 scores for all queries were inde- pendently normalised (the top answer for each query receives a 1). Due to the cutoff at 1000, a truncated curve is expected. Additionally, because the query score distributions are not cen- tred at the same point, the plot exhibits a flatter curve than would be observed for a single query. 0 1 2 3 4 5 6 7 0 0.2 0.4 0.6 0.8 1 Percentageofdocuments Normalised BM25 anchor score (top 1000 documents per query) ANU WT10gC VLC2R Figure F.2: Distribution of normalised Okapi BM25 scores for aggregate anchor-text for the WT10gC collection. The BM25 distributions are calculated using the top 1000 results for each of the 100 queries. Unlike query-independent evidence BM25 scores are not comparable between query results. To build this distribution the BM25 scores for all queries were inde- pendently normalised (the top answer for each query receives a 1). Due to the cutoff at 1000, a truncated curve is expected. Additionally, because the query score distributions are not cen- tred at the same point, the plot exhibits a flatter curve than would be observed for a single query.
  • 221.
    Appendix G Query sets G.1.GOV home page set Query .GOV Doc ID White House G03-16-2396677 Office of Homeland Security G25-97-0219687 Office of Management and Budget G01-47-2257273 OMB G01-47-2257273 United States Trade Representative G00-02-0599362 USTR G00-02-0599362 Department of Agriculture G42-03-3102230 USDA G42-03-3102230 Agricultural Research Service G00-03-3996998 Animal Plant Health Inspection Service G00-06-2853218 Cooperative State Research Education and Extension Service G00-11-0223618 Economic Research Service G00-03-2081400 Farm Service Agency G01-58-2364809 National Agricultural Library G00-00-2308409 Natural Resources Conservation Service G00-04-2280100 Research Economics Education G01-91-2827118 Rural Development G00-09-0025460 Bureau of the Census G02-93-4116586 STATUSA Database G00-10-3137809 Bureau of Export Administration G00-03-1901246 FEDWorld G00-06-4174747 International Trade Administration G00-00-3667859 205
  • 222.
    206 Query sets ITAG00-00-3667859 National Institute of Standards Technology G40-04-1519418 NIST G40-04-1519418 National Marine Fisheries Service G46-01-2225985 NMFS G46-01-2225985 National Oceanic Atmospheric Administration G21-42-3486883 NOAA G21-42-3486883 National Ocean Service G00-03-1496820 National Technical Information Service G01-03-0674427 NTIS G01-03-0674427 National Telecommunications Information Administration G00-05-1550998 National Weather Service G00-10-2171731 Department of Education G00-03-2042174 Educational Resources Information Center G08-78-1802103 ERIC G08-78-1802103 National Library of Education G04-56-3588687 NLE G04-56-3588687 Department of Energy G00-06-1479477 Office of Economic Impact and Diversity G05-02-2264248 Southwestern Power Administration G00-11-0259770 Department of Health and Human Services G00-00-3031135 HHS G00-00-3031135 Administration for Children and Families G29-19-2177375 Agency for Health Care Research and Quality G00-01-0960846 AHCRQ G00-01-0960846 Centers for Disease Control and Prevention G08-82-2708305 CDC G08-82-2708305 Food and Drug Administration G00-01-3511414 FDA G00-01-3511414 Health Care Financing Administration G00-03-3635966 National Institutes of Health G00-01-3774693 NIH G00-01-3774693 National Library of Medicine G00-06-1119476 NLM G00-06-1119476 Department of Housing and Urban Development G19-73-3432233
  • 223.
    §G.1 .GOV homepage set 207 HUD G19-73-3432233 Government National Mortgage Association G37-23-0000000 Ginnie Mae G37-23-0000000 Housing and Urban Development Reading Room G12-73-4081497 Office of Healthy Homes and Lead Hazard Control G10-39-2062297 Public and Indian Housing Agencies G12-36-3618097 Department of the Interior G00-09-2318516 DOI G00-09-2318516 Bureau of Land Management G00-00-2056373 BLM G00-00-2056373 Geological Survey G01-26-3878517 National Park Service G00-03-0029179 Office of Surface Mining G00-44-0995015 Department of Justice G00-04-3171772 DOJ G00-04-3171772 Drug Enforcement Agency G00-72-4001908 DEA G00-72-4001908 Federal Bureau of Investigation G01-84-2237979 FBI G01-84-2237979 Federal Bureau of Prisons G00-03-2244949 Immigration and Naturalization Service G04-47-1027920 INS G04-47-1027920 Office of Justice Programs G00-52-2562368 OJP G00-52-2562368 United States Marshals Service G04-91-1779147 USMS G04-91-1779147 Department of Labor G19-13-1577185 DOL G19-13-1577185 Bureau of Labor Statistics G39-37-3612440 G00-01-0682299 BLS G39-37-3612440 G00-01-0682299 Mine Safety and Health Administration G00-10-3730888 Occupational Safety Health Administration G00-09-2693851 OSHA G00-09-2693851
  • 224.
    208 Query sets Departmentof State G00-58-0058694 DOS G00-58-0058694 Department of State Library G00-18-1147964 Department of Transportation G01-50-1226182 DOT G01-50-1226182 Bureau of Transportation Statistics G00-01-3065065 Federal Aviation Administration G00-06-2330537 FAA G00-06-2330537 National Transportation Library G00-03-1771651 Department of the Treasury G00-03-3649117 Bureau of Alcohol Tobacco Firearms G04-24-1874467 ATF G04-24-1874467 Bureau of Engraving and Printing G00-01-0534347 Bureau of Public Debt G00-04-1219947 Executive Office for Asset Forfeiture G04-75-2804241 Financial Crimes Enforcement Network G03-33-2329825 Financial Management Service G00-10-2794731 FMS G00-10-2794731 Internal Revenue Service IRS G01-42-2236557 G27-81-0697864 Office of Thrift Supervision G00-10-2917540 OTS G00-10-2917540 Secret Service G03-62-1819147 US Customs Service G26-69-3739619 US Mint G01-38-0907787 Department of Veterans Affairs G07-29-0536719 Advisory Council on Historic Preservation G00-08-1007258 ACHP G00-08-1007258 American Battle Monuments Commission G08-41-4046345 Central Intelligence Agency G06-34-0212798 G00-04-0693582 CIA G06-34-0212798 G00-04-0693582 Commodity Futures Trading Commission G00-16-3850519 CFTC G00-16-3850519
  • 225.
    §G.1 .GOV homepage set 209 Consumer Product Safety Commission G00-03-1848726 CPSC G00-03-1848726 Corporation for National Service G00-08-4188069 Environmental Protection Agency G00-00-0029827 EPA G00-00-0029827 Equal Employment Opportunity Commission G00-79-1517391 EEOC G00-79-1517391 Farm Credit Administration G00-07-3398062 FCA G00-07-3398062 Federal Communications Commission G36-78-0130889 FCC G36-78-0130889 Federal Deposit Insurance Corporation G01-51-0988286 FDIC G01-51-0988286 Federal Election Commission G00-06-3072823 FEC G00-06-3072823 Federal Emergency Management Agency G00-03-2245885 FEMA G00-03-2245885 Federal Energy Regulatory Commission G00-05-0212361 FERC G00-05-0212361 Federal Labor Relations Authority G00-07-2059058 FLRA G00-07-2059058 Federal Maritime Commission G00-00-2164772 Federal Retirement Thrift Investment Board G00-06-0905797 FRTIB G00-06-0905797 Federal Trade Commission G03-32-2819928 FTC G03-32-2819928 General Services Administration G00-05-1904668 GSA G00-05-1904668 Federal Consumer Information Center Pueblo CO G22-50-0922418 Institute of Museum and Library Services G00-11-0472793 IMLS G00-11-0472793 International Broadcasting Bureau G00-06-1636322 IBB G00-06-1636322 Merit Systems Protection Board G01-60-1363045 MSPB G01-60-1363045
  • 226.
    210 Query sets NationalArchives and Records Administration G00-02-1372443 NARA G00-02-1372443 National Capital Planning Commission G00-08-1222422 NCPC G00-08-1222422 National Commission on Libraries and Information Science NCLIS G00-05-0712949 NCLIS G00-05-0712949 National Council on Disability G00-08-0435196 National Credit Union Administration G42-74-1917577 NCUA G42-74-1917577 National Endowment for the Arts G00-00-3681135 NEA G00-00-3681135 National Mediation Board G00-06-2661322 NMB G00-06-2661322 National Science Foundation NSF G00-07-1120880 NSF G00-07-1120880 National Transportation Safety Board G00-02-1479121 NTSB G00-02-1479121 Nuclear Regulatory Commission G00-11-0770745 NRC G00-11-0770745 Nuclear Waste Technical Review Board G00-05-1894408 NWTRB G00-05-1894408 Occupational Safety and Health Administration G00-09-2693851 OSHA G00-09-2693851 Office of Federal Housing Enterprise Oversight G00-07-2732685 OFHEO G00-07-2732685 Office of Personnel Management G01-78-1330378 OPM G01-78-1330378 Office of Special Counsel G12-71-1037814 G00-09-3815798 OSC G12-71-1037814 G00-09-3815798 Overseas Private Investment Corporation G00-03-1048747 OPIC G00-03-1048747 Peace Corps G12-14-0612098 Pension Benefit Guaranty Corporation G00-08-2596456
  • 227.
    §G.1 .GOV homepage set 211 Postal Rate Commission G00-10-2861072 Railroad Retirement Board G00-00-2016453 RRB G00-00-2016453 Securities and Exchange Commission G00-05-3121512 SEC G00-05-3121512 Selective Service System G00-08-4021223 SSS G00-08-4021223 Social Security Administration G03-24-2061352 SSA G03-24-2061352 Tennessee Valley Authority G00-07-2267029 TVA G00-07-2267029 Thrift Savings Plan G00-04-2615580 TSP G00-04-2615580 United States Arms Control and Disarmament Agency G00-50-1769358 ACDA G00-50-1769358 United States International Trade Commission G00-00-0300859 USITC G00-00-0300859 Dataweb G00-00-1961652 United States Office of Government Ethics G01-28-2830345 United States Postal Service G00-07-4137777 USPS G00-07-4137777 United States Trade and Development Agency G00-02-0555602 Voice of America G00-22-0758032 Broadcasting Bureau of Governors G01-30-3859822 Task Force on Agricultural Air Quality Research G01-51-3170401 White House Commission on Aviation Safety and Security G12-57-0619425 Radio and TV Marti G01-88-3234145 Judicial Branch G00-03-1342151 Legislative Branch G02-36-2411536 G02-32-2010279 Library of Congress G00-03-097897 Table G.1: .GOV home page finding training set. Generated us- ing the automated sitemap method (described in Section 2.6.5.3) on the first.gov listing of government departments.
  • 228.
  • 229.
    Bibliography 1. ABITEBOUL, S.,PREDA, M., AND COBENA, G. Adaptive On-Line Page Impor- tance Computation. In Proceedings of WWW2003 (Budapest, Hungary, May 2003). 2. ADAMIC, L. A. The small World Wide Web. In Proceedings of ECDL’99 (Paris, France, 1999), pp. 443–452. 3. ADAMIC, L. A. Zipf, Power-laws, and Pareto - a ranking tutorial. Tech. rep., Information Dynamics Lab, HP Labs, 2000. http://www.hpl.hp.com/ research/idl/papers/ranking/ranking.html. 4. ADAMIC, L. A., AND HUBERMAN, B. A. The Nature of Markets in the World Wide Web. Quarterly Journal of Economic Commerce 1 (2000), 5–12. 5. ADAMIC, L. A., AND HUBERMAN, B. A. The Web’s Hidden Order. Communica- tions of the ACM 44, 9 (September 2001). 6. ALBERT, R., BARABASI, A., AND JEONG, H. Diameter of the World Wide Web. Nature 401, 9 (September 1999), 103–131. 7. ALTAVISTA. AltaVista. http://www.altavista.com, accessed 10/12/2003. 8. AMENTO, B., TERVEEN, L. G., AND HILL, W. C. Does “authority” mean qual- ity? Predicting expert quality ratings of Web documents. In Proceedings of ACM SIGIR’00 (Athens, Greece, July 2000), pp. 296–303. 9. AMITAY, E., CARMEL, D., DARLOW, A., LEMPEL, R., AND SOFFER, A. The Connectivity Sonar: Detecting Site Functionality by Structural Patterns. In Pro- ceedings of ACM HT’03 (Nottingham, United Kingdom, August 2003). 10. APACHE. Welcome! - The Apache HTTP Server Project, 2004. http://httpd. apache.org, accessed 12/11/2004. 11. ARASU, A., NOVAK, J., TOMKINS, A., AND TOMLIN, J. PageRank Computation and the Structure of the Web: Experiments and Algorithms. In Proceedings of WWW2002 (Hawaii, USA, May 2002). 12. AUSTRALIA POST. Australia post, 2004. http://www.australiapost.com. au, accessed 12/11/2004. 13. AYAN, N. F., LI, W.-S., AND KOLAK, O. Automating extraction of logical do- mains in a web site. Data and Knowledge Engineering 43, 2 (November 2002), 179– 205. 213
  • 230.
    214 Bibliography 14. BAEZA-YATES,R., AND RIBEIRO-NETO, B. Modern Information Retrieval. Addi- son Wesley, 1999. 15. BAILEY, P., CRASWELL, N., AND HAWKING, D. Engineering a multi-purpose test collection for Web retrieval experiments. Information Processing and Man- agement 39, 6 (2003), 853–871. http://es.cmis.csiro.au/pubs/bailey ipm03.pdf. 16. BALDI, P., FRASCONI, P., AND SMYTH, P. Modeling the Internet and the Web: Probabilistic Methods and Algorithms. Wiley, 2003. 17. BARABASI, A.-L., AND ALBERT, R. Emergence of Scaling in Random Networks. Science 286 (October 1999). 18. BARABASI, A.-L., ALBERT, R., AND JEONG, H. Scale-free characteristics of ran- dom networks: the topology of the World-Wide Web. Physica A 281 (2000), 69– 77. 19. BERGER, A., AND LAFFERTY, J. D. Information Retrieval as Statistical Transla- tion. In Proceedings of ACM SIGIR’99 (Berkeley, CA, USA, 1999), pp. 222–229. 20. BERNERS-LEE, T. Weaving the Web. The Original Design and Ultimate Destiny of the World Wide Web by its Inventor. Harper Collins, San Francisco, 1999. 21. BERNERS-LEE, T., FIELDING, R., AND MASINTER, L. RFC2396 – Uniform Re- source Identifiers. Request for Comments, August 1998. 22. BERRY, M. W., DUMAIS, S. T., AND O’BRIEN, G. W. Using Linear Algebra for Intelligent Information Retrieval. Tech. rep., University of Tennessee, Depart- ment of Computer Science, December 1994. 23. BHARAT, K., AND BRODER, A. Mirror, Mirror on the Web: A Study of Host Pairs with Replicated Content. In Proceedings of WWW8 (Toronto, Canada, May 1999). http://www8.org/w8-papers/4c-server/mirror/mirror.html. 24. BHARAT, K., BRODER, A., DEAN, J., AND HENZINGER, M. A Comparison of Techniques to Find Mirrored Hosts on the WWW. In WOWS’99 (Berkeley, USA, August 1999). http://www.henzinger.com/monika/. 25. BHARAT, K., CHANG, B., HENZINGER, M., AND RUHL, M. Who links to whom: Mining linkage between Web sites. In Proceedings of ICDM’01 (San Jose, USA, November 2001). 26. BHARAT, K., AND HENZINGER, M. Improved Algorithms for Topic Distilla- tion in a Hyperlinked Environment. In Proceedings of ACM SIGIR’98 (Melbourne, Australia, 1998). 27. BHARAT, K., AND MIHAILA, G. A. When Experts Agree: Using Non-Affiliated Experts to Rank Popular Topics. In Proceedings of WWW2001 (Hong Kong, 2001). http://www10.org/cdrom/papers/474/.
  • 231.
    Bibliography 215 28. BOOKSTEIN,A. Implications of Boolean Structures for Probabilistic Retrieval. In Proceedings of ACM SIGIR’85 (New York, USA, 1985), pp. 11–17. 29. BOTAFOGO, R., RIVLIN, E., AND SHNEIDERMAN, B. Structural Analysis of Hy- pertexts: Identifying Hierarchies and Useful Metrics. ACM Transactions on Infor- mation Systems 10, 2 (1992), 142–180. 30. BRAY, T. Measuring the Web. In Proceedings of WWW5 (Paris, France, May 1996). 31. BRIN, S., AND PAGE, L. The anatomy of a large-scale hypertextual web search engine. In Proceedings of WWW7 (Brisbane, Australia, May 1998). http: //www7.scu.edu.au/programme/fullpapers/1921/com1921.htm. 32. BRODER, A. On the Resemblance and Containment of Documents. In Proceed- ings of SEQS’97 (1997). 33. BRODER, A. A taxonomy of web search. ACM SIGIR Forum 36, 2 (Fall 2002), 3–10. 34. BRODER, A., GLASSMAN, S., MANASSE, M., AND ZWEIG, G. Syntactic Clustering of the Web. In Proceedings of WWW6 (Santa Clara, USA, April 1997). http://www.scope.gmd.de/info/www6/technical/paper205/ paper205.html. 35. BRODER, A., KUMAR, R., MAGHOUL, F., RAGHAVAN, P., RAJAGOPALAN, S., STATA, R., TOMKINS, A., AND WIENER, J. Graph structure in the Web: ex- periments and models. In Proceedings of WWW9 (Amsterdam, 2000). http: //www9.org/w9cdrom/index.html. 36. BUCKLEY, C., AND VOORHEES, E. Evaluating evaluation measure stability. In Proceedings of ACM SIGIR’00 (Athens, Greece, July 2000), pp. 33–40. 37. CAI, D., YU, S., WEN, J.-R., AND MA, W.-Y. VIPS: a Vision-based Page Segmen- tation Algorithm. Tech. rep., Microsoft Research Asia, 2003. MSR-TR-2003-79. 38. CAI, D., YU, S., WEN, J.-R., AND MA, W.-Y. Block-based web search. In Pro- ceedings of ACM SIGIR’04 (Sheffield, UK, July 2004), pp. 456–463. 39. CAI, D., YU, S., WEN, J.-R., AND MA, W.-Y. Block-level Link Analysis. In Proceedings of ACM SIGIR’04 (Sheffield, UK, July 2004), pp. 440–447. 40. CALADO, P., RIBEIRO-NETO, B., ZIVIANI, N., MOURA, E., AND SILVA, I. Local Versus Global Link Information in the Web. ACM Transactions on Information Systems 21, 1 (January 2003), 42–63. 41. CARRI `ERE, S. J., AND KAZMAN, R. WebQuery: Searching and visualizing the Web through connectivity. In Proceedings of WWW6 (Santa Clara, USA, 1997), pp. 701–711. http://www.scope.gmd.de/info/www6/technical/ paper096/paper96.html.
  • 232.
    216 Bibliography 42. CHAKRABARTI,S. Integrating the Document Object Model with Hyperlinks for Enhanced Topic Distillation and Information Extraction. In Proceedings of WWW2001 (Hong Kong, 2001), pp. 211–220. 43. CHAKRABARTI, S. Mining the Web: Discovering knowledge from hypertext data. Morgan Kaufmann, San Francisco, 2003. 44. CHAKRABARTI, S., DOM, B., RAGHAVAN, P., RAJOGOPALAN, S., AND KLEIN- BERG, J. Automatic resource compilation by analyzing hyperlink structure and associated text. In Proceedings of WWW7 (Melbourne, Australia, 1998), pp. 65–74. 45. CHAKRABARTI, S., JOSHI, M., AND TAWDE, V. Enhanced Topic Distillation us- ing Text, Markup Tags, and Hyperlinks. In Proceedings of ACM SIGIR’01 (New Orleans, USA, 2001), pp. 208–216. 46. CHO, J., GARC´IA-MOLINA, H., AND PAGE, L. Efficient crawling through URL ordering. Computer Networks and ISDN Systems 30, 1–7 (1998), 161–172. 47. CHOWDHURY, A., FRIEDER, O., GROSSMAN, D., AND MCCABE, M. Collection Statistics for Fast Duplicate Document Detection. ACM Transactions on Informa- tion Systems 20, 2 (April 2002), 171–191. 48. CLEVERDON, C., MILLS, J., AND KEEN, M. Factors determining the perfor- mance of indexing systems. In ASLib Cranfield Project. Cranfield, 1966. 49. CLEVERDON, C. W. Optimizing convenient online access to bibliographic data- bases. Information Services and Use 4 (1984), 37–47. 50. COLLINS-THOMPSON, K., OGILVIE, P., ZHANG, Y., AND CALLAN, J. Informa- tion Filtering, Novelty Detection, and Named-Page Finding. In TREC-11 Note- book Proceedings (Gaithersburg, Maryland USA, November 2002), NIST. 51. COOPER, W. S. Getting beyond Boole. Information Processing and Management: An International Journal 24 (May 1988), 243–248. 52. CRASWELL, N., CRIMMINS, F., HAWKING, D., AND MOFFAT, A. Performance and cost tradeoffs in web search. In ADC’04 (Dunedin, New Zealand, January 2004), pp. 161–170. http://es.csiro.au/pubs/craswell adc04.pdf. 53. CRASWELL, N., AND HAWKING, D. Overview of the TREC-2002 Web Track. In TREC-11 Notebook Proceedings (Gaithersburg, MD, USA, November 2002). 54. CRASWELL, N., AND HAWKING, D. TREC-2004 Web Track Guidelines, July 2004. http://es.csiro.au/TRECWeb/guidelines 2004.html, accessed 10/11/2004. 55. CRASWELL, N., AND HAWKING, D. Characteristics of human-generated re- source lists. Unpublished (In submission).
  • 233.
    Bibliography 217 56. CRASWELL,N., HAWKING, D., AND ROBERTSON, S. Effective site finding us- ing link anchor information. In Proceedings of ACM SIGIR’01 (New Orleans, USA, 2001), pp. 250–257. http://es.cmis.csiro.au/pubs/craswell sigir01.pdf. 57. CRASWELL, N., HAWKING, D., THOM, J., UPSTILL, T., WILKINSON, R., AND WU, M. TREC11 Web and Interactive Tracks at CSIRO. In TREC-11 Notebook Proceedings (Gaithersburg, MD, USA, November 2002). 58. CRASWELL, N., HAWKING, D., THOM, J., UPSTILL, T., WILKINSON, R., AND WU, M. TREC12 Web Track at CSIRO. In TREC-12 Notebook Proceedings (Gaithersburg, MD, USA, November 2003). 59. CRASWELL, N., HAWKING, D., WILKINSON, R., AND WU, M. TREC10 Web and Interactive Tracks at CSIRO. In TREC-10 Notebook Proceedings (Gaithersburg, MD, USA, November 2001). http://es.cmis.csiro.au/pubs/craswell trec01.pdf. 60. CRASWELL, N., HAWKING, D., WILKINSON, R., AND WU, M. Overview of the TREC-2003 Web Track. In TREC-12 Notebook Proceedings (Gaithersburg, MD, USA, November 2003). 61. CROFT, W. B., AND HARPER, D. J. Using probabilistic models of document retrieval without relevance information. Journal of Documentation 35 (1979), 285– 295. 62. CSIRO. TREC Web Corpus: WT10g, 2003. http://es.csiro.au/TRECWeb/ wt10g.html, accessed 12/11/2004. 63. DAVISON, B. D. Recognizing Nepotistic Links on the Web. In Proceedings of AAAI’00 (Workship on Artificial Intelligence for Web Search) (Austin, Texas USA, 2000), pp. 23–28. 64. DAVISON, B. D. Topical Locality in the Web. In Proceedings of ACM SIGIR’00 (Athens, Greece, July 2000), pp. 272–279. 65. DAVISON, B. D. Topical Locality in the Web: Experiments and Observations. Tech. rep., Department of Computer Science, Rutgers, New Jersey, July 2000. 66. DEERWESTER, S., DUMAIS, S. T., FURNAS, G. W., LANDAUER, T. K., AND HARSHMAN, R. Indexing by Latent Semantic Analysis. JASIS 41, 6 (1990), 391– 407. 67. DILL, S., KUMAR, R., MCCURLEY, K. S., RAJAGOPALAN, S., SIVAKUMAR, D., AND TOMKINS, A. Self-Similarity in the Web. ACM Transactions On Internet Technologies 2, 3 (August 2002), 205–223.
  • 234.
    218 Bibliography 68. DING,C., HE, X., HUSBANDS, P., ZHA, H., AND SIMON, H. PageRank, HITS and a unified framework for link analysis. Tech. Rep. 49372, LBNL, 2002. http: //citeseer.nj.nec.com/546720.html. 69. DMOZ. Open Directory Project. http://www.dmoz.org, accessed 12/11/2004. 70. DUBLIN CORE METADATA INITIATIVE. Dublin Core Metadata Element Set, Version 1.1: Reference Description, 2003. http://dublincore.org/ documents/dces/, accessed 14/11/2004. 71. DUBLIN CORE METADATA INITIATIVE. DCMI Frequently Asked Ques- tions (FAQ) – What search-enginges support the Dublin Core Metadata Element Set?, 2004. http://www.dublincore.org/resources/faq/ #whatsearchenginessupport, accessed 14/11/2004. 72. DWORK, C., KUMAR, R., NAOR, M., AND SIVAKUMAR, D. Rank aggregation methods for the Web. In Proceedings of WWW2001 (Hong Kong, 2001), pp. 613– 622. http://doi.acm.org/10.1145/371920.372165. 73. EIRON, N., AND MCCURLEY, K. S. Analysis of Anchor Text for Web Search. Tech. rep., IBM, 2003. 74. EIRON, N., AND MCCURLEY, K. S. Analysis of Anchor Text for Web Search (Extended Abstract). In Proceedings of ACM SIGIR’03 (Toronto, Canada, 2003), pp. 450–460. 75. EIRON, N., AND MCCURLEY, K. S. Untangling Compound Documents on the Web. Tech. rep., IBM, 2003. 76. EISENBERG, M., AND BARRY, C. Order effects: A study of the possible influence of presentation order on user judgments of document relevance. JASIS 39, 5 (1988), 293–300. 77. EXCITE. Excite, 2004. http://www.excite.com, accessed 12/11/2004. 78. FAGIN, R., KUMAR, R., MCCURLEY, K. S., NOVAK, J., SIVAKUMAR, D., TOM- LIN, J. A., AND WILLIAMSON, D. P. Searching the Workplace Web. In Proceed- ings of WWW2003 (Budapest, Hungary, May 2003), pp. 366–375. 79. FAGIN, R., KUMAR, R., AND SIVAKUMAR, D. Comparing top k lists. In ACM SIAM (Baltimore, MD, USA, 2003), pp. 28–36. 80. FAST SEARCH AND TRANSFER, ASA. Personal communication, 2004. http: //www.alltheweb.com, accessed 12/11/2003. 81. FIELDING. RFC2616 - HTTP/1.1: Status Code Definitions, 1999. http://www. w3.org/Protocols/rfc2616/rfc2616-sec10.html#sec10.3, accessed 12/11/2004.
  • 235.
    Bibliography 219 82. FORTUNE.Fortune 500, 2003. http://www.fortune.com/fortune/ fortune500, accessed 06/09/2003. 83. FOX, E., AND SHAW, J. Combination of multiple searches. In TREC-3 Notebook Proceedings (Gaithersburg, MD, USA, 1994), pp. 243–252. 84. FRAKES, W., AND BAEZA-YATES, R., Eds. Information Retrieval: Data Structures and Algorithms. Prentice Hall, 1992. 85. FUHR, N., LALMAS, M., KAZAI, G., AND VERT, N. G. Proceedings of the INitia- tive for the Evaluation of XML Retrieval (INEX). In ERCIM workshop proceedings (Dagstuhl, 2003). 86. FUJIMURA, K., INOUE, T., AND SUGISAKI, M. The EigenRumor Algorithm for Ranking Blogs. In 2nd Annual Workshop on the Weblogging Ecosystem - Aggregation, Analysis and Dynamics (Chiba, Japan, 2005). 87. GARFIELD, E. Citation Indexes for Science: A New Dimension in Documentation through Association of Ideas. Science 122, 3159 (1955), 108–111. 88. GARFIELD, E. Citation analysis as a tool in journal evaluation. Science 178, 4060 (1972), 471–479. 89. GARNER, R. A Computer Oriented, Graph Theoretic Analysis of Citation Index Struc- tures. Drexel University Press, Philadelphia, 1967. 90. GLOVER, E. J., TSIOUTSIOULIKLIS, K., LAWRENCE, S., PENNOCK, D. M., AND FLAKE, G. W. Using Web Structure for Classifying and Describing Web Pages. In Proceedings of WWW2002 (Honolulu, Hawaii, USA, May 2002). 91. GOLUB, G. H., AND LOAN, C. F. V. Matrix Computations. The Johns Hopkins University Press, Baltimore, USA, 1996. 92. GOOGLE. Blogger. http://www.blogger.com, accessed 06/11/2005. 93. GOOGLE. Google search engine. http://www.google.com, accessed 12/11/2004. 94. GOOGLE. Google Directory > Shopping Publications > Books > Gen- eral, September 2002. http://directory.google.com/Top/Shopping/ Publications/Books/General, accessed 09/09/2002. 95. GOOGLE. Google Directory, 2004. http://directory.google.com/, ac- cessed 12/11/2004. 96. GOOGLE. Google Search Appliance Frequently Asked Questions, 2004. http: //www.google.com/appliance/faq.html, accessed 12/11/2004. 97. GOOGLE. Google Technology, 2004. http://www.google.com/ technology/, accessed 10/11/2004.
  • 236.
    220 Bibliography 98. GOOGLE.Google Toolbar, 2004. http://toolbar.google.com/, accessed 12/11/2004. 99. GRANKA, L., JOACHIMS, T., AND GAY, G. Eye-Tracking Analysis of User Behav- ior in WWW Search. In Proceedings of ACM SIGIR’04 (Sheffield, United Kingdom, August 2004). 100. GURRIN, C., AND SMEATON, A. F. Replicating Web Structure in Small-Scale Test Collections. Information Retrieval 7 (2004), 239–263. 101. HARMAN, D. How effective is suffixing? JASIS 42, 1 (1991), 7–15. 102. HAVELIWALA, T. H. Efficient computation of PageRank. Tech. Rep. 1999-31, Stanford University Database Group, 1999. http://dbpubs.stanford.edu: 8090/pub/1999-31. 103. HAVELIWALA, T. H. Topic-sensitive pagerank. In Proceedings of WWW2002 (Honolulu, Hawaii, USA, 2002), ACM Press, pp. 517–526. 104. HAVELIWALA, T. H. Topic-Sensitive PageRank: A Context-Sensitive Ranking Algorithm for Web Search. In IEEE Transactions on Knowledge and Data Engineer- ing (July 2003). 105. HAVELIWALA, T. H., AND KAMVAR, S. D. The Second Eigenvalue of the Google Matrix. Tech. rep., Stanford University, 2003. 106. HAWKING, D. Overview of the TREC-9 Web Track. In TREC-9 Notebook Pro- ceedings (Gaithersburg, MD, USA, 2000). http://trec.nist.gov/pubs/ trec9/. 107. HAWKING, D. Challenges in enterprise search. In Proceedings of the Australasian Database Conference ADC2004 (Dunedin, New Zealand, January 2004), pp. 15–26. Invited paper: http://es.csiro.au/pubs/hawking adc04keynote.pdf. 108. HAWKING, D., BAILEY, P., AND CRASWELL, N. An intranet reality check for TREC ad hoc. Tech. rep., CSIRO Mathematical and Information Sciences, 2000. http://es.cmis.csiro.au/pubs/hawking tr00.pdf. 109. HAWKING, D., BAILEY, P., AND CRASWELL, N. Efficient and flexible search using text and metadata. Tech. rep., CSIRO Mathematical and Information Sci- ences, 2000. http://es.csiro.au/pubs/hawking tr00b.pdf. 110. HAWKING, D., AND CRASWELL, N. Overview of the TREC-2001 Web Track. In TREC-10 Notebook Proceedings (Gaithersburg, MD, USA, 2001). http://trec. nist.gov/pubs/. 111. HAWKING, D., AND CRASWELL, N. Very large scale retrieval and web search. In TREC: Experiment and Evaluation in Information Retrieval, E. Voorhees and D. Har- man, Eds. MIT Press, 2005. http://es.csiro.au/pubs/trecbook for website.pdf.
  • 237.
    Bibliography 221 112. HAWKING,D., CRASWELL, N., BAILEY, P., AND GRIFFITHS, K. Measuring search engine quality. Information Retrieval 4, 1 (2001), 33–59. http://es. cmis.csiro.au/pubs/hawking ir01.pdf. 113. HAWKING, D., CRASWELL, N., CRIMMINS, F., AND UPSTILL, T. Enterprise search: What works and what doesn’t. In Proceedings of the Infonortics Search Engines Meeting (San Francisco, April 2002). http://es.csiro.au/pubs/ hawking se02talk.pdf. 114. HAWKING, D., CRASWELL, N., CRIMMINS, F., AND UPSTILL, T. How valu- able is external link evidence when searching enterprise webs? In Proceedings of ADC’04 (Dunedin, New Zealand, January 2004). http://es.cmis.csiro. au/pubs/hawking adc04.pdf. 115. HAWKING, D., CRASWELL, N., CRIMMINS, F., AND UPSTILL, T. How Valuable is External Link Evidence when Searching Enterprise Webs? In Proceedings of ADC’04 (Dunedin, New Zealand, January 2004). http://es.cmis.csiro. au/pubs/hawking adc04.pdf. 116. HAWKING, D., CRASWELL, N., AND GRIFFITHS, K. Which search engine is best at finding online services? In Proceedings of WWW10 (Hong Kong, 2001). http: //www10.org/cdrom/posters/1089.pdf. 117. HAWKING, D., CRASWELL, N., THISTLEWAITE, P., AND HARMAN, D. Results and challenges in Web search evaluation. In Proceedings of WWW8 (Toronto, Canada, 1999), vol. 31, pp. 1321–1330. http://es.cmis.csiro.au/pubs/ hawking www99.pdf. 118. HAWKING, D., AND ROBERTSON, S. On Collection Size and Retrieval Effective- ness. Information Retrieval 6, 1 (2003), 99–150. 119. HAWKING, D., AND THISTLEWAITE, P. Overview of TREC-6 Very Large Collec- tion Track. In TREC-6 Notebook Proceedings (Gaithersburg, MD, USA, 1997), E. M. Voorhees and D. K. Harman, Eds., pp. 93–105. 120. HAWKING, D., UPSTILL, T., AND CRASWELL, N. Towards better weighting of anchors. In Proceedings of SIGIR’04 (Sheffield, England, July 2004), pp. 512–513. http://es.csiro.au/pubs/hawking sigirposter04.pdf. 121. HAWKING, D., VOORHEES, E., BAILEY, P., AND CRASWELL, N. Overview of TREC-8 Web Track. In TREC-8 Notebook Proceedings (Gaithersburg, MD, USA, 1999), pp. 131–150. http://trec.nist.gov/pubs/trec-8. 122. HENZINGER, M., MOTWANI, R., AND SILVERSTEIN, C. Challenges in Web Search Engines. ACM SIGIR Forum 36, 2 (Fall 2002). 123. HEYDON, A., AND NAJORK, M. Mercator: A Scalable, Extensible Web Crawler. World Wide Web Journal (December 1999), 219 – 229. http://www.research. digital.com/SRC/mercator/.
  • 238.
    222 Bibliography 124. HORRIGAN,J. B., AND RAINIE, L. PEW Internet & American life project: Getting serious online, March 2002. http://www.pewinternet.org/ reports/reports.asp?Report=55&Section=ReportLevel1&Field= Level1ID&ID=241, accessed 12/11/2004. 125. HUBBELL, C. H. An Input-Output Approach to Clique Identification. Sociometry 28 (1965), 377–399. 126. HULL, D. Stemming algorithms – a case study for detailed evaluation. JASIS 47, 1 (1996), 70–84. 127. JEH, G., AND WIDOM, J. Scaling personalized web search. In Proceedings of WWW2003 (Budapest, Hungry, 2003), pp. 271–279. 128. JING, Y., AND CROFT, W. B. An association thesaurus for information retrieval. In Proceedings of RIAO’94 (New York, USA, 1994), pp. 146–160. 129. JOACHIMS, T. Evaluating Retrieval Performance Using Clickthrough Data. In Proceedings of ACM SIGIR’02 Workshop on Mathematical/Formal Methods in Infor- mation Retrieval (Tampere, Finland, 2002). 130. KAMVAR, S. D., HAVELIWALA, T. H., MANNING, C. D., AND GOLUB, G. H. Exploiting the block structure of the web for computing PageRank. Tech. rep., Stanford University, 2003. 131. KATZ, L. A new status index derived from sociometric analysis. Psychometrika 18, 1 (March 1953), 39–43. 132. KLEINBERG, J. M. Authoritative Sources in a Hyperlinked Environment. Journal of the ACM 46, 5 (1999), 604–632. 133. KOSTER, M. robotstxt.org, 2003. http://www.robotstxt.org/, accessed 12/11/2003. 134. KRAAIJ, W., AND POHLMANN, R. Viewing Stemming as Recall Enhancement. In Proceedings of ACM SIGIR’96 (Zurich, Switzerland, 1996), pp. 40–48. 135. KRAAIJ, W., WESTERVELD, T., AND HIEMSTRA, D. The Importance of Prior Probabilities for Entry Page Search. In Proceedings of ACM SIGIR’02 (Tampere, Finland, 2002), pp. 27–34. 136. KUMAR, S. R., RAGHAVAN, P., RAJAGOPALAN, S., SIVAKUMAR, D., TOMKINS, A., AND UPFAL, E. The Web as a Graph. In Symposium on Principles of Database Systems (Dallas, Texas USA, 2000), pp. 1–10. 137. KUMAR, S. R., RAGHAVAN, P., RAJAGOPALAN, S., AND TOMKINS, A. Trawling the Web for emerging cyber-communities. In Proceedings of WWW8 (Toronto, Canada, 1999), pp. 403–415.
  • 239.
    Bibliography 223 138. LARSON,R. R. Bibliometrics of the World Wide Web: An exploratory analysis of the intellection architecture of cyberspace. Tech. rep., Computer Science De- partment, University of California, Santa Barbara, 1996. http://sherlock. berkeley.edu/asis96/asis96.html. 139. LAWRENCE, S., AND GILES, C. L. Searching the World Wide Web. Science 280, 5360 (1998). 140. LEMPEL, R., AND MORAN, S. The stochastic approach for link-structure analysis (SALSA) and the TKC effect. Computer Networks 33, 1–6 (2000), 387–401. 141. LEMPEL, R., AND MORAN, S. (SALSA) the stochastic approach for link-structure analysis. ACM Transactions on Information Systems (2001). 142. LI, W.-S., KOLAK, O., AND VU, Q. Defining Logical Domains in a Web Site. In Proceedings of HT’00 (San Antonio, Texas USA, 2000). 143. LI, Y., AND RAFSKY, L. Beyond Relevance Ranking: Hyperlink Vector Vot- ing. In Proceedings of ACM SIGIR’97 Workshop on Networked Information Retrieval (Philadelphia, USA, 1997). 144. LOOKSMART. Looksmart, 2003. http://www.looksmart.com, accessed 12/11/2004. 145. MARCHIORI, M. The Quest for Correct Information on the Web: Hyper Search Engines. In Proceedings of WWW6 (Santa Clara, USA, 1997), pp. 265–276. 146. MARON, M., AND KUHNS, J. On Relevance, Probabilistic Indexing and Infor- mation Retrieval. Journal of the ACM 7, 3 (1960), 216–244. 147. MCKELLEHER, K. The Wired 40, July 2003. http://www.wired.com/ wired/archive/11.07/40main.html, accessed 06/09/2003. 148. MICROSOFT. Internet Information Services, 2004. http://www.microsoft. com/windowsserver2003/iis/default.mspx, accessed 11/12/2004. 149. MICROSOFT. MSN Search Engine, 2004. http://search.msn.com, accessed 11/12/2004. 150. MIZZARO, S. Relevance: The Whole History. JASIS 48, 9 (1997), 810–832. 151. MONTAGUE, M. Metasearch: Data fusion for Document Retrieval. PhD thesis, Dart- mouth College, Hannover, New Hampshire, 2002. 152. NETSCAPE. Core JavaScript Guide 1.5, 2000. http://devedge.netscape. com/library/manuals/2000/javascript/1.5/guide/. 153. NEW YORK TIMES. Bestsellers. Web Site, September 2002. http://www. nytimes.com/2002/09/01/books/bestseller/, accessed 09/09/2002.
  • 240.
    224 Bibliography 154. NG,A. Y., ZHENG, A. X., AND JORDAN, M. I. Link analysis, eigenvectors, and stability. In Proceedings of IJCAI’01 (Seattle, USA, 2001), ACM Press. 155. OGILVIE, P., AND CALLAM, J. Combining document representations for known- item search. In Proceedings of ACM SIGIR’03 (Toronto, Canada, August 2003), pp. 143–150. 156. OGILVIE, P., AND CALLAM, J. Combining structural information and the use of priors in mixed named-page and homepage finding. In TREC-12 Notebook Proceedings (Gaithersburg, MD, USA, November 2003), NIST. 157. PAGE, L., BRIN, S., MOTWANI, R., AND WINOGRAD, T. The PageRank Cita- tion Ranking: Bringing Order to the Web. Tech. Rep. 1999-66, Stanford Uni- versity Database Group, 1998. http://dbpubs.stanford.edu:8090/pub/ 1999-66. 158. PANDURANGAN, G., RAGHAVAN, P., AND UPFAL, E. Using PageRank to Char- acterize Web Structure. Tech. rep., Purdue University, 2002. 159. PANT, G. Deriving Link-context from HTML. In ACM DMKD (San Diego, Cali- fornia, USA, June 2003). 160. PARKER, L. M. P., AND JOHNSON, R. E. Does order of presentation affect users’ judgment of documents? JASIS 41, 7 (1990), 493–494. 161. PINSKI, G., AND NARIN, F. Citation influence for journal aggregates of scientific publications: Theory, with application to the literature of physics. Information Processing and Management 12 (1976). 162. PONTE, J. M., AND CROFT, W. B. A Language Modeling Approach to Informa- tion Retrieval. In Proceedings of ACM SIGIR’98 (Melbourne, Australia, August 1998). 163. PORTER, M. An algorithm for suffix stripping. Program 14, 3 (1980), 130–137. http://www.tartarus.org/∼martin/PorterStemmer/. 164. RAGGETT, D., HORS, A. L., AND JACOBS, I. HTML 4.01 Specification: The global structure of an HTML document, 1999. http://www.w3.org/TR/ html4/struct/global.html#didx-meta data, accessed 12/11/2004. 165. RAGHAVAN, S., AND GARCIA-MOLINA, H. Crawling the Hidden Web. In Pro- ceedings of VLDB’01 (2001), pp. 129–138. http://citeseer.ist.psu.edu/ article/raghavan01crawling.html. 166. RIVEST, R. The MD5 message-digest algorithm. Request for Comments, April 1992. 167. ROBERTSON, S. The probability ranking principle in IR. Journal of Documentation 33 (1977), 294–304. As appears in Spark-Jones and Willet, 1997.
  • 241.
    Bibliography 225 168. ROBERTSON,S., AND JONES, K. S. Simple, proven approaches to text retrieval. Tech. Rep. UCAM-CL-TR-356, University of Cambridge, May 1997. http://www.cl.cam.ac.uk/ftp/papers/reports/abstract. html#TR356-ksj-approaches-to-text-retrieval.html. 169. ROBERTSON, S., AND SPARCK-JONES, K. Relevance weighting of search terms. JASIS 27 (1976), 129–146. 170. ROBERTSON, S., AND WALKER, S. Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval. In Proceedings of ACM SI- GIR’94 (Dublin, Ireland, 1994), pp. 232–241. 171. ROBERTSON, S., WALKER, S., HANCOCK-BEAULIEU, M., GULL, A., AND LAU, M. Okapi at TREC-1. In TREC-1 Notebook Proceedings (Gaithersburg, MD, USA, 1992), pp. 21–30. http://trec.nist.gov/pubs/trec1/. 172. ROBERTSON, S., WALKER, S., JONES, S., HANCOCK-BEAULIEU, M., AND GAT- FORD, M. Okapi at TREC-3. In TREC-3 Notebook Proceedings (Gaithersburg, MD, USA, 1994), pp. 109–126. http://trec.nist.gov/pubs/trec3/. 173. ROBERTSON, S., ZARAGOZA, H., AND TAYLOR, M. Simple BM25 extension to multiple weighted fields. In Proceedings of CIKM’04 (2004), pp. 42–49. http: //research.microsoft.com/%7Ehugoz/bm25wf.pdf. 174. ROCCHIO, J. Document Retrieval Systems–Optimization and Evaluation. PhD the- sis, Harvard Computational Laboratory, 1966. 175. ROCCHIO, J. Relevance Feedback in Information Retrieval. Prentice-Hall, Inc., 1971. 176. SALTON, G. Automatic Information Organization. McGraw-Hill, New York, 1968. 177. SALTON, G., Ed. The SMART retrieval system - experiments in automatic documment processing. McGraw-Hill, New York, 1971. 178. SAVOY, J., AND RASOLOFO, Y. Report on the TREC-10 experiment: Distributed collections and entrypage searching. In TREC-10 Notebook Proceedings (Gaithers- burg, MD, USA, 2001). http://trec.nist.gov/pubs/. 179. SEELEY, J. R. The net of reciprocal influence: A problem in treating sociometric data. Canadian Journal of Psychology 3 (1949), 234–240. 180. SHAH, C., AND CROFT, W. B. Evaluating High Accuracy Retrieval Techniques. In Proceedings of ACM SIGIR’04 (Sheffield, United Kingdom, 2004), pp. 2–9. 181. SHAKES, J., LANGHEINRICH, M., AND ETZIONI, O. Dynamic reference sifting: a case study in the homepage domain. Computer Networks and ISDN Systems 29 (1997), 1193–1204. 182. SHANNON, C. E. Prediction and entropy of printed English. Bell Systems Techni- cal Journal, 30 (1951), 51–64.
  • 242.
    226 Bibliography 183. SHIVAKUMAR,N., AND GARCIA-MOLINA, H. Finding Near-Replicas of Docu- ments on the Web. In Proceedings of WDB’98 (1998). 184. SILVERSTEIN, C., HENZINGER, M., MARAIS, H., AND MORICZ, M. Analysis of a Very Large AltaVista Query Log. Tech. rep., Digital Systems Research Center, 1998. 185. SINGHAL, A., AND KASZKIEL, M. A Case Study in Web Search using TREC Algorithms. In Proceedings of WWW10 (Hong Kong, 2001), pp. 708–716. http: //www10.org/cdrom/papers/317/. 186. SINGHAL, A., SALTON, G., MITRA, M., AND BUCKLEY, C. Document Length Normalization. Information Processing and Management 32, 5 (1996). 187. SMALL, H. Co-citation in the scientific literature: A new measure of the relation- ship between two documents. JASIS 24, 4 (1973), 265–269. 188. SOBOROFF, I. Do TREC Web Collections Look Like the Web? ACM SIGIR Forum 36, 2 (2002), 23–31. 189. SOBOROFF, I. On evaluating web search with very few relevant documents. In Proceedings of ACM SIGIR’04 (Sheffield, United Kingdom, 2004), pp. 530–531. 190. SPARCK-JONES, K. A statistical interpretation of term specificity and its applica- tion in retrieval. Journal of Documentation 28, 1 (1972), 11–20. 191. SPARCK-JONES, K., AND WILLET, P., Eds. Readings in Information Retrieval. Mor- gan Kaufmann, 1997. 192. SPINELLO, R. A. An ethical evaluation of web site linking. ACM SIGCAS Com- puters and Society 30, 4 (2000), 25–32. 193. SULLIVAN, D. How To Use HTML Meta Tags, December 2002. http: //searchenginewatch.com/webmasters/article.php/2167931, accessed 08/11/04. 194. SULLIVAN, D. Nielsen/NetRatings Search Engine Ratings. Web Site, September 2002. http://www.searchenginewatch.com/reports/netratings. html, accessed 06/11/2002. 195. SULLIVAN, D. Who Powers Whom? Search Providers Chart. Web Site, Septem- ber 2002. http://www.searchenginewatch.com/reports/alliances. html, accessed 06/11/2002. 196. TERVEEN, L., HILL, W., AND AMENTO, B. Constructing, Organizing, and Col- lections of Topically Related Web Resources. ACM Transactions of Computer- Human Interation 6, 1 (March 1999), 67–94.
  • 243.
    Bibliography 227 197. TOMLIN,J. A. A New Paradigm for Ranking Pages on the World Wide Web. In Proceedings of WWW2003 (Budapest, Hungary, May 2003). http://www2003. org/cdrom/papers/refereed/p042/paper42 html/p42-tomlin.htm. 198. TRAVIS, B., AND BRODER, A. Web search quality vs. informational relevance. In Proceedings of the Infonortics Search Engines Meeting (Boston, 2001). http://www. infonortics.com/searchengines/sh01/slides-01/travis.html. 199. UPSTILL, T., CRASWELL, N., AND HAWKING, D. Buying Bestsellers On- line: A Case Study in Search and Searchability. In Proceedings of ADCS2002 (Sydney, Australia, 2002). http://es.cmis.csiro.au/pubs/upstill adcs02.pdf. 200. UPSTILL, T., CRASWELL, N., AND HAWKING, D. Predicting fame and fortune: Pagerank or indegree? In Proceedings of ADCS2003 (Canberra, Australia, Decem- ber 2003). http://es.cmis.csiro.au/pubs/upstill adcs03.pdf. 201. UPSTILL, T., CRASWELL, N., AND HAWKING, D. Query-independent evidence in home page finding. ACM Transactions on Information Systems 21, 3 (2003), 286– 313. 202. UPSTILL, T., AND ROBERTSON, S. Exploiting Hyperlink Recommendation Ev- idence in Navigational Web Search. In Proceedings of ACM SIGIR’04 (Sheffield, United Kingdom, July 2004), pp. 576–577. 203. VAGHAN, L., AND SHAW, D. Bibliographic and Web Citations: What Is The Difference? JASIS 54, 14 (2003), 1313–1322. 204. VAN RIJSBERGEN, C. J. Information Retrieval, 2nd edition. Dept. of Computer Science, University of Glasgow, 1979. 205. VAN RIJSBERGEN, K. Information Retrieval. Butterworths, 1979. http://www. dcs.gla.ac.uk/Keith/Preface.html. 206. VOORHEES, E. Evaluation by highly relevant documents. In Proceedings of ACM SIGIR’01 (New Orleans, USA, 2001), pp. 74–82. 207. VOORHEES, E. M. Overview of the first Text REtrieval Conference (TREC-1). In TREC-1 Notebook Proceedings (Gaithersburg, MD, USA, 1991). 208. VOORHEES, E. M. Variations in relevance judgments and the measurement of retrieval effectiveness. In Proceedings of ACM SIGIR’98 (Melbourne, 1998). 209. VOORHEES, E. M. The Philosophy of Information Retrieval Evaluation. In Springer’s Lecture Notes. Springer, January 2002. 210. VOORHEES, E. M., AND HARMAN, D. K. Overview of the fifth Text REtrieval Conference (TREC-5). In TREC-5 Notebook Proceedings (Gaithersburg, MD, USA, 1996).
  • 244.
    228 Bibliography 211. WESTERVELD,T. Using generative probabilistic models for multimedia retrieval. PhD thesis, Centrum voor Wiskunde en Informatica, Amsterdam, Netherlands, 2004. 212. WESTERVELD, T., KRAAIJ, W., AND HIEMSTRA, D. Retrieving Web pages using content, links, URLs and anchors. In TREC-10 Notebook Proceedings (Gaithers- burg, MD, USA, 2001). http://trec.nist.gov/pubs/. 213. WILLIAMS, H. E., ZOBEL, J., AND BAHLE, D. Fast phrase querying with com- bined indexes. ACM Transactions on Information Systems 22, 4 (October 2004), 573–572. 214. WITTEN, I. H., BELL, T. C., AND MOFFAT, A. Managing Gigabytes: Compressing and Indexing Documents and Images. John Wiley & Sons, Inc., 1999. 215. XU, J., AND CROFT, W. B. Query expansion using local and global document analysis. In Proceedings of ACM SIGIR’96 (Zurich, Switzerland, 1996), pp. 4–11. 216. YAHOO! Yahoo! Business and Economy > Shopping and Services > Books > Booksellers, September 2002. http://www.yahoo.com/Business and Economy/Shopping and Services/Books/Booksellers/, accessed 09/09/2002. 217. YAHOO! Yahoo! Directory Service, 2004. http://www.yahoo.com, accessed 12/11/2004. 218. ZHAI, C., AND LAFFERTY, J. A study of smoothing methods for language mod- els applied to information retrieval. ACM Transactions on Information Systems 2, 2 (April 2004). 219. ZHU, X., AND GAUCH, S. Incorporating Quality Metrics in Central- ized/Distributed Information Retrieval on the World Wide Web. Tech. rep., De- partment of Electrical Engineering and Computer Science, University of Kansas, 2000. 220. ZOBEL, J. How reliable are the results of large-scale information retrieval exper- iments? In Proceedings of ACM SIGIR’98 (Melbourne, Australia, August 1998), pp. 307–314.