Upstill_Thesis_Revised_17Aug05

Document ranking using
web evidence
Trystan Garrett Upstill
A thesis submitted for the degree of
Doctor of Philosophy at
The Australian National University
August 2005

c Trystan Garrett Upstill
Typeset in Palatino by TEX and LATEX2ε.

This thesis includes experiments published in:
• Upstill T., Craswell N., and Hawking D. “Buying Bestsellers Online: A Case
Study in Search and Searchability”, which appeared in the Proceedings of
ADCS2002, December 2002 [199].
• Upstill T., Craswell N., and Hawking D. “Query-independent evidence in home
page ﬁnding”, which appeared in the ACM TOIS volume 21:3, July 2003 [201].
• Craswell N., Hawking D., Thom J., Upstill T., Wilkinson R., and Wu M. “TREC12
Web Track at CSIRO”, which appeared in the TREC-12 Notebook Proceedings,
November 2003 [58].
• Upstill T., Craswell N., and Hawking D. “Predicting Fame and Fortune: Page-
Rank or Indegree?”, which appeared in the Proceedings of ADCS2003, Decem-
ber 2003 [200].
• Upstill T., and Robertson S. “Exploiting Hyperlink Recommendation Evidence
in Navigational Web Search”, which appeared in the Proceedings of SIGIR’04,
August 2004 [202].
• Hawking D., Upstill T., and Craswell N. “Towards Better Weighting of An-
chors”, which appeared in the Proceedings of SIGIR’04, August 2004 [120].
Chapter 9 contains results submitted as “csiro” runs in TREC 2003. The Topic Distilla-
tion runs submitted to TREC 2003 were generated in collaboration with Nick Craswell
and David Hawking. The framework used to tune parameters in Chapter 9 was de-
veloped by Nick Craswell. The ﬁrst-cut ranking algorithm presented in Chapter 9 was
formulated by David Hawking for use in the Panoptic search system.
Except where indicated above, this thesis is my own original work.
Trystan Garrett Upstill
13 August 2005

Abstract
Evidence based on web graph structure is reportedly used by the current generation
of World-Wide Web (WWW) search engines to identify “high-quality”, “important”
pages and to reject “spam” content. However, despite the apparent wide use of this
evidence its application in web-based document retrieval is controversial. Confusion
exists as to how to incorporate web evidence in document ranking, and whether such
evidence is in fact useful.
This thesis demonstrates how web evidence can be used to improve retrieval effec-
tiveness for navigational search tasks. Fundamental questions investigated include:
which forms of web evidence are useful, how web evidence should be combined with
other document evidence, and what biases are present in web evidence. Through
investigating these questions, this thesis presents a number of ﬁndings regarding
how web evidence may be effectively used in a general-purpose web-based document
ranking algorithm.
The results of experimentation with well-known forms of web evidence on several
small-to-medium collections of web data are surprising. Aggregate anchor-text mea-
sures perform well, but well-studied hyperlink recommendation algorithms are far
less useful. Further gains in retrieval effectiveness are achieved for anchor-text mea-
sures by revising traditional full-text ranking methods to favour aggregate anchor-text
documents containing large volumes of anchor-text. For home page ﬁnding tasks ad-
ditional gains are achieved by including a simple URL depth measure which favours
short URLs over long ones.
The most effective combination of evidence treats document-level and web-based
evidence as separate document components, and uses a linear combination to sum
scores. It is submitted that the document-level evidence contains the author’s de-
scription of document contents, and that the web-based evidence gives the wider web
community view of the document. Consequently if both measures agree, and the doc-
ument is scored highly in both cases, this is a strong indication that the page is what it
claims to be. A linear combination of the two types of evidence is found to be partic-
ularly effective, achieving the highest retrieval effectiveness of any query-dependent
evidence on navigational and Topic Distillation tasks.
However, care should be taken when using hyperlink-based evidence as a direct
measure of document quality. Thesis experiments show the existence of bias towards
the home pages of large, popular and technology-oriented companies. Further empir-
ical evidence is presented to demonstrate how the authorship of web documents and
sites directly affects the quantity and quality of available web evidence. These factors
demonstrate the need for robust methods for mining and interpreting data from the
web graph.
v

Contents
Abstract v
1 Introduction 3
1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Background 5
2.1 A web search system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 The document gatherer . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.2 The indexer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.3 The query processor . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.4 The results presentation interface . . . . . . . . . . . . . . . . . . . 7
2.2 Ranking in web search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Document-level evidence . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3.1 Text-based document evidence . . . . . . . . . . . . . . . . . . . . 9
2.3.1.1 Boolean matching . . . . . . . . . . . . . . . . . . . . . . 9
2.3.1.2 Vector space model . . . . . . . . . . . . . . . . . . . . . 10
2.3.1.3 Probabilistic ranking . . . . . . . . . . . . . . . . . . . . 12
2.3.1.4 Statistical language model ranking . . . . . . . . . . . . 14
2.3.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.3 Other evidence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.3.1 Metadata . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.3.2 URL information . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.3.3 Document structure and tag information . . . . . . . . . 19
2.3.3.4 Quality metrics . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3.3.5 Units of retrieval . . . . . . . . . . . . . . . . . . . . . . . 20
2.4 Web-based evidence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.4.1 Anchor-text evidence . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.4.2 Bibliometric measures . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.4.2.1 Bibliographic methods applied to a web . . . . . . . . . 27
2.4.3 Hyperlink recommendation . . . . . . . . . . . . . . . . . . . . . . 28
2.4.3.1 Link counting / in-degree . . . . . . . . . . . . . . . . . 28
2.4.3.2 PageRank . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.4.3.3 Topic-speciﬁc PageRank . . . . . . . . . . . . . . . . . . 30
2.4.4 Other hyperlink analysis methods . . . . . . . . . . . . . . . . . . 30
2.4.4.1 HITS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.5 Combining document evidence . . . . . . . . . . . . . . . . . . . . . . . . 33
vii

viii Contents
2.5.1 Score/rank fusion methods . . . . . . . . . . . . . . . . . . . . . . 33
2.5.1.1 Linear combination of scores . . . . . . . . . . . . . . . . 34
2.5.1.2 Re-ranking . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.5.1.3 Meta-search fusion techniques . . . . . . . . . . . . . . . 34
2.5.1.4 Rank aggregation . . . . . . . . . . . . . . . . . . . . . . 35
2.5.1.5 Using minimum query-independent evidence thresh-
olds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.5.2 Revising retrieval models to address combination of evidence . . 35
2.5.2.1 Field-weighted Okapi BM25 . . . . . . . . . . . . . . . . 36
2.5.2.2 Language mixture models . . . . . . . . . . . . . . . . . 37
2.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.6.1 Web information needs and search taxonomy . . . . . . . . . . . . 38
2.6.2 Navigational search tasks . . . . . . . . . . . . . . . . . . . . . . . 39
2.6.2.1 Home page ﬁnding . . . . . . . . . . . . . . . . . . . . . 39
2.6.2.2 Named page ﬁnding . . . . . . . . . . . . . . . . . . . . 39
2.6.3 Informational search tasks . . . . . . . . . . . . . . . . . . . . . . . 40
2.6.3.1 Topic Distillation . . . . . . . . . . . . . . . . . . . . . . . 40
2.6.4 Transactional search tasks . . . . . . . . . . . . . . . . . . . . . . . 40
2.6.5 Evaluation strategies / judging relevance . . . . . . . . . . . . . . 40
2.6.5.1 Human relevance judging . . . . . . . . . . . . . . . . . 40
2.6.5.2 Implicit human judgements . . . . . . . . . . . . . . . . 42
2.6.5.3 Judgements based on authoritative links . . . . . . . . . 42
2.6.6 Evaluation measures . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.6.6.1 Precision and recall . . . . . . . . . . . . . . . . . . . . . 42
2.6.6.2 Mean Reciprocal Rank and success rates . . . . . . . . . 44
2.6.7 The Text REtrieval Conference . . . . . . . . . . . . . . . . . . . . 44
2.6.7.1 TREC corpora used in this thesis . . . . . . . . . . . . . 45
2.6.7.2 TREC web track evaluations . . . . . . . . . . . . . . . . 45
3 Hyperlink methods - implementation issues 49
3.1 Building the web graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.1.1 URL address resolution . . . . . . . . . . . . . . . . . . . . . . . . 50
3.1.2 Duplicate documents . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.1.3 Hyperlink redirects . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.1.4 Dynamic content . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.1.5 Links created for reasons other than recommendation . . . . . . . 54
3.2 Extracting hyperlink evidence from WWW search engines . . . . . . . . 55
3.3 Implementing PageRank . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.3.1 Dangling links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.3.2 Bookmark vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.3.3 PageRank convergence . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.3.4 PageRank applied to small-to-medium webs . . . . . . . . . . . . 59
3.4 Expected correlation of hyperlink recommendation measures . . . . . . 59

Contents ix
4 Web search and site searchability 61
4.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.1.1 Query selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.1.2 Search engine selection . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.1.3 Bookstore selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.1.4 Submitting queries and collecting results . . . . . . . . . . . . . . 65
4.1.5 Judging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.2 Comparing bookstores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.3 Comparing search engines . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.3.1 Search engine bookstore coverage . . . . . . . . . . . . . . . . . . 67
4.4 Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.4.1 Bookstore searchability: coverage . . . . . . . . . . . . . . . . . . 70
4.4.2 Bookstore searchability: matching/ranking performance . . . . . 73
4.4.3 Search engine retrieval effectiveness . . . . . . . . . . . . . . . . . 73
4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5 Analysis of hyperlink recommendation evidence 77
5.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.1.1 Sourcing candidate pages . . . . . . . . . . . . . . . . . . . . . . . 78
5.1.2 Company attributes . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.1.3 Extracting hyperlink recommendation scores . . . . . . . . . . . . 79
5.2 Hyperlink recommendation bias . . . . . . . . . . . . . . . . . . . . . . . 81
5.2.1 Home page preference . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.2.2 Hyperlink recommendation as a page quality recommendation . 82
5.2.2.1 Large, famous company preference . . . . . . . . . . . . 82
5.2.2.2 Country and technology preference . . . . . . . . . . . . 82
5.3 Correlation between hyperlink recommendation measures . . . . . . . . 87
5.3.1 For company home pages . . . . . . . . . . . . . . . . . . . . . . . 87
5.3.2 For spam pages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.4.1 Home page bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.4.2 Other systematic biases . . . . . . . . . . . . . . . . . . . . . . . . 90
5.4.3 PageRank or in-degree? . . . . . . . . . . . . . . . . . . . . . . . . 91
6 Combining query-independent web evidence with query-dependent evidence 93
6.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.1.1 Query and document set . . . . . . . . . . . . . . . . . . . . . . . . 94
6.1.2 Query-dependent baselines . . . . . . . . . . . . . . . . . . . . . . 94
6.1.3 Extracting PageRank . . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.1.4 Combining query-dependent baselines with query-independent
web evidence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6.2.1 Baseline performance . . . . . . . . . . . . . . . . . . . . . . . . . 96
6.2.2 Using a threshold . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

x Contents
6.2.3 Re-ranking using PageRank . . . . . . . . . . . . . . . . . . . . . . 97
6.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
7 Home page finding using query-independent web evidence 101
7.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
7.1.1 Query-independent evidence . . . . . . . . . . . . . . . . . . . . . 102
7.1.2 Query-dependent baselines . . . . . . . . . . . . . . . . . . . . . . 102
7.1.3 Test collections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
7.1.4 Combining query-dependent baselines with query-independent
evidence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
7.2 Minimum threshold experiments . . . . . . . . . . . . . . . . . . . . . . . 106
7.2.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
7.2.2 Training cutoffs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
7.3 Optimal combination experiments . . . . . . . . . . . . . . . . . . . . . . 112
7.3.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
7.4 Score-based re-ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
7.4.1 Setting score cutoffs . . . . . . . . . . . . . . . . . . . . . . . . . . 116
7.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
7.5 Interpretation of results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
7.5.1 What query-independent evidence should be used in re-ranking? 123
7.5.2 Which query-dependent baseline should be used? . . . . . . . . . 125
7.6 Further experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
7.6.1 Rank and score distributions . . . . . . . . . . . . . . . . . . . . . 127
7.6.2 Can the four-tier URL-type classification be improved? . . . . . . 127
7.6.3 PageRank and in-degree correlation . . . . . . . . . . . . . . . . . 131
7.6.4 Use of external link information . . . . . . . . . . . . . . . . . . . 132
7.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
8 Anchor-text in web search 135
8.1 Document statistics in anchor-text . . . . . . . . . . . . . . . . . . . . . . 135
8.1.1 Term frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
8.1.2 Inverse document frequency . . . . . . . . . . . . . . . . . . . . . 136
8.1.3 Document length normalisation . . . . . . . . . . . . . . . . . . . 138
8.1.3.1 Removing aggregate anchor-text length normalisation . 140
8.1.3.2 Anchor-text length normalisation by other document
fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
8.2 Combining anchor-text with other document evidence . . . . . . . . . . 143
8.2.1 Linear combination . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
8.2.2 Field-weighted Okapi BM25 . . . . . . . . . . . . . . . . . . . . . 143
8.2.3 Fusion of linear combination and field-weighted evidence . . . . 144
8.2.4 Snippet-based anchor-text scoring . . . . . . . . . . . . . . . . . . 144
8.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
8.3.1 Anchor-text baseline effectiveness . . . . . . . . . . . . . . . . . . 145
8.3.2 Anchor-text and full-text document evidence . . . . . . . . . . . . 146

Contents xi
8.3.2.1 Field-weighted Okapi BM25 combination . . . . . . . . 147
8.3.2.2 Linear combination . . . . . . . . . . . . . . . . . . . . . 148
8.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
9 A ﬁrst-cut document ranking function using web evidence 151
9.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
9.1.1 Evaluating performance . . . . . . . . . . . . . . . . . . . . . . . . 151
9.1.2 Document evidence . . . . . . . . . . . . . . . . . . . . . . . . . . 152
9.1.2.1 Full-text evidence . . . . . . . . . . . . . . . . . . . . . . 152
9.1.2.2 Title evidence . . . . . . . . . . . . . . . . . . . . . . . . . 153
9.1.2.3 URL length . . . . . . . . . . . . . . . . . . . . . . . . . . 153
9.1.3 Web evidence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
9.1.3.1 Anchor-text . . . . . . . . . . . . . . . . . . . . . . . . . . 153
9.1.3.2 In-degree . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
9.1.4 Combining document evidence . . . . . . . . . . . . . . . . . . . . 154
9.1.5 Test sets and tuning . . . . . . . . . . . . . . . . . . . . . . . . . . 155
9.1.6 Addressing the combined HP/NP task . . . . . . . . . . . . . . . 156
9.2 Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
9.2.1 Combining HP and NP runs for the combined task . . . . . . . . 160
9.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
9.3.1 TREC 2003 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
9.3.1.1 Topic Distillation 2003 (TD2003) results . . . . . . . . . . 160
9.3.1.2 Combined HP/NP 2003 (HP/NP2003) results . . . . . . 162
9.3.2 Evaluating the ranking function on further corporate web col-
lections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
9.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
10 Discussion 167
10.1 Web search system applicability . . . . . . . . . . . . . . . . . . . . . . . . 167
10.2 Which tasks should be modelled and evaluated in web search experi-
ments? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
10.3 Building a more efﬁcient ranking system . . . . . . . . . . . . . . . . . . . 169
10.4 Tuning on a per corpus basis . . . . . . . . . . . . . . . . . . . . . . . . . . 170
11 Summary and conclusions 173
11.1 Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
11.2 Document ranking recommendations . . . . . . . . . . . . . . . . . . . . 176
11.3 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
A Glossary 179
B The canonicalisation of URLs 183

xii Contents
C Bookstore search and searchability: case study data 185
C.1 Book categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
C.2 Web search engine querying . . . . . . . . . . . . . . . . . . . . . . . . . . 185
C.3 Correct book answers in bookstore case study . . . . . . . . . . . . . . . 187
D TREC participation in 2002 195
D.1 Topic Distillation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
D.2 Named page ﬁnding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
E Analysis of hyperlink recommendation evidence additional results 199
F Okapi BM25 distributions 203
G Query sets 205
G.1 .GOV home page set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
Bibliography 213

List of Tables
2.1 Proximity of the the term “Yahoo” to links to http://www.yahoo.com/ 24
4.1 Search engine properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.2 Bookstores included in the evaluation . . . . . . . . . . . . . . . . . . . . 64
4.3 Bookstore comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.4 Search engine success rates . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.5 Search engine precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.6 Search engine document coverage . . . . . . . . . . . . . . . . . . . . . . 69
4.7 Search engine link coverage . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.1 Values extracted from Google . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.2 PageRanks by industry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.3 Extreme cases where PageRank and in-degree scores disagree. . . . . . . 88
7.1 Test collection information . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
7.2 Using query-independent thresholds on the ANU collection . . . . . . . 107
7.3 Using query-independent thresholds on the WT10gC collection . . . . . 109
7.4 Using query-independent thresholds on the WT10gT collection. . . . . . 111
7.5 Optimal re-ranking results for content . . . . . . . . . . . . . . . . . . . . 113
7.6 Optimal re-ranking results for anchor-text . . . . . . . . . . . . . . . . . . 114
7.7 Optimal re-ranking results for content+anchor-text . . . . . . . . . . . . . 115
7.8 Signiﬁcant differences between methods when using Optimal re-rankings116
7.9 Summary of Optimal re-ranking results . . . . . . . . . . . . . . . . . . . 117
7.10 Score-based re-ranking results for content . . . . . . . . . . . . . . . . . . 120
7.11 Score-based re-ranking results for anchor-text . . . . . . . . . . . . . . . . 121
7.12 Score-based re-ranking results for content+anchor-text . . . . . . . . . . 122
7.13 Numerical summary of re-ranking improvements . . . . . . . . . . . . . 123
7.14 S@5 for URL-type category combinations, length and directory depth . . 131
7.15 Correlation of PageRank variants with in-degree . . . . . . . . . . . . . . 132
7.16 Using VLC2 links in WT10g . . . . . . . . . . . . . . . . . . . . . . . . . . 133
8.1 Summary of idf variants used in ranking functions under examination . 138
8.2 Summary of document length normalisation variants in ranking func-
tions under examination . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
8.3 Summary of snippet-based document ranking algorithms under exam-
ination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
8.4 Okapi BM25 aggregate anchor-text scores and ranks for length normal-
isation variants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
xiii

xiv LIST OF TABLES
8.5 Effectiveness of Okapi BM25 aggregate anchor-text length normalisa-
tion techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
8.6 Length normalisation in Field-weighted Okapi BM25 . . . . . . . . . . . 147
8.7 Effectiveness of anchor-text snippet-based ranking functions . . . . . . . 148
8.8 Effectiveness of the evaluated combination methods for TD2003 . . . . . 149
8.9 Effectiveness of the evaluated combination methods for NP2002 and
NP&HP2003 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
9.1 Tuned parameters and retrieval effectiveness . . . . . . . . . . . . . . . . 159
9.2 Results for combined HP/NP runs on the training set . . . . . . . . . . . 160
9.3 Topic Distillation submission summary . . . . . . . . . . . . . . . . . . . 161
9.4 Combined home page/named page finding task submission summary . 162
9.5 Ranking function retrieval effectiveness on the public corporate webs
of several large Australian organisations . . . . . . . . . . . . . . . . . . . 164
C.1 Correct book answers in bookstore case study . . . . . . . . . . . . . . . 194
D.1 Official results for submissions to the 2002 TREC web track Topic Dis-
tillation task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
D.2 Official results for submissions to the 2002 TREC web track named page
finding task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
G.1 .GOV home page finding training set . . . . . . . . . . . . . . . . . . . . . 211

List of Figures
2.1 A sample network of relationships . . . . . . . . . . . . . . . . . . . . . . 25
3.1 Effect of PageRank d value (random jump probability) on success rate
for Democratic PageRank calculations for the WT10gC test collection . . 57
3.2 Effect of PageRank d value (random jump probability) on success rate
for Aristocratic PageRank calculations for the WT10gC test collection . . 58
3.3 Effect of PageRank d value on the rate of Democratic PageRank conver-
gence on WT10g, by number of iterations . . . . . . . . . . . . . . . . . . 58
5.1 Combined PageRank distribution for the non-home page document set . 79
5.2 Toolbar PageRank distributions within sites . . . . . . . . . . . . . . . . . 83
5.3 Bias in hyperlink recommendation evidence towards large, admired
and popular companies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.4 Bias in hyperlink recommendation evidence towards technology-oriented
or US companies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.5 Toolbar PageRank versus in-degree for company home pages. . . . . . . 88
5.6 Toolbar PageRank versus in-degree for links to a spam company. . . . . 89
6.1 The percentage of home pages and non-home pages that exceed each
Google PageRank value. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.2 Quota-based re-ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
6.3 Score-based re-ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
6.4 Example of two queries using different re-ranking techniques . . . . . . 99
7.1 Example of an Optimal re-ranking and calculation of random control
success rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
7.2 Setting score-based re-ranking cutoffs for the content and anchor-text
baselines using the WT10gC collection . . . . . . . . . . . . . . . . . . . . 118
7.3 Setting score-based re-ranking cutoffs for the content+anchor-text base-
line using the WT10gC collection . . . . . . . . . . . . . . . . . . . . . . . 119
7.4 Baseline success rates across different cutoffs . . . . . . . . . . . . . . . . 126
7.5 Baseline rankings of the correct answers for WT10gC . . . . . . . . . . . 128
7.6 PageRank distributions for WT10gC . . . . . . . . . . . . . . . . . . . . . 129
7.7 In-degree and URL-type distributions for WT10gC . . . . . . . . . . . . . 130
8.1 Document scores achieved by BM25 using several values of k1 with
increasing tf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
xv

xvi LIST OF FIGURES
8.2 Aggregate anchor-text term distribution for the USGS home page . . . . 139
8.3 Aggregate anchor-text term distribution for a USGS info page . . . . . . 139
8.4 The effect of document length normalisation on BM25 scores for a sin-
gle term query. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
9.1 Document scores achieved by AF1 and BM25 for values of tf . . . . . . 154
9.2 A plot illustrating the concurrent exploration of Okapi BM25 k1 and b
values using the hill-climbing function . . . . . . . . . . . . . . . . . . . . 157
9.3 A full iteration of the hill-climbing function . . . . . . . . . . . . . . . . . 158
E.1 Google Toolbar PageRank distributions within sites (Additional to those
in Chapter 5) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
E.2 Google Toolbar PageRank distributions within sites (Additional to those
in Chapter 5) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
F.1 Distribution of normalised Okapi BM25 scores for document full-text . 204
F.2 Distribution of normalised Okapi BM25 scores for aggregate anchor-text 204

“In an extreme view, the world can be seen as only connections, nothing
else. We think of a dictionary as the repository of meaning, but it deﬁnes
words only in terms of other words. I liked the idea that a piece of infor-
mation is really deﬁned only by what it’s related to, and how it’s related.
There really is little else to meaning. The structure is everything. There
are billions of neurons in our brains, but what are neurons? Just cells. The
brain has no knowledge until connections are made between neurons. All
that we know, all that we are, comes from the way our neurons are con-
nected.”
— Tim Berners-Lee [20]

Chapter 1
Introduction
Document retrieval on the World-Wide Web (WWW), arguably the world’s largest col-
lection of documents, is a challenging and important task. The scale of the WWW is
immense, consisting of at least ten billion publicly visible web documents1 distributed
on millions of servers world-wide. Web authors follow few formal protocols, often re-
main anonymous and publish in a wide variety of formats. There is no central registry
or repository of the WWW’s contents and documents are often in a constant state of
ﬂux. The WWW is also an environment where documents often misrepresent their
content as some web authors seek to unbalance ranking algorithms in their favour for
personal gain [122]. To compound these factors, WWW search engine users typically
provide short queries (averaging around two terms [184]) and expect a sub-second
response time from the system. Given these signiﬁcant challenges, there is potentially
much to be learnt from the search systems which manage to retrieve relevant docu-
ments in such an environment.
The current generation of WWW search engines reportedly makes extensive use
of evidence derived from the structure of the WWW to better match relevant doc-
uments and identify potentially authoritative pages [31]. However, despite this re-
ported use, to date there has been little analysis which supports the inclusion of web
evidence in document ranking, or which examines precisely what its effect on search
results might be. The success of document ranking in the current generation of WWW
search engines is attributed to a number of web analysis techniques. How these tech-
niques are used and incorporated remains a trade secret. It also remains unclear as
to whether such techniques can be employed to improve retrieval effectiveness in
smaller, corporate-sized web collections.
This thesis investigates how web evidence can be used to improve retrieval ef-
fectiveness for navigational search tasks. Three important forms of web evidence
are considered: anchor-text, hyperlink recommendation measures (PageRank vari-
ants and in-degree), and URL hierarchy-based measures. These forms of web evi-
dence are believed to be used by prominent WWW search engines [31]. Other forms
of web evidence reviewed, but not examined, include HITS [132], HTML document
structure [42] and page segmentation [37], information unit measures [196], and click-
through evidence [129].
1
This is necessarily a crude estimate of the WWW’s static size. See Section 2.4 for details.
3

4 Introduction
To exploit web evidence effectively in a document ranking algorithm, several ques-
tions must be addressed:
• Which forms of web evidence are useful?
• How should web evidence be combined with other document evidence?
• What biases are inherent in web evidence?
Through addressing these and other related problems, this thesis demonstrates
how web evidence may be used effectively in a general-purpose web-based document
ranking algorithm.
1.1 Overview
Chapters 2 and 3 review background literature and implementation issues. Chap-
ter 2 surveys the web search domain, and presents an overview of document and
web evidence often used in web-based document ranking, methods for combining
this evidence, and a review of strategies for evaluating the effectiveness of ranking
algorithms. To justify the formulations of hyperlink evidence used, and to ensure ex-
periments can be reproduced, Chapter 3 describes methods used to process the web
graph and implement recommendation evidence.
Chapters 4 to 8 present a series of detailed experiments. Chapter 4 reports results
from an investigation of how the searchability of web sites affects hyperlink evidence,
and thereby retrieval effectiveness in WWW search engines. Chapter 5 presents a
set of experiments that analyse the extent to which hyperlink evidence is correlated
with “real-world” measures of authority or quality. It includes an analysis of how
the use of web evidence may bias search results, and whether hyperlink recommen-
dation evidence is useful in identifying site entry points. Chapters 6 and 7 follow
with an evaluation of retrieval effectiveness improvements afforded by hyperlink ev-
idence. Chapter 6 investigates how query-independent evidence might be combined
with query-dependent baselines. Chapter 7 investigates the home page ﬁnding task
on small-to-medium web collections. Chapter 8 presents a set of experiments that in-
vestigates further possibilities for improving the effectiveness of measures based on
anchor-text evidence.
The experiments culminate in a proposal for, and evaluation of, a ranking function
that incorporates evidence explored in this thesis. The effectiveness of this ranking
function is evaluated through submissions to the TREC 2003 web track, presented in
Chapter 9. Chapters 10 and 11 present and discuss ﬁndings, draw conclusions and
outline future research directions. A glossary is included as Appendix A.

Chapter 2
Background
To provide a foundation and context for thesis experiments, this chapter outlines the
web-based document ranking domain. The chapter includes:
• An overview of a generic web search system, outlining the role of document
ranking in web search;
• A detailed analysis of document and web-level evidence commonly used for
document ranking in research and (believed to be used in) commercial web
search systems;
• An exploration of methods for combining evidence into a single ranking func-
tion; and
• A review of common user web search tasks and methods used to evaluate the
effectiveness of document ranking for such tasks.
Where applicable, reference is made throughout this chapter to the related scope
of the thesis and the rationale for experiments undertaken.
2.1 A web search system
A web search engine typically consists of a document gatherer (usually a crawler), a
document indexer, a query processor and a results presentation interface [31]. The
document gatherer and document indexer need only be run when the underlying set
of web documents has changed (which is likely to be continuous on the WWW, but
perhaps intermittent for other web corpora).
How each element which makes up a generic web search system is understood in
the context of this thesis is discussed below.
5

6 Background
2.1.1 The document gatherer
Web-based documents are normally1 gathered using a crawler [123]. Crawlers traverse
a web graph by recursively following hyperlinks, storing each document encountered,
and parsing stored documents for URLs to crawl. Crawlers typically maintain a fron-
tier, the queue of pages which remain to be downloaded. The frontier may be a FIFO2
queue, or sorted by some other attribute, such as perceived authority or frequency of
change [46]. Crawlers also typically maintain a list of all downloaded or detected du-
plicate pages (so pages are not fetched more than once), and a scope of pages to crawl
(for example, a maximum depth, specified domain, or timeout value), both of which
are checked prior to adding pages to the frontier. The crawler frontier is initialised
with a set of seed pages from which the crawl starts (these are specified manually).
Crawling ceases when the frontier is empty, or some time or resource limit is reached.
Once crawling is complete,3 the downloaded documents are indexed.
2.1.2 The indexer
The indexer distills information contained within corpus documents into a format
which is amenable to quick access by the query processor. Typically this involves ex-
tracting document features by breaking-down documents into their constituent terms,
extracting statistics relating to term presence within the documents and corpus, and
calculating any query-independent evidence.4 After the index is built, the system is
ready to process queries.
2.1.3 The query processor
The query processor serves user queries by matching and ranking documents from the
index according to user input. As the query processor interacts directly with the doc-
ument index created by the indexer, they are often considered in tandem.
This thesis is concerned with a non-iterative retrieval process, i.e. one without
query refinement or relevance feedback [169, 174, 175, 177]. This is the level of in-
teraction supported by current popular WWW search systems and many web search
systems, most of whom incorporate little relevance feedback beyond “find more like
this” [93] or lists containing suggested supplementary query terms [217].
Although particularly important in WWW search systems, this thesis is not pri-
marily concerned with the efficiency of query processing. A comprehensive overview
of efficient document query processing and indexing methods is provided in [214].
1
In some cases alternative document accessing methods may be available, for example if the docu-
ments being indexed are stored locally.
2
A queue ordered such that the first item in is the first item out.
3
If crawling is continuous, and an incremental index structure is used, documents might be indexed
continuously.
4
Query-independent evidence is evidence that does not depend on the user query. For efficiency
reasons such evidence is generally collected and calculated during the document indexing phase (prior
to query processing).

§2.2 Ranking in web search 7
2.1.4 The results presentation interface
The results presentation interface displays and links to the documents matched by the
query processor in response to the user query. Current popular WWW and web search
systems present a linear list of ranked results, sometimes with the degree of match
and/or summaries and abstracts for the matching documents. This type of interface
is modelled in experiments within this thesis.
2.2 Ranking in web search
The principal component of the query processor is the document ranking function.
The ranking functions of modern search systems frequently incorporate many forms
of document evidence [31]. Some of this evidence, such as textual information, is
collected locally for each document in the corpus (described in Section 2.3). Other
evidence, such as external document descriptions or recommendations, is amassed
through an examination of the context of a document within the web graph (described
in Section 2.4).
2.3 Document-level evidence
Text-based ranking algorithms typically assign scores to documents based on the dis-
tribution of query terms within both the document and the corpus. Therefore the
choice of what should constitute a term is an important concern. While terms are
often simply defined as document words (treated individually) [170] they may also
take further forms. For example, terms may consist of the canonical string compo-
nents of words (stems) [163], include (n-)tuples of words [214], consist of a word and
associated synonyms [128], or may include a combination of some or many of these
properties.
Unless otherwise noted, the ranking functions examined within this thesis use
single words as terms. In some experiments ranking functions instead make use of
canonical word stems, conflated using the Porter stemmer [163], as terms. These and
alternative term representations are discussed below.
The conflation of terms may increase the overlap between documents and queries,
finding term matches which may otherwise have been missed. For example if the
query term “cat” is processed and a document in the corpus mentions “cats” it is
likely that the document will be relevant to the user’s request. Stemming methods
are frequently employed to reduce words to their canonical forms and thereby allow
such matches. An empirically validated method for reducing terms to their canon-
ical forms is the Porter stemmer [163]. The Porter stemmer has been demonstrated
to perform as well as other suffix-stemming algorithms and to perform comparably
to other significantly more expensive, linguistic-based stemming algorithms [126].5
5
These algorithms are expensive with regard to training and computational cost.

8 Background
The Porter stemmer removes suffixes, for example “shipping” and “shipped” would
become “ship”. In this way suffix-stemming attempts to remove pluralisation from
terms and to generalise words [126],6 sometimes leading to an improvement in re-
trieval system recall [134]. However, reducing the exactness of term matches can
result in the retrieval of less relevant documents [84, 101], thereby reducing search
precision.7 Furthermore, if a retrieved document does not contain any occurrences of
a query term, as all term matches are stems, it may be difficult for a user to understand
why that document was retrieved [108].
In many popular ranking functions documents are considered to be “bags-of-
words” [162, 170, 176], where term occurrence is assumed to be independent and
unordered. For example, given a term such as “Computer” there is no prior probabil-
ity of encountering the word “Science” afterwards. Accordingly no extra evidence
is recorded if the words “Computer Science” are encountered together in a docu-
ment rather than separately. While there is arguably more meaning conveyed through
higher order terms (terms containing multiple words) than in single-word term mod-
els, there is little empirical evidence to support the use of higher-order terms [128].
Even when using manually created word association thesauri, retrieval effectiveness
has not been observed to be significantly improved [128]. “Bags-of-words” algorithms
are also generally less expensive when indexing and querying English language doc-
uments8 [214].
Terms may have multiple meanings (polysemy) and many concepts are repre-
sented by multiple words (synonyms). Several methods attempt to explore relation-
ships between terms to compress the document and query space. The association
of words to concepts can be performed manually through the use of dictionaries
or ontologies, or automatically using techniques such as Latent Semantic Analysis
(LSA) [22, 66]. LSA involves the extraction of contextual meaning of words through
examinations of the distribution of terms within a corpus using the vector space model
(see Section 2.3.1.2). Terms are broken down into co-occurrence tables and then a Sin-
gular Value Decomposition (SVD) is performed to determine term relationships [66].
The SVD projects the initial term meanings onto a subspace spanned by only the “im-
portant” singular term vectors. The potential benefits of LSA techniques are two-fold:
firstly they may reduce user confusion through the compression of similar (synony-
mous or polysemous) terms, and secondly they may reduce the size of term space, and
thereby improve system efficiency [66]. Indeed, LSA techniques have been shown
to improve the efficiency of retrieval systems considerably while maintaining (but
not exceeding) the effectiveness of non-decomposed vector space-based retrieval sys-
tems [22, 43]. However, the use of LSA-based algorithms is likely to negatively affect
navigational search (an important search task, described in Section 2.6.2) as the mean-
ing conveyed by entity naming terms may be lost.
6
Employing stemming prior to indexing reduces the size of the corpus index however the discarded
term information is then lost. As an alternative stemming can be applied during query processing [214].
7
The measures of system precision and recall are defined in Section 2.6.6.1
8
The use of phrase-optimised indexes can improve the efficiency of phrase-based retrieval [213].

§2.3 Document-level evidence 9
Some terms occur so frequently in a corpus that their presence or absence within
a document may have negligible effect. The most frequent terms arguably convey the
least document relevance information and have the smallest discrimination value (see
inverse document frequency measure in Section 2.3.1.2). Additionally, because of the
high frequency of occurrence, such terms are likely to generate the highest overhead
during indexing and querying.9 Extremely frequent terms (commonly referred to as
“stop words”) are often removed from documents prior to indexing.10 However, it
has been suggested that such terms might be useful when matching documents [214],
particularly in phrase-based searches [14]. Nevertheless, in experiments within this
thesis, stop words are removed prior to indexing.
2.3.1 Text-based document evidence
To build a retrieval model, an operational definition of what constitutes a relevant
document is required. While each of the ranking models discussed below shares sim-
ilar document statistics, they were all derived through different relevance matching
assumptions. Experiments within this thesis employ the Okapi BM25 probabilistic
algorithm (for reasons outlined in Section 2.3.2). Other full-text ranking methods are
discussed for completeness.
The notation used during the model discussions below is as follows: D denotes a
document, Q denotes a query, t is a term, wt indicates a weight or score for a single
term, and S(D, Q) is the score assigned to the query to document match.
2.3.1.1 Boolean matching
In the Boolean model, retrieved documents are matched to queries formed with logic
operators. There are no degrees of match; a document either satisfies the query or
does not. Thus Boolean models are often referred to as “exact match” techniques [14].
While Boolean matching makes it clear why documents were retrieved, its syntax is
largely unfamiliar to ordinary users [28, 49, 51]. Nevertheless empirical evidence sug-
gests that trained search users prefer Boolean search as it provides an exact specifica-
tion for retrieving documents [49]. However, without any ranking by degree of match,
the navigation of the set of matching documents is difficult, particularly on large cor-
pora with unstructured content [14]. Empirical evidence also suggests that the use
of term weights in the retrieval model (described in the next sub-section) brings large
gains [14]. To employ Boolean matching techniques on corpora of the scale considered
in this thesis, it would have to be supplemented by some other document statistic in
order to provide a ranked list of results [14].
9
However, given the high amount of expected repetition they could potentially be more efficiently
compressed [214].
10
This list often contains common function words or connectives such as “the”, “and” and “a”.

10 Background
The Boolean scoring function is:
S(D, Q) =
0 Q /∈ D
1 Q ∈ D
(2.1)
where Q is the query condition expressed in Boolean logic operators.
2.3.1.2 Vector space model
The vector space model is based on the implicit assumption that the relevance of a
document with respect to some query is correlated with the distance between the query
and document. In the vector space model each document (and query) is represented
in an n-dimensional Euclidean space with an orthogonal dimension for each term in
the corpus.11 The degree of relevance between a query and document is measured
using a distance function [176].
The most basic term vector representation simply flags term presence using vec-
tors of binary {0, 1}. This is known as the binary vector model [176]. The document
representation can be extended by including term and document statistics in the docu-
ment and query vector representations [176]. An empirically validated document sta-
tistic is the number of term occurrences within a document (term frequency or tf ) [176].
The intuitive justification for this statistic is that a document that mentions a term
more often is more likely to be relevant for, or about, that term. Another important
statistic is the potential for a term to discriminate between candidate documents [190].
The potential of a term to discriminate between documents has been observed to be
inversely proportional to the frequency of its occurrence in a corpus [190], with terms
that are common in a corpus less likely to convey useful relevance information. A
frequently used measure of term discrimination based on this observation is inverse
document frequency (or idf ) [190]. Using the tf and idf measures, the weight of a term
present in a document can be defined as:
wt,D = tf t,D × idft (2.2)
where idf is:
idft = log
N
nt
(2.3)
where nt is the number of documents in the corpus that contain term t, and N is the
total number of documents in the corpus.
11
So all dimensions are linearly independent.

There are many functions that can be used to score the distance between document
and query vectors [176]. A commonly used distance function is the cosine measure of
similarity [14]:
S(D, Q) =
D · Q
(|D| × |Q|)
(2.4)
or:
S(D, Q) =
t∈Q wt,D × wt,Q
t∈Q w2
t,D × t∈Q w2
t,Q
(2.5)
Because the longer a document is, the more likely it is that a term will be encoun-
tered in it, an unnormalised tf component is more likely to assign higher scores to
longer documents. To compensate for this effect the term weighting function in the
vector space model is often length normalised, such that a term that occurs in a short
document is assigned more weight than a term that occurs in a long document. This
is termed document length normalisation. For example, a simple form of length normal-
isation is [14]:
wt,D =
tf t,D + 1
maxtf D + 1
× idft (2.6)
where maxtfD is the maximum term frequency observed for a term in document D.
After observing relatively poor performance for the vector space model in a set of
TREC experiments, Singhal et al. [186] hypothesised that the form of document length
normalisation used within the model was inferior to that used in other models. To in-
vestigate this effect they compared the length of known relevant documents with the
length of documents otherwise retrieved by the retrieval system. Their results indi-
cated that long documents were more likely to be relevant for the task studied,12 but
no more likely to be retrieved after length normalisation in the vector space model.
Accordingly, Singhal et al. [186] proposed that the (cosine) length normalisation com-
ponent be pivoted to favour documents that were more frequently relevant (in this
case, longer documents).
12
The task studied was the TREC-3 ad-hoc retrieval task. The ad-hoc retrieval task is an informational
task (see Section 2.6.1) where the user needs to acquire or learn some information that may be present in
a document.

12 Background
2.3.1.3 Probabilistic ranking
Probabilistic ranking algorithms provide an intuitive justification for the relevance of
matched documents by attempting to model and thereby rank the statistical proba-
bility that a document is relevant given the matching terms found [146, 169]. The
Probability Ranking Principle was described by Cooper [167] as:
“If a reference retrieval system’s response to each request is a ranking of
the documents in the collections in order of decreasing probability of use-
fulness to the user who submitted the request, where the probabilities are
estimated as accurately as possible on the basis of whatever data has been
made available to the system for this purpose, the overall effectiveness of
the system to its users will be the best that is obtainable on the basis of that
data.”
The probabilistic model for information retrieval was originally proposed by
Maron and Kuhn [146] and updated in an influential paper by Robertson and
Sparck-Jones [169]. Probabilistic ranking techniques have a strong theoretical basis
and should, at least in principle and given all available information, provide the best
predictions of document relevance. The formal specification of the Probabilistic Rank-
ing Principle can be described as an optimisation problem, where documents should
only be retrieved in response to a query if the cost of retrieving the document is less
than the cost of not retrieving the document [169].
A prominent probabilistic ranking formulation is the Binary Independence Model
used in the Okapi BM25 algorithm [171]. The Binary Independence Model is con-
ditioned by several important assumptions in order to decrease complexity. These
assumptions include:
• Independence of documents, i.e. that the relevance of one document is indepen-
dent of the relevance of all other documents;13
• Independence of terms, i.e. that the occurrence or absence of one term is not
related to the presence or absence of any other term;14 and
• That the distribution of terms within a document can be used to estimate the
document’s probability of relevance.15
13
This is brought into question when one document’s relevance may be affected by another document
ranked above it (as is the case with duplicate documents). This independence assumption was removed
in several probabilistic formulations without significant improvement in retrieval effectiveness [204].
14
This assumption was also removed from probabilistic formulations without significant effectiveness
improvements [204].
15
This assumption is made according to the cluster hypothesis which states that “closely associated
documents tend to be relevant to the same requests”, therefore “documents relevant to a request are
separate from those that are not” [204].

In most early probabilistic models, the term probabilities were estimated from
a sample set of documents and queries with corresponding relevance judgements.
However, this information is not always available. Croft and Harper [61] have revis-
ited the initial formulation of relevance and proposed a probabilistic model that did
not include a prior estimate of relevance.
Okapi BM25
The Okapi BM25 formula was proposed by Robertson et al. [172]. In Okapi BM25,
documents are ordered by decreasing probability of their relevance to the query,
P(R|Q, D). The formulation takes into account the number of times a query term oc-
curs in a document (tf ), the proportion of other documents which contain the query
term (idf ), and the relative length of the document. A score for each document is
calculated by summing the match weights for each query term. The document score
indicates the Bayesian inference weight that the document will be relevant to the user
query.
Robertson and Walker [170] derived the document length normalisation used in
the Okapi BM25 formula as an approximation to the 2-Poisson model. The form of
length normalisation employed when using Okapi BM25 with default parameters
(k1 = 2, b = 0.75) is justified because long documents contain more information than
shorter documents, and are thus more likely to be relevant [186].
The base Okapi BM25 formulation [172] is:
BM25wt = idf t ×
(k1 + 1)tf t,D
k1((1 − b) + b×dl
avdl ) + tf t,D
×
(k3 + 1) × Qwt
k3 + Qwt
+ k2 × nq
(avdl − dl)
avdl + dl
(2.7)
where wt is the relevance weight assigned to a document due to query term t, Qwt is
the weight attached to the term by the query, nq is the number of query terms, tf t,D is
the number of times t occurs in the document, N is the total number of documents, nt
is the number of documents containing t, dl is the length of the document and avdl is
the average document length (both measured in bytes).
Here k1 controls the influence of tf t,D and b adjusts the document length normali-
sation. A k1 approaching 0 reduces the influence of the term frequency, while a larger
k1 increases the influence. A b approaching 1 assumes that the documents are longer
due to repetition (full length normalisation), whilst b = 0 assumes that documents are
long because they cover multiple topics (no length normalisation) [168].
Setting k1 = 2, k2 = 0, k3 = ∞ and b = 0.75 (verified experimentally in TREC tasks
and on large corpora [168, 186]):
BM25wt,D = Qwt × tf t,D ×
log(N−nt+0.5
nt+0.5 )
2 × (0.25 + 0.75 × dl
avdl ) + tf t,D
(2.8)

14 Background
The ﬁnal document score is the sum of term weights:
BM25(D, Q) =
t∈Q
wt,D (2.9)
2.3.1.4 Statistical language model ranking
Statistical language modelling is based on Shannon’s communication theory [182]16
and examines the distribution of language in a document to estimate the probabil-
ity that a query was generated in an attempt to retrieve that document. Statistical
language models have long been used in language generation, speech recognition
and machine translation tasks, but have only recently been applied to document re-
trieval [162].
Language models calculate the probability of encountering a particular string (s)
in a language (modelled by M) by estimating P(s|M). The application of language
modelling to information retrieval conceptually reverses the document ranking
process. Unlike probabilistic ranking functions which model the relevance of docu-
ments to a query, language modelling approaches model the probability that a query
was generated from a document. In this way, language models replace the notion of
relevance with one of sampling, where the probability that the query was picked from
a document is modelled. The motivation for this approach is that users have some pro-
totype document in mind when an information need is formed, and they choose query
terms to that effect. Further, it is asserted that when a user seeks a document they are
thinking about what it is that makes the document they are seeking “different”. The
statistical language model ranks documents using the maximum likelihood estima-
tion (Pmle) that the query was generated with that document in mind (P(Q|MD)),
otherwise considered to be the probability of generating the query according to each
document language model.
Language modelling was initially applied to document retrieval by Ponte and
Croft [162] who proposed a simple unigram-based document model.17 The simple
unigram model assigns:
P(D|Q) =
t∈Q
P(t|MD) (2.10)
The model presented above may not be effective in general document retrieval
as it requires a document to contain all query terms. Any document that is missing
one or more query terms will be assigned a probability of query generation of zero.
Smoothing is often used to counter this effect (by adjusting the maximum likelihood
16
This is primarily known for its application to text sequencing and estimation of message noise.
17
A unigram language model models the probability of each term occurring independently, whilst
higher order (n-gram) language models model the probability that consecutive terms appear near each
other (described in Section 2.3). In the unigram model the occurrence of a term is independent of
the presence or absence of any other term (similar to the term independence assumption in the Okapi
model).

estimation of the language model). Smoothing methods discount the probabilities
of the terms seen in the text, to assign extra probability mass to the unseen terms
according to a fallback model [218]. In information retrieval it is common to exploit
corpus properties for this purpose. Thereby:
P(D|Q) = t∈Q P(t|MD) if t ∈ MD
αP(t|MC) otherwise
(2.11)
where P(t|Md) is the smoothed probability of a term seen in the document D, p(t|MC)
is the collection language model (over C), and α is the co-efﬁcient controlling proba-
bility mass assigned to unseen terms (so that all probabilities sum to one).
Models for smoothing the document model include Dirichlet smoothing [155],
geometric smoothing [162], linear interpolation [19] and 2-state Hidden Markov Mod-
els. Dirichlet smoothing has been shown to be particularly effective when dealing
with short queries, as it provides an effective normalisation using document
length [155, 218].18 Language models with Dirichlet smoothing have been used to
good effect in recent TREC web tracks by Ogilvie and Callan [155].
A document language model is built for all query terms [155]:
P(Q|MD) =
t∈Q
P(t|MD) (2.12)
Adding smoothing to the document model using the collection model:
P(t|MD) = β1Pmle(t|D) + β2Pmle(t|C) (2.13)
The β1 and β2 collection and document linear interpolation parameters are then esti-
mated using Dirichlet smoothing.
β1 =
|D|
|D| + γ
, β2 =
γ
|D| + γ
(2.14)
where |D| is the document length and γ is often set near the average document length
in the corpus [155]. The mle for a document is deﬁned as:
Pmle(w|D) =
tft,D
|D|
(2.15)
Similarly, for the corpus:
Pmle(w|C) =
tft,C
|C|
(2.16)
18
Document length has been exploited with success in the Okapi BM25 model and in the vector space
model.

16 Background
The document score is then:
S(D, Q) =
t∈Q
(β1 × (
count(t; D)
|D|
)) + (β2 × (
count(t; C)
|C|
)) (2.17)
Statistical language models have several beneﬁcial properties. If users are as-
sumed to provide query terms that are likely to occur in documents of interest, and
that distinguish those documents from other documents in the corpus, language mod-
els provide a degree of conﬁdence that a particular document should be retrieved [162].
Further, while the vector space and probabilistic models use a crude approximation
to document corpus statistics (such as document frequency, discrimination value and
document length), language models are sometimes seen to provide a more integrated
and natural use of corpus statistics [162].
2.3.2 Discussion
The most effective implementations of each of the retrieval models discussed above
have been empirically shown to be very similar [53, 60, 106, 110, 119, 121]. Discrepan-
cies previously observed in the effectiveness of the different models have been found
to be due to differences in the underlying statistics used in the model implementa-
tion, and not the model formalisation [186]. All models employ a tf × idf approach to
some degree, and normalise term contribution using document length. This is explicit
in probabilistic [170] and vector space models [186], and is often included within the
smoothing function in language models [155, 218]. The use of these document statis-
tics in information retrieval systems has been empirically validated over the past ten
years [155, 168].
When dealing with free-text elements, experiments within this thesis use the prob-
abilistic ranking function Okapi BM25 without prior relevance information [170].
This function has been empirically validated to perform as well as current state-of-
the-art ranking functions [53, 57, 58, 59, 60, 168, 170].
Further discussion and comparison of full-text ranking functions is outside the
scope of this thesis. If interested the reader should consult [14, 176, 191, 204].
2.3.3 Other evidence
To build a baseline that achieves similar performance to that of popular web and
WWW search engines several further types of document-level evidence may need
to be considered [31, 109, 113].
2.3.3.1 Metadata
Metadata is data used to describe data. An example of real-world metadata is a library
catalogue card, which contains data that describes a book within the library (although
metadata is not always stored separately from the document it describes). In web
documents metadata may be stored within HTML metadata tags (<META>), or in a

separate XML/RDF resource descriptors. As metadata tags are intended to describe
document contents, the content of metadata tags is not rendered by web browsers.
Several standards exist for metadata creation, one of the least restricted forms of which
is simple Dublin Core [70]. Dublin Core provides a small set of core elements (all
of which are optional) that are used to describe resources. These elements include:
document author, title, subject, description, and language. An example of HTML
metadata usage, taken from http://cs.anu.edu.au/∼Trystan.Upstill/19 is:
<meta http-equiv="Content-Type"
content="text/html;
charset=iso-8859-1" />
<meta name="keywords"
content="Upstill, Web, Information, Retrieval" />
<meta name="description"
content="Trystan Upstill’s Homepage, Web IR" />
<meta name="revised"
content="Trystan Upstill, 6/27/01" />
<meta name="author"
content="Trystan Upstill" />
The utility of metadata depends on the observance of document authorship stan-
dards. Inconsistencies between document content and purpose, and associated meta-
data tags, may severely reduce system retrieval effectiveness. Such inconsistencies
may occur either unintentionally through outdated metadata information, or through
deliberate attribute “stufﬁng” in an attempt by the document author to have the doc-
ument retrieved for a particular search term [71]. When a document is retrieved due
to misleading metadata information, search system users may have no idea why the
document has been retrieved, with no visible text justifying the document match.
The use of HTML metadata tags is not considered within this thesis due to the
relatively low adherence to metadata standards in documents across the WWW, and
the inconsistency of adherence in other web corpora [107]. This policy is followed by
many WWW search systems [71, 193].
2.3.3.2 URL information
Uniform Resource Locators, or URLs, provide web addresses for documents. The URL
of a document may contain document evidence, either though term presence in the
URL or implicitly through some other URL characteristic (such as depth in the site
hierarchy).
The URL string may contain useful query-dependent evidence by including a po-
tential search term (e.g: http://cs.anu.edu.au/∼Joe.Blogs/ contains the po-
tentially useful terms of “Joe” and “Blogs”). URLs can be matched using simple string
matching techniques (e.g. checking if the text is present or not) or using full-text
19
META tags have been formatted according to X-HTML 1.0.

18 Background
ranking algorithms (although a binary term presence vector would probably suffice).
Ogilvie and Callan [50, 155, 156] proposed a novel method for matching URL strings
within a language modelling framework. In their method the probability that a URL
was generated for a particular term, given the URLs of all corpus documents, is cal-
culated. Query terms and URLs are treated as character sequences and a character-
based trigram generative probability is computed for each URL. The numerator and
denominator probabilities in the trigram expansion are then estimated using a linear
interpolation with the collection model [50, 155, 156]. Ogilvie and Callan then com-
bined this URL-based language model with the language models of other document
components. The actual contribution of this type of URL matching is unclear.20
Further query-independent evidence relating to URLs might also be gained
through examining common formatting practices. For example some features could
be correlated with the length of a URL (by characters or directory depth), the match-
ing of a particular character in the URL (e.g. looking for ‘∼’ when matching personal
home pages [181]), or a more advanced metric. Westerveld et al. [135, 212] proposed
a URL-type indicator for estimating the likelihood that a page is a home page. In this
measure URLs are grouped into four categories, Root, Subroot, Path and File, using
the following rules:
Root a domain name,
e.g. www.cyborg.com/.
Subroot a domain name followed by a single directory,
e.g. www.glasgow.ac.uk/staff/.
Path a domain name followed by two or more directories,
e.g. trec.nist.gov/pubs/trec9/.
File any URL ending in a filename rather than a directory,
e.g. trec.nist.gov/contact.html.
Westerveld et al. [135, 212] calculated probabilities for encountering a home page
in each of these URL-types using training data on the WT10g collection (described
in Section 2.6.7.2). They then used these probabilities to assign scores to documents
based on the likelihood that their document URL would be a home page.
In experiments reported within this thesis, URL-type and URL length informa-
tion are considered. While the textual elements in a URL may be useful in doc-
ument matching, consistent benefits arising from their use are yet to be substanti-
ated [107, 155]. As such they are not considered within this work.
20
Ranking functions which included this URL measure performed well, but the contribution of the
URL measure was unclear.

2.3.3.3 Document structure and tag information
Important information might be marked up within a web document to indicate to
a document viewer that a particular segment of the document, or full document, is
important. For example useful evidence could be collected from:
• Titles / Heading tags: encoded in <H?> or <TITLE> tags.
• Marked-up text: For example bold (B), emphasised (E) or italic (I) text may
contain important information.
• Internal tag structure: The structural makeup of a document may give insight
into what a document contains. For example, if a document contains a very
long table, list or form, this may give some indication as to the utility of that
document.
• Descriptive text tags: Images often include descriptions of their content for users
viewing web pages without graphics capabilities. These are included as an at-
tribute in the IMG tag (ALT=).
Ogilvie and Callan [50, 155, 156] achieved small effectiveness gains through an up-
weighting of TITLE, Image ALT text and FONT tag text for both named page finding
and home page finding tasks. However, the effectiveness gains through the use of
these additional forms of evidence were small compared to those achieved through
the use of document full-text, referring anchor-text and URL length priors.21
The only document structure used in experiments within this thesis is document
TITLE. While there is some evidence to suggest that up-weighting marked-up text
might provide some gains, experiments have shown that the associated improvement
is relatively small [155].
2.3.3.4 Quality metrics
Zhu and Gauch [219] considered whether the effectiveness of full-text-based docu-
ment ranking22 could be improved through the inclusion of quality metrics.
They evaluated six measures of document quality:
• Currency: how recently a document was last modified (using document time
stamps).
• Availability: how many links leaving a document were available (calculated as
the number of broken links from a page divided by the total number of links).
• Information-to-noise: a measurement of how much text in the document was
noise (such as HTML tags or whitespace) as opposed to how much was useful
content.
21
Using the length of a URL to estimate a prior probability of document relevance.
22
Calculated using a tf × idf vector space model (see Section 2.3.1.2).

20 Background
• Authority: a score sourced from Yahoo Internet Life reviews and ZDNet ratings
in 1999. According to these reviews each site was assigned an authority score.
Sites not reviewed were assigned an authority score of zero.
• Popularity: how many documents link to the site (in-degree). This information
was sourced from AltaVista [7]. The in-degree measure is discussed in detail in
Section 2.4.3.1.
• Cohesiveness: how closely related the elements of a web page are, determined
by classifying elements using a vector space model into a 4385 node ontology
and measuring the distance between competing classifications. A small distance
between classifications indicates that the document was cohesive. A large dis-
tance indicates the opposite.
Zhu and Gauch [219] evaluated performance using a small corpus with 40 queries
taken from a query log file.23 They observed some improvement in mean precision
based on all the quality metrics, although not all improvements were significant.24
The smallest individual improvements were for “Popularity” and “Authority” (both
non-significant). The improvements obtained through the use of all other metrics was
significant. The largest individual improvement was observed for the “Information-
to-noise” ratio. Using all quality metrics apart from “Popularity” and “Authority”
resulted in a (significant) 24% increase in performance over the baseline document
ranking [219].
These quality metrics, apart from in-degree, are not included in experiments
within this thesis because sourced information may be incomplete [219] or inaccu-
rate [113].
2.3.3.5 Units of retrieval
Identifying the URL which contains the information unit most relevant to the user
may be a difficult task. There are many ways in which a unit of information may be
defined on a web and so the granularity of information units retrieved by web search
engines may vary considerably.
If the granularity is too fine (e.g. the retrieval of a single document URL when a
whole web site is relevant), the user may not be able to fulfil their information need.
In particular the user may not be able to tell whether the system has retrieved an ad-
equate answer, or the retrieved document list may contain many parts of a composite
document from a single web site.
If the unit of retrieval is too large (e.g. the retrieval of a home page URL when
only a deep page is relevant), the information may be buried such that it is difficult
for users to retrieve.
The obvious unit for WWW-based document retrieval is the web page. However,
there are many situations in which a user may be looking for a smaller element of
23
It is unclear how the underlying search task [106, 108] was modelled in this experiment.
24
Significance was tested using a paired-samples t-test [219].

information, such as when seeking an answer to a specific question. Alternatively, a
unit of information may be considered to be a set of web pages. It is common for web
documents to be made up of multiple web pages, or at least be related to other co-
located documents [75]. An example of a composite document is the WWW site for
the Keith van Rijsbergen book ‘Information Retrieval’ which consists of many pages,
each containing small sections from the book [205]. In a study of the IBM intranet
Eiron and McCurley [75] reported that approximately 25% of all URLs encountered on
the IBM corpus were members of some larger “compound” document that spanned
several pages.
The problem of determining the “most useful” level for an information unit
was considered in the 2003 TREC Topic Distillation task (TD2003 – described in Sec-
tion 2.6.7). The TD2003 task judged systems according to whether they retrieved im-
portant resources, and did not mark subsidiary documents as being relevant [60]. The
TD2003 task is similar to the “component coverage” assessment used in the INEX
XML task [85], where XML retrieval systems are rewarded for retrieving the correct
unit of information. In the XML task the optimal system would return the unit of
information that contains the relevant information and nothing else.
Some methods analyse the web graph and site structure in an attempt to identify
logical information units. Terveen et al. build site units by graphing co-located pages,
using a method entitled “clan graphs” [196]. Further methods attempt to determine
the appropriate information unit by applying a set of heuristics based on site hierarchy
and linkage [13, 142].
This thesis adopts the view that finding the correct information unit is analogous
to finding the optimal entry point for the correct information unit. As such, none of
the heuristics outlined above are used to detect information units. Instead, hyperlink
recommendation and other document evidence is evaluated according to whether it
can be used to find information unit entry points.
Document segmentation
Document segmentation methods break-down HTML documents into document com-
ponents that can be analysed individually. A commonly used segmentation method
is to break-down HTML documents into their Document Object Model (DOM), accord-
ing to the document tag hierarchy [42, 45]. Visual Information Processing System
(VIPS) [37, 38, 39] is a recently proposed extension of DOM-based break-down and
dissects HTML documents using visual elements in addition to their DOM.
Document segmentation techniques are not considered in this thesis. While finer
document breakdown might be useful for finding short answers to particular ques-
tions, there is little evidence of improvements in ranking at the web page level [39].

22 Background
2.4 Web-based evidence
Many early WWW search engines conceptualised the document corpus as a flat struc-
ture and relied solely on the document-level evidence outlined above, ignoring hy-
perlinks between documents [33]. This section outlines techniques for exploiting the
web graph that is created when considering documents within a web as nodes and
hyperlinks between documents as directed edges.
This thesis does not consider measures based on user interaction with the web
search system, such as click-through evidence [129]. While click-through evidence
may be useful when ranking web pages, assumptions made about user behaviour may
be questionable. In many cases it may be difficult to determine whether users have
judged a document relevant from a sequence of queries and clicks. Collecting such
evidence also requires access to user interaction logs for a large scale search system.
Work within this thesis relating to the combination of query-dependent evidence with
other query-independent evidence is applicable to this domain.
The WWW graph was initially hypothesised to be a small world network [18], that
is, a network that has a finite diameter,25 where each node has a path to every other
node by a relatively small number of steps. Small world networks have been shown
to exist in other natural phenomena, such as relationships between research scientists
or between actors [2, 5, 6, 18]. Barabasi hypothesised that the diameter of the WWW
graph was 18.59 links (estimated for 8 × 108 documents) [18]. However, this work
was challenged by WWW graph analysis performed by Broder et al. [35]. Using a
200 million page crawl from AltaVista, which contained 1.5 billion links [7], Broder et
al. observed that the WWW graph’s maximal and average diameter was infinite. The
study revealed that the WWW graph resembles a bow-tie with a Strongly Connected
Component (SCC), an upstream component (IN), a downstream component (OUT),
links between IN and OUT (Tendrils), and disconnected components. Each of these
components was observed to be roughly the same size (around 50 million nodes).
The SCC is a highly connected graph that exhibits the small-world property. The IN
component consists of nodes that link into the SCC, but cannot be accessed from the
SCC. The OUT component consists of nodes that are linked to from the SCC, but do
not link back to the SCC. Tendrils link IN nodes directly to OUT nodes, bypassing the
SCC. Disconnected components are pages to which no-one linked, and which linked-
to no-one.
The minimal diameter26 for the bow-tie was 28 for the SCC and 500 for the entire
graph. The probability of a directed path existing between two nodes was observed
to be 24%, and the average length of such a path was observed to be 16 links. The
shortest directed path between two random nodes in the SCC was, on average, 16 to
20 links. Further work by Dill et al. [67] has reported that WWW subgraphs, when
restricted by domain or keyword occurrence, also form bow-tie-like structures. This
phenomenon has been termed the fractal nature of the WWW, and is exhibited by
25
Average distance between two nodes in a graph
26
The minimum number of steps by which the graph could be crossed

§2.4 Web-based evidence 23
other scale-free networks [67].
Many WWW distributions have been observed to follow a power law [3]. That is,
the distributions take some form k = 1/ix for i > 1, where k is the probability that a
node has the value i according to some exponent x. Important WWW distributions
that have been observed to follow the power law include:
• WWW site in-links (in-degrees). The fraction of pages with an in-degree i was
first approximated by Kumar et al. [136, 137] to be distributed according to
power law with exponent x = 2 on a 1997 crawl of around 40 million pages
gathered by Alexa.27 Later Barabasi et al. estimated the exponent at x = 2.1
over a graph computed for a corpus containing 325 000 documents from the
nd.edu domain [17, 18]. Broder et al. [35] have since confirmed the estimate of
x = 2.1.
• WWW site out-links (out-degrees). Barabasi and Albert [17] estimated a power
law distribution with exponent x = 2.45. Broder et al. [35] reported a x = 2.75
exponent for out-degree on a 200 million page crawl from AltaVista.
• Local WWW site in-degrees and out-degrees [25].
• WWW site accesses [4].
2.4.1 Anchor-text evidence
Web authors often supply textual snippets when marking-up links between web doc-
uments, encoded within anchor “<A HREF=’’></A>” tags. The average length of
an anchor-text snippet has been observed to be 2.85 terms [159]. This is similar to
the average query length submitted to WWW search engines [184] and suggests there
might be some similarity between a document’s anchor-text and the queries typically
submitted to search engines to find that document [73, 74].
A common method for exploiting anchor-text is to combine all anchor-text snip-
pets pointing to a single document into a single aggregate anchor-text document, and
then to use the aggregate document to score the target document [56]. In terms of
document evidence, this aggregate anchor-text document may give some indication
of what other web authors view as the content, or purpose, of a document. It has been
observed that anchor-text frequently includes information associated with a page that
is not included in the page itself [90].
To increase the anchor-text information collected for hyperlinks, anchor-text evi-
dence can be expanded to include text outside (but in close proximity to) anchor-tags.
However, there is disagreement regarding whether such text should be included.
Chakrabarti [44] investigated the potential utility of text surrounding anchor tags
by measuring the proximity of the term “Yahoo” to the anchor tags of links to
http://www.yahoo.com in 5000 web documents. Chakrabarti found that includ-
ing 50 words around the anchor tags performed best as most occurrences of Yahoo
27
http://www.alexa.com

24 Background
Distance -100 -75 -50 -25 0 25 50 75 100
Density 1 6 11 31 880 73 112 21 7
Table 2.1: Proximity of the the term “Yahoo” to links to http://www.yahoo.com/ for 5000
WWW documents (from [44]). Distance is measured in bytes. A distance of 0 indicates that
“Yahoo” appeared within the anchor tag. A negative distance indicates it occurred before the
anchor-tag, and a positive distance indicates that it occurred after the tag.
were within that bound (see Table 2.1). Chakrabarti found that using this extra text
improved recall, but at the cost of precision (precision and recall are described in Sec-
tion 2.6.6.1). In later research Davison [64, 65] reported that extra text surrounding the
anchor-text did not describe the target document any more accurately than the text
within anchor-tags. However, Glover et al. [90] reported that using up to 25 terms
around anchor-text tags improved page-content classification performance. Pant el
al. [159] proposed a further method for expanding anchor-text evidence using a DOM
break-down (DOM described in Section 2.3.3.5). They suggested that if an anchor-text
snippet contains under 20 terms then the anchor-text evidence should be extended to
consider all text up to the next set of HTML tags. They found that expanding to be-
tween two and four HTML tag levels improved classification of the target documents
when compared to only using text that occurred within anchor-tags.
Experiments within this thesis only consider text within the anchor tags, as there
is little conclusive evidence to support the use of text surrounding anchor tags.
Anchor-text ranking
Approaches to ranking anchor-text evidence include:
• Vector space. Hyperlink Vector Voting, proposed by Li and Rafsky [143], ranks
anchor-text evidence using a vector space containing all anchor-text pointing to
a document. The final score is the sum of all the dot products between the query
vector and anchor-text vectors. Li and Rafsky did not formally evaluate this
method.
• Okapi BM25. Craswell, Hawking and Robertson [56] built surrogate docu-
ments from all the anchor-text snippets pointing to a page and ranked the doc-
uments as if they contained document full-text. This application of anchor-text
provided dramatic improvements in navigational search performance.28
• Language Modelling. Ogilvie and Callan [155] modelled anchor-text separately
from other document evidence using a unigram language model with Dirich-
28
Navigational search is described in Section 2.6.2.

let smoothing. The anchor-text language model was then combined with their
models for other sections of the document using a mixture model (see Sec-
tion 2.5.2.2). This type of anchor-text scoring has been empirically evaluated
and shown to be effective [155, 156].
Unless otherwise noted, the anchor-text baselines used in this thesis are scored
from anchor-text aggregate documents using the Okapi BM25 ranking algorithm.
This method is used because it has previously been reported to perform well [56].
2.4.2 Bibliometric measures
1
2
3
4
5
Figure 2.1: A sample network of relationships
Social networks researchers [125, 131] are concerned with the general study of
links in nature for diverse applications, including communication (to detect espi-
onage or optimise transmission) and modelling disease outbreak [89]. Bibliomet-
rics researchers are similarly interested in the citation patterns between research pa-
pers [87, 89], and study these citations in an attempt to identify relationships. This
can be seen as a specialisation of social network analysis [89]. In many social net-
work models, there is an implicit assumption that the occurrence of a link (citation)
indicates a relationship or some attribution of prestige. However, in the context of
some areas (such as research) it may be difﬁcult to determine whether a citation is an
indication of praise or retort [203].
Social networks and citations may be modelled using link adjacency matrices. A
directed social network of size n can be represented as an n × n matrix, where links
between nodes are encoded in the matrix (e.g. if a node i links to j, then Ei,j = 1). For
example, the relationship network shown in Figure 2.1 may be represented as:
E =








0 0 1 1 0
0 0 1 1 1
0 0 0 1 0
0 0 0 0 1
0 0 1 0 0









26 Background
Prestige
The number of incoming links to a node is a basic measure of its prestige [131]. This
gives a measure of the direct endorsement the node has received. However, examining
direct endorsements alone, may not give an accurate representation of node prestige.
It may be more interesting to know if a node is recognised by other important nodes,
thus transitive citation becomes important. A transitive endorsement is an endorse-
ment through an intermediate node (i.e. if A links to B links to C, then A weakly
endorses C).
An early measure of prestige in a social network analysis was proposed by See-
ley [179] and later revised by Hubbell [125]. In this model, every document has an
initial prestige associated with it (represented as a row in p), which is transferred to
its adjacent nodes (through the adjacency matrix E). Thus the direct prestige of any (a
priori equal) node can be calculated by setting p = (1, ..., 1)T and calculating p = pET .
By performing a power iteration over p ← pET the prestige measure p converges to
the principal eigenvector of the matrix ET and provides a measure of transitive pres-
tige.29 The power iteration method multiplies p by increasing powers of ET until the
calculation converges (tested using some entropy constant).
To measure prestige for academic journals Garfield [88] proposed the “impact
factor”. The impact factor score for a journal j is the average number of citations
to papers within that journal received during the previous two years. Pinski and
Narin [161] proposed a variation to the “impact factor”, termed the influence weight,
based on the observation that all journal citations may not be equally important. They
hypothesised that a journal is influential if its papers are cited by papers in other in-
fluential journals, and thus incorporate a measure of transitive endorsement. This
notion of transitive endorsement is similar to that modelled in PageRank and HITS
(described in Sections 2.4.3.2 and 2.4.4.1).
Co-citation and bibliographic coupling
Co-citation is used to measure subject similarity between two documents. If a docu-
ment A cites documents B and C, documents B and C are co-cited by A. If many doc-
uments cite both documents B and C, this indicates that B and C may be related [187].
The more documents that cite both B and C, the closer their relationship.
The co-citation matrix (CC) is calculated as:
CC = ET
E (2.18)
where CCi,j is the number of papers which jointly cite papers i and j, and the diagonal
is node in-degree.
Bibliographic coupling is the inverse of co-citation, and infers that if two docu-
ments include the same references then they are likely to be related, i.e. if document
29
See Golub and Van Loan [91] for more information about principal eigenvectors and the power
method [pp.330–332].

A and B both cite document C this gives some indication that they are related. The
more documents that document A and B both cite, the stronger their relationship.
The bibliographic coupling (BC) matrix is calculated as:
BC = EET
(2.19)
where BCi,j is the number of papers jointly cited by i and j and the diagonal is node
out-degree.
Citation graph measures
Important information may be conveyed by the distance between two nodes in a cita-
tion graph, the radius of a node (maximum distance from a node to the graph edge),
the cut of the graph (or which edges of the graph that when removed will disconnect
large sections of the graph), and the centre of the graph (the node that has the smallest
radius). For example, when examining a field of research, interesting papers can be
identified by their small radius, as this indicates that most papers in the area have a
short path to the paper. The cut of the graph typically indicates communication be-
tween cliques, and can be used to identify important nodes, whose omission would
lead to the loss of the relationship between the groups [196].
2.4.2.1 Bibliographic methods applied to a web
Hyperlink-based scoring assumes that web hyperlinks provide some vote for the im-
portance of their targets. However, due to the relatively small cost of web publishing,
the discretion used when creating links between web pages may be less than is em-
ployed by researchers in scientific literature [203]. Indeed it has been observed that
not all web links are created for recommendation purposes [63] (discussed in Section
3.1.5).
An early use of hyperlink-based evidence was in a WWW site visualisation, where
a site’s visibility represented its direct prestige, and the out-degree of a site was the
node’s luminosity [30]. Larson [138] presented one of the first applications of biblio-
metrics on the WWW by using co-citation to cluster related web pages and to explore
topical themes.
Marchiori [145] provided an early examination of the use of hyperlink evidence in
a document ranking scheme, by proposing that a document’s score should be relative
to that document’s full-text score and “hyper” (hypertext-based) score. Marchiori’s
model was based on the idea that a document’s quality is enriched through the pro-
vision of links to other important resources. In this model, the “hyper-information”
score was a measure based on a document’s subsidiary links, rather than its parent
links. The page score was dependent not only on its full-text content, but the content
of its subsidiaries as well. A decay factor was implemented such that the farther a
subsidiary was from the initial document, the less its contribution would be.
Xu and Croft [215] outline two broad domains for web-based hyperlink informa-
tion: global link information and local link information. Global link information is

28 Background
computed from a full web graph, based on links between all documents in a cor-
pus [40, 215]. In comparison, local link information is built for some subset of the
graph currently under examination, such as the set of documents retrieved in response
to a particular query. In many cases the additional cost involved in calculating local
link information might be unacceptable for web or WWW search systems [40].
2.4.3 Hyperlink recommendation
The hyperlink recommendation techniques examined here are similar to the biblio-
metric measures of prestige, and may be able to provide some measure of the “im-
portance”, “quality” or “authority” of a web document [31]. This hypothesis is tested
through experiments presented in Chapter 5.
2.4.3.1 Link counting / in-degree
A page’s in-degree score is a measure of its direct prestige, and is obtained through a
count of its incoming links [29, 41]. It is widely believed that a web page’s in-degree
may give some indication of its importance or popularity [219].
In an analysis of link targets Bharat et al. [25] found that the US commercial do-
main .com had higher in-degree on average than all other domains. Sites within the
.org and .net domains also had higher in-degree (on average) than sites in other
countries.
2.4.3.2 PageRank
PageRank is a more sophisticated query-independent link citation measure developed
by Page and Brin [31, 157] to “objectively and mechanically [measure] the human in-
terest and attention devoted [to web pages]” [157]. PageRank uses global link infor-
mation and is stated to be the primary link recommendation scheme employed in the
Google search engine [93] and search appliance [96]. PageRank is designed to simu-
late the behaviour of a “random web surfer” [157] who navigates a web by randomly
following links. If a page with no outgoing links is reached, the surfer jumps to a
randomly chosen bookmark. In addition to this normal surﬁng behaviour, the surfer
occasionally spontaneously jumps to a bookmark instead of following a link. The
PageRank of a page is the probability that the web surfer will be visiting that page at
any given moment.
PageRank is similar to bibliometric prestige, but differs by down-weighting doc-
uments that have many outgoing links; the fewer links a node has, the larger the
portion of prestige it will bestow to its outgoing links. The PageRank distribution
matrix (EPR) is then:
EPRi,j =
Ei,j
n=1..dim(Ei,j) En,j
(2.20)
for the link adjacency matrix E.

The PageRank distribution matrix (EPR) is a non-negative stochastic30 matrix that
is aperiodic and irreducible.31 The PageRank calculation is a Markov process, where
PageRank is an n-state system and the distribution matrix (EPR) contains the inde-
pendent transition probabilities EPRi,j of jumping from state i to j. If the random
surfer is in all states with equal probability leaving from a node i then EPR1..n,j =
(1/n, ..., 1/n).
The basic formulation of a single iteration of PageRank is then:
p = p × EPRT
(2.21)
where p is initialised according to the bookmark vector (by default a unit vector), and
is the updated PageRank score after each iteration.
Page and Brin observed that unlike scientific citation graphs, it is quite common to
find sections of the web graph that act as “rank sinks”. To address this difficulty Page
and Brin introduced a random jump (or teleport) component where, with a constant
probability d, the surfer jumps to a random bookmarked node in b. That is:
p = ((1 − d) × b) + d × b × p × EPRT
(2.22)
If d = 0 or b is not broad enough, the PageRank calculation may not converge [102].
Another complexity in the PageRank calculation are nodes that act as “rank leaks”,
this occurs if the surfer encounters a page with no outgoing links, or a link to a page
that is outside the crawl (a dangling link). One approach to resolving this issue is to
jump with certainty (a probability of one) when a dangling link is encountered [154].
This approach, and several others, are covered in more detail in Section 3.3.1. If ap-
plying the “jump with certainty” method, and using a unit b bookmark vector (such
that the random surfer has every page bookmarked), the final PageRank scores are
equivalent to the principal eigenvector of the transition matrix EPR, where EPR is
updated to include the random jump factor:
EPRi,j =
(1 − d)
dim(Ei,k)
+ d ×
Ei,j
n=1..dim(Ei,j) En,j
(2.23)
Expressed algorithmically, the PageRank algorithm (when using “jump with cer-
tainty”) is:
R0 ← S
loop :
r ← dang(Ri)
Ri+1 ← rE + ARi
Ri+1 ← (1 − d)E + d(Ri+1)
δ ← Ri+1 − Ri 1
while δ >
30
Every node can reach any other node at any time-step (implies irreducibility).
31
Every node can reach every other node.

Upstill_Thesis_Revised_17Aug05

Upstill_Thesis_Revised_17Aug05

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (15)

Similar to Upstill_Thesis_Revised_17Aug05

Similar to Upstill_Thesis_Revised_17Aug05 (20)

Upstill_Thesis_Revised_17Aug05