Pagerank and hits

Hyperlink based search algorithms-PageRank
and HITS
Shatakirti
MT2011096

Dimensionality Reduction

Contents
1 Link Analysis and Web Search 2

2 Hyperlink-Induced Topic Search (HITS) 2
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.2 Motivation behind developing the HITS algorithm . . . . . . . 2
2.3 Authorities and Hubs . . . . . . . . . . . . . . . . . . . . . . . 3
2.4 HITS Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.5 HITS Implementation . . . . . . . . . . . . . . . . . . . . . . 5
2.6 Advantages and Disadvantages of HITS . . . . . . . . . . . . . 6

3 PageRank 6
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.2 PageRank algorithm . . . . . . . . . . . . . . . . . . . . . . . 7
3.3 PageRank Implementation . . . . . . . . . . . . . . . . . . . . 7
3.4 Problems with the algorithm and their modiﬁcations . . . . . 9
3.4.1 Rank Sink . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.4.2 Dangling Links . . . . . . . . . . . . . . . . . . . . . . 10
3.5 Advantages and Disadvantages of PageRank . . . . . . . . . . 11

References 12

List of Figures
1 Hubs and Authorities . . . . . . . . . . . . . . . . . . . . . . . 4
2 Page Rank Example . . . . . . . . . . . . . . . . . . . . . . . 8
3 Rank Sink/Page Cycles . . . . . . . . . . . . . . . . . . . . . . 9
4 Rank Sink/Page Cycles . . . . . . . . . . . . . . . . . . . . . . 10
5 Link-Farms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

1


1 Link Analysis and Web Search
Back in the 1990’s, web search was purely based on the number of occur-
rences of a word in a document. The search was purely and only based on
relevancy of a document with the query. But over time, webpages have been
increasing at an enormous rate. Thereby, simply getting the relevant docu-
ments wasn’t sufficient as the number of relevant documents may range in
a few millions. But instead, it was supposed to be classified in descending
order of importance (most imp document first). Content similarity was an-
other major issue. It was easily spammed as the page owner can repeat some
words to boost his rankings and make it more relevant to a large number of
queries. These problems were solved by analyzing the link structure of the
web. Hyperlinks in a document provide a valuable source of information for
information retrieval. Link analysis has been successful for deciding which
webpages are important. It has also been used for categorizing webpages,
finding pages related to given pages and to find duplicated websites. During
1997-1998, two most famous and influential algorithms were proposed. They
are : PageRank and HITS. Both these algorithms exploit the hyperlinks of
the web to rank the pages. We will see these algorithms in detail here.

2 Hyperlink-Induced Topic Search (HITS)
2.1 Introduction
During the same time as the Page Rank algorithm was being developed by
Sergey Brin and Larry Page, Jon Kleinberg a professor in the Department
of Computer Science at Cornell came up with his own solution to the Web
Search problem. He developed an algorithm that made use of the link struc-
ture of the web in order to discover and rank pages relevant for a particular
topic. HITS (hyperlink-induced topic search) is now part of the Ask search
engine (www.Ask.com).

2.2 Motivation behind developing the HITS algorithm
The HITS(hyperlink-induced topic search) algorithm has been developed by
looking at the way how Humans analyze a search process rather than the
machines searching up a query by looking at a bunch of documents and
return the matches. For ex. if we want to buy a car, we type in ”top
automobile makers in the world”, in the intention of getting the top ranked
car list and their official websites. When we ask someone this question, we

2


probably expect him to understand that by ”automobile”, we actually mean
Car and the automobile in general means other vehicles too. But if you search
this using a query for a computer, the search results would be a lot different.
The computer will simply count all the occurrences of the given word in a
set of documents, but won’t apply any intelligence for you. And hence, the
search results would be apt according to what we typed, but not what we
were expecting them to be. The conclusion is that even if trying to find
pages that contain the query words should be the starting point, a different
ranking system is needed in order to find those pages that are authoritative
for a given query.

2.3 Authorities and Hubs
Page i is called an authority for the query ”automobile makers” if it contains
valuable information on the subject. Official web sites of car manufacturers,
such as www.bmw.com, HyundaiUSA.com, www.mercedes-benz.com would
be authorities for this search. Commercial web sites selling cars might be
authorities on the subject as well. These are the ones truly relevant to the
given query. These are the ones that the user expects back from the query
engine. However, there is a second category of pages relevant to the process
of finding the authoritative pages, called hubs. Their role is to advertise
the authoritative pages. They contain useful links towards the authoritative
pages. In other words, hubs point the search engine in the ”right direction”.
In real life, when you buy a car, you are more inclined to purchase it from a
certain dealer that your friend recommends. Following the analogy, the au-
thority in this case would be the car dealer, and the hub would be your friend.
You trust your friend; therefore you trust what your friend recommends. In
the World Wide Web, hubs for our query about automobiles might be pages
that contain rankings of the cars, blogs where people discuss about the cars
that they purchased, and so on.

3


Figure 1: Hubs and Authorities

2.4 HITS Algorithm
Let’s assume that a webpage i has an authority score ai and hub score hi .
And let ξ denote the set of all directed edges in the web-graph. Let eij
represent the directed edge from a webpage i to j. Initially all the authority
score of ai is 0 and hub score hi is also 0. The HITS algorithm updates the
scores following the summation:
(k) (k−1)
ai = hj , where eij ξ (1)
j

The above equations are represented in the form of an adjacency matrix L
deﬁned by :

1 if there exists i and j such that eij ξ;
Lij =
0 otherwise.
The authoritative and hub scores’ summation can also be written as:

a(k) = LT h(k−1) and h(k) = La(k) (2)

where a(k) and h(k) are n x 1 vectors comprising the authority and hub scores,
respectively, for each of the n nodes (webpages) in the graph. The above

4


equation gets repeatedly updating until a(k) and h(k) converge for some k.
Basically a(k) and h(k) are normalized every time the scores get updated. By
substituting the values of h(k) and a(k) from each other’s equation in (2), we
get :
a(k) = LT La(k−1) and h(k) = LLT h(k−1) (3)
The iterations in the above equation is the power iteration method to
find the dominating eigen vectorsof LT L and LLT . The matrix LT L is the
authority matrix as it is used to find the authority scores and similarly matrix
LLT is the hub matrix as its dominating vector gives final hub score. By
computing the calues of a and h will help us solve the equations:
LT La = λmax a and LLT h = λmax h (4)
where λmax is the largest eigenvalue of LT L and LLT .
Next, we need to normalize each power iteration involving a(k) or h(k) . One
way to normalize this is:

(k) a(k) (k) h(k)
a ← and h ← (5)
m(a(k) ) m(h(k) )
where m(x) is the signed element having the maximum magnitude of x.

2.5 HITS Implementation
Once the user enters a query, the algorithm first constructs a neighborhood
graph N associated with the terms present in the query. The authority and
the hub scores for the documents in N are then computed as described in
equation 3 and returned to the user. N can be constructed using the inverted
index. The graph N is then extended by adding the nodes that either point
to the nodes in N or are pointed by the nodes in N . The graph expansion
also brings in all the related documents containing the synonyms of the query
terms. Sometimes however, its possible that the a node has a large indegree
or a very large out-degree. In such cases one can always restrict the number of
nodes in or out from the node that contains the query terms or its synonyms.
Once the graph N is constructed for the given query, the adjacency matrix
L is constructed. Next we can also get the dominant eigenvectors of LT L
and LLT and know the hub scores and the authority scores. From this, the
most related webpages are shown first to the user. For ranking the webpages,
HITS doesn’t use the entire web, instead, it uses a somewhat neighborhood
graphs. This reduces the computational costs. Some more reduction in cost
is achieved by computing only one dominant eigenvector of either LT L or
LLT by multiplication with L or LT . For example, the authority vector a
can be calculated from the hub vector h by a = LT h.

5


2.6 Advantages and Disadvantages of HITS
Advantages

The main strength of HITS is its ability to rank pages according to the
query topic, which may be able to provide more relevant authority and hub
pages. The process of building the adjecency matrices from somewhat smaller
neighborhood graphs and applying power iterations does not present any
computational burden.

Disadvantages

1. The main disadvantage of HITS is that the neighborhood graph must
be built ”on the fly” i.e the authority and hub rankings are query
dependent. Minor changes to the web could significantly change the
authority and hub scores.

2. Another major disadvantage is that it suffers from topicdrif t i.e the
neighborhood graph N could contain nodes which have high authority
scores for a topic unrelated to the original query.

3. HITS also cannot detect advertisements. There are a lot of sites that
have commercial advertising sponsors and also, the friendly exchange
of links also reduce the accuracy of the HITS algorithm.

4. The algorithm can easily be spammed since its quite easy to add out-
links in one’s own page.

5. The query time evaluation is slow. Collecting the root set, expanding
it and performing eigenvector computation are all expensive tasks.

3 PageRank
3.1 Introduction
PageRank, the second link analysis algorithm from 1998, is the heart of
Google. Both PageRank and Google were conceived by Sergey Brin and
Larry Page while they were computer science graduate students at Stanford
University. Brin and Page use a recursive scheme similar to Kleinberg’s HITS
algorithm but the PageRank algorithm produces a ranking independent of a
user’s query. Their original idea was that a page is important if it is pointed
to by other important pages. That is, they decided that the importance of

6


your page (its PageRank score) is determined by summing the PageRanks of
all pages that point to yours. So, basically, PageRank is an algorithm that
that reviews and then assigns a weight value for the elements of a webpage.
It’s the number of people that are linked to you and how important those
links are.

3.2 PageRank algorithm
Let the rank of a webpage pi be given by P R(pi ). Suppose that the page pi
has pages M (pi ) linking to it. These are basically the citations. L(pj ) is the
number of outbound links on page pj and N is the total number of pages.
The PageRank of a page pi is given as follows.

P R(pj )
P R(pi ) =
pj M (pi )
L(pj )

The above equation simply sums up the PageRanks of all the websites point-
ing to our site. Where N is the total number of pages.

3.3 PageRank Implementation
In the mathematical model of the PageRank, if an important page points to
some pages, the PageRank of this important page is distributed to all the
pages it is pointing to, proportionally. In other words, if YAHOO! points
to your Web page, that’s good, but you shouldn’t receive the full weight of
YAHOO! because they point to many other places. If YAHOO! points to 999
other pages in addition to yours, then you should only get credit for 1/1000
of YAHOO!’s PageRank. And hence intuitively, we can say that a page can
have a high PageRank if there are many pages that point to it, or if there
are some pages that point to it and have a high PageRank.

7


Figure 2: PageRank Example. Courtesy : Wikipedia

The above figure shows Mathematical PageRanks for a simple network,
expressed as percentages. (Google uses a logarithmic scale.) Page C has a
higher PageRank than Page E, even though there are fewer links to C; the
one link to C comes from an important page and hence is of high value. If
web surfers who start on a random page have an 85% likelihood of choosing a
random link from the page they are currently visiting, and a 15% likelihood
of jumping to a page chosen at random from the entire web, they will reach
Page E 8.1% of the time. (The 15% likelihood of jumping to an arbitrary
page corresponds to a damping factor of 85%.) Without damping, all web
surfers would eventually end up on Pages A, B, or C, and all other pages
would have PageRank zero. In the presence of damping, Page A effectively
links to all pages in the web, even though it has no outgoing links of its own.

The PageRank is updated once every month and does not require any
analysis of the actual(semantic) content of the web or a user’s queries. So,
Google first finds the semantic matches(webpages) with the user’s query first

8


and then order the output result set of pages in order of their page ranks.
The computation of PageRank is itself a huge challenging task. Computing
the PageRank from the power iteration method may involve the following:
1. Parallelization of the sparse vector matrix multiplications
2. Partitioning of iteration matrix into blocks of webpages with outlinks
and another block of those without outlinks.
3. Speeding up the convergence of the distribution vector.
Updating the PageRank is a big implementation concern since the web
is not static. The calculated PageRank now may not be the same after
a few days. Extensive research is being done for being able to use the old
scores of a page to calculate the new PageRank without having to reconstruct
everything. Also, the changing link structure and the addition and deleting
of webpages must be taken care of.

3.4 Problems with the algorithm and their modifica-
tions
3.4.1 Rank Sink
Problem Description:

Take the following link structure for example where pages 2,3,4,5 are in a
path and an external site (page1) contributes a PageRank 1 to the first page
of the link (page2).

Figure 3: Rank Sink/Page Cycles

In the above case, all the pages in the path would have a PageRank of 1.
Now, what if we connect the last page of the path (page5) to the first page
(page2) and construct a cycle as shown in the figure below?

9


Figure 4: Rank Sink/Page Cycles

In this case, the PageRank accumulates but never distribute it. Because
of this, all the pages in the cycle get a PageRank of ∞.

Solution to Rank Sink - Random Surfer Model

The PageRank algorithm assumes that there is a ”random surfer” who
is given a web page at random and he keeps on clicking the links from one
page to another and never clicking ”back”. But the surfer eventually gets
bored and starts another random page. The probability that this random
surfer visits this page is the PageRank. And the probability that the ”ran-
dom surfer” will get bored and request another random page or a group of
pages, is called the damping factor denoted by d.

The damping factor is added to the equation and now, the new equation
becomes:
(1 − d) P R(pj )
P R(pi ) = + d( )
N pj M (pi )
L(pj )
Where N is the total number of pages. Usually the damping factor d is set
to 0.85.

3.4.2 Dangling Links
Problem Description

Suppose that there are some pages that do not have any outlinks, we call
them dangling nodes. In the random surfer model above, we can see that if
a scenario of dangling links appear, the random surfer will get stuck on these
pages and the importance of these pages cannot be given to other pages.

10


In another case, if the main web graph has some disconnected components,
the random surfer who started from one component cannot go to the other
component. All pages in the other component will get a 0 importance.

Solution to Dangling Links

The damping factor d will take care of this situation. The algorithm is
however further modified a little to solve this issue. The algorithm is to be
tweaked a little bit as shown below:
(1 − d) P R(pj ) P R(pj )
P R(pi ) = + d( + )
N pj M (pi )
L(pj ) pj M (pi )
N

P R(pj )
The term N
is useful when there are no outlinks to pj . The
pj M (pi )
damping factor d refers to the probability that the surfer quits the current
page and ”teleports” to a new one. Since every page can be teleported with
1
equal probability, each page has n probability to be chosen. Hence, no page
will give 0 PageRank. Dangling links do not affect the ranking of any other
page directly, so they are removed until all the PageRanks are calculated.

3.5 Advantages and Disadvantages of PageRank
Advantages of PageRank
1. The algorithm is robust against Spam since its not easy for a webpage
owner to add inlinks to his/her page from other important pages.
2. PageRank is a global measure and is query independent.
Disdvantages of PageRank
1. The major disadvantage of PageRank is that it favors the older pages,
because a new page, even a very good one will not have many links
unless it is a part of an existing site.
2. PageRank can be easily increased by the concept of ”link-farms” as
shown below. However, while indexing, the search actively tries to find
these flaws.

Link-farms : 99 vertices point to vertex 1. We have discussed above
that a page has atleast a PageRank 0.15. In the following 2 scenarios,
the PageRank of the main page (page1) is very good even though the
average is very worst.

11


Figure 5: Link-Farms

3. Something that is of course also very eﬃcient to raise your own PageR-
ank, is ’buying’ a link on a page with high PageRank. However, Google
has publicly warned webmasters that if they are discovered to do any
of the two above, their links might be ignored in the future, or they
might even be taken out of Google’s index.

References
[1] ”The Anatomy of a Large-Scale Hypertextual Web Search Engine” by
Sergey Brin and Lawrence Page

[2] ”Google Page Rank-Algorithms for Data Base Systems (Fachseminar)”
by David Degen - May 12, 2007

[3] ”Link Analysis in Web Information Retrieval” by Monika Henzinger ;
Google Incorporated, Mountain View, California

[4] ”Understanding Search Engines-Mathematical Modeling and Text Re-
trieval” by Micheal W.Berry and Murray Browne

[5] en.wikipedia.org/wiki/PageRank

[6] en.wikipedia.org/wiki/HITS_algorithm

12

Pagerank and hits

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Pagerank and hits

Similar to Pagerank and hits (20)

Pagerank and hits