SlideShare a Scribd company logo
1 of 13
Download to read offline
Hyperlink based search algorithms-PageRank
                 and HITS
                Shatakirti
                MT2011096
Dimensionality Reduction


Contents
1 Link Analysis and Web Search                                                                                           2

2 Hyperlink-Induced Topic Search (HITS)                                                                                   2
  2.1 Introduction . . . . . . . . . . . . . . . . . . . . .                                 .   .   .   .   .   .   .    2
  2.2 Motivation behind developing the HITS algorithm                                        .   .   .   .   .   .   .    2
  2.3 Authorities and Hubs . . . . . . . . . . . . . . . .                                   .   .   .   .   .   .   .    3
  2.4 HITS Algorithm . . . . . . . . . . . . . . . . . . .                                   .   .   .   .   .   .   .    4
  2.5 HITS Implementation . . . . . . . . . . . . . . .                                      .   .   .   .   .   .   .    5
  2.6 Advantages and Disadvantages of HITS . . . . . .                                       .   .   .   .   .   .   .    6

3 PageRank                                                                                                                6
  3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . .                                     .   .   .   .   .    6
  3.2 PageRank algorithm . . . . . . . . . . . . . . . . . .                                         .   .   .   .   .    7
  3.3 PageRank Implementation . . . . . . . . . . . . . . .                                          .   .   .   .   .    7
  3.4 Problems with the algorithm and their modifications                                             .   .   .   .   .    9
      3.4.1 Rank Sink . . . . . . . . . . . . . . . . . . . .                                        .   .   .   .   .    9
      3.4.2 Dangling Links . . . . . . . . . . . . . . . . .                                         .   .   .   .   .   10
  3.5 Advantages and Disadvantages of PageRank . . . . .                                             .   .   .   .   .   11

References                                                                                                               12


List of Figures
   1    Hubs and Authorities .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   . 4
   2    Page Rank Example .      .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   . 8
   3    Rank Sink/Page Cycles    .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   . 9
   4    Rank Sink/Page Cycles    .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   . 10
   5    Link-Farms . . . . . .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   . 12




                                             1
Dimensionality Reduction


1     Link Analysis and Web Search
Back in the 1990’s, web search was purely based on the number of occur-
rences of a word in a document. The search was purely and only based on
relevancy of a document with the query. But over time, webpages have been
increasing at an enormous rate. Thereby, simply getting the relevant docu-
ments wasn’t sufficient as the number of relevant documents may range in
a few millions. But instead, it was supposed to be classified in descending
order of importance (most imp document first). Content similarity was an-
other major issue. It was easily spammed as the page owner can repeat some
words to boost his rankings and make it more relevant to a large number of
queries. These problems were solved by analyzing the link structure of the
web. Hyperlinks in a document provide a valuable source of information for
information retrieval. Link analysis has been successful for deciding which
webpages are important. It has also been used for categorizing webpages,
finding pages related to given pages and to find duplicated websites. During
1997-1998, two most famous and influential algorithms were proposed. They
are : PageRank and HITS. Both these algorithms exploit the hyperlinks of
the web to rank the pages. We will see these algorithms in detail here.


2     Hyperlink-Induced Topic Search (HITS)
2.1    Introduction
During the same time as the Page Rank algorithm was being developed by
Sergey Brin and Larry Page, Jon Kleinberg a professor in the Department
of Computer Science at Cornell came up with his own solution to the Web
Search problem. He developed an algorithm that made use of the link struc-
ture of the web in order to discover and rank pages relevant for a particular
topic. HITS (hyperlink-induced topic search) is now part of the Ask search
engine (www.Ask.com).

2.2    Motivation behind developing the HITS algorithm
The HITS(hyperlink-induced topic search) algorithm has been developed by
looking at the way how Humans analyze a search process rather than the
machines searching up a query by looking at a bunch of documents and
return the matches. For ex. if we want to buy a car, we type in ”top
automobile makers in the world”, in the intention of getting the top ranked
car list and their official websites. When we ask someone this question, we


                                     2
Dimensionality Reduction


probably expect him to understand that by ”automobile”, we actually mean
Car and the automobile in general means other vehicles too. But if you search
this using a query for a computer, the search results would be a lot different.
The computer will simply count all the occurrences of the given word in a
set of documents, but won’t apply any intelligence for you. And hence, the
search results would be apt according to what we typed, but not what we
were expecting them to be. The conclusion is that even if trying to find
pages that contain the query words should be the starting point, a different
ranking system is needed in order to find those pages that are authoritative
for a given query.

2.3    Authorities and Hubs
Page i is called an authority for the query ”automobile makers” if it contains
valuable information on the subject. Official web sites of car manufacturers,
such as www.bmw.com, HyundaiUSA.com, www.mercedes-benz.com would
be authorities for this search. Commercial web sites selling cars might be
authorities on the subject as well. These are the ones truly relevant to the
given query. These are the ones that the user expects back from the query
engine. However, there is a second category of pages relevant to the process
of finding the authoritative pages, called hubs. Their role is to advertise
the authoritative pages. They contain useful links towards the authoritative
pages. In other words, hubs point the search engine in the ”right direction”.
In real life, when you buy a car, you are more inclined to purchase it from a
certain dealer that your friend recommends. Following the analogy, the au-
thority in this case would be the car dealer, and the hub would be your friend.
You trust your friend; therefore you trust what your friend recommends. In
the World Wide Web, hubs for our query about automobiles might be pages
that contain rankings of the cars, blogs where people discuss about the cars
that they purchased, and so on.




                                      3
Dimensionality Reduction




                      Figure 1: Hubs and Authorities



2.4    HITS Algorithm
Let’s assume that a webpage i has an authority score ai and hub score hi .
And let ξ denote the set of all directed edges in the web-graph. Let eij
represent the directed edge from a webpage i to j. Initially all the authority
score of ai is 0 and hub score hi is also 0. The HITS algorithm updates the
scores following the summation:
                         (k)            (k−1)
                       ai      =       hj       , where eij   ξ           (1)
                                   j

The above equations are represented in the form of an adjacency matrix L
defined by :


                     1 if there exists i and j such that eij         ξ;
            Lij =
                     0 otherwise.
The authoritative and hub scores’ summation can also be written as:

                    a(k) = LT h(k−1)        and       h(k) = La(k)        (2)

where a(k) and h(k) are n x 1 vectors comprising the authority and hub scores,
respectively, for each of the n nodes (webpages) in the graph. The above

                                            4
Dimensionality Reduction


equation gets repeatedly updating until a(k) and h(k) converge for some k.
Basically a(k) and h(k) are normalized every time the scores get updated. By
substituting the values of h(k) and a(k) from each other’s equation in (2), we
get :
                 a(k) = LT La(k−1) and h(k) = LLT h(k−1)                    (3)
   The iterations in the above equation is the power iteration method to
find the dominating eigen vectorsof LT L and LLT . The matrix LT L is the
authority matrix as it is used to find the authority scores and similarly matrix
LLT is the hub matrix as its dominating vector gives final hub score. By
computing the calues of a and h will help us solve the equations:
                   LT La = λmax a        and   LLT h = λmax h              (4)
where λmax is the largest eigenvalue of LT L and LLT .
Next, we need to normalize each power iteration involving a(k) or h(k) . One
way to normalize this is:

                      (k)      a(k)             (k)         h(k)
                  a         ←            and   h      ←                    (5)
                              m(a(k) )                    m(h(k) )
where m(x) is the signed element having the maximum magnitude of x.

2.5    HITS Implementation
Once the user enters a query, the algorithm first constructs a neighborhood
graph N associated with the terms present in the query. The authority and
the hub scores for the documents in N are then computed as described in
equation 3 and returned to the user. N can be constructed using the inverted
index. The graph N is then extended by adding the nodes that either point
to the nodes in N or are pointed by the nodes in N . The graph expansion
also brings in all the related documents containing the synonyms of the query
terms. Sometimes however, its possible that the a node has a large indegree
or a very large out-degree. In such cases one can always restrict the number of
nodes in or out from the node that contains the query terms or its synonyms.
Once the graph N is constructed for the given query, the adjacency matrix
L is constructed. Next we can also get the dominant eigenvectors of LT L
and LLT and know the hub scores and the authority scores. From this, the
most related webpages are shown first to the user. For ranking the webpages,
HITS doesn’t use the entire web, instead, it uses a somewhat neighborhood
graphs. This reduces the computational costs. Some more reduction in cost
is achieved by computing only one dominant eigenvector of either LT L or
LLT by multiplication with L or LT . For example, the authority vector a
can be calculated from the hub vector h by a = LT h.

                                          5
Dimensionality Reduction


2.6     Advantages and Disadvantages of HITS
Advantages

The main strength of HITS is its ability to rank pages according to the
query topic, which may be able to provide more relevant authority and hub
pages. The process of building the adjecency matrices from somewhat smaller
neighborhood graphs and applying power iterations does not present any
computational burden.

Disadvantages

    1. The main disadvantage of HITS is that the neighborhood graph must
       be built ”on the fly” i.e the authority and hub rankings are query
       dependent. Minor changes to the web could significantly change the
       authority and hub scores.

    2. Another major disadvantage is that it suffers from topicdrif t i.e the
       neighborhood graph N could contain nodes which have high authority
       scores for a topic unrelated to the original query.

    3. HITS also cannot detect advertisements. There are a lot of sites that
       have commercial advertising sponsors and also, the friendly exchange
       of links also reduce the accuracy of the HITS algorithm.

    4. The algorithm can easily be spammed since its quite easy to add out-
       links in one’s own page.

    5. The query time evaluation is slow. Collecting the root set, expanding
       it and performing eigenvector computation are all expensive tasks.


3     PageRank
3.1     Introduction
PageRank, the second link analysis algorithm from 1998, is the heart of
Google. Both PageRank and Google were conceived by Sergey Brin and
Larry Page while they were computer science graduate students at Stanford
University. Brin and Page use a recursive scheme similar to Kleinberg’s HITS
algorithm but the PageRank algorithm produces a ranking independent of a
user’s query. Their original idea was that a page is important if it is pointed
to by other important pages. That is, they decided that the importance of


                                      6
Dimensionality Reduction


your page (its PageRank score) is determined by summing the PageRanks of
all pages that point to yours. So, basically, PageRank is an algorithm that
that reviews and then assigns a weight value for the elements of a webpage.
It’s the number of people that are linked to you and how important those
links are.

3.2    PageRank algorithm
Let the rank of a webpage pi be given by P R(pi ). Suppose that the page pi
has pages M (pi ) linking to it. These are basically the citations. L(pj ) is the
number of outbound links on page pj and N is the total number of pages.
The PageRank of a page pi is given as follows.

                                                     P R(pj )
                           P R(pi ) =
                                        pj   M (pi )
                                                      L(pj )

The above equation simply sums up the PageRanks of all the websites point-
ing to our site. Where N is the total number of pages.

3.3    PageRank Implementation
In the mathematical model of the PageRank, if an important page points to
some pages, the PageRank of this important page is distributed to all the
pages it is pointing to, proportionally. In other words, if YAHOO! points
to your Web page, that’s good, but you shouldn’t receive the full weight of
YAHOO! because they point to many other places. If YAHOO! points to 999
other pages in addition to yours, then you should only get credit for 1/1000
of YAHOO!’s PageRank. And hence intuitively, we can say that a page can
have a high PageRank if there are many pages that point to it, or if there
are some pages that point to it and have a high PageRank.




                                             7
Dimensionality Reduction




           Figure 2: PageRank Example. Courtesy : Wikipedia


    The above figure shows Mathematical PageRanks for a simple network,
expressed as percentages. (Google uses a logarithmic scale.) Page C has a
higher PageRank than Page E, even though there are fewer links to C; the
one link to C comes from an important page and hence is of high value. If
web surfers who start on a random page have an 85% likelihood of choosing a
random link from the page they are currently visiting, and a 15% likelihood
of jumping to a page chosen at random from the entire web, they will reach
Page E 8.1% of the time. (The 15% likelihood of jumping to an arbitrary
page corresponds to a damping factor of 85%.) Without damping, all web
surfers would eventually end up on Pages A, B, or C, and all other pages
would have PageRank zero. In the presence of damping, Page A effectively
links to all pages in the web, even though it has no outgoing links of its own.


   The PageRank is updated once every month and does not require any
analysis of the actual(semantic) content of the web or a user’s queries. So,
Google first finds the semantic matches(webpages) with the user’s query first

                                      8
Dimensionality Reduction


and then order the output result set of pages in order of their page ranks.
The computation of PageRank is itself a huge challenging task. Computing
the PageRank from the power iteration method may involve the following:
  1. Parallelization of the sparse vector matrix multiplications
  2. Partitioning of iteration matrix into blocks of webpages with outlinks
     and another block of those without outlinks.
  3. Speeding up the convergence of the distribution vector.
    Updating the PageRank is a big implementation concern since the web
is not static. The calculated PageRank now may not be the same after
a few days. Extensive research is being done for being able to use the old
scores of a page to calculate the new PageRank without having to reconstruct
everything. Also, the changing link structure and the addition and deleting
of webpages must be taken care of.

3.4     Problems with the algorithm and their modifica-
        tions
3.4.1   Rank Sink
Problem Description:

    Take the following link structure for example where pages 2,3,4,5 are in a
path and an external site (page1) contributes a PageRank 1 to the first page
of the link (page2).




                     Figure 3: Rank Sink/Page Cycles


   In the above case, all the pages in the path would have a PageRank of 1.
Now, what if we connect the last page of the path (page5) to the first page
(page2) and construct a cycle as shown in the figure below?

                                      9
Dimensionality Reduction




                     Figure 4: Rank Sink/Page Cycles


    In this case, the PageRank accumulates but never distribute it. Because
of this, all the pages in the cycle get a PageRank of ∞.

Solution to Rank Sink - Random Surfer Model

    The PageRank algorithm assumes that there is a ”random surfer” who
is given a web page at random and he keeps on clicking the links from one
page to another and never clicking ”back”. But the surfer eventually gets
bored and starts another random page. The probability that this random
surfer visits this page is the PageRank. And the probability that the ”ran-
dom surfer” will get bored and request another random page or a group of
pages, is called the damping factor denoted by d.

The damping factor is added to the equation and now, the new equation
becomes:
                           (1 − d)                P R(pj )
                P R(pi ) =         + d(                    )
                              N        pj M (pi )
                                                   L(pj )
Where N is the total number of pages. Usually the damping factor d is set
to 0.85.

3.4.2   Dangling Links
Problem Description

    Suppose that there are some pages that do not have any outlinks, we call
them dangling nodes. In the random surfer model above, we can see that if
a scenario of dangling links appear, the random surfer will get stuck on these
pages and the importance of these pages cannot be given to other pages.

                                     10
Dimensionality Reduction


In another case, if the main web graph has some disconnected components,
the random surfer who started from one component cannot go to the other
component. All pages in the other component will get a 0 importance.

Solution to Dangling Links

   The damping factor d will take care of this situation. The algorithm is
however further modified a little to solve this issue. The algorithm is to be
tweaked a little bit as shown below:
                            (1 − d)                  P R(pj )                P R(pj )
           P R(pi ) =               + d(                      +                       )
                               N        pj   M (pi )
                                                      L(pj )    pj   M (pi )
                                                                               N

                         P R(pj )
The term                    N
                                    is useful when there are no outlinks to pj . The
            pj M (pi )
damping factor d refers to the probability that the surfer quits the current
page and ”teleports” to a new one. Since every page can be teleported with
                                 1
equal probability, each page has n probability to be chosen. Hence, no page
will give 0 PageRank. Dangling links do not affect the ranking of any other
page directly, so they are removed until all the PageRanks are calculated.

3.5    Advantages and Disadvantages of PageRank
Advantages of PageRank
  1. The algorithm is robust against Spam since its not easy for a webpage
     owner to add inlinks to his/her page from other important pages.
  2. PageRank is a global measure and is query independent.
Disdvantages of PageRank
  1. The major disadvantage of PageRank is that it favors the older pages,
     because a new page, even a very good one will not have many links
     unless it is a part of an existing site.
  2. PageRank can be easily increased by the concept of ”link-farms” as
     shown below. However, while indexing, the search actively tries to find
     these flaws.

      Link-farms : 99 vertices point to vertex 1. We have discussed above
      that a page has atleast a PageRank 0.15. In the following 2 scenarios,
      the PageRank of the main page (page1) is very good even though the
      average is very worst.

                                               11
Dimensionality Reduction




                           Figure 5: Link-Farms



  3. Something that is of course also very efficient to raise your own PageR-
     ank, is ’buying’ a link on a page with high PageRank. However, Google
     has publicly warned webmasters that if they are discovered to do any
     of the two above, their links might be ignored in the future, or they
     might even be taken out of Google’s index.


References
[1] ”The Anatomy of a Large-Scale Hypertextual Web Search Engine” by
    Sergey Brin and Lawrence Page

[2] ”Google Page Rank-Algorithms for Data Base Systems (Fachseminar)”
    by David Degen - May 12, 2007

[3] ”Link Analysis in Web Information Retrieval” by Monika Henzinger ;
    Google Incorporated, Mountain View, California

[4] ”Understanding Search Engines-Mathematical Modeling and Text Re-
    trieval” by Micheal W.Berry and Murray Browne

[5] en.wikipedia.org/wiki/PageRank

[6] en.wikipedia.org/wiki/HITS_algorithm




                                   12

More Related Content

What's hot

Information retrieval introduction
Information retrieval introductionInformation retrieval introduction
Information retrieval introductionnimmyjans4
 
Page rank algorithm
Page rank algorithmPage rank algorithm
Page rank algorithmJunghoon Kim
 
Machine Learning With Logistic Regression
Machine Learning  With Logistic RegressionMachine Learning  With Logistic Regression
Machine Learning With Logistic RegressionKnoldus Inc.
 
Big Data Analytics with R
Big Data Analytics with RBig Data Analytics with R
Big Data Analytics with RGreat Wide Open
 
NE7012- SOCIAL NETWORK ANALYSIS
NE7012- SOCIAL NETWORK ANALYSISNE7012- SOCIAL NETWORK ANALYSIS
NE7012- SOCIAL NETWORK ANALYSISrathnaarul
 
Data mining in social network
Data mining in social networkData mining in social network
Data mining in social networkakash_mishra
 
Web mining (structure mining)
Web mining (structure mining)Web mining (structure mining)
Web mining (structure mining)Amir Fahmideh
 
Feature Engineering in Machine Learning
Feature Engineering in Machine LearningFeature Engineering in Machine Learning
Feature Engineering in Machine LearningKnoldus Inc.
 
Link analysis .. Data Mining
Link analysis .. Data MiningLink analysis .. Data Mining
Link analysis .. Data MiningMustafa Salam
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingankur bhalla
 
PageRank Algorithm In data mining
PageRank Algorithm In data miningPageRank Algorithm In data mining
PageRank Algorithm In data miningMai Mustafa
 
CS6010 Social Network Analysis Unit II
CS6010 Social Network Analysis   Unit IICS6010 Social Network Analysis   Unit II
CS6010 Social Network Analysis Unit IIpkaviya
 
Lecture6 introduction to data streams
Lecture6 introduction to data streamsLecture6 introduction to data streams
Lecture6 introduction to data streamshktripathy
 
Exploratory data analysis with Python
Exploratory data analysis with PythonExploratory data analysis with Python
Exploratory data analysis with PythonDavis David
 
The impact of web on ir
The impact of web on irThe impact of web on ir
The impact of web on irPrimya Tamil
 

What's hot (20)

Link Analysis
Link AnalysisLink Analysis
Link Analysis
 
Information retrieval introduction
Information retrieval introductionInformation retrieval introduction
Information retrieval introduction
 
Web mining
Web mining Web mining
Web mining
 
Page rank algorithm
Page rank algorithmPage rank algorithm
Page rank algorithm
 
Machine Learning With Logistic Regression
Machine Learning  With Logistic RegressionMachine Learning  With Logistic Regression
Machine Learning With Logistic Regression
 
Data For Datamining
Data For DataminingData For Datamining
Data For Datamining
 
Seo and page rank algorithm
Seo and page rank algorithmSeo and page rank algorithm
Seo and page rank algorithm
 
Big Data Analytics with R
Big Data Analytics with RBig Data Analytics with R
Big Data Analytics with R
 
NE7012- SOCIAL NETWORK ANALYSIS
NE7012- SOCIAL NETWORK ANALYSISNE7012- SOCIAL NETWORK ANALYSIS
NE7012- SOCIAL NETWORK ANALYSIS
 
Data mining in social network
Data mining in social networkData mining in social network
Data mining in social network
 
Web mining (structure mining)
Web mining (structure mining)Web mining (structure mining)
Web mining (structure mining)
 
Feature Engineering in Machine Learning
Feature Engineering in Machine LearningFeature Engineering in Machine Learning
Feature Engineering in Machine Learning
 
Link analysis .. Data Mining
Link analysis .. Data MiningLink analysis .. Data Mining
Link analysis .. Data Mining
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
PageRank Algorithm In data mining
PageRank Algorithm In data miningPageRank Algorithm In data mining
PageRank Algorithm In data mining
 
CS6010 Social Network Analysis Unit II
CS6010 Social Network Analysis   Unit IICS6010 Social Network Analysis   Unit II
CS6010 Social Network Analysis Unit II
 
Lecture6 introduction to data streams
Lecture6 introduction to data streamsLecture6 introduction to data streams
Lecture6 introduction to data streams
 
Web mining
Web miningWeb mining
Web mining
 
Exploratory data analysis with Python
Exploratory data analysis with PythonExploratory data analysis with Python
Exploratory data analysis with Python
 
The impact of web on ir
The impact of web on irThe impact of web on ir
The impact of web on ir
 

Similar to Pagerank and hits

Microsoft Professional Capstone: Data Science
Microsoft Professional Capstone: Data ScienceMicrosoft Professional Capstone: Data Science
Microsoft Professional Capstone: Data ScienceMashfiq Shahriar
 
Consumer centric api design v0.4.0
Consumer centric api design v0.4.0Consumer centric api design v0.4.0
Consumer centric api design v0.4.0mustafa sarac
 
WEB Data Mining
WEB Data MiningWEB Data Mining
WEB Data MiningOases Ong
 
Upstill_Thesis_Revised_17Aug05
Upstill_Thesis_Revised_17Aug05Upstill_Thesis_Revised_17Aug05
Upstill_Thesis_Revised_17Aug05Trystan Upstill
 
Query-drift prevention for robust query expansion
Query-drift prevention for robust query expansionQuery-drift prevention for robust query expansion
Query-drift prevention for robust query expansionLiron Zighelnic
 
Rapport d'analyse Dimensionality Reduction
Rapport d'analyse Dimensionality ReductionRapport d'analyse Dimensionality Reduction
Rapport d'analyse Dimensionality ReductionMatthieu Cisel
 
Research: Developing an Interactive Web Information Retrieval and Visualizati...
Research: Developing an Interactive Web Information Retrieval and Visualizati...Research: Developing an Interactive Web Information Retrieval and Visualizati...
Research: Developing an Interactive Web Information Retrieval and Visualizati...Roman Atachiants
 
LSP ( Logic Score Preference ) _ Rajan_Dhabalia_San Francisco State University
LSP ( Logic Score Preference ) _ Rajan_Dhabalia_San Francisco State UniversityLSP ( Logic Score Preference ) _ Rajan_Dhabalia_San Francisco State University
LSP ( Logic Score Preference ) _ Rajan_Dhabalia_San Francisco State Universitydhabalia
 

Similar to Pagerank and hits (20)

Aregay_Msc_EEMCS
Aregay_Msc_EEMCSAregay_Msc_EEMCS
Aregay_Msc_EEMCS
 
Microsoft Professional Capstone: Data Science
Microsoft Professional Capstone: Data ScienceMicrosoft Professional Capstone: Data Science
Microsoft Professional Capstone: Data Science
 
Seo book
Seo bookSeo book
Seo book
 
Seo book
Seo bookSeo book
Seo book
 
Consumer centric api design v0.4.0
Consumer centric api design v0.4.0Consumer centric api design v0.4.0
Consumer centric api design v0.4.0
 
WEB Data Mining
WEB Data MiningWEB Data Mining
WEB Data Mining
 
10.1.1.21.3147
10.1.1.21.314710.1.1.21.3147
10.1.1.21.3147
 
10.1.1.21.3147
10.1.1.21.314710.1.1.21.3147
10.1.1.21.3147
 
Upstill_Thesis_Revised_17Aug05
Upstill_Thesis_Revised_17Aug05Upstill_Thesis_Revised_17Aug05
Upstill_Thesis_Revised_17Aug05
 
Query-drift prevention for robust query expansion
Query-drift prevention for robust query expansionQuery-drift prevention for robust query expansion
Query-drift prevention for robust query expansion
 
Rapport d'analyse Dimensionality Reduction
Rapport d'analyse Dimensionality ReductionRapport d'analyse Dimensionality Reduction
Rapport d'analyse Dimensionality Reduction
 
Algorithms
AlgorithmsAlgorithms
Algorithms
 
Research: Developing an Interactive Web Information Retrieval and Visualizati...
Research: Developing an Interactive Web Information Retrieval and Visualizati...Research: Developing an Interactive Web Information Retrieval and Visualizati...
Research: Developing an Interactive Web Information Retrieval and Visualizati...
 
Thesis_Report
Thesis_ReportThesis_Report
Thesis_Report
 
Ashwin_Thesis
Ashwin_ThesisAshwin_Thesis
Ashwin_Thesis
 
Master thesis
Master thesisMaster thesis
Master thesis
 
LSP ( Logic Score Preference ) _ Rajan_Dhabalia_San Francisco State University
LSP ( Logic Score Preference ) _ Rajan_Dhabalia_San Francisco State UniversityLSP ( Logic Score Preference ) _ Rajan_Dhabalia_San Francisco State University
LSP ( Logic Score Preference ) _ Rajan_Dhabalia_San Francisco State University
 
Master's Thesis
Master's ThesisMaster's Thesis
Master's Thesis
 
Gomadam Dissertation
Gomadam DissertationGomadam Dissertation
Gomadam Dissertation
 
Rzepnicki_thesis
Rzepnicki_thesisRzepnicki_thesis
Rzepnicki_thesis
 

Pagerank and hits

  • 1. Hyperlink based search algorithms-PageRank and HITS Shatakirti MT2011096
  • 2. Dimensionality Reduction Contents 1 Link Analysis and Web Search 2 2 Hyperlink-Induced Topic Search (HITS) 2 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2.2 Motivation behind developing the HITS algorithm . . . . . . . 2 2.3 Authorities and Hubs . . . . . . . . . . . . . . . . . . . . . . . 3 2.4 HITS Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.5 HITS Implementation . . . . . . . . . . . . . . . . . . . . . . 5 2.6 Advantages and Disadvantages of HITS . . . . . . . . . . . . . 6 3 PageRank 6 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 3.2 PageRank algorithm . . . . . . . . . . . . . . . . . . . . . . . 7 3.3 PageRank Implementation . . . . . . . . . . . . . . . . . . . . 7 3.4 Problems with the algorithm and their modifications . . . . . 9 3.4.1 Rank Sink . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.4.2 Dangling Links . . . . . . . . . . . . . . . . . . . . . . 10 3.5 Advantages and Disadvantages of PageRank . . . . . . . . . . 11 References 12 List of Figures 1 Hubs and Authorities . . . . . . . . . . . . . . . . . . . . . . . 4 2 Page Rank Example . . . . . . . . . . . . . . . . . . . . . . . 8 3 Rank Sink/Page Cycles . . . . . . . . . . . . . . . . . . . . . . 9 4 Rank Sink/Page Cycles . . . . . . . . . . . . . . . . . . . . . . 10 5 Link-Farms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1
  • 3. Dimensionality Reduction 1 Link Analysis and Web Search Back in the 1990’s, web search was purely based on the number of occur- rences of a word in a document. The search was purely and only based on relevancy of a document with the query. But over time, webpages have been increasing at an enormous rate. Thereby, simply getting the relevant docu- ments wasn’t sufficient as the number of relevant documents may range in a few millions. But instead, it was supposed to be classified in descending order of importance (most imp document first). Content similarity was an- other major issue. It was easily spammed as the page owner can repeat some words to boost his rankings and make it more relevant to a large number of queries. These problems were solved by analyzing the link structure of the web. Hyperlinks in a document provide a valuable source of information for information retrieval. Link analysis has been successful for deciding which webpages are important. It has also been used for categorizing webpages, finding pages related to given pages and to find duplicated websites. During 1997-1998, two most famous and influential algorithms were proposed. They are : PageRank and HITS. Both these algorithms exploit the hyperlinks of the web to rank the pages. We will see these algorithms in detail here. 2 Hyperlink-Induced Topic Search (HITS) 2.1 Introduction During the same time as the Page Rank algorithm was being developed by Sergey Brin and Larry Page, Jon Kleinberg a professor in the Department of Computer Science at Cornell came up with his own solution to the Web Search problem. He developed an algorithm that made use of the link struc- ture of the web in order to discover and rank pages relevant for a particular topic. HITS (hyperlink-induced topic search) is now part of the Ask search engine (www.Ask.com). 2.2 Motivation behind developing the HITS algorithm The HITS(hyperlink-induced topic search) algorithm has been developed by looking at the way how Humans analyze a search process rather than the machines searching up a query by looking at a bunch of documents and return the matches. For ex. if we want to buy a car, we type in ”top automobile makers in the world”, in the intention of getting the top ranked car list and their official websites. When we ask someone this question, we 2
  • 4. Dimensionality Reduction probably expect him to understand that by ”automobile”, we actually mean Car and the automobile in general means other vehicles too. But if you search this using a query for a computer, the search results would be a lot different. The computer will simply count all the occurrences of the given word in a set of documents, but won’t apply any intelligence for you. And hence, the search results would be apt according to what we typed, but not what we were expecting them to be. The conclusion is that even if trying to find pages that contain the query words should be the starting point, a different ranking system is needed in order to find those pages that are authoritative for a given query. 2.3 Authorities and Hubs Page i is called an authority for the query ”automobile makers” if it contains valuable information on the subject. Official web sites of car manufacturers, such as www.bmw.com, HyundaiUSA.com, www.mercedes-benz.com would be authorities for this search. Commercial web sites selling cars might be authorities on the subject as well. These are the ones truly relevant to the given query. These are the ones that the user expects back from the query engine. However, there is a second category of pages relevant to the process of finding the authoritative pages, called hubs. Their role is to advertise the authoritative pages. They contain useful links towards the authoritative pages. In other words, hubs point the search engine in the ”right direction”. In real life, when you buy a car, you are more inclined to purchase it from a certain dealer that your friend recommends. Following the analogy, the au- thority in this case would be the car dealer, and the hub would be your friend. You trust your friend; therefore you trust what your friend recommends. In the World Wide Web, hubs for our query about automobiles might be pages that contain rankings of the cars, blogs where people discuss about the cars that they purchased, and so on. 3
  • 5. Dimensionality Reduction Figure 1: Hubs and Authorities 2.4 HITS Algorithm Let’s assume that a webpage i has an authority score ai and hub score hi . And let ξ denote the set of all directed edges in the web-graph. Let eij represent the directed edge from a webpage i to j. Initially all the authority score of ai is 0 and hub score hi is also 0. The HITS algorithm updates the scores following the summation: (k) (k−1) ai = hj , where eij ξ (1) j The above equations are represented in the form of an adjacency matrix L defined by : 1 if there exists i and j such that eij ξ; Lij = 0 otherwise. The authoritative and hub scores’ summation can also be written as: a(k) = LT h(k−1) and h(k) = La(k) (2) where a(k) and h(k) are n x 1 vectors comprising the authority and hub scores, respectively, for each of the n nodes (webpages) in the graph. The above 4
  • 6. Dimensionality Reduction equation gets repeatedly updating until a(k) and h(k) converge for some k. Basically a(k) and h(k) are normalized every time the scores get updated. By substituting the values of h(k) and a(k) from each other’s equation in (2), we get : a(k) = LT La(k−1) and h(k) = LLT h(k−1) (3) The iterations in the above equation is the power iteration method to find the dominating eigen vectorsof LT L and LLT . The matrix LT L is the authority matrix as it is used to find the authority scores and similarly matrix LLT is the hub matrix as its dominating vector gives final hub score. By computing the calues of a and h will help us solve the equations: LT La = λmax a and LLT h = λmax h (4) where λmax is the largest eigenvalue of LT L and LLT . Next, we need to normalize each power iteration involving a(k) or h(k) . One way to normalize this is: (k) a(k) (k) h(k) a ← and h ← (5) m(a(k) ) m(h(k) ) where m(x) is the signed element having the maximum magnitude of x. 2.5 HITS Implementation Once the user enters a query, the algorithm first constructs a neighborhood graph N associated with the terms present in the query. The authority and the hub scores for the documents in N are then computed as described in equation 3 and returned to the user. N can be constructed using the inverted index. The graph N is then extended by adding the nodes that either point to the nodes in N or are pointed by the nodes in N . The graph expansion also brings in all the related documents containing the synonyms of the query terms. Sometimes however, its possible that the a node has a large indegree or a very large out-degree. In such cases one can always restrict the number of nodes in or out from the node that contains the query terms or its synonyms. Once the graph N is constructed for the given query, the adjacency matrix L is constructed. Next we can also get the dominant eigenvectors of LT L and LLT and know the hub scores and the authority scores. From this, the most related webpages are shown first to the user. For ranking the webpages, HITS doesn’t use the entire web, instead, it uses a somewhat neighborhood graphs. This reduces the computational costs. Some more reduction in cost is achieved by computing only one dominant eigenvector of either LT L or LLT by multiplication with L or LT . For example, the authority vector a can be calculated from the hub vector h by a = LT h. 5
  • 7. Dimensionality Reduction 2.6 Advantages and Disadvantages of HITS Advantages The main strength of HITS is its ability to rank pages according to the query topic, which may be able to provide more relevant authority and hub pages. The process of building the adjecency matrices from somewhat smaller neighborhood graphs and applying power iterations does not present any computational burden. Disadvantages 1. The main disadvantage of HITS is that the neighborhood graph must be built ”on the fly” i.e the authority and hub rankings are query dependent. Minor changes to the web could significantly change the authority and hub scores. 2. Another major disadvantage is that it suffers from topicdrif t i.e the neighborhood graph N could contain nodes which have high authority scores for a topic unrelated to the original query. 3. HITS also cannot detect advertisements. There are a lot of sites that have commercial advertising sponsors and also, the friendly exchange of links also reduce the accuracy of the HITS algorithm. 4. The algorithm can easily be spammed since its quite easy to add out- links in one’s own page. 5. The query time evaluation is slow. Collecting the root set, expanding it and performing eigenvector computation are all expensive tasks. 3 PageRank 3.1 Introduction PageRank, the second link analysis algorithm from 1998, is the heart of Google. Both PageRank and Google were conceived by Sergey Brin and Larry Page while they were computer science graduate students at Stanford University. Brin and Page use a recursive scheme similar to Kleinberg’s HITS algorithm but the PageRank algorithm produces a ranking independent of a user’s query. Their original idea was that a page is important if it is pointed to by other important pages. That is, they decided that the importance of 6
  • 8. Dimensionality Reduction your page (its PageRank score) is determined by summing the PageRanks of all pages that point to yours. So, basically, PageRank is an algorithm that that reviews and then assigns a weight value for the elements of a webpage. It’s the number of people that are linked to you and how important those links are. 3.2 PageRank algorithm Let the rank of a webpage pi be given by P R(pi ). Suppose that the page pi has pages M (pi ) linking to it. These are basically the citations. L(pj ) is the number of outbound links on page pj and N is the total number of pages. The PageRank of a page pi is given as follows. P R(pj ) P R(pi ) = pj M (pi ) L(pj ) The above equation simply sums up the PageRanks of all the websites point- ing to our site. Where N is the total number of pages. 3.3 PageRank Implementation In the mathematical model of the PageRank, if an important page points to some pages, the PageRank of this important page is distributed to all the pages it is pointing to, proportionally. In other words, if YAHOO! points to your Web page, that’s good, but you shouldn’t receive the full weight of YAHOO! because they point to many other places. If YAHOO! points to 999 other pages in addition to yours, then you should only get credit for 1/1000 of YAHOO!’s PageRank. And hence intuitively, we can say that a page can have a high PageRank if there are many pages that point to it, or if there are some pages that point to it and have a high PageRank. 7
  • 9. Dimensionality Reduction Figure 2: PageRank Example. Courtesy : Wikipedia The above figure shows Mathematical PageRanks for a simple network, expressed as percentages. (Google uses a logarithmic scale.) Page C has a higher PageRank than Page E, even though there are fewer links to C; the one link to C comes from an important page and hence is of high value. If web surfers who start on a random page have an 85% likelihood of choosing a random link from the page they are currently visiting, and a 15% likelihood of jumping to a page chosen at random from the entire web, they will reach Page E 8.1% of the time. (The 15% likelihood of jumping to an arbitrary page corresponds to a damping factor of 85%.) Without damping, all web surfers would eventually end up on Pages A, B, or C, and all other pages would have PageRank zero. In the presence of damping, Page A effectively links to all pages in the web, even though it has no outgoing links of its own. The PageRank is updated once every month and does not require any analysis of the actual(semantic) content of the web or a user’s queries. So, Google first finds the semantic matches(webpages) with the user’s query first 8
  • 10. Dimensionality Reduction and then order the output result set of pages in order of their page ranks. The computation of PageRank is itself a huge challenging task. Computing the PageRank from the power iteration method may involve the following: 1. Parallelization of the sparse vector matrix multiplications 2. Partitioning of iteration matrix into blocks of webpages with outlinks and another block of those without outlinks. 3. Speeding up the convergence of the distribution vector. Updating the PageRank is a big implementation concern since the web is not static. The calculated PageRank now may not be the same after a few days. Extensive research is being done for being able to use the old scores of a page to calculate the new PageRank without having to reconstruct everything. Also, the changing link structure and the addition and deleting of webpages must be taken care of. 3.4 Problems with the algorithm and their modifica- tions 3.4.1 Rank Sink Problem Description: Take the following link structure for example where pages 2,3,4,5 are in a path and an external site (page1) contributes a PageRank 1 to the first page of the link (page2). Figure 3: Rank Sink/Page Cycles In the above case, all the pages in the path would have a PageRank of 1. Now, what if we connect the last page of the path (page5) to the first page (page2) and construct a cycle as shown in the figure below? 9
  • 11. Dimensionality Reduction Figure 4: Rank Sink/Page Cycles In this case, the PageRank accumulates but never distribute it. Because of this, all the pages in the cycle get a PageRank of ∞. Solution to Rank Sink - Random Surfer Model The PageRank algorithm assumes that there is a ”random surfer” who is given a web page at random and he keeps on clicking the links from one page to another and never clicking ”back”. But the surfer eventually gets bored and starts another random page. The probability that this random surfer visits this page is the PageRank. And the probability that the ”ran- dom surfer” will get bored and request another random page or a group of pages, is called the damping factor denoted by d. The damping factor is added to the equation and now, the new equation becomes: (1 − d) P R(pj ) P R(pi ) = + d( ) N pj M (pi ) L(pj ) Where N is the total number of pages. Usually the damping factor d is set to 0.85. 3.4.2 Dangling Links Problem Description Suppose that there are some pages that do not have any outlinks, we call them dangling nodes. In the random surfer model above, we can see that if a scenario of dangling links appear, the random surfer will get stuck on these pages and the importance of these pages cannot be given to other pages. 10
  • 12. Dimensionality Reduction In another case, if the main web graph has some disconnected components, the random surfer who started from one component cannot go to the other component. All pages in the other component will get a 0 importance. Solution to Dangling Links The damping factor d will take care of this situation. The algorithm is however further modified a little to solve this issue. The algorithm is to be tweaked a little bit as shown below: (1 − d) P R(pj ) P R(pj ) P R(pi ) = + d( + ) N pj M (pi ) L(pj ) pj M (pi ) N P R(pj ) The term N is useful when there are no outlinks to pj . The pj M (pi ) damping factor d refers to the probability that the surfer quits the current page and ”teleports” to a new one. Since every page can be teleported with 1 equal probability, each page has n probability to be chosen. Hence, no page will give 0 PageRank. Dangling links do not affect the ranking of any other page directly, so they are removed until all the PageRanks are calculated. 3.5 Advantages and Disadvantages of PageRank Advantages of PageRank 1. The algorithm is robust against Spam since its not easy for a webpage owner to add inlinks to his/her page from other important pages. 2. PageRank is a global measure and is query independent. Disdvantages of PageRank 1. The major disadvantage of PageRank is that it favors the older pages, because a new page, even a very good one will not have many links unless it is a part of an existing site. 2. PageRank can be easily increased by the concept of ”link-farms” as shown below. However, while indexing, the search actively tries to find these flaws. Link-farms : 99 vertices point to vertex 1. We have discussed above that a page has atleast a PageRank 0.15. In the following 2 scenarios, the PageRank of the main page (page1) is very good even though the average is very worst. 11
  • 13. Dimensionality Reduction Figure 5: Link-Farms 3. Something that is of course also very efficient to raise your own PageR- ank, is ’buying’ a link on a page with high PageRank. However, Google has publicly warned webmasters that if they are discovered to do any of the two above, their links might be ignored in the future, or they might even be taken out of Google’s index. References [1] ”The Anatomy of a Large-Scale Hypertextual Web Search Engine” by Sergey Brin and Lawrence Page [2] ”Google Page Rank-Algorithms for Data Base Systems (Fachseminar)” by David Degen - May 12, 2007 [3] ”Link Analysis in Web Information Retrieval” by Monika Henzinger ; Google Incorporated, Mountain View, California [4] ”Understanding Search Engines-Mathematical Modeling and Text Re- trieval” by Micheal W.Berry and Murray Browne [5] en.wikipedia.org/wiki/PageRank [6] en.wikipedia.org/wiki/HITS_algorithm 12