Link Analysis

  Rajendra Akerkar
Vestlandsforsking, Norway
Objectives
       To review common approaches to l k analysis
                                h      link l

       To calculate the popularity of a site based on link
        analysis

       To model human judgments indirectly




2            R. Akerkar
Outline
        1.
        1    Early Approaches to Link Analysis
        2.   Hubs and Authorities: HITS
        3.   Page Rank
        4.   Stability
        5.   Probabilistic Link Analysis
        6.   Limitation of Link Analysis




3            R. Akerkar
Early Approaches
    Basic Assumptions
       • Hyperlinks contain information about the human
         judgment of a site
       • The more incoming links to a site the more it is
                                      site,
         judged important
    Bray 1996
       y
       • The visibility of a site is measured by the number
         of other sites pointing to it
       • The luminosity of a site is measured by the number
         of other sites to which it points
        Limitation: failure to capture the relative
         importance of different parents (children) sites
4            R. Akerkar
Early Approaches
    Mark 1988
    • To calculate the score S of a document at vertex v
                               1
            S(v) = s(v) +             Σ           S(w)
                            | ch[v] | w Є |ch(v)|
                 v: a vertex i the h
                             in h hypertext graph G = (V E)
                                                  h      (V,
                 S(v): the global score
                 s(v): the score if the document is isolated
                 ch(v): children of the document at vertex v
    • Limitation:
       - Require G to be a directed acyclic g p (
             q                            y   graph (DAG)
                                                        )
       - If v has a single link to w, S(v) > S(w)
       - If v has a long path to w and s(v) < s(w),
            then S(v) > S (w)
        unreasonable
5              R. Akerkar
Early Approaches
    Marchiori (1997)
        • Hyper information should complement textual
          information to obtain the overall information
           S(v) = s(v) + h(v)
            ( )    ( )    ( )
              - S(v): overall information
              - s(v): textual information
              - h(v): hyper information
                                        r(v, w)
        • h(v) =      Σ             F             S(w)
                      w Є |ch[v]|
              - F a fading constant F Є (0 1)
                F:           constant,      (0,
              - r(v, w): the rank of w after sorting the children
                of v by S(w)
        a remedy of the previous approach (Mark 1988)
6              R. Akerkar
HITS - Kleinberg’s Algorithm
     • HITS – Hypertext Induced Topic Selection
               yp                 p

     • For each vertex v Є V in a subgraph of interest:
                 a(v) - the authority of v
                 h(v) - the hubness of v

     • A site is very a thoritati e if it recei es man
                  er authoritative        receives many
       citations. Citation from important sites weight
       more than citations from less-important sites

     • Hubness shows the importance of a site. A good
       hub is a site that links to many authoritative sites
                                      y

7             R. Akerkar
Authority and Hubness

    2                                                       5


    3                            1     1                    6


4
                                                            7

        a(1) = h(2) + h(3) + h(4)    h(1) = a(5) + a(6) + a(7)

8                   R. Akerkar
Authority d H b
    A th it and Hubness Convergence
                        C     g

       • R
         Recursive d
               i dependency:
                      d

              a(v)  Σ               h(w)
                         w Є pa[v]

              h(v)  Σ w Є ch[v] a(w)

        • Using Linear Algebra, we can prove:

               a(v) and h(v) converge



9          R. Akerkar
HITS Example
  Find a base subgraph:
• Start with a root set R {1, 2, 3, 4}

• {1, 2, 3, 4} - nodes relevant to
               the topic

• Expand the root set R to include
all the children and a fixed
number of parents of nodes in R

 A new set S (base subgraph) 


10                  R. Akerkar
HITS Example
     BaseSubgraph( R d)
                   R,
     1. S  r
     2.   for each v in R
     3.   do S  S U ch[v]
     4.       P  pa[v]
     5.       if |P| > d
               f
     6.       then P  arbitrary subset of P having size d
     7.            SSUP
     8.
     8    return S




11                R. Akerkar
HITS Example
      Hubs and authorities: two n-dimensional a and h
              HubsAuthorities(G)
                                              |V|
               1     1  [1,…,1] Є R
               2     a0  h 0  1
               3     t 1
               4     repeat
               5            for each v in V
               6            do at (v)  Σ                      h      (w)
                                                    w Є pa[v] t -1
               7                    ht (v)  Σ w Є pa[v] a              (w)
               8              a t  at / || at ||                  t -1
               9              h t  ht / || ht ||
               10             t  t+1
               11     until || a t – at -1 || + || h t – ht -1 || < ε
                                          1                  1
               12     return (a t , h t )
12            R. Akerkar
HITS Example Results
             y
     Authority
     Hubness




1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Authority and hubness weights
13               R. Akerkar
HITS Improvements
     Brarat and Henzinger (1998)
       • HITS problems
          1) The document can contain many identical links to
             the same document in another host
          2) Links are generated automatically (e.g. messages
             posted on newsgroups)

       • Solutions
          1) Assign weight to identical multiple edges, which
             are inversely proportional to their multiplicity
          2) Prune irrelevant nodes or regulating the influence
             of a node with a relevance weight

14             R. Akerkar
PageRank
 Introduced by Page et al (
                y g            (1998))
   The weight is assigned by the rank of parents




 Difference with HITS
   HITS takes Hubness & Authority weights
   The page rank is proportional to its parents’ rank, but inversely
    proportional to its parents’ outdegree


                                                                15
Matrix Notation




                                     Adjacent M i
                                     Adj      Matrix



                                     A=


  * http://www.kusatro.kyoto-u.com

                                                       16
Matrix Notation
 Matrix Notation
  r=αBr=Mr
    α : eigenvalue
    r : eigenvector of B


       Ax=λx
                              B=
       | A - λI | x = 0

 Finding Pagerank
  to find eigenvector of B with an associated eigenvalue α

                                                  17
Matrix Notation
     PageRank: eigenvector of P relative to max eigenvalue

       B = P D P-1
           D: diagonal matrix of eigenvalues {λ1, … λn}
           P: regular matrix that consists of eigenvectors




        PageRank   r1 =
                                      normalized


18                 R. Akerkar
Matrix Notation




                        • Confirm the result
                         # of inlinks from high ranked page
                         hard to explain about 5&2 6&7
                                               5&2,

                        • Interesting Topic
                          How do you create your
                                homepage highly ranked?
19         R. Akerkar
Markov Chain Notation
      Random surfer model
        Description of a random walk through the Web graph
        Interpreted as a transition matrix with asymptotic probability that a
         surfer is currently browsing that page




                 rt = M rtt-1
                            1
                 M: transition matrix for a first-order Markov chain (stochastic)

        Does it converge to some sensible solution ( too)
                       g                           (as   )
        regardless of the initial ranks ?
20                    R. Akerkar
Problem
      “Rank Sink” Problem
        In general, many Web pages have no inlinks/outlinks
        It results in dangling edges in the graph

       E.g.
         no parent  rank 0
             MT converges to a matrix
              whose last column is all zero

         no children  no solution
             MT converges to zero matrix




21                      R. Akerkar
Modification
      Surfer will restart browsing by p
                                  g y picking a new Web p g at
                                            g           page
       random
         M=(B+E)
           E : escape matrix
           M : stochastic matrix

      S ll problem?
       Still bl
        It is not guaranteed that M is primitive

        If M is stochastic and primitive, PageRank converges to
         corresponding stationary distribution of M

22                      R. Akerkar
PageRank Algorithm




                        * Page et al 1998
                                  al,
23         R. Akerkar
Distribution of the Mixture Model
      The probability distribution that results from combining the
           p         y                                        g
       Markovian random walk distribution & the static rank source
       distribution
             r = εe + (1- ε)x
                       1
                     ε: probability of selecting non-linked page

             PageRank


 Now, transition matrix [εH + (1- ε)M] is primitive and stochastic
 rt converges to the dominant eigenvector
24                    R. Akerkar
Stability
      Whether the link analysis algorithms based on eigenvectors
       are stable in the sense that results don’t change significantly?

      Th connectivity of a portion of the graph is changed
       The              f            f h       h h        d
       arbitrary
        How will it affect the results of algorithms?




25                 R. Akerkar
Stability of HITS
     Ng et al (2001)
     • A bound on the number of hyperlinks k that can added or deleted
     from one page without affecting the authority or hubness weights
     • It is possible to perturb a symmetric matrix by a quantity that grows
     as δ that produces a constant perturbation of the dominant
     eigenvector




                                   δ: eigengap λ1 – λ2
                                        g g p
                                   d: maximum outdegree of G
26                    R. Akerkar
Stability of PageRank

                                         Ng et al (2001)

               V: the set of vertices touched by the perturbation
 The parameter ε of the mixture model has a stabilization role
 If the set of pages affected by the p
                pg              y     perturbation have a small rank, the overall
                                                                    ,
     change will also be small


                                         tighter bound by
                                         Bianchini et al (2001)

              δ(j) >= 2 depends on the edges incident on j
27                    R. Akerkar
SALSA
      SALSA (Lempel, Moran 2001)
        Probabilistic extension of the HITS algorithm
        Random walk is carried out by following hyperlinks both in the
         forward and in the backward direction
      Two separate random walks
        Hub walk
        Authority walk




28               R. Akerkar
Forming a Bipartite Graph in SALSA




29         R. Akerkar
Random Walks
      Hub walk
        Follow a Web link from a page uh to a page wa (a forward link)
         and then
        Immediately traverse a backlink going from wa to vh, where (u w)
                                                                     (u,w)
         Є E and (v,w) Є E
      Authority Walk
                 y
        Follow a Web link from a page w(a) to a page u(h) (a backward
         link) and then
        Immediately traverse a f
                d l             forward link going b k f
                                      dl k         back from vh to wa
         where (u,w) Є E and (v,w) Є E


30                R. Akerkar
Computing Weights
      Hub weight computed from the sum of the product of the
       inverse degree of the in-links and the out-links




31               R. Akerkar
Why We Care
      Lempel and Moran (2001) showed theoretically that SALSA
       weights are more robust that HITS weights in the presence of the
       Tightly Knit Community (TKC) Effect.
        This effect occurs when a small collection of pages (related to a given topic)
         is connected so that every hub links to every authority and includes as a special
         case the mutual reinforcement effect
      The pages in a community connected in this way can be ranked
       highly by HITS, higher than pages in a much larger collection
                 HITS
       where only some hubs link to some authorities
      TKC could be exploited by spammers hoping to increase their
       page weight ( li k f
               i h (e.g. link farms)
                                   )



32                 R. Akerkar
A Similar Approach
      Rafiei and Mendelzon (2000) and Ng et al. (2001)
       propose similar approaches using reset as in
       PageRank
      Unlike PageRank, in this model the surfer will
       follow a forward link on odd steps but a
       backward li k
       b k d link on even steps t
      The stability properties of these ranking
       distributions are similar to those of PageRank (Ng
       et al. 2001)

33             R. Akerkar
Overcoming TKC
      Similarity downweight sequencing and sequential
                y        g            g
      clustering (Roberts and Rosenthal 2003)
       Consider the underlying structure of clusters
                             y g
       Suggest downweight sequencing to avoid the
        Tight Knit Community problem
          g                     yp
       Results indicate approach is effective for few
        tested queries, but still untested on a large scale


34            R. Akerkar
PHITS and More
      PHITS: Cohn and Chang (2000)
        Only the principal eigenvector is extracted using SALSA, so the
         authority along the remaining eigenvectors is completely
         neglected
           Account for more eigenvectors of the co-citation matrix

      See also Lempel, Moran (2003)




35                 R. Akerkar
Limits of Link Analysis
      META tags/ invisible text
        Search engines relying on meta tags in documents are often
         misled (intentionally) by web developers
     P f
      Pay-for-place
                 l
       Search engine bias : organizations pay search engines and page
        rank
       Advertisements: organizations pay high ranking pages for
        advertising space
           W h a primary effect of increased visibility to end users and a secondary
            With           ff     f         d bl              d         d        d
            effect of increased respectability due to relevance to high ranking page



36                 R. Akerkar
Limits of Link Analysis
      Stability
        Adding even a small number of nodes/edges to the graph has a
         significant impact
      T i drift – similar t TKC
       Topic d ift i il to
        A top authority may be a hub of pages on a different topic
         resulting in increased rank of the authority p g
                 g                                  y page
      Content evolution
        Adding/removing links/content can affect the intuitive
         authority rank of a page requiring recalculation of page ranks



37                R. Akerkar
Further Reading
        R. Akerkar d P. Lingras, Building an I ll
         R Ak k and P L             B ld      Intelligent W b Th
                                                          Web: Theory
         and Practice, Jones & Bartlett, 2008
        http://www.jbpub.com/catalog/9780763741372/
            p          j p                g




38               R. Akerkar

Link analysis

  • 1.
    Link Analysis Rajendra Akerkar Vestlandsforsking, Norway
  • 2.
    Objectives  To review common approaches to l k analysis h link l  To calculate the popularity of a site based on link analysis  To model human judgments indirectly 2 R. Akerkar
  • 3.
    Outline 1. 1 Early Approaches to Link Analysis 2. Hubs and Authorities: HITS 3. Page Rank 4. Stability 5. Probabilistic Link Analysis 6. Limitation of Link Analysis 3 R. Akerkar
  • 4.
    Early Approaches Basic Assumptions • Hyperlinks contain information about the human judgment of a site • The more incoming links to a site the more it is site, judged important Bray 1996 y • The visibility of a site is measured by the number of other sites pointing to it • The luminosity of a site is measured by the number of other sites to which it points  Limitation: failure to capture the relative importance of different parents (children) sites 4 R. Akerkar
  • 5.
    Early Approaches Mark 1988 • To calculate the score S of a document at vertex v 1 S(v) = s(v) + Σ S(w) | ch[v] | w Є |ch(v)| v: a vertex i the h in h hypertext graph G = (V E) h (V, S(v): the global score s(v): the score if the document is isolated ch(v): children of the document at vertex v • Limitation: - Require G to be a directed acyclic g p ( q y graph (DAG) ) - If v has a single link to w, S(v) > S(w) - If v has a long path to w and s(v) < s(w), then S(v) > S (w)  unreasonable 5 R. Akerkar
  • 6.
    Early Approaches Marchiori (1997) • Hyper information should complement textual information to obtain the overall information S(v) = s(v) + h(v) ( ) ( ) ( ) - S(v): overall information - s(v): textual information - h(v): hyper information r(v, w) • h(v) = Σ F S(w) w Є |ch[v]| - F a fading constant F Є (0 1) F: constant, (0, - r(v, w): the rank of w after sorting the children of v by S(w)  a remedy of the previous approach (Mark 1988) 6 R. Akerkar
  • 7.
    HITS - Kleinberg’sAlgorithm • HITS – Hypertext Induced Topic Selection yp p • For each vertex v Є V in a subgraph of interest: a(v) - the authority of v h(v) - the hubness of v • A site is very a thoritati e if it recei es man er authoritative receives many citations. Citation from important sites weight more than citations from less-important sites • Hubness shows the importance of a site. A good hub is a site that links to many authoritative sites y 7 R. Akerkar
  • 8.
    Authority and Hubness 2 5 3 1 1 6 4 7 a(1) = h(2) + h(3) + h(4) h(1) = a(5) + a(6) + a(7) 8 R. Akerkar
  • 9.
    Authority d Hb A th it and Hubness Convergence C g • R Recursive d i dependency: d a(v)  Σ h(w) w Є pa[v] h(v)  Σ w Є ch[v] a(w) • Using Linear Algebra, we can prove: a(v) and h(v) converge 9 R. Akerkar
  • 10.
    HITS Example Find a base subgraph: • Start with a root set R {1, 2, 3, 4} • {1, 2, 3, 4} - nodes relevant to the topic • Expand the root set R to include all the children and a fixed number of parents of nodes in R  A new set S (base subgraph)  10 R. Akerkar
  • 11.
    HITS Example BaseSubgraph( R d) R, 1. S  r 2. for each v in R 3. do S  S U ch[v] 4. P  pa[v] 5. if |P| > d f 6. then P  arbitrary subset of P having size d 7. SSUP 8. 8 return S 11 R. Akerkar
  • 12.
    HITS Example Hubs and authorities: two n-dimensional a and h HubsAuthorities(G) |V| 1 1  [1,…,1] Є R 2 a0  h 0  1 3 t 1 4 repeat 5 for each v in V 6 do at (v)  Σ h (w) w Є pa[v] t -1 7 ht (v)  Σ w Є pa[v] a (w) 8 a t  at / || at || t -1 9 h t  ht / || ht || 10 t  t+1 11 until || a t – at -1 || + || h t – ht -1 || < ε 1 1 12 return (a t , h t ) 12 R. Akerkar
  • 13.
    HITS Example Results y Authority Hubness 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Authority and hubness weights 13 R. Akerkar
  • 14.
    HITS Improvements Brarat and Henzinger (1998) • HITS problems 1) The document can contain many identical links to the same document in another host 2) Links are generated automatically (e.g. messages posted on newsgroups) • Solutions 1) Assign weight to identical multiple edges, which are inversely proportional to their multiplicity 2) Prune irrelevant nodes or regulating the influence of a node with a relevance weight 14 R. Akerkar
  • 15.
    PageRank  Introduced byPage et al ( y g (1998))  The weight is assigned by the rank of parents  Difference with HITS  HITS takes Hubness & Authority weights  The page rank is proportional to its parents’ rank, but inversely proportional to its parents’ outdegree 15
  • 16.
    Matrix Notation Adjacent M i Adj Matrix A= * http://www.kusatro.kyoto-u.com 16
  • 17.
    Matrix Notation  MatrixNotation r=αBr=Mr α : eigenvalue r : eigenvector of B Ax=λx B= | A - λI | x = 0 Finding Pagerank  to find eigenvector of B with an associated eigenvalue α 17
  • 18.
    Matrix Notation PageRank: eigenvector of P relative to max eigenvalue B = P D P-1 D: diagonal matrix of eigenvalues {λ1, … λn} P: regular matrix that consists of eigenvectors PageRank r1 = normalized 18 R. Akerkar
  • 19.
    Matrix Notation • Confirm the result # of inlinks from high ranked page hard to explain about 5&2 6&7 5&2, • Interesting Topic How do you create your homepage highly ranked? 19 R. Akerkar
  • 20.
    Markov Chain Notation  Random surfer model  Description of a random walk through the Web graph  Interpreted as a transition matrix with asymptotic probability that a surfer is currently browsing that page rt = M rtt-1 1 M: transition matrix for a first-order Markov chain (stochastic) Does it converge to some sensible solution ( too) g (as ) regardless of the initial ranks ? 20 R. Akerkar
  • 21.
    Problem  “Rank Sink” Problem  In general, many Web pages have no inlinks/outlinks  It results in dangling edges in the graph E.g. no parent  rank 0 MT converges to a matrix whose last column is all zero no children  no solution MT converges to zero matrix 21 R. Akerkar
  • 22.
    Modification  Surfer will restart browsing by p g y picking a new Web p g at g page random M=(B+E) E : escape matrix M : stochastic matrix  S ll problem? Still bl  It is not guaranteed that M is primitive  If M is stochastic and primitive, PageRank converges to corresponding stationary distribution of M 22 R. Akerkar
  • 23.
    PageRank Algorithm * Page et al 1998 al, 23 R. Akerkar
  • 24.
    Distribution of theMixture Model  The probability distribution that results from combining the p y g Markovian random walk distribution & the static rank source distribution r = εe + (1- ε)x 1 ε: probability of selecting non-linked page PageRank Now, transition matrix [εH + (1- ε)M] is primitive and stochastic rt converges to the dominant eigenvector 24 R. Akerkar
  • 25.
    Stability  Whether the link analysis algorithms based on eigenvectors are stable in the sense that results don’t change significantly?  Th connectivity of a portion of the graph is changed The f f h h h d arbitrary  How will it affect the results of algorithms? 25 R. Akerkar
  • 26.
    Stability of HITS Ng et al (2001) • A bound on the number of hyperlinks k that can added or deleted from one page without affecting the authority or hubness weights • It is possible to perturb a symmetric matrix by a quantity that grows as δ that produces a constant perturbation of the dominant eigenvector δ: eigengap λ1 – λ2 g g p d: maximum outdegree of G 26 R. Akerkar
  • 27.
    Stability of PageRank Ng et al (2001) V: the set of vertices touched by the perturbation  The parameter ε of the mixture model has a stabilization role  If the set of pages affected by the p pg y perturbation have a small rank, the overall , change will also be small tighter bound by Bianchini et al (2001) δ(j) >= 2 depends on the edges incident on j 27 R. Akerkar
  • 28.
    SALSA  SALSA (Lempel, Moran 2001)  Probabilistic extension of the HITS algorithm  Random walk is carried out by following hyperlinks both in the forward and in the backward direction  Two separate random walks  Hub walk  Authority walk 28 R. Akerkar
  • 29.
    Forming a BipartiteGraph in SALSA 29 R. Akerkar
  • 30.
    Random Walks  Hub walk  Follow a Web link from a page uh to a page wa (a forward link) and then  Immediately traverse a backlink going from wa to vh, where (u w) (u,w) Є E and (v,w) Є E  Authority Walk y  Follow a Web link from a page w(a) to a page u(h) (a backward link) and then  Immediately traverse a f d l forward link going b k f dl k back from vh to wa where (u,w) Є E and (v,w) Є E 30 R. Akerkar
  • 31.
    Computing Weights  Hub weight computed from the sum of the product of the inverse degree of the in-links and the out-links 31 R. Akerkar
  • 32.
    Why We Care  Lempel and Moran (2001) showed theoretically that SALSA weights are more robust that HITS weights in the presence of the Tightly Knit Community (TKC) Effect.  This effect occurs when a small collection of pages (related to a given topic) is connected so that every hub links to every authority and includes as a special case the mutual reinforcement effect  The pages in a community connected in this way can be ranked highly by HITS, higher than pages in a much larger collection HITS where only some hubs link to some authorities  TKC could be exploited by spammers hoping to increase their page weight ( li k f i h (e.g. link farms) ) 32 R. Akerkar
  • 33.
    A Similar Approach  Rafiei and Mendelzon (2000) and Ng et al. (2001) propose similar approaches using reset as in PageRank  Unlike PageRank, in this model the surfer will follow a forward link on odd steps but a backward li k b k d link on even steps t  The stability properties of these ranking distributions are similar to those of PageRank (Ng et al. 2001) 33 R. Akerkar
  • 34.
    Overcoming TKC  Similarity downweight sequencing and sequential y g g clustering (Roberts and Rosenthal 2003)  Consider the underlying structure of clusters y g  Suggest downweight sequencing to avoid the Tight Knit Community problem g yp  Results indicate approach is effective for few tested queries, but still untested on a large scale 34 R. Akerkar
  • 35.
    PHITS and More  PHITS: Cohn and Chang (2000)  Only the principal eigenvector is extracted using SALSA, so the authority along the remaining eigenvectors is completely neglected  Account for more eigenvectors of the co-citation matrix  See also Lempel, Moran (2003) 35 R. Akerkar
  • 36.
    Limits of LinkAnalysis  META tags/ invisible text  Search engines relying on meta tags in documents are often misled (intentionally) by web developers P f Pay-for-place l  Search engine bias : organizations pay search engines and page rank  Advertisements: organizations pay high ranking pages for advertising space  W h a primary effect of increased visibility to end users and a secondary With ff f d bl d d d effect of increased respectability due to relevance to high ranking page 36 R. Akerkar
  • 37.
    Limits of LinkAnalysis  Stability  Adding even a small number of nodes/edges to the graph has a significant impact  T i drift – similar t TKC Topic d ift i il to  A top authority may be a hub of pages on a different topic resulting in increased rank of the authority p g g y page  Content evolution  Adding/removing links/content can affect the intuitive authority rank of a page requiring recalculation of page ranks 37 R. Akerkar
  • 38.
    Further Reading  R. Akerkar d P. Lingras, Building an I ll R Ak k and P L B ld Intelligent W b Th Web: Theory and Practice, Jones & Bartlett, 2008  http://www.jbpub.com/catalog/9780763741372/ p j p g 38 R. Akerkar