1.
Link Analysis Rajendra AkerkarVestlandsforsking, Norway
2.
Objectives To review common approaches to l k analysis h link l To calculate the popularity of a site based on link analysis To model human judgments indirectly2 R. Akerkar
3.
Outline 1. 1 Early Approaches to Link Analysis 2. Hubs and Authorities: HITS 3. Page Rank 4. Stability 5. Probabilistic Link Analysis 6. Limitation of Link Analysis3 R. Akerkar
4.
Early Approaches Basic Assumptions • Hyperlinks contain information about the human judgment of a site • The more incoming links to a site the more it is site, judged important Bray 1996 y • The visibility of a site is measured by the number of other sites pointing to it • The luminosity of a site is measured by the number of other sites to which it points Limitation: failure to capture the relative importance of different parents (children) sites4 R. Akerkar
5.
Early Approaches Mark 1988 • To calculate the score S of a document at vertex v 1 S(v) = s(v) + Σ S(w) | ch[v] | w Є |ch(v)| v: a vertex i the h in h hypertext graph G = (V E) h (V, S(v): the global score s(v): the score if the document is isolated ch(v): children of the document at vertex v • Limitation: - Require G to be a directed acyclic g p ( q y graph (DAG) ) - If v has a single link to w, S(v) > S(w) - If v has a long path to w and s(v) < s(w), then S(v) > S (w) unreasonable5 R. Akerkar
6.
Early Approaches Marchiori (1997) • Hyper information should complement textual information to obtain the overall information S(v) = s(v) + h(v) ( ) ( ) ( ) - S(v): overall information - s(v): textual information - h(v): hyper information r(v, w) • h(v) = Σ F S(w) w Є |ch[v]| - F a fading constant F Є (0 1) F: constant, (0, - r(v, w): the rank of w after sorting the children of v by S(w) a remedy of the previous approach (Mark 1988)6 R. Akerkar
7.
HITS - Kleinberg’s Algorithm • HITS – Hypertext Induced Topic Selection yp p • For each vertex v Є V in a subgraph of interest: a(v) - the authority of v h(v) - the hubness of v • A site is very a thoritati e if it recei es man er authoritative receives many citations. Citation from important sites weight more than citations from less-important sites • Hubness shows the importance of a site. A good hub is a site that links to many authoritative sites y7 R. Akerkar
9.
Authority d H b A th it and Hubness Convergence C g • R Recursive d i dependency: d a(v) Σ h(w) w Є pa[v] h(v) Σ w Є ch[v] a(w) • Using Linear Algebra, we can prove: a(v) and h(v) converge9 R. Akerkar
10.
HITS Example Find a base subgraph:• Start with a root set R {1, 2, 3, 4}• {1, 2, 3, 4} - nodes relevant to the topic• Expand the root set R to includeall the children and a fixednumber of parents of nodes in R A new set S (base subgraph) 10 R. Akerkar
11.
HITS Example BaseSubgraph( R d) R, 1. S r 2. for each v in R 3. do S S U ch[v] 4. P pa[v] 5. if |P| > d f 6. then P arbitrary subset of P having size d 7. SSUP 8. 8 return S11 R. Akerkar
12.
HITS Example Hubs and authorities: two n-dimensional a and h HubsAuthorities(G) |V| 1 1 [1,…,1] Є R 2 a0 h 0 1 3 t 1 4 repeat 5 for each v in V 6 do at (v) Σ h (w) w Є pa[v] t -1 7 ht (v) Σ w Є pa[v] a (w) 8 a t at / || at || t -1 9 h t ht / || ht || 10 t t+1 11 until || a t – at -1 || + || h t – ht -1 || < ε 1 1 12 return (a t , h t )12 R. Akerkar
13.
HITS Example Results y Authority Hubness1 2 3 4 5 6 7 8 9 10 11 12 13 14 15Authority and hubness weights13 R. Akerkar
14.
HITS Improvements Brarat and Henzinger (1998) • HITS problems 1) The document can contain many identical links to the same document in another host 2) Links are generated automatically (e.g. messages posted on newsgroups) • Solutions 1) Assign weight to identical multiple edges, which are inversely proportional to their multiplicity 2) Prune irrelevant nodes or regulating the influence of a node with a relevance weight14 R. Akerkar
15.
PageRank Introduced by Page et al ( y g (1998)) The weight is assigned by the rank of parents Difference with HITS HITS takes Hubness & Authority weights The page rank is proportional to its parents’ rank, but inversely proportional to its parents’ outdegree 15
16.
Matrix Notation Adjacent M i Adj Matrix A= * http://www.kusatro.kyoto-u.com 16
17.
Matrix Notation Matrix Notation r=αBr=Mr α : eigenvalue r : eigenvector of B Ax=λx B= | A - λI | x = 0 Finding Pagerank to find eigenvector of B with an associated eigenvalue α 17
18.
Matrix Notation PageRank: eigenvector of P relative to max eigenvalue B = P D P-1 D: diagonal matrix of eigenvalues {λ1, … λn} P: regular matrix that consists of eigenvectors PageRank r1 = normalized18 R. Akerkar
19.
Matrix Notation • Confirm the result # of inlinks from high ranked page hard to explain about 5&2 6&7 5&2, • Interesting Topic How do you create your homepage highly ranked?19 R. Akerkar
20.
Markov Chain Notation Random surfer model Description of a random walk through the Web graph Interpreted as a transition matrix with asymptotic probability that a surfer is currently browsing that page rt = M rtt-1 1 M: transition matrix for a first-order Markov chain (stochastic) Does it converge to some sensible solution ( too) g (as ) regardless of the initial ranks ?20 R. Akerkar
21.
Problem “Rank Sink” Problem In general, many Web pages have no inlinks/outlinks It results in dangling edges in the graph E.g. no parent rank 0 MT converges to a matrix whose last column is all zero no children no solution MT converges to zero matrix21 R. Akerkar
22.
Modification Surfer will restart browsing by p g y picking a new Web p g at g page random M=(B+E) E : escape matrix M : stochastic matrix S ll problem? Still bl It is not guaranteed that M is primitive If M is stochastic and primitive, PageRank converges to corresponding stationary distribution of M22 R. Akerkar
23.
PageRank Algorithm * Page et al 1998 al,23 R. Akerkar
24.
Distribution of the Mixture Model The probability distribution that results from combining the p y g Markovian random walk distribution & the static rank source distribution r = εe + (1- ε)x 1 ε: probability of selecting non-linked page PageRank Now, transition matrix [εH + (1- ε)M] is primitive and stochastic rt converges to the dominant eigenvector24 R. Akerkar
25.
Stability Whether the link analysis algorithms based on eigenvectors are stable in the sense that results don’t change significantly? Th connectivity of a portion of the graph is changed The f f h h h d arbitrary How will it affect the results of algorithms?25 R. Akerkar
26.
Stability of HITS Ng et al (2001) • A bound on the number of hyperlinks k that can added or deleted from one page without affecting the authority or hubness weights • It is possible to perturb a symmetric matrix by a quantity that grows as δ that produces a constant perturbation of the dominant eigenvector δ: eigengap λ1 – λ2 g g p d: maximum outdegree of G26 R. Akerkar
27.
Stability of PageRank Ng et al (2001) V: the set of vertices touched by the perturbation The parameter ε of the mixture model has a stabilization role If the set of pages affected by the p pg y perturbation have a small rank, the overall , change will also be small tighter bound by Bianchini et al (2001) δ(j) >= 2 depends on the edges incident on j27 R. Akerkar
28.
SALSA SALSA (Lempel, Moran 2001) Probabilistic extension of the HITS algorithm Random walk is carried out by following hyperlinks both in the forward and in the backward direction Two separate random walks Hub walk Authority walk28 R. Akerkar
29.
Forming a Bipartite Graph in SALSA29 R. Akerkar
30.
Random Walks Hub walk Follow a Web link from a page uh to a page wa (a forward link) and then Immediately traverse a backlink going from wa to vh, where (u w) (u,w) Є E and (v,w) Є E Authority Walk y Follow a Web link from a page w(a) to a page u(h) (a backward link) and then Immediately traverse a f d l forward link going b k f dl k back from vh to wa where (u,w) Є E and (v,w) Є E30 R. Akerkar
31.
Computing Weights Hub weight computed from the sum of the product of the inverse degree of the in-links and the out-links31 R. Akerkar
32.
Why We Care Lempel and Moran (2001) showed theoretically that SALSA weights are more robust that HITS weights in the presence of the Tightly Knit Community (TKC) Effect. This effect occurs when a small collection of pages (related to a given topic) is connected so that every hub links to every authority and includes as a special case the mutual reinforcement effect The pages in a community connected in this way can be ranked highly by HITS, higher than pages in a much larger collection HITS where only some hubs link to some authorities TKC could be exploited by spammers hoping to increase their page weight ( li k f i h (e.g. link farms) )32 R. Akerkar
33.
A Similar Approach Rafiei and Mendelzon (2000) and Ng et al. (2001) propose similar approaches using reset as in PageRank Unlike PageRank, in this model the surfer will follow a forward link on odd steps but a backward li k b k d link on even steps t The stability properties of these ranking distributions are similar to those of PageRank (Ng et al. 2001)33 R. Akerkar
34.
Overcoming TKC Similarity downweight sequencing and sequential y g g clustering (Roberts and Rosenthal 2003) Consider the underlying structure of clusters y g Suggest downweight sequencing to avoid the Tight Knit Community problem g yp Results indicate approach is effective for few tested queries, but still untested on a large scale34 R. Akerkar
35.
PHITS and More PHITS: Cohn and Chang (2000) Only the principal eigenvector is extracted using SALSA, so the authority along the remaining eigenvectors is completely neglected Account for more eigenvectors of the co-citation matrix See also Lempel, Moran (2003)35 R. Akerkar
36.
Limits of Link Analysis META tags/ invisible text Search engines relying on meta tags in documents are often misled (intentionally) by web developers P f Pay-for-place l Search engine bias : organizations pay search engines and page rank Advertisements: organizations pay high ranking pages for advertising space W h a primary effect of increased visibility to end users and a secondary With ff f d bl d d d effect of increased respectability due to relevance to high ranking page36 R. Akerkar
37.
Limits of Link Analysis Stability Adding even a small number of nodes/edges to the graph has a significant impact T i drift – similar t TKC Topic d ift i il to A top authority may be a hub of pages on a different topic resulting in increased rank of the authority p g g y page Content evolution Adding/removing links/content can affect the intuitive authority rank of a page requiring recalculation of page ranks37 R. Akerkar
38.
Further Reading R. Akerkar d P. Lingras, Building an I ll R Ak k and P L B ld Intelligent W b Th Web: Theory and Practice, Jones & Bartlett, 2008 http://www.jbpub.com/catalog/9780763741372/ p j p g38 R. Akerkar
A particular slide catching your eye?
Clipping is a handy way to collect important slides you want to go back to later.
Be the first to comment