2.
1256 C. de Kerchove et al. / Linear Algebra and its Applications 429 (2008) 1254–1276 Fig. 1. Every optimal linkagestrategy for a set I of ﬁve pages must have this structure. Theorem 1. Let E I ) is strongly connected and E I /= ,E ∅. in(I)Then and every E I optimal be given. outlink Suppose structure that the E out(I) subgraph is to have (I,E only I one outlinkto a particular page outside of I. We are also interested in the optimal internal link structure for a website. In the case wherethere is a unique leaking node in the website, that is only one node linking to the rest of the web, Theorem 11 can beparticularized as follows. Theorem I. Then every 2. Let optimal E out(I) internal ,E in(I) link and structure E I be given. EI is Suppose composed that of there is only one leaking node in a forward chain of links together with every possiblebackward link. Putting together Theorems 10 and 11, we get in Theorem 12 the optimal link structure for a website. Thisoptimal structure is illustrated in Fig. 1. Theorem of a forward 3. chain Let E in(I) of links and together E I be given. withevery Then, possible for every backward optimal link link, structure, and E out(I) E I consists is composed of a uniqueoutlink, starting from the last node of the chain. This paper is organized as follows. In the following preliminary section, werecall some graph concepts as well as the definition of the PageRank, and we introduce some notations. In Section 3, wedevelop tools for analysing the PageRank of a set of pages I. Then we come to the main part of this paper: in Section 4 weprovide the optimal linkage strategy for a set of nodes. In Section 5, we give some extensions and variants of the maintheorems. We end this paper with some concluding remarks. 2. Graphs and PageRank Let G = (N,E) be a directed graphrepresenting the Web. The webpages are represented by thesetofnodesN ={1,...,n}andthehyperlinksarerepresentedbythesetofdirectedlinksE ⊆ N × N. Thatmeans that(i,j) ∈ Eifandonly ifthere existsahyperlink linkingpage i topage j. Let us ﬁrst brieﬂy recall some usual concepts about directed graphs (see for instance [4]).A link (i,j) is said to be an outlink for node i and an inlink for node j.If(i,j) ∈ E, node i is called a parent of node j. ByC. de Kerchove et al. / Linear Algebra and its Applications 429 (2008) 1254–1276 1257 j ← i, we mean that j belongs to theset of children of i, that is j ∈ {k ∈ N:(i,k) ∈ E}. The outdegree d i of a node i is its number of children, that is d i = |{j ∈ N:(i,j) ∈ E}|. A path from i 0 to i s is a sequence of nodes 〈i 0 ,i 1 ,...,i s 〉 such that (i k ,i k+1 ) ∈ E for every k = 0,1,...,s − 1.A node i has an access to a node j if there exists a path from i to j. In this paper, we will also say that a node i has an accessto a set J if i has an access to at least one node j ∈ J. The graph G is strongly connected if every node of N has an access toevery other node of N. A set of nodes F ⊆ N is a ﬁnal class of the graph G = (N,E) if the subgraph (F,E F ) is stronglyconnected and moreover E out(F) = ∅(i.e. nodes of F do not have an access to N F). Let us now brieﬂy introduce thePageRank score (see [5,6,12,13,15] for background). Without loss of generality (please refer to the book of Langville andMeyer [13] or the survey of Bianchini et al. [5] for details), we can make the assumption that each node has at least oneoutlink, i.e. d i /= 0 for every i ∈ N. Therefore the n × n stochastic matrix P = [P ij ] i,j∈N given by P ij = { d 0 i −1 if (i,j) ∈E, otherwise is well deﬁned and is a scaling of the adjacency matrix of G. Let also 0 < c < 1 be a damping factor and z z T1 = 1, where be a positive stochastic personalization vector, i.e. z i 1 denotes the vector of all ones. The Google matrix > 0for all i = 1,...,n is then deﬁned as and G = cP + (1 − c)1z T . Since z > 0 and c < 1, this stochastic matrix is positive, i.e.G ij > 0 for all i,j. The PageRank vector is then deﬁned as the unique invariant measure of the matrix G, that is the uniqueleft Perron vector of G, T = TG, T1 = 1. (1) The PageRank of a node i is the ith entry i = The PageRank vector is usuallyinterpreted as Te i the of the stationary PageRank distribution vector. of the following Markov chain (see for instance [13]):a random surfer moves on the webgraph, using hyperlinks betweenpages witha probabilityc andzappingtosome newpageaccordingtothe personalization vector with a probability (1 − c). The Google matrix G is the probability transition matrix ofthis random walk. In this stochastic interpretation, the PageRank of a node is equal to the inverse of its mean return time,that will take for coming back is to i −1 i (see is [7,10]). the mean number of steps a random surfer starting in node i 3.PageRank of a website We are T e I interested = ∑ in characterizing the PageRank of a set I. We deﬁne this as the sum ,i∈I where e I i denotes the vector witha 1intheentries of I and 0 elsewhere. Note that the PageRank of a set corresponds tothe notion of energy of a community in [5].1258 C. de Kerchove et al. / Linear Algebra and its Applications 429 (2008) 1254–1276 Let I ⊆ N be a subset of the nodesof the graph. The PageRank of I can be expressed as Te I = (1 − c)z T (I − cP)−1e I from PageRank equations (1). Let usthen deﬁne the vector v = (I − cP) −1 e I . (2) With this, we have the following expression for the PageRank of the set I: Te I = (1 − c)z T v. (3) The vector v will play a crucial role throughout this paper. In this section, we will ﬁrst present aprobabilistic interpretation for this vector and prove some of its properties. We will then show how it can be used in order toanalyze the inﬂuence of some page i ∈ I on the PageRank of the set I. We will end this section by brieﬂy introducing theconcept of basic absorbing graph, which will be useful in order to analyze optimal linkage strategies under someassumptions. 3.1. Mean number of visits before zapping considera Let us ﬁrst random see surferonthe how the entrieswebgraphGthat, of the vector as v described = (I − cP)−1e inSection I can 2, be followsthe interpreted. hyperlinks Let us ofthe webgraph with a probability c. But, instead of zapping to some page of G with a probability (1 − c), he stops his walkwith probability (1 − c) at each step of time. This is equivalent to consider a random walk on the extended graph G e = (N∪{n + 1},E ∪{(i,n + 1):i ∈ N}) with a transition probability matrix P e ( cP (1 − c)1 0 1 ) = . At each step of time, withprobability 1 − c, the random surfer can disappear from the original graph, that is he can reach the absorbing node n + 1.
3.
The nonnegative matrix(I − cP)−1 iscommonly calledthe fundamentalmatrix ofthe absorb- ing Markov chain [(I − cP)−1] ijdeﬁned by P e (see for instance [10,16]). In the extended graph G e , the entry is the expected number of visits to node jbefore reaching the absorbing node n + 1 when starting from node i. From the point of view of the standard random surferdescribed in Section 2, the entry [(I − cP)−1] ij is the expected number of visits to node j before zapping for the ﬁrst timewhen starting from node i. Therefore, the vector v deﬁned in Eq. (2) has the following probabilistic interpretation. The entryrandom v i is surfer the expected number of visits starts his walk in node i. to the set I before zapping for the ﬁrst time whenthe Now, let us ﬁrst prove some simple properties about this vector. Lemma 1. Let v ∈ Rn ⩾0 be deﬁned by v = cPv + e I. Then, (a) max i/∈I v i ⩽ c max i∈I v i ; (b) v i ⩽ 1 + cv i for all i ∈ N; with equality if and only if the node i does nothave an access to I; (c) v i ⩾ min j←i v j for alli ∈ I; withequality ifand only ifthe node i does not have an access to I.C. de Kerchove et al. / Linear Algebra and its Applications 429 (2008) 1254–1276 1259 Proof (a) Since c < 1, for all i /∈ I,max i/∈I ⎛ ⎝c ⎞ v i = max i/∈I ∑ j←i v j d i ⎠ ⩽ c max j v j . Since c < 1, it then follows that max j v j = max i∈I v i . (b)The inequality v i ⩽ 1 1−c follows directly from max i ⎛ ⎝ 1 + c ⎞ v i ⩽ max i ∑ j←i v j d i ⎠ ⩽ 1 + c max j v j .From(a) itthenalsofollowsthatv Then i ∈ I. Moreover, i ⩽ 1−c c foralli /∈ I.Now,leti ∈ Nsuchthatv i = 1−c 1 . 1 + cv i = vi = 1 + c ∑ j←i v j d i , that node is k v such j = 1−c that 1 for every j ← i. Hence node j must also belong to I. Byinduction, every i has an access to k must belong to I. (c) Let i ∈ I. Then, by (b) 1 + cv i ⩾ v i = 1 + c ∑ j←i v j d i ⩾ 1 +c min j←i v j , so v i ⩾ min j←i v j for all i ∈ I. If v i = min j←i v j then also 1 + cv i = v i and hence, by (b), the node idoes not have an access to I. □ Let us denote the set of nodes of I which on average give the most visits to I before zappingby V = argmax j∈I v j . Then the following lemma is quite intuitive. It says that, among the nodes of I, those which providethe higher mean number of visits to I are parents of I, i.e. parents of some node of I. Lemma 2 (Parents of I). If E in(I) /= ∅,then V ⊆ {j ∈ I: there exists l ∈ I such that (j,l) ∈ E in(I) }. If E in(I) = ∅, then v j = 0 for every j ∈ I. Proof. Suppose ﬁrstthat E in(I) /= ∅. Let k ∈ V with v = (I − cP)−1e I . If we supposed that there does not exist l ∈ I such that (k,l) ∈ E in(I) ,then we would have, since v k > 0, v k = c ∑ j←k v j d k ⩽ c max j/∈I v j = cv k < v k ,1260 C. de Kerchove et al. / Linear Algebra and its Applications 429 (2008) 1254–1276 which is a contradiction. Now, if Ein(I) = ∅, then there is no access to I from I, so clearly v j = 0 for every j ∈ I. □ Lemma 2 shows that the nodes j the set ofparents of I. The converse ∈ I which is not true, provide as we will the see higher in the value following of v j must belongto example: some parents of I can provide a lower mean number of visits to I that other nodes which are not parents of I. Inother word, Lemma 2 gives a necessary but not sufﬁcient condition in order to maximize the entry v j for some j ∈ I.Example 1. Let us see on an example that having (j,i) ∈ E in(I) for some i ∈ I is not sufﬁcient to have j ∈ V. Consider thegraph in Fig. 2. Let I = {1} and take a damping factor c = 0.85. For v = (I − cP)−1e 1 , we have v 2 = v 3 = v 4 = 4.359 >v 5 = 3.521 > v 6 = 3.492 > v 7 > ··· > v 11 , so V = {2,3,4}. As ensured by Lemma 2, every node of the set V is a parentof node 1. But here, V does not contain all parents of node 1. Indeed, the node 6 /∈ V while it is a parent of 1 and ismoreover its parent with the lowest outdegree. Moreover, we see in this example that node 5, which is a not a parent of node1 but a parent of node 6, gives a higher value of the expected number of visits to I before zapping, than node 6, parent of 1.Let us try to get some intuition about that. When starting from node 6, a random surfer has probability one half to reachnode 1 in only one step. But he has also a probability one half to move to node 11 and to be send far away from node 1. Onthe other side, when starting from node 5, the random surfer cannot reach node 1 in only one step. But with probability 3/4he will reach one of the nodes 2, 3 or 4 in one step. And from these nodes, the websurfer stays very near to node 1 andcannot be sent far away from it. Inthe next lemma, we show thatfrom some node i ∈ I whichhas an access to I, there alwaysexists what we call a decreasing path to I. That is, we can ﬁnd a path such that the mean number of visits to I is higher whenstarting from some node of the path than when starting from the successor of this node in the path. Fig. 2. The node 6 ∈ Vand yet it is a parent of I = {1} (see Example 1).C. de Kerchove et al. / Linear Algebra and its Applications 429 (2008) 1254–1276 1261 Lemma 3 (Decreasing paths to I).For every i 0 ∈ I which has an access to I, there exists a path 〈i 0 ,i 1 ,...,i s 〉 with i 1 ,...,i s−1 ∈ I and i s ∈ I such that v i 0> v i 1 > ··· > v i s . Proof. Let us simply construct a decreasing path recursively by i k+1 ∈ argmin j←i k v j , as long as ik ∈ I. Ifi k has an access to I, then v i k+1 < v i k < 1−c 1 by Lemma 1(b) and (c), so the node i k+1 has also an access to I.By assumption, i 0 has an access to I. Moreover, the set I has a ﬁnite number of elements, so there must exist an s such that is ∈ I. □ 3.2. Inﬂuence of the outlinks of a node We will now see how a modiﬁcation of the outlinks of some node i ∈ N canchange the PageRank set of links, of E a and subset ˜ E, respectively. of nodes I ⊆ N. So we will compare tilde Everysymbol. item So corresponding ˜ to the graph deﬁned by the set two graphs of links ˜ E on N deﬁned by their will be writtenwith a tor, ˜ P denotes its scaled adjacency matrix, ˜ the corresponding ˜ v = (I − PageRank c vec- V ˜ d i = = argmax |{j:(i,j) ∈ ˜ E}| j∈I ˜ v j . Finally, the outdegree of some by j˜←i we mean j node i in this graph, ∈ {k:(i,k) ∈ ˜ E}. P)−1e ˜ I andSo, let us consider two graphs deﬁned, respectively, by their set of links E and ˜ E. Suppose that ˜ they E} forallk differ only/= i. in Then the theirscaled links starting adjacency from some matrices given node P and i, ˜P that are is linked {j:(k,j) by a
4.
rankone ∈ E} = {j:(k,j) correction. ∈ Let us then deﬁne the vector = ∑ j˜←i e j d ˜ i − ∑ j←i e j d i , which gives thecorrection to apply to the line i of the matrix P in order to get P. ˜ Now let us ﬁrst express the difference between thePageRank of I for two conﬁgurations differing only in the links starting from some node i. Note that in the following lemmathe personalization vector z does not appear explicitly in the expression of ˜ . Lemma k /= i,{j:(k,j) 4. Let two ∈ E} graphs ={j:(k,j) deﬁned ∈ ˜ respectively E}. Then by E and ˜ E and let i ∈ N such that for all ˜ T e I = T e I + c T v i 1 − c T (I −cP)−1e i . Proof. Clearly, the scaled adjacency matrices are matrix (I − cP)−1 exists and the PageRank vectors linked by P˜ = can be expressed P as + e i T . Since c < 1, the T = (1 − c)z T (I − cP) −1 , ˜ T = (1 − c)z T (I − c(P + e i T )) −1 .1262 C. de Kerchove et al. / Linear Algebra and its Applications 429 (2008) 1254–1276 Applying the Sherman–Morrisonformula to ((I − cP) − ce i T )−1, we get ˜ T = (1 − c)z T (I − cP) −1 + (1 − c)z T (I − cP) −1 e i 1 − c c T (I − cP)−1 T(I − cP)−1e i and the result follows immediately. □ Let us now give an equivalent condition in order to increase thePageRank of I by changing outlinksof some node i. The PageRank of I increases essentially whenthe new setof links favorsnodes giving a higher mean number of visits to I before zapping. Theorem5(PageRankandmeannumberofvisitsbefore tivelyby E and ˜ E and let i ∈ N such that for all k /= i,{j:(k,j) zapping).Lettwographsdeﬁnedrespec- ∈ E} = {j:(k,j) ∈ ˜ E}. Then ˜T e I T e I T v > 0 and ˜ > if and only if T e I = Te I if and only if T v = 0. Proof. Letus ﬁrstshowthat T (I − cP)−1e i ⩽1isalways veriﬁed.Letu = (I − cP)−1e i . Then u − cPu = e i and, by Lemma 1(a), u j ⩽ u i for all j. So T u = ∑ j˜←i u j˜d i − ∑ j←i u j d i ⩽ u i − ∑ j←i u j d i ⩽ u i − c ∑ j←i u j d i = 1. Now, since c < 1 and > 0, the conclusion followsby Lemma 4. □ The following Proposition 6 shows how to add a new link (i,j) starting from a given node i in order toincrease the PageRank of the set I. The PageRank of I increases as soon as a node i ∈ I adds a linkto a node j with a largerorequal expected number of visits to I before zapping. Proposition 6 (Adding a link). Let i ∈ I and let j ˜ E = E ∪{(i,j)}. Then∈ N be such that (i,j) /∈ E and v i ⩽ v j . Let ˜ T e I ⩾ T e I with equality if and only if the node i does not have an accessto I. Proof. Let i ∈ I and let j ∈ N be such that (i,j) /∈ E and v i ⩽ v j . Then 1 + c ∑ k←i v k d i = v i ⩽ 1 + cv i ⩽ 1 +cv j with equality if and only if i does not have an access to I by Lemma 1(b). Let ˜ E = E ∪{(i,j)}. Then T v = ( v j ) ⩾ 0with equality if and only if i does not have an access to I. The conclusion follows from Theorem 5. □ 1 d i − ∑ k←i v k + 1diC. de Kerchove et al. / Linear Algebra and its Applications 429 (2008) 1254–1276 1263 Fig. 3. For I = {1,2,3}, removinglink (1, 2) gives ˜ T e I < T e I , even if v 1 > v 2 (see Example 2). Now let us see how to remove a link (i,j) starting from agiven node i in order to increase the PageRank of the set I. If a node i ∈ N removes a link to its worst child from the point ofview of the expected number of visits to I before zapping, then the PageRank of I increases. Proposition7 (Removingalink). Leti ∈ Nandletj ∈ argmin k←i ˜ E = E {(i,j)}. Then ˜ v k . Let T e I ⩾ T e I with equality if and only if v k = v j forevery k such that (i,k) ∈ E. Proof. Let i ∈ N and let j ∈ argmin k←i ˜ E = E {(i,j)}. Then T v = v k . Let ∑ k←i v k − v j di (d i − 1) ⩾ 0 with equality if and only if v k = v j for all k ← i. The conclusion follows by Theorem 5. □ In order toincrease the PageRank of I with a new link (i,j), Proposition 6 only requires that v PageRank j ⩽ v i . On of the I by otherdeleting side, Proposition link (i,j). One 7 requires could wonder that v j whether = min k←i or not v k in order to increasethe this condition could be weakened fact, this cannot to v j < be v done i , so as to have symmetric conditions for theaddition or deletion of links. In as shown in the following example. Example 2. Letus see byanexample that the conditionj ∈argmin k←i v k inProposition7 cannot be I = weakened {1,2,3}. to We v j have < v i . Consider the graph in Fig. 3 and takea damping factor c = 0.85. Let v 1 = 2.63 > v 2 = 2.303 > v 3 = 1.533. As ensured by Proposition 7, if we remove the link(1,3), the PageRank of I increases (e.g. from 0.199 to 0.22 with a uniform personalization vector z = n 1 1), since 3 ∈argmin k←1 v k . But, if we remove instead the link (1,2), the PageRank of I decreases (from 0.199 to 0.179 with zuniform) even if v 2 < v 1 . Remark 1. Let us note that, if the node i does not have an access to the set I, then for everydeletion of a link starting from i, the PageRank of I will not be modiﬁed. Indeed, in this case Tv = 0 since by Lemma 1(b),v j = 1−c 1 for every j ← i.
Clipping is a handy way to collect and organize the most important slides from a presentation. You can keep your great finds in clipboards organized around topics.
Be the first to comment