Done reread maximizingpagerankviaoutlinks(2)


Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Done reread maximizingpagerankviaoutlinks(2)

  1. 1. Available online at Linear Algebra and its Applications 429 (2008) 1254– Maximizing PageRank via outlinks Cristobald deKerchove∗, Laure Ninove, Paul van Dooren CESAME, Université catholique de Louvain, Avenue GeorgesLemaître 4-6, B-1348 Louvain-la-Neuve, Belgium Received 30 October 2006; accepted 18 January 2008 Available online10 March 2008 Submitted by A. Ran Abstract We analyze linkage strategies for a set I of webpages for which thewebmaster wants to maximize the sum of Google’s PageRank scores. The webmaster can only choose the hyperlinksstarting from the webpages of I and has no control on the hyperlinks from other webpages. We provide an optimal linkagestrategy under some reasonable assumptions. © 2008 Elsevier Inc. All rights reserved. AMS classification: 15A18; 15A48;15A51; 60J15; 68U35 Keywords: PageRank; Google matrix; Markov chain; Perron vector; Optimal linkage strategy 1.Introduction PageRank, a measure of webpages’ relevance introduced by Brin and Page, is at the heart of the well knownsearch engine Google [6,15]. Google classifies the webpages according to the pertinence scores given by PageRank, whichare computed from the graph structure of the Web. A page with a high PageRank will appear among the first items in the listof pages corresponding to a particular query.IfwelookatthepopularityofGoogle,itisnotsurprisingthatsomewebmasterswanttoincrease the PageRank oftheirwebpages inorder toget more visits from websurfers to their website. SincePageRankisbasedonthelinkstructureoftheWeb,itisthereforeusefultounderstandhowaddition or deletion of hyperlinksinfluence it. ∗ Corresponding author. E-mail addresses: (C. de Kerchove), (L. Ninove), (P. van Dooren). 0024-3795/$ - see front matter (2008 Elsevier Inc. All rights reserved. doi:10.1016/j.laa.2008.01.023ELSEVIERC. de Kerchove et al. / Linear Algebra and its Applications 429 (2008) 1254–1276 1255 Mathematical analysis ofPageRank’s sensitivity with respect to perturbations of the matrix describing the webgraph is a topical subject of interest(see for instance [2,5,11,12,13,14] and the references therein). Normwise and componentwise conditioning bounds [11] aswell as the derivative [12,13] are used to understand the sensitivity of the PageRank vector. It appears that the PageRankvector is relatively insensitive to small changes in the graph structure, at least when these changes concern webpages with alow PageRank score [5,12]. One could think therefore that trying to modify its PageRank via changes in the link structure ofthe Web is a waste of time. However, what is important for webmasters is not the values of the PageRank vector but theranking that ensues from it. Lempel and Morel [14] showed that PageRank is not rank-stable, i.e. small modifications in thelink structure of the webgraph may cause dramatic changes in the ranking of the webpages. Therefore, the question of howthe PageRank of a particular page or set of pages could be increased – even slightly – by adding or removing links to thewebgraph remains of interest. As it is well known [1,9], if a hyperlink from a page i to a page j is added, without no othermodification in the Web, then the PageRank of j will increase. But in general, you do not have control on the inlinks of yourwebpage unless you pay another webmaster to add a hyperlink from his/her page to your or you make an alliance with him/her by trading a link for a link [3,8]. But it is natural to ask how you could modify your PageRank by yourself. This leads toanalyze how the choice of the outlinks of a page can influence its own PageRank. Sydow [17] showed via numericalsimulations that adding well chosen outlinks to a webpage may increase significantly its PageRank ranking. Avrachenkovand Litvak [2] analyzed theoretically the possible effect of new outlinks on the PageRank of a page and its neighbors.Supposing that a webpage has control only on its outlinks, they gave the optimallinkage strategy for this single page.Bianchini etal. [5] as well as Avrachenkov and Litvak in [1] consider the impact of links between web communities(websites or sets of related webpages), respectively on the sum of the PageRanks and on the individual PageRank scores ofthe pages of some community. They give general rules in order to havea PageRankashighaspossiblebuttheydonotprovideanoptimallinkstructureforawebsite. Our aim in this paper is to find a generalization of Avrachenkov–Litvak’s optimallinkage strategy [2] to the case of a website with several pages. We consider a given set of pages and suppose we have onlycontrol on the outlinks of these pages. We are interested in the problem of maximizing the sum of the PageRanks of thesepages. Suppose G = (N,E) be the webgraph, with a set of nodes N = {1,...,n} and a set of links E ⊆ N × N. For a subset ofnodes I ⊆ N, we define E I = {(i,j) ∈ E:i,j ∈ I} the set of internal links, E out(I) = {(i,j) ∈ E:i ∈ I,j /∈ I} the set of externaloutlinks, E in(I) = {(i,j) ∈ E:i /∈ I,j ∈ I} the set of external inlinks, E I = {(i,j) ∈ E:i,j /∈ I} the set of external links. If we donot impose any condition on E I and E out(I) , the problem of maximizing the sum of the PageRanks of pages of I is quitetrivial and does not have much interest (see the discussion in Section 4). Therefore, when characterizing optimal linkstructures, we will make the following accessibility assumption: every page of the website must have an access to the rest ofthe Web. Our first main result concerns the optimal outlink structure for a given website. In the case where the subgraphcorresponding to the website is strongly connected, Theorem 10 can be particularized as follows.
  2. 2. 1256 C. de Kerchove et al. / Linear Algebra and its Applications 429 (2008) 1254–1276 Fig. 1. Every optimal linkagestrategy for a set I of five pages must have this structure. Theorem 1. Let E I ) is strongly connected and E I /= ,E ∅. in(I)Then and every E I optimal be given. outlink Suppose structure that the E out(I) subgraph is to have (I,E only I one outlinkto a particular page outside of I. We are also interested in the optimal internal link structure for a website. In the case wherethere is a unique leaking node in the website, that is only one node linking to the rest of the web, Theorem 11 can beparticularized as follows. Theorem I. Then every 2. Let optimal E out(I) internal ,E in(I) link and structure E I be given. EI is Suppose composed that of there is only one leaking node in a forward chain of links together with every possiblebackward link. Putting together Theorems 10 and 11, we get in Theorem 12 the optimal link structure for a website. Thisoptimal structure is illustrated in Fig. 1. Theorem of a forward 3. chain Let E in(I) of links and together E I be given. withevery Then, possible for every backward optimal link link, structure, and E out(I) E I consists is composed of a uniqueoutlink, starting from the last node of the chain. This paper is organized as follows. In the following preliminary section, werecall some graph concepts as well as the definition of the PageRank, and we introduce some notations. In Section 3, wedevelop tools for analysing the PageRank of a set of pages I. Then we come to the main part of this paper: in Section 4 weprovide the optimal linkage strategy for a set of nodes. In Section 5, we give some extensions and variants of the maintheorems. We end this paper with some concluding remarks. 2. Graphs and PageRank Let G = (N,E) be a directed graphrepresenting the Web. The webpages are represented by thesetofnodesN ={1,...,n}andthehyperlinksarerepresentedbythesetofdirectedlinksE ⊆ N × N. Thatmeans that(i,j) ∈ Eifandonly ifthere existsahyperlink linkingpage i topage j. Let us first briefly recall some usual concepts about directed graphs (see for instance [4]).A link (i,j) is said to be an outlink for node i and an inlink for node j.If(i,j) ∈ E, node i is called a parent of node j. ByC. de Kerchove et al. / Linear Algebra and its Applications 429 (2008) 1254–1276 1257 j ← i, we mean that j belongs to theset of children of i, that is j ∈ {k ∈ N:(i,k) ∈ E}. The outdegree d i of a node i is its number of children, that is d i = |{j ∈ N:(i,j) ∈ E}|. A path from i 0 to i s is a sequence of nodes 〈i 0 ,i 1 ,...,i s 〉 such that (i k ,i k+1 ) ∈ E for every k = 0,1,...,s − 1.A node i has an access to a node j if there exists a path from i to j. In this paper, we will also say that a node i has an accessto a set J if i has an access to at least one node j ∈ J. The graph G is strongly connected if every node of N has an access toevery other node of N. A set of nodes F ⊆ N is a final class of the graph G = (N,E) if the subgraph (F,E F ) is stronglyconnected and moreover E out(F) = ∅(i.e. nodes of F do not have an access to N F). Let us now briefly introduce thePageRank score (see [5,6,12,13,15] for background). Without loss of generality (please refer to the book of Langville andMeyer [13] or the survey of Bianchini et al. [5] for details), we can make the assumption that each node has at least oneoutlink, i.e. d i /= 0 for every i ∈ N. Therefore the n × n stochastic matrix P = [P ij ] i,j∈N given by P ij = { d 0 i −1 if (i,j) ∈E, otherwise is well defined and is a scaling of the adjacency matrix of G. Let also 0 < c < 1 be a damping factor and z z T1 = 1, where be a positive stochastic personalization vector, i.e. z i 1 denotes the vector of all ones. The Google matrix > 0for all i = 1,...,n is then defined as and G = cP + (1 − c)1z T . Since z > 0 and c < 1, this stochastic matrix is positive, i.e.G ij > 0 for all i,j. The PageRank vector is then defined as the unique invariant measure of the matrix G, that is the uniqueleft Perron vector of G, T = TG, T1 = 1. (1) The PageRank of a node i is the ith entry i = The PageRank vector is usuallyinterpreted as Te i the of the stationary PageRank distribution vector. of the following Markov chain (see for instance [13]):a random surfer moves on the webgraph, using hyperlinks betweenpages witha probabilityc andzappingtosome newpageaccordingtothe personalization vector with a probability (1 − c). The Google matrix G is the probability transition matrix ofthis random walk. In this stochastic interpretation, the PageRank of a node is equal to the inverse of its mean return time,that will take for coming back is to i −1 i (see is [7,10]). the mean number of steps a random surfer starting in node i 3.PageRank of a website We are T e I interested = ∑ in characterizing the PageRank of a set I. We define this as the sum ,i∈I where e I i denotes the vector witha 1intheentries of I and 0 elsewhere. Note that the PageRank of a set corresponds tothe notion of energy of a community in [5].1258 C. de Kerchove et al. / Linear Algebra and its Applications 429 (2008) 1254–1276 Let I ⊆ N be a subset of the nodesof the graph. The PageRank of I can be expressed as Te I = (1 − c)z T (I − cP)−1e I from PageRank equations (1). Let usthen define the vector v = (I − cP) −1 e I . (2) With this, we have the following expression for the PageRank of the set I: Te I = (1 − c)z T v. (3) The vector v will play a crucial role throughout this paper. In this section, we will first present aprobabilistic interpretation for this vector and prove some of its properties. We will then show how it can be used in order toanalyze the influence of some page i ∈ I on the PageRank of the set I. We will end this section by briefly introducing theconcept of basic absorbing graph, which will be useful in order to analyze optimal linkage strategies under someassumptions. 3.1. Mean number of visits before zapping considera Let us first random see surferonthe how the entrieswebgraphGthat, of the vector as v described = (I − cP)−1e inSection I can 2, be followsthe interpreted. hyperlinks Let us ofthe webgraph with a probability c. But, instead of zapping to some page of G with a probability (1 − c), he stops his walkwith probability (1 − c) at each step of time. This is equivalent to consider a random walk on the extended graph G e = (N∪{n + 1},E ∪{(i,n + 1):i ∈ N}) with a transition probability matrix P e ( cP (1 − c)1 0 1 ) = . At each step of time, withprobability 1 − c, the random surfer can disappear from the original graph, that is he can reach the absorbing node n + 1.
  3. 3. The nonnegative matrix(I − cP)−1 iscommonly calledthe fundamentalmatrix ofthe absorb- ing Markov chain [(I − cP)−1] ijdefined by P e (see for instance [10,16]). In the extended graph G e , the entry is the expected number of visits to node jbefore reaching the absorbing node n + 1 when starting from node i. From the point of view of the standard random surferdescribed in Section 2, the entry [(I − cP)−1] ij is the expected number of visits to node j before zapping for the first timewhen starting from node i. Therefore, the vector v defined in Eq. (2) has the following probabilistic interpretation. The entryrandom v i is surfer the expected number of visits starts his walk in node i. to the set I before zapping for the first time whenthe Now, let us first prove some simple properties about this vector. Lemma 1. Let v ∈ Rn ⩾0 be defined by v = cPv + e I. Then, (a) max i/∈I v i ⩽ c max i∈I v i ; (b) v i ⩽ 1 + cv i for all i ∈ N; with equality if and only if the node i does nothave an access to I; (c) v i ⩾ min j←i v j for alli ∈ I; withequality ifand only ifthe node i does not have an access to I.C. de Kerchove et al. / Linear Algebra and its Applications 429 (2008) 1254–1276 1259 Proof (a) Since c < 1, for all i /∈ I,max i/∈I ⎛ ⎝c ⎞ v i = max i/∈I ∑ j←i v j d i ⎠ ⩽ c max j v j . Since c < 1, it then follows that max j v j = max i∈I v i . (b)The inequality v i ⩽ 1 1−c follows directly from max i ⎛ ⎝ 1 + c ⎞ v i ⩽ max i ∑ j←i v j d i ⎠ ⩽ 1 + c max j v j .From(a) itthenalsofollowsthatv Then i ∈ I. Moreover, i ⩽ 1−c c foralli /∈ I.Now,leti ∈ Nsuchthatv i = 1−c 1 . 1 + cv i = vi = 1 + c ∑ j←i v j d i , that node is k v such j = 1−c that 1 for every j ← i. Hence node j must also belong to I. Byinduction, every i has an access to k must belong to I. (c) Let i ∈ I. Then, by (b) 1 + cv i ⩾ v i = 1 + c ∑ j←i v j d i ⩾ 1 +c min j←i v j , so v i ⩾ min j←i v j for all i ∈ I. If v i = min j←i v j then also 1 + cv i = v i and hence, by (b), the node idoes not have an access to I. □ Let us denote the set of nodes of I which on average give the most visits to I before zappingby V = argmax j∈I v j . Then the following lemma is quite intuitive. It says that, among the nodes of I, those which providethe higher mean number of visits to I are parents of I, i.e. parents of some node of I. Lemma 2 (Parents of I). If E in(I) /= ∅,then V ⊆ {j ∈ I: there exists l ∈ I such that (j,l) ∈ E in(I) }. If E in(I) = ∅, then v j = 0 for every j ∈ I. Proof. Suppose firstthat E in(I) /= ∅. Let k ∈ V with v = (I − cP)−1e I . If we supposed that there does not exist l ∈ I such that (k,l) ∈ E in(I) ,then we would have, since v k > 0, v k = c ∑ j←k v j d k ⩽ c max j/∈I v j = cv k < v k ,1260 C. de Kerchove et al. / Linear Algebra and its Applications 429 (2008) 1254–1276 which is a contradiction. Now, if Ein(I) = ∅, then there is no access to I from I, so clearly v j = 0 for every j ∈ I. □ Lemma 2 shows that the nodes j the set ofparents of I. The converse ∈ I which is not true, provide as we will the see higher in the value following of v j must belongto example: some parents of I can provide a lower mean number of visits to I that other nodes which are not parents of I. Inother word, Lemma 2 gives a necessary but not sufficient condition in order to maximize the entry v j for some j ∈ I.Example 1. Let us see on an example that having (j,i) ∈ E in(I) for some i ∈ I is not sufficient to have j ∈ V. Consider thegraph in Fig. 2. Let I = {1} and take a damping factor c = 0.85. For v = (I − cP)−1e 1 , we have v 2 = v 3 = v 4 = 4.359 >v 5 = 3.521 > v 6 = 3.492 > v 7 > ∑∑∑ > v 11 , so V = {2,3,4}. As ensured by Lemma 2, every node of the set V is aparent of node 1. But here, V does not contain all parents of node 1. Indeed, the node 6 /∈ V while it is a parent of 1 and ismoreover its parent with the lowest outdegree. Moreover, we see in this example that node 5, which is a not a parent of node1 but a parent of node 6, gives a higher value of the expected number of visits to I before zapping, than node 6, parent of 1.Let us try to get some intuition about that. When starting from node 6, a random surfer has probability one half to reachnode 1 in only one step. But he has also a probability one half to move to node 11 and to be send far away from node 1. Onthe other side, when starting from node 5, the random surfer cannot reach node 1 in only one step. But with probability 3/4he will reach one of the nodes 2, 3 or 4 in one step. And from these nodes, the websurfer stays very near to node 1 andcannot be sent far away from it. Inthe next lemma, we show thatfrom some node i ∈ I whichhas an access to I, there alwaysexists what we call a decreasing path to I. That is, we can find a path such that the mean number of visits to I is higher whenstarting from some node of the path than when starting from the successor of this node in the path. Fig. 2. The node 6 ∈ Vand yet it is a parent of I = {1} (see Example 1).C. de Kerchove et al. / Linear Algebra and its Applications 429 (2008) 1254–1276 1261 Lemma 3 (Decreasing paths to I).For every i 0 ∈ I which has an access to I, there exists a path 〈i 0 ,i 1 ,...,i s 〉 with i 1 ,...,i s−1 ∈ I and i s ∈ I such that v i 0> v i 1 > ∑∑∑ > v i s . Proof. Let us simply construct a decreasing path recursively by i k+1 ∈ argmin j←i k v j , as longas i k ∈ I. Ifi k has an access to I, then v i k+1 < v i k < 1−c 1 by Lemma 1(b) and (c), so the node i k+1 has also an accessto I. By assumption, i 0 has an access to I. Moreover, the set I has a finite number of elements, so there must exist an s suchthat i s ∈ I. □ 3.2. Influence of the outlinks of a node We will now see how a modification of the outlinks of some node i ∈N can change the PageRank set of links, of E a and subset ˜ E, respectively. of nodes I ⊆ N. So we will compare tilde Everysymbol. item So corresponding ˜ to the graph defined by the set two graphs of links ˜ E on N defined by their will be writtenwith a tor, ˜ P denotes its scaled adjacency matrix, ˜ the corresponding ˜ v = (I − PageRank c vec- V ˜ d i = = argmax |{j:(i,j) ∈ ˜ E}| j∈I ˜ v j . Finally, the outdegree of some by j˜←i we mean j node i in this graph, ∈ {k:(i,k) ∈ ˜ E}. P)−1e ˜ I andSo, let us consider two graphs defined, respectively, by their set of links E and ˜ E. Suppose that ˜ they E} forallk differ only/= i. in Then the theirscaled links starting adjacency from some matrices given node P and i, ˜P that are is linked {j:(k,j) by a
  4. 4. rankone ∈ E} = {j:(k,j) correction. ∈ Let us then define the vector = ∑ j˜←i e j d ˜ i − ∑ j←i e j d i , which gives thecorrection to apply to the line i of the matrix P in order to get P. ˜ Now let us first express the difference between thePageRank of I for two configurations differing only in the links starting from some node i. Note that in the following lemmathe personalization vector z does not appear explicitly in the expression of ˜ . Lemma k /= i,{j:(k,j) 4. Let two ∈ E} graphs ={j:(k,j) defined ∈ ˜ respectively E}. Then by E and ˜ E and let i ∈ N such that for all ˜ T e I = T e I + c T v i 1 − c T (I −cP)−1e i . Proof. Clearly, the scaled adjacency matrices are matrix (I − cP)−1 exists and the PageRank vectors linked by P˜ = can be expressed P as + e i T . Since c < 1, the T = (1 − c)z T (I − cP) −1 , ˜ T = (1 − c)z T (I − c(P + e i T )) −1 .1262 C. de Kerchove et al. / Linear Algebra and its Applications 429 (2008) 1254–1276 Applying the Sherman–Morrisonformula to ((I − cP) − ce i T )−1, we get ˜ T = (1 − c)z T (I − cP) −1 + (1 − c)z T (I − cP) −1 e i 1 − c c T (I − cP)−1 T(I − cP)−1e i and the result follows immediately. □ Let us now give an equivalent condition in order to increase thePageRank of I by changing outlinksof some node i. The PageRank of I increases essentially whenthe new setof links favorsnodes giving a higher mean number of visits to I before zapping. Theorem5(PageRankandmeannumberofvisitsbefore tivelyby E and ˜ E and let i ∈ N such that for all k /= i,{j:(k,j) zapping).Lettwographsdefinedrespec- ∈ E} = {j:(k,j) ∈ ˜ E}. Then ˜T e I T e I T v > 0 and ˜ > if and only if T e I = Te I if and only if T v = 0. Proof. Letus firstshowthat T (I − cP)−1e i ⩽1isalways verified.Letu = (I − cP)−1e i . Then u − cPu = e i and, by Lemma 1(a), u j ⩽ u i for all j. So T u = ∑ j˜←i u j˜d i − ∑ j←i u j d i ⩽ u i − ∑ j←i u j d i ⩽ u i − c ∑ j←i u j d i = 1. Now, since c < 1 and > 0, the conclusion followsby Lemma 4. □ The following Proposition 6 shows how to add a new link (i,j) starting from a given node i in order toincrease the PageRank of the set I. The PageRank of I increases as soon as a node i ∈ I adds a linkto a node j with a largerorequal expected number of visits to I before zapping. Proposition 6 (Adding a link). Let i ∈ I and let j ˜ E = E ∪{(i,j)}. Then∈ N be such that (i,j) /∈ E and v i ⩽ v j . Let ˜ T e I ⩾ T e I with equality if and only if the node i does not have an accessto I. Proof. Let i ∈ I and let j ∈ N be such that (i,j) /∈ E and v i ⩽ v j . Then 1 + c ∑ k←i v k d i = v i ⩽ 1 + cv i ⩽ 1 +cv j with equality if and only if i does not have an access to I by Lemma 1(b). Let ˜ E = E ∪{(i,j)}. Then T v = ( v j ) ⩾ 0with equality if and only if i does not have an access to I. The conclusion follows from Theorem 5. □ 1 d i − ∑ k←i v k + 1diC. de Kerchove et al. / Linear Algebra and its Applications 429 (2008) 1254–1276 1263 Fig. 3. For I = {1,2,3}, removinglink (1, 2) gives ˜ T e I < T e I , even if v 1 > v 2 (see Example 2). Now let us see how to remove a link (i,j) starting from agiven node i in order to increase the PageRank of the set I. If a node i ∈ N removes a link to its worst child from the point ofview of the expected number of visits to I before zapping, then the PageRank of I increases. Proposition7 (Removingalink). Leti ∈ Nandletj ∈ argmin k←i ˜ E = E {(i,j)}. Then ˜ v k . Let T e I ⩾ T e I with equality if and only if v k = v j forevery k such that (i,k) ∈ E. Proof. Let i ∈ N and let j ∈ argmin k←i ˜ E = E {(i,j)}. Then T v = v k . Let ∑ k←i v k − v j di (d i − 1) ⩾ 0 with equality if and only if v k = v j for all k ← i. The conclusion follows by Theorem 5. □ In order toincrease the PageRank of I with a new link (i,j), Proposition 6 only requires that v PageRank j ⩽ v i . On of the I by otherdeleting side, Proposition link (i,j). One 7 requires could wonder that v j whether = min k←i or not v k in order to increasethe this condition could be weakened fact, this cannot to v j < be v done i , so as to have symmetric conditions for theaddition or deletion of links. In as shown in the following example. Example 2. Letus see byanexample that the conditionj ∈argmin k←i v k inProposition7 cannot be I = weakened {1,2,3}. to We v j have < v i . Consider the graph in Fig. 3 and takea damping factor c = 0.85. Let v 1 = 2.63 > v 2 = 2.303 > v 3 = 1.533. As ensured by Proposition 7, if we remove the link(1,3), the PageRank of I increases (e.g. from 0.199 to 0.22 with a uniform personalization vector z = n 1 1), since 3 ∈argmin k←1 v k . But, if we remove instead the link (1,2), the PageRank of I decreases (from 0.199 to 0.179 with zuniform) even if v 2 < v 1 . Remark 1. Let us note that, if the node i does not have an access to the set I, then for everydeletion of a link starting from i, the PageRank of I will not be modified. Indeed, in this case Tv = 0 since by Lemma 1(b),v j = 1−c 1 for every j ← i.