• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Google PageRank
 

Google PageRank

on

  • 1,236 views

 

Statistics

Views

Total Views
1,236
Views on SlideShare
1,236
Embed Views
0

Actions

Likes
1
Downloads
24
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Google PageRank Google PageRank Document Transcript

    • 1Linear Algebra in Use: Ranking Web Pages with an Eigenvector Maia Bittner, Yifei Feng Abstract—Google PageRank is an algorithm that uses the In this paper, we will explain how the interlinking structureunderlying, hyperlinked structure of the web to determine the of the web and the properties of Markov Chains can be used totheoretical number of times a random web surfer would visit each quantify the relative importance of each indexed page. We willpage. Google converts these numbers into probabilities, and usesthese probabilities as each web page’s relative importance. Then, examine Markov Chains, eigenvectors, and the power iterationthe most important pages for a given query can be returned first method, as well as some of the problems that arise when usingin search results. In this paper, we focus on the math behind this system to rank web pages.PageRank, which includes eigenvectors and Markov Chains, andon explaining how it is used to rank webpages. A. Markov Chains Google PageRank, Eigenvector, Eigenvalue, Markov Chain Markov chains are mathematical models that describe par- I. I NTRODUCTION ticular types of systems. For example, we can construe the The internet is humanity’s largest collection of information, number of students at Olin College who are sick as a Markovand its democratic nature means that anyone can contribute Chain if we know the how likely it is that a student willmore information to it. Search engines help users sort through become sick. Let us say that if a student is feeling well, she hasthe billions of available web pages to find the information a 5% chance of becoming sick the next day, and that if she isthat they are looking for. Most search engines use a two step already sick, she has a 35% chance of feeling better tomorrow.process to return web pages based on the user’s query. The first In our example, a student can only be healthy or sick; thesestep involves finding which of the pages the search engine has two states are called the state space of the system. In addition,indexed are related to the query, either by containing the words we’ve decided to only ask how the students are feeling in thein the query, or by more advanced means that use semantic morning, and their health on any day only depends on howmodels. The second step is to order this list of relevant pages they were feeling the previous morning. This constant, discreteaccording to some criterion. For example, the first web search increase in time makes the system time-homogenous. We canengines, launched in the early 90’s, used text-based matching generate a set of linear equations that will describe how manysystems as their criterion for ordering returned results by students at Olin College are healthy and sick on any given day.relevancy. This ranking method often resulted in returning If we let mk indicates the number of healthy students and nkexact matches in unauthoritative, poorly written pages before indicates the number of sick students at morning k, then weresults that the user could trust. Even worse, this system was get the following two equations:easy to exploit by page owners, who could fill their pages mk+1 = .95mk + .35nkwith irrelevant words and phrases, with the hope of ranking nk+1 = .05mk + .65nkhighly. These problems prompted researchers to investigatemore advanced methods of ranking. Larry Page and Sergey Brin were researching a new kind Putting this system of linear equations into matrix notation,of search engine at Stanford University when they had the we get:idea that pages could be ranked by link popularity. Theunderlying social basis of their idea was that more reputable mk+1 .95 .35 mk = (1)pages are linked to by other pages more often. Page and Brin nk+1 .05 .65 nkdeveloped an algorithm that could quantify this idea, and in We can take this matrix full of probabilities and call it P, the1998, christened the algorithm PageRank [2] and published transition matrix.their first paper on it. Shortly afterwards, they founded Google,a web search engine that uses PageRank to help rank the .95 .35returned results. Google’s famously useful search results [3] P= (2) .05 .65helped it reach almost $29 billion dollars in revenue in 2010[1]. The original algorithm organized the indexed web pages The columns can be viewed as representing the present statesuch that the links between them are used to construct the of the system, and the rows can be viewed as representingprobability of a random web surfer navigating from one page the future state of the system. The first column accounts forto another. This system can be characterized as a Markov the students who are healthy today, and the second columnChain, a mathematical model described below, in order to take accounts for the students who are sick today, while the first rowadvantage of their convenient properties. indicates the students who will be healthy tomorrow and the
    • 2second row indicates the students who will be sick tomorrow. classified as a Markov Chain has these steady-state values thatThe intersecting elements of the transition matrix represent the system will remain constant at, regardless of the initialthe probability that a student will transition from that column state. This steady-state vector is a specific example of anstate to the row state. So we can see that p1,1 indicates that eigenvector, explained below.95% of the students who are healthy today will be healthytomorrow. The total number of students who will be healthy B. Eigenvalues and Eigenvectorstomorrow is represented by the first row, so that it is 95% of An eigenvector is a nonzero vector x that, when multipledthe students who are healthy today plus 35% of the students by a matrix A, only scales in length and does not changewho are sick today. Similarly, the second row shows us the direction, except for potentially a reverse in direction.number of students who will be sick tomorrow: 5% of thestudents who are healthy today plus 65% of the students who Ax = λx (9)are sick today. The corresponding amount that an eigenvector is scaled by, λ, You can see that each column sums to one, to account for is called its eigenvalue. There are several techniques to find the100% of students who are sick and 100% of students who eigenvalues and eigenvectors of a matrix. We will demonstrateare healthy in the current state. Square matrices like this, that one technique below with matrix A.have nonnegative real entries and where every column sums  to one, are called column-stochastic. 1 2 5 We can find the total number of students who will be in A= 0 3 0 a state on day k + 1 by multiplying the transition matrix by 0 0 4a vector containing how many students were in each state on How to find the eigenvalues for matrix A: We know that λ isday k. a eigenvalue of A if and only if the equation Pxk = xk+1 (3) Ax = λx (10)For example, if 270 students at Olin College are healthy today,and 30 are sick, we can find the state vector for tomorrow: has a nontrivial solution. This is equivalent to finding λ such that: .95 .35 270 267 = (4) (A − λI)x = 0 (11) .05 .65 30 33 The above equation has nontrivial solution when the determi-which shows that tomorrow, 267 students will be healthy and nant of A − λI is zero.33 students will be sick. To find the next day’s values, youcan multiply again by the transition matrix: 1−λ 2 5 det(A − λI3 ) = 0 3−λ 0 .95 .35 267 265.2 0 0 4−λ = (5) .05 .65 33 34.8 = (4 − λ)(1 − λ)(3 − λ) = 0 which is the same as 2 Solving for λ, we get that the eigenvalues are λ1 = 1, λ2 = 3, .95 .35 270 265.2 and λ3 = 4. If we solve for Avi = λi vi , we will get the = (6) .05 .65 30 34.8 corresponding eigenvector for each eigenvalue:       So we can see that in order to find mk and nk , we can 1 4 5multiply P k by the state vector containing m0 and n0 , as in v1 =  0  , v2 =  −1  , v3 =  0 the equation below: 0 2 3 k mk .95 .35 m0 This means that if v2 is transformed by A, the result will = (7) scale v2 by its eigenvalue, 3. nk .05 .65 n0 You can see in Eqn. (8) that the steady state of a Markov If you continue to multiply the state vector by the transition Chain has an eigenvalue of 1. This is why those steady statematrix for very high values of k, we will see that it will vectors are a special case of eigenvectors. Because they areeventually converge upon a steady state, regardless of initial column-stochastic, all transition matrices of Markov Chainsconditions. This is represented by vector q in Eqn. (8). will have an eigenvalue of 1 (we invite the reader to prove this in Exercise 4). A system having an eigenvalue of 1 is the Pq = q (8) same as it having a steady state. Being able to find this steady-state vector is the main In some matrices, we may get repeated roots when solvingadvantage of using a column-stochastic matrix to model a det(A − λI) = 0. We will demonstrate this for the column-system. Column-stochastic transition matrices are always used stochastic matrix P:  to represent the known probabilities of transitioning between 0 1 0 0 0states in Markov Chains. To model a system as a Markov  1 0 0 0 0   1 Chain, it must be a discrete-time process that has a finite P= 0 0 0 1 2   state space, and the probability distribution for any state must  0 0 0 0 1  2depend only on the previous state. Every situation that can be 0 0 1 0 0
    • 3 four webpages. Page 1 is linked to by pages 2 and 3, so its importance score is x1 = 2. In the same way, we can get x2 = 3, x3 = 1, x4 = 2. Page 2 has the highest importance score, indicating that page 2 is the most important page in this web. However, this approach has several drawbacks. First, the pages that have more links to other pages would have more votes, which means that a website can easily gain more influence by creating many links to other pages. Second, we would expect that a vote from an important page should weigh more than a vote from an unimportant one, but every page’s vote is worth the same amount with this method. A way to fix both of these problems is to give each page the amountFigure 1. A small web of 4 pages, connected by directional links of voting power that is equivalent to its importance score. So for webpage k with an importance score of xk , it has a totalTo find the eigenvalues, solve: voting power of xk . Then we can equally distribute xk to all the pages it links to. We can define the importance score of −λ 1 0 0 0 a page as the sum of all the weighted votes it gets from the 1 −λ 0 0 0 pages that link to it. So if webpage k has a set of pages Sk det(P − λI5 ) = 0 0 −λ 1 1 2 linked to it, we have 0 0 −λ 0 1 2 xj 0 0 1 0 −λ xk = (12) nj 1 j∈Sk = − (λ − 1)2 (λ + 1)(2λ2 + 2λ + 1) = 0 2 where nj is the number of links from page j. If we apply this When we solve the characteristic equation, we find that the method to the network of Figure 1, we can get a system offive eigenvalues are: λ1 = 1, λ2 = 1, λ3 = −1, λ4 = − 1 − 2 , i linear equations: 2 1 i x2 x4λ5 = − 2 + 2 . Since 1 appears twice as an eigenvalue, we x1 = +say that is has algebraic multiplicity of 2. The number of 1 2individual eigenvectors associated with eigenvalue 1 is called x1 x3 x4 x2 = + +the geometric multiplicity of λ = 1. The reader can confirm 3 2 2 x1that in this case, λ = 1 has geometric multiplicity of 2 with x3 = 3associated eigenvectors x and y. x1 x3 x4 = +  √2   √2  3 2 √2 − 2 √  2   2   2   2  x=  0 ,y =  0     which can be written in the matrix form x = Lx, where x =  0   0  [x1 , x2 , x3 , x4 ]T and 0 0  0 1 0 2 1  We can see that when transition matrices for Markov chains  1 0 1 1  L= 1 3 2 2 have geometric multiplicity for eigenvalue of 1, it’s unclear  3 0 0 0  1 1which independent eigenvector should be used to represent 3 0 2 0the steady-state of the system. L is called the link matrix of this network system since it encapsulates the links between all the pages in the system. II. PAGE R ANK Because we’ve evenly distributed xk to each of the pages When the founders of Google created PageRank, they were k links to, the link matrix is always column-stochastic. Astrying to discern the relative authority of web pages from the we defined earlier, vector x contains the importance scoresunderlying structure of links that connects the web. They did of all web pages. To find these scores, we can solve forthis by calculating an importance score for each web page. Lx = x. You’ll notice that this looks similar to Eqn. (8),Given a webpage, call it page k, we can use xk to denote the and indeed, we can transform this problem into finding theimportance of this page among the total number of n pages. eigenvector with eigenvalue λ = 1 for the matrix L! ForThere are many different ways that one could calculate an matrix L, the eigenvector is [0.387, 0.290, 0.129, 0.194]T forimportance score. One simple and intuitive way to do page λ = 1. So we know the importance score of each page isranking is to count the number of links from other pages x1 ≈ 0.387, x2 ≈ 0.290, x3 ≈ 0.129, x4 ≈ 0.194. Note thatto page k, and assign that number as xk . We can think of with this more sophisticated method, page 1 has the highesteach link as being one vote for the page it links to, and of importance score instead of page 2. This is because pagethe number of votes a page gets as showing the importance 2 only links to page 1, so it casts its entire vote to pageof the page. In the example network of Figure 1, there are 1, boosting up its score. Knowing that the link matrix is a
    • 4column-stochastic matrix, let us now look at the problem ofranking internet pages in terms of a Markov Chain system. For a network with n pages, the ith entry of the n × 1vector xk denotes the probability of visiting page i after kclicks. The link matrix L is the transition matrix such thatthe entry lij is the probability of clicking on a link to pagei when on page j. Finding the importance score of a page isthe same as finding its entry in the steady state vector of theMarkov chain that describes the system. For example, say westart from page 1, so that vector that represents our beginingstate is x0 = [1, 0, 0, 0]T To find the probability of ending upon each web page after n clicks, we do: xn = Ln x0 (13) Figure 2. Here are two small subwebs, which do not exchange linkswhere n represents the state after n clicks (the possibility ofbeing on each page), and L is the link matrix, or transition  matrix. So by calculating the powers of L, we can determine 0 1 0 0 0the steady state vector. This process is called the Power   1 0 0 0 0  1  Iteration method, and it converges on an estimate of the A=  0 0 0 1 2  1 eigenvector for the greatest eigenvalue, which is always 1 in  0 0 0 0 2the case of a column-stochastic matrix. For example, by raising 0 0 1 0 0the link matrix L to the 25th power, we have Mathematically, this problem poses itself as being a multi-  0.387 0.387 0.387 0.387  dimensional Markov Chain. Link matrix A has an geometric  0.290 0.290 0.290 0.290  multiplicity of 2 for the eigenvalue of 1, as we showed B≈  0.129 0.129 0.129 0.129   in section I. B. It’s unclear which of the two associated 0.194 0.194 0.194 0.194 eigenvectors should be chosen to form the rankings. The two eigenvectors are essentially eigenvectors for each subweb,If we multiply matrix B by our initial state x0 , we get our and each shows rankings which are accurate locally, but notsteady state vector globally. T Google has chosen to solve this problem of subwebs by s= 0.387 0.290 0.129 0.194 introducing an element of randomness into the link matrix. 1which shows the probability of each link being visited. This Defining a matrix S as an n × n matrix with all entries npower iteration process gives us approximately the same and a value m between 0 and 1 as a relative weight, we canresult as finding the eigenvector of the link matrix, but is replace link matrix A with:often more computationally feasible, especially for matriceswith a dimension of around 1 billion, like Google’s. These M = (1 − m)A + mS (14)computation savings are why this is the method by which If m > 0, there will be no parts of matrix M whichGoogle actually calculates the PageRank of web pages [5]. By represent entirely disconnected subwebs, as every web surfertaking powers of the matrix to estimate eigenvectors, Google has some probability of reaching another page regardless ofis doing the reverse of many applications, which diagonalize the page they’re on. In the original PageRank algorithm, anmatrices into their eigenvector components in order to take m value of .15 was used. Today, it is spectulated by thosethem to a high power. Few applications actually use the power outside of Google that the value currently in use lies betweeniteration method, since it is only appropriate given a narrow .1 and .2. The larger the value of m, the higher the randomrange of conditions. The sparseness of the web’s link matrix, matrix is weighted, and the more egalitarian the correspondingand the need to only know the eigenvector corresponding to PageRank values are. If m is 1, a web surfer has equalthe dominant eigenvalue, make it an application well-suited to probability of getting to any page on the web from any othertake advantages of the power iteration method. page, and the all links would be ignored. If m is 0, any subwebs contained in the system will cause the eigenvalues toA. Subwebs have a multiplicity greater than 1, and there will be ambiguity in the system. We now address a problem that this model has whenfaced with real-world constraints. We refer to this problemas disconnected subwebs, as shown in Figure 2. If there are III. D ISCUSSIONtwo or more groups of linked pages that do not link to each We’ve shown how systems that can be characterized asother, it is impossible to rank all pages relative to each other. a Markov Chain will converge to a steady state, and thatThe matrix shown below is the link matrix for the web shown these steady state values can be found by using either thein Figure 2: characteristic equation or the power iteration method. We then
    • 5investigated how the web can be viewed as a Markov Chainwhen the state is which page a web surfer is on, and the hy-perlinks between pages dictate the probability of transitioningfrom one state to another. With this characterization, we canview the steady-state vector as the proportional amount of timea web surfer would spend on every page, hence, a valuablemetric for which pages are more important. R EFERENCES[1] O’Neill, Nick. Why Facebook Could Be Worth $200 Billion All Facebook. Available at http://www.allfacebook.com/is-facebooks-valuation-hype-a- macro-perspective-2011-01[2] Page, L. and Brin, S. and Motwani, R. and Winograd, T. The pagerank citation ranking: Bringing order to the web Technical report, Stanford Figure 3. A small web of 5 pages Digital Library Technologies Project, 1998.[3] Granka, Laura A. and Joachims, Thorsten and Gay, Geri. Eye-tracking analysis of user behavior in WWW search SIGIR, 2004. Available at http://www.cs.cornell.edu/People/tj/publications/granketal04a.pdf IV. E XERCISES[4] Grinstead, Charles M. and Snell, J. Laurie Introduction Exercise 1: Find the eigenvalues and correspoding eigenvectors of the to Probability: Chapter 11 Markov Chains Available at following matrix. http://www.dartmouth.edu/ chance/teachingaids/booksarticles/probabilitybook/Chapter11.pdf 3 0 −1 0  [5] Ipsen, Ilse, and Rebecca M. Wills Analysis and Computation of Google’s PageRank 7th IMACS International Symposium on Iterative Methods in  4 −3 2 0  A= Scientific Computing, Fields Institute, Toronto, Canada, 5 May 2005. 0 0 1 0  Available at http://www4.ncsu.edu/ ipsen/ps/slidesimacs.pdf −2 0 3 2 Exercise 2: Given the column-stochastic matrix P: 0.6 0.3 P= 0.4 0.7 find the steady-state vector for P. Exercise 3: Create a link matrix for the network with 5 internet pages in Figure 3, then rank the pages. Exercise 4: In section I B. we claim that all transition matrices for Markov chains have 1 as an eigenvalue. Why is this true for every column-stochastic matrix?
    • 6 V. S OLUTION TO E XERCISESSolution 1: The eigenvalues are λ1 = 2, λ2 = −3, λ3 = 3 and λ4 = 1,and corresponding eigenvectors are 0 0 3 1          0   1   2   2  x1 =  x = x = x = 0  2  0  3  0  4  2  1 0 −6 −4Solution 2: The steady-state vector for P is, 0.4286 s= 5714Solution 3: The link matrix is, 1 0 1 0 0   3  1 0 1 0 0   3 3 1  A= 0 0 0 2 1    1 0 1 0 0  3 3 1 1 3 0 0 2 0The ranking vector is, 1 1 1 1 1 T x= 4 6 4 6 6Solution 4: By definition, every column in a column stochastic matrixcontains no negative numbers and sums to 1. As shown in Section I. B.,eigenvalues are the numbers that, when subtracted from the main diagonalof a matrix, cause its determinant to equal 0. Since every column adds to1, and we are subtracting 1 from every column, clearly the determinantwill equal 0. Therefore, 1 is always an eigenvalue of column-stochasticmatrices.