CSC373Design and analysis of algorithms A Presentation by Marib Alam 0910803042
The ProblemImagine a library containing 25 billion documentsbut with no centralized organization and nolibrarians. In addition, anyone may add adocument at any time without telling anyone. Youmay feel sure that one of the documents containedin the collection has a piece of information that isvitally important to you, and, being impatient likemost of us, youd like to find it in a matter ofseconds. How would you go about doing it?
The SolutionPageRank is a link analysis algorithm, named afterLarry Page and used by the Google Internetsearch engine, that assigns a numerical weightingto each element of a hyperlinked set ofdocuments, such as the World Wide Web, with thepurpose of "measuring" its relative importancewithin the set
Importance & Implementation• The basis of Google search among with other 250+ factors.• Responsible for serving billions of searches everyday.• Made billions of dollars for Google.• Implemented by Google.• The name "PageRank" is a trademark of Google, and the PageRank process has been patented
MATLAB implementation codefunction [v] = rank(M, d, v_quadratic_error)N = size(M, 2); % N is equal to half the size of Mv = rand(N, 1); v = v ./ norm(v, 2);last_v = ones(N, 1) * inf;M_hat = (d .* M) + (((1 - d) / N) .* ones(N, N));while(norm(v - last_v, 2) > v_quadratic_error)last_v = v;v = M_hat * v;v = v ./ norm(v, 2);end
The PageRank AlgorithmPR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))where PR(A) is the PageRank of page A, PR(Ti) is the PageRank of pages Ti which link to page A, C(Ti) is the number of outbound links on page Ti and d is a damping factor which can be set between 0 and 1.
The PageRank AlgorithmA Different Notation of the PageRank AlgorithmPR(A) = (1-d) / N + d (PR(T1)/C(T1) + ..PR(Tn)/C(Tn))where PR(A) is the PageRank of page A, PR(Ti) is the PageRank of pages Ti which link to page A, C(Ti) is the number of outbound links on page Ti and d is a damping factor which can be set between 0 and 1. N is the total number of all pages on the web
Key points of PageRank• Not all links weigh the same when it comes to PR.• If you had a web page with a PR8 and had 1 link on it, the site linked to would get a fair amount of PR value. But, if you had 100 links on that page, each individual link would only get a fraction of the value.• Content is not taken into account when PageRank is calculated.• PageRank does not rank web sites as a whole, but is determined for each page individually.• Each inbound link is important to the overall total. Except banned sites, which don’t count.• PageRank values don’t range from 0 to 10. PageRank is a floating-point number.
Key points of PageRank• Each Page Rank level is progressively harder to reach. PageRank is believed to be calculated on a logarithmic scale.
A Simple Example PR(A) = 0.5 + 0.5 PR(C) PR(B) = 0.5 + 0.5 (PR(A) / 2) PR(C) = 0.5 + 0.5 (PR(A) / 2 + PR(B)) According to Page and Brin, the damping factor d is usually set to 0.85, but to keep the calculation simple we set it to 0.5. The exact value of the damping factor d admittedly has effects on PageRank, but it does not influence the fundamental principles of PageRankPR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))where PR(A) is the PageRank of page A, PR(Ti) is the PageRank of pages Ti which link to page A, C(Ti) is the number of outbound links on page Ti and d is a damping factor which can be set between 0 and 1.
A Simple Example PR(A) = 14/13 = 1.07692308 PR(B) = 10/13 = 0.76923077 PR(C) = 15/13 = 1.15384615Because of the size of the actual web, the Googlesearch engine uses an approximate, iterativecomputation of PageRank values. This means thateach page is assigned an initial starting value andthe PageRanks of all pages are then calculated inseveral computation circles based on theequations determined by the PageRank algorithm.
A Simple Example Iteration PR(A) PR(B) PR(C) 0 1 1 1 1 1 0.75 1.125 2 1.0625 0.765625 1.1484375 3 1.07421875 0.76855469 1.15283203 4 1.07641602 0.76910400 1.15365601 5 1.07682800 0.76920700 1.15381050 6 1.07690525 0.76922631 1.15383947 7 1.07691973 0.76922993 1.15384490 8 1.07692245 0.76923061 1.15384592 9 1.07692296 0.76923074 1.15384611 10 1.07692305 0.76923076 1.15384615 11 1.07692307 0.76923077 1.15384615 12 1.07692308 0.76923077 1.15384615We see that we get a good approximation of the realPageRank values after only a few iterations. According topublications of Page and Brin, about 100 iterations arenecessary to get a good approximation of the PageRankvalues of the whole web.
A Different ExampleSuppose that page Pj has lj links. If one of those links is topage Pi, then Pj will pass on 1/lj of its importance to Pi. The importanceranking of Pi is then the sum of all the contributions made by pageslinking to it. That is, if we denote the set of pages linkingto Pi by Bi, thenLets first create a matrix, called the hyperlink matrix, in which the entryin the ith row and jth column is
A Different ExampleWe will also form a vector I whose components are PageRanks thatis, the importance rankings of all the pages. The condition abovedefining the PageRank may be expressed as I = HIIn other words, the vector I is an eigenvector of the matrix H witheigenvalue 1. We also call this astationary vector of H.