Pagerank

The PageRank Citation Ranking: Bringing Order to the Web Larry Page etc. Stanford University Presented by Guoqiang Su & Wei Li

Contents Motivation Related work Page Rank & Random Surfer Model Implementation Application Conclusion

Motivation Web: heterogeneous and unstructured Free of quality control on the web Commercial interest to manipulate ranking

Related Work Academic citation analysis Link-based analysis Clustering methods of link structure Hubs & Authorities Model

Backlink Link Structure of the Web Approximation of importance / quality

PageRank Pages with lots of backlinks are important Backlinks coming from important pages convey more importance to a page Problem: Rank Sink

Rank Sink Page cycles pointed by some incoming link Problem: this loop will accumulate rank but never distribute any rank outside

Escape Term Solution: Rank Source c is maximized and = 1 E(u) is some vector over the web pages – uniform, favorite page etc.

Matrix Notation R is the dominant eigenvector and c is the dominant eigenvalue of because c is maximized

Computing PageRank - initialize vector over web pages loop: - new ranks sum of normalized backlink ranks - compute normalizing factor - add escape term - control parameter while - stop when converged

Random Surfer Model Page Rank corresponds to the probability distribution of a random walk on the web graphs E(u) can be re-phrased as the random surfer gets bored periodically and jumps to a different page and not kept in a loop forever

Implementation Computing resources — 24 million pages — 75 million URLs Memory and disk storage Weight Vector (4 byte float) Matrix A (linear access)

Implementation (Con't) Unique integer ID for each URL Sort and Remove dangling links Rank initial assignment Iteration until convergence Add back dangling links and Re-compute

Convergence Properties Graph (V, E) is an expander with factor  if for all (not too large) subsets S: |As|   |s| Eigenvalue separation: Largest eigenvalue is sufficiently larger than the second-largest eigenvalue Random walk converges fast to a limiting probability distribution on a set of nodes in the graph.

Convergence Properties (con't) PageRank computation is O(log(|V|)) due to rapidly mixing graph G of the web.

Personalized PageRank Rank Source E can be initialized : – uniformly over all pages: e.g. copyright warnings, disclaimers, mailing lists archives  result in overly high ranking – total weight on a single page, e.g . Netscape, McCarthy  great variation of ranks under different single pages as rank source – and everything in-between, e.g. server root pages  allow manipulation by commercial interests

Applications I Estimate web traffic – Server/page aliases – Link/traffic disparity, e.g. porn sites, free web-mail Backlink predictor – Citation counts have been used to predict future citations – very difficult to map the citation structure of the web completely – avoid the local maxima that citation counts get stuck in and get better performance

Applications II - Ranking Proxy Surfer's Navigation Aid Annotating links by PageRank (bar graph) Not query dependent

Issues Users are no random walkers – Content based methods Starting point distribution – Actual usage data as starting vector Reinforcing effects/bias towards main pages How about traffic to ranking pages? No query specific rank Linkage spam – PageRank favors pages that managed to get other pages to link to them – Linkage not necessarily a sign of relevancy, only of promotion (advertisement…)

Conclusion PageRank is a global ranking based on the web's graph structure PageRank use backlinks information to bring order to the web PageRank can separate out representative pages as cluster center A great variety of applications

Pagerank

More Related Content

Viewers also liked

Similar to Pagerank

Recently uploaded

Pagerank