Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.



Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this


  1. 1. The PageRank Citation Ranking:Bringing Order to the Web<br />Larry Page etc.<br />Stanford University<br />Presented by<br />Guoqiang Su & Wei Li<br />
  2. 2. Contents<br />Motivation<br />Related work<br />Page Rank & Random Surfer Model<br />Implementation<br />Application<br />Conclusion<br />
  3. 3. Motivation<br /><ul><li>Web: heterogeneous and unstructured
  4. 4. Free of quality control on the web
  5. 5. Commercial interest to manipulate ranking</li></li></ul><li>Related Work<br />Academic citation analysis<br />Link-based analysis<br />Clustering methods of link structure<br />Hubs & Authorities Model<br />
  6. 6. Backlink<br />Link Structure of the Web<br />Approximation of importance / quality<br />
  7. 7. PageRank<br />Pages with lots of backlinks are important<br />Backlinks coming from important pages convey more importance to a page<br />Problem: Rank Sink<br />
  8. 8. Rank Sink<br />Page cycles pointed by some incoming link<br />Problem: this loop will accumulate rank but never distribute any rank outside<br />
  9. 9. Escape Term<br />Solution: Rank Source<br />c is maximized and = 1<br />E(u) is some vector over the web pages<br /> – uniform, favorite page etc.<br />
  10. 10. Matrix Notation<br />R is the dominant eigenvector and c is the dominant eigenvalue of because c is maximized<br />
  11. 11. Computing PageRank<br /> - initialize vector over web pages<br />loop:<br /> - new ranks sum of normalized backlink ranks <br /> - compute normalizing factor<br /> - add escape term<br /> - control parameter<br />while - stop when converged<br />
  12. 12. Random Surfer Model<br />Page Rank corresponds to the probability distribution of a random walk on the web graphs<br />E(u) can be re-phrased as the random surfer gets bored periodically and jumps to a different page and not kept in a loop forever<br />
  13. 13. Implementation<br />Computing resources<br /> — 24 million pages<br /> — 75 million URLs<br />Memory and disk storage<br />Weight Vector <br />(4 byte float)<br /> Matrix A <br />(linear access)<br />
  14. 14. Implementation (Con't)<br />Unique integer ID for each URL<br />Sort and Remove dangling links<br />Rank initial assignment<br />Iteration until convergence<br />Add back dangling links and Re-compute<br />
  15. 15. Convergence Properties<br />Graph (V, E) is an expander with factor  if for all (not too large) subsets S: |As| |s|<br />Eigenvalue separation: Largest eigenvalue is sufficiently larger than the second-largest eigenvalue<br />Random walk converges fast to a limiting probability distribution on a set of nodes in the graph.<br />
  16. 16. Convergence Properties (con't)<br />PageRank computation is O(log(|V|)) due to rapidly mixing graph G of the web.<br />
  17. 17. Personalized PageRank<br />Rank Source E can be initialized :<br /> – uniformly over all pages: e.g. copyright <br /> warnings, disclaimers, mailing lists archives<br /> result in overly high ranking<br /> – total weight on a single page, e.g. Netscape, McCarthy<br /> great variation of ranks under different single pages as rank source<br /> – and everything in-between, e.g. server root pages<br /> allow manipulation by commercial interests<br />
  18. 18. Applications I<br />Estimate web traffic<br /> – Server/page aliases<br /> – Link/traffic disparity, e.g. porn sites, free web-mail<br />Backlink predictor<br /> – Citation counts have been used to predict future citations <br /> – very difficult to map the citation structure of the web completely<br /> – avoid the local maxima that citation counts get stuck in and get better performance<br />
  19. 19. Applications II - Ranking Proxy<br />Surfer's Navigation Aid<br />Annotating links by PageRank (bar graph)<br />Not query dependent<br />
  20. 20. Issues<br />Users are no random walkers<br /> – Content based methods<br />Starting point distribution<br />– Actual usage data as starting vector<br />Reinforcing effects/bias towards main pages<br />How about traffic to ranking pages?<br />No query specific rank<br />Linkage spam<br /> – PageRank favors pages that managed to get other pages to link to <br /> them<br /> – Linkage not necessarily a sign of relevancy, only of promotion <br /> (advertisement…)<br />
  21. 21. Evaluation I<br />
  22. 22. Evaluation II<br />
  23. 23. Conclusion<br />PageRank is a global ranking based on the web's graph structure<br />PageRank use backlinks information to bring order to the web<br />PageRank can separate out representative pages as cluster center<br />A great variety of applications<br />