Pagerank

187 views
166 views

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
187
On SlideShare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
1
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Pagerank

  1. 1. The PageRank Citation Ranking:Bringing Order to the Web<br />Larry Page etc.<br />Stanford University<br />Presented by<br />Guoqiang Su & Wei Li<br />
  2. 2. Contents<br />Motivation<br />Related work<br />Page Rank & Random Surfer Model<br />Implementation<br />Application<br />Conclusion<br />
  3. 3. Motivation<br /><ul><li>Web: heterogeneous and unstructured
  4. 4. Free of quality control on the web
  5. 5. Commercial interest to manipulate ranking</li></li></ul><li>Related Work<br />Academic citation analysis<br />Link-based analysis<br />Clustering methods of link structure<br />Hubs & Authorities Model<br />
  6. 6. Backlink<br />Link Structure of the Web<br />Approximation of importance / quality<br />
  7. 7. PageRank<br />Pages with lots of backlinks are important<br />Backlinks coming from important pages convey more importance to a page<br />Problem: Rank Sink<br />
  8. 8. Rank Sink<br />Page cycles pointed by some incoming link<br />Problem: this loop will accumulate rank but never distribute any rank outside<br />
  9. 9. Escape Term<br />Solution: Rank Source<br />c is maximized and = 1<br />E(u) is some vector over the web pages<br /> – uniform, favorite page etc.<br />
  10. 10. Matrix Notation<br />R is the dominant eigenvector and c is the dominant eigenvalue of because c is maximized<br />
  11. 11. Computing PageRank<br /> - initialize vector over web pages<br />loop:<br /> - new ranks sum of normalized backlink ranks <br /> - compute normalizing factor<br /> - add escape term<br /> - control parameter<br />while - stop when converged<br />
  12. 12. Random Surfer Model<br />Page Rank corresponds to the probability distribution of a random walk on the web graphs<br />E(u) can be re-phrased as the random surfer gets bored periodically and jumps to a different page and not kept in a loop forever<br />
  13. 13. Implementation<br />Computing resources<br /> — 24 million pages<br /> — 75 million URLs<br />Memory and disk storage<br />Weight Vector <br />(4 byte float)<br /> Matrix A <br />(linear access)<br />
  14. 14. Implementation (Con't)<br />Unique integer ID for each URL<br />Sort and Remove dangling links<br />Rank initial assignment<br />Iteration until convergence<br />Add back dangling links and Re-compute<br />
  15. 15. Convergence Properties<br />Graph (V, E) is an expander with factor  if for all (not too large) subsets S: |As| |s|<br />Eigenvalue separation: Largest eigenvalue is sufficiently larger than the second-largest eigenvalue<br />Random walk converges fast to a limiting probability distribution on a set of nodes in the graph.<br />
  16. 16. Convergence Properties (con't)<br />PageRank computation is O(log(|V|)) due to rapidly mixing graph G of the web.<br />
  17. 17. Personalized PageRank<br />Rank Source E can be initialized :<br /> – uniformly over all pages: e.g. copyright <br /> warnings, disclaimers, mailing lists archives<br /> result in overly high ranking<br /> – total weight on a single page, e.g. Netscape, McCarthy<br /> great variation of ranks under different single pages as rank source<br /> – and everything in-between, e.g. server root pages<br /> allow manipulation by commercial interests<br />
  18. 18. Applications I<br />Estimate web traffic<br /> – Server/page aliases<br /> – Link/traffic disparity, e.g. porn sites, free web-mail<br />Backlink predictor<br /> – Citation counts have been used to predict future citations <br /> – very difficult to map the citation structure of the web completely<br /> – avoid the local maxima that citation counts get stuck in and get better performance<br />
  19. 19. Applications II - Ranking Proxy<br />Surfer's Navigation Aid<br />Annotating links by PageRank (bar graph)<br />Not query dependent<br />
  20. 20. Issues<br />Users are no random walkers<br /> – Content based methods<br />Starting point distribution<br />– Actual usage data as starting vector<br />Reinforcing effects/bias towards main pages<br />How about traffic to ranking pages?<br />No query specific rank<br />Linkage spam<br /> – PageRank favors pages that managed to get other pages to link to <br /> them<br /> – Linkage not necessarily a sign of relevancy, only of promotion <br /> (advertisement…)<br />
  21. 21. Evaluation I<br />
  22. 22. Evaluation II<br />
  23. 23. Conclusion<br />PageRank is a global ranking based on the web's graph structure<br />PageRank use backlinks information to bring order to the web<br />PageRank can separate out representative pages as cluster center<br />A great variety of applications<br />

×