Successfully reported this slideshow.

Introduccion a las Finanzas


Published on

  • Be the first to comment

  • Be the first to like this

Introduccion a las Finanzas

  1. 1. The PageRank Citation Ranking: Bringing Order to the Web Larry Page etc. Stanford University Presented by Guoqiang Su & Wei Li
  2. 2. Contents <ul><li>Motivation </li></ul><ul><li>Related work </li></ul><ul><li>Page Rank & Random Surfer Model </li></ul><ul><li>Implementation </li></ul><ul><li>Application </li></ul><ul><li>Conclusion </li></ul>
  3. 3. Motivation <ul><li>Web: heterogeneous and unstructured </li></ul><ul><li>Free of quality control on the web </li></ul><ul><li>Commercial interest to manipulate ranking </li></ul>
  4. 4. Related Work <ul><li>Academic citation analysis </li></ul><ul><li>Link-based analysis </li></ul><ul><li>Clustering methods of link structure </li></ul><ul><li>Hubs & Authorities Model </li></ul>
  5. 5. Backlink <ul><li>Link Structure of the Web </li></ul><ul><li>Approximation of importance / quality </li></ul>
  6. 6. PageRank <ul><li>Pages with lots of backlinks are important </li></ul><ul><li>Backlinks coming from important pages convey more importance to a page </li></ul><ul><li>Problem: Rank Sink </li></ul>
  7. 7. Rank Sink <ul><li>Page cycles pointed by some incoming link </li></ul><ul><li>Problem: this loop will accumulate rank but never distribute any rank outside </li></ul>
  8. 8. Escape Term <ul><li>Solution: Rank Source </li></ul><ul><li>c is maximized and = 1 </li></ul><ul><li>E(u) is some vector over the web pages </li></ul><ul><li>– uniform, favorite page etc. </li></ul>
  9. 9. Matrix Notation <ul><li>R is the dominant eigenvector and c is the dominant eigenvalue of because c is maximized </li></ul>
  10. 10. Computing PageRank <ul><li>- initialize vector over web pages </li></ul><ul><li>loop: </li></ul><ul><li>- new ranks sum of normalized backlink ranks </li></ul><ul><li> - compute normalizing factor </li></ul><ul><li> - add escape term </li></ul><ul><li> - control parameter </li></ul><ul><li>while - stop when converged </li></ul>
  11. 11. Random Surfer Model <ul><li>Page Rank corresponds to the probability distribution of a random walk on the web graphs </li></ul><ul><li>E(u) can be re-phrased as the random surfer gets bored periodically and jumps to a different page and not kept in a loop forever </li></ul>
  12. 12. Implementation <ul><li>Computing resources </li></ul><ul><li>— 24 million pages </li></ul><ul><li>— 75 million URLs </li></ul><ul><li>Memory and disk storage </li></ul><ul><li>Weight Vector </li></ul><ul><li>(4 byte float) </li></ul><ul><li> Matrix A </li></ul><ul><li>(linear access) </li></ul>
  13. 13. Implementation (Con't) <ul><li>Unique integer ID for each URL </li></ul><ul><li>Sort and Remove dangling links </li></ul><ul><li>Rank initial assignment </li></ul><ul><li>Iteration until convergence </li></ul><ul><li>Add back dangling links and Re-compute </li></ul>
  14. 14. Convergence Properties <ul><li>Graph (V, E) is an expander with factor  if for all (not too large) subsets S: |As|   |s| </li></ul><ul><li>Eigenvalue separation: Largest eigenvalue is sufficiently larger than the second-largest eigenvalue </li></ul><ul><li>Random walk converges fast to a limiting probability distribution on a set of nodes in the graph. </li></ul>
  15. 15. Convergence Properties (con't) <ul><li>PageRank computation is O(log(|V|)) due to rapidly mixing graph G of the web. </li></ul>
  16. 16. Personalized PageRank <ul><li>Rank Source E can be initialized : </li></ul><ul><li>– uniformly over all pages: e.g. copyright </li></ul><ul><li>warnings, disclaimers, mailing lists archives </li></ul><ul><li> result in overly high ranking </li></ul><ul><li>– total weight on a single page, e.g . Netscape, McCarthy </li></ul><ul><li> great variation of ranks under different single pages as rank source </li></ul><ul><li>– and everything in-between, e.g. server root pages </li></ul><ul><li> allow manipulation by commercial interests </li></ul>
  17. 17. Applications I <ul><li>Estimate web traffic </li></ul><ul><li>– Server/page aliases </li></ul><ul><li>– Link/traffic disparity, e.g. porn sites, free web-mail </li></ul><ul><li>Backlink predictor </li></ul><ul><li>– Citation counts have been used to predict future citations </li></ul><ul><li>– very difficult to map the citation structure of the web completely </li></ul><ul><li>– avoid the local maxima that citation counts get stuck in and get better performance </li></ul>
  18. 18. Applications II - Ranking Proxy <ul><li>Surfer's Navigation Aid </li></ul><ul><li>Annotating links by PageRank (bar graph) </li></ul><ul><li>Not query dependent </li></ul>
  19. 19. Issues <ul><li>Users are no random walkers </li></ul><ul><li>– Content based methods </li></ul><ul><li>Starting point distribution </li></ul><ul><li> – Actual usage data as starting vector </li></ul><ul><li>Reinforcing effects/bias towards main pages </li></ul><ul><li>How about traffic to ranking pages? </li></ul><ul><li>No query specific rank </li></ul><ul><li>Linkage spam </li></ul><ul><li>– PageRank favors pages that managed to get other pages to link to </li></ul><ul><li>them </li></ul><ul><li>– Linkage not necessarily a sign of relevancy, only of promotion </li></ul><ul><li>(advertisement…) </li></ul>
  20. 20. Evaluation I
  21. 21. Evaluation II
  22. 22. Conclusion <ul><li>PageRank is a global ranking based on the web's graph structure </li></ul><ul><li>PageRank use backlinks information to bring order to the web </li></ul><ul><li>PageRank can separate out representative pages as cluster center </li></ul><ul><li>A great variety of applications </li></ul>