The PageRank Citation Ranking: Bringing Order to the Web Larry Page etc. Stanford University Presented by Guoqiang Su & Wei Li
Contents Motivation Related work Page Rank & Random Surfer Model Implementation Application Conclusion
Motivation Web: heterogeneous and unstructured Free of quality control on the web Commercial interest to manipulate ranking
Related Work Academic citation analysis Link-based analysis Clustering methods of link structure Hubs & Authorities Model
Backlink Link Structure of the Web Approximation of importance / quality
PageRank Pages with lots of backlinks are important Backlinks coming from important pages convey more importance to a page Problem: Rank Sink
Rank Sink Page cycles pointed by some incoming link Problem: this loop will accumulate rank but never distribute any rank outside
Escape Term Solution: Rank Source c is maximized and  = 1 E(u) is some vector over the web pages – uniform, favorite page etc.
Matrix Notation R is the dominant eigenvector and c is the dominant eigenvalue of  because c is maximized
Computing PageRank - initialize vector over web pages loop: - new ranks sum of normalized backlink ranks      - compute normalizing factor   - add escape term   - control parameter while  - stop when converged
Random Surfer Model Page Rank corresponds to the probability distribution of a random walk on the web graphs E(u) can be re-phrased as the random surfer gets bored periodically and jumps to a different page and not kept in a loop forever
Implementation Computing resources —  24 million pages —  75 million URLs Memory and disk storage Weight Vector  (4 byte float)   Matrix  A  (linear access)
Implementation (Con't) Unique integer ID for each URL Sort and Remove dangling links Rank initial assignment Iteration until convergence Add back dangling links and Re-compute
Convergence Properties Graph (V, E) is an  expander  with factor    if for all (not too large) subsets S: |As|     |s| Eigenvalue separation:  Largest eigenvalue is sufficiently larger than the second-largest eigenvalue Random walk  converges fast to a limiting probability distribution on a set of nodes in the graph.
Convergence Properties (con't) PageRank computation is O(log(|V|)) due to rapidly  mixing graph  G of the web.
Personalized PageRank Rank Source E can be initialized : –  uniformly over all pages:  e.g. copyright  warnings, disclaimers, mailing lists archives    result in overly high ranking –  total weight on a single page,  e.g .  Netscape, McCarthy    great variation of ranks under different single pages  as rank source –  and everything in-between,  e.g. server root pages    allow manipulation by commercial interests
Applications I Estimate web traffic –  Server/page aliases –  Link/traffic disparity, e.g. porn sites, free web-mail Backlink predictor –  Citation counts have been used to predict future citations  –  very difficult to map the citation structure of the web completely –  avoid the local maxima that citation counts get stuck in and get better performance
Applications II - Ranking Proxy Surfer's Navigation Aid Annotating links by PageRank (bar graph) Not query dependent
Issues Users are no random walkers –  Content based methods Starting point distribution   –  Actual usage data as starting vector Reinforcing effects/bias towards main pages How about traffic to ranking pages? No query specific rank Linkage spam –  PageRank favors pages that managed to get other pages to link to  them –  Linkage not necessarily a sign of relevancy, only of promotion  (advertisement…)
Evaluation I
Evaluation II
Conclusion PageRank is a global ranking based on the web's graph structure PageRank use backlinks information to bring order to the web PageRank can separate out representative pages as cluster center A great variety of applications

Pagerank

  • 1.
    The PageRank CitationRanking: Bringing Order to the Web Larry Page etc. Stanford University Presented by Guoqiang Su & Wei Li
  • 2.
    Contents Motivation Relatedwork Page Rank & Random Surfer Model Implementation Application Conclusion
  • 3.
    Motivation Web: heterogeneousand unstructured Free of quality control on the web Commercial interest to manipulate ranking
  • 4.
    Related Work Academiccitation analysis Link-based analysis Clustering methods of link structure Hubs & Authorities Model
  • 5.
    Backlink Link Structureof the Web Approximation of importance / quality
  • 6.
    PageRank Pages withlots of backlinks are important Backlinks coming from important pages convey more importance to a page Problem: Rank Sink
  • 7.
    Rank Sink Pagecycles pointed by some incoming link Problem: this loop will accumulate rank but never distribute any rank outside
  • 8.
    Escape Term Solution:Rank Source c is maximized and = 1 E(u) is some vector over the web pages – uniform, favorite page etc.
  • 9.
    Matrix Notation Ris the dominant eigenvector and c is the dominant eigenvalue of because c is maximized
  • 10.
    Computing PageRank -initialize vector over web pages loop: - new ranks sum of normalized backlink ranks - compute normalizing factor - add escape term - control parameter while - stop when converged
  • 11.
    Random Surfer ModelPage Rank corresponds to the probability distribution of a random walk on the web graphs E(u) can be re-phrased as the random surfer gets bored periodically and jumps to a different page and not kept in a loop forever
  • 12.
    Implementation Computing resources— 24 million pages — 75 million URLs Memory and disk storage Weight Vector (4 byte float) Matrix A (linear access)
  • 13.
    Implementation (Con't) Uniqueinteger ID for each URL Sort and Remove dangling links Rank initial assignment Iteration until convergence Add back dangling links and Re-compute
  • 14.
    Convergence Properties Graph(V, E) is an expander with factor  if for all (not too large) subsets S: |As|   |s| Eigenvalue separation: Largest eigenvalue is sufficiently larger than the second-largest eigenvalue Random walk converges fast to a limiting probability distribution on a set of nodes in the graph.
  • 15.
    Convergence Properties (con't)PageRank computation is O(log(|V|)) due to rapidly mixing graph G of the web.
  • 16.
    Personalized PageRank RankSource E can be initialized : – uniformly over all pages: e.g. copyright warnings, disclaimers, mailing lists archives  result in overly high ranking – total weight on a single page, e.g . Netscape, McCarthy  great variation of ranks under different single pages as rank source – and everything in-between, e.g. server root pages  allow manipulation by commercial interests
  • 17.
    Applications I Estimateweb traffic – Server/page aliases – Link/traffic disparity, e.g. porn sites, free web-mail Backlink predictor – Citation counts have been used to predict future citations – very difficult to map the citation structure of the web completely – avoid the local maxima that citation counts get stuck in and get better performance
  • 18.
    Applications II -Ranking Proxy Surfer's Navigation Aid Annotating links by PageRank (bar graph) Not query dependent
  • 19.
    Issues Users areno random walkers – Content based methods Starting point distribution – Actual usage data as starting vector Reinforcing effects/bias towards main pages How about traffic to ranking pages? No query specific rank Linkage spam – PageRank favors pages that managed to get other pages to link to them – Linkage not necessarily a sign of relevancy, only of promotion (advertisement…)
  • 20.
  • 21.
  • 22.
    Conclusion PageRank isa global ranking based on the web's graph structure PageRank use backlinks information to bring order to the web PageRank can separate out representative pages as cluster center A great variety of applications