Slideshare.net (beta)

 

All comments

Add a comment on Slide 1

If you have a SlideShare account, login to comment; else you can comment as a guest


Showing 1-50 of 0 (more)

PageRank and Related Methods

From Cloud, 3 months ago

DERI Reading Group / DERI, NUI Galway / 12th April 2006

221 views  |  0 comments  |  0 favorites
Download not available ?
 

Groups / Events

 

 
Embed
options

More Info

This slideshow is Public
Total Views: 221
on Slideshare: 221
from embeds: 0

Slideshow transcript

Slide 1: PageRank™ and Related Methods John@Bresl.in Reading Group 12th April 2006

Slide 2: Overview 1. Previously 2. What is “PageRank”? (links as votes) 3. Algorithms to calculate PageRank: Simplified algorithm Algorithm with damping factor 4. The different PageRank values (actual, toolbar, ODP) 5. Manipulating PageRank ratings 6. PageRank and the Semantic Web 1. Previously 2. What is it? 3. Algorithms 4. Values 5. Manipulating 6. Semantic Web

Slide 3: Similar work around this time [Kleinberg, 1998]: Authoritative Sources in a Hyperlinked Environment “An algorithmic formulation of the notion of authority, based on the relationship between a set of relevant authoritative pages and the set of ‘hub pages’ that join them together in the link structure.” HITS algorithm used by Ask.com and Teoma [Chakrabarti et al., 1998]: Experiments in Topic Distillation Added extra heuristics to Kleinberg’s algorithms Part of the CLEVER project from IBM 1. Previously 2. What is it? 3. Algorithms 4. Values 5. Manipulating 6. Semantic Web

Slide 4: Query dependency Query dependent: Rank a small subset of pages related to a specific query HITS [Kleinberg, 1998] was proposed as query dependent Query independent: Rank the whole Web PageRank [Brin and Page, 1998] was proposed as query independent 1. Previously 2. What is it? 3. Algorithms 4. Values 5. Manipulating 6. Semantic Web

Slide 5: Brin, Page, PageRank, Google Developed at Stanford University by Larry Page and Sergey Brin A research project about a new kind of search engine Started in 1995 and led to a functional prototype in 1998 Shortly after, Page and Brin founded the Google Inc. search engine company: Google still uses PageRank as a key element 1. Previously 2. What is it? 3. Algorithms 4. Values 5. Manipulating 6. Semantic Web

Slide 6: Definition of PageRank A patented method to assign a numerical weighting to each element of a hyperlinked set of documents, such as the World Wide Web Purpose of “measuring” an element’s relative importance within the set The algorithm may be applied to any collection of entities with reciprocal quotations and references The numerical weight that it assigns to any given element E is also called the PageRank of E and denoted by PR(E) 1. Previously 2. What is it? 3. Algorithms 4. Values 5. Manipulating 6. Semantic Web

Slide 7: PageRank for dummies PageRank is basically a numeric value that represents how important a page is on the Web Page Rank is a topic much discussed by Search Engine Optimisation (SEO) experts At the heart of PageRank is a mathematical formula that may seem scary for non-scientists to look at but is actually fairly simple to understand Good authorities should be pointed to by good authorities 1. Previously 2. What is it? 3. Algorithms 4. Values 5. Manipulating 6. Semantic Web

Slide 8: Links as votes (1) Google figures that when one page links to another page, it is effectively casting a vote for the other page The more votes that are cast for a page, the more important that page must be Also, the importance of the page that is casting the vote determines how important the vote itself is How important each vote is is taken into account when a page's PageRank is calculated 1. Previously 2. What is it? 3. Algorithms 4. Values 5. Manipulating 6. Semantic Web

Slide 9: Links as votes (2) Therefore, a link from page p to page q denotes endorsement: Page p considers page q an authority on a subject Can therefore mine this web graph of recommendations An authority value or PageRank (PR) value is assigned to every page (document) 1. Previously 2. What is it? 3. Algorithms 4. Values 5. Manipulating 6. Semantic Web

Slide 10: Basic in-degree ranking Can rank pages according to “in degree”: wi = sum of incoming links w=3 w=2 1. Red Page 2. Yellow Page 3. Blue Page w=2 4. Green Page 5. Purple Page w=1 w=1 1. Previously 2. What is it? 3. Algorithms 4. Values 5. Manipulating 6. Semantic Web

Slide 11: The PageRank algorithm Probability distribution (0 to 1) represents the likelihood that a person randomly clicking on links will arrive at a particular page Any size collection of documents Distribution is evenly divided between documents at the beginning, and then PageRank computations require several passes or “iterations” towards true values A document with a PageRank of 0.5 means that there’s a 50% chance of a person clicking on a random link arriving that document 1. Previously 2. What is it? 3. Algorithms 4. Values 5. Manipulating 6. Semantic Web

Slide 12: Simple PageRank algorithm (1) Imagine that there are only four web pages in the Universe (A, B, C, D): Initial approximation here is to divide PR equally (0.25 each) amongst all documents Let’s say that pages B, C and D all have one outgoing link to A, then they each confer 0.25 PR to A: PR(A) = PR(B) + PR(C) + PR(D) PR(A) = 0.75 However, this doesn’t take into account the number of other links from B, C, D 1. Previously 2. What is it? 3. Algorithms 4. Values 5. Manipulating 6. Semantic Web

Slide 13: Simple PageRank algorithm (2) So let’s say that B also has a link to C, and D has three outgoing links… Half of B’s PR vote is given to A, half to C One third of D’s PR vote is given to A PR(A) = PR(B)/2 + PR(C)/1 + PR(D)/3 PR conferred by an outgoing link is the document’s own PR divided by the number of outgoing links (specific URLs only count once): PR(A) = PR(t1)/C(t1) + … PR(tn)/C(tn) 1. Previously 2. What is it? 3. Algorithms 4. Values 5. Manipulating 6. Semantic Web

Slide 14: Algorithm with damping factor PageRank theory holds that a surfer randomly clicking on web links will eventually stop clicking (a good theory!) This is accounted for by a damping factor, d (around 0.85)… PR(A) = (1 - d) + d(PR(t1)/C(t1) ... + PR(tn)/C(tn)) As the number of documents are increased, the initial approx of PR for each document decreases The model forming the basis of the PageRank algorithm is therefore a random walk through all the pages of the Web This random walk can be thought of as a Markov chain where pages = states, links = edges / transitions 1. Previously 2. What is it? 3. Algorithms 4. Values 5. Manipulating 6. Semantic Web

Slide 15: Different PR formulas Brin and Page gave the formula on the previous overhead But they also say that the total sum of PRs is 1 Apparently this should be the “normalised sum”, or the “average” to you and me This has led to some confusion, so this alternate formula is often given: (1 − d) + d PR( t i ) PR( A ) = N ∑n C(t ) i =1K i N is the total number of documents in the collection n is the total number of documents linking to A 1. Previously 2. What is it? 3. Algorithms 4. Values 5. Manipulating 6. Semantic Web

Slide 16: Example of PR calculation Page A Page B Page C Page D 1. Previously 2. What is it? 3. Algorithms 4. Values 5. Manipulating 6. Semantic Web

Slide 17: PR before any iterations Page A 1*0.85/2 Page B 1 1 1*0.85/2 1*0.85 1*0.85 Page C 1*0.85 Page D 1 1 1. Previously 2. What is it? 3. Algorithms 4. Values 5. Manipulating 6. Semantic Web

Slide 18: During first iteration Each page has not passed on 0.15, so we get: Page A: 0.85 (from Page C) + 0.15 (not transferred) =1 Page B: 0.425 (from Page A) + 0.15 (not transferred) = 0.575 Page C: 0.85 (from Page D) + 0.85 (from Page B) + 0.425 (from Page A) + 0.15 (not transferred) = 2.275 Page D: receives none, but has not transferred 0.15 = 0.15 1. Previously 2. What is it? 3. Algorithms 4. Values 5. Manipulating 6. Semantic Web

Slide 19: PR after first iteration Page A Page B 1 0.575 Page C Page D 2.275 0.15 1. Previously 2. What is it? 3. Algorithms 4. Values 5. Manipulating 6. Semantic Web

Slide 20: During second iteration Each page still retains 0.15 of PR, so we get: Page A: 2.275*0.85 (from Page C) + 0.15 (not transferred) = 2.08375 Page B: 1*0.85/2 (from Page A) + 0.15 (not transferred) = 0.575 Page C: 0.15*0.85 (from Page D) + 0.575*0.85 (from Page B) + 1*0.85/2 (from Page A) + 0.15 (not transferred) = 1.19125 Page D: receives none, but has not transferred 0.15 = 0.15 1. Previously 2. What is it? 3. Algorithms 4. Values 5. Manipulating 6. Semantic Web

Slide 21: PR after second iteration Page A Page B 2.03875 0.575 Page C Page D 1.1925 0.15 1. Previously 2. What is it? 3. Algorithms 4. Values 5. Manipulating 6. Semantic Web

Slide 22: PR after 20 iterations Page A Page B 1.490 0.783 Page C Page D 1.577 0.15 1. Previously 2. What is it? 3. Algorithms 4. Values 5. Manipulating 6. Semantic Web

Slide 23: Example conclusions Page C has the highest PageRank, and page A has the next highest: Page C has a highest importance in this page graph! It has also passed on some of its authority to page A More iterations lead to convergence of PageRank values 1. Previously 2. What is it? 3. Algorithms 4. Values 5. Manipulating 6. Semantic Web

Slide 24: Matrix view using Markov chains A Markov chain (MC) describes a discrete time stochastic process over a set of states… S = {s1, s2, … sn} …according to a transition probability matrix: P = {Pij} Pij is the probability of moving to state j when at state i ∑jPij = 1 Memorylessness property: The next state of the chain depends only on the current state and not on the past states of the process (first order MC) Higher order MCs are also possible 1. Previously 2. What is it? 3. Algorithms 4. Values 5. Manipulating 6. Semantic Web

Slide 25: Random walks and Markov Random walks on Web page graphs correspond to Markov chains: The set of states S is the set of nodes of the graph The transition probability matrix P is the probability that we follow an edge from one node to another The state probability vector… qt = (qt1,qt2, … qtn) …stores the probability of being at state i at time t, and: q0i is the probability of starting (time = 0) from state i qt = qt-1 P 1. Previously 2. What is it? 3. Algorithms 4. Values 5. Manipulating 6. Semantic Web

Slide 26: Transition probability matrix ⎡0 1 1 0 0⎤ ⎢0 v2 ⎢ 0 0 0 1⎥ ⎥ v1 L = ⎢0 1 0 0 0⎥ ⎢ ⎥ ⎢1 1 1 0 0⎥ v3 ⎢1 ⎣ 0 0 1 0⎥ ⎦ ⎡ 0 12 12 0 0⎤ ⎢0 0 0 0 1⎥ ⎢ ⎥ P=⎢ 0 1 0 0 0⎥ ⎢ ⎥ v5 v4 ⎢1 3 1 3 1 3 0 0⎥ ⎢1 2 0 ⎣ 0 12 0⎥ ⎦ 1. Previously 2. What is it? 3. Algorithms 4. Values 5. Manipulating 6. Semantic Web

Slide 27: Vanilla PageRank random walk ⎡ 0 12 12 0 0⎤ ⎢0 0 0 0 1⎥ ⎢ ⎥ P=⎢ 0 1 0 0 0⎥ ⎢ ⎥ ⎢1 3 1 3 1 3 0 0⎥ ⎢1 2 0 ⎣ 0 12 0⎥ ⎦ 1. Previously 2. What is it? 3. Algorithms 4. Values 5. Manipulating 6. Semantic Web

Slide 28: Random walk encounters sink ⎡ 0 12 12 0 0⎤ ⎢0 0 0 0 0⎥ ⎢ ⎥ P=⎢ 0 1 0 0 0⎥ ⎢ ⎥ ⎢1 3 1 3 1 3 0 0⎥ ⎢1 2 0 ⎣ 0 12 0⎥ ⎦ 1. Previously 2. What is it? 3. Algorithms 4. Values 5. Manipulating 6. Semantic Web

Slide 29: What is a sink node or page? If a page has no links to other pages, it becomes a sink and therefore terminates the random surfing process However, the solution is quite simple… If the random surfer arrives at a sink page, it picks another URL at random and continues surfing again! When calculating PageRank, pages with no outbound links are assumed to artificially link out to all other pages in the collection (“teleporters”) Their PageRank scores are therefore divided evenly among all other pages (to be fair with pages that aren’t sinks) 1. Previously 2. What is it? 3. Algorithms 4. Values 5. Manipulating 6. Semantic Web

Slide 30: Replace sinks with teleporters ⎡ 0 12 12 0 0⎤ ⎢1 5 1 5 1 5 1 5 1 5⎥ ⎢ ⎥ P' = ⎢ 0 1 0 0 0⎥ ⎢ ⎥ ⎢1 3 1 3 1 3 0 0⎥ ⎢1 2 0 ⎣ 0 12 0 ⎥ ⎦ 1. Previously 2. What is it? 3. Algorithms 4. Values 5. Manipulating 6. Semantic Web

Slide 31: State probabilities ⎡ 0 12 12 0 0⎤ ⎢0 0 0 0 1⎥ v2 ⎢ ⎥ v1 P=⎢ 0 1 0 0 0⎥ ⎢ ⎥ ⎢1 3 1 3 1 3 0 0⎥ v3 ⎢1 2 0 ⎣ 0 12 0⎥ ⎦ qt+11 = 1/3 qt4 + 1/2 qt5 qt+12 = 1/2 qt1 + qt3 + 1/3 qt4 qt+13 = 1/2 qt1 + 1/3 qt4 qt+14 = 1/2 qt5 v5 v4 qt+15 = qt2 1. Previously 2. What is it? 3. Algorithms 4. Values 5. Manipulating 6. Semantic Web

Slide 32: Eigenvectors and power method The PageRank of all pages can be calculated as the principal eigenvector of the matrix P The Google matrix P is currently (do a search for “*”) of size 25 × 109 and therefore the computation of PageRank values is not trivial To find an approximation of the principal eigenvector, the power method is used: q0 = initial guess; for t = 1 to 50 { qt = P qt-1 }; return q50 Because the computation involves an extremely large matrix, the matrix-vector multiplications must be implemented in parallel on multi-processor systems 1. Previously 2. What is it? 3. Algorithms 4. Values 5. Manipulating 6. Semantic Web

Slide 33: Other PageRank values Google Toolbar PR: Measured between 1 and 10 Method for deriving this is undisclosed Republished every three months, so may be unreliable for long periods of the year May be a proxy value of actual PR derived through a logarithmic scale Google Directory (ODP) PR: Eight unit measurement (0 to 7) Difficult to see what value is (uses coloured GIFs) 1. Previously 2. What is it? 3. Algorithms 4. Values 5. Manipulating 6. Semantic Web

Slide 34: More than just PageRank (1) There is some evidence to say that Google is using the text in a link’s anchor when deciding the relevance or PR value of a target page: Perhaps more so than the page’s PR… From www.google.com/technology: “Google combines PageRank with sophisticated text-matching techniques… Google goes far beyond the number of times a term appears on a page and examines all aspects of the page's content (and the content of the pages linking to it) to determine if it's a good match for your query.” 1. Previously 2. What is it? 3. Algorithms 4. Values 5. Manipulating 6. Semantic Web

Slide 35: More than just PageRank (2) Other criteria (may) include: Term frequencies Term proximities Term position (title, top of page, etc.) Term characteristics (boldface, capitalised, etc.) Link analysis information Category information Popularity information

Slide 36: Manipulating PageRank People will always find ways to exploit the PR system: There used to be a flaw where a low PR page with no incoming links could be redirected to a high PR page, and would attain the PR of that higher page For SEO purposes, webmasters often buy links for their sites, and since links from higher PR pages are deemed to be more valuable they are often more expensive… Google frowns on “link farms” buying and selling links, and of course has to regularly devise new methods for identifying such sites and other PR manipulation tools 1. Previously 2. What is it? 3. Algorithms 4. Values 5. Manipulating 6. Semantic Web

Slide 37: rel=“nofollow” In early 2005, Google implemented a new attribute, rel="nofollow", for the HTML link element, so that website builders and bloggers can make links that Google will not follow for the purposes of PageRank They are links that no longer constitute a “vote” in the PageRank system The nofollow attribute was added in an attempt to help combat blog comment spam 1. Previously 2. What is it? 3. Algorithms 4. Values 5. Manipulating 6. Semantic Web

Slide 38: Filtering of links Navigational links serve the purpose of moving within a site (or to related sites): www.rte.ie → www.rte.ie/sport www.yahoo.com → www.yahoo.ie www.2fm.ie → www.rte.ie Filter out navigational links Same domain names: www.yahoo.com versus yahoo.com same IP address, different names Better to standardise your URLs so that all the PR will go to just one URL 1. Previously 2. What is it? 3. Algorithms 4. Values 5. Manipulating 6. Semantic Web

Slide 39: SemRank for connection search An approach and framework for ranking complex relationships resulting from a “relationship search” Semantic associations based on RDF property sequences (labelled paths in a KB) Different search modes selected using a a slider gives a variety of result orderings Can also specify keywords to augment queries, e.g. “enrolls” or “depositsInto” 1. Previously 2. What is it? 3. Algorithms 4. Values 5. Manipulating 6. Semantic Web

Slide 40: Sample domains for SemRank 1. Previously 2. What is it? 3. Algorithms 4. Values 5. Manipulating 6. Semantic Web

Slide 41: SemRank conventional mode 1. Previously 2. What is it? 3. Algorithms 4. Values 5. Manipulating 6. Semantic Web

Slide 42: SemRank discovery mode 1. Previously 2. What is it? 3. Algorithms 4. Values 5. Manipulating 6. Semantic Web

Slide 43: Other research on semantic rank Semantic searching and ranking of entities on the Semantic Web: Rocha et al., WWW2004 Guha et al., WWW2003 Stojanovic et al., ISWC2003 Zhuge et al., WWW2003 Semantic ranking of relationships : Halaschek, VLDB04 Demo Aleman-Meza et al., SWDB03 1. Previously 2. What is it? 3. Algorithms 4. Values 5. Manipulating 6. Semantic Web

Slide 44: Other research on PageRank Specialised PageRank: Personalisation [Brin and Page, 1998] Instead of picking a node uniformly at random, favour specific nodes that are related to the user Topic sensitive PageRank [Haveliwala, 2002] Compute many PageRank vectors, one for each topic Estimate relevance of query with each topic Produce final PageRank as a weighted combination Updating PageRank [Chien et al., 2002] Fast computation of PageRank: Numerical analysis tricks Node aggregation techniques Dealing with the “Web frontier” http://citeseer.ist.psu.edu/719287.html 1. Previously 2. What is it? 3. Algorithms 4. Values 5. Manipulating 6. Semantic Web

Slide 45: References (1) Anyanwu et al., “SemRank”, WWW2005 Conference, 2005. Brin and Page, “The Anatomy of a Large-Scale Hypertextual Web Search Engine”, www- db.stanford.edu/~backrub/google.html Chakrabarti et al., “Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text”, WWW1998 Conference, 1998. Chakrabarti et al., “Experiments in Topic Distillation”, ACM SIGIR WS on Hypertext IR on the Web, 1998. Chien et al., “Link Evolution: Analysis and Algorithms”, Workshop on Algorithms and Models for the Web Graph, 2002. Haveliwala, “Topic-Sensitive PageRank”, WWW2002 Conference, 2002. Kamvar et al., “Extrapolation Methods for Accelerating PageRank Computation”, WWW2003 Conference, 2003. 1. Previously 2. What is it? 3. Algorithms 4. Values 5. Manipulating 6. Semantic Web

Slide 46: References (2) Katz, “A New Status Index Derived from Sociometric Analysis”, Psychometrika, 1953. Kleinberg, “Authoritative Sources in a Hyperlinked Environment”, ACM SIAM Symposium on Discrete Algorithms, 1998. Langville, Meyer, “Deeper Inside PageRank”, Internet Mathematics, TR, 2003. Motwani and Raghavan, “Randomized Algorithms”, Cambridge Press, 1995. Phil Craven, “Google's PageRank Explained”, www.webworkshop.net/pagerank.html Pinski and Narin, “Citation Influence for Journal Aggregates of Scientific Publications”, Information Processing and Management, 1976. Rogers, “The Google Pagerank Algorithm and How It Works”, www.iprcom.com/papers/pagerank 1. Previously 2. What is it? 3. Algorithms 4. Values 5. Manipulating 6. Semantic Web

Slide 47: Questions and Answers 1. Previously 2. What is it? 3. Algorithms 4. Values 5. Manipulating 6. Semantic Web