Web Search:  Ranking Web Pages
Statistics <ul><li>In 1994, one of the first web search engines, the World Wide Web Worm (WWWW), had an index of 110,000 w...
Search Engines  <ul><li>A search engine is a system which  collects ,  organizes  &  presents  a way to select Web documen...
Search Engines <ul><li>Two categories of search engine </li></ul><ul><ul><li>general-purpose search engine, e.g.  Yahoo !,...
Search Engines <ul><li>Two main components of a search engine: </li></ul><ul><ul><li>web crawler  (spider), which collects...
Search Engine <ul><li>Information Retrieval (IR) is a key to search engine or Web Search. </li></ul><ul><li>Most commonly ...
Search Engine <ul><li>The  Google  is a representative general-purpose search engine (successful, takes many techniques su...
Ranking result pages <ul><li>Based on content </li></ul><ul><ul><li>Based on content </li></ul></ul><ul><ul><li>Number of ...
Problems with content-based ranking <ul><li>Many pages containing search terms may be of poor quality or irrelevant </li><...
Based on link structure <ul><li>Hyperlinks among web pages provide new web search opportunities. </li></ul><ul><li>Our foc...
Backlink <ul><li>A backlink of a page  p  is a link that points to  p </li></ul><ul><li>A page with more backlinks is rank...
PageRank <ul><li>Page et al., &quot;The PageRank Citation Ranking: Brining Order to the Web.&quot;, 1998 </li></ul><ul><li...
PageRank <ul><li>Web can be viewed as a huge directed graph  G(V, E) </li></ul><ul><ul><li>where  V  is the set of web pag...
Basic ideas <ul><li>A page is likely to be important if it is linked by many pages. </li></ul><ul><li>A page is likely to ...
PageRank Definition <ul><li>N( p ): number of outgoing links from page  p </li></ul><ul><li>B( p ): set of pages that poin...
Computing PageRank  <ul><li>Create a stochastic matrix  M  for the link structure </li></ul><ul><ul><li>Each page  i  corr...
Convergence <ul><li>If the ranks converge, i.e., there is a rank vector R such that R   = M    R, R is the eigenvector of...
Example The sum of all PageRank’s is the total number of pages p1 p2 p3 r1 r3 r2 0.5 0 0.5 0   0  0.5  0.5  1 0 = r1 r3 r2
Random surfer model <ul><li>A random surfer </li></ul><ul><ul><li>Starts with a random page </li></ul></ul><ul><ul><li>Ran...
Rank leak  <ul><li>Pages with no outgoing links </li></ul><ul><li>Called a rank leak because eventually all importance   w...
Rank sink <ul><li>A page or a group of pages is a  rank sink  if they can receive rank propagation from its parents but ca...
An example of rank sink p1 p2 p3 r1 r3 r2 0.5 0 0.5 0   1  0.5  0.5  0 0 = r1 r3 r2
Solution <ul><li>d : damping factor </li></ul><ul><li>PageRank( p ) = d  · Σ q ∈ B ( p)  (PageRank( q ) / N ( q )) + (1 - ...
An example p1 p2 p3 the sum of all PageRank’s is  still  the total number of pages.   r1 r3 r2 0.5 0 0.5 0   1  0.5  0.5  ...
Upcoming SlideShare
Loading in …5
×

Ranking Web Pages

1,417 views

Published on

Published in: Technology, Design
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,417
On SlideShare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
63
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Ranking Web Pages

  1. 1. Web Search: Ranking Web Pages
  2. 2. Statistics <ul><li>In 1994, one of the first web search engines, the World Wide Web Worm (WWWW), had an index of 110,000 web pages and web accessible documents. </li></ul><ul><li>Up to 09/2005 Google indexes 8,200,000,000 web pages. </li></ul>
  3. 3. Search Engines <ul><li>A search engine is a system which collects , organizes & presents a way to select Web documents based on certain words , phrases , or patterns within documents </li></ul><ul><ul><li>Model the Web as a full-text DB </li></ul></ul><ul><ul><li>Index a portion of the Web docs </li></ul></ul><ul><ul><li>Search Web documents using user-specified words/patterns in a text </li></ul></ul>
  4. 4. Search Engines <ul><li>Two categories of search engine </li></ul><ul><ul><li>general-purpose search engine, e.g. Yahoo !, AltaVista and Google </li></ul></ul><ul><ul><li>special-purpose search engines (or Internet Portals), e.g. LinuxStart ( www.linuxstart.com ) </li></ul></ul>
  5. 5. Search Engines <ul><li>Two main components of a search engine: </li></ul><ul><ul><li>web crawler (spider), which collects massive Web pages. </li></ul></ul><ul><ul><li>large database, which stores and indexes collected Web pages. </li></ul></ul><ul><li>Ranking has to be performed without accessing the text, just the index </li></ul><ul><li>Ranking algorithms : all information is “top secret;” it is almost impossible to measure recall as the number of relevant pages can be quite large for simple queries </li></ul>
  6. 6. Search Engine <ul><li>Information Retrieval (IR) is a key to search engine or Web Search. </li></ul><ul><li>Most commonly – used models: </li></ul><ul><ul><li>Boolean Model </li></ul></ul><ul><ul><li>Vector Space Model (VSM) </li></ul></ul><ul><ul><li>Probability Model </li></ul></ul><ul><ul><li>their variations </li></ul></ul>
  7. 7. Search Engine <ul><li>The Google is a representative general-purpose search engine (successful, takes many techniques such as use of both web page link structure and text content, which fully takes advantage of the web page features). </li></ul><ul><li>The PageRank in Google is defined as follow: </li></ul><ul><li>Assume page A has pages P 1 ...P n which point to it. The parameter d is a damping factor which can be set between 0 and 1. Also C ( P i ) is defined as the number of links going out of page P i . The PageRank of a page A is given as follows: </li></ul><ul><li>PR ( A ) = ( 1 - d ) + d ( PR ( P 1 )/ C ( P 1 ) + ... + PR ( P n )/ C ( P n )) </li></ul><ul><li>Usually the parameter d is set to 0.85. PageRank or PR ( A ) can be calculated using a simple iterative algorithm. </li></ul><ul><li>Other features: anchor text processing, location information management and various data structures, which fully make use of the features of the web. </li></ul>
  8. 8. Ranking result pages <ul><li>Based on content </li></ul><ul><ul><li>Based on content </li></ul></ul><ul><ul><li>Number of occurrences of the search terms </li></ul></ul><ul><li>Based on link structure </li></ul><ul><ul><li>Backlink count </li></ul></ul><ul><ul><li>PageRank </li></ul></ul><ul><li>And more. </li></ul><ul><li>(http://www.cs.duke.edu/~junyang/courses/cps296.1-2002-spring/) </li></ul>
  9. 9. Problems with content-based ranking <ul><li>Many pages containing search terms may be of poor quality or irrelevant </li></ul><ul><ul><li>Example: a page with just a line “search engine”. </li></ul></ul><ul><li>Many high-quality or relevant pages do not even contain the search terms </li></ul><ul><ul><li>Example: Google homepage </li></ul></ul><ul><li>Page containing more occurrences of the search terms are ranked higher; spamming is easy </li></ul><ul><ul><li>Example: a page with line “search engine” Repeated many times </li></ul></ul>
  10. 10. Based on link structure <ul><li>Hyperlinks among web pages provide new web search opportunities. </li></ul><ul><li>Our focus in this lecture </li></ul><ul><ul><li>PageRank </li></ul></ul><ul><ul><li>HITS </li></ul></ul>
  11. 11. Backlink <ul><li>A backlink of a page p is a link that points to p </li></ul><ul><li>A page with more backlinks is ranked higher </li></ul><ul><li>Intuition: Each backlink is a “vote” for the page’s importance </li></ul><ul><li>Based on local link structure; still easy to spam </li></ul><ul><ul><li>Create lots of pages that point to a particular page </li></ul></ul>
  12. 12. PageRank <ul><li>Page et al., &quot;The PageRank Citation Ranking: Brining Order to the Web.&quot;, 1998 </li></ul><ul><li>Main idea: Pages pointed by high-ranking pages are ranked higher </li></ul><ul><li>Definition is recursive by design </li></ul><ul><li>Based on global link structure; hard to spam </li></ul>
  13. 13. PageRank <ul><li>Web can be viewed as a huge directed graph G(V, E) </li></ul><ul><ul><li>where V is the set of web pages (vertices) and E is the set of hyperlinks (directed edges). </li></ul></ul><ul><li>Each page may have a number of outgoing edges (forward links) and a number of incoming links (backlinks). </li></ul><ul><li>Each backlink of a page represents a citation to the page. </li></ul><ul><li>PageRank is a measure of global web page importance based on the backlinks of web pages. </li></ul>
  14. 14. Basic ideas <ul><li>A page is likely to be important if it is linked by many pages. </li></ul><ul><li>A page is likely to be important If it is linked to by important pages </li></ul><ul><ul><li>even though it may not necessarily be linked by a large number of pages. </li></ul></ul><ul><li>The importance of a page is divided evenly and propagated to the pages pointed to by it. </li></ul>8 4 4
  15. 15. PageRank Definition <ul><li>N( p ): number of outgoing links from page p </li></ul><ul><li>B( p ): set of pages that point to p </li></ul><ul><li>PageRank( p ) = Σ q ∈ B ( p) (PageRank( q ) / N ( q )) </li></ul><ul><li>Intuition </li></ul><ul><ul><li>Each page q evenly distributes its importance to all pages that q points to </li></ul></ul><ul><ul><li>Each page p gets a boost of its importance from each page that points to p </li></ul></ul>
  16. 16. Computing PageRank <ul><li>Create a stochastic matrix M for the link structure </li></ul><ul><ul><li>Each page i corresponds to row i and column i </li></ul></ul><ul><ul><li>If page j has n outgoing links, then the M ( i , j ) = 1 Ú n if page j </li></ul></ul><ul><li>Solve by setting all PageRank.s to 1 initially, and thenapplying M repeatedly until the values converge </li></ul>
  17. 17. Convergence <ul><li>If the ranks converge, i.e., there is a rank vector R such that R = M  R, R is the eigenvector of matrix M with eigenvalue being 1. </li></ul><ul><li>Convergence is guaranteed only if </li></ul><ul><ul><li>M is aperiodic (the Web graph is not a big cycle). This is practically guaranteed for Web. </li></ul></ul><ul><ul><li>M is irreducible (the Web graph is strongly connected). This is usually not true. </li></ul></ul>
  18. 18. Example The sum of all PageRank’s is the total number of pages p1 p2 p3 r1 r3 r2 0.5 0 0.5 0 0 0.5 0.5 1 0 = r1 r3 r2
  19. 19. Random surfer model <ul><li>A random surfer </li></ul><ul><ul><li>Starts with a random page </li></ul></ul><ul><ul><li>Randomly selects a link on the page to visit next </li></ul></ul><ul><ul><li>Never uses the .back. button </li></ul></ul><ul><li>PageRank(p) measures the probability that a </li></ul><ul><ul><li>random surfer visits page p </li></ul></ul>
  20. 20. Rank leak <ul><li>Pages with no outgoing links </li></ul><ul><li>Called a rank leak because eventually all importance will leak. out of the Web </li></ul>p1 p2 p3 r1 r3 r2 0.5 0 0.5 0 0 0.5 0.5 0 0 = r1 r3 r2
  21. 21. Rank sink <ul><li>A page or a group of pages is a rank sink if they can receive rank propagation from its parents but cannot propagate rank to other pages. </li></ul><ul><li>Causes the loss of total ranks. </li></ul><ul><li>Eventually accumulate all importance of the Web </li></ul>
  22. 22. An example of rank sink p1 p2 p3 r1 r3 r2 0.5 0 0.5 0 1 0.5 0.5 0 0 = r1 r3 r2
  23. 23. Solution <ul><li>d : damping factor </li></ul><ul><li>PageRank( p ) = d · Σ q ∈ B ( p) (PageRank( q ) / N ( q )) + (1 - d ) </li></ul><ul><li>Intuition in the random surfer model </li></ul><ul><ul><li>A surfer occasionally gets bored and jump to a random page on the Web instead of following a random link </li></ul></ul><ul><li>Another intuition </li></ul><ul><ul><li>Make the graph fully connected </li></ul></ul>
  24. 24. An example p1 p2 p3 the sum of all PageRank’s is still the total number of pages. r1 r3 r2 0.5 0 0.5 0 1 0.5 0.5 0 0 = r1 r3 r2 + 0.2 0.2 0.2

×