• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Ranking Web Pages
 

Ranking Web Pages

on

  • 1,440 views

 

Statistics

Views

Total Views
1,440
Views on SlideShare
1,435
Embed Views
5

Actions

Likes
0
Downloads
56
Comments
0

1 Embed 5

http://www.slideshare.net 5

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Ranking Web Pages Ranking Web Pages Presentation Transcript

    • Web Search: Ranking Web Pages
    • Statistics
      • In 1994, one of the first web search engines, the World Wide Web Worm (WWWW), had an index of 110,000 web pages and web accessible documents.
      • Up to 09/2005 Google indexes 8,200,000,000 web pages.
    • Search Engines
      • A search engine is a system which collects , organizes & presents a way to select Web documents based on certain words , phrases , or patterns within documents
        • Model the Web as a full-text DB
        • Index a portion of the Web docs
        • Search Web documents using user-specified words/patterns in a text
    • Search Engines
      • Two categories of search engine
        • general-purpose search engine, e.g. Yahoo !, AltaVista and Google
        • special-purpose search engines (or Internet Portals), e.g. LinuxStart ( www.linuxstart.com )
    • Search Engines
      • Two main components of a search engine:
        • web crawler (spider), which collects massive Web pages.
        • large database, which stores and indexes collected Web pages.
      • Ranking has to be performed without accessing the text, just the index
      • Ranking algorithms : all information is “top secret;” it is almost impossible to measure recall as the number of relevant pages can be quite large for simple queries
    • Search Engine
      • Information Retrieval (IR) is a key to search engine or Web Search.
      • Most commonly – used models:
        • Boolean Model
        • Vector Space Model (VSM)
        • Probability Model
        • their variations
    • Search Engine
      • The Google is a representative general-purpose search engine (successful, takes many techniques such as use of both web page link structure and text content, which fully takes advantage of the web page features).
      • The PageRank in Google is defined as follow:
      • Assume page A has pages P 1 ...P n which point to it. The parameter d is a damping factor which can be set between 0 and 1. Also C ( P i ) is defined as the number of links going out of page P i . The PageRank of a page A is given as follows:
      • PR ( A ) = ( 1 - d ) + d ( PR ( P 1 )/ C ( P 1 ) + ... + PR ( P n )/ C ( P n ))
      • Usually the parameter d is set to 0.85. PageRank or PR ( A ) can be calculated using a simple iterative algorithm.
      • Other features: anchor text processing, location information management and various data structures, which fully make use of the features of the web.
    • Ranking result pages
      • Based on content
        • Based on content
        • Number of occurrences of the search terms
      • Based on link structure
        • Backlink count
        • PageRank
      • And more.
      • (http://www.cs.duke.edu/~junyang/courses/cps296.1-2002-spring/)
    • Problems with content-based ranking
      • Many pages containing search terms may be of poor quality or irrelevant
        • Example: a page with just a line “search engine”.
      • Many high-quality or relevant pages do not even contain the search terms
        • Example: Google homepage
      • Page containing more occurrences of the search terms are ranked higher; spamming is easy
        • Example: a page with line “search engine” Repeated many times
    • Based on link structure
      • Hyperlinks among web pages provide new web search opportunities.
      • Our focus in this lecture
        • PageRank
        • HITS
    • Backlink
      • A backlink of a page p is a link that points to p
      • A page with more backlinks is ranked higher
      • Intuition: Each backlink is a “vote” for the page’s importance
      • Based on local link structure; still easy to spam
        • Create lots of pages that point to a particular page
    • PageRank
      • Page et al., "The PageRank Citation Ranking: Brining Order to the Web.", 1998
      • Main idea: Pages pointed by high-ranking pages are ranked higher
      • Definition is recursive by design
      • Based on global link structure; hard to spam
    • PageRank
      • Web can be viewed as a huge directed graph G(V, E)
        • where V is the set of web pages (vertices) and E is the set of hyperlinks (directed edges).
      • Each page may have a number of outgoing edges (forward links) and a number of incoming links (backlinks).
      • Each backlink of a page represents a citation to the page.
      • PageRank is a measure of global web page importance based on the backlinks of web pages.
    • Basic ideas
      • A page is likely to be important if it is linked by many pages.
      • A page is likely to be important If it is linked to by important pages
        • even though it may not necessarily be linked by a large number of pages.
      • The importance of a page is divided evenly and propagated to the pages pointed to by it.
      8 4 4
    • PageRank Definition
      • N( p ): number of outgoing links from page p
      • B( p ): set of pages that point to p
      • PageRank( p ) = Σ q ∈ B ( p) (PageRank( q ) / N ( q ))
      • Intuition
        • Each page q evenly distributes its importance to all pages that q points to
        • Each page p gets a boost of its importance from each page that points to p
    • Computing PageRank
      • Create a stochastic matrix M for the link structure
        • Each page i corresponds to row i and column i
        • If page j has n outgoing links, then the M ( i , j ) = 1 Ú n if page j
      • Solve by setting all PageRank.s to 1 initially, and thenapplying M repeatedly until the values converge
    • Convergence
      • If the ranks converge, i.e., there is a rank vector R such that R = M  R, R is the eigenvector of matrix M with eigenvalue being 1.
      • Convergence is guaranteed only if
        • M is aperiodic (the Web graph is not a big cycle). This is practically guaranteed for Web.
        • M is irreducible (the Web graph is strongly connected). This is usually not true.
    • Example The sum of all PageRank’s is the total number of pages p1 p2 p3 r1 r3 r2 0.5 0 0.5 0 0 0.5 0.5 1 0 = r1 r3 r2
    • Random surfer model
      • A random surfer
        • Starts with a random page
        • Randomly selects a link on the page to visit next
        • Never uses the .back. button
      • PageRank(p) measures the probability that a
        • random surfer visits page p
    • Rank leak
      • Pages with no outgoing links
      • Called a rank leak because eventually all importance will leak. out of the Web
      p1 p2 p3 r1 r3 r2 0.5 0 0.5 0 0 0.5 0.5 0 0 = r1 r3 r2
    • Rank sink
      • A page or a group of pages is a rank sink if they can receive rank propagation from its parents but cannot propagate rank to other pages.
      • Causes the loss of total ranks.
      • Eventually accumulate all importance of the Web
    • An example of rank sink p1 p2 p3 r1 r3 r2 0.5 0 0.5 0 1 0.5 0.5 0 0 = r1 r3 r2
    • Solution
      • d : damping factor
      • PageRank( p ) = d · Σ q ∈ B ( p) (PageRank( q ) / N ( q )) + (1 - d )
      • Intuition in the random surfer model
        • A surfer occasionally gets bored and jump to a random page on the Web instead of following a random link
      • Another intuition
        • Make the graph fully connected
    • An example p1 p2 p3 the sum of all PageRank’s is still the total number of pages. r1 r3 r2 0.5 0 0.5 0 1 0.5 0.5 0 0 = r1 r3 r2 + 0.2 0.2 0.2