• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Ranking Web Pages

Ranking Web Pages






Total Views
Views on SlideShare
Embed Views



1 Embed 5

http://www.slideshare.net 5



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

    Ranking Web Pages Ranking Web Pages Presentation Transcript

    • Web Search: Ranking Web Pages
    • Statistics
      • In 1994, one of the first web search engines, the World Wide Web Worm (WWWW), had an index of 110,000 web pages and web accessible documents.
      • Up to 09/2005 Google indexes 8,200,000,000 web pages.
    • Search Engines
      • A search engine is a system which collects , organizes & presents a way to select Web documents based on certain words , phrases , or patterns within documents
        • Model the Web as a full-text DB
        • Index a portion of the Web docs
        • Search Web documents using user-specified words/patterns in a text
    • Search Engines
      • Two categories of search engine
        • general-purpose search engine, e.g. Yahoo !, AltaVista and Google
        • special-purpose search engines (or Internet Portals), e.g. LinuxStart ( www.linuxstart.com )
    • Search Engines
      • Two main components of a search engine:
        • web crawler (spider), which collects massive Web pages.
        • large database, which stores and indexes collected Web pages.
      • Ranking has to be performed without accessing the text, just the index
      • Ranking algorithms : all information is “top secret;” it is almost impossible to measure recall as the number of relevant pages can be quite large for simple queries
    • Search Engine
      • Information Retrieval (IR) is a key to search engine or Web Search.
      • Most commonly – used models:
        • Boolean Model
        • Vector Space Model (VSM)
        • Probability Model
        • their variations
    • Search Engine
      • The Google is a representative general-purpose search engine (successful, takes many techniques such as use of both web page link structure and text content, which fully takes advantage of the web page features).
      • The PageRank in Google is defined as follow:
      • Assume page A has pages P 1 ...P n which point to it. The parameter d is a damping factor which can be set between 0 and 1. Also C ( P i ) is defined as the number of links going out of page P i . The PageRank of a page A is given as follows:
      • PR ( A ) = ( 1 - d ) + d ( PR ( P 1 )/ C ( P 1 ) + ... + PR ( P n )/ C ( P n ))
      • Usually the parameter d is set to 0.85. PageRank or PR ( A ) can be calculated using a simple iterative algorithm.
      • Other features: anchor text processing, location information management and various data structures, which fully make use of the features of the web.
    • Ranking result pages
      • Based on content
        • Based on content
        • Number of occurrences of the search terms
      • Based on link structure
        • Backlink count
        • PageRank
      • And more.
      • (http://www.cs.duke.edu/~junyang/courses/cps296.1-2002-spring/)
    • Problems with content-based ranking
      • Many pages containing search terms may be of poor quality or irrelevant
        • Example: a page with just a line “search engine”.
      • Many high-quality or relevant pages do not even contain the search terms
        • Example: Google homepage
      • Page containing more occurrences of the search terms are ranked higher; spamming is easy
        • Example: a page with line “search engine” Repeated many times
    • Based on link structure
      • Hyperlinks among web pages provide new web search opportunities.
      • Our focus in this lecture
        • PageRank
        • HITS
    • Backlink
      • A backlink of a page p is a link that points to p
      • A page with more backlinks is ranked higher
      • Intuition: Each backlink is a “vote” for the page’s importance
      • Based on local link structure; still easy to spam
        • Create lots of pages that point to a particular page
    • PageRank
      • Page et al., "The PageRank Citation Ranking: Brining Order to the Web.", 1998
      • Main idea: Pages pointed by high-ranking pages are ranked higher
      • Definition is recursive by design
      • Based on global link structure; hard to spam
    • PageRank
      • Web can be viewed as a huge directed graph G(V, E)
        • where V is the set of web pages (vertices) and E is the set of hyperlinks (directed edges).
      • Each page may have a number of outgoing edges (forward links) and a number of incoming links (backlinks).
      • Each backlink of a page represents a citation to the page.
      • PageRank is a measure of global web page importance based on the backlinks of web pages.
    • Basic ideas
      • A page is likely to be important if it is linked by many pages.
      • A page is likely to be important If it is linked to by important pages
        • even though it may not necessarily be linked by a large number of pages.
      • The importance of a page is divided evenly and propagated to the pages pointed to by it.
      8 4 4
    • PageRank Definition
      • N( p ): number of outgoing links from page p
      • B( p ): set of pages that point to p
      • PageRank( p ) = Σ q ∈ B ( p) (PageRank( q ) / N ( q ))
      • Intuition
        • Each page q evenly distributes its importance to all pages that q points to
        • Each page p gets a boost of its importance from each page that points to p
    • Computing PageRank
      • Create a stochastic matrix M for the link structure
        • Each page i corresponds to row i and column i
        • If page j has n outgoing links, then the M ( i , j ) = 1 Ú n if page j
      • Solve by setting all PageRank.s to 1 initially, and thenapplying M repeatedly until the values converge
    • Convergence
      • If the ranks converge, i.e., there is a rank vector R such that R = M  R, R is the eigenvector of matrix M with eigenvalue being 1.
      • Convergence is guaranteed only if
        • M is aperiodic (the Web graph is not a big cycle). This is practically guaranteed for Web.
        • M is irreducible (the Web graph is strongly connected). This is usually not true.
    • Example The sum of all PageRank’s is the total number of pages p1 p2 p3 r1 r3 r2 0.5 0 0.5 0 0 0.5 0.5 1 0 = r1 r3 r2
    • Random surfer model
      • A random surfer
        • Starts with a random page
        • Randomly selects a link on the page to visit next
        • Never uses the .back. button
      • PageRank(p) measures the probability that a
        • random surfer visits page p
    • Rank leak
      • Pages with no outgoing links
      • Called a rank leak because eventually all importance will leak. out of the Web
      p1 p2 p3 r1 r3 r2 0.5 0 0.5 0 0 0.5 0.5 0 0 = r1 r3 r2
    • Rank sink
      • A page or a group of pages is a rank sink if they can receive rank propagation from its parents but cannot propagate rank to other pages.
      • Causes the loss of total ranks.
      • Eventually accumulate all importance of the Web
    • An example of rank sink p1 p2 p3 r1 r3 r2 0.5 0 0.5 0 1 0.5 0.5 0 0 = r1 r3 r2
    • Solution
      • d : damping factor
      • PageRank( p ) = d · Σ q ∈ B ( p) (PageRank( q ) / N ( q )) + (1 - d )
      • Intuition in the random surfer model
        • A surfer occasionally gets bored and jump to a random page on the Web instead of following a random link
      • Another intuition
        • Make the graph fully connected
    • An example p1 p2 p3 the sum of all PageRank’s is still the total number of pages. r1 r3 r2 0.5 0 0.5 0 1 0.5 0.5 0 0 = r1 r3 r2 + 0.2 0.2 0.2