How Web Search Engines Work ? Apurva Jadhav apurvajadhav[at]gmail[dot]com
Outline Text Search Indexing  Query Processing Relevance Ranking Vector space Model Performance Measures Link Analysis to rank Web pages
Text Search Given query  q “honda car”,  find documents that contain terms  honda  and  car   Text search index structure - Inverted Index Steps for construction of inverted index Document preprocessing  Tokenization Stemming, removal of stop words Indexing Tokenizer Stemming Indexer documents Inverted  Index
Document Preprocessing Includes following steps: Removing all html tags Tokenization: Break document into constituent words or terms  Removal of common stop words such as  the, a, an, at Stemming:  Find and replace words with their root shirts     shirt  Assign a unique token Id to each token Assign a unique document Id to each document
Inverted Index For each term  t , inverted index stores a list of IDs of documents that contain  t Example: Documents Doc1 :  The quick brown fox jumps over the lazy dog Doc2:   Fox News is the number one cable news channel The postings list is sorted by document ID Supports advanced query operators such AND, OR, NOT Postings list fox Doc1 Doc1 Doc2 dog
Query Processing Consider query  honda car Retrieve postings list for  honda Retrieve postings list for  car Merge the two postings list Postings list sorted by doc ID If length of postings list are m and n then it takes  O(m + n) to merge them 1 2 16 4 8 9 3 16 8 honda car 8, 16
Inverted Index Construction Estimate size of index Use integer 32 bits to represent a document ID Average number of unique terms in a document be 100 Any document ID occurs in 100 postings lists on average Index Size = 4 * 100 * number of documents bytes At Web scale, it runs into 100s of GB Clearly, one cannot hold the index structure in RAM
Inverted Index Construction For each document output (term, documentID) pairs to a file on disk Note this file size is same as index size Documents Doc1 :  The quick brown fox jumps over the lazy dog Doc2:   Fox News is the number one cable news channel Sort this file by terms . This uses disk based external sort Term quick brown fox jumps over . . fox news number one . Doc ID 1 1 1 1 1 . . 2 2 2 2 .
Inverted Index Construction sort The result is split into dictionary file and postings list file 1 1 2 1 Dictionary file Postings file Term brown fox fox jumps over . . . news number one . Doc ID 1 1 2 1 1 . . 2 2 2 2 . Term brown fox jumps over . . news number one . Postings file offset 0 . . . . Term quick brown fox jumps over . . fox news number one . Doc ID 1 1 1 1 1 . . 2 2 2 2 .
Relevance Ranking Inverted index returns a list of documents which contain query terms How do we rank these documents ? Use frequency of query terms Use importance / rareness of query terms Do query terms occur in title of the document?
Vector space model Documents are represented as vectors in a multi-dimensional Euclidean space. Each term/word of the vocabulary represents a dimension The weight (co-ordinate) of document  d  along the dimension represented by term  t  is a product of the following Term Frequency TF ( d , t ): The number of times term  t  occurs in document  d Inverted document frequency IDF ( t ): All terms are not equally important. IDF captures the importance or rareness of terms.  IDF ( t ) = log ( 1 + |D| / |D t |) where |D| is the total number of documents   |D t | is the number of documents which contain term  t Car d q ө can Computer
Vector space model Queries are also represented in terms of term vectors Documents are ranked by their proximity to query vector Cosine of the angle between document vector and query vector is to measure proximity between two vectors Cos( ө ) = d.q / (|d||q|) The smaller the angle between vectors d and q, the more relevant document  d  is for query  q
Performance Measure Search Engines return a ranked list of result documents for a given query To measure accuracy, we use a set of queries  Q  and manually identified set of relevant documents  Dq  for each query  q .  We define two measures to assess accuracy of search engines. Let Rel(q,k) be number of documents relevant to query q returned in top k positions Recall   for query  q,  at position  k,  is the fraction of all relevant documents  Dq  that are returned in top k postions.  Recall(k)  =  1/|Dq| * Rel(q,k) Precision  for query  q , at position  k , is the fraction of top k results that are relevant   Precision(k) = 1/k * Rel(q,k)
Challenges in ranking Web pages Spamming: Many Web page authors resort to spam, ie adding unrelated words, to rank higher in search engines for certain queries Finding authoritative sources: There are thousands of documents that contain the given query terms.  Example: For query ‘ yahoo ’,  www.yahoo.com   is the most relevant result  Anchor text: gives important information about a document. It is indexed as part of the document
Page Rank Measure of authority or prestige of a Web page It is based on the link structure of the Web graph A Web page which is linked / cited by many other Web pages is popular and has higher PageRank It is a query independent static ranking of Web pages Roughly, given two pages both of which contain the query terms, the page with higher PageRank is more relevant
Page Rank Web pages link to each other through hyperlinks. (hrefs in HTML) Thus, the Web can be visualized as a directed graph where web pages constitute the set of  nodes  N  and hyperlinks constitute the set edges  E Each web page (node) has a measure of  authority  or  prestige  called  PageRank   PageRank  of a page (node)  v  is proportional to sum of  PageRank  of all web pages that  link to it p[v] =  Σ (u,v)  Є  E   p[u] / N u N u  is number of outlinks of node u u1 p[v1] = p[u1] + p[u2] / 2 p[v2] = p[u2]/2 + p[u3]  u2 u3 v1 v2 w1
Page Rank Computation Consider N x N Link Matrix  L and Page Rank Vector p L(u, v) = E(u,v) / Nu where E(u,v) = 1 iff there is an edge from u to v Nu = number of outlinks from node u p = L T  p   Page Rank vector is the first eigen vector of link matrix L T Page Rank is computed by power iteration method
References Books and Papers S. Chakrabarti . Mining the Web – Discovering Knowledge From Hypertext Data C Manning and P Raghavan . Introduction to Information Retrieval http://www-csli.stanford.edu/ ~hinrich/information-retrieval-book.html S. Brin and L. Page . Anatomy of a large scale hypertextual Web search engine. WWW7, 1998 Software Nutch is an open source Java Web crawler.  http://lucene.apache.org/nutch/about.html Lucene is an open source Java text search engine.  http://lucene.apache.org/
Introduction Web Search is the dominant means of online information retrieval More than 200 million searches performed each day in US alone. Aim of a search engine is to find documents relevant to a user query Most search engines try to find and rank documents which contain the query terms
Web Crawlers Fetches Web pages. The basic idea is pretty simple as illustrated below. Add a few seed URLs ( www.yahoo.com ) to a queue While (!queue.isEmpty()) do  URL u = queue.remove() fetch Web page W(u) Extract all hyperlinks from W(u) and add them to the queue Done. To fetch all or a significant percentage of all Web pages (millions) one needs to engineer a large scale crawler A large scale crawler has to be distributed and multi-threaded Nutch is an open source Web crawler.  http://lucene.apache.org/nutch/about.html

How web searching engines work

  • 1.
    How Web SearchEngines Work ? Apurva Jadhav apurvajadhav[at]gmail[dot]com
  • 2.
    Outline Text SearchIndexing Query Processing Relevance Ranking Vector space Model Performance Measures Link Analysis to rank Web pages
  • 3.
    Text Search Givenquery q “honda car”, find documents that contain terms honda and car Text search index structure - Inverted Index Steps for construction of inverted index Document preprocessing Tokenization Stemming, removal of stop words Indexing Tokenizer Stemming Indexer documents Inverted Index
  • 4.
    Document Preprocessing Includesfollowing steps: Removing all html tags Tokenization: Break document into constituent words or terms Removal of common stop words such as the, a, an, at Stemming: Find and replace words with their root shirts  shirt Assign a unique token Id to each token Assign a unique document Id to each document
  • 5.
    Inverted Index Foreach term t , inverted index stores a list of IDs of documents that contain t Example: Documents Doc1 : The quick brown fox jumps over the lazy dog Doc2: Fox News is the number one cable news channel The postings list is sorted by document ID Supports advanced query operators such AND, OR, NOT Postings list fox Doc1 Doc1 Doc2 dog
  • 6.
    Query Processing Considerquery honda car Retrieve postings list for honda Retrieve postings list for car Merge the two postings list Postings list sorted by doc ID If length of postings list are m and n then it takes O(m + n) to merge them 1 2 16 4 8 9 3 16 8 honda car 8, 16
  • 7.
    Inverted Index ConstructionEstimate size of index Use integer 32 bits to represent a document ID Average number of unique terms in a document be 100 Any document ID occurs in 100 postings lists on average Index Size = 4 * 100 * number of documents bytes At Web scale, it runs into 100s of GB Clearly, one cannot hold the index structure in RAM
  • 8.
    Inverted Index ConstructionFor each document output (term, documentID) pairs to a file on disk Note this file size is same as index size Documents Doc1 : The quick brown fox jumps over the lazy dog Doc2: Fox News is the number one cable news channel Sort this file by terms . This uses disk based external sort Term quick brown fox jumps over . . fox news number one . Doc ID 1 1 1 1 1 . . 2 2 2 2 .
  • 9.
    Inverted Index Constructionsort The result is split into dictionary file and postings list file 1 1 2 1 Dictionary file Postings file Term brown fox fox jumps over . . . news number one . Doc ID 1 1 2 1 1 . . 2 2 2 2 . Term brown fox jumps over . . news number one . Postings file offset 0 . . . . Term quick brown fox jumps over . . fox news number one . Doc ID 1 1 1 1 1 . . 2 2 2 2 .
  • 10.
    Relevance Ranking Invertedindex returns a list of documents which contain query terms How do we rank these documents ? Use frequency of query terms Use importance / rareness of query terms Do query terms occur in title of the document?
  • 11.
    Vector space modelDocuments are represented as vectors in a multi-dimensional Euclidean space. Each term/word of the vocabulary represents a dimension The weight (co-ordinate) of document d along the dimension represented by term t is a product of the following Term Frequency TF ( d , t ): The number of times term t occurs in document d Inverted document frequency IDF ( t ): All terms are not equally important. IDF captures the importance or rareness of terms. IDF ( t ) = log ( 1 + |D| / |D t |) where |D| is the total number of documents |D t | is the number of documents which contain term t Car d q ө can Computer
  • 12.
    Vector space modelQueries are also represented in terms of term vectors Documents are ranked by their proximity to query vector Cosine of the angle between document vector and query vector is to measure proximity between two vectors Cos( ө ) = d.q / (|d||q|) The smaller the angle between vectors d and q, the more relevant document d is for query q
  • 13.
    Performance Measure SearchEngines return a ranked list of result documents for a given query To measure accuracy, we use a set of queries Q and manually identified set of relevant documents Dq for each query q . We define two measures to assess accuracy of search engines. Let Rel(q,k) be number of documents relevant to query q returned in top k positions Recall for query q, at position k, is the fraction of all relevant documents Dq that are returned in top k postions. Recall(k) = 1/|Dq| * Rel(q,k) Precision for query q , at position k , is the fraction of top k results that are relevant Precision(k) = 1/k * Rel(q,k)
  • 14.
    Challenges in rankingWeb pages Spamming: Many Web page authors resort to spam, ie adding unrelated words, to rank higher in search engines for certain queries Finding authoritative sources: There are thousands of documents that contain the given query terms. Example: For query ‘ yahoo ’, www.yahoo.com is the most relevant result Anchor text: gives important information about a document. It is indexed as part of the document
  • 15.
    Page Rank Measureof authority or prestige of a Web page It is based on the link structure of the Web graph A Web page which is linked / cited by many other Web pages is popular and has higher PageRank It is a query independent static ranking of Web pages Roughly, given two pages both of which contain the query terms, the page with higher PageRank is more relevant
  • 16.
    Page Rank Webpages link to each other through hyperlinks. (hrefs in HTML) Thus, the Web can be visualized as a directed graph where web pages constitute the set of nodes N and hyperlinks constitute the set edges E Each web page (node) has a measure of authority or prestige called PageRank PageRank of a page (node) v is proportional to sum of PageRank of all web pages that link to it p[v] = Σ (u,v) Є E p[u] / N u N u is number of outlinks of node u u1 p[v1] = p[u1] + p[u2] / 2 p[v2] = p[u2]/2 + p[u3] u2 u3 v1 v2 w1
  • 17.
    Page Rank ComputationConsider N x N Link Matrix L and Page Rank Vector p L(u, v) = E(u,v) / Nu where E(u,v) = 1 iff there is an edge from u to v Nu = number of outlinks from node u p = L T p Page Rank vector is the first eigen vector of link matrix L T Page Rank is computed by power iteration method
  • 18.
    References Books andPapers S. Chakrabarti . Mining the Web – Discovering Knowledge From Hypertext Data C Manning and P Raghavan . Introduction to Information Retrieval http://www-csli.stanford.edu/ ~hinrich/information-retrieval-book.html S. Brin and L. Page . Anatomy of a large scale hypertextual Web search engine. WWW7, 1998 Software Nutch is an open source Java Web crawler. http://lucene.apache.org/nutch/about.html Lucene is an open source Java text search engine. http://lucene.apache.org/
  • 19.
    Introduction Web Searchis the dominant means of online information retrieval More than 200 million searches performed each day in US alone. Aim of a search engine is to find documents relevant to a user query Most search engines try to find and rank documents which contain the query terms
  • 20.
    Web Crawlers FetchesWeb pages. The basic idea is pretty simple as illustrated below. Add a few seed URLs ( www.yahoo.com ) to a queue While (!queue.isEmpty()) do URL u = queue.remove() fetch Web page W(u) Extract all hyperlinks from W(u) and add them to the queue Done. To fetch all or a significant percentage of all Web pages (millions) one needs to engineer a large scale crawler A large scale crawler has to be distributed and multi-threaded Nutch is an open source Web crawler. http://lucene.apache.org/nutch/about.html