Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
How Web Search Engines Work ? Apurva Jadhav apurvajadhav[at]gmail[dot]com
Outline <ul><ul><li>Text Search </li></ul></ul><ul><ul><ul><li>Indexing  </li></ul></ul></ul><ul><ul><ul><li>Query Process...
Text Search <ul><li>Given query  q “honda car”,  find documents that contain terms  honda  and  car   </li></ul><ul><li>Te...
Document Preprocessing <ul><li>Includes following steps: </li></ul><ul><li>Removing all html tags </li></ul><ul><li>Tokeni...
Inverted Index <ul><li>For each term  t , inverted index stores a list of IDs of documents that contain  t </li></ul><ul><...
Query Processing <ul><li>Consider query  honda car </li></ul><ul><ul><ul><li>Retrieve postings list for  honda </li></ul><...
Inverted Index Construction <ul><li>Estimate size of index </li></ul><ul><li>Use integer 32 bits to represent a document I...
Inverted Index Construction <ul><li>For each document output (term, documentID) pairs to a file on disk </li></ul><ul><li>...
Inverted Index Construction sort <ul><li>The result is split into dictionary file and postings list file </li></ul>1 1 2 1...
Relevance Ranking <ul><li>Inverted index returns a list of documents which contain query terms </li></ul><ul><li>How do we...
Vector space model <ul><li>Documents are represented as vectors in a multi-dimensional Euclidean space. </li></ul><ul><li>...
Vector space model <ul><li>Queries are also represented in terms of term vectors </li></ul><ul><li>Documents are ranked by...
Performance Measure <ul><li>Search Engines return a ranked list of result documents for a given query </li></ul><ul><li>To...
Challenges in ranking Web pages <ul><li>Spamming: Many Web page authors resort to spam, ie adding unrelated words, to rank...
Page Rank <ul><li>Measure of authority or prestige of a Web page </li></ul><ul><li>It is based on the link structure of th...
Page Rank <ul><li>Web pages link to each other through hyperlinks. (hrefs in HTML) </li></ul><ul><li>Thus, the Web can be ...
Page Rank Computation <ul><li>Consider N x N Link Matrix  L and Page Rank Vector p </li></ul><ul><li>L(u, v) = E(u,v) / Nu...
References <ul><li>Books and Papers </li></ul><ul><li>S. Chakrabarti . Mining the Web – Discovering Knowledge From Hyperte...
Introduction <ul><li>Web Search is the dominant means of online information retrieval </li></ul><ul><li>More than 200 mill...
Web Crawlers <ul><li>Fetches Web pages. </li></ul><ul><li>The basic idea is pretty simple as illustrated below. </li></ul>...
Upcoming SlideShare
Loading in …5
×

How web searching engines work

1,459 views

Published on

Published in: Education, Technology
  • Be the first to comment

  • Be the first to like this

How web searching engines work

  1. 1. How Web Search Engines Work ? Apurva Jadhav apurvajadhav[at]gmail[dot]com
  2. 2. Outline <ul><ul><li>Text Search </li></ul></ul><ul><ul><ul><li>Indexing </li></ul></ul></ul><ul><ul><ul><li>Query Processing </li></ul></ul></ul><ul><ul><li>Relevance Ranking </li></ul></ul><ul><ul><ul><li>Vector space Model </li></ul></ul></ul><ul><ul><ul><li>Performance Measures </li></ul></ul></ul><ul><ul><ul><li>Link Analysis to rank Web pages </li></ul></ul></ul>
  3. 3. Text Search <ul><li>Given query q “honda car”, find documents that contain terms honda and car </li></ul><ul><li>Text search index structure - Inverted Index </li></ul><ul><li>Steps for construction of inverted index </li></ul><ul><li>Document preprocessing </li></ul><ul><ul><li>Tokenization </li></ul></ul><ul><ul><li>Stemming, removal of stop words </li></ul></ul><ul><li>Indexing </li></ul>Tokenizer Stemming Indexer documents Inverted Index
  4. 4. Document Preprocessing <ul><li>Includes following steps: </li></ul><ul><li>Removing all html tags </li></ul><ul><li>Tokenization: Break document into constituent words or terms </li></ul><ul><li>Removal of common stop words such as the, a, an, at </li></ul><ul><li>Stemming: Find and replace words with their root </li></ul><ul><li>shirts  shirt </li></ul><ul><li>Assign a unique token Id to each token </li></ul><ul><li>Assign a unique document Id to each document </li></ul>
  5. 5. Inverted Index <ul><li>For each term t , inverted index stores a list of IDs of documents that contain t </li></ul><ul><li>Example: </li></ul><ul><li>Documents </li></ul><ul><li>Doc1 : The quick brown fox jumps over the lazy dog </li></ul><ul><li>Doc2: Fox News is the number one cable news channel </li></ul><ul><li>The postings list is sorted by document ID </li></ul><ul><li>Supports advanced query operators such AND, OR, NOT </li></ul>Postings list fox Doc1 Doc1 Doc2 dog
  6. 6. Query Processing <ul><li>Consider query honda car </li></ul><ul><ul><ul><li>Retrieve postings list for honda </li></ul></ul></ul><ul><ul><ul><li>Retrieve postings list for car </li></ul></ul></ul><ul><ul><ul><li>Merge the two postings list </li></ul></ul></ul><ul><ul><ul><li>Postings list sorted by doc ID </li></ul></ul></ul><ul><ul><ul><li>If length of postings list are m and n then it takes </li></ul></ul></ul><ul><ul><ul><li>O(m + n) to merge them </li></ul></ul></ul>1 2 16 4 8 9 3 16 8 honda car 8, 16
  7. 7. Inverted Index Construction <ul><li>Estimate size of index </li></ul><ul><li>Use integer 32 bits to represent a document ID </li></ul><ul><li>Average number of unique terms in a document be 100 </li></ul><ul><li>Any document ID occurs in 100 postings lists on average </li></ul><ul><li>Index Size = 4 * 100 * number of documents bytes </li></ul><ul><li>At Web scale, it runs into 100s of GB </li></ul><ul><li>Clearly, one cannot hold the index structure in RAM </li></ul>
  8. 8. Inverted Index Construction <ul><li>For each document output (term, documentID) pairs to a file on disk </li></ul><ul><li>Note this file size is same as index size </li></ul><ul><li>Documents </li></ul><ul><li>Doc1 : The quick brown fox jumps over the lazy dog </li></ul><ul><li>Doc2: Fox News is the number one cable news channel </li></ul><ul><li>Sort this file by terms . This uses disk based external sort </li></ul>Term quick brown fox jumps over . . fox news number one . Doc ID 1 1 1 1 1 . . 2 2 2 2 .
  9. 9. Inverted Index Construction sort <ul><li>The result is split into dictionary file and postings list file </li></ul>1 1 2 1 Dictionary file Postings file Term brown fox fox jumps over . . . news number one . Doc ID 1 1 2 1 1 . . 2 2 2 2 . Term brown fox jumps over . . news number one . Postings file offset 0 . . . . Term quick brown fox jumps over . . fox news number one . Doc ID 1 1 1 1 1 . . 2 2 2 2 .
  10. 10. Relevance Ranking <ul><li>Inverted index returns a list of documents which contain query terms </li></ul><ul><li>How do we rank these documents ? </li></ul><ul><li>Use frequency of query terms </li></ul><ul><li>Use importance / rareness of query terms </li></ul><ul><li>Do query terms occur in title of the document? </li></ul>
  11. 11. Vector space model <ul><li>Documents are represented as vectors in a multi-dimensional Euclidean space. </li></ul><ul><li>Each term/word of the vocabulary represents a dimension </li></ul><ul><li>The weight (co-ordinate) of document d along the dimension represented by term t is a product of the following </li></ul><ul><li>Term Frequency TF ( d , t ): The number of times term t occurs in document d </li></ul><ul><li>Inverted document frequency IDF ( t ): All terms are not equally important. IDF captures the importance or rareness of terms. </li></ul><ul><li>IDF ( t ) = log ( 1 + |D| / |D t |) </li></ul><ul><li>where |D| is the total number of documents </li></ul><ul><li> |D t | is the number of documents which contain term t </li></ul>Car d q ө can Computer
  12. 12. Vector space model <ul><li>Queries are also represented in terms of term vectors </li></ul><ul><li>Documents are ranked by their proximity to query vector </li></ul><ul><li>Cosine of the angle between document vector and query vector is to measure proximity between two vectors </li></ul><ul><li>Cos( ө ) = d.q / (|d||q|) </li></ul><ul><li>The smaller the angle between vectors d and q, the more relevant document d is for query q </li></ul>
  13. 13. Performance Measure <ul><li>Search Engines return a ranked list of result documents for a given query </li></ul><ul><li>To measure accuracy, we use a set of queries Q and manually identified set of relevant documents Dq for each query q . </li></ul><ul><li>We define two measures to assess accuracy of search engines. </li></ul><ul><li>Let Rel(q,k) be number of documents relevant to query q returned in top k positions </li></ul><ul><li>Recall for query q, at position k, is the fraction of all relevant documents Dq that are returned in top k postions. </li></ul><ul><ul><li>Recall(k) = 1/|Dq| * Rel(q,k) </li></ul></ul><ul><li>Precision for query q , at position k , is the fraction of top k results that are relevant </li></ul><ul><li> Precision(k) = 1/k * Rel(q,k) </li></ul>
  14. 14. Challenges in ranking Web pages <ul><li>Spamming: Many Web page authors resort to spam, ie adding unrelated words, to rank higher in search engines for certain queries </li></ul><ul><li>Finding authoritative sources: There are thousands of documents that contain the given query terms. </li></ul><ul><li>Example: For query ‘ yahoo ’, www.yahoo.com is the most relevant result </li></ul><ul><li>Anchor text: gives important information about a document. It is indexed as part of the document </li></ul>
  15. 15. Page Rank <ul><li>Measure of authority or prestige of a Web page </li></ul><ul><li>It is based on the link structure of the Web graph </li></ul><ul><li>A Web page which is linked / cited by many other Web pages is popular and has higher PageRank </li></ul><ul><li>It is a query independent static ranking of Web pages </li></ul><ul><li>Roughly, given two pages both of which contain the query terms, the page with higher PageRank is more relevant </li></ul>
  16. 16. Page Rank <ul><li>Web pages link to each other through hyperlinks. (hrefs in HTML) </li></ul><ul><li>Thus, the Web can be visualized as a directed graph where web pages constitute the set of nodes N and hyperlinks constitute the set edges E </li></ul><ul><li>Each web page (node) has a measure of authority or prestige called PageRank </li></ul><ul><li>PageRank of a page (node) v is proportional to sum of PageRank of all web pages that link to it </li></ul><ul><li>p[v] = Σ (u,v) Є E p[u] / N u </li></ul><ul><li>N u is number of outlinks of node u </li></ul>u1 p[v1] = p[u1] + p[u2] / 2 p[v2] = p[u2]/2 + p[u3] u2 u3 v1 v2 w1
  17. 17. Page Rank Computation <ul><li>Consider N x N Link Matrix L and Page Rank Vector p </li></ul><ul><li>L(u, v) = E(u,v) / Nu </li></ul><ul><li>where E(u,v) = 1 iff there is an edge from u to v </li></ul><ul><li>Nu = number of outlinks from node u </li></ul><ul><li>p = L T p </li></ul><ul><li>Page Rank vector is the first eigen vector of link matrix L T </li></ul><ul><li>Page Rank is computed by power iteration method </li></ul>
  18. 18. References <ul><li>Books and Papers </li></ul><ul><li>S. Chakrabarti . Mining the Web – Discovering Knowledge From Hypertext Data </li></ul><ul><li>C Manning and P Raghavan . Introduction to Information Retrieval </li></ul><ul><li>http://www-csli.stanford.edu/ ~hinrich/information-retrieval-book.html </li></ul><ul><li>S. Brin and L. Page . Anatomy of a large scale hypertextual Web search engine. WWW7, 1998 </li></ul><ul><li>Software </li></ul><ul><li>Nutch is an open source Java Web crawler. http://lucene.apache.org/nutch/about.html </li></ul><ul><li>Lucene is an open source Java text search engine. http://lucene.apache.org/ </li></ul>
  19. 19. Introduction <ul><li>Web Search is the dominant means of online information retrieval </li></ul><ul><li>More than 200 million searches performed each day in US alone. </li></ul><ul><li>Aim of a search engine is to find documents relevant to a user query </li></ul><ul><li>Most search engines try to find and rank documents which contain the query terms </li></ul>
  20. 20. Web Crawlers <ul><li>Fetches Web pages. </li></ul><ul><li>The basic idea is pretty simple as illustrated below. </li></ul><ul><li>Add a few seed URLs ( www.yahoo.com ) to a queue </li></ul><ul><li>While (!queue.isEmpty()) do </li></ul><ul><li>URL u = queue.remove() </li></ul><ul><li>fetch Web page W(u) </li></ul><ul><li>Extract all hyperlinks from W(u) and add them to the queue </li></ul><ul><li>Done. </li></ul><ul><li>To fetch all or a significant percentage of all Web pages (millions) one needs to engineer a large scale crawler </li></ul><ul><li>A large scale crawler has to be distributed and multi-threaded </li></ul><ul><li>Nutch is an open source Web crawler. http://lucene.apache.org/nutch/about.html </li></ul>

×