How Web Search Engines Work ? Apurva Jadhav apurvajadhav[at]gmail[dot]com
Outline <ul><ul><li>Text Search </li></ul></ul><ul><ul><ul><li>Indexing  </li></ul></ul></ul><ul><ul><ul><li>Query Process...
Text Search <ul><li>Given query  q “honda car”,  find documents that contain terms  honda  and  car   </li></ul><ul><li>Te...
Document Preprocessing <ul><li>Includes following steps: </li></ul><ul><li>Removing all html tags </li></ul><ul><li>Tokeni...
Inverted Index <ul><li>For each term  t , inverted index stores a list of IDs of documents that contain  t </li></ul><ul><...
Query Processing <ul><li>Consider query  honda car </li></ul><ul><ul><ul><li>Retrieve postings list for  honda </li></ul><...
Inverted Index Construction <ul><li>Estimate size of index </li></ul><ul><li>Use integer 32 bits to represent a document I...
Inverted Index Construction <ul><li>For each document output (term, documentID) pairs to a file on disk </li></ul><ul><li>...
Inverted Index Construction sort <ul><li>The result is split into dictionary file and postings list file </li></ul>1 1 2 1...
Relevance Ranking <ul><li>Inverted index returns a list of documents which contain query terms </li></ul><ul><li>How do we...
Vector space model <ul><li>Documents are represented as vectors in a multi-dimensional Euclidean space. </li></ul><ul><li>...
Vector space model <ul><li>Queries are also represented in terms of term vectors </li></ul><ul><li>Documents are ranked by...
Performance Measure <ul><li>Search Engines return a ranked list of result documents for a given query </li></ul><ul><li>To...
Challenges in ranking Web pages <ul><li>Spamming: Many Web page authors resort to spam, ie adding unrelated words, to rank...
Page Rank <ul><li>Measure of authority or prestige of a Web page </li></ul><ul><li>It is based on the link structure of th...
Page Rank <ul><li>Web pages link to each other through hyperlinks. (hrefs in HTML) </li></ul><ul><li>Thus, the Web can be ...
Page Rank Computation <ul><li>Consider N x N Link Matrix  L and Page Rank Vector p </li></ul><ul><li>L(u, v) = E(u,v) / Nu...
References <ul><li>Books and Papers </li></ul><ul><li>S. Chakrabarti . Mining the Web – Discovering Knowledge From Hyperte...
Introduction <ul><li>Web Search is the dominant means of online information retrieval </li></ul><ul><li>More than 200 mill...
Web Crawlers <ul><li>Fetches Web pages. </li></ul><ul><li>The basic idea is pretty simple as illustrated below. </li></ul>...
Upcoming SlideShare
Loading in …5
×

How web searching engines work

1,343 views
1,269 views

Published on

Published in: Education, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,343
On SlideShare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
102
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

How web searching engines work

  1. 1. How Web Search Engines Work ? Apurva Jadhav apurvajadhav[at]gmail[dot]com
  2. 2. Outline <ul><ul><li>Text Search </li></ul></ul><ul><ul><ul><li>Indexing </li></ul></ul></ul><ul><ul><ul><li>Query Processing </li></ul></ul></ul><ul><ul><li>Relevance Ranking </li></ul></ul><ul><ul><ul><li>Vector space Model </li></ul></ul></ul><ul><ul><ul><li>Performance Measures </li></ul></ul></ul><ul><ul><ul><li>Link Analysis to rank Web pages </li></ul></ul></ul>
  3. 3. Text Search <ul><li>Given query q “honda car”, find documents that contain terms honda and car </li></ul><ul><li>Text search index structure - Inverted Index </li></ul><ul><li>Steps for construction of inverted index </li></ul><ul><li>Document preprocessing </li></ul><ul><ul><li>Tokenization </li></ul></ul><ul><ul><li>Stemming, removal of stop words </li></ul></ul><ul><li>Indexing </li></ul>Tokenizer Stemming Indexer documents Inverted Index
  4. 4. Document Preprocessing <ul><li>Includes following steps: </li></ul><ul><li>Removing all html tags </li></ul><ul><li>Tokenization: Break document into constituent words or terms </li></ul><ul><li>Removal of common stop words such as the, a, an, at </li></ul><ul><li>Stemming: Find and replace words with their root </li></ul><ul><li>shirts  shirt </li></ul><ul><li>Assign a unique token Id to each token </li></ul><ul><li>Assign a unique document Id to each document </li></ul>
  5. 5. Inverted Index <ul><li>For each term t , inverted index stores a list of IDs of documents that contain t </li></ul><ul><li>Example: </li></ul><ul><li>Documents </li></ul><ul><li>Doc1 : The quick brown fox jumps over the lazy dog </li></ul><ul><li>Doc2: Fox News is the number one cable news channel </li></ul><ul><li>The postings list is sorted by document ID </li></ul><ul><li>Supports advanced query operators such AND, OR, NOT </li></ul>Postings list fox Doc1 Doc1 Doc2 dog
  6. 6. Query Processing <ul><li>Consider query honda car </li></ul><ul><ul><ul><li>Retrieve postings list for honda </li></ul></ul></ul><ul><ul><ul><li>Retrieve postings list for car </li></ul></ul></ul><ul><ul><ul><li>Merge the two postings list </li></ul></ul></ul><ul><ul><ul><li>Postings list sorted by doc ID </li></ul></ul></ul><ul><ul><ul><li>If length of postings list are m and n then it takes </li></ul></ul></ul><ul><ul><ul><li>O(m + n) to merge them </li></ul></ul></ul>1 2 16 4 8 9 3 16 8 honda car 8, 16
  7. 7. Inverted Index Construction <ul><li>Estimate size of index </li></ul><ul><li>Use integer 32 bits to represent a document ID </li></ul><ul><li>Average number of unique terms in a document be 100 </li></ul><ul><li>Any document ID occurs in 100 postings lists on average </li></ul><ul><li>Index Size = 4 * 100 * number of documents bytes </li></ul><ul><li>At Web scale, it runs into 100s of GB </li></ul><ul><li>Clearly, one cannot hold the index structure in RAM </li></ul>
  8. 8. Inverted Index Construction <ul><li>For each document output (term, documentID) pairs to a file on disk </li></ul><ul><li>Note this file size is same as index size </li></ul><ul><li>Documents </li></ul><ul><li>Doc1 : The quick brown fox jumps over the lazy dog </li></ul><ul><li>Doc2: Fox News is the number one cable news channel </li></ul><ul><li>Sort this file by terms . This uses disk based external sort </li></ul>Term quick brown fox jumps over . . fox news number one . Doc ID 1 1 1 1 1 . . 2 2 2 2 .
  9. 9. Inverted Index Construction sort <ul><li>The result is split into dictionary file and postings list file </li></ul>1 1 2 1 Dictionary file Postings file Term brown fox fox jumps over . . . news number one . Doc ID 1 1 2 1 1 . . 2 2 2 2 . Term brown fox jumps over . . news number one . Postings file offset 0 . . . . Term quick brown fox jumps over . . fox news number one . Doc ID 1 1 1 1 1 . . 2 2 2 2 .
  10. 10. Relevance Ranking <ul><li>Inverted index returns a list of documents which contain query terms </li></ul><ul><li>How do we rank these documents ? </li></ul><ul><li>Use frequency of query terms </li></ul><ul><li>Use importance / rareness of query terms </li></ul><ul><li>Do query terms occur in title of the document? </li></ul>
  11. 11. Vector space model <ul><li>Documents are represented as vectors in a multi-dimensional Euclidean space. </li></ul><ul><li>Each term/word of the vocabulary represents a dimension </li></ul><ul><li>The weight (co-ordinate) of document d along the dimension represented by term t is a product of the following </li></ul><ul><li>Term Frequency TF ( d , t ): The number of times term t occurs in document d </li></ul><ul><li>Inverted document frequency IDF ( t ): All terms are not equally important. IDF captures the importance or rareness of terms. </li></ul><ul><li>IDF ( t ) = log ( 1 + |D| / |D t |) </li></ul><ul><li>where |D| is the total number of documents </li></ul><ul><li> |D t | is the number of documents which contain term t </li></ul>Car d q ө can Computer
  12. 12. Vector space model <ul><li>Queries are also represented in terms of term vectors </li></ul><ul><li>Documents are ranked by their proximity to query vector </li></ul><ul><li>Cosine of the angle between document vector and query vector is to measure proximity between two vectors </li></ul><ul><li>Cos( ө ) = d.q / (|d||q|) </li></ul><ul><li>The smaller the angle between vectors d and q, the more relevant document d is for query q </li></ul>
  13. 13. Performance Measure <ul><li>Search Engines return a ranked list of result documents for a given query </li></ul><ul><li>To measure accuracy, we use a set of queries Q and manually identified set of relevant documents Dq for each query q . </li></ul><ul><li>We define two measures to assess accuracy of search engines. </li></ul><ul><li>Let Rel(q,k) be number of documents relevant to query q returned in top k positions </li></ul><ul><li>Recall for query q, at position k, is the fraction of all relevant documents Dq that are returned in top k postions. </li></ul><ul><ul><li>Recall(k) = 1/|Dq| * Rel(q,k) </li></ul></ul><ul><li>Precision for query q , at position k , is the fraction of top k results that are relevant </li></ul><ul><li> Precision(k) = 1/k * Rel(q,k) </li></ul>
  14. 14. Challenges in ranking Web pages <ul><li>Spamming: Many Web page authors resort to spam, ie adding unrelated words, to rank higher in search engines for certain queries </li></ul><ul><li>Finding authoritative sources: There are thousands of documents that contain the given query terms. </li></ul><ul><li>Example: For query ‘ yahoo ’, www.yahoo.com is the most relevant result </li></ul><ul><li>Anchor text: gives important information about a document. It is indexed as part of the document </li></ul>
  15. 15. Page Rank <ul><li>Measure of authority or prestige of a Web page </li></ul><ul><li>It is based on the link structure of the Web graph </li></ul><ul><li>A Web page which is linked / cited by many other Web pages is popular and has higher PageRank </li></ul><ul><li>It is a query independent static ranking of Web pages </li></ul><ul><li>Roughly, given two pages both of which contain the query terms, the page with higher PageRank is more relevant </li></ul>
  16. 16. Page Rank <ul><li>Web pages link to each other through hyperlinks. (hrefs in HTML) </li></ul><ul><li>Thus, the Web can be visualized as a directed graph where web pages constitute the set of nodes N and hyperlinks constitute the set edges E </li></ul><ul><li>Each web page (node) has a measure of authority or prestige called PageRank </li></ul><ul><li>PageRank of a page (node) v is proportional to sum of PageRank of all web pages that link to it </li></ul><ul><li>p[v] = Σ (u,v) Є E p[u] / N u </li></ul><ul><li>N u is number of outlinks of node u </li></ul>u1 p[v1] = p[u1] + p[u2] / 2 p[v2] = p[u2]/2 + p[u3] u2 u3 v1 v2 w1
  17. 17. Page Rank Computation <ul><li>Consider N x N Link Matrix L and Page Rank Vector p </li></ul><ul><li>L(u, v) = E(u,v) / Nu </li></ul><ul><li>where E(u,v) = 1 iff there is an edge from u to v </li></ul><ul><li>Nu = number of outlinks from node u </li></ul><ul><li>p = L T p </li></ul><ul><li>Page Rank vector is the first eigen vector of link matrix L T </li></ul><ul><li>Page Rank is computed by power iteration method </li></ul>
  18. 18. References <ul><li>Books and Papers </li></ul><ul><li>S. Chakrabarti . Mining the Web – Discovering Knowledge From Hypertext Data </li></ul><ul><li>C Manning and P Raghavan . Introduction to Information Retrieval </li></ul><ul><li>http://www-csli.stanford.edu/ ~hinrich/information-retrieval-book.html </li></ul><ul><li>S. Brin and L. Page . Anatomy of a large scale hypertextual Web search engine. WWW7, 1998 </li></ul><ul><li>Software </li></ul><ul><li>Nutch is an open source Java Web crawler. http://lucene.apache.org/nutch/about.html </li></ul><ul><li>Lucene is an open source Java text search engine. http://lucene.apache.org/ </li></ul>
  19. 19. Introduction <ul><li>Web Search is the dominant means of online information retrieval </li></ul><ul><li>More than 200 million searches performed each day in US alone. </li></ul><ul><li>Aim of a search engine is to find documents relevant to a user query </li></ul><ul><li>Most search engines try to find and rank documents which contain the query terms </li></ul>
  20. 20. Web Crawlers <ul><li>Fetches Web pages. </li></ul><ul><li>The basic idea is pretty simple as illustrated below. </li></ul><ul><li>Add a few seed URLs ( www.yahoo.com ) to a queue </li></ul><ul><li>While (!queue.isEmpty()) do </li></ul><ul><li>URL u = queue.remove() </li></ul><ul><li>fetch Web page W(u) </li></ul><ul><li>Extract all hyperlinks from W(u) and add them to the queue </li></ul><ul><li>Done. </li></ul><ul><li>To fetch all or a significant percentage of all Web pages (millions) one needs to engineer a large scale crawler </li></ul><ul><li>A large scale crawler has to be distributed and multi-threaded </li></ul><ul><li>Nutch is an open source Web crawler. http://lucene.apache.org/nutch/about.html </li></ul>

×