Your SlideShare is downloading. ×
0
A Brief Tour of Modern Web Search Engines
A Brief Tour of Modern Web Search Engines
A Brief Tour of Modern Web Search Engines
A Brief Tour of Modern Web Search Engines
A Brief Tour of Modern Web Search Engines
A Brief Tour of Modern Web Search Engines
A Brief Tour of Modern Web Search Engines
A Brief Tour of Modern Web Search Engines
A Brief Tour of Modern Web Search Engines
A Brief Tour of Modern Web Search Engines
A Brief Tour of Modern Web Search Engines
A Brief Tour of Modern Web Search Engines
A Brief Tour of Modern Web Search Engines
A Brief Tour of Modern Web Search Engines
A Brief Tour of Modern Web Search Engines
A Brief Tour of Modern Web Search Engines
A Brief Tour of Modern Web Search Engines
A Brief Tour of Modern Web Search Engines
A Brief Tour of Modern Web Search Engines
A Brief Tour of Modern Web Search Engines
A Brief Tour of Modern Web Search Engines
A Brief Tour of Modern Web Search Engines
A Brief Tour of Modern Web Search Engines
A Brief Tour of Modern Web Search Engines
A Brief Tour of Modern Web Search Engines
A Brief Tour of Modern Web Search Engines
A Brief Tour of Modern Web Search Engines
A Brief Tour of Modern Web Search Engines
A Brief Tour of Modern Web Search Engines
A Brief Tour of Modern Web Search Engines
A Brief Tour of Modern Web Search Engines
A Brief Tour of Modern Web Search Engines
A Brief Tour of Modern Web Search Engines
A Brief Tour of Modern Web Search Engines
A Brief Tour of Modern Web Search Engines
A Brief Tour of Modern Web Search Engines
A Brief Tour of Modern Web Search Engines
A Brief Tour of Modern Web Search Engines
A Brief Tour of Modern Web Search Engines
A Brief Tour of Modern Web Search Engines
A Brief Tour of Modern Web Search Engines
A Brief Tour of Modern Web Search Engines
A Brief Tour of Modern Web Search Engines
A Brief Tour of Modern Web Search Engines
A Brief Tour of Modern Web Search Engines
A Brief Tour of Modern Web Search Engines
A Brief Tour of Modern Web Search Engines
A Brief Tour of Modern Web Search Engines
A Brief Tour of Modern Web Search Engines
A Brief Tour of Modern Web Search Engines
A Brief Tour of Modern Web Search Engines
A Brief Tour of Modern Web Search Engines
A Brief Tour of Modern Web Search Engines
A Brief Tour of Modern Web Search Engines
A Brief Tour of Modern Web Search Engines
A Brief Tour of Modern Web Search Engines
A Brief Tour of Modern Web Search Engines
A Brief Tour of Modern Web Search Engines
A Brief Tour of Modern Web Search Engines
A Brief Tour of Modern Web Search Engines
A Brief Tour of Modern Web Search Engines
A Brief Tour of Modern Web Search Engines
A Brief Tour of Modern Web Search Engines
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

A Brief Tour of Modern Web Search Engines

718

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
718
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
36
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Google claims to index over 2 billion pages
  • The lexicon can be efficiently stored in a hash table (e.g. using overflow chaining), allowing fast lookups Note: Reported index sizes include lexicon, mapping file, inverted lists
  • Cat occurs in 3 documents; 2 times in doc 1 at positions 2 and 6, 1 time in doc 2 at position 8, and 3 times in doc 7 at positions 4, 8 and 11
  • Average docno now: 2+1/3 , was: 3+1/3 offset now: 4+1/6, was: 6+1/2
  • Reminder : 20 GB; 25,000 queries * no heuristics to reduce query time (e.g.early termination or stopping) Relative speed of bitwise schemes as for small collection Vbyte coding again gives fastest processing VbyD-VbyF-VbyO index twice as fast as any index that does not use Vby
  • Transcript

    • 1. A Brief Tour of Modern Web Search Engines Hugh E. Williams eBay Inc. [email_address]
    • 2. Overview <ul><li>Introduction </li></ul><ul><li>Web crawling </li></ul><ul><li>Document stores and indexing </li></ul><ul><li>Inverted Indexing </li></ul><ul><li>Query Evaluation </li></ul><ul><li>Ranking and Relevance Measurement </li></ul><ul><li>Caching and Web Serving </li></ul><ul><li>eBay </li></ul><ul><li>Reading Materials </li></ul>
    • 3. Web Search Basics <ul><li>Web search engines don’t search the web </li></ul><ul><ul><li>They search a copy of the web </li></ul></ul><ul><ul><li>They crawl or spider documents from the web </li></ul></ul><ul><ul><li>They index the documents, and provide a search interface based on that index </li></ul></ul><ul><ul><li>Document summarization is used to present short snippets that allow users to judge relevance </li></ul></ul><ul><ul><li>Users click on links to visit the actual, original web document </li></ul></ul>
    • 4. (Simplified) Web Search Architecture Crawlers Document Store Index File Managers Result Cache Web Servers Aggregators
    • 5. CRAWLERS AND CRAWLING
    • 6. Crawling from Seed Resources <ul><li>The basic seed-based crawling algorithm is as follows: </li></ul><ul><ul><li>Create an empty URL queue </li></ul></ul><ul><ul><li>Add user-supplied seed URLs to the queue (simplest approach: append to the tail) </li></ul></ul><ul><ul><li>If the resource at the head of queue meets the “crawl criteria” (more later) request the resource at the head of the queue </li></ul></ul><ul><ul><li>Process the retrieved resource: </li></ul></ul><ul><ul><ul><li>Extract URLs from the resource. For each URL: </li></ul></ul></ul><ul><ul><ul><ul><li>Decide if the URL should be added to the URL queue </li></ul></ul></ul></ul><ul><ul><ul><ul><li>If yes, store the headers and resource in the collection store </li></ul></ul></ul></ul><ul><ul><ul><li>Record the URL in the visited URL list with the time visited </li></ul></ul></ul><ul><ul><li>Repeat from Step 3 until the queue is empty, then stop. </li></ul></ul>
    • 7. So, it’s that simple? <ul><li>“ I'm writing a robot, what do I need to be careful of? Lots. First read through all the stuff on the robot page then read the proceedings of past WWW Conferences, and the complete HTTP and HTML spec. Yes; it's a lot of work.” (from http://www.robotstxt.org/faq/writing.html ) </li></ul><ul><li>Writing a crawler isn’t straightforward. Some examples: </li></ul><ul><ul><li>Sites can use the robots.txt exclusion standard to limit which pages should be retrieved </li></ul></ul><ul><ul><li>Crawler shouldn’t overload or overvisit sites </li></ul></ul><ul><ul><li>Many URLs exist for the same resource </li></ul></ul><ul><ul><li>URLs redirect to other resources (more in a moment) </li></ul></ul><ul><ul><li>Dynamic pages can generate loops, unending lists, and other traps </li></ul></ul><ul><ul><li>URLs are difficult to harvest: some are embedded in JavaScript scripts, hidden behind forms, and so on </li></ul></ul><ul><ul><li>… </li></ul></ul>
    • 8. Example: Resolving URLs <ul><li>The following URLs resolve to the same resource: </li></ul><ul><ul><li>ebay.com/garden </li></ul></ul><ul><ul><li>pages.ebay.com/garden </li></ul></ul><ul><ul><li>www.ebay.com/garden </li></ul></ul><ul><ul><li>www.ebay.com/./garden </li></ul></ul><ul><ul><li>www.ebay.com//////garden </li></ul></ul><ul><ul><li>ebay.com/GARDEN </li></ul></ul><ul><ul><li>ebay.com:80/garden </li></ul></ul><ul><ul><li>ebay.com/%67%61%72%64%65%6e </li></ul></ul><ul><ul><li>ebay.com/garden/foo/.. </li></ul></ul><ul><ul><li>garden.ebay.com </li></ul></ul><ul><ul><li>garden.ebay.com/index.html </li></ul></ul><ul><ul><li>garden.ebay.com/#test </li></ul></ul><ul><ul><li>garden.ebay.com/?test=hello </li></ul></ul>
    • 9. Crawl Criteria <ul><li>Crawlers actually need to do three fundamental tasks: </li></ul><ul><ul><li>Fetch new resources from new domains or pages </li></ul></ul><ul><ul><li>Fetch new resources from existing domains or pages </li></ul></ul><ul><ul><li>Re-fetch existing resources (that have changed) </li></ul></ul><ul><li>We can think of a successful crawl action as one that leads to a new resource being indexed and visited (or viewed?) by the user of the search engine </li></ul><ul><li>A failure is fetching a resource that isn’t used, or refetching a resource that didn’t change </li></ul>
    • 10. Crawl Criteria… <ul><li>Crawl prioritization is essential: </li></ul><ul><ul><li>There are far more URLs than available fetching bandwidth </li></ul></ul><ul><ul><li>For large sites, where we’re being polite, it’s impossible to fetch all resources </li></ul></ul><ul><ul><li>It’s essential to balance refetch and discovery </li></ul></ul><ul><ul><li>It’s essential to balance new site exploration with old site exploration </li></ul></ul>
    • 11. Interesting problems <ul><li>Crawler challenges: </li></ul><ul><ul><li>HTTP HEAD and GET requests sometimes return different headers </li></ul></ul><ul><ul><li>Not Found pages often return HTTP 2xx codes </li></ul></ul><ul><ul><li>A pages can redirect to itself, or into a cycle (more in a moment) </li></ul></ul><ul><ul><li>Pages can look different to end-user browsers and crawlers </li></ul></ul><ul><ul><li>Pages can require JavaScript processing </li></ul></ul><ul><ul><li>Pages can require cookies </li></ul></ul><ul><ul><li>Pages can be built in non-HTML environments </li></ul></ul>
    • 12. DOCUMENT STORES AND INDEXES
    • 13. (Simplified) Web Search Architecture Crawlers Document Store Index File Managers Result Cache Web Servers Aggregators
    • 14. Indexing Challenges <ul><li>There are hundreds of billions of web pages (as we’ve seen, it’s really infinite) </li></ul><ul><li>It is neither practical nor desirable to search over all of them: </li></ul><ul><ul><li>Should remove spam pages </li></ul></ul><ul><ul><li>Should remove illegal pages </li></ul></ul><ul><ul><li>Should remove repetitive or duplicate pages </li></ul></ul><ul><ul><li>Should remove crawler traps </li></ul></ul><ul><ul><li>Should remove automatically generated pages </li></ul></ul><ul><ul><li>Should remove pages that no longer exist </li></ul></ul><ul><ul><li>Should remove pages that have substantially changed </li></ul></ul><ul><ul><li>Should remove pages that cannot be understood by the target users </li></ul></ul><ul><ul><li>… </li></ul></ul><ul><li>Most search engines index somewhere in the range of 20 to 50 billion documents </li></ul><ul><li>Figuring out how many pages each engine indexes, and how many pages are on the web are both hard research problems </li></ul>
    • 15. How do we choose the right pages? <ul><li>There are many ways to choose the right pages: </li></ul><ul><ul><li>Store those that meet future information needs! </li></ul></ul><ul><ul><li>In practice, this means: </li></ul></ul><ul><ul><ul><li>Choose pages that users visit </li></ul></ul></ul><ul><ul><ul><li>Choose pages that are popular in the web link graph </li></ul></ul></ul><ul><ul><ul><li>Choose pages that match queries </li></ul></ul></ul><ul><ul><ul><li>Choose pages from popular sites </li></ul></ul></ul><ul><ul><ul><li>Choose pages that are clicked on in search results </li></ul></ul></ul><ul><ul><ul><li>Choose pages shown by competitors </li></ul></ul></ul><ul><ul><ul><li>Choose pages in the language or market of the users </li></ul></ul></ul><ul><ul><ul><li>Choose pages that are distinct from other pages </li></ul></ul></ul><ul><ul><ul><li>Choose pages that change at a moderate rate </li></ul></ul></ul><ul><ul><ul><li>… </li></ul></ul></ul><ul><li>Whatever choice is made: </li></ul><ul><ul><li>The head is stable </li></ul></ul><ul><ul><li>The tail “wags around”, billions of candidate pages have similar or identical scores </li></ul></ul>
    • 16. Choosing Pages in Practice <ul><li>In practice, there are two solutions to choosing pages for the index: </li></ul><ul><ul><li>In real time, make a yes/no decision about each page, and add to the index </li></ul></ul><ul><ul><li>Store the pages, and process them offline to construct an index </li></ul></ul><ul><li>The former solution is typically based on the well-known AltaVista “chunk” solution </li></ul><ul><ul><li>Create a buffer of documents (a “chunk”) </li></ul></ul><ul><ul><li>Build an index on that buffer </li></ul></ul><ul><ul><li>Move the index and content to an index serving node </li></ul></ul><ul><ul><li>(After some time) Mark the chunk’s URLs for refetch </li></ul></ul><ul><ul><li>(After some time) Expire the chunk </li></ul></ul><ul><li>The latter approach is likely what’s used at Google: </li></ul><ul><ul><li>Store multiple copies of the web in a document store </li></ul></ul><ul><ul><li>Iterate over the document store (potentially multiple times) to choose documents </li></ul></ul><ul><ul><li>Create an index, and ship it to the index serving nodes </li></ul></ul><ul><ul><li>Repeat </li></ul></ul>
    • 17. INVERTED INDEXES
    • 18. Supporting Query Based Retrieval <ul><li>We’ll talk about query evaluation in the next section </li></ul><ul><li>But, for now, believe that queries are evaluated using inverted indexes </li></ul><ul><li>Compressed inverted indexes are typically 10%-20% of the size of the data being stored. </li></ul><ul><ul><li>In many cases, they are too large to store in memory, so disk storage is a necessity. However: </li></ul></ul><ul><ul><ul><li>disk size is limited </li></ul></ul></ul><ul><ul><ul><li>disk access is slow </li></ul></ul></ul>
    • 19. Inverted Index <ul><li>A document-level inverted index for a collection consists of: </li></ul><ul><ul><li>lexicon - a searchable in-memory vocabulary containing the unique searchable terms in the collection (t 1 , …, t n ) </li></ul></ul><ul><ul><li>for each t , a pointer to the inverted list of that term on disk </li></ul></ul><ul><li>The inverted list contains information about the occurrence of terms: </li></ul><ul><ul><li>postings < d , f d,t > where f d,t is the frequency of term t in document d; one posting is stored for each document in which t occurs </li></ul></ul><ul><ul><li>additional statistics such as f t (the number of documents that t occurs in) and L d (the length of document d) </li></ul></ul>
    • 20. Inverted Index Mapping file Memory Memory or Disk Lexicon wild cat Collection 1 8 7 6 5 4 3 2 fat cat cat on the mat 3: 1, 2, 7 Inverted Lists cat
    • 21. Answering Queries <ul><li>Document numbers and frequencies are sufficient to answer ranked (more later) and Boolean queries </li></ul><ul><ul><li>the position of terms in a document is not important </li></ul></ul><ul><li>For phrase and proximity queries, must additionally store term offsets o i , so postings need to be of the form: </li></ul><ul><li><d, f d,t [o 1 … o f d,t ]> </li></ul><ul><li>Example: inverted list for the term “ cat ” </li></ul><ul><li>3 <1, 2 [2, 6]> <2, 1 [8]> <7, 3 [4, 8, 11]> </li></ul>
    • 22. Index Ordering <ul><li>Postings are usually ordered by increasing d , and offsets within postings are ordered by increasing o </li></ul><ul><ul><li>this allows the difference between values to be stored </li></ul></ul><ul><ul><ul><li>This improves compressibility because the values are smaller </li></ul></ul></ul><ul><ul><ul><li>This improves compressibility because the integer distribution is more skew </li></ul></ul></ul><ul><li>The inverted list for “cat” </li></ul><ul><li>3 <1, 2 [2, 6]> <2, 1 [8]> <7, 3 [4, 8, 11]> </li></ul><ul><li>becomes: </li></ul><ul><li>3 <1, 2 [2, 4]> <1, 1 [8]> <5, 3 [4, 4, 3]> </li></ul><ul><li>Other orderings are typically used in web search engines: </li></ul><ul><ul><li>frequency-sorted index </li></ul></ul><ul><ul><li>impact-ordered index </li></ul></ul><ul><ul><li>Page-rank ordered index </li></ul></ul><ul><ul><li>Access-ordered index </li></ul></ul><ul><ul><li>The differences can’t be taken between d values, but other differences can often be taken </li></ul></ul>
    • 23. Benefits of Compression <ul><li>Compression of indexes has several benefits: </li></ul><ul><ul><li>1. less storage space needed </li></ul></ul><ul><ul><li>2. better use of disk-to-cpu communication bandwidth (or main-memory to CPU) </li></ul></ul><ul><ul><li>3. more data can be cached in memory, so fewer disk accesses are required for a stream of queries </li></ul></ul><ul><li>To be effective: the total retrieval time and CPU processing costs under a compression scheme should be less than the retrieval time for the uncompressed representation </li></ul>
    • 24. Compression Experiments <ul><li>Hardware </li></ul><ul><ul><li>Intel Pentium III 1.0 GHz </li></ul></ul><ul><ul><li>512 MB main-memory </li></ul></ul><ul><ul><li>Linux operating system (kernel 2.4.7) </li></ul></ul><ul><li>Collections </li></ul><ul><ul><li>small </li></ul></ul><ul><ul><ul><li>500 Mb (94,802 documents from TREC-7 VLC) </li></ul></ul></ul><ul><ul><ul><li>index fits in main memory (703,518 terms) </li></ul></ul></ul><ul><ul><li>large </li></ul></ul><ul><ul><ul><li>20 Gb (4,014,894 documents from TREC-7 VLC) </li></ul></ul></ul><ul><ul><ul><li>index several times larger than main memory (9,574,703 terms) </li></ul></ul></ul><ul><li>Queries </li></ul><ul><ul><ul><li>10,000 / 25,000 queries from a 1997 query log from the Excite search engine </li></ul></ul></ul><ul><ul><ul><li>filtered to remove profanities </li></ul></ul></ul><ul><ul><ul><li>evaluated as conjunctional Boolean queries </li></ul></ul></ul>
    • 25. Results: Small Collection
    • 26. Results: Large Collection
    • 27. QUERY PROCESSING
    • 28. (Simplified) Web Search Architecture Crawlers Document Store Index File Managers Result Cache Web Servers Aggregators
    • 29. Index Serving <ul><li>The document collection is partitioned equally between n machines </li></ul><ul><li>Each machine evaluates a query on its fraction of the collection, and returns its best m results </li></ul><ul><li>An aggregator collates the responses, and chooses the overall best l (typically l=10) </li></ul><ul><li>The set of n machines is known as a row </li></ul><ul><li>Rows can be copied to increase throughput </li></ul><ul><li>Rows can be widened to decrease latency </li></ul>
    • 30. Querying on the nodes <ul><li>In practice, nodes don’t exhaustively evaluate queries on the inverted index: </li></ul><ul><ul><li>They stop evaluating when: </li></ul></ul><ul><ul><ul><li>Time runs out </li></ul></ul></ul><ul><ul><ul><li>Result sets are stable </li></ul></ul></ul><ul><ul><ul><li>Enough results have been found </li></ul></ul></ul><ul><ul><ul><li>The system is under too much load </li></ul></ul></ul><ul><ul><ul><li>… </li></ul></ul></ul>
    • 31. RANKING
    • 32. Querying in Web Search <ul><li>Web search users search for a variety of information needs: </li></ul><ul><ul><li>Broder (2002) proposed this taxonomy: </li></ul></ul><ul><ul><ul><li>Informational (want to learn something) [around 80% of queries] </li></ul></ul></ul><ul><ul><ul><li>Navigational (want to go somewhere else) [around 10%] </li></ul></ul></ul><ul><ul><ul><li>Transactional (want to do something) [around 10%] </li></ul></ul></ul><ul><li>Users express their information needs as queries </li></ul><ul><ul><li>Usually informally expressed as two or three words (we call this a ranked query. More later) </li></ul></ul><ul><ul><li>A year 2000 study showed the mean query length was 2.4 words per query with a median of 2; the mean length is getting longer (Why?) </li></ul></ul><ul><ul><li>Around 48.4% of users submit just one query in a session, 20.8% submit two, and about 31% submit three or more </li></ul></ul><ul><ul><li>Less than 5% of queries use Boolean operators (AND, OR, and NOT), and around 5% contain quoted phrases </li></ul></ul>
    • 33. What Users Are Searching For Reproduced from: Bernard J. Jansen , Amanda Spink: How are we searching the World Wide Web? A comparison of nine search engine transaction logs.  Inf. Process. Manage. 42 (1): 248-263 (2006)
    • 34. Answers <ul><li>What is a good answer to a query? </li></ul><ul><ul><li>One that is relevant to the user’s information need! </li></ul></ul><ul><li>Web search engines typically return ten answers-per-page, where each answer is a short summary of a web document </li></ul><ul><li>Likely relevance to an information need is approximated by statistical similarity between web documents and the query </li></ul><ul><li>Users favour search engines that have high precision , that is, those that return relevant answers in the first page of results </li></ul><ul><ul><li>Around 75% of queries don’t go beyond page one </li></ul></ul>
    • 35. Approximating Relevance <ul><li>Statistical similarity is used to estimate the relevance of a query to an answer </li></ul><ul><li>Consider the query “Mark Nason Adler Boots” </li></ul><ul><ul><li>An interesting document contains all four words </li></ul></ul><ul><ul><ul><li>Web search engines enforce this Boolean AND requirement </li></ul></ul></ul><ul><ul><li>The more frequently the words occur in the document, the better; this is called the term frequency (TF) </li></ul></ul><ul><ul><li>Better documents have more occurrences of the rarer words </li></ul></ul><ul><ul><ul><li>For example, an answer containing only “Adler” is likely to be better than an answer containing only “Boots” </li></ul></ul></ul><ul><ul><ul><li>This is the so-called inverse document frequency (IDF) </li></ul></ul></ul>
    • 36. Term Frequency… <ul><li>The notion of term frequency is typically expressed as tf t,d where tf is the term frequency of term t document d </li></ul><ul><li>The weight of the term frequency component in the ranking function is usually a logarithm of the raw frequency (and 0, if tf t,d is zero) </li></ul><ul><ul><li>This “dampens” the effect of high frequency tf values </li></ul></ul><ul><ul><li>Usually, if tf t,d > 0 then w t,d = 1 + log 10 ( tf t,d ) </li></ul></ul>
    • 37. Inverse Document Frequency <ul><li>To introduce discrimination between terms, we introduce the notion of IDF </li></ul><ul><li>The inverse document frequency is typically expressed as idf t where idf is the inverse of the number of documents in the collection that contain term t </li></ul><ul><ul><li>Usually, idf t = log 10 (N / df t ), where N is the number of documents in the collection </li></ul></ul><ul><ul><li>The log is again used to dampen the effect of very uncommon terms </li></ul></ul><ul><ul><li>Note that every term in the collection has one IDF value </li></ul></ul><ul><ul><ul><li>This is important for index design and query evaluation; this is one reason why inverted indexing works for web search </li></ul></ul></ul>
    • 38. tf.idf <ul><li>Most popular ranking functions bring together TF and IDF to weight terms: </li></ul><ul><ul><li>w t,d = (1 + log 10 ( tf t,d )) x log 10 (N / df t ) </li></ul></ul><ul><li>When you hear the phrase “tf.idf”, this is the basic formalization that’s being discussed: </li></ul><ul><ul><li>The more a term occurs in a relevant document, the better </li></ul></ul><ul><ul><li>The more discriminating a term is across the collection, the better </li></ul></ul><ul><ul><li>You’ll sometimes see the same concept written as “tf-idf” </li></ul></ul><ul><ul><li>You’ll often see the elements of the “tf.idf” approach hidden amongst constants and other factors </li></ul></ul>
    • 39. <ul><li>The Okapi ranking function is as follows: </li></ul><ul><ul><li>Q is a query that contains the words T </li></ul></ul><ul><ul><li>k1 , b , and k3 are constant parameters ( k1 =1.2 and b =0.75 work well, k3 is 7 or 1000) </li></ul></ul><ul><ul><li>K is: </li></ul></ul><ul><ul><li>tf is the term frequency of the term with a document </li></ul></ul><ul><ul><li>qtf is the term frequency in the query </li></ul></ul><ul><ul><li>w is: </li></ul></ul><ul><ul><li>N is the number of documents, n is the number containing the term </li></ul></ul><ul><ul><li>dl and avdl are the document length and average document length </li></ul></ul><ul><li>Okapi is a well-known ranking function that you’ll often find in the literature and in experimental research work </li></ul><ul><li>It also contains an IDL component (more later) </li></ul>A ranking function: Okapi BM25
    • 40. Comments on tf.idf schemes <ul><li>If a query contains only one word, the IDF component has no effect; it’s only useful for discriminating between terms in queries </li></ul><ul><li>Many formulations also include IDL, which ensures long documents don’t dominate short documents </li></ul><ul><li>They are only one (important) component of a modern search engine ranking function </li></ul>
    • 41. Query Evaluation in Web Search <ul><li>In practice, web search engines: </li></ul><ul><ul><li>Don’t do pure ranking, because they perform the Boolean AND to find documents that contain all query terms, and then rank over those terms </li></ul></ul><ul><ul><ul><li>High precision , lower recall (more later) </li></ul></ul></ul><ul><ul><ul><li>Less expensive to evaluate than a ranked query </li></ul></ul></ul><ul><ul><ul><li>In practice, because of query alterations, the AND is often very broad </li></ul></ul></ul>
    • 42. Ranking in Practice <ul><li>Search engine rankers are complex: </li></ul><ul><ul><li>Machine-learned ranking functions </li></ul></ul><ul><ul><li>Hundreds of ranking factors </li></ul></ul><ul><ul><li>Query independent ranking factors </li></ul></ul><ul><ul><li>Document segments or streams </li></ul></ul><ul><ul><li>Two-pass ranking </li></ul></ul><ul><ul><li>Early query termination </li></ul></ul><ul><ul><li>Query alterations </li></ul></ul>
    • 43. Query independent ranking factors <ul><li>Documents may have ranking factors that are query independent: </li></ul><ul><ul><li>Spam score </li></ul></ul><ul><ul><li>PageRank or page authority score </li></ul></ul><ul><ul><li>Basic statistics (word counts, inlink and outlink counts, intra and inter domain link counts, image counts, …) </li></ul></ul><ul><ul><li>Impression counts </li></ul></ul><ul><ul><li>… </li></ul></ul>
    • 44. Streams <ul><li>It is desirable to rank different parts of the document using different factors, weights, and rankers </li></ul><ul><li>For example, consider: </li></ul><ul><ul><li>URL text </li></ul></ul><ul><ul><li>Title text </li></ul></ul><ul><ul><li>Body text </li></ul></ul><ul><ul><li>Anchor text (more in a moment) </li></ul></ul><ul><ul><li>Query text (queries that lead to clicks on the document) </li></ul></ul><ul><ul><li>… </li></ul></ul><ul><li>Streams allow logical documents to be represented in the index </li></ul>
    • 45. Anchor text <ul><li>Anchor text is drawn from the HTML <a> tags that “point” to a document </li></ul><ul><ul><li><a href= http://ebay.com >eBay home page</a> </li></ul></ul><ul><li>Anchor text is often a more useful description of a site or page than the page contains </li></ul><ul><li>Anchor text is very useful for navigational querying </li></ul><ul><li>In practice, we may take just the anchor text, or some fragment of the text surrounding the tag too </li></ul><ul><li>Anchor text is painted into the destination document </li></ul><ul><li>In practice, anchor text management is a key challenge of crawler design </li></ul>
    • 46. Query Text <ul><li>110009824911: number stencils </li></ul><ul><li>110009869781: rivera ceramic tiles </li></ul><ul><li>110009873312: rivera ceramic tiles rivera ceramic tiles rivera ceramic tiles rivera ceramic tiles </li></ul><ul><li>110009952936: ibm machines </li></ul><ul><li>110010165729: sick </li></ul><ul><li>110010223000: mudd </li></ul><ul><li>110010296589: carson pirie </li></ul><ul><li>110010301601: boy scout shoes </li></ul><ul><li>110010311498: ferragamo 6 </li></ul><ul><li>110010377525: spin win </li></ul><ul><li>110010581717: invitations sweet 16 </li></ul><ul><li>110010594270: bonfire of the vanities </li></ul><ul><li>110010672814: 1968 vw manual </li></ul><ul><li>110010675084: studebaker poster </li></ul><ul><li>110010757262: hawaiian shirt </li></ul><ul><li>110010785797: silver capris 27 silver jeans capris 27 </li></ul><ul><li>110010831515: fishing lures deep diving </li></ul><ul><li>110010874213: harley davidson boots harley davidson boots harley davidson boots </li></ul><ul><li>110011110468: soligen hunting knife </li></ul><ul><li>110011350242: amanda lee </li></ul><ul><li>110011535343: orphan annie </li></ul><ul><li>110011646306: crkt weasel </li></ul><ul><li>110011977526: 18k gold ruby earrings </li></ul><ul><li>110011979581: the dale earnhardt story </li></ul>Idea: tag item with query <x>, when the item is clicked for query <x>
    • 47. Query Alterations <ul><li>Most engines use past user behavior to aid in augmenting queries </li></ul><ul><ul><li>For example, if many users correct a misspelling, it will be automatically corrected </li></ul></ul><ul><li>Query alterations fall in several classes: </li></ul><ul><ul><li>Corrections (active, partial, or suggested) </li></ul></ul><ul><ul><li>Additional terms </li></ul></ul><ul><ul><ul><li>Can be added to the AND, or used in reranking later </li></ul></ul></ul><ul><ul><li>Used for highlighting </li></ul></ul>
    • 48. RELEVANCE MEASUREMENT
    • 49. Relevance Judgment <ul><li>Web search engines measure relevance using human judges </li></ul><ul><li>Each result to each query is judged on a scale. For example: </li></ul><ul><ul><li>Perfect, the ideal result for this query </li></ul></ul><ul><ul><li>Relevant, that is, meets the information need </li></ul></ul><ul><ul><li>Irrelevant, that, does not meet the information need </li></ul></ul><ul><ul><li>Detrimental, hurts the impression of the search engine </li></ul></ul><ul><li>The judgments are used to compute various metrics that measure recall and precision </li></ul>
    • 50. Recall and Precision <ul><li>Recall is the fraction of relevant documents retrieved from all relevant documents in the collection </li></ul><ul><ul><li>If there are 100 documents in the collection, 10 are relevant, 12 are retrieved, and 3 of those are relevant, the recall is 0.3 or 30% </li></ul></ul><ul><li>Precision is the fraction of retrieved documents that are relevant </li></ul><ul><ul><li>If there are 100 documents in the collection, 10 are relevant, 12 are retrieved, and 3 of those are relevant, the precision is 0.25 or 25% </li></ul></ul>
    • 51. Recall and Precision… <ul><li>In practice, recall is hard to measure: </li></ul><ul><ul><li>Requires relevance judgment of all documents in the collection to each query </li></ul></ul><ul><ul><li>Impractical for large collections </li></ul></ul><ul><ul><li>Typically ignored by web search engines </li></ul></ul><ul><li>Precision is much easier to measure </li></ul><ul><li>Precision may fluctuate, but typically decreases with the number of results inspected </li></ul>
    • 52. Recall and Precision… <ul><li>The simplest way to compare two retrieval systems is the P@n measure: </li></ul><ul><ul><li>Take the top n results for a query q </li></ul></ul><ul><ul><li>Measure the fraction that are relevant </li></ul></ul><ul><ul><li>Store this as the P@n for q for system x </li></ul></ul><ul><ul><li>Determine the mean average P@n over all queries for system x </li></ul></ul><ul><ul><li>Repeat for system y </li></ul></ul><ul><ul><li>Determine whether x is better or worse than y using a statistical measure, such a two-sided t-test </li></ul></ul><ul><li>It’s difficult to choose the right value for n , but a typical choice is n=10 </li></ul><ul><li>In practice, the measurements are more complex, but typically favor high precision measures </li></ul>
    • 53. CACHING AND WEB SERVING
    • 54. Caching <ul><li>Around 70% of web search queries have been seen recently </li></ul><ul><li>Therefore, web search engines are able to search most results from large, distributed caches </li></ul><ul><li>In practice, caching is keyed on more than the query </li></ul><ul><ul><li>Market, preferences, personalization, … </li></ul></ul>
    • 55. Web Serving <ul><li>The web servers host the web resources </li></ul><ul><ul><li>HTML, CSS, JavaScript, and interpreter </li></ul></ul><ul><li>The web servers often host many simple services </li></ul><ul><li>The web servers open, manage, and close connections to the search engine </li></ul><ul><li>The web servers also manage the issues of being connected to the Internet: </li></ul><ul><ul><li>Denial of service attack prevention </li></ul></ul><ul><ul><li>Load balancing </li></ul></ul><ul><li>They’re also a convenient place to do logging </li></ul>
    • 56. EBAY CHALLENGES IN SEARCH
    • 57. Challenges at eBay <ul><li>eBay manages: </li></ul><ul><ul><li>Over 90 million active users worldwide </li></ul></ul><ul><ul><li>Over 200 million items for sale in 50,000 categories </li></ul></ul><ul><ul><li>Over 8 billion URL requests per day </li></ul></ul><ul><ul><li>Over 10,000 queries per second at peak </li></ul></ul><ul><li>… in a dynamic environment </li></ul><ul><ul><li>Hundreds of new features per quarter </li></ul></ul><ul><ul><li>Roughly 10% of items are listed or ended every day </li></ul></ul><ul><li>… worldwide! </li></ul><ul><ul><li>In 39 countries and 10 languages </li></ul></ul><ul><ul><li>24x7x365 </li></ul></ul><ul><li>More than70 billion read / write operations per day </li></ul>
    • 58. eBay Search: differences <ul><li>The first, major real time search engine: </li></ul><ul><ul><li>Dynamic collection </li></ul></ul><ul><ul><ul><li>New documents </li></ul></ul></ul><ul><ul><ul><li>Time is important in relevance </li></ul></ul></ul><ul><ul><li>Low latency requirement for publication-to-index </li></ul></ul><ul><ul><li>Index updates </li></ul></ul><ul><ul><ul><li>Changes in documents, new documents, and document deletions </li></ul></ul></ul>
    • 59. eBay Search: differences… <ul><ul><li>Ranking challenges </li></ul></ul><ul><ul><ul><li>Different signals: </li></ul></ul></ul><ul><ul><ul><ul><li>Temporal relevance </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Rapidly changing signals </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Difficult to maintain accurate statistics </li></ul></ul></ul></ul><ul><ul><ul><ul><li>New terms and phrases </li></ul></ul></ul></ul><ul><ul><ul><li>Missing signals: </li></ul></ul></ul><ul><ul><ul><ul><li>Anchor text, link graph, page rank, … </li></ul></ul></ul></ul><ul><ul><ul><li>One major query type </li></ul></ul></ul><ul><ul><ul><li>Auction cycle makes tuning harder </li></ul></ul></ul><ul><ul><li>Systems challenges </li></ul></ul><ul><ul><ul><li>Cache hit ratio; results need to be fresh </li></ul></ul></ul>
    • 60. Q&A <ul><li>Pssst…. eBay is hiring! Mail me if you’re interested, hugh.williams@ebay.com </li></ul>
    • 61. REFERENCE MATERIAL
    • 62. Great Books! <ul><li>Manning, Raghavan and Schütze,  Introduction to Information Retrieval, Cambridge University Press, 2008. (free online) </li></ul><ul><li>Croft, Metzler, and Strohman, Search Engines: Information Retrieval in Practice, 2010. </li></ul><ul><li>Witten, Moffat, and Bell, Managing Gigabytes, Morgan-Kaufmann, 2 nd Edition, 1999. </li></ul><ul><li>Baeza-Yates and Ribeiro-Neto, Modern Information Retrieval, Addison-Wesley,, 1999 </li></ul>
    • 63. References <ul><li>Spink and Xu, “Selected results from a large study of Web searching: the Excite study”, Information Research 6(1), October 2000 </li></ul><ul><li>Scholer, Williams, Yiannis, and Zobel, “Compression of inverted indexes for fast query evaluation”,  In Proc. of the ACM-SIGIR International Conference on Research and Development in Information Retrieval, 2002. </li></ul><ul><li>Broder, “A taxonomy of web search”, SIGIR Forum 36(2), 2002. </li></ul><ul><li>Jansen and Spink, “How are we searching the World Wide Web? A comparison of nine search engine transaction logs.”, Inf. Process. Manage. 42(1), 2006. </li></ul>

    ×