Yahoo Real Time Search SMX March 2010


Published on

Published in: Technology

Yahoo Real Time Search SMX March 2010

  1. 1. Yahoo! Search Real Time Search Exploring the frontiers in modern information retrieval <ul><li>March 2010 </li></ul>Ivan Davtchev @ivan_d
  2. 2. What is real time search? <ul><li>Showing the most relevant up-to-date content for a topic of recently increased interest. </li></ul><ul><li>Freshest content is great, but not always best. </li></ul>
  3. 3. Recent Y! real time search launches Tweets on SRP Improved Yahoo! News + Twitter
  4. 4. Challenges of real time search <ul><ul><li>Real-time indexing : get new content as it is published </li></ul></ul><ul><ul><ul><li>Crawl really, really fast </li></ul></ul></ul><ul><ul><ul><li>Index news feeds, RSS, Twitter </li></ul></ul></ul><ul><ul><li>Query analysis : discover queries to handle differently </li></ul></ul><ul><ul><ul><li>For most queries, promoting recent content degrades relevance </li></ul></ul></ul><ul><ul><li>Ranking for fresh content : adjust ranking algorithms </li></ul></ul><ul><ul><ul><li>For most newly-discovered content, many traditional ranking signals do not exist or are weak (e.g. anchor text) </li></ul></ul></ul>
  5. 5. How do you know a query is really hot? <ul><li>Find, in real-time, queries about emerging events and news stories </li></ul><ul><ul><li>E.g. natural disasters; sports updates; political breaking stories; etc. </li></ul></ul><ul><li>Standard approach not ideal: </li></ul><ul><ul><li>Maintain temporal model for each query </li></ul></ul><ul><ul><ul><li>Full time series or just statistics </li></ul></ul></ul><ul><ul><li>Identify irregularities in model </li></ul></ul><ul><ul><ul><li>Change in moving average of more than n σ’s </li></ul></ul></ul><ul><ul><li>Works well for head queries, not so for torso/tail </li></ul></ul>Screenshots of Google Trends (©2008 Google) taken at 22:23 PST on 3/1/2010 to illustrate temporal model and tail queries.
  6. 6. Our Answer: Yahoo! TimeSense
  7. 7. How is this different and better? <ul><li>It uses language modeling </li></ul><ul><ul><li>Language = “collection of words” </li></ul></ul><ul><ul><ul><li>All words in Webster’s Dictionary = English </li></ul></ul></ul><ul><ul><ul><li>All words in Yahoo! query logs, including misspellings = Query Log Language </li></ul></ul></ul><ul><ul><li>Language model = a way to explain a “vocabulary” to a computer </li></ul></ul>Sentence Times seen Fraction of all sentences ebay 90000 1/1000 apple 80000 1/2000 britney spears 40000 1/5000 ebay apple britney spears 0 0
  8. 8. Why Language Modeling? <ul><li>Language model allows us to ask questions like </li></ul><ul><ul><li>“Is this word part of the language”? </li></ul></ul><ul><ul><ul><li>Answer: if in table, yes </li></ul></ul></ul><ul><ul><li>“Which word is more likely to appear in the language, A or B?” </li></ul></ul><ul><ul><ul><li>Answer: whatever is higher up in the table </li></ul></ul></ul><ul><ul><li>Is a sentence more likely to be found in text from language A or language B? </li></ul></ul><ul><ul><ul><li>Answer: look at the table where the sentence is higher </li></ul></ul></ul><ul><ul><li>We convert the task of classifying buzzing queries to a series of “language model questions” </li></ul></ul><ul><ul><li>And build models to answer these from query logs </li></ul></ul>
  9. 9. <ul><li>Q: Is this query much more prominent right now? </li></ul><ul><li>To answer, we build many small language models: </li></ul><ul><ul><li>One for each X minutes of query logs in the past month </li></ul></ul>Buzzing / Spiking Queries Source: Y! paper “Towards Recency Ranking in Web Search”, WSDM 2010 Current Model for last X minutes Feb. 6 Feb. 7 Feb. 8 Model for 02/07/2010 1:0Xpm
  10. 10. Buzzing / Spiking Queries <ul><li>Q: Is this query much more prominent right now? </li></ul><ul><ul><li>Language Model: Is this more likely to belong to the last X minutes than to </li></ul></ul><ul><ul><ul><li>The previous X minutes </li></ul></ul></ul><ul><ul><ul><li>The same X minutes in the previous day </li></ul></ul></ul><ul><ul><ul><li>The same X minutes in the previous week </li></ul></ul></ul><ul><ul><ul><li>Etc. </li></ul></ul></ul>Source: Y! paper “Towards Recency Ranking in Web Search”, WSDM 2010 Feb. 6 Feb. 7 Feb. 8 Current Model for last X minutes Same X minutes, previous day
  11. 11. There is more secret sauce of course… <ul><li>Perhaps building language models for fresh content like Yahoo! News and Twitter… </li></ul><ul><li>And doing this very fast in production… </li></ul><ul><li>And then ranking content in real time – we have an interesting paper coming out soon: “Improving Recency Ranking Using Twitter Data” </li></ul>
  12. 12. Yahoo! Real Time features to look forward to <ul><li>Ranking + indexing even closer to true real time </li></ul><ul><li>Real time results in search verticals beyond news </li></ul><ul><li>Real time relevance algorithms powering experiences in Yahoo! properties beyond Search </li></ul>
  13. 13. Real Time practices to avoid <ul><li>Do not create content with unrelated buzz terms </li></ul><ul><li>Do not abuse shortening services for spam links </li></ul><ul><li>Do not go overboard with Twitter #hashtags </li></ul><ul><li>We aim to completely remove real time spam! </li></ul>Screenshots taken from on 3/1/2010 © 2010 Twitter
  14. 14. Learn more <ul><li>Yahoo! Search Blog: </li></ul><ul><li> </li></ul><ul><li>@YahooSearch </li></ul><ul><li>Yahoo! Search Sciences: </li></ul><ul><li> </li></ul><ul><li>@YahooLabs </li></ul>