Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

ppt

237 views

Published on

  • Be the first to comment

  • Be the first to like this

ppt

  1. 1. Efficient Computation of Diverse Query Results Erik Vee joint work with Utkarsh Srivastava, Jayavel Shanmugasundaram, Prashant Bhat, Sihem Amer Yahia
  2. 2. Motivation <ul><li>Imagine looking for shoes on Yahoo! Shopping, and seeing only Reeboks </li></ul>
  3. 3. Motivation <ul><li>Imagine looking for shoes on Yahoo! Shopping, and seeing only Reeboks </li></ul><ul><li>… or looking for cars on Yahoo! Autos, and seeing only Hondas </li></ul>
  4. 4. Motivation <ul><li>Imagine looking for shoes on Yahoo! Shopping, and seeing only Reeboks </li></ul><ul><li>… or looking for cars on Yahoo! Autos, and seeing only Hondas </li></ul><ul><li>… or looking for jobs on Yahoo! Hotjobs, and seeing only jobs from Yahoo! </li></ul><ul><li>It is not enough to simply give the best response </li></ul><ul><ul><li>Need diversity of answers </li></ul></ul>
  5. 5. Diversity Search <ul><li>If we display 30 results in 5 categories, then should show 6 items from each category </li></ul><ul><ul><li>NB: Our goal is to show range of choices, not representative sample </li></ul></ul><ul><ul><li>Recurse on each subgroup of items </li></ul></ul><ul><li>Diversity crucial for users looking for range of results </li></ul><ul><ul><li>e.g. Shopping, information gathering/research </li></ul></ul><ul><li>Useful for aiding navigation </li></ul><ul><ul><li>Users tend to favor search-and-click over hierarchies </li></ul></ul><ul><li>Likely to give at least one good answer on first page </li></ul>
  6. 6. Contributions <ul><li>Formally define diversity search </li></ul><ul><ul><li>Other diversity-like approaches use extensive post-processing or are not query-dependent </li></ul></ul><ul><li>Proved that traditional IR engines cannot produce guaranteed diverse results </li></ul><ul><li>Gave novel algorithms to produce diverse results </li></ul><ul><ul><li>Both one-pass (datastreaming) and probing algorithms </li></ul></ul><ul><li>Experimentally verified that these results are nearly as fast as normal top-k processing </li></ul><ul><ul><li>Much faster than post-processing techniques </li></ul></ul>
  7. 7. What about other approaches? <ul><li>If not diverse enough, query again </li></ul><ul><ul><li>E.g. If all results are from one company, issue another query </li></ul></ul><ul><ul><li>Bad for latency </li></ul></ul><ul><li>Issue multiple queries (one for Honda, one for Toyota...) </li></ul><ul><ul><li>Can be prohibitively expensive (kills throughput) </li></ul></ul><ul><ul><ul><li>latency fine </li></ul></ul></ul><ul><ul><li>Some applications may have dozens of top-level categories </li></ul></ul><ul><li>Fetch extra results, then find most diverse set from this </li></ul><ul><ul><li>Not guaranteed to get good results </li></ul></ul><ul><ul><li>Requires fetching additional results unnecessarily </li></ul></ul><ul><li>Fetch all results, then find diverse set </li></ul><ul><ul><li>Many times slower </li></ul></ul><ul><li>Random sample of results </li></ul><ul><ul><li>Miss important results this way </li></ul></ul>
  8. 8. What about clever scoring? <ul><li>Can we give each item a global “diversity” score, then find top-k using this? </li></ul><ul><ul><li>Prove in paper: There is no global score that gives guaranteed diversity </li></ul></ul><ul><li>Can we give each item a local “diversity” score, so that it has a different score in each list of the inverted index? </li></ul><ul><ul><li>Prove in paper: There is no list-based scoring of the item that gives guaranteed diversity </li></ul></ul>
  9. 9. Outline <ul><li>Definition of diversity </li></ul><ul><li>Overview of our algorithms </li></ul><ul><li>Our experimental results </li></ul>
  10. 10. Diversity search <ul><li>Over all possible sets of top-k results that match query, return set with most diversity </li></ul><ul><li>Paper defines diversity more precisely </li></ul><ul><ul><li>Focus on hierarchy view of diversity (in next slides) </li></ul></ul><ul><li>For scored diversity (in which each item has a score) </li></ul><ul><ul><li>Over all possible sets of top-k results with maximum score, return set with highest diversity </li></ul></ul><ul><ul><li>Note: Diversity only useful when score not too fine-grained </li></ul></ul>
  11. 11. Diversity definition (by picture) Implicitly defines hierarchy Make Model Color Year Text Determine a category ordering
  12. 12. Hierarchy after a query Diversity search always returns valid results E.g. Query text contains `Low`
  13. 13. Hierarchy after a query Diversity search always returns valid results E.g. Query text contains `Low` All siblings return the same number of results (or as close as possible)
  14. 14. Returning top-k diverse results Diversity search always returns valid results E.g. Query text contains `Low` Suppose return k=4 results Must return 2 Hondas and 2 Toyotas Will not return 2 green Civics
  15. 15. Outline <ul><li>Definition of diversity </li></ul><ul><li>Overview of our algorithms </li></ul><ul><li>Our experimental results </li></ul>
  16. 16. Algorithms <ul><li>One Pass </li></ul><ul><ul><li>Never goes backward (just one pass over dataset) </li></ul></ul><ul><ul><li>Maintains a top-k diverse set based on what has been seen </li></ul></ul><ul><ul><li>Jumps ahead if more results will not help diversity </li></ul></ul><ul><ul><li>Optimal one-pass algorithm </li></ul></ul><ul><li>Probe </li></ul><ul><ul><li>May jump forward or backward (i.e. probes) </li></ul></ul><ul><ul><li>Prove: at most 2k probes for top-k diverse result set </li></ul></ul><ul><li>Both also work for scored diversity </li></ul>
  17. 17. Dewey IDs Every branch gets a number Every item then labeled, e.g. 0.2.0.1.0 is Honda Odyssey Green ’06 `Good miles’ Create inverted index low  00000, 00010, 00100, 00200, 00300, 00310, 10000, 11000, 12000, 13000
  18. 18. Next and Prev Supports two basic operations: Next and Prev E.g. Query text contains `Low` Next(0.0.3.2.2) = 1.0.0.0.0 Prev(2.0.0.0.0) = 1.3.0.0.0 Inverted index for ‘Low’ lists all items in Dewey ID order In general, must find intersection of lists (still easy) low  00000, 00010, 00100, 00200, 00300, 00310, 10000, 11000, 12000, 13000
  19. 19. One pass (for k = 2) First finds 00000, 00010 Now knows Civic Green no longer helps Jumps by calling next(0.0.1.0.0)
  20. 20. One pass (for k = 2) Finds 00100 Removes 00010 First finds 00000, 00010 Now knows Civic Green no longer helps! Jumps by calling next(0.0.1.0.0) Now knows Civic no longer helps! Jumps by calling next(0.1.0.0.0)
  21. 21. One pass (for k = 2) Finds 00100 Removes 00010 First finds 00000, 00010 Now knows Civic Green no longer helps! Jumps by calling next(0.0.1.0.0) Now knows Civic no longer helps! Jumps by calling next(0.1.0.0.0) Finds 01000 Removes 00100 Knows to stop
  22. 22. Probe (for k = 4) Calls next(0.0.0.0.0) and prev(  .  .  .  .  ) to find first and last items Wants another Honda Calls prev(0.  .  .  .  ) Discovers there are only 2 top-level categories
  23. 23. Probe (for k = 4) Calls next(0.0.0.0.0) and prev(  .  .  .  .  ) to find first and last items Wants another Honda Calls prev(0.  .  .  .  ) Why not next(0.1.0.0.0)? If Honda has only one child, then will return a Toyota!
  24. 24. Probe (for k = 4) Calls next(0.0.0.0.0) and prev(  .  .  .  .  ) to find first and last items Wants another Honda Calls prev(0.  .  .  .  ) Finds 00310 Wants another Toyota Calls next(1.0.0.0.0)
  25. 25. Probe (for k = 4) Calls next(0.0.0.0.0) and prev(  .  .  .  .  ) to find first and last items Wants another Honda Calls prev(0.  .  .  .  ) Finds 00310 Wants another Toyota Calls next(1.0.0.0.0) Finds 10000
  26. 26. Outline <ul><li>Definition of diversity </li></ul><ul><li>Overview of our algorithms </li></ul><ul><li>Our experimental results </li></ul>
  27. 27. Results <ul><li>Dataset consisted of listing from Yahoo! Autos </li></ul><ul><li>Queries were synthetic to test various parameters </li></ul><ul><ul><li>Selectivity, # predicates, # results </li></ul></ul><ul><li>Preprocessing time for 100K listings < 5min </li></ul><ul><ul><li>Times shown are for 5K queries </li></ul></ul><ul><li>4 algorithms </li></ul><ul><ul><li>Basic: No diversity </li></ul></ul><ul><ul><li>Naïve: Fetch everything, post-process </li></ul></ul><ul><ul><li>OnePass: Our algorithm. Takes just one pass over data </li></ul></ul><ul><ul><li>Probe: Our algorithm. May make multiple probes into data </li></ul></ul>
  28. 28. Comparable time for diversity search unscored scored Basic: No diversity Naïve: Many times slower OnePass: Close to probe Probe: Within factor 2 of no diversity MultiQuery (not shown): Latency close to Basic, but throughput many times worse
  29. 29. Results summary <ul><li>Getting diverse results not too much slower than getting non-diverse results </li></ul><ul><ul><li>Many times faster than naïve approaches </li></ul></ul><ul><li>Multi-query approach has even worse throughput than naïve </li></ul><ul><ul><li>But keeps latency low </li></ul></ul><ul><li>How does this compare to getting extra results, then finding a diverse subset? </li></ul><ul><ul><li>Getting 2k results instead of k is about twice as slow </li></ul></ul><ul><ul><li>Plus, does not guarantee diverse results </li></ul></ul>
  30. 30. Conclusions <ul><li>Can get guaranteed diversity, taking time close to normal top-k query </li></ul><ul><ul><li>Almost as fast or faster than non-guaranteed results </li></ul></ul><ul><ul><li>Diversity at every level </li></ul></ul><ul><li>Works even when items have scores </li></ul><ul><li>Needs a different algorithm than traditional IR engines </li></ul><ul><ul><li>Proved this in paper (under standard notions) </li></ul></ul><ul><li>Are there approximate notions that can use existing IR machinery? </li></ul>

×