Improving relevance with log information

698 views

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
698
On SlideShare
0
From Embeds
0
Number of Embeds
35
Actions
Shares
0
Downloads
6
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Improving relevance with log information

  1. 1. Improving relevance with log analysis Richard Boulton [email_address] @rboulton
  2. 2. Sources of ranking information
  3. 3. Document text <ul><li>Term frequency based weights. </li><ul><li>Vector models, Cosine
  4. 4. BM25, BM25F
  5. 5. Purely based on document and query </li></ul></ul>
  6. 6. Link analysis <ul><li>Google – Page Rank
  7. 7. Citation analysis </li></ul>
  8. 8. Historical behaviour <ul><li>Which results were picked for this query before
  9. 9. How well did results which are similar to this result perform in the past. </li></ul>
  10. 10. How to analyse historical behaviour
  11. 11. Finding past behaviour <ul><li>Keep logs of searches, together with their results.
  12. 12. Keep click-through information.
  13. 13. Keep track of eventual outcomes (sales, ad views, content downloads). </li></ul>
  14. 14. Hadoop <ul><li>Distributed data processing
  15. 15. Map-combine-reduce
  16. 16. Very good for log analysis! </li></ul>
  17. 17. Dumbo <ul><li>Python interface for writing Hadoop jobs.
  18. 18. Very simple to use.
  19. 19. Very poor documentation, sadly.
  20. 20. Some performance penalty for using python, but very good for ad-hoc jobs and rapid development. </li></ul>
  21. 21. Past results <ul><li>Easy to track results which were picked, but:
  22. 22. New results were never picked
  23. 23. New queries never had results picked
  24. 24. Need massive volume to get anywhere </li></ul>
  25. 25. Past behaviour <ul><li>Use the history better by building models
  26. 26. Represent documents in terms of features .
  27. 27. Use history to produce a score for each result.
  28. 28. Use machine learning to build a model to predict the score for a set of features.
  29. 29. Use model to produce scores for ranking. </li></ul>
  30. 30. Features <ul><li>BM25 scores for each field
  31. 31. Review scores
  32. 32. Categories
  33. 33. Prices </li><ul><li>Price within a category (dumbo) </li></ul></ul>
  34. 34. Scores <ul><li>Account for position bias </li><ul><li>Model click-throughs for each position
  35. 35. “An Experimental Comparison of Click Position-Bias Models” - Craswell et al. </li></ul><li>Account for old data being less relevant </li></ul>
  36. 36. Building a model <ul><li>Logistic regression </li><ul><li>Liblinear / libsvm
  37. 37. Apache Mahout </li></ul><li>Neural nets </li><ul><li>libfann </li></ul></ul>
  38. 38. Interesting results <ul><li>BM25 weights for title should be biased 5 times higher than weights for body text.*
  39. 39. Don't need very much data to build a useful model.
  40. 40. * for some sample news data. </li></ul>
  41. 41. Summary <ul><li>Keep your logs!
  42. 42. Tie searches to results in logs
  43. 43. Dumbo + Hadoop makes adhoc investigation of behaviour easy. </li></ul>

×