Your SlideShare is downloading. ×
Improving relevance with log information
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Improving relevance with log information

457
views

Published on

Published in: Technology

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
457
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
4
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Improving relevance with log analysis Richard Boulton [email_address] @rboulton
  • 2. Sources of ranking information
  • 3. Document text
    • Term frequency based weights.
      • Vector models, Cosine
      • 4. BM25, BM25F
      • 5. Purely based on document and query
  • 6. Link analysis
    • Google – Page Rank
    • 7. Citation analysis
  • 8. Historical behaviour
    • Which results were picked for this query before
    • 9. How well did results which are similar to this result perform in the past.
  • 10. How to analyse historical behaviour
  • 11. Finding past behaviour
    • Keep logs of searches, together with their results.
    • 12. Keep click-through information.
    • 13. Keep track of eventual outcomes (sales, ad views, content downloads).
  • 14. Hadoop
    • Distributed data processing
    • 15. Map-combine-reduce
    • 16. Very good for log analysis!
  • 17. Dumbo
    • Python interface for writing Hadoop jobs.
    • 18. Very simple to use.
    • 19. Very poor documentation, sadly.
    • 20. Some performance penalty for using python, but very good for ad-hoc jobs and rapid development.
  • 21. Past results
    • Easy to track results which were picked, but:
    • 22. New results were never picked
    • 23. New queries never had results picked
    • 24. Need massive volume to get anywhere
  • 25. Past behaviour
    • Use the history better by building models
    • 26. Represent documents in terms of features .
    • 27. Use history to produce a score for each result.
    • 28. Use machine learning to build a model to predict the score for a set of features.
    • 29. Use model to produce scores for ranking.
  • 30. Features
    • BM25 scores for each field
    • 31. Review scores
    • 32. Categories
    • 33. Prices
      • Price within a category (dumbo)
  • 34. Scores
    • Account for position bias
      • Model click-throughs for each position
      • 35. “An Experimental Comparison of Click Position-Bias Models” - Craswell et al.
    • Account for old data being less relevant
  • 36. Building a model
    • Logistic regression
      • Liblinear / libsvm
      • 37. Apache Mahout
    • Neural nets
      • libfann
  • 38. Interesting results
    • BM25 weights for title should be biased 5 times higher than weights for body text.*
    • 39. Don't need very much data to build a useful model.
    • 40. * for some sample news data.
  • 41. Summary
    • Keep your logs!
    • 42. Tie searches to results in logs
    • 43. Dumbo + Hadoop makes adhoc investigation of behaviour easy.

×