Beyond tf idf why, what & how

  • 4,078 views
Uploaded on

Presented by Stephen Murtagh, Etsy.com, Inc. …

Presented by Stephen Murtagh, Etsy.com, Inc.

TF-IDF (term frequency, inverse document frequency) is a standard method of weighting query terms for scoring documents, and is the method that is used by default in Solr/Lucene. Unfortunately, TF-IDF is really only a measure of rarity, not quality or usefulness. This means it would give more weight to a useless, rare term, such as a misspelling, than to a more useful, but more common, term.

In this presentation, we will discuss our experiences replacing Lucene's TF-IDF based scoring function with a more useful one using information gain, a standard machine-learning measure that combines frequency and specificity. Information gain is much more expensive to compute, however, so this requires periodically computing the term weights outside of Solr/Lucene and making the results accessible within Solr/Lucene.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
4,078
On Slideshare
0
From Embeds
0
Number of Embeds
11

Actions

Shares
Downloads
63
Comments
0
Likes
11

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Beyond TF-IDFStephen Murtaghetsy.com
  • 2. 20,000,000 items
  • 3. 1,000,000 sellers
  • 4. 15,000,000 dailysearches80,000,000 daily calls to Solr
  • 5. Etsy Engineering• Code as Craft - our engineering blog• http://codeascraft.etsy.com/• Continuous Deployment• https://github.com/etsy/deployinator• Experiment-driven culture• Hybrid engineering roles• Dev-Ops• Data-Driven Products
  • 6. Etsy Search• 2 search clusters: Flip and Flop• Master -> 20 slaves• Only one cluster takes traffic• Thrift (no HTTP endpoint)• BitTorrent for index replication• Solr 4.1• Incremental index every 12 minutes
  • 7. Beyond TF-IDF•Why?•What?•How?
  • 8. Luggage tags“unique bag”q = unique+bag
  • 9. q = unique+bag>
  • 10. Scoring in Lucene
  • 11. Scoring in LuceneFixed for any given queryconstant
  • 12. Scoring in Lucenef(term, document)f(term)
  • 13. Scoring in LuceneUser contentOnly measure rarity
  • 14. IDF(“unique”)4.429547IDF(“bag”)4.32836>
  • 15. q = unique+bag“unique unique bag” “unique bag bag”>
  • 16. “unique” tells usnothing...
  • 17. Stop words• Add “unique” to stop word list?• What about “handmade” or “blue”?• Low-information words can still be usefulfor matching• ... but harmful for ranking
  • 18. Why not replace IDF?
  • 19. Beyond TF-IDF•Why?• IDF ignores term “usefulness”•What?•How?
  • 20. Beyond TF-IDF•Why?• IDF ignores term “usefulness”•What?•How?
  • 21. What do we replace itwith?
  • 22. Benefits of IDFI1 =doc1 doc2 doc3 . . . docnart 2 0 1 . . . 1jewelry 1 3 0 . . . 0......termm 1 0 1 . . . 0
  • 23. Benefits of IDFI1 =doc1 doc2 doc3 . . . docnart 2 0 1 . . . 1jewelry 1 3 0 . . . 0......termm 1 0 1 . . . 0IDF(jewelry) = 1 + log(n￿d id,jewelry)
  • 24. ShardingI1 =doc1 doc2 doc3 . . . dockart 2 0 1 . . . 1jewelry 1 3 0 . . . 0......termm 1 0 1 . . . 0I2 =dock+1 dock+2 dock+3 . . . docnart 6 1 0 . . . 1jewelry 0 1 3 . . . 0......termm 0 1 1 . . . 0
  • 25. ShardingI1 =doc1 doc2 doc3 . . . dockart 2 0 1 . . . 1jewelry 1 3 0 . . . 0......termm 1 0 1 . . . 0I2 =dock+1 dock+2 dock+3 . . . docnart 6 1 0 . . . 1jewelry 0 1 3 . . . 0......termm 0 1 1 . . . 0IDF(jewelry) = 1 + log(n￿d id,jewelry)
  • 26. ShardingI1 =doc1 doc2 doc3 . . . dockart 2 0 1 . . . 1jewelry 1 3 0 . . . 0......termm 1 0 1 . . . 0I2 =dock+1 dock+2 dock+3 . . . docnart 6 1 0 . . . 1jewelry 0 1 3 . . . 0......termm 0 1 1 . . . 0IDF1(jewelry) ￿= IDF2(jewelry) ￿= IDF(jewelry)
  • 27. Sharded IDF options• Ignore it - Shards score differently• Shards exchange stats - Messy• Central source distributes IDF to shards
  • 28. Information Gain• P(x) - Probability of "x" appearing in a listing• P(x|y) - Probability of "x" appearing given "y" appearsinfo(y) = D(P(X|y)||P(X))info(y) = Σx∈X log(P(x|y)P(x)) ∗ P(x|y)
  • 29. Term Info(x) IDFunique 0.26 4.43bag 1.24 4.33pattern 1.20 4.38original 0.85 4.38dress 1.31 4.42man 0.64 4.41photo 0.74 4.37stone 0.92 4.35Similar IDF
  • 30. Term Info(x) IDFunique 0.26 4.39black 0.22 3.32red 0.22 3.52handmade 0.20 3.26two 0.32 5.64white 0.19 3.32three 0.37 6.19for 0.21 3.59Similar Info Gain
  • 31. q = unique+bagUsing IDFscore(“unique unique bag”)>score(“unique bag bag”)Using information gainscore(“unique unique bag”)<score(“unique bag bag”)
  • 32. Beyond TF-IDF•Why?• IDF ignores term “usefulness”•What?•How?
  • 33. Beyond TF-IDF•Why?• IDF ignores term “usefulness”•What?• Information gain accounts for term quality•How?
  • 34. Beyond TF-IDF•Why?• IDF ignores term “usefulness”•What?• Information gain accounts for term quality•How?
  • 35. Listing Quality• Performance relativeto rank• Hadoop: logs -> hdfs• cron: hdfs -> master• bash: master -> slave• Loaded as external filefield
  • 36. Computing info gainI1 =doc1 doc2 doc3 . . . docnart 2 0 1 . . . 1jewelry 1 3 0 . . . 0......termm 1 0 1 . . . 0info(y) = D(P(X|y)||P(X))info(y) = Σx∈X log(P(x|y)P(x)) ∗ P(x|y)
  • 37. Hadoop• Brute-force• Count all terms• Count all co-occuring terms• Construct distributions• Compute info gain for all terms
  • 38. File Distribution• cron copies score file to master• master replicates file to slavesinfogain=`find /search/data/ -maxdepth 1 -type f -name info_gain.* -print | sort | tail -n 1`scp $infogain user@$slave:$infogain
  • 39. File Distribution
  • 40. schema.xml
  • 41. Beyond TF-IDF•Why?• IDF ignores term “usefulness”•What?• Information gain accounts for term quality•How?• Hadoop + similarity factory = win
  • 42. Fast Deploys,Careful Testing• Idea• Proof of Concept• Side-By-Side• A/B test• 100% Live
  • 43. Side-by-Side
  • 44. Relevant != High quality
  • 45. A/B Test• Users are randomly assigned to A or B• A sees IDF-based results• B sees info gain-based results
  • 46. A/B Test• Users are randomly assigned to A or B• A sees IDF-based results• B sees info gain-based results• Small but significant decrease in clicks,page views, etc.
  • 47. More homogeneous resultsLower average quality score
  • 48. Next Steps
  • 49. Parameter Tweaking...Rebalance relevancy and quality signals in score
  • 50. The Future
  • 51. Latent SemanticIndexing in Solr/Lucene
  • 52. Latent SemanticIndexing• In TF-IDF, documents are sparse vectors interm space• LSI re-maps these to dense vectors in“concept” space• Construct transformation matrix:• Load file at index and query time• Re-map query and documentsRm+RrTr×m
  • 53. CONTACTStephen Murtaghsmurtagh@etsy.com