Beyond tf idf why, what & how

6,601 views
6,253 views

Published on

Presented by Stephen Murtagh, Etsy.com, Inc.

TF-IDF (term frequency, inverse document frequency) is a standard method of weighting query terms for scoring documents, and is the method that is used by default in Solr/Lucene. Unfortunately, TF-IDF is really only a measure of rarity, not quality or usefulness. This means it would give more weight to a useless, rare term, such as a misspelling, than to a more useful, but more common, term.

In this presentation, we will discuss our experiences replacing Lucene's TF-IDF based scoring function with a more useful one using information gain, a standard machine-learning measure that combines frequency and specificity. Information gain is much more expensive to compute, however, so this requires periodically computing the term weights outside of Solr/Lucene and making the results accessible within Solr/Lucene.

Published in: Education, Technology, Business
0 Comments
17 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
6,601
On SlideShare
0
From Embeds
0
Number of Embeds
1,372
Actions
Shares
0
Downloads
110
Comments
0
Likes
17
Embeds 0
No embeds

No notes for slide

Beyond tf idf why, what & how

  1. 1. Beyond TF-IDFStephen Murtaghetsy.com
  2. 2. 20,000,000 items
  3. 3. 1,000,000 sellers
  4. 4. 15,000,000 dailysearches80,000,000 daily calls to Solr
  5. 5. Etsy Engineering• Code as Craft - our engineering blog• http://codeascraft.etsy.com/• Continuous Deployment• https://github.com/etsy/deployinator• Experiment-driven culture• Hybrid engineering roles• Dev-Ops• Data-Driven Products
  6. 6. Etsy Search• 2 search clusters: Flip and Flop• Master -> 20 slaves• Only one cluster takes traffic• Thrift (no HTTP endpoint)• BitTorrent for index replication• Solr 4.1• Incremental index every 12 minutes
  7. 7. Beyond TF-IDF•Why?•What?•How?
  8. 8. Luggage tags“unique bag”q = unique+bag
  9. 9. q = unique+bag>
  10. 10. Scoring in Lucene
  11. 11. Scoring in LuceneFixed for any given queryconstant
  12. 12. Scoring in Lucenef(term, document)f(term)
  13. 13. Scoring in LuceneUser contentOnly measure rarity
  14. 14. IDF(“unique”)4.429547IDF(“bag”)4.32836>
  15. 15. q = unique+bag“unique unique bag” “unique bag bag”>
  16. 16. “unique” tells usnothing...
  17. 17. Stop words• Add “unique” to stop word list?• What about “handmade” or “blue”?• Low-information words can still be usefulfor matching• ... but harmful for ranking
  18. 18. Why not replace IDF?
  19. 19. Beyond TF-IDF•Why?• IDF ignores term “usefulness”•What?•How?
  20. 20. Beyond TF-IDF•Why?• IDF ignores term “usefulness”•What?•How?
  21. 21. What do we replace itwith?
  22. 22. Benefits of IDFI1 =doc1 doc2 doc3 . . . docnart 2 0 1 . . . 1jewelry 1 3 0 . . . 0......termm 1 0 1 . . . 0
  23. 23. Benefits of IDFI1 =doc1 doc2 doc3 . . . docnart 2 0 1 . . . 1jewelry 1 3 0 . . . 0......termm 1 0 1 . . . 0IDF(jewelry) = 1 + log(nd id,jewelry)
  24. 24. ShardingI1 =doc1 doc2 doc3 . . . dockart 2 0 1 . . . 1jewelry 1 3 0 . . . 0......termm 1 0 1 . . . 0I2 =dock+1 dock+2 dock+3 . . . docnart 6 1 0 . . . 1jewelry 0 1 3 . . . 0......termm 0 1 1 . . . 0
  25. 25. ShardingI1 =doc1 doc2 doc3 . . . dockart 2 0 1 . . . 1jewelry 1 3 0 . . . 0......termm 1 0 1 . . . 0I2 =dock+1 dock+2 dock+3 . . . docnart 6 1 0 . . . 1jewelry 0 1 3 . . . 0......termm 0 1 1 . . . 0IDF(jewelry) = 1 + log(nd id,jewelry)
  26. 26. ShardingI1 =doc1 doc2 doc3 . . . dockart 2 0 1 . . . 1jewelry 1 3 0 . . . 0......termm 1 0 1 . . . 0I2 =dock+1 dock+2 dock+3 . . . docnart 6 1 0 . . . 1jewelry 0 1 3 . . . 0......termm 0 1 1 . . . 0IDF1(jewelry) = IDF2(jewelry) = IDF(jewelry)
  27. 27. Sharded IDF options• Ignore it - Shards score differently• Shards exchange stats - Messy• Central source distributes IDF to shards
  28. 28. Information Gain• P(x) - Probability of x appearing in a listing• P(x|y) - Probability of x appearing given y appearsinfo(y) = D(P(X|y)||P(X))info(y) = Σx∈X log(P(x|y)P(x)) ∗ P(x|y)
  29. 29. Term Info(x) IDFunique 0.26 4.43bag 1.24 4.33pattern 1.20 4.38original 0.85 4.38dress 1.31 4.42man 0.64 4.41photo 0.74 4.37stone 0.92 4.35Similar IDF
  30. 30. Term Info(x) IDFunique 0.26 4.39black 0.22 3.32red 0.22 3.52handmade 0.20 3.26two 0.32 5.64white 0.19 3.32three 0.37 6.19for 0.21 3.59Similar Info Gain
  31. 31. q = unique+bagUsing IDFscore(“unique unique bag”)score(“unique bag bag”)Using information gainscore(“unique unique bag”)score(“unique bag bag”)
  32. 32. Beyond TF-IDF•Why?• IDF ignores term “usefulness”•What?•How?
  33. 33. Beyond TF-IDF•Why?• IDF ignores term “usefulness”•What?• Information gain accounts for term quality•How?
  34. 34. Beyond TF-IDF•Why?• IDF ignores term “usefulness”•What?• Information gain accounts for term quality•How?
  35. 35. Listing Quality• Performance relativeto rank• Hadoop: logs - hdfs• cron: hdfs - master• bash: master - slave• Loaded as external filefield
  36. 36. Computing info gainI1 =doc1 doc2 doc3 . . . docnart 2 0 1 . . . 1jewelry 1 3 0 . . . 0......termm 1 0 1 . . . 0info(y) = D(P(X|y)||P(X))info(y) = Σx∈X log(P(x|y)P(x)) ∗ P(x|y)
  37. 37. Hadoop• Brute-force• Count all terms• Count all co-occuring terms• Construct distributions• Compute info gain for all terms
  38. 38. File Distribution• cron copies score file to master• master replicates file to slavesinfogain=`find /search/data/ -maxdepth 1 -type f -name info_gain.* -print | sort | tail -n 1`scp $infogain user@$slave:$infogain
  39. 39. File Distribution
  40. 40. schema.xml
  41. 41. Beyond TF-IDF•Why?• IDF ignores term “usefulness”•What?• Information gain accounts for term quality•How?• Hadoop + similarity factory = win
  42. 42. Fast Deploys,Careful Testing• Idea• Proof of Concept• Side-By-Side• A/B test• 100% Live
  43. 43. Side-by-Side
  44. 44. Relevant != High quality
  45. 45. A/B Test• Users are randomly assigned to A or B• A sees IDF-based results• B sees info gain-based results
  46. 46. A/B Test• Users are randomly assigned to A or B• A sees IDF-based results• B sees info gain-based results• Small but significant decrease in clicks,page views, etc.
  47. 47. More homogeneous resultsLower average quality score
  48. 48. Next Steps
  49. 49. Parameter Tweaking...Rebalance relevancy and quality signals in score
  50. 50. The Future
  51. 51. Latent SemanticIndexing in Solr/Lucene
  52. 52. Latent SemanticIndexing• In TF-IDF, documents are sparse vectors interm space• LSI re-maps these to dense vectors in“concept” space• Construct transformation matrix:• Load file at index and query time• Re-map query and documentsRm+RrTr×m
  53. 53. CONTACTStephen Murtaghsmurtagh@etsy.com

×