Beyond TF-IDFStephen Murtaghetsy.com
20,000,000 items
1,000,000 sellers
15,000,000 dailysearches80,000,000 daily calls to Solr
Etsy Engineering• Code as Craft - our engineering blog• http://codeascraft.etsy.com/• Continuous Deployment• https://githu...
Etsy Search• 2 search clusters: Flip and Flop• Master -> 20 slaves• Only one cluster takes traffic• Thrift (no HTTP endpoin...
Beyond TF-IDF•Why?•What?•How?
Luggage tags“unique bag”q = unique+bag
q = unique+bag>
Scoring in Lucene
Scoring in LuceneFixed for any given queryconstant
Scoring in Lucenef(term, document)f(term)
Scoring in LuceneUser contentOnly measure rarity
IDF(“unique”)4.429547IDF(“bag”)4.32836>
q = unique+bag“unique unique bag” “unique bag bag”>
“unique” tells usnothing...
Stop words• Add “unique” to stop word list?• What about “handmade” or “blue”?• Low-information words can still be usefulfo...
Why not replace IDF?
Beyond TF-IDF•Why?• IDF ignores term “usefulness”•What?•How?
Beyond TF-IDF•Why?• IDF ignores term “usefulness”•What?•How?
What do we replace itwith?
Benefits of IDFI1 =doc1 doc2 doc3 . . . docnart 2 0 1 . . . 1jewelry 1 3 0 . . . 0......termm 1 0 1 . . . 0
Benefits of IDFI1 =doc1 doc2 doc3 . . . docnart 2 0 1 . . . 1jewelry 1 3 0 . . . 0......termm 1 0 1 . . . 0IDF(j...
ShardingI1 =doc1 doc2 doc3 . . . dockart 2 0 1 . . . 1jewelry 1 3 0 . . . 0......termm 1 0 1 . . . 0I2 =doc...
ShardingI1 =doc1 doc2 doc3 . . . dockart 2 0 1 . . . 1jewelry 1 3 0 . . . 0......termm 1 0 1 . . . 0I2 =doc...
ShardingI1 =doc1 doc2 doc3 . . . dockart 2 0 1 . . . 1jewelry 1 3 0 . . . 0......termm 1 0 1 . . . 0I2 =doc...
Sharded IDF options• Ignore it - Shards score differently• Shards exchange stats - Messy• Central source distributes IDF t...
Information Gain• P(x) - Probability of x appearing in a listing• P(x|y) - Probability of x appearing given y appearsinfo(...
Term Info(x) IDFunique 0.26 4.43bag 1.24 4.33pattern 1.20 4.38original 0.85 4.38dress 1.31 4.42man 0.64 4.41photo 0.74 4.3...
Term Info(x) IDFunique 0.26 4.39black 0.22 3.32red 0.22 3.52handmade 0.20 3.26two 0.32 5.64white 0.19 3.32three 0.37 6.19f...
q = unique+bagUsing IDFscore(“unique unique bag”)score(“unique bag bag”)Using information gainscore(“unique unique bag”)sc...
Beyond TF-IDF•Why?• IDF ignores term “usefulness”•What?•How?
Beyond TF-IDF•Why?• IDF ignores term “usefulness”•What?• Information gain accounts for term quality•How?
Beyond TF-IDF•Why?• IDF ignores term “usefulness”•What?• Information gain accounts for term quality•How?
Listing Quality• Performance relativeto rank• Hadoop: logs - hdfs• cron: hdfs - master• bash: master - slave• Loaded as ex...
Computing info gainI1 =doc1 doc2 doc3 . . . docnart 2 0 1 . . . 1jewelry 1 3 0 . . . 0......termm 1 0 1 . . . 0i...
Hadoop• Brute-force• Count all terms• Count all co-occuring terms• Construct distributions• Compute info gain for all terms
File Distribution• cron copies score file to master• master replicates file to slavesinfogain=`find /search/data/ -maxdepth 1...
File Distribution
schema.xml
Beyond TF-IDF•Why?• IDF ignores term “usefulness”•What?• Information gain accounts for term quality•How?• Hadoop + similar...
Fast Deploys,Careful Testing• Idea• Proof of Concept• Side-By-Side• A/B test• 100% Live
Side-by-Side
Relevant != High quality
A/B Test• Users are randomly assigned to A or B• A sees IDF-based results• B sees info gain-based results
A/B Test• Users are randomly assigned to A or B• A sees IDF-based results• B sees info gain-based results• Small but signi...
More homogeneous resultsLower average quality score
Next Steps
Parameter Tweaking...Rebalance relevancy and quality signals in score
The Future
Latent SemanticIndexing in Solr/Lucene
Latent SemanticIndexing• In TF-IDF, documents are sparse vectors interm space• LSI re-maps these to dense vectors in“conce...
CONTACTStephen Murtaghsmurtagh@etsy.com
Beyond tf idf why, what & how
Beyond tf idf why, what & how
Beyond tf idf why, what & how
Beyond tf idf why, what & how
Beyond tf idf why, what & how
Beyond tf idf why, what & how
Upcoming SlideShare
Loading in...5
×

Beyond tf idf why, what & how

5,036

Published on

Presented by Stephen Murtagh, Etsy.com, Inc.

TF-IDF (term frequency, inverse document frequency) is a standard method of weighting query terms for scoring documents, and is the method that is used by default in Solr/Lucene. Unfortunately, TF-IDF is really only a measure of rarity, not quality or usefulness. This means it would give more weight to a useless, rare term, such as a misspelling, than to a more useful, but more common, term.

In this presentation, we will discuss our experiences replacing Lucene's TF-IDF based scoring function with a more useful one using information gain, a standard machine-learning measure that combines frequency and specificity. Information gain is much more expensive to compute, however, so this requires periodically computing the term weights outside of Solr/Lucene and making the results accessible within Solr/Lucene.

Published in: Education, Technology, Business

Beyond tf idf why, what & how

  1. 1. Beyond TF-IDFStephen Murtaghetsy.com
  2. 2. 20,000,000 items
  3. 3. 1,000,000 sellers
  4. 4. 15,000,000 dailysearches80,000,000 daily calls to Solr
  5. 5. Etsy Engineering• Code as Craft - our engineering blog• http://codeascraft.etsy.com/• Continuous Deployment• https://github.com/etsy/deployinator• Experiment-driven culture• Hybrid engineering roles• Dev-Ops• Data-Driven Products
  6. 6. Etsy Search• 2 search clusters: Flip and Flop• Master -> 20 slaves• Only one cluster takes traffic• Thrift (no HTTP endpoint)• BitTorrent for index replication• Solr 4.1• Incremental index every 12 minutes
  7. 7. Beyond TF-IDF•Why?•What?•How?
  8. 8. Luggage tags“unique bag”q = unique+bag
  9. 9. q = unique+bag>
  10. 10. Scoring in Lucene
  11. 11. Scoring in LuceneFixed for any given queryconstant
  12. 12. Scoring in Lucenef(term, document)f(term)
  13. 13. Scoring in LuceneUser contentOnly measure rarity
  14. 14. IDF(“unique”)4.429547IDF(“bag”)4.32836>
  15. 15. q = unique+bag“unique unique bag” “unique bag bag”>
  16. 16. “unique” tells usnothing...
  17. 17. Stop words• Add “unique” to stop word list?• What about “handmade” or “blue”?• Low-information words can still be usefulfor matching• ... but harmful for ranking
  18. 18. Why not replace IDF?
  19. 19. Beyond TF-IDF•Why?• IDF ignores term “usefulness”•What?•How?
  20. 20. Beyond TF-IDF•Why?• IDF ignores term “usefulness”•What?•How?
  21. 21. What do we replace itwith?
  22. 22. Benefits of IDFI1 =doc1 doc2 doc3 . . . docnart 2 0 1 . . . 1jewelry 1 3 0 . . . 0......termm 1 0 1 . . . 0
  23. 23. Benefits of IDFI1 =doc1 doc2 doc3 . . . docnart 2 0 1 . . . 1jewelry 1 3 0 . . . 0......termm 1 0 1 . . . 0IDF(jewelry) = 1 + log(nd id,jewelry)
  24. 24. ShardingI1 =doc1 doc2 doc3 . . . dockart 2 0 1 . . . 1jewelry 1 3 0 . . . 0......termm 1 0 1 . . . 0I2 =dock+1 dock+2 dock+3 . . . docnart 6 1 0 . . . 1jewelry 0 1 3 . . . 0......termm 0 1 1 . . . 0
  25. 25. ShardingI1 =doc1 doc2 doc3 . . . dockart 2 0 1 . . . 1jewelry 1 3 0 . . . 0......termm 1 0 1 . . . 0I2 =dock+1 dock+2 dock+3 . . . docnart 6 1 0 . . . 1jewelry 0 1 3 . . . 0......termm 0 1 1 . . . 0IDF(jewelry) = 1 + log(nd id,jewelry)
  26. 26. ShardingI1 =doc1 doc2 doc3 . . . dockart 2 0 1 . . . 1jewelry 1 3 0 . . . 0......termm 1 0 1 . . . 0I2 =dock+1 dock+2 dock+3 . . . docnart 6 1 0 . . . 1jewelry 0 1 3 . . . 0......termm 0 1 1 . . . 0IDF1(jewelry) = IDF2(jewelry) = IDF(jewelry)
  27. 27. Sharded IDF options• Ignore it - Shards score differently• Shards exchange stats - Messy• Central source distributes IDF to shards
  28. 28. Information Gain• P(x) - Probability of x appearing in a listing• P(x|y) - Probability of x appearing given y appearsinfo(y) = D(P(X|y)||P(X))info(y) = Σx∈X log(P(x|y)P(x)) ∗ P(x|y)
  29. 29. Term Info(x) IDFunique 0.26 4.43bag 1.24 4.33pattern 1.20 4.38original 0.85 4.38dress 1.31 4.42man 0.64 4.41photo 0.74 4.37stone 0.92 4.35Similar IDF
  30. 30. Term Info(x) IDFunique 0.26 4.39black 0.22 3.32red 0.22 3.52handmade 0.20 3.26two 0.32 5.64white 0.19 3.32three 0.37 6.19for 0.21 3.59Similar Info Gain
  31. 31. q = unique+bagUsing IDFscore(“unique unique bag”)score(“unique bag bag”)Using information gainscore(“unique unique bag”)score(“unique bag bag”)
  32. 32. Beyond TF-IDF•Why?• IDF ignores term “usefulness”•What?•How?
  33. 33. Beyond TF-IDF•Why?• IDF ignores term “usefulness”•What?• Information gain accounts for term quality•How?
  34. 34. Beyond TF-IDF•Why?• IDF ignores term “usefulness”•What?• Information gain accounts for term quality•How?
  35. 35. Listing Quality• Performance relativeto rank• Hadoop: logs - hdfs• cron: hdfs - master• bash: master - slave• Loaded as external filefield
  36. 36. Computing info gainI1 =doc1 doc2 doc3 . . . docnart 2 0 1 . . . 1jewelry 1 3 0 . . . 0......termm 1 0 1 . . . 0info(y) = D(P(X|y)||P(X))info(y) = Σx∈X log(P(x|y)P(x)) ∗ P(x|y)
  37. 37. Hadoop• Brute-force• Count all terms• Count all co-occuring terms• Construct distributions• Compute info gain for all terms
  38. 38. File Distribution• cron copies score file to master• master replicates file to slavesinfogain=`find /search/data/ -maxdepth 1 -type f -name info_gain.* -print | sort | tail -n 1`scp $infogain user@$slave:$infogain
  39. 39. File Distribution
  40. 40. schema.xml
  41. 41. Beyond TF-IDF•Why?• IDF ignores term “usefulness”•What?• Information gain accounts for term quality•How?• Hadoop + similarity factory = win
  42. 42. Fast Deploys,Careful Testing• Idea• Proof of Concept• Side-By-Side• A/B test• 100% Live
  43. 43. Side-by-Side
  44. 44. Relevant != High quality
  45. 45. A/B Test• Users are randomly assigned to A or B• A sees IDF-based results• B sees info gain-based results
  46. 46. A/B Test• Users are randomly assigned to A or B• A sees IDF-based results• B sees info gain-based results• Small but significant decrease in clicks,page views, etc.
  47. 47. More homogeneous resultsLower average quality score
  48. 48. Next Steps
  49. 49. Parameter Tweaking...Rebalance relevancy and quality signals in score
  50. 50. The Future
  51. 51. Latent SemanticIndexing in Solr/Lucene
  52. 52. Latent SemanticIndexing• In TF-IDF, documents are sparse vectors interm space• LSI re-maps these to dense vectors in“concept” space• Construct transformation matrix:• Load file at index and query time• Re-map query and documentsRm+RrTr×m
  53. 53. CONTACTStephen Murtaghsmurtagh@etsy.com

×