Your SlideShare is downloading. ×
  • Like
Beyond tf idf why, what & how
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Now you can save presentations on your phone or tablet

Available for both IPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Beyond tf idf why, what & how

  • 4,176 views
Published

Presented by Stephen Murtagh, Etsy.com, Inc. …

Presented by Stephen Murtagh, Etsy.com, Inc.

TF-IDF (term frequency, inverse document frequency) is a standard method of weighting query terms for scoring documents, and is the method that is used by default in Solr/Lucene. Unfortunately, TF-IDF is really only a measure of rarity, not quality or usefulness. This means it would give more weight to a useless, rare term, such as a misspelling, than to a more useful, but more common, term.

In this presentation, we will discuss our experiences replacing Lucene's TF-IDF based scoring function with a more useful one using information gain, a standard machine-learning measure that combines frequency and specificity. Information gain is much more expensive to compute, however, so this requires periodically computing the term weights outside of Solr/Lucene and making the results accessible within Solr/Lucene.

Published in Education , Technology , Business
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
4,176
On SlideShare
0
From Embeds
0
Number of Embeds
11

Actions

Shares
Downloads
66
Comments
0
Likes
11

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Beyond TF-IDFStephen Murtaghetsy.com
  • 2. 20,000,000 items
  • 3. 1,000,000 sellers
  • 4. 15,000,000 dailysearches80,000,000 daily calls to Solr
  • 5. Etsy Engineering• Code as Craft - our engineering blog• http://codeascraft.etsy.com/• Continuous Deployment• https://github.com/etsy/deployinator• Experiment-driven culture• Hybrid engineering roles• Dev-Ops• Data-Driven Products
  • 6. Etsy Search• 2 search clusters: Flip and Flop• Master -> 20 slaves• Only one cluster takes traffic• Thrift (no HTTP endpoint)• BitTorrent for index replication• Solr 4.1• Incremental index every 12 minutes
  • 7. Beyond TF-IDF•Why?•What?•How?
  • 8. Luggage tags“unique bag”q = unique+bag
  • 9. q = unique+bag>
  • 10. Scoring in Lucene
  • 11. Scoring in LuceneFixed for any given queryconstant
  • 12. Scoring in Lucenef(term, document)f(term)
  • 13. Scoring in LuceneUser contentOnly measure rarity
  • 14. IDF(“unique”)4.429547IDF(“bag”)4.32836>
  • 15. q = unique+bag“unique unique bag” “unique bag bag”>
  • 16. “unique” tells usnothing...
  • 17. Stop words• Add “unique” to stop word list?• What about “handmade” or “blue”?• Low-information words can still be usefulfor matching• ... but harmful for ranking
  • 18. Why not replace IDF?
  • 19. Beyond TF-IDF•Why?• IDF ignores term “usefulness”•What?•How?
  • 20. Beyond TF-IDF•Why?• IDF ignores term “usefulness”•What?•How?
  • 21. What do we replace itwith?
  • 22. Benefits of IDFI1 =doc1 doc2 doc3 . . . docnart 2 0 1 . . . 1jewelry 1 3 0 . . . 0......termm 1 0 1 . . . 0
  • 23. Benefits of IDFI1 =doc1 doc2 doc3 . . . docnart 2 0 1 . . . 1jewelry 1 3 0 . . . 0......termm 1 0 1 . . . 0IDF(jewelry) = 1 + log(n￿d id,jewelry)
  • 24. ShardingI1 =doc1 doc2 doc3 . . . dockart 2 0 1 . . . 1jewelry 1 3 0 . . . 0......termm 1 0 1 . . . 0I2 =dock+1 dock+2 dock+3 . . . docnart 6 1 0 . . . 1jewelry 0 1 3 . . . 0......termm 0 1 1 . . . 0
  • 25. ShardingI1 =doc1 doc2 doc3 . . . dockart 2 0 1 . . . 1jewelry 1 3 0 . . . 0......termm 1 0 1 . . . 0I2 =dock+1 dock+2 dock+3 . . . docnart 6 1 0 . . . 1jewelry 0 1 3 . . . 0......termm 0 1 1 . . . 0IDF(jewelry) = 1 + log(n￿d id,jewelry)
  • 26. ShardingI1 =doc1 doc2 doc3 . . . dockart 2 0 1 . . . 1jewelry 1 3 0 . . . 0......termm 1 0 1 . . . 0I2 =dock+1 dock+2 dock+3 . . . docnart 6 1 0 . . . 1jewelry 0 1 3 . . . 0......termm 0 1 1 . . . 0IDF1(jewelry) ￿= IDF2(jewelry) ￿= IDF(jewelry)
  • 27. Sharded IDF options• Ignore it - Shards score differently• Shards exchange stats - Messy• Central source distributes IDF to shards
  • 28. Information Gain• P(x) - Probability of "x" appearing in a listing• P(x|y) - Probability of "x" appearing given "y" appearsinfo(y) = D(P(X|y)||P(X))info(y) = Σx∈X log(P(x|y)P(x)) ∗ P(x|y)
  • 29. Term Info(x) IDFunique 0.26 4.43bag 1.24 4.33pattern 1.20 4.38original 0.85 4.38dress 1.31 4.42man 0.64 4.41photo 0.74 4.37stone 0.92 4.35Similar IDF
  • 30. Term Info(x) IDFunique 0.26 4.39black 0.22 3.32red 0.22 3.52handmade 0.20 3.26two 0.32 5.64white 0.19 3.32three 0.37 6.19for 0.21 3.59Similar Info Gain
  • 31. q = unique+bagUsing IDFscore(“unique unique bag”)>score(“unique bag bag”)Using information gainscore(“unique unique bag”)<score(“unique bag bag”)
  • 32. Beyond TF-IDF•Why?• IDF ignores term “usefulness”•What?•How?
  • 33. Beyond TF-IDF•Why?• IDF ignores term “usefulness”•What?• Information gain accounts for term quality•How?
  • 34. Beyond TF-IDF•Why?• IDF ignores term “usefulness”•What?• Information gain accounts for term quality•How?
  • 35. Listing Quality• Performance relativeto rank• Hadoop: logs -> hdfs• cron: hdfs -> master• bash: master -> slave• Loaded as external filefield
  • 36. Computing info gainI1 =doc1 doc2 doc3 . . . docnart 2 0 1 . . . 1jewelry 1 3 0 . . . 0......termm 1 0 1 . . . 0info(y) = D(P(X|y)||P(X))info(y) = Σx∈X log(P(x|y)P(x)) ∗ P(x|y)
  • 37. Hadoop• Brute-force• Count all terms• Count all co-occuring terms• Construct distributions• Compute info gain for all terms
  • 38. File Distribution• cron copies score file to master• master replicates file to slavesinfogain=`find /search/data/ -maxdepth 1 -type f -name info_gain.* -print | sort | tail -n 1`scp $infogain user@$slave:$infogain
  • 39. File Distribution
  • 40. schema.xml
  • 41. Beyond TF-IDF•Why?• IDF ignores term “usefulness”•What?• Information gain accounts for term quality•How?• Hadoop + similarity factory = win
  • 42. Fast Deploys,Careful Testing• Idea• Proof of Concept• Side-By-Side• A/B test• 100% Live
  • 43. Side-by-Side
  • 44. Relevant != High quality
  • 45. A/B Test• Users are randomly assigned to A or B• A sees IDF-based results• B sees info gain-based results
  • 46. A/B Test• Users are randomly assigned to A or B• A sees IDF-based results• B sees info gain-based results• Small but significant decrease in clicks,page views, etc.
  • 47. More homogeneous resultsLower average quality score
  • 48. Next Steps
  • 49. Parameter Tweaking...Rebalance relevancy and quality signals in score
  • 50. The Future
  • 51. Latent SemanticIndexing in Solr/Lucene
  • 52. Latent SemanticIndexing• In TF-IDF, documents are sparse vectors interm space• LSI re-maps these to dense vectors in“concept” space• Construct transformation matrix:• Load file at index and query time• Re-map query and documentsRm+RrTr×m
  • 53. CONTACTStephen Murtaghsmurtagh@etsy.com