Beyond tf idf why, what & how
Upcoming SlideShare
Loading in...5
×
 

Beyond tf idf why, what & how

on

  • 4,247 views

Presented by Stephen Murtagh, Etsy.com, Inc. ...

Presented by Stephen Murtagh, Etsy.com, Inc.

TF-IDF (term frequency, inverse document frequency) is a standard method of weighting query terms for scoring documents, and is the method that is used by default in Solr/Lucene. Unfortunately, TF-IDF is really only a measure of rarity, not quality or usefulness. This means it would give more weight to a useless, rare term, such as a misspelling, than to a more useful, but more common, term.

In this presentation, we will discuss our experiences replacing Lucene's TF-IDF based scoring function with a more useful one using information gain, a standard machine-learning measure that combines frequency and specificity. Information gain is much more expensive to compute, however, so this requires periodically computing the term weights outside of Solr/Lucene and making the results accessible within Solr/Lucene.

Statistics

Views

Total Views
4,247
Slideshare-icon Views on SlideShare
2,954
Embed Views
1,293

Actions

Likes
8
Downloads
57
Comments
0

7 Embeds 1,293

http://www.lucenerevolution.org 1222
http://lucenerevolution.org 52
http://www.lucenerevolution.com 13
http://lucenerevolution.com 2
https://twitter.com 2
https://www.google.co.il 1
http://www.google.com 1
More...

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Beyond tf idf why, what & how Beyond tf idf why, what & how Presentation Transcript

    • Beyond TF-IDFStephen Murtaghetsy.com
    • 20,000,000 items
    • 1,000,000 sellers
    • 15,000,000 dailysearches80,000,000 daily calls to Solr
    • Etsy Engineering• Code as Craft - our engineering blog• http://codeascraft.etsy.com/• Continuous Deployment• https://github.com/etsy/deployinator• Experiment-driven culture• Hybrid engineering roles• Dev-Ops• Data-Driven Products
    • Etsy Search• 2 search clusters: Flip and Flop• Master -> 20 slaves• Only one cluster takes traffic• Thrift (no HTTP endpoint)• BitTorrent for index replication• Solr 4.1• Incremental index every 12 minutes
    • Beyond TF-IDF•Why?•What?•How?
    • Luggage tags“unique bag”q = unique+bag
    • q = unique+bag>
    • Scoring in Lucene
    • Scoring in LuceneFixed for any given queryconstant
    • Scoring in Lucenef(term, document)f(term)
    • Scoring in LuceneUser contentOnly measure rarity
    • IDF(“unique”)4.429547IDF(“bag”)4.32836>
    • q = unique+bag“unique unique bag” “unique bag bag”>
    • “unique” tells usnothing...
    • Stop words• Add “unique” to stop word list?• What about “handmade” or “blue”?• Low-information words can still be usefulfor matching• ... but harmful for ranking
    • Why not replace IDF?
    • Beyond TF-IDF•Why?• IDF ignores term “usefulness”•What?•How?
    • Beyond TF-IDF•Why?• IDF ignores term “usefulness”•What?•How?
    • What do we replace itwith?
    • Benefits of IDFI1 =doc1 doc2 doc3 . . . docnart 2 0 1 . . . 1jewelry 1 3 0 . . . 0......termm 1 0 1 . . . 0
    • Benefits of IDFI1 =doc1 doc2 doc3 . . . docnart 2 0 1 . . . 1jewelry 1 3 0 . . . 0......termm 1 0 1 . . . 0IDF(jewelry) = 1 + log(n￿d id,jewelry)
    • ShardingI1 =doc1 doc2 doc3 . . . dockart 2 0 1 . . . 1jewelry 1 3 0 . . . 0......termm 1 0 1 . . . 0I2 =dock+1 dock+2 dock+3 . . . docnart 6 1 0 . . . 1jewelry 0 1 3 . . . 0......termm 0 1 1 . . . 0
    • ShardingI1 =doc1 doc2 doc3 . . . dockart 2 0 1 . . . 1jewelry 1 3 0 . . . 0......termm 1 0 1 . . . 0I2 =dock+1 dock+2 dock+3 . . . docnart 6 1 0 . . . 1jewelry 0 1 3 . . . 0......termm 0 1 1 . . . 0IDF(jewelry) = 1 + log(n￿d id,jewelry)
    • ShardingI1 =doc1 doc2 doc3 . . . dockart 2 0 1 . . . 1jewelry 1 3 0 . . . 0......termm 1 0 1 . . . 0I2 =dock+1 dock+2 dock+3 . . . docnart 6 1 0 . . . 1jewelry 0 1 3 . . . 0......termm 0 1 1 . . . 0IDF1(jewelry) ￿= IDF2(jewelry) ￿= IDF(jewelry)
    • Sharded IDF options• Ignore it - Shards score differently• Shards exchange stats - Messy• Central source distributes IDF to shards
    • Information Gain• P(x) - Probability of "x" appearing in a listing• P(x|y) - Probability of "x" appearing given "y" appearsinfo(y) = D(P(X|y)||P(X))info(y) = Σx∈X log(P(x|y)P(x)) ∗ P(x|y)
    • Term Info(x) IDFunique 0.26 4.43bag 1.24 4.33pattern 1.20 4.38original 0.85 4.38dress 1.31 4.42man 0.64 4.41photo 0.74 4.37stone 0.92 4.35Similar IDF
    • Term Info(x) IDFunique 0.26 4.39black 0.22 3.32red 0.22 3.52handmade 0.20 3.26two 0.32 5.64white 0.19 3.32three 0.37 6.19for 0.21 3.59Similar Info Gain
    • q = unique+bagUsing IDFscore(“unique unique bag”)>score(“unique bag bag”)Using information gainscore(“unique unique bag”)<score(“unique bag bag”)
    • Beyond TF-IDF•Why?• IDF ignores term “usefulness”•What?•How?
    • Beyond TF-IDF•Why?• IDF ignores term “usefulness”•What?• Information gain accounts for term quality•How?
    • Beyond TF-IDF•Why?• IDF ignores term “usefulness”•What?• Information gain accounts for term quality•How?
    • Listing Quality• Performance relativeto rank• Hadoop: logs -> hdfs• cron: hdfs -> master• bash: master -> slave• Loaded as external filefield
    • Computing info gainI1 =doc1 doc2 doc3 . . . docnart 2 0 1 . . . 1jewelry 1 3 0 . . . 0......termm 1 0 1 . . . 0info(y) = D(P(X|y)||P(X))info(y) = Σx∈X log(P(x|y)P(x)) ∗ P(x|y)
    • Hadoop• Brute-force• Count all terms• Count all co-occuring terms• Construct distributions• Compute info gain for all terms
    • File Distribution• cron copies score file to master• master replicates file to slavesinfogain=`find /search/data/ -maxdepth 1 -type f -name info_gain.* -print | sort | tail -n 1`scp $infogain user@$slave:$infogain
    • File Distribution
    • schema.xml
    • Beyond TF-IDF•Why?• IDF ignores term “usefulness”•What?• Information gain accounts for term quality•How?• Hadoop + similarity factory = win
    • Fast Deploys,Careful Testing• Idea• Proof of Concept• Side-By-Side• A/B test• 100% Live
    • Side-by-Side
    • Relevant != High quality
    • A/B Test• Users are randomly assigned to A or B• A sees IDF-based results• B sees info gain-based results
    • A/B Test• Users are randomly assigned to A or B• A sees IDF-based results• B sees info gain-based results• Small but significant decrease in clicks,page views, etc.
    • More homogeneous resultsLower average quality score
    • Next Steps
    • Parameter Tweaking...Rebalance relevancy and quality signals in score
    • The Future
    • Latent SemanticIndexing in Solr/Lucene
    • Latent SemanticIndexing• In TF-IDF, documents are sparse vectors interm space• LSI re-maps these to dense vectors in“concept” space• Construct transformation matrix:• Load file at index and query time• Re-map query and documentsRm+RrTr×m
    • CONTACTStephen Murtaghsmurtagh@etsy.com