Beyond TF-IDF
Stephen Murtagh
etsy.com
20,000,000 items
1,000,000 sellers
15,000,000 daily
searches
80,000,000 daily calls to Solr
Etsy Engineering
• Code as Craft - our engineering blog
• http://codeascraft.etsy.com/
• Continuous Deployment
• https://github.com/etsy/deployinator
• Experiment-driven culture
• Hybrid engineering roles
• Dev-Ops
• Data-Driven Products
Etsy Search
• 2 search clusters: Flip and Flop
• Master -> 20 slaves
• Only one cluster takes traffic
• Thrift (no HTTP endpoint)
• BitTorrent for index replication
• Solr 4.1
• Incremental index every 12 minutes
Beyond TF-IDF
•Why?
•What?
•How?
Luggage tags
“unique bag”
q = unique+bag
q = unique+bag
>
Scoring in Lucene
Scoring in Lucene
Fixed for any given query
constant
Scoring in Lucene
f(term, document)
f(term)
Scoring in Lucene
User content
Only measure rarity
IDF(“unique”)
4.429547
IDF(“bag”)
4.32836>
q = unique+bag
“unique unique bag” “unique bag bag”
>
“unique” tells us
nothing...
Stop words
• Add “unique” to stop word list?
• What about “handmade” or “blue”?
• Low-information words can still be useful
for matching
• ... but harmful for ranking
Why not replace IDF?
Beyond TF-IDF
•Why?
• IDF ignores term “usefulness”
•What?
•How?
Beyond TF-IDF
•Why?
• IDF ignores term “usefulness”
•What?
•How?
What do we replace it
with?
Benefits of IDF
I1 =





doc1 doc2 doc3 . . . docn
art 2 0 1 . . . 1
jewelry 1 3 0 . . . 0
...
...
termm 1 0 1 . . . 0





Benefits of IDF
I1 =





doc1 doc2 doc3 . . . docn
art 2 0 1 . . . 1
jewelry 1 3 0 . . . 0
...
...
termm 1 0 1 . . . 0





IDF(jewelry) = 1 + log(
n

d id,jewelry
)
Sharding
I1 =





doc1 doc2 doc3 . . . dock
art 2 0 1 . . . 1
jewelry 1 3 0 . . . 0
...
...
termm 1 0 1 . . . 0





I2 =





dock+1 dock+2 dock+3 . . . docn
art 6 1 0 . . . 1
jewelry 0 1 3 . . . 0
...
...
termm 0 1 1 . . . 0





Sharding
I1 =





doc1 doc2 doc3 . . . dock
art 2 0 1 . . . 1
jewelry 1 3 0 . . . 0
...
...
termm 1 0 1 . . . 0





I2 =





dock+1 dock+2 dock+3 . . . docn
art 6 1 0 . . . 1
jewelry 0 1 3 . . . 0
...
...
termm 0 1 1 . . . 0





IDF(jewelry) = 1 + log(
n

d id,jewelry
)
Sharding
I1 =





doc1 doc2 doc3 . . . dock
art 2 0 1 . . . 1
jewelry 1 3 0 . . . 0
...
...
termm 1 0 1 . . . 0





I2 =





dock+1 dock+2 dock+3 . . . docn
art 6 1 0 . . . 1
jewelry 0 1 3 . . . 0
...
...
termm 0 1 1 . . . 0





IDF1(jewelry) = IDF2(jewelry) = IDF(jewelry)
Sharded IDF options
• Ignore it - Shards score differently
• Shards exchange stats - Messy
• Central source distributes IDF to shards
Information Gain
• P(x) - Probability of x appearing in a listing
• P(x|y) - Probability of x appearing given y appears
info(y) = D(P(X|y)||P(X))
info(y) = Σx∈X log(
P(x|y)
P(x)
) ∗ P(x|y)
Term Info(x) IDF
unique 0.26 4.43
bag 1.24 4.33
pattern 1.20 4.38
original 0.85 4.38
dress 1.31 4.42
man 0.64 4.41
photo 0.74 4.37
stone 0.92 4.35
Similar IDF
Term Info(x) IDF
unique 0.26 4.39
black 0.22 3.32
red 0.22 3.52
handmade 0.20 3.26
two 0.32 5.64
white 0.19 3.32
three 0.37 6.19
for 0.21 3.59
Similar Info Gain
q = unique+bag
Using IDF
score(“unique unique bag”)

score(“unique bag bag”)
Using information gain
score(“unique unique bag”)

score(“unique bag bag”)
Beyond TF-IDF
•Why?
• IDF ignores term “usefulness”
•What?
•How?
Beyond TF-IDF
•Why?
• IDF ignores term “usefulness”
•What?
• Information gain accounts for term quality
•How?
Beyond TF-IDF
•Why?
• IDF ignores term “usefulness”
•What?
• Information gain accounts for term quality
•How?
Listing Quality
• Performance relative
to rank
• Hadoop: logs - hdfs
• cron: hdfs - master
• bash: master - slave
• Loaded as external file
field
Computing info gain
I1 =





doc1 doc2 doc3 . . . docn
art 2 0 1 . . . 1
jewelry 1 3 0 . . . 0
...
...
termm 1 0 1 . . . 0





info(y) = D(P(X|y)||P(X))
info(y) = Σx∈X log(
P(x|y)
P(x)
) ∗ P(x|y)
Hadoop
• Brute-force
• Count all terms
• Count all co-occuring terms
• Construct distributions
• Compute info gain for all terms
File Distribution
• cron copies score file to master
• master replicates file to slaves
infogain=`find /search/data/ -maxdepth 1 -type f -
name info_gain.* -print | sort | tail -n 1`
scp $infogain user@$slave:$infogain
File Distribution
schema.xml
Beyond TF-IDF
•Why?
• IDF ignores term “usefulness”
•What?
• Information gain accounts for term quality
•How?
• Hadoop + similarity factory = win
Fast Deploys,
Careful Testing
• Idea
• Proof of Concept
• Side-By-Side
• A/B test
• 100% Live
Side-by-Side
Relevant != High quality
A/B Test
• Users are randomly assigned to A or B
• A sees IDF-based results
• B sees info gain-based results
A/B Test
• Users are randomly assigned to A or B
• A sees IDF-based results
• B sees info gain-based results
• Small but significant decrease in clicks,
page views, etc.
More homogeneous results
Lower average quality score
Next Steps
Parameter Tweaking...
Rebalance relevancy and quality signals in score
The Future
Latent Semantic
Indexing in Solr/Lucene
Latent Semantic
Indexing
• In TF-IDF, documents are sparse vectors in
term space
• LSI re-maps these to dense vectors in
“concept” space
• Construct transformation matrix:
• Load file at index and query time
• Re-map query and documents
Rm
+
Rr
Tr×m
CONTACT
Stephen Murtagh
smurtagh@etsy.com

Beyond tf idf why, what & how