Beyond tf idf why, what & how

Beyond TF-IDF
Stephen Murtagh
etsy.com

15,000,000 daily
searches
80,000,000 daily calls to Solr

Etsy Engineering
• Code as Craft - our engineering blog
• http://codeascraft.etsy.com/
• Continuous Deployment
• https://github.com/etsy/deployinator
• Experiment-driven culture
• Hybrid engineering roles
• Dev-Ops
• Data-Driven Products

Etsy Search
• 2 search clusters: Flip and Flop
• Master -> 20 slaves
• Only one cluster takes trafﬁc
• Thrift (no HTTP endpoint)
• BitTorrent for index replication
• Solr 4.1
• Incremental index every 12 minutes

Beyond TF-IDF
•Why?
•What?
•How?

Luggage tags
“unique bag”
q = unique+bag

Scoring in Lucene
Fixed for any given query
constant

Scoring in Lucene
f(term, document)
f(term)

Scoring in Lucene
User content
Only measure rarity

IDF(“unique”)
4.429547
IDF(“bag”)
4.32836>

q = unique+bag
“unique unique bag” “unique bag bag”
>

“unique” tells us
nothing...

Stop words
• Add “unique” to stop word list?
• What about “handmade” or “blue”?
• Low-information words can still be useful
for matching
• ... but harmful for ranking

Beyond TF-IDF
•Why?
• IDF ignores term “usefulness”
•What?
•How?

Benefits of IDF
I1 =





doc1 doc2 doc3 . . . docn
art 2 0 1 . . . 1
jewelry 1 3 0 . . . 0
...
...
termm 1 0 1 . . . 0






Benefits of IDF
I1 =





art 2 0 1 . . . 1
jewelry 1 3 0 . . . 0
...
...
termm 1 0 1 . . . 0





IDF(jewelry) = 1 + log(
n

d id,jewelry
)

Sharding
I1 =





doc1 doc2 doc3 . . . dock
art 2 0 1 . . . 1
jewelry 1 3 0 . . . 0
...
...
termm 1 0 1 . . . 0





I2 =





dock+1 dock+2 dock+3 . . . docn
art 6 1 0 . . . 1
jewelry 0 1 3 . . . 0
...
...
termm 0 1 1 . . . 0






Sharding
I1 =





art 2 0 1 . . . 1
jewelry 1 3 0 . . . 0
...
...
termm 1 0 1 . . . 0





I2 =





art 6 1 0 . . . 1
jewelry 0 1 3 . . . 0
...
...
termm 0 1 1 . . . 0





IDF(jewelry) = 1 + log(
n

d id,jewelry
)

Sharding
I1 =





art 2 0 1 . . . 1
jewelry 1 3 0 . . . 0
...
...
termm 1 0 1 . . . 0





I2 =





art 6 1 0 . . . 1
jewelry 0 1 3 . . . 0
...
...
termm 0 1 1 . . . 0





IDF1(jewelry) = IDF2(jewelry) = IDF(jewelry)

Sharded IDF options
• Ignore it - Shards score differently
• Shards exchange stats - Messy
• Central source distributes IDF to shards

Information Gain
• P(x) - Probability of x appearing in a listing
• P(x|y) - Probability of x appearing given y appears
info(y) = D(P(X|y)||P(X))
info(y) = Σx∈X log(
P(x|y)
P(x)
) ∗ P(x|y)

Term Info(x) IDF
unique 0.26 4.43
bag 1.24 4.33
pattern 1.20 4.38
original 0.85 4.38
dress 1.31 4.42
man 0.64 4.41
photo 0.74 4.37
stone 0.92 4.35
Similar IDF

Term Info(x) IDF
unique 0.26 4.39
black 0.22 3.32
red 0.22 3.52
handmade 0.20 3.26
two 0.32 5.64
white 0.19 3.32
three 0.37 6.19
for 0.21 3.59
Similar Info Gain

q = unique+bag
Using IDF
score(“unique unique bag”)

score(“unique bag bag”)
Using information gain
score(“unique unique bag”)

score(“unique bag bag”)

Beyond TF-IDF
•Why?
•What?
• Information gain accounts for term quality
•How?

Listing Quality
• Performance relative
to rank
• Hadoop: logs - hdfs
• cron: hdfs - master
• bash: master - slave
• Loaded as external ﬁle
ﬁeld

Computing info gain
I1 =





art 2 0 1 . . . 1
jewelry 1 3 0 . . . 0
...
...
termm 1 0 1 . . . 0





info(y) = D(P(X|y)||P(X))
info(y) = Σx∈X log(
P(x|y)
P(x)
) ∗ P(x|y)

Hadoop
• Brute-force
• Count all terms
• Count all co-occuring terms
• Construct distributions
• Compute info gain for all terms

File Distribution
• cron copies score file to master
• master replicates file to slaves
infogain=`find /search/data/ -maxdepth 1 -type f -
name info_gain.* -print | sort | tail -n 1`
scp $infogain user@$slave:$infogain

Beyond TF-IDF
•Why?
•What?
• Information gain accounts for term quality
•How?
• Hadoop + similarity factory = win

Fast Deploys,
Careful Testing
• Idea
• Proof of Concept
• Side-By-Side
• A/B test
• 100% Live

A/B Test
• Users are randomly assigned to A or B
• A sees IDF-based results
• B sees info gain-based results

A/B Test
• Users are randomly assigned to A or B
• A sees IDF-based results
• B sees info gain-based results
• Small but signiﬁcant decrease in clicks,
page views, etc.

More homogeneous results
Lower average quality score

Parameter Tweaking...
Rebalance relevancy and quality signals in score

Latent Semantic
Indexing in Solr/Lucene

Latent Semantic
Indexing
• In TF-IDF, documents are sparse vectors in
term space
• LSI re-maps these to dense vectors in
“concept” space
• Construct transformation matrix:
• Load ﬁle at index and query time
• Re-map query and documents
Rm
+
Rr
Tr×m

CONTACT
Stephen Murtagh
smurtagh@etsy.com

Beyond tf idf why, what & how

More Related Content

What's hot

Similar to Beyond tf idf why, what & how

More from lucenerevolution

Recently uploaded

Beyond tf idf why, what & how