SlideShare is now on Android. 15 million presentations at your fingertips.  Get the app

×
  • Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
 

Beyond tf idf why, what & how

by on May 15, 2013

  • 3,097 views

Presented by Stephen Murtagh, Etsy.com, Inc. ...

Presented by Stephen Murtagh, Etsy.com, Inc.

TF-IDF (term frequency, inverse document frequency) is a standard method of weighting query terms for scoring documents, and is the method that is used by default in Solr/Lucene. Unfortunately, TF-IDF is really only a measure of rarity, not quality or usefulness. This means it would give more weight to a useless, rare term, such as a misspelling, than to a more useful, but more common, term.

In this presentation, we will discuss our experiences replacing Lucene's TF-IDF based scoring function with a more useful one using information gain, a standard machine-learning measure that combines frequency and specificity. Information gain is much more expensive to compute, however, so this requires periodically computing the term weights outside of Solr/Lucene and making the results accessible within Solr/Lucene.

Statistics

Views

Total Views
3,097
Views on SlideShare
2,118
Embed Views
979

Actions

Likes
6
Downloads
47
Comments
0

6 Embeds 979

http://www.lucenerevolution.org 910
http://lucenerevolution.org 52
http://www.lucenerevolution.com 13
http://lucenerevolution.com 2
https://twitter.com 1
https://www.google.co.il 1

Accessibility

Categories

Upload Details

Uploaded via SlideShare as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
Post Comment
Edit your comment

Beyond tf idf why, what & how Beyond tf idf why, what & how Presentation Transcript