• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Beyond tf idf why, what & how
 

Beyond tf idf why, what & how

on

  • 4,078 views

Presented by Stephen Murtagh, Etsy.com, Inc. ...

Presented by Stephen Murtagh, Etsy.com, Inc.

TF-IDF (term frequency, inverse document frequency) is a standard method of weighting query terms for scoring documents, and is the method that is used by default in Solr/Lucene. Unfortunately, TF-IDF is really only a measure of rarity, not quality or usefulness. This means it would give more weight to a useless, rare term, such as a misspelling, than to a more useful, but more common, term.

In this presentation, we will discuss our experiences replacing Lucene's TF-IDF based scoring function with a more useful one using information gain, a standard machine-learning measure that combines frequency and specificity. Information gain is much more expensive to compute, however, so this requires periodically computing the term weights outside of Solr/Lucene and making the results accessible within Solr/Lucene.

Statistics

Views

Total Views
4,078
Views on SlideShare
2,839
Embed Views
1,239

Actions

Likes
8
Downloads
55
Comments
0

7 Embeds 1,239

http://www.lucenerevolution.org 1169
http://lucenerevolution.org 52
http://www.lucenerevolution.com 13
http://lucenerevolution.com 2
https://twitter.com 1
https://www.google.co.il 1
http://www.google.com 1
More...

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Beyond tf idf why, what & how Beyond tf idf why, what & how Presentation Transcript