Presented by Stephen Murtagh, Etsy.com, Inc.
TF-IDF (term frequency, inverse document frequency) is a standard method of weighting query terms for scoring documents, and is the method that is used by default in Solr/Lucene. Unfortunately, TF-IDF is really only a measure of rarity, not quality or usefulness. This means it would give more weight to a useless, rare term, such as a misspelling, than to a more useful, but more common, term.
In this presentation, we will discuss our experiences replacing Lucene's TF-IDF based scoring function with a more useful one using information gain, a standard machine-learning measure that combines frequency and specificity. Information gain is much more expensive to compute, however, so this requires periodically computing the term weights outside of Solr/Lucene and making the results accessible within Solr/Lucene.