5. common wisdom
• they are everywhare and bloat index
• remove them to increase performance (smaller index and query) and
relevance of search results
6. common wisdom
• they are everywhare and bloat index
• remove them to increase performance (smaller index and query) and
relevance of search results
• … but sometimes stop words add little semantic to a sentence
• … and sometimes you need them - To be or not to be
7. common wisdom
• they are everywhare and bloat index
• remove them to increase performance (smaller index and query) and
relevance of search results
• … but sometimes stop words add little semantic to a sentence
• … and sometimes you need them - To be or not to be
• having the best of both worlds? multiple mappings of data: one with
stop words removed and one with stop words
8. common wisdom
• they are everywhare and bloat index
• remove them to increase performance (smaller index and query) and
relevance of search results
• … but sometimes stop words add little semantic to a sentence
• … and sometimes you need them - To be or not to be
• having the best of both worlds? multiple mappings of data: one with
stop words removed and one with stop words doubled data by
indexing in two different ways!
9.
10. • Common Terms Query analyzes query, identifies which
words are “important” based on document frequencies
for each term
• Common Terms Query leverage the power of stop word
removal (faster searches) without eliminating them (they
can contribute to score sometimes)
• Common Terms Query adapts to your domain, words
with high frequency will automatically be recognized as
stop words
11. restoring stop words
possibility of improving
• searches comprised only of stopwords (improved recall)
• to be or not to be
• The Who
• searches for short searches including stopwords (improved precison)
• pearl vs. the pearl
• the one
• a zukofsky (author Zukofsky, title "a")
• distinguish "in" from "and” in some cases
• archaeology in literature != archaeology and literature
12. restoring stop words
possibility of improving
• searches comprised only of stopwords (improved recall)
• to be or not to be
• The Who
• searches for short searches including stopwords (improved precison)
• pearl vs. the pearl
• the one
• a zukofsky (author Zukofsky, title "a")
• distinguish "in" from "and” in some cases
• archaeology in literature != archaeology and literature
possibility of degrading
• long queries (over 6 terms) with a lot of stopwords have reduced precision
• Lectures on the Calculus of Variations and Optimal Control Theory
• BUT: the words occurring as a phrase float to the top
• AND: you can modify minimum match (mm) param
13. restoring stop words
how to decide?
• take a look at your business knowledge domain
• count percent of searches with stop words
• count terms in user queries