SlideShare a Scribd company logo
1 of 14
Download to read offline
face of the stopwords
The February 2014 Monthly
Tomek Sobczak
what are stop words?
what are stop words?
what are stop words?
common wisdom
• they are everywhare and bloat index
• remove them to increase performance (smaller index and query) and
relevance of search results
common wisdom
• they are everywhare and bloat index
• remove them to increase performance (smaller index and query) and
relevance of search results
• … but sometimes stop words add little semantic to a sentence
• … and sometimes you need them - To be or not to be
common wisdom
• they are everywhare and bloat index
• remove them to increase performance (smaller index and query) and
relevance of search results
• … but sometimes stop words add little semantic to a sentence
• … and sometimes you need them - To be or not to be
• having the best of both worlds? multiple mappings of data: one with
stop words removed and one with stop words
common wisdom
• they are everywhare and bloat index
• remove them to increase performance (smaller index and query) and
relevance of search results
• … but sometimes stop words add little semantic to a sentence
• … and sometimes you need them - To be or not to be
• having the best of both worlds? multiple mappings of data: one with
stop words removed and one with stop words doubled data by
indexing in two different ways!
• Common Terms Query analyzes query, identifies which
words are “important” based on document frequencies
for each term
• Common Terms Query leverage the power of stop word
removal (faster searches) without eliminating them (they
can contribute to score sometimes)
• Common Terms Query adapts to your domain, words
with high frequency will automatically be recognized as
stop words
restoring stop words
possibility of improving
• searches comprised only of stopwords (improved recall)
• to be or not to be
• The Who
• searches for short searches including stopwords (improved precison)
• pearl vs. the pearl
• the one
• a zukofsky (author Zukofsky, title "a")
• distinguish "in" from "and” in some cases
• archaeology in literature != archaeology and literature
restoring stop words
possibility of improving
• searches comprised only of stopwords (improved recall)
• to be or not to be
• The Who
• searches for short searches including stopwords (improved precison)
• pearl vs. the pearl
• the one
• a zukofsky (author Zukofsky, title "a")
• distinguish "in" from "and” in some cases
• archaeology in literature != archaeology and literature
possibility of degrading
• long queries (over 6 terms) with a lot of stopwords have reduced precision
• Lectures on the Calculus of Variations and Optimal Control Theory
• BUT: the words occurring as a phrase float to the top
• AND: you can modify minimum match (mm) param
restoring stop words
how to decide?
• take a look at your business knowledge domain
• count percent of searches with stop words
• count terms in user queries
Thank you!

More Related Content

Similar to Stopwords in Search

Oak Hill's Sr Projects Res Step By Step PPT#3
Oak Hill's Sr Projects Res Step By Step PPT#3Oak Hill's Sr Projects Res Step By Step PPT#3
Oak Hill's Sr Projects Res Step By Step PPT#3Jeremy Young
 
3. introduction to text mining
3. introduction to text mining3. introduction to text mining
3. introduction to text miningLokesh Ramaswamy
 
3. introduction to text mining
3. introduction to text mining3. introduction to text mining
3. introduction to text miningLokesh Ramaswamy
 
Week 9 research info literacy
Week 9 research info literacy Week 9 research info literacy
Week 9 research info literacy NathaliaGuimares15
 
Natural Language Processing Crash Course
Natural Language Processing Crash CourseNatural Language Processing Crash Course
Natural Language Processing Crash CourseCharlie Greenbacker
 
Patrick Hanks - Why lexicographers should take more notice of phraseology, co...
Patrick Hanks - Why lexicographers should take more notice of phraseology, co...Patrick Hanks - Why lexicographers should take more notice of phraseology, co...
Patrick Hanks - Why lexicographers should take more notice of phraseology, co...Scottish Language Dictionaries
 
Radiology Residents Involvement in Research 2016
Radiology Residents Involvement in Research 2016Radiology Residents Involvement in Research 2016
Radiology Residents Involvement in Research 2016evadew1
 
UI&U Clinical Mental Health Counseling Residency Summer 2017
UI&U Clinical Mental Health Counseling Residency Summer 2017UI&U Clinical Mental Health Counseling Residency Summer 2017
UI&U Clinical Mental Health Counseling Residency Summer 2017Tina Beis
 
DATA641 Lecture 3 - Word meaning.pptx
DATA641 Lecture 3 - Word meaning.pptxDATA641 Lecture 3 - Word meaning.pptx
DATA641 Lecture 3 - Word meaning.pptxDrPraveenPawar
 
The New "3R's": Radiology Resident Research
The New "3R's": Radiology Resident ResearchThe New "3R's": Radiology Resident Research
The New "3R's": Radiology Resident Researchevadew1
 
Using Word Roots
Using Word RootsUsing Word Roots
Using Word RootsAmy LC
 
LeanStartup:Research is cheaper than development
LeanStartup:Research is cheaper than developmentLeanStartup:Research is cheaper than development
LeanStartup:Research is cheaper than developmentJohn McCaffrey
 
Using and learning phrases
Using and learning phrasesUsing and learning phrases
Using and learning phrasesCassandra Jacobs
 

Similar to Stopwords in Search (20)

Advanced search strategy
Advanced search strategyAdvanced search strategy
Advanced search strategy
 
2013 FST 101
2013 FST 1012013 FST 101
2013 FST 101
 
Oak Hill's Sr Projects Res Step By Step PPT#3
Oak Hill's Sr Projects Res Step By Step PPT#3Oak Hill's Sr Projects Res Step By Step PPT#3
Oak Hill's Sr Projects Res Step By Step PPT#3
 
3. introduction to text mining
3. introduction to text mining3. introduction to text mining
3. introduction to text mining
 
3. introduction to text mining
3. introduction to text mining3. introduction to text mining
3. introduction to text mining
 
Week 9 research info literacy
Week 9 research info literacy Week 9 research info literacy
Week 9 research info literacy
 
Revising & Editing in Stages
Revising & Editing in StagesRevising & Editing in Stages
Revising & Editing in Stages
 
Natural Language Processing Crash Course
Natural Language Processing Crash CourseNatural Language Processing Crash Course
Natural Language Processing Crash Course
 
Patrick Hanks - Why lexicographers should take more notice of phraseology, co...
Patrick Hanks - Why lexicographers should take more notice of phraseology, co...Patrick Hanks - Why lexicographers should take more notice of phraseology, co...
Patrick Hanks - Why lexicographers should take more notice of phraseology, co...
 
Radiology Residents Involvement in Research 2016
Radiology Residents Involvement in Research 2016Radiology Residents Involvement in Research 2016
Radiology Residents Involvement in Research 2016
 
Literature Reviews
Literature ReviewsLiterature Reviews
Literature Reviews
 
UI&U Clinical Mental Health Counseling Residency Summer 2017
UI&U Clinical Mental Health Counseling Residency Summer 2017UI&U Clinical Mental Health Counseling Residency Summer 2017
UI&U Clinical Mental Health Counseling Residency Summer 2017
 
DATA641 Lecture 3 - Word meaning.pptx
DATA641 Lecture 3 - Word meaning.pptxDATA641 Lecture 3 - Word meaning.pptx
DATA641 Lecture 3 - Word meaning.pptx
 
Search strategies
Search strategiesSearch strategies
Search strategies
 
The New "3R's": Radiology Resident Research
The New "3R's": Radiology Resident ResearchThe New "3R's": Radiology Resident Research
The New "3R's": Radiology Resident Research
 
Using Word Roots
Using Word RootsUsing Word Roots
Using Word Roots
 
morphemes.pdf
morphemes.pdfmorphemes.pdf
morphemes.pdf
 
LeanStartup:Research is cheaper than development
LeanStartup:Research is cheaper than developmentLeanStartup:Research is cheaper than development
LeanStartup:Research is cheaper than development
 
Using and learning phrases
Using and learning phrasesUsing and learning phrases
Using and learning phrases
 
Text Mining
Text MiningText Mining
Text Mining
 

Recently uploaded

Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based projectAnoyGreter
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样umasea
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptkotipi9215
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - InfographicHr365.us smith
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700
 
The Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfThe Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfPower Karaoke
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin
 

Recently uploaded (20)

Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML Diagrams
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based project
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.ppt
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - Infographic
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
 
The Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfThe Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdf
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
 

Stopwords in Search

  • 1. face of the stopwords The February 2014 Monthly Tomek Sobczak
  • 2. what are stop words?
  • 3. what are stop words?
  • 4. what are stop words?
  • 5. common wisdom • they are everywhare and bloat index • remove them to increase performance (smaller index and query) and relevance of search results
  • 6. common wisdom • they are everywhare and bloat index • remove them to increase performance (smaller index and query) and relevance of search results • … but sometimes stop words add little semantic to a sentence • … and sometimes you need them - To be or not to be
  • 7. common wisdom • they are everywhare and bloat index • remove them to increase performance (smaller index and query) and relevance of search results • … but sometimes stop words add little semantic to a sentence • … and sometimes you need them - To be or not to be • having the best of both worlds? multiple mappings of data: one with stop words removed and one with stop words
  • 8. common wisdom • they are everywhare and bloat index • remove them to increase performance (smaller index and query) and relevance of search results • … but sometimes stop words add little semantic to a sentence • … and sometimes you need them - To be or not to be • having the best of both worlds? multiple mappings of data: one with stop words removed and one with stop words doubled data by indexing in two different ways!
  • 9.
  • 10. • Common Terms Query analyzes query, identifies which words are “important” based on document frequencies for each term • Common Terms Query leverage the power of stop word removal (faster searches) without eliminating them (they can contribute to score sometimes) • Common Terms Query adapts to your domain, words with high frequency will automatically be recognized as stop words
  • 11. restoring stop words possibility of improving • searches comprised only of stopwords (improved recall) • to be or not to be • The Who • searches for short searches including stopwords (improved precison) • pearl vs. the pearl • the one • a zukofsky (author Zukofsky, title "a") • distinguish "in" from "and” in some cases • archaeology in literature != archaeology and literature
  • 12. restoring stop words possibility of improving • searches comprised only of stopwords (improved recall) • to be or not to be • The Who • searches for short searches including stopwords (improved precison) • pearl vs. the pearl • the one • a zukofsky (author Zukofsky, title "a") • distinguish "in" from "and” in some cases • archaeology in literature != archaeology and literature possibility of degrading • long queries (over 6 terms) with a lot of stopwords have reduced precision • Lectures on the Calculus of Variations and Optimal Control Theory • BUT: the words occurring as a phrase float to the top • AND: you can modify minimum match (mm) param
  • 13. restoring stop words how to decide? • take a look at your business knowledge domain • count percent of searches with stop words • count terms in user queries