SlideShare a Scribd company logo
Search Concepts and Tools
Steve Davids
What is Search?
Image Credit: https://www.flickr.com/photos/doggybytes/ - CC License
What is Search?
Definition: “try to find something by looking or
otherwise seeking carefully and thoroughly”
Synonyms: hunt, look, quest
Hunt
●  SELECT * FROM book WHERE author LIKE
%name%
●  publisher: (“Wall Street Journal” OR
“WSJ.com” OR “Wall Street Journal India”...)
Look
●  Relevant Results
○  Use signals to boost
relevance
●  Ability to quickly
whittle down results
Quest
Provide recommendations
Hunt Look Quest
What do we have in common?
●  Open Source Search Library (Java API)
●  Free text search
●  Relevancy ranking
●  Faceting and filtering
●  Hit-term highlighting
●  Near real-time indexing/querying
●  Inverted Index
Free text search via…
●  Keyword
●  Wildcard
●  Proximity
●  Fuzzy
●  Range
●  Geospatial
Free text search via…
●  Keyword
●  Wildcard
●  Proximity
●  Fuzzy
●  Range
●  Geospatial
walk*
M?ham?d
M[ou]hamm?[ae]d
Free text search via…
●  Keyword
●  Wildcard
●  Proximity
●  Fuzzy
●  Range
●  Geospatial
Free text search via…
●  Keyword
●  Wildcard
●  Proximity
●  Fuzzy
●  Range
●  Geospatial
Free text search via…
●  Keyword
●  Wildcard
●  Proximity
●  Fuzzy
●  Range
●  Geospatial
[* TO N]
Free text search via…
●  Keyword
●  Wildcard
●  Proximity
●  Fuzzy
●  Range
●  Geospatial
Text Analysis
●  Convert text into searchable words
●  CharFilter
o  Mutates single stream of text
●  Tokenizer
o  Splits single stream of text into one or more
tokens
●  TokenFilter
o  Mutates token stream
Notable Character Filters
●  HTML Strip
o  “<p>Example <a href=’/test’>link</a></p>”
o  “Example link”
●  Pattern Replace
o  pattern="[^a-zA-Z]" replacement=""
o  “Testing123”
o  “Testing”
Notable Tokenizers
●  Keyword
o  “Hello World!”
o  {“Hello World!”}
●  Whitespace
o  “Hello World!”
o  {“Hello”, “World!”}
●  Standard
●  Pattern
●  ICU (International Components for Unicode)
Notable Token Filters
●  Lower Case
o  {“Hello”, “World!”}
o  {“hello”, “world!”}
●  Synonym
o  synonyms.txt (expand=true): JPN, Japan, JN
§  {“to”, “Japan”}
§  {“to”, {“Japan”, “JPN”, “JN”}}
o  synonyms.txt (expand=false)
§  {“to”, “JPN”}
Notable Token Filters
●  Word Delimiter
o  {“F22-Raptor”}
o  {“F22”, “Raptor”}
o  {“F”, “22”, “Raptor”}
o  {“F”, {“22”, “F22”}, {“Raptor”, “F22Raptor}}
●  Porter Stem
o  {“walked”, “walking”, “walks”}
o  {“walk”, “walk”, “walk”}
Inverted Index
T[0] = "It is what it is"
T[1] = "what is it?"
T[2] = "it is a banana"
"a": {2}
"banana": {2}
"is": {0, 1, 2}
"it": {0, 1, 2}
"It": {0}
"it?": {1}
"what": {0, 1}
Inverted Index
T[0] = "It is what it is"
T[1] = "what is it?"
T[2] = "it is a banana"
"a": {2}
"banana": {2}
"is": {0, 1, 2}
"it": {0, 1, 2}
"what": {0, 1}
What is Relevant?
●  TF-IDF
o  Term Frequency - Inverse Document Frequency
●  Boosting
o  Important terms
o  Signals
Relational vs Denormalized
Students
-  S01 | Marina
-  S02 | Ben
Classes
-  C01 | Hadoop
-  C02 | Solr
Enrolled
-  S01 | C01
-  S01 | C02
-  S02 | C01
-  S02 | C02
Students
-  S01 | Marina | [Hadoop, Solr]
-  S02 | Ben | [Hadoop, Solr]
Wrap it up...
●  Design for the user
●  Know your data
●  Be lazy
Questions?
Solr Hack Session...
Solr Schema
●  Field Definitions
o  Field Type, Indexed, Stored,
Multivalued, Doc Values
o  Copy Fields
o  Dynamic Fields
§  <dynamicField name="*_sort" type="lowercase" />
●  Field Types
o  Analysis Chain
Solr Config
●  Request handler definitions
●  Search component definitions
●  Update processor chains
●  Cache settings
●  Index specifications
●  Threshold settings
●  Custom library import locations

More Related Content

Viewers also liked

A noobs lesson on solr (configuration)
A noobs lesson on solr (configuration)A noobs lesson on solr (configuration)
A noobs lesson on solr (configuration)
BTI360
 
Gradle: From Extreme to Mainstream
Gradle: From Extreme to MainstreamGradle: From Extreme to Mainstream
Gradle: From Extreme to Mainstream
BTI360
 
Migrating Legacy Web Applications to AngularJS
Migrating Legacy Web Applications to AngularJSMigrating Legacy Web Applications to AngularJS
Migrating Legacy Web Applications to AngularJS
BTI360
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
BTI360
 
Scala vs Java 8 in a Java 8 World
Scala vs Java 8 in a Java 8 WorldScala vs Java 8 in a Java 8 World
Scala vs Java 8 in a Java 8 World
BTI360
 
Spock Testing Framework - The Next Generation
Spock Testing Framework - The Next GenerationSpock Testing Framework - The Next Generation
Spock Testing Framework - The Next Generation
BTI360
 

Viewers also liked (6)

A noobs lesson on solr (configuration)
A noobs lesson on solr (configuration)A noobs lesson on solr (configuration)
A noobs lesson on solr (configuration)
 
Gradle: From Extreme to Mainstream
Gradle: From Extreme to MainstreamGradle: From Extreme to Mainstream
Gradle: From Extreme to Mainstream
 
Migrating Legacy Web Applications to AngularJS
Migrating Legacy Web Applications to AngularJSMigrating Legacy Web Applications to AngularJS
Migrating Legacy Web Applications to AngularJS
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Scala vs Java 8 in a Java 8 World
Scala vs Java 8 in a Java 8 WorldScala vs Java 8 in a Java 8 World
Scala vs Java 8 in a Java 8 World
 
Spock Testing Framework - The Next Generation
Spock Testing Framework - The Next GenerationSpock Testing Framework - The Next Generation
Spock Testing Framework - The Next Generation
 

Recently uploaded

Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Tosin Akinosho
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
MichaelKnudsen27
 
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
Alex Pruden
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
Hiroshi SHIBATA
 
JavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green MasterplanJavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green Masterplan
Miro Wengner
 
Azure API Management to expose backend services securely
Azure API Management to expose backend services securelyAzure API Management to expose backend services securely
Azure API Management to expose backend services securely
Dinusha Kumarasiri
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
Zilliz
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
saastr
 
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyFreshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
ScyllaDB
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Wask
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
Zilliz
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
saastr
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Alpen-Adria-Universität
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
ssuserfac0301
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Safe Software
 
FREE A4 Cyber Security Awareness Posters-Social Engineering part 3
FREE A4 Cyber Security Awareness  Posters-Social Engineering part 3FREE A4 Cyber Security Awareness  Posters-Social Engineering part 3
FREE A4 Cyber Security Awareness Posters-Social Engineering part 3
Data Hops
 
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Tatiana Kojar
 
AWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptxAWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptx
HarisZaheer8
 
Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |
AstuteBusiness
 

Recently uploaded (20)

Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
 
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
 
JavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green MasterplanJavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green Masterplan
 
Azure API Management to expose backend services securely
Azure API Management to expose backend services securelyAzure API Management to expose backend services securely
Azure API Management to expose backend services securely
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
 
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyFreshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
 
FREE A4 Cyber Security Awareness Posters-Social Engineering part 3
FREE A4 Cyber Security Awareness  Posters-Social Engineering part 3FREE A4 Cyber Security Awareness  Posters-Social Engineering part 3
FREE A4 Cyber Security Awareness Posters-Social Engineering part 3
 
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
 
AWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptxAWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptx
 
Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |
 

Search Concepts & Tools

  • 1. Search Concepts and Tools Steve Davids
  • 2. What is Search? Image Credit: https://www.flickr.com/photos/doggybytes/ - CC License
  • 3. What is Search? Definition: “try to find something by looking or otherwise seeking carefully and thoroughly” Synonyms: hunt, look, quest
  • 4. Hunt ●  SELECT * FROM book WHERE author LIKE %name% ●  publisher: (“Wall Street Journal” OR “WSJ.com” OR “Wall Street Journal India”...)
  • 5. Look ●  Relevant Results ○  Use signals to boost relevance ●  Ability to quickly whittle down results
  • 7. Hunt Look Quest What do we have in common?
  • 8. ●  Open Source Search Library (Java API) ●  Free text search ●  Relevancy ranking ●  Faceting and filtering ●  Hit-term highlighting ●  Near real-time indexing/querying ●  Inverted Index
  • 9. Free text search via… ●  Keyword ●  Wildcard ●  Proximity ●  Fuzzy ●  Range ●  Geospatial
  • 10. Free text search via… ●  Keyword ●  Wildcard ●  Proximity ●  Fuzzy ●  Range ●  Geospatial walk* M?ham?d M[ou]hamm?[ae]d
  • 11. Free text search via… ●  Keyword ●  Wildcard ●  Proximity ●  Fuzzy ●  Range ●  Geospatial
  • 12. Free text search via… ●  Keyword ●  Wildcard ●  Proximity ●  Fuzzy ●  Range ●  Geospatial
  • 13. Free text search via… ●  Keyword ●  Wildcard ●  Proximity ●  Fuzzy ●  Range ●  Geospatial [* TO N]
  • 14. Free text search via… ●  Keyword ●  Wildcard ●  Proximity ●  Fuzzy ●  Range ●  Geospatial
  • 15. Text Analysis ●  Convert text into searchable words ●  CharFilter o  Mutates single stream of text ●  Tokenizer o  Splits single stream of text into one or more tokens ●  TokenFilter o  Mutates token stream
  • 16. Notable Character Filters ●  HTML Strip o  “<p>Example <a href=’/test’>link</a></p>” o  “Example link” ●  Pattern Replace o  pattern="[^a-zA-Z]" replacement="" o  “Testing123” o  “Testing”
  • 17. Notable Tokenizers ●  Keyword o  “Hello World!” o  {“Hello World!”} ●  Whitespace o  “Hello World!” o  {“Hello”, “World!”} ●  Standard ●  Pattern ●  ICU (International Components for Unicode)
  • 18. Notable Token Filters ●  Lower Case o  {“Hello”, “World!”} o  {“hello”, “world!”} ●  Synonym o  synonyms.txt (expand=true): JPN, Japan, JN §  {“to”, “Japan”} §  {“to”, {“Japan”, “JPN”, “JN”}} o  synonyms.txt (expand=false) §  {“to”, “JPN”}
  • 19. Notable Token Filters ●  Word Delimiter o  {“F22-Raptor”} o  {“F22”, “Raptor”} o  {“F”, “22”, “Raptor”} o  {“F”, {“22”, “F22”}, {“Raptor”, “F22Raptor}} ●  Porter Stem o  {“walked”, “walking”, “walks”} o  {“walk”, “walk”, “walk”}
  • 20. Inverted Index T[0] = "It is what it is" T[1] = "what is it?" T[2] = "it is a banana" "a": {2} "banana": {2} "is": {0, 1, 2} "it": {0, 1, 2} "It": {0} "it?": {1} "what": {0, 1}
  • 21. Inverted Index T[0] = "It is what it is" T[1] = "what is it?" T[2] = "it is a banana" "a": {2} "banana": {2} "is": {0, 1, 2} "it": {0, 1, 2} "what": {0, 1}
  • 22. What is Relevant? ●  TF-IDF o  Term Frequency - Inverse Document Frequency ●  Boosting o  Important terms o  Signals
  • 23. Relational vs Denormalized Students -  S01 | Marina -  S02 | Ben Classes -  C01 | Hadoop -  C02 | Solr Enrolled -  S01 | C01 -  S01 | C02 -  S02 | C01 -  S02 | C02 Students -  S01 | Marina | [Hadoop, Solr] -  S02 | Ben | [Hadoop, Solr]
  • 24.
  • 25.
  • 26. Wrap it up... ●  Design for the user ●  Know your data ●  Be lazy
  • 29. Solr Schema ●  Field Definitions o  Field Type, Indexed, Stored, Multivalued, Doc Values o  Copy Fields o  Dynamic Fields §  <dynamicField name="*_sort" type="lowercase" /> ●  Field Types o  Analysis Chain
  • 30. Solr Config ●  Request handler definitions ●  Search component definitions ●  Update processor chains ●  Cache settings ●  Index specifications ●  Threshold settings ●  Custom library import locations