SlideShare a Scribd company logo
Introduction to
Apache Solr
Hassan Nasr Esfahani
Topics
– What we need from a text search engine
– What is Solr?
– Why Solr?
– Concepts And Architecture
– Usage
– Special Features
– Competitors
Text Retrieval vs Database
Retrieval
– Information and Query
– Unstructured vs Structured
– Ambiguous vs Well defined
– Answers
– Relevant documents (ambiguous) vs matched
documents
What we want from text search
engine
Basic Search Features:
– Store some documents with some fields
– Query for documents
Text Search Features
– Find most relevant docs
– Handle Natural language Complications (stop words, stem, tokenizing … )
– Highlight text
– …
Problems with Text Search
SampleProblem
‫‌روم‬‫ی‬‫م‬‫،‌کتابش‌،‌محمد‌صادق‬Tokenization
‫ی‌و‌ي‬Different Letter representation
‫میروم‌،‌میروی،‌میرود‬Similar words
‫معلم‌و‌آموزگار‬Synonymous words
‫شیر‬Word ambiguity
‌،‫با،‌است‌،‌به،‌رفت‬...Stop words
‫گذارش‬Spell errors
‫نون‬Spoken language
What is Solr?
– An Open Search Engine
– Written in Java
– Wrapping Apache Lucene
– With REST API
– Fault tolerant
– Scalable
– Distributable
Solr Simple Architecture
Apache Lucene
Query Documents
Analyze
Queue
1
2
3
Analyze
Queue
1
2
3’
Schema.
xml
Solr Featues
– Advanced Search Method
– Language knowledge
– Scoring/Boosting
– Grouping
– Highlighting
– Nested Documents
– Realtime index update
How It Works
– Solr Server Contains Some Core ( similar to datebase in
DBMS )
– Each Core specified by schema.xml + …
– Fields
– Data Types
– Analyzers
Field List
Field Attributes:
Type
Indexed
Stored
Multivalued
…
Data Types
– Int, float , long, double
– Date
– String
– Text ( configurable )
– Location
Communicating with Solr
– REST API
– Client Libraris
– JAVA
– Ruby
– PHP
– C#
– Python
– …
– Data Import Handlers
– Direct SQL query
Query Format
– Dirfferent Query Parsers:
– Standard(Lucene)
– Dismax
– Edismax
– Block Join Query Parser
– …
Standard Query Format
– field:Value
– Phrase search : field:"word list"
– Wildcard search : wor?d , word*
– Fuzzy Searches : roam~ matches all terms like foam or foams (max 2 edit distance)
– Proximity Searches (words with maximum distance): "jakarta apache"~10
– Range Searches: [52 TO 10000] or {Aida TO Carmen}
– Bossting: jakarta^4 apache
– Boolean Operators : AND (&&) , OR(||) , NOT(!) , + , - , ( )
– Filter Query
Relevancy
– ∝Term Frequency
– ∝ Inverse Document Frequency
– Query Expansion
– …
Weakness
– No Transactions
– No join query
– Use as secondary database
– No partial record modification
Alternative
– Elasticsearch based on Search
– Mostly towards Analytic Usage
– More popular
– Easier to start
– Less Documented
Introduction to Apache Solr

More Related Content

Similar to Introduction to Apache Solr

Find it, possibly also near you!
Find it, possibly also near you!Find it, possibly also near you!
Find it, possibly also near you!
Paul Borgermans
 
Introduction to php php++
Introduction to php php++Introduction to php php++
Introduction to php php++
Tanay Kishore Mishra
 
semantic web & natural language
semantic web & natural languagesemantic web & natural language
semantic web & natural language
Nurfadhlina Mohd Sharef
 
Multi-language Content Discovery Through Entity Driven Search
Multi-language Content Discovery Through Entity Driven SearchMulti-language Content Discovery Through Entity Driven Search
Multi-language Content Discovery Through Entity Driven Search
Alessandro Benedetti
 
Lucene and MySQL
Lucene and MySQLLucene and MySQL
Lucene and MySQL
farhan "Frank"​ mashraqi
 
Apache solr
Apache solrApache solr
Apache solr
Dipen Rangwani
 
Knowledge Representation, Semantic Web
Knowledge Representation, Semantic WebKnowledge Representation, Semantic Web
Knowledge Representation, Semantic Web
Serendipity Seraph
 
Elasticsearch and Spark
Elasticsearch and SparkElasticsearch and Spark
Elasticsearch and Spark
Audible, Inc.
 
Enterprise Search Using Apache Solr
Enterprise Search Using Apache SolrEnterprise Search Using Apache Solr
Enterprise Search Using Apache Solr
sagar chaturvedi
 
Knowledge mangement
Knowledge mangementKnowledge mangement
Knowledge mangement
Serendipity Seraph
 
Bio ontologies and semantic technologies
Bio ontologies and semantic technologiesBio ontologies and semantic technologies
Bio ontologies and semantic technologies
Prof. Wim Van Criekinge
 
Query Parsing - Tips and Tricks
Query Parsing - Tips and TricksQuery Parsing - Tips and Tricks
Query Parsing - Tips and Tricks
Erik Hatcher
 
Understanding RDF: the Resource Description Framework in Context (1999)
Understanding RDF: the Resource Description Framework in Context  (1999)Understanding RDF: the Resource Description Framework in Context  (1999)
Understanding RDF: the Resource Description Framework in Context (1999)
Dan Brickley
 
SPARQL in the Semantic Web
SPARQL in the Semantic WebSPARQL in the Semantic Web
SPARQL in the Semantic Web
Jan Beeck
 
Multi-language Content Discovery Through Entity Driven Search: Presented by A...
Multi-language Content Discovery Through Entity Driven Search: Presented by A...Multi-language Content Discovery Through Entity Driven Search: Presented by A...
Multi-language Content Discovery Through Entity Driven Search: Presented by A...
Lucidworks
 
Introduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrIntroduction to Apache Lucene/Solr
Introduction to Apache Lucene/Solr
Rahul Jain
 
Semantic web
Semantic webSemantic web
Semantic web
tariq1352
 
Intro to Elasticsearch
Intro to ElasticsearchIntro to Elasticsearch
Intro to Elasticsearch
Clifford James
 
RDA and the semantic Web
RDA and the semantic WebRDA and the semantic Web
RDA and the semantic Web
Gordon Dunsire
 
Philly PHP: April '17 Elastic Search Introduction by Aditya Bhamidpati
Philly PHP: April '17 Elastic Search Introduction by Aditya BhamidpatiPhilly PHP: April '17 Elastic Search Introduction by Aditya Bhamidpati
Philly PHP: April '17 Elastic Search Introduction by Aditya Bhamidpati
Robert Calcavecchia
 

Similar to Introduction to Apache Solr (20)

Find it, possibly also near you!
Find it, possibly also near you!Find it, possibly also near you!
Find it, possibly also near you!
 
Introduction to php php++
Introduction to php php++Introduction to php php++
Introduction to php php++
 
semantic web & natural language
semantic web & natural languagesemantic web & natural language
semantic web & natural language
 
Multi-language Content Discovery Through Entity Driven Search
Multi-language Content Discovery Through Entity Driven SearchMulti-language Content Discovery Through Entity Driven Search
Multi-language Content Discovery Through Entity Driven Search
 
Lucene and MySQL
Lucene and MySQLLucene and MySQL
Lucene and MySQL
 
Apache solr
Apache solrApache solr
Apache solr
 
Knowledge Representation, Semantic Web
Knowledge Representation, Semantic WebKnowledge Representation, Semantic Web
Knowledge Representation, Semantic Web
 
Elasticsearch and Spark
Elasticsearch and SparkElasticsearch and Spark
Elasticsearch and Spark
 
Enterprise Search Using Apache Solr
Enterprise Search Using Apache SolrEnterprise Search Using Apache Solr
Enterprise Search Using Apache Solr
 
Knowledge mangement
Knowledge mangementKnowledge mangement
Knowledge mangement
 
Bio ontologies and semantic technologies
Bio ontologies and semantic technologiesBio ontologies and semantic technologies
Bio ontologies and semantic technologies
 
Query Parsing - Tips and Tricks
Query Parsing - Tips and TricksQuery Parsing - Tips and Tricks
Query Parsing - Tips and Tricks
 
Understanding RDF: the Resource Description Framework in Context (1999)
Understanding RDF: the Resource Description Framework in Context  (1999)Understanding RDF: the Resource Description Framework in Context  (1999)
Understanding RDF: the Resource Description Framework in Context (1999)
 
SPARQL in the Semantic Web
SPARQL in the Semantic WebSPARQL in the Semantic Web
SPARQL in the Semantic Web
 
Multi-language Content Discovery Through Entity Driven Search: Presented by A...
Multi-language Content Discovery Through Entity Driven Search: Presented by A...Multi-language Content Discovery Through Entity Driven Search: Presented by A...
Multi-language Content Discovery Through Entity Driven Search: Presented by A...
 
Introduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrIntroduction to Apache Lucene/Solr
Introduction to Apache Lucene/Solr
 
Semantic web
Semantic webSemantic web
Semantic web
 
Intro to Elasticsearch
Intro to ElasticsearchIntro to Elasticsearch
Intro to Elasticsearch
 
RDA and the semantic Web
RDA and the semantic WebRDA and the semantic Web
RDA and the semantic Web
 
Philly PHP: April '17 Elastic Search Introduction by Aditya Bhamidpati
Philly PHP: April '17 Elastic Search Introduction by Aditya BhamidpatiPhilly PHP: April '17 Elastic Search Introduction by Aditya Bhamidpati
Philly PHP: April '17 Elastic Search Introduction by Aditya Bhamidpati
 

Recently uploaded

GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge GraphGraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
Neo4j
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
Jakub Marek
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
MichaelKnudsen27
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
ssuserfac0301
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
akankshawande
 
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
Edge AI and Vision Alliance
 
Y-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PPY-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PP
c5vrf27qcz
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
Leveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and StandardsLeveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and Standards
Neo4j
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
Hiroshi SHIBATA
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
tolgahangng
 
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectorsConnector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
DianaGray10
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
Brandon Minnick, MBA
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
Zilliz
 
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyFreshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
ScyllaDB
 
FREE A4 Cyber Security Awareness Posters-Social Engineering part 3
FREE A4 Cyber Security Awareness  Posters-Social Engineering part 3FREE A4 Cyber Security Awareness  Posters-Social Engineering part 3
FREE A4 Cyber Security Awareness Posters-Social Engineering part 3
Data Hops
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
Zilliz
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
Tomaz Bratanic
 

Recently uploaded (20)

GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge GraphGraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
 
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
 
Y-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PPY-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PP
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
Leveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and StandardsLeveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and Standards
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
 
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectorsConnector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
 
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyFreshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
 
FREE A4 Cyber Security Awareness Posters-Social Engineering part 3
FREE A4 Cyber Security Awareness  Posters-Social Engineering part 3FREE A4 Cyber Security Awareness  Posters-Social Engineering part 3
FREE A4 Cyber Security Awareness Posters-Social Engineering part 3
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
 

Introduction to Apache Solr

  • 2. Topics – What we need from a text search engine – What is Solr? – Why Solr? – Concepts And Architecture – Usage – Special Features – Competitors
  • 3. Text Retrieval vs Database Retrieval – Information and Query – Unstructured vs Structured – Ambiguous vs Well defined – Answers – Relevant documents (ambiguous) vs matched documents
  • 4. What we want from text search engine Basic Search Features: – Store some documents with some fields – Query for documents Text Search Features – Find most relevant docs – Handle Natural language Complications (stop words, stem, tokenizing … ) – Highlight text – …
  • 5. Problems with Text Search SampleProblem ‫‌روم‬‫ی‬‫م‬‫،‌کتابش‌،‌محمد‌صادق‬Tokenization ‫ی‌و‌ي‬Different Letter representation ‫میروم‌،‌میروی،‌میرود‬Similar words ‫معلم‌و‌آموزگار‬Synonymous words ‫شیر‬Word ambiguity ‌،‫با،‌است‌،‌به،‌رفت‬...Stop words ‫گذارش‬Spell errors ‫نون‬Spoken language
  • 6. What is Solr? – An Open Search Engine – Written in Java – Wrapping Apache Lucene – With REST API – Fault tolerant – Scalable – Distributable
  • 7. Solr Simple Architecture Apache Lucene Query Documents Analyze Queue 1 2 3 Analyze Queue 1 2 3’ Schema. xml
  • 8. Solr Featues – Advanced Search Method – Language knowledge – Scoring/Boosting – Grouping – Highlighting – Nested Documents – Realtime index update
  • 9. How It Works – Solr Server Contains Some Core ( similar to datebase in DBMS ) – Each Core specified by schema.xml + … – Fields – Data Types – Analyzers
  • 11. Data Types – Int, float , long, double – Date – String – Text ( configurable ) – Location
  • 12. Communicating with Solr – REST API – Client Libraris – JAVA – Ruby – PHP – C# – Python – … – Data Import Handlers – Direct SQL query
  • 13. Query Format – Dirfferent Query Parsers: – Standard(Lucene) – Dismax – Edismax – Block Join Query Parser – …
  • 14. Standard Query Format – field:Value – Phrase search : field:"word list" – Wildcard search : wor?d , word* – Fuzzy Searches : roam~ matches all terms like foam or foams (max 2 edit distance) – Proximity Searches (words with maximum distance): "jakarta apache"~10 – Range Searches: [52 TO 10000] or {Aida TO Carmen} – Bossting: jakarta^4 apache – Boolean Operators : AND (&&) , OR(||) , NOT(!) , + , - , ( ) – Filter Query
  • 15. Relevancy – ∝Term Frequency – ∝ Inverse Document Frequency – Query Expansion – …
  • 16. Weakness – No Transactions – No join query – Use as secondary database – No partial record modification
  • 17. Alternative – Elasticsearch based on Search – Mostly towards Analytic Usage – More popular – Easier to start – Less Documented