SlideShare a Scribd company logo
1 of 23
Apache Solr Ratification




     cdevecchi@gmail.com
Solr – What is it?


• Apache Project
• Open source engine based in lucene
• APIs XML/HTTP e JSON
Features


•   Lemmatization

•   Hit Highlight

•   Dictionaries

•   Geosearch

•   Faceted Search

•   Caching

•   Index Replication and Databases Integration
Characteristics


•   Java -> Tomcat / Jboss / Jetty

•   Schema

•   Client solrj

•   Jmx statistics
Query

• Highlighting
   – Activated by query (hl=true)


• Text Analysis
   – Use dictionary and thesaurus
   – Relevancy searchs
   – Spelling suggestions
   – Search by similarity (“More like this”)
   – Fuzzy (Damerau-Levenshtein distance)
Query


• Querying data
   – Words
   – Words by field
   – Orderly (sort)


• Faceted Search
   – Categories
Query


• Faceted Search, the queries could be a problem?
   – Exemple


   http://localhost:8983/solr/select?

   q=video&rows=0&facet=true&facet.field=inStock

   &facet.query=price:[*+TO+500]

   &facet.query=price:[500+TO+*]

   &facet.prefix=xx&facet.limit=5&facet.mincount=1
Data indexing




• Solr XML native
• CSV
• Database (DIH)
• Rich Documents
• Crawler
Index




• Index is being larger than you imagine?


• Could be adjusted in:
   – Index size segments
   – Merge index segments
Collections



• It is possible to create separated index by documents kind
Data Replication


• Master / Slave
   - Index
   - Config files
Sharding

• ZooKeeper
  – http://hadoop.apache.org/zookeeper/
SolrCaching


• Put searched docs on cache
• Two implementations
   – Solr.search.LRUCache (LRU= Least Recently Used in
     memory)
   – Solr.search.FastLRUCache (a partir da versão 1.4)
• How to use
   – filterCache
   – queryResultCache
   – documentCache (sobe tudo em memória)
Cluster – Carrot2



• Search Results Clustering Engine

• Search in many nodes




•   Live Demo

     – http://search.carrot2.org/stable/search
Crawling


• Apache Nutch
  – Search, parse and parallel indexing or distributed indexing
  – Many formats
     • Ex. plain text, html, xml, zip, .doc, javascript, rss, pdf, etc
  – Cluster
  – MapReduce
  – Distributed Filesystem (via hadoop)
Backup / Snapshot



• Active by scripts (solr-tools)

• Index snapshots

• Diferencial backups

   – $solr_data/yyyymmdd
Architecture (Master/Slave)
Architecture (Índice Distribuído)
Indexing Tests



• Indexing tests
   • 7k xml sized, with 111 fields


• 1,2 milion docs on index


• VM -> 2GB RAM, processor 2.33 Ghz
Indexing Tests




                 90

                 44
Search Tests
QPS




      61
       0
37
 5

38    38
References




•   http://lucene.apache.org/solr/

•   http://wiki.apache.org/solr/

•   http://project.carrot2.org/

•   http://download.carrot2.org/head/manual/index.html#chapter.introduction

•   http://wiki.apache.org/solr/ZooKeeperIntegration

More Related Content

What's hot

Modules Building Presentation
Modules Building PresentationModules Building Presentation
Modules Building Presentation
htyson
 
Scaling search in Oak with Solr
Scaling search in Oak with Solr Scaling search in Oak with Solr
Scaling search in Oak with Solr
Tommaso Teofili
 
Back to Basics: Build Something Big With MongoDB
Back to Basics: Build Something Big With MongoDB Back to Basics: Build Something Big With MongoDB
Back to Basics: Build Something Big With MongoDB
MongoDB
 
Nutch as a Web data mining platform
Nutch as a Web data mining platformNutch as a Web data mining platform
Nutch as a Web data mining platform
abial
 

What's hot (20)

/path/to/content - the Apache Jackrabbit content repository
/path/to/content - the Apache Jackrabbit content repository/path/to/content - the Apache Jackrabbit content repository
/path/to/content - the Apache Jackrabbit content repository
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and Friends
 
Nutch - web-scale search engine toolkit
Nutch - web-scale search engine toolkitNutch - web-scale search engine toolkit
Nutch - web-scale search engine toolkit
 
Intro to elasticsearch
Intro to elasticsearchIntro to elasticsearch
Intro to elasticsearch
 
Elasticsearch Introduction
Elasticsearch IntroductionElasticsearch Introduction
Elasticsearch Introduction
 
Modules Building Presentation
Modules Building PresentationModules Building Presentation
Modules Building Presentation
 
Web scraping with nutch solr part 2
Web scraping with nutch solr part 2Web scraping with nutch solr part 2
Web scraping with nutch solr part 2
 
Scaling search in Oak with Solr
Scaling search in Oak with Solr Scaling search in Oak with Solr
Scaling search in Oak with Solr
 
Back to Basics: Build Something Big With MongoDB
Back to Basics: Build Something Big With MongoDB Back to Basics: Build Something Big With MongoDB
Back to Basics: Build Something Big With MongoDB
 
ElasticSearch - DevNexus Atlanta - 2014
ElasticSearch - DevNexus Atlanta - 2014ElasticSearch - DevNexus Atlanta - 2014
ElasticSearch - DevNexus Atlanta - 2014
 
Elasticsearch Basics
Elasticsearch BasicsElasticsearch Basics
Elasticsearch Basics
 
Turning a Search Engine into a Relational Database
Turning a Search Engine into a Relational DatabaseTurning a Search Engine into a Relational Database
Turning a Search Engine into a Relational Database
 
Managing Your Content with Elasticsearch
Managing Your Content with ElasticsearchManaging Your Content with Elasticsearch
Managing Your Content with Elasticsearch
 
ElasticSearch in Production: lessons learned
ElasticSearch in Production: lessons learnedElasticSearch in Production: lessons learned
ElasticSearch in Production: lessons learned
 
Nutch as a Web data mining platform
Nutch as a Web data mining platformNutch as a Web data mining platform
Nutch as a Web data mining platform
 
Elasticsearch in Production (London version)
Elasticsearch in Production (London version)Elasticsearch in Production (London version)
Elasticsearch in Production (London version)
 
An Introduction to Elastic Search.
An Introduction to Elastic Search.An Introduction to Elastic Search.
An Introduction to Elastic Search.
 
Sphinx - High performance full-text search for MySQL
Sphinx - High performance full-text search for MySQLSphinx - High performance full-text search for MySQL
Sphinx - High performance full-text search for MySQL
 
Running MongoDB 3.0 on AWS
Running MongoDB 3.0 on AWSRunning MongoDB 3.0 on AWS
Running MongoDB 3.0 on AWS
 
Large scale crawling with Apache Nutch
Large scale crawling with Apache NutchLarge scale crawling with Apache Nutch
Large scale crawling with Apache Nutch
 

Viewers also liked

Solr data importhandler
Solr data importhandlerSolr data importhandler
Solr data importhandler
Dikshant Shahi
 
Solr Recipes Workshop
Solr Recipes WorkshopSolr Recipes Workshop
Solr Recipes Workshop
Erik Hatcher
 
E-commerce (Social Commerce) Infanto-Juvenil
E-commerce (Social Commerce) Infanto-JuvenilE-commerce (Social Commerce) Infanto-Juvenil
E-commerce (Social Commerce) Infanto-Juvenil
Kari Kotake
 

Viewers also liked (20)

Web Of Science
Web Of ScienceWeb Of Science
Web Of Science
 
Citation Searching for Promotion and Tenure in Web of Science
Citation Searching for Promotion and Tenure in Web of ScienceCitation Searching for Promotion and Tenure in Web of Science
Citation Searching for Promotion and Tenure in Web of Science
 
Solr data importhandler
Solr data importhandlerSolr data importhandler
Solr data importhandler
 
Solr Recipes Workshop
Solr Recipes WorkshopSolr Recipes Workshop
Solr Recipes Workshop
 
Database indexing framework
Database indexing frameworkDatabase indexing framework
Database indexing framework
 
Web of Science – A Short Insight
Web of Science – A Short InsightWeb of Science – A Short Insight
Web of Science – A Short Insight
 
Evaluacion segunda unidad
Evaluacion segunda unidadEvaluacion segunda unidad
Evaluacion segunda unidad
 
Performance em javascript
Performance em javascriptPerformance em javascript
Performance em javascript
 
K2 a tech workshopfinal
K2 a tech workshopfinalK2 a tech workshopfinal
K2 a tech workshopfinal
 
Developing with pyGTK in EeePC
Developing with pyGTK in EeePCDeveloping with pyGTK in EeePC
Developing with pyGTK in EeePC
 
E-commerce (Social Commerce) Infanto-Juvenil
E-commerce (Social Commerce) Infanto-JuvenilE-commerce (Social Commerce) Infanto-Juvenil
E-commerce (Social Commerce) Infanto-Juvenil
 
Responsive design business_case_or_not
Responsive design business_case_or_notResponsive design business_case_or_not
Responsive design business_case_or_not
 
ASIJ Elementary School Counseling and Guidance Back to School 2014
ASIJ Elementary School Counseling and Guidance Back to School 2014ASIJ Elementary School Counseling and Guidance Back to School 2014
ASIJ Elementary School Counseling and Guidance Back to School 2014
 
E commerce Programming
E commerce Programming E commerce Programming
E commerce Programming
 
1. lipid
1. lipid1. lipid
1. lipid
 
Travel cambodia
Travel cambodiaTravel cambodia
Travel cambodia
 
雲端運算的演進與定義
雲端運算的演進與定義雲端運算的演進與定義
雲端運算的演進與定義
 
Astronomia con il software libero
Astronomia con il software liberoAstronomia con il software libero
Astronomia con il software libero
 
Generace Y - výzkum Telefonica O2
Generace Y - výzkum Telefonica O2Generace Y - výzkum Telefonica O2
Generace Y - výzkum Telefonica O2
 
Speech
SpeechSpeech
Speech
 

Similar to Solr

Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash course
Tommaso Teofili
 
Introduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrIntroduction to Apache Lucene/Solr
Introduction to Apache Lucene/Solr
Rahul Jain
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
Erik Hatcher
 
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DCWhat's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
Lucidworks (Archived)
 
Battle of the Giants - Apache Solr vs. Elasticsearch (ApacheCon)
Battle of the Giants - Apache Solr vs. Elasticsearch (ApacheCon)Battle of the Giants - Apache Solr vs. Elasticsearch (ApacheCon)
Battle of the Giants - Apache Solr vs. Elasticsearch (ApacheCon)
Sematext Group, Inc.
 

Similar to Solr (20)

Apache Solr - Enterprise search platform
Apache Solr - Enterprise search platformApache Solr - Enterprise search platform
Apache Solr - Enterprise search platform
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash course
 
Solr Powered Lucene
Solr Powered LuceneSolr Powered Lucene
Solr Powered Lucene
 
Building Enterprise Search Engines using Open Source Technologies
Building Enterprise Search Engines using Open Source TechnologiesBuilding Enterprise Search Engines using Open Source Technologies
Building Enterprise Search Engines using Open Source Technologies
 
Building Enterprise Search Engines using Open Source Technologies
Building Enterprise Search Engines using Open Source TechnologiesBuilding Enterprise Search Engines using Open Source Technologies
Building Enterprise Search Engines using Open Source Technologies
 
Introduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and UsecasesIntroduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and Usecases
 
Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014
Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014
Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014
 
BigData Faceted Search Comparison between Apache Solr vs. ElasticSearch
BigData Faceted Search Comparison between Apache Solr vs. ElasticSearchBigData Faceted Search Comparison between Apache Solr vs. ElasticSearch
BigData Faceted Search Comparison between Apache Solr vs. ElasticSearch
 
Battle of the giants: Apache Solr vs ElasticSearch
Battle of the giants: Apache Solr vs ElasticSearchBattle of the giants: Apache Solr vs ElasticSearch
Battle of the giants: Apache Solr vs ElasticSearch
 
NoSQL, Apache SOLR and Apache Hadoop
NoSQL, Apache SOLR and Apache HadoopNoSQL, Apache SOLR and Apache Hadoop
NoSQL, Apache SOLR and Apache Hadoop
 
Introduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrIntroduction to Apache Lucene/Solr
Introduction to Apache Lucene/Solr
 
Solr search engine with multiple table relation
Solr search engine with multiple table relationSolr search engine with multiple table relation
Solr search engine with multiple table relation
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
 
Apache solr
Apache solrApache solr
Apache solr
 
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DCWhat's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
 
SolrCloud on Hadoop
SolrCloud on HadoopSolrCloud on Hadoop
SolrCloud on Hadoop
 
Battle of the Giants - Apache Solr vs. Elasticsearch (ApacheCon)
Battle of the Giants - Apache Solr vs. Elasticsearch (ApacheCon)Battle of the Giants - Apache Solr vs. Elasticsearch (ApacheCon)
Battle of the Giants - Apache Solr vs. Elasticsearch (ApacheCon)
 
New Persistence Features in Spring Roo 1.1
New Persistence Features in Spring Roo 1.1New Persistence Features in Spring Roo 1.1
New Persistence Features in Spring Roo 1.1
 
Solr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for HadoopSolr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for Hadoop
 

Recently uploaded

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 

Recently uploaded (20)

Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 

Solr

  • 1. Apache Solr Ratification cdevecchi@gmail.com
  • 2. Solr – What is it? • Apache Project • Open source engine based in lucene • APIs XML/HTTP e JSON
  • 3. Features • Lemmatization • Hit Highlight • Dictionaries • Geosearch • Faceted Search • Caching • Index Replication and Databases Integration
  • 4. Characteristics • Java -> Tomcat / Jboss / Jetty • Schema • Client solrj • Jmx statistics
  • 5. Query • Highlighting – Activated by query (hl=true) • Text Analysis – Use dictionary and thesaurus – Relevancy searchs – Spelling suggestions – Search by similarity (“More like this”) – Fuzzy (Damerau-Levenshtein distance)
  • 6. Query • Querying data – Words – Words by field – Orderly (sort) • Faceted Search – Categories
  • 7. Query • Faceted Search, the queries could be a problem? – Exemple http://localhost:8983/solr/select? q=video&rows=0&facet=true&facet.field=inStock &facet.query=price:[*+TO+500] &facet.query=price:[500+TO+*] &facet.prefix=xx&facet.limit=5&facet.mincount=1
  • 8. Data indexing • Solr XML native • CSV • Database (DIH) • Rich Documents • Crawler
  • 9. Index • Index is being larger than you imagine? • Could be adjusted in: – Index size segments – Merge index segments
  • 10. Collections • It is possible to create separated index by documents kind
  • 11. Data Replication • Master / Slave - Index - Config files
  • 12. Sharding • ZooKeeper – http://hadoop.apache.org/zookeeper/
  • 13. SolrCaching • Put searched docs on cache • Two implementations – Solr.search.LRUCache (LRU= Least Recently Used in memory) – Solr.search.FastLRUCache (a partir da versão 1.4) • How to use – filterCache – queryResultCache – documentCache (sobe tudo em memória)
  • 14. Cluster – Carrot2 • Search Results Clustering Engine • Search in many nodes • Live Demo – http://search.carrot2.org/stable/search
  • 15. Crawling • Apache Nutch – Search, parse and parallel indexing or distributed indexing – Many formats • Ex. plain text, html, xml, zip, .doc, javascript, rss, pdf, etc – Cluster – MapReduce – Distributed Filesystem (via hadoop)
  • 16. Backup / Snapshot • Active by scripts (solr-tools) • Index snapshots • Diferencial backups – $solr_data/yyyymmdd
  • 19. Indexing Tests • Indexing tests • 7k xml sized, with 111 fields • 1,2 milion docs on index • VM -> 2GB RAM, processor 2.33 Ghz
  • 20. Indexing Tests 90 44
  • 22. QPS 61 0 37 5 38 38
  • 23. References • http://lucene.apache.org/solr/ • http://wiki.apache.org/solr/ • http://project.carrot2.org/ • http://download.carrot2.org/head/manual/index.html#chapter.introduction • http://wiki.apache.org/solr/ZooKeeperIntegration

Editor's Notes

  1. \n
  2. \n
  3. \n
  4. \n
  5. \n
  6. \n
  7. \n
  8. \n
  9. \n
  10. \n
  11. \n
  12. \n
  13. Filter cache (usado para 3 condições)\n1. cacheia conteúdo dos parâmetros “fq”\n2. cachear as faceted\n3. Sort -> se estiver setado para true\n
  14. \n
  15. \n
  16. \n
  17. \n
  18. \n
  19. \n
  20. \n
  21. \n
  22. \n
  23. \n