SlideShare a Scribd company logo
1 of 46
Lucene And Solr Introduction By Pascal Dimassimo [email_address]
About me ,[object Object]
Working for OpenText/Nstein on Semantic Navigation application
http://semanticnavigation.opentext.com/
History ,[object Object]
Solr launches in 2006
Lucid Imagination in 2009 ,[object Object]
Offer commercial support ,[object Object]
Buzz ,[object Object]
“Largely responsible for significant decline in commercial OEM revenue” Source http://lucenerevolution.com/sites/default/files/slides/Lucene%20Rev%20Preso%20IDC_MarketTrends_Reynolds.pdf
Lucene? ,[object Object]
NOT an application
Text indexing and searching
Open Source
Mature
Easy to learn API
Typical Search App Taken from Lucene In Action 2 nd  Edition Lucene
Search? ,[object Object]
O(n) -> Slow...
You want to find a word in a book: how do you do it?
Inverted Index
Inverted Index Original Slide from Michael Busch (available at  http://goo.gl/0MQvy  )
Inverted Index Original Slide from Michael Busch (available at  http://goo.gl/0MQvy  )
Lucene Document FSDirectory dir = FSDirectory. open ( new  File( "./index" )); SimpleAnalyzer analyzer =  new  SimpleAnalyzer(); MaxFieldLength len = IndexWriter.MaxFieldLength. UNLIMITED ; IndexWriter writer =  new  IndexWriter(dir, analyzer,  true , len); String content =  "The old night keeper keeps the keep in the town" ; Document doc =  new  Document(); doc.add( new  Field( "content" , content, Field.Store. YES , Field.Index. ANALYZED ));  writer.addDocument(doc); writer.commit();
Lucene Document ,[object Object]
Organized in  fields.  A field must be specified at query time!
Schema-less
Plain text
Fields ,[object Object]
Analyzed: split the content into terms to be added to the inverted index. Normalized terms.
Stored: Keep the original content on disk
Multivalued: Repeat the same field multiple times in the same document with different values
Lucene Document String content =  "The old night keeper keeps the keep in the town" ; String author =  "Peter Smith" ; String category1 =  "Fiction" ; String category2 =  "Canadian" ; String isbn =  "978-1-933988-17-7" ; String id =  "ABY123" ; Document doc =  new  Document(); doc.add( new  Field( "content" , content, Field.Store. YES , Field.Index. ANALYZED )); doc.add( new  Field( "author" , author, Field.Store. YES , Field.Index. ANALYZED )); doc.add( new  Field( "category" , category1, Field.Store. YES , Field.Index. ANALYZED )); doc.add( new  Field( "category" , category2, Field.Store. YES , Field.Index. ANALYZED )); doc.add( new  Field( "isbn" , isbn, Field.Store. YES , Field.Index. NOT_ANALYZED )); doc.add( new  Field( "id" , id, Field.Store. YES , Field.Index. NO )); writer.addDocument(doc); writer.commit();
Lucene Demo ,[object Object]
Relevancy ,[object Object]
Vectorial Model ,[object Object]
Score represents how close the vectors are
Tf-idf (term frequency–inverse document frequency)
Documents with many of the search terms are scored higher
Smaller documents are scored higher
Analyzer Taken from Lucene In Action 2 nd  Edition
Analyzer ,[object Object]
Used when indexing and querying
Tokenizer + Filters
Custom analyzers
Analyzer "The quick brown fox jumped over the lazy dog" WhitespaceAnalyzer [The] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dog] SimpleAnalyzer [the] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dog] StopAnalyzer [quick] [brown] [fox] [jumped] [over] [lazy] [dog] StandardAnalyzer [quick] [brown] [fox] [jumped] [over] [lazy] [dog] Example from Lucene In Action 2 nd  Edition

More Related Content

What's hot

Java 8 Streams and Rx Java Comparison
Java 8 Streams and Rx Java ComparisonJava 8 Streams and Rx Java Comparison
Java 8 Streams and Rx Java ComparisonJosé Paumard
 
GraphQL & Relay - 串起前後端世界的橋樑
GraphQL & Relay - 串起前後端世界的橋樑GraphQL & Relay - 串起前後端世界的橋樑
GraphQL & Relay - 串起前後端世界的橋樑Pokai Chang
 
Implementing Ajax In ColdFusion 7
Implementing Ajax In ColdFusion 7Implementing Ajax In ColdFusion 7
Implementing Ajax In ColdFusion 7Pranav Prakash
 
jQuery : Talk to server with Ajax
jQuery : Talk to server with AjaxjQuery : Talk to server with Ajax
jQuery : Talk to server with AjaxWildan Maulana
 
How To Webinar - Sumo Logic API
How To Webinar - Sumo Logic APIHow To Webinar - Sumo Logic API
How To Webinar - Sumo Logic APISumo Logic
 
course slides -- powerpoint
course slides -- powerpointcourse slides -- powerpoint
course slides -- powerpointwebhostingguy
 
Java SE 8 for Java EE developers
Java SE 8 for Java EE developersJava SE 8 for Java EE developers
Java SE 8 for Java EE developersJosé Paumard
 
Go OO! - Real-life Design Patterns in PHP 5
Go OO! - Real-life Design Patterns in PHP 5Go OO! - Real-life Design Patterns in PHP 5
Go OO! - Real-life Design Patterns in PHP 5Stephan Schmidt
 
Creating APIs over RDF
Creating APIs over RDFCreating APIs over RDF
Creating APIs over RDFLeigh Dodds
 
XML and Web Services with PHP5 and PEAR
XML and Web Services with PHP5 and PEARXML and Web Services with PHP5 and PEAR
XML and Web Services with PHP5 and PEARStephan Schmidt
 
The Django Book / Chapter 3: Views and URLconfs
The Django Book / Chapter 3: Views and URLconfsThe Django Book / Chapter 3: Views and URLconfs
The Django Book / Chapter 3: Views and URLconfsVincent Chien
 
Lecture 3 - Comm Lab: Web @ ITP
Lecture 3 - Comm Lab: Web @ ITP Lecture 3 - Comm Lab: Web @ ITP
Lecture 3 - Comm Lab: Web @ ITP yucefmerhi
 
1-04: HTML Elements
1-04: HTML Elements1-04: HTML Elements
1-04: HTML Elementsapnwebdev
 
Introduction to Perl - Day 2
Introduction to Perl - Day 2Introduction to Perl - Day 2
Introduction to Perl - Day 2Dave Cross
 
Building Automated REST APIs with Python
Building Automated REST APIs with PythonBuilding Automated REST APIs with Python
Building Automated REST APIs with PythonJeff Knupp
 
Build JSON and XML using RABL gem
Build JSON and XML using RABL gemBuild JSON and XML using RABL gem
Build JSON and XML using RABL gemNascenia IT
 

What's hot (18)

Java 8 Streams and Rx Java Comparison
Java 8 Streams and Rx Java ComparisonJava 8 Streams and Rx Java Comparison
Java 8 Streams and Rx Java Comparison
 
GraphQL & Relay - 串起前後端世界的橋樑
GraphQL & Relay - 串起前後端世界的橋樑GraphQL & Relay - 串起前後端世界的橋樑
GraphQL & Relay - 串起前後端世界的橋樑
 
Implementing Ajax In ColdFusion 7
Implementing Ajax In ColdFusion 7Implementing Ajax In ColdFusion 7
Implementing Ajax In ColdFusion 7
 
jQuery : Talk to server with Ajax
jQuery : Talk to server with AjaxjQuery : Talk to server with Ajax
jQuery : Talk to server with Ajax
 
How To Webinar - Sumo Logic API
How To Webinar - Sumo Logic APIHow To Webinar - Sumo Logic API
How To Webinar - Sumo Logic API
 
Free your lambdas
Free your lambdasFree your lambdas
Free your lambdas
 
course slides -- powerpoint
course slides -- powerpointcourse slides -- powerpoint
course slides -- powerpoint
 
Java SE 8 for Java EE developers
Java SE 8 for Java EE developersJava SE 8 for Java EE developers
Java SE 8 for Java EE developers
 
Go OO! - Real-life Design Patterns in PHP 5
Go OO! - Real-life Design Patterns in PHP 5Go OO! - Real-life Design Patterns in PHP 5
Go OO! - Real-life Design Patterns in PHP 5
 
Creating APIs over RDF
Creating APIs over RDFCreating APIs over RDF
Creating APIs over RDF
 
XML and Web Services with PHP5 and PEAR
XML and Web Services with PHP5 and PEARXML and Web Services with PHP5 and PEAR
XML and Web Services with PHP5 and PEAR
 
The Django Book / Chapter 3: Views and URLconfs
The Django Book / Chapter 3: Views and URLconfsThe Django Book / Chapter 3: Views and URLconfs
The Django Book / Chapter 3: Views and URLconfs
 
Linq
LinqLinq
Linq
 
Lecture 3 - Comm Lab: Web @ ITP
Lecture 3 - Comm Lab: Web @ ITP Lecture 3 - Comm Lab: Web @ ITP
Lecture 3 - Comm Lab: Web @ ITP
 
1-04: HTML Elements
1-04: HTML Elements1-04: HTML Elements
1-04: HTML Elements
 
Introduction to Perl - Day 2
Introduction to Perl - Day 2Introduction to Perl - Day 2
Introduction to Perl - Day 2
 
Building Automated REST APIs with Python
Building Automated REST APIs with PythonBuilding Automated REST APIs with Python
Building Automated REST APIs with Python
 
Build JSON and XML using RABL gem
Build JSON and XML using RABL gemBuild JSON and XML using RABL gem
Build JSON and XML using RABL gem
 

Viewers also liked

Portable Lucene Index Format & Applications - Andrzej Bialecki
Portable Lucene Index Format & Applications - Andrzej BialeckiPortable Lucene Index Format & Applications - Andrzej Bialecki
Portable Lucene Index Format & Applications - Andrzej Bialeckilucenerevolution
 
Introduction to Lucene and Solr - 1
Introduction to Lucene and Solr - 1Introduction to Lucene and Solr - 1
Introduction to Lucene and Solr - 1YI-CHING WU
 
Finite State Queries In Lucene
Finite State Queries In LuceneFinite State Queries In Lucene
Finite State Queries In Luceneotisg
 
Analytics in olap with lucene & hadoop
Analytics in olap with lucene & hadoopAnalytics in olap with lucene & hadoop
Analytics in olap with lucene & hadooplucenerevolution
 
Beyond full-text searches with Lucene and Solr
Beyond full-text searches with Lucene and SolrBeyond full-text searches with Lucene and Solr
Beyond full-text searches with Lucene and SolrBertrand Delacretaz
 
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer SimonDocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simonlucenerevolution
 
Lucandra
LucandraLucandra
Lucandraotisg
 
The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...
The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...
The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...Lucidworks
 
Building and Running Solr-as-a-Service: Presented by Shai Erera, IBM
Building and Running Solr-as-a-Service: Presented by Shai Erera, IBMBuilding and Running Solr-as-a-Service: Presented by Shai Erera, IBM
Building and Running Solr-as-a-Service: Presented by Shai Erera, IBMLucidworks
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr DevelopersErik Hatcher
 
Berlin Buzzwords 2013 - How does lucene store your data?
Berlin Buzzwords 2013 - How does lucene store your data?Berlin Buzzwords 2013 - How does lucene store your data?
Berlin Buzzwords 2013 - How does lucene store your data?Adrien Grand
 
Architecture and Implementation of Apache Lucene: Marter's Thesis
Architecture and Implementation of Apache Lucene: Marter's ThesisArchitecture and Implementation of Apache Lucene: Marter's Thesis
Architecture and Implementation of Apache Lucene: Marter's ThesisJosiane Gamgo
 
Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart
Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, FlipkartNear Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart
Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, FlipkartLucidworks
 
Lucene Introduction
Lucene IntroductionLucene Introduction
Lucene Introductionotisg
 
Text categorization with Lucene and Solr
Text categorization with Lucene and SolrText categorization with Lucene and Solr
Text categorization with Lucene and SolrTommaso Teofili
 
Introduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and UsecasesIntroduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and UsecasesRahul Jain
 

Viewers also liked (20)

Portable Lucene Index Format & Applications - Andrzej Bialecki
Portable Lucene Index Format & Applications - Andrzej BialeckiPortable Lucene Index Format & Applications - Andrzej Bialecki
Portable Lucene Index Format & Applications - Andrzej Bialecki
 
Introduction to Lucene and Solr - 1
Introduction to Lucene and Solr - 1Introduction to Lucene and Solr - 1
Introduction to Lucene and Solr - 1
 
Finite State Queries In Lucene
Finite State Queries In LuceneFinite State Queries In Lucene
Finite State Queries In Lucene
 
Apache lucene
Apache luceneApache lucene
Apache lucene
 
Analytics in olap with lucene & hadoop
Analytics in olap with lucene & hadoopAnalytics in olap with lucene & hadoop
Analytics in olap with lucene & hadoop
 
Beyond full-text searches with Lucene and Solr
Beyond full-text searches with Lucene and SolrBeyond full-text searches with Lucene and Solr
Beyond full-text searches with Lucene and Solr
 
Lucene
LuceneLucene
Lucene
 
Lucene and MySQL
Lucene and MySQLLucene and MySQL
Lucene and MySQL
 
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer SimonDocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
 
Lucandra
LucandraLucandra
Lucandra
 
The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...
The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...
The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...
 
Building and Running Solr-as-a-Service: Presented by Shai Erera, IBM
Building and Running Solr-as-a-Service: Presented by Shai Erera, IBMBuilding and Running Solr-as-a-Service: Presented by Shai Erera, IBM
Building and Running Solr-as-a-Service: Presented by Shai Erera, IBM
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
 
Berlin Buzzwords 2013 - How does lucene store your data?
Berlin Buzzwords 2013 - How does lucene store your data?Berlin Buzzwords 2013 - How does lucene store your data?
Berlin Buzzwords 2013 - How does lucene store your data?
 
Architecture and Implementation of Apache Lucene: Marter's Thesis
Architecture and Implementation of Apache Lucene: Marter's ThesisArchitecture and Implementation of Apache Lucene: Marter's Thesis
Architecture and Implementation of Apache Lucene: Marter's Thesis
 
Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart
Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, FlipkartNear Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart
Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart
 
Lucene Introduction
Lucene IntroductionLucene Introduction
Lucene Introduction
 
Text categorization with Lucene and Solr
Text categorization with Lucene and SolrText categorization with Lucene and Solr
Text categorization with Lucene and Solr
 
Introduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and UsecasesIntroduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and Usecases
 
Lucene basics
Lucene basicsLucene basics
Lucene basics
 

Similar to Lucene And Solr Intro

Introduction to Search Engines
Introduction to Search EnginesIntroduction to Search Engines
Introduction to Search EnginesNitin Pande
 
An Introduction to Solr
An Introduction to SolrAn Introduction to Solr
An Introduction to Solrtomhill
 
Douglas Crockford Presentation Jsonsaga
Douglas Crockford Presentation JsonsagaDouglas Crockford Presentation Jsonsaga
Douglas Crockford Presentation JsonsagaAjax Experience 2009
 
FluentSelenium Presentation Code Camp09
FluentSelenium Presentation Code Camp09FluentSelenium Presentation Code Camp09
FluentSelenium Presentation Code Camp09Pyxis Technologies
 
Dev8d Apache Solr Tutorial
Dev8d Apache Solr TutorialDev8d Apache Solr Tutorial
Dev8d Apache Solr TutorialSourcesense
 
Jsonsaga
JsonsagaJsonsaga
Jsonsaganohmad
 
Letting In the Light: Using Solr as an External Search Component
Letting In the Light: Using Solr as an External Search ComponentLetting In the Light: Using Solr as an External Search Component
Letting In the Light: Using Solr as an External Search ComponentJay Luker
 
Advanced Perl Techniques
Advanced Perl TechniquesAdvanced Perl Techniques
Advanced Perl TechniquesDave Cross
 
Don't Be Afraid of Abstract Syntax Trees
Don't Be Afraid of Abstract Syntax TreesDon't Be Afraid of Abstract Syntax Trees
Don't Be Afraid of Abstract Syntax TreesJamund Ferguson
 
The JSON Saga
The JSON SagaThe JSON Saga
The JSON Sagakaven yan
 
Lucene Bootcamp -1
Lucene Bootcamp -1 Lucene Bootcamp -1
Lucene Bootcamp -1 GokulD
 
jQuery Presentation - Refresh Events
jQuery Presentation - Refresh EventsjQuery Presentation - Refresh Events
jQuery Presentation - Refresh EventsEugene Andruszczenko
 
Eugene Andruszczenko: jQuery
Eugene Andruszczenko: jQueryEugene Andruszczenko: jQuery
Eugene Andruszczenko: jQueryRefresh Events
 
Cool bonsai cool - an introduction to ElasticSearch
Cool bonsai cool - an introduction to ElasticSearchCool bonsai cool - an introduction to ElasticSearch
Cool bonsai cool - an introduction to ElasticSearchclintongormley
 
Wso2 Scenarios Esb Webinar July 1st
Wso2 Scenarios Esb Webinar July 1stWso2 Scenarios Esb Webinar July 1st
Wso2 Scenarios Esb Webinar July 1stWSO2
 
Spring has got me under it’s SpEL
Spring has got me under it’s SpELSpring has got me under it’s SpEL
Spring has got me under it’s SpELEldad Dor
 

Similar to Lucene And Solr Intro (20)

Introduction to Search Engines
Introduction to Search EnginesIntroduction to Search Engines
Introduction to Search Engines
 
Javascript2839
Javascript2839Javascript2839
Javascript2839
 
Json
JsonJson
Json
 
An Introduction to Solr
An Introduction to SolrAn Introduction to Solr
An Introduction to Solr
 
Douglas Crockford Presentation Jsonsaga
Douglas Crockford Presentation JsonsagaDouglas Crockford Presentation Jsonsaga
Douglas Crockford Presentation Jsonsaga
 
Processing XML with Java
Processing XML with JavaProcessing XML with Java
Processing XML with Java
 
FluentSelenium Presentation Code Camp09
FluentSelenium Presentation Code Camp09FluentSelenium Presentation Code Camp09
FluentSelenium Presentation Code Camp09
 
Dev8d Apache Solr Tutorial
Dev8d Apache Solr TutorialDev8d Apache Solr Tutorial
Dev8d Apache Solr Tutorial
 
Jsonsaga
JsonsagaJsonsaga
Jsonsaga
 
Letting In the Light: Using Solr as an External Search Component
Letting In the Light: Using Solr as an External Search ComponentLetting In the Light: Using Solr as an External Search Component
Letting In the Light: Using Solr as an External Search Component
 
Advanced Perl Techniques
Advanced Perl TechniquesAdvanced Perl Techniques
Advanced Perl Techniques
 
Don't Be Afraid of Abstract Syntax Trees
Don't Be Afraid of Abstract Syntax TreesDon't Be Afraid of Abstract Syntax Trees
Don't Be Afraid of Abstract Syntax Trees
 
The JSON Saga
The JSON SagaThe JSON Saga
The JSON Saga
 
Lucene Bootcamp -1
Lucene Bootcamp -1 Lucene Bootcamp -1
Lucene Bootcamp -1
 
jQuery Presentation - Refresh Events
jQuery Presentation - Refresh EventsjQuery Presentation - Refresh Events
jQuery Presentation - Refresh Events
 
Eugene Andruszczenko: jQuery
Eugene Andruszczenko: jQueryEugene Andruszczenko: jQuery
Eugene Andruszczenko: jQuery
 
Cool bonsai cool - an introduction to ElasticSearch
Cool bonsai cool - an introduction to ElasticSearchCool bonsai cool - an introduction to ElasticSearch
Cool bonsai cool - an introduction to ElasticSearch
 
Solr Presentation
Solr PresentationSolr Presentation
Solr Presentation
 
Wso2 Scenarios Esb Webinar July 1st
Wso2 Scenarios Esb Webinar July 1stWso2 Scenarios Esb Webinar July 1st
Wso2 Scenarios Esb Webinar July 1st
 
Spring has got me under it’s SpEL
Spring has got me under it’s SpELSpring has got me under it’s SpEL
Spring has got me under it’s SpEL
 

Recently uploaded

SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 

Recently uploaded (20)

SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 

Lucene And Solr Intro

  • 1. Lucene And Solr Introduction By Pascal Dimassimo [email_address]
  • 2.
  • 3. Working for OpenText/Nstein on Semantic Navigation application
  • 5.
  • 7.
  • 8.
  • 9.
  • 10. “Largely responsible for significant decline in commercial OEM revenue” Source http://lucenerevolution.com/sites/default/files/slides/Lucene%20Rev%20Preso%20IDC_MarketTrends_Reynolds.pdf
  • 11.
  • 13. Text indexing and searching
  • 17. Typical Search App Taken from Lucene In Action 2 nd Edition Lucene
  • 18.
  • 20. You want to find a word in a book: how do you do it?
  • 22. Inverted Index Original Slide from Michael Busch (available at http://goo.gl/0MQvy )
  • 23. Inverted Index Original Slide from Michael Busch (available at http://goo.gl/0MQvy )
  • 24. Lucene Document FSDirectory dir = FSDirectory. open ( new File( "./index" )); SimpleAnalyzer analyzer = new SimpleAnalyzer(); MaxFieldLength len = IndexWriter.MaxFieldLength. UNLIMITED ; IndexWriter writer = new IndexWriter(dir, analyzer, true , len); String content = "The old night keeper keeps the keep in the town" ; Document doc = new Document(); doc.add( new Field( "content" , content, Field.Store. YES , Field.Index. ANALYZED )); writer.addDocument(doc); writer.commit();
  • 25.
  • 26. Organized in fields. A field must be specified at query time!
  • 29.
  • 30. Analyzed: split the content into terms to be added to the inverted index. Normalized terms.
  • 31. Stored: Keep the original content on disk
  • 32. Multivalued: Repeat the same field multiple times in the same document with different values
  • 33. Lucene Document String content = "The old night keeper keeps the keep in the town" ; String author = "Peter Smith" ; String category1 = "Fiction" ; String category2 = "Canadian" ; String isbn = "978-1-933988-17-7" ; String id = "ABY123" ; Document doc = new Document(); doc.add( new Field( "content" , content, Field.Store. YES , Field.Index. ANALYZED )); doc.add( new Field( "author" , author, Field.Store. YES , Field.Index. ANALYZED )); doc.add( new Field( "category" , category1, Field.Store. YES , Field.Index. ANALYZED )); doc.add( new Field( "category" , category2, Field.Store. YES , Field.Index. ANALYZED )); doc.add( new Field( "isbn" , isbn, Field.Store. YES , Field.Index. NOT_ANALYZED )); doc.add( new Field( "id" , id, Field.Store. YES , Field.Index. NO )); writer.addDocument(doc); writer.commit();
  • 34.
  • 35.
  • 36.
  • 37. Score represents how close the vectors are
  • 38. Tf-idf (term frequency–inverse document frequency)
  • 39. Documents with many of the search terms are scored higher
  • 40. Smaller documents are scored higher
  • 41. Analyzer Taken from Lucene In Action 2 nd Edition
  • 42.
  • 43. Used when indexing and querying
  • 46. Analyzer "The quick brown fox jumped over the lazy dog" WhitespaceAnalyzer [The] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dog] SimpleAnalyzer [the] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dog] StopAnalyzer [quick] [brown] [fox] [jumped] [over] [lazy] [dog] StandardAnalyzer [quick] [brown] [fox] [jumped] [over] [lazy] [dog] Example from Lucene In Action 2 nd Edition
  • 47. Analyzer "XY&Z Corporation - xyz@example.com" WhitespaceAnalyzer [XY&Z] [Corporation] [-] [xyz@example.com] SimpleAnalyzer [xy] [z] [corporation] [xyz] [example] [com] StopAnalyzer [xy] [z] [corporation] [xyz] [example] [com] StandardAnalyzer [xy&z] [corporation] [xyz@example.com] Example from Lucene In Action 2 nd Edition
  • 48. Custom Analyzers WhitespaceTokenizer Tokenize at white spaces KeywordTokenizer Tokenize input as a single token StandardTokenizer Tokenize at white spaces but keeping high-level entity as token (email, etc TODO) LowerCaseFilter Lowercases token text StopFilter Removes words that exist in a provided set of words PorterStemFilter Stems each token using the Porter stemming algorithm. For example, country and countries both stem to countri . Some descriptions from Lucene In Action 2 nd Edition
  • 49.
  • 50. Lucene applied an Analyzer to each word queried
  • 51. Query can be programmatically build
  • 53. Query code SimpleAnalyzer analyzer = new SimpleAnalyzer(); QueryParser parser = new QueryParser(Version. LUCENE_30 , "content" , analyzer); Query query = parser.parse( "big" ); TopDocs docs = searcher.search(query, 10);
  • 54. Query Syntax: Basic title:montreal text field
  • 55. Query Syntax: Range name:[a TO k] range field
  • 56. Query Syntax: Boolean title:(java AND programming) operator field
  • 57. Query Syntax: Boolean title:java OR name:pascal operator field field
  • 58. Query Syntax: Phrase title:”Lucene in Action” phrase field
  • 59. Query Syntax: Wildcard title:program* Term prefix field
  • 60.
  • 61.
  • 64.
  • 65. HTTP application built around Lucene
  • 66. Makes it easy to develop search solutions
  • 67. Advanced features develop on top of Lucene
  • 68. As of 2010, Lucene and Solr are merged
  • 69.
  • 70. Each index has its own schema
  • 71. Lists all fields allowed for an index
  • 72. Defines the analyzers for each field
  • 73. Solr Schema < field name = &quot;id&quot; type = &quot;string&quot; indexed = &quot;true&quot; stored = &quot;true&quot; required = &quot;true&quot; /> < field name = &quot;title&quot; type = &quot;text&quot; indexed = &quot;true&quot; stored = &quot;true&quot; /> < field name = &quot;presenter&quot; type = &quot;text_ws&quot; indexed = &quot;true&quot; stored = &quot;true&quot; /> < field name = &quot;date&quot; type = &quot;date&quot; indexed = &quot;true&quot; stored = &quot;true&quot; /> < field name = &quot;abstract&quot; type = &quot;text&quot; indexed = &quot;true&quot; stored = &quot;true&quot; />
  • 74. Solr Schema < fieldType name = &quot;text&quot; class = &quot;solr.TextField&quot; positionIncrementGap = &quot;100&quot; > < analyzer type = &quot;index&quot; > < tokenizer class = &quot;solr.WhitespaceTokenizerFactory&quot; /> < filter class = &quot;solr.StopFilterFactory&quot; ignoreCase = &quot;true&quot; words = &quot;stopwords.txt&quot; /> < filter class = &quot;solr.LowerCaseFilterFactory&quot; /> < filter class = &quot;solr.ISOLatin1AccentFilterFactory&quot; /> < filter class = &quot;solr.SnowballPorterFilterFactory&quot; language = &quot;English&quot; protected = &quot;protwords.txt&quot; /> </ analyzer > < analyzer type = &quot;query&quot; > < tokenizer class = &quot;solr.WhitespaceTokenizerFactory&quot; /> < filter class = &quot;solr.StopFilterFactory&quot; ignoreCase = &quot;true&quot; words = &quot;stopwords.txt&quot; /> < filter class = &quot;solr.LowerCaseFilterFactory&quot; /> < filter class = &quot;solr.ISOLatin1AccentFilterFactory&quot; /> < filter class = &quot;solr.SnowballPorterFilterFactory&quot; language = &quot;English&quot; protected = &quot;protwords.txt&quot; /> </ analyzer > </ fieldType >
  • 75. Solr Schema < fieldType name = &quot;text_ws&quot; class = &quot;solr.TextField&quot; positionIncrementGap = &quot;100&quot; > < analyzer type = &quot;index&quot; > < tokenizer class = &quot;solr.WhitespaceTokenizerFactory&quot; /> < filter class = &quot;solr.LowerCaseFilterFactory&quot; /> </ analyzer > < analyzer type = &quot;query&quot; > < tokenizer class = &quot;solr.WhitespaceTokenizerFactory&quot; /> < filter class = &quot;solr.LowerCaseFilterFactory&quot; /> </ analyzer > </ fieldType >
  • 76.
  • 77. XML by default, but also CSV
  • 79. Advanced features: binary document extraction, DB plugin
  • 80. Solr Indexation < add > < doc > < field name = &quot;id&quot; > 002 </ field > < field name = &quot;title&quot; > Lucene And Solr Introduction </ field > < field name = &quot;presenter&quot; > Pascal Dimassimo </ field > < field name = &quot;date&quot; > 2010-11-18T00:00:00Z </ field > < field name = &quot;abstract&quot; > ... </ field > </ doc > <doc>...</doc> </ add > curl http://localhost:8983/solr/update -H &quot;Content-Type: text/xml&quot; --data-binary @add.xml
  • 81.
  • 83. Response in XML by default, but other formats are supported (json, php, ruby)
  • 84. Solr Query curl http://localhost:8983/solr/select?q=title:Lucene < response > < lst name = &quot;responseHeader&quot; > < int name = &quot;status&quot; > 0 </ int > < int name = &quot;QTime&quot; > 269 </ int > < lst name = &quot;params&quot; > < str name = &quot;q&quot; > title:Lucene </ str > </ lst > </ lst > < result name = &quot;response&quot; numFound = &quot;1&quot; start = &quot;0&quot; > < doc > < str name = &quot;id&quot; > 002 </ str > < str name = &quot;title&quot; > Lucene And Solr Introduction </ str > < str name = &quot;presenter&quot; > Pascal Dimassimo </ str > < date name = &quot;date&quot; > 2010-11-18T00:00:00Z </ date > < str name = &quot;abstract&quot; > ... </ str > </ doc > </ result > </ response >
  • 85. Solr Query Parameters q Lucene Query sort Field to sort on. Defaut to score start Offset for the results page to display. Default 0 rows Numbers of results to display per page. Default 10 fq Filter Query. Default to all documents fl List of fields to display per document. Default to all fields wt Format to display result. Default to xml
  • 86.
  • 87. Useful for drilling down in results set
  • 88.
  • 89.
  • 92.

Editor's Notes

  1. Do one thing well Apache Licence 10 years Version 3.0 It is fast!
  2. Analyze documents: split each words Get documents in. Lucene returns a list of documents as search result.
  3. Exemple livre: on recherche du début à chaque fois qu&apos;on recherche un mot Beacoup plus simple d&apos;utiliser un index Inverted index: for a word, list documents that contains it
  4. Analyse: transformer le contenu en termes Un terme pourrait être plus d&apos;un mot: “New York” Position is also stored Binary Search: O(log n) -&gt; logarithmic Boolean Search Wildcard Search
  5. Lucene generates a id for each document Stored = Original content stored “as is” on disk. Can be returned to the user when document is returned When Lucene returns document, it returns id. You can retrieve stored content with the id
  6. Document: email, article, usager Email fields: expéditeur, destinataire, titre, contenu, attachement Article fields: auteur, titre, catégorie, contenu, date de publication Analogie BD: document = rangée, field = colonne On peut stocker des documents avec des champs différents.
  7. Lucene generates a id for each document Stored = Original content stored “as is” on disk. Can be returned to the user when document is returned When Lucene returns document, it returns id. You can retrieve stored content with the id
  8. Lucene can returns results sorted by a field
  9. Terms almost synonym of words
  10. Basic Query instance: TermQuery Use PerFieldAnalyzerWrapper to specify the specific analyzer for each field
  11. Terms stored in alphabetical order. Using String.compareTo. Returns all docs for each terms in range
  12. Supports AND, OR, NOT Supports +, -
  13. Supports AND, OR, NOT Supports +, -
  14. CNET l&apos;a utilisé pour permettre aux utilisateurs de mieux retrouver les produits