SlideShare a Scribd company logo
Lucene And Solr Introduction By Pascal Dimassimo [email_address]
About me ,[object Object]
Working for OpenText/Nstein on Semantic Navigation application
http://semanticnavigation.opentext.com/
History ,[object Object]
Solr launches in 2006
Lucid Imagination in 2009 ,[object Object]
Offer commercial support ,[object Object]
Buzz ,[object Object]
“Largely responsible for significant decline in commercial OEM revenue” Source http://lucenerevolution.com/sites/default/files/slides/Lucene%20Rev%20Preso%20IDC_MarketTrends_Reynolds.pdf
Lucene? ,[object Object]
NOT an application
Text indexing and searching
Open Source
Mature
Easy to learn API
Typical Search App Taken from Lucene In Action 2 nd  Edition Lucene
Search? ,[object Object]
O(n) -> Slow...
You want to find a word in a book: how do you do it?
Inverted Index
Inverted Index Original Slide from Michael Busch (available at  http://goo.gl/0MQvy  )
Inverted Index Original Slide from Michael Busch (available at  http://goo.gl/0MQvy  )
Lucene Document FSDirectory dir = FSDirectory. open ( new  File( "./index" )); SimpleAnalyzer analyzer =  new  SimpleAnalyzer(); MaxFieldLength len = IndexWriter.MaxFieldLength. UNLIMITED ; IndexWriter writer =  new  IndexWriter(dir, analyzer,  true , len); String content =  "The old night keeper keeps the keep in the town" ; Document doc =  new  Document(); doc.add( new  Field( "content" , content, Field.Store. YES , Field.Index. ANALYZED ));  writer.addDocument(doc); writer.commit();
Lucene Document ,[object Object]
Organized in  fields.  A field must be specified at query time!
Schema-less
Plain text
Fields ,[object Object]
Analyzed: split the content into terms to be added to the inverted index. Normalized terms.
Stored: Keep the original content on disk
Multivalued: Repeat the same field multiple times in the same document with different values
Lucene Document String content =  "The old night keeper keeps the keep in the town" ; String author =  "Peter Smith" ; String category1 =  "Fiction" ; String category2 =  "Canadian" ; String isbn =  "978-1-933988-17-7" ; String id =  "ABY123" ; Document doc =  new  Document(); doc.add( new  Field( "content" , content, Field.Store. YES , Field.Index. ANALYZED )); doc.add( new  Field( "author" , author, Field.Store. YES , Field.Index. ANALYZED )); doc.add( new  Field( "category" , category1, Field.Store. YES , Field.Index. ANALYZED )); doc.add( new  Field( "category" , category2, Field.Store. YES , Field.Index. ANALYZED )); doc.add( new  Field( "isbn" , isbn, Field.Store. YES , Field.Index. NOT_ANALYZED )); doc.add( new  Field( "id" , id, Field.Store. YES , Field.Index. NO )); writer.addDocument(doc); writer.commit();
Lucene Demo ,[object Object]
Relevancy ,[object Object]
Vectorial Model ,[object Object]
Score represents how close the vectors are
Tf-idf (term frequency–inverse document frequency)
Documents with many of the search terms are scored higher
Smaller documents are scored higher
Analyzer Taken from Lucene In Action 2 nd  Edition
Analyzer ,[object Object]
Used when indexing and querying
Tokenizer + Filters
Custom analyzers
Analyzer "The quick brown fox jumped over the lazy dog" WhitespaceAnalyzer [The] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dog] SimpleAnalyzer [the] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dog] StopAnalyzer [quick] [brown] [fox] [jumped] [over] [lazy] [dog] StandardAnalyzer [quick] [brown] [fox] [jumped] [over] [lazy] [dog] Example from Lucene In Action 2 nd  Edition

More Related Content

What's hot

Java 8 Streams and Rx Java Comparison
Java 8 Streams and Rx Java ComparisonJava 8 Streams and Rx Java Comparison
Java 8 Streams and Rx Java Comparison
José Paumard
 
GraphQL & Relay - 串起前後端世界的橋樑
GraphQL & Relay - 串起前後端世界的橋樑GraphQL & Relay - 串起前後端世界的橋樑
GraphQL & Relay - 串起前後端世界的橋樑
Pokai Chang
 
Implementing Ajax In ColdFusion 7
Implementing Ajax In ColdFusion 7Implementing Ajax In ColdFusion 7
Implementing Ajax In ColdFusion 7
Pranav Prakash
 
jQuery : Talk to server with Ajax
jQuery : Talk to server with AjaxjQuery : Talk to server with Ajax
jQuery : Talk to server with Ajax
Wildan Maulana
 
How To Webinar - Sumo Logic API
How To Webinar - Sumo Logic APIHow To Webinar - Sumo Logic API
How To Webinar - Sumo Logic API
Sumo Logic
 
Free your lambdas
Free your lambdasFree your lambdas
Free your lambdas
José Paumard
 
course slides -- powerpoint
course slides -- powerpointcourse slides -- powerpoint
course slides -- powerpoint
webhostingguy
 
Java SE 8 for Java EE developers
Java SE 8 for Java EE developersJava SE 8 for Java EE developers
Java SE 8 for Java EE developers
José Paumard
 
Go OO! - Real-life Design Patterns in PHP 5
Go OO! - Real-life Design Patterns in PHP 5Go OO! - Real-life Design Patterns in PHP 5
Go OO! - Real-life Design Patterns in PHP 5
Stephan Schmidt
 
Creating APIs over RDF
Creating APIs over RDFCreating APIs over RDF
Creating APIs over RDF
Leigh Dodds
 
XML and Web Services with PHP5 and PEAR
XML and Web Services with PHP5 and PEARXML and Web Services with PHP5 and PEAR
XML and Web Services with PHP5 and PEAR
Stephan Schmidt
 
The Django Book / Chapter 3: Views and URLconfs
The Django Book / Chapter 3: Views and URLconfsThe Django Book / Chapter 3: Views and URLconfs
The Django Book / Chapter 3: Views and URLconfs
Vincent Chien
 
Linq
LinqLinq
Linq
ClickExpo
 
Lecture 3 - Comm Lab: Web @ ITP
Lecture 3 - Comm Lab: Web @ ITP Lecture 3 - Comm Lab: Web @ ITP
Lecture 3 - Comm Lab: Web @ ITP
yucefmerhi
 
1-04: HTML Elements
1-04: HTML Elements1-04: HTML Elements
1-04: HTML Elements
apnwebdev
 
Introduction to Perl - Day 2
Introduction to Perl - Day 2Introduction to Perl - Day 2
Introduction to Perl - Day 2
Dave Cross
 
Building Automated REST APIs with Python
Building Automated REST APIs with PythonBuilding Automated REST APIs with Python
Building Automated REST APIs with Python
Jeff Knupp
 
Build JSON and XML using RABL gem
Build JSON and XML using RABL gemBuild JSON and XML using RABL gem
Build JSON and XML using RABL gem
Nascenia IT
 

What's hot (18)

Java 8 Streams and Rx Java Comparison
Java 8 Streams and Rx Java ComparisonJava 8 Streams and Rx Java Comparison
Java 8 Streams and Rx Java Comparison
 
GraphQL & Relay - 串起前後端世界的橋樑
GraphQL & Relay - 串起前後端世界的橋樑GraphQL & Relay - 串起前後端世界的橋樑
GraphQL & Relay - 串起前後端世界的橋樑
 
Implementing Ajax In ColdFusion 7
Implementing Ajax In ColdFusion 7Implementing Ajax In ColdFusion 7
Implementing Ajax In ColdFusion 7
 
jQuery : Talk to server with Ajax
jQuery : Talk to server with AjaxjQuery : Talk to server with Ajax
jQuery : Talk to server with Ajax
 
How To Webinar - Sumo Logic API
How To Webinar - Sumo Logic APIHow To Webinar - Sumo Logic API
How To Webinar - Sumo Logic API
 
Free your lambdas
Free your lambdasFree your lambdas
Free your lambdas
 
course slides -- powerpoint
course slides -- powerpointcourse slides -- powerpoint
course slides -- powerpoint
 
Java SE 8 for Java EE developers
Java SE 8 for Java EE developersJava SE 8 for Java EE developers
Java SE 8 for Java EE developers
 
Go OO! - Real-life Design Patterns in PHP 5
Go OO! - Real-life Design Patterns in PHP 5Go OO! - Real-life Design Patterns in PHP 5
Go OO! - Real-life Design Patterns in PHP 5
 
Creating APIs over RDF
Creating APIs over RDFCreating APIs over RDF
Creating APIs over RDF
 
XML and Web Services with PHP5 and PEAR
XML and Web Services with PHP5 and PEARXML and Web Services with PHP5 and PEAR
XML and Web Services with PHP5 and PEAR
 
The Django Book / Chapter 3: Views and URLconfs
The Django Book / Chapter 3: Views and URLconfsThe Django Book / Chapter 3: Views and URLconfs
The Django Book / Chapter 3: Views and URLconfs
 
Linq
LinqLinq
Linq
 
Lecture 3 - Comm Lab: Web @ ITP
Lecture 3 - Comm Lab: Web @ ITP Lecture 3 - Comm Lab: Web @ ITP
Lecture 3 - Comm Lab: Web @ ITP
 
1-04: HTML Elements
1-04: HTML Elements1-04: HTML Elements
1-04: HTML Elements
 
Introduction to Perl - Day 2
Introduction to Perl - Day 2Introduction to Perl - Day 2
Introduction to Perl - Day 2
 
Building Automated REST APIs with Python
Building Automated REST APIs with PythonBuilding Automated REST APIs with Python
Building Automated REST APIs with Python
 
Build JSON and XML using RABL gem
Build JSON and XML using RABL gemBuild JSON and XML using RABL gem
Build JSON and XML using RABL gem
 

Viewers also liked

Portable Lucene Index Format & Applications - Andrzej Bialecki
Portable Lucene Index Format & Applications - Andrzej BialeckiPortable Lucene Index Format & Applications - Andrzej Bialecki
Portable Lucene Index Format & Applications - Andrzej Bialecki
lucenerevolution
 
Introduction to Lucene and Solr - 1
Introduction to Lucene and Solr - 1Introduction to Lucene and Solr - 1
Introduction to Lucene and Solr - 1
YI-CHING WU
 
Finite State Queries In Lucene
Finite State Queries In LuceneFinite State Queries In Lucene
Finite State Queries In Lucene
otisg
 
Apache lucene
Apache luceneApache lucene
Apache lucene
Dr. Abhiram Gandhe
 
Analytics in olap with lucene & hadoop
Analytics in olap with lucene & hadoopAnalytics in olap with lucene & hadoop
Analytics in olap with lucene & hadoop
lucenerevolution
 
Beyond full-text searches with Lucene and Solr
Beyond full-text searches with Lucene and SolrBeyond full-text searches with Lucene and Solr
Beyond full-text searches with Lucene and Solr
Bertrand Delacretaz
 
Lucene
LuceneLucene
Lucene
Matt Wood
 
Lucene and MySQL
Lucene and MySQLLucene and MySQL
Lucene and MySQL
farhan "Frank"​ mashraqi
 
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer SimonDocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
lucenerevolution
 
Lucandra
LucandraLucandra
Lucandra
otisg
 
The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...
The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...
The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...
Lucidworks
 
Building and Running Solr-as-a-Service: Presented by Shai Erera, IBM
Building and Running Solr-as-a-Service: Presented by Shai Erera, IBMBuilding and Running Solr-as-a-Service: Presented by Shai Erera, IBM
Building and Running Solr-as-a-Service: Presented by Shai Erera, IBM
Lucidworks
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
Erik Hatcher
 
Berlin Buzzwords 2013 - How does lucene store your data?
Berlin Buzzwords 2013 - How does lucene store your data?Berlin Buzzwords 2013 - How does lucene store your data?
Berlin Buzzwords 2013 - How does lucene store your data?
Adrien Grand
 
Architecture and Implementation of Apache Lucene: Marter's Thesis
Architecture and Implementation of Apache Lucene: Marter's ThesisArchitecture and Implementation of Apache Lucene: Marter's Thesis
Architecture and Implementation of Apache Lucene: Marter's Thesis
Josiane Gamgo
 
Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart
Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, FlipkartNear Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart
Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart
Lucidworks
 
Lucene Introduction
Lucene IntroductionLucene Introduction
Lucene Introduction
otisg
 
Text categorization with Lucene and Solr
Text categorization with Lucene and SolrText categorization with Lucene and Solr
Text categorization with Lucene and Solr
Tommaso Teofili
 
Introduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and UsecasesIntroduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and Usecases
Rahul Jain
 
Lucene basics
Lucene basicsLucene basics
Lucene basics
Nitin Pande
 

Viewers also liked (20)

Portable Lucene Index Format & Applications - Andrzej Bialecki
Portable Lucene Index Format & Applications - Andrzej BialeckiPortable Lucene Index Format & Applications - Andrzej Bialecki
Portable Lucene Index Format & Applications - Andrzej Bialecki
 
Introduction to Lucene and Solr - 1
Introduction to Lucene and Solr - 1Introduction to Lucene and Solr - 1
Introduction to Lucene and Solr - 1
 
Finite State Queries In Lucene
Finite State Queries In LuceneFinite State Queries In Lucene
Finite State Queries In Lucene
 
Apache lucene
Apache luceneApache lucene
Apache lucene
 
Analytics in olap with lucene & hadoop
Analytics in olap with lucene & hadoopAnalytics in olap with lucene & hadoop
Analytics in olap with lucene & hadoop
 
Beyond full-text searches with Lucene and Solr
Beyond full-text searches with Lucene and SolrBeyond full-text searches with Lucene and Solr
Beyond full-text searches with Lucene and Solr
 
Lucene
LuceneLucene
Lucene
 
Lucene and MySQL
Lucene and MySQLLucene and MySQL
Lucene and MySQL
 
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer SimonDocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
 
Lucandra
LucandraLucandra
Lucandra
 
The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...
The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...
The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...
 
Building and Running Solr-as-a-Service: Presented by Shai Erera, IBM
Building and Running Solr-as-a-Service: Presented by Shai Erera, IBMBuilding and Running Solr-as-a-Service: Presented by Shai Erera, IBM
Building and Running Solr-as-a-Service: Presented by Shai Erera, IBM
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
 
Berlin Buzzwords 2013 - How does lucene store your data?
Berlin Buzzwords 2013 - How does lucene store your data?Berlin Buzzwords 2013 - How does lucene store your data?
Berlin Buzzwords 2013 - How does lucene store your data?
 
Architecture and Implementation of Apache Lucene: Marter's Thesis
Architecture and Implementation of Apache Lucene: Marter's ThesisArchitecture and Implementation of Apache Lucene: Marter's Thesis
Architecture and Implementation of Apache Lucene: Marter's Thesis
 
Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart
Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, FlipkartNear Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart
Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart
 
Lucene Introduction
Lucene IntroductionLucene Introduction
Lucene Introduction
 
Text categorization with Lucene and Solr
Text categorization with Lucene and SolrText categorization with Lucene and Solr
Text categorization with Lucene and Solr
 
Introduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and UsecasesIntroduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and Usecases
 
Lucene basics
Lucene basicsLucene basics
Lucene basics
 

Similar to Lucene And Solr Intro

Introduction to Search Engines
Introduction to Search EnginesIntroduction to Search Engines
Introduction to Search Engines
Nitin Pande
 
Json
JsonJson
Javascript2839
Javascript2839Javascript2839
Javascript2839
Ramamohan Chokkam
 
An Introduction to Solr
An Introduction to SolrAn Introduction to Solr
An Introduction to Solr
tomhill
 
Douglas Crockford Presentation Jsonsaga
Douglas Crockford Presentation JsonsagaDouglas Crockford Presentation Jsonsaga
Douglas Crockford Presentation Jsonsaga
Ajax Experience 2009
 
Processing XML with Java
Processing XML with JavaProcessing XML with Java
Processing XML with Java
BG Java EE Course
 
FluentSelenium Presentation Code Camp09
FluentSelenium Presentation Code Camp09FluentSelenium Presentation Code Camp09
FluentSelenium Presentation Code Camp09
Pyxis Technologies
 
Dev8d Apache Solr Tutorial
Dev8d Apache Solr TutorialDev8d Apache Solr Tutorial
Dev8d Apache Solr Tutorial
Sourcesense
 
Jsonsaga
JsonsagaJsonsaga
Jsonsaga
nohmad
 
Letting In the Light: Using Solr as an External Search Component
Letting In the Light: Using Solr as an External Search ComponentLetting In the Light: Using Solr as an External Search Component
Letting In the Light: Using Solr as an External Search Component
Jay Luker
 
Advanced Perl Techniques
Advanced Perl TechniquesAdvanced Perl Techniques
Advanced Perl Techniques
Dave Cross
 
Don't Be Afraid of Abstract Syntax Trees
Don't Be Afraid of Abstract Syntax TreesDon't Be Afraid of Abstract Syntax Trees
Don't Be Afraid of Abstract Syntax Trees
Jamund Ferguson
 
The JSON Saga
The JSON SagaThe JSON Saga
The JSON Saga
kaven yan
 
Lucene Bootcamp -1
Lucene Bootcamp -1 Lucene Bootcamp -1
Lucene Bootcamp -1
GokulD
 
Eugene Andruszczenko: jQuery
Eugene Andruszczenko: jQueryEugene Andruszczenko: jQuery
Eugene Andruszczenko: jQuery
Refresh Events
 
jQuery Presentation - Refresh Events
jQuery Presentation - Refresh EventsjQuery Presentation - Refresh Events
jQuery Presentation - Refresh Events
Eugene Andruszczenko
 
Cool bonsai cool - an introduction to ElasticSearch
Cool bonsai cool - an introduction to ElasticSearchCool bonsai cool - an introduction to ElasticSearch
Cool bonsai cool - an introduction to ElasticSearch
clintongormley
 
Solr Presentation
Solr PresentationSolr Presentation
Solr Presentation
Gaurav Verma
 
Wso2 Scenarios Esb Webinar July 1st
Wso2 Scenarios Esb Webinar July 1stWso2 Scenarios Esb Webinar July 1st
Wso2 Scenarios Esb Webinar July 1st
WSO2
 
Spring has got me under it’s SpEL
Spring has got me under it’s SpELSpring has got me under it’s SpEL
Spring has got me under it’s SpEL
Eldad Dor
 

Similar to Lucene And Solr Intro (20)

Introduction to Search Engines
Introduction to Search EnginesIntroduction to Search Engines
Introduction to Search Engines
 
Json
JsonJson
Json
 
Javascript2839
Javascript2839Javascript2839
Javascript2839
 
An Introduction to Solr
An Introduction to SolrAn Introduction to Solr
An Introduction to Solr
 
Douglas Crockford Presentation Jsonsaga
Douglas Crockford Presentation JsonsagaDouglas Crockford Presentation Jsonsaga
Douglas Crockford Presentation Jsonsaga
 
Processing XML with Java
Processing XML with JavaProcessing XML with Java
Processing XML with Java
 
FluentSelenium Presentation Code Camp09
FluentSelenium Presentation Code Camp09FluentSelenium Presentation Code Camp09
FluentSelenium Presentation Code Camp09
 
Dev8d Apache Solr Tutorial
Dev8d Apache Solr TutorialDev8d Apache Solr Tutorial
Dev8d Apache Solr Tutorial
 
Jsonsaga
JsonsagaJsonsaga
Jsonsaga
 
Letting In the Light: Using Solr as an External Search Component
Letting In the Light: Using Solr as an External Search ComponentLetting In the Light: Using Solr as an External Search Component
Letting In the Light: Using Solr as an External Search Component
 
Advanced Perl Techniques
Advanced Perl TechniquesAdvanced Perl Techniques
Advanced Perl Techniques
 
Don't Be Afraid of Abstract Syntax Trees
Don't Be Afraid of Abstract Syntax TreesDon't Be Afraid of Abstract Syntax Trees
Don't Be Afraid of Abstract Syntax Trees
 
The JSON Saga
The JSON SagaThe JSON Saga
The JSON Saga
 
Lucene Bootcamp -1
Lucene Bootcamp -1 Lucene Bootcamp -1
Lucene Bootcamp -1
 
Eugene Andruszczenko: jQuery
Eugene Andruszczenko: jQueryEugene Andruszczenko: jQuery
Eugene Andruszczenko: jQuery
 
jQuery Presentation - Refresh Events
jQuery Presentation - Refresh EventsjQuery Presentation - Refresh Events
jQuery Presentation - Refresh Events
 
Cool bonsai cool - an introduction to ElasticSearch
Cool bonsai cool - an introduction to ElasticSearchCool bonsai cool - an introduction to ElasticSearch
Cool bonsai cool - an introduction to ElasticSearch
 
Solr Presentation
Solr PresentationSolr Presentation
Solr Presentation
 
Wso2 Scenarios Esb Webinar July 1st
Wso2 Scenarios Esb Webinar July 1stWso2 Scenarios Esb Webinar July 1st
Wso2 Scenarios Esb Webinar July 1st
 
Spring has got me under it’s SpEL
Spring has got me under it’s SpELSpring has got me under it’s SpEL
Spring has got me under it’s SpEL
 

Recently uploaded

GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
Rohit Gautam
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
Pierluigi Pugliese
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 

Recently uploaded (20)

GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 

Lucene And Solr Intro

  • 1. Lucene And Solr Introduction By Pascal Dimassimo [email_address]
  • 2.
  • 3. Working for OpenText/Nstein on Semantic Navigation application
  • 5.
  • 7.
  • 8.
  • 9.
  • 10. “Largely responsible for significant decline in commercial OEM revenue” Source http://lucenerevolution.com/sites/default/files/slides/Lucene%20Rev%20Preso%20IDC_MarketTrends_Reynolds.pdf
  • 11.
  • 13. Text indexing and searching
  • 17. Typical Search App Taken from Lucene In Action 2 nd Edition Lucene
  • 18.
  • 20. You want to find a word in a book: how do you do it?
  • 22. Inverted Index Original Slide from Michael Busch (available at http://goo.gl/0MQvy )
  • 23. Inverted Index Original Slide from Michael Busch (available at http://goo.gl/0MQvy )
  • 24. Lucene Document FSDirectory dir = FSDirectory. open ( new File( "./index" )); SimpleAnalyzer analyzer = new SimpleAnalyzer(); MaxFieldLength len = IndexWriter.MaxFieldLength. UNLIMITED ; IndexWriter writer = new IndexWriter(dir, analyzer, true , len); String content = "The old night keeper keeps the keep in the town" ; Document doc = new Document(); doc.add( new Field( "content" , content, Field.Store. YES , Field.Index. ANALYZED )); writer.addDocument(doc); writer.commit();
  • 25.
  • 26. Organized in fields. A field must be specified at query time!
  • 29.
  • 30. Analyzed: split the content into terms to be added to the inverted index. Normalized terms.
  • 31. Stored: Keep the original content on disk
  • 32. Multivalued: Repeat the same field multiple times in the same document with different values
  • 33. Lucene Document String content = "The old night keeper keeps the keep in the town" ; String author = "Peter Smith" ; String category1 = "Fiction" ; String category2 = "Canadian" ; String isbn = "978-1-933988-17-7" ; String id = "ABY123" ; Document doc = new Document(); doc.add( new Field( "content" , content, Field.Store. YES , Field.Index. ANALYZED )); doc.add( new Field( "author" , author, Field.Store. YES , Field.Index. ANALYZED )); doc.add( new Field( "category" , category1, Field.Store. YES , Field.Index. ANALYZED )); doc.add( new Field( "category" , category2, Field.Store. YES , Field.Index. ANALYZED )); doc.add( new Field( "isbn" , isbn, Field.Store. YES , Field.Index. NOT_ANALYZED )); doc.add( new Field( "id" , id, Field.Store. YES , Field.Index. NO )); writer.addDocument(doc); writer.commit();
  • 34.
  • 35.
  • 36.
  • 37. Score represents how close the vectors are
  • 38. Tf-idf (term frequency–inverse document frequency)
  • 39. Documents with many of the search terms are scored higher
  • 40. Smaller documents are scored higher
  • 41. Analyzer Taken from Lucene In Action 2 nd Edition
  • 42.
  • 43. Used when indexing and querying
  • 46. Analyzer "The quick brown fox jumped over the lazy dog" WhitespaceAnalyzer [The] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dog] SimpleAnalyzer [the] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dog] StopAnalyzer [quick] [brown] [fox] [jumped] [over] [lazy] [dog] StandardAnalyzer [quick] [brown] [fox] [jumped] [over] [lazy] [dog] Example from Lucene In Action 2 nd Edition
  • 47. Analyzer "XY&Z Corporation - xyz@example.com" WhitespaceAnalyzer [XY&Z] [Corporation] [-] [xyz@example.com] SimpleAnalyzer [xy] [z] [corporation] [xyz] [example] [com] StopAnalyzer [xy] [z] [corporation] [xyz] [example] [com] StandardAnalyzer [xy&z] [corporation] [xyz@example.com] Example from Lucene In Action 2 nd Edition
  • 48. Custom Analyzers WhitespaceTokenizer Tokenize at white spaces KeywordTokenizer Tokenize input as a single token StandardTokenizer Tokenize at white spaces but keeping high-level entity as token (email, etc TODO) LowerCaseFilter Lowercases token text StopFilter Removes words that exist in a provided set of words PorterStemFilter Stems each token using the Porter stemming algorithm. For example, country and countries both stem to countri . Some descriptions from Lucene In Action 2 nd Edition
  • 49.
  • 50. Lucene applied an Analyzer to each word queried
  • 51. Query can be programmatically build
  • 53. Query code SimpleAnalyzer analyzer = new SimpleAnalyzer(); QueryParser parser = new QueryParser(Version. LUCENE_30 , "content" , analyzer); Query query = parser.parse( "big" ); TopDocs docs = searcher.search(query, 10);
  • 54. Query Syntax: Basic title:montreal text field
  • 55. Query Syntax: Range name:[a TO k] range field
  • 56. Query Syntax: Boolean title:(java AND programming) operator field
  • 57. Query Syntax: Boolean title:java OR name:pascal operator field field
  • 58. Query Syntax: Phrase title:”Lucene in Action” phrase field
  • 59. Query Syntax: Wildcard title:program* Term prefix field
  • 60.
  • 61.
  • 64.
  • 65. HTTP application built around Lucene
  • 66. Makes it easy to develop search solutions
  • 67. Advanced features develop on top of Lucene
  • 68. As of 2010, Lucene and Solr are merged
  • 69.
  • 70. Each index has its own schema
  • 71. Lists all fields allowed for an index
  • 72. Defines the analyzers for each field
  • 73. Solr Schema < field name = &quot;id&quot; type = &quot;string&quot; indexed = &quot;true&quot; stored = &quot;true&quot; required = &quot;true&quot; /> < field name = &quot;title&quot; type = &quot;text&quot; indexed = &quot;true&quot; stored = &quot;true&quot; /> < field name = &quot;presenter&quot; type = &quot;text_ws&quot; indexed = &quot;true&quot; stored = &quot;true&quot; /> < field name = &quot;date&quot; type = &quot;date&quot; indexed = &quot;true&quot; stored = &quot;true&quot; /> < field name = &quot;abstract&quot; type = &quot;text&quot; indexed = &quot;true&quot; stored = &quot;true&quot; />
  • 74. Solr Schema < fieldType name = &quot;text&quot; class = &quot;solr.TextField&quot; positionIncrementGap = &quot;100&quot; > < analyzer type = &quot;index&quot; > < tokenizer class = &quot;solr.WhitespaceTokenizerFactory&quot; /> < filter class = &quot;solr.StopFilterFactory&quot; ignoreCase = &quot;true&quot; words = &quot;stopwords.txt&quot; /> < filter class = &quot;solr.LowerCaseFilterFactory&quot; /> < filter class = &quot;solr.ISOLatin1AccentFilterFactory&quot; /> < filter class = &quot;solr.SnowballPorterFilterFactory&quot; language = &quot;English&quot; protected = &quot;protwords.txt&quot; /> </ analyzer > < analyzer type = &quot;query&quot; > < tokenizer class = &quot;solr.WhitespaceTokenizerFactory&quot; /> < filter class = &quot;solr.StopFilterFactory&quot; ignoreCase = &quot;true&quot; words = &quot;stopwords.txt&quot; /> < filter class = &quot;solr.LowerCaseFilterFactory&quot; /> < filter class = &quot;solr.ISOLatin1AccentFilterFactory&quot; /> < filter class = &quot;solr.SnowballPorterFilterFactory&quot; language = &quot;English&quot; protected = &quot;protwords.txt&quot; /> </ analyzer > </ fieldType >
  • 75. Solr Schema < fieldType name = &quot;text_ws&quot; class = &quot;solr.TextField&quot; positionIncrementGap = &quot;100&quot; > < analyzer type = &quot;index&quot; > < tokenizer class = &quot;solr.WhitespaceTokenizerFactory&quot; /> < filter class = &quot;solr.LowerCaseFilterFactory&quot; /> </ analyzer > < analyzer type = &quot;query&quot; > < tokenizer class = &quot;solr.WhitespaceTokenizerFactory&quot; /> < filter class = &quot;solr.LowerCaseFilterFactory&quot; /> </ analyzer > </ fieldType >
  • 76.
  • 77. XML by default, but also CSV
  • 79. Advanced features: binary document extraction, DB plugin
  • 80. Solr Indexation < add > < doc > < field name = &quot;id&quot; > 002 </ field > < field name = &quot;title&quot; > Lucene And Solr Introduction </ field > < field name = &quot;presenter&quot; > Pascal Dimassimo </ field > < field name = &quot;date&quot; > 2010-11-18T00:00:00Z </ field > < field name = &quot;abstract&quot; > ... </ field > </ doc > <doc>...</doc> </ add > curl http://localhost:8983/solr/update -H &quot;Content-Type: text/xml&quot; --data-binary @add.xml
  • 81.
  • 83. Response in XML by default, but other formats are supported (json, php, ruby)
  • 84. Solr Query curl http://localhost:8983/solr/select?q=title:Lucene < response > < lst name = &quot;responseHeader&quot; > < int name = &quot;status&quot; > 0 </ int > < int name = &quot;QTime&quot; > 269 </ int > < lst name = &quot;params&quot; > < str name = &quot;q&quot; > title:Lucene </ str > </ lst > </ lst > < result name = &quot;response&quot; numFound = &quot;1&quot; start = &quot;0&quot; > < doc > < str name = &quot;id&quot; > 002 </ str > < str name = &quot;title&quot; > Lucene And Solr Introduction </ str > < str name = &quot;presenter&quot; > Pascal Dimassimo </ str > < date name = &quot;date&quot; > 2010-11-18T00:00:00Z </ date > < str name = &quot;abstract&quot; > ... </ str > </ doc > </ result > </ response >
  • 85. Solr Query Parameters q Lucene Query sort Field to sort on. Defaut to score start Offset for the results page to display. Default 0 rows Numbers of results to display per page. Default 10 fq Filter Query. Default to all documents fl List of fields to display per document. Default to all fields wt Format to display result. Default to xml
  • 86.
  • 87. Useful for drilling down in results set
  • 88.
  • 89.
  • 92.

Editor's Notes

  1. Do one thing well Apache Licence 10 years Version 3.0 It is fast!
  2. Analyze documents: split each words Get documents in. Lucene returns a list of documents as search result.
  3. Exemple livre: on recherche du début à chaque fois qu&apos;on recherche un mot Beacoup plus simple d&apos;utiliser un index Inverted index: for a word, list documents that contains it
  4. Analyse: transformer le contenu en termes Un terme pourrait être plus d&apos;un mot: “New York” Position is also stored Binary Search: O(log n) -&gt; logarithmic Boolean Search Wildcard Search
  5. Lucene generates a id for each document Stored = Original content stored “as is” on disk. Can be returned to the user when document is returned When Lucene returns document, it returns id. You can retrieve stored content with the id
  6. Document: email, article, usager Email fields: expéditeur, destinataire, titre, contenu, attachement Article fields: auteur, titre, catégorie, contenu, date de publication Analogie BD: document = rangée, field = colonne On peut stocker des documents avec des champs différents.
  7. Lucene generates a id for each document Stored = Original content stored “as is” on disk. Can be returned to the user when document is returned When Lucene returns document, it returns id. You can retrieve stored content with the id
  8. Lucene can returns results sorted by a field
  9. Terms almost synonym of words
  10. Basic Query instance: TermQuery Use PerFieldAnalyzerWrapper to specify the specific analyzer for each field
  11. Terms stored in alphabetical order. Using String.compareTo. Returns all docs for each terms in range
  12. Supports AND, OR, NOT Supports +, -
  13. Supports AND, OR, NOT Supports +, -
  14. CNET l&apos;a utilisé pour permettre aux utilisateurs de mieux retrouver les produits