SlideShare a Scribd company logo
The Lucene Search Engine
Kira Radinsky
Based on the material from: Thomas Paul and Steven J. Owens
What is Lucene?
• Doug Cutting’s grandmother’s middle name
• A open source set of Java Classses
– Search Engine/Document Classifier/Indexer
– Developed by Doug Cutting (1996)
• Xerox/Apple/Excite/Nutch/Yahoo/Cloudera
• Hadoop founder, Board of directors of the Apache Software
• Jakarta Apache Product. Strong open source
community support.
• High-performance, full-featured text search
engine library
• Easy to use yet powerful API
Use the Source, Luke
• Document
• Field
– Represents a section of a Document: name for the section + the actual data.
• Analyzer
– Abstract class (to provide interface)
– Document -> tokens (for later indexing)
– StandardAnalyzer class.
• IndexWriter
– Creates and maintains indexes.
• IndexSearcher
– Searches through an index.
• QueryParser
– Builds a parser that can search through an index.
• Query
– Abstract class that contains the search criteria created by the QueryParser.
• Hits
– Contains the Document objects that are returned by running the Query object against the index.
Indexing a Document
Document from an article
private Document createDocument(String article, String author,
String title, String topic,
String url,
Date dateWritten)
{
Document document = new Document();
document.add(Field.Text("author", author));
document.add(Field.Text("title", title));
document.add(Field.Text("topic", topic));
document.add(Field.UnIndexed("url", url));
document.add(Field.Keyword("date", dateWritten));
document.add(Field.UnStored("article", article));
return document;
}
The Field Object
Factory Method Tokenized Indexed Stored Use for
Field.Text(String name,
String value)
Yes Yes Yes
contents you want
stored
Field.Text(String name,
Reader value)
Yes Yes No
contents you don't
want stored
Field.Keyword(String
name, String value)
No Yes Yes
values you don't want
broken down
Field.UnIndexed(String
name, String value)
No No Yes
values you don't want
indexed
Field.UnStored(String
name, String value)
Yes Yes No
values you don't want
stored
Store a Document in the index
String indexDirectory = "lucene-index";
private void indexDocument(Document document)
throws Exception
{
Analyzer analyzer = new StandardAnalyzer();
IndexWriter writer = new IndexWriter(
indexDirectory,
analyzer, false
);
writer.addDocument(document);
writer.optimize();
writer.close();
}
Analyzers and Tokenizers
SimpleAnalyzer SimpleAnalyzer seems to just use a Tokenizer that converts all
of the input to lower case.
StopAnalyzer StopAnalyzer includes the lower-case filter, and also has a filter
that drops out any "stop words", words like articles (a, an, the,
etc) that occur so commonly in english that they might as well
be noise for searching purposes. StopAnalyzer comes with a
set of stop words, but you can instantiate it with your own
array of stop words.
StandardAnalyzer StandardAnalyzer does both lower-case and stop-word
filtering, and in addition tries to do some basic clean-up of
words, for example taking out apostrophes ( ' ) and removing
periods from acronyms (i.e. "T.L.A." becomes "TLA").
Lucene Sandbox Here you can find analyzers in your own language
Adding to an Index
public void indexArticle(
String article,
String author,
String title, String topic,
String url, Date dateWritten)
throws Exception
{
Document document = createDocument
(
article, author,
title, topic, url,
dateWritten
);
indexDocument(document);
}
Searching the Index
Searching
IndexSearcher is = new
IndexSearcher(indexDirectory);
Analyzer analyzer = new StandardAnalyzer();
QueryParser parser = new QueryParser("article",
analyzer);
Query query = parser.parse(searchCriteria);
Hits hits = is.search(query);
Extracting Document objects
for (int i=0; i<hits.length(); i++)
{
Document doc = hits.doc(i);
// display the articles that were
found to the user
}
Search Criteria
Supports several searches: AND OR and NOT,
fuzzy, proximity searches, wildcard searches, and
range searches
– author:Henry relativity AND "quantum physics“
– "string theory" NOT Einstein
– "Galileo Kepler"~5
– author:Johnson date:[01/01/2004 TO 01/31/2004]
Thread Safety
• Indexing and searching are not only thread safe,
but process safe. What this means is that:
– Multiple index searchers can read the lucene index
files at the same time.
– An index writer or reader can edit the lucene index
files while searches are ongoing
– Multiple index writers or readers can try to edit the
lucene index files at the same time (it's important for
the index writer/reader to be closed so it will release
the file lock).
• The query parser is not thread safe,
• The index writer however, is thread safe,

More Related Content

What's hot

Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
dnaber
 
Apache Lucene Basics
Apache Lucene BasicsApache Lucene Basics
Apache Lucene Basics
Anirudh Sharma
 
Intelligent crawling and indexing using lucene
Intelligent crawling and indexing using luceneIntelligent crawling and indexing using lucene
Intelligent crawling and indexing using luceneSwapnil & Patil
 
Azure search
Azure searchAzure search
Azure search
Alexej Sommer
 
Berlin Buzzwords 2013 - How does lucene store your data?
Berlin Buzzwords 2013 - How does lucene store your data?Berlin Buzzwords 2013 - How does lucene store your data?
Berlin Buzzwords 2013 - How does lucene store your data?
Adrien Grand
 
Munching & crunching - Lucene index post-processing
Munching & crunching - Lucene index post-processingMunching & crunching - Lucene index post-processing
Munching & crunching - Lucene index post-processing
abial
 
Building a Search Engine Using Lucene
Building a Search Engine Using LuceneBuilding a Search Engine Using Lucene
Building a Search Engine Using Lucene
Abdelrahman Othman Helal
 
Hacking Lucene for Custom Search Results
Hacking Lucene for Custom Search ResultsHacking Lucene for Custom Search Results
Hacking Lucene for Custom Search Results
OpenSource Connections
 
Intro to Apache Lucene and Solr
Intro to Apache Lucene and SolrIntro to Apache Lucene and Solr
Intro to Apache Lucene and Solr
Grant Ingersoll
 
Wanna search? Piece of cake!
Wanna search? Piece of cake!Wanna search? Piece of cake!
Wanna search? Piece of cake!
Alex Kursov
 
Multi faceted responsive search, autocomplete, feeds engine & logging
Multi faceted responsive search, autocomplete, feeds engine & loggingMulti faceted responsive search, autocomplete, feeds engine & logging
Multi faceted responsive search, autocomplete, feeds engine & logging
lucenerevolution
 
Search Me: Using Lucene.Net
Search Me: Using Lucene.NetSearch Me: Using Lucene.Net
Search Me: Using Lucene.Net
gramana
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
Andy Jackson
 
Content analysis for ECM with Apache Tika
Content analysis for ECM with Apache TikaContent analysis for ECM with Apache Tika
Content analysis for ECM with Apache Tika
Paolo Mottadelli
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucene
lucenerevolution
 
Full text search
Full text searchFull text search
Full text searchdeleteman
 
Content extraction with apache tika
Content extraction with apache tikaContent extraction with apache tika
Content extraction with apache tikaJukka Zitting
 
What's new with Apache Tika?
What's new with Apache Tika?What's new with Apache Tika?
What's new with Apache Tika?
gagravarr
 

What's hot (19)

Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
 
Apache Lucene Basics
Apache Lucene BasicsApache Lucene Basics
Apache Lucene Basics
 
Intelligent crawling and indexing using lucene
Intelligent crawling and indexing using luceneIntelligent crawling and indexing using lucene
Intelligent crawling and indexing using lucene
 
Azure search
Azure searchAzure search
Azure search
 
Lucece Indexing
Lucece IndexingLucece Indexing
Lucece Indexing
 
Berlin Buzzwords 2013 - How does lucene store your data?
Berlin Buzzwords 2013 - How does lucene store your data?Berlin Buzzwords 2013 - How does lucene store your data?
Berlin Buzzwords 2013 - How does lucene store your data?
 
Munching & crunching - Lucene index post-processing
Munching & crunching - Lucene index post-processingMunching & crunching - Lucene index post-processing
Munching & crunching - Lucene index post-processing
 
Building a Search Engine Using Lucene
Building a Search Engine Using LuceneBuilding a Search Engine Using Lucene
Building a Search Engine Using Lucene
 
Hacking Lucene for Custom Search Results
Hacking Lucene for Custom Search ResultsHacking Lucene for Custom Search Results
Hacking Lucene for Custom Search Results
 
Intro to Apache Lucene and Solr
Intro to Apache Lucene and SolrIntro to Apache Lucene and Solr
Intro to Apache Lucene and Solr
 
Wanna search? Piece of cake!
Wanna search? Piece of cake!Wanna search? Piece of cake!
Wanna search? Piece of cake!
 
Multi faceted responsive search, autocomplete, feeds engine & logging
Multi faceted responsive search, autocomplete, feeds engine & loggingMulti faceted responsive search, autocomplete, feeds engine & logging
Multi faceted responsive search, autocomplete, feeds engine & logging
 
Search Me: Using Lucene.Net
Search Me: Using Lucene.NetSearch Me: Using Lucene.Net
Search Me: Using Lucene.Net
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
 
Content analysis for ECM with Apache Tika
Content analysis for ECM with Apache TikaContent analysis for ECM with Apache Tika
Content analysis for ECM with Apache Tika
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucene
 
Full text search
Full text searchFull text search
Full text search
 
Content extraction with apache tika
Content extraction with apache tikaContent extraction with apache tika
Content extraction with apache tika
 
What's new with Apache Tika?
What's new with Apache Tika?What's new with Apache Tika?
What's new with Apache Tika?
 

Similar to Tutorial 5 (lucene)

Fast track to lucene
Fast track to luceneFast track to lucene
Fast track to lucene
Marouane Gazanayi
 
Examiness hints and tips from the trenches
Examiness hints and tips from the trenchesExaminess hints and tips from the trenches
Examiness hints and tips from the trenches
Ismail Mayat
 
Full Text Search with Lucene
Full Text Search with LuceneFull Text Search with Lucene
Full Text Search with LuceneWO Community
 
IR with lucene
IR with luceneIR with lucene
IR with lucene
Stelios Gorilas
 
Lucene Introduction
Lucene IntroductionLucene Introduction
Lucene Introduction
otisg
 
Advanced full text searching techniques using Lucene
Advanced full text searching techniques using LuceneAdvanced full text searching techniques using Lucene
Advanced full text searching techniques using Lucene
Asad Abbas
 
Full Text Search In PostgreSQL
Full Text Search In PostgreSQLFull Text Search In PostgreSQL
Full Text Search In PostgreSQL
Karwin Software Solutions LLC
 
Intro to Elasticsearch
Intro to ElasticsearchIntro to Elasticsearch
Intro to Elasticsearch
Clifford James
 
Elasticsearch python
Elasticsearch pythonElasticsearch python
Elasticsearch python
valiantval2
 
Intro to elasticsearch
Intro to elasticsearchIntro to elasticsearch
Intro to elasticsearch
Joey Wen
 
Lucene Bootcamp -1
Lucene Bootcamp -1 Lucene Bootcamp -1
Lucene Bootcamp -1 GokulD
 
ElasticSearch Basics
ElasticSearch BasicsElasticSearch Basics
ElasticSearch Basics
Amresh Singh
 
Philly PHP: April '17 Elastic Search Introduction by Aditya Bhamidpati
Philly PHP: April '17 Elastic Search Introduction by Aditya BhamidpatiPhilly PHP: April '17 Elastic Search Introduction by Aditya Bhamidpati
Philly PHP: April '17 Elastic Search Introduction by Aditya Bhamidpati
Robert Calcavecchia
 
Apache lucene - full text search
Apache lucene - full text searchApache lucene - full text search
Apache lucene - full text searchMarcelo Cure
 
DIY Percolator
DIY PercolatorDIY Percolator
DIY Percolatorjdhok
 
Anno4j - Idiomatic Persistence and Querying for the W3C Annotation Data Model
Anno4j - Idiomatic Persistence and Querying for the W3C Annotation Data ModelAnno4j - Idiomatic Persistence and Querying for the W3C Annotation Data Model
Anno4j - Idiomatic Persistence and Querying for the W3C Annotation Data Model
Emanuel Berndl
 
Amazon Elasticsearch and Databases
Amazon Elasticsearch and DatabasesAmazon Elasticsearch and Databases
Amazon Elasticsearch and Databases
Amazon Web Services
 
Lucene indexing
Lucene indexingLucene indexing
Lucene indexing
Lucky Sharma
 
Elasticsearch for beginners
Elasticsearch for beginnersElasticsearch for beginners
Elasticsearch for beginners
Neil Baker
 

Similar to Tutorial 5 (lucene) (20)

Fast track to lucene
Fast track to luceneFast track to lucene
Fast track to lucene
 
Examiness hints and tips from the trenches
Examiness hints and tips from the trenchesExaminess hints and tips from the trenches
Examiness hints and tips from the trenches
 
Full Text Search with Lucene
Full Text Search with LuceneFull Text Search with Lucene
Full Text Search with Lucene
 
IR with lucene
IR with luceneIR with lucene
IR with lucene
 
Lucene Introduction
Lucene IntroductionLucene Introduction
Lucene Introduction
 
Advanced full text searching techniques using Lucene
Advanced full text searching techniques using LuceneAdvanced full text searching techniques using Lucene
Advanced full text searching techniques using Lucene
 
Full Text Search In PostgreSQL
Full Text Search In PostgreSQLFull Text Search In PostgreSQL
Full Text Search In PostgreSQL
 
Lucene in Action
Lucene in ActionLucene in Action
Lucene in Action
 
Intro to Elasticsearch
Intro to ElasticsearchIntro to Elasticsearch
Intro to Elasticsearch
 
Elasticsearch python
Elasticsearch pythonElasticsearch python
Elasticsearch python
 
Intro to elasticsearch
Intro to elasticsearchIntro to elasticsearch
Intro to elasticsearch
 
Lucene Bootcamp -1
Lucene Bootcamp -1 Lucene Bootcamp -1
Lucene Bootcamp -1
 
ElasticSearch Basics
ElasticSearch BasicsElasticSearch Basics
ElasticSearch Basics
 
Philly PHP: April '17 Elastic Search Introduction by Aditya Bhamidpati
Philly PHP: April '17 Elastic Search Introduction by Aditya BhamidpatiPhilly PHP: April '17 Elastic Search Introduction by Aditya Bhamidpati
Philly PHP: April '17 Elastic Search Introduction by Aditya Bhamidpati
 
Apache lucene - full text search
Apache lucene - full text searchApache lucene - full text search
Apache lucene - full text search
 
DIY Percolator
DIY PercolatorDIY Percolator
DIY Percolator
 
Anno4j - Idiomatic Persistence and Querying for the W3C Annotation Data Model
Anno4j - Idiomatic Persistence and Querying for the W3C Annotation Data ModelAnno4j - Idiomatic Persistence and Querying for the W3C Annotation Data Model
Anno4j - Idiomatic Persistence and Querying for the W3C Annotation Data Model
 
Amazon Elasticsearch and Databases
Amazon Elasticsearch and DatabasesAmazon Elasticsearch and Databases
Amazon Elasticsearch and Databases
 
Lucene indexing
Lucene indexingLucene indexing
Lucene indexing
 
Elasticsearch for beginners
Elasticsearch for beginnersElasticsearch for beginners
Elasticsearch for beginners
 

More from Kira

Tutorial 14 (collaborative filtering)
Tutorial 14 (collaborative filtering)Tutorial 14 (collaborative filtering)
Tutorial 14 (collaborative filtering)
Kira
 
Tutorial 12 (click models)
Tutorial 12 (click models)Tutorial 12 (click models)
Tutorial 12 (click models)
Kira
 
Tutorial 11 (computational advertising)
Tutorial 11 (computational advertising)Tutorial 11 (computational advertising)
Tutorial 11 (computational advertising)
Kira
 
Tutorial 10 (computational advertising)
Tutorial 10 (computational advertising)Tutorial 10 (computational advertising)
Tutorial 10 (computational advertising)
Kira
 
Tutorial 9 (bloom filters)
Tutorial 9 (bloom filters)Tutorial 9 (bloom filters)
Tutorial 9 (bloom filters)
Kira
 
Tutorial 8 (web graph models)
Tutorial 8 (web graph models)Tutorial 8 (web graph models)
Tutorial 8 (web graph models)
Kira
 
Tutorial 7 (link analysis)
Tutorial 7 (link analysis)Tutorial 7 (link analysis)
Tutorial 7 (link analysis)
Kira
 
Tutorial 6 (web graph attributes)
Tutorial 6 (web graph attributes)Tutorial 6 (web graph attributes)
Tutorial 6 (web graph attributes)
Kira
 
Tutorial 4 (duplicate detection)
Tutorial 4 (duplicate detection)Tutorial 4 (duplicate detection)
Tutorial 4 (duplicate detection)
Kira
 
Tutorial 3 (b tree min heap)
Tutorial 3 (b tree min heap)Tutorial 3 (b tree min heap)
Tutorial 3 (b tree min heap)
Kira
 
Tutorial 2 (mle + language models)
Tutorial 2 (mle + language models)Tutorial 2 (mle + language models)
Tutorial 2 (mle + language models)
Kira
 
Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)
Kira
 
Tutorial 13 (explicit ugc + sentiment analysis)
Tutorial 13 (explicit ugc + sentiment analysis)Tutorial 13 (explicit ugc + sentiment analysis)
Tutorial 13 (explicit ugc + sentiment analysis)
Kira
 

More from Kira (13)

Tutorial 14 (collaborative filtering)
Tutorial 14 (collaborative filtering)Tutorial 14 (collaborative filtering)
Tutorial 14 (collaborative filtering)
 
Tutorial 12 (click models)
Tutorial 12 (click models)Tutorial 12 (click models)
Tutorial 12 (click models)
 
Tutorial 11 (computational advertising)
Tutorial 11 (computational advertising)Tutorial 11 (computational advertising)
Tutorial 11 (computational advertising)
 
Tutorial 10 (computational advertising)
Tutorial 10 (computational advertising)Tutorial 10 (computational advertising)
Tutorial 10 (computational advertising)
 
Tutorial 9 (bloom filters)
Tutorial 9 (bloom filters)Tutorial 9 (bloom filters)
Tutorial 9 (bloom filters)
 
Tutorial 8 (web graph models)
Tutorial 8 (web graph models)Tutorial 8 (web graph models)
Tutorial 8 (web graph models)
 
Tutorial 7 (link analysis)
Tutorial 7 (link analysis)Tutorial 7 (link analysis)
Tutorial 7 (link analysis)
 
Tutorial 6 (web graph attributes)
Tutorial 6 (web graph attributes)Tutorial 6 (web graph attributes)
Tutorial 6 (web graph attributes)
 
Tutorial 4 (duplicate detection)
Tutorial 4 (duplicate detection)Tutorial 4 (duplicate detection)
Tutorial 4 (duplicate detection)
 
Tutorial 3 (b tree min heap)
Tutorial 3 (b tree min heap)Tutorial 3 (b tree min heap)
Tutorial 3 (b tree min heap)
 
Tutorial 2 (mle + language models)
Tutorial 2 (mle + language models)Tutorial 2 (mle + language models)
Tutorial 2 (mle + language models)
 
Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)
 
Tutorial 13 (explicit ugc + sentiment analysis)
Tutorial 13 (explicit ugc + sentiment analysis)Tutorial 13 (explicit ugc + sentiment analysis)
Tutorial 13 (explicit ugc + sentiment analysis)
 

Recently uploaded

From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
g2nightmarescribd
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 

Recently uploaded (20)

From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 

Tutorial 5 (lucene)

  • 1. The Lucene Search Engine Kira Radinsky Based on the material from: Thomas Paul and Steven J. Owens
  • 2. What is Lucene? • Doug Cutting’s grandmother’s middle name • A open source set of Java Classses – Search Engine/Document Classifier/Indexer – Developed by Doug Cutting (1996) • Xerox/Apple/Excite/Nutch/Yahoo/Cloudera • Hadoop founder, Board of directors of the Apache Software • Jakarta Apache Product. Strong open source community support. • High-performance, full-featured text search engine library • Easy to use yet powerful API
  • 3. Use the Source, Luke • Document • Field – Represents a section of a Document: name for the section + the actual data. • Analyzer – Abstract class (to provide interface) – Document -> tokens (for later indexing) – StandardAnalyzer class. • IndexWriter – Creates and maintains indexes. • IndexSearcher – Searches through an index. • QueryParser – Builds a parser that can search through an index. • Query – Abstract class that contains the search criteria created by the QueryParser. • Hits – Contains the Document objects that are returned by running the Query object against the index.
  • 5. Document from an article private Document createDocument(String article, String author, String title, String topic, String url, Date dateWritten) { Document document = new Document(); document.add(Field.Text("author", author)); document.add(Field.Text("title", title)); document.add(Field.Text("topic", topic)); document.add(Field.UnIndexed("url", url)); document.add(Field.Keyword("date", dateWritten)); document.add(Field.UnStored("article", article)); return document; }
  • 6. The Field Object Factory Method Tokenized Indexed Stored Use for Field.Text(String name, String value) Yes Yes Yes contents you want stored Field.Text(String name, Reader value) Yes Yes No contents you don't want stored Field.Keyword(String name, String value) No Yes Yes values you don't want broken down Field.UnIndexed(String name, String value) No No Yes values you don't want indexed Field.UnStored(String name, String value) Yes Yes No values you don't want stored
  • 7. Store a Document in the index String indexDirectory = "lucene-index"; private void indexDocument(Document document) throws Exception { Analyzer analyzer = new StandardAnalyzer(); IndexWriter writer = new IndexWriter( indexDirectory, analyzer, false ); writer.addDocument(document); writer.optimize(); writer.close(); }
  • 8. Analyzers and Tokenizers SimpleAnalyzer SimpleAnalyzer seems to just use a Tokenizer that converts all of the input to lower case. StopAnalyzer StopAnalyzer includes the lower-case filter, and also has a filter that drops out any "stop words", words like articles (a, an, the, etc) that occur so commonly in english that they might as well be noise for searching purposes. StopAnalyzer comes with a set of stop words, but you can instantiate it with your own array of stop words. StandardAnalyzer StandardAnalyzer does both lower-case and stop-word filtering, and in addition tries to do some basic clean-up of words, for example taking out apostrophes ( ' ) and removing periods from acronyms (i.e. "T.L.A." becomes "TLA"). Lucene Sandbox Here you can find analyzers in your own language
  • 9. Adding to an Index public void indexArticle( String article, String author, String title, String topic, String url, Date dateWritten) throws Exception { Document document = createDocument ( article, author, title, topic, url, dateWritten ); indexDocument(document); }
  • 11. Searching IndexSearcher is = new IndexSearcher(indexDirectory); Analyzer analyzer = new StandardAnalyzer(); QueryParser parser = new QueryParser("article", analyzer); Query query = parser.parse(searchCriteria); Hits hits = is.search(query);
  • 12. Extracting Document objects for (int i=0; i<hits.length(); i++) { Document doc = hits.doc(i); // display the articles that were found to the user }
  • 13. Search Criteria Supports several searches: AND OR and NOT, fuzzy, proximity searches, wildcard searches, and range searches – author:Henry relativity AND "quantum physics“ – "string theory" NOT Einstein – "Galileo Kepler"~5 – author:Johnson date:[01/01/2004 TO 01/31/2004]
  • 14. Thread Safety • Indexing and searching are not only thread safe, but process safe. What this means is that: – Multiple index searchers can read the lucene index files at the same time. – An index writer or reader can edit the lucene index files while searches are ongoing – Multiple index writers or readers can try to edit the lucene index files at the same time (it's important for the index writer/reader to be closed so it will release the file lock). • The query parser is not thread safe, • The index writer however, is thread safe,