Lucene

•

0 likes•68 views

Surinder Kaur

Brief on Lucene : The Search engine

Technology

Lucene
The Search Engine
By Surinder Kaur

Basics
Index
Segment
Inverted Index
Indexing
Lucene Delete
Lucene Update
Searching
Near Real Time Search
Query Boost
Scoring
References
Table of Content

Basics
Search Engine
Open Source
Supports Full Text Search, Sorting, Filtering and many other search functionalities
The core to Lucene is-
Inverted Index
Relevance Score
Search Algorithms
Tokenization

Index
An index is collection of document.
These document may or may not have any schema.
Fields: Document consists of one or more ﬁelds. Each ﬁeld can
be of different data type.
Each Field is represented as key value pair.
Terms: When a ﬁeld is processed through analyzer, it produces
Terms.
A term is “the unit of search” in search engines.

Segment
Index is split into many smaller
sections, called Segments. Each
segment has its own index.
Lucene searches all the segments in
sequence.
Data (document) once written to
segment can never be modiﬁed.
However Lucene can merge multiple
segments to optimize the
performance.

Inverted Index
Inverted index is an index data structure.
In simple words it inverts the “document-centric” data
structure (document -> terms) to “term-centric” data
structure (term -> document).

Lucene: Insert (Indexing)
“Indexing” is process of Document insertion to Lucene.
Lucene writes data to “in-memory buﬀer”.
When the buffer size reaches certain size, it gets
ﬂushed to a “segment”.

Lucene: Delete
Document is never deleted from segment but only
marked deleted in a ﬁle. So that it can not be
accessed during the search.
It can be considered as soft delete.

Lucene: Update
A document never really gets updated.
But the update is actually a two-step process:
“older version” is marked “deleted” in the “original
segment”.
“new version” is “added” to the “current segment”.

Lucene: Get or Search
Searching or retrieving results from Lucene is a multi
step process:
Query Parser : Creates a query.
Index Searcher : Searches the query

Near Real Time Search
Lucene provides “near real time search” but not the
real time search.
NRT search is due to the way documents get inserted.
Since any new document ﬁrst gets added to in-memory
buffer. Then buffer is ﬂushed to become a segment.
Till the document reaches the segment it is
“unsearchable”.

Document Scoring
The ofﬁcial doc says- “Lucene scoring uses a combination of
the Vector Space Model (VSM) of Information Retrieval and
the Boolean model to determine how relevant a given Document is to
a User's query.”
In simpler term it is called “Tf-Idf” (Term Frequency- Inverse Document
Frequency) i.e. more times a query term appears in a document
relative to the number of times the term appears in all the documents
in the collection, the more relevant that document is to the query.
Note: Scoring is a detailed topic, I would publish a detailed study of
it. For reference Similarity formula is described here.

Boosting Score
Lucene let’s apply boost at various level. These are
namely:
Document Level Boost (while Indexing)
Field Level Boost (while Indexing)
Query Level Boost (while Searching)

Query Boost
Query-time boosts allow one to specify which terms/clauses
are "more important”.
Query boost plays role during searching.
The higher the boost factor, the more relevant the term will
be, and therefore the higher the corresponding document
scores.
Eg: Boosting ﬁrst name over last name to factor of 2:
(ﬁrst_name : “Jack”)^ 2 (last_name : “Jack”)

References
Lucene Documentation
Segment
Inverted index
Lucene tutorial
Lucene Query Syntax
Lucene Similarity

What's hot

Lucene indexingLucky Sharma

Introduction to apache luceneShrikrishna Parab

Hacking Lucene for Custom Search ResultsOpenSource Connections

Apache luceneDr. Abhiram Gandhe

Elasticsearch speed is keyEnterprise Search Warsaw Meetup

Intelligent crawling and indexing using luceneSwapnil & Patil

Faceted Search with Lucenelucenerevolution

Intro to Apache Lucene and SolrGrant Ingersoll

Beyond full-text searches with Lucene and SolrBertrand Delacretaz

Query DSL In ElasticsearchKnoldus Inc.

Apache Lucene intro - Breizhcamp 2015Adrien Grand

Apache Lucene BasicsAnirudh Sharma

Lucene BootCampGokulD

Tutorial on developing a Solr search component pluginsearchbox-com

Building your own search engine with Apache SolrBiogeeks

High Performance JSON Search and Relational Faceted Browsing with Lucenelucenerevolution

Real-time Inverted Search in the Cloud Using Lucene and Stormlucenerevolution

Rapid Prototyping with SolrErik Hatcher

Recent Additions to Lucene Arsenallucenerevolution

Search at Twitterlucenerevolution

What's hot (20)

Lucene indexing

Introduction to apache lucene

Hacking Lucene for Custom Search Results

Apache lucene

Elasticsearch speed is key

Intelligent crawling and indexing using lucene

Faceted Search with Lucene

Intro to Apache Lucene and Solr

Beyond full-text searches with Lucene and Solr

Query DSL In Elasticsearch

Apache Lucene intro - Breizhcamp 2015

Apache Lucene Basics

Lucene BootCamp

Tutorial on developing a Solr search component plugin

Building your own search engine with Apache Solr

High Performance JSON Search and Relational Faceted Browsing with Lucene

Real-time Inverted Search in the Cloud Using Lucene and Storm

Rapid Prototyping with Solr

Recent Additions to Lucene Arsenal

Search at Twitter

Similar to Lucene

LuceneHarshit Agarwal

Searching and Analyzing Qualitative Data on Personal ComputerIOSR Journals

Philly PHP: April '17 Elastic Search Introduction by Aditya BhamidpatiRobert Calcavecchia

Intro to elasticsearchJoey Wen

Information_Retrieval_Models_Nfaoui_El_HabibEl Habib NFAOUI

The search engine indexCJ Jenkins

Tovek Presentation by Livio Costantinimaxfalc

A Review of Elastic Search: Performance Metrics and challengesrahulmonikasharma

Elastic searchBinit Pathak

Ibm haifa.mq.finalPranav Prakash

Database and Research Matrix.pptxRahulRoshan37

Extracting and Reducing the Semantic Information Content of Web Documents to ...ijsrd.com

Chapter 1: Introduction to Information Storage and Retrievalcaptainmactavish1996

N017249497IOSR Journals

Context Based Indexing in Search Engines Using Ontology: Reviewiosrjce

Index Structures.pptxMBablu1

Sub1522International Journal of Science and Research (IJSR)

ElasticSearch Basic IntroductionMayur Rathod

USING GOOGLE’S KEYWORD RELATION IN MULTIDOMAIN DOCUMENT CLASSIFICATIONIJDKP

Advanced full text searching techniques using LuceneAsad Abbas

Similar to Lucene (20)

Lucene

Searching and Analyzing Qualitative Data on Personal Computer

Philly PHP: April '17 Elastic Search Introduction by Aditya Bhamidpati

Intro to elasticsearch

Information_Retrieval_Models_Nfaoui_El_Habib

The search engine index

Tovek Presentation by Livio Costantini

A Review of Elastic Search: Performance Metrics and challenges

Elastic search

Ibm haifa.mq.final

Database and Research Matrix.pptx

Extracting and Reducing the Semantic Information Content of Web Documents to ...

Chapter 1: Introduction to Information Storage and Retrieval

N017249497

Context Based Indexing in Search Engines Using Ontology: Review

Index Structures.pptx

Sub1522

ElasticSearch Basic Introduction

USING GOOGLE’S KEYWORD RELATION IN MULTIDOMAIN DOCUMENT CLASSIFICATION

Advanced full text searching techniques using Lucene

Recently uploaded

The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad

SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55

08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh

Install Stable Diffusion in windows machinePadma Pradeep

Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard

08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls

Pigging Solutions Piggable Sweeping ElbowsPigging Solutions

08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls

Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren

How to Remove Document Management Hurdles with X-Docs?XfilesPro

CloudStudio User manual (basic edition):comworks

Presentation on how to chat with PDF using ChatGPT code interpreternaman860154

Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetEnjoy Anytime

Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal

Key Features Of Token Development (1).pptxLBM Solutions

Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j

Recently uploaded (20)

The Codex of Business Writing Software for Real-World Solutions 2.pptx

SQL Database Design For Developers at php[tek] 2024

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...

08448380779 Call Girls In Friends Colony Women Seeking Men

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi

Install Stable Diffusion in windows machine

Maximizing Board Effectiveness 2024 Webinar.pptx

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men

Pigging Solutions Piggable Sweeping Elbows

08448380779 Call Girls In Greater Kailash - I Women Seeking Men

Advanced Test Driven-Development @ php[tek] 2024

How to Remove Document Management Hurdles with X-Docs?

CloudStudio User manual (basic edition):

Presentation on how to chat with PDF using ChatGPT code interpreter

Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget

Unblocking The Main Thread Solving ANRs and Frozen Frames

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service

Key Features Of Token Development (1).pptx

Breaking the Kubernetes Kill Chain: Host Path Mount

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...

Lucene

1. Lucene The Search Engine By Surinder Kaur

2. Basics Index Segment Inverted Index Indexing Lucene Delete Lucene Update Searching Near Real Time Search Query Boost Scoring References Table of Content

3. Basics Search Engine Open Source Supports Full Text Search, Sorting, Filtering and many other search functionalities The core to Lucene is- Inverted Index Relevance Score Search Algorithms Tokenization

4. Index An index is collection of document. These document may or may not have any schema. Fields: Document consists of one or more fields. Each field can be of different data type. Each Field is represented as key value pair. Terms: When a field is processed through analyzer, it produces Terms. A term is “the unit of search” in search engines.

5. Segment Index is split into many smaller sections, called Segments. Each segment has its own index. Lucene searches all the segments in sequence. Data (document) once written to segment can never be modiﬁed. However Lucene can merge multiple segments to optimize the performance.

6. Inverted Index Inverted index is an index data structure. In simple words it inverts the “document-centric” data structure (document -> terms) to “term-centric” data structure (term -> document).

7. Lucene: Insert (Indexing) “Indexing” is process of Document insertion to Lucene. Lucene writes data to “in-memory buﬀer”. When the buffer size reaches certain size, it gets ﬂushed to a “segment”.

8. Lucene: Delete Document is never deleted from segment but only marked deleted in a ﬁle. So that it can not be accessed during the search. It can be considered as soft delete.

9. Lucene: Update A document never really gets updated. But the update is actually a two-step process: “older version” is marked “deleted” in the “original segment”. “new version” is “added” to the “current segment”.

10. Lucene: Get or Search Searching or retrieving results from Lucene is a multi step process: Query Parser : Creates a query. Index Searcher : Searches the query

11. Near Real Time Search Lucene provides “near real time search” but not the real time search. NRT search is due to the way documents get inserted. Since any new document ﬁrst gets added to in-memory buffer. Then buffer is ﬂushed to become a segment. Till the document reaches the segment it is “unsearchable”.

12. Document Scoring The ofﬁcial doc says- “Lucene scoring uses a combination of the Vector Space Model (VSM) of Information Retrieval and the Boolean model to determine how relevant a given Document is to a User's query.” In simpler term it is called “Tf-Idf” (Term Frequency- Inverse Document Frequency) i.e. more times a query term appears in a document relative to the number of times the term appears in all the documents in the collection, the more relevant that document is to the query. Note: Scoring is a detailed topic, I would publish a detailed study of it. For reference Similarity formula is described here.

13. Boosting Score Lucene let’s apply boost at various level. These are namely: Document Level Boost (while Indexing) Field Level Boost (while Indexing) Query Level Boost (while Searching)

14. Query Boost Query-time boosts allow one to specify which terms/clauses are "more important”. Query boost plays role during searching. The higher the boost factor, the more relevant the term will be, and therefore the higher the corresponding document scores. Eg: Boosting ﬁrst name over last name to factor of 2: (ﬁrst_name : “Jack”)^ 2 (last_name : “Jack”)

15. References Lucene Documentation Segment Inverted index Lucene tutorial Lucene Query Syntax Lucene Similarity

Lucene

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Lucene

Similar to Lucene (20)

More from Surinder Kaur

More from Surinder Kaur (12)

Recently uploaded

Recently uploaded (20)

Lucene