Lucene
The Search Engine
By Surinder Kaur
Basics
Index
Segment
Inverted Index
Indexing
Lucene Delete
Lucene Update
Searching
Near Real Time Search
Query Boost
Scoring
References
Table of Content
Basics
Search Engine
Open Source
Supports Full Text Search, Sorting, Filtering and many other search functionalities
The core to Lucene is-
Inverted Index
Relevance Score
Search Algorithms
Tokenization
Index
An index is collection of document.
These document may or may not have any schema.
Fields: Document consists of one or more fields. Each field can
be of different data type.
Each Field is represented as key value pair.
Terms: When a field is processed through analyzer, it produces
Terms.
A term is “the unit of search” in search engines.
Segment
Index is split into many smaller
sections, called Segments. Each
segment has its own index.
Lucene searches all the segments in
sequence.
Data (document) once written to
segment can never be modified.
However Lucene can merge multiple
segments to optimize the
performance.
Inverted Index
Inverted index is an index data structure.
In simple words it inverts the “document-centric” data
structure (document -> terms) to “term-centric” data
structure (term -> document).
Lucene: Insert (Indexing)
“Indexing” is process of Document insertion to Lucene.
Lucene writes data to “in-memory buffer”.
When the buffer size reaches certain size, it gets
flushed to a “segment”.
Lucene: Delete
Document is never deleted from segment but only
marked deleted in a file. So that it can not be
accessed during the search.
It can be considered as soft delete.
Lucene: Update
A document never really gets updated.
But the update is actually a two-step process:
“older version” is marked “deleted” in the “original
segment”.
“new version” is “added” to the “current segment”.
Lucene: Get or Search
Searching or retrieving results from Lucene is a multi
step process:
Query Parser : Creates a query.
Index Searcher : Searches the query
Near Real Time Search
Lucene provides “near real time search” but not the
real time search.
NRT search is due to the way documents get inserted.
Since any new document first gets added to in-memory
buffer. Then buffer is flushed to become a segment.
Till the document reaches the segment it is
“unsearchable”.
Document Scoring
The official doc says- “Lucene scoring uses a combination of
the Vector Space Model (VSM) of Information Retrieval and
the Boolean model to determine how relevant a given Document is to
a User's query.”
In simpler term it is called “Tf-Idf” (Term Frequency- Inverse Document
Frequency) i.e. more times a query term appears in a document
relative to the number of times the term appears in all the documents
in the collection, the more relevant that document is to the query.
Note: Scoring is a detailed topic, I would publish a detailed study of
it. For reference Similarity formula is described here.
Boosting Score
Lucene let’s apply boost at various level. These are
namely:
Document Level Boost (while Indexing)
Field Level Boost (while Indexing)
Query Level Boost (while Searching)
Query Boost
Query-time boosts allow one to specify which terms/clauses
are "more important”.
Query boost plays role during searching.
The higher the boost factor, the more relevant the term will
be, and therefore the higher the corresponding document
scores.
Eg: Boosting first name over last name to factor of 2:
(first_name : “Jack”)^ 2 (last_name : “Jack”)
References
Lucene Documentation
Segment
Inverted index
Lucene tutorial
Lucene Query Syntax
Lucene Similarity

Lucene

  • 1.
  • 2.
    Basics Index Segment Inverted Index Indexing Lucene Delete LuceneUpdate Searching Near Real Time Search Query Boost Scoring References Table of Content
  • 3.
    Basics Search Engine Open Source SupportsFull Text Search, Sorting, Filtering and many other search functionalities The core to Lucene is- Inverted Index Relevance Score Search Algorithms Tokenization
  • 4.
    Index An index iscollection of document. These document may or may not have any schema. Fields: Document consists of one or more fields. Each field can be of different data type. Each Field is represented as key value pair. Terms: When a field is processed through analyzer, it produces Terms. A term is “the unit of search” in search engines.
  • 5.
    Segment Index is splitinto many smaller sections, called Segments. Each segment has its own index. Lucene searches all the segments in sequence. Data (document) once written to segment can never be modified. However Lucene can merge multiple segments to optimize the performance.
  • 6.
    Inverted Index Inverted indexis an index data structure. In simple words it inverts the “document-centric” data structure (document -> terms) to “term-centric” data structure (term -> document).
  • 7.
    Lucene: Insert (Indexing) “Indexing”is process of Document insertion to Lucene. Lucene writes data to “in-memory buffer”. When the buffer size reaches certain size, it gets flushed to a “segment”.
  • 8.
    Lucene: Delete Document isnever deleted from segment but only marked deleted in a file. So that it can not be accessed during the search. It can be considered as soft delete.
  • 9.
    Lucene: Update A documentnever really gets updated. But the update is actually a two-step process: “older version” is marked “deleted” in the “original segment”. “new version” is “added” to the “current segment”.
  • 10.
    Lucene: Get orSearch Searching or retrieving results from Lucene is a multi step process: Query Parser : Creates a query. Index Searcher : Searches the query
  • 11.
    Near Real TimeSearch Lucene provides “near real time search” but not the real time search. NRT search is due to the way documents get inserted. Since any new document first gets added to in-memory buffer. Then buffer is flushed to become a segment. Till the document reaches the segment it is “unsearchable”.
  • 12.
    Document Scoring The officialdoc says- “Lucene scoring uses a combination of the Vector Space Model (VSM) of Information Retrieval and the Boolean model to determine how relevant a given Document is to a User's query.” In simpler term it is called “Tf-Idf” (Term Frequency- Inverse Document Frequency) i.e. more times a query term appears in a document relative to the number of times the term appears in all the documents in the collection, the more relevant that document is to the query. Note: Scoring is a detailed topic, I would publish a detailed study of it. For reference Similarity formula is described here.
  • 13.
    Boosting Score Lucene let’sapply boost at various level. These are namely: Document Level Boost (while Indexing) Field Level Boost (while Indexing) Query Level Boost (while Searching)
  • 14.
    Query Boost Query-time boostsallow one to specify which terms/clauses are "more important”. Query boost plays role during searching. The higher the boost factor, the more relevant the term will be, and therefore the higher the corresponding document scores. Eg: Boosting first name over last name to factor of 2: (first_name : “Jack”)^ 2 (last_name : “Jack”)
  • 15.
    References Lucene Documentation Segment Inverted index Lucenetutorial Lucene Query Syntax Lucene Similarity