Lucene indexing

Introduction
 Lucene Index
 Lucene Index data in form of Posting list which are in Inverted Index format.
 How does it look ?
 Lucene index data in files called segments.
 Unlike a database, Lucene has no notion of a fixed global schema
 Lucene’s flexible schema also means a single index can hold documents that
rep- resent different entities.
 Lucene requires you to flatten, or de-normalize, your content when you index it.

 A document is Lucene’s atomic unit of indexing and searching. It’s a
container that holds one or more fields, which in turn contain the “real”
content.
 To index your raw content sources, you must first translate it into Lucene’s
documents and fields. Then, at search time, it’s the field values that are
searched
 Three things Lucene can do with each field:
 The value may be indexed
 If it’s indexed, the field may also optionally store term vectors,
 the field’s value may be stored,

Indexing Process
 Enriching and Creating the Document
 To Index any data, we need to get text of the raw data i.e the form in which Lucene
can ingest the data.
 Build Documents are not always simple, when you are indexing from database or
PDF or Website HTML you need to have to do so much, preprocess so that a proper
Document can be build out of it.
 Analysis
 Method addDocument & addDocuments of IndexWriter Class hand our data off to
Lucene to index.
 As a first step Lucene analyzes the text, create tokens out of it and perform analysis
operations like for instance, tokens could be lowercased before indexing, so that it
will help in making search case insensitive.
 StemFilter, Synonyms and Stopwords are such examples of analysis

 Adding to the index
 After the analyzed part is done, data is ready to be added to index.
 Lucene uses inverted index as the data structure beneath the surface.
 Lets see how it works ?
 Rather than answering question
“What words are contained in this document?”
it is optimized for providing quick answers to
“Which documents contain word X?”
 Lucene index data in the Segments

 INDEX SEGMENTS
 Each segment is a standalone index, holding a subset of all indexed documents.
 Index Time : A new segment is created whenever the writer flushes buffered
documents and pending deletions into the directory.
 Search time: Each segment is visited separately and the results are combined.
 Each segment is consist of various types of files :
 _X.<ext> where X is the segment’s name and ext is extension
 There are separate files to hold the different parts of the index
 You can use compound file format so that most of these index files are collapsed into a
single compound file in extension .cfs
 segements file is the file which contains references of all live segments named
segments_<N>

 Types of Index files and formats:
Name Extension Brief Description
Segments File segments.gen, segments_N Stores information about segments
Lock File write.lock The Write lock prevents multiple IndexWriters from writing to
same file.
Compound File .cfs An optional "virtual" file consisting of all the other index files for
systems that frequently run out of file handles.
Fields .fnm Stores information about the fields
Field Index .fdx Contains pointers to field data
Field Data .fdt The stored fields for documents
Term Infos .tis Part of the term dictionary, stores term info
Term Info Index .tii The index into the Term Infos file
Frequencies .frq Contains the list of docs which contain each term along with
frequency
Positions .prx Stores position information about where a term occurs in the
index
Norms .nrm Encodes length and boost factors for docs and fields
Term Vector Index .tvx Stores offset into the document data file
Term Vector Documents .tvd Contains information about each document that has term
Term Vector Fields .tvf The field level info about term vectors
Deleted Documents .del Info about what files are deleted

Indexing Utils
 Indexing Operations
 Adding documents
 addDocument(Document) Adds the document using the default analyze
 addDocuments(List<Document>) Adds the document using the default analyze in a block
 Deleting documents
 IndexWriter provides various methods to remove documents from an index:
 deleteDocuments(Term)
 deleteDocuments(Term[])
 deleteDocuments(Query)
 deleteDocuments(Query[])
 As with added documents, you must call commit() or close() on your writer to commit the changes to the index.
 hasDeletions() method to check if an index contains any documents marked for deletion.
 After optimize the deleted docs got removed from index

 Indexing Operations
 Updating documents
 updateDocument(Term, Document) first deletes all documents containing the
provided term and then adds the new document using the writer’s default analyzer.
 updateDocument(Term, Document, Analyzer) does the same but uses provided
analyzer instead of the writer’s default analyzer.

 Optimize Index
 When you index documents, especially many documents or using multiple
sessions with IndexWriter, you’ll invariably create an index that has many
separate segments.
 When you search the index, Lucene must search each segment separately
then combine the results.
 This has a tradeoff as the large no of segments the large no of seprate search
and more the merge would be.
 An optimized index also consumes fewer file descriptors during searching.
 Optimizing only improves searching speed, not indexing speed.

 Optimize Index
 IndexWriter exposes four methods to optimize:
 forceMerge(int maxNumSegments): Forces merge policy to merge segments until
there are <= maxNumSegments.
 forceMerge(int maxNumSegments, boolean doWait): Just like forceMerge(int),
except you can specify whether the call should block until all merging completes.
 forceMergeDeletes() : Forces merging of all segments that have deleted

 Index Commits
 A new index commit is created whenever you invoke one of IndexWriter’s
commit methods.
 Commits all pending changes (added and deleted documents, segment
merges, added indexes, etc.) to the index, and syncs all referenced index files,
such that a reader will see the changes and the index updates will survive an
or machine crash or power loss.
 The steps IndexWriter takes during commit:
 Flush any buffered documents and deletions.
 Sync all newly created files, including newly flushed files
 Write and sync the next segments_N file.
 Remove old commits by calling on IndexDeletionPolicy to remove old commits.

 Index Merging
 When an index has too many segments, IndexWriter selects some of the segments
and merges them into a single, large segment
 There are various merge policies like : LogMergePolicy , LogDocMergePolicy etc
 Concurrency, thread safety, and locking issues
 Any number of read-only IndexReaders may be open at once on a single index.
 Only a single writer may be open on an index at once. Lucene uses a write lock
to enforce this
 IndexReaders may be open even while an IndexWriter is making changes to the
index. Each IndexReader will always show the index as of the point in time that it
was opened. It won’t see any changes being done by the IndexWriter until the
commits and the reader is reopened.

 Concurrency, thread safety, and locking issues
 The Lucene index only blocks concurrent write operations on the index.
 Various implementations of Lock are :
 NoLockFactory
 SimpleFSLockFactory
 SingleInstanceLockFactory
 VerifyingLockFactory

 Boosting documents and fields
 Index-time boosts are not supported anymore. As a replacement, index-time
scoring factors should be indexed into a doc value field combined at query
time using eg. FunctionScoreQuery.

Lucene indexing

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Lucene indexing

Similar to Lucene indexing (20)

Recently uploaded

Recently uploaded (20)

Lucene indexing