SlideShare a Scribd company logo
Lucene
BUILDING INDEX
Introduction
 Lucene Index
 Lucene Index data in form of Posting list which are in Inverted Index format.
 How does it look ?
 Lucene index data in files called segments.
 Unlike a database, Lucene has no notion of a fixed global schema
 Lucene’s flexible schema also means a single index can hold documents that
rep- resent different entities.
 Lucene requires you to flatten, or de-normalize, your content when you index it.
 A document is Lucene’s atomic unit of indexing and searching. It’s a
container that holds one or more fields, which in turn contain the “real”
content.
 To index your raw content sources, you must first translate it into Lucene’s
documents and fields. Then, at search time, it’s the field values that are
searched
 Three things Lucene can do with each field:
 The value may be indexed
 If it’s indexed, the field may also optionally store term vectors,
 the field’s value may be stored,
Inverted Index
Indexing Process
 Enriching and Creating the Document
 To Index any data, we need to get text of the raw data i.e the form in which Lucene
can ingest the data.
 Build Documents are not always simple, when you are indexing from database or
PDF or Website HTML you need to have to do so much, preprocess so that a proper
Document can be build out of it.
 Analysis
 Method addDocument & addDocuments of IndexWriter Class hand our data off to
Lucene to index.
 As a first step Lucene analyzes the text, create tokens out of it and perform analysis
operations like for instance, tokens could be lowercased before indexing, so that it
will help in making search case insensitive.
 StemFilter, Synonyms and Stopwords are such examples of analysis
 Adding to the index
 After the analyzed part is done, data is ready to be added to index.
 Lucene uses inverted index as the data structure beneath the surface.
 Lets see how it works ?
 Rather than answering question
“What words are contained in this document?”
it is optimized for providing quick answers to
“Which documents contain word X?”
 Lucene index data in the Segments
Indexing
Process
 INDEX SEGMENTS
 Each segment is a standalone index, holding a subset of all indexed documents.
 Index Time : A new segment is created whenever the writer flushes buffered
documents and pending deletions into the directory.
 Search time: Each segment is visited separately and the results are combined.
 Each segment is consist of various types of files :
 _X.<ext> where X is the segment’s name and ext is extension
 There are separate files to hold the different parts of the index
 You can use compound file format so that most of these index files are collapsed into a
single compound file in extension .cfs
 segements file is the file which contains references of all live segments named
segments_<N>
 Types of Index files and formats:
Name Extension Brief Description
Segments File segments.gen, segments_N Stores information about segments
Lock File write.lock The Write lock prevents multiple IndexWriters from writing to
same file.
Compound File .cfs An optional "virtual" file consisting of all the other index files for
systems that frequently run out of file handles.
Fields .fnm Stores information about the fields
Field Index .fdx Contains pointers to field data
Field Data .fdt The stored fields for documents
Term Infos .tis Part of the term dictionary, stores term info
Term Info Index .tii The index into the Term Infos file
Frequencies .frq Contains the list of docs which contain each term along with
frequency
Positions .prx Stores position information about where a term occurs in the
index
Norms .nrm Encodes length and boost factors for docs and fields
Term Vector Index .tvx Stores offset into the document data file
Term Vector Documents .tvd Contains information about each document that has term
Term Vector Fields .tvf The field level info about term vectors
Deleted Documents .del Info about what files are deleted
Indexing Utils
 Indexing Operations
 Adding documents
 addDocument(Document) Adds the document using the default analyze
 addDocuments(List<Document>) Adds the document using the default analyze in a block
 Deleting documents
 IndexWriter provides various methods to remove documents from an index:
 deleteDocuments(Term)
 deleteDocuments(Term[])
 deleteDocuments(Query)
 deleteDocuments(Query[])
 As with added documents, you must call commit() or close() on your writer to commit the changes to the index.
 hasDeletions() method to check if an index contains any documents marked for deletion.
 After optimize the deleted docs got removed from index
 Indexing Operations
 Updating documents
 updateDocument(Term, Document) first deletes all documents containing the
provided term and then adds the new document using the writer’s default analyzer.
 updateDocument(Term, Document, Analyzer) does the same but uses provided
analyzer instead of the writer’s default analyzer.
 Optimize Index
 When you index documents, especially many documents or using multiple
sessions with IndexWriter, you’ll invariably create an index that has many
separate segments.
 When you search the index, Lucene must search each segment separately
then combine the results.
 This has a tradeoff as the large no of segments the large no of seprate search
and more the merge would be.
 An optimized index also consumes fewer file descriptors during searching.
 Optimizing only improves searching speed, not indexing speed.
 Optimize Index
 IndexWriter exposes four methods to optimize:
 forceMerge(int maxNumSegments): Forces merge policy to merge segments until
there are <= maxNumSegments.
 forceMerge(int maxNumSegments, boolean doWait): Just like forceMerge(int),
except you can specify whether the call should block until all merging completes.
 forceMergeDeletes() : Forces merging of all segments that have deleted
 Index Commits
 A new index commit is created whenever you invoke one of IndexWriter’s
commit methods.
 Commits all pending changes (added and deleted documents, segment
merges, added indexes, etc.) to the index, and syncs all referenced index files,
such that a reader will see the changes and the index updates will survive an
or machine crash or power loss.
 The steps IndexWriter takes during commit:
 Flush any buffered documents and deletions.
 Sync all newly created files, including newly flushed files
 Write and sync the next segments_N file.
 Remove old commits by calling on IndexDeletionPolicy to remove old com- mits.
 Index Merging
 When an index has too many segments, IndexWriter selects some of the segments
and merges them into a single, large segment
 There are various merge policies like : LogMergePolicy , LogDocMergePolicy etc
 Concurrency, thread safety, and locking issues
 Any number of read-only IndexReaders may be open at once on a single index.
 Only a single writer may be open on an index at once. Lucene uses a write lock
to enforce this
 IndexReaders may be open even while an IndexWriter is making changes to the
index. Each IndexReader will always show the index as of the point in time that it
was opened. It won’t see any changes being done by the IndexWriter until the
commits and the reader is reopened.
 Concurrency, thread safety, and locking issues
 The Lucene index only blocks concurrent write operations on the index.
 Various implementations of Lock are :
 NoLockFactory
 SimpleFSLockFactory
 SingleInstanceLockFactory
 VerifyingLockFactory
 Boosting documents and fields
 Index-time boosts are not supported anymore. As a replacement, index-time
scoring factors should be indexed into a doc value field combined at query
time using eg. FunctionScoreQuery.

More Related Content

What's hot

Azure上の データベース 機能の選び方。KVSからDWHまで
Azure上の データベース 機能の選び方。KVSからDWHまでAzure上の データベース 機能の選び方。KVSからDWHまで
Azure上の データベース 機能の選び方。KVSからDWHまで
Daisuke Masubuchi
 
Deep Dive Into Elasticsearch
Deep Dive Into ElasticsearchDeep Dive Into Elasticsearch
Deep Dive Into Elasticsearch
Knoldus Inc.
 
What is in a Lucene index?
What is in a Lucene index?What is in a Lucene index?
What is in a Lucene index?
lucenerevolution
 
차곡차곡 쉽게 알아가는 Elasticsearch와 Node.js
차곡차곡 쉽게 알아가는 Elasticsearch와 Node.js차곡차곡 쉽게 알아가는 Elasticsearch와 Node.js
차곡차곡 쉽게 알아가는 Elasticsearch와 Node.js
HeeJung Hwang
 
quick intro to elastic search
quick intro to elastic search quick intro to elastic search
quick intro to elastic search
medcl
 
Fess/Elasticsearchを使った業務で使える?全文検索への道
Fess/Elasticsearchを使った業務で使える?全文検索への道Fess/Elasticsearchを使った業務で使える?全文検索への道
Fess/Elasticsearchを使った業務で使える?全文検索への道
Shinsuke Sugaya
 
Introduction to Elasticsearch
Introduction to ElasticsearchIntroduction to Elasticsearch
Introduction to Elasticsearch
Ruslan Zavacky
 
Introduction to elasticsearch
Introduction to elasticsearchIntroduction to elasticsearch
Introduction to elasticsearch
hypto
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFrames
Databricks
 
Intro to elasticsearch
Intro to elasticsearchIntro to elasticsearch
Intro to elasticsearch
Joey Wen
 
Introduction to Elasticsearch
Introduction to ElasticsearchIntroduction to Elasticsearch
Introduction to Elasticsearch
Ismaeel Enjreny
 
정보검색과 Elasticsearch (크몽)
정보검색과 Elasticsearch (크몽)정보검색과 Elasticsearch (크몽)
정보검색과 Elasticsearch (크몽)
크몽
 
Real time entity resolution with elasticsearch - haystack 2018
Real time entity resolution with elasticsearch - haystack 2018Real time entity resolution with elasticsearch - haystack 2018
Real time entity resolution with elasticsearch - haystack 2018
OpenSource Connections
 
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdf
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdfWord2Vec model to generate synonyms on the fly in Apache Lucene.pdf
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdf
Sease
 
Internals of Presto Service
Internals of Presto ServiceInternals of Presto Service
Internals of Presto Service
Treasure Data, Inc.
 
Elastic Search
Elastic SearchElastic Search
Elastic Search
Navule Rao
 
How to Build a Semantic Search System
How to Build a Semantic Search SystemHow to Build a Semantic Search System
How to Build a Semantic Search System
Trey Grainger
 
Haystack 2019 - Search with Vectors - Simon Hughes
Haystack 2019 - Search with Vectors - Simon HughesHaystack 2019 - Search with Vectors - Simon Hughes
Haystack 2019 - Search with Vectors - Simon Hughes
OpenSource Connections
 
Elasticsearch presentation 1
Elasticsearch presentation 1Elasticsearch presentation 1
Elasticsearch presentation 1
Maruf Hassan
 

What's hot (20)

Di入門
Di入門Di入門
Di入門
 
Azure上の データベース 機能の選び方。KVSからDWHまで
Azure上の データベース 機能の選び方。KVSからDWHまでAzure上の データベース 機能の選び方。KVSからDWHまで
Azure上の データベース 機能の選び方。KVSからDWHまで
 
Deep Dive Into Elasticsearch
Deep Dive Into ElasticsearchDeep Dive Into Elasticsearch
Deep Dive Into Elasticsearch
 
What is in a Lucene index?
What is in a Lucene index?What is in a Lucene index?
What is in a Lucene index?
 
차곡차곡 쉽게 알아가는 Elasticsearch와 Node.js
차곡차곡 쉽게 알아가는 Elasticsearch와 Node.js차곡차곡 쉽게 알아가는 Elasticsearch와 Node.js
차곡차곡 쉽게 알아가는 Elasticsearch와 Node.js
 
quick intro to elastic search
quick intro to elastic search quick intro to elastic search
quick intro to elastic search
 
Fess/Elasticsearchを使った業務で使える?全文検索への道
Fess/Elasticsearchを使った業務で使える?全文検索への道Fess/Elasticsearchを使った業務で使える?全文検索への道
Fess/Elasticsearchを使った業務で使える?全文検索への道
 
Introduction to Elasticsearch
Introduction to ElasticsearchIntroduction to Elasticsearch
Introduction to Elasticsearch
 
Introduction to elasticsearch
Introduction to elasticsearchIntroduction to elasticsearch
Introduction to elasticsearch
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFrames
 
Intro to elasticsearch
Intro to elasticsearchIntro to elasticsearch
Intro to elasticsearch
 
Introduction to Elasticsearch
Introduction to ElasticsearchIntroduction to Elasticsearch
Introduction to Elasticsearch
 
정보검색과 Elasticsearch (크몽)
정보검색과 Elasticsearch (크몽)정보검색과 Elasticsearch (크몽)
정보검색과 Elasticsearch (크몽)
 
Real time entity resolution with elasticsearch - haystack 2018
Real time entity resolution with elasticsearch - haystack 2018Real time entity resolution with elasticsearch - haystack 2018
Real time entity resolution with elasticsearch - haystack 2018
 
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdf
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdfWord2Vec model to generate synonyms on the fly in Apache Lucene.pdf
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdf
 
Internals of Presto Service
Internals of Presto ServiceInternals of Presto Service
Internals of Presto Service
 
Elastic Search
Elastic SearchElastic Search
Elastic Search
 
How to Build a Semantic Search System
How to Build a Semantic Search SystemHow to Build a Semantic Search System
How to Build a Semantic Search System
 
Haystack 2019 - Search with Vectors - Simon Hughes
Haystack 2019 - Search with Vectors - Simon HughesHaystack 2019 - Search with Vectors - Simon Hughes
Haystack 2019 - Search with Vectors - Simon Hughes
 
Elasticsearch presentation 1
Elasticsearch presentation 1Elasticsearch presentation 1
Elasticsearch presentation 1
 

Similar to Lucene indexing

Lucene
LuceneLucene
Lucene
LuceneLucene
Apache lucene
Apache luceneApache lucene
Apache lucene
Dr. Abhiram Gandhe
 
MARUTHI_INVERTED_SEARCH_presentation.pptx
MARUTHI_INVERTED_SEARCH_presentation.pptxMARUTHI_INVERTED_SEARCH_presentation.pptx
MARUTHI_INVERTED_SEARCH_presentation.pptx
MaruthiRock
 
Elasticsearch Architechture
Elasticsearch ArchitechtureElasticsearch Architechture
Elasticsearch Architechture
Anurag Sharma
 
Is Your Index Reader Really Atomic or Maybe Slow?
Is Your Index Reader Really Atomic or Maybe Slow?Is Your Index Reader Really Atomic or Maybe Slow?
Is Your Index Reader Really Atomic or Maybe Slow?
lucenerevolution
 
Searching and Analyzing Qualitative Data on Personal Computer
Searching and Analyzing Qualitative Data on Personal ComputerSearching and Analyzing Qualitative Data on Personal Computer
Searching and Analyzing Qualitative Data on Personal Computer
IOSR Journals
 
Must be similar to screenshotsI must be able to run the projects.docx
Must be similar to screenshotsI must be able to run the projects.docxMust be similar to screenshotsI must be able to run the projects.docx
Must be similar to screenshotsI must be able to run the projects.docx
herthaweston
 
Chapter 3 Indexing Structure.pdf
Chapter 3 Indexing Structure.pdfChapter 3 Indexing Structure.pdf
Chapter 3 Indexing Structure.pdf
JemalNesre1
 
Introduction to Lucene and Solr - 1
Introduction to Lucene and Solr - 1Introduction to Lucene and Solr - 1
Introduction to Lucene and Solr - 1
YI-CHING WU
 
Lucene Bootcamp -1
Lucene Bootcamp -1 Lucene Bootcamp -1
Lucene Bootcamp -1 GokulD
 
ElasticSearch Basic Introduction
ElasticSearch Basic IntroductionElasticSearch Basic Introduction
ElasticSearch Basic Introduction
Mayur Rathod
 
Solr中国6月21日企业搜索
Solr中国6月21日企业搜索Solr中国6月21日企业搜索
Solr中国6月21日企业搜索longkeyy
 
Index Structures.pptx
Index Structures.pptxIndex Structures.pptx
Index Structures.pptx
MBablu1
 
Introduction To Apache Lucene
Introduction To Apache LuceneIntroduction To Apache Lucene
Introduction To Apache Lucene
Mindfire Solutions
 
Indexing in Search Engine
Indexing in Search EngineIndexing in Search Engine
Indexing in Search Engine
Shikha Gupta
 
Advanced full text searching techniques using Lucene
Advanced full text searching techniques using LuceneAdvanced full text searching techniques using Lucene
Advanced full text searching techniques using Lucene
Asad Abbas
 

Similar to Lucene indexing (20)

Lucene
LuceneLucene
Lucene
 
Lucece Indexing
Lucece IndexingLucece Indexing
Lucece Indexing
 
Lucene
LuceneLucene
Lucene
 
Apache lucene
Apache luceneApache lucene
Apache lucene
 
MARUTHI_INVERTED_SEARCH_presentation.pptx
MARUTHI_INVERTED_SEARCH_presentation.pptxMARUTHI_INVERTED_SEARCH_presentation.pptx
MARUTHI_INVERTED_SEARCH_presentation.pptx
 
Elasticsearch Architechture
Elasticsearch ArchitechtureElasticsearch Architechture
Elasticsearch Architechture
 
Is Your Index Reader Really Atomic or Maybe Slow?
Is Your Index Reader Really Atomic or Maybe Slow?Is Your Index Reader Really Atomic or Maybe Slow?
Is Your Index Reader Really Atomic or Maybe Slow?
 
Searching and Analyzing Qualitative Data on Personal Computer
Searching and Analyzing Qualitative Data on Personal ComputerSearching and Analyzing Qualitative Data on Personal Computer
Searching and Analyzing Qualitative Data on Personal Computer
 
Must be similar to screenshotsI must be able to run the projects.docx
Must be similar to screenshotsI must be able to run the projects.docxMust be similar to screenshotsI must be able to run the projects.docx
Must be similar to screenshotsI must be able to run the projects.docx
 
 
Chapter 3 Indexing Structure.pdf
Chapter 3 Indexing Structure.pdfChapter 3 Indexing Structure.pdf
Chapter 3 Indexing Structure.pdf
 
Introduction to Lucene and Solr - 1
Introduction to Lucene and Solr - 1Introduction to Lucene and Solr - 1
Introduction to Lucene and Solr - 1
 
Lucene Bootcamp -1
Lucene Bootcamp -1 Lucene Bootcamp -1
Lucene Bootcamp -1
 
ElasticSearch Basic Introduction
ElasticSearch Basic IntroductionElasticSearch Basic Introduction
ElasticSearch Basic Introduction
 
Solr中国6月21日企业搜索
Solr中国6月21日企业搜索Solr中国6月21日企业搜索
Solr中国6月21日企业搜索
 
Index Structures.pptx
Index Structures.pptxIndex Structures.pptx
Index Structures.pptx
 
Introduction To Apache Lucene
Introduction To Apache LuceneIntroduction To Apache Lucene
Introduction To Apache Lucene
 
Indexing in Search Engine
Indexing in Search EngineIndexing in Search Engine
Indexing in Search Engine
 
Advanced full text searching techniques using Lucene
Advanced full text searching techniques using LuceneAdvanced full text searching techniques using Lucene
Advanced full text searching techniques using Lucene
 
G0361034038
G0361034038G0361034038
G0361034038
 

Recently uploaded

Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
Vlad Stirbu
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
UiPathCommunity
 
Assure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyesAssure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
RinaMondal9
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
Pierluigi Pugliese
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 

Recently uploaded (20)

Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
 
Assure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyesAssure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyes
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 

Lucene indexing

  • 2. Introduction  Lucene Index  Lucene Index data in form of Posting list which are in Inverted Index format.  How does it look ?  Lucene index data in files called segments.  Unlike a database, Lucene has no notion of a fixed global schema  Lucene’s flexible schema also means a single index can hold documents that rep- resent different entities.  Lucene requires you to flatten, or de-normalize, your content when you index it.
  • 3.  A document is Lucene’s atomic unit of indexing and searching. It’s a container that holds one or more fields, which in turn contain the “real” content.  To index your raw content sources, you must first translate it into Lucene’s documents and fields. Then, at search time, it’s the field values that are searched  Three things Lucene can do with each field:  The value may be indexed  If it’s indexed, the field may also optionally store term vectors,  the field’s value may be stored,
  • 5. Indexing Process  Enriching and Creating the Document  To Index any data, we need to get text of the raw data i.e the form in which Lucene can ingest the data.  Build Documents are not always simple, when you are indexing from database or PDF or Website HTML you need to have to do so much, preprocess so that a proper Document can be build out of it.  Analysis  Method addDocument & addDocuments of IndexWriter Class hand our data off to Lucene to index.  As a first step Lucene analyzes the text, create tokens out of it and perform analysis operations like for instance, tokens could be lowercased before indexing, so that it will help in making search case insensitive.  StemFilter, Synonyms and Stopwords are such examples of analysis
  • 6.  Adding to the index  After the analyzed part is done, data is ready to be added to index.  Lucene uses inverted index as the data structure beneath the surface.  Lets see how it works ?  Rather than answering question “What words are contained in this document?” it is optimized for providing quick answers to “Which documents contain word X?”  Lucene index data in the Segments
  • 8.  INDEX SEGMENTS  Each segment is a standalone index, holding a subset of all indexed documents.  Index Time : A new segment is created whenever the writer flushes buffered documents and pending deletions into the directory.  Search time: Each segment is visited separately and the results are combined.  Each segment is consist of various types of files :  _X.<ext> where X is the segment’s name and ext is extension  There are separate files to hold the different parts of the index  You can use compound file format so that most of these index files are collapsed into a single compound file in extension .cfs  segements file is the file which contains references of all live segments named segments_<N>
  • 9.  Types of Index files and formats: Name Extension Brief Description Segments File segments.gen, segments_N Stores information about segments Lock File write.lock The Write lock prevents multiple IndexWriters from writing to same file. Compound File .cfs An optional "virtual" file consisting of all the other index files for systems that frequently run out of file handles. Fields .fnm Stores information about the fields Field Index .fdx Contains pointers to field data Field Data .fdt The stored fields for documents Term Infos .tis Part of the term dictionary, stores term info Term Info Index .tii The index into the Term Infos file Frequencies .frq Contains the list of docs which contain each term along with frequency Positions .prx Stores position information about where a term occurs in the index Norms .nrm Encodes length and boost factors for docs and fields Term Vector Index .tvx Stores offset into the document data file Term Vector Documents .tvd Contains information about each document that has term Term Vector Fields .tvf The field level info about term vectors Deleted Documents .del Info about what files are deleted
  • 10. Indexing Utils  Indexing Operations  Adding documents  addDocument(Document) Adds the document using the default analyze  addDocuments(List<Document>) Adds the document using the default analyze in a block  Deleting documents  IndexWriter provides various methods to remove documents from an index:  deleteDocuments(Term)  deleteDocuments(Term[])  deleteDocuments(Query)  deleteDocuments(Query[])  As with added documents, you must call commit() or close() on your writer to commit the changes to the index.  hasDeletions() method to check if an index contains any documents marked for deletion.  After optimize the deleted docs got removed from index
  • 11.  Indexing Operations  Updating documents  updateDocument(Term, Document) first deletes all documents containing the provided term and then adds the new document using the writer’s default analyzer.  updateDocument(Term, Document, Analyzer) does the same but uses provided analyzer instead of the writer’s default analyzer.
  • 12.  Optimize Index  When you index documents, especially many documents or using multiple sessions with IndexWriter, you’ll invariably create an index that has many separate segments.  When you search the index, Lucene must search each segment separately then combine the results.  This has a tradeoff as the large no of segments the large no of seprate search and more the merge would be.  An optimized index also consumes fewer file descriptors during searching.  Optimizing only improves searching speed, not indexing speed.
  • 13.  Optimize Index  IndexWriter exposes four methods to optimize:  forceMerge(int maxNumSegments): Forces merge policy to merge segments until there are <= maxNumSegments.  forceMerge(int maxNumSegments, boolean doWait): Just like forceMerge(int), except you can specify whether the call should block until all merging completes.  forceMergeDeletes() : Forces merging of all segments that have deleted
  • 14.  Index Commits  A new index commit is created whenever you invoke one of IndexWriter’s commit methods.  Commits all pending changes (added and deleted documents, segment merges, added indexes, etc.) to the index, and syncs all referenced index files, such that a reader will see the changes and the index updates will survive an or machine crash or power loss.  The steps IndexWriter takes during commit:  Flush any buffered documents and deletions.  Sync all newly created files, including newly flushed files  Write and sync the next segments_N file.  Remove old commits by calling on IndexDeletionPolicy to remove old com- mits.
  • 15.  Index Merging  When an index has too many segments, IndexWriter selects some of the segments and merges them into a single, large segment  There are various merge policies like : LogMergePolicy , LogDocMergePolicy etc  Concurrency, thread safety, and locking issues  Any number of read-only IndexReaders may be open at once on a single index.  Only a single writer may be open on an index at once. Lucene uses a write lock to enforce this  IndexReaders may be open even while an IndexWriter is making changes to the index. Each IndexReader will always show the index as of the point in time that it was opened. It won’t see any changes being done by the IndexWriter until the commits and the reader is reopened.
  • 16.  Concurrency, thread safety, and locking issues  The Lucene index only blocks concurrent write operations on the index.  Various implementations of Lock are :  NoLockFactory  SimpleFSLockFactory  SingleInstanceLockFactory  VerifyingLockFactory
  • 17.  Boosting documents and fields  Index-time boosts are not supported anymore. As a replacement, index-time scoring factors should be indexed into a doc value field combined at query time using eg. FunctionScoreQuery.