SlideShare a Scribd company logo
Table of Contents 1)  Understanding Lucene 2)  Lucene Indexing 3)  Types of Fields in Lucene Index 4)  An example of Lucene Index fields 5)  Core Searching classes 6)  Types of Queries 7)  Incremental Indexing 8)  Score Boosting and relevance ranking 9)  Scoring Algorithm 10)  Sorting search results 11)  Handling multiple pages of search results 12)  Examples of queries possible with Lucene 13)  Abstract storage in Index 14)  Security 15)  Composition of Segments in Lucene Index 16)  Debugging lucene indexing process 17)  Lucene in Alfresco 18)  Alfresco repository architecture 19)  Why do we sometimes have redundant data in Index and Database 20)  Caching 21)  Experience of lucene implementation  22)  Good articles on Lucene
Understanding Lucene ,[object Object],[object Object],[object Object],Back to Content page
Lucene doesn’t care about the source of the data, its format, or even its language, as long as you can convert it to text.  This means you can use Lucene to index and search data stored in files: web pages on remote web servers, documents stored in local file systems, in databases, simple text files, Microsoft Word documents, HTML or PDF files, or any other format from which you can extract textual information. The quality of a search is typically described using  precision   and recall   metrics. Recall measures how well the search system finds relevant documents, whereas precision measures how well the system filters out the irrelevant documents. Understanding Lucene Back to Content page
As you saw in our Indexer class, you need the following classes to perform the simplest indexing procedure: ■  IndexWriter  (creates a new index and adds documents to an existing index) ■  Directory  (represents the location of a Lucene index.    Subclasses : FSDirectory and  RAMDirectory ) ■  Analyzer  (extracts tokens out of text to be indexed and eliminates the rest) ■  Document  (a collection of fields ) ■  Field  ( Each field corresponds to a piece of data that is either queried against or retrieved from the index during search) Lucene Indexing Back to Content page
Types of Fields in Lucene Index ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Back to Content page
An example of Lucene Index fields Back to Content page
Core Searching classes ■  IndexSearcher ■  Term  (basic unit for searching, consists of the name of the field and the value of  that field) ■  Query   (subclasses : TermQuery, BooleanQuery, PhraseQuery, PrefixQuery,      PhrasePrefixQuery, RangeQuery,FilteredQuery, and    SpanQuery.) ■  TermQuery  (primitive query types) ■  Hits  (simple container of pointers to ranked search results) Back to Content page
TermQuery s  are especially useful for retrieving documents by a key.   A TermQuery is returned from QueryParser if the expression consists of a  single word. PrefixQuery   matches documents containing terms beginning with a specified string.   QueryParser creates a PrefixQuery for a term when it ends with an asterisk  (*) in query expressions. RangeQuery   facilitates searches from a starting term through an ending term.   RangeQuery query = new RangeQuery(begin, end,  true ); BooleanQuery  The various query types discussed here can be combined in complex    ways using BooleanQuery. BooleanQuery itself is a container of Boolean    clauses . A clause is a subquery that can be optional, required, or    prohibited. These attributes allow for logical AND, OR, and NOT    combinations. You add a clause to a BooleanQuery using this API      method: public void add(Query query, boolean required, boolean prohibited) PhraseQuery  An index contains positional information of terms. PhraseQuery uses this    information to locate documents where terms are within a certain distance of one   another. FuzzyQuery  matches terms  similar  to a specified term. Types of Queries Back to Content page
Incremental Indexing ,[object Object],[object Object],[object Object],[object Object],Back to Content page
Incremental Indexing  (IndexModifier ) ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Back to Content page
Incremental Indexing  (IndexModifier ) ,[object Object],[object Object],[object Object],[object Object],[object Object],Back to Content page
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Source :  http://lucene.apache.org/java/docs/scoring.html Score Boosting
By default, all Documents have no boost—or, rather, they all have the same boost factor of 1.0. By changing a Document’s boost factor, you can instruct Lucene to consider it more or less important with respect to other Documents in the index.  The API for doing this consists of a single method, setBoost(float), which can be used as follows: doc.setBoost(1.5); writer.addDocument(doc); When you boost a Document, Lucene internally uses the same boost factor to boost each of its Fields.  To give field boost :  subjectField.setBoost(1.2); The boost factor values you should use depend on what you’re trying to achieve; you may need to  do a bit of experimentation and tuning to achieve the desired effect . It’s worth noting that shorter Fields have an implicit boost associated with them, due to the way Lucene’s scoring algorithm works.  Boosting is, in general, an advanced feature that many applications can work very well without. Document and Field boosting comes into play at search time. Lucene’s search results are ranked according to how closely each Document matches the query, and each matching Document is assigned a score. Lucene’s scoring formula consists of a number of factors, and the boost factor is one of them. Boosting Documents and Fields
Relevancy scoring mechanism Source :  http://infotrieve.com/products_services/databases/LSRC_CST.pdf The formula used by lucene to calculate the rank of a document
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Query level boosting Source :  http://lucene.apache.org/java/docs/queryparsersyntax.html Back to Content page
The list of the fields to which boost was added with an explanation as to why.                                                               Quoted directly from ServerSide.com :  “ The date boost has been really important for us”.  We have data that goes back for a long time, and seemed to be returning “old reports” too often. The date-based booster trick has gotten around this, allowing for the newest content to bubble up . The end result is that we now have a nice simple design which allows us to add new sources to our index with minimal development time! How ServerSide.com used  boost  to solve it’s problem Source :  http://www.theserverside.com/tt/articles/article.tss?l=ILoveLucene Back to Content page
Scoring Algorithm Back to Content page
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Scoring Algorithm Back to Content page
Now that the Hits object has been initialized, it begins the process of identifying documents that match the query by calling  getMoreDocs  method. Assuming we are not sorting (since sorting doesn't effect the raw Lucene score), we call on the "expert" search method of the Searcher, passing in our Weight object, Filter and the number of results we want.This method returns a  TopDocs  object, which is an internal collection of search results. The Searcher creates a  TopDocCollector  and passes it along with the Weight, Filter to another expert search method (for more on the  HitCollector  mechanism, see Searcher .)  The TopDocCollector uses a  PriorityQueue  to collect the top results for the search.  If a Filter is being used, some initial setup is done to determine which docs to include. Otherwise, we ask the Weight for a  Scorer  for the IndexReader of the current searcher and we proceed by calling the score method on the Scorer . At last, we are actually going to score some documents. The score method takes in the HitCollector (most likely the TopDocCollector) and does its business. Of course, here is where things get involved. The Scorer that is returned by the Weight object depends on what type of Query was submitted. In most real world applications with multiple query terms, the Scorer is going to be a  BooleanScorer2.  Assuming a BooleanScorer2 scorer, we first initialize the  Coordinator , which is used to apply the  coord()  factor. We then get a internal Scorer based on the required, optional and prohibited parts of the query. Using this internal Scorer, the BooleanScorer2 then proceeds into a while loop based on the  Scorer#next()  method. The next() method advances to the next document matching the query. This is an abstract method in the Scorer class and is thus overriden by all derived implementations. If you have a simple OR query your internal Scorer is most likely a  DisjunctionSumScorer , which essentially combines the scorers from the sub scorers of the OR'd terms. Scoring Algorithm Back to Content page
Sorting comes at the expense of resources. More memory is needed to keep the fields used for sorting available. For numeric types, each field being sorted for each document in the index requires that four bytes be cached. For String types, each unique term is also cached for each document. Only the actual fields used for sorting are cached in this manner. We need to plan our system resources accordingly if we want to use the sorting capabilities, knowing that sorting by a String is the most expensive type in terms of resources. Sorting search results
Handling multiple pages of search results ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
[object Object],[object Object],[object Object],[object Object],[object Object],Examples of queries possible with Lucene Back to Content page
Handling of various types of queries by the QueryParser Back to Content page
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Abstract storage in Index Back to Content page
A security filter is a powerful example, allowing users to only see search results of documents they own even if their query technically matches other documents that are off limits. An example of document filtering constrains documents with security in mind. Our example assumes documents are associated with an owner, which is known at indexing time. We index two documents; both have the term  info  in their keywords field, but each document has a different owner: public class SecurityFilterTest extends TestCase { private RAMDirectory directory; protected void setUp() throws Exception { IndexWriter writer = new IndexWriter(directory, new WhitespaceAnalyzer(), true); // Elwood Document document = new Document(); document.add( Field.Keyword("owner", "elwood") ); document.add(Field.Text("keywords", "elwoods sensitive info")); writer.addDocument(document); // Jake document = new Document(); document.add( Field.Keyword("owner", "jake") ); document.add(Field.Text("keywords", "jakes sensitive info")); writer.addDocument(document); writer.close(); } } Security Source  : Pg 211 from Lucene in action Back to Content page
Suppose, though, that Jake is using the search feature in our application, and only documents he owns should be searchable by him. Quite elegantly, we can easily use a QueryFilter to constrain the search space to only documents he is the owner of, as shown in listing 5.7. public void testSecurityFilter() throws Exception { directory = new RAMDirectory(); setUp(); TermQuery query =  new TermQuery(new Term("keywords", "info")) ; IndexSearcher searcher = new IndexSearcher(directory); Hits hits = searcher.search(query); assertEquals("Both documents match", 2, hits.length()); QueryFilter jakeFilter = new QueryFilter( new TermQuery(new Term("owner", "jake"))); hits = searcher.search(query, jakeFilter); assertEquals(1, hits.length()); assertEquals("elwood is safe", "jakes sensitive info", hits.doc(0).get("keywords")); } For using this approach we will have a field in the Index called owner. Security Back to Content page
You can constrain a query to a subset of documents another way, by combining the constraining query to the original query as a  required  clause of a BooleanQuery. There are a couple of important differences, despite the fact that the same documents are returned from both. QueryFilter caches the set of documents allowed, probably speeding up successive searches using the same instance. In addition, normalized Hits scores are unlikely to be the same. The score difference makes sense when you’re looking at the scoring formula (see section 3.3, page 78). The IDF factor may be dramatically different. When you’re using BooleanQuery aggregation, all documents containing the terms are factored into the equation, whereas a filter reduces the documents under consideration and impacts the inverse document frequency factor. Security Back to Content page
Each segment index maintains the following: Field names . This contains the set of field names used in the index.  Stored Field values . This contains, for each document, a list of attribute-value pairs, where the attributes are field names. These are used to store auxiliary information about the document, such as its title, url, or an identifier to access a database. The set of stored fields are what is returned for each hit when searching. This is keyed by document number.  Term dictionary . A dictionary containing all of the terms used in all of the indexed fields of all of the documents. The dictionary also contains the number of documents which contain the term, and pointers to the term's frequency and proximity data.  Term Frequency data . For each term in the dictionary, the numbers of all the documents that contain that term, and the frequency of the term in that document.  Term Proximity data . For each term in the dictionary, the positions that the term occurs in each document.  Normalization factors . For each field in each document, a value is stored that is multiplied into the score for hits on that field.  Term Vectors . For each field in each document, the term vector (sometimes called document vector) may be stored. A term vector consists of term text and term frequency. To add Term Vectors to your index see the  Field  constructors  Deleted documents . An optional file indicating which documents are deleted.  Composition of Segments in Lucene Index Back to Content page
We can get Lucene to output information about its indexing operations by setting Index-Writer’s public instance variable infoStream to one of the OutputStreams, such as System.out   IndexWriter writer = new IndexWriter(dir, new SimpleAnalyzer(), true); writer.infoStream = System.out; Debugging lucene indexing process Back to Content page
Lucene In Alfresco There are three possible approaches we can follow.  1) Let alfresco do the indexing, use its implementation of the search, use the search results it returns and load it into our page. 2) Let Alfresco do the indexing and directly access its indexes to get query results 3) Let alfresco only do the content management, and we take care of both the indexing and the searching Back to Content page
Advantages of using Alfresco created lucene indexes ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Back to Content page
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Data dictionary options in Alfresco Back to Content page
Alfresco Repository Architecture
Lucene Index Structure in Alfresco ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],These 2 files are placed  outside the node folders IndexInfoBackup IndexInfo Source :  http://lucene.apache.org/java/docs/fileformats.html
The database can redundantly keeps some of the information that can be found  in the Lucene index for two specific reasons: ■  Failure recovery —If the index somehow becomes corrupted (for example, through disk failure), it can easily and quickly be rebuilt from the data stored in the database without any information loss. This is further leveraged by the fact that the database can reside on a different machine. ■  Access speed —Each document is marked with a unique identifier. So, in the case that the application needs to access a certain document by a given identifier, the database can return it more efficiently than Lucene could. (the identifier is the primary key of a document in the database). If we would employ Lucene here, it would have to search its whole index for the document with the identifier stored in one of the document’s fields. Why do we sometimes have redundant data in Index and Database Back to Content page
If we are unable to get access to Alfresco’s indexing and scoring process then we possibly  add boost to the query itself. It is still not confirmed whether it will work first of all, and if it works, whether it will work fast enough. “ Title:Lucene”^4 OR “Keywords:Lucene”^3 OR “Contents:Lucene”^1 A possible approach to improve hit relevancy in Alfresco Back to Content page
Lucene has an internal caching mechanism in case of filters. Lucene does come with a simple cache mechanism, if you use  Lucene Filters . The classes to look at are  CachingWrapperFilter  and  QueryFilter .  For example lets say we wanted to let users search JUST on the last 30 days worth of content. We could run the filter ONCE and then cache it with the Term clause used to run the query. Then we could just use the same filter again for every user until you have to optimize() the index again. As long as the document numbers stay they same we don't have much more to do. But this will probably not be of much use to us, since we will need to optimize the index often. Caching mechanism Back to Content page
Caching mechanism List of Top Keywords Top searched keywords obtained from logs logs Lucene Index Top Keyword  results   cache Searcher (Query) Results UI Searcher checks if query matches top keywords If query term matches one of the cached keywords then results are  fetched from cache If query term doesn’t match one of the cached keywords then search in the Index Top keywords are searched for in the index and cached beforehand Cache expiring and refreshing mechanism ( including regular updating  of top keywords list ) Back to Content page
Question  : “ I gave  parallelMultiSearcher  a try and it was significantly slower than simply iterating through the indexes one at a time. Our new plan is to somehow have only one index per search machine and a larger main index stored on the master. What I'm interested to know is whether having one extremely large index for the master then splitting the index into several smaller indexes (if this is possible) would be better than having several smaller indexes and merging them on the search machines into one index. I would also be interested to know how others have divided up search work across a cluster.” Answer  : “  I'm responsible for the webshots.com search index and we've had very good results with lucene.  It currently indexes over 100 Million documents and performs 4 Million searches / day.  We initially tested running multiple small copies and using a MultiSearcher and then merging results as compared to running a very large single index.  We actually found that the single large instance performed better. To improve load handling we clustered multiple identical copies together, then session bind a user to particular server and cache the results, but each server is running a single index.  Our index is currently about 40Gb. The advantage of binding a user is that once a search is performed then caching within lucene and in the application is very effective if subsequent searches go back to the same box. Our initial searches are usually in the sub 100milliS range while subsequent requests for deeper pages in the search are returned instantly.” Experience of lucene implementation @ webshots.com Back to Content page
Example of Mail messages Indexing  ,[object Object],[object Object],[object Object],[object Object],[object Object],Note :  Tokenization:  The method for indexing is by each word. Certain common patterns, such as phone numbers, email addresses, and domain names are tokenized as shown in the following figure.   Source :http://wiki.zimbra.com/index.php?title=Zimbra_Server Back to Content page
Good Articles on Lucene http://www.theserverside.com/tt/articles/article.tss?l=ILoveLucene http://www.javaworld.com/javaworld/jw-09-2000/jw-0915-lucene.html?page=1 http://technology.amis.nl/blog/?p=1288 http://powerbuilder.sys-con.com/read/42488.htm http://www-128.ibm.com/developerworks/library/wa-lucene2/ Spell Checking : http://today.java.net/pub/a/today/2005/08/09/didyoumean.html Lucene integration with hibernate: http://www.hibernate.org/hib_docs/search/reference/en/html_single/ Lucene with Spring :  http://technology.amis.nl/blog/?p=1248 It talks about spring modules. Back to Content page

More Related Content

What's hot

Deep Dive Into Elasticsearch
Deep Dive Into ElasticsearchDeep Dive Into Elasticsearch
Deep Dive Into Elasticsearch
Knoldus Inc.
 
Using Kafka to scale database replication
Using Kafka to scale database replicationUsing Kafka to scale database replication
Using Kafka to scale database replication
Venu Ryali
 
Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...
Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...
Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...
Edureka!
 
What I learnt: Elastic search & Kibana : introduction, installtion & configur...
What I learnt: Elastic search & Kibana : introduction, installtion & configur...What I learnt: Elastic search & Kibana : introduction, installtion & configur...
What I learnt: Elastic search & Kibana : introduction, installtion & configur...
Rahul K Chauhan
 
Extending Druid Index File
Extending Druid Index FileExtending Druid Index File
Extending Druid Index File
Navis Ryu
 
Kafka to the Maxka - (Kafka Performance Tuning)
Kafka to the Maxka - (Kafka Performance Tuning)Kafka to the Maxka - (Kafka Performance Tuning)
Kafka to the Maxka - (Kafka Performance Tuning)
DataWorks Summit
 
Elastic Observability keynote
Elastic Observability keynoteElastic Observability keynote
Elastic Observability keynote
Elasticsearch
 
Database in Microservices - (2nd PostgreSQL Conference Nepal 2023)
Database in Microservices - (2nd PostgreSQL Conference Nepal 2023)Database in Microservices - (2nd PostgreSQL Conference Nepal 2023)
Database in Microservices - (2nd PostgreSQL Conference Nepal 2023)
Sandip Basnet
 
Amazon Aurora: Under the Hood
Amazon Aurora: Under the HoodAmazon Aurora: Under the Hood
Amazon Aurora: Under the Hood
Amazon Web Services
 
State of the Trino Project
State of the Trino ProjectState of the Trino Project
State of the Trino Project
Martin Traverso
 
Elasticsearch for beginners
Elasticsearch for beginnersElasticsearch for beginners
Elasticsearch for beginners
Neil Baker
 
AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...
AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...
AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...
Amazon Web Services
 
New Features in Apache Pinot
New Features in Apache PinotNew Features in Apache Pinot
New Features in Apache Pinot
Siddharth Teotia
 
Amazon Redshift Masterclass
Amazon Redshift MasterclassAmazon Redshift Masterclass
Amazon Redshift Masterclass
Amazon Web Services
 
Achieving a 50% Reduction in Cross-AZ Network Costs from Kafka (Uday Sagar Si...
Achieving a 50% Reduction in Cross-AZ Network Costs from Kafka (Uday Sagar Si...Achieving a 50% Reduction in Cross-AZ Network Costs from Kafka (Uday Sagar Si...
Achieving a 50% Reduction in Cross-AZ Network Costs from Kafka (Uday Sagar Si...
confluent
 
Query Optimization in SQL Server
Query Optimization in SQL ServerQuery Optimization in SQL Server
Query Optimization in SQL Server
Rajesh Gunasundaram
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
 
End-to-end Streaming Between gRPC Services Via Kafka with John Fallows
End-to-end Streaming Between gRPC Services Via Kafka with John FallowsEnd-to-end Streaming Between gRPC Services Via Kafka with John Fallows
End-to-end Streaming Between gRPC Services Via Kafka with John Fallows
HostedbyConfluent
 
LDAP
LDAPLDAP
Introduction To Apache Lucene
Introduction To Apache LuceneIntroduction To Apache Lucene
Introduction To Apache Lucene
Mindfire Solutions
 

What's hot (20)

Deep Dive Into Elasticsearch
Deep Dive Into ElasticsearchDeep Dive Into Elasticsearch
Deep Dive Into Elasticsearch
 
Using Kafka to scale database replication
Using Kafka to scale database replicationUsing Kafka to scale database replication
Using Kafka to scale database replication
 
Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...
Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...
Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...
 
What I learnt: Elastic search & Kibana : introduction, installtion & configur...
What I learnt: Elastic search & Kibana : introduction, installtion & configur...What I learnt: Elastic search & Kibana : introduction, installtion & configur...
What I learnt: Elastic search & Kibana : introduction, installtion & configur...
 
Extending Druid Index File
Extending Druid Index FileExtending Druid Index File
Extending Druid Index File
 
Kafka to the Maxka - (Kafka Performance Tuning)
Kafka to the Maxka - (Kafka Performance Tuning)Kafka to the Maxka - (Kafka Performance Tuning)
Kafka to the Maxka - (Kafka Performance Tuning)
 
Elastic Observability keynote
Elastic Observability keynoteElastic Observability keynote
Elastic Observability keynote
 
Database in Microservices - (2nd PostgreSQL Conference Nepal 2023)
Database in Microservices - (2nd PostgreSQL Conference Nepal 2023)Database in Microservices - (2nd PostgreSQL Conference Nepal 2023)
Database in Microservices - (2nd PostgreSQL Conference Nepal 2023)
 
Amazon Aurora: Under the Hood
Amazon Aurora: Under the HoodAmazon Aurora: Under the Hood
Amazon Aurora: Under the Hood
 
State of the Trino Project
State of the Trino ProjectState of the Trino Project
State of the Trino Project
 
Elasticsearch for beginners
Elasticsearch for beginnersElasticsearch for beginners
Elasticsearch for beginners
 
AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...
AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...
AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...
 
New Features in Apache Pinot
New Features in Apache PinotNew Features in Apache Pinot
New Features in Apache Pinot
 
Amazon Redshift Masterclass
Amazon Redshift MasterclassAmazon Redshift Masterclass
Amazon Redshift Masterclass
 
Achieving a 50% Reduction in Cross-AZ Network Costs from Kafka (Uday Sagar Si...
Achieving a 50% Reduction in Cross-AZ Network Costs from Kafka (Uday Sagar Si...Achieving a 50% Reduction in Cross-AZ Network Costs from Kafka (Uday Sagar Si...
Achieving a 50% Reduction in Cross-AZ Network Costs from Kafka (Uday Sagar Si...
 
Query Optimization in SQL Server
Query Optimization in SQL ServerQuery Optimization in SQL Server
Query Optimization in SQL Server
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
End-to-end Streaming Between gRPC Services Via Kafka with John Fallows
End-to-end Streaming Between gRPC Services Via Kafka with John FallowsEnd-to-end Streaming Between gRPC Services Via Kafka with John Fallows
End-to-end Streaming Between gRPC Services Via Kafka with John Fallows
 
LDAP
LDAPLDAP
LDAP
 
Introduction To Apache Lucene
Introduction To Apache LuceneIntroduction To Apache Lucene
Introduction To Apache Lucene
 

Viewers also liked

Berlin Buzzwords 2013 - How does lucene store your data?
Berlin Buzzwords 2013 - How does lucene store your data?Berlin Buzzwords 2013 - How does lucene store your data?
Berlin Buzzwords 2013 - How does lucene store your data?
Adrien Grand
 
Architecture and Implementation of Apache Lucene: Marter's Thesis
Architecture and Implementation of Apache Lucene: Marter's ThesisArchitecture and Implementation of Apache Lucene: Marter's Thesis
Architecture and Implementation of Apache Lucene: Marter's Thesis
Josiane Gamgo
 
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
dnaber
 
Lucene Introduction
Lucene IntroductionLucene Introduction
Lucene Introduction
otisg
 
Apache Solr/Lucene Internals by Anatoliy Sokolenko
Apache Solr/Lucene Internals  by Anatoliy SokolenkoApache Solr/Lucene Internals  by Anatoliy Sokolenko
Apache Solr/Lucene Internals by Anatoliy Sokolenko
Provectus
 
Elastic search overview
Elastic search overviewElastic search overview
Elastic search overview
ABC Talks
 
Introduction to Elasticsearch
Introduction to ElasticsearchIntroduction to Elasticsearch
Introduction to Elasticsearch
Ruslan Zavacky
 
Introduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of LuceneIntroduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of Lucene
Rahul Jain
 
Elasticsearch presentation 1
Elasticsearch presentation 1Elasticsearch presentation 1
Elasticsearch presentation 1
Maruf Hassan
 
SlideShare 101
SlideShare 101SlideShare 101
SlideShare 101
Amit Ranjan
 

Viewers also liked (10)

Berlin Buzzwords 2013 - How does lucene store your data?
Berlin Buzzwords 2013 - How does lucene store your data?Berlin Buzzwords 2013 - How does lucene store your data?
Berlin Buzzwords 2013 - How does lucene store your data?
 
Architecture and Implementation of Apache Lucene: Marter's Thesis
Architecture and Implementation of Apache Lucene: Marter's ThesisArchitecture and Implementation of Apache Lucene: Marter's Thesis
Architecture and Implementation of Apache Lucene: Marter's Thesis
 
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
 
Lucene Introduction
Lucene IntroductionLucene Introduction
Lucene Introduction
 
Apache Solr/Lucene Internals by Anatoliy Sokolenko
Apache Solr/Lucene Internals  by Anatoliy SokolenkoApache Solr/Lucene Internals  by Anatoliy Sokolenko
Apache Solr/Lucene Internals by Anatoliy Sokolenko
 
Elastic search overview
Elastic search overviewElastic search overview
Elastic search overview
 
Introduction to Elasticsearch
Introduction to ElasticsearchIntroduction to Elasticsearch
Introduction to Elasticsearch
 
Introduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of LuceneIntroduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of Lucene
 
Elasticsearch presentation 1
Elasticsearch presentation 1Elasticsearch presentation 1
Elasticsearch presentation 1
 
SlideShare 101
SlideShare 101SlideShare 101
SlideShare 101
 

Similar to Lucene basics

Apache lucene
Apache luceneApache lucene
Apache lucene
Dr. Abhiram Gandhe
 
Advanced full text searching techniques using Lucene
Advanced full text searching techniques using LuceneAdvanced full text searching techniques using Lucene
Advanced full text searching techniques using Lucene
Asad Abbas
 
Lucene
LuceneLucene
Page 18Goal Implement a complete search engine. Milestones.docx
Page 18Goal Implement a complete search engine. Milestones.docxPage 18Goal Implement a complete search engine. Milestones.docx
Page 18Goal Implement a complete search engine. Milestones.docx
smile790243
 
unit 4,Indexes in database.docx
unit 4,Indexes in database.docxunit 4,Indexes in database.docx
unit 4,Indexes in database.docx
RaviRajput416403
 
Overview of Indexing In Object Oriented Database
Overview of Indexing In Object Oriented DatabaseOverview of Indexing In Object Oriented Database
Overview of Indexing In Object Oriented Database
Editor IJMTER
 
Query Optimization in MongoDB
Query Optimization in MongoDBQuery Optimization in MongoDB
Query Optimization in MongoDB
Hamoon Mohammadian Pour
 
Preview of Custom Search Admin Tools
Preview of Custom Search Admin ToolsPreview of Custom Search Admin Tools
Preview of Custom Search Admin Tools
Axiell ALM
 
Getting Started With Elasticsearch In .NET
Getting Started With Elasticsearch In .NETGetting Started With Elasticsearch In .NET
Getting Started With Elasticsearch In .NET
Ahmed Abd Ellatif
 
Getting started with Elasticsearch in .net
Getting started with Elasticsearch in .netGetting started with Elasticsearch in .net
Getting started with Elasticsearch in .net
Ismaeel Enjreny
 
Technologies for Websites
Technologies for WebsitesTechnologies for Websites
Technologies for Websites
Compare Infobase Limited
 
Technical Utilities for your Site
Technical Utilities for your SiteTechnical Utilities for your Site
Technical Utilities for your Site
Compare Infobase Limited
 
Context Based Web Indexing For Semantic Web
Context Based Web Indexing For Semantic WebContext Based Web Indexing For Semantic Web
Context Based Web Indexing For Semantic Web
IOSR Journals
 
Search engine. Elasticsearch
Search engine. ElasticsearchSearch engine. Elasticsearch
Search engine. Elasticsearch
Selecto
 
A Review of Elastic Search: Performance Metrics and challenges
A Review of Elastic Search: Performance Metrics and challengesA Review of Elastic Search: Performance Metrics and challenges
A Review of Elastic Search: Performance Metrics and challenges
rahulmonikasharma
 
Elasticsearch
ElasticsearchElasticsearch
Elasticsearch
Divij Sehgal
 
MARUTHI_INVERTED_SEARCH_presentation.pptx
MARUTHI_INVERTED_SEARCH_presentation.pptxMARUTHI_INVERTED_SEARCH_presentation.pptx
MARUTHI_INVERTED_SEARCH_presentation.pptx
MaruthiRock
 
Searching and Analyzing Qualitative Data on Personal Computer
Searching and Analyzing Qualitative Data on Personal ComputerSearching and Analyzing Qualitative Data on Personal Computer
Searching and Analyzing Qualitative Data on Personal Computer
IOSR Journals
 
Database management system session 6
Database management system session 6Database management system session 6
Database management system session 6
Infinity Tech Solutions
 

Similar to Lucene basics (20)

Apache lucene
Apache luceneApache lucene
Apache lucene
 
Advanced full text searching techniques using Lucene
Advanced full text searching techniques using LuceneAdvanced full text searching techniques using Lucene
Advanced full text searching techniques using Lucene
 
Lucene
LuceneLucene
Lucene
 
Page 18Goal Implement a complete search engine. Milestones.docx
Page 18Goal Implement a complete search engine. Milestones.docxPage 18Goal Implement a complete search engine. Milestones.docx
Page 18Goal Implement a complete search engine. Milestones.docx
 
 
unit 4,Indexes in database.docx
unit 4,Indexes in database.docxunit 4,Indexes in database.docx
unit 4,Indexes in database.docx
 
Overview of Indexing In Object Oriented Database
Overview of Indexing In Object Oriented DatabaseOverview of Indexing In Object Oriented Database
Overview of Indexing In Object Oriented Database
 
Query Optimization in MongoDB
Query Optimization in MongoDBQuery Optimization in MongoDB
Query Optimization in MongoDB
 
Preview of Custom Search Admin Tools
Preview of Custom Search Admin ToolsPreview of Custom Search Admin Tools
Preview of Custom Search Admin Tools
 
Getting Started With Elasticsearch In .NET
Getting Started With Elasticsearch In .NETGetting Started With Elasticsearch In .NET
Getting Started With Elasticsearch In .NET
 
Getting started with Elasticsearch in .net
Getting started with Elasticsearch in .netGetting started with Elasticsearch in .net
Getting started with Elasticsearch in .net
 
Technologies for Websites
Technologies for WebsitesTechnologies for Websites
Technologies for Websites
 
Technical Utilities for your Site
Technical Utilities for your SiteTechnical Utilities for your Site
Technical Utilities for your Site
 
Context Based Web Indexing For Semantic Web
Context Based Web Indexing For Semantic WebContext Based Web Indexing For Semantic Web
Context Based Web Indexing For Semantic Web
 
Search engine. Elasticsearch
Search engine. ElasticsearchSearch engine. Elasticsearch
Search engine. Elasticsearch
 
A Review of Elastic Search: Performance Metrics and challenges
A Review of Elastic Search: Performance Metrics and challengesA Review of Elastic Search: Performance Metrics and challenges
A Review of Elastic Search: Performance Metrics and challenges
 
Elasticsearch
ElasticsearchElasticsearch
Elasticsearch
 
MARUTHI_INVERTED_SEARCH_presentation.pptx
MARUTHI_INVERTED_SEARCH_presentation.pptxMARUTHI_INVERTED_SEARCH_presentation.pptx
MARUTHI_INVERTED_SEARCH_presentation.pptx
 
Searching and Analyzing Qualitative Data on Personal Computer
Searching and Analyzing Qualitative Data on Personal ComputerSearching and Analyzing Qualitative Data on Personal Computer
Searching and Analyzing Qualitative Data on Personal Computer
 
Database management system session 6
Database management system session 6Database management system session 6
Database management system session 6
 

Recently uploaded

leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
alexjohnson7307
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
Hiroshi SHIBATA
 
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
Alex Pruden
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
Zilliz
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their MainframeDigital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Precisely
 
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyFreshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
ScyllaDB
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
Antonios Katsarakis
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
MichaelKnudsen27
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Tosin Akinosho
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
saastr
 
GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)
Javier Junquera
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
Tatiana Kojar
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Wask
 
dbms calicut university B. sc Cs 4th sem.pdf
dbms  calicut university B. sc Cs 4th sem.pdfdbms  calicut university B. sc Cs 4th sem.pdf
dbms calicut university B. sc Cs 4th sem.pdf
Shinana2
 
AWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptxAWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptx
HarisZaheer8
 
FREE A4 Cyber Security Awareness Posters-Social Engineering part 3
FREE A4 Cyber Security Awareness  Posters-Social Engineering part 3FREE A4 Cyber Security Awareness  Posters-Social Engineering part 3
FREE A4 Cyber Security Awareness Posters-Social Engineering part 3
Data Hops
 
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
Edge AI and Vision Alliance
 

Recently uploaded (20)

leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
 
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their MainframeDigital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
 
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyFreshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
 
GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
 
dbms calicut university B. sc Cs 4th sem.pdf
dbms  calicut university B. sc Cs 4th sem.pdfdbms  calicut university B. sc Cs 4th sem.pdf
dbms calicut university B. sc Cs 4th sem.pdf
 
AWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptxAWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptx
 
FREE A4 Cyber Security Awareness Posters-Social Engineering part 3
FREE A4 Cyber Security Awareness  Posters-Social Engineering part 3FREE A4 Cyber Security Awareness  Posters-Social Engineering part 3
FREE A4 Cyber Security Awareness Posters-Social Engineering part 3
 
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
 

Lucene basics

  • 1. Table of Contents 1) Understanding Lucene 2) Lucene Indexing 3) Types of Fields in Lucene Index 4) An example of Lucene Index fields 5) Core Searching classes 6) Types of Queries 7) Incremental Indexing 8) Score Boosting and relevance ranking 9) Scoring Algorithm 10) Sorting search results 11) Handling multiple pages of search results 12) Examples of queries possible with Lucene 13) Abstract storage in Index 14) Security 15) Composition of Segments in Lucene Index 16) Debugging lucene indexing process 17) Lucene in Alfresco 18) Alfresco repository architecture 19) Why do we sometimes have redundant data in Index and Database 20) Caching 21) Experience of lucene implementation 22) Good articles on Lucene
  • 2.
  • 3. Lucene doesn’t care about the source of the data, its format, or even its language, as long as you can convert it to text. This means you can use Lucene to index and search data stored in files: web pages on remote web servers, documents stored in local file systems, in databases, simple text files, Microsoft Word documents, HTML or PDF files, or any other format from which you can extract textual information. The quality of a search is typically described using precision and recall metrics. Recall measures how well the search system finds relevant documents, whereas precision measures how well the system filters out the irrelevant documents. Understanding Lucene Back to Content page
  • 4. As you saw in our Indexer class, you need the following classes to perform the simplest indexing procedure: ■ IndexWriter (creates a new index and adds documents to an existing index) ■ Directory (represents the location of a Lucene index. Subclasses : FSDirectory and RAMDirectory ) ■ Analyzer (extracts tokens out of text to be indexed and eliminates the rest) ■ Document (a collection of fields ) ■ Field ( Each field corresponds to a piece of data that is either queried against or retrieved from the index during search) Lucene Indexing Back to Content page
  • 5.
  • 6. An example of Lucene Index fields Back to Content page
  • 7. Core Searching classes ■ IndexSearcher ■ Term (basic unit for searching, consists of the name of the field and the value of that field) ■ Query (subclasses : TermQuery, BooleanQuery, PhraseQuery, PrefixQuery, PhrasePrefixQuery, RangeQuery,FilteredQuery, and SpanQuery.) ■ TermQuery (primitive query types) ■ Hits (simple container of pointers to ranked search results) Back to Content page
  • 8. TermQuery s are especially useful for retrieving documents by a key. A TermQuery is returned from QueryParser if the expression consists of a single word. PrefixQuery matches documents containing terms beginning with a specified string. QueryParser creates a PrefixQuery for a term when it ends with an asterisk (*) in query expressions. RangeQuery facilitates searches from a starting term through an ending term. RangeQuery query = new RangeQuery(begin, end, true ); BooleanQuery The various query types discussed here can be combined in complex ways using BooleanQuery. BooleanQuery itself is a container of Boolean clauses . A clause is a subquery that can be optional, required, or prohibited. These attributes allow for logical AND, OR, and NOT combinations. You add a clause to a BooleanQuery using this API method: public void add(Query query, boolean required, boolean prohibited) PhraseQuery An index contains positional information of terms. PhraseQuery uses this information to locate documents where terms are within a certain distance of one another. FuzzyQuery matches terms similar to a specified term. Types of Queries Back to Content page
  • 9.
  • 10.
  • 11.
  • 12.
  • 13. By default, all Documents have no boost—or, rather, they all have the same boost factor of 1.0. By changing a Document’s boost factor, you can instruct Lucene to consider it more or less important with respect to other Documents in the index. The API for doing this consists of a single method, setBoost(float), which can be used as follows: doc.setBoost(1.5); writer.addDocument(doc); When you boost a Document, Lucene internally uses the same boost factor to boost each of its Fields. To give field boost : subjectField.setBoost(1.2); The boost factor values you should use depend on what you’re trying to achieve; you may need to do a bit of experimentation and tuning to achieve the desired effect . It’s worth noting that shorter Fields have an implicit boost associated with them, due to the way Lucene’s scoring algorithm works. Boosting is, in general, an advanced feature that many applications can work very well without. Document and Field boosting comes into play at search time. Lucene’s search results are ranked according to how closely each Document matches the query, and each matching Document is assigned a score. Lucene’s scoring formula consists of a number of factors, and the boost factor is one of them. Boosting Documents and Fields
  • 14. Relevancy scoring mechanism Source : http://infotrieve.com/products_services/databases/LSRC_CST.pdf The formula used by lucene to calculate the rank of a document
  • 15.
  • 16. The list of the fields to which boost was added with an explanation as to why.                                                               Quoted directly from ServerSide.com : “ The date boost has been really important for us”. We have data that goes back for a long time, and seemed to be returning “old reports” too often. The date-based booster trick has gotten around this, allowing for the newest content to bubble up . The end result is that we now have a nice simple design which allows us to add new sources to our index with minimal development time! How ServerSide.com used boost to solve it’s problem Source : http://www.theserverside.com/tt/articles/article.tss?l=ILoveLucene Back to Content page
  • 17. Scoring Algorithm Back to Content page
  • 18.
  • 19. Now that the Hits object has been initialized, it begins the process of identifying documents that match the query by calling getMoreDocs method. Assuming we are not sorting (since sorting doesn't effect the raw Lucene score), we call on the "expert" search method of the Searcher, passing in our Weight object, Filter and the number of results we want.This method returns a TopDocs object, which is an internal collection of search results. The Searcher creates a TopDocCollector and passes it along with the Weight, Filter to another expert search method (for more on the HitCollector mechanism, see Searcher .) The TopDocCollector uses a PriorityQueue to collect the top results for the search. If a Filter is being used, some initial setup is done to determine which docs to include. Otherwise, we ask the Weight for a Scorer for the IndexReader of the current searcher and we proceed by calling the score method on the Scorer . At last, we are actually going to score some documents. The score method takes in the HitCollector (most likely the TopDocCollector) and does its business. Of course, here is where things get involved. The Scorer that is returned by the Weight object depends on what type of Query was submitted. In most real world applications with multiple query terms, the Scorer is going to be a BooleanScorer2. Assuming a BooleanScorer2 scorer, we first initialize the Coordinator , which is used to apply the coord() factor. We then get a internal Scorer based on the required, optional and prohibited parts of the query. Using this internal Scorer, the BooleanScorer2 then proceeds into a while loop based on the Scorer#next() method. The next() method advances to the next document matching the query. This is an abstract method in the Scorer class and is thus overriden by all derived implementations. If you have a simple OR query your internal Scorer is most likely a DisjunctionSumScorer , which essentially combines the scorers from the sub scorers of the OR'd terms. Scoring Algorithm Back to Content page
  • 20. Sorting comes at the expense of resources. More memory is needed to keep the fields used for sorting available. For numeric types, each field being sorted for each document in the index requires that four bytes be cached. For String types, each unique term is also cached for each document. Only the actual fields used for sorting are cached in this manner. We need to plan our system resources accordingly if we want to use the sorting capabilities, knowing that sorting by a String is the most expensive type in terms of resources. Sorting search results
  • 21.
  • 22.
  • 23. Handling of various types of queries by the QueryParser Back to Content page
  • 24.
  • 25. A security filter is a powerful example, allowing users to only see search results of documents they own even if their query technically matches other documents that are off limits. An example of document filtering constrains documents with security in mind. Our example assumes documents are associated with an owner, which is known at indexing time. We index two documents; both have the term info in their keywords field, but each document has a different owner: public class SecurityFilterTest extends TestCase { private RAMDirectory directory; protected void setUp() throws Exception { IndexWriter writer = new IndexWriter(directory, new WhitespaceAnalyzer(), true); // Elwood Document document = new Document(); document.add( Field.Keyword("owner", "elwood") ); document.add(Field.Text("keywords", "elwoods sensitive info")); writer.addDocument(document); // Jake document = new Document(); document.add( Field.Keyword("owner", "jake") ); document.add(Field.Text("keywords", "jakes sensitive info")); writer.addDocument(document); writer.close(); } } Security Source : Pg 211 from Lucene in action Back to Content page
  • 26. Suppose, though, that Jake is using the search feature in our application, and only documents he owns should be searchable by him. Quite elegantly, we can easily use a QueryFilter to constrain the search space to only documents he is the owner of, as shown in listing 5.7. public void testSecurityFilter() throws Exception { directory = new RAMDirectory(); setUp(); TermQuery query = new TermQuery(new Term("keywords", "info")) ; IndexSearcher searcher = new IndexSearcher(directory); Hits hits = searcher.search(query); assertEquals("Both documents match", 2, hits.length()); QueryFilter jakeFilter = new QueryFilter( new TermQuery(new Term("owner", "jake"))); hits = searcher.search(query, jakeFilter); assertEquals(1, hits.length()); assertEquals("elwood is safe", "jakes sensitive info", hits.doc(0).get("keywords")); } For using this approach we will have a field in the Index called owner. Security Back to Content page
  • 27. You can constrain a query to a subset of documents another way, by combining the constraining query to the original query as a required clause of a BooleanQuery. There are a couple of important differences, despite the fact that the same documents are returned from both. QueryFilter caches the set of documents allowed, probably speeding up successive searches using the same instance. In addition, normalized Hits scores are unlikely to be the same. The score difference makes sense when you’re looking at the scoring formula (see section 3.3, page 78). The IDF factor may be dramatically different. When you’re using BooleanQuery aggregation, all documents containing the terms are factored into the equation, whereas a filter reduces the documents under consideration and impacts the inverse document frequency factor. Security Back to Content page
  • 28. Each segment index maintains the following: Field names . This contains the set of field names used in the index. Stored Field values . This contains, for each document, a list of attribute-value pairs, where the attributes are field names. These are used to store auxiliary information about the document, such as its title, url, or an identifier to access a database. The set of stored fields are what is returned for each hit when searching. This is keyed by document number. Term dictionary . A dictionary containing all of the terms used in all of the indexed fields of all of the documents. The dictionary also contains the number of documents which contain the term, and pointers to the term's frequency and proximity data. Term Frequency data . For each term in the dictionary, the numbers of all the documents that contain that term, and the frequency of the term in that document. Term Proximity data . For each term in the dictionary, the positions that the term occurs in each document. Normalization factors . For each field in each document, a value is stored that is multiplied into the score for hits on that field. Term Vectors . For each field in each document, the term vector (sometimes called document vector) may be stored. A term vector consists of term text and term frequency. To add Term Vectors to your index see the Field constructors Deleted documents . An optional file indicating which documents are deleted. Composition of Segments in Lucene Index Back to Content page
  • 29. We can get Lucene to output information about its indexing operations by setting Index-Writer’s public instance variable infoStream to one of the OutputStreams, such as System.out  IndexWriter writer = new IndexWriter(dir, new SimpleAnalyzer(), true); writer.infoStream = System.out; Debugging lucene indexing process Back to Content page
  • 30. Lucene In Alfresco There are three possible approaches we can follow. 1) Let alfresco do the indexing, use its implementation of the search, use the search results it returns and load it into our page. 2) Let Alfresco do the indexing and directly access its indexes to get query results 3) Let alfresco only do the content management, and we take care of both the indexing and the searching Back to Content page
  • 31.
  • 32.
  • 34.
  • 35. The database can redundantly keeps some of the information that can be found in the Lucene index for two specific reasons: ■ Failure recovery —If the index somehow becomes corrupted (for example, through disk failure), it can easily and quickly be rebuilt from the data stored in the database without any information loss. This is further leveraged by the fact that the database can reside on a different machine. ■ Access speed —Each document is marked with a unique identifier. So, in the case that the application needs to access a certain document by a given identifier, the database can return it more efficiently than Lucene could. (the identifier is the primary key of a document in the database). If we would employ Lucene here, it would have to search its whole index for the document with the identifier stored in one of the document’s fields. Why do we sometimes have redundant data in Index and Database Back to Content page
  • 36. If we are unable to get access to Alfresco’s indexing and scoring process then we possibly add boost to the query itself. It is still not confirmed whether it will work first of all, and if it works, whether it will work fast enough. “ Title:Lucene”^4 OR “Keywords:Lucene”^3 OR “Contents:Lucene”^1 A possible approach to improve hit relevancy in Alfresco Back to Content page
  • 37. Lucene has an internal caching mechanism in case of filters. Lucene does come with a simple cache mechanism, if you use Lucene Filters . The classes to look at are CachingWrapperFilter and QueryFilter . For example lets say we wanted to let users search JUST on the last 30 days worth of content. We could run the filter ONCE and then cache it with the Term clause used to run the query. Then we could just use the same filter again for every user until you have to optimize() the index again. As long as the document numbers stay they same we don't have much more to do. But this will probably not be of much use to us, since we will need to optimize the index often. Caching mechanism Back to Content page
  • 38. Caching mechanism List of Top Keywords Top searched keywords obtained from logs logs Lucene Index Top Keyword results cache Searcher (Query) Results UI Searcher checks if query matches top keywords If query term matches one of the cached keywords then results are fetched from cache If query term doesn’t match one of the cached keywords then search in the Index Top keywords are searched for in the index and cached beforehand Cache expiring and refreshing mechanism ( including regular updating of top keywords list ) Back to Content page
  • 39. Question : “ I gave parallelMultiSearcher a try and it was significantly slower than simply iterating through the indexes one at a time. Our new plan is to somehow have only one index per search machine and a larger main index stored on the master. What I'm interested to know is whether having one extremely large index for the master then splitting the index into several smaller indexes (if this is possible) would be better than having several smaller indexes and merging them on the search machines into one index. I would also be interested to know how others have divided up search work across a cluster.” Answer : “ I'm responsible for the webshots.com search index and we've had very good results with lucene. It currently indexes over 100 Million documents and performs 4 Million searches / day. We initially tested running multiple small copies and using a MultiSearcher and then merging results as compared to running a very large single index. We actually found that the single large instance performed better. To improve load handling we clustered multiple identical copies together, then session bind a user to particular server and cache the results, but each server is running a single index. Our index is currently about 40Gb. The advantage of binding a user is that once a search is performed then caching within lucene and in the application is very effective if subsequent searches go back to the same box. Our initial searches are usually in the sub 100milliS range while subsequent requests for deeper pages in the search are returned instantly.” Experience of lucene implementation @ webshots.com Back to Content page
  • 40.
  • 41. Good Articles on Lucene http://www.theserverside.com/tt/articles/article.tss?l=ILoveLucene http://www.javaworld.com/javaworld/jw-09-2000/jw-0915-lucene.html?page=1 http://technology.amis.nl/blog/?p=1288 http://powerbuilder.sys-con.com/read/42488.htm http://www-128.ibm.com/developerworks/library/wa-lucene2/ Spell Checking : http://today.java.net/pub/a/today/2005/08/09/didyoumean.html Lucene integration with hibernate: http://www.hibernate.org/hib_docs/search/reference/en/html_single/ Lucene with Spring : http://technology.amis.nl/blog/?p=1248 It talks about spring modules. Back to Content page