IR with lucene
Upcoming SlideShare
Loading in...5

IR with lucene



Presentation at the Greek Java Hellenic group about the open source search engine Lucene

Presentation at the Greek Java Hellenic group about the open source search engine Lucene



Total Views
Views on SlideShare
Embed Views



2 Embeds 8 4 4



Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

IR with lucene IR with lucene Presentation Transcript

  • Introduction to Information Retrieval with Lucene By Stylianos Gkorilas
  • Introductions  Presenter  Architect/Development Team Leader @Trasys Greece   IR (Information Retrieval)    Java EE projects for European Agencies The tracing and recovery of specific information from stored data IR is interdisciplinary, based on computer science, mathematics, library science, information science, information architecture, cognitive psychology, linguistics, and statistics. Lucene       Open Source – Apache Software License ( Founder: Doug Cutting 0.01 release on March 2000 (SourceForge) 1.2 release June 2002 (First apache Jakarta Release) Its own top level apache project in 2005 Current version is 3.1
  • More Lucene Intro…  Lucene is high performance, scalable IR library (not a ready to use application)    Number of full featured search applications built on top (More later…) Lucene ports and bindings in many other programming environments incl. Perl, Python, Ruby, C/C++, PHP and C# (.NET) Lucene „Powered By‟ apps (a few of many): LinkedIn, Apple, MySpace, Eclipse IDE, MS Outlook, Atlassian (JIRA). See more @
  • Components of a Search Application (1/4)  Acquire Content  Gather and scope the content   e.g. from the web with a spider or crawler, a CMS, a Database or a file system Projects helping Solr: handles RDBMS and XML feeds and rich documents through Tika integration  Nutch: web crawler - sister project at apache  Grub: open source web crawler 
  • Components of a Search Application (2/4)  Build document  Define the document     The unit of the search engine Has fields De-normalization involved Projects helping: Usually the same frameworks cover both this and the previous step     Compass and its evolution ElasticSearch Hibernate Search DBSight Oracle/Lucene Integration
  • Components of a Search Application (3/4)  Analyze Document  Handled by Analyzers Built-in and contributed  Built with tokenizers and token filters   Index Document   Through Lucene API or your framework of choice Search User Interface/Render Results  Application specific
  • Components of a Search Application (4/4)  Query Builder    Lucene provides one Frameworks provide extensions but also the application itself e.g. advanced search Run Query   Retrieve documents running the query built Three common theoretical models     Administration   Boolean model Vector space model Probabilistic model e.g. tuning options Analytics  reporting
  • How Lucene models content      Documents Fields Denormalization of content Flexible Schema Inverted Index
  • Basic Lucene Classes  Indexing IndexWriter  Directory  Analyzer  Document  Field   Searching IndexSearcher  Query  TopDocs  Term  QueryParser 
  • Basic Indexing  Adding documents RAMDirectory directory = new RAMDirectory(); IndexWriter writer = new IndexWriter(directory, new WhitespaceAnalyzer(), IndexWriter.MaxFieldLength.UNLIMITED); Document doc = new Document(); doc.add(new Field(“post", "the JHUG meeting is on this Saturday", Field.Store.YES, Field.Index.ANALYZED));   Deleting and updating documents Field options      Store Analyze Norms Term vectors Boost
  • Scoring – The formula tf(t in d): Term frequency factor for the term (t) in the document (d), i.e. how many times the term t occurs in the document. idf(t): Inverse document frequency of the term: a measure of how “unique” the term is. Very common terms have a low idf; very rare terms have a high idf. boost(t.field in d): Field & Document boost, as set during indexing. This may be used to statically boost certain fields and certain documents over others. lengthNorm(t.field in d): Normalization value of a field, given the number of terms within the field. This value is computed during indexing and stored in the index norms. Shorter fields (fewer tokens) get a bigger boost from this factor. coord(q, d): Coordination factor, based on the number of query terms the document contains. The coordination factor gives an AND-like boost to documents that contain more of the search terms than other documents queryNorm(q): Normalization value for a query, given the sum of the squared weights of each of the query terms.
  • Querying – the API  Variety of Query class implementations           TermQuery PhraseQuery TermRangeQuery NumericRangeQuery PrefixQuery BooleanQuery WildCardQuery FuzzyQuery MatchAllDocsQuery …
  • Querying - Example private void indexSingleFieldDocs(Field[] fields) throws Exception { IndexWriter writer = new IndexWriter(directory, new WhitespaceAnalyzer(), IndexWriter.MaxFieldLength.UNLIMITED); for (int i = 0; i < fields.length; i++) { Document doc = new Document(); doc.add(fields[i]); writer.addDocument(doc); } writer.optimize(); writer.close(); } public void wildcard() throws Exception { indexSingleFieldDocs(new Field[] { new Field("contents", "wild", Field.Store.YES, Field.Index.ANALYZED), new Field("contents", "child", Field.Store.YES, Field.Index.ANALYZED), new Field("contents", "mild", Field.Store.YES, Field.Index.ANALYZED), new Field("contents", "mildew", Field.Store.YES, Field.Index.ANALYZED) }); IndexSearcher searcher = new IndexSearcher(directory, true); Query query = new WildcardQuery(new Term("contents", "?ild*")); TopDocs matches =, 10); }
  • Querying - QueryParser Query query = new QueryParser("subject", analyzer).parse("(clinical OR ethics) AND methodology");            trachea AND esophagus The default join condition is OR e.g. trachea esophagus cough AND (trachea OR esophagus) trachea NOT esophagus full_title:trachea "trachea disease" "trachea disease“~5 is_gender_male:y [2010-01-01 TO 2010-07-01] esophaguz~ Trachea^5 esophagus
  • Analyzers - Internals   At Indexing and querying time Inside an analyzer   Operates on a TokenStream A token has a text value and metadata like      Start end character offsets Token type Position increment Optionally application specific bit flags and byte[] payload Token stream is abstract. Tokenizer and TokenFilter are the concrete ones    Tokenizer reads chars and produces tokens Token filter ingests tokens and produces new ones The composite pattern is implemented and they form a chain of one another
  • Analyzers – building blocks   Analyzers can be created by combining token streams (Order is important) Building blocks provided in core                CharTokenizer WhitespaceTokenizer KeywordTokenizer. LetterTokenizer LowerCaseTokenizer SinkTokenizer StandardTokenizer LowerCaseFilter StopFilter PorterStemFilter TeeTokenFilter ASCIIFoldingFilter CachingTokenFilter LengthFilter StandardFilter
  • Analyzers - core      WhitespaceAnalyzer Splits tokens at whitespace SimpleAnalyzer Divides text at non letter characters and lowercases StopAnalyzer Divides text at non letter characters, lowercases, and removes stop words KeywordAnalyzer Treats entire text as a single token StandardAnalyzer Tokenizes based on a sophisticated grammar that recognizes emailaddresses, acronyms, Chinese-JapaneseKorean characters,alphanumerics, and more lowercases and removes stop words
  • Analyzers – Example (1/2) Analyzing “The JHUG meeting is on this Saturday" WhitespaceAnalyzer: [The] [JHUG] [meeting] [is] [on] [this] [Saturday] SimpleAnalyzer: [the] [jhug] [meeting] [is] [on] [this] [saturday] StopAnalyzer: [jhug] [meeting] [saturday] StandardAnalyzer: [jhug] [meeting] [Saturday]
  • Analyzers – Example (2/2) Analyzing "XY&Z Corporation -" WhitespaceAnalyzer: [XY&Z] [Corporation] [-] [] SimpleAnalyzer: [xy] [z] [corporation] [xyz] [example] [com] StopAnalyzer: [xy] [z] [corporation] [xyz] [example] [com] StandardAnalyzer: [xy&z] [corporation] []
  • Analyzers – Beyond the built in  language-specific analyzers, under contrib/analyzers.      language-specific stemming and stop-word removal Sounds Like analyzer e.g. MetaphoneReplacementAnalyzer that transforms terms to their phonetic roots SynonymAnalyzer Nutch Analysis: bigrams for stop words Stemming analysis  The PorterStemFilter. It stems words using the Porter stemming algorithm created by Dr. Martin Porter, and it‟s best defined in his own words:   The Porter stemming algorithm (or „Porter stemmer‟) is a process for removing the commoner morphological and inflexional endings from words in English. Its main use is as part of a term normalisation process that is usually done when setting up Information Retrieval systems. SnowballAnalyzer: Stemming for many European languages
  • Filters    Narrow the search space Overloaded search methods that accept Filter instances Examples       TermRangeFilter NumericRangeFilter PrefixFilter QueryWrapperFilter SpanQueryFilter ChainedFilter
  • Example: Filters for Security  Constraints known at indexing time    Index the constraint as a field Search wrapping a TermQuery on the constraint field with a QueryWrapperFilter Factor in information at search time    A custom filter Filter will access an external privilege store that will provide some means of identifying documents in the index e.g. a unique term with regard to permissions Return an DocIdSet to Lucene. Bit positions match the document numbers. Enabled bits mean the document for that position is available to be searched against the query; unset bits mean the documents won‟t be considered in the search
  • Internals - Concurrency  Any number of IndexReaders open   Only one IndexWriter at a time   Locking with write lock file IndexReaders may be open while the index is being changed by an IndexWriter   IndexSearchers use underlying IndexReaders It will see changes only when the writer commits and is reopened Both are thread safe/friendly classes
  • Internals - Indexing concepts     Index is made up from segment files Deleting documents does not actually deletes - only marks for deletion Index writes are buffered and flushed periodically Segments need to be merged    Automatically by the IndexWriter Explicit calls to optimize There is the notion of commit (as you would expect), which has 4 steps     Flush buffered documents and deletions Sync files; force OS to write to stable storage of the underlying I/O system Write and sync the segments_N file Remove old commits
  • Internals - Transactions  Two-phase commit is supported   prepareCommit performs steps 1,2 and most of 3 Lucene implements the ACID transactional model     Atomicity: all or nothing commit Consistency: e.g. update will mean both delete and add Isolation: IndexReaders cannot see what has not been comitted Durability: Index is not corrupted and persists in storage
  • Architectures  Cluster nodes that share a remote file system index    Index in database   Much slower Separate write and read indexes (replication)    Slower than local Possible limitations due to client side caching (Samba, NFS, AFP) or stale file handles (NFS) relies on the IndexDeletionPolicy feature of Lucene Out of the box in Solr and ElasticSearch Autonomous search servers (e.g. Solr, ElasticSearch)  Loose coupling through JSON or XML
  • Frameworks– Compass Document definition via JPA mapping <compass-core-mapping package="eu.emea.eudract.model.entity"> <class name="cta.sectiona.CtaIdentification" alias="cta" root="true" support-unmarshall="false"> <id name="ctaIdentificationId"> <meta-data>cta_id</meta-data> </id> <dynamic-meta-data name="ncaName" converter="jexl" store="yes"> </dynamic-meta-data> <property name="fullTitle"> <meta-data>cta_full_title</meta-data> </property><property name="sponsorProtocolVersionDate"> <meta-data format="yyyy-MM-dd" store="no">cta_sponsor_protocol_version_date</meta-data> </property> <property name="isResubmission"> <meta-data converter="shortToYesNoNaConverter" store="no">cta_is_resubmission</meta-data> </property> <component name="eudractNumber" /> </class> <class name="eudractnumber.EudractNumber" alias="eudract_number" root="false"> <property name="eudractNumberId"> <meta-data converter="dashHandlingConverter" store="no">filteredEudractNumberId</meta-data> <meta-data>eudract_number</meta-data> </property> <property name="paediatricClinicalTrial"> <meta-data converter="shortToYesNoNaConverter" store="no">paediatric_clinical_trial </meta-data> </property> </class> ..... </compass-core-mapping>
  • Frameworks– Solr Document definition via DB mapping <dataConfig> <dataSource driver="org.hsqldb.jdbcDriver" url="jdbc:hsqldb:/temp/example/ex" user="sa" /> <document name="products"> <entity name="item" query="select * from item"> <field column="ID" name="id" /> <field column="NAME" name="name" /> <field column="MANU" name="manu" /> <field column="WEIGHT" name="weight" /> <field column="PRICE" name="price" /> <field column="POPULARITY" name="popularity" /> <field column="INSTOCK" name="inStock" /> <field column="INCLUDES" name="includes" /> <entity name="feature" query="select description from feature where item_id='${item.ID}'"> <field name="features" column="description" /> </entity> <entity name="item_category" query="select CATEGORY_ID from item_category where item_id='${item.ID}'"> <entity name="category" query="select description from category where id = '${item_category.CATEGORY_ID}'"> <field column="description" name="cat" /> </entity> </entity> </entity> </document> </dataConfig>
  • Frameworks– Compass/Lucene Configuration <compass name="default"> <setting name="compass.transaction.managerLookup"> org.compass.core.transaction.manager.OC4J</setting> <setting name="compass.transaction.factory"> org.compass.core.transaction.JTASyncTransactionFactory</setting> <setting name="compass.transaction.lockPollInterval">400</setting> <setting name="compass.transaction.lockTimeout">90</setting> <setting name="compass.engine.connection">file://P:/Tmp/stelinio</setting> <!--<setting name="compass.engine.connection"> jdbc://jdbc/EudractV8DataSourceSecure</setting>--> <!--<setting name="">--> <!> <!--</setting>--> <!--<setting name="compass.engine.ramBufferSize">512</setting>--> <!--<setting name="compass.engine.maxBufferedDocs">-1</setting>--> <setting name="compass.converter.dashHandlingConverter.type"> eu.emea.eudract.compasssearch.DashHandlingConverter </setting> <setting name="compass.converter.shortToYesNoNaConverter.type"> eu.emea.eudract.compasssearch.ShortToYesNoNaConverter </setting> <setting name="compass.converter.shortToPerDayOrTotalConverter.type"> eu.emea.eudract.compasssearch.ShortToPerDayOrTotalConverter </setting> <setting name=""> </setting> <setting name="compass.engine.analyzer.default.type"> org.apache.lucene.analysis.standard.StandardAnalyzer </setting> </compass>
  • Cool extra features- Spellchecking    You will need a dictionary of valid words You could use the unique terms in your index Given the dictionary you could     To present or not to present (the suggestion)   Use a Sounds like algorithm like Soundex or Metaphone Or use Ngrams E.g. squirrel as a 3gram is squ, qui, uir, irr, rre, rel. As a 4gram squi, quir, uirr, irre, rrel. Mistakenly searching for squirel would match 5 grams, with 2 shared between the 3grams and 4grams. This would score high! Maybe use the Levenshtein distance Other ideas    Use the rest of the terms in the query to bias Maybe combine distance with frequency of term Check result numbers in initial and corrected searches
  • Even More features  Sorting    SpanQueries    Use a field for sorting instead of relevance e.g. when you use the MatchAllDocsQuery Beware it uses FieldCache which resides in RAM! distance between terms (span) Family of queries like SpanNearQuery or SpanOrQuery and others Synonyms  Injection during indexing or during searching?    Leverage a synonyms knowledge base     Key thing is to understand that synonyms must be injected on the same position increments Answer to the query “Greek Restaurants Near Me” An efficient technique is to use grids   Assign non-unique grid numbers at areas (e.g. in a mercator space) Index documents with a field containing the grid numbers that match their positional lingitude and latitude MoreLikeThis   A good strategy is to convert it into an index Spatial Searches   A MultiPhraseQuery is appropriate for searching time During indexing will allow faster searches One use of term vectors Function Queries  e.g. add boosts for fields at search time
  • Some last things to bare in mind  It would be wise to back up you index   Performance has some trade-offs          search latency indexing throughput near real time results index replication index optimization Resource consumption   You can have hot back ups (supported through the CommitDeletionPolicy) Disk space File descriptors Memory „Luke‟ is a really handy tool You can repair a corrupted index (corrupted segments are just lost… D‟oh!)
  • Resources     Book: Lucene in Action Solr: Vector Space Model: ector_Space_Model IR Links:
  • That’s it Questions?