The document discusses advancements in Lucene, an open-source search engine library, focusing on improving inexact matching through automaton queries and enhancements in query APIs. It highlights the benefits of expanding queries for better language support, stemming, and spellchecking, while also noting ongoing developments in Lucene-Solr merging and indexing improvements. Additionally, it addresses enhancements in text analysis, relevance benchmarking, and spatial support.
In this document
Powered by AI
Overview of Lucene as an open-source search library. Discusses its capabilities for inexact matching and various use cases.
Explains the Inverted Index, improvements in numeric range queries, and automaton queries for enhanced search performance.
Discussion on advanced functionalities like stemming, query expansion strategies, and experimental results for improved effectiveness.
Addresses spellchecking enhancements, ongoing Lucene developments, community collaboration, indexing flexibility, and improvements in relevance scoring.
Open floor for questions and provision of backup slides for further details.
Agenda
• Introduction toLucene
• Improving inexact matching:
– Background
– Regular Expression, Wildcard, Fuzzy Queries
• Additional use cases:
– Language support: expansion versus
stemming
– Improved spellchecking
• Other ongoing developments in Lucene
3.
Introduction to Lucene
•Open Source Search Engine Library
– Not just Java, ported to other languages too.
– Commercial support via several companies
• Just the library
– Embed for your own uses, e.g. Eclipse
– For a search server, see Solr
– For web search + crawler, see Nutch
• Website: http://lucene.apache.org
4.
Inverted Index
• Likea TreeMap<Terms, Documents>
• Before 2.9, queries only operate on one
“subtree”.
– For example, Regex and Fuzzy exhaustively
evaluate all terms, unless you give them a
“constant prefix”.
• TermQuery is just a special case, looks at
one leaf node.
5.
Lucene 2.9: FastNumeric Ranges
• Indexes at different levels of precision.
• Enumerates multiple subtrees.
– But typically this is a small number: e.g. 15
• Query APIs improved to support this.
4 5 6
42 44 52 63 64
421 423 445 446 448 521 522 632 633 634 641 642 644
6.
Automaton Queries
• Onlyexplore subtrees that can lead to an
accept state of some finite state machine.
• AutomatonQuery traverses the term
dictionary and the state machine in parallel
7.
Another way tothink of it
• Index as a state machine that recognizes Terms
and transduces matching Documents.
• AutomatonQuery represents a user’s search
need as a FSM.
• The intersection of the two emits search results.
8.
Query API improvements
•Automata might need to do many seeks
around the term dictionary.
– Depends on what is in term dictionary
– Depends on state machine structure
• MultiTermQuery API further improved
– Easier and more efficient to skip around.
– Explicitly supports seeking.
9.
Regex, Wildcard, Fuzzy
•Without constant prefix, exhaustive
– Regex: (http|ftp)://foo.com
– Wildcard: ?oo?ar
– Fuzzy: foobar~
• Re-implemented as automata queries
– Just parsers that produce a DFA
– Improved performance and scalability
– (http|ftp)://foo.com examines 2 terms.
Stemming
• Stemmers workat index and query time
– walked, walking -> walk
– Can increase retrieval effectiveness
• Some problems
– Mistakes: international -> intern
– Must determine language of documents
– Multilingual cases can get messy
– Tuning is difficult: must re-index
– Unfriendly: wildcards on stemmed terms…
12.
Expansion instead
• Don’tremove data at index time
– Expand the query instead.
– Single field now works well for all queries:
exact match, wildcard, expanded, etc.
• Simplifies search configuration
– Tuning relevance is easier, no re-indexing.
– No need to worry about language ID for docs.
– Multilingual case is much simpler.
13.
Automata expansion
• Naturalfit for morphology
• Use set intersection operators
– Minus to subtract exact match case
– Union to search multiple languages
• Efficient operation
– Doesn’t explode for languages with complex
morphology
14.
Experimental results
• 125kdocs English test collection
• Results are for TD queries
• Inverted the “S-Stemmer”
• 6 declarative rewrite rules to regex
• Competitive with traditional stemming.
No Porter S-Stem Automaton
Stemming S-Stem
MAP 0.4575 0.5069 0.5029 0.4979
MRR 0.8070 0.7862 0.7587 0.8220
# Terms 336,675 280,061 305,710 336,675
15.
TODO:
• Support expansionmodels, too in Lucene.
• Language-specific resources
– lucene-hunspell could provide these
• Language-independent tokenization
– Unicode rules go a long way.
• Scoring that doesn’t need stopwords
– For now, use CommonGrams!
16.
Spellchecking
• Lucene spellcheckerbuilds a separate
index to find correction candidates
• Perhaps our fuzzy enumeration is now fast
enough for small edit distances (e.g. 1,2)
to just use the index directly.
• Could simplify configurations, especially
distributed ones.
Community
• Merging Luceneand Solr development
– Still two separate released “products”!!!
– Share mailing list and code repository
– Solr dev code in sync with Lucene dev code
• Benefits to both Lucene and Solr users
– Lucene features exposed to Solr faster
– Solr features available to Lucene users
19.
Indexing
• Flexible Indexing
– Customize the format of the index
– Decreased RAM usage
– Faster IndexReader open and faster seeking
• Future
– Serialize custom attributes to the index
– More RAM savings
– Improved index compression
– Faster Near-Real-Time performance
20.
Text Analysis
• ImprovedUnicode Support
– Unicode 4/Supplementary support
– ICU integration (to support Unicode 5.2)
• Improved Language Support
– More languages
– Faster indexing performance
– Easier packaging and ease-of-use
• Improvements to Analysis API
21.
Relevance and Scoring
•Improved relevance benchmarking
– Open Relevance Project
– Quality Benchmarking Package
• Future: More flexible scoring
– How to support BM25, DFR, more vector
space?
– Some packages/patches available, but
• Additional index statistics needed to be simple
22.
Other
• Improvements toSpatial support
– See Chris Male’s talk here:
http://vimeo.com/10204365
– Progress in both Lucene and Solr
• Steps towards a more modular
architecture
– Smaller Lucene core
– Separate modules with more functionality