Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Improved Search with Lucene 4.0            Robert Muir, Lucid Imagination    robert.muir@lucidimagination.com, 10/19/2011
Introductions What: Examine improvements in 4.0 Who am I?  • Lucene Committer & PMC Member  • Lucid Imagination Employee...
Indexing Improvements Codecs  • Flexible index format Index DocValues  • Efficient per-document lookup values Miscellan...
Codecs: Introduction Problem: “one size fits all” index format How to integrate future improvements?   • Example: more e...
Codecs: Example use case Accelerate near-realtime reopen Speed up delete by term on ID field   • “Primary key lookup” I...
Codecs: available formats   Standard: Lucene 4.0 index format   Pulsing: inline low frequency terms postings   Memory: ...
Codecs: ConfigurationLucene: use CodecProvider classSolr: specify in schema.xml                          7
Codecs: Configuration            8
Index DocValues: Introduction Problem: lookup a per-document value  • RAM-resident for things like scoring factors (e.g. ...
Index DocValues: Limitations New feature, recently added to Lucene✔ External scoring factors✔ Memory-resident and disk-ba...
Index DocValues: Example             11
Binary Terms In Lucene 4.0, index terms are byte[]  • Do not need to be unicode text Example: Localized Sort and Range  ...
Localized Sort/Range ExampleLucene: use CollationAnalyzer classSolr: specify in schema.xml                          13
Additional Index Statistics New statistics in Lucene 4.0:  •   totalTermFreq(term): number of occurrences  •   sumTotalTe...
Search Improvements Improved Scoring API  • Additional Scoring algorithms Spellchecking improvements Miscellaneous  • Q...
Scoring: Introduction Problem: “baked-in” vector space model How to integrate additional algorithms?  • Example: Languag...
Scoring: additional algorithms      BM25      Language Models      Divergence from Randomness      Information-based M...
Scoring: ConfigurationLucene: use Similarity classSolr: specify in schema.xml                          18
Spellchecking Improvements New DirectSpellChecker  • No additional index needed Better Suggestions  • Levenshtein vs. n-...
Spellchecking: ConfigurationLucene: use DirectSpellChecker classSolr: specify in solrconfig.xml                          20
Queryparsing Improvements Regular expression queries  • /mycompany.(com|org|net)/ Specify number of edits for fuzzy  • f...
Deep-paging support Problem: users who page deep into results :) Normal paging in lucene:  • Page 1: search(“foo”, 20) ←...
Performance Improvements Concurrent Flushing Fast Fuzzy Query Improved RAM Efficiency                   23
Concurrent Flushinghttp://people.apache.org/~mikemccand/lucenebench/
Fast Fuzzy QueryIn Lucene 3 this thing is < 1 QPS!
Improved RAM Efficiency Lucene 4.0 uses much less memory  • Terms Index, Suggester, Synonyms : finite-state  • Fieldcache...
Future Improvements Block Index Compression  • PFOR-delta, Simple8b, … Positional iterators from Scorers  • Offsets in p...
Conclusion Lucene 4.0 will have many improvements,  architectural, and API changes. A few of these were introduced here:...
More Information Lucene Website  • http://lucene.apache.org Users List  • java-user@lucene.apache.org Lucene Revolution...
Upcoming SlideShare
Loading in …5
×

Improved Search with Lucene 4.0 - Robert Muir

8,647 views

Published on

See conference video - http://www.lucidimagination.com/devzone/events/conferences/ApacheLuceneEurocon2011

This talk describes how you can practically apply some of Lucene 4's new features (such as flexible indexing, scoring improvements, column-stride fields) to improve your search application.

The talk will give a brief description of these new features and some example use-cases, to address practical use cases you can try yourself in and around the new features now available in Lucene 4. We'll cover application of functions where you can configure Solr to:

Set up the schema to use Pulsing or Memory codec for a primary key field
Not use a separate spellcheck index, controlling character-level swaps from the query processor
Sorting with a different locale
Per-field similarity configurations, such as using a non-vector-space algorithm

Published in: Technology
  • Be the first to comment

Improved Search with Lucene 4.0 - Robert Muir

  1. 1. Improved Search with Lucene 4.0 Robert Muir, Lucid Imagination robert.muir@lucidimagination.com, 10/19/2011
  2. 2. Introductions What: Examine improvements in 4.0 Who am I? • Lucene Committer & PMC Member • Lucid Imagination Employee Many big changes coming! Examine three general areas: • Indexing Improvements • Search Improvements • Performance Improvements Future improvements 2
  3. 3. Indexing Improvements Codecs • Flexible index format Index DocValues • Efficient per-document lookup values Miscellaneous Improvements • Binary terms • Additional index statistics 3
  4. 4. Codecs: Introduction Problem: “one size fits all” index format How to integrate future improvements? • Example: more efficient index compression • But need to support old format, too. How to optimize for your data? • Without digging into the guts of indexer! Idea: format should be pluggable 4
  5. 5. Codecs: Example use case Accelerate near-realtime reopen Speed up delete by term on ID field • “Primary key lookup” Idea: optimize ID field for this use case 5
  6. 6. Codecs: available formats Standard: Lucene 4.0 index format Pulsing: inline low frequency terms postings Memory: loads field entirely into RAM Appending: supports append-only filesystem SimpleText: plaintext (not for production!!!!!) PreFlex: supports Lucene 3.x index format 6
  7. 7. Codecs: ConfigurationLucene: use CodecProvider classSolr: specify in schema.xml 7
  8. 8. Codecs: Configuration 8
  9. 9. Index DocValues: Introduction Problem: lookup a per-document value • RAM-resident for things like scoring factors (e.g. pagerank) • Disk-resident for things like document versioning Existing workarounds in Lucene have limitations • Scoring limited to one byte norm • FieldCache requires uninversion to build • Stored fields are slow, geared towards summary results Idea: add performant, flexible per-document lookups. 9
  10. 10. Index DocValues: Limitations New feature, recently added to Lucene✔ External scoring factors✔ Memory-resident and disk-based lookup✗ Docvalues sort-by-term✗ Internal scoring factors (die norms, die)✗ Integration with Solr 10
  11. 11. Index DocValues: Example 11
  12. 12. Binary Terms In Lucene 4.0, index terms are byte[] • Do not need to be unicode text Example: Localized Sort and Range • Previous versions of Lucene: special encoder • Lucene 4.0: 50% space savings 12
  13. 13. Localized Sort/Range ExampleLucene: use CollationAnalyzer classSolr: specify in schema.xml 13
  14. 14. Additional Index Statistics New statistics in Lucene 4.0: • totalTermFreq(term): number of occurrences • sumTotalTermFreq(field): number of tokens • sumDocFreq(field): number of postings • docCount(field): number of docs with value Available from Lucene APIs Available from function queries Supports additional scoring algorithms... 14
  15. 15. Search Improvements Improved Scoring API • Additional Scoring algorithms Spellchecking improvements Miscellaneous • Query parsing improvements • Deep paging support 15
  16. 16. Scoring: Introduction Problem: “baked-in” vector space model How to integrate additional algorithms? • Example: Language Models • Before 4.0: write custom Queries • Before 4.0: track certain statistics yourself How to customize for your data? • Without digging into the guts of postings lists! Idea: separate “matching” from “scoring” 16
  17. 17. Scoring: additional algorithms  BM25  Language Models  Divergence from Randomness  Information-based Models 17
  18. 18. Scoring: ConfigurationLucene: use Similarity classSolr: specify in schema.xml 18
  19. 19. Spellchecking Improvements New DirectSpellChecker • No additional index needed Better Suggestions • Levenshtein vs. n-gram Exposes more configuration options • Tune to your collection 19
  20. 20. Spellchecking: ConfigurationLucene: use DirectSpellChecker classSolr: specify in solrconfig.xml 20
  21. 21. Queryparsing Improvements Regular expression queries • /mycompany.(com|org|net)/ Specify number of edits for fuzzy • foobar~2 Wildcard escaping • crazy?* Range query syntax improvements • Mix inclusive/exclusive bounds • Support open-ended syntax in Lucene 21
  22. 22. Deep-paging support Problem: users who page deep into results :) Normal paging in lucene: • Page 1: search(“foo”, 20) ← look at 1-20 • Page 2: search(“foo”, 40) ← look at 21-40 Deep paging in lucene: • Page 1: searchAfter(null, “foo”, 20) • Page 2: searchAfter(lastResult, “foo”, 20) 22
  23. 23. Performance Improvements Concurrent Flushing Fast Fuzzy Query Improved RAM Efficiency 23
  24. 24. Concurrent Flushinghttp://people.apache.org/~mikemccand/lucenebench/
  25. 25. Fast Fuzzy QueryIn Lucene 3 this thing is < 1 QPS!
  26. 26. Improved RAM Efficiency Lucene 4.0 uses much less memory • Terms Index, Suggester, Synonyms : finite-state • Fieldcache: packed integers / utf-8 Example with Wikipedia collection: “Memory footprint reduction from 389M to 90M after some off-the wall sorting and faceting”
  27. 27. Future Improvements Block Index Compression • PFOR-delta, Simple8b, … Positional iterators from Scorers • Offsets in postings lists (fast highlighting) • Proximity Scoring Structured/Section Scoring (e.g. BM25F) Improved flexibility through Codec • stored fields, term vectors, … Faster filtered search 27
  28. 28. Conclusion Lucene 4.0 will have many improvements, architectural, and API changes. A few of these were introduced here: • Ability to customize the index format • Additional scoring algorithms • Performance Improvements More are currently under development. For more information, look at CHANGES.txt in subversion, JIRA, ... 28
  29. 29. More Information Lucene Website • http://lucene.apache.org Users List • java-user@lucene.apache.org Lucene Revolution presentations • http://www.lucidimagination.com/devzone/events/conferences/revolution/2010 • http://www.lucidimagination.com/devzone/events/conferences/revolution/2011 Contact Info • robert.muir@lucidimagination.com • http://www.lucidimagination.com/blog/author/robert-muir 29

×