SlideShare a Scribd company logo
Improved Search with Lucene 4.0
    NOVA Apache Lucene/Solr Meetup




            Robert Muir, Lucid Imagination
    robert.muir@lucidimagination.com, 6/14/2012
Introductions
 What: Examine improvements in 4.0
 Who am I?
  • Lucene Committer & PMC Member
  • Lucid Imagination Employee
 Many big changes coming!
 Examine three general areas:
  • Indexing Improvements
  • Search Improvements
  • Performance Improvements
 Future improvements

                     2
But first: Lucene 101
 Search engine library
  • Not an application: use Solr
 Open Source
  • Apache Software Foundation
 Documents: collection of Fields
  • What these are is up to you.
  • Not a parser: use Tika
 Inverted Index
  • Terms map to documents
  • Provide search terms: get
    ranked results

                             3
Lucene 4.0

 Not yet released
  • Nearing alpha stage
  • Alpha → Beta → Final
  • Will release in parallel with Solr 4.0
 Modular API
  • Core, Analysis, Autosuggest, ...
 What changed?




                      4
Indexing Improvements


 Codecs
  • Flexible index format
 Index DocValues
  • Efficient per-document lookup values
 Miscellaneous Improvements
  • Offsets in postings
  • Binary terms
  • Additional index statistics


                        5
Lucene: indexing basics
 Incremental indexing
  • Segment is a mini-index
  • Periodically merged
  • Update = delete + add
 IndexWriter
  • Add/update/delete docs
 IndexReader
  • Snapshot-in-time view




                              6
Codecs: Introduction


 Problem: “one size fits all” index format
 How to integrate future improvements?
   • Example: more efficient index compression
   • But need to support old format, too.
 How to optimize for your data?
   • Without digging into the guts of indexer!
 Idea: format should be pluggable



                          7
Codecs: Flex API




         8
Codecs: entire index

 Customize the encoding, data structures, files
 Customize all portions of the index
  • PostingsFormat: terms, docs, positions, …
  • LiveDocsFormat: deleted documents
  • StoredFieldsFormat: stored fields
  • TermVectorsFormat: term vectors
  • ...


                        9
Codecs: Example use case
 Accelerate near-realtime reopen
 Speed up delete by term on ID field
   • “Primary key lookup”
 Idea: optimize ID field for this use case




                            10
Codecs: postings formats
   Lucene40: Lucene 4.0 index format
   Pulsing: inline low frequency terms' postings
   Memory: loads field entirely into RAM
   Appending: supports append-only filesystem
   SimpleText: plaintext (not for production!!!!!)
   Lucene3x: supports Lucene 3.x index format




                          11
Codecs: Configuration
Lucene: use Codec class
Solr: specify in solrconfig.xml/schema.xml




                          12
Codecs: Configuration




           13
Index DocValues: Introduction

 Problem: lookup a per-document value
  • RAM-resident for things like scoring factors (e.g. pagerank)
  • Disk-resident for things like document versioning
 Existing workarounds in Lucene have limitations
  • Scoring limited to one byte norm
  • FieldCache requires uninversion to construct
  • Stored fields are slow, geared towards summary results
 Idea: add performant, flexible per-document lookups.




                               14
Index DocValues: Features

 New feature, recently added to Lucene


✔ External scoring factors
✔ Memory-resident and disk-based lookup
✔ DocValues sort-by-term
✔ Internal scoring factors
✔ Grouping support
✗ FST sort implementation
✗ Independently updatable
✗ Integration with Solr



                              15
Index DocValues: Example




             16
Offsets in Postings

 Problem: search result highlighting
   • Typically a performance bottleneck
 Two strategies today: neither is great
   • Re-analyze
   • Term vectors
 Highlighting is really important
   • How users judge if a result is relevant
   • Who wants a search engine w/o highlighting!
 How can we speed this up?



                           17
Current strategy #1: re-analyze

 Analyze documents on-the-fly
  • Same way as at indexing time
  • Map terms back to locations
 Can be ok if:
  • Analysis is very simple
  • Documents are tiny
 Can be slow:
  • Complex analysis pipelines
  • Documents are large



                              18
Current strategy #2: term vectors

 Inverted index for each doc
  • Avoids re-analysis
 Can be ok if:
  • Analysis is very complex
  • Lots of disk space
 Can be slow:
  • 3 seeks per document (.tvx, .tvd, .tvf)




                             19
New strategy: offsets in postings

 Offsets in the positions file
   • Avoids re-analysis
   • More efficient access/storage
 Space-efficient:
   • ~ 60% less space than term vectors
     (english news text)
   • Best case: one byte per position
 Fast:
   • I/O regions should be in cache from
     scoring
   • Preliminary benchmarks: > 10x         Offse
                                           t

                                     20
Binary Terms


 In Lucene 4.0, index terms are byte[]
  • Do not need to be unicode text
 Example: Localized Sort and Range
  • Previous versions of Lucene: special encoder
  • Lucene 4.0: 50% space savings




                        21
Localized Sort/Range Example
Lucene: use CollationAnalyzer class
Solr: specify in schema.xml




                          22
Additional Index Statistics


 New statistics in Lucene 4.0:
  •   totalTermFreq(term): number of occurrences
  •   sumTotalTermFreq(field): number of tokens
  •   sumDocFreq(field): number of postings
  •   docCount(field): number of docs with value
 Available from Lucene APIs
 Available from function queries
 Supports additional scoring algorithms...



                            23
Search Improvements


 Improved Scoring API
  • Additional Scoring algorithms
  • Distributed Scoring API
 Spellchecking improvements
 Query parsing improvements




                      24
Scoring: Introduction


 Problem: “baked-in” vector space model
 How to integrate additional algorithms?
  • Example: Language Models
  • Before 4.0: write custom Queries
  • Before 4.0: track certain statistics yourself
 How to customize for your data?
  • Without digging into the guts of postings lists!
 Idea: separate “matching” from “scoring”



                             25
Scoring: additional algorithms


      Okapi BM25
      Language Models
      Divergence from Randomness
      Information-based Models




                    26
Scoring: Configuration
Lucene: use Similarity class
Solr: specify in schema.xml




                          27
Scoring: distributed API
 Statistics API on IndexSearcher
 64-bit collection/term statistics
   • For distributed collections > 2B docs
 Compatible with all scoring algorithms
 Solr implementation as a patch
   • Designed for both approximate and
     exact interchange



                      28
Spellchecking Improvements


 New DirectSpellChecker
  • No additional index needed
 Better Suggestions
  • Levenshtein vs. n-gram
 Exposes more configuration options
  • Tune to your collection




                       29
Spellchecking: Configuration
Lucene: use DirectSpellChecker class
Solr: specify in solrconfig.xml




                          30
Queryparsing Improvements

 Regular expression queries
  • /mycompany.(com|org|net)/
 Specify number of edits for fuzzy
  • foobar~2
 Wildcard escaping
  • crazy?*
 Range query syntax improvements
  • Mix inclusive/exclusive bounds
  • Support open-ended syntax in Lucene


                          31
Performance Improvements



   Concurrent Flushing
   BlockTree term dictionary
   Improved RAM Efficiency
   Faster Filtered Search




                       32
Concurrent Flushing




http://people.apache.org/~mikemccand/lucenebench
/
BlockTree term dictionary




    In Lucene 3 this thing is < 1
    QPS!
Improved RAM Efficiency
 Lucene 4.0 uses much less memory
  • Terms Index, Suggester, Synonyms : finite-state
  • Fieldcache: packed integers / utf-8
 Example with Wikipedia collection:
  “Memory footprint reduction from 389M to 90M
  after some off-the wall sorting and faceting”
Faster Filtered Search
 Up to 500% faster execution time for filtered
  search.
 Especially faster for “Hard” queries like Span
  queries / Sloppy Phrase Queries / High
  Frequency Boolean Queries
 Lucene decides between different execution
  strategies:
  • Conjunction of query and filter (via skipTo)
  • Drive only by Query, but with filter handled like
    deleted documents.
Future Improvements


 Block Index Compression
  • PFOR-delta, Simple8b, …
 Position iterators from Scorers
  • Simplify/speed up Highlighting
  • Proximity Scoring
 Structured/Section Scoring (e.g. BM25F)




                         37
Conclusion
 Lucene 4.0 will have many improvements,
  architectural, and API changes.
 A few of these were introduced here:
  • Ability to customize the index format
  • Additional scoring algorithms
  • Performance Improvements
 More are currently under development.
 For more information, look at CHANGES.txt in
  subversion, JIRA, ...


                          38
More Information

 Lucene Website
  • http://lucene.apache.org
 Users List
  • java-user@lucene.apache.org
 Lucene Revolution presentations
  •   http://www.lucidimagination.com/devzone/events/lucene_revolution_conferences

 Contact Info
  •   robert.muir@lucidimagination.com
  •   http://www.lucidimagination.com/blog/author/robert-muir
  •   @rcmuir




                                         39

More Related Content

What's hot

Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
Erik Hatcher
 
Search Lucene
Search LuceneSearch Lucene
Search Lucene
Jeremy Coates
 
Introduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrIntroduction to Apache Lucene/Solr
Introduction to Apache Lucene/Solr
Rahul Jain
 

What's hot (20)

Lucene and MySQL
Lucene and MySQLLucene and MySQL
Lucene and MySQL
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
 
Lucene indexing
Lucene indexingLucene indexing
Lucene indexing
 
Flexible Indexing in Lucene 4.0
Flexible Indexing in Lucene 4.0Flexible Indexing in Lucene 4.0
Flexible Indexing in Lucene 4.0
 
Portable Lucene Index Format & Applications - Andrzej Bialecki
Portable Lucene Index Format & Applications - Andrzej BialeckiPortable Lucene Index Format & Applications - Andrzej Bialecki
Portable Lucene Index Format & Applications - Andrzej Bialecki
 
Lucene Introduction
Lucene IntroductionLucene Introduction
Lucene Introduction
 
Search Lucene
Search LuceneSearch Lucene
Search Lucene
 
Lucece Indexing
Lucece IndexingLucece Indexing
Lucece Indexing
 
Elasticsearch speed is key
Elasticsearch speed is keyElasticsearch speed is key
Elasticsearch speed is key
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
 
Munching & crunching - Lucene index post-processing
Munching & crunching - Lucene index post-processingMunching & crunching - Lucene index post-processing
Munching & crunching - Lucene index post-processing
 
Lucene's Latest (for Libraries)
Lucene's Latest (for Libraries)Lucene's Latest (for Libraries)
Lucene's Latest (for Libraries)
 
Solr 4
Solr 4Solr 4
Solr 4
 
Solr Recipes
Solr RecipesSolr Recipes
Solr Recipes
 
Hacking Lucene for Custom Search Results
Hacking Lucene for Custom Search ResultsHacking Lucene for Custom Search Results
Hacking Lucene for Custom Search Results
 
Apache Solr/Lucene Internals by Anatoliy Sokolenko
Apache Solr/Lucene Internals  by Anatoliy SokolenkoApache Solr/Lucene Internals  by Anatoliy Sokolenko
Apache Solr/Lucene Internals by Anatoliy Sokolenko
 
Berlin Buzzwords 2013 - How does lucene store your data?
Berlin Buzzwords 2013 - How does lucene store your data?Berlin Buzzwords 2013 - How does lucene store your data?
Berlin Buzzwords 2013 - How does lucene store your data?
 
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucene
 
Introduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrIntroduction to Apache Lucene/Solr
Introduction to Apache Lucene/Solr
 

Similar to Improved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup

Keynote Yonik Seeley & Steve Rowe lucene solr roadmap
Keynote   Yonik Seeley & Steve Rowe lucene solr roadmapKeynote   Yonik Seeley & Steve Rowe lucene solr roadmap
Keynote Yonik Seeley & Steve Rowe lucene solr roadmap
lucenerevolution
 
KEYNOTE: Lucene / Solr road map
KEYNOTE: Lucene / Solr road mapKEYNOTE: Lucene / Solr road map
KEYNOTE: Lucene / Solr road map
lucenerevolution
 
Lucene Bootcamp - 2
Lucene Bootcamp - 2Lucene Bootcamp - 2
Lucene Bootcamp - 2
GokulD
 
SQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
SQL Analytics for Search Engineers - Timothy Potter, LucidworksngineersSQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
SQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
Lucidworks
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
Erik Hatcher
 
Full Text Search with Lucene
Full Text Search with LuceneFull Text Search with Lucene
Full Text Search with Lucene
WO Community
 

Similar to Improved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup (20)

Keynote Yonik Seeley & Steve Rowe lucene solr roadmap
Keynote   Yonik Seeley & Steve Rowe lucene solr roadmapKeynote   Yonik Seeley & Steve Rowe lucene solr roadmap
Keynote Yonik Seeley & Steve Rowe lucene solr roadmap
 
KEYNOTE: Lucene / Solr road map
KEYNOTE: Lucene / Solr road mapKEYNOTE: Lucene / Solr road map
KEYNOTE: Lucene / Solr road map
 
Lucene Bootcamp - 2
Lucene Bootcamp - 2Lucene Bootcamp - 2
Lucene Bootcamp - 2
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
 
Presto: Fast SQL on Everything
Presto: Fast SQL on EverythingPresto: Fast SQL on Everything
Presto: Fast SQL on Everything
 
Let's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/SolrLet's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/Solr
 
Just the Job: Employing Solr for Recruitment Search -Charlie Hull
Just the Job: Employing Solr for Recruitment Search -Charlie Hull Just the Job: Employing Solr for Recruitment Search -Charlie Hull
Just the Job: Employing Solr for Recruitment Search -Charlie Hull
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
 
Apache Solr - Enterprise search platform
Apache Solr - Enterprise search platformApache Solr - Enterprise search platform
Apache Solr - Enterprise search platform
 
SQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
SQL Analytics for Search Engineers - Timothy Potter, LucidworksngineersSQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
SQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
 
MongoDB 3.2 Feature Preview
MongoDB 3.2 Feature PreviewMongoDB 3.2 Feature Preview
MongoDB 3.2 Feature Preview
 
Introduction to Lucene and Solr - 1
Introduction to Lucene and Solr - 1Introduction to Lucene and Solr - 1
Introduction to Lucene and Solr - 1
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
 
Budapest Spring MUG 2016 - MongoDB User Group
Budapest Spring MUG 2016 - MongoDB User GroupBudapest Spring MUG 2016 - MongoDB User Group
Budapest Spring MUG 2016 - MongoDB User Group
 
Cross Site Collection Navigation using SPFx, Powershell PnP & PnP-JS
Cross Site Collection Navigation using SPFx, Powershell PnP & PnP-JSCross Site Collection Navigation using SPFx, Powershell PnP & PnP-JS
Cross Site Collection Navigation using SPFx, Powershell PnP & PnP-JS
 
Full Text Search with Lucene
Full Text Search with LuceneFull Text Search with Lucene
Full Text Search with Lucene
 
FiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
FiloDB: Reactive, Real-Time, In-Memory Time Series at ScaleFiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
FiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
 
Cross Site Collection Navigation
Cross Site Collection NavigationCross Site Collection Navigation
Cross Site Collection Navigation
 
Conceptos básicos. Seminario web 6: Despliegue de producción
Conceptos básicos. Seminario web 6: Despliegue de producciónConceptos básicos. Seminario web 6: Despliegue de producción
Conceptos básicos. Seminario web 6: Despliegue de producción
 
Webinar : Nouveautés de MongoDB 3.2
Webinar : Nouveautés de MongoDB 3.2Webinar : Nouveautés de MongoDB 3.2
Webinar : Nouveautés de MongoDB 3.2
 

Improved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup

  • 1. Improved Search with Lucene 4.0 NOVA Apache Lucene/Solr Meetup Robert Muir, Lucid Imagination robert.muir@lucidimagination.com, 6/14/2012
  • 2. Introductions  What: Examine improvements in 4.0  Who am I? • Lucene Committer & PMC Member • Lucid Imagination Employee  Many big changes coming!  Examine three general areas: • Indexing Improvements • Search Improvements • Performance Improvements  Future improvements 2
  • 3. But first: Lucene 101  Search engine library • Not an application: use Solr  Open Source • Apache Software Foundation  Documents: collection of Fields • What these are is up to you. • Not a parser: use Tika  Inverted Index • Terms map to documents • Provide search terms: get ranked results 3
  • 4. Lucene 4.0  Not yet released • Nearing alpha stage • Alpha → Beta → Final • Will release in parallel with Solr 4.0  Modular API • Core, Analysis, Autosuggest, ...  What changed? 4
  • 5. Indexing Improvements  Codecs • Flexible index format  Index DocValues • Efficient per-document lookup values  Miscellaneous Improvements • Offsets in postings • Binary terms • Additional index statistics 5
  • 6. Lucene: indexing basics  Incremental indexing • Segment is a mini-index • Periodically merged • Update = delete + add  IndexWriter • Add/update/delete docs  IndexReader • Snapshot-in-time view 6
  • 7. Codecs: Introduction  Problem: “one size fits all” index format  How to integrate future improvements? • Example: more efficient index compression • But need to support old format, too.  How to optimize for your data? • Without digging into the guts of indexer!  Idea: format should be pluggable 7
  • 9. Codecs: entire index  Customize the encoding, data structures, files  Customize all portions of the index • PostingsFormat: terms, docs, positions, … • LiveDocsFormat: deleted documents • StoredFieldsFormat: stored fields • TermVectorsFormat: term vectors • ... 9
  • 10. Codecs: Example use case  Accelerate near-realtime reopen  Speed up delete by term on ID field • “Primary key lookup”  Idea: optimize ID field for this use case 10
  • 11. Codecs: postings formats  Lucene40: Lucene 4.0 index format  Pulsing: inline low frequency terms' postings  Memory: loads field entirely into RAM  Appending: supports append-only filesystem  SimpleText: plaintext (not for production!!!!!)  Lucene3x: supports Lucene 3.x index format 11
  • 12. Codecs: Configuration Lucene: use Codec class Solr: specify in solrconfig.xml/schema.xml 12
  • 14. Index DocValues: Introduction  Problem: lookup a per-document value • RAM-resident for things like scoring factors (e.g. pagerank) • Disk-resident for things like document versioning  Existing workarounds in Lucene have limitations • Scoring limited to one byte norm • FieldCache requires uninversion to construct • Stored fields are slow, geared towards summary results  Idea: add performant, flexible per-document lookups. 14
  • 15. Index DocValues: Features  New feature, recently added to Lucene ✔ External scoring factors ✔ Memory-resident and disk-based lookup ✔ DocValues sort-by-term ✔ Internal scoring factors ✔ Grouping support ✗ FST sort implementation ✗ Independently updatable ✗ Integration with Solr 15
  • 17. Offsets in Postings  Problem: search result highlighting • Typically a performance bottleneck  Two strategies today: neither is great • Re-analyze • Term vectors  Highlighting is really important • How users judge if a result is relevant • Who wants a search engine w/o highlighting!  How can we speed this up? 17
  • 18. Current strategy #1: re-analyze  Analyze documents on-the-fly • Same way as at indexing time • Map terms back to locations  Can be ok if: • Analysis is very simple • Documents are tiny  Can be slow: • Complex analysis pipelines • Documents are large 18
  • 19. Current strategy #2: term vectors  Inverted index for each doc • Avoids re-analysis  Can be ok if: • Analysis is very complex • Lots of disk space  Can be slow: • 3 seeks per document (.tvx, .tvd, .tvf) 19
  • 20. New strategy: offsets in postings  Offsets in the positions file • Avoids re-analysis • More efficient access/storage  Space-efficient: • ~ 60% less space than term vectors (english news text) • Best case: one byte per position  Fast: • I/O regions should be in cache from scoring • Preliminary benchmarks: > 10x Offse t 20
  • 21. Binary Terms  In Lucene 4.0, index terms are byte[] • Do not need to be unicode text  Example: Localized Sort and Range • Previous versions of Lucene: special encoder • Lucene 4.0: 50% space savings 21
  • 22. Localized Sort/Range Example Lucene: use CollationAnalyzer class Solr: specify in schema.xml 22
  • 23. Additional Index Statistics  New statistics in Lucene 4.0: • totalTermFreq(term): number of occurrences • sumTotalTermFreq(field): number of tokens • sumDocFreq(field): number of postings • docCount(field): number of docs with value  Available from Lucene APIs  Available from function queries  Supports additional scoring algorithms... 23
  • 24. Search Improvements  Improved Scoring API • Additional Scoring algorithms • Distributed Scoring API  Spellchecking improvements  Query parsing improvements 24
  • 25. Scoring: Introduction  Problem: “baked-in” vector space model  How to integrate additional algorithms? • Example: Language Models • Before 4.0: write custom Queries • Before 4.0: track certain statistics yourself  How to customize for your data? • Without digging into the guts of postings lists!  Idea: separate “matching” from “scoring” 25
  • 26. Scoring: additional algorithms  Okapi BM25  Language Models  Divergence from Randomness  Information-based Models 26
  • 27. Scoring: Configuration Lucene: use Similarity class Solr: specify in schema.xml 27
  • 28. Scoring: distributed API  Statistics API on IndexSearcher  64-bit collection/term statistics • For distributed collections > 2B docs  Compatible with all scoring algorithms  Solr implementation as a patch • Designed for both approximate and exact interchange 28
  • 29. Spellchecking Improvements  New DirectSpellChecker • No additional index needed  Better Suggestions • Levenshtein vs. n-gram  Exposes more configuration options • Tune to your collection 29
  • 30. Spellchecking: Configuration Lucene: use DirectSpellChecker class Solr: specify in solrconfig.xml 30
  • 31. Queryparsing Improvements  Regular expression queries • /mycompany.(com|org|net)/  Specify number of edits for fuzzy • foobar~2  Wildcard escaping • crazy?*  Range query syntax improvements • Mix inclusive/exclusive bounds • Support open-ended syntax in Lucene 31
  • 32. Performance Improvements  Concurrent Flushing  BlockTree term dictionary  Improved RAM Efficiency  Faster Filtered Search 32
  • 34. BlockTree term dictionary In Lucene 3 this thing is < 1 QPS!
  • 35. Improved RAM Efficiency  Lucene 4.0 uses much less memory • Terms Index, Suggester, Synonyms : finite-state • Fieldcache: packed integers / utf-8  Example with Wikipedia collection: “Memory footprint reduction from 389M to 90M after some off-the wall sorting and faceting”
  • 36. Faster Filtered Search  Up to 500% faster execution time for filtered search.  Especially faster for “Hard” queries like Span queries / Sloppy Phrase Queries / High Frequency Boolean Queries  Lucene decides between different execution strategies: • Conjunction of query and filter (via skipTo) • Drive only by Query, but with filter handled like deleted documents.
  • 37. Future Improvements  Block Index Compression • PFOR-delta, Simple8b, …  Position iterators from Scorers • Simplify/speed up Highlighting • Proximity Scoring  Structured/Section Scoring (e.g. BM25F) 37
  • 38. Conclusion  Lucene 4.0 will have many improvements, architectural, and API changes.  A few of these were introduced here: • Ability to customize the index format • Additional scoring algorithms • Performance Improvements  More are currently under development.  For more information, look at CHANGES.txt in subversion, JIRA, ... 38
  • 39. More Information  Lucene Website • http://lucene.apache.org  Users List • java-user@lucene.apache.org  Lucene Revolution presentations • http://www.lucidimagination.com/devzone/events/lucene_revolution_conferences  Contact Info • robert.muir@lucidimagination.com • http://www.lucidimagination.com/blog/author/robert-muir • @rcmuir 39