Is Your Index Reader Really Atomic or Maybe Slow?

3,609 views

Published on

Presented by Uwe Schindler | SD DataSolutions GmbH - See conference video - http://www.lucidimagination.com/devzone/events/conferences/lucene-revolution-2012

Since the first day, Apache Lucene exposed the two fundamental concepts of reading and writing an index directly through IndexReader & IndexWriter. However, the API did not reflect reality; from the IndexWriter perspective this was desirable but when reading the index this caused several problems in the past. In reality a Lucene index is not a single index while logically treated as a such. This talk will introduce the new API classes AtomicReader and CompositeReader added in Lucene 4.0 as very general interfaces, and DirectoryReader, which most people know as the segment-based “Lucene index on disk”. The talk will also cover more changes and improvements to the search API like reader contexts that allow to convert local document ids to global ones from IndexSearcher. Lucene changed all IndexReaders to be read-only, so it’s no longer possible to modify indexes using those classes. Finally, Uwe Schindler will show migration paths from custom norm values to the various new ranking models that were added to Lucene; this includes using Similarity with Lucene 4.0’s DocValues as replacement for norms.

Published in: Technology
0 Comments
6 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
3,609
On SlideShare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
22
Comments
0
Likes
6
Embeds 0
No embeds

No notes for slide

Is Your Index Reader Really Atomic or Maybe Slow?

  1. 1. Is Your IndexReader Really Atomic or Maybe Slow? Uwe Schindler SD DataSolutions GmbH, uschindler@sd-datasolutions.de 1
  2. 2. My Background• I am committer and PMC member of Apache Lucene and Solr. My main focus is on development of Lucene Java.• Implemented fast numerical search and maintaining the new attribute-based text analysis API. Well known as Generics and Sophisticated Backwards Compatibility Policeman.• Working as consultant and software architect for SD DataSolutions GmbH in Bremen, Germany. The main task is maintaining PANGAEA (Publishing Network for Geoscientific & Environmental Data) where I implemented the portals geo-spatial retrieval functions with Apache Lucene Core.• Talks about Lucene at various international conferences like the previous Lucene Revolution, Lucene Eurocon, ApacheCon EU/US, Berlin Buzzwords, and various local meetups. 2
  3. 3. Agenda• Motivation / History of Lucene• AtomicReader & CompositeReader• Reader contexts• Wrap up 3
  4. 4. Lucene Index Structure• Lucene was the first full text search engine that supported document additions and updates• Snapshot isolation ensures consistency  Segmented index structure  Committing changes creates new segments 4
  5. 5. Segments in Lucene • Each index consists of various segments placed in the index directory. All documents are added to new new segment files, merged with other on-disk files after flushing.Image: Doug Cutting *) The term “optimal” does not mean all indexes must be optimized! 5
  6. 6. Segments in Lucene • Each index consists of various segments placed in the index directory. All documents are added to new new segment files, merged with other on-disk files after flushing. • Lucene writes segments incrementally and can merge them.Image: Doug Cutting *) The term “optimal” does not mean all indexes must be optimized! 6
  7. 7. Segments in Lucene • Each index consists of various segments placed in the index directory. All documents are added to new new segment files, merged with other on-disk files after flushing. • Lucene writes segments incrementally and can merge them.Image: Doug Cutting • Optimized* index consists of one segment. *) The term “optimal” does not mean all indexes must be optimized! 7
  8. 8. Lucene merges while indexing all of English Wikipedia Video by courtesy of: 8 Mike McCandless’ blog, http://goo.gl/kI53f
  9. 9. Indexes in Lucene (up to version 3.6)• Each segment (“atomic index”) is a completely functional index: – SegmentReader implements the IndexReader interface for single segments• Composite indexes – DirectoryReader implements the IndexReader interface on top of a set of SegmentReaders – MultiReader is an abstraction of multiple IndexReaders combined to one virtual index 9
  10. 10. Composite Indexes (up to version 3.6)Atomic “views” on multi-segment index: – Term Dictionary: on-the-fly merged & sorted term index (priority queue for TermEnum,…) – Postings: posting lists for each term appended, convert document IDs to be global – Metadata: doc frequency, term frequency,… – Stored fields, term vectors, deletions: delegate global document IDs -> segment document IDs (binary search) – FieldCache: duplicate instances for single segments and composite view (memory!!!) 10
  11. 11. Merging Term Index and Postings Term 1Term 1 Term 1 Term 2Term 2 Term 3 Term 3Term 4 Term 5 Term 4Term 7 Term 6 Term 5Term 8 Term 8 Term 6Term 9 Term 9 Term 7 Term 8 Term 9 11
  12. 12. Merging Term Index and Postings Term 1Term 1 Term 1 Term 2Term 2 Term 3 Term 3Term 4 Term 5 Term 4Term 7 Term 6 Term 5Term 8 Term 8 Term 6Term 9 Term 9 Term 7 Term 8 Term 9 12
  13. 13. Composite Indexes (up to version 3.6)Atomic “views” on multi-segment index: – Term Dictionary: on-the-fly merged & sorted term index (priority queue for TermEnum,…) – Postings: posting lists for each term appended, convert document IDs to be global – Metadata: doc frequency, term frequency,… – Stored fields, term vectors, deletions: delegate global document IDs -> segment document IDs (binary search) – FieldCache: duplicate instances for single segments and composite view (memory!!!) 13
  14. 14. Searching before Version 2.9• IndexSearcher used the underlying index always as a “single” (atomic) index: – Queries are executed on the atomic view of a composite index – Slowdown for queries that scan term dictionary (MultiTermQuery) or hit lots of documents, facetting Image: © The Walt Disney Company => recommendation to “optimize” index – On every index change, FieldCache used for sorting had to be reloaded completely 14
  15. 15. Lucene 2.9 and later: Per segment searchSearch is executed separately on each index segment: Atomic view no longer used! 15
  16. 16. Per Segment Search: Pros• No live term dictionary merging• Possibility to parallelize – ExecutorService in IndexSearcher since Lucene 3.1+ – Do not optimize to make this work!• Sorting only needs per-segment FieldCache – Cheap reopen after index changes!• Filter cache per segment – Cheap reopen after index changes! 16
  17. 17. Per Segment Search: Cons• Query/Filter API changes: – Scorer / Filter’s DocIdSet no longer use global document IDs• Slower sorting by string terms – Term ords are only comparable inside each segment – String comparisons needed after segment traversal – Use numeric sorting if possible!!! (Lucene supports missing values since version 3.4 [buggy], corrected 3.5+) 17
  18. 18. Agenda• Motivation / History of Lucene• AtomicReader & CompositeReader• Reader contexts• Wrap up 18
  19. 19. Composite Indexes (up to version 3.6)Atomic “views” on multi-segment index: – Term Dictionary: on-the-fly merged & sorted term index (priority queue for TermEnum,…) – Postings: posting lists for each term appended, convert document IDs to be global – Metadata: doc frequency, term frequency,… – Stored fields, term vectors, deletions: delegate global document IDs -> segment document IDs (binary search) – FieldCache: duplicate instances for single segments and composite view (memory!!!) 19
  20. 20. Composite Indexes (version 4.0)Atomic “views” on multi-segment index: – Term Dictionary: on-the-fly merged & sorted term index (priority queue for TermEnum,…) – Postings: posting lists for each term appended, convert document IDs to be global – Metadata: doc frequency, term frequency,… – Stored fields, term vectors, deletions: delegate global document IDs -> segment document IDs (binary search) – FieldCache: duplicate instances for single segments and composite view (memory!!!) 20
  21. 21. Early Lucene 4.0 • Only “historic” IndexReader interface available since Lucene 1.0 • 80% of all methods of composite IndexReaders throwed UnsupportedOperationException – This affected all user-facing APIs (SegmentReader was hidden / marked experimental) • No compile time safety!Image: © The Walt Disney Company – Query Scorers and Filters need term dictionary and postings, throwing UOE when executed on composite reader 21
  22. 22. Image: © The Walt Disney Company Heavy Committing™ !!! 22
  23. 23. IndexReader• Most generic (abstract) API to an index – superclass of more specific types – cannot be subclassed directly (no public constructor) 23
  24. 24. IndexReader• Most generic (abstract) API to an index – superclass of more specific types – cannot be subclassed directly (no public constructor)• Does not know anything about “inverted index” concept – no terms, no postings! 24
  25. 25. IndexReader• Most generic (abstract) API to an index – superclass of more specific types – cannot be subclassed directly (no public constructor)• Does not know anything about “inverted index” concept – no terms, no postings!• Access to stored fields (and term vectors) by document ID – to display / highlight search results only! 25
  26. 26. IndexReader• Most generic (abstract) API to an index – superclass of more specific types – cannot be subclassed directly (no public constructor)• Does not know anything about “inverted index” concept – no terms, no postings!• Access to stored fields (and term vectors) by document ID – to display / highlight search results only!• Some very limited metadata – number of documents,… 26
  27. 27. IndexReader• Most generic (abstract) API to an index – superclass of more specific types – cannot be subclassed directly (no public constructor)• Does not know anything about “inverted index” concept – no terms, no postings!• Access to stored fields (and term vectors) by document ID – to display / highlight search results only!• Some very limited metadata – number of documents,…• Unmodifiable: No more deletions / changing of norms possible! – Applies to all readers in Lucene 4.0 !!! 27
  28. 28. IndexReader• Most generic (abstract) API to an index – superclass of more specific types – cannot be subclassed directly (no public constructor)• Does not know anything about “inverted index” concept – no terms, no postings!• Access to stored fields (and term vectors) by document ID – to display / highlight search results only!• Some very limited metadata – number of documents,…• Unmodifiable: No more deletions / changing of norms possible! – Applies to all readers in Lucene 4.0 !!!• Passed to IndexSearcher – just as before! 28
  29. 29. IndexReader• Most generic (abstract) API to an index – superclass of more specific types – cannot be subclassed directly (no public constructor)• Does not know anything about “inverted index” concept – no terms, no postings!• Access to stored fields (and term vectors) by document ID – to display / highlight search results only!• Some very limited metadata – number of documents,…• Unmodifiable: No more deletions / changing of norms possible! – Applies to all readers in Lucene 4.0 !!!• Passed to IndexSearcher – just as before!• Allows IndexReader.open() for backwards compatibility (deprecated) 29
  30. 30. AtomicReader• Inherits from IndexReader 30
  31. 31. AtomicReader• Inherits from IndexReader• Access to “atomic” indexes (single segments) 31
  32. 32. AtomicReader• Inherits from IndexReader• Access to “atomic” indexes (single segments)• Full term dictionary and postings API 32
  33. 33. AtomicReader• Inherits from IndexReader• Access to “atomic” indexes (single segments)• Full term dictionary and postings API 33
  34. 34. AtomicReader• Inherits from IndexReader• Access to “atomic” indexes (single segments)• Full term dictionary and postings API• Access to DocValues (new in Lucene 4.0) and norms 34
  35. 35. CompositeReader• No additional functionality on top of IndexReader 35
  36. 36. CompositeReader• No additional functionality on top of IndexReader• Provides getSequentialSubReaders() to retrieve all child readers 36
  37. 37. CompositeReader• No additional functionality on top of IndexReader• Provides getSequentialSubReaders() to retrieve all child readers• DirectoryReader and MultiReader implement this class 37
  38. 38. DirectoryReader• Abstract class, defines interface for: 38
  39. 39. DirectoryReader• Abstract class, defines interface for: – access to on-disk indexes (on top of Directory class) – access to commit points, index metadata, index version, isCurrent() for reopen support – defines abstract openIfChanged (for cheap reopening of indexes) – child readers are always AtomicReader instances 39
  40. 40. DirectoryReader• Abstract class, defines interface for: – access to on-disk indexes (on top of Directory class) – access to commit points, index metadata, index version, isCurrent() for reopen support – defines abstract openIfChanged (for cheap reopening of indexes) – child readers are always AtomicReader instances• Provides static factory methods for opening indexes – well-known from IndexReader in Lucene 1 to 3 – factories return internal DirectoryReader implementation (StandardDirectoryReader with SegmentReaders as childs) 40
  41. 41. Basic Search ExampleImage: © The Walt Disney Company Looks familiar, doesn’t it? 41
  42. 42. “Arrgh! I still need terms and postings on myDirectoryReader!!! What should I do? Optimizeto only have one segment? Help me please!!!” (example question on java-user@lucene.apache.org) 42
  43. 43. What to do?• Calm down• Take a break and drink a beer*• Don’t optimize / force merge your index!!! *) Robert’s solution 43
  44. 44. Solutions• Most efficient way: – Retrieve atomic leaves from your composite: reader.getTopReaderContext().leaves() – Iterate over sub-readers, do the work (possibly parallelized) – Merge results 44
  45. 45. Solutions• Most efficient way: – Retrieve atomic leaves from your composite: reader.getTopReaderContext().leaves() – Iterate over sub-readers, do the work (possibly parallelized) – Merge results• Otherwise: wrap your CompositeReader: 45
  46. 46. (Slow) SolutionAtomicReader SlowCompositeReaderWrapper.wrap(IndexReader r)• Wraps IndexReaders of any kind as atomic reader, providing terms, postings, deletions, doc values – Internally uses same algorithms like previous Lucene readers – Segment-merging uses this to merge segments, too 46
  47. 47. (Slow) SolutionAtomicReader SlowCompositeReaderWrapper.wrap(IndexReader r)• Wraps IndexReaders of any kind as atomic reader, providing terms, postings, deletions, doc values – Internally uses same algorithms like previous Lucene readers – Segment-merging uses this to merge segments, too• Solr always provides an AtomicReader for convenience through SolrIndexSearcher. Plugin writers should use:AtomicReader rd = mySolrIndexSearcher.getAtomicReader() 47
  48. 48. Other readers• FilterAtomicReader – Was FilterIndexReader in 3.x (but now solely works on atomic readers) – Allows to filter terms, postings, deletions – Useful for index splitters (e.g., PKIndexSplitter, MultiPassIndexSplitter) (provide own getLiveDocs() method, merge to IndexWriter)• ParallelAtomicReader, -CompositeReader 48
  49. 49. IndexReader Reopening• Reopening solely provided by directory-based DirectoryReader instances 49
  50. 50. IndexReader Reopening• Reopening solely provided by directory-based DirectoryReader instances• No more reopen for: – AtomicReader: they are atomic, no refresh possible – MultiReader: reopen child readers separately, create new MultiReader on top of reopened readers – Parallel*Reader, FilterAtomicReader: reopen wrapped readers, create new wrapper afterwards 50
  51. 51. Agenda• Motivation / History of Lucene• AtomicReader & CompositeReader• Reader contexts• Wrap up 51
  52. 52. IndexReaderContextAtomicReaderContextCompositeReaderContextWTF ?!? 52
  53. 53. The Problem SuperComposite: MultiReader CompositeA: CompositeB: DirectoryReader DirectoryReaderAtomic0 Atomic1 Atomic2 Atomic0 Atomic1 Atomic2Doc0 Doc0 Doc0 Doc0 Doc0 Doc0Doc1 Doc1 Doc1 Doc1 Doc1 Doc1Doc2 Doc2 Doc2 Doc2 Doc2 Doc2Doc3 Doc3 Doc3 Doc3 Doc3 Doc3 53
  54. 54. The Problem SuperComposite: MultiReader CompositeA: CompositeB: DirectoryReader DirectoryReaderAtomic0 Atomic1 Atomic2 Atomic0 Atomic1 Atomic2Doc0 Doc4 Doc8 Doc0 Doc0 Doc0Doc1 Doc5 Doc9 Doc1 Doc1 Doc1Doc2 Doc6 Doc10 Doc2 Doc2 Doc2Doc3 Doc7 Doc11 Doc3 Doc3 Doc3 54
  55. 55. The Problem SuperComposite: MultiReader CompositeA: CompositeB: DirectoryReader DirectoryReaderAtomic0 Atomic1 Atomic2 Atomic0 Atomic1 Atomic2Doc0 Doc4 Doc8 Doc12 Doc16 Doc20Doc1 Doc5 Doc9 Doc13 Doc17 Doc21Doc2 Doc6 Doc10 Doc14 Doc18 Doc22Doc3 Doc7 Doc11 Doc15 Doc19 Doc23 55
  56. 56. Solution• Each IndexReader instance provides: IndexReaderContext getTopReaderContext() 56
  57. 57. Solution• Each IndexReader instance provides: IndexReaderContext getTopReaderContext()• The context provides a “view” on all childs (direct descendants) and leaves (down to lowest level AtomicReaders) 57
  58. 58. Solution• Each IndexReader instance provides: IndexReaderContext getTopReaderContext()• The context provides a “view” on all childs (direct descendants) and leaves (down to lowest level AtomicReaders)• Each atomic leave has a docBase (document ID offset) 58
  59. 59. Solution• Each IndexReader instance provides: IndexReaderContext getTopReaderContext()• The context provides a “view” on all childs (direct descendants) and leaves (down to lowest level AtomicReaders)• Each atomic leave has a docBase (document ID offset)• IndexSearcher passes a context instance relative to its own top-level reader to each Query-Scorer / Filter – allows to access complete reader tree up to the current top- level reader – Allows to get the “global” document ID 59
  60. 60. Agenda• Motivation / History of Lucene• AtomicReader & CompositeReader• Reader contexts• Wrap up 60
  61. 61. Summary• Up to Lucene 3.6 indexes are still accessible on any hierarchy level with same interface• Lucene 4.0 will split the IndexReader class into several abstract interfaces• IndexReaderContexts will support per-segment search preserving top-level document IDs 61
  62. 62. Summary• Lucene moved from global searches to per- segment search in Lucene 2.9• Up to Lucene 3.6 indexes are still accessible on any hierarchy level with same interface• Lucene 4.0 will split the IndexReader class into several abstract interfaces• IndexReaderContexts will support per-segment search preserving top-level document IDs 62
  63. 63. Summary• Up to Lucene 3.6 indexes are still accessible on any hierarchy level with same interface• Lucene 4.0 will split the IndexReader class into several abstract interfaces• IndexReaderContexts will support per-segment search preserving top-level document IDs 63
  64. 64. Summary• Lucene 4.0 will split the IndexReader class into several abstract interfaces• IndexReaderContexts will support per-segment search preserving top-level document IDs 64
  65. 65. Summary• Lucene 4.0 will split the IndexReader class into several abstract interfaces• IndexReaderContexts will support per-segment search preserving top-level document IDs 65
  66. 66. Contact Uwe Schindler uschindler@apache.org http://www.thetaphi.de @thetaph1 SD DataSolutions GmbH Wätjenstr. 49 28213 Bremen, Germany +49 421 40889785-0http://www.sd-datasolutions.de 66

×