Lucene Boot Camp
Grant Ingersoll
Lucid Imagination
Nov. 4, 2008
New Orleans, LA
2
Schedule
• In-depth Indexing/Searching
– Performance, Internals
– Filters, Sorting
• Terms and Term Vectors
• Class Proj...
3
Day I Recap
• Indexing
– IndexWriter
– Document/Field
– Analyzer
• Searching
– IndexSearcher
– IndexReader
– QueryParser...
4
Indexing In-Depth
• Deletions and Updates
• Optimize
• Important Internals
– File Formats
– Segments, Commits, Merging
–...
5
Lucene File Formats and
Structures
• http://lucene.apache.org/java/2_4_0/fileformats.html
• A Lucene index is made up of...
6
Segments
• Each Segment is an independent index containing:
– Field Names
– Stored Field values
– Term Dictionary, proxi...
How Lucene Indexes
• Lucene indexes Documents into memory
– At certain trigger points, memory (segments)
are committed/flu...
8
Segments and Merging
• May be created when new documents are
added
• Are merged from time to time based on
segment size ...
9
Merge Policy
• Identifies Segments to be merged
• Two Current Implementations
– LogDocMergePolicy
– LogByteSizeMergePoli...
10
MergeScheduler
• Responsible for performing the merge
• Two Implementations:
– Serial - blocking
– Concurrent - new, ba...
11
Optimize
• Optimize is the process of merging
segments down into a single segment
• This process can yield significant ...
12
Final Thoughts On Merging
• Usually don’t have to think about it, except
when to optimize
• In high update, performance...
Deletion
• A deletion only marks the Document as
deleted
– Doesn’t get physically removed until a merge
• Deletions can be...
14
Task
– Build your index from yesterday and then try
some deletes
• Id, term, Query
– Also try out an optimize on a FSDi...
15
Updates
• Updates are always a delete and an add
• Updates are always a delete and an add
– Yes, that is a repeat!
– Na...
Performance Factors
• setRAMBufferSizeMB
– New model for automagically controlling indexing
factors based on the amount of...
17
More Factors
• mergeFactor
– How often segments are merged
– Smaller == less RAM, better for incremental updates
– Larg...
Index Threading
• IndexWriter and IndexReader are thread-
safe and can be shared between threads without
external synchron...
Benchmarking Indexing
• contrib/benchmark
• Try out different algorithms between Lucene 2.2
and 2.3
– contrib/benchmark/co...
Benchmarking Results
Records/Sec Avg. T
Mem
2.2 421 39M
Trunk 2,122 52M
Trunk-mt
(4)
3,680 57M
Your results will depend on...
Searching
• Earlier we touched on basics of search
using the QueryParser
• Now look at:
– Searcher/IndexReader Lifecycle
–...
Lifecycle
• Recall that the IndexReader loads a snapshot
of index into memory
– This means updates made since loading the ...
23
Reopen
• It is possible to have IndexReader reopen new
or changed segments
– Save some on the cost of loading a new ind...
Query Classes
• TermQuery is basis for all non-span queries
• BooleanQuery combines multiple Query
instances as clauses
– ...
Spans
• Spans provide information about where
matches took place
• Not supported by the QueryParser
• Can be used in Boole...
QueryParser
• MultiFieldQueryParser
• Boolean operators cause confusion
– Better to think in terms of required (+ operator...
Sorting
• Lucene default sort is by score
• Searcher has several methods that take in a
Sort object
• Sorting should be ad...
Sorting II
• Look at Searcher, Sort and
SortField
• Custom sorting is done with a
SortComparatorSource
• Sorting can be ve...
Filters
• Filters restrict the search space to a
subset of Documents
• Use Cases
– Search within a Search
– Restrict by da...
Filter Classes
• QueryWrapperFilter (QueryFilter)
– Restrict to subset of Documents that match a Query
• RangeFilter
– Res...
31
Task
• Modify your program to sort by a field and
to filter by a query or some other criteria
– ~15 minutes
Searchers
• MultiSearcher
– Search over multiple Searchables, including remote
• MultiReader
– Not a Searcher, but can be ...
Expert Results
• Searcher has several “expert” methods
• HitCollector allows low-level access to all
Documents as they are...
Search Performance
• Search speed is based on a number of factors:
– Query Type(s)
– Query Size
– Analysis
– Occurrences o...
Query Types
• Be careful with WildcardQuery as it rewrites
to a BooleanQuery containing all the terms
that match the wildc...
Query Size
• Stopword removal
• Search an “all” field instead of many fields with the same
terms
• Disambiguation
– May be...
Usual Suspects
• CPU
– Profile your application
• Memory
– Examine your heap size, garbage collection approach
• I/O
– Cac...
FieldSelector
• Prior to version 2.1, Lucene always loaded all
Fields in a Document
• FieldSelector API addition allows Lu...
39
Relevance
• At some point along your journey, you will
get results that you think are “bad”
• Is it a big deal?
– Conte...
Scoring and Similarity
• Lucene has sophisticated scoring
mechanism designed to meet most needs
• Has hooks for modifying ...
Explanations
• explain(Query, int) method is
useful for understanding why a Document
scored the way it did
• Shows all the...
Tuning Relevance
• FunctionQuery from Solr (variation in
Lucene)
• Override Similarity
• Implement own Query and related c...
43
Task
• Open Luke and try some queries and then
use the “explain” button
• Or, write some code to do explains on a
query...
44
Terms and Term Vectors
• Sometimes you need access to the Term
Dictionary:
– Auto suggest
– Frequency information
• Som...
Term Information
• TermEnum gives access to terms and how many
Documents they occur in
– IndexReader.terms()
• TermDocs gi...
46
Term Vectors
• Term Vectors give access to term frequency
information in a given Document
– IndexReader.getTermFreqVect...
47
TermsTest
• Provides samples of working with terms
and term vectors
Lunch ?
1-2:30
Recap
• Indexing
• Searching
• Performance
• Odds and Ends
– Explains
– FieldSelector
– Relevance
– Terms and Term Vectors
50
Class Project
• Your chance to really dig in and get your
hands dirty
• Ask Questions
• Options…
51
Option I
• Start building out your Lucene Application!
– Index your Data (or any data)
• Threading/Updates/Deletions
• ...
52
Option II
• Dig deeper into an area of interest
– Performance
• How fast can you index?
• Search? Queries per Second?
–...
53
Option III
• Dig into JIRA issues and find something to
fix in Lucene
• https://issues.apache.org/jira/secure/Dashboard...
54
Option IV
• Try out Solr
• http://lucene.apache.org/solr
55
Option V
• Other?
– Architecture Review/Discussion
– Use Case Discussion
Project Post-Mortem
• Volunteers to share?
Open Discussion
• Multilingual Best Practices
– UNICODE
– One Index versus many
• Advanced Analysis
• Distributed Lucene
•...
Resources
• trainer@lucenebootcamp.com
• Lucid Imagination
– Support
– Training
– Value Add
– grant@lucidimagination.com
Finally…
• Please take the time to fill out a survey to
help me improve this training
– Located in base directory of sourc...
Upcoming SlideShare
Loading in …5
×

Lucene Bootcamp - 2

1,928
-1

Published on

Published in: Technology
0 Comments
5 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,928
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
110
Comments
0
Likes
5
Embeds 0
No embeds

No notes for slide
  • Provide info about Term Dictionary
  • Look at IndexWriter.optimize() options
  • See TopDocsTest.java in src/test
  • Examine FieldSelectorTest code
  • Lucene Bootcamp - 2

    1. 1. Lucene Boot Camp Grant Ingersoll Lucid Imagination Nov. 4, 2008 New Orleans, LA
    2. 2. 2 Schedule • In-depth Indexing/Searching – Performance, Internals – Filters, Sorting • Terms and Term Vectors • Class Project • Q & A
    3. 3. 3 Day I Recap • Indexing – IndexWriter – Document/Field – Analyzer • Searching – IndexSearcher – IndexReader – QueryParser • Analysis • Contrib
    4. 4. 4 Indexing In-Depth • Deletions and Updates • Optimize • Important Internals – File Formats – Segments, Commits, Merging – Compound File System • Performance
    5. 5. 5 Lucene File Formats and Structures • http://lucene.apache.org/java/2_4_0/fileformats.html • A Lucene index is made up of one or more Segments • Lucene tracks Documents internally by an int “id” • This id may change across index operations – You should not rely on it unless you know your index isn’t changing • You can ask for a Document by this id on the IndexReader
    6. 6. 6 Segments • Each Segment is an independent index containing: – Field Names – Stored Field values – Term Dictionary, proximity info and normalization factors – Term Vectors (optional) – Deleted Docs • Compound File System (CFS) stores all of these logical pieces in a single file
    7. 7. How Lucene Indexes • Lucene indexes Documents into memory – At certain trigger points, memory (segments) are committed/flushed to the Directory • Can be forced by calling commit() – Segments are periodically merged (more in a moment)
    8. 8. 8 Segments and Merging • May be created when new documents are added • Are merged from time to time based on segment size in relation to: – MergePolicy – MergeScheduler – Optimization
    9. 9. 9 Merge Policy • Identifies Segments to be merged • Two Current Implementations – LogDocMergePolicy – LogByteSizeMergePolicy • mergeFactor - Max # of segments allowed before merging
    10. 10. 10 MergeScheduler • Responsible for performing the merge • Two Implementations: – Serial - blocking – Concurrent - new, background
    11. 11. 11 Optimize • Optimize is the process of merging segments down into a single segment • This process can yield significant speedups in search • Can be slow • Can also do partial optimizes
    12. 12. 12 Final Thoughts On Merging • Usually don’t have to think about it, except when to optimize • In high update, performance critical environments, you may need to dig into it more as it can sometimes cause long pauses • Good to optimize when you can, otherwise, keep a low mergeFactor
    13. 13. Deletion • A deletion only marks the Document as deleted – Doesn’t get physically removed until a merge • Deletions can be a bit confusing – Both IndexReader and IndexWriter have delete methods • By: id, term(s), Query(s)
    14. 14. 14 Task – Build your index from yesterday and then try some deletes • Id, term, Query – Also try out an optimize on a FSDirectory against the full Reuters sample – 15-20 minutes
    15. 15. 15 Updates • Updates are always a delete and an add • Updates are always a delete and an add – Yes, that is a repeat! – Nature of data structures used in search • See IndexWriter.updateDocument()
    16. 16. Performance Factors • setRAMBufferSizeMB – New model for automagically controlling indexing factors based on the amount of memory in use – Obsoletes setMaxBufferedDocs • maxBufferedDocs – Minimum # of docs before merge occurs and a new segment is created – Usually, Larger == faster, but more RAM
    17. 17. 17 More Factors • mergeFactor – How often segments are merged – Smaller == less RAM, better for incremental updates – Larger == faster, better for batch indexing • maxFieldLength – Limit the number of terms in a Document • Analysis • Reuse – Document, TokenStream, Token
    18. 18. Index Threading • IndexWriter and IndexReader are thread- safe and can be shared between threads without external synchronization • One open IndexWriter per Directory • Parallel Indexing – Index to separate Directory instances – Merge using IndexWriter.addIndexes – Could also distribute and collect
    19. 19. Benchmarking Indexing • contrib/benchmark • Try out different algorithms between Lucene 2.2 and 2.3 – contrib/benchmark/conf: • indexing.alg • indexing-multithreaded.alg • Info: – Mac Pro 2 x 2GHz Dual-Core Xeon – 4 GB RAM – ant run-task -Dtask.alg=./conf/indexing.alg -Dtask.mem=1024M
    20. 20. Benchmarking Results Records/Sec Avg. T Mem 2.2 421 39M Trunk 2,122 52M Trunk-mt (4) 3,680 57M Your results will depend on analysis, etc.
    21. 21. Searching • Earlier we touched on basics of search using the QueryParser • Now look at: – Searcher/IndexReader Lifecycle – Query classes – More details on the QueryParser – Filters – Sorting
    22. 22. Lifecycle • Recall that the IndexReader loads a snapshot of index into memory – This means updates made since loading the index will not be seen • Business rules are needed to define how often to reload the index, if at all – IndexReader.isCurrent() can help • Loading an index is an expensive operation – Do not open a Searcher/IndexReader for every search
    23. 23. 23 Reopen • It is possible to have IndexReader reopen new or changed segments – Save some on the cost of loading a new index • Does not close the old reader, so application must • See DeletionsUpdatesTest.testReopen()
    24. 24. Query Classes • TermQuery is basis for all non-span queries • BooleanQuery combines multiple Query instances as clauses – should – required • PhraseQuery finds terms occurring near each other, position-wise – “slop” is the edit distance between two terms • Take 2-3 minutes to explore Query implementations
    25. 25. Spans • Spans provide information about where matches took place • Not supported by the QueryParser • Can be used in BooleanQuery clauses • Take 2-3 minutes to explore SpanQuery classes – SpanNearQuery useful for doing phrase matching
    26. 26. QueryParser • MultiFieldQueryParser • Boolean operators cause confusion – Better to think in terms of required (+ operator) and not allowed (- operator) • Check JIRA for QueryParser issues • http://www.gossamer-threads.com/lists/lucene/java-user/40945 • Most applications either modify QP, create their own, or restrict to a subset of the syntax • Your users may not need all the “flexibility” of the QP
    27. 27. Sorting • Lucene default sort is by score • Searcher has several methods that take in a Sort object • Sorting should be addressed during indexing • Sorting is done on Fields containing a single term that can be used for comparison • The SortField defines the different sort types available – AUTO, STRING, INT, FLOAT, CUSTOM, SCORE, DOC
    28. 28. Sorting II • Look at Searcher, Sort and SortField • Custom sorting is done with a SortComparatorSource • Sorting can be very expensive – Terms are cached in the FieldCache
    29. 29. Filters • Filters restrict the search space to a subset of Documents • Use Cases – Search within a Search – Restrict by date – Rating – Security – Author
    30. 30. Filter Classes • QueryWrapperFilter (QueryFilter) – Restrict to subset of Documents that match a Query • RangeFilter – Restrict to Documents that fall within a range – Better alternative to RangeQuery • CachingWrapperFilter – Wrap another Filter and provide caching
    31. 31. 31 Task • Modify your program to sort by a field and to filter by a query or some other criteria – ~15 minutes
    32. 32. Searchers • MultiSearcher – Search over multiple Searchables, including remote • MultiReader – Not a Searcher, but can be used with IndexSearcher to achieve same results for local indexes • ParallelMultiSearcher – Like MultiSearcher, but threaded • RemoteSearchable – RMI based remote searching • Look at MultiSearcherTest in example code
    33. 33. Expert Results • Searcher has several “expert” methods • HitCollector allows low-level access to all Documents as they are scored
    34. 34. Search Performance • Search speed is based on a number of factors: – Query Type(s) – Query Size – Analysis – Occurrences of Query Terms – Optimize – Index Size – Index type (RAMDirectory, other) – Usual Suspects • CPU • Memory • I/O • Business Needs
    35. 35. Query Types • Be careful with WildcardQuery as it rewrites to a BooleanQuery containing all the terms that match the wildcards • Avoid starting a WildcardQuery with wildcard • Use ConstantScoreRangeQuery instead of RangeQuery • Be careful with range queries and dates – User mailing list and Wiki have useful tips for optimizing date handling
    36. 36. Query Size • Stopword removal • Search an “all” field instead of many fields with the same terms • Disambiguation – May be useful when doing synonym expansion – Difficult to automate and may be slower – Some applications may allow the user to disambiguate • Relevance Feedback/More Like This – Use most important words – “Important” can be defined in a number of ways
    37. 37. Usual Suspects • CPU – Profile your application • Memory – Examine your heap size, garbage collection approach • I/O – Cache your Searcher • Define business logic for refreshing based on indexing needs – Warm your Searcher before going live -- See Solr • Business Needs – Do you really need to support Wildcards? – What about date range queries down to the millisecond?
    38. 38. FieldSelector • Prior to version 2.1, Lucene always loaded all Fields in a Document • FieldSelector API addition allows Lucene to skip large Fields – Options: Load, Lazy Load, No Load, Load and Break, Load for Merge, Size, Size and Break • Makes storage of original content more viable without large cost of loading it when not used • FieldSelectorTest in example code
    39. 39. 39 Relevance • At some point along your journey, you will get results that you think are “bad” • Is it a big deal? – Content, Content, Content! – Relevance Judgments – Don’t break other queries just to “fix” one • Hardcode it! – A query doesn’t always have to result in a “search”
    40. 40. Scoring and Similarity • Lucene has sophisticated scoring mechanism designed to meet most needs • Has hooks for modifying scores • Scoring is handled by the Query, Weight and Scorer class
    41. 41. Explanations • explain(Query, int) method is useful for understanding why a Document scored the way it did • Shows all the pieces that went into scoring the result: – Tf, DF, boosts, etc.
    42. 42. Tuning Relevance • FunctionQuery from Solr (variation in Lucene) • Override Similarity • Implement own Query and related classes • Payloads • Boosts
    43. 43. 43 Task • Open Luke and try some queries and then use the “explain” button • Or, write some code to do explains on a query and some documents • See how Query type, boosting, other factors play a role in the score
    44. 44. 44 Terms and Term Vectors • Sometimes you need access to the Term Dictionary: – Auto suggest – Frequency information • Sometimes you need a Document-centric view of terms, frequencies, positions and offsets – Term Vectors
    45. 45. Term Information • TermEnum gives access to terms and how many Documents they occur in – IndexReader.terms() • TermDocs gives access to the frequency of a term in a Document – IndexReader.termDocs() – TermPositions extends TermDocs and provides access to position and payload info – IndexReader.termPositions()
    46. 46. 46 Term Vectors • Term Vectors give access to term frequency information in a given Document – IndexReader.getTermFreqVector • TermVectorMapper provides callbacks for working with Term Vectors
    47. 47. 47 TermsTest • Provides samples of working with terms and term vectors
    48. 48. Lunch ? 1-2:30
    49. 49. Recap • Indexing • Searching • Performance • Odds and Ends – Explains – FieldSelector – Relevance – Terms and Term Vectors
    50. 50. 50 Class Project • Your chance to really dig in and get your hands dirty • Ask Questions • Options…
    51. 51. 51 Option I • Start building out your Lucene Application! – Index your Data (or any data) • Threading/Updates/Deletions • Analysis – Search • Caching/Warming • Dealing with Updates • Multi-threaded – Display
    52. 52. 52 Option II • Dig deeper into an area of interest – Performance • How fast can you index? • Search? Queries per Second? – Analysis – Query Parsing – Scoring – Contrib
    53. 53. 53 Option III • Dig into JIRA issues and find something to fix in Lucene • https://issues.apache.org/jira/secure/Dashboard.jspa • http://wiki.apache.org/lucene-java/HowToCon
    54. 54. 54 Option IV • Try out Solr • http://lucene.apache.org/solr
    55. 55. 55 Option V • Other? – Architecture Review/Discussion – Use Case Discussion
    56. 56. Project Post-Mortem • Volunteers to share?
    57. 57. Open Discussion • Multilingual Best Practices – UNICODE – One Index versus many • Advanced Analysis • Distributed Lucene • Crawling • Hadoop • Nutch • Solr
    58. 58. Resources • trainer@lucenebootcamp.com • Lucid Imagination – Support – Training – Value Add – grant@lucidimagination.com
    59. 59. Finally… • Please take the time to fill out a survey to help me improve this training – Located in base directory of source – Email it to me at trainer@lucenebootcamp.com • There are several Lucene related talks on Wednesday
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×