Loading…

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

Like this presentation? Why not share!

Lucene Bootcamp - 2

on

  • 3,277 views

 

Statistics

Views

Total Views
3,277
Views on SlideShare
3,270
Embed Views
7

Actions

Likes
5
Downloads
98
Comments
0

1 Embed 7

http://www.slideshare.net 7

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Lucene Bootcamp - 2 Lucene Bootcamp - 2 Presentation Transcript

  • Lucene Boot Camp
    • Grant Ingersoll
    • Lucid Imagination
    • Nov. 4, 2008
    • New Orleans, LA
  • Schedule
    • In-depth Indexing/Searching
      • Performance, Internals
      • Filters, Sorting
    • Terms and Term Vectors
    • Class Project
    • Q & A
  • Day I Recap
    • Indexing
      • IndexWriter
      • Document / Field
      • Analyzer
    • Searching
      • IndexSearcher
      • IndexReader
      • QueryParser
    • Analysis
    • Contrib
  • Indexing In-Depth
    • Deletions and Updates
    • Optimize
    • Important Internals
      • File Formats
      • Segments, Commits, Merging
      • Compound File System
    • Performance
  • Lucene File Formats and Structures
    • http://lucene.apache.org/java/2_4_0/fileformats.html
    • A Lucene index is made up of one or more Segments
    • Lucene tracks Document s internally by an int “id”
    • This id may change across index operations
      • You should not rely on it unless you know your index isn’t changing
    • You can ask for a Document by this id on the IndexReader
  • Segments
    • Each Segment is an independent index containing:
      • Field Names
      • Stored Field values
      • Term Dictionary, proximity info and normalization factors
      • Term Vectors (optional)
      • Deleted Docs
    • Compound File System (CFS) stores all of these logical pieces in a single file
  • How Lucene Indexes
    • Lucene indexes Document s into memory
      • At certain trigger points, memory (segments) are committed/flushed to the Directory
        • Can be forced by calling commit()
      • Segments are periodically merged (more in a moment)
  • Segments and Merging
    • May be created when new documents are added
    • Are merged from time to time based on segment size in relation to:
      • MergePolicy
      • MergeScheduler
      • Optimization
  • Merge Policy
    • Identifies Segments to be merged
    • Two Current Implementations
      • LogDocMergePolicy
      • LogByteSizeMergePolicy
    • mergeFactor - Max # of segments allowed before merging
  • MergeScheduler
    • Responsible for performing the merge
    • Two Implementations:
      • Serial - blocking
      • Concurrent - new, background
  • Optimize
    • Optimize is the process of merging segments down into a single segment
    • This process can yield significant speedups in search
    • Can be slow
    • Can also do partial optimizes
  • Final Thoughts On Merging
    • Usually don’t have to think about it, except when to optimize
    • In high update, performance critical environments, you may need to dig into it more as it can sometimes cause long pauses
    • Good to optimize when you can, otherwise, keep a low mergeFactor
  • Deletion
    • A deletion only marks the Document as deleted
      • Doesn’t get physically removed until a merge
    • Deletions can be a bit confusing
      • Both IndexReader and IndexWriter have delete methods
        • By: id, term(s), Query (s)
  • Task
      • Build your index from yesterday and then try some deletes
        • Id, term, Query
      • Also try out an optimize on a FSDirectory against the full Reuters sample
      • 15-20 minutes
  • Updates
    • Updates are always a delete and an add
    • Updates are always a delete and an add
      • Yes, that is a repeat!
      • Nature of data structures used in search
    • See IndexWriter.updateDocument()
  • Performance Factors
    • setRAMBufferSizeMB
      • New model for automagically controlling indexing factors based on the amount of memory in use
      • Obsoletes setMaxBufferedDocs
    • maxBufferedDocs
      • Minimum # of docs before merge occurs and a new segment is created
      • Usually, Larger == faster, but more RAM
  • More Factors
    • mergeFactor
      • How often segments are merged
      • Smaller == less RAM, better for incremental updates
      • Larger == faster, better for batch indexing
    • maxFieldLength
      • Limit the number of terms in a Document
    • Analysis
    • Reuse
      • Document , TokenStream , Token
  • Index Threading
    • IndexWriter and IndexReader are thread-safe and can be shared between threads without external synchronization
    • One open IndexWriter per Directory
    • Parallel Indexing
      • Index to separate Directory instances
      • Merge using IndexWriter.addIndexes
      • Could also distribute and collect
  • Benchmarking Indexing
    • contrib/benchmark
    • Try out different algorithms between Lucene 2.2 and 2.3
      • contrib/benchmark/conf:
        • indexing.alg
        • indexing-multithreaded.alg
    • Info:
      • Mac Pro 2 x 2GHz Dual-Core Xeon
      • 4 GB RAM
      • ant run-task -Dtask.alg=./conf/indexing.alg -Dtask.mem=1024M
  • Benchmarking Results Your results will depend on analysis, etc. Records/Sec Avg. T Mem 2.2 421 39M Trunk 2,122 52M Trunk-mt (4) 3,680 57M
  • Searching
    • Earlier we touched on basics of search using the QueryParser
    • Now look at:
      • Searcher / IndexReader Lifecycle
      • Query classes
      • More details on the QueryParser
      • Filter s
      • Sort ing
  • Lifecycle
    • Recall that the IndexReader loads a snapshot of index into memory
      • This means updates made since loading the index will not be seen
    • Business rules are needed to define how often to reload the index, if at all
      • IndexReader.isCurrent() can help
    • Loading an index is an expensive operation
      • Do not open a Searcher/IndexReader for every search
  • Reopen
    • It is possible to have IndexReader reopen new or changed segments
      • Save some on the cost of loading a new index
    • Does not close the old reader, so application must
    • See DeletionsUpdatesTest.testReopen()
  • Query Classes
    • TermQuery is basis for all non-span queries
    • BooleanQuery combines multiple Query instances as clauses
      • should
      • required
    • PhraseQuery finds terms occurring near each other, position-wise
      • “ slop” is the edit distance between two terms
    • Take 2-3 minutes to explore Query implementations
  • Spans
    • Spans provide information about where matches took place
    • Not supported by the QueryParser
    • Can be used in BooleanQuery clauses
    • Take 2-3 minutes to explore SpanQuery classes
      • SpanNearQuery useful for doing phrase matching
  • QueryParser
    • MultiFieldQueryParser
    • Boolean operators cause confusion
      • Better to think in terms of required (+ operator) and not allowed (- operator)
    • Check JIRA for QueryParser issues
    • http://www.gossamer-threads.com/lists/lucene/java-user/40945
    • Most applications either modify QP , create their own, or restrict to a subset of the syntax
    • Your users may not need all the “flexibility” of the QP
  • Sorting
    • Lucene default sort is by score
    • Searcher has several methods that take in a Sort object
    • Sorting should be addressed during indexing
    • Sorting is done on Field s containing a single term that can be used for comparison
    • The SortField defines the different sort types available
      • AUTO, STRING, INT, FLOAT, CUSTOM, SCORE, DOC
  • Sorting II
    • Look at Searcher , Sort and SortField
    • Custom sorting is done with a SortComparatorSource
    • Sorting can be very expensive
      • Terms are cached in the FieldCache
  • Filter s
    • Filter s restrict the search space to a subset of Document s
    • Use Cases
      • Search within a Search
      • Restrict by date
      • Rating
      • Security
      • Author
  • Filter Classes
    • QueryWrapperFilter (QueryFilter)
      • Restrict to subset of Document s that match a Query
    • RangeFilter
      • Restrict to Document s that fall within a range
      • Better alternative to RangeQuery
    • CachingWrapperFilter
      • Wrap another Filter and provide caching
  • Task
    • Modify your program to sort by a field and to filter by a query or some other criteria
      • ~15 minutes
  • Searcher s
    • MultiSearcher
      • Search over multiple Searchable s, including remote
    • MultiReader
      • Not a Searcher , but can be used with IndexSearcher to achieve same results for local indexes
    • ParallelMultiSearcher
      • Like MultiSearcher , but threaded
    • RemoteSearchable
      • RMI based remote searching
    • Look at MultiSearcherTest in example code
  • Expert Results
    • Searcher has several “expert” methods
    • HitCollector allows low-level access to all Document s as they are scored
  • Search Performance
    • Search speed is based on a number of factors:
      • Query Type(s)
      • Query Size
      • Analysis
      • Occurrences of Query Terms
      • Optimize
      • Index Size
      • Index type ( RAMDirectory , other)
      • Usual Suspects
        • CPU
        • Memory
        • I/O
        • Business Needs
  • Query Types
    • Be careful with WildcardQuery as it rewrites to a BooleanQuery containing all the terms that match the wildcards
    • Avoid starting a WildcardQuery with wildcard
    • Use ConstantScoreRangeQuery instead of RangeQuery
    • Be careful with range queries and dates
      • User mailing list and Wiki have useful tips for optimizing date handling
  • Query Size
    • Stopword removal
    • Search an “all” field instead of many fields with the same terms
    • Disambiguation
      • May be useful when doing synonym expansion
      • Difficult to automate and may be slower
      • Some applications may allow the user to disambiguate
    • Relevance Feedback/More Like This
      • Use most important words
      • “ Important” can be defined in a number of ways
  • Usual Suspects
    • CPU
      • Profile your application
    • Memory
      • Examine your heap size, garbage collection approach
    • I/O
      • Cache your Searcher
        • Define business logic for refreshing based on indexing needs
      • Warm your Searcher before going live -- See Solr
    • Business Needs
      • Do you really need to support Wildcards?
      • What about date range queries down to the millisecond?
  • FieldSelector
    • Prior to version 2.1, Lucene always loaded all Fields in a Document
    • FieldSelector API addition allows Lucene to skip large Fields
      • Options: Load, Lazy Load, No Load, Load and Break, Load for Merge, Size, Size and Break
    • Makes storage of original content more viable without large cost of loading it when not used
    • FieldSelectorTest in example code
  • Relevance
    • At some point along your journey, you will get results that you think are “bad”
    • Is it a big deal?
      • Content, Content, Content!
      • Relevance Judgments
      • Don’t break other queries just to “fix” one
    • Hardcode it!
      • A query doesn’t always have to result in a “search”
  • Scoring and Similarity
    • Lucene has sophisticated scoring mechanism designed to meet most needs
    • Has hooks for modifying scores
    • Scoring is handled by the Query , Weight and Scorer class
  • Explanations
    • explain(Query, int) method is useful for understanding why a Document scored the way it did
    • Shows all the pieces that went into scoring the result:
      • Tf, DF, boosts, etc.
  • Tuning Relevance
    • FunctionQuery from Solr (variation in Lucene)
    • Override Similarity
    • Implement own Query and related classes
    • Payload s
    • Boosts
  • Task
    • Open Luke and try some queries and then use the “explain” button
    • Or, write some code to do explains on a query and some documents
    • See how Query type, boosting, other factors play a role in the score
  • Terms and Term Vectors
    • Sometimes you need access to the Term Dictionary:
      • Auto suggest
      • Frequency information
    • Sometimes you need a Document-centric view of terms, frequencies, positions and offsets
      • Term Vectors
  • Term Information
    • TermEnum gives access to terms and how many Document s they occur in
      • IndexReader.terms()
    • TermDocs gives access to the frequency of a term in a Document
      • IndexReader.termDocs()
    • TermPositions extends TermDocs and provides access to position and payload info
      • IndexReader.termPositions()
  • Term Vectors
    • Term Vectors give access to term frequency information in a given Document
      • IndexReader.getTermFreqVector
    • TermVectorMapper provides callbacks for working with Term Vectors
  • TermsTest
    • Provides samples of working with terms and term vectors
  • Lunch ? 1-2:30
  • Recap
    • Indexing
    • Searching
    • Performance
    • Odds and Ends
      • Explains
      • FieldSelector
      • Relevance
      • Terms and Term Vectors
  • Class Project
    • Your chance to really dig in and get your hands dirty
    • Ask Questions
    • Options…
  • Option I
    • Start building out your Lucene Application!
      • Index your Data (or any data)
        • Threading/Updates/Deletions
        • Analysis
      • Search
        • Caching/Warming
        • Dealing with Updates
        • Multi-threaded
      • Display
  • Option II
    • Dig deeper into an area of interest
      • Performance
        • How fast can you index?
        • Search? Queries per Second?
      • Analysis
      • Query Parsing
      • Scoring
      • Contrib
  • Option III
    • Dig into JIRA issues and find something to fix in Lucene
    • https://issues.apache.org/jira/secure/Dashboard.jspa
    • http://wiki.apache.org/lucene-java/HowToContribute
  • Option IV
    • Try out Solr
    • http://lucene.apache.org/solr
  • Option V
    • Other?
      • Architecture Review/Discussion
      • Use Case Discussion
  • Project Post-Mortem
    • Volunteers to share?
  • Open Discussion
    • Multilingual Best Practices
      • UNICODE
      • One Index versus many
    • Advanced Analysis
    • Distributed Lucene
    • Crawling
    • Hadoop
    • Nutch
    • Solr
  • Resources
    • [email_address]
    • Lucid Imagination
      • Support
      • Training
      • Value Add
      • [email_address]
  • Finally…
    • Please take the time to fill out a survey to help me improve this training
      • Located in base directory of source
      • Email it to me at trainer@lucenebootcamp.com
    • There are several Lucene related talks on Wednesday