Lucene BootCamp

4,431 views

Published on

Published in: Technology
0 Comments
10 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
4,431
On SlideShare
0
From Embeds
0
Number of Embeds
22
Actions
Shares
0
Downloads
170
Comments
0
Likes
10
Embeds 0
No embeds

No notes for slide
  • Take a look at IndexerWriter
  • Take a look at Field constructors and parameters
  • Do some searches:
    Case sensitive?
    Dates?
    Stopwords?
  • 5-10 minutes
    Hint: the same one you used to create the index
  • Examine the code for one or two of these
  • See TopDocsTest.java in src/test
  • Examine FieldSelectorTest code
  • Should take most of the afternoon
  • Look through various contributions
  • 10-15 minutes
  • Lucene BootCamp

    1. 1. Lucene Boot Camp Grant Ingersoll Lucid Imagination Nov. 12, 2007 Atlanta, Georgia
    2. 2. Intro • My Background • Your Background • Brief History of Lucene • Goals for Tutorial – Understand Lucene core capabilities – Real examples, real code, real data • Ask Questions!!!!!
    3. 3. Schedule 1. 10-10:10 Introducing Lucene and Search 2. 10:10-12 Indexing, Analysis, Searching, Performance 3. 12-12:05 Break 4. 12-1 More on Indexing, Analysis, Searching, Performance 5. 1-2:30 Lunch 6. 2:30-2:40 Recap, Questions, Content 7. 2:40-4:40 Class Example 8. 4-4:20 Break 9. 4:20-5 Class Example 10. 5-5:20 Lucene Contributions (time permitting) 11. 5:20-5:25 Open Discussion (time permitting) 12. 5:25-5:30 Resources/Wrap Up
    4. 4. Lucene is… • NOT a crawler – See Nutch • NOT an application – See PoweredBy on the Wiki • NOT a library for doing Google PageRank or other link analysis algorithms – See Nutch • A library for enabling text based search
    5. 5. A Few Words about Solr • HTTP-based Search Server • XML Configuration • XML, JSON, Ruby, PHP, Java support • Caching, Replication • Many, many nice features that Lucene users need • http://lucene.apache.org/solr
    6. 6. Search Basics • Goal: Identify documents that are similar to input query • Lucene uses a modified Vector Space Model (VSM) – Boolean + VSM – TF-IDF – The words in the document and the query each define a Vector in an n-dimensional space – Sim(q1, d1) = cos Θ – In Lucene, boolean approach restricts what documents to score q1 d1 Θ dj= <w1,j,w2,j,…,wn,j> q= <w1,q,w2,q,…wn,q> w = weight assigned to term
    7. 7. Indexing • Process of preparing and adding text to Lucene – Optimized for searching • Key Point: Lucene only indexes Strings – What does this mean? • Lucene doesn’t care about XML, Word, PDF, etc. – There are many good open source extractors available • It’s our job to convert whatever file format we have into something Lucene can use
    8. 8. Indexing Classes • Analyzer – Creates tokens using a Tokenizer and filters them through zero or more TokenFilters • IndexWriter – Responsible for converting text into internal Lucene format
    9. 9. Indexing Classes • Directory – Where the Index is stored – RAMDirectory, FSDirectory, others • Document – A collection of Fields – Can be boosted • Field – Free text, keywords, dates, etc. – Defines attributes for storing, indexing – Can be boosted – Field Constructors and parameters • Open up Fieldable and Field in IDE
    10. 10. How to Index • Create IndexWriter • For each input – Create a Document – Add Fields to the Document – Add the Document to the IndexWriter • Close the IndexWriter • Optimize (optional)
    11. 11. Task 1.a • From the Boot Camp Files, use the basic.ReutersIndexer skeleton to start • Index the small Reuters Collection using the IndexWriter, a Directory and StandardAnalyzer – Boost every 10 documents by 3 • Questions to Answer: – What Fields should I define? – What attributes should each Field have? • What Fields should OMIT_NORMS? – Pick a field to boost and give a reason why you think it should be boosted
    12. 12. Use the Luke
    13. 13. Searching • Key Classes: – Searcher • Provides methods for searching • Take a moment to look at the Searcher class declaration • IndexSearcher, MultiSearcher, ParallelMultiSearcher – IndexReader • Loads a snapshot of the index into memory for searching – Hits • Storage/caching of results from searching – QueryParser • JavaCC grammar for creating Lucene Queries • http://lucene.apache.org/java/docs/queryparsersyntax.html – Query • Logical representation of program’s information need
    14. 14. Query Parsing • Basic syntax: title:hockey +(body:stanley AND body:cup) • OR/AND must be uppercase • Default operator is OR (can be changed) • Supports fairly advanced syntax, see the website – http://lucene.apache.org/java/docs/queryparsersyntax.html • Doesn’t always play nice, so beware – Many applications construct queries programmatically or restrict syntax
    15. 15. Task 1.b • Using the ReutersIndexerTest.java skeleton in the boot camp files – Search your newly created index using queries you develop – Delete a Document by the doc id • Hints: – Use a IndexSearcher – Create a Query using the QueryParser – Display the results from the Hits • Questions: – What is the default field for the QueryParser? – What Analyzer to use?
    16. 16. Task 1 Results • Locks – Lucene maintains locks on files to prevent index corruption – Located in same directory as index • Scores from Hits are normalized – Scores across queries are NOT comparable • Lucene 2.3 has some transactional semantics for indexing, but is not a DB
    17. 17. Deletion and Updates • Deletions can be a bit confusing – Both IndexReader and IndexWriter have delete methods • Updates are always a delete and an add • Updates are always a delete and an add – Yes, that is a repeat! – Nature of data structures used in search
    18. 18. Analysis • Analysis is the process of creating Tokens to be indexed • Analysis is usually done to improve results overall, but it comes with a price • Lucene comes with many different Analyzers, Tokenizers and TokenFilters, each with their own goals – See contrib/analyzers • StandardAnalyzer is included with the core JAR and does a good job for most English and Latin-based tasks • Often times you want the same content analyzed in different ways • Consider a catch-all Field in addition to other Fields
    19. 19. Commonly Used Analyzers • StandardAnalyzer • WhitespaceAnalyzer • PerFieldAnalyzerWrapper • SimpleAnalyzer
    20. 20. Indexing in a Nutshell • For each Document – For each Field to be tokenized • Create the tokens using the specified Tokenizer – Tokens consist of a String, position, type and offset information • Pass the tokens through the chained TokenFilters where they can be changed or removed • Add the end result to the inverted index • Position information can be altered – Useful when removing words or to prevent phrases from matching
    21. 21. Inverted Index aardvark hood red little riding robin women zoo Little Red Riding Hood Robin Hood Little Women 0 1 0 2 0 0 2 1 0 1 2
    22. 22. Tokenization • Split words into Tokens to be processed • Tokenization is fairly straightforward for most languages that use a space for word segmentation – More difficult for some East Asian languages – See the CJK Analyzer
    23. 23. Modifying Tokens • TokenFilters are used to alter the token stream to be indexed • Common tasks: – Remove stopwords – Lower case – Stem/Normalize -> Wi-Fi -> Wi Fi – Add Synonyms • StandardAnalyzer does things that you may not want
    24. 24. Custom Analyzers • Solution: write your own Analyzer • Better solution: write a configurable Analyzer so you only need one Analyzer that you can easily change for your projects – See Solr • Tokenizers and TokenFilters must be newly constructed for each input
    25. 25. Special Cases • Dates and numbers need special treatment to be searchable – o.a.l.document.DateTools – org.apache.solr.util.NumberUtils • Altering Position Information – Increase Position Gap between sentences to prevent phrases from crossing sentence boundaries – Index synonyms at the same position so query can match regardless of synonym used
    26. 26. 5 minute Break
    27. 27. Indexing Performance • Behind the Scenes – Lucene indexes Documents into memory – At certain trigger points, memory (segments) are flushed to the Directory – Segments are periodically merged • Lucene 2.3 has significant performance improvements
    28. 28. IndexWriter Performance Factors • maxBufferedDocs – Minimum # of docs before merge occurs and a new segment is created – Usually, Larger == faster, but more RAM • mergeFactor – How often segments are merged – Smaller == less RAM, better for incremental updates – Larger == faster, better for batch indexing • maxFieldLength – Limit the number of terms in a Document
    29. 29. Lucene 2.3 IndexWriter Changes • setRAMBufferSizeMB – New model for automagically controlling indexing factors based on the amount of memory in use – Obsoletes setMaxBufferedDocs and setMergeFactor • Takes storage and term vectors out of the merge process • Turn off auto-commit if there are stored fields and term vectors • Provides significant performance increase
    30. 30. Index Threading • IndexWriter and IndexReader are thread- safe and can be shared between threads without external synchronization • One open IndexWriter per Directory • Parallel Indexing – Index to separate Directory instances – Merge using IndexWriter.addIndexes – Could also distribute and collect
    31. 31. Benchmarking Indexing • contrib/benchmark • Try out different algorithms between Lucene 2.2 and trunk (2.3) – contrib/benchmark/conf: • indexing.alg • indexing-multithreaded.alg • Info: – Mac Pro 2 x 2GHz Dual-Core Xeon – 4 GB RAM – ant run-task -Dtask.alg=./conf/indexing.alg -Dtask.mem=1024M
    32. 32. Benchmarking Results Records/Sec Avg. T Mem 2.2 421 39M Trunk 2,122 52M Trunk-mt (4) 3,680 57M Your results will depend on analysis, etc.
    33. 33. Searching • Earlier we touched on basics of search using the QueryParser • Now look at: – Searcher/IndexReader Lifecycle – Query classes – More details on the QueryParser – Filters – Sorting
    34. 34. Lifecycle • Recall that the IndexReader loads a snapshot of index into memory – This means updates made since loading the index will not be seen • Business rules are needed to define how often to reload the index, if at all – IndexReader.isCurrent() can help • Loading an index is an expensive operation – Do not open a Searcher/IndexReader for every search
    35. 35. Query Classes • TermQuery is basis for all non-span queries • BooleanQuery combines multiple Query instances as clauses – should – required • PhraseQuery finds terms occurring near each other, position-wise – “slop” is the edit distance between two terms • Take 2-3 minutes to explore Query implementations
    36. 36. Spans • Spans provide information about where matches took place • Not supported by the QueryParser • Can be used in BooleanQuery clauses • Take 2-3 minutes to explore SpanQuery classes – SpanNearQuery useful for doing phrase matching
    37. 37. QueryParser • MultiFieldQueryParser • Boolean operators cause confusion – Better to think in terms of required (+ operator) and not allowed (- operator) • Check JIRA for QueryParser issues • http://www.gossamer-threads.com/lists/lucene/java-user/40945 • Most applications either modify QP, create their own, or restrict to a subset of the syntax • Your users may not need all the “flexibility” of the QP
    38. 38. Sorting • Lucene default sort is by score • Searcher has several methods that take in a Sort object • Sorting should be addressed during indexing • Sorting is done on Fields containing a single term that can be used for comparison • The SortField defines the different sort types available – AUTO, STRING, INT, FLOAT, CUSTOM, SCORE, DOC
    39. 39. Sorting II • Look at Searcher, Sort and SortField • Custom sorting is done with a SortComparatorSource • Sorting can be very expensive – Terms are cached in the FieldCache • SortFilterTest.java example
    40. 40. Filters • Filters restrict the search space to a subset of Documents • Use Cases – Search within a Search – Restrict by date – Rating – Security – Author
    41. 41. Filter Classes • QueryWrapperFilter (QueryFilter) – Restrict to subset of Documents that match a Query • RangeFilter – Restrict to Documents that fall within a range – Better alternative to RangeQuery • CachingWrapperFilter – Wrap another Filter and provide caching • SortFilterTest.java example
    42. 42. Expert Results • Searcher has several “expert” methods – Hits is not always what you need due to: • Caching • Normalized Scores • Reexecutes Query repeatedly as results are accessed • HitCollector allows low-level access to all Documents as they are scored • TopDocs represents top n docs that match – TopDocsTest in examples
    43. 43. Searchers • MultiSearcher – Search over multiple Searchables, including remote • MultiReader – Not a Searcher, but can be used with IndexSearcher to achieve same results for local indexes • ParallelMultiSearcher – Like MultiSearcher, but threaded • RemoteSearchable – RMI based remote searching • Look at MultiSearcherTest in example code
    44. 44. Search Performance • Search speed is based on a number of factors: – Query Type(s) – Query Size – Analysis – Occurrences of Query Terms – Optimize – Index Size – Index type (RAMDirectory, other) – Usual Suspects • CPU • Memory • I/O • Business Needs
    45. 45. Query Types • Be careful with WildcardQuery as it rewrites to a BooleanQuery containing all the terms that match the wildcards • Avoid starting a WildcardQuery with wildcard • Use ConstantScoreRangeQuery instead of RangeQuery • Be careful with range queries and dates – User mailing list and Wiki have useful tips for optimizing date handling
    46. 46. Query Size • Stopword removal • Search an “all” field instead of many fields with the same terms • Disambiguation – May be useful when doing synonym expansion – Difficult to automate and may be slower – Some applications may allow the user to disambiguate • Relevance Feedback/More Like This – Use most important words – “Important” can be defined in a number of ways
    47. 47. Usual Suspects • CPU – Profile your application • Memory – Examine your heap size, garbage collection approach • I/O – Cache your Searcher • Define business logic for refreshing based on indexing needs – Warm your Searcher before going live -- See Solr • Business Needs – Do you really need to support Wildcards? – What about date range queries down to the millisecond?
    48. 48. Explanations • explain(Query, int) method is useful for understanding why a Document scored the way it did • ExplainsTest in sample code • Open Luke and try some queries and then use the “explain” button
    49. 49. FieldSelector • Prior to version 2.1, Lucene always loaded all Fields in a Document • FieldSelector API addition allows Lucene to skip large Fields – Options: Load, Lazy Load, No Load, Load and Break, Load for Merge, Size, Size and Break • Makes storage of original content more viable without large cost of loading it when not used • FieldSelectorTest in example code
    50. 50. Scoring and Similarity • Lucene has sophisticated scoring mechanism designed to meet most needs • Has hooks for modifying scores • Scoring is handled by the Query, Weight and Scorer class
    51. 51. Affecting Relevance • FunctionQuery from Solr (variation in Lucene) • Override Similarity • Implement own Query and related classes • Payloads • HitCollector • Take 5 to examine these
    52. 52. Lunch 1-2:30
    53. 53. Recap • Indexing • Searching • Performance • Odds and Ends – Explains – FieldSelector – Relevance
    54. 54. Next Up • Dealing with Content – File Formats – Extraction • Large Task • Miscellaneous • Wrapping Up
    55. 55. File Formats • Several open source libraries, projects for extracting content to use in Lucene – PDF: PDFBox • http://www.pdfbox.org/ – Word: POI, Open Office, TextMining • http://www.textmining.org/textmining.zip – XML: SAX or Pull parser – HTML: Neko, Jtidy • http://people.apache.org/~andyc/neko/doc/html/ • http://jtidy.sourceforge.net/ • Tika – http://incubator.apache.org/tika/ • Aperture – http://aperture.sourceforge.net
    56. 56. Aperture Basics • Crawlers • Data Connectors • Extraction Wrappers – POI, PDFBox, HTML, XML, etc. • http://aperture.wiki.sourceforge.net/Extractors will give you info on what comes back from Aperture • LuceneApertureCallbackHandler in example code
    57. 57. Large Task • Using the skeleton files in the com.lucenebootcamp.training.full package: – Get some content: • Web, file system • Different file formats – Index it • Plan out your fields, boosts, field properties • Support updates and deletes • Optional: – How fast can you make it go? Divide and conquer? Multithreaded?
    58. 58. Large Task • Search Content – Allow for arbitrary user queries across multiple Fields via command line or simple web interface – How fast can you make it? • Support: – Sort – Filter – Explains • How much slower is to retrieve an explanation?
    59. 59. Large Task • Document Retrieval – Display/write out the one or more documents – Support FieldSelector
    60. 60. Large Task • Optional Tasks – Hit Highlighting using contrib/Highlighter – Multithreaded indexing and Search – Explore other Field construction options • Binary fields, term vectors – Use Lucene trunk version and try out some of the changes in indexing – Try out Solr or Nutch at http://lucene.apache.org/ • What’s do they offer that Lucene Java doesn’t that you might need?
    61. 61. Large Task Metadata – Pair up if you want – Ask questions – 2 hours – Use Luke to check your index! – Explore other parts of Lucene that you are interested in – Be prepared to discuss/share with the class
    62. 62. Large Task Post-Mortem • Volunteers to share?
    63. 63. Term Information • TermEnum gives access to terms and how many Documents they occur in – IndexReader.terms() – IndexReader.termPositions() • TermDocs gives access to the frequency of a term in a Document – IndexReader.termDocs() • Term Vectors give access to term frequency information in a given Document – IndexReader.getTermFreqVector • TermsTest in sample code
    64. 64. Lucene Contributions • Many people have generously contributed code to help solve common problems • These are in contrib directory of the source • Popular: – Analyzers – Highlighter – Queries and MoreLikeThis – Snowball Stemmers – Spellchecker
    65. 65. Open Discussion • Multilingual Best Practices – UNICODE – One Index versus many • Advanced Analysis • Distributed Lucene • Crawling • Hadoop • Nutch • Solr
    66. 66. Resources • http://lucene.apache.org/ • http://en.wikipedia.org/wiki/Vector_space_model • Modern Information Retrieval by Baeza-Yates and Ribeiro-Neto • Lucene In Action by Hatcher and Gospodnetić • Wiki • Mailing Lists – java-user@lucene.apache.org • Discussions on how to use Lucene – java-dev@lucene.apache.org • Discussions on how to develop Lucene • Issue Tracking – https://issues.apache.org/jira/secure/Dashboard.jspa • We always welcome patches – Ask on the mailing list before reporting a bug
    67. 67. Resources • trainer@lucenebootcamp.com
    68. 68. Finally… • Please take the time to fill out a survey to help me improve this training – Located in base directory of source – Email it to me at trainer@lucenebootcamp.com • There are several Lucene related talks on Friday
    69. 69. Extras
    70. 70. Task 2 • Take 10-15 minutes, pair up, and write an Analyzer and Unit Test – Examine results in Luke – Run some searches • Ideas: – Combine existing Tokenizers and TokenFilters – Normalize abbreviations – Filter out all words beginning with the letter A – Identify/Mark sentences • Questions: – What would help improve search results?
    71. 71. Task 2 Results • Share what you did and why • Improving Results (in most cases) – Stemming – Ignore Case – Stopword Removal – Synonyms – Pay attention to business needs
    72. 72. Grab Bag • Accessing Term Information – TermEnum – TermDocs – Term Vectors • FieldSelector • Scoring and Similarity • File Formats
    73. 73. Task 6 • Count and print all the unique terms in the index and their frequencies – Notes: • Half of the class write it using TermEnum and TermDocs • Other Half write it using Term Vectors • Time your Task • Only count the title and body content
    74. 74. Task 6 Results • Term Vector approach is faster on smaller collections • TermEnum approach is faster on larger collections
    75. 75. Task 4 • Re-index your collection – Add in a “rating” field that randomly assigns a number between 0 and 9 • Write searches to sort by • Date • Title • Rating, Date, Doc Id • A Custom Sort • Questions – How to sort the title? – How to sort multiple Fields?
    76. 76. Task 4 Results • Add stitle to use for sorting the title
    77. 77. Task 5 • Create and search using Filters to: – Restrict to all docs written on Feb. 26, 1987 – Restrict to all docs with the word “computer” in title • Also: – Create a Filter where the length of the body + title is greater than X
    78. 78. Task 5 Results • Solr has more advanced Filter mechanisms that may be worth using • Cache filters
    79. 79. Task 7 • Pair up if you like and take 30-40 minutes to: – Pick two file formats to work on – Identify content in that format • Can you index contents on your hard drive? • Project Gutenberg, Creative Commons, Wikipedia • Combine w/ Reuters collection – Extract the content and index it using the appropriate library – Store the content as a Field – Search the content – Load Documents with and without FieldSelector and measure performance
    80. 80. Task 7 (cont.) • Include score and explanation in results • Dump results to XML or HTML • Be prepared to share with class what you did – What libraries did you use? – What content did you use? – What is your Document structure? – What issues did you have?
    81. 81. 20 Minute Break
    82. 82. Task 7 Results • Explain what your group did • Build a Content Handler Framework – Or help out with Tika
    83. 83. Task 8 • Building on Task 7 – Incorporate one or more contrib packages into your solution

    ×