Your SlideShare is downloading. ×
0
Lucene Boot Camp I <ul><li>Grant Ingersoll </li></ul><ul><li>Lucid Imagination </li></ul><ul><li>Nov. 3, 2008  </li></ul><...
Intro <ul><li>My Background </li></ul><ul><li>Goals for Tutorial </li></ul><ul><ul><li>Understand Lucene core capabilities...
Schedule <ul><li>Day I </li></ul><ul><ul><li>Concepts </li></ul></ul><ul><ul><li>Indexing </li></ul></ul><ul><ul><li>Searc...
Resources <ul><li>Slides at  </li></ul><ul><ul><li>http://www.lucenebootcamp.com/boot-camp-slides/ </li></ul></ul><ul><li>...
What is Search? <ul><li>Given a user’s information need (query), find documents relevant to the need </li></ul><ul><ul><li...
Search Use Cases <ul><li>Web </li></ul><ul><ul><li>Google, Y!, etc. </li></ul></ul><ul><li>Enterprise </li></ul><ul><ul><l...
Your Content And You <ul><li>Only you know your content! </li></ul><ul><ul><li>Key Features </li></ul></ul><ul><ul><ul><li...
Search Basics <ul><li>Many different Models: </li></ul><ul><ul><li>Boolean, Probabilistic, Inference, Neural Net, and:  </...
Inverted Index From “Taming Text”
Lucene Background <ul><li>Created by Doug Cutting in 1999 </li></ul><ul><li>Donated to ASF in 2001 </li></ul><ul><li>Morph...
Lucene is… <ul><li>NOT a crawler </li></ul><ul><ul><li>See Nutch </li></ul></ul><ul><li>NOT an application </li></ul><ul><...
A Few Words about Solr <ul><li>HTTP-based Search Server </li></ul><ul><li>XML Configuration </li></ul><ul><li>XML, JSON, R...
Indexing <ul><li>Process of preparing and adding text to Lucene, which stores it in an inverted index </li></ul><ul><li>Ke...
Indexing Classes <ul><li>Analyzer </li></ul><ul><ul><li>Creates tokens using a  Tokenizer  and filters them through zero o...
Indexing Classes <ul><li>Document </li></ul><ul><ul><li>A collection of  Field s </li></ul></ul><ul><ul><li>Can be boosted...
How to Index <ul><li>Create  IndexWriter </li></ul><ul><li>For each input </li></ul><ul><ul><li>Create a  Document </li></...
Indexing in a Nutshell <ul><li>For each  Document </li></ul><ul><ul><li>For each  Field  to be tokenized </li></ul></ul><u...
Task 1.a <ul><li>From the Boot Camp Files, use the basic.ReutersIndexer skeleton to start </li></ul><ul><li>Index the smal...
Use Luke
5 minute Break
Searching <ul><li>Parse user query </li></ul><ul><li>Lookup matching Documents </li></ul><ul><li>Score Documents </li></ul...
Key Classes: <ul><li>Searcher </li></ul><ul><ul><li>Provides methods for searching </li></ul></ul><ul><ul><li>Take a momen...
Query Parsing <ul><li>Basic syntax: </li></ul><ul><ul><li>title:hockey +(body:stanley AND body:cup) </li></ul></ul><ul><li...
How to Search <ul><li>Create/Get an  IndexSearcher </li></ul><ul><li>Create a  Query </li></ul><ul><ul><li>Use a  QueryPar...
Task 1.b <ul><li>Using the ReutersIndexerTest.java skeleton in the boot camp files </li></ul><ul><ul><li>Search your newly...
Task 1 Results <ul><li>Scores across queries are NOT comparable </li></ul><ul><ul><li>They may not even be comparable for ...
Lunch 1-2:30
Discussion/Questions <ul><li>So far, we’ve seen the basics of search and indexing </li></ul><ul><li>Next going to look int...
Analysis <ul><li>Analysis is the process of creating  Token s to be indexed </li></ul><ul><li>Analysis is usually done to ...
Solr’s Analysis tool <ul><li>If you use nothing else from Solr, the Admin analysis tool can really help you understand ana...
Analyzers <ul><li>StandardAnalyzer, WhitespaceAnalyzer, SimpleAnalyzer </li></ul><ul><li>Contrib/analysis </li></ul><ul><u...
Tokenization <ul><li>Split words into  Token s to be processed </li></ul><ul><li>Tokenization is fairly straightforward fo...
Modifying Tokens <ul><li>TokenFilter s are used to alter the token stream to be indexed </li></ul><ul><li>Common tasks: </...
Payloads <ul><li>Associate an arbitrary byte array with a term in the index </li></ul><ul><li>Uses </li></ul><ul><ul><li>P...
n-grams <ul><li>Combine units of content together into a single token </li></ul><ul><li>Character </li></ul><ul><ul><li>2-...
Custom Analyzers <ul><li>Problem: none of the Analyzers cover my problem </li></ul><ul><li>Solution: write your own  Analy...
Analysis APIs <ul><li>Have a look at the  TokenStream  and  Token  API s </li></ul><ul><li>Token s and  TokenStream s may ...
Special Cases <ul><li>Dates and numbers need special treatment to be searchable </li></ul><ul><ul><li>o.a.l.document.DateT...
Task 2 <ul><li>Take 15-20 minutes and write an  Analyzer/Tokenizer/TokenFilter  and Unit Test </li></ul><ul><ul><li>Examin...
Discussion <ul><li>What did you implement? </li></ul><ul><li>What issues do you face with your content? </li></ul><ul><li>...
Lucene Contributions <ul><li>Many people have generously contributed code to help solve common problems </li></ul><ul><li>...
Highlighter <ul><li>Highlight query keywords in context </li></ul><ul><ul><li>Often useful for display purposes </li></ul>...
Spell Checking <ul><li>Suggest spelling corrections based on spellings of words in the index </li></ul><ul><ul><li>Will/ca...
Spell Checking <ul><li>Classes:  Spellchecker ,  StringDistance </li></ul><ul><li>See  ContribExamplesTest </li></ul><ul><...
More Like This <ul><li>Given a  Document , find other  Document s that are similar </li></ul><ul><ul><li>Variation on rele...
Summary <ul><li>Indexing </li></ul><ul><li>Searching </li></ul><ul><li>Analysis </li></ul><ul><li>Contrib </li></ul><ul><l...
Resources <ul><li>http://lucene.apache.org/ </li></ul><ul><li>http://en.wikipedia.org/wiki/Vector_space_model </li></ul><u...
Resources <ul><li>[email_address] </li></ul><ul><li>Lucid Imagination </li></ul><ul><ul><li>Support </li></ul></ul><ul><ul...
Upcoming SlideShare
Loading in...5
×

Lucene Bootcamp -1

2,438

Published on

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,438
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
111
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide
  • Transcript of "Lucene Bootcamp -1 "

    1. 1. Lucene Boot Camp I <ul><li>Grant Ingersoll </li></ul><ul><li>Lucid Imagination </li></ul><ul><li>Nov. 3, 2008 </li></ul><ul><li>New Orleans, LA </li></ul>
    2. 2. Intro <ul><li>My Background </li></ul><ul><li>Goals for Tutorial </li></ul><ul><ul><li>Understand Lucene core capabilities </li></ul></ul><ul><ul><li>Real examples, real code, real data </li></ul></ul><ul><li>Ask Questions!!!!! </li></ul>
    3. 3. Schedule <ul><li>Day I </li></ul><ul><ul><li>Concepts </li></ul></ul><ul><ul><li>Indexing </li></ul></ul><ul><ul><li>Searching </li></ul></ul><ul><ul><li>Analysis </li></ul></ul><ul><ul><li>Lucene contrib: highlighter, spell checking, etc. </li></ul></ul><ul><li>Day II </li></ul><ul><ul><li>In-depth Indexing/Searching </li></ul></ul><ul><ul><ul><li>Performance, Internals </li></ul></ul></ul><ul><ul><li>Terms and Term Vectors </li></ul></ul><ul><ul><li>Class Project </li></ul></ul><ul><ul><li>Q & A </li></ul></ul>
    4. 4. Resources <ul><li>Slides at </li></ul><ul><ul><li>http://www.lucenebootcamp.com/boot-camp-slides/ </li></ul></ul><ul><li>Lucene Java </li></ul><ul><ul><li>http://lucene.apache.org/java </li></ul></ul><ul><ul><li>http://lucene.apache.org/java/2_4_0/ </li></ul></ul><ul><ul><li>http://lucene.apache.org/java/2_4_0/api/index.html </li></ul></ul><ul><li>Luke: </li></ul><ul><ul><li>http://www.getopt.org/luke </li></ul></ul>
    5. 5. What is Search? <ul><li>Given a user’s information need (query), find documents relevant to the need </li></ul><ul><ul><li>Very Subjective! </li></ul></ul><ul><li>Information Retrieval </li></ul><ul><ul><li>Interdisciplinary </li></ul></ul><ul><ul><li>Comp. Sci, Math/Statistics, Library Sci., Linguistics, AI… </li></ul></ul>
    6. 6. Search Use Cases <ul><li>Web </li></ul><ul><ul><li>Google, Y!, etc. </li></ul></ul><ul><li>Enterprise </li></ul><ul><ul><li>Intranet, Content Repositories, email, etc. </li></ul></ul><ul><li>eCommerce/DB/CMS </li></ul><ul><ul><li>Online Stores, websites, etc. </li></ul></ul><ul><li>Other </li></ul><ul><ul><li>QA, Federated </li></ul></ul><ul><li>Yours? Why do you need Search? </li></ul>
    7. 7. Your Content And You <ul><li>Only you know your content! </li></ul><ul><ul><li>Key Features </li></ul></ul><ul><ul><ul><li>Title, body, price, margin, etc. </li></ul></ul></ul><ul><ul><li>Important Terms </li></ul></ul><ul><ul><li>Synonyms/Jargon </li></ul></ul><ul><ul><li>Structures (tables, lists, etc.) </li></ul></ul><ul><ul><li>Importance </li></ul></ul><ul><ul><li>Priorities </li></ul></ul>
    8. 8. Search Basics <ul><li>Many different Models: </li></ul><ul><ul><li>Boolean, Probabilistic, Inference, Neural Net, and: </li></ul></ul><ul><li>Modified Vector Space Model (VSM) </li></ul><ul><ul><li>Boolean + VSM </li></ul></ul><ul><ul><li>TF-IDF </li></ul></ul><ul><ul><li>The words in the document and the query each define a Vector in an n-dimensional space </li></ul></ul><ul><ul><li>Sim(q1, d1) = cos Θ </li></ul></ul>d j = <w 1,j ,w 2,j ,…,w n,j > q= <w 1,q ,w 2,q ,…w n,q > w = weight assigned to term q 1 d 1 Θ
    9. 9. Inverted Index From “Taming Text”
    10. 10. Lucene Background <ul><li>Created by Doug Cutting in 1999 </li></ul><ul><li>Donated to ASF in 2001 </li></ul><ul><li>Morphed into a Top Level Project (TLP) with many sub projects </li></ul><ul><ul><li>Java (flagship) a.k.a. “Lucene” </li></ul></ul><ul><ul><li>Solr, Nutch, Mahout, Tika, several Lucene ports </li></ul></ul><ul><li>From here on out, Lucene refers to “Lucene Java” </li></ul>
    11. 11. Lucene is… <ul><li>NOT a crawler </li></ul><ul><ul><li>See Nutch </li></ul></ul><ul><li>NOT an application </li></ul><ul><ul><li>See PoweredBy on the Wiki </li></ul></ul><ul><li>NOT a library for doing Google PageRank or other link analysis algorithms </li></ul><ul><ul><li>See Nutch </li></ul></ul><ul><li>A library for enabling text based search </li></ul>
    12. 12. A Few Words about Solr <ul><li>HTTP-based Search Server </li></ul><ul><li>XML Configuration </li></ul><ul><li>XML, JSON, Ruby, PHP, Java support </li></ul><ul><li>Many, many nice features that Lucene users need </li></ul><ul><ul><li>Faceting, spell checking, highlighting </li></ul></ul><ul><ul><li>Caching, Replication, Distributed </li></ul></ul><ul><li>http://lucene.apache.org/solr </li></ul>
    13. 13. Indexing <ul><li>Process of preparing and adding text to Lucene, which stores it in an inverted index </li></ul><ul><li>Key Point: Lucene only indexes Strings </li></ul><ul><ul><li>What does this mean? </li></ul></ul><ul><ul><ul><li>Lucene doesn’t care about XML, Word, PDF, etc. </li></ul></ul></ul><ul><ul><ul><ul><li>There are many good open source extractors available </li></ul></ul></ul></ul><ul><ul><ul><li>It’s our job to convert whatever file format we have into something Lucene can use </li></ul></ul></ul>
    14. 14. Indexing Classes <ul><li>Analyzer </li></ul><ul><ul><li>Creates tokens using a Tokenizer and filters them through zero or more TokenFilter s </li></ul></ul><ul><li>IndexWriter </li></ul><ul><ul><li>Responsible for converting text into internal Lucene format </li></ul></ul><ul><li>Directory </li></ul><ul><ul><li>Where the Index is stored </li></ul></ul><ul><ul><li>RAMDirectory , FSDirectory , others </li></ul></ul>
    15. 15. Indexing Classes <ul><li>Document </li></ul><ul><ul><li>A collection of Field s </li></ul></ul><ul><ul><li>Can be boosted </li></ul></ul><ul><li>Field </li></ul><ul><ul><li>Free text, keywords, dates, etc. </li></ul></ul><ul><ul><li>Defines attributes for storing, indexing </li></ul></ul><ul><ul><li>Can be boosted </li></ul></ul><ul><ul><li>Field Constructors and parameters </li></ul></ul><ul><ul><ul><li>Open up Fieldable and Field in IDE </li></ul></ul></ul>
    16. 16. How to Index <ul><li>Create IndexWriter </li></ul><ul><li>For each input </li></ul><ul><ul><li>Create a Document </li></ul></ul><ul><ul><li>Add Field s to the Document </li></ul></ul><ul><ul><li>Add the Document to the IndexWriter </li></ul></ul><ul><li>Close the IndexWriter </li></ul><ul><li>Optimize (optional) </li></ul>
    17. 17. Indexing in a Nutshell <ul><li>For each Document </li></ul><ul><ul><li>For each Field to be tokenized </li></ul></ul><ul><ul><ul><li>Create the tokens using the specified Tokenizer </li></ul></ul></ul><ul><ul><ul><ul><li>Tokens consist of a String, position, type and offset information </li></ul></ul></ul></ul><ul><ul><ul><li>Pass the tokens through the chained TokenFilter s where they can be changed or removed </li></ul></ul></ul><ul><ul><ul><li>Add the end result to the inverted index </li></ul></ul></ul><ul><li>Position information can be altered </li></ul><ul><ul><li>Useful when removing words or to prevent phrases from matching </li></ul></ul>
    18. 18. Task 1.a <ul><li>From the Boot Camp Files, use the basic.ReutersIndexer skeleton to start </li></ul><ul><li>Index the small Reuters Collection using the IndexWriter , a Directory and StandardAnalyzer </li></ul><ul><ul><li>Boost every 10 documents by 3 </li></ul></ul><ul><li>Questions to Answer: </li></ul><ul><ul><li>What Field s should I define? </li></ul></ul><ul><ul><li>What attributes should each Field have? </li></ul></ul><ul><ul><li>Pick a field to boost and give a reason why you think it should be boosted </li></ul></ul><ul><li>~30 minutes </li></ul>
    19. 19. Use Luke
    20. 20. 5 minute Break
    21. 21. Searching <ul><li>Parse user query </li></ul><ul><li>Lookup matching Documents </li></ul><ul><li>Score Documents </li></ul><ul><li>Return ranked list </li></ul>
    22. 22. Key Classes: <ul><li>Searcher </li></ul><ul><ul><li>Provides methods for searching </li></ul></ul><ul><ul><li>Take a moment to look at the Searcher class declaration </li></ul></ul><ul><ul><li>IndexSearcher, MultiSearcher, ParallelMultiSearcher </li></ul></ul><ul><li>IndexReader </li></ul><ul><ul><li>Loads a snapshot of the index into memory for searching </li></ul></ul><ul><ul><li>More tomorrow </li></ul></ul><ul><li>TopDocs - The search results </li></ul><ul><li>QueryParser </li></ul><ul><ul><li>http: //lucene .apache. org/java/docs/queryparsersyntax .html </li></ul></ul><ul><li>Query </li></ul><ul><ul><li>Logical representation of program’s information need </li></ul></ul>
    23. 23. Query Parsing <ul><li>Basic syntax: </li></ul><ul><ul><li>title:hockey +(body:stanley AND body:cup) </li></ul></ul><ul><li>OR/AND must be uppercase </li></ul><ul><li>Default operator is OR (can be changed) </li></ul><ul><li>Supports fairly advanced syntax, see the website </li></ul><ul><ul><li>http://lucene.apache.org/java/docs/queryparsersyntax.html </li></ul></ul><ul><li>Doesn’t always play nice, so beware </li></ul><ul><ul><li>Many applications construct queries programmatically or restrict syntax </li></ul></ul>
    24. 24. How to Search <ul><li>Create/Get an IndexSearcher </li></ul><ul><li>Create a Query </li></ul><ul><ul><li>Use a QueryParser </li></ul></ul><ul><ul><li>Construct it programmatically </li></ul></ul><ul><li>Display the results from the TopDocs </li></ul><ul><ul><li>Retrieve Field values from Document </li></ul></ul><ul><li>More tomorrow on search lifecyle </li></ul>
    25. 25. Task 1.b <ul><li>Using the ReutersIndexerTest.java skeleton in the boot camp files </li></ul><ul><ul><li>Search your newly created index using queries you develop </li></ul></ul><ul><li>Questions: </li></ul><ul><ul><li>What is the default field for the QueryParser ? </li></ul></ul><ul><ul><li>What Analyzer to use? </li></ul></ul><ul><li>~20 minutes </li></ul>
    26. 26. Task 1 Results <ul><li>Scores across queries are NOT comparable </li></ul><ul><ul><li>They may not even be comparable for the same query over time (if the index changes) </li></ul></ul><ul><li>Performance </li></ul><ul><ul><li>Caching </li></ul></ul><ul><ul><li>Warming </li></ul></ul><ul><ul><li>More Tomorrow </li></ul></ul>
    27. 27. Lunch 1-2:30
    28. 28. Discussion/Questions <ul><li>So far, we’ve seen the basics of search and indexing </li></ul><ul><li>Next going to look into Analysis and Contrib modules </li></ul>
    29. 29. Analysis <ul><li>Analysis is the process of creating Token s to be indexed </li></ul><ul><li>Analysis is usually done to improve results overall, but it comes with a price </li></ul><ul><li>Lucene comes with many different Analyzer s, Tokenizer s and TokenFilter s, each with their own goals </li></ul><ul><li>StandardAnalyzer is included with the core JAR and does a good job for most English and Latin-based tasks </li></ul><ul><li>Often times you want the same content analyzed in different ways </li></ul><ul><li>Consider a catch-all Field in addition to other Field s </li></ul>
    30. 30. Solr’s Analysis tool <ul><li>If you use nothing else from Solr, the Admin analysis tool can really help you understand analysis </li></ul><ul><li>Download Solr and unpack it </li></ul><ul><li>cd apache-solr-1.3.0/example </li></ul><ul><li>java -jar start.jar </li></ul><ul><li>http://localhost:8983/solr/admin/analysis.jsp </li></ul>
    31. 31. Analyzers <ul><li>StandardAnalyzer, WhitespaceAnalyzer, SimpleAnalyzer </li></ul><ul><li>Contrib/analysis </li></ul><ul><ul><li>Suite of Analyzers for many common situations </li></ul></ul><ul><ul><ul><li>Languages </li></ul></ul></ul><ul><ul><ul><li>n-grams </li></ul></ul></ul><ul><ul><ul><li>Payloads </li></ul></ul></ul><ul><li>Contrib/snowball </li></ul>
    32. 32. Tokenization <ul><li>Split words into Token s to be processed </li></ul><ul><li>Tokenization is fairly straightforward for most languages that use a space for word segmentation </li></ul><ul><ul><li>More difficult for some East Asian languages </li></ul></ul><ul><ul><li>See the CJK Analyzer </li></ul></ul>
    33. 33. Modifying Tokens <ul><li>TokenFilter s are used to alter the token stream to be indexed </li></ul><ul><li>Common tasks: </li></ul><ul><ul><li>Remove stopwords </li></ul></ul><ul><ul><li>Lower case </li></ul></ul><ul><ul><li>Stem/Normalize -> Wi-Fi -> Wi Fi </li></ul></ul><ul><ul><li>Add Synonyms </li></ul></ul><ul><li>StandardAnalyzer does things that you may not want </li></ul>
    34. 34. Payloads <ul><li>Associate an arbitrary byte array with a term in the index </li></ul><ul><li>Uses </li></ul><ul><ul><li>Part of Speech </li></ul></ul><ul><ul><li>Font weight </li></ul></ul><ul><ul><li>URL </li></ul></ul><ul><li>Currently can search using the BoostingTermQuery </li></ul>
    35. 35. n-grams <ul><li>Combine units of content together into a single token </li></ul><ul><li>Character </li></ul><ul><ul><li>2-grams for the word “Lucene”: </li></ul></ul><ul><ul><ul><li>Lu,uc, ce, en, ne </li></ul></ul></ul><ul><ul><li>Can make search possible when data is noisy or hard to tokenize </li></ul></ul><ul><li>Word (“shingles” in Lucene parlance) </li></ul><ul><ul><li>Pseudo Phrases </li></ul></ul>
    36. 36. Custom Analyzers <ul><li>Problem: none of the Analyzers cover my problem </li></ul><ul><li>Solution: write your own Analyzer </li></ul><ul><li>Better solution: write a configurable Analyzer so you only need one Analyzer that you can easily change for your projects </li></ul><ul><ul><li>See Solr </li></ul></ul>
    37. 37. Analysis APIs <ul><li>Have a look at the TokenStream and Token API s </li></ul><ul><li>Token s and TokenStream s may be reused </li></ul><ul><ul><li>Helps reduce allocations and speeds up indexing </li></ul></ul><ul><ul><li>Not all Analysis can take advantage: caching </li></ul></ul><ul><ul><li>Analyzer.reusableTokenStream() </li></ul></ul><ul><ul><li>TokenStream.next(Token) </li></ul></ul>
    38. 38. Special Cases <ul><li>Dates and numbers need special treatment to be searchable </li></ul><ul><ul><li>o.a.l.document.DateTools </li></ul></ul><ul><ul><li>org.apache.solr.util.NumberUtils </li></ul></ul><ul><li>Altering Position Information </li></ul><ul><ul><li>Increase Position Gap between sentences to prevent phrases from crossing sentence boundaries </li></ul></ul><ul><ul><li>Index synonyms at the same position so query can match regardless of synonym used </li></ul></ul>
    39. 39. Task 2 <ul><li>Take 15-20 minutes and write an Analyzer/Tokenizer/TokenFilter and Unit Test </li></ul><ul><ul><li>Examine results in Luke </li></ul></ul><ul><ul><li>Run some searches </li></ul></ul><ul><li>Ideas: </li></ul><ul><ul><li>Combine existing Tokenizer s and TokenFilter s </li></ul></ul><ul><ul><li>Normalize abbreviations </li></ul></ul><ul><ul><li>Add payloads </li></ul></ul><ul><ul><li>Filter out all words beginning with the letter A </li></ul></ul><ul><ul><li>Identify/Mark sentences </li></ul></ul>
    40. 40. Discussion <ul><li>What did you implement? </li></ul><ul><li>What issues do you face with your content? </li></ul><ul><li>To Stem or not to Stem? </li></ul><ul><li>Stopwords: good or bad? </li></ul><ul><li>Tradeoffs of different techniques </li></ul>
    41. 41. Lucene Contributions <ul><li>Many people have generously contributed code to help solve common problems </li></ul><ul><li>These are in contrib directory of the source </li></ul><ul><li>Popular: </li></ul><ul><ul><li>Analyzers </li></ul></ul><ul><ul><li>Highlighter </li></ul></ul><ul><ul><li>Queries and MoreLikeThis </li></ul></ul><ul><ul><li>Snowball Stemmers </li></ul></ul><ul><ul><li>Spellchecker </li></ul></ul>
    42. 42. Highlighter <ul><li>Highlight query keywords in context </li></ul><ul><ul><li>Often useful for display purposes </li></ul></ul><ul><li>Important Classes: </li></ul><ul><ul><li>Highlighter - Main entry point, coordinates the work </li></ul></ul><ul><ul><li>Fragmenter - Splits up document for scoring </li></ul></ul><ul><ul><li>Formatter - Marks up the results </li></ul></ul><ul><ul><li>Scorer - Scores the fragments </li></ul></ul><ul><ul><ul><li>SpanScorer - Can score phrases </li></ul></ul></ul><ul><li>Use term vectors for performance </li></ul><ul><li>Look at example usage </li></ul>
    43. 43. Spell Checking <ul><li>Suggest spelling corrections based on spellings of words in the index </li></ul><ul><ul><li>Will/can suggest incorrectly spelled words </li></ul></ul><ul><li>Uses a distance measure to determine suggestions </li></ul><ul><ul><li>Can also factor in document frequency </li></ul></ul><ul><ul><li>Distance Measure is pluggable </li></ul></ul>
    44. 44. Spell Checking <ul><li>Classes: Spellchecker , StringDistance </li></ul><ul><li>See ContribExamplesTest </li></ul><ul><li>Practical aspects: </li></ul><ul><ul><li>It’s not as simple as just turning it on </li></ul></ul><ul><ul><li>Good results require testing and tuning </li></ul></ul><ul><ul><ul><li>Pay attention to accuracy settings </li></ul></ul></ul><ul><ul><ul><li>Mind your Analysis (simple, no stemming) </li></ul></ul></ul><ul><ul><ul><li>Consider alternate StringDistance ( JaroWinklerDistance ) </li></ul></ul></ul>
    45. 45. More Like This <ul><li>Given a Document , find other Document s that are similar </li></ul><ul><ul><li>Variation on relevance feedback </li></ul></ul><ul><ul><li>“ Find Similar” </li></ul></ul><ul><li>Extracts the most important terms from a Document and creates a new query </li></ul><ul><ul><li>Many options available for determining important terms </li></ul></ul><ul><li>Classes: MoreLikeThis </li></ul><ul><ul><li>See ContribExamplesTest </li></ul></ul>
    46. 46. Summary <ul><li>Indexing </li></ul><ul><li>Searching </li></ul><ul><li>Analysis </li></ul><ul><li>Contrib </li></ul><ul><li>Questions? </li></ul>
    47. 47. Resources <ul><li>http://lucene.apache.org/ </li></ul><ul><li>http://en.wikipedia.org/wiki/Vector_space_model </li></ul><ul><li>Modern Information Retrieval by Baeza-Yates and Ribeiro-Neto </li></ul><ul><li>Lucene In Action by Hatcher and Gospodnetić </li></ul><ul><li>Wiki </li></ul><ul><li>Mailing Lists </li></ul><ul><ul><li>[email_address] </li></ul></ul><ul><ul><ul><li>Discussions on how to use Lucene </li></ul></ul></ul><ul><ul><li>[email_address] </li></ul></ul><ul><ul><ul><li>Discussions on how to develop Lucene </li></ul></ul></ul><ul><li>Issue Tracking </li></ul><ul><ul><li>https://issues.apache.org/jira/secure/Dashboard.jspa </li></ul></ul><ul><li>We always welcome patches </li></ul><ul><ul><li>Ask on the mailing list before reporting a bug </li></ul></ul>
    48. 48. Resources <ul><li>[email_address] </li></ul><ul><li>Lucid Imagination </li></ul><ul><ul><li>Support </li></ul></ul><ul><ul><li>Training </li></ul></ul><ul><ul><li>Value Add </li></ul></ul><ul><ul><li>[email_address] </li></ul></ul>
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×