Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Lucene Boot Camp I <ul><li>Grant Ingersoll </li></ul><ul><li>Lucid Imagination </li></ul><ul><li>Nov. 3, 2008  </li></ul><...
Intro <ul><li>My Background </li></ul><ul><li>Goals for Tutorial </li></ul><ul><ul><li>Understand Lucene core capabilities...
Schedule <ul><li>Day I </li></ul><ul><ul><li>Concepts </li></ul></ul><ul><ul><li>Indexing </li></ul></ul><ul><ul><li>Searc...
Resources <ul><li>Slides at  </li></ul><ul><ul><li>http://www.lucenebootcamp.com/boot-camp-slides/ </li></ul></ul><ul><li>...
What is Search? <ul><li>Given a user’s information need (query), find documents relevant to the need </li></ul><ul><ul><li...
Search Use Cases <ul><li>Web </li></ul><ul><ul><li>Google, Y!, etc. </li></ul></ul><ul><li>Enterprise </li></ul><ul><ul><l...
Your Content And You <ul><li>Only you know your content! </li></ul><ul><ul><li>Key Features </li></ul></ul><ul><ul><ul><li...
Search Basics <ul><li>Many different Models: </li></ul><ul><ul><li>Boolean, Probabilistic, Inference, Neural Net, and:  </...
Inverted Index From “Taming Text”
Lucene Background <ul><li>Created by Doug Cutting in 1999 </li></ul><ul><li>Donated to ASF in 2001 </li></ul><ul><li>Morph...
Lucene is… <ul><li>NOT a crawler </li></ul><ul><ul><li>See Nutch </li></ul></ul><ul><li>NOT an application </li></ul><ul><...
A Few Words about Solr <ul><li>HTTP-based Search Server </li></ul><ul><li>XML Configuration </li></ul><ul><li>XML, JSON, R...
Indexing <ul><li>Process of preparing and adding text to Lucene, which stores it in an inverted index </li></ul><ul><li>Ke...
Indexing Classes <ul><li>Analyzer </li></ul><ul><ul><li>Creates tokens using a  Tokenizer  and filters them through zero o...
Indexing Classes <ul><li>Document </li></ul><ul><ul><li>A collection of  Field s </li></ul></ul><ul><ul><li>Can be boosted...
How to Index <ul><li>Create  IndexWriter </li></ul><ul><li>For each input </li></ul><ul><ul><li>Create a  Document </li></...
Indexing in a Nutshell <ul><li>For each  Document </li></ul><ul><ul><li>For each  Field  to be tokenized </li></ul></ul><u...
Task 1.a <ul><li>From the Boot Camp Files, use the basic.ReutersIndexer skeleton to start </li></ul><ul><li>Index the smal...
Use Luke
5 minute Break
Searching <ul><li>Parse user query </li></ul><ul><li>Lookup matching Documents </li></ul><ul><li>Score Documents </li></ul...
Key Classes: <ul><li>Searcher </li></ul><ul><ul><li>Provides methods for searching </li></ul></ul><ul><ul><li>Take a momen...
Query Parsing <ul><li>Basic syntax: </li></ul><ul><ul><li>title:hockey +(body:stanley AND body:cup) </li></ul></ul><ul><li...
How to Search <ul><li>Create/Get an  IndexSearcher </li></ul><ul><li>Create a  Query </li></ul><ul><ul><li>Use a  QueryPar...
Task 1.b <ul><li>Using the ReutersIndexerTest.java skeleton in the boot camp files </li></ul><ul><ul><li>Search your newly...
Task 1 Results <ul><li>Scores across queries are NOT comparable </li></ul><ul><ul><li>They may not even be comparable for ...
Lunch 1-2:30
Discussion/Questions <ul><li>So far, we’ve seen the basics of search and indexing </li></ul><ul><li>Next going to look int...
Analysis <ul><li>Analysis is the process of creating  Token s to be indexed </li></ul><ul><li>Analysis is usually done to ...
Solr’s Analysis tool <ul><li>If you use nothing else from Solr, the Admin analysis tool can really help you understand ana...
Analyzers <ul><li>StandardAnalyzer, WhitespaceAnalyzer, SimpleAnalyzer </li></ul><ul><li>Contrib/analysis </li></ul><ul><u...
Tokenization <ul><li>Split words into  Token s to be processed </li></ul><ul><li>Tokenization is fairly straightforward fo...
Modifying Tokens <ul><li>TokenFilter s are used to alter the token stream to be indexed </li></ul><ul><li>Common tasks: </...
Payloads <ul><li>Associate an arbitrary byte array with a term in the index </li></ul><ul><li>Uses </li></ul><ul><ul><li>P...
n-grams <ul><li>Combine units of content together into a single token </li></ul><ul><li>Character </li></ul><ul><ul><li>2-...
Custom Analyzers <ul><li>Problem: none of the Analyzers cover my problem </li></ul><ul><li>Solution: write your own  Analy...
Analysis APIs <ul><li>Have a look at the  TokenStream  and  Token  API s </li></ul><ul><li>Token s and  TokenStream s may ...
Special Cases <ul><li>Dates and numbers need special treatment to be searchable </li></ul><ul><ul><li>o.a.l.document.DateT...
Task 2 <ul><li>Take 15-20 minutes and write an  Analyzer/Tokenizer/TokenFilter  and Unit Test </li></ul><ul><ul><li>Examin...
Discussion <ul><li>What did you implement? </li></ul><ul><li>What issues do you face with your content? </li></ul><ul><li>...
Lucene Contributions <ul><li>Many people have generously contributed code to help solve common problems </li></ul><ul><li>...
Highlighter <ul><li>Highlight query keywords in context </li></ul><ul><ul><li>Often useful for display purposes </li></ul>...
Spell Checking <ul><li>Suggest spelling corrections based on spellings of words in the index </li></ul><ul><ul><li>Will/ca...
Spell Checking <ul><li>Classes:  Spellchecker ,  StringDistance </li></ul><ul><li>See  ContribExamplesTest </li></ul><ul><...
More Like This <ul><li>Given a  Document , find other  Document s that are similar </li></ul><ul><ul><li>Variation on rele...
Summary <ul><li>Indexing </li></ul><ul><li>Searching </li></ul><ul><li>Analysis </li></ul><ul><li>Contrib </li></ul><ul><l...
Resources <ul><li>http://lucene.apache.org/ </li></ul><ul><li>http://en.wikipedia.org/wiki/Vector_space_model </li></ul><u...
Resources <ul><li>[email_address] </li></ul><ul><li>Lucid Imagination </li></ul><ul><ul><li>Support </li></ul></ul><ul><ul...
Upcoming SlideShare
Loading in …5
×

Lucene Bootcamp -1

2,729 views

Published on

Published in: Technology
  • Be the first to comment

Lucene Bootcamp -1

  1. 1. Lucene Boot Camp I <ul><li>Grant Ingersoll </li></ul><ul><li>Lucid Imagination </li></ul><ul><li>Nov. 3, 2008 </li></ul><ul><li>New Orleans, LA </li></ul>
  2. 2. Intro <ul><li>My Background </li></ul><ul><li>Goals for Tutorial </li></ul><ul><ul><li>Understand Lucene core capabilities </li></ul></ul><ul><ul><li>Real examples, real code, real data </li></ul></ul><ul><li>Ask Questions!!!!! </li></ul>
  3. 3. Schedule <ul><li>Day I </li></ul><ul><ul><li>Concepts </li></ul></ul><ul><ul><li>Indexing </li></ul></ul><ul><ul><li>Searching </li></ul></ul><ul><ul><li>Analysis </li></ul></ul><ul><ul><li>Lucene contrib: highlighter, spell checking, etc. </li></ul></ul><ul><li>Day II </li></ul><ul><ul><li>In-depth Indexing/Searching </li></ul></ul><ul><ul><ul><li>Performance, Internals </li></ul></ul></ul><ul><ul><li>Terms and Term Vectors </li></ul></ul><ul><ul><li>Class Project </li></ul></ul><ul><ul><li>Q & A </li></ul></ul>
  4. 4. Resources <ul><li>Slides at </li></ul><ul><ul><li>http://www.lucenebootcamp.com/boot-camp-slides/ </li></ul></ul><ul><li>Lucene Java </li></ul><ul><ul><li>http://lucene.apache.org/java </li></ul></ul><ul><ul><li>http://lucene.apache.org/java/2_4_0/ </li></ul></ul><ul><ul><li>http://lucene.apache.org/java/2_4_0/api/index.html </li></ul></ul><ul><li>Luke: </li></ul><ul><ul><li>http://www.getopt.org/luke </li></ul></ul>
  5. 5. What is Search? <ul><li>Given a user’s information need (query), find documents relevant to the need </li></ul><ul><ul><li>Very Subjective! </li></ul></ul><ul><li>Information Retrieval </li></ul><ul><ul><li>Interdisciplinary </li></ul></ul><ul><ul><li>Comp. Sci, Math/Statistics, Library Sci., Linguistics, AI… </li></ul></ul>
  6. 6. Search Use Cases <ul><li>Web </li></ul><ul><ul><li>Google, Y!, etc. </li></ul></ul><ul><li>Enterprise </li></ul><ul><ul><li>Intranet, Content Repositories, email, etc. </li></ul></ul><ul><li>eCommerce/DB/CMS </li></ul><ul><ul><li>Online Stores, websites, etc. </li></ul></ul><ul><li>Other </li></ul><ul><ul><li>QA, Federated </li></ul></ul><ul><li>Yours? Why do you need Search? </li></ul>
  7. 7. Your Content And You <ul><li>Only you know your content! </li></ul><ul><ul><li>Key Features </li></ul></ul><ul><ul><ul><li>Title, body, price, margin, etc. </li></ul></ul></ul><ul><ul><li>Important Terms </li></ul></ul><ul><ul><li>Synonyms/Jargon </li></ul></ul><ul><ul><li>Structures (tables, lists, etc.) </li></ul></ul><ul><ul><li>Importance </li></ul></ul><ul><ul><li>Priorities </li></ul></ul>
  8. 8. Search Basics <ul><li>Many different Models: </li></ul><ul><ul><li>Boolean, Probabilistic, Inference, Neural Net, and: </li></ul></ul><ul><li>Modified Vector Space Model (VSM) </li></ul><ul><ul><li>Boolean + VSM </li></ul></ul><ul><ul><li>TF-IDF </li></ul></ul><ul><ul><li>The words in the document and the query each define a Vector in an n-dimensional space </li></ul></ul><ul><ul><li>Sim(q1, d1) = cos Θ </li></ul></ul>d j = <w 1,j ,w 2,j ,…,w n,j > q= <w 1,q ,w 2,q ,…w n,q > w = weight assigned to term q 1 d 1 Θ
  9. 9. Inverted Index From “Taming Text”
  10. 10. Lucene Background <ul><li>Created by Doug Cutting in 1999 </li></ul><ul><li>Donated to ASF in 2001 </li></ul><ul><li>Morphed into a Top Level Project (TLP) with many sub projects </li></ul><ul><ul><li>Java (flagship) a.k.a. “Lucene” </li></ul></ul><ul><ul><li>Solr, Nutch, Mahout, Tika, several Lucene ports </li></ul></ul><ul><li>From here on out, Lucene refers to “Lucene Java” </li></ul>
  11. 11. Lucene is… <ul><li>NOT a crawler </li></ul><ul><ul><li>See Nutch </li></ul></ul><ul><li>NOT an application </li></ul><ul><ul><li>See PoweredBy on the Wiki </li></ul></ul><ul><li>NOT a library for doing Google PageRank or other link analysis algorithms </li></ul><ul><ul><li>See Nutch </li></ul></ul><ul><li>A library for enabling text based search </li></ul>
  12. 12. A Few Words about Solr <ul><li>HTTP-based Search Server </li></ul><ul><li>XML Configuration </li></ul><ul><li>XML, JSON, Ruby, PHP, Java support </li></ul><ul><li>Many, many nice features that Lucene users need </li></ul><ul><ul><li>Faceting, spell checking, highlighting </li></ul></ul><ul><ul><li>Caching, Replication, Distributed </li></ul></ul><ul><li>http://lucene.apache.org/solr </li></ul>
  13. 13. Indexing <ul><li>Process of preparing and adding text to Lucene, which stores it in an inverted index </li></ul><ul><li>Key Point: Lucene only indexes Strings </li></ul><ul><ul><li>What does this mean? </li></ul></ul><ul><ul><ul><li>Lucene doesn’t care about XML, Word, PDF, etc. </li></ul></ul></ul><ul><ul><ul><ul><li>There are many good open source extractors available </li></ul></ul></ul></ul><ul><ul><ul><li>It’s our job to convert whatever file format we have into something Lucene can use </li></ul></ul></ul>
  14. 14. Indexing Classes <ul><li>Analyzer </li></ul><ul><ul><li>Creates tokens using a Tokenizer and filters them through zero or more TokenFilter s </li></ul></ul><ul><li>IndexWriter </li></ul><ul><ul><li>Responsible for converting text into internal Lucene format </li></ul></ul><ul><li>Directory </li></ul><ul><ul><li>Where the Index is stored </li></ul></ul><ul><ul><li>RAMDirectory , FSDirectory , others </li></ul></ul>
  15. 15. Indexing Classes <ul><li>Document </li></ul><ul><ul><li>A collection of Field s </li></ul></ul><ul><ul><li>Can be boosted </li></ul></ul><ul><li>Field </li></ul><ul><ul><li>Free text, keywords, dates, etc. </li></ul></ul><ul><ul><li>Defines attributes for storing, indexing </li></ul></ul><ul><ul><li>Can be boosted </li></ul></ul><ul><ul><li>Field Constructors and parameters </li></ul></ul><ul><ul><ul><li>Open up Fieldable and Field in IDE </li></ul></ul></ul>
  16. 16. How to Index <ul><li>Create IndexWriter </li></ul><ul><li>For each input </li></ul><ul><ul><li>Create a Document </li></ul></ul><ul><ul><li>Add Field s to the Document </li></ul></ul><ul><ul><li>Add the Document to the IndexWriter </li></ul></ul><ul><li>Close the IndexWriter </li></ul><ul><li>Optimize (optional) </li></ul>
  17. 17. Indexing in a Nutshell <ul><li>For each Document </li></ul><ul><ul><li>For each Field to be tokenized </li></ul></ul><ul><ul><ul><li>Create the tokens using the specified Tokenizer </li></ul></ul></ul><ul><ul><ul><ul><li>Tokens consist of a String, position, type and offset information </li></ul></ul></ul></ul><ul><ul><ul><li>Pass the tokens through the chained TokenFilter s where they can be changed or removed </li></ul></ul></ul><ul><ul><ul><li>Add the end result to the inverted index </li></ul></ul></ul><ul><li>Position information can be altered </li></ul><ul><ul><li>Useful when removing words or to prevent phrases from matching </li></ul></ul>
  18. 18. Task 1.a <ul><li>From the Boot Camp Files, use the basic.ReutersIndexer skeleton to start </li></ul><ul><li>Index the small Reuters Collection using the IndexWriter , a Directory and StandardAnalyzer </li></ul><ul><ul><li>Boost every 10 documents by 3 </li></ul></ul><ul><li>Questions to Answer: </li></ul><ul><ul><li>What Field s should I define? </li></ul></ul><ul><ul><li>What attributes should each Field have? </li></ul></ul><ul><ul><li>Pick a field to boost and give a reason why you think it should be boosted </li></ul></ul><ul><li>~30 minutes </li></ul>
  19. 19. Use Luke
  20. 20. 5 minute Break
  21. 21. Searching <ul><li>Parse user query </li></ul><ul><li>Lookup matching Documents </li></ul><ul><li>Score Documents </li></ul><ul><li>Return ranked list </li></ul>
  22. 22. Key Classes: <ul><li>Searcher </li></ul><ul><ul><li>Provides methods for searching </li></ul></ul><ul><ul><li>Take a moment to look at the Searcher class declaration </li></ul></ul><ul><ul><li>IndexSearcher, MultiSearcher, ParallelMultiSearcher </li></ul></ul><ul><li>IndexReader </li></ul><ul><ul><li>Loads a snapshot of the index into memory for searching </li></ul></ul><ul><ul><li>More tomorrow </li></ul></ul><ul><li>TopDocs - The search results </li></ul><ul><li>QueryParser </li></ul><ul><ul><li>http: //lucene .apache. org/java/docs/queryparsersyntax .html </li></ul></ul><ul><li>Query </li></ul><ul><ul><li>Logical representation of program’s information need </li></ul></ul>
  23. 23. Query Parsing <ul><li>Basic syntax: </li></ul><ul><ul><li>title:hockey +(body:stanley AND body:cup) </li></ul></ul><ul><li>OR/AND must be uppercase </li></ul><ul><li>Default operator is OR (can be changed) </li></ul><ul><li>Supports fairly advanced syntax, see the website </li></ul><ul><ul><li>http://lucene.apache.org/java/docs/queryparsersyntax.html </li></ul></ul><ul><li>Doesn’t always play nice, so beware </li></ul><ul><ul><li>Many applications construct queries programmatically or restrict syntax </li></ul></ul>
  24. 24. How to Search <ul><li>Create/Get an IndexSearcher </li></ul><ul><li>Create a Query </li></ul><ul><ul><li>Use a QueryParser </li></ul></ul><ul><ul><li>Construct it programmatically </li></ul></ul><ul><li>Display the results from the TopDocs </li></ul><ul><ul><li>Retrieve Field values from Document </li></ul></ul><ul><li>More tomorrow on search lifecyle </li></ul>
  25. 25. Task 1.b <ul><li>Using the ReutersIndexerTest.java skeleton in the boot camp files </li></ul><ul><ul><li>Search your newly created index using queries you develop </li></ul></ul><ul><li>Questions: </li></ul><ul><ul><li>What is the default field for the QueryParser ? </li></ul></ul><ul><ul><li>What Analyzer to use? </li></ul></ul><ul><li>~20 minutes </li></ul>
  26. 26. Task 1 Results <ul><li>Scores across queries are NOT comparable </li></ul><ul><ul><li>They may not even be comparable for the same query over time (if the index changes) </li></ul></ul><ul><li>Performance </li></ul><ul><ul><li>Caching </li></ul></ul><ul><ul><li>Warming </li></ul></ul><ul><ul><li>More Tomorrow </li></ul></ul>
  27. 27. Lunch 1-2:30
  28. 28. Discussion/Questions <ul><li>So far, we’ve seen the basics of search and indexing </li></ul><ul><li>Next going to look into Analysis and Contrib modules </li></ul>
  29. 29. Analysis <ul><li>Analysis is the process of creating Token s to be indexed </li></ul><ul><li>Analysis is usually done to improve results overall, but it comes with a price </li></ul><ul><li>Lucene comes with many different Analyzer s, Tokenizer s and TokenFilter s, each with their own goals </li></ul><ul><li>StandardAnalyzer is included with the core JAR and does a good job for most English and Latin-based tasks </li></ul><ul><li>Often times you want the same content analyzed in different ways </li></ul><ul><li>Consider a catch-all Field in addition to other Field s </li></ul>
  30. 30. Solr’s Analysis tool <ul><li>If you use nothing else from Solr, the Admin analysis tool can really help you understand analysis </li></ul><ul><li>Download Solr and unpack it </li></ul><ul><li>cd apache-solr-1.3.0/example </li></ul><ul><li>java -jar start.jar </li></ul><ul><li>http://localhost:8983/solr/admin/analysis.jsp </li></ul>
  31. 31. Analyzers <ul><li>StandardAnalyzer, WhitespaceAnalyzer, SimpleAnalyzer </li></ul><ul><li>Contrib/analysis </li></ul><ul><ul><li>Suite of Analyzers for many common situations </li></ul></ul><ul><ul><ul><li>Languages </li></ul></ul></ul><ul><ul><ul><li>n-grams </li></ul></ul></ul><ul><ul><ul><li>Payloads </li></ul></ul></ul><ul><li>Contrib/snowball </li></ul>
  32. 32. Tokenization <ul><li>Split words into Token s to be processed </li></ul><ul><li>Tokenization is fairly straightforward for most languages that use a space for word segmentation </li></ul><ul><ul><li>More difficult for some East Asian languages </li></ul></ul><ul><ul><li>See the CJK Analyzer </li></ul></ul>
  33. 33. Modifying Tokens <ul><li>TokenFilter s are used to alter the token stream to be indexed </li></ul><ul><li>Common tasks: </li></ul><ul><ul><li>Remove stopwords </li></ul></ul><ul><ul><li>Lower case </li></ul></ul><ul><ul><li>Stem/Normalize -> Wi-Fi -> Wi Fi </li></ul></ul><ul><ul><li>Add Synonyms </li></ul></ul><ul><li>StandardAnalyzer does things that you may not want </li></ul>
  34. 34. Payloads <ul><li>Associate an arbitrary byte array with a term in the index </li></ul><ul><li>Uses </li></ul><ul><ul><li>Part of Speech </li></ul></ul><ul><ul><li>Font weight </li></ul></ul><ul><ul><li>URL </li></ul></ul><ul><li>Currently can search using the BoostingTermQuery </li></ul>
  35. 35. n-grams <ul><li>Combine units of content together into a single token </li></ul><ul><li>Character </li></ul><ul><ul><li>2-grams for the word “Lucene”: </li></ul></ul><ul><ul><ul><li>Lu,uc, ce, en, ne </li></ul></ul></ul><ul><ul><li>Can make search possible when data is noisy or hard to tokenize </li></ul></ul><ul><li>Word (“shingles” in Lucene parlance) </li></ul><ul><ul><li>Pseudo Phrases </li></ul></ul>
  36. 36. Custom Analyzers <ul><li>Problem: none of the Analyzers cover my problem </li></ul><ul><li>Solution: write your own Analyzer </li></ul><ul><li>Better solution: write a configurable Analyzer so you only need one Analyzer that you can easily change for your projects </li></ul><ul><ul><li>See Solr </li></ul></ul>
  37. 37. Analysis APIs <ul><li>Have a look at the TokenStream and Token API s </li></ul><ul><li>Token s and TokenStream s may be reused </li></ul><ul><ul><li>Helps reduce allocations and speeds up indexing </li></ul></ul><ul><ul><li>Not all Analysis can take advantage: caching </li></ul></ul><ul><ul><li>Analyzer.reusableTokenStream() </li></ul></ul><ul><ul><li>TokenStream.next(Token) </li></ul></ul>
  38. 38. Special Cases <ul><li>Dates and numbers need special treatment to be searchable </li></ul><ul><ul><li>o.a.l.document.DateTools </li></ul></ul><ul><ul><li>org.apache.solr.util.NumberUtils </li></ul></ul><ul><li>Altering Position Information </li></ul><ul><ul><li>Increase Position Gap between sentences to prevent phrases from crossing sentence boundaries </li></ul></ul><ul><ul><li>Index synonyms at the same position so query can match regardless of synonym used </li></ul></ul>
  39. 39. Task 2 <ul><li>Take 15-20 minutes and write an Analyzer/Tokenizer/TokenFilter and Unit Test </li></ul><ul><ul><li>Examine results in Luke </li></ul></ul><ul><ul><li>Run some searches </li></ul></ul><ul><li>Ideas: </li></ul><ul><ul><li>Combine existing Tokenizer s and TokenFilter s </li></ul></ul><ul><ul><li>Normalize abbreviations </li></ul></ul><ul><ul><li>Add payloads </li></ul></ul><ul><ul><li>Filter out all words beginning with the letter A </li></ul></ul><ul><ul><li>Identify/Mark sentences </li></ul></ul>
  40. 40. Discussion <ul><li>What did you implement? </li></ul><ul><li>What issues do you face with your content? </li></ul><ul><li>To Stem or not to Stem? </li></ul><ul><li>Stopwords: good or bad? </li></ul><ul><li>Tradeoffs of different techniques </li></ul>
  41. 41. Lucene Contributions <ul><li>Many people have generously contributed code to help solve common problems </li></ul><ul><li>These are in contrib directory of the source </li></ul><ul><li>Popular: </li></ul><ul><ul><li>Analyzers </li></ul></ul><ul><ul><li>Highlighter </li></ul></ul><ul><ul><li>Queries and MoreLikeThis </li></ul></ul><ul><ul><li>Snowball Stemmers </li></ul></ul><ul><ul><li>Spellchecker </li></ul></ul>
  42. 42. Highlighter <ul><li>Highlight query keywords in context </li></ul><ul><ul><li>Often useful for display purposes </li></ul></ul><ul><li>Important Classes: </li></ul><ul><ul><li>Highlighter - Main entry point, coordinates the work </li></ul></ul><ul><ul><li>Fragmenter - Splits up document for scoring </li></ul></ul><ul><ul><li>Formatter - Marks up the results </li></ul></ul><ul><ul><li>Scorer - Scores the fragments </li></ul></ul><ul><ul><ul><li>SpanScorer - Can score phrases </li></ul></ul></ul><ul><li>Use term vectors for performance </li></ul><ul><li>Look at example usage </li></ul>
  43. 43. Spell Checking <ul><li>Suggest spelling corrections based on spellings of words in the index </li></ul><ul><ul><li>Will/can suggest incorrectly spelled words </li></ul></ul><ul><li>Uses a distance measure to determine suggestions </li></ul><ul><ul><li>Can also factor in document frequency </li></ul></ul><ul><ul><li>Distance Measure is pluggable </li></ul></ul>
  44. 44. Spell Checking <ul><li>Classes: Spellchecker , StringDistance </li></ul><ul><li>See ContribExamplesTest </li></ul><ul><li>Practical aspects: </li></ul><ul><ul><li>It’s not as simple as just turning it on </li></ul></ul><ul><ul><li>Good results require testing and tuning </li></ul></ul><ul><ul><ul><li>Pay attention to accuracy settings </li></ul></ul></ul><ul><ul><ul><li>Mind your Analysis (simple, no stemming) </li></ul></ul></ul><ul><ul><ul><li>Consider alternate StringDistance ( JaroWinklerDistance ) </li></ul></ul></ul>
  45. 45. More Like This <ul><li>Given a Document , find other Document s that are similar </li></ul><ul><ul><li>Variation on relevance feedback </li></ul></ul><ul><ul><li>“ Find Similar” </li></ul></ul><ul><li>Extracts the most important terms from a Document and creates a new query </li></ul><ul><ul><li>Many options available for determining important terms </li></ul></ul><ul><li>Classes: MoreLikeThis </li></ul><ul><ul><li>See ContribExamplesTest </li></ul></ul>
  46. 46. Summary <ul><li>Indexing </li></ul><ul><li>Searching </li></ul><ul><li>Analysis </li></ul><ul><li>Contrib </li></ul><ul><li>Questions? </li></ul>
  47. 47. Resources <ul><li>http://lucene.apache.org/ </li></ul><ul><li>http://en.wikipedia.org/wiki/Vector_space_model </li></ul><ul><li>Modern Information Retrieval by Baeza-Yates and Ribeiro-Neto </li></ul><ul><li>Lucene In Action by Hatcher and Gospodnetić </li></ul><ul><li>Wiki </li></ul><ul><li>Mailing Lists </li></ul><ul><ul><li>[email_address] </li></ul></ul><ul><ul><ul><li>Discussions on how to use Lucene </li></ul></ul></ul><ul><ul><li>[email_address] </li></ul></ul><ul><ul><ul><li>Discussions on how to develop Lucene </li></ul></ul></ul><ul><li>Issue Tracking </li></ul><ul><ul><li>https://issues.apache.org/jira/secure/Dashboard.jspa </li></ul></ul><ul><li>We always welcome patches </li></ul><ul><ul><li>Ask on the mailing list before reporting a bug </li></ul></ul>
  48. 48. Resources <ul><li>[email_address] </li></ul><ul><li>Lucid Imagination </li></ul><ul><ul><li>Support </li></ul></ul><ul><ul><li>Training </li></ul></ul><ul><ul><li>Value Add </li></ul></ul><ul><ul><li>[email_address] </li></ul></ul>

×