Lucene Boot Camp I Grant Ingersoll Lucid Imagination Nov. 3, 2008  New Orleans, LA
Intro My Background Goals for Tutorial Understand Lucene core capabilities Real examples, real code, real data Ask Questions!!!!!
Schedule Day I Concepts Indexing Searching Analysis Lucene contrib: highlighter, spell checking, etc. Day II In-depth Indexing/Searching  Performance, Internals Terms and Term Vectors Class Project Q & A
Resources Slides at  http://www.lucenebootcamp.com/boot-camp-slides/ Lucene Java http://lucene.apache.org/java http://lucene.apache.org/java/2_4_0/ http://lucene.apache.org/java/2_4_0/api/index.html Luke: http://www.getopt.org/luke
What is Search? Given a user’s information need (query), find documents relevant to the need Very Subjective! Information Retrieval Interdisciplinary Comp. Sci, Math/Statistics, Library Sci., Linguistics, AI…
Search Use Cases Web Google, Y!, etc. Enterprise Intranet, Content Repositories, email, etc.  eCommerce/DB/CMS Online Stores, websites, etc. Other QA, Federated Yours?  Why do you need Search?
Your Content And You Only you know your content! Key Features Title, body, price, margin, etc. Important Terms Synonyms/Jargon Structures (tables, lists, etc.) Importance Priorities
Search Basics Many different Models: Boolean, Probabilistic, Inference, Neural Net, and:  Modified Vector Space Model (VSM) Boolean + VSM TF-IDF The words in the document and the query each define a Vector in an n-dimensional space Sim(q1, d1) = cos  Θ d j = <w 1,j ,w 2,j ,…,w n,j > q= <w 1,q ,w 2,q ,…w n,q > w = weight assigned to term q 1 d 1 Θ
Inverted Index From “Taming Text”
Lucene Background Created by Doug Cutting in 1999 Donated to ASF in 2001 Morphed into a Top Level Project (TLP) with many sub projects Java (flagship) a.k.a. “Lucene” Solr, Nutch, Mahout, Tika, several Lucene ports From here on out, Lucene refers to “Lucene Java”
Lucene is… NOT a crawler See Nutch NOT an application See PoweredBy on the Wiki NOT a library for doing Google PageRank or other link analysis algorithms See Nutch A library for enabling text based search
A Few Words about Solr HTTP-based Search Server XML Configuration XML, JSON, Ruby, PHP, Java support Many, many nice features that Lucene users need Faceting, spell checking, highlighting Caching, Replication, Distributed http://lucene.apache.org/solr
Indexing Process of preparing and adding text to Lucene, which stores it in an inverted index Key Point:  Lucene only indexes Strings What does this mean? Lucene doesn’t care about XML, Word, PDF, etc. There are many good open source extractors available It’s our job to convert whatever file format we have into something Lucene can use
Indexing Classes Analyzer Creates tokens using a  Tokenizer  and filters them through zero or more  TokenFilter s IndexWriter Responsible for converting text into internal Lucene format Directory Where the Index is stored RAMDirectory ,  FSDirectory , others
Indexing Classes Document A collection of  Field s Can be boosted Field Free text, keywords, dates, etc. Defines attributes for storing, indexing Can be boosted Field  Constructors and parameters Open up  Fieldable  and  Field  in IDE
How to Index Create  IndexWriter For each input Create a  Document Add  Field s to the  Document Add the  Document  to the  IndexWriter Close the  IndexWriter Optimize (optional)
Indexing in a Nutshell For each  Document For each  Field  to be tokenized Create the tokens using the specified  Tokenizer Tokens consist of a String, position, type and offset information Pass the tokens through the chained  TokenFilter s where they can be changed or removed Add the end result to the inverted index Position information can be altered Useful when removing words or to prevent phrases from matching
Task 1.a From the Boot Camp Files, use the basic.ReutersIndexer skeleton to start Index the small Reuters Collection using the  IndexWriter , a  Directory  and  StandardAnalyzer Boost every 10 documents by 3 Questions to Answer: What  Field s should I define? What attributes should each  Field  have? Pick a field to boost and give a reason why you think it should be boosted ~30 minutes
Use Luke
5 minute Break
Searching Parse user query Lookup matching Documents Score Documents Return ranked list
Key Classes: Searcher Provides methods for searching Take a moment to look at the  Searcher  class declaration IndexSearcher, MultiSearcher, ParallelMultiSearcher IndexReader Loads a  snapshot  of the index into memory for searching More tomorrow TopDocs -  The search results QueryParser http: //lucene .apache. org/java/docs/queryparsersyntax .html Query Logical representation of program’s information need
Query Parsing Basic syntax: title:hockey +(body:stanley AND body:cup) OR/AND must be uppercase Default operator is OR (can be changed) Supports fairly advanced syntax, see the website http://lucene.apache.org/java/docs/queryparsersyntax.html Doesn’t always play nice, so beware Many applications construct queries programmatically or restrict syntax
How to Search Create/Get an  IndexSearcher Create a  Query Use a  QueryParser Construct it programmatically Display the results from the  TopDocs Retrieve  Field  values from  Document More tomorrow on search lifecyle
Task 1.b Using the ReutersIndexerTest.java skeleton in the boot camp files Search your newly created index using queries you develop Questions: What is the default field for the  QueryParser ? What  Analyzer  to use? ~20 minutes
Task 1 Results Scores across queries are NOT comparable They may not even be comparable for the same query over time (if the index changes) Performance Caching Warming More Tomorrow
Lunch 1-2:30
Discussion/Questions So far, we’ve seen the basics of search and indexing Next going to look into Analysis and Contrib modules
Analysis Analysis is the process of creating  Token s to be indexed Analysis is usually done to improve results overall, but it comes with a price Lucene comes with many different  Analyzer s,  Tokenizer s and  TokenFilter s, each with their own goals StandardAnalyzer  is included with the core JAR and does a good job for most English and Latin-based tasks Often times you want the same content analyzed in different ways Consider a catch-all  Field  in addition to other  Field s
Solr’s Analysis tool If you use nothing else from Solr, the Admin analysis tool can really help you understand analysis Download Solr and unpack it cd apache-solr-1.3.0/example java -jar start.jar http://localhost:8983/solr/admin/analysis.jsp
Analyzers StandardAnalyzer, WhitespaceAnalyzer, SimpleAnalyzer Contrib/analysis Suite of Analyzers for many common situations Languages n-grams Payloads Contrib/snowball
Tokenization Split words into  Token s to be processed Tokenization is fairly straightforward for most languages that use a space for word segmentation More difficult for some East Asian languages See the CJK Analyzer
Modifying Tokens TokenFilter s are used to alter the token stream to be indexed Common tasks: Remove stopwords Lower case Stem/Normalize -> Wi-Fi -> Wi Fi Add Synonyms StandardAnalyzer  does things that you may not want
Payloads Associate an arbitrary byte array with a term in the index Uses Part of Speech Font weight URL Currently can search using the  BoostingTermQuery
n-grams Combine units of content together into a single token Character 2-grams for the word “Lucene”: Lu,uc, ce, en, ne Can make search possible when data is noisy or hard to tokenize Word (“shingles” in Lucene parlance) Pseudo Phrases
Custom Analyzers Problem: none of the Analyzers cover my problem Solution: write your own  Analyzer Better solution: write a configurable  Analyzer  so you only need one Analyzer that you can easily change for your projects See Solr
Analysis APIs Have a look at the  TokenStream  and  Token  API s Token s and  TokenStream s may be reused Helps reduce allocations and speeds up indexing Not all Analysis can take advantage: caching Analyzer.reusableTokenStream() TokenStream.next(Token)
Special Cases Dates and numbers need special treatment to be searchable o.a.l.document.DateTools org.apache.solr.util.NumberUtils Altering Position Information Increase Position Gap between sentences to prevent phrases from crossing sentence boundaries Index synonyms at the same position so query can match regardless of synonym used
Task 2 Take 15-20 minutes and write an  Analyzer/Tokenizer/TokenFilter  and Unit Test Examine results in Luke Run some searches Ideas: Combine existing  Tokenizer s and  TokenFilter s Normalize abbreviations Add payloads Filter out all words beginning with the letter A Identify/Mark sentences
Discussion What did you implement? What issues do you face with your content? To Stem or not to Stem? Stopwords: good or bad? Tradeoffs of different techniques
Lucene Contributions Many people have generously contributed code to help solve common problems These are in contrib directory of the source Popular: Analyzers Highlighter Queries and MoreLikeThis Snowball Stemmers Spellchecker
Highlighter Highlight query keywords in context Often useful for display purposes Important Classes: Highlighter  - Main entry point, coordinates the work Fragmenter  - Splits up document for scoring Formatter  - Marks up the results Scorer  - Scores the fragments SpanScorer  - Can score phrases Use term vectors for performance Look at example usage
Spell Checking Suggest spelling corrections based on spellings of words in the index Will/can suggest incorrectly spelled words Uses a distance measure to determine suggestions Can also factor in document frequency Distance Measure is pluggable
Spell Checking Classes:  Spellchecker ,  StringDistance See  ContribExamplesTest Practical aspects: It’s not as simple as just turning it on Good results require testing and tuning Pay attention to accuracy settings Mind your Analysis (simple, no stemming) Consider alternate  StringDistance  ( JaroWinklerDistance )
More Like This Given a  Document , find other  Document s that are similar Variation on relevance feedback “ Find Similar” Extracts the most important terms from a  Document  and creates a new query Many options available for determining important terms Classes:  MoreLikeThis See ContribExamplesTest
Summary Indexing Searching Analysis Contrib Questions?
Resources http://lucene.apache.org/ http://en.wikipedia.org/wiki/Vector_space_model Modern Information Retrieval  by Baeza-Yates and Ribeiro-Neto Lucene In Action  by Hatcher and Gospodnetić Wiki Mailing Lists [email_address] Discussions on how to use Lucene [email_address] Discussions on how to develop Lucene Issue Tracking https://issues.apache.org/jira/secure/Dashboard.jspa We always welcome patches Ask on the mailing list before reporting a bug
Resources [email_address] Lucid Imagination Support Training Value Add [email_address]

Lucene Bootcamp -1

  • 1.
    Lucene Boot CampI Grant Ingersoll Lucid Imagination Nov. 3, 2008 New Orleans, LA
  • 2.
    Intro My BackgroundGoals for Tutorial Understand Lucene core capabilities Real examples, real code, real data Ask Questions!!!!!
  • 3.
    Schedule Day IConcepts Indexing Searching Analysis Lucene contrib: highlighter, spell checking, etc. Day II In-depth Indexing/Searching Performance, Internals Terms and Term Vectors Class Project Q & A
  • 4.
    Resources Slides at http://www.lucenebootcamp.com/boot-camp-slides/ Lucene Java http://lucene.apache.org/java http://lucene.apache.org/java/2_4_0/ http://lucene.apache.org/java/2_4_0/api/index.html Luke: http://www.getopt.org/luke
  • 5.
    What is Search?Given a user’s information need (query), find documents relevant to the need Very Subjective! Information Retrieval Interdisciplinary Comp. Sci, Math/Statistics, Library Sci., Linguistics, AI…
  • 6.
    Search Use CasesWeb Google, Y!, etc. Enterprise Intranet, Content Repositories, email, etc. eCommerce/DB/CMS Online Stores, websites, etc. Other QA, Federated Yours? Why do you need Search?
  • 7.
    Your Content AndYou Only you know your content! Key Features Title, body, price, margin, etc. Important Terms Synonyms/Jargon Structures (tables, lists, etc.) Importance Priorities
  • 8.
    Search Basics Manydifferent Models: Boolean, Probabilistic, Inference, Neural Net, and: Modified Vector Space Model (VSM) Boolean + VSM TF-IDF The words in the document and the query each define a Vector in an n-dimensional space Sim(q1, d1) = cos Θ d j = <w 1,j ,w 2,j ,…,w n,j > q= <w 1,q ,w 2,q ,…w n,q > w = weight assigned to term q 1 d 1 Θ
  • 9.
    Inverted Index From“Taming Text”
  • 10.
    Lucene Background Createdby Doug Cutting in 1999 Donated to ASF in 2001 Morphed into a Top Level Project (TLP) with many sub projects Java (flagship) a.k.a. “Lucene” Solr, Nutch, Mahout, Tika, several Lucene ports From here on out, Lucene refers to “Lucene Java”
  • 11.
    Lucene is… NOTa crawler See Nutch NOT an application See PoweredBy on the Wiki NOT a library for doing Google PageRank or other link analysis algorithms See Nutch A library for enabling text based search
  • 12.
    A Few Wordsabout Solr HTTP-based Search Server XML Configuration XML, JSON, Ruby, PHP, Java support Many, many nice features that Lucene users need Faceting, spell checking, highlighting Caching, Replication, Distributed http://lucene.apache.org/solr
  • 13.
    Indexing Process ofpreparing and adding text to Lucene, which stores it in an inverted index Key Point: Lucene only indexes Strings What does this mean? Lucene doesn’t care about XML, Word, PDF, etc. There are many good open source extractors available It’s our job to convert whatever file format we have into something Lucene can use
  • 14.
    Indexing Classes AnalyzerCreates tokens using a Tokenizer and filters them through zero or more TokenFilter s IndexWriter Responsible for converting text into internal Lucene format Directory Where the Index is stored RAMDirectory , FSDirectory , others
  • 15.
    Indexing Classes DocumentA collection of Field s Can be boosted Field Free text, keywords, dates, etc. Defines attributes for storing, indexing Can be boosted Field Constructors and parameters Open up Fieldable and Field in IDE
  • 16.
    How to IndexCreate IndexWriter For each input Create a Document Add Field s to the Document Add the Document to the IndexWriter Close the IndexWriter Optimize (optional)
  • 17.
    Indexing in aNutshell For each Document For each Field to be tokenized Create the tokens using the specified Tokenizer Tokens consist of a String, position, type and offset information Pass the tokens through the chained TokenFilter s where they can be changed or removed Add the end result to the inverted index Position information can be altered Useful when removing words or to prevent phrases from matching
  • 18.
    Task 1.a Fromthe Boot Camp Files, use the basic.ReutersIndexer skeleton to start Index the small Reuters Collection using the IndexWriter , a Directory and StandardAnalyzer Boost every 10 documents by 3 Questions to Answer: What Field s should I define? What attributes should each Field have? Pick a field to boost and give a reason why you think it should be boosted ~30 minutes
  • 19.
  • 20.
  • 21.
    Searching Parse userquery Lookup matching Documents Score Documents Return ranked list
  • 22.
    Key Classes: SearcherProvides methods for searching Take a moment to look at the Searcher class declaration IndexSearcher, MultiSearcher, ParallelMultiSearcher IndexReader Loads a snapshot of the index into memory for searching More tomorrow TopDocs - The search results QueryParser http: //lucene .apache. org/java/docs/queryparsersyntax .html Query Logical representation of program’s information need
  • 23.
    Query Parsing Basicsyntax: title:hockey +(body:stanley AND body:cup) OR/AND must be uppercase Default operator is OR (can be changed) Supports fairly advanced syntax, see the website http://lucene.apache.org/java/docs/queryparsersyntax.html Doesn’t always play nice, so beware Many applications construct queries programmatically or restrict syntax
  • 24.
    How to SearchCreate/Get an IndexSearcher Create a Query Use a QueryParser Construct it programmatically Display the results from the TopDocs Retrieve Field values from Document More tomorrow on search lifecyle
  • 25.
    Task 1.b Usingthe ReutersIndexerTest.java skeleton in the boot camp files Search your newly created index using queries you develop Questions: What is the default field for the QueryParser ? What Analyzer to use? ~20 minutes
  • 26.
    Task 1 ResultsScores across queries are NOT comparable They may not even be comparable for the same query over time (if the index changes) Performance Caching Warming More Tomorrow
  • 27.
  • 28.
    Discussion/Questions So far,we’ve seen the basics of search and indexing Next going to look into Analysis and Contrib modules
  • 29.
    Analysis Analysis isthe process of creating Token s to be indexed Analysis is usually done to improve results overall, but it comes with a price Lucene comes with many different Analyzer s, Tokenizer s and TokenFilter s, each with their own goals StandardAnalyzer is included with the core JAR and does a good job for most English and Latin-based tasks Often times you want the same content analyzed in different ways Consider a catch-all Field in addition to other Field s
  • 30.
    Solr’s Analysis toolIf you use nothing else from Solr, the Admin analysis tool can really help you understand analysis Download Solr and unpack it cd apache-solr-1.3.0/example java -jar start.jar http://localhost:8983/solr/admin/analysis.jsp
  • 31.
    Analyzers StandardAnalyzer, WhitespaceAnalyzer,SimpleAnalyzer Contrib/analysis Suite of Analyzers for many common situations Languages n-grams Payloads Contrib/snowball
  • 32.
    Tokenization Split wordsinto Token s to be processed Tokenization is fairly straightforward for most languages that use a space for word segmentation More difficult for some East Asian languages See the CJK Analyzer
  • 33.
    Modifying Tokens TokenFilters are used to alter the token stream to be indexed Common tasks: Remove stopwords Lower case Stem/Normalize -> Wi-Fi -> Wi Fi Add Synonyms StandardAnalyzer does things that you may not want
  • 34.
    Payloads Associate anarbitrary byte array with a term in the index Uses Part of Speech Font weight URL Currently can search using the BoostingTermQuery
  • 35.
    n-grams Combine unitsof content together into a single token Character 2-grams for the word “Lucene”: Lu,uc, ce, en, ne Can make search possible when data is noisy or hard to tokenize Word (“shingles” in Lucene parlance) Pseudo Phrases
  • 36.
    Custom Analyzers Problem:none of the Analyzers cover my problem Solution: write your own Analyzer Better solution: write a configurable Analyzer so you only need one Analyzer that you can easily change for your projects See Solr
  • 37.
    Analysis APIs Havea look at the TokenStream and Token API s Token s and TokenStream s may be reused Helps reduce allocations and speeds up indexing Not all Analysis can take advantage: caching Analyzer.reusableTokenStream() TokenStream.next(Token)
  • 38.
    Special Cases Datesand numbers need special treatment to be searchable o.a.l.document.DateTools org.apache.solr.util.NumberUtils Altering Position Information Increase Position Gap between sentences to prevent phrases from crossing sentence boundaries Index synonyms at the same position so query can match regardless of synonym used
  • 39.
    Task 2 Take15-20 minutes and write an Analyzer/Tokenizer/TokenFilter and Unit Test Examine results in Luke Run some searches Ideas: Combine existing Tokenizer s and TokenFilter s Normalize abbreviations Add payloads Filter out all words beginning with the letter A Identify/Mark sentences
  • 40.
    Discussion What didyou implement? What issues do you face with your content? To Stem or not to Stem? Stopwords: good or bad? Tradeoffs of different techniques
  • 41.
    Lucene Contributions Manypeople have generously contributed code to help solve common problems These are in contrib directory of the source Popular: Analyzers Highlighter Queries and MoreLikeThis Snowball Stemmers Spellchecker
  • 42.
    Highlighter Highlight querykeywords in context Often useful for display purposes Important Classes: Highlighter - Main entry point, coordinates the work Fragmenter - Splits up document for scoring Formatter - Marks up the results Scorer - Scores the fragments SpanScorer - Can score phrases Use term vectors for performance Look at example usage
  • 43.
    Spell Checking Suggestspelling corrections based on spellings of words in the index Will/can suggest incorrectly spelled words Uses a distance measure to determine suggestions Can also factor in document frequency Distance Measure is pluggable
  • 44.
    Spell Checking Classes: Spellchecker , StringDistance See ContribExamplesTest Practical aspects: It’s not as simple as just turning it on Good results require testing and tuning Pay attention to accuracy settings Mind your Analysis (simple, no stemming) Consider alternate StringDistance ( JaroWinklerDistance )
  • 45.
    More Like ThisGiven a Document , find other Document s that are similar Variation on relevance feedback “ Find Similar” Extracts the most important terms from a Document and creates a new query Many options available for determining important terms Classes: MoreLikeThis See ContribExamplesTest
  • 46.
    Summary Indexing SearchingAnalysis Contrib Questions?
  • 47.
    Resources http://lucene.apache.org/ http://en.wikipedia.org/wiki/Vector_space_modelModern Information Retrieval by Baeza-Yates and Ribeiro-Neto Lucene In Action by Hatcher and Gospodnetić Wiki Mailing Lists [email_address] Discussions on how to use Lucene [email_address] Discussions on how to develop Lucene Issue Tracking https://issues.apache.org/jira/secure/Dashboard.jspa We always welcome patches Ask on the mailing list before reporting a bug
  • 48.
    Resources [email_address] LucidImagination Support Training Value Add [email_address]