Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Custom Analyzer in Lucene
Lucene/Solr Meetup
Ganesh.M
http://www.linkedin.com/in/gmurugappan
• Introduction to Analyzer
• Why we require Custom Analyzer
• Use case / Scenario
• Writing custom analyzer
• Know your an...
• Analyzer : Analyzes the given text and returns
tokens using Tokenizer and TokenFilter
• Tokenizer : Understands the lang...
• Lets have the text
“The quick brown fox jumps over lazy dog”
Using Standard Analyzer, it will generate
following tokens
...
Know Your analyzer
• It is important to choose best analyzer for
your fields.
• If you choose it wrong then it may not giv...
Lucene 3.x: Below code will print the tokens
generated from given analyzer
Analyzer analyzer = new SimpleAnalyzer();
Token...
The purpose of Custom Analyzer
• Existing analyzers not always solves our
purpose, some times we need to analyze in a
diff...
Use case
• Synonym Injection / Abbreviation Expansion
– Add synonyms at the time of indexing.
– In case of parsing resume,...
• Stripping XML / HTML tags and index only the
content
<Address>
<Street>123, MG Road<Street>
<City>Bangalore<Bangalore>
<...
• Break Email ID / URL in to multiple tokens
– Sachin Tendulkar
<sachin.tendulkar123@gmail.com>
– Should be analyzed as
• ...
HTMLAnalyzer in Lucene 4.5
public class HTMLAnalyzer extends Analyzer {
@Override
protected TokenStreamComponents createCo...
HTMLAnalyzer in Solr
<fieldType name="text_html" class="solr.TextField"
positionIncrementGap="100">
<analyzer>
<charFilter...
SynonymAnalyzer
• SynonymAnalyzer will inject the synonym as
part of the indexed content using Lucene 3.3
• Check out the ...
PerFieldAnalyzerWrapper
• IndexWriter / IndexWriterConfig will take only
one Analyzer and it will use that for all its
fie...
Upcoming SlideShare
Loading in …5
×

Custom analyzer using lucene

838 views

Published on

Built custom analyzer using Lucene

Published in: Software
  • Be the first to comment

  • Be the first to like this

Custom analyzer using lucene

  1. 1. Custom Analyzer in Lucene Lucene/Solr Meetup Ganesh.M http://www.linkedin.com/in/gmurugappan
  2. 2. • Introduction to Analyzer • Why we require Custom Analyzer • Use case / Scenario • Writing custom analyzer • Know your analyzer
  3. 3. • Analyzer : Analyzes the given text and returns tokens using Tokenizer and TokenFilter • Tokenizer : Understands the language and breaks the text in to tokens. – WhitespaceTokenizer divides text at whitespace – LetterTokenizer divides text at non-letter – CJKTokenizer – Chinese, Japanese, Korean language tokenizer • TokenFiler: adds / stem / deletes token – StopFilter – removes stop words – PorterStemFilter – Transforms the token
  4. 4. • Lets have the text “The quick brown fox jumps over lazy dog” Using Standard Analyzer, it will generate following tokens Quick Brown Fox Jumps Over Lazy dog
  5. 5. Know Your analyzer • It is important to choose best analyzer for your fields. • If you choose it wrong then it may not give expected search result. • If you ever think you are not expecting the correct result then check your Analyzer and Query parser.
  6. 6. Lucene 3.x: Below code will print the tokens generated from given analyzer Analyzer analyzer = new SimpleAnalyzer(); TokenStream ts = analyzer.tokenStream(“Field", new StringReader(“Hello world-2013 ")); ts.reset(); while (ts.incrementToken()) { System.out.println("token: " + ts.getAttribute(TermAttribute.class).term()); } ts.close();
  7. 7. The purpose of Custom Analyzer • Existing analyzers not always solves our purpose, some times we need to analyze in a different way • Custom Analyzer could use existing inbuilt filters. • It could also be used for parsing queries
  8. 8. Use case • Synonym Injection / Abbreviation Expansion – Add synonyms at the time of indexing. – In case of parsing resume, add related content for a keyword. If you find text “lucene/solr” then you could add information retrieval, search engine. – If you are searching medical documents, chat messages etc you need to expand the abbreviation / codes at the time of indexing
  9. 9. • Stripping XML / HTML tags and index only the content <Address> <Street>123, MG Road<Street> <City>Bangalore<Bangalore> <State>Karnataka<State> </Address>
  10. 10. • Break Email ID / URL in to multiple tokens – Sachin Tendulkar <sachin.tendulkar123@gmail.com> – Should be analyzed as • sachin • tendulkar • sachin • tendulkar123 • gmail • com
  11. 11. HTMLAnalyzer in Lucene 4.5 public class HTMLAnalyzer extends Analyzer { @Override protected TokenStreamComponents createComponents(String arg0, Reader reader) { HTMLStripCharFilter htmlFilter = new HTMLStripCharFilter(reader); WhitespaceTokenizer tokenizer = new WhitespaceTokenizer(Version.LUCENE_45, htmlFilter); TokenStream result = new LowerCaseFilter(Version.LUCENE_45, tokenizer); return new TokenStreamComponents (tokenizer, result); } }
  12. 12. HTMLAnalyzer in Solr <fieldType name="text_html" class="solr.TextField" positionIncrementGap="100"> <analyzer> <charFilter class="solr.HTMLStripCharFilterFactory" escapedTags="a, title" /> <tokenizer class="solr.WhitespaceTokenizerFactory"/> </analyzer> </fieldType>
  13. 13. SynonymAnalyzer • SynonymAnalyzer will inject the synonym as part of the indexed content using Lucene 3.3 • Check out the code.. https://github.com/geekganesh/SynonymAnal yzer
  14. 14. PerFieldAnalyzerWrapper • IndexWriter / IndexWriterConfig will take only one Analyzer and it will use that for all its fields. • We may have multiple fields and each field should be indexed using specific analyzer then we need to use PerFieldAnalyzerWrapper • PerFieldAnalyzerWrapper is used to have different analyzer per field. It will be passed to IndexWriter

×