SlideShare a Scribd company logo
Custom Analyzer in Lucene
Lucene/Solr Meetup
Ganesh.M
http://www.linkedin.com/in/gmurugappan
• Introduction to Analyzer
• Why we require Custom Analyzer
• Use case / Scenario
• Writing custom analyzer
• Know your analyzer
• Analyzer : Analyzes the given text and returns
tokens using Tokenizer and TokenFilter
• Tokenizer : Understands the language and breaks
the text in to tokens.
– WhitespaceTokenizer divides text at whitespace
– LetterTokenizer divides text at non-letter
– CJKTokenizer – Chinese, Japanese, Korean language
tokenizer
• TokenFiler: adds / stem / deletes token
– StopFilter – removes stop words
– PorterStemFilter – Transforms the token
• Lets have the text
“The quick brown fox jumps over lazy dog”
Using Standard Analyzer, it will generate
following tokens
Quick Brown
Fox Jumps
Over Lazy
dog
Know Your analyzer
• It is important to choose best analyzer for
your fields.
• If you choose it wrong then it may not give
expected search result.
• If you ever think you are not expecting the
correct result then check your Analyzer and
Query parser.
Lucene 3.x: Below code will print the tokens
generated from given analyzer
Analyzer analyzer = new SimpleAnalyzer();
TokenStream ts = analyzer.tokenStream(“Field", new
StringReader(“Hello world-2013 "));
ts.reset();
while (ts.incrementToken()) {
System.out.println("token: " +
ts.getAttribute(TermAttribute.class).term());
}
ts.close();
The purpose of Custom Analyzer
• Existing analyzers not always solves our
purpose, some times we need to analyze in a
different way
• Custom Analyzer could use existing inbuilt
filters.
• It could also be used for parsing queries
Use case
• Synonym Injection / Abbreviation Expansion
– Add synonyms at the time of indexing.
– In case of parsing resume, add related content for
a keyword. If you find text “lucene/solr” then you
could add information retrieval, search engine.
– If you are searching medical documents, chat
messages etc you need to expand the
abbreviation / codes at the time of indexing
• Stripping XML / HTML tags and index only the
content
<Address>
<Street>123, MG Road<Street>
<City>Bangalore<Bangalore>
<State>Karnataka<State>
</Address>
• Break Email ID / URL in to multiple tokens
– Sachin Tendulkar
<sachin.tendulkar123@gmail.com>
– Should be analyzed as
• sachin
• tendulkar
• sachin
• tendulkar123
• gmail
• com
HTMLAnalyzer in Lucene 4.5
public class HTMLAnalyzer extends Analyzer {
@Override
protected TokenStreamComponents createComponents(String
arg0, Reader reader) {
HTMLStripCharFilter htmlFilter = new HTMLStripCharFilter(reader);
WhitespaceTokenizer tokenizer = new
WhitespaceTokenizer(Version.LUCENE_45, htmlFilter);
TokenStream result = new LowerCaseFilter(Version.LUCENE_45,
tokenizer);
return new TokenStreamComponents (tokenizer, result);
}
}
HTMLAnalyzer in Solr
<fieldType name="text_html" class="solr.TextField"
positionIncrementGap="100">
<analyzer>
<charFilter class="solr.HTMLStripCharFilterFactory"
escapedTags="a, title" /> <tokenizer
class="solr.WhitespaceTokenizerFactory"/>
</analyzer>
</fieldType>
SynonymAnalyzer
• SynonymAnalyzer will inject the synonym as
part of the indexed content using Lucene 3.3
• Check out the code..
https://github.com/geekganesh/SynonymAnal
yzer
PerFieldAnalyzerWrapper
• IndexWriter / IndexWriterConfig will take only
one Analyzer and it will use that for all its
fields.
• We may have multiple fields and each field
should be indexed using specific analyzer then
we need to use PerFieldAnalyzerWrapper
• PerFieldAnalyzerWrapper is used to have
different analyzer per field. It will be passed to
IndexWriter

More Related Content

Similar to Custom analyzer using lucene

Semantic & Multilingual Strategies in Lucene/Solr
Semantic & Multilingual Strategies in Lucene/SolrSemantic & Multilingual Strategies in Lucene/Solr
Semantic & Multilingual Strategies in Lucene/Solr
Trey Grainger
 
Configuring Apache Solr for Thai Text Search
Configuring Apache Solr for Thai Text SearchConfiguring Apache Solr for Thai Text Search
Configuring Apache Solr for Thai Text Search
sagarote
 
Autocomplete in elasticsearch
Autocomplete in elasticsearchAutocomplete in elasticsearch
Autocomplete in elasticsearch
Taimur Qureshi
 
Lexical Analysis - Compiler Design
Lexical Analysis - Compiler DesignLexical Analysis - Compiler Design
Lexical Analysis - Compiler Design
Akhil Kaushik
 
Lucene BootCamp
Lucene BootCampLucene BootCamp
Lucene BootCamp
GokulD
 
Enterprise Search Solution: Apache SOLR. What's available and why it's so cool
Enterprise Search Solution: Apache SOLR. What's available and why it's so coolEnterprise Search Solution: Apache SOLR. What's available and why it's so cool
Enterprise Search Solution: Apache SOLR. What's available and why it's so cool
Ecommerce Solution Provider SysIQ
 
Find it, possibly also near you!
Find it, possibly also near you!Find it, possibly also near you!
Find it, possibly also near you!
Paul Borgermans
 
Advanced full text searching techniques using Lucene
Advanced full text searching techniques using LuceneAdvanced full text searching techniques using Lucene
Advanced full text searching techniques using Lucene
Asad Abbas
 
Introduction to Search Engines
Introduction to Search EnginesIntroduction to Search Engines
Introduction to Search Engines
Nitin Pande
 
(ATS6-PLAT02) Accelrys Catalog and Protocol Validation
(ATS6-PLAT02) Accelrys Catalog and Protocol Validation(ATS6-PLAT02) Accelrys Catalog and Protocol Validation
(ATS6-PLAT02) Accelrys Catalog and Protocol Validation
BIOVIA
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
Erik Hatcher
 
Feedparser
FeedparserFeedparser
Feedparser
Lindsey Smith
 
Cd ch2 - lexical analysis
Cd   ch2 - lexical analysisCd   ch2 - lexical analysis
Cd ch2 - lexical analysis
mengistu23
 
Lucene Bootcamp - 2
Lucene Bootcamp - 2Lucene Bootcamp - 2
Lucene Bootcamp - 2
GokulD
 
Let's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/SolrLet's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/Solr
Sease
 
The Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data EcosystemThe Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data Ecosystem
Trey Grainger
 
Lucene And Solr Intro
Lucene And Solr IntroLucene And Solr Intro
Lucene And Solr Intro
pascaldimassimo
 
Intro to Elasticsearch
Intro to ElasticsearchIntro to Elasticsearch
Intro to Elasticsearch
Clifford James
 
Systematic Searching Strategies.pptx
Systematic Searching Strategies.pptxSystematic Searching Strategies.pptx
Systematic Searching Strategies.pptx
AnPhong9
 
Sumo Logic QuickStart Webinar Oct 2016
Sumo Logic QuickStart Webinar Oct 2016Sumo Logic QuickStart Webinar Oct 2016
Sumo Logic QuickStart Webinar Oct 2016
Sumo Logic
 

Similar to Custom analyzer using lucene (20)

Semantic & Multilingual Strategies in Lucene/Solr
Semantic & Multilingual Strategies in Lucene/SolrSemantic & Multilingual Strategies in Lucene/Solr
Semantic & Multilingual Strategies in Lucene/Solr
 
Configuring Apache Solr for Thai Text Search
Configuring Apache Solr for Thai Text SearchConfiguring Apache Solr for Thai Text Search
Configuring Apache Solr for Thai Text Search
 
Autocomplete in elasticsearch
Autocomplete in elasticsearchAutocomplete in elasticsearch
Autocomplete in elasticsearch
 
Lexical Analysis - Compiler Design
Lexical Analysis - Compiler DesignLexical Analysis - Compiler Design
Lexical Analysis - Compiler Design
 
Lucene BootCamp
Lucene BootCampLucene BootCamp
Lucene BootCamp
 
Enterprise Search Solution: Apache SOLR. What's available and why it's so cool
Enterprise Search Solution: Apache SOLR. What's available and why it's so coolEnterprise Search Solution: Apache SOLR. What's available and why it's so cool
Enterprise Search Solution: Apache SOLR. What's available and why it's so cool
 
Find it, possibly also near you!
Find it, possibly also near you!Find it, possibly also near you!
Find it, possibly also near you!
 
Advanced full text searching techniques using Lucene
Advanced full text searching techniques using LuceneAdvanced full text searching techniques using Lucene
Advanced full text searching techniques using Lucene
 
Introduction to Search Engines
Introduction to Search EnginesIntroduction to Search Engines
Introduction to Search Engines
 
(ATS6-PLAT02) Accelrys Catalog and Protocol Validation
(ATS6-PLAT02) Accelrys Catalog and Protocol Validation(ATS6-PLAT02) Accelrys Catalog and Protocol Validation
(ATS6-PLAT02) Accelrys Catalog and Protocol Validation
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
 
Feedparser
FeedparserFeedparser
Feedparser
 
Cd ch2 - lexical analysis
Cd   ch2 - lexical analysisCd   ch2 - lexical analysis
Cd ch2 - lexical analysis
 
Lucene Bootcamp - 2
Lucene Bootcamp - 2Lucene Bootcamp - 2
Lucene Bootcamp - 2
 
Let's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/SolrLet's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/Solr
 
The Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data EcosystemThe Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data Ecosystem
 
Lucene And Solr Intro
Lucene And Solr IntroLucene And Solr Intro
Lucene And Solr Intro
 
Intro to Elasticsearch
Intro to ElasticsearchIntro to Elasticsearch
Intro to Elasticsearch
 
Systematic Searching Strategies.pptx
Systematic Searching Strategies.pptxSystematic Searching Strategies.pptx
Systematic Searching Strategies.pptx
 
Sumo Logic QuickStart Webinar Oct 2016
Sumo Logic QuickStart Webinar Oct 2016Sumo Logic QuickStart Webinar Oct 2016
Sumo Logic QuickStart Webinar Oct 2016
 

Recently uploaded

Operational ease MuleSoft and Salesforce Service Cloud Solution v1.0.pptx
Operational ease MuleSoft and Salesforce Service Cloud Solution v1.0.pptxOperational ease MuleSoft and Salesforce Service Cloud Solution v1.0.pptx
Operational ease MuleSoft and Salesforce Service Cloud Solution v1.0.pptx
sandeepmenon62
 
UI5con 2024 - Bring Your Own Design System
UI5con 2024 - Bring Your Own Design SystemUI5con 2024 - Bring Your Own Design System
UI5con 2024 - Bring Your Own Design System
Peter Muessig
 
UI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
UI5con 2024 - Keynote: Latest News about UI5 and it’s EcosystemUI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
UI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
Peter Muessig
 
DevOps Consulting Company | Hire DevOps Services
DevOps Consulting Company | Hire DevOps ServicesDevOps Consulting Company | Hire DevOps Services
DevOps Consulting Company | Hire DevOps Services
seospiralmantra
 
Transforming Product Development using OnePlan To Boost Efficiency and Innova...
Transforming Product Development using OnePlan To Boost Efficiency and Innova...Transforming Product Development using OnePlan To Boost Efficiency and Innova...
Transforming Product Development using OnePlan To Boost Efficiency and Innova...
OnePlan Solutions
 
一比一原版(USF毕业证)旧金山大学毕业证如何办理
一比一原版(USF毕业证)旧金山大学毕业证如何办理一比一原版(USF毕业证)旧金山大学毕业证如何办理
一比一原版(USF毕业证)旧金山大学毕业证如何办理
dakas1
 
DECODING JAVA THREAD DUMPS: MASTER THE ART OF ANALYSIS
DECODING JAVA THREAD DUMPS: MASTER THE ART OF ANALYSISDECODING JAVA THREAD DUMPS: MASTER THE ART OF ANALYSIS
DECODING JAVA THREAD DUMPS: MASTER THE ART OF ANALYSIS
Tier1 app
 
J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...
J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...
J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...
Bert Jan Schrijver
 
Using Query Store in Azure PostgreSQL to Understand Query Performance
Using Query Store in Azure PostgreSQL to Understand Query PerformanceUsing Query Store in Azure PostgreSQL to Understand Query Performance
Using Query Store in Azure PostgreSQL to Understand Query Performance
Grant Fritchey
 
Liberarsi dai framework con i Web Component.pptx
Liberarsi dai framework con i Web Component.pptxLiberarsi dai framework con i Web Component.pptx
Liberarsi dai framework con i Web Component.pptx
Massimo Artizzu
 
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CDKuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
rodomar2
 
一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理
一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理
一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理
dakas1
 
INTRODUCTION TO AI CLASSICAL THEORY TARGETED EXAMPLES
INTRODUCTION TO AI CLASSICAL THEORY TARGETED EXAMPLESINTRODUCTION TO AI CLASSICAL THEORY TARGETED EXAMPLES
INTRODUCTION TO AI CLASSICAL THEORY TARGETED EXAMPLES
anfaltahir1010
 
ppt on the brain chip neuralink.pptx
ppt  on   the brain  chip neuralink.pptxppt  on   the brain  chip neuralink.pptx
ppt on the brain chip neuralink.pptx
Reetu63
 
How Can Hiring A Mobile App Development Company Help Your Business Grow?
How Can Hiring A Mobile App Development Company Help Your Business Grow?How Can Hiring A Mobile App Development Company Help Your Business Grow?
How Can Hiring A Mobile App Development Company Help Your Business Grow?
ToXSL Technologies
 
一比一原版(sdsu毕业证书)圣地亚哥州立大学毕业证如何办理
一比一原版(sdsu毕业证书)圣地亚哥州立大学毕业证如何办理一比一原版(sdsu毕业证书)圣地亚哥州立大学毕业证如何办理
一比一原版(sdsu毕业证书)圣地亚哥州立大学毕业证如何办理
kgyxske
 
Preparing Non - Technical Founders for Engaging a Tech Agency
Preparing Non - Technical Founders for Engaging  a  Tech AgencyPreparing Non - Technical Founders for Engaging  a  Tech Agency
Preparing Non - Technical Founders for Engaging a Tech Agency
ISH Technologies
 
Safelyio Toolbox Talk Softwate & App (How To Digitize Safety Meetings)
Safelyio Toolbox Talk Softwate & App (How To Digitize Safety Meetings)Safelyio Toolbox Talk Softwate & App (How To Digitize Safety Meetings)
Safelyio Toolbox Talk Softwate & App (How To Digitize Safety Meetings)
safelyiotech
 
8 Best Automated Android App Testing Tool and Framework in 2024.pdf
8 Best Automated Android App Testing Tool and Framework in 2024.pdf8 Best Automated Android App Testing Tool and Framework in 2024.pdf
8 Best Automated Android App Testing Tool and Framework in 2024.pdf
kalichargn70th171
 
Oracle 23c New Features For DBAs and Developers.pptx
Oracle 23c New Features For DBAs and Developers.pptxOracle 23c New Features For DBAs and Developers.pptx
Oracle 23c New Features For DBAs and Developers.pptx
Remote DBA Services
 

Recently uploaded (20)

Operational ease MuleSoft and Salesforce Service Cloud Solution v1.0.pptx
Operational ease MuleSoft and Salesforce Service Cloud Solution v1.0.pptxOperational ease MuleSoft and Salesforce Service Cloud Solution v1.0.pptx
Operational ease MuleSoft and Salesforce Service Cloud Solution v1.0.pptx
 
UI5con 2024 - Bring Your Own Design System
UI5con 2024 - Bring Your Own Design SystemUI5con 2024 - Bring Your Own Design System
UI5con 2024 - Bring Your Own Design System
 
UI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
UI5con 2024 - Keynote: Latest News about UI5 and it’s EcosystemUI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
UI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
 
DevOps Consulting Company | Hire DevOps Services
DevOps Consulting Company | Hire DevOps ServicesDevOps Consulting Company | Hire DevOps Services
DevOps Consulting Company | Hire DevOps Services
 
Transforming Product Development using OnePlan To Boost Efficiency and Innova...
Transforming Product Development using OnePlan To Boost Efficiency and Innova...Transforming Product Development using OnePlan To Boost Efficiency and Innova...
Transforming Product Development using OnePlan To Boost Efficiency and Innova...
 
一比一原版(USF毕业证)旧金山大学毕业证如何办理
一比一原版(USF毕业证)旧金山大学毕业证如何办理一比一原版(USF毕业证)旧金山大学毕业证如何办理
一比一原版(USF毕业证)旧金山大学毕业证如何办理
 
DECODING JAVA THREAD DUMPS: MASTER THE ART OF ANALYSIS
DECODING JAVA THREAD DUMPS: MASTER THE ART OF ANALYSISDECODING JAVA THREAD DUMPS: MASTER THE ART OF ANALYSIS
DECODING JAVA THREAD DUMPS: MASTER THE ART OF ANALYSIS
 
J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...
J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...
J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...
 
Using Query Store in Azure PostgreSQL to Understand Query Performance
Using Query Store in Azure PostgreSQL to Understand Query PerformanceUsing Query Store in Azure PostgreSQL to Understand Query Performance
Using Query Store in Azure PostgreSQL to Understand Query Performance
 
Liberarsi dai framework con i Web Component.pptx
Liberarsi dai framework con i Web Component.pptxLiberarsi dai framework con i Web Component.pptx
Liberarsi dai framework con i Web Component.pptx
 
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CDKuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
 
一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理
一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理
一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理
 
INTRODUCTION TO AI CLASSICAL THEORY TARGETED EXAMPLES
INTRODUCTION TO AI CLASSICAL THEORY TARGETED EXAMPLESINTRODUCTION TO AI CLASSICAL THEORY TARGETED EXAMPLES
INTRODUCTION TO AI CLASSICAL THEORY TARGETED EXAMPLES
 
ppt on the brain chip neuralink.pptx
ppt  on   the brain  chip neuralink.pptxppt  on   the brain  chip neuralink.pptx
ppt on the brain chip neuralink.pptx
 
How Can Hiring A Mobile App Development Company Help Your Business Grow?
How Can Hiring A Mobile App Development Company Help Your Business Grow?How Can Hiring A Mobile App Development Company Help Your Business Grow?
How Can Hiring A Mobile App Development Company Help Your Business Grow?
 
一比一原版(sdsu毕业证书)圣地亚哥州立大学毕业证如何办理
一比一原版(sdsu毕业证书)圣地亚哥州立大学毕业证如何办理一比一原版(sdsu毕业证书)圣地亚哥州立大学毕业证如何办理
一比一原版(sdsu毕业证书)圣地亚哥州立大学毕业证如何办理
 
Preparing Non - Technical Founders for Engaging a Tech Agency
Preparing Non - Technical Founders for Engaging  a  Tech AgencyPreparing Non - Technical Founders for Engaging  a  Tech Agency
Preparing Non - Technical Founders for Engaging a Tech Agency
 
Safelyio Toolbox Talk Softwate & App (How To Digitize Safety Meetings)
Safelyio Toolbox Talk Softwate & App (How To Digitize Safety Meetings)Safelyio Toolbox Talk Softwate & App (How To Digitize Safety Meetings)
Safelyio Toolbox Talk Softwate & App (How To Digitize Safety Meetings)
 
8 Best Automated Android App Testing Tool and Framework in 2024.pdf
8 Best Automated Android App Testing Tool and Framework in 2024.pdf8 Best Automated Android App Testing Tool and Framework in 2024.pdf
8 Best Automated Android App Testing Tool and Framework in 2024.pdf
 
Oracle 23c New Features For DBAs and Developers.pptx
Oracle 23c New Features For DBAs and Developers.pptxOracle 23c New Features For DBAs and Developers.pptx
Oracle 23c New Features For DBAs and Developers.pptx
 

Custom analyzer using lucene

  • 1. Custom Analyzer in Lucene Lucene/Solr Meetup Ganesh.M http://www.linkedin.com/in/gmurugappan
  • 2. • Introduction to Analyzer • Why we require Custom Analyzer • Use case / Scenario • Writing custom analyzer • Know your analyzer
  • 3. • Analyzer : Analyzes the given text and returns tokens using Tokenizer and TokenFilter • Tokenizer : Understands the language and breaks the text in to tokens. – WhitespaceTokenizer divides text at whitespace – LetterTokenizer divides text at non-letter – CJKTokenizer – Chinese, Japanese, Korean language tokenizer • TokenFiler: adds / stem / deletes token – StopFilter – removes stop words – PorterStemFilter – Transforms the token
  • 4. • Lets have the text “The quick brown fox jumps over lazy dog” Using Standard Analyzer, it will generate following tokens Quick Brown Fox Jumps Over Lazy dog
  • 5. Know Your analyzer • It is important to choose best analyzer for your fields. • If you choose it wrong then it may not give expected search result. • If you ever think you are not expecting the correct result then check your Analyzer and Query parser.
  • 6. Lucene 3.x: Below code will print the tokens generated from given analyzer Analyzer analyzer = new SimpleAnalyzer(); TokenStream ts = analyzer.tokenStream(“Field", new StringReader(“Hello world-2013 ")); ts.reset(); while (ts.incrementToken()) { System.out.println("token: " + ts.getAttribute(TermAttribute.class).term()); } ts.close();
  • 7. The purpose of Custom Analyzer • Existing analyzers not always solves our purpose, some times we need to analyze in a different way • Custom Analyzer could use existing inbuilt filters. • It could also be used for parsing queries
  • 8. Use case • Synonym Injection / Abbreviation Expansion – Add synonyms at the time of indexing. – In case of parsing resume, add related content for a keyword. If you find text “lucene/solr” then you could add information retrieval, search engine. – If you are searching medical documents, chat messages etc you need to expand the abbreviation / codes at the time of indexing
  • 9. • Stripping XML / HTML tags and index only the content <Address> <Street>123, MG Road<Street> <City>Bangalore<Bangalore> <State>Karnataka<State> </Address>
  • 10. • Break Email ID / URL in to multiple tokens – Sachin Tendulkar <sachin.tendulkar123@gmail.com> – Should be analyzed as • sachin • tendulkar • sachin • tendulkar123 • gmail • com
  • 11. HTMLAnalyzer in Lucene 4.5 public class HTMLAnalyzer extends Analyzer { @Override protected TokenStreamComponents createComponents(String arg0, Reader reader) { HTMLStripCharFilter htmlFilter = new HTMLStripCharFilter(reader); WhitespaceTokenizer tokenizer = new WhitespaceTokenizer(Version.LUCENE_45, htmlFilter); TokenStream result = new LowerCaseFilter(Version.LUCENE_45, tokenizer); return new TokenStreamComponents (tokenizer, result); } }
  • 12. HTMLAnalyzer in Solr <fieldType name="text_html" class="solr.TextField" positionIncrementGap="100"> <analyzer> <charFilter class="solr.HTMLStripCharFilterFactory" escapedTags="a, title" /> <tokenizer class="solr.WhitespaceTokenizerFactory"/> </analyzer> </fieldType>
  • 13. SynonymAnalyzer • SynonymAnalyzer will inject the synonym as part of the indexed content using Lucene 3.3 • Check out the code.. https://github.com/geekganesh/SynonymAnal yzer
  • 14. PerFieldAnalyzerWrapper • IndexWriter / IndexWriterConfig will take only one Analyzer and it will use that for all its fields. • We may have multiple fields and each field should be indexed using specific analyzer then we need to use PerFieldAnalyzerWrapper • PerFieldAnalyzerWrapper is used to have different analyzer per field. It will be passed to IndexWriter