Introduction to Apache Lucene
by Shrikrishna parab
AGENDA
What is Apache Lucene ?
Focus of Apache Lucene
Lucene Architecture
Analyzers
Analysis Example
Demo
WHAT IS APACHE LUCENE?
 Apache Lucene is an open source Java based full-
text search engine.
 Lucene is not a Web application, but rather a code
library and API that can easily be used to add search
capabilities to applications.
 It is also known as Information Retrieval Library.
 Lucene is independent of the file format. Text from
PDFs, HTML, Word document can be indexed as
long as their textual information can be extracted.
FOCUS
 Indexing Documents
 Searching Documents
INDEXING DOCUMENTS
 What is Indexing?
1. Conversion to Plain text (for PDF, html files etc.)
2. Analysis (Convert the text into Tokens)
3. Index (Map the tokens into indexes)
SEARCHING DOCUMENTS
 What is Searching?
1. Take the User Input
2. Create a query
3. Query the index
4. Return the results
LUCENE ARCHITECTURE
ANALYZER
 Tokenizes the input text
 Common Analyzers
1. WhitespaceAnalyzer
Splits tokens on whitespace
2. SimpleAnalyzer
Splits tokens on non-letters, and then lowercases
3. StopAnalyzer
Same as SimpleAnalyzer, but also removes stop words
4. StandardAnalyzer
Most sophisticated analyzer that knows about certain token types,
lowercases, removes stop words
ANALYSIS EXAMPLES
“Boost is the Secrete of our Energy”
 Whitespace Analyzer
[Boost][is][the][Secrete][of][our][Energy]
 Simple Analyzer
[boost][is][the][secrete][of][our][energy]
 Stop Analyzer
[boost][secrete][energy]
 Standard Analyzer
[boost][secrete][energy]
DEMO OF SIMPLE INDEXING AND SEARCHING
USING APACHE LUCENE
Thank You

Introduction to apache lucene

  • 1.
    Introduction to ApacheLucene by Shrikrishna parab
  • 2.
    AGENDA What is ApacheLucene ? Focus of Apache Lucene Lucene Architecture Analyzers Analysis Example Demo
  • 3.
    WHAT IS APACHELUCENE?  Apache Lucene is an open source Java based full- text search engine.  Lucene is not a Web application, but rather a code library and API that can easily be used to add search capabilities to applications.  It is also known as Information Retrieval Library.  Lucene is independent of the file format. Text from PDFs, HTML, Word document can be indexed as long as their textual information can be extracted.
  • 4.
  • 5.
    INDEXING DOCUMENTS  Whatis Indexing? 1. Conversion to Plain text (for PDF, html files etc.) 2. Analysis (Convert the text into Tokens) 3. Index (Map the tokens into indexes)
  • 6.
    SEARCHING DOCUMENTS  Whatis Searching? 1. Take the User Input 2. Create a query 3. Query the index 4. Return the results
  • 7.
  • 8.
    ANALYZER  Tokenizes theinput text  Common Analyzers 1. WhitespaceAnalyzer Splits tokens on whitespace 2. SimpleAnalyzer Splits tokens on non-letters, and then lowercases 3. StopAnalyzer Same as SimpleAnalyzer, but also removes stop words 4. StandardAnalyzer Most sophisticated analyzer that knows about certain token types, lowercases, removes stop words
  • 9.
    ANALYSIS EXAMPLES “Boost isthe Secrete of our Energy”  Whitespace Analyzer [Boost][is][the][Secrete][of][our][Energy]  Simple Analyzer [boost][is][the][secrete][of][our][energy]  Stop Analyzer [boost][secrete][energy]  Standard Analyzer [boost][secrete][energy]
  • 10.
    DEMO OF SIMPLEINDEXING AND SEARCHING USING APACHE LUCENE
  • 11.