Introduction to apache lucene

Introduction to Apache Lucene
by Shrikrishna parab

AGENDA
What is Apache Lucene ?
Focus of Apache Lucene
Lucene Architecture
Analyzers
Analysis Example
Demo

WHAT IS APACHE LUCENE?
 Apache Lucene is an open source Java based full-
text search engine.
 Lucene is not a Web application, but rather a code
library and API that can easily be used to add search
capabilities to applications.
 It is also known as Information Retrieval Library.
 Lucene is independent of the file format. Text from
PDFs, HTML, Word document can be indexed as
long as their textual information can be extracted.

FOCUS
 Indexing Documents
 Searching Documents

INDEXING DOCUMENTS
 What is Indexing?
1. Conversion to Plain text (for PDF, html files etc.)
2. Analysis (Convert the text into Tokens)
3. Index (Map the tokens into indexes)

SEARCHING DOCUMENTS
 What is Searching?
1. Take the User Input
2. Create a query
3. Query the index
4. Return the results

ANALYZER
 Tokenizes the input text
 Common Analyzers
1. WhitespaceAnalyzer
Splits tokens on whitespace
2. SimpleAnalyzer
Splits tokens on non-letters, and then lowercases
3. StopAnalyzer
Same as SimpleAnalyzer, but also removes stop words
4. StandardAnalyzer
Most sophisticated analyzer that knows about certain token types,
lowercases, removes stop words

ANALYSIS EXAMPLES
“Boost is the Secrete of our Energy”
 Whitespace Analyzer
[Boost][is][the][Secrete][of][our][Energy]
 Simple Analyzer
[boost][is][the][secrete][of][our][energy]
 Stop Analyzer
[boost][secrete][energy]
 Standard Analyzer
[boost][secrete][energy]

DEMO OF SIMPLE INDEXING AND SEARCHING
USING APACHE LUCENE

Introduction to apache lucene

More Related Content

What's hot

Viewers also liked

Similar to Introduction to apache lucene

More from Shrikrishna Parab

Recently uploaded

Introduction to apache lucene