0
Introduction to Apache Lucene

Sumit Luthra
Agenda
What is Apache Lucene ?
Focus of Apache Lucene
Lucene Architecture
Core Indexing Classes
Core Searching Classes
Dem...
What is Apache Lucene?
Apache Lucene is a high-performance, full- featured text search
engine library written entirely in ...
Focus
Indexing Documents
Searching Documents

Note :
You can use Lucene to provide consistent full-text indexing across
bo...
Lucene Architecture
Index
document

Users

Analyze
document

Search UI

Build document

Index

Build
query

Render
results...
Indexing Documents
IndexWriter writer = new IndexWriter(directory, analyzer, true);
Document doc = new Document();
doc.add...
Core indexing classes
IndexWriter
Directory
Analyzer
Document
Field
IndexWriter construction
// Deprecated
IndexWriter(Directory d, Analyzer a, // default analyzer
IndexWriter.MaxFieldLength...
Directory
FSDirectory
RAMDirectory
DbDirectory
FileSwitchDirectory
JEDirectory
Analyzers
Tokenizes the input text
Common Analyzers
–

WhitespaceAnalyzer
Splits tokens on whitespace

–

SimpleAnalyzer
S...
Analysis examples
•

“The quick brown fox jumped over the lazy dog”

•

WhitespaceAnalyzer
–

•

SimpleAnalyzer
–

•

[the...
More analysis examples
•

“XY&Z Corporation – xyz@example.com”

•

WhitespaceAnalyzer
–

•

SimpleAnalyzer
–

•

[xy] [z] ...
Document & Fields
A Document is the atomic unit of indexing and
searching, It contains Fields
Fields have a name and a val...
Field options
Field.Store
–

NO : Don’t store the field value in the index

–

YES : Store the field value in the index

F...
Searching an Index
IndexSearcher searcher = new IndexSearcher(directory);
QueryParser parser = new QueryParser(Version, fi...
Core searching classes
IndexSearcher
Query
QueryParser
TopDocs
ScoreDoc
IndexSearcher
Constructor:
–

IndexSearcher(Directory d);
•

–

// Deprecated

IndexSearcher(IndexReader r);
•

Construct ...
Query
•

TermQuery
–

Constructed from a Term

•

TermRangeQuery

•

NumericRangeQuery

•

PrefixQuery

•

BooleanQuery

•...
QueryParser
•

Constructor
–

•

QueryParser(Version matchVersion,
String defaultField,
Analyzer analyzer);

Parsing metho...
QueryParser syntax examples
Query expression

Document matches if…

java

Contains the term java in the default field

jav...
TopDocs
Class containing top N ranked searched documents/results
that match a given query.

ScoreDoc
Array of ScoreDoc con...
Demo of simple indexing and searching
using Apache Lucene

You will require lucene-core-x.y.jar for this demo.
Any Questions ?
Thank You.
Upcoming SlideShare
Loading in...5
×

Introduction To Apache Lucene

2,035

Published on

Apache LuceneTM is a free open-source , high-performance, full-featured text search engine library that has been written completely in Java. As a technology is best suited for any application that requires full-text search, especially cross-platform.

Published in: Technology, Education
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,035
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
61
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Transcript of "Introduction To Apache Lucene"

  1. 1. Introduction to Apache Lucene Sumit Luthra
  2. 2. Agenda What is Apache Lucene ? Focus of Apache Lucene Lucene Architecture Core Indexing Classes Core Searching Classes Demo Questions & Answers
  3. 3. What is Apache Lucene? Apache Lucene is a high-performance, full- featured text search engine library written entirely in Java.” Also known as Information Retrieval Library. Lucene is specifically an API, not an application. Open Source
  4. 4. Focus Indexing Documents Searching Documents Note : You can use Lucene to provide consistent full-text indexing across both database objects and documents in various formats (Microsoft Office documents, PDF, HTML, text, emails and so on).
  5. 5. Lucene Architecture Index document Users Analyze document Search UI Build document Index Build query Render results Acquire content Raw Content Run query
  6. 6. Indexing Documents IndexWriter writer = new IndexWriter(directory, analyzer, true); Document doc = new Document(); doc.add(new Field(“content", “Hello World”, Field.Store.YES, Field.Index.TOKENIZED)); doc.add(new Field(“name", “filename.txt", Field.Store.YES, Field.Index.TOKENIZED)); doc.add(new Field(“path", “http://myfile/", Field.Store.YES, Field.Index.TOKENIZED)); // [...] writer.addDocument(doc); writer.close();
  7. 7. Core indexing classes IndexWriter Directory Analyzer Document Field
  8. 8. IndexWriter construction // Deprecated IndexWriter(Directory d, Analyzer a, // default analyzer IndexWriter.MaxFieldLength mfl); // Preferred IndexWriter(Directory d, IndexWriterConfig c);
  9. 9. Directory FSDirectory RAMDirectory DbDirectory FileSwitchDirectory JEDirectory
  10. 10. Analyzers Tokenizes the input text Common Analyzers – WhitespaceAnalyzer Splits tokens on whitespace – SimpleAnalyzer Splits tokens on non-letters, and then lowercases – StopAnalyzer Same as SimpleAnalyzer, but also removes stop words – StandardAnalyzer Most sophisticated analyzer that knows about certain token types, lowercases, removes stop words, ...
  11. 11. Analysis examples • “The quick brown fox jumped over the lazy dog” • WhitespaceAnalyzer – • SimpleAnalyzer – • [the] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dog] StopAnalyzer – • [The] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dog] [quick] [brown] [fox] [jumped] [over] [lazy] [dog] StandardAnalyzer – [quick] [brown] [fox] [jumped] [over] [lazy] [dog]
  12. 12. More analysis examples • “XY&Z Corporation – xyz@example.com” • WhitespaceAnalyzer – • SimpleAnalyzer – • [xy] [z] [corporation] [xyz] [example] [com] StopAnalyzer – • [XY&Z] [Corporation] [-] [xyz@example.com] [xy] [z] [corporation] [xyz] [example] [com] StandardAnalyzer – [xy&z] [corporation] [xyz@example.com]
  13. 13. Document & Fields A Document is the atomic unit of indexing and searching, It contains Fields Fields have a name and a value – You have to translate raw content into Fields – Examples: Title, author, date, abstract, body, URL, keywords, ... – Different documents can have different fields
  14. 14. Field options Field.Store – NO : Don’t store the field value in the index – YES : Store the field value in the index Field.Index – ANALYZED : Tokenize with an Analyzer – NOT_ANALYZED : Do not tokenize – NO : Do not index this field
  15. 15. Searching an Index IndexSearcher searcher = new IndexSearcher(directory); QueryParser parser = new QueryParser(Version, field_name ,analyzer); Query query = parser.parse(WORD_SEARCHED); TopDocs hits = searcher.search(query, noOfHits); ScoreDoc[] document = hits.scoreDocs; Document doc = searcher.doc(0); // look at first match System.out.println(“name=" + doc.get(“name")); searcher.close();
  16. 16. Core searching classes IndexSearcher Query QueryParser TopDocs ScoreDoc
  17. 17. IndexSearcher Constructor: – IndexSearcher(Directory d); • – // Deprecated IndexSearcher(IndexReader r); • Construct an IndexReader with static method IndexReader.open(dir)
  18. 18. Query • TermQuery – Constructed from a Term • TermRangeQuery • NumericRangeQuery • PrefixQuery • BooleanQuery • PhraseQuery • WildcardQuery • FuzzyQuery • MatchAllDocsQuery
  19. 19. QueryParser • Constructor – • QueryParser(Version matchVersion, String defaultField, Analyzer analyzer); Parsing methods – Query parse(String query) throws ParseException; – ... and many more
  20. 20. QueryParser syntax examples Query expression Document matches if… java Contains the term java in the default field java junit java OR junit Contains the term java or junit or both in the default field (the default operator can be changed to AND) +java +junit Contains both java and junit in the default field java AND junit title:ant Contains the term ant in the title field title:extreme –subject:sports Contains extreme in the title and not sports in subject (agile OR extreme) AND java Boolean expression matches title:”junit in action” Phrase matches in title title:”junit action”~5 Proximity matches (within 5) in title java* Wildcard matches java~ Fuzzy matches lastmodified:[1/1/09 TO 12/31/09] Range matches
  21. 21. TopDocs Class containing top N ranked searched documents/results that match a given query. ScoreDoc Array of ScoreDoc containing documents/results that match a given query.
  22. 22. Demo of simple indexing and searching using Apache Lucene You will require lucene-core-x.y.jar for this demo.
  23. 23. Any Questions ? Thank You.
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×