Your SlideShare is downloading. ×
Introduction To Apache Lucene
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Introduction To Apache Lucene

1,964
views

Published on

Apache LuceneTM is a free open-source , high-performance, full-featured text search engine library that has been written completely in Java. As a technology is best suited for any application that …

Apache LuceneTM is a free open-source , high-performance, full-featured text search engine library that has been written completely in Java. As a technology is best suited for any application that requires full-text search, especially cross-platform.

Published in: Technology, Education

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,964
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
59
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Introduction to Apache Lucene Sumit Luthra
  • 2. Agenda What is Apache Lucene ? Focus of Apache Lucene Lucene Architecture Core Indexing Classes Core Searching Classes Demo Questions & Answers
  • 3. What is Apache Lucene? Apache Lucene is a high-performance, full- featured text search engine library written entirely in Java.” Also known as Information Retrieval Library. Lucene is specifically an API, not an application. Open Source
  • 4. Focus Indexing Documents Searching Documents Note : You can use Lucene to provide consistent full-text indexing across both database objects and documents in various formats (Microsoft Office documents, PDF, HTML, text, emails and so on).
  • 5. Lucene Architecture Index document Users Analyze document Search UI Build document Index Build query Render results Acquire content Raw Content Run query
  • 6. Indexing Documents IndexWriter writer = new IndexWriter(directory, analyzer, true); Document doc = new Document(); doc.add(new Field(“content", “Hello World”, Field.Store.YES, Field.Index.TOKENIZED)); doc.add(new Field(“name", “filename.txt", Field.Store.YES, Field.Index.TOKENIZED)); doc.add(new Field(“path", “http://myfile/", Field.Store.YES, Field.Index.TOKENIZED)); // [...] writer.addDocument(doc); writer.close();
  • 7. Core indexing classes IndexWriter Directory Analyzer Document Field
  • 8. IndexWriter construction // Deprecated IndexWriter(Directory d, Analyzer a, // default analyzer IndexWriter.MaxFieldLength mfl); // Preferred IndexWriter(Directory d, IndexWriterConfig c);
  • 9. Directory FSDirectory RAMDirectory DbDirectory FileSwitchDirectory JEDirectory
  • 10. Analyzers Tokenizes the input text Common Analyzers – WhitespaceAnalyzer Splits tokens on whitespace – SimpleAnalyzer Splits tokens on non-letters, and then lowercases – StopAnalyzer Same as SimpleAnalyzer, but also removes stop words – StandardAnalyzer Most sophisticated analyzer that knows about certain token types, lowercases, removes stop words, ...
  • 11. Analysis examples • “The quick brown fox jumped over the lazy dog” • WhitespaceAnalyzer – • SimpleAnalyzer – • [the] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dog] StopAnalyzer – • [The] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dog] [quick] [brown] [fox] [jumped] [over] [lazy] [dog] StandardAnalyzer – [quick] [brown] [fox] [jumped] [over] [lazy] [dog]
  • 12. More analysis examples • “XY&Z Corporation – xyz@example.com” • WhitespaceAnalyzer – • SimpleAnalyzer – • [xy] [z] [corporation] [xyz] [example] [com] StopAnalyzer – • [XY&Z] [Corporation] [-] [xyz@example.com] [xy] [z] [corporation] [xyz] [example] [com] StandardAnalyzer – [xy&z] [corporation] [xyz@example.com]
  • 13. Document & Fields A Document is the atomic unit of indexing and searching, It contains Fields Fields have a name and a value – You have to translate raw content into Fields – Examples: Title, author, date, abstract, body, URL, keywords, ... – Different documents can have different fields
  • 14. Field options Field.Store – NO : Don’t store the field value in the index – YES : Store the field value in the index Field.Index – ANALYZED : Tokenize with an Analyzer – NOT_ANALYZED : Do not tokenize – NO : Do not index this field
  • 15. Searching an Index IndexSearcher searcher = new IndexSearcher(directory); QueryParser parser = new QueryParser(Version, field_name ,analyzer); Query query = parser.parse(WORD_SEARCHED); TopDocs hits = searcher.search(query, noOfHits); ScoreDoc[] document = hits.scoreDocs; Document doc = searcher.doc(0); // look at first match System.out.println(“name=" + doc.get(“name")); searcher.close();
  • 16. Core searching classes IndexSearcher Query QueryParser TopDocs ScoreDoc
  • 17. IndexSearcher Constructor: – IndexSearcher(Directory d); • – // Deprecated IndexSearcher(IndexReader r); • Construct an IndexReader with static method IndexReader.open(dir)
  • 18. Query • TermQuery – Constructed from a Term • TermRangeQuery • NumericRangeQuery • PrefixQuery • BooleanQuery • PhraseQuery • WildcardQuery • FuzzyQuery • MatchAllDocsQuery
  • 19. QueryParser • Constructor – • QueryParser(Version matchVersion, String defaultField, Analyzer analyzer); Parsing methods – Query parse(String query) throws ParseException; – ... and many more
  • 20. QueryParser syntax examples Query expression Document matches if… java Contains the term java in the default field java junit java OR junit Contains the term java or junit or both in the default field (the default operator can be changed to AND) +java +junit Contains both java and junit in the default field java AND junit title:ant Contains the term ant in the title field title:extreme –subject:sports Contains extreme in the title and not sports in subject (agile OR extreme) AND java Boolean expression matches title:”junit in action” Phrase matches in title title:”junit action”~5 Proximity matches (within 5) in title java* Wildcard matches java~ Fuzzy matches lastmodified:[1/1/09 TO 12/31/09] Range matches
  • 21. TopDocs Class containing top N ranked searched documents/results that match a given query. ScoreDoc Array of ScoreDoc containing documents/results that match a given query.
  • 22. Demo of simple indexing and searching using Apache Lucene You will require lucene-core-x.y.jar for this demo.
  • 23. Any Questions ? Thank You.