Introduction To Apache Lucene

2,497 views

Published on

Apache LuceneTM is a free open-source , high-performance, full-featured text search engine library that has been written completely in Java. As a technology is best suited for any application that requires full-text search, especially cross-platform.

Published in: Technology, Education
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,497
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
72
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Introduction To Apache Lucene

  1. 1. Introduction to Apache Lucene Sumit Luthra
  2. 2. Agenda What is Apache Lucene ? Focus of Apache Lucene Lucene Architecture Core Indexing Classes Core Searching Classes Demo Questions & Answers
  3. 3. What is Apache Lucene? Apache Lucene is a high-performance, full- featured text search engine library written entirely in Java.” Also known as Information Retrieval Library. Lucene is specifically an API, not an application. Open Source
  4. 4. Focus Indexing Documents Searching Documents Note : You can use Lucene to provide consistent full-text indexing across both database objects and documents in various formats (Microsoft Office documents, PDF, HTML, text, emails and so on).
  5. 5. Lucene Architecture Index document Users Analyze document Search UI Build document Index Build query Render results Acquire content Raw Content Run query
  6. 6. Indexing Documents IndexWriter writer = new IndexWriter(directory, analyzer, true); Document doc = new Document(); doc.add(new Field(“content", “Hello World”, Field.Store.YES, Field.Index.TOKENIZED)); doc.add(new Field(“name", “filename.txt", Field.Store.YES, Field.Index.TOKENIZED)); doc.add(new Field(“path", “http://myfile/", Field.Store.YES, Field.Index.TOKENIZED)); // [...] writer.addDocument(doc); writer.close();
  7. 7. Core indexing classes IndexWriter Directory Analyzer Document Field
  8. 8. IndexWriter construction // Deprecated IndexWriter(Directory d, Analyzer a, // default analyzer IndexWriter.MaxFieldLength mfl); // Preferred IndexWriter(Directory d, IndexWriterConfig c);
  9. 9. Directory FSDirectory RAMDirectory DbDirectory FileSwitchDirectory JEDirectory
  10. 10. Analyzers Tokenizes the input text Common Analyzers – WhitespaceAnalyzer Splits tokens on whitespace – SimpleAnalyzer Splits tokens on non-letters, and then lowercases – StopAnalyzer Same as SimpleAnalyzer, but also removes stop words – StandardAnalyzer Most sophisticated analyzer that knows about certain token types, lowercases, removes stop words, ...
  11. 11. Analysis examples • “The quick brown fox jumped over the lazy dog” • WhitespaceAnalyzer – • SimpleAnalyzer – • [the] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dog] StopAnalyzer – • [The] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dog] [quick] [brown] [fox] [jumped] [over] [lazy] [dog] StandardAnalyzer – [quick] [brown] [fox] [jumped] [over] [lazy] [dog]
  12. 12. More analysis examples • “XY&Z Corporation – xyz@example.com” • WhitespaceAnalyzer – • SimpleAnalyzer – • [xy] [z] [corporation] [xyz] [example] [com] StopAnalyzer – • [XY&Z] [Corporation] [-] [xyz@example.com] [xy] [z] [corporation] [xyz] [example] [com] StandardAnalyzer – [xy&z] [corporation] [xyz@example.com]
  13. 13. Document & Fields A Document is the atomic unit of indexing and searching, It contains Fields Fields have a name and a value – You have to translate raw content into Fields – Examples: Title, author, date, abstract, body, URL, keywords, ... – Different documents can have different fields
  14. 14. Field options Field.Store – NO : Don’t store the field value in the index – YES : Store the field value in the index Field.Index – ANALYZED : Tokenize with an Analyzer – NOT_ANALYZED : Do not tokenize – NO : Do not index this field
  15. 15. Searching an Index IndexSearcher searcher = new IndexSearcher(directory); QueryParser parser = new QueryParser(Version, field_name ,analyzer); Query query = parser.parse(WORD_SEARCHED); TopDocs hits = searcher.search(query, noOfHits); ScoreDoc[] document = hits.scoreDocs; Document doc = searcher.doc(0); // look at first match System.out.println(“name=" + doc.get(“name")); searcher.close();
  16. 16. Core searching classes IndexSearcher Query QueryParser TopDocs ScoreDoc
  17. 17. IndexSearcher Constructor: – IndexSearcher(Directory d); • – // Deprecated IndexSearcher(IndexReader r); • Construct an IndexReader with static method IndexReader.open(dir)
  18. 18. Query • TermQuery – Constructed from a Term • TermRangeQuery • NumericRangeQuery • PrefixQuery • BooleanQuery • PhraseQuery • WildcardQuery • FuzzyQuery • MatchAllDocsQuery
  19. 19. QueryParser • Constructor – • QueryParser(Version matchVersion, String defaultField, Analyzer analyzer); Parsing methods – Query parse(String query) throws ParseException; – ... and many more
  20. 20. QueryParser syntax examples Query expression Document matches if… java Contains the term java in the default field java junit java OR junit Contains the term java or junit or both in the default field (the default operator can be changed to AND) +java +junit Contains both java and junit in the default field java AND junit title:ant Contains the term ant in the title field title:extreme –subject:sports Contains extreme in the title and not sports in subject (agile OR extreme) AND java Boolean expression matches title:”junit in action” Phrase matches in title title:”junit action”~5 Proximity matches (within 5) in title java* Wildcard matches java~ Fuzzy matches lastmodified:[1/1/09 TO 12/31/09] Range matches
  21. 21. TopDocs Class containing top N ranked searched documents/results that match a given query. ScoreDoc Array of ScoreDoc containing documents/results that match a given query.
  22. 22. Demo of simple indexing and searching using Apache Lucene You will require lucene-core-x.y.jar for this demo.
  23. 23. Any Questions ? Thank You.

×