Lucene Introduction

12,614 views
12,346 views

Published on

Lucene introduction / overview, also touching on Lucene 2.9/3.0 features

Published in: Technology, Education
1 Comment
17 Likes
Statistics
Notes
No Downloads
Views
Total views
12,614
On SlideShare
0
From Embeds
0
Number of Embeds
735
Actions
Shares
0
Downloads
534
Comments
1
Likes
17
Embeds 0
No embeds

No notes for slide

Lucene Introduction

  1. 1. Lucene Introduction Otis Gospodnetic, Sematext Int’l @otisg [email_address] http://jroller.com/otis http://sematext.com/
  2. 2. About Otis <ul><li>Lucener since pre-Apache (cca 2000) </li></ul><ul><li>Committer: Lucene, Solr, Nutch, Mahout, Open Relevance </li></ul><ul><li>Lucene in Action 1 & 2 co-author </li></ul><ul><li>Solr in Action author </li></ul><ul><li>Sematext co-founder </li></ul>
  3. 3. What is Lucene? <ul><li>Free, ASL, Java IR library, Jar </li></ul><ul><li>Doug Cutting, ASF, 2001 </li></ul><ul><li>Application agnostic: Indexing & Searching </li></ul><ul><li>High performance, scalable </li></ul><ul><li>No dependencies </li></ul><ul><li>Heavily ported </li></ul>Otis Gospodnetic, Sematext Int’l
  4. 4. What Lucene Ain’t <ul><li>Turn key “solution” </li></ul><ul><li>Application, no installer/wizard needed </li></ul><ul><li>(Web) crawler </li></ul><ul><li>Insert-doc-format-here parser / filter </li></ul>Otis Gospodnetic, Sematext Int’l
  5. 5. The Lucene Family <ul><li>Lucene vs. Apache Lucene vs. Java Lucene: IR library </li></ul><ul><li>Nutch: Hadoop-loving crawler, indexer, searcher for web-wide scale SE </li></ul><ul><li>Solr: Search server </li></ul><ul><li>Droids: Standalone framework for writing crawlers </li></ul><ul><li>Lucene.Net: C#, Incubator graduate </li></ul><ul><li>Lucy: C Lucene impl </li></ul><ul><li>Mahout: Hadoop-loving ML library </li></ul><ul><li>Open Relevance: Relevance judgments </li></ul><ul><li>PyLucene: Python port </li></ul>Otis Gospodnetic, Sematext Int’l
  6. 6. Integration Data Source Data Source Gather Parse Make Doc Search UI Search App e.g. webapp Search Index Index Otis Gospodnetic, Sematext Int’l
  7. 7. Integration: Rich Doc Indexing HTML PDF Gather Make Doc Index Index MS Word PDF Parse with Tika Otis Gospodnetic, Sematext Int’l
  8. 8. Lucene Strengths <ul><li>Simple API </li></ul><ul><li>Fast </li></ul><ul><li>Concurrent indexing and searching </li></ul><ul><li>Incremental indexing </li></ul><ul><li>NRT: Near-Real-Time </li></ul><ul><li>Boolean + Vector space, sorting, etc. </li></ul><ul><li>Cheap </li></ul>Otis Gospodnetic, Sematext Int’l
  9. 9. Query Types <ul><li>Single and multi-term queries </li></ul><ul><li>Phrase queries (sloppiness allowed) </li></ul><ul><li>Wildcard and fuzzy </li></ul><ul><li>Range queries </li></ul><ul><li>“Boolean”: required, prohibited, “should” </li></ul><ul><li>Grouping </li></ul><ul><li>Fields </li></ul>Otis Gospodnetic, Sematext Int’l
  10. 10. Query Syntax <ul><li>+monkey +banana  monkey AND banana </li></ul><ul><li>+dog –snoopy   dog AND NOT snoopy </li></ul><ul><li>“ pork flu” </li></ul><ul><li>“ pork flu” –”new york”   “pork flu” NOT “new york” </li></ul><ul><li>“ sweet pork”~3 </li></ul><ul><li>natur* </li></ul><ul><li>schmidt~ </li></ul><ul><li>createDate:[200901 TO 201001] </li></ul><ul><li>author:doug </li></ul><ul><li>author:”doug cutting” </li></ul><ul><li>author:”doug cutting” AND project:(lucene OR nutch OR hadoop) </li></ul><ul><li>title:lucene^5.0 body:lucene </li></ul>Otis Gospodnetic, Sematext Int’l
  11. 11. Code: FS Indexer Otis Gospodnetic, Sematext Int’l private IndexWriter writer; public Indexer(String indexDir) throws IOException { Directory dir = FSDirectory.open (new File(indexDir)); writer = new IndexWriter(dir, new StandardAnalyzer(Version.LUCENE_CURRENT), true, IndexWriter.MaxFieldLength.UNLIMITED); } public void close() throws IOException { writer.close(); } public void index(String dataDir, FileFilter filter) throws Exception { File[] files = new File(dataDir).listFiles(); for (File f: files) { Document doc = new Document(); doc.add(new Field(&quot;contents&quot;, new FileReader(f))); doc.add(new Field(&quot;filename&quot;, f.getName(), Field.Store.YES, Field.Index.NOT_ANALYZED)); writer.addDocument(doc); } }
  12. 12. Indexing Pipeline Otis Gospodnetic, Sematext Int’l Tokenizer TokenFilter Document Document Writer Inverted Index add
  13. 13. Indexer Pipeline: Analysis Source: Lucene in Action Otis Gospodnetic, Sematext Int’l <ul><li>1 Tokenizer </li></ul><ul><li>N TokenFilters </li></ul>
  14. 14. Analysis in Action Otis Gospodnetic, Sematext Int’l &quot; The quick brown fox jumped over the lazy dogs &quot; WhitespaceAnalyzer : [The] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dogs] SimpleAnalyzer : [the] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dogs] StopAnalyzer : [quick] [brown] [fox] [jumped] [over] [lazy] [dogs] StandardAnalyzer : [quick] [brown] [fox] [jumped] [over] [lazy] [dogs] &quot; XY&Z Corporation - xyz@example.com &quot; WhitespaceAnalyzer : [XY&Z] [Corporation] [-] [xyz@example.com] SimpleAnalyzer : [xy] [z] [corporation] [xyz] [example] [com] StopAnalyzer : [xy] [z] [corporation] [xyz] [example] [com] StandardAnalyzer : [xy&z] [corporation] [xyz@example.com]
  15. 15. Field Options <ul><li>Doc has 1+ Fields. Field has name+value </li></ul><ul><li>Field. Index .(no, (not)analyzed, no norms, not analyzed no norms) </li></ul><ul><li>Field. Store .(yes, no) </li></ul><ul><li>Field. TermVector .(yes, no, with pos., with offset, with both) </li></ul>Otis Gospodnetic, Sematext Int’l
  16. 16. Inverted Index Source: developer.apple.com Otis Gospodnetic, Sematext Int’l
  17. 17. Index Directory <ul><li># ls -lh </li></ul><ul><li>total 1.1G </li></ul><ul><li>-rw-r--r-- 1 root root 123M 2009-03-14 10:29 _0.fdt </li></ul><ul><li>-rw-r--r-- 1 root root 44M 2009-03-14 10:29 _0.fdx </li></ul><ul><li>-rw-r--r-- 1 root root 33 2009-03-14 10:31 _9j.fnm </li></ul><ul><li>-rw-r--r-- 1 root root 372M 2009-03-14 10:36 _9j.frq </li></ul><ul><li>-rw-r--r-- 1 root root 11M 2009-03-14 10:36 _9j.nrm </li></ul><ul><li>-rw-r--r-- 1 root root 180M 2009-03-14 10:36 _9j.prx </li></ul><ul><li>-rw-r--r-- 1 root root 5.5M 2009-03-14 10:36 _9j.tii </li></ul><ul><li>-rw-r--r-- 1 root root 308M 2009-03-14 10:36 _9j.tis </li></ul><ul><li>-rw-r--r-- 1 root root 64 2009-03-14 10:36 segments_2 </li></ul><ul><li>-rw-r--r-- 1 root root 20 2009-03-14 10:36 segments.gen </li></ul><ul><li>Details: http://lucene.apache.org/java/2_9_0/fileformats.html </li></ul>Otis Gospodnetic, Sematext Int’l
  18. 18. Code: Searcher Otis Gospodnetic, Sematext Int’l public void search(String indexDir, String q) throws IOException, ParseException { Directory dir = FSDirectory.open (new File(indexDir)); IndexSearcher is = new IndexSearcher(dir, true); QueryParser parser = new QueryParser(&quot;contents&quot;, new StandardAnalyzer(Version.LUCENE_CURRENT)); Query query = parser.parse(q); TopDocs hits = is.search(query, 10); System.err.println(&quot;Found &quot; + hits.totalHits + &quot; document(s)&quot;); for (int i=0; i<hits.scoreDocs.length; i++) { ScoreDoc scoreDoc = hits.scoreDocs[i]; Document doc = is.doc(scoreDoc.doc); System.out.println( doc.get(&quot;filename&quot;) ); } is.close(); }
  19. 19. Code: Doc Deletion <ul><li>Via IndexReader </li></ul><ul><li>void deleteDocument(int docNum)           Deletes the document numbered docNum </li></ul><ul><li>int deleteDocuments(Term term)           Deletes all documents that have a given term indexed. </li></ul><ul><li>Via IndexWriter </li></ul><ul><li>void deleteAll()           Delete all documents in the index. </li></ul><ul><li>void deleteDocuments(Query query)           Deletes the document(s) matching the provided query.  </li></ul><ul><li>void deleteDocuments(Query[] queries)           Deletes the document(s) matching any of the provided queries.  </li></ul><ul><li>void deleteDocuments(Term term)           Deletes the document(s) containing term.  </li></ul><ul><li>void deleteDocuments(Term[] terms)           Deletes the document(s) containing any of the terms. </li></ul>Otis Gospodnetic, Sematext Int’l
  20. 20. Code: Doc Updates Otis Gospodnetic, Sematext Int’l void updateDocument(Term  term, Document  doc, Analyzer analyzer)           Updates a document by first deleting the document(s) containing term and then adding the new document.   void Via IndexWriter facade void updateDocument(Term term, Document doc)           Updates a document by first deleting the document(s) containing term and then adding the new document.   void
  21. 21. Pitfalls <ul><li>Update = delete + add </li></ul><ul><li>No partial doc update </li></ul><ul><li>No joins </li></ul>Otis Gospodnetic, Sematext Int’l
  22. 22. Performance Tips <ul><li>Index: -Xmx, setRAMBufferSizeMB, !optimize, !compound, !NFS, multi-thread, analysis, NO_NORMS </li></ul><ul><li>Search: 1 searcher, !NFS, RAM vs. heap, SSD, optimize, FieldSelector </li></ul><ul><li>Details: </li></ul><ul><li>http://wiki.apache.org/lucene-java/ImproveIndexingSpeed http://wiki.apache.org/lucene-java/ImproveSearchingSpeed </li></ul>Otis Gospodnetic, Sematext Int’l
  23. 23. Lucene 2.9 & 3.0 <ul><li>Per segment searching and caching (can lead to much faster reopen among other things) </li></ul><ul><li>Near real-time search (aka NRT) </li></ul><ul><li>New Query types </li></ul><ul><li>Smarter, more scalable multi-term queries (wildcard, range, etc) </li></ul><ul><li>Freshly optimized Collector/Scorer API </li></ul><ul><li>Improved Unicode support and the addition of Collation contrib </li></ul><ul><li>New Attribute based TokenStream API </li></ul><ul><li>New QueryParser framework in contrib with a core QueryParser replacement impl included </li></ul><ul><li>Scoring is now optional when sorting by Field, or using a custom Collector, gaining sizable performance when scores are not required </li></ul><ul><li>New analyzers (PersianAnalyzer, ArabicAnalyzer, SmartChineseAnalyzer) </li></ul><ul><li>New fast-vector-highlighter for large documents </li></ul><ul><li>Lucene now includes high-performance handling of numeric fields. Such fields are indexed with a trie structure, enabling simple to use and much faster numeric range searching without having to externally pre-process numeric values into textual values. </li></ul>Otis Gospodnetic, Sematext Int’l
  24. 24. Community [email_address] [email_address] Otis Gospodnetic, Sematext Int’l &quot;I posted, went to get a sandwich, and came back to see two answers. The change works, and I can get the fix into production today. This list is magic.&quot;
  25. 25. Resources <ul><li>http://lucene.apache.org/java </li></ul><ul><ul><li>Wiki, MLs, javadoc </li></ul></ul><ul><li>http://manning.com/lucene </li></ul><ul><ul><li>LIA2 soon, MEAP available </li></ul></ul><ul><li>@lucene </li></ul>Otis Gospodnetic, Sematext Int’l
  26. 26. Contact <ul><li>@otisg </li></ul><ul><li>[email_address] </li></ul><ul><li>[email_address] </li></ul><ul><li>sematext.com </li></ul><ul><li>jroller.com/otis </li></ul><ul><li>blog.sematext.com </li></ul>Otis Gospodnetic, Sematext Int’l

×