Lucene Introduction
Upcoming SlideShare
Loading in...5
×
 

Lucene Introduction

on

  • 13,021 views

Lucene introduction / overview, also touching on Lucene 2.9/3.0 features

Lucene introduction / overview, also touching on Lucene 2.9/3.0 features

Statistics

Views

Total Views
13,021
Views on SlideShare
12,336
Embed Views
685

Actions

Likes
14
Downloads
451
Comments
1

29 Embeds 685

http://www.jroller.com 398
http://sharepointorange.blogspot.in 87
http://www.slideshare.net 54
http://sharepointorange.blogspot.com 35
http://sharepointorange.blogspot.ru 26
http://jroller.com 15
http://angrejee.blogspot.com 13
http://www.sharepointorange.blogspot.in 10
http://www.linkedin.com 9
http://nobal-tech.blogspot.com 6
https://www.linkedin.com 4
http://sharepointorange.blogspot.ca 4
http://sharepointorange.blogspot.fr 3
http://nobal-tech.blogspot.in 2
http://sharepointorange.blogspot.ae 2
http://sharepointorange.blogspot.fi 2
http://sharepointorange.blogspot.co.uk 2
http://sharepointorange.blogspot.com.es 2
http://webcache.googleusercontent.com 1
http://www.jroller.org 1
http://sharepointorange.blogspot.com.br 1
http://sharepointorange.blogspot.kr 1
http://sharepointorange.blogspot.co.il 1
http://nobal-tech.blogspot.fi 1
http://nobal-tech.blogspot.ie 1
http://angrejee.blogspot.ca 1
http://nobal-tech.blogspot.ca 1
http://www.sharepointorange.blogspot.se 1
http://sharepointorange.blogspot.nl 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Lucene Introduction Lucene Introduction Presentation Transcript

  • Lucene Introduction Otis Gospodnetic, Sematext Int’l @otisg [email_address] http://jroller.com/otis http://sematext.com/
  • About Otis
    • Lucener since pre-Apache (cca 2000)
    • Committer: Lucene, Solr, Nutch, Mahout, Open Relevance
    • Lucene in Action 1 & 2 co-author
    • Solr in Action author
    • Sematext co-founder
  • What is Lucene?
    • Free, ASL, Java IR library, Jar
    • Doug Cutting, ASF, 2001
    • Application agnostic: Indexing & Searching
    • High performance, scalable
    • No dependencies
    • Heavily ported
    Otis Gospodnetic, Sematext Int’l
  • What Lucene Ain’t
    • Turn key “solution”
    • Application, no installer/wizard needed
    • (Web) crawler
    • Insert-doc-format-here parser / filter
    Otis Gospodnetic, Sematext Int’l
  • The Lucene Family
    • Lucene vs. Apache Lucene vs. Java Lucene: IR library
    • Nutch: Hadoop-loving crawler, indexer, searcher for web-wide scale SE
    • Solr: Search server
    • Droids: Standalone framework for writing crawlers
    • Lucene.Net: C#, Incubator graduate
    • Lucy: C Lucene impl
    • Mahout: Hadoop-loving ML library
    • Open Relevance: Relevance judgments
    • PyLucene: Python port
    Otis Gospodnetic, Sematext Int’l
  • Integration Data Source Data Source Gather Parse Make Doc Search UI Search App e.g. webapp Search Index Index Otis Gospodnetic, Sematext Int’l
  • Integration: Rich Doc Indexing HTML PDF Gather Make Doc Index Index MS Word PDF Parse with Tika Otis Gospodnetic, Sematext Int’l
  • Lucene Strengths
    • Simple API
    • Fast
    • Concurrent indexing and searching
    • Incremental indexing
    • NRT: Near-Real-Time
    • Boolean + Vector space, sorting, etc.
    • Cheap
    Otis Gospodnetic, Sematext Int’l
  • Query Types
    • Single and multi-term queries
    • Phrase queries (sloppiness allowed)
    • Wildcard and fuzzy
    • Range queries
    • “Boolean”: required, prohibited, “should”
    • Grouping
    • Fields
    Otis Gospodnetic, Sematext Int’l
  • Query Syntax
    • +monkey +banana  monkey AND banana
    • +dog –snoopy   dog AND NOT snoopy
    • “ pork flu”
    • “ pork flu” –”new york”   “pork flu” NOT “new york”
    • “ sweet pork”~3
    • natur*
    • schmidt~
    • createDate:[200901 TO 201001]
    • author:doug
    • author:”doug cutting”
    • author:”doug cutting” AND project:(lucene OR nutch OR hadoop)
    • title:lucene^5.0 body:lucene
    Otis Gospodnetic, Sematext Int’l
  • Code: FS Indexer Otis Gospodnetic, Sematext Int’l private IndexWriter writer; public Indexer(String indexDir) throws IOException { Directory dir = FSDirectory.open (new File(indexDir)); writer = new IndexWriter(dir, new StandardAnalyzer(Version.LUCENE_CURRENT), true, IndexWriter.MaxFieldLength.UNLIMITED); } public void close() throws IOException { writer.close(); } public void index(String dataDir, FileFilter filter) throws Exception { File[] files = new File(dataDir).listFiles(); for (File f: files) { Document doc = new Document(); doc.add(new Field("contents", new FileReader(f))); doc.add(new Field("filename", f.getName(), Field.Store.YES, Field.Index.NOT_ANALYZED)); writer.addDocument(doc); } }
  • Indexing Pipeline Otis Gospodnetic, Sematext Int’l Tokenizer TokenFilter Document Document Writer Inverted Index add
  • Indexer Pipeline: Analysis Source: Lucene in Action Otis Gospodnetic, Sematext Int’l
    • 1 Tokenizer
    • N TokenFilters
  • Analysis in Action Otis Gospodnetic, Sematext Int’l " The quick brown fox jumped over the lazy dogs " WhitespaceAnalyzer : [The] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dogs] SimpleAnalyzer : [the] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dogs] StopAnalyzer : [quick] [brown] [fox] [jumped] [over] [lazy] [dogs] StandardAnalyzer : [quick] [brown] [fox] [jumped] [over] [lazy] [dogs] " XY&Z Corporation - xyz@example.com " WhitespaceAnalyzer : [XY&Z] [Corporation] [-] [xyz@example.com] SimpleAnalyzer : [xy] [z] [corporation] [xyz] [example] [com] StopAnalyzer : [xy] [z] [corporation] [xyz] [example] [com] StandardAnalyzer : [xy&z] [corporation] [xyz@example.com]
  • Field Options
    • Doc has 1+ Fields. Field has name+value
    • Field. Index .(no, (not)analyzed, no norms, not analyzed no norms)
    • Field. Store .(yes, no)
    • Field. TermVector .(yes, no, with pos., with offset, with both)
    Otis Gospodnetic, Sematext Int’l
  • Inverted Index Source: developer.apple.com Otis Gospodnetic, Sematext Int’l
  • Index Directory
    • # ls -lh
    • total 1.1G
    • -rw-r--r-- 1 root root 123M 2009-03-14 10:29 _0.fdt
    • -rw-r--r-- 1 root root 44M 2009-03-14 10:29 _0.fdx
    • -rw-r--r-- 1 root root 33 2009-03-14 10:31 _9j.fnm
    • -rw-r--r-- 1 root root 372M 2009-03-14 10:36 _9j.frq
    • -rw-r--r-- 1 root root 11M 2009-03-14 10:36 _9j.nrm
    • -rw-r--r-- 1 root root 180M 2009-03-14 10:36 _9j.prx
    • -rw-r--r-- 1 root root 5.5M 2009-03-14 10:36 _9j.tii
    • -rw-r--r-- 1 root root 308M 2009-03-14 10:36 _9j.tis
    • -rw-r--r-- 1 root root 64 2009-03-14 10:36 segments_2
    • -rw-r--r-- 1 root root 20 2009-03-14 10:36 segments.gen
    • Details: http://lucene.apache.org/java/2_9_0/fileformats.html
    Otis Gospodnetic, Sematext Int’l
  • Code: Searcher Otis Gospodnetic, Sematext Int’l public void search(String indexDir, String q) throws IOException, ParseException { Directory dir = FSDirectory.open (new File(indexDir)); IndexSearcher is = new IndexSearcher(dir, true); QueryParser parser = new QueryParser(&quot;contents&quot;, new StandardAnalyzer(Version.LUCENE_CURRENT)); Query query = parser.parse(q); TopDocs hits = is.search(query, 10); System.err.println(&quot;Found &quot; + hits.totalHits + &quot; document(s)&quot;); for (int i=0; i<hits.scoreDocs.length; i++) { ScoreDoc scoreDoc = hits.scoreDocs[i]; Document doc = is.doc(scoreDoc.doc); System.out.println( doc.get(&quot;filename&quot;) ); } is.close(); }
  • Code: Doc Deletion
    • Via IndexReader
    • void deleteDocument(int docNum)           Deletes the document numbered docNum
    • int deleteDocuments(Term term)           Deletes all documents that have a given term indexed.
    • Via IndexWriter
    • void deleteAll()           Delete all documents in the index.
    • void deleteDocuments(Query query)           Deletes the document(s) matching the provided query. 
    • void deleteDocuments(Query[] queries)           Deletes the document(s) matching any of the provided queries. 
    • void deleteDocuments(Term term)           Deletes the document(s) containing term. 
    • void deleteDocuments(Term[] terms)           Deletes the document(s) containing any of the terms.
    Otis Gospodnetic, Sematext Int’l
  • Code: Doc Updates Otis Gospodnetic, Sematext Int’l void updateDocument(Term  term, Document  doc, Analyzer analyzer)           Updates a document by first deleting the document(s) containing term and then adding the new document.   void Via IndexWriter facade void updateDocument(Term term, Document doc)           Updates a document by first deleting the document(s) containing term and then adding the new document.   void
  • Pitfalls
    • Update = delete + add
    • No partial doc update
    • No joins
    Otis Gospodnetic, Sematext Int’l
  • Performance Tips
    • Index: -Xmx, setRAMBufferSizeMB, !optimize, !compound, !NFS, multi-thread, analysis, NO_NORMS
    • Search: 1 searcher, !NFS, RAM vs. heap, SSD, optimize, FieldSelector
    • Details:
    • http://wiki.apache.org/lucene-java/ImproveIndexingSpeed http://wiki.apache.org/lucene-java/ImproveSearchingSpeed
    Otis Gospodnetic, Sematext Int’l
  • Lucene 2.9 & 3.0
    • Per segment searching and caching (can lead to much faster reopen among other things)
    • Near real-time search (aka NRT)
    • New Query types
    • Smarter, more scalable multi-term queries (wildcard, range, etc)
    • Freshly optimized Collector/Scorer API
    • Improved Unicode support and the addition of Collation contrib
    • New Attribute based TokenStream API
    • New QueryParser framework in contrib with a core QueryParser replacement impl included
    • Scoring is now optional when sorting by Field, or using a custom Collector, gaining sizable performance when scores are not required
    • New analyzers (PersianAnalyzer, ArabicAnalyzer, SmartChineseAnalyzer)
    • New fast-vector-highlighter for large documents
    • Lucene now includes high-performance handling of numeric fields. Such fields are indexed with a trie structure, enabling simple to use and much faster numeric range searching without having to externally pre-process numeric values into textual values.
    Otis Gospodnetic, Sematext Int’l
  • Community [email_address] [email_address] Otis Gospodnetic, Sematext Int’l &quot;I posted, went to get a sandwich, and came back to see two answers. The change works, and I can get the fix into production today. This list is magic.&quot;
  • Resources
    • http://lucene.apache.org/java
      • Wiki, MLs, javadoc
    • http://manning.com/lucene
      • LIA2 soon, MEAP available
    • @lucene
    Otis Gospodnetic, Sematext Int’l
  • Contact
    • @otisg
    • [email_address]
    • [email_address]
    • sematext.com
    • jroller.com/otis
    • blog.sematext.com
    Otis Gospodnetic, Sematext Int’l