Apache lucene


Published on

Lucene: The Text Searching Engine

Published in: Technology
  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Apache lucene

  1. 1. Searching
  2. 2. Agenda Search Engine Lucene Java Features Code Example Scalability Solr Nutch
  3. 3. About Speaker Abhiram Gandhe 9+ Years Experience on Java/J2EE platform Consultant eCommerce Architect with Delivery Cube Pursuing PhD from VNIT Nagpur on Link Prediction onGraph Databases M.Tech. (Comp. Sci. & Engg.) MNNIT Allahabad, B.E.(Comp. Tech.) YCCE Nagpur …
  4. 4. What is a Search Engine? Answer: A software that Builds an index on text Answers queries using the index“But we have database already for that…” A Search Engine offers Scalability Relevance Ranking Integrates different data sources (email, webpages, files, databases, …)
  5. 5.  Works on words not substrings auto !=automatic, automobile Indexing Process: Convert document Extract text and meta data Normalize text Write (inverted) index Example: Document 1: Apache Lucene at JUGNagpur Document 2: JUGNagpur conference
  6. 6. What is Apache Lucene?“Apache Lucene is a high-performance, full- featured text searchengine library written entirely in Java”- from http://lucene.apache.org/
  7. 7. What is Apache Lucene? Lucene is specifically an API, not an application. Hard parts have been done, easy programming hasbeen left to you. You can build a search application that is specificallysuited to your needs. You can use Lucene to provide consistent full-textindexing across both database objects and documentsin various formats (Microsoft Officedocuments, PDF, HTML, text, emails and so on).
  8. 8. Availability Freely Available (no cost) Open Source Apache License, version 2.0 http://www.apache.org/licenses/LICENSE-2.0 Download from: http://www.apache.org/dyn/closer.cgi/lucene/java/
  9. 9. Apache Lucene Overview The Apache LuceneTM project develops open-source searchsoftware, including: Lucene Core, our flagship sub-project, provides Java-basedindexing and search technology, as well as spellchecking, hithighlighting and advanced analysis/tokenization capabilities. SolrTM is a high performance search server built using LuceneCore, with XML/HTTP and JSON/Python/Ruby APIs, hithighlighting, faceted search, caching, replication, and a webadmin interface. Open Relevance Project is a subproject with the aim of collectingand distributing free materials for relevance testing andperformance. PyLucene is a Python port of the Core project.
  10. 10. Lucene Java Features Powerful Query Syntax Create queries from user input or programmatically Ranked Search Flexible Queries Phrases, wildcard, etc. Field Specific Queries eg. Title, artist, album Fast indexing Fast searching Sorting by relevance or other Large and active community Apache License 2.0
  11. 11. Lucene Query Example JUGNagpur JUGNagpur AND Lucene  +JUGNagpur +Lucene JUGNagpur OR Lucene JUGNagpur NOT PHP  +JUGNagpur -PHP “Java Conference” Title: Lucene J?GNagpur JUG* schmidt~  schmidt, schmit, schmitt price: [100 TO 500]
  12. 12. IndexFor thisDemo, were going tocreate an in-memoryindex fromsomestrings.StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_40);Directory index = new RAMDirectory();IndexWriterConfig config = newIndexWriterConfig(Version.LUCENE_40, analyzer);IndexWriter w = new IndexWriter(index, config);addDoc(w, "Lucene in Action", "193398817");addDoc(w, "Lucene for Dummies", "55320055Z");addDoc(w, "Managing Gigabytes", "55063554A");addDoc(w, "The Art of Computer Science", "9900333X");w.close();
  13. 13. Index...addDoc() iswhatactuallyaddsdocumentsto the indexprivate static void addDoc(IndexWriter w, String title, String isbn) throwsIOException {Document doc = new Document();doc.add(new TextField("title", title, Field.Store.YES));doc.add(new StringField("isbn", isbn, Field.Store.YES));w.addDocument(doc);}Note the use of TextField for content we want tokenized,and StringField for id fields and the like, which we dontwant tokenized.
  14. 14. QueryWe read thequery fromstdin, parseit and builda luceneQuery outof it.String querystr = args.length > 0 ? args[0] : "lucene";Query q = newQueryParser(Version.LUCENE_40, "title", analyzer).parse(querystr);
  15. 15. SearchUsing theQuery wecreate aSearcher tosearch theindex.Then aTopScoreDocCollector isinstantiated tocollect the top10 scoring hits.int hitsPerPage = 10;IndexReader reader = IndexReader.open(index);IndexSearcher searcher = new IndexSearcher(reader);TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage,true);searcher.search(q, collector);ScoreDoc[] hits = collector.topDocs().scoreDocs;
  16. 16. DisplayNow that wehave resultsfrom oursearch, wedisplay theresults tothe user.System.out.println("Found " + hits.length + " hits.");for(int i=0;i<hits.length;++i) {int docId = hits[i].doc;Document d = searcher.doc(docId);System.out.println((i + 1) + ". " + d.get("isbn") + "t" +d.get("title"));}
  17. 17. Everything is a Document A document can represent anything textual: Word Document DVD (the textual metadata only) Website Member (name, ID, etc...) A Lucene Document need not refer to an actual file on adisk, it could also resemble a row in a relational database. Each developer is responsible for turning their own datasets into Lucene Documents. Lucene comes with a numberof 3rd party contributions, including examples for parsingstructured data files such as XML documents and Wordfiles.
  18. 18. Indexes The type of index used in Lucene and other full- textsearch engines is sometimes also called an “invertedindex”. Indexes track term frequencies Every term maps back to a Document This index is what allows Lucene to quickly locateevery document currently associated with a given setup input search terms.
  19. 19. Basic Indexing An index consists of one or more Lucene Documents 1. Create a Document A document consists of one or more Fields: name-value pair Example: a Field commonly found in applications is title. In the case of a title Field, the field name istitle and the value is the title of that content item. Add one or more Fields to the Document 2. Add the Document to an Index Indexing involves adding Documents to an IndexWriter 3. Indexer will Analyze the Document We can provide specialized Analyzers such as StandardAnalyzer Analyzers control how the text is broken into terms which are then used to index the document: Analyzers can be used to remove stop words, perform stemmingLucene comes with a default Analyzer which works well for unstructured Englishtext, however it often performs incorrect normalizations on non-English texts. Lucenemakes it easy to build custom Analyzers, and provides a number of helpful buildingblocks with which to build your own. Lucene even includes a number of “stemming”algorithms for various languages, which can improve document retrieval accuracywhenthe source language is known at indexing time.
  20. 20. Basic Searching Searching requires an index to have already been built. Create a Query E.g. Usually via QueryParser, MultiPhraseQuery, etc. That parses user input Open an Index Search the Index E.g. Via IndexSearcher Use the same Analyzer as before Iterate through returned Documents Extract out needed results Extract out result scores (if needed)It is important that Queries use the same (or very similar) Analyzer that was usedwhen the index was created. The reason for this is due to the way that theAnalyzer performs normalization computations on the input text. Inorder tofind Documents using the same type of text that was used when indexing, thattext must be normalized in the same way that the original data wasnormalized.
  21. 21. Scalability Limits 3 main scalability factors: Query Rate Index Size Update Rate
  22. 22. Query Rate Scalability Lucene is already fast Built-in simple cache mechanism Easy solution for heavy workloads:(gives near-linear scaling) Add more query servers behind a load balancer Can grow as your traffic grows
  23. 23. Index Size Scalability Can easily handle millions of Documents Lucene is very commonly deployed into systems with 10s ofmillions of Documents. Although query performance can degrade as moreDocuments are added to the index, the growth factor isvery low. The main limits related to Index size that you arelikely to run in to will be disk capacity and disk I/O limits. If you need bigger: Built-in methods to allow queries to span multiple remoteLucene indexes Can merge multiple remote indexes at query-time.
  24. 24.  Lucene is threadsafe Can update and query at the same time I/O is limiting factor
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.