Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Apache lucene


Published on

Published in: Technology
  • Be the first to comment

Apache lucene

  1. 1. Apache Lucene Ioan Toma based on slides from Aaron Bannert© Copyright 2012 STI INNSBRUCK
  2. 2. What is Apache Lucene? “Apache Lucene(TM) is a high-performance, full-featured text searchengine library written entirely in Java. It is a technology suitable for nearlyany application that requires full-text search, especially cross-platform.” - from
  3. 3. Features• Scalable, High-Performance Indexing – over 95GB/hour on modern hardware – small RAM requirements -- only 1MB heap – incremental indexing as fast as batch indexing – index size roughly 20-30% the size of text indexed• Powerful, Accurate and Efficient Search Algorithms – ranked searching -- best results returned first – many powerful query types: phrase queries, wildcard queries, proximity queries, range queries and more – fielded searching (e.g., title, author, contents) – date-range searching – sorting by any field – multiple-index searching with merged results – allows simultaneous update and
  4. 4. 4 Features • Cross-Platform Solution – Available as Open Source software under the Apache License which lets you use Lucene in both commercial and Open Source programs – 100%-pure Java – Implementations in other programming languages available that are index-compatible • CLucene - Lucene implementation in C++ • Lucene.Net - Lucene implementation in .NET • Zend Search - Lucene implementation in the Zend Framework for PHP 5
  5. 5. Ranked Searching1. Phrase Matching2. Keyword Matching – Prefer more unique terms first – Scoring and ranking takes into account the uniqueness of each term when determining a document’s relevance
  6. 6. Flexible Queries• Phrases“star wars”• Wildcardsstar*• Ranges{star-stun}[2006-2007]• Boolean Operatorsstar AND
  7. 7. Field-specific Queries• Field-specific queries can be used to target specific fields in the Document Index.• For exampletitle:”star wars” ANDdirector:”George Lucas”
  8. 8. Sorting• Can sort any field in a Document – For example, by Price, Release Date, Amazon Sales Rank, etc…• By default, Lucene will sort results by their relevance score. Sorting by any other field in a Document is also
  9. 9. Lucene 9
  10. 10. Everything is a Document• A document can represent anything textual: – Word Document – DVD (the textual metadata only) – Website Member (name, ID, etc…)• A Lucene Document need not refer to an actual file on a disk, it could also resemble a row in a relational database.• Developers are responsible for turning their own data sets into Lucene Documents• A document is seen as a list of fields, where a field has a name an a
  11. 11. Indexes• The unit of indexing in Lucene is a term. A term is often a word.• Indexes track term frequencies• Every term maps back to a Document• Lucene uses inverted index which allows Lucene to quickly locate every document currently associated with a given set up input search
  12. 12. Basic Indexing1. Parse different types of documents (HTML, PDF, Word, text files, etc.)2. Extract tokens and related info (Lucene Analyser)3. Add the Document to an Index Lucene provide a standard analyzer forEnglish and latin based languages
  13. 13. Basic Searching1. Create a Query • (eg. by parsing user input)2. Open an Index3. Search the Index • Use the same Analyzer as before4. Iterate through returned Documents • Extract out needed results • Extract out result scores (if needed)
  14. 14. Lucene as SOA1. Design an HTTP query syntax – GET queries – XML for results2. Wrap Tomcat around core code3. Write a Client Library As it follows SOA principles, basic building blocks suchas load balancers can be deployed to quickly scale up thecapacity of the search
  15. 15. Lucene as SOA DiagramSingle-Machine ArchitectureLucene-based Application includes three components 1. Lucene Custom Client Library 2. Search Service 3. Custom Core Search
  16. 16. Lucene 16
  17. 17. Scalability Limits• 3 main scalability factors: – Query Rate – Index Size – Update
  18. 18. Query Rate Scalability• Lucene is already fast – Built-in caching• Easy solution for heavy workloads:(gives near-linear scaling) – Add more query servers behind a load balancer – Can grow as your traffic
  19. 19. Lucene as SOA DiagramHigh-Scale Multi-Machine
  20. 20. Index Size Scalability• Can easily handle millions of Documents• Lucene is very commonly deployed into systems with 10s of millions of Documents.• Main limits related to Index size that one is likely to run in to will be disk capacity and disk I/O limits.If you need bigger:• Built-in multi-machine capabilities – Can merge multiple remote indexes at
  21. 21. Update Rate• Lucene is threadsafe – Can update and query at the same time• I/O is limiting factorStrategies for achieving even higher update rates: – Vertical Split – for big workloads (Centralized Index Building) 1. Build indexes apart from query service 2. Push updated indexes on intervals – Horizontal Split – for huge workloads 1. Split data into columns 2. Merge columns for queries 3. Columns only receive their own data for