Enterprise Search Using Apache Solr

1,324 views
989 views

Published on

Apache solr is an enterprise search engine. It facilitates indexing of large number of documents of any size and provides very robust search techniques. This ppt provides brief introduction of it.

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,324
On SlideShare
0
From Embeds
0
Number of Embeds
12
Actions
Shares
0
Downloads
30
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Enterprise Search Using Apache Solr

  1. 1. Enterprise Search using Apache Solr Sagar Chaturvedi
  2. 2. Agenda • What is Solr? • Features of Solr • High level Architecture of Solr • How things work in Solr? • What is Fuzzy Search? • How is Performance of Solr?
  3. 3. Solr - Introduction • An open source Enterprise search platform by Apache. • A full text search server running on Web containers like Tomcat or Jetty. • Indexes input files and provides various search facilities over them. • Uses the Lucene Java search library at its core. Type of Tool: Search and Index API Documentation: http://lucene.apache.org/solr/4_3_0/ License Type Apache License 2.0 Last Release Date 6 May 2013(4.3.0) Release Frequency 1 month approximately Mailing List/Community support http://lucene.apache.org/solr/discussion.html Major Applications/Users Instagram, AOL, the Guardian, Shopper.com, SourceForge, eBay Stability Stable version "4.3.0".
  4. 4. Solr - Features • Faceted Search • Can take input in form of XML, CSV, JSON files and from database. • Using Apache Tika, supports more than 25 input formats like PDF and MS Word. • JSON, XML, PHP, Ruby, Python and custom Java binary output formats. • Scalable in form of Solr Cloud • Supports 32 major languages including Chinese, Korean, Japanese, Arabian etc. • Boosting of results • Extensible plugin architecture • HTML Administration Interface
  5. 5. Solr - Architecture
  6. 6. Solr – Query Processing
  7. 7. Solr - Indexing
  8. 8. Solr – Fuzzy Search • It is the technique of finding strings that match a pattern approximately (rather than exactly). • It is used to find documents that contain words with similar spelling to the search term. Ex. - If you search for appple then search engine will show all documents having term "apple" also. • Used in spell checking, spam filtering, OCR scanning. • Solr's standard query parser supports fuzzy searches based on the Levenshtein Distance or Edit Distance algorithm. • Closeness of search is based upon Edit Distance (No of steps required to convert one word into another) of words.
  9. 9. Solr - Performance • Indexing – The time taken in indexing depends on –  Size and number of fields in each document  Number of fields to be indexed  Type of fields  Machine capabilities (CPU, Memory) With each document size ~1 KB, if we have 100 million documents then total indexing time must be a few hours. • Query – If we have 100 million documents on 10 Solr nodes(10 million documents each) then average search response time is ~1 second.
  10. 10. Thank You !!

×