using Apache Solr
• What is Solr?
• Features of Solr
• High level Architecture of Solr
• How things work in Solr?
• What is Fuzzy Search?
• How is Performance of Solr?
Solr - Introduction
• An open source Enterprise search platform by Apache.
• A full text search server running on Web containers like Tomcat or Jetty.
• Indexes input files and provides various search facilities over them.
• Uses the Lucene Java search library at its core.
Type of Tool: Search and Index API
License Type Apache License 2.0
Last Release Date 6 May 2013(4.3.0)
Release Frequency 1 month approximately
Applications/Users Instagram, AOL, the Guardian, Shopper.com, SourceForge, eBay
Stability Stable version "4.3.0".
Solr - Features
• Faceted Search
• Can take input in form of XML, CSV, JSON files
and from database.
• Using Apache Tika, supports more than 25 input
formats like PDF and MS Word.
• JSON, XML, PHP, Ruby, Python and custom
Java binary output formats.
• Scalable in form of Solr Cloud
• Supports 32 major languages including Chinese,
Korean, Japanese, Arabian etc.
• Boosting of results
• Extensible plugin architecture
• HTML Administration Interface
Solr – Fuzzy Search
• It is the technique of finding strings that match a
pattern approximately (rather than exactly).
• It is used to find documents that contain words with
similar spelling to the search term. Ex. - If you
search for appple then search engine will show all
documents having term "apple" also.
• Used in spell checking, spam filtering, OCR
• Solr's standard query parser supports fuzzy
searches based on the Levenshtein Distance or
Edit Distance algorithm.
• Closeness of search is based upon Edit Distance
(No of steps required to convert one word into
another) of words.
Solr - Performance
• Indexing – The time taken in indexing depends
Size and number of fields in each
Number of fields to be indexed
Type of fields
Machine capabilities (CPU, Memory)
With each document size ~1 KB, if we have
100 million documents then total indexing
time must be a few hours.
• Query – If we have 100 million documents on
10 Solr nodes(10 million documents each) then
average search response time is ~1 second.