How Solr Search Works

How SOLR Search Works
Rajat Jain - 20th Dec, 2016

Agenda
• What do you mean by Search?
• Search Requirements
• Comparison of SOLR with SQL/NoSQL
• SOLR Architecture
• SOLR Usage in Trellis
• How Google Search Works
• Other Search Technologies

Search Requirements
• Text Search – eg. “Architects”
• Filters – eg. “In New Delhi”, “iOS”
• Sorting – eg. “Best Match”, “Highest Rating”, etc.
• And More..
• Facets
• Stemming
• Fuzzy Matching
• Image Search, etc.

Search Requirements
• Full Text Search
• Fast reads (writes can be slower)
• Various Combinations of Filters
• Various Combinations of Sorting
• Non Features:
• Real-time – usually staleness is not a problem
• Data Integrity – usually not a source of storage – can be ‘lossy’

Search Requirements – Faceted Search
• A Type of Filtering with
suggestions
• In most cases – sorted by
number
• Basically helps the user to
narrow down the search without
having to ‘guess’ how to narrow
it

Conventional Storage for Search
• SQL (MySQL)
• Relational Tables
• Normalized Data
• Assuming using Keys / Indexes for reads & writes
• Optimized for reads and writes & transactional data (acid transactions)
• Lots of security, etc.
• Table Data stored in File System
• Indexing - Individual columns – set of columns
• Full Text search – recent addition (full text index)

Conventional Storage for Search
• No SQL (think MongoDB)
• Key Value Pairs
• De-normalized Data
• Unstructured Data
• Optimized for Reads – writes can be slightly slower (in case of transactional)
• Data stored in File System
• Indexing – individual fields
• Full Text Search – has in-built support

Advantages of SOLR over MySQL/NoSQL
• Reversed Index
• Mind-blowing Text-analysis / stemming / scoring / fuzziness
• Weighting fields / boosting – custom scoring functions
• Single document concept – no relations (in general)
• Faceting support out-of-the box
• Optimized for search and search alone (at scale without performance
drop)

SOLR Architecture – Indexing
• Take a ‘document’ / field, etc.
• For each field apply set of filters
/ tokenizers
• Convert to individual tokens
• Update the ‘inverted’ index
based on the tokens
• In general in the Index keep
track of stats, etc. for the various
terms
• Different indexes per field

SOLR Architecture - Indexing
13
XML Update
Handler
CSV Update
Handler
/update /update/csv
XML Update
with custom
processor chain
/update/xml
Extracting
RequestHandler
(PDF, Word, …)
/update/extract
Lucene Index
Data Import
Handler
Database pull
RSS pull
Simple
transforms
SQL DB
RSS
feed
<doc>
<title>
Remove Duplicates
processor
Logging
processor
Index
processor
Custom Transform
processor
PDF
HTTP POST
HTTP POST
pull
pull
Update Processor Chain (per handler)
Lucene
Text Index
Analyzers

SOLR Architecture – Searching
• User enters query
• Parse the query, i.e. apply the
required filters and tokenizers
• Converted to tokens
• Parallel search across multiple
indexes (per field)
• Score all the documents
• Sort in async fashion

SOLR Architecture – Updating Index
• Types of Index Updates
• Instant Index
• Incremental Indexing
• Full Indexing
• Index Update Strategies
• Instant / Incremental Index cannot happen continuously
• Too much causes performance degradation
• Full Index periodically to optimize the index

SOLR Architecture – Scalability
• Sharding
• Splitting collections across servers
– search in parallel
• Replication
• More than one copy of the data
for failover
• SolrCloud
• Using Zookeeper for managing
clusters

SOLR Architecture – Other Features
• Stemming
• Identify root word and variations of the word, eg. "stems", "stemmer",
"stemming", "stemmed" as based on "stem"
• Fuzzy Matching
• Similar Words / Misspellings
• Edit Distance
• NLP
• Identify Entities / Nouns in Search Query
• OpenNLP Plugin for SOLR
• And much more…

SOLR Usage in Trellis
• Architecture
• Data-in from MySQL
• Index Update Strategy
• AutoComplete
• Basic Search
• Advanced Search
• Filters / Sorting / Facets & More
• Demo (Incl. Config Files)

How Google Search Works
• Crawling
• Robots.txt
• Indexing
• Multiple Indexes – Instant / Daily / Weekly / Long Tail
• Searching
• NLP, Stemming, Auto-correct, etc.
• Ranking – PageRank
• Video - https://www.youtube.com/watch?v=BNHR6IQJGZs

Other Search Technologies
• ElasticSearch
• Much newer than Solr
• Built-in scalability
• Uses same Lucene as the base
• JSON instead of XML
• Good for Analytical querying
• Others
• Splunk
• Sphinx

That’s All Folks
References
• SOLR Home Page -
http://lucene.apache.org/solr/
• Tutorials
• http://www.solrtutorial.com/index.h
tml
• https://lucene.apache.org/solr/4_10
_0/tutorial.html
• Just Google the rest!!

How Solr Search Works

In this document

More Related Content

What's hot

Similar to How Solr Search Works

More from Atlogys Technical Consulting

Recently uploaded

How Solr Search Works