Lifecycle of a Solr Search Request - Chris "Hoss" Hostetter, Lucidworks

Lifecycle of a
Solr Search
Request
Chris "Hoss" Hostetter - 2017-09-14
https://home.apache.org/~hossman/rev2017/
https://twitter.com/_hossman
https://www.lucidworks.com/
Abstract:
This intermediate session for existing Solr users will provide a
Deep Dive look into the lifecycle of a Solr Search Request. We
will drill down through each layer of code, discussing what
happens at each stage -- including when & how inter-node
communication takes place in a multi-node SolrCloud cluster.
Along the way, we will also review the various places where
users can conﬁgure existing (or custom written) plugins to
override or amend the default behavior.
Lifecycle of a Solr Search Request https://people.apache.org/~hossman/rev2017/
1 of 24 10/4/17, 4:32 PM

Agenda
Deep Dive look into the lifecycle of 4 Solr Search Requests...
Single Node: Single SolrCore
Simple Query1.
Facet Query2.
SolrCloud: 2 Shards + 2 Replicas
Simple Query3.
Facet Query4.
...and where various types of Plugins can be used.
2 of 24 10/4/17, 4:32 PM

Simple Query
bin/solr -e techproducts
http://localhost:8983/solr/techproducts/select
? q = ipod
& sort = inStock desc, score desc
& fl = id, name
& rows = 10
This sample paginated query is based off of the techproducts
example conﬁgs & data that have been included in ever release of Solr
since it was ﬁrst open sourced.
I have a nostalgic affection for this silly little dataset.
3 of 24 10/4/17, 4:32 PM

HTTP (Jetty)
SolrDispatchFilter
Solr Webapp/solr ➔
CoreContainer
/techproducts ➔ SolrCore
/select? ➔ RequestHandler
SolrCore
foo
SolrCore
etc...
wt=json ➔ ResponseWriter
...:8983/solr/techproducts/select?...
UI:HTML,Javascript,
Images,CSS
SolrCore
techproducts
Purple: The HTTP layer, currently implemented by Jetty
Blue: Solr runs as "webapp" inside the Jetty Servlet container (but
that's just an implementation detail)
Black: The key pieces of the Solr webapp: misc "flat files" that power
the Solr UI, and the SolrDispatchFilter which is responsible
for mapping all HTTP request/responses into their internal Solr
representations and executing them
Red: CoreContainer is singleton responsible to managing the
lifecycle of SolrCores
Green: each SolrCore encapsulates the configs & data for a single
"index" (which in a SolrCloud configuration would be a replica of
some shard or some collection)
4 of 24 10/4/17, 4:32 PM

SolrCore: techproducts
SolrRequestHandlers SearchComponents
QueryComponent: query
- prepare()
- df=text&q=ipod ➔ Query
- etc...
- process()
- etc...
SearchHandler: /select
- initParams
- df = text (default)
- components (implicit)
- query
- etc...
SearchHandler: /etc...
UpdateRequestHandler : /etc...
FacetComponent: facet
etc...
Green: The SolrCore used for this (HTTP) request
Black: Named instances of (plugable) SolrRequestHandlers.
SearchHandler is the most common, and it uses a conﬁgurable
list of SearchComponents
Red: Named instances of (plugable) SearchComponents,
QueryComponent is the only one used in this simple request
All SearchComponents implement prepare() & process()
methods, which are called by SearchHandler
5 of 24 10/4/17, 4:32 PM

SolrIndexSearcher
query
IndexSchema
- SchemaFields ➔ FieldTypes
QueryComponent.prepare()
+ rows=10 ➔ ok?
ﬂ=id,name ➔ ok?
/ q ➔ LuceneQParser
LuceneQParser + (df=text ➔ text) + "ipod" ➔ TermQuery
( "inStock desc" ➔ bool ➔ BoolField.getSortField(inStock,desc)
+ "score desc" ➔ SortField.SCORE ) ➔ Sort
TextField: text
- Analyzer
- Similarity
- etc...
TextField: etc..
- Analyzer
- Similarity
- etc...
BoolField: bool
- Analyzer
- Similarity
- getSortField
- etc...
LuceneQParser
DismaxQParser
etc...
Red: QueryComponent.prepare() and it's basic logic for
validating & parsing the basic request params
Green: Named instances of (pluggable) QParserPlugins for
parsing query strings (q & fq params). Here the (implicit) default
LuceneQParser
Orange: The IndexSchema which contains...
Named SchemaFields (or dynamicFields) which map
to...
Purple: Named instances of (pluggable) FieldTypes which
dictate how the ﬁeld names mapped to them are parsed,
indexed, sorted, queried, etc...
Blue: The SolrIndexSearcher is ultimately what will be
queried with these parsed queries & sort objects
6 of 24 10/4/17, 4:32 PM

SolrIndexSearcher.search(...)
window(start, rows, windowSize)
(queryResultCache? | Index) ➔ DocList
queryQueryComponent.process()
search(Query,ﬁlters[],start,rows,Sort,...) ➔ DocList
JsonResponseWriter
DocList {
+ searcher.doc(#)
➔ Stored Fields
}
➔ Bytes ➔ HTTP...
documentCache
queryResultCache
ﬁlterCache
IndexReader
- InvertedIndex
- Stored Fields XmlResponseWriter
etc...
7 of 24 10/4/17, 4:32 PM

Red: QueryComponent.process() which uses the
SolrIndexSearcher to execute the Query created by it's
prepare() method
Blue: the SolrIndexSearcher includes several caches in
addition to the InvertedIndex, and when executing a query, first
evaluates the start/rows requested to fit a configured "window size"
so that "page #2" type requests can result in a cache hit & re-use the
results computed for "page #1"
Orange: The low level InvertedIndex & The
queryResultCache that can be used in it's place when
executing basic searchers & the DocList containing a sorted
list of (internal) doc#s and their scores for the requested
start+rows of this query
Purple: The Stored Fields of the documents in the index & the
documentCache used by SolrIndexSearcher to
reduce disk reads when popular documents are frequently
matched by searches
Green: Named instances of (pluggable)
QueryResponseWriters which dictate how the data structures
produced once a request is processed get serialized into bytes (for
the HTTP response returned to the original client by Jetty)
8 of 24 10/4/17, 4:32 PM

More Complex Query
? q = ipod
& fq = price:[* TO 1000]
& sort = div(popularity,price) asc,
score desc
& fl = id, name, why:[explain style=nl]
& facet = true
& facet.field = cat
This slightly more interesting query builds off the previous example by:
Adding a "filter query" on the (numeric) price field
Changing the primary sort criteria to be a mathematical function
against 2 fields
Requesting an additional psuedo-field explaining the score of each
document
Faceting on the "cat" (aka: category) field
9 of 24 10/4/17, 4:32 PM

HTTP (Jetty)
SolrDispatchFilter
Solr Webapp/solr ➔
CoreContainer
/techproducts ➔ SolrCore
/select? ➔ RequestHandler
SolrCore
foo
SolrCore
etc...
wt=json ➔ ResponseWriter
...:8983/solr/techproducts/select?...
UI:HTML,Javascript,
Images,CSS
SolrCore
techproducts
The HTTP, Webapp, DispatchFilter, CoreContainer, SolrCore, and
RequestHandler layers all function exactly as in our previous (simpler)
example. It's only once the SearchHandler starts looping over the
components that things get more interesting....
10 of 24 10/4/17, 4:32 PM

query
IndexSchema
- SchemaFields ➔ FieldTypes
QueryComponent.prepare()
etc...
"price:[* TO 1000]" ➔ float
➔ PointRangeQuery(...) ➔ filters[]
div(popularity,price)
➔ ValueSource(IntFieldSource,...)
FloatPointField: float
- ValueSource
- getRangeQuery()
- etc...
IntPointField: int
- ValueSource
- etc...
FacetComponent.prepare()
facet=true ✔
facet.field=cat ➔ ok?
needDocSet = true
SolrIndexSearcher
div()
sum()
etc...
Most items identical to those shown in the "simple" query are omitted for
brevity. Of the new items shown here...
Red: In addition to some additional logic in
QueryComponent.prepare() method (to parse the filter
query and more complex sort) we know also see the
FacetComponent.prepare() method, which does it's own
validation & sets a flag indicating that it needs extra info (the
DocSet) once SolrIndexSearcher is asked to execute the
Query
Green: Named instances of (pluggable) ValueSourceParsers
for parsing function strings -- used here in our sort, but could also be
used in queries
Orange: As before the IndexSchema, now showing that
FieldTypes are also responsible for providing the range query
(filter) and ValueSources (used by the functions)
11 of 24 10/4/17, 4:32 PM

SolrIndexSearcher
queryQueryComponent.process()
search(...) ➔〈DocList,DocSet〉
etc...
JsonResponseWriter
DocList {
+ searcher.doc(#)
➔ Stored Fields
+ [explain ...]
}
+ Facet Counts
➔ Bytes ➔ HTTP...
ExplainAugmenter
ChildDocTransformer
queryFacetComponent.process()
For Each "cat" Index Terms:
➔ Intersect with DocSet
SubQueryAugmenter
etc...
searcher.explain(#)
documentCache
queryResultCache
ﬁlterCache
IndexReader
- InvertedIndex
- Stored Fields
Most items identical to those shown in the "simple" query are omitted for
brevity. Of the new items shown here...
Red: Now when QueryComponent.process() executes the
search, the "needsDocSet" ﬂag set by
FacetComponent.prepare() is also used.
FacetComponent.process() can then use the resulting
DocSet (an unordered set of all matching doc# -- regardless of sort)
to compute the facet counts.
Olive: Named instances of (pluggable) DocTransformers (or
Augmenters) which can be used to annotate individual documents
returned in the results. For this query in particular we see the
ExplainAugmenter which uses the SolrIndexSearcher to
get a (debugging) data structure "explaining" how the score of each
document was computed.
Green: the JsonResponseWriter not only returns the Stored
Fields of each document, but also the results of any
DocTransformers. It also serializes the Facet Counts.
12 of 24 10/4/17, 4:32 PM

Simple Query
SolrCloud: 4 Nodes, 2 Shards, 2 Replicas
bin/solr -e cloud
...
? q = ipod
& fl = id, name
& rows = 10
This is the same as or original simple query, still using the
techproducts sample conﬁgs & data, but from here on we'll assume
we're using a 4 node SolrCloud cluster, with the techproducts
collection conﬁgured to have 2 shards, with a replication factor of 2.
13 of 24 10/4/17, 4:32 PM

SolrDispatchFilter
/techproducts ➔ tech_s1_r2
Jetty: http://host1:8983
SolrDispatchFilter
/techproducts ?➔ host4
SolrDispatchFilter
/techproducts ?➔ tech_s2_r2
SolrDispatchFilter
/techproducts ➔ tech_s2_r1
techproducts
tech_s1_r2
foo
foo_s1_r1
foo
foo_s2_r1
techproducts
tech_s1_r1
techproducts
tech_s2_r1
foo
foo_s1_r2
techproducts
tech_s2_r2
foo
foo_s2_r2
Purple: 4 Jetty instances, running on (the same port 8983 of) 4
different hosts
Black: The 4 SolrDispatchFilters running inside each of
these 4 Jetty instances, and how each of them resolves requests for
the techproducts collection.
Green the individual SolrCores (which are each a replica of some
shard of a collection) running in each Solr node. Note that for the
purposes of illustrating the diff possible ways a Solr request may be
routed, host3 does not contain any SolrCores that are part of the
techproducts collection.
(Other Layers such as the Solr webapp and the CoreContainer have
been omitted to save space)
14 of 24 10/4/17, 4:32 PM

coordinator shard1
QueryComponent:
prepare() + process()
α: q=ipod&fl=id&fsv=true
➔ top ids + sort values
β1: ids=X,Y,Z&fl=name ➔ ...
shard2
QueryComponent:
β2: ids=A,..,G&fl=name ➔ ...
Repeat until done:
query.distributedProcess
➔ ShardRequests (α,β)
Loop: ShardRequests
query.handleResponse
QueryComponent:
distributedProcess()
α: shard top10 + sort values
β: full fl for final top10 ids
FacetComponent
15 of 24 10/4/17, 4:32 PM

Purple: The HTTP Layer showing 3 hosts: an arbitrary 'coordinator'
node, and 2 nodes each hosting a replica of the 2 shards for the
collection
Black: SearchHandler. On the coordinator node,
SearchHandler executes new logic to execute sub-requests
created by it's SearchComponents to arbitrarily selected replicas
of each shard. On the replicas handling these sub-requests, the
SearchHandler processes these requests just as if they were
simple (single node) queries.
Red: SearchComponent methods. On the coordinator node
SearchHandler loops over every component calling
SearchComponent.distributedProcess() to
create/modify sub-requests for the individual shards, and then calls
SearchComponent.handleResponse() to merge the
results from each shard and decide if/when/what additional
information may be needed. This process repeats until all calls to
distributedProcess() on all SearchComponents
indicate that they are finished.
Green & Blue: The 2 stages (α & β) of shard sub-requests needed to
process this simple query. Note that the α-requests are identical for
both shards, but the β-requests are slightly different to request the
fl fields for the matches specific to that shard.
16 of 24 10/4/17, 4:32 PM

Shard Request α
q=ipod&fl=id&fsv=true&rows=10
sort=inStock desc, score desc numFound=42+314=356
Z, Zebra
F, Frog
B, Boat
D, Deer
C, Car
X, X-Ray
G, Gong
A, Apple
Y, Yo-Yo
E, Ear
Merged
Shard 1
numFound=42
F〈true,6〉
B〈true,6〉
D〈true,5〉
C〈true,3〉
G〈true,2〉
A〈true,1〉
E〈false,5〉
Shard 2
numFound=314
Z〈true,6〉
X〈true,3〉
Y〈false,9〉 Shard Request β
q=ipod&ids=...&fl=name
Shard 1
A, Apple
B, Boat
C, Car
D, Deer
E, Ear
F, Frog
G, Gong
Shard 2
X, X-Ray
Y, Yo-Yo
Z, Zebra
Here we see hypothetical α request+responses, hypothetical β
requests+responses, & the final Merged results from both -- showing how
the IDs and sort values from the α request are used to determine which
documents will be in the final results, and in which order. For these specific
documents, the β requests+responses fill in the fl fields for the final
client.
Red & Blue: The responses from shard1 & shard2 for the α request
Green & Purple: The responses from shard1 & shard2 for the β
request
17 of 24 10/4/17, 4:32 PM

Complex Query*
SolrCloud: 4 Node, 2 Shards, 2 Replicas
? q = ipod
& fl = id, name
& facet = true
& facet.field = cat
In the interest of time, this query is not as "Complex" as the "Complex"
Single Core query we looked at before. I've omitted things like fq params,
sorting on functions, and the use of DocTransformers in the fl
because nothing about how those are handled in a Single Core query
changes when they are requested by a coordinator node in a SolrCloud
query.
18 of 24 10/4/17, 4:32 PM

coordinator shard1
QueryComponent:
β1: ids=X,Y,Z&fl=name ➔...
FacetComponent:
α: facet.limit=N + extra
➔ top terms w/counts
β1: ..._terms=aa,qq,... ➔...
QueryComponent:
α: shard top10 + sort values
β: full fl for final top10 ids
shard2
FacetComponent:
α: facet.field=cat
w/facet.limit overrequest
β: request missing counts
for final top terms
➔ ShardRequests (α, β)
19 of 24 10/4/17, 4:32 PM

Purple: The HTTP Layer showing 3 hosts: an arbitrary 'coordinator'
node, and 2 nodes each hosting a replica of the 2 shards for the
collection. To save space, the (largely redundant) details of the
requests to shard2 are not shown.
Black: SearchHandler. To save space, the details (shown in
previous diagrams) regarding how SearchHandler processes
requests when acting as a coordinator have been omitted -- the key
thing to note is that even with the added complexity of the
FacetComponent, there are still only 2 stages of sub-requests to
each shard (α & β)
Red: SearchComponent methods:
QueryComponent behaves exactly as before
Now that FacetComponent is in use, it can modify the sub-
requests created by QueryComponent to "piggy back" on
them and request additional information from each shard.
Green & Blue: The 2 stages (α & β) of shard sub-requests needed to
process this query. Although the details of the requests to shard2 are
omitted for brevity, the α-requests are identical for both shards, and
(as before) the β-requests are slightly different to request both the
the fl ﬁelds for the document matches speciﬁc to that shard, as well
as the facet counts for any "candidate" terms that were not included
in the α response from that shard.
20 of 24 10/4/17, 4:32 PM

Shard Request α
facet.ﬁeld=cat
facet.limit=N+OVERREQUEST
Shard Request β
facet.ﬁeld={!_terms=...}cat
auto: 253 (3 + 250)
lawn: 190 (20 + 170)
...
DVD: 102 (5 + 97)
Final (Merge α+β)Shard 1
games: 40
...
lawn: 20
books: 10
DVD: 5
...
beach: 4
toys: 3
Shard 2
auto: 250
lawn: 170
...
food: 100
DVD: 97
...
books: 90
clothing: 90
Shard 1
auto: 3
food: 0
Shard 2
games: 45
N
auto: 250-253 (? + 250)
lawn: 190 (20 + 170)
...
games: 40-130 (40 + ?)
food: 100-103 (? + 100)
DVD: 102 (5 + 97)
...
Merge α
21 of 24 10/4/17, 4:32 PM

Here we see the additional information involved in α & β
requests+responses+merging for our more complex queries compared to
what we looked at before. The information requested & merged by
QueryComponent is omitted for brevity, and we focus solely on how
FacetComponent modifies those requests to "overrequest" the
original facet.limit and what it does with the results.
In the α request, over-request additional terms from each shard beyond
what the user asked for; In the β request, ask each shard for the details
about any terms that are "candidates" for the final results but where NOT
already returned by this shard in the α response.
Each term that is a candidate for the final response is shown in a unique
color. Black/Grey is used to indicate terms where incomplete information
is available to the coordinator, but enough is known to be confident that
they can't possibly be candidates for the final results. Faded terms (in
italics) show at what stage the coordinating FacetComponent knows
that particular term can be eliminated for consideration.
(While the "..." ellipses are used to denote the possibility of many
additional terms depending on the value of facet.limit=N (which
defaults to 100), viewers may find the easiest way to understand how
these results are merged & refined is to assume N=3 and imagine the
ellipses do not exist in the diagram)
22 of 24 10/4/17, 4:32 PM

Q & A
23 of 24 10/4/17, 4:32 PM

Me
https://twitter.com/_hossman
My Company
https://www.lucidworks.com/
These Slides
https://home.apache.org/~hossman/rev2017/
Solr Docs & Mailing List
https://lucene.apache.org/solr/resources.html
24 of 24 10/4/17, 4:32 PM

Lifecycle of a Solr Search Request - Chris "Hoss" Hostetter, Lucidworks

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Lifecycle of a Solr Search Request - Chris "Hoss" Hostetter, Lucidworks

Similar to Lifecycle of a Solr Search Request - Chris "Hoss" Hostetter, Lucidworks (20)

More from Lucidworks

More from Lucidworks (20)

Recently uploaded

Recently uploaded (20)

Lifecycle of a Solr Search Request - Chris "Hoss" Hostetter, Lucidworks