Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)

Rebuilding Solr 6 examples –
layer by layer
Alexandre Rafalovitch
www.solr-start.com

Who am I
• Software developer with 20+ years of experience
– Including 3 years as Senior Tech Support (BEA Weblogic)
• Solr popularizer
• Published book author on Solr Indexing (for Solr 4.3)
• Run http://www.solr-start.com resource site
• Solr committer (since August 2016)
• Past and present Solr focus on onboarding, usability,
tooling, information sharing

Example catch-22
• Search is a – surprisingly - complex expertise
• Solr is a complex product
– Wide
– Deep
– History-rich
• And so are its many examples

Fasten the seatbelt
• Review all of the (Solr 6.2) OOTB examples
• Make a small one from scratch
• Deconstruct a real shipped example
• Next learning action...

OOTB Examples – how many?
bin/solr start –e
-e <example> Name of the example to run; available examples:
cloud: SolrCloud example
techproducts: Comprehensive example
illustrating many of Solr's core capabilities
dih: Data Import Handler
schemaless: Schema-less example

techproducts example
• Used to be collection1
• solr.home: example/techproducts/solr
– Can restart with
bin/solr start -s example/techproducts/solr
– Actual core at
example/techproducts/solr/techproducts

techproducts example (cont.)
• Source configuration
– server/solr/configset/sample_techproducts_config
– Not actually a configset (copy, not share)
• Can be rebuilt
rm –rf example/techproducts
• Has data (14 files of products, money, utf8 tests)
bin/post -c techproducts example/exampledocs/*.xml

schemaless example
• solr.home: example/schemaless/solr
• Actual core: example/schemaless/solr/gettingstarted
• Source configuration:
– server/solr/configset/data_driven_schema_configs
– Config you get when you are not using config:
bin/solr create -c newcore
• No data, but can take (nearly) anything:
bin/post -c <name> example/exampledocs/*.xml

schemaless mode?
• “Let us guess what you mean”
– Auto-guess field type based on first content occurrence
– Create explicit field definitions
• booleans, dates, numbers, strings
• Always multivalued (because: who knows?!?)
• Can be configured (URP chain in solrconfig.xml)
– Rewrites managed-schema (coments begone!)
– Makes search work with
<copyField source="*" dest="_text_"/>

techproducts vs schemaless
• Configured techproducts vs
auto-detecting schemaless
• Strings
"name":"Test with some GB18030 encoded characters",
"name":["Test with some GB18030 encoded characters"],
• Numbers
"price":0.0, "price_c":"0.0,USD",
"price":[0.0],
• Booleans
"inStock":true,
"inStock":[true],

cloud example
• Highly configurable (unless using –noprompt)
• solr.home: example/cloud/nodeX/solr
• Source configuration is a choice
Please choose a configuration for the gettingstarted collection, available
options are: basic_configs, data_driven_schema_configs, or
sample_techproducts_configs [data_driven_schema_configs]
• Can be rebuilt:
bin/solr stop -all
rm -rf example/cloud
• Demonstrates Config API (configoverlay.json)

dih example(s)
• Data import handler – legacy, but still kicking
• solr.home: example/example-DIH/solr
• Has 5 (five!) different cores
– db - database import (example/example-DIH/hsqldb/ex.*)
– solr - import from another Solr core (configured for db core)
– mail - import from IMAP (needs some configuration)
– tika - import rich-content (example/exampledocs/solr-word.pdf)
– rss - external XML feed (very broken right now)
• Cannot be rebuilt – only emptied
bin/post -c db -type 'application/json' -d '{delete: {query:"*:*"}}'

What about: bin/solr start?
• solr.home: server/solr
• No initial collection/cores, have to create explicitly:
– With script (see bin/solr create_core –h for details):
bin/solr create –c <corename> -d <name or path>
– With Core Admin UI for non-SolrCloud:
http://localhost:8983/solr/admin/cores?action=CREATE&…
– With Collection API for SolrCloud:
http://localhost:8983/admin/collections?action=CREATE&…

basic_configs configuration
• Available for cloud example
and explicit creation
• Schemaless mode is configured, not enabled
• “Minimal Solr configuration” !?!
– managed-schema: 1005 lines
– solrconfig.xml: 1484 lines

files example
• Specifically tuned for file indexing
– Augmented schemaless mode with language,
content-type guessing
– Custom /browse end-point
– Source configuration: example/files/conf
– Setup instructions: example/files/README.txt
– Bring your own data

films example
• Schemaless (Based on data_driven_schema_configs)
– Uses Schema API to add custom fields
– Uses schemaless for rest of fields
• Comes with its own data (1100 film records)
• Uses velocity (/browse), Schema API, Request
Parameters API (params.json)
• Setup instructions: example/films/README.txt

That was a good news
• Many examples
• Easy to get one running
• Some come with data
• Some you can throw your own data into
• Lots of comments

This is the bad news
Files Types Fields Dynamic
Fields
managed-schema
size
solrconfig.
xml size
basic 46 71 4 73 1005 1484
data_driven 46 71 4 73 1005 1482
techproducts 101 66 33 28 1149 1701
dih db 62 62 31 28 1129 1490
dih tika 6 61 3 27 901 1466
files 69 73 9 73 517 1508
films
(data_driven+)
46 71 8 73 481 1482

Tip – getting these numbers
• XML extraction with XMLStarlet (XLST CLI)
– xml sel -t -m "//fieldType" -v @name -n managed-schema
– xml sel -t -m "//copyField" -c . -n managed-schema |wc -l
– xml sel -t -m "//*[@docValues]"
-v "concat(local-name(), ' ', @name, ' docValues:',
@docValues)" -n managed-schema
– xml sel -t -m "//requestHandler" -v "@name" -n
solrconfig.xml

Why is it like this?
• Many examples predate Solr Reference Guide
• grep for options, possibilities, defaults
• Each example is a kitchen sink
“Too much of a good thing is also a bad thing”
Source: 1980s Soviet joke about Virtual Reality

Go small – managed-schema
<schema name="demo" version="1.6">
<dynamicField name="*" type="string"
indexed="true" stored="true" multiValued="true"/>
<field name="text" type="text_basic"
indexed="true" stored="false" multiValued="true"/>
<copyField source="*" dest="text"/>
…

Go small – managed-schema(2)
…
<fieldType name="string" class="solr.StrField"/>
<fieldType name="text_basic" class="solr.TextField">
<analyzer>
<tokenizer class="solr.LowerCaseTokenizerFactory" />
</analyzer>
</fieldType>
</schema>

Go small – solrconfig.xml
<config>
<luceneMatchVersion>6.2.0</luceneMatchVersion>
<requestHandler name="/select” class="solr.SearchHandler”>
<lst name="defaults">
<str name="df">text</str>
</lst>
</requestHandler>
</config>

Go small – load and test
• bin/solr create -c demo -d .../demo-config/
• bin/post -c demo example/exampledocs/*.xml
• Test it works, using HTTPie (HTTP CLI)

Go small - review
• Minimal example could be very minimal
• Some things will not work
– No uniqueKey – no way to update documents, no
SolrCloud
– No _version_ – no SolrCloud
– Everything is multiValued – no sorting
– copyField * => text, no meaningful relevancy,
specialized analyzer chain processing

Deconstructing films example
• bin/solr create –c films
• curl http://localhost:8983/solr/films/schema ... (add name,
initial_release_date)
• Index 1100 records from
– (Solr) XML,
– (generic) JSON (doc), or
– CSV format
• Search for batman
• Use /browse end-point and search for batman
• Enable highlighting in results

Initial stats for films core
Sizes (line counts)
managed-schema* 481
solrconfig.xml 1482
params.json 20
File count in conf
.txt 41
.xml 3
.json 1
managed-schema (xml) 1
* already has no comments

Deconstructing – just straight tags
• managed-schema lost comments during
construction
• Let's remove comments from solrconfig.xml
• xml ed -L -d "//comment()" solrconfig.xml
– Edit in place
– Delete XPATH

solrconfig.xml without comments
Sizes (line counts)
managed-schema 481
solrconfig.xml 1482
278
params.json 20
File count in conf
.txt 41
.xml 3
.json 1

Deconstructing – what to clean
• Currently
– (explicit) fields: 8
– dynamic fields: 73
• xml sel -t -m "//dynamicField" -v @name -n managed-
schema |wc -l
– types: 71
– copyFields: 1
• Let's start from dynamic fields

Deconstructing – dynamic fields
• Used dynamic fields
– do NOT modify schema
– DO show up in Admin UI, if used
– Example from different schema:
• Used/matched fields
• Generic definitions

Deconstructing – in use dynamic fields

Deconstructing – in use dynamic fields
• NO dynamic fields are used
– * is a copyField instruction
• Can remove them all
• xml ed -L -d "//dynamicField"
managed-schema

Remove dynamicFields
Sizes (line counts)
managed-schema 481
409
solrconfig.xml 278
params.json 20
File count in conf
.txt 41
.xml 3
.json 1

Deconstructing – field types
• How many types out of 71 do we use?
– xml sel -t -m "//field|//dynamicField"
-v "@type" -n conf/managed-schema |sort –u
– long, string, strings, tdate, text_general
• But also some in solrconfig.xml
– booleans, string, strings, tdates, tdoubles, text_general,
tlongs
• Combined total: 9 field type definitions
• Delete the rest (by hand)

Remove no-longer used types
Sizes (line counts)
managed-schema 409
34 (!!!)
solrconfig.xml 278
params.json 20
File count in conf
.txt 41
.xml 3
.json 1

Deconstructing – support files
• Inside lang directory (38 files)
– find lang –name 'stopwords_*.txt' | wc -l
• stopwords_*.txt: 30 files
• contractions_*.txt: 4 files
– find lang -type f |egrep -v 'stopwords_|contractions_'
• hyphenations_ga.txt, stemdict_nl.txt, stoptags_ja.txt,
userdict_ja.txt

Support files – still in use?
• Check for usage
– grep -o 'stopwords_.*.txt' managed-schema solrconfig.xml
– grep -o 'contractions_.*.txt' ...
– ...
• NO Matches (we no longer have related types)
– Delete the whole lang directory
• What about files just inside config directory
– Don't need currency.xml, protwords.txt

Remove no-longer used types
Sizes (line counts)
managed-schema 34
solrconfig.xml 278
params.json 20
File count in conf
.txt 41 2
.xml 3 2
.json 1

Deconstructing – actual field usage

The mystery of _root_
• In the original schema – no explanations
• Documentation – used for nested documents:
To support nested documents, the schema must include an indexed/non-stored
field _root_ . The value of that field is populated automatically and is the same for
all documents in the block, regardless of the inheritance depth.
• We are not using nested documents
• And neither does any other shipped example...

Remove _root_
Sizes (line counts)
managed-schema 34 33
solrconfig.xml 278
params.json 20
File count in conf
.txt 2
.xml 2
.json 1

Deconstructing – text_general type
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100"
multiValued="true">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
<filter class="solr.SynonymFilterFactory" expand="true"
ignoreCase="true" synonyms="synonyms.txt"/>
</analyzer>
</fieldType>

text_general support files
stopwords.txt
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
• synonyms.txt
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with# the
License. You may obtain a copy of the License at#.
......
.#-----------------------------------------------------------------------
#some test synonym mappings unlikely to appear in real input textaaafoo =>
aaabar
bbbfoo => bbbfoo bbbbar
cccfoo => cccbar cccbaz
fooaaa,baraaa,bazaaa
# Some synonym groups specific to this example
GB,gib,gigabyte,gigabytes
MB,mib,megabyte,megabytes
Television, Televisions, TV, TVs
#notice we use "gib" instead of "GiB" so any WordDelimiterFilter coming
#after us won't split it into two words.
# Synonym mappings can be used for spelling correction
toopixima => pixma

text_general's empty stopwords
• No file
=> default stopwords
=> English
• Empty file
=> disabled stopwords
• Currently – NOT used

text_general simplified definition
<fieldType name="text_general" class="solr.TextField"
positionIncrementGap="100" multiValued="true">
<analyzer>
</analyzer>
</fieldType>

Remove stopwords and synonyms
Sizes (line counts)
managed-schema 33 26
solrconfig.xml 278
params.json 20
File count in conf
.txt 2 0
.xml 2
.json 1

How far did we get
Sizes (line counts)
managed-schema* 481 26
solrconfig.xml 1482
278
params.json 20
File count in conf
.txt 41 0
.xml 3 2
.json 1
* already has no comments

Deconstructing – solrconfig.xml
• solrconfig.xml is more complex than schema
• Heterogeneous Sections
• Nested definitions
• Alternative implementations (e.g. highlighter)
• Also remember
– configoverlay.json – overrides solrconfig.xml
– params.json – additional configuration parameters

solrconfig.xml – feature counts
11 requestHandler
8 lib
5 searchComponent
3 queryResponseWriter
2 initParams
1 updateRequestProcessorChain
1 updateHandler
1 requestDispatcher
1 query
1 luceneMatchVersion
1 jmx
1 indexConfig
1 directoryFactory
1 dataDir
1 codecFactory

solrconfig.xml – line counts
55:<updateRequestProcessorChain name="add-unknown-fields-to-the-schema">
52:<searchComponent class="solr.HighlightComponent" name="highlight">
18:<query>
17:<requestHandler name="/spell" class="solr.SearchHandler" startup="lazy">
15:<searchComponent name="spellcheck" class="solr.SpellCheckComponent">
13:<updateHandler class="solr.DirectUpdateHandler2">
9:<requestHandler name="/terms" class="solr.SearchHandler" startup="lazy">
8:<requestHandler name="/elevate" class="solr.SearchHandler" startup="lazy">
8:<requestHandler name="/tvrh" class="solr.SearchHandler" startup="lazy">
7:<requestHandler name="/update/extract" startup="lazy"
class="solr.extraction.ExtractingRequestHandler">
7:<requestHandler name="/query" class="solr.SearchHandler">
6:<requestHandler name="/debug/dump" class="solr.DumpRequestHandler">
......

Remember, this works!
<config>
<luceneMatchVersion>6.2.0</luceneMatchVersion>
<requestHandler name="/select” class="solr.SearchHandler”>
<str name="df">text</str>
</lst>
</requestHandler>
</config>

add-unknown-fields-to-the-schema
• Famous "schemaless" mode
• Generic, but fully configurable
• Far from perfect
– Remember, we had to manually pre-add fields
– Development, not production
– Has normalization side-effects (normalizes dates)
• Cannot remove it in our example

solrconfig.xml - highlighter
<searchComponent class="solr.HighlightComponent" name="highlight">
<highlighting>
<fragmenter name="gap" default="true"
class="solr.highlight.GapFragmenter">
<int name="hl.fragsize">100</int>
</lst>
</fragmenter>
<fragmenter name="regex" class="solr.highlight.RegexFragmenter">
<int name="hl.fragsize">70</int>
<float name="hl.regex.slop">0.5</float>
<str name="hl.regex.pattern">[-w ,/n"']{20,200}</str>
</lst>
</fragmenter>
<formatter name="html" default="true"
class="solr.highlight.HtmlFormatter">
<str name="hl.simple.pre"><![CDATA[<em>]]></str>
<str name="hl.simple.post"><![CDATA[</em>]]></str>
</lst>
</formatter>
<encoder name="html" class="solr.highlight.HtmlEncoder"/>
<fragListBuilder name="simple" class="solr.highlight.SimpleFragListBuilder"/>
<fragListBuilder name="single" class="solr.highlight.SingleFragListBuilder"/>
.......
• fragmenters
• encoders
• fragListBuilders
• fragmentBuilders
• boundaryScanners
• ....

highlighter – the truth
• Highlighter searchComponent is in default stack
• The params are a mix of standard highlighter,
alternative FastVector highlighter
• Cannot use FastVector version as schema fields
are missing termVectors, etc
• And standard highlighter params are same as
implicit values
• Therefore, we can remove the WHOLE definition

Remove highlighter
Sizes (line counts)
managed-schema 26
solrconfig.xml 278 226
params.json 20
File count in conf
.txt 0
.xml 2
.json 1

Other searchComponents
• Not on the default stack
– spellcheck
– term
– termVector
– elevator
• Have dedicated requestHandlers
• Inception (example within example)
• Can be deleted
– also delete elevate.xml
15:<searchComponent name="spellcheck"
class="solr.SpellCheckComponent">
17:<requestHandler name="/spell"
class="solr.SearchHandler" startup="lazy">
1:<searchComponent name="terms"
class="solr.TermsComponent"/>
9:<requestHandler name="/terms"
1:<searchComponent name="tvComponent"
class="solr.TermVectorComponent"/>
8:<requestHandler name="/tvrh"
4:<searchComponent name="elevator"
class="solr.QueryElevationComponent">
8:<requestHandler name="/elevate"

Remove custom searchComponents
Sizes (line counts)
managed-schema 26
solrconfig.xml 226 163
params.json 20
File count in conf
.txt 0
.xml 2 1
.json 1

solrconfig.xml – more stuff
• There is more that can be taken out
– query section, since you have to tune it anyway
– updateHandler, and revert to basic commits
– jmx
– enableRemoteStreaming – definitely take that out
• But keep velocity, browse, search support

Next action
• Join the (virtual) Solr Example Reading Group
– Starts November 2016
– Register at http://bit.ly/SolrERG
• Join mailing list at http://www.solr-start.com
– Get the link to the presentation source
– Learn about other similar projects
– Get news of Solr articles and projects on the web

Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)

Similar to Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016) (20)

Recently uploaded

Recently uploaded (20)

Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)

Editor's Notes