Rebuilding Solr 6 examples –
layer by layer
Alexandre Rafalovitch
www.solr-start.com
Who am I
• Software developer with 20+ years of experience
– Including 3 years as Senior Tech Support (BEA Weblogic)
• Solr popularizer
• Published book author on Solr Indexing (for Solr 4.3)
• Run http://www.solr-start.com resource site
• Solr committer (since August 2016)
• Past and present Solr focus on onboarding, usability,
tooling, information sharing
Example catch-22
• Search is a – surprisingly - complex expertise
• Solr is a complex product
– Wide
– Deep
– History-rich
• And so are its many examples
Fasten the seatbelt
• Review all of the (Solr 6.2) OOTB examples
• Make a small one from scratch
• Deconstruct a real shipped example
• Next learning action...
OOTB Examples – how many?
bin/solr start –e
-e <example> Name of the example to run; available examples:
cloud: SolrCloud example
techproducts: Comprehensive example
illustrating many of Solr's core capabilities
dih: Data Import Handler
schemaless: Schema-less example
techproducts example
• Used to be collection1
• solr.home: example/techproducts/solr
– Can restart with
bin/solr start -s example/techproducts/solr
– Actual core at
example/techproducts/solr/techproducts
techproducts example (cont.)
• Source configuration
– server/solr/configset/sample_techproducts_config
– Not actually a configset (copy, not share)
• Can be rebuilt
rm –rf example/techproducts
• Has data (14 files of products, money, utf8 tests)
bin/post -c techproducts example/exampledocs/*.xml
schemaless example
• solr.home: example/schemaless/solr
• Actual core: example/schemaless/solr/gettingstarted
• Source configuration:
– server/solr/configset/data_driven_schema_configs
– Config you get when you are not using config:
bin/solr create -c newcore
• No data, but can take (nearly) anything:
bin/post -c <name> example/exampledocs/*.xml
schemaless mode?
• “Let us guess what you mean”
– Auto-guess field type based on first content occurrence
– Create explicit field definitions
• booleans, dates, numbers, strings
• Always multivalued (because: who knows?!?)
• Can be configured (URP chain in solrconfig.xml)
– Rewrites managed-schema (coments begone!)
– Makes search work with
<copyField source="*" dest="_text_"/>
techproducts vs schemaless
• Configured techproducts vs
auto-detecting schemaless
• Strings
"name":"Test with some GB18030 encoded characters",
"name":["Test with some GB18030 encoded characters"],
• Numbers
"price":0.0, "price_c":"0.0,USD",
"price":[0.0],
• Booleans
"inStock":true,
"inStock":[true],
cloud example
• Highly configurable (unless using –noprompt)
• solr.home: example/cloud/nodeX/solr
• Source configuration is a choice
Please choose a configuration for the gettingstarted collection, available
options are: basic_configs, data_driven_schema_configs, or
sample_techproducts_configs [data_driven_schema_configs]
• Can be rebuilt:
bin/solr stop -all
rm -rf example/cloud
• Demonstrates Config API (configoverlay.json)
dih example(s)
• Data import handler – legacy, but still kicking
• solr.home: example/example-DIH/solr
• Has 5 (five!) different cores
– db - database import (example/example-DIH/hsqldb/ex.*)
– solr - import from another Solr core (configured for db core)
– mail - import from IMAP (needs some configuration)
– tika - import rich-content (example/exampledocs/solr-word.pdf)
– rss - external XML feed (very broken right now)
• Cannot be rebuilt – only emptied
bin/post -c db -type 'application/json' -d '{delete: {query:"*:*"}}'
What about: bin/solr start?
• solr.home: server/solr
• No initial collection/cores, have to create explicitly:
– With script (see bin/solr create_core –h for details):
bin/solr create –c <corename> -d <name or path>
– With Core Admin UI for non-SolrCloud:
http://localhost:8983/solr/admin/cores?action=CREATE&…
– With Collection API for SolrCloud:
http://localhost:8983/admin/collections?action=CREATE&…
basic_configs configuration
• Available for cloud example
and explicit creation
• Schemaless mode is configured, not enabled
• “Minimal Solr configuration” !?!
– managed-schema: 1005 lines
– solrconfig.xml: 1484 lines
files example
• Specifically tuned for file indexing
– Augmented schemaless mode with language,
content-type guessing
– Custom /browse end-point
– Source configuration: example/files/conf
– Setup instructions: example/files/README.txt
– Bring your own data
films example
• Schemaless (Based on data_driven_schema_configs)
– Uses Schema API to add custom fields
– Uses schemaless for rest of fields
• Comes with its own data (1100 film records)
• Uses velocity (/browse), Schema API, Request
Parameters API (params.json)
• Setup instructions: example/films/README.txt
That was a good news
• Many examples
• Easy to get one running
• Some come with data
• Some you can throw your own data into
• Lots of comments
This is the bad news
Files Types Fields Dynamic
Fields
managed-schema
size
solrconfig.
xml size
basic 46 71 4 73 1005 1484
data_driven 46 71 4 73 1005 1482
techproducts 101 66 33 28 1149 1701
dih db 62 62 31 28 1129 1490
dih tika 6 61 3 27 901 1466
files 69 73 9 73 517 1508
films
(data_driven+)
46 71 8 73 481 1482
Tip – getting these numbers
• XML extraction with XMLStarlet (XLST CLI)
– xml sel -t -m "//fieldType" -v @name -n managed-schema
– xml sel -t -m "//copyField" -c . -n managed-schema |wc -l
– xml sel -t -m "//*[@docValues]"
-v "concat(local-name(), ' ', @name, ' docValues:',
@docValues)" -n managed-schema
– xml sel -t -m "//requestHandler" -v "@name" -n
solrconfig.xml
Why is it like this?
• Many examples predate Solr Reference Guide
• grep for options, possibilities, defaults
• Each example is a kitchen sink
“Too much of a good thing is also a bad thing”
Source: 1980s Soviet joke about Virtual Reality
Go small – managed-schema
<schema name="demo" version="1.6">
<dynamicField name="*" type="string"
indexed="true" stored="true" multiValued="true"/>
<field name="text" type="text_basic"
indexed="true" stored="false" multiValued="true"/>
<copyField source="*" dest="text"/>
…
Go small – managed-schema(2)
…
<fieldType name="string" class="solr.StrField"/>
<fieldType name="text_basic" class="solr.TextField">
<analyzer>
<tokenizer class="solr.LowerCaseTokenizerFactory" />
</analyzer>
</fieldType>
</schema>
Go small – solrconfig.xml
<config>
<luceneMatchVersion>6.2.0</luceneMatchVersion>
<requestHandler name="/select” class="solr.SearchHandler”>
<lst name="defaults">
<str name="df">text</str>
</lst>
</requestHandler>
</config>
Go small – load and test
• bin/solr create -c demo -d .../demo-config/
• bin/post -c demo example/exampledocs/*.xml
• Test it works, using HTTPie (HTTP CLI)
Go small - review
• Minimal example could be very minimal
• Some things will not work
– No uniqueKey – no way to update documents, no
SolrCloud
– No _version_ – no SolrCloud
– Everything is multiValued – no sorting
– copyField * => text, no meaningful relevancy,
specialized analyzer chain processing
Deconstructing films example
• bin/solr create –c films
• curl http://localhost:8983/solr/films/schema ... (add name,
initial_release_date)
• Index 1100 records from
– (Solr) XML,
– (generic) JSON (doc), or
– CSV format
• Search for batman
• Use /browse end-point and search for batman
• Enable highlighting in results
Initial stats for films core
Sizes (line counts)
managed-schema* 481
solrconfig.xml 1482
params.json 20
File count in conf
.txt 41
.xml 3
.json 1
managed-schema (xml) 1
* already has no comments
Deconstructing – just straight tags
• managed-schema lost comments during
construction
• Let's remove comments from solrconfig.xml
• xml ed -L -d "//comment()" solrconfig.xml
– Edit in place
– Delete XPATH
solrconfig.xml without comments
Sizes (line counts)
managed-schema 481
solrconfig.xml 1482
278
params.json 20
File count in conf
.txt 41
.xml 3
.json 1
managed-schema (xml) 1
Deconstructing – what to clean
• Currently
– (explicit) fields: 8
– dynamic fields: 73
• xml sel -t -m "//dynamicField" -v @name -n managed-
schema |wc -l
– types: 71
– copyFields: 1
• Let's start from dynamic fields
Deconstructing – dynamic fields
• Used dynamic fields
– do NOT modify schema
– DO show up in Admin UI, if used
– Example from different schema:
• Used/matched fields
• Generic definitions
Deconstructing – in use dynamic fields
Deconstructing – in use dynamic fields
• NO dynamic fields are used
– * is a copyField instruction
• Can remove them all
• xml ed -L -d "//dynamicField"
managed-schema
Remove dynamicFields
Sizes (line counts)
managed-schema 481
409
solrconfig.xml 278
params.json 20
File count in conf
.txt 41
.xml 3
.json 1
managed-schema (xml) 1
Deconstructing – field types
• How many types out of 71 do we use?
– xml sel -t -m "//field|//dynamicField"
-v "@type" -n conf/managed-schema |sort –u
– long, string, strings, tdate, text_general
• But also some in solrconfig.xml
– booleans, string, strings, tdates, tdoubles, text_general,
tlongs
• Combined total: 9 field type definitions
• Delete the rest (by hand)
Remove no-longer used types
Sizes (line counts)
managed-schema 409
34 (!!!)
solrconfig.xml 278
params.json 20
File count in conf
.txt 41
.xml 3
.json 1
managed-schema (xml) 1
Deconstructing – support files
• Inside lang directory (38 files)
– find lang –name 'stopwords_*.txt' | wc -l
• stopwords_*.txt: 30 files
• contractions_*.txt: 4 files
– find lang -type f |egrep -v 'stopwords_|contractions_'
• hyphenations_ga.txt, stemdict_nl.txt, stoptags_ja.txt,
userdict_ja.txt
Support files – still in use?
• Check for usage
– grep -o 'stopwords_.*.txt' managed-schema solrconfig.xml
– grep -o 'contractions_.*.txt' ...
– ...
• NO Matches (we no longer have related types)
– Delete the whole lang directory
• What about files just inside config directory
– Don't need currency.xml, protwords.txt
Remove no-longer used types
Sizes (line counts)
managed-schema 34
solrconfig.xml 278
params.json 20
File count in conf
.txt 41 2
.xml 3 2
.json 1
managed-schema (xml) 1
Deconstructing – actual field usage
Actual field usage - _root_
The mystery of _root_
• In the original schema – no explanations
• Documentation – used for nested documents:
To support nested documents, the schema must include an indexed/non-stored
field _root_ . The value of that field is populated automatically and is the same for
all documents in the block, regardless of the inheritance depth.
• We are not using nested documents
• And neither does any other shipped example...
Remove _root_
Sizes (line counts)
managed-schema 34 33
solrconfig.xml 278
params.json 20
File count in conf
.txt 2
.xml 2
.json 1
managed-schema (xml) 1
Deconstructing – text_general type
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100"
multiValued="true">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
<filter class="solr.SynonymFilterFactory" expand="true"
ignoreCase="true" synonyms="synonyms.txt"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
text_general support files
stopwords.txt
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
• synonyms.txt
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with# the
License. You may obtain a copy of the License at#.
......
.#-----------------------------------------------------------------------
#some test synonym mappings unlikely to appear in real input textaaafoo =>
aaabar
bbbfoo => bbbfoo bbbbar
cccfoo => cccbar cccbaz
fooaaa,baraaa,bazaaa
# Some synonym groups specific to this example
GB,gib,gigabyte,gigabytes
MB,mib,megabyte,megabytes
Television, Televisions, TV, TVs
#notice we use "gib" instead of "GiB" so any WordDelimiterFilter coming
#after us won't split it into two words.
# Synonym mappings can be used for spelling correction
toopixima => pixma
text_general's empty stopwords
• No file
=> default stopwords
=> English
• Empty file
=> disabled stopwords
• Currently – NOT used
text_general simplified definition
<fieldType name="text_general" class="solr.TextField"
positionIncrementGap="100" multiValued="true">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
Remove stopwords and synonyms
Sizes (line counts)
managed-schema 33 26
solrconfig.xml 278
params.json 20
File count in conf
.txt 2 0
.xml 2
.json 1
managed-schema (xml) 1
How far did we get
Sizes (line counts)
managed-schema* 481 26
solrconfig.xml 1482
278
params.json 20
File count in conf
.txt 41 0
.xml 3 2
.json 1
managed-schema (xml) 1
* already has no comments
Deconstructing – solrconfig.xml
• solrconfig.xml is more complex than schema
• Heterogeneous Sections
• Nested definitions
• Alternative implementations (e.g. highlighter)
• Also remember
– configoverlay.json – overrides solrconfig.xml
– params.json – additional configuration parameters
solrconfig.xml – feature counts
11 requestHandler
8 lib
5 searchComponent
3 queryResponseWriter
2 initParams
1 updateRequestProcessorChain
1 updateHandler
1 requestDispatcher
1 query
1 luceneMatchVersion
1 jmx
1 indexConfig
1 directoryFactory
1 dataDir
1 codecFactory
solrconfig.xml – line counts
55:<updateRequestProcessorChain name="add-unknown-fields-to-the-schema">
52:<searchComponent class="solr.HighlightComponent" name="highlight">
18:<query>
17:<requestHandler name="/spell" class="solr.SearchHandler" startup="lazy">
15:<searchComponent name="spellcheck" class="solr.SpellCheckComponent">
13:<updateHandler class="solr.DirectUpdateHandler2">
9:<requestHandler name="/terms" class="solr.SearchHandler" startup="lazy">
8:<requestHandler name="/elevate" class="solr.SearchHandler" startup="lazy">
8:<requestHandler name="/tvrh" class="solr.SearchHandler" startup="lazy">
7:<requestHandler name="/update/extract" startup="lazy"
class="solr.extraction.ExtractingRequestHandler">
7:<requestHandler name="/query" class="solr.SearchHandler">
6:<requestHandler name="/debug/dump" class="solr.DumpRequestHandler">
......
Remember, this works!
<config>
<luceneMatchVersion>6.2.0</luceneMatchVersion>
<requestHandler name="/select” class="solr.SearchHandler”>
<lst name="defaults">
<str name="df">text</str>
</lst>
</requestHandler>
</config>
add-unknown-fields-to-the-schema
• Famous "schemaless" mode
• Generic, but fully configurable
• Far from perfect
– Remember, we had to manually pre-add fields
– Development, not production
– Has normalization side-effects (normalizes dates)
• Cannot remove it in our example
solrconfig.xml - highlighter
<searchComponent class="solr.HighlightComponent" name="highlight">
<highlighting>
<fragmenter name="gap" default="true"
class="solr.highlight.GapFragmenter">
<lst name="defaults">
<int name="hl.fragsize">100</int>
</lst>
</fragmenter>
<fragmenter name="regex" class="solr.highlight.RegexFragmenter">
<lst name="defaults">
<int name="hl.fragsize">70</int>
<float name="hl.regex.slop">0.5</float>
<str name="hl.regex.pattern">[-w ,/n"']{20,200}</str>
</lst>
</fragmenter>
<formatter name="html" default="true"
class="solr.highlight.HtmlFormatter">
<lst name="defaults">
<str name="hl.simple.pre"><![CDATA[<em>]]></str>
<str name="hl.simple.post"><![CDATA[</em>]]></str>
</lst>
</formatter>
<encoder name="html" class="solr.highlight.HtmlEncoder"/>
<fragListBuilder name="simple" class="solr.highlight.SimpleFragListBuilder"/>
<fragListBuilder name="single" class="solr.highlight.SingleFragListBuilder"/>
.......
• fragmenters
• encoders
• fragListBuilders
• fragmentBuilders
• boundaryScanners
• ....
highlighter – the truth
• Highlighter searchComponent is in default stack
• The params are a mix of standard highlighter,
alternative FastVector highlighter
• Cannot use FastVector version as schema fields
are missing termVectors, etc
• And standard highlighter params are same as
implicit values
• Therefore, we can remove the WHOLE definition
Remove highlighter
Sizes (line counts)
managed-schema 26
solrconfig.xml 278 226
params.json 20
File count in conf
.txt 0
.xml 2
.json 1
managed-schema (xml) 1
Other searchComponents
• Not on the default stack
– spellcheck
– term
– termVector
– elevator
• Have dedicated requestHandlers
• Inception (example within example)
• Can be deleted
– also delete elevate.xml
15:<searchComponent name="spellcheck"
class="solr.SpellCheckComponent">
17:<requestHandler name="/spell"
class="solr.SearchHandler" startup="lazy">
1:<searchComponent name="terms"
class="solr.TermsComponent"/>
9:<requestHandler name="/terms"
class="solr.SearchHandler" startup="lazy">
1:<searchComponent name="tvComponent"
class="solr.TermVectorComponent"/>
8:<requestHandler name="/tvrh"
class="solr.SearchHandler" startup="lazy">
4:<searchComponent name="elevator"
class="solr.QueryElevationComponent">
8:<requestHandler name="/elevate"
class="solr.SearchHandler" startup="lazy">
Remove custom searchComponents
Sizes (line counts)
managed-schema 26
solrconfig.xml 226 163
params.json 20
File count in conf
.txt 0
.xml 2 1
.json 1
managed-schema (xml) 1
solrconfig.xml – more stuff
• There is more that can be taken out
– query section, since you have to tune it anyway
– updateHandler, and revert to basic commits
– jmx
– enableRemoteStreaming – definitely take that out
• But keep velocity, browse, search support
Next action
• Join the (virtual) Solr Example Reading Group
– Starts November 2016
– Register at http://bit.ly/SolrERG
• Join mailing list at http://www.solr-start.com
– Get the link to the presentation source
– Learn about other similar projects
– Get news of Solr articles and projects on the web

Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)

  • 1.
    Rebuilding Solr 6examples – layer by layer Alexandre Rafalovitch www.solr-start.com
  • 2.
    Who am I •Software developer with 20+ years of experience – Including 3 years as Senior Tech Support (BEA Weblogic) • Solr popularizer • Published book author on Solr Indexing (for Solr 4.3) • Run http://www.solr-start.com resource site • Solr committer (since August 2016) • Past and present Solr focus on onboarding, usability, tooling, information sharing
  • 3.
    Example catch-22 • Searchis a – surprisingly - complex expertise • Solr is a complex product – Wide – Deep – History-rich • And so are its many examples
  • 4.
    Fasten the seatbelt •Review all of the (Solr 6.2) OOTB examples • Make a small one from scratch • Deconstruct a real shipped example • Next learning action...
  • 5.
    OOTB Examples –how many? bin/solr start –e -e <example> Name of the example to run; available examples: cloud: SolrCloud example techproducts: Comprehensive example illustrating many of Solr's core capabilities dih: Data Import Handler schemaless: Schema-less example
  • 6.
    techproducts example • Usedto be collection1 • solr.home: example/techproducts/solr – Can restart with bin/solr start -s example/techproducts/solr – Actual core at example/techproducts/solr/techproducts
  • 7.
    techproducts example (cont.) •Source configuration – server/solr/configset/sample_techproducts_config – Not actually a configset (copy, not share) • Can be rebuilt rm –rf example/techproducts • Has data (14 files of products, money, utf8 tests) bin/post -c techproducts example/exampledocs/*.xml
  • 8.
    schemaless example • solr.home:example/schemaless/solr • Actual core: example/schemaless/solr/gettingstarted • Source configuration: – server/solr/configset/data_driven_schema_configs – Config you get when you are not using config: bin/solr create -c newcore • No data, but can take (nearly) anything: bin/post -c <name> example/exampledocs/*.xml
  • 9.
    schemaless mode? • “Letus guess what you mean” – Auto-guess field type based on first content occurrence – Create explicit field definitions • booleans, dates, numbers, strings • Always multivalued (because: who knows?!?) • Can be configured (URP chain in solrconfig.xml) – Rewrites managed-schema (coments begone!) – Makes search work with <copyField source="*" dest="_text_"/>
  • 10.
    techproducts vs schemaless •Configured techproducts vs auto-detecting schemaless • Strings "name":"Test with some GB18030 encoded characters", "name":["Test with some GB18030 encoded characters"], • Numbers "price":0.0, "price_c":"0.0,USD", "price":[0.0], • Booleans "inStock":true, "inStock":[true],
  • 11.
    cloud example • Highlyconfigurable (unless using –noprompt) • solr.home: example/cloud/nodeX/solr • Source configuration is a choice Please choose a configuration for the gettingstarted collection, available options are: basic_configs, data_driven_schema_configs, or sample_techproducts_configs [data_driven_schema_configs] • Can be rebuilt: bin/solr stop -all rm -rf example/cloud • Demonstrates Config API (configoverlay.json)
  • 12.
    dih example(s) • Dataimport handler – legacy, but still kicking • solr.home: example/example-DIH/solr • Has 5 (five!) different cores – db - database import (example/example-DIH/hsqldb/ex.*) – solr - import from another Solr core (configured for db core) – mail - import from IMAP (needs some configuration) – tika - import rich-content (example/exampledocs/solr-word.pdf) – rss - external XML feed (very broken right now) • Cannot be rebuilt – only emptied bin/post -c db -type 'application/json' -d '{delete: {query:"*:*"}}'
  • 13.
    What about: bin/solrstart? • solr.home: server/solr • No initial collection/cores, have to create explicitly: – With script (see bin/solr create_core –h for details): bin/solr create –c <corename> -d <name or path> – With Core Admin UI for non-SolrCloud: http://localhost:8983/solr/admin/cores?action=CREATE&… – With Collection API for SolrCloud: http://localhost:8983/admin/collections?action=CREATE&…
  • 14.
    basic_configs configuration • Availablefor cloud example and explicit creation • Schemaless mode is configured, not enabled • “Minimal Solr configuration” !?! – managed-schema: 1005 lines – solrconfig.xml: 1484 lines
  • 15.
    files example • Specificallytuned for file indexing – Augmented schemaless mode with language, content-type guessing – Custom /browse end-point – Source configuration: example/files/conf – Setup instructions: example/files/README.txt – Bring your own data
  • 17.
    films example • Schemaless(Based on data_driven_schema_configs) – Uses Schema API to add custom fields – Uses schemaless for rest of fields • Comes with its own data (1100 film records) • Uses velocity (/browse), Schema API, Request Parameters API (params.json) • Setup instructions: example/films/README.txt
  • 18.
    That was agood news • Many examples • Easy to get one running • Some come with data • Some you can throw your own data into • Lots of comments
  • 19.
    This is thebad news Files Types Fields Dynamic Fields managed-schema size solrconfig. xml size basic 46 71 4 73 1005 1484 data_driven 46 71 4 73 1005 1482 techproducts 101 66 33 28 1149 1701 dih db 62 62 31 28 1129 1490 dih tika 6 61 3 27 901 1466 files 69 73 9 73 517 1508 films (data_driven+) 46 71 8 73 481 1482
  • 20.
    Tip – gettingthese numbers • XML extraction with XMLStarlet (XLST CLI) – xml sel -t -m "//fieldType" -v @name -n managed-schema – xml sel -t -m "//copyField" -c . -n managed-schema |wc -l – xml sel -t -m "//*[@docValues]" -v "concat(local-name(), ' ', @name, ' docValues:', @docValues)" -n managed-schema – xml sel -t -m "//requestHandler" -v "@name" -n solrconfig.xml
  • 21.
    Why is itlike this? • Many examples predate Solr Reference Guide • grep for options, possibilities, defaults • Each example is a kitchen sink “Too much of a good thing is also a bad thing” Source: 1980s Soviet joke about Virtual Reality
  • 22.
    Go small –managed-schema <schema name="demo" version="1.6"> <dynamicField name="*" type="string" indexed="true" stored="true" multiValued="true"/> <field name="text" type="text_basic" indexed="true" stored="false" multiValued="true"/> <copyField source="*" dest="text"/> …
  • 23.
    Go small –managed-schema(2) … <fieldType name="string" class="solr.StrField"/> <fieldType name="text_basic" class="solr.TextField"> <analyzer> <tokenizer class="solr.LowerCaseTokenizerFactory" /> </analyzer> </fieldType> </schema>
  • 24.
    Go small –solrconfig.xml <config> <luceneMatchVersion>6.2.0</luceneMatchVersion> <requestHandler name="/select” class="solr.SearchHandler”> <lst name="defaults"> <str name="df">text</str> </lst> </requestHandler> </config>
  • 25.
    Go small –load and test • bin/solr create -c demo -d .../demo-config/ • bin/post -c demo example/exampledocs/*.xml • Test it works, using HTTPie (HTTP CLI)
  • 27.
    Go small -review • Minimal example could be very minimal • Some things will not work – No uniqueKey – no way to update documents, no SolrCloud – No _version_ – no SolrCloud – Everything is multiValued – no sorting – copyField * => text, no meaningful relevancy, specialized analyzer chain processing
  • 28.
    Deconstructing films example •bin/solr create –c films • curl http://localhost:8983/solr/films/schema ... (add name, initial_release_date) • Index 1100 records from – (Solr) XML, – (generic) JSON (doc), or – CSV format • Search for batman • Use /browse end-point and search for batman • Enable highlighting in results
  • 30.
    Initial stats forfilms core Sizes (line counts) managed-schema* 481 solrconfig.xml 1482 params.json 20 File count in conf .txt 41 .xml 3 .json 1 managed-schema (xml) 1 * already has no comments
  • 31.
    Deconstructing – juststraight tags • managed-schema lost comments during construction • Let's remove comments from solrconfig.xml • xml ed -L -d "//comment()" solrconfig.xml – Edit in place – Delete XPATH
  • 32.
    solrconfig.xml without comments Sizes(line counts) managed-schema 481 solrconfig.xml 1482 278 params.json 20 File count in conf .txt 41 .xml 3 .json 1 managed-schema (xml) 1
  • 33.
    Deconstructing – whatto clean • Currently – (explicit) fields: 8 – dynamic fields: 73 • xml sel -t -m "//dynamicField" -v @name -n managed- schema |wc -l – types: 71 – copyFields: 1 • Let's start from dynamic fields
  • 34.
    Deconstructing – dynamicfields • Used dynamic fields – do NOT modify schema – DO show up in Admin UI, if used – Example from different schema: • Used/matched fields • Generic definitions
  • 35.
    Deconstructing – inuse dynamic fields
  • 36.
    Deconstructing – inuse dynamic fields • NO dynamic fields are used – * is a copyField instruction • Can remove them all • xml ed -L -d "//dynamicField" managed-schema
  • 37.
    Remove dynamicFields Sizes (linecounts) managed-schema 481 409 solrconfig.xml 278 params.json 20 File count in conf .txt 41 .xml 3 .json 1 managed-schema (xml) 1
  • 38.
    Deconstructing – fieldtypes • How many types out of 71 do we use? – xml sel -t -m "//field|//dynamicField" -v "@type" -n conf/managed-schema |sort –u – long, string, strings, tdate, text_general • But also some in solrconfig.xml – booleans, string, strings, tdates, tdoubles, text_general, tlongs • Combined total: 9 field type definitions • Delete the rest (by hand)
  • 39.
    Remove no-longer usedtypes Sizes (line counts) managed-schema 409 34 (!!!) solrconfig.xml 278 params.json 20 File count in conf .txt 41 .xml 3 .json 1 managed-schema (xml) 1
  • 40.
    Deconstructing – supportfiles • Inside lang directory (38 files) – find lang –name 'stopwords_*.txt' | wc -l • stopwords_*.txt: 30 files • contractions_*.txt: 4 files – find lang -type f |egrep -v 'stopwords_|contractions_' • hyphenations_ga.txt, stemdict_nl.txt, stoptags_ja.txt, userdict_ja.txt
  • 41.
    Support files –still in use? • Check for usage – grep -o 'stopwords_.*.txt' managed-schema solrconfig.xml – grep -o 'contractions_.*.txt' ... – ... • NO Matches (we no longer have related types) – Delete the whole lang directory • What about files just inside config directory – Don't need currency.xml, protwords.txt
  • 42.
    Remove no-longer usedtypes Sizes (line counts) managed-schema 34 solrconfig.xml 278 params.json 20 File count in conf .txt 41 2 .xml 3 2 .json 1 managed-schema (xml) 1
  • 43.
  • 44.
  • 45.
    The mystery of_root_ • In the original schema – no explanations • Documentation – used for nested documents: To support nested documents, the schema must include an indexed/non-stored field _root_ . The value of that field is populated automatically and is the same for all documents in the block, regardless of the inheritance depth. • We are not using nested documents • And neither does any other shipped example...
  • 46.
    Remove _root_ Sizes (linecounts) managed-schema 34 33 solrconfig.xml 278 params.json 20 File count in conf .txt 2 .xml 2 .json 1 managed-schema (xml) 1
  • 47.
    Deconstructing – text_generaltype <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100" multiValued="true"> <analyzer type="index"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/> <filter class="solr.SynonymFilterFactory" expand="true" ignoreCase="true" synonyms="synonyms.txt"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType>
  • 48.
    text_general support files stopwords.txt #Licensed to the Apache Software Foundation (ASF) under one or more # contributor license agreements. See the NOTICE file distributed with # this work for additional information regarding copyright ownership. # The ASF licenses this file to You under the Apache License, Version 2.0 # (the "License"); you may not use this file except in compliance with # the License. You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0# # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. • synonyms.txt # The ASF licenses this file to You under the Apache License, Version 2.0 # (the "License"); you may not use this file except in compliance with# the License. You may obtain a copy of the License at#. ...... .#----------------------------------------------------------------------- #some test synonym mappings unlikely to appear in real input textaaafoo => aaabar bbbfoo => bbbfoo bbbbar cccfoo => cccbar cccbaz fooaaa,baraaa,bazaaa # Some synonym groups specific to this example GB,gib,gigabyte,gigabytes MB,mib,megabyte,megabytes Television, Televisions, TV, TVs #notice we use "gib" instead of "GiB" so any WordDelimiterFilter coming #after us won't split it into two words. # Synonym mappings can be used for spelling correction toopixima => pixma
  • 49.
    text_general's empty stopwords •No file => default stopwords => English • Empty file => disabled stopwords • Currently – NOT used
  • 50.
    text_general simplified definition <fieldTypename="text_general" class="solr.TextField" positionIncrementGap="100" multiValued="true"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType>
  • 51.
    Remove stopwords andsynonyms Sizes (line counts) managed-schema 33 26 solrconfig.xml 278 params.json 20 File count in conf .txt 2 0 .xml 2 .json 1 managed-schema (xml) 1
  • 52.
    How far didwe get Sizes (line counts) managed-schema* 481 26 solrconfig.xml 1482 278 params.json 20 File count in conf .txt 41 0 .xml 3 2 .json 1 managed-schema (xml) 1 * already has no comments
  • 53.
    Deconstructing – solrconfig.xml •solrconfig.xml is more complex than schema • Heterogeneous Sections • Nested definitions • Alternative implementations (e.g. highlighter) • Also remember – configoverlay.json – overrides solrconfig.xml – params.json – additional configuration parameters
  • 54.
    solrconfig.xml – featurecounts 11 requestHandler 8 lib 5 searchComponent 3 queryResponseWriter 2 initParams 1 updateRequestProcessorChain 1 updateHandler 1 requestDispatcher 1 query 1 luceneMatchVersion 1 jmx 1 indexConfig 1 directoryFactory 1 dataDir 1 codecFactory
  • 55.
    solrconfig.xml – linecounts 55:<updateRequestProcessorChain name="add-unknown-fields-to-the-schema"> 52:<searchComponent class="solr.HighlightComponent" name="highlight"> 18:<query> 17:<requestHandler name="/spell" class="solr.SearchHandler" startup="lazy"> 15:<searchComponent name="spellcheck" class="solr.SpellCheckComponent"> 13:<updateHandler class="solr.DirectUpdateHandler2"> 9:<requestHandler name="/terms" class="solr.SearchHandler" startup="lazy"> 8:<requestHandler name="/elevate" class="solr.SearchHandler" startup="lazy"> 8:<requestHandler name="/tvrh" class="solr.SearchHandler" startup="lazy"> 7:<requestHandler name="/update/extract" startup="lazy" class="solr.extraction.ExtractingRequestHandler"> 7:<requestHandler name="/query" class="solr.SearchHandler"> 6:<requestHandler name="/debug/dump" class="solr.DumpRequestHandler"> ......
  • 56.
    Remember, this works! <config> <luceneMatchVersion>6.2.0</luceneMatchVersion> <requestHandlername="/select” class="solr.SearchHandler”> <lst name="defaults"> <str name="df">text</str> </lst> </requestHandler> </config>
  • 57.
    add-unknown-fields-to-the-schema • Famous "schemaless"mode • Generic, but fully configurable • Far from perfect – Remember, we had to manually pre-add fields – Development, not production – Has normalization side-effects (normalizes dates) • Cannot remove it in our example
  • 58.
    solrconfig.xml - highlighter <searchComponentclass="solr.HighlightComponent" name="highlight"> <highlighting> <fragmenter name="gap" default="true" class="solr.highlight.GapFragmenter"> <lst name="defaults"> <int name="hl.fragsize">100</int> </lst> </fragmenter> <fragmenter name="regex" class="solr.highlight.RegexFragmenter"> <lst name="defaults"> <int name="hl.fragsize">70</int> <float name="hl.regex.slop">0.5</float> <str name="hl.regex.pattern">[-w ,/n"']{20,200}</str> </lst> </fragmenter> <formatter name="html" default="true" class="solr.highlight.HtmlFormatter"> <lst name="defaults"> <str name="hl.simple.pre"><![CDATA[<em>]]></str> <str name="hl.simple.post"><![CDATA[</em>]]></str> </lst> </formatter> <encoder name="html" class="solr.highlight.HtmlEncoder"/> <fragListBuilder name="simple" class="solr.highlight.SimpleFragListBuilder"/> <fragListBuilder name="single" class="solr.highlight.SingleFragListBuilder"/> ....... • fragmenters • encoders • fragListBuilders • fragmentBuilders • boundaryScanners • ....
  • 59.
    highlighter – thetruth • Highlighter searchComponent is in default stack • The params are a mix of standard highlighter, alternative FastVector highlighter • Cannot use FastVector version as schema fields are missing termVectors, etc • And standard highlighter params are same as implicit values • Therefore, we can remove the WHOLE definition
  • 60.
    Remove highlighter Sizes (linecounts) managed-schema 26 solrconfig.xml 278 226 params.json 20 File count in conf .txt 0 .xml 2 .json 1 managed-schema (xml) 1
  • 61.
    Other searchComponents • Noton the default stack – spellcheck – term – termVector – elevator • Have dedicated requestHandlers • Inception (example within example) • Can be deleted – also delete elevate.xml 15:<searchComponent name="spellcheck" class="solr.SpellCheckComponent"> 17:<requestHandler name="/spell" class="solr.SearchHandler" startup="lazy"> 1:<searchComponent name="terms" class="solr.TermsComponent"/> 9:<requestHandler name="/terms" class="solr.SearchHandler" startup="lazy"> 1:<searchComponent name="tvComponent" class="solr.TermVectorComponent"/> 8:<requestHandler name="/tvrh" class="solr.SearchHandler" startup="lazy"> 4:<searchComponent name="elevator" class="solr.QueryElevationComponent"> 8:<requestHandler name="/elevate" class="solr.SearchHandler" startup="lazy">
  • 62.
    Remove custom searchComponents Sizes(line counts) managed-schema 26 solrconfig.xml 226 163 params.json 20 File count in conf .txt 0 .xml 2 1 .json 1 managed-schema (xml) 1
  • 63.
    solrconfig.xml – morestuff • There is more that can be taken out – query section, since you have to tune it anyway – updateHandler, and revert to basic commits – jmx – enableRemoteStreaming – definitely take that out • But keep velocity, browse, search support
  • 64.
    Next action • Jointhe (virtual) Solr Example Reading Group – Starts November 2016 – Register at http://bit.ly/SolrERG • Join mailing list at http://www.solr-start.com – Get the link to the presentation source – Learn about other similar projects – Get news of Solr articles and projects on the web

Editor's Notes

  • #5 I fully expect you to get lost several times along the way and hopefully find your way again. But don't worry, at the end, I will give you a way forward even if you got completely lost.
  • #8 I Don't Think It Means What You Think It Means
  • #13 Small tool makes good
  • #20 Too many files, types, fields, dynamic fields, and copyFields. And every example is “same same but different”
  • #44 Looking at bottom-right corner
  • #46 Actually, I added the description to the documentation – while preparing these slides. It's nice when committers dogfood, isn't it?
  • #48 We have split analyzers, because
  • #51 Both stopwords and synonyms here use text configuration files, but Solr now ships with REST-managed implementations as well, which you may prefer. You can see them in techproducts configuration