Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)

2,165 views

Published on

Overview of Solr 6.2 examples, including features they have and challenges they present. A contrasting demonstration of a minimal viable example. A step-by-step deconstruction of "films" example to show what part of shipped examples are not actually needed.

Published in: Software
  • Be the first to comment

Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)

  1. 1. Rebuilding Solr 6 examples – layer by layer Alexandre Rafalovitch www.solr-start.com
  2. 2. Who am I • Software developer with 20+ years of experience – Including 3 years as Senior Tech Support (BEA Weblogic) • Solr popularizer • Published book author on Solr Indexing (for Solr 4.3) • Run http://www.solr-start.com resource site • Solr committer (since August 2016) • Past and present Solr focus on onboarding, usability, tooling, information sharing
  3. 3. Example catch-22 • Search is a – surprisingly - complex expertise • Solr is a complex product – Wide – Deep – History-rich • And so are its many examples
  4. 4. Fasten the seatbelt • Review all of the (Solr 6.2) OOTB examples • Make a small one from scratch • Deconstruct a real shipped example • Next learning action...
  5. 5. OOTB Examples – how many? bin/solr start –e -e <example> Name of the example to run; available examples: cloud: SolrCloud example techproducts: Comprehensive example illustrating many of Solr's core capabilities dih: Data Import Handler schemaless: Schema-less example
  6. 6. techproducts example • Used to be collection1 • solr.home: example/techproducts/solr – Can restart with bin/solr start -s example/techproducts/solr – Actual core at example/techproducts/solr/techproducts
  7. 7. techproducts example (cont.) • Source configuration – server/solr/configset/sample_techproducts_config – Not actually a configset (copy, not share) • Can be rebuilt rm –rf example/techproducts • Has data (14 files of products, money, utf8 tests) bin/post -c techproducts example/exampledocs/*.xml
  8. 8. schemaless example • solr.home: example/schemaless/solr • Actual core: example/schemaless/solr/gettingstarted • Source configuration: – server/solr/configset/data_driven_schema_configs – Config you get when you are not using config: bin/solr create -c newcore • No data, but can take (nearly) anything: bin/post -c <name> example/exampledocs/*.xml
  9. 9. schemaless mode? • “Let us guess what you mean” – Auto-guess field type based on first content occurrence – Create explicit field definitions • booleans, dates, numbers, strings • Always multivalued (because: who knows?!?) • Can be configured (URP chain in solrconfig.xml) – Rewrites managed-schema (coments begone!) – Makes search work with <copyField source="*" dest="_text_"/>
  10. 10. techproducts vs schemaless • Configured techproducts vs auto-detecting schemaless • Strings "name":"Test with some GB18030 encoded characters", "name":["Test with some GB18030 encoded characters"], • Numbers "price":0.0, "price_c":"0.0,USD", "price":[0.0], • Booleans "inStock":true, "inStock":[true],
  11. 11. cloud example • Highly configurable (unless using –noprompt) • solr.home: example/cloud/nodeX/solr • Source configuration is a choice Please choose a configuration for the gettingstarted collection, available options are: basic_configs, data_driven_schema_configs, or sample_techproducts_configs [data_driven_schema_configs] • Can be rebuilt: bin/solr stop -all rm -rf example/cloud • Demonstrates Config API (configoverlay.json)
  12. 12. dih example(s) • Data import handler – legacy, but still kicking • solr.home: example/example-DIH/solr • Has 5 (five!) different cores – db - database import (example/example-DIH/hsqldb/ex.*) – solr - import from another Solr core (configured for db core) – mail - import from IMAP (needs some configuration) – tika - import rich-content (example/exampledocs/solr-word.pdf) – rss - external XML feed (very broken right now) • Cannot be rebuilt – only emptied bin/post -c db -type 'application/json' -d '{delete: {query:"*:*"}}'
  13. 13. What about: bin/solr start? • solr.home: server/solr • No initial collection/cores, have to create explicitly: – With script (see bin/solr create_core –h for details): bin/solr create –c <corename> -d <name or path> – With Core Admin UI for non-SolrCloud: http://localhost:8983/solr/admin/cores?action=CREATE&… – With Collection API for SolrCloud: http://localhost:8983/admin/collections?action=CREATE&…
  14. 14. basic_configs configuration • Available for cloud example and explicit creation • Schemaless mode is configured, not enabled • “Minimal Solr configuration” !?! – managed-schema: 1005 lines – solrconfig.xml: 1484 lines
  15. 15. files example • Specifically tuned for file indexing – Augmented schemaless mode with language, content-type guessing – Custom /browse end-point – Source configuration: example/files/conf – Setup instructions: example/files/README.txt – Bring your own data
  16. 16. films example • Schemaless (Based on data_driven_schema_configs) – Uses Schema API to add custom fields – Uses schemaless for rest of fields • Comes with its own data (1100 film records) • Uses velocity (/browse), Schema API, Request Parameters API (params.json) • Setup instructions: example/films/README.txt
  17. 17. That was a good news • Many examples • Easy to get one running • Some come with data • Some you can throw your own data into • Lots of comments
  18. 18. This is the bad news Files Types Fields Dynamic Fields managed-schema size solrconfig. xml size basic 46 71 4 73 1005 1484 data_driven 46 71 4 73 1005 1482 techproducts 101 66 33 28 1149 1701 dih db 62 62 31 28 1129 1490 dih tika 6 61 3 27 901 1466 files 69 73 9 73 517 1508 films (data_driven+) 46 71 8 73 481 1482
  19. 19. Tip – getting these numbers • XML extraction with XMLStarlet (XLST CLI) – xml sel -t -m "//fieldType" -v @name -n managed-schema – xml sel -t -m "//copyField" -c . -n managed-schema |wc -l – xml sel -t -m "//*[@docValues]" -v "concat(local-name(), ' ', @name, ' docValues:', @docValues)" -n managed-schema – xml sel -t -m "//requestHandler" -v "@name" -n solrconfig.xml
  20. 20. Why is it like this? • Many examples predate Solr Reference Guide • grep for options, possibilities, defaults • Each example is a kitchen sink “Too much of a good thing is also a bad thing” Source: 1980s Soviet joke about Virtual Reality
  21. 21. Go small – managed-schema <schema name="demo" version="1.6"> <dynamicField name="*" type="string" indexed="true" stored="true" multiValued="true"/> <field name="text" type="text_basic" indexed="true" stored="false" multiValued="true"/> <copyField source="*" dest="text"/> …
  22. 22. Go small – managed-schema(2) … <fieldType name="string" class="solr.StrField"/> <fieldType name="text_basic" class="solr.TextField"> <analyzer> <tokenizer class="solr.LowerCaseTokenizerFactory" /> </analyzer> </fieldType> </schema>
  23. 23. Go small – solrconfig.xml <config> <luceneMatchVersion>6.2.0</luceneMatchVersion> <requestHandler name="/select” class="solr.SearchHandler”> <lst name="defaults"> <str name="df">text</str> </lst> </requestHandler> </config>
  24. 24. Go small – load and test • bin/solr create -c demo -d .../demo-config/ • bin/post -c demo example/exampledocs/*.xml • Test it works, using HTTPie (HTTP CLI)
  25. 25. Go small - review • Minimal example could be very minimal • Some things will not work – No uniqueKey – no way to update documents, no SolrCloud – No _version_ – no SolrCloud – Everything is multiValued – no sorting – copyField * => text, no meaningful relevancy, specialized analyzer chain processing
  26. 26. Deconstructing films example • bin/solr create –c films • curl http://localhost:8983/solr/films/schema ... (add name, initial_release_date) • Index 1100 records from – (Solr) XML, – (generic) JSON (doc), or – CSV format • Search for batman • Use /browse end-point and search for batman • Enable highlighting in results
  27. 27. Initial stats for films core Sizes (line counts) managed-schema* 481 solrconfig.xml 1482 params.json 20 File count in conf .txt 41 .xml 3 .json 1 managed-schema (xml) 1 * already has no comments
  28. 28. Deconstructing – just straight tags • managed-schema lost comments during construction • Let's remove comments from solrconfig.xml • xml ed -L -d "//comment()" solrconfig.xml – Edit in place – Delete XPATH
  29. 29. solrconfig.xml without comments Sizes (line counts) managed-schema 481 solrconfig.xml 1482 278 params.json 20 File count in conf .txt 41 .xml 3 .json 1 managed-schema (xml) 1
  30. 30. Deconstructing – what to clean • Currently – (explicit) fields: 8 – dynamic fields: 73 • xml sel -t -m "//dynamicField" -v @name -n managed- schema |wc -l – types: 71 – copyFields: 1 • Let's start from dynamic fields
  31. 31. Deconstructing – dynamic fields • Used dynamic fields – do NOT modify schema – DO show up in Admin UI, if used – Example from different schema: • Used/matched fields • Generic definitions
  32. 32. Deconstructing – in use dynamic fields
  33. 33. Deconstructing – in use dynamic fields • NO dynamic fields are used – * is a copyField instruction • Can remove them all • xml ed -L -d "//dynamicField" managed-schema
  34. 34. Remove dynamicFields Sizes (line counts) managed-schema 481 409 solrconfig.xml 278 params.json 20 File count in conf .txt 41 .xml 3 .json 1 managed-schema (xml) 1
  35. 35. Deconstructing – field types • How many types out of 71 do we use? – xml sel -t -m "//field|//dynamicField" -v "@type" -n conf/managed-schema |sort –u – long, string, strings, tdate, text_general • But also some in solrconfig.xml – booleans, string, strings, tdates, tdoubles, text_general, tlongs • Combined total: 9 field type definitions • Delete the rest (by hand)
  36. 36. Remove no-longer used types Sizes (line counts) managed-schema 409 34 (!!!) solrconfig.xml 278 params.json 20 File count in conf .txt 41 .xml 3 .json 1 managed-schema (xml) 1
  37. 37. Deconstructing – support files • Inside lang directory (38 files) – find lang –name 'stopwords_*.txt' | wc -l • stopwords_*.txt: 30 files • contractions_*.txt: 4 files – find lang -type f |egrep -v 'stopwords_|contractions_' • hyphenations_ga.txt, stemdict_nl.txt, stoptags_ja.txt, userdict_ja.txt
  38. 38. Support files – still in use? • Check for usage – grep -o 'stopwords_.*.txt' managed-schema solrconfig.xml – grep -o 'contractions_.*.txt' ... – ... • NO Matches (we no longer have related types) – Delete the whole lang directory • What about files just inside config directory – Don't need currency.xml, protwords.txt
  39. 39. Remove no-longer used types Sizes (line counts) managed-schema 34 solrconfig.xml 278 params.json 20 File count in conf .txt 41 2 .xml 3 2 .json 1 managed-schema (xml) 1
  40. 40. Deconstructing – actual field usage
  41. 41. Actual field usage - _root_
  42. 42. The mystery of _root_ • In the original schema – no explanations • Documentation – used for nested documents: To support nested documents, the schema must include an indexed/non-stored field _root_ . The value of that field is populated automatically and is the same for all documents in the block, regardless of the inheritance depth. • We are not using nested documents • And neither does any other shipped example...
  43. 43. Remove _root_ Sizes (line counts) managed-schema 34 33 solrconfig.xml 278 params.json 20 File count in conf .txt 2 .xml 2 .json 1 managed-schema (xml) 1
  44. 44. Deconstructing – text_general type <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100" multiValued="true"> <analyzer type="index"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/> <filter class="solr.SynonymFilterFactory" expand="true" ignoreCase="true" synonyms="synonyms.txt"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType>
  45. 45. text_general support files stopwords.txt # Licensed to the Apache Software Foundation (ASF) under one or more # contributor license agreements. See the NOTICE file distributed with # this work for additional information regarding copyright ownership. # The ASF licenses this file to You under the Apache License, Version 2.0 # (the "License"); you may not use this file except in compliance with # the License. You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0# # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. • synonyms.txt # The ASF licenses this file to You under the Apache License, Version 2.0 # (the "License"); you may not use this file except in compliance with# the License. You may obtain a copy of the License at#. ...... .#----------------------------------------------------------------------- #some test synonym mappings unlikely to appear in real input textaaafoo => aaabar bbbfoo => bbbfoo bbbbar cccfoo => cccbar cccbaz fooaaa,baraaa,bazaaa # Some synonym groups specific to this example GB,gib,gigabyte,gigabytes MB,mib,megabyte,megabytes Television, Televisions, TV, TVs #notice we use "gib" instead of "GiB" so any WordDelimiterFilter coming #after us won't split it into two words. # Synonym mappings can be used for spelling correction toopixima => pixma
  46. 46. text_general's empty stopwords • No file => default stopwords => English • Empty file => disabled stopwords • Currently – NOT used
  47. 47. text_general simplified definition <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100" multiValued="true"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType>
  48. 48. Remove stopwords and synonyms Sizes (line counts) managed-schema 33 26 solrconfig.xml 278 params.json 20 File count in conf .txt 2 0 .xml 2 .json 1 managed-schema (xml) 1
  49. 49. How far did we get Sizes (line counts) managed-schema* 481 26 solrconfig.xml 1482 278 params.json 20 File count in conf .txt 41 0 .xml 3 2 .json 1 managed-schema (xml) 1 * already has no comments
  50. 50. Deconstructing – solrconfig.xml • solrconfig.xml is more complex than schema • Heterogeneous Sections • Nested definitions • Alternative implementations (e.g. highlighter) • Also remember – configoverlay.json – overrides solrconfig.xml – params.json – additional configuration parameters
  51. 51. solrconfig.xml – feature counts 11 requestHandler 8 lib 5 searchComponent 3 queryResponseWriter 2 initParams 1 updateRequestProcessorChain 1 updateHandler 1 requestDispatcher 1 query 1 luceneMatchVersion 1 jmx 1 indexConfig 1 directoryFactory 1 dataDir 1 codecFactory
  52. 52. solrconfig.xml – line counts 55:<updateRequestProcessorChain name="add-unknown-fields-to-the-schema"> 52:<searchComponent class="solr.HighlightComponent" name="highlight"> 18:<query> 17:<requestHandler name="/spell" class="solr.SearchHandler" startup="lazy"> 15:<searchComponent name="spellcheck" class="solr.SpellCheckComponent"> 13:<updateHandler class="solr.DirectUpdateHandler2"> 9:<requestHandler name="/terms" class="solr.SearchHandler" startup="lazy"> 8:<requestHandler name="/elevate" class="solr.SearchHandler" startup="lazy"> 8:<requestHandler name="/tvrh" class="solr.SearchHandler" startup="lazy"> 7:<requestHandler name="/update/extract" startup="lazy" class="solr.extraction.ExtractingRequestHandler"> 7:<requestHandler name="/query" class="solr.SearchHandler"> 6:<requestHandler name="/debug/dump" class="solr.DumpRequestHandler"> ......
  53. 53. Remember, this works! <config> <luceneMatchVersion>6.2.0</luceneMatchVersion> <requestHandler name="/select” class="solr.SearchHandler”> <lst name="defaults"> <str name="df">text</str> </lst> </requestHandler> </config>
  54. 54. add-unknown-fields-to-the-schema • Famous "schemaless" mode • Generic, but fully configurable • Far from perfect – Remember, we had to manually pre-add fields – Development, not production – Has normalization side-effects (normalizes dates) • Cannot remove it in our example
  55. 55. solrconfig.xml - highlighter <searchComponent class="solr.HighlightComponent" name="highlight"> <highlighting> <fragmenter name="gap" default="true" class="solr.highlight.GapFragmenter"> <lst name="defaults"> <int name="hl.fragsize">100</int> </lst> </fragmenter> <fragmenter name="regex" class="solr.highlight.RegexFragmenter"> <lst name="defaults"> <int name="hl.fragsize">70</int> <float name="hl.regex.slop">0.5</float> <str name="hl.regex.pattern">[-w ,/n"']{20,200}</str> </lst> </fragmenter> <formatter name="html" default="true" class="solr.highlight.HtmlFormatter"> <lst name="defaults"> <str name="hl.simple.pre"><![CDATA[<em>]]></str> <str name="hl.simple.post"><![CDATA[</em>]]></str> </lst> </formatter> <encoder name="html" class="solr.highlight.HtmlEncoder"/> <fragListBuilder name="simple" class="solr.highlight.SimpleFragListBuilder"/> <fragListBuilder name="single" class="solr.highlight.SingleFragListBuilder"/> ....... • fragmenters • encoders • fragListBuilders • fragmentBuilders • boundaryScanners • ....
  56. 56. highlighter – the truth • Highlighter searchComponent is in default stack • The params are a mix of standard highlighter, alternative FastVector highlighter • Cannot use FastVector version as schema fields are missing termVectors, etc • And standard highlighter params are same as implicit values • Therefore, we can remove the WHOLE definition
  57. 57. Remove highlighter Sizes (line counts) managed-schema 26 solrconfig.xml 278 226 params.json 20 File count in conf .txt 0 .xml 2 .json 1 managed-schema (xml) 1
  58. 58. Other searchComponents • Not on the default stack – spellcheck – term – termVector – elevator • Have dedicated requestHandlers • Inception (example within example) • Can be deleted – also delete elevate.xml 15:<searchComponent name="spellcheck" class="solr.SpellCheckComponent"> 17:<requestHandler name="/spell" class="solr.SearchHandler" startup="lazy"> 1:<searchComponent name="terms" class="solr.TermsComponent"/> 9:<requestHandler name="/terms" class="solr.SearchHandler" startup="lazy"> 1:<searchComponent name="tvComponent" class="solr.TermVectorComponent"/> 8:<requestHandler name="/tvrh" class="solr.SearchHandler" startup="lazy"> 4:<searchComponent name="elevator" class="solr.QueryElevationComponent"> 8:<requestHandler name="/elevate" class="solr.SearchHandler" startup="lazy">
  59. 59. Remove custom searchComponents Sizes (line counts) managed-schema 26 solrconfig.xml 226 163 params.json 20 File count in conf .txt 0 .xml 2 1 .json 1 managed-schema (xml) 1
  60. 60. solrconfig.xml – more stuff • There is more that can be taken out – query section, since you have to tune it anyway – updateHandler, and revert to basic commits – jmx – enableRemoteStreaming – definitely take that out • But keep velocity, browse, search support
  61. 61. Next action • Join the (virtual) Solr Example Reading Group – Starts November 2016 – Register at http://bit.ly/SolrERG • Join mailing list at http://www.solr-start.com – Get the link to the presentation source – Learn about other similar projects – Get news of Solr articles and projects on the web

×