Your SlideShare is downloading. ×
  • Like
Lucene for Solr Developers
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.


Now you can save presentations on your phone or tablet

Available for both IPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Lucene for Solr Developers



Published in Education , Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads


Total Views
On SlideShare
From Embeds
Number of Embeds



Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

    No notes for slide


  • 1. Lucene for Solr Developers uberconf - July 14, 2011 Presented by Erik Lucid Imagination
  • 2. Lucene Core• IndexWriter• Directory• IndexReader, IndexSearcher• analysis: Analyzer, TokenStream, Tokenizer,TokenFilter• Query
  • 3. Solr Architecture
  • 4. Customizing - Dont do it!• Unless you need to.• In other words... ensure youve given the built-in capabilities a try, asked on the e-mail list, and spelunked into at least Solrs code a bit to make some sense of the situation.• But were here to roll up our sleeves, because we need to...
  • 5. But first...• Look at Lucene and/or Solr source code as appropriate• Carefully read javadocs and wiki pages - lots of tips there• And, hey, search for what youre trying to do... • Google, of course • But try out LucidFind and other Lucene ecosystem specific search systems -
  • 6. Extension points• Tokenizer, TokenFilter, • QParser CharFilter • DataImportHandler• SearchComponent hooks• RequestHandler • data sources• ResponseWriter • entity processors• FieldType • transformers• Similarity • several others
  • 7. Factories• FooFactory (most) everywhere. Sometimes theres BarPlugin style• for sake of discussion... lets just skip the "factory" part• In Solr, Factories and Plugins are used by configuration loading to parameterize and construct
  • 8. "Installing" plugins• Compile .java to .class, JAR it up• Put JAR files in either: • <solr-home>/lib • a shared lib when using multicore • anywhere, and register location in solrconfig.xml• Hook in plugins as appropriate
  • 9. Multicore sharedLib<solr sharedLib="/usr/local/solr/customlib" persistent="true"> <cores adminPath="/admin/cores"> <core instanceDir="core1" name="core1"/> <core instanceDir="core2" name="core2"/> </cores></solr>
  • 10. Plugins via solrconfig.xml• <lib dir="/path/to/your/custom/jars" />
  • 11. Analysis• CharFilter• Tokenizer• TokenFilter
  • 12. Primer• Tokens, Terms• Attributes: Type, Payloads, Offsets, Positions, Term Vectors• part of the picture:
  • 13. Version• enum: • Version.LUCENE_31, Version.LUCENE_32, etc• Version.onOrAfter(Version other)
  • 14. CharFilter• extend BaseCharFilter• enables pre-tokenization filtering/morphing of incoming field value• only affects tokenization, not stored value• Built-in CharFilters: HTMLStripCharFilter, PatternReplaceCharFilter, and MappingCharFilter
  • 15. Tokenizer• common to extend CharTokenizer• implement - • protected abstract boolean isTokenChar(int c);• optionally override - • protected int normalize(int c)• extend Tokenizer directly for finer control• Popular built-in Tokenizers include: WhitespaceTokenizer, StandardTokenizer, PatternTokenizer, KeywordTokenizer, ICUTokenizer
  • 16. TokenFilter• a TokenStream whose input is another TokenStream• Popular TokenFilters include: LowerCaseFilter, CommonGramsFilter, SnowballFilter, StopFilter, WordDelimiterFilter
  • 17. Lucenes analysis APIs• tricky business, what with Attributes (Source/Factorys), State, characters, code points,Version, etc...• Test!!! • BaseTokenStreamTestCase • Look at Lucene and Solrs test cases
  • 18. Solrs Analysis Tools• Admin analysis tool• Field analysis request handler• DEMO
  • 19. Query Parsing• String ->
  • 20. QParserPluginpublic abstract class QParserPlugin implements NamedListInitializedPlugin { public abstract QParser createParser( String qstr, SolrParams localParams, SolrParams params, SolrQueryRequest req);}
  • 21. QParserpublic abstract class QParser { public abstract Query parse() throws ParseException;}
  • 22. Built-in QParsersfrom /** internal use - name to class mappings of builtin parsers */ public static final Object[] standardPlugins = { LuceneQParserPlugin.NAME, LuceneQParserPlugin.class, OldLuceneQParserPlugin.NAME, OldLuceneQParserPlugin.class, FunctionQParserPlugin.NAME, FunctionQParserPlugin.class, PrefixQParserPlugin.NAME, PrefixQParserPlugin.class, BoostQParserPlugin.NAME, BoostQParserPlugin.class, DisMaxQParserPlugin.NAME, DisMaxQParserPlugin.class, ExtendedDismaxQParserPlugin.NAME, ExtendedDismaxQParserPlugin.class, FieldQParserPlugin.NAME, FieldQParserPlugin.class, RawQParserPlugin.NAME, RawQParserPlugin.class, TermQParserPlugin.NAME, TermQParserPlugin.class, NestedQParserPlugin.NAME, NestedQParserPlugin.class, FunctionRangeQParserPlugin.NAME, FunctionRangeQParserPlugin.class, SpatialFilterQParserPlugin.NAME, SpatialFilterQParserPlugin.class, SpatialBoxQParserPlugin.NAME, SpatialBoxQParserPlugin.class, JoinQParserPlugin.NAME, JoinQParserPlugin.class, };
  • 23. Local Parameters• {!qparser_name param=value}expression • or• {!qparser_name param=value v=expression}• Can substitute $references from request parameters
  • 24. Param Substitutionsolrconfig.xml<requestHandler name="/document" class="solr.SearchHandler"> <lst name="invariants"> <str name="q">{!term f=id v=$id}</str> </lst></requestHandler>Solr requesthttp://localhost:8983/solr/document?id=FOO37
  • 25. Custom QParser• Implement a QParserPlugin that creates your custom QParser• Register in solrconfig.xml • <queryParser name="myparser" class="com.mycompany.MyQParserPlugin"/>
  • 26. Update Processor• Responsible for handling these commands: • add/update • delete • commit • merge indexes
  • 27. Built-in Update Processors• RunUpdateProcessor • Actually performs the operations, such as adding the documents to the index• LogUpdateProcessor • Logs each operation• SignatureUpdateProcessor • duplicate detection and optionally rejection
  • 28. UIMA Update Processor• UIMA - Unstructured Information Management Architecture -• Enables UIMA components to augment documents• Entity extraction, automated categorization, language detection, etc• "contrib" plugin•
  • 29. Update Processor Chain• UpdateProcessors sequence into a chain• Each processor can abort the entire update or hand processing to next processor in the chain• Chains, of update processor factories, are specified in solrconfig.xml• Update requests can specify an update.processor parameter
  • 30. Default update processor chainFrom construct the default chainUpdateRequestProcessorFactory[] factories = new UpdateRequestProcessorFactory[]{ new RunUpdateProcessorFactory(), new LogUpdateProcessorFactory() }; Note: these steps have been swapped on trunk recently
  • 31. Example Update Processor• What are the best facets to show for a particular query? Wouldnt it be nice to see the distribution of document "attributes" represented across a result set?• Learned this trick from the Smithsonian, who were doing it manually - add an indexed field containing the field names of the interesting other fields on the document.• Facet on that field "of field names" initially, then request facets on the top values returned.
  • 32. Config for custom update processor<updateRequestProcessorChain name="fields_used" default="true"> <processor class="solr.processor.FieldsUsedUpdateProcessorFactory"> <str name="fieldsUsedFieldName">attribute_fields</str> <str name="fieldNameRegex">.*_attribute</str> </processor> <processor class="solr.LogUpdateProcessorFactory" /> <processor class="solr.RunUpdateProcessorFactory" /></updateRequestProcessorChain>
  • 33. FieldsUsedUpdateProcessorFactorypublic class FieldsUsedUpdateProcessorFactory extends UpdateRequestProcessorFactory { private String fieldsUsedFieldName; private Pattern fieldNamePattern; public UpdateRequestProcessor getInstance(SolrQueryRequest req, SolrQueryResponse rsp, UpdateRequestProcessor next) { return new FieldsUsedUpdateProcessor(req, rsp, this, next); } // ... next slide ...}
  • 34. FieldsUsedUpdateProcessorFactory @Override public void init(NamedList args) { if (args == null) return; SolrParams params = SolrParams.toSolrParams(args); fieldsUsedFieldName = params.get("fieldsUsedFieldName"); if (fieldsUsedFieldName == null) { throw new SolrException (SolrException.ErrorCode.SERVER_ERROR, "fieldsUsedFieldName must be specified"); } // TODO check that fieldsUsedFieldName is a valid field name and multiValued String fieldNameRegex = params.get("fieldNameRegex"); if (fieldNameRegex == null) { throw new SolrException (SolrException.ErrorCode.SERVER_ERROR, "fieldNameRegex must be specified"); } fieldNamePattern = Pattern.compile(fieldNameRegex); super.init(args); }
  • 35. class FieldsUsedUpdateProcessor extends UpdateRequestProcessor { public FieldsUsedUpdateProcessor(SolrQueryRequest req, SolrQueryResponse rsp, FieldsUsedUpdateProcessorFactory factory, UpdateRequestProcessor next) { super(next); } @Override public void processAdd(AddUpdateCommand cmd) throws IOException { SolrInputDocument doc = cmd.getSolrInputDocument(); Collection<String> incomingFieldNames = doc.getFieldNames(); Iterator<String> iterator = incomingFieldNames.iterator(); ArrayList<String> usedFields = new ArrayList<String>(); while (iterator.hasNext()) { String f =; if (fieldNamePattern.matcher(f).matches()) { usedFields.add(f); } } doc.addField(fieldsUsedFieldName, usedFields.toArray()); super.processAdd(cmd); }}
  • 36. FieldsUsedUpdateProcessor in actionschema.xml <dynamicField name="*_attribute" type="string" indexed="true" stored="true" multiValued="true"/>Add some documentssolr.add([{:id=>1, :name => "Big Blue Shoes", :size_attribute => L, :color_attribute => Blue}, {:id=>2, :name => "Cool Gizmo", :memory_attribute => "16GB", :color_attribute => White}])solr.commitFacet on attribute_fields - http://localhost:8983/solr/select?q=*:*&facet=on&facet.field=attribute_fields&wt=json&indent=on "facet_fields":{ "attribute_fields":[ "color_attribute",2, "memory_attribute",1, "size_attribute",1]}
  • 37. Search Components• Built-in: Clustering, Debug, Facet, Highlight, MoreLikeThis, Query, QueryElevation, SpellCheck, Stats, TermVector, Terms• Non-distributed API: • prepare(ResponseBuilder rb) • process(ResponseBuilder rb)
  • 38. Example - auto facet select• It sure would be nice if you could have Solr automatically select field(s) for faceting based dynamically off the profile of the results. For example, youre indexing disparate types of products, all with varying attributes (color, size - like for apparel, memory_size - for electronics, subject - for books, etc), and a user searches for "ipod" where most products match products with color and memory_size attributes... lets automatically facet on those fields.•
  • 39. AutoFacetSelection Component• Too much code for a slide, lets take a look in an IDE...• Basically - • process() gets autofacet.field and autofacet.n request params, facets on field, takes top N values, sets those as facet.fields • Gotcha - need to call rb.setNeedDocSet (true) in prepare() as faceting needs it
  • 40. SearchComponent config<searchComponent name="autofacet" class="solr.AutoFacetSelectionComponent"/><requestHandler name="/searchplus" class="solr.SearchHandler"> <arr name="components"> <str>query</str> <str>autofacet</str> <str>facet</str> <str>debug</str> </arr></requestHandler>
  • 41. autofacet successhttp://localhost:8983/solr/searchplus?q=*:*&facet=on&autofacet.field=attribute_fields&wt=json&indent=on{ "response":{"numFound":2,"start":0,"docs":[ { "size_attribute":["L"], "color_attribute":["Blue"], "name":"Big Blue Shoes", "id":"1", "attribute_fields":["size_attribute", "color_attribute"]}, { "color_attribute":["White"], "name":"Cool Gizmo", "memory_attribute":["16GB"], "id":"2", "attribute_fields":["color_attribute", "memory_attribute"]}] }, "facet_counts":{ "facet_queries":{}, "facet_fields":{ "color_attribute":[ "Blue",1, "White",1], "memory_attribute":[ "16GB",1]}}}
  • 42. Distributed-aware SearchComponents• SearchComponent has a few distributed mode methods: • distributedProcess(ResponseBuilder) • modifyRequest(ResponseBuilder rb, SearchComponent who, ShardRequest sreq) • handleResponses(ResponseBuilder rb, ShardRequest sreq) • finishStage(ResponseBuilder rb)
  • 43. Testing• AbstractSolrTestCase• SolrTestCaseJ4• SolrMeter •
  • 44. For more information...•• LucidFind • search Lucene ecosystem: mailing lists, wikis, JIRA, etc •• Getting started with LucidWorks Enterprise: • lucidworks-search-platform/enterprise• - wiki, e-mail lists
  • 45. Thank You!