Lucene for Solr Developers

         erik . hatcher
               @




                             1
Abstract
You’re Solr powered, and needing to customize its
capabilities. Apache Solr is flexibly architected, with
practically everything pluggable. Under the hood, Solr is
driven by the well-known Apache Lucene. Lucene for
Solr Developers will guide you through the various ways
in which Solr can be extended, customized, and enhanced
with a bit of Lucene API know-how. We’ll delve into
improving analysis with custom character mapping,
tokenizing, and token filtering extensions; show why and
how to implement specialized query parsing, and how to
add your own search and update request handling.



                                                            2
About me...

• Co-author, “Lucene in Action”
• Commiter, Lucene and Solr
• Lucene PMC and ASF member
• Member of Technical Staff / co-founder,
  Lucid Imagination



                                            3
... works


    search platform




www.lucidimagination.com
                           4
What is Lucene?
•   An open source search library (not an application)

•   100% Java

•   Continuously improved and tuned over more than
    10 years

•   Compact, portable index representation

•   Programmable text analyzers, spell checking and
    highlighting

•   Not a crawler or a text extraction tool


                                                         5
Inverted Index
•   Lucene stores input data in what is known as an
    inverted index

•   In an inverted index each indexed term points to a
    list of documents that contain the term

•   Similar to the index provided at the end of a book

•   In this case "inverted" simply means the list of terms
    point to documents

•   It is much faster to find a term in an index, than to
    scan all the documents


                                                             6
Inverted Index Example




                         7
Segments and Merging
•   A Lucene index is a collection of one or more sub-indexes
    called segments

•   Each segment is a fully independent index

•   A multi-way merge algorithm is used to periodically merge
    segments

•   New segments are created when an IndexWriter flushes new
    documents and pending deletes to disk

•   Trying for a balance between large-scale performance vs. small-
    scale updates

•   Optimization merges all segments into one


                                                                      8
Segments and Merging




                       9
Segments
• When a document is deleted it still exists
  in an index segment until that segment is
  merged
• At certain trigger points, these Documents
  are flushed to the Directory
• Can be forced by calling commit
• Segments are periodically merged
                                               10
IndexSearcher




                11
Adding new documents




                       12
Commit




         13
Committed and
  Warmed




                14
Lucene Scoring

•   Lucene uses a similarity scoring formula to rank results by measuring the
    similarity between a query and the documents that match the query. The
    factors that form the scoring formula are:

    •   Term Frequency: tf (t in d). How often the term occurs in the document.

    •   Inverse Document Frequency: idf (t). A measure of how rare the term is in
        the whole collection. One over the number of times the term appears in
        the collection.

    •   Terms that are rare throughout the entire collection score higher.



                                                                                    15
Coord and Norms
•   Coord: The coordination factor, coord (q, d).
    Boosts documents that match more of the
    search terms than other documents.
    •   If 4 of 4 terms match coord = 4/4
    •   If 3 of 4 terms match coord = 3/4
•   Length Normalization - Adjust the score based
    on length of fields in the document.
    •   shorter fields that match get a boost


                                                    16
Scoring Factors (cont)
• Boost: (t.field in d). A way to boost a field
  or a whole document above others.
• Query Norm: (q). Normalization value
  for a query, given the sum of the squared
  weights of each of the query terms.
• You will often hear the Lucene scoring
  simply referred to as
  TF·IDF.

                                                17
Explanation

      • Lucene has a feature called Explanation
      • Solr uses the debugQuery parameter to
         retrieve scoring explanations

0.2987913 =   (MATCH) fieldWeight(text:lucen in 688), product of:
  1.4142135   = tf(termFreq(text:lucen)=2)
  9.014501    = idf(docFreq=3, maxDocs=12098)
  0.0234375   = fieldNorm(field=text, doc=688)




                                                                    18
Lucene Core
• IndexWriter
• Directory
• IndexReader, IndexSearcher
• analysis: Analyzer, TokenStream,
  Tokenizer,TokenFilter
• Query
                                     19
Solr Architecture




                    20
Customizing - Don't do it!

•   Unless you need to.
•   In other words... ensure you've given the built-in
    capabilities a try, asked on the e-mail list, and
    spelunked into at least Solr's code a bit to make
    some sense of the situation.
•   But we're here to roll up our sleeves, because we
    need to...

                                                         21
But first...
•   Look at Lucene and/or Solr source code as
    appropriate

•   Carefully read javadocs and wiki pages - lots of tips
    there

•   And, hey, search for what you're trying to do...

    •   Google, of course

    •   But try out LucidFind and other Lucene ecosystem
        specific search systems -
        http://www.lucidimagination.com/search/


                                                            22
Extension points
•   Tokenizer, TokenFilter,   •   QParser
    CharFilter
                              •   DataImportHandler
•   SearchComponent               hooks

•   RequestHandler                •   data sources

•   ResponseWriter                •   entity processors

•   FieldType                     •   transformers

•   Similarity                •   several others



                                                          23
Factories
• FooFactory (most) everywhere.
  Sometimes there's BarPlugin style

• for sake of discussion... let's just skip the
  "factory" part
• In Solr, Factories and Plugins are used by
  configuration loading to parameterize and
  construct


                                                  24
"Installing" plugins
• Compile .java to .class, JAR it up
• Put JAR files in either:
 • <solr-home>/lib
 • a shared lib when using multicore
 • anywhere, and register location in
    solrconfig.xml
• Hook in plugins as appropriate
                                        25
Multicore sharedLib

<solr sharedLib="/usr/local/solr/customlib"
       persistent="true">
   <cores adminPath="/admin/cores">
      <core instanceDir="core1" name="core1"/>
      <core instanceDir="core2" name="core2"/>
   </cores>
</solr>




                                                 26
Plugins via
        solrconfig.xml


• <lib dir="/path/to/your/custom/jars" />



                                            27
Analysis

• CharFilter
• Tokenizer
• TokenFilter


                           28
Primer

• Tokens, Terms
• Attributes: Type, Payloads, Offsets,
  Positions, Term Vectors
• part of the picture:


                                         29
Version

• enum:
 • Version.LUCENE_31,
    Version.LUCENE_32, etc
• Version.onOrAfter(Version other)


                                     30
CharFilter
• extend BaseCharFilter
• enables pre-tokenization filtering/morphing
  of incoming field value
• only affects tokenization, not stored value
• Built-in CharFilters: HTMLStripCharFilter,
  PatternReplaceCharFilter, and
  MappingCharFilter


                                                31
Tokenizer
•   common to extend CharTokenizer

•   implement -

    •   protected abstract boolean isTokenChar(int c);

•   optionally override -

    •   protected int normalize(int c)

•   extend Tokenizer directly for finer control

•   Popular built-in Tokenizers include: WhitespaceTokenizer,
    StandardTokenizer, PatternTokenizer, KeywordTokenizer,
    ICUTokenizer



                                                                32
TokenFilter

• a TokenStream whose input is another
  TokenStream
• Popular TokenFilters include:
  LowerCaseFilter, CommonGramsFilter,
  SnowballFilter, StopFilter,
  WordDelimiterFilter



                                         33
Lucene's analysis APIs
• tricky business, what with Attributes
  (Source/Factory's), State, characters, code
  points,Version, etc...
• Test!!!
 • BaseTokenStreamTestCase
 • Look at Lucene and Solr's test cases

                                                34
Solr's Analysis Tools

• Admin analysis tool
• Field analysis request handler
• DEMO


                                   35
Query Parsing


• String -> org.apache.lucene.search.Query



                                             36
QParserPlugin
public abstract class QParserPlugin
    implements NamedListInitializedPlugin {

    public abstract QParser createParser(
      String qstr,
      SolrParams localParams,
      SolrParams params,
      SolrQueryRequest req);
}



                                              37
QParser
public abstract class QParser {

    public abstract Query parse()
              throws ParseException;

}




                                       38
Built-in QParsers
from QParserPlugin.java
  /** internal use - name to class mappings of builtin parsers */
  public static final Object[] standardPlugins = {
     LuceneQParserPlugin.NAME, LuceneQParserPlugin.class,
     OldLuceneQParserPlugin.NAME, OldLuceneQParserPlugin.class,
     FunctionQParserPlugin.NAME, FunctionQParserPlugin.class,
     PrefixQParserPlugin.NAME, PrefixQParserPlugin.class,
     BoostQParserPlugin.NAME, BoostQParserPlugin.class,
     DisMaxQParserPlugin.NAME, DisMaxQParserPlugin.class,
     ExtendedDismaxQParserPlugin.NAME, ExtendedDismaxQParserPlugin.class,
     FieldQParserPlugin.NAME, FieldQParserPlugin.class,
     RawQParserPlugin.NAME, RawQParserPlugin.class,
     TermQParserPlugin.NAME, TermQParserPlugin.class,
     NestedQParserPlugin.NAME, NestedQParserPlugin.class,
     FunctionRangeQParserPlugin.NAME, FunctionRangeQParserPlugin.class,
     SpatialFilterQParserPlugin.NAME, SpatialFilterQParserPlugin.class,
     SpatialBoxQParserPlugin.NAME, SpatialBoxQParserPlugin.class,
     JoinQParserPlugin.NAME, JoinQParserPlugin.class,
  };


                                                                            39
Local Parameters

• {!qparser_name param=value}expression
 • or
• {!qparser_name param=value v=expression}
• Can substitute $references from request
  parameters



                                             40
Param Substitution
solrconfig.xml
<requestHandler name="/document"
                class="solr.SearchHandler">
  <lst name="invariants">
    <str name="q">{!term f=id v=$id}</str>
  </lst>
</requestHandler>

Solr request
http://localhost:8983/solr/document?id=FOO37




                                               41
Custom QParser

• Implement a QParserPlugin that creates your
  custom QParser
• Register in solrconfig.xml
 • <queryParser name="myparser"
    class="com.mycompany.MyQParserPlugin"/>



                                                42
Update Processor

• Responsible for handling these commands:
 • add/update
 • delete
 • commit
 • merge indexes

                                             43
Built-in Update
            Processors
•   RunUpdateProcessor
    •   Actually performs the operations, such as
        adding the documents to the index
•   LogUpdateProcessor
    •   Logs each operation
•   SignatureUpdateProcessor
    •   duplicate detection and optionally rejection

                                                       44
UIMA Update
           Processor
•   UIMA - Unstructured Information Management
    Architecture - http://uima.apache.org/

•   Enables UIMA components to augment
    documents

•   Entity extraction, automated categorization,
    language detection, etc

•   "contrib" plugin

•   http://wiki.apache.org/solr/SolrUIMA

                                                   45
Update Processor
         Chain
• UpdateProcessor's sequence into a chain
• Each processor can abort the entire update
  or hand processing to next processor in
  the chain
• Chains, of update processor factories, are
  specified in solrconfig.xml
• Update requests can specify an
  update.processor parameter

                                               46
Default update
            processor chain
From SolrCore.java
// construct the default chain
UpdateRequestProcessorFactory[] factories =
  new UpdateRequestProcessorFactory[]{
     new RunUpdateProcessorFactory(),
     new LogUpdateProcessorFactory()
  };

    Note: these steps have been swapped on trunk recently

                                                            47
Example Update
           Processor
•   What are the best facets to show for a particular
    query? Wouldn't it be nice to see the distribution of
    document "attributes" represented across a result
    set?

•   Learned this trick from the Smithsonian, who were
    doing it manually - add an indexed field containing the
    field names of the interesting other fields on the
    document.

•   Facet on that field "of field names" initially, then
    request facets on the top values returned.


                                                             48
Config for custom
           update processor
<updateRequestProcessorChain name="fields_used" default="true">
 <processor class="solr.processor.FieldsUsedUpdateProcessorFactory">
  <str name="fieldsUsedFieldName">attribute_fields</str>
  <str name="fieldNameRegex">.*_attribute</str>
 </processor>
 <processor class="solr.LogUpdateProcessorFactory" />
 <processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>




                                                                   49
FieldsUsedUpdateProcessorFactory


public class FieldsUsedUpdateProcessorFactory extends UpdateRequestProcessorFactory {
 private String fieldsUsedFieldName;
 private Pattern fieldNamePattern;

    public UpdateRequestProcessor getInstance(SolrQueryRequest req, SolrQueryResponse rsp,
                                                                  UpdateRequestProcessor next) {
      return new FieldsUsedUpdateProcessor(req, rsp, this, next);
    }

    // ... next slide ...

}




                                                                                                   50
FieldsUsedUpdateProcessorFactory
 @Override
 public void init(NamedList args) {
  if (args == null) return;

     SolrParams params = SolrParams.toSolrParams(args);

     fieldsUsedFieldName = params.get("fieldsUsedFieldName");
     if (fieldsUsedFieldName == null) {
       throw new SolrException
          (SolrException.ErrorCode.SERVER_ERROR,
             "fieldsUsedFieldName must be specified");
     }

     // TODO check that fieldsUsedFieldName is a valid field name and multiValued

     String fieldNameRegex = params.get("fieldNameRegex");
     if (fieldNameRegex == null) {
       throw new SolrException
          (SolrException.ErrorCode.SERVER_ERROR,
             "fieldNameRegex must be specified");
     }
     fieldNamePattern = Pattern.compile(fieldNameRegex);

     super.init(args);
 }
                                                                                  51
class FieldsUsedUpdateProcessor extends UpdateRequestProcessor {
  public FieldsUsedUpdateProcessor(SolrQueryRequest req,
                                   SolrQueryResponse rsp,
                                   FieldsUsedUpdateProcessorFactory factory,
                                   UpdateRequestProcessor next) {
    super(next);
  }

    @Override
    public void processAdd(AddUpdateCommand cmd) throws IOException {
      SolrInputDocument doc = cmd.getSolrInputDocument();

        Collection<String> incomingFieldNames = doc.getFieldNames();

        Iterator<String> iterator = incomingFieldNames.iterator();
        ArrayList<String> usedFields = new ArrayList<String>();
        while (iterator.hasNext()) {
          String f = iterator.next();
          if (fieldNamePattern.matcher(f).matches()) {
            usedFields.add(f);
          }
        }

        doc.addField(fieldsUsedFieldName, usedFields.toArray());
        super.processAdd(cmd);
    }
}



                                                                               52
FieldsUsedUpdateProcessor
          in action
schema.xml
  <dynamicField name="*_attribute" type="string" indexed="true" stored="true" multiValued="true"/>

Add some documents
solr.add([{:id=>1, :name => "Big Blue Shoes", :size_attribute => 'L', :color_attribute => 'Blue'},
          {:id=>2, :name => "Cool Gizmo", :memory_attribute => "16GB", :color_attribute => 'White'}])
solr.commit

Facet on attribute_fields
 - http://localhost:8983/solr/select?q=*:*&facet=on&facet.field=attribute_fields&wt=json&indent=on
      "facet_fields":{
          "attribute_fields":[
             "color_attribute",2,
             "memory_attribute",1,
             "size_attribute",1]}




                                                                                                        53
Search Components
• Built-in: Clustering, Debug, Facet, Highlight,
  MoreLikeThis, Query, QueryElevation,
  SpellCheck, Stats, TermVector, Terms
• Non-distributed API:
 • prepare(ResponseBuilder rb)
 • process(ResponseBuilder rb)

                                                   54
Example - auto facet
          select
•   It sure would be nice if you could have Solr automatically
    select field(s) for faceting based dynamically off the
    profile of the results. For example, you're indexing
    disparate types of products, all with varying attributes
    (color, size - like for apparel, memory_size - for
    electronics, subject - for books, etc), and a user searches
    for "ipod" where most products match products with
    color and memory_size attributes... let's automatically
    facet on those fields.

•   https://issues.apache.org/jira/browse/SOLR-2641



                                                                  55
AutoFacetSelection
       Component
•   Too much code for a slide, let's take a look in
    an IDE...

•   Basically -

    •   process() gets autofacet.field and autofacet.n
        request params, facets on field, takes top N
        values, sets those as facet.field's

    •   Gotcha - need to call rb.setNeedDocSet
        (true) in prepare() as faceting needs it


                                                        56
SearchComponent
              config
<searchComponent name="autofacet"
     class="solr.AutoFacetSelectionComponent"/>
<requestHandler name="/searchplus"
                class="solr.SearchHandler">
  <arr name="components">
    <str>query</str>
    <str>autofacet</str>
    <str>facet</str>
    <str>debug</str>
  </arr>
</requestHandler>


                                                  57
autofacet success
http://localhost:8983/solr/searchplus
?q=*:*&facet=on&autofacet.field=attribute_fields&wt=json&indent=on
{
  "response":{"numFound":2,"start":0,"docs":[
       {
         "size_attribute":["L"],
         "color_attribute":["Blue"],
         "name":"Big Blue Shoes",
         "id":"1",
         "attribute_fields":["size_attribute",
           "color_attribute"]},
       {
         "color_attribute":["White"],
         "name":"Cool Gizmo",
         "memory_attribute":["16GB"],
         "id":"2",
         "attribute_fields":["color_attribute",
           "memory_attribute"]}]
  },
  "facet_counts":{
     "facet_queries":{},
     "facet_fields":{
       "color_attribute":[
         "Blue",1,
         "White",1],
       "memory_attribute":[
         "16GB",1]}}}
                                                                     58
Distributed-aware
    SearchComponents
•   SearchComponent has a few distributed mode
    methods:

    •   distributedProcess(ResponseBuilder)

    •   modifyRequest(ResponseBuilder rb,
        SearchComponent who, ShardRequest sreq)

    •   handleResponses(ResponseBuilder rb,
        ShardRequest sreq)

    •   finishStage(ResponseBuilder rb)


                                                  59
Testing

• AbstractSolrTestCase
• SolrTestCaseJ4
• SolrMeter
 • http://code.google.com/p/solrmeter/

                                         60
For more information...
•   http://www.lucidimagination.com

•   LucidFind

    •   search Lucene ecosystem: mailing lists, wikis, JIRA, etc

    •   http://search.lucidimagination.com

•   Getting started with LucidWorks Enterprise:

    •   http://www.lucidimagination.com/products/
        lucidworks-search-platform/enterprise

•   http://lucene.apache.org/solr - wiki, e-mail lists


                                                                   61
Thank You!




             62

Lucene for Solr Developers

  • 1.
    Lucene for SolrDevelopers erik . hatcher @ 1
  • 2.
    Abstract You’re Solr powered,and needing to customize its capabilities. Apache Solr is flexibly architected, with practically everything pluggable. Under the hood, Solr is driven by the well-known Apache Lucene. Lucene for Solr Developers will guide you through the various ways in which Solr can be extended, customized, and enhanced with a bit of Lucene API know-how. We’ll delve into improving analysis with custom character mapping, tokenizing, and token filtering extensions; show why and how to implement specialized query parsing, and how to add your own search and update request handling. 2
  • 3.
    About me... • Co-author,“Lucene in Action” • Commiter, Lucene and Solr • Lucene PMC and ASF member • Member of Technical Staff / co-founder, Lucid Imagination 3
  • 4.
    ... works search platform www.lucidimagination.com 4
  • 5.
    What is Lucene? • An open source search library (not an application) • 100% Java • Continuously improved and tuned over more than 10 years • Compact, portable index representation • Programmable text analyzers, spell checking and highlighting • Not a crawler or a text extraction tool 5
  • 6.
    Inverted Index • Lucene stores input data in what is known as an inverted index • In an inverted index each indexed term points to a list of documents that contain the term • Similar to the index provided at the end of a book • In this case "inverted" simply means the list of terms point to documents • It is much faster to find a term in an index, than to scan all the documents 6
  • 7.
  • 8.
    Segments and Merging • A Lucene index is a collection of one or more sub-indexes called segments • Each segment is a fully independent index • A multi-way merge algorithm is used to periodically merge segments • New segments are created when an IndexWriter flushes new documents and pending deletes to disk • Trying for a balance between large-scale performance vs. small- scale updates • Optimization merges all segments into one 8
  • 9.
  • 10.
    Segments • When adocument is deleted it still exists in an index segment until that segment is merged • At certain trigger points, these Documents are flushed to the Directory • Can be forced by calling commit • Segments are periodically merged 10
  • 11.
  • 12.
  • 13.
  • 14.
    Committed and Warmed 14
  • 15.
    Lucene Scoring • Lucene uses a similarity scoring formula to rank results by measuring the similarity between a query and the documents that match the query. The factors that form the scoring formula are: • Term Frequency: tf (t in d). How often the term occurs in the document. • Inverse Document Frequency: idf (t). A measure of how rare the term is in the whole collection. One over the number of times the term appears in the collection. • Terms that are rare throughout the entire collection score higher. 15
  • 16.
    Coord and Norms • Coord: The coordination factor, coord (q, d). Boosts documents that match more of the search terms than other documents. • If 4 of 4 terms match coord = 4/4 • If 3 of 4 terms match coord = 3/4 • Length Normalization - Adjust the score based on length of fields in the document. • shorter fields that match get a boost 16
  • 17.
    Scoring Factors (cont) •Boost: (t.field in d). A way to boost a field or a whole document above others. • Query Norm: (q). Normalization value for a query, given the sum of the squared weights of each of the query terms. • You will often hear the Lucene scoring simply referred to as TF·IDF. 17
  • 18.
    Explanation • Lucene has a feature called Explanation • Solr uses the debugQuery parameter to retrieve scoring explanations 0.2987913 = (MATCH) fieldWeight(text:lucen in 688), product of: 1.4142135 = tf(termFreq(text:lucen)=2) 9.014501 = idf(docFreq=3, maxDocs=12098) 0.0234375 = fieldNorm(field=text, doc=688) 18
  • 19.
    Lucene Core • IndexWriter •Directory • IndexReader, IndexSearcher • analysis: Analyzer, TokenStream, Tokenizer,TokenFilter • Query 19
  • 20.
  • 21.
    Customizing - Don'tdo it! • Unless you need to. • In other words... ensure you've given the built-in capabilities a try, asked on the e-mail list, and spelunked into at least Solr's code a bit to make some sense of the situation. • But we're here to roll up our sleeves, because we need to... 21
  • 22.
    But first... • Look at Lucene and/or Solr source code as appropriate • Carefully read javadocs and wiki pages - lots of tips there • And, hey, search for what you're trying to do... • Google, of course • But try out LucidFind and other Lucene ecosystem specific search systems - http://www.lucidimagination.com/search/ 22
  • 23.
    Extension points • Tokenizer, TokenFilter, • QParser CharFilter • DataImportHandler • SearchComponent hooks • RequestHandler • data sources • ResponseWriter • entity processors • FieldType • transformers • Similarity • several others 23
  • 24.
    Factories • FooFactory (most)everywhere. Sometimes there's BarPlugin style • for sake of discussion... let's just skip the "factory" part • In Solr, Factories and Plugins are used by configuration loading to parameterize and construct 24
  • 25.
    "Installing" plugins • Compile.java to .class, JAR it up • Put JAR files in either: • <solr-home>/lib • a shared lib when using multicore • anywhere, and register location in solrconfig.xml • Hook in plugins as appropriate 25
  • 26.
    Multicore sharedLib <solr sharedLib="/usr/local/solr/customlib" persistent="true"> <cores adminPath="/admin/cores"> <core instanceDir="core1" name="core1"/> <core instanceDir="core2" name="core2"/> </cores> </solr> 26
  • 27.
    Plugins via solrconfig.xml • <lib dir="/path/to/your/custom/jars" /> 27
  • 28.
  • 29.
    Primer • Tokens, Terms •Attributes: Type, Payloads, Offsets, Positions, Term Vectors • part of the picture: 29
  • 30.
    Version • enum: •Version.LUCENE_31, Version.LUCENE_32, etc • Version.onOrAfter(Version other) 30
  • 31.
    CharFilter • extend BaseCharFilter •enables pre-tokenization filtering/morphing of incoming field value • only affects tokenization, not stored value • Built-in CharFilters: HTMLStripCharFilter, PatternReplaceCharFilter, and MappingCharFilter 31
  • 32.
    Tokenizer • common to extend CharTokenizer • implement - • protected abstract boolean isTokenChar(int c); • optionally override - • protected int normalize(int c) • extend Tokenizer directly for finer control • Popular built-in Tokenizers include: WhitespaceTokenizer, StandardTokenizer, PatternTokenizer, KeywordTokenizer, ICUTokenizer 32
  • 33.
    TokenFilter • a TokenStreamwhose input is another TokenStream • Popular TokenFilters include: LowerCaseFilter, CommonGramsFilter, SnowballFilter, StopFilter, WordDelimiterFilter 33
  • 34.
    Lucene's analysis APIs •tricky business, what with Attributes (Source/Factory's), State, characters, code points,Version, etc... • Test!!! • BaseTokenStreamTestCase • Look at Lucene and Solr's test cases 34
  • 35.
    Solr's Analysis Tools •Admin analysis tool • Field analysis request handler • DEMO 35
  • 36.
    Query Parsing • String-> org.apache.lucene.search.Query 36
  • 37.
    QParserPlugin public abstract classQParserPlugin implements NamedListInitializedPlugin { public abstract QParser createParser( String qstr, SolrParams localParams, SolrParams params, SolrQueryRequest req); } 37
  • 38.
    QParser public abstract classQParser { public abstract Query parse() throws ParseException; } 38
  • 39.
    Built-in QParsers from QParserPlugin.java /** internal use - name to class mappings of builtin parsers */ public static final Object[] standardPlugins = { LuceneQParserPlugin.NAME, LuceneQParserPlugin.class, OldLuceneQParserPlugin.NAME, OldLuceneQParserPlugin.class, FunctionQParserPlugin.NAME, FunctionQParserPlugin.class, PrefixQParserPlugin.NAME, PrefixQParserPlugin.class, BoostQParserPlugin.NAME, BoostQParserPlugin.class, DisMaxQParserPlugin.NAME, DisMaxQParserPlugin.class, ExtendedDismaxQParserPlugin.NAME, ExtendedDismaxQParserPlugin.class, FieldQParserPlugin.NAME, FieldQParserPlugin.class, RawQParserPlugin.NAME, RawQParserPlugin.class, TermQParserPlugin.NAME, TermQParserPlugin.class, NestedQParserPlugin.NAME, NestedQParserPlugin.class, FunctionRangeQParserPlugin.NAME, FunctionRangeQParserPlugin.class, SpatialFilterQParserPlugin.NAME, SpatialFilterQParserPlugin.class, SpatialBoxQParserPlugin.NAME, SpatialBoxQParserPlugin.class, JoinQParserPlugin.NAME, JoinQParserPlugin.class, }; 39
  • 40.
    Local Parameters • {!qparser_nameparam=value}expression • or • {!qparser_name param=value v=expression} • Can substitute $references from request parameters 40
  • 41.
    Param Substitution solrconfig.xml <requestHandler name="/document" class="solr.SearchHandler"> <lst name="invariants"> <str name="q">{!term f=id v=$id}</str> </lst> </requestHandler> Solr request http://localhost:8983/solr/document?id=FOO37 41
  • 42.
    Custom QParser • Implementa QParserPlugin that creates your custom QParser • Register in solrconfig.xml • <queryParser name="myparser" class="com.mycompany.MyQParserPlugin"/> 42
  • 43.
    Update Processor • Responsiblefor handling these commands: • add/update • delete • commit • merge indexes 43
  • 44.
    Built-in Update Processors • RunUpdateProcessor • Actually performs the operations, such as adding the documents to the index • LogUpdateProcessor • Logs each operation • SignatureUpdateProcessor • duplicate detection and optionally rejection 44
  • 45.
    UIMA Update Processor • UIMA - Unstructured Information Management Architecture - http://uima.apache.org/ • Enables UIMA components to augment documents • Entity extraction, automated categorization, language detection, etc • "contrib" plugin • http://wiki.apache.org/solr/SolrUIMA 45
  • 46.
    Update Processor Chain • UpdateProcessor's sequence into a chain • Each processor can abort the entire update or hand processing to next processor in the chain • Chains, of update processor factories, are specified in solrconfig.xml • Update requests can specify an update.processor parameter 46
  • 47.
    Default update processor chain From SolrCore.java // construct the default chain UpdateRequestProcessorFactory[] factories = new UpdateRequestProcessorFactory[]{ new RunUpdateProcessorFactory(), new LogUpdateProcessorFactory() }; Note: these steps have been swapped on trunk recently 47
  • 48.
    Example Update Processor • What are the best facets to show for a particular query? Wouldn't it be nice to see the distribution of document "attributes" represented across a result set? • Learned this trick from the Smithsonian, who were doing it manually - add an indexed field containing the field names of the interesting other fields on the document. • Facet on that field "of field names" initially, then request facets on the top values returned. 48
  • 49.
    Config for custom update processor <updateRequestProcessorChain name="fields_used" default="true"> <processor class="solr.processor.FieldsUsedUpdateProcessorFactory"> <str name="fieldsUsedFieldName">attribute_fields</str> <str name="fieldNameRegex">.*_attribute</str> </processor> <processor class="solr.LogUpdateProcessorFactory" /> <processor class="solr.RunUpdateProcessorFactory" /> </updateRequestProcessorChain> 49
  • 50.
    FieldsUsedUpdateProcessorFactory public class FieldsUsedUpdateProcessorFactoryextends UpdateRequestProcessorFactory { private String fieldsUsedFieldName; private Pattern fieldNamePattern; public UpdateRequestProcessor getInstance(SolrQueryRequest req, SolrQueryResponse rsp, UpdateRequestProcessor next) { return new FieldsUsedUpdateProcessor(req, rsp, this, next); } // ... next slide ... } 50
  • 51.
    FieldsUsedUpdateProcessorFactory @Override publicvoid init(NamedList args) { if (args == null) return; SolrParams params = SolrParams.toSolrParams(args); fieldsUsedFieldName = params.get("fieldsUsedFieldName"); if (fieldsUsedFieldName == null) { throw new SolrException (SolrException.ErrorCode.SERVER_ERROR, "fieldsUsedFieldName must be specified"); } // TODO check that fieldsUsedFieldName is a valid field name and multiValued String fieldNameRegex = params.get("fieldNameRegex"); if (fieldNameRegex == null) { throw new SolrException (SolrException.ErrorCode.SERVER_ERROR, "fieldNameRegex must be specified"); } fieldNamePattern = Pattern.compile(fieldNameRegex); super.init(args); } 51
  • 52.
    class FieldsUsedUpdateProcessor extendsUpdateRequestProcessor { public FieldsUsedUpdateProcessor(SolrQueryRequest req, SolrQueryResponse rsp, FieldsUsedUpdateProcessorFactory factory, UpdateRequestProcessor next) { super(next); } @Override public void processAdd(AddUpdateCommand cmd) throws IOException { SolrInputDocument doc = cmd.getSolrInputDocument(); Collection<String> incomingFieldNames = doc.getFieldNames(); Iterator<String> iterator = incomingFieldNames.iterator(); ArrayList<String> usedFields = new ArrayList<String>(); while (iterator.hasNext()) { String f = iterator.next(); if (fieldNamePattern.matcher(f).matches()) { usedFields.add(f); } } doc.addField(fieldsUsedFieldName, usedFields.toArray()); super.processAdd(cmd); } } 52
  • 53.
    FieldsUsedUpdateProcessor in action schema.xml <dynamicField name="*_attribute" type="string" indexed="true" stored="true" multiValued="true"/> Add some documents solr.add([{:id=>1, :name => "Big Blue Shoes", :size_attribute => 'L', :color_attribute => 'Blue'}, {:id=>2, :name => "Cool Gizmo", :memory_attribute => "16GB", :color_attribute => 'White'}]) solr.commit Facet on attribute_fields - http://localhost:8983/solr/select?q=*:*&facet=on&facet.field=attribute_fields&wt=json&indent=on "facet_fields":{ "attribute_fields":[ "color_attribute",2, "memory_attribute",1, "size_attribute",1]} 53
  • 54.
    Search Components • Built-in:Clustering, Debug, Facet, Highlight, MoreLikeThis, Query, QueryElevation, SpellCheck, Stats, TermVector, Terms • Non-distributed API: • prepare(ResponseBuilder rb) • process(ResponseBuilder rb) 54
  • 55.
    Example - autofacet select • It sure would be nice if you could have Solr automatically select field(s) for faceting based dynamically off the profile of the results. For example, you're indexing disparate types of products, all with varying attributes (color, size - like for apparel, memory_size - for electronics, subject - for books, etc), and a user searches for "ipod" where most products match products with color and memory_size attributes... let's automatically facet on those fields. • https://issues.apache.org/jira/browse/SOLR-2641 55
  • 56.
    AutoFacetSelection Component • Too much code for a slide, let's take a look in an IDE... • Basically - • process() gets autofacet.field and autofacet.n request params, facets on field, takes top N values, sets those as facet.field's • Gotcha - need to call rb.setNeedDocSet (true) in prepare() as faceting needs it 56
  • 57.
    SearchComponent config <searchComponent name="autofacet" class="solr.AutoFacetSelectionComponent"/> <requestHandler name="/searchplus" class="solr.SearchHandler"> <arr name="components"> <str>query</str> <str>autofacet</str> <str>facet</str> <str>debug</str> </arr> </requestHandler> 57
  • 58.
    autofacet success http://localhost:8983/solr/searchplus ?q=*:*&facet=on&autofacet.field=attribute_fields&wt=json&indent=on { "response":{"numFound":2,"start":0,"docs":[ { "size_attribute":["L"], "color_attribute":["Blue"], "name":"Big Blue Shoes", "id":"1", "attribute_fields":["size_attribute", "color_attribute"]}, { "color_attribute":["White"], "name":"Cool Gizmo", "memory_attribute":["16GB"], "id":"2", "attribute_fields":["color_attribute", "memory_attribute"]}] }, "facet_counts":{ "facet_queries":{}, "facet_fields":{ "color_attribute":[ "Blue",1, "White",1], "memory_attribute":[ "16GB",1]}}} 58
  • 59.
    Distributed-aware SearchComponents • SearchComponent has a few distributed mode methods: • distributedProcess(ResponseBuilder) • modifyRequest(ResponseBuilder rb, SearchComponent who, ShardRequest sreq) • handleResponses(ResponseBuilder rb, ShardRequest sreq) • finishStage(ResponseBuilder rb) 59
  • 60.
    Testing • AbstractSolrTestCase • SolrTestCaseJ4 •SolrMeter • http://code.google.com/p/solrmeter/ 60
  • 61.
    For more information... • http://www.lucidimagination.com • LucidFind • search Lucene ecosystem: mailing lists, wikis, JIRA, etc • http://search.lucidimagination.com • Getting started with LucidWorks Enterprise: • http://www.lucidimagination.com/products/ lucidworks-search-platform/enterprise • http://lucene.apache.org/solr - wiki, e-mail lists 61
  • 62.