Improving the Solr Update Chain
           Jan Høydahl
What will I cover?
Who is Jan Høydahl?
Intro to Solr’s (hidden) UpdateChain
How to write your own UpdateProcessors
Example: Web crawl @ Oslo University
A vision for future improvements
Conclusion




                      2
Jan Høydahl
          1995: Developer telecom
          1998: Java developer
          2000: Search - FAST
          2006: Lucene
          2007: Cominvent
          2011: Lucene committer


          > 100 projects



      5
Cominvent AS




Consulting & support       www.solrtraining.com
   Lucene/Solr
       FAST


                       6
Why document processing?
Analysis is Field oriented
Filters only see the “local” field




                      7
Why document processing?
But what if you want to:
  Add or remove fields?
  Make decisions based on other fields?
We need a way to modify the Document




                    8
Why document processing?

Doc1
name
postcode
                         programmer
cv_pdf_url
                        near Barcelona




               9
Why document processing?

Doc1
name
postcode
                          programmer
latlong
                         near Barcelona
cv_pdf_url
cv_text




                10
Why document processing?


   Client
    Doc1
     name
     postcode
     latlong
     cv_pdf_url
     cv_text




                  11
Why document processing?


Client
 Doc1
  name
  postcode     3rd party
  latlong      pipeline
  cv_pdf_url
  cv_text




                     12
Solr’s Update Chain




        13
The Update Chain




        14
The Update Chain




Doc
name
postcode
cv_pdf_url




                     15
The Update Chain
              Postcode
              ToLatLong
              Processor


Doc            Doc
name           name
postcode       postcode
cv_pdf_url     latlong
               cv_pdf_url




                            15
The Update Chain
              Postcode
                             UrlFetcher
              ToLatLong
                             Processor
              Processor


Doc            Doc           Doc
name           name              name
postcode       postcode          postcode
cv_pdf_url     latlong           latlong
               cv_pdf_url        cv_pdf_url
                                 cv_pdf_bin




                            15
The Update Chain
              Postcode                        Tika
                             UrlFetcher
              ToLatLong                       Extracting
                             Processor
              Processor                       Processor


Doc            Doc           Doc              Doc
name           name              name         name
postcode       postcode          postcode     postcode
cv_pdf_url     latlong           latlong      latlong
               cv_pdf_url        cv_pdf_url   cv_pdf_url
                                 cv_pdf_bin   cv_pdf_bin
                                              cv_text



                            15
How it’s wired
Chain definition in solrconfig.xml:




Choose chain in your update request:
.../solr/update/xml?..&update.chain=cv-chain



                       17
Other examples




Language Identification
          18
Other examples
Company

       The Apache Software Foundation
       (ASF) is a non-profit corporation to
       support Apache software projects.
       The ASF was formed from the
       Apache Group and incorporated in
       Delaware, U.S., in June 1999.

  Location                                    Date


                  Entity extraction
                          19
Writing your own processor




            21
Writing your own processor




            21
Writing your own processor




             22
Writing your own processor




             23
Writing your own processor




•Make generic processors - parameterized
•Use SchemaAware, SolrCoreAware and
  ResourceLoaderAware interfaces
•Prefix param names to avoid name clash
•Testing and testable methods
•Donate back to Apache & document on Wiki
                      24
Web crawl with
Language Detection
 @ Oslo University

        25
Solr @ Oslo University




           26
Solr @ Oslo University
     <?xml version="1.0"?>
<updateRequestProcessorChain name="web">
  <processor class="solr.update.processor.LanguageIdentifierUpdateProcessorFactory">
    <lst name="defaults">
      <str name="langid.fl">title,content</str>
      <str name="langid.idField">url</str>
      <str name="langid.fallbackFields"></str>
      <str name="langid.fallback">no</str>
      <str name="langid.whitelist">no,en</str>
      <str name="langid.langField">language</str>
      <str name="langid.langsField">languages</str>
      <bool name="langid.overwrite">true</bool>
      <bool name="langid.map">true</bool>
      <str name="langid.map.fl">title,content</str>
      <double name="langid.threshold">0.5</double>
    </lst>
  </processor>
  <processor class="no.uio.update.processor.RegexpReplaceProcessorFactory">
    <bool name="enabled">true</bool>
    <str name="fl">content_no content_en</str>
    <str name="pattern">[su00A0]+</str>
    <str name="replacement"> </str>
  </processor>
  <processor class="no.uio.update.processor.FixDuplicateContentTitleProcessorFactory">
    <bool name="enabled">true</bool>
    <str name="title.fl">title_no,title_en</str>
                                           27
    <str name="content.fl">content_no,content_en</str>
Solr @ Oslo University
     <?xml version="1.0"?>
<updateRequestProcessorChain name="web">
  <processor class="solr.update.processor.LanguageIdentifierUpd
    <lst name="defaults">
      <str name="langid.fl">title,content</str>
      <str name="langid.idField">url</str>
      <str name="langid.fallbackFields"></str>
      <str name="langid.fallback">no</str>
      <str name="langid.whitelist">no,en</str>
      <str name="langid.langField">language</str>
      <str name="langid.langsField">languages</str>
      <bool name="langid.overwrite">true</bool>
      <bool name="langid.map">true</bool>
      <str name="langid.map.fl">title,content</str>
      <double name="langid.threshold">0.5</double>
    </lst>
  </processor>
  <processor class="no.uio.update.processor.RegexpReplaceProces
                             27
Solr @ Oslo University
     <?xml version="1.0"?>
<updateRequestProcessorChain name="web">
  <processor class="solr.update.processor.LanguageIdentifierUpdateProcessorFactory">
    <lst name="defaults">
      <str name="langid.fl">title,content</str>
      <str name="langid.idField">url</str>
      <str name="langid.fallbackFields"></str>
      <str name="langid.fallback">no</str>
      <str name="langid.whitelist">no,en</str>
      <str name="langid.langField">language</str>
      <str name="langid.langsField">languages</str>
      <bool name="langid.overwrite">true</bool>
      <bool name="langid.map">true</bool>
      <str name="langid.map.fl">title,content</str>
      <double name="langid.threshold">0.5</double>
    </lst>
  </processor>
  <processor class="no.uio.update.processor.RegexpReplaceProcessorFactory">
    <bool name="enabled">true</bool>
    <str name="fl">content_no content_en</str>
    <str name="pattern">[su00A0]+</str>
    <str name="replacement"> </str>
  </processor>
  <processor class="no.uio.update.processor.FixDuplicateContentTitleProcessorFactory">
    <bool name="enabled">true</bool>
    <str name="title.fl">title_no,title_en</str>
                                           27
    <str name="content.fl">content_no,content_en</str>
</processor>
<processor class="no.uio.update.processor.RegexpReplaceProcessorFactory">
  <bool name="enabled">true</bool>

             Solr @ Oslo University
  <str name="fl">content_no content_en</str>
  <str name="pattern">[su00A0]+</str>
  <str name="replacement"> </str>
</processor>
<processor class="no.uio.update.processor.FixDuplicateContentTitleProcessorFactory">
  <bool name="enabled">true</bool>
  <str name="title.fl">title_no,title_en</str>
  <str name="content.fl">content_no,content_en</str>
  <str name="contentType.field">tika_Content-Type</str>
  <str name="contentType.regexp">^(text/html|application/xhtml).*</str>
</processor>
<processor class="no.uio.update.processor.URLClassifyProcessorFactory">
  <bool name="enabled">true</bool>
  <str name="domainOutputField">host</str>
  <str name="canonicalUrlOutputField">canonicalurl</str>
</processor>
<processor class="no.uio.update.processor.RegexpBoostProcessorFactory">
  <bool name="enabled">true</bool>
  <str name="inputField">url</str>
  <str name="boostField">urlboost</str>
  <str name="boostFilename">${solr.solr.home}/conf/rank/urlboosts.txt</str>
</processor>
<processor class="no.uio.update.processor.StaticRankProcessorFactory">
  <bool name="enabled">true</bool>
  <str name="inputField">canonicalurl</str>
  <str name="anchorField">anchortext</str>
  <str name="rankField">docrank</str>
  <str name="rankFilename">${solr.solr.home}/conf/rank/rankinfo.txt</str>
</processor>
<processor class="solr.LogUpdateProcessorFactory"/>
                                         28
</processor>
<processor class="no.uio.update.processor.RegexpReplaceProcessorFactory">
  <bool name="enabled">true</bool>

             Solr @ Oslo University
  <str name="fl">content_no content_en</str>
  <str name="pattern">[su00A0]+</str>
  <str name="replacement"> </str>
</processor>
<processor class="no.uio.update.processor.FixDuplicateContentTitleProcessorFactory">
  <bool name="enabled">true</bool>
  <str name="title.fl">title_no,title_en</str>
  <str name="content.fl">content_no,content_en</str>
  <str name="contentType.field">tika_Content-Type</str>
  <str name="contentType.regexp">^(text/html|application/xhtml).*</str>
</processor>
<processor class="no.uio.update.processor.URLClassifyProcessorFactory">
  <bool name="enabled">true</bool>
  <str name="domainOutputField">host</str>
  <str name="canonicalUrlOutputField">canonicalurl</str>
</processor>
<processor class="no.uio.update.processor.RegexpBoostProcessorFactory">
  <bool name="enabled">true</bool>
  <str name="inputField">url</str>
  <str name="boostField">urlboost</str>
  <str name="boostFilename">${solr.solr.home}/conf/rank/urlboosts.txt</str>
</processor>
<processor class="no.uio.update.processor.StaticRankProcessorFactory">
  <bool name="enabled">true</bool>
  <str name="inputField">canonicalurl</str>
  <str name="anchorField">anchortext</str>
  <str name="rankField">docrank</str>
  <str name="rankFilename">${solr.solr.home}/conf/rank/rankinfo.txt</str>
</processor>
<processor class="solr.LogUpdateProcessorFactory"/>
                                         28
</processor>
<processor class="no.uio.update.processor.RegexpReplaceProcessorFactory">
  <bool name="enabled">true</bool>

             Solr @ Oslo University
  <str name="fl">content_no content_en</str>
  <str name="pattern">[su00A0]+</str>
  <str name="replacement"> </str>
</processor>
<processor class="no.uio.update.processor.FixDuplicateContentTitleProcessorFactory">
  <bool name="enabled">true</bool>
  <str name="title.fl">title_no,title_en</str>
  <str name="content.fl">content_no,content_en</str>
  <str name="contentType.field">tika_Content-Type</str>
  <str name="contentType.regexp">^(text/html|application/xhtml).*</str>
</processor>
<processor class="no.uio.update.processor.URLClassifyProcessorFactory">
  <bool name="enabled">true</bool>
  <str name="domainOutputField">host</str>
  <str name="canonicalUrlOutputField">canonicalurl</str>
</processor>
<processor class="no.uio.update.processor.RegexpBoostProcessorFactory">
  <bool name="enabled">true</bool>
  <str name="inputField">url</str>
  <str name="boostField">urlboost</str>
  <str name="boostFilename">${solr.solr.home}/conf/rank/urlboosts.txt</str>
</processor>
<processor class="no.uio.update.processor.StaticRankProcessorFactory">
  <bool name="enabled">true</bool>
  <str name="inputField">canonicalurl</str>
  <str name="anchorField">anchortext</str>
  <str name="rankField">docrank</str>
  <str name="rankFilename">${solr.solr.home}/conf/rank/rankinfo.txt</str>
</processor>
<processor class="solr.LogUpdateProcessorFactory"/>
                                         28
</processor>
<processor class="no.uio.update.processor.RegexpReplaceProcessorFactory">
  <bool name="enabled">true</bool>

             Solr @ Oslo University
  <str name="fl">content_no content_en</str>
  <str name="pattern">[su00A0]+</str>
  <str name="replacement"> </str>
</processor>
<processor class="no.uio.update.processor.FixDuplicateContentTitleProcessorFactory">
  <bool name="enabled">true</bool>
  <str name="title.fl">title_no,title_en</str>
  <str name="content.fl">content_no,content_en</str>
  <str name="contentType.field">tika_Content-Type</str>
  <str name="contentType.regexp">^(text/html|application/xhtml).*</str>
</processor>
<processor class="no.uio.update.processor.URLClassifyProcessorFactory">
  <bool name="enabled">true</bool>
  <str name="domainOutputField">host</str>
  <str name="canonicalUrlOutputField">canonicalurl</str>
</processor>
<processor class="no.uio.update.processor.RegexpBoostProcessorFactory">
  <bool name="enabled">true</bool>
  <str name="inputField">url</str>
  <str name="boostField">urlboost</str>
  <str name="boostFilename">${solr.solr.home}/conf/rank/urlboosts.txt</str>
</processor>
<processor class="no.uio.update.processor.StaticRankProcessorFactory">
  <bool name="enabled">true</bool>
  <str name="inputField">canonicalurl</str>
  <str name="anchorField">anchortext</str>
  <str name="rankField">docrank</str>
  <str name="rankFilename">${solr.solr.home}/conf/rank/rankinfo.txt</str>
</processor>
<processor class="solr.LogUpdateProcessorFactory"/>
                                         28
Donations back to Apache

SOLR-2599: FieldCopyProcessor
SOLR-2825: RegexReplaceProcessor
SOLR-2826: URLClassifyProcessor
SOLR-2827: RegexpBoostProcessor
SOLR-2828: StaticRankProcessor
Binary Document Dumper (?)

Many thanks for the donations!


                       29
Room for
improvement?



  32
Improvements


Processors re-created for every request
Duplication of config between chains
No support for non-linear or sub chains
Does not scale very well
Lack native scripting language support




                   34
Improvements
Pain:
  Potentially expensive initialization
  StaticRankProcessor: read&parse 50.000 lines

Proposed cure:
  Keep persistent state object in factory:
  private final Map<Object,Object> sharedObjCache
  new StaticRankProcessor(params, request,
  response, nextProcessor, sharedObjCache);
  Processor uses sharedObjCache for state



                       35
Improvements


Processors re-created for every request
Duplication of config between chains
No support for non-linear or sub chains
Does not scale very well
Lack native scripting language support




                    36
Improvements
Pain:
  Multi chains often need identical Processors
  UiO’s two chains share 80% -> copy/paste

Proposed cure:
  Allow sharing of named instances
  Define:
  <processor name="langid" class="..">
  Refer:
  <processor ref="langid" />
  See SOLR-2823

                     37
Improvements


Processors re-created for every request
Duplication of config between chains
No support for non-linear or sub chains
Does not scale very well
Lack native scripting language support




                   38
Improvements
Pain:
  Chains are linear only
  Hard to do branching, sub chains, conditional...

Proposed cure (SOLR-2841):
  New scriptable Update Chain - alternative to XML
  Script chain logic in solr/conf/updateproc.groovy
  Full flexibility:
  chain myChain {
     if(doc.getFieldValue("type").equals("pdf"))
       process(tikaproc)
   }


                      39
Improvements


Processors re-created for every request
Duplication of config between chains
No support for non-linear chains or sub chains
Does not scale very well
Lack native scripting language support




                     40
Improvements
Pain:
  Single threaded
  Heavy processing not efficient

Proposed cure:
  Local: Use multi threaded update requests
  SolrCloud: Dedicated nodes, role=“processor” ?
  Wrap an external pipeline in UpdateProcessor
    Example: OpenPipelineUpdateProcessor ?




                        41
Improvements


Processors re-created for every request
Duplication of config between chains
No support for non-linear chains or sub chains
Does not scale very well
Lack native scripting language support




                     42
Improvements
Pain:
  Not really a “problem” :-)
  Nice to write processors in Python, Groovy, JS...

Proposed cure:
  Now: Finish SOLR-1725: Script based Processor
  Later: Make scripts first-class processors
    <processor script="myScript.py" />
    or
    <processor ref="myScript" />




                      43
One last thing...


       44
New standalone framework?
•The UpdateChain is Solr specific
•Interest for a pure pipeline framework
  •Search engine independent
  •Scalable
  •Rich pool of processors
  •Several existing candidates
•Some initial thoughts:
  http://wiki.apache.org/solr/DocumentProcessing




                          45
Summary


   46
Summary
•Document centric vs field centric processing
•UpdateChain is there - use it!
•Works well for most “light” cases
•Scaling issues, but caching config may help
•More processors welcome!




                       47
Questions?
Jan Høydahl, Cominvent AS
@cominvent
www.cominvent.com
Extra


 49
Alternative pipelines
   OpenPipeline (Dieselpoint)
•OpenPipe (T-Rank, now on GitHub)
•Pypes (ESR)
•UIMA (Apache)
•Eclipse SMILA
•Apache commons pipeline
•Piped (FoundIT, Norway)
•Behemoth (DigitaPebble)
•FindWise and TwigKit also has some technology



                      50
Calling out from UpdateChain
This is one way an
external pipeline
system can be
integrated with Solr.

The main benefit of
such a method is you
can continue to feed
content with SolrJ, DIH
or other Update
Request Handlers.


                          51
Scaling with external pipeline
Here is a more
advanced,
distributed
case, where a
Solr node is
dedicated for
processing, and
the entry point
Solr only
dispatches the
requests.


                  52

Improving the Solr Update Chain

  • 1.
    Improving the SolrUpdate Chain Jan Høydahl
  • 2.
    What will Icover? Who is Jan Høydahl? Intro to Solr’s (hidden) UpdateChain How to write your own UpdateProcessors Example: Web crawl @ Oslo University A vision for future improvements Conclusion 2
  • 5.
    Jan Høydahl 1995: Developer telecom 1998: Java developer 2000: Search - FAST 2006: Lucene 2007: Cominvent 2011: Lucene committer > 100 projects 5
  • 6.
    Cominvent AS Consulting &support www.solrtraining.com Lucene/Solr FAST 6
  • 7.
    Why document processing? Analysisis Field oriented Filters only see the “local” field 7
  • 8.
    Why document processing? Butwhat if you want to: Add or remove fields? Make decisions based on other fields? We need a way to modify the Document 8
  • 9.
    Why document processing? Doc1 name postcode programmer cv_pdf_url near Barcelona 9
  • 10.
    Why document processing? Doc1 name postcode programmer latlong near Barcelona cv_pdf_url cv_text 10
  • 11.
    Why document processing? Client Doc1 name postcode latlong cv_pdf_url cv_text 11
  • 12.
    Why document processing? Client Doc1 name postcode 3rd party latlong pipeline cv_pdf_url cv_text 12
  • 13.
  • 14.
  • 15.
  • 16.
    The Update Chain Postcode ToLatLong Processor Doc Doc name name postcode postcode cv_pdf_url latlong cv_pdf_url 15
  • 17.
    The Update Chain Postcode UrlFetcher ToLatLong Processor Processor Doc Doc Doc name name name postcode postcode postcode cv_pdf_url latlong latlong cv_pdf_url cv_pdf_url cv_pdf_bin 15
  • 18.
    The Update Chain Postcode Tika UrlFetcher ToLatLong Extracting Processor Processor Processor Doc Doc Doc Doc name name name name postcode postcode postcode postcode cv_pdf_url latlong latlong latlong cv_pdf_url cv_pdf_url cv_pdf_url cv_pdf_bin cv_pdf_bin cv_text 15
  • 21.
    How it’s wired Chaindefinition in solrconfig.xml: Choose chain in your update request: .../solr/update/xml?..&update.chain=cv-chain 17
  • 22.
  • 23.
    Other examples Company The Apache Software Foundation (ASF) is a non-profit corporation to support Apache software projects. The ASF was formed from the Apache Group and incorporated in Delaware, U.S., in June 1999. Location Date Entity extraction 19
  • 25.
    Writing your ownprocessor 21
  • 26.
    Writing your ownprocessor 21
  • 27.
    Writing your ownprocessor 22
  • 28.
    Writing your ownprocessor 23
  • 29.
    Writing your ownprocessor •Make generic processors - parameterized •Use SchemaAware, SolrCoreAware and ResourceLoaderAware interfaces •Prefix param names to avoid name clash •Testing and testable methods •Donate back to Apache & document on Wiki 24
  • 30.
    Web crawl with LanguageDetection @ Oslo University 25
  • 31.
    Solr @ OsloUniversity 26
  • 32.
    Solr @ OsloUniversity <?xml version="1.0"?> <updateRequestProcessorChain name="web"> <processor class="solr.update.processor.LanguageIdentifierUpdateProcessorFactory"> <lst name="defaults"> <str name="langid.fl">title,content</str> <str name="langid.idField">url</str> <str name="langid.fallbackFields"></str> <str name="langid.fallback">no</str> <str name="langid.whitelist">no,en</str> <str name="langid.langField">language</str> <str name="langid.langsField">languages</str> <bool name="langid.overwrite">true</bool> <bool name="langid.map">true</bool> <str name="langid.map.fl">title,content</str> <double name="langid.threshold">0.5</double> </lst> </processor> <processor class="no.uio.update.processor.RegexpReplaceProcessorFactory"> <bool name="enabled">true</bool> <str name="fl">content_no content_en</str> <str name="pattern">[su00A0]+</str> <str name="replacement"> </str> </processor> <processor class="no.uio.update.processor.FixDuplicateContentTitleProcessorFactory"> <bool name="enabled">true</bool> <str name="title.fl">title_no,title_en</str> 27 <str name="content.fl">content_no,content_en</str>
  • 33.
    Solr @ OsloUniversity <?xml version="1.0"?> <updateRequestProcessorChain name="web"> <processor class="solr.update.processor.LanguageIdentifierUpd <lst name="defaults"> <str name="langid.fl">title,content</str> <str name="langid.idField">url</str> <str name="langid.fallbackFields"></str> <str name="langid.fallback">no</str> <str name="langid.whitelist">no,en</str> <str name="langid.langField">language</str> <str name="langid.langsField">languages</str> <bool name="langid.overwrite">true</bool> <bool name="langid.map">true</bool> <str name="langid.map.fl">title,content</str> <double name="langid.threshold">0.5</double> </lst> </processor> <processor class="no.uio.update.processor.RegexpReplaceProces 27
  • 34.
    Solr @ OsloUniversity <?xml version="1.0"?> <updateRequestProcessorChain name="web"> <processor class="solr.update.processor.LanguageIdentifierUpdateProcessorFactory"> <lst name="defaults"> <str name="langid.fl">title,content</str> <str name="langid.idField">url</str> <str name="langid.fallbackFields"></str> <str name="langid.fallback">no</str> <str name="langid.whitelist">no,en</str> <str name="langid.langField">language</str> <str name="langid.langsField">languages</str> <bool name="langid.overwrite">true</bool> <bool name="langid.map">true</bool> <str name="langid.map.fl">title,content</str> <double name="langid.threshold">0.5</double> </lst> </processor> <processor class="no.uio.update.processor.RegexpReplaceProcessorFactory"> <bool name="enabled">true</bool> <str name="fl">content_no content_en</str> <str name="pattern">[su00A0]+</str> <str name="replacement"> </str> </processor> <processor class="no.uio.update.processor.FixDuplicateContentTitleProcessorFactory"> <bool name="enabled">true</bool> <str name="title.fl">title_no,title_en</str> 27 <str name="content.fl">content_no,content_en</str>
  • 35.
    </processor> <processor class="no.uio.update.processor.RegexpReplaceProcessorFactory"> <bool name="enabled">true</bool> Solr @ Oslo University <str name="fl">content_no content_en</str> <str name="pattern">[su00A0]+</str> <str name="replacement"> </str> </processor> <processor class="no.uio.update.processor.FixDuplicateContentTitleProcessorFactory"> <bool name="enabled">true</bool> <str name="title.fl">title_no,title_en</str> <str name="content.fl">content_no,content_en</str> <str name="contentType.field">tika_Content-Type</str> <str name="contentType.regexp">^(text/html|application/xhtml).*</str> </processor> <processor class="no.uio.update.processor.URLClassifyProcessorFactory"> <bool name="enabled">true</bool> <str name="domainOutputField">host</str> <str name="canonicalUrlOutputField">canonicalurl</str> </processor> <processor class="no.uio.update.processor.RegexpBoostProcessorFactory"> <bool name="enabled">true</bool> <str name="inputField">url</str> <str name="boostField">urlboost</str> <str name="boostFilename">${solr.solr.home}/conf/rank/urlboosts.txt</str> </processor> <processor class="no.uio.update.processor.StaticRankProcessorFactory"> <bool name="enabled">true</bool> <str name="inputField">canonicalurl</str> <str name="anchorField">anchortext</str> <str name="rankField">docrank</str> <str name="rankFilename">${solr.solr.home}/conf/rank/rankinfo.txt</str> </processor> <processor class="solr.LogUpdateProcessorFactory"/> 28
  • 36.
    </processor> <processor class="no.uio.update.processor.RegexpReplaceProcessorFactory"> <bool name="enabled">true</bool> Solr @ Oslo University <str name="fl">content_no content_en</str> <str name="pattern">[su00A0]+</str> <str name="replacement"> </str> </processor> <processor class="no.uio.update.processor.FixDuplicateContentTitleProcessorFactory"> <bool name="enabled">true</bool> <str name="title.fl">title_no,title_en</str> <str name="content.fl">content_no,content_en</str> <str name="contentType.field">tika_Content-Type</str> <str name="contentType.regexp">^(text/html|application/xhtml).*</str> </processor> <processor class="no.uio.update.processor.URLClassifyProcessorFactory"> <bool name="enabled">true</bool> <str name="domainOutputField">host</str> <str name="canonicalUrlOutputField">canonicalurl</str> </processor> <processor class="no.uio.update.processor.RegexpBoostProcessorFactory"> <bool name="enabled">true</bool> <str name="inputField">url</str> <str name="boostField">urlboost</str> <str name="boostFilename">${solr.solr.home}/conf/rank/urlboosts.txt</str> </processor> <processor class="no.uio.update.processor.StaticRankProcessorFactory"> <bool name="enabled">true</bool> <str name="inputField">canonicalurl</str> <str name="anchorField">anchortext</str> <str name="rankField">docrank</str> <str name="rankFilename">${solr.solr.home}/conf/rank/rankinfo.txt</str> </processor> <processor class="solr.LogUpdateProcessorFactory"/> 28
  • 37.
    </processor> <processor class="no.uio.update.processor.RegexpReplaceProcessorFactory"> <bool name="enabled">true</bool> Solr @ Oslo University <str name="fl">content_no content_en</str> <str name="pattern">[su00A0]+</str> <str name="replacement"> </str> </processor> <processor class="no.uio.update.processor.FixDuplicateContentTitleProcessorFactory"> <bool name="enabled">true</bool> <str name="title.fl">title_no,title_en</str> <str name="content.fl">content_no,content_en</str> <str name="contentType.field">tika_Content-Type</str> <str name="contentType.regexp">^(text/html|application/xhtml).*</str> </processor> <processor class="no.uio.update.processor.URLClassifyProcessorFactory"> <bool name="enabled">true</bool> <str name="domainOutputField">host</str> <str name="canonicalUrlOutputField">canonicalurl</str> </processor> <processor class="no.uio.update.processor.RegexpBoostProcessorFactory"> <bool name="enabled">true</bool> <str name="inputField">url</str> <str name="boostField">urlboost</str> <str name="boostFilename">${solr.solr.home}/conf/rank/urlboosts.txt</str> </processor> <processor class="no.uio.update.processor.StaticRankProcessorFactory"> <bool name="enabled">true</bool> <str name="inputField">canonicalurl</str> <str name="anchorField">anchortext</str> <str name="rankField">docrank</str> <str name="rankFilename">${solr.solr.home}/conf/rank/rankinfo.txt</str> </processor> <processor class="solr.LogUpdateProcessorFactory"/> 28
  • 38.
    </processor> <processor class="no.uio.update.processor.RegexpReplaceProcessorFactory"> <bool name="enabled">true</bool> Solr @ Oslo University <str name="fl">content_no content_en</str> <str name="pattern">[su00A0]+</str> <str name="replacement"> </str> </processor> <processor class="no.uio.update.processor.FixDuplicateContentTitleProcessorFactory"> <bool name="enabled">true</bool> <str name="title.fl">title_no,title_en</str> <str name="content.fl">content_no,content_en</str> <str name="contentType.field">tika_Content-Type</str> <str name="contentType.regexp">^(text/html|application/xhtml).*</str> </processor> <processor class="no.uio.update.processor.URLClassifyProcessorFactory"> <bool name="enabled">true</bool> <str name="domainOutputField">host</str> <str name="canonicalUrlOutputField">canonicalurl</str> </processor> <processor class="no.uio.update.processor.RegexpBoostProcessorFactory"> <bool name="enabled">true</bool> <str name="inputField">url</str> <str name="boostField">urlboost</str> <str name="boostFilename">${solr.solr.home}/conf/rank/urlboosts.txt</str> </processor> <processor class="no.uio.update.processor.StaticRankProcessorFactory"> <bool name="enabled">true</bool> <str name="inputField">canonicalurl</str> <str name="anchorField">anchortext</str> <str name="rankField">docrank</str> <str name="rankFilename">${solr.solr.home}/conf/rank/rankinfo.txt</str> </processor> <processor class="solr.LogUpdateProcessorFactory"/> 28
  • 39.
    Donations back toApache SOLR-2599: FieldCopyProcessor SOLR-2825: RegexReplaceProcessor SOLR-2826: URLClassifyProcessor SOLR-2827: RegexpBoostProcessor SOLR-2828: StaticRankProcessor Binary Document Dumper (?) Many thanks for the donations! 29
  • 42.
  • 43.
    Improvements Processors re-created forevery request Duplication of config between chains No support for non-linear or sub chains Does not scale very well Lack native scripting language support 34
  • 44.
    Improvements Pain: Potentiallyexpensive initialization StaticRankProcessor: read&parse 50.000 lines Proposed cure: Keep persistent state object in factory: private final Map<Object,Object> sharedObjCache new StaticRankProcessor(params, request, response, nextProcessor, sharedObjCache); Processor uses sharedObjCache for state 35
  • 45.
    Improvements Processors re-created forevery request Duplication of config between chains No support for non-linear or sub chains Does not scale very well Lack native scripting language support 36
  • 46.
    Improvements Pain: Multichains often need identical Processors UiO’s two chains share 80% -> copy/paste Proposed cure: Allow sharing of named instances Define: <processor name="langid" class=".."> Refer: <processor ref="langid" /> See SOLR-2823 37
  • 47.
    Improvements Processors re-created forevery request Duplication of config between chains No support for non-linear or sub chains Does not scale very well Lack native scripting language support 38
  • 48.
    Improvements Pain: Chainsare linear only Hard to do branching, sub chains, conditional... Proposed cure (SOLR-2841): New scriptable Update Chain - alternative to XML Script chain logic in solr/conf/updateproc.groovy Full flexibility: chain myChain { if(doc.getFieldValue("type").equals("pdf")) process(tikaproc) } 39
  • 49.
    Improvements Processors re-created forevery request Duplication of config between chains No support for non-linear chains or sub chains Does not scale very well Lack native scripting language support 40
  • 50.
    Improvements Pain: Singlethreaded Heavy processing not efficient Proposed cure: Local: Use multi threaded update requests SolrCloud: Dedicated nodes, role=“processor” ? Wrap an external pipeline in UpdateProcessor Example: OpenPipelineUpdateProcessor ? 41
  • 51.
    Improvements Processors re-created forevery request Duplication of config between chains No support for non-linear chains or sub chains Does not scale very well Lack native scripting language support 42
  • 52.
    Improvements Pain: Notreally a “problem” :-) Nice to write processors in Python, Groovy, JS... Proposed cure: Now: Finish SOLR-1725: Script based Processor Later: Make scripts first-class processors <processor script="myScript.py" /> or <processor ref="myScript" /> 43
  • 53.
  • 54.
    New standalone framework? •TheUpdateChain is Solr specific •Interest for a pure pipeline framework •Search engine independent •Scalable •Rich pool of processors •Several existing candidates •Some initial thoughts: http://wiki.apache.org/solr/DocumentProcessing 45
  • 55.
  • 56.
    Summary •Document centric vsfield centric processing •UpdateChain is there - use it! •Works well for most “light” cases •Scaling issues, but caching config may help •More processors welcome! 47
  • 57.
    Questions? Jan Høydahl, CominventAS @cominvent www.cominvent.com
  • 58.
  • 59.
    Alternative pipelines OpenPipeline (Dieselpoint) •OpenPipe (T-Rank, now on GitHub) •Pypes (ESR) •UIMA (Apache) •Eclipse SMILA •Apache commons pipeline •Piped (FoundIT, Norway) •Behemoth (DigitaPebble) •FindWise and TwigKit also has some technology 50
  • 60.
    Calling out fromUpdateChain This is one way an external pipeline system can be integrated with Solr. The main benefit of such a method is you can continue to feed content with SolrJ, DIH or other Update Request Handlers. 51
  • 61.
    Scaling with externalpipeline Here is a more advanced, distributed case, where a Solr node is dedicated for processing, and the entry point Solr only dispatches the requests. 52