SlideShare a Scribd company logo
Improving the Solr Update Chain
           Jan Høydahl
What will I cover?
Who is Jan Høydahl?
Intro to Solr’s (hidden) UpdateChain
How to write your own UpdateProcessors
Example: Web crawl @ Oslo University
A vision for future improvements
Conclusion




                      2
Jan Høydahl
          1995: Developer telecom
          1998: Java developer
          2000: Search - FAST
          2006: Lucene
          2007: Cominvent
          2011: Lucene committer


          > 100 projects



      5
Cominvent AS




Consulting & support       www.solrtraining.com
   Lucene/Solr
       FAST


                       6
Why document processing?
Analysis is Field oriented
Filters only see the “local” field




                      7
Why document processing?
But what if you want to:
  Add or remove fields?
  Make decisions based on other fields?
We need a way to modify the Document




                    8
Why document processing?

Doc1
name
postcode
                         programmer
cv_pdf_url
                        near Barcelona




               9
Why document processing?

Doc1
name
postcode
                          programmer
latlong
                         near Barcelona
cv_pdf_url
cv_text




                10
Why document processing?


   Client
    Doc1
     name
     postcode
     latlong
     cv_pdf_url
     cv_text




                  11
Why document processing?


Client
 Doc1
  name
  postcode     3rd party
  latlong      pipeline
  cv_pdf_url
  cv_text




                     12
Solr’s Update Chain




        13
The Update Chain




        14
The Update Chain




Doc
name
postcode
cv_pdf_url




                     15
The Update Chain
              Postcode
              ToLatLong
              Processor


Doc            Doc
name           name
postcode       postcode
cv_pdf_url     latlong
               cv_pdf_url




                            15
The Update Chain
              Postcode
                             UrlFetcher
              ToLatLong
                             Processor
              Processor


Doc            Doc           Doc
name           name              name
postcode       postcode          postcode
cv_pdf_url     latlong           latlong
               cv_pdf_url        cv_pdf_url
                                 cv_pdf_bin




                            15
The Update Chain
              Postcode                        Tika
                             UrlFetcher
              ToLatLong                       Extracting
                             Processor
              Processor                       Processor


Doc            Doc           Doc              Doc
name           name              name         name
postcode       postcode          postcode     postcode
cv_pdf_url     latlong           latlong      latlong
               cv_pdf_url        cv_pdf_url   cv_pdf_url
                                 cv_pdf_bin   cv_pdf_bin
                                              cv_text



                            15
How it’s wired
Chain definition in solrconfig.xml:




Choose chain in your update request:
.../solr/update/xml?..&update.chain=cv-chain



                       17
Other examples




Language Identification
          18
Other examples
Company

       The Apache Software Foundation
       (ASF) is a non-profit corporation to
       support Apache software projects.
       The ASF was formed from the
       Apache Group and incorporated in
       Delaware, U.S., in June 1999.

  Location                                    Date


                  Entity extraction
                          19
Writing your own processor




            21
Writing your own processor




            21
Writing your own processor




             22
Writing your own processor




             23
Writing your own processor




•Make generic processors - parameterized
•Use SchemaAware, SolrCoreAware and
  ResourceLoaderAware interfaces
•Prefix param names to avoid name clash
•Testing and testable methods
•Donate back to Apache & document on Wiki
                      24
Web crawl with
Language Detection
 @ Oslo University

        25
Solr @ Oslo University




           26
Solr @ Oslo University
     <?xml version="1.0"?>
<updateRequestProcessorChain name="web">
  <processor class="solr.update.processor.LanguageIdentifierUpdateProcessorFactory">
    <lst name="defaults">
      <str name="langid.fl">title,content</str>
      <str name="langid.idField">url</str>
      <str name="langid.fallbackFields"></str>
      <str name="langid.fallback">no</str>
      <str name="langid.whitelist">no,en</str>
      <str name="langid.langField">language</str>
      <str name="langid.langsField">languages</str>
      <bool name="langid.overwrite">true</bool>
      <bool name="langid.map">true</bool>
      <str name="langid.map.fl">title,content</str>
      <double name="langid.threshold">0.5</double>
    </lst>
  </processor>
  <processor class="no.uio.update.processor.RegexpReplaceProcessorFactory">
    <bool name="enabled">true</bool>
    <str name="fl">content_no content_en</str>
    <str name="pattern">[su00A0]+</str>
    <str name="replacement"> </str>
  </processor>
  <processor class="no.uio.update.processor.FixDuplicateContentTitleProcessorFactory">
    <bool name="enabled">true</bool>
    <str name="title.fl">title_no,title_en</str>
                                           27
    <str name="content.fl">content_no,content_en</str>
Solr @ Oslo University
     <?xml version="1.0"?>
<updateRequestProcessorChain name="web">
  <processor class="solr.update.processor.LanguageIdentifierUpd
    <lst name="defaults">
      <str name="langid.fl">title,content</str>
      <str name="langid.idField">url</str>
      <str name="langid.fallbackFields"></str>
      <str name="langid.fallback">no</str>
      <str name="langid.whitelist">no,en</str>
      <str name="langid.langField">language</str>
      <str name="langid.langsField">languages</str>
      <bool name="langid.overwrite">true</bool>
      <bool name="langid.map">true</bool>
      <str name="langid.map.fl">title,content</str>
      <double name="langid.threshold">0.5</double>
    </lst>
  </processor>
  <processor class="no.uio.update.processor.RegexpReplaceProces
                             27
Solr @ Oslo University
     <?xml version="1.0"?>
<updateRequestProcessorChain name="web">
  <processor class="solr.update.processor.LanguageIdentifierUpdateProcessorFactory">
    <lst name="defaults">
      <str name="langid.fl">title,content</str>
      <str name="langid.idField">url</str>
      <str name="langid.fallbackFields"></str>
      <str name="langid.fallback">no</str>
      <str name="langid.whitelist">no,en</str>
      <str name="langid.langField">language</str>
      <str name="langid.langsField">languages</str>
      <bool name="langid.overwrite">true</bool>
      <bool name="langid.map">true</bool>
      <str name="langid.map.fl">title,content</str>
      <double name="langid.threshold">0.5</double>
    </lst>
  </processor>
  <processor class="no.uio.update.processor.RegexpReplaceProcessorFactory">
    <bool name="enabled">true</bool>
    <str name="fl">content_no content_en</str>
    <str name="pattern">[su00A0]+</str>
    <str name="replacement"> </str>
  </processor>
  <processor class="no.uio.update.processor.FixDuplicateContentTitleProcessorFactory">
    <bool name="enabled">true</bool>
    <str name="title.fl">title_no,title_en</str>
                                           27
    <str name="content.fl">content_no,content_en</str>
</processor>
<processor class="no.uio.update.processor.RegexpReplaceProcessorFactory">
  <bool name="enabled">true</bool>

             Solr @ Oslo University
  <str name="fl">content_no content_en</str>
  <str name="pattern">[su00A0]+</str>
  <str name="replacement"> </str>
</processor>
<processor class="no.uio.update.processor.FixDuplicateContentTitleProcessorFactory">
  <bool name="enabled">true</bool>
  <str name="title.fl">title_no,title_en</str>
  <str name="content.fl">content_no,content_en</str>
  <str name="contentType.field">tika_Content-Type</str>
  <str name="contentType.regexp">^(text/html|application/xhtml).*</str>
</processor>
<processor class="no.uio.update.processor.URLClassifyProcessorFactory">
  <bool name="enabled">true</bool>
  <str name="domainOutputField">host</str>
  <str name="canonicalUrlOutputField">canonicalurl</str>
</processor>
<processor class="no.uio.update.processor.RegexpBoostProcessorFactory">
  <bool name="enabled">true</bool>
  <str name="inputField">url</str>
  <str name="boostField">urlboost</str>
  <str name="boostFilename">${solr.solr.home}/conf/rank/urlboosts.txt</str>
</processor>
<processor class="no.uio.update.processor.StaticRankProcessorFactory">
  <bool name="enabled">true</bool>
  <str name="inputField">canonicalurl</str>
  <str name="anchorField">anchortext</str>
  <str name="rankField">docrank</str>
  <str name="rankFilename">${solr.solr.home}/conf/rank/rankinfo.txt</str>
</processor>
<processor class="solr.LogUpdateProcessorFactory"/>
                                         28
</processor>
<processor class="no.uio.update.processor.RegexpReplaceProcessorFactory">
  <bool name="enabled">true</bool>

             Solr @ Oslo University
  <str name="fl">content_no content_en</str>
  <str name="pattern">[su00A0]+</str>
  <str name="replacement"> </str>
</processor>
<processor class="no.uio.update.processor.FixDuplicateContentTitleProcessorFactory">
  <bool name="enabled">true</bool>
  <str name="title.fl">title_no,title_en</str>
  <str name="content.fl">content_no,content_en</str>
  <str name="contentType.field">tika_Content-Type</str>
  <str name="contentType.regexp">^(text/html|application/xhtml).*</str>
</processor>
<processor class="no.uio.update.processor.URLClassifyProcessorFactory">
  <bool name="enabled">true</bool>
  <str name="domainOutputField">host</str>
  <str name="canonicalUrlOutputField">canonicalurl</str>
</processor>
<processor class="no.uio.update.processor.RegexpBoostProcessorFactory">
  <bool name="enabled">true</bool>
  <str name="inputField">url</str>
  <str name="boostField">urlboost</str>
  <str name="boostFilename">${solr.solr.home}/conf/rank/urlboosts.txt</str>
</processor>
<processor class="no.uio.update.processor.StaticRankProcessorFactory">
  <bool name="enabled">true</bool>
  <str name="inputField">canonicalurl</str>
  <str name="anchorField">anchortext</str>
  <str name="rankField">docrank</str>
  <str name="rankFilename">${solr.solr.home}/conf/rank/rankinfo.txt</str>
</processor>
<processor class="solr.LogUpdateProcessorFactory"/>
                                         28
</processor>
<processor class="no.uio.update.processor.RegexpReplaceProcessorFactory">
  <bool name="enabled">true</bool>

             Solr @ Oslo University
  <str name="fl">content_no content_en</str>
  <str name="pattern">[su00A0]+</str>
  <str name="replacement"> </str>
</processor>
<processor class="no.uio.update.processor.FixDuplicateContentTitleProcessorFactory">
  <bool name="enabled">true</bool>
  <str name="title.fl">title_no,title_en</str>
  <str name="content.fl">content_no,content_en</str>
  <str name="contentType.field">tika_Content-Type</str>
  <str name="contentType.regexp">^(text/html|application/xhtml).*</str>
</processor>
<processor class="no.uio.update.processor.URLClassifyProcessorFactory">
  <bool name="enabled">true</bool>
  <str name="domainOutputField">host</str>
  <str name="canonicalUrlOutputField">canonicalurl</str>
</processor>
<processor class="no.uio.update.processor.RegexpBoostProcessorFactory">
  <bool name="enabled">true</bool>
  <str name="inputField">url</str>
  <str name="boostField">urlboost</str>
  <str name="boostFilename">${solr.solr.home}/conf/rank/urlboosts.txt</str>
</processor>
<processor class="no.uio.update.processor.StaticRankProcessorFactory">
  <bool name="enabled">true</bool>
  <str name="inputField">canonicalurl</str>
  <str name="anchorField">anchortext</str>
  <str name="rankField">docrank</str>
  <str name="rankFilename">${solr.solr.home}/conf/rank/rankinfo.txt</str>
</processor>
<processor class="solr.LogUpdateProcessorFactory"/>
                                         28
</processor>
<processor class="no.uio.update.processor.RegexpReplaceProcessorFactory">
  <bool name="enabled">true</bool>

             Solr @ Oslo University
  <str name="fl">content_no content_en</str>
  <str name="pattern">[su00A0]+</str>
  <str name="replacement"> </str>
</processor>
<processor class="no.uio.update.processor.FixDuplicateContentTitleProcessorFactory">
  <bool name="enabled">true</bool>
  <str name="title.fl">title_no,title_en</str>
  <str name="content.fl">content_no,content_en</str>
  <str name="contentType.field">tika_Content-Type</str>
  <str name="contentType.regexp">^(text/html|application/xhtml).*</str>
</processor>
<processor class="no.uio.update.processor.URLClassifyProcessorFactory">
  <bool name="enabled">true</bool>
  <str name="domainOutputField">host</str>
  <str name="canonicalUrlOutputField">canonicalurl</str>
</processor>
<processor class="no.uio.update.processor.RegexpBoostProcessorFactory">
  <bool name="enabled">true</bool>
  <str name="inputField">url</str>
  <str name="boostField">urlboost</str>
  <str name="boostFilename">${solr.solr.home}/conf/rank/urlboosts.txt</str>
</processor>
<processor class="no.uio.update.processor.StaticRankProcessorFactory">
  <bool name="enabled">true</bool>
  <str name="inputField">canonicalurl</str>
  <str name="anchorField">anchortext</str>
  <str name="rankField">docrank</str>
  <str name="rankFilename">${solr.solr.home}/conf/rank/rankinfo.txt</str>
</processor>
<processor class="solr.LogUpdateProcessorFactory"/>
                                         28
Donations back to Apache

SOLR-2599: FieldCopyProcessor
SOLR-2825: RegexReplaceProcessor
SOLR-2826: URLClassifyProcessor
SOLR-2827: RegexpBoostProcessor
SOLR-2828: StaticRankProcessor
Binary Document Dumper (?)

Many thanks for the donations!


                       29
Room for
improvement?



  32
Improvements


Processors re-created for every request
Duplication of config between chains
No support for non-linear or sub chains
Does not scale very well
Lack native scripting language support




                   34
Improvements
Pain:
  Potentially expensive initialization
  StaticRankProcessor: read&parse 50.000 lines

Proposed cure:
  Keep persistent state object in factory:
  private final Map<Object,Object> sharedObjCache
  new StaticRankProcessor(params, request,
  response, nextProcessor, sharedObjCache);
  Processor uses sharedObjCache for state



                       35
Improvements


Processors re-created for every request
Duplication of config between chains
No support for non-linear or sub chains
Does not scale very well
Lack native scripting language support




                    36
Improvements
Pain:
  Multi chains often need identical Processors
  UiO’s two chains share 80% -> copy/paste

Proposed cure:
  Allow sharing of named instances
  Define:
  <processor name="langid" class="..">
  Refer:
  <processor ref="langid" />
  See SOLR-2823

                     37
Improvements


Processors re-created for every request
Duplication of config between chains
No support for non-linear or sub chains
Does not scale very well
Lack native scripting language support




                   38
Improvements
Pain:
  Chains are linear only
  Hard to do branching, sub chains, conditional...

Proposed cure (SOLR-2841):
  New scriptable Update Chain - alternative to XML
  Script chain logic in solr/conf/updateproc.groovy
  Full flexibility:
  chain myChain {
     if(doc.getFieldValue("type").equals("pdf"))
       process(tikaproc)
   }


                      39
Improvements


Processors re-created for every request
Duplication of config between chains
No support for non-linear chains or sub chains
Does not scale very well
Lack native scripting language support




                     40
Improvements
Pain:
  Single threaded
  Heavy processing not efficient

Proposed cure:
  Local: Use multi threaded update requests
  SolrCloud: Dedicated nodes, role=“processor” ?
  Wrap an external pipeline in UpdateProcessor
    Example: OpenPipelineUpdateProcessor ?




                        41
Improvements


Processors re-created for every request
Duplication of config between chains
No support for non-linear chains or sub chains
Does not scale very well
Lack native scripting language support




                     42
Improvements
Pain:
  Not really a “problem” :-)
  Nice to write processors in Python, Groovy, JS...

Proposed cure:
  Now: Finish SOLR-1725: Script based Processor
  Later: Make scripts first-class processors
    <processor script="myScript.py" />
    or
    <processor ref="myScript" />




                      43
One last thing...


       44
New standalone framework?
•The UpdateChain is Solr specific
•Interest for a pure pipeline framework
  •Search engine independent
  •Scalable
  •Rich pool of processors
  •Several existing candidates
•Some initial thoughts:
  http://wiki.apache.org/solr/DocumentProcessing




                          45
Summary


   46
Summary
•Document centric vs field centric processing
•UpdateChain is there - use it!
•Works well for most “light” cases
•Scaling issues, but caching config may help
•More processors welcome!




                       47
Questions?
Jan Høydahl, Cominvent AS
@cominvent
www.cominvent.com
Extra


 49
Alternative pipelines
   OpenPipeline (Dieselpoint)
•OpenPipe (T-Rank, now on GitHub)
•Pypes (ESR)
•UIMA (Apache)
•Eclipse SMILA
•Apache commons pipeline
•Piped (FoundIT, Norway)
•Behemoth (DigitaPebble)
•FindWise and TwigKit also has some technology



                      50
Calling out from UpdateChain
This is one way an
external pipeline
system can be
integrated with Solr.

The main benefit of
such a method is you
can continue to feed
content with SolrJ, DIH
or other Update
Request Handlers.


                          51
Scaling with external pipeline
Here is a more
advanced,
distributed
case, where a
Solr node is
dedicated for
processing, and
the entry point
Solr only
dispatches the
requests.


                  52

More Related Content

What's hot

Parallelizing Existing R Packages
Parallelizing Existing R PackagesParallelizing Existing R Packages
Parallelizing Existing R Packages
Craig Warman
 
Solr 4: Run Solr in SolrCloud Mode on your local file system.
Solr 4: Run Solr in SolrCloud Mode on your local file system.Solr 4: Run Solr in SolrCloud Mode on your local file system.
Solr 4: Run Solr in SolrCloud Mode on your local file system.
gutierrezga00
 
Automating a Vendor File Load Process with Perl and Shell Scripting
Automating a Vendor File Load Process with Perl and Shell ScriptingAutomating a Vendor File Load Process with Perl and Shell Scripting
Automating a Vendor File Load Process with Perl and Shell Scripting
Roy Zimmer
 
Introduction to Structured Streaming | Big Data Hadoop Spark Tutorial | Cloud...
Introduction to Structured Streaming | Big Data Hadoop Spark Tutorial | Cloud...Introduction to Structured Streaming | Big Data Hadoop Spark Tutorial | Cloud...
Introduction to Structured Streaming | Big Data Hadoop Spark Tutorial | Cloud...
CloudxLab
 
Introduction to column oriented databases in PHP
Introduction to column oriented databases in PHPIntroduction to column oriented databases in PHP
Introduction to column oriented databases in PHP
Zend by Rogue Wave Software
 
WWW09 - Triplify Light-Weight Linked Data Publication from Relational Databases
WWW09 - Triplify Light-Weight Linked Data Publication from Relational DatabasesWWW09 - Triplify Light-Weight Linked Data Publication from Relational Databases
WWW09 - Triplify Light-Weight Linked Data Publication from Relational DatabasesSören Auer
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
Farzad Nozarian
 
Scylla Summit 2017: SMF: The Fastest RPC in the West
Scylla Summit 2017: SMF: The Fastest RPC in the WestScylla Summit 2017: SMF: The Fastest RPC in the West
Scylla Summit 2017: SMF: The Fastest RPC in the West
ScyllaDB
 
Owl2 rl
Owl2 rlOwl2 rl
Owl2 rl
STIinnsbruck
 
Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at TwitterHadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
DataWorks Summit
 
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
System Programming and Administration
System Programming and AdministrationSystem Programming and Administration
System Programming and Administration
Krasimir Berov (Красимир Беров)
 
Scylla Summit 2017: How to Run Cassandra/Scylla from a MySQL DBA's Point of View
Scylla Summit 2017: How to Run Cassandra/Scylla from a MySQL DBA's Point of ViewScylla Summit 2017: How to Run Cassandra/Scylla from a MySQL DBA's Point of View
Scylla Summit 2017: How to Run Cassandra/Scylla from a MySQL DBA's Point of View
ScyllaDB
 
Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013
Julien Le Dem
 
Ruby on Rails (RoR) as a back-end processor for Apex
Ruby on Rails (RoR) as a back-end processor for Apex Ruby on Rails (RoR) as a back-end processor for Apex
Ruby on Rails (RoR) as a back-end processor for Apex
Espen Brækken
 
NoSQL and Triple Stores
NoSQL and Triple StoresNoSQL and Triple Stores
NoSQL and Triple Stores
andyseaborne
 
A Practical Introduction to Apache Solr
A Practical Introduction to Apache SolrA Practical Introduction to Apache Solr
A Practical Introduction to Apache Solr
Angel Borroy López
 
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
CloudxLab
 

What's hot (20)

Parallelizing Existing R Packages
Parallelizing Existing R PackagesParallelizing Existing R Packages
Parallelizing Existing R Packages
 
Solr 4: Run Solr in SolrCloud Mode on your local file system.
Solr 4: Run Solr in SolrCloud Mode on your local file system.Solr 4: Run Solr in SolrCloud Mode on your local file system.
Solr 4: Run Solr in SolrCloud Mode on your local file system.
 
Automating a Vendor File Load Process with Perl and Shell Scripting
Automating a Vendor File Load Process with Perl and Shell ScriptingAutomating a Vendor File Load Process with Perl and Shell Scripting
Automating a Vendor File Load Process with Perl and Shell Scripting
 
Introduction to Structured Streaming | Big Data Hadoop Spark Tutorial | Cloud...
Introduction to Structured Streaming | Big Data Hadoop Spark Tutorial | Cloud...Introduction to Structured Streaming | Big Data Hadoop Spark Tutorial | Cloud...
Introduction to Structured Streaming | Big Data Hadoop Spark Tutorial | Cloud...
 
Introduction to column oriented databases in PHP
Introduction to column oriented databases in PHPIntroduction to column oriented databases in PHP
Introduction to column oriented databases in PHP
 
WWW09 - Triplify Light-Weight Linked Data Publication from Relational Databases
WWW09 - Triplify Light-Weight Linked Data Publication from Relational DatabasesWWW09 - Triplify Light-Weight Linked Data Publication from Relational Databases
WWW09 - Triplify Light-Weight Linked Data Publication from Relational Databases
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
Scylla Summit 2017: SMF: The Fastest RPC in the West
Scylla Summit 2017: SMF: The Fastest RPC in the WestScylla Summit 2017: SMF: The Fastest RPC in the West
Scylla Summit 2017: SMF: The Fastest RPC in the West
 
Presto overview
Presto overviewPresto overview
Presto overview
 
Owl2 rl
Owl2 rlOwl2 rl
Owl2 rl
 
Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLab
 
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at TwitterHadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
 
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
 
System Programming and Administration
System Programming and AdministrationSystem Programming and Administration
System Programming and Administration
 
Scylla Summit 2017: How to Run Cassandra/Scylla from a MySQL DBA's Point of View
Scylla Summit 2017: How to Run Cassandra/Scylla from a MySQL DBA's Point of ViewScylla Summit 2017: How to Run Cassandra/Scylla from a MySQL DBA's Point of View
Scylla Summit 2017: How to Run Cassandra/Scylla from a MySQL DBA's Point of View
 
Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013
 
Ruby on Rails (RoR) as a back-end processor for Apex
Ruby on Rails (RoR) as a back-end processor for Apex Ruby on Rails (RoR) as a back-end processor for Apex
Ruby on Rails (RoR) as a back-end processor for Apex
 
NoSQL and Triple Stores
NoSQL and Triple StoresNoSQL and Triple Stores
NoSQL and Triple Stores
 
A Practical Introduction to Apache Solr
A Practical Introduction to Apache SolrA Practical Introduction to Apache Solr
A Practical Introduction to Apache Solr
 
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
 

Viewers also liked

Best practices for highly available and large scale SolrCloud
Best practices for highly available and large scale SolrCloudBest practices for highly available and large scale SolrCloud
Best practices for highly available and large scale SolrCloud
Anshum Gupta
 
The Latest in Spatial & Temporal Search: Presented by David Smiley
The Latest in Spatial & Temporal Search: Presented by David SmileyThe Latest in Spatial & Temporal Search: Presented by David Smiley
The Latest in Spatial & Temporal Search: Presented by David Smiley
Lucidworks
 
Intel
IntelIntel
Cpu spec
Cpu specCpu spec
Cpu spec
Sinu Jose
 
Introduction of cpu
Introduction of cpuIntroduction of cpu
Introduction of cpu
Tharindu Darshana
 
Romain Rogister DSP ppt V2003
Romain  Rogister  DSP  ppt V2003Romain  Rogister  DSP  ppt V2003
Romain Rogister DSP ppt V2003Romain Rogister
 
Central Processing Unit CUP by madridista ujjwal
Central Processing Unit CUP by madridista ujjwalCentral Processing Unit CUP by madridista ujjwal
Central Processing Unit CUP by madridista ujjwal
Ujwal Limbu
 
Data transfer instruction set of 8085 micro processor
Data transfer instruction set of 8085 micro processorData transfer instruction set of 8085 micro processor
Data transfer instruction set of 8085 micro processor
vishalgohel12195
 
Arrandale presentation1
Arrandale presentation1Arrandale presentation1
Arrandale presentation1
Tata Consultancy Services
 
Intel i7
Intel i7Intel i7
Intel i7
Justin k.
 
Solr Compute Cloud - An Elastic SolrCloud Infrastructure
Solr Compute Cloud - An Elastic SolrCloud Infrastructure Solr Compute Cloud - An Elastic SolrCloud Infrastructure
Solr Compute Cloud - An Elastic SolrCloud Infrastructure
Nitin S
 
Solrcloud Leader Election
Solrcloud Leader ElectionSolrcloud Leader Election
Solrcloud Leader Election
ravikgiitk
 
Perl Bag of Tricks - Baltimore Perl mongers
Perl Bag of Tricks  -  Baltimore Perl mongersPerl Bag of Tricks  -  Baltimore Perl mongers
Perl Bag of Tricks - Baltimore Perl mongers
brian d foy
 
Apache SolrCloud
Apache SolrCloudApache SolrCloud
Apache SolrCloud
Michał Warecki
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusters
Sematext Group, Inc.
 
Microprocessors and controllers
Microprocessors and controllersMicroprocessors and controllers
Microprocessors and controllers
Wendy Hemo
 
How SolrCloud Changes the User Experience In a Sharded Environment
How SolrCloud Changes the User Experience In a Sharded EnvironmentHow SolrCloud Changes the User Experience In a Sharded Environment
How SolrCloud Changes the User Experience In a Sharded Environment
lucenerevolution
 
Deploying and managing SolrCloud in the cloud using the Solr Scale Toolkit
Deploying and managing SolrCloud in the cloud using the Solr Scale ToolkitDeploying and managing SolrCloud in the cloud using the Solr Scale Toolkit
Deploying and managing SolrCloud in the cloud using the Solr Scale Toolkit
thelabdude
 
Cpu scheduling(suresh)
Cpu scheduling(suresh)Cpu scheduling(suresh)
Cpu scheduling(suresh)Nagarajan
 
Benchmarking Solr Performance
Benchmarking Solr PerformanceBenchmarking Solr Performance
Benchmarking Solr Performance
Lucidworks
 

Viewers also liked (20)

Best practices for highly available and large scale SolrCloud
Best practices for highly available and large scale SolrCloudBest practices for highly available and large scale SolrCloud
Best practices for highly available and large scale SolrCloud
 
The Latest in Spatial & Temporal Search: Presented by David Smiley
The Latest in Spatial & Temporal Search: Presented by David SmileyThe Latest in Spatial & Temporal Search: Presented by David Smiley
The Latest in Spatial & Temporal Search: Presented by David Smiley
 
Intel
IntelIntel
Intel
 
Cpu spec
Cpu specCpu spec
Cpu spec
 
Introduction of cpu
Introduction of cpuIntroduction of cpu
Introduction of cpu
 
Romain Rogister DSP ppt V2003
Romain  Rogister  DSP  ppt V2003Romain  Rogister  DSP  ppt V2003
Romain Rogister DSP ppt V2003
 
Central Processing Unit CUP by madridista ujjwal
Central Processing Unit CUP by madridista ujjwalCentral Processing Unit CUP by madridista ujjwal
Central Processing Unit CUP by madridista ujjwal
 
Data transfer instruction set of 8085 micro processor
Data transfer instruction set of 8085 micro processorData transfer instruction set of 8085 micro processor
Data transfer instruction set of 8085 micro processor
 
Arrandale presentation1
Arrandale presentation1Arrandale presentation1
Arrandale presentation1
 
Intel i7
Intel i7Intel i7
Intel i7
 
Solr Compute Cloud - An Elastic SolrCloud Infrastructure
Solr Compute Cloud - An Elastic SolrCloud Infrastructure Solr Compute Cloud - An Elastic SolrCloud Infrastructure
Solr Compute Cloud - An Elastic SolrCloud Infrastructure
 
Solrcloud Leader Election
Solrcloud Leader ElectionSolrcloud Leader Election
Solrcloud Leader Election
 
Perl Bag of Tricks - Baltimore Perl mongers
Perl Bag of Tricks  -  Baltimore Perl mongersPerl Bag of Tricks  -  Baltimore Perl mongers
Perl Bag of Tricks - Baltimore Perl mongers
 
Apache SolrCloud
Apache SolrCloudApache SolrCloud
Apache SolrCloud
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusters
 
Microprocessors and controllers
Microprocessors and controllersMicroprocessors and controllers
Microprocessors and controllers
 
How SolrCloud Changes the User Experience In a Sharded Environment
How SolrCloud Changes the User Experience In a Sharded EnvironmentHow SolrCloud Changes the User Experience In a Sharded Environment
How SolrCloud Changes the User Experience In a Sharded Environment
 
Deploying and managing SolrCloud in the cloud using the Solr Scale Toolkit
Deploying and managing SolrCloud in the cloud using the Solr Scale ToolkitDeploying and managing SolrCloud in the cloud using the Solr Scale Toolkit
Deploying and managing SolrCloud in the cloud using the Solr Scale Toolkit
 
Cpu scheduling(suresh)
Cpu scheduling(suresh)Cpu scheduling(suresh)
Cpu scheduling(suresh)
 
Benchmarking Solr Performance
Benchmarking Solr PerformanceBenchmarking Solr Performance
Benchmarking Solr Performance
 

Similar to Improving the Solr Update Chain

Solr Application Development Tutorial
Solr Application Development TutorialSolr Application Development Tutorial
Solr Application Development TutorialErik Hatcher
 
CrossRef How-to: A Technical Introduction to the Basics of CrossRef, Chuck Ko...
CrossRef How-to: A Technical Introduction to the Basics of CrossRef, Chuck Ko...CrossRef How-to: A Technical Introduction to the Basics of CrossRef, Chuck Ko...
CrossRef How-to: A Technical Introduction to the Basics of CrossRef, Chuck Ko...
Crossref
 
Organizing the Data Chaos of Scientists
Organizing the Data Chaos of ScientistsOrganizing the Data Chaos of Scientists
Organizing the Data Chaos of Scientists
Andreas Schreiber
 
Python tools for testing web services over HTTP
Python tools for testing web services over HTTPPython tools for testing web services over HTTP
Python tools for testing web services over HTTP
Mykhailo Kolesnyk
 
Dev8d Apache Solr Tutorial
Dev8d Apache Solr TutorialDev8d Apache Solr Tutorial
Dev8d Apache Solr TutorialSourcesense
 
DataFinder: A Python Application for Scientific Data Management
DataFinder: A Python Application for Scientific Data ManagementDataFinder: A Python Application for Scientific Data Management
DataFinder: A Python Application for Scientific Data Management
Andreas Schreiber
 
Serve Meals, Not Ingredients - ChefConf 2015
Serve Meals, Not Ingredients - ChefConf 2015Serve Meals, Not Ingredients - ChefConf 2015
Serve Meals, Not Ingredients - ChefConf 2015
Chef
 
Serve Meals, Not Ingredients (ChefConf 2015)
Serve Meals, Not Ingredients (ChefConf 2015)Serve Meals, Not Ingredients (ChefConf 2015)
Serve Meals, Not Ingredients (ChefConf 2015)
ThirdWaveInsights
 
Innovate2014 Better Integrations Through Open Interfaces
Innovate2014 Better Integrations Through Open InterfacesInnovate2014 Better Integrations Through Open Interfaces
Innovate2014 Better Integrations Through Open InterfacesSteve Speicher
 
CrossRef Technical Basics 2010 CrossRef Workshops
CrossRef Technical Basics 2010 CrossRef WorkshopsCrossRef Technical Basics 2010 CrossRef Workshops
CrossRef Technical Basics 2010 CrossRef WorkshopsCrossref
 
apidays LIVE Helsinki - Implementing OpenAPI and GraphQL Services with gRPC b...
apidays LIVE Helsinki - Implementing OpenAPI and GraphQL Services with gRPC b...apidays LIVE Helsinki - Implementing OpenAPI and GraphQL Services with gRPC b...
apidays LIVE Helsinki - Implementing OpenAPI and GraphQL Services with gRPC b...
apidays
 
Software Engineering - RS4
Software Engineering - RS4Software Engineering - RS4
Software Engineering - RS4
AtakanAral
 
Data Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingData Architectures for Robust Decision Making
Data Architectures for Robust Decision Making
Gwen (Chen) Shapira
 
Apex behind the scenes
Apex behind the scenesApex behind the scenes
Apex behind the scenesEnkitec
 
Processing XML with Java
Processing XML with JavaProcessing XML with Java
Processing XML with Java
BG Java EE Course
 
Jones "Working with Scholarly APIs: A NISO Training Series, Session One: Foun...
Jones "Working with Scholarly APIs: A NISO Training Series, Session One: Foun...Jones "Working with Scholarly APIs: A NISO Training Series, Session One: Foun...
Jones "Working with Scholarly APIs: A NISO Training Series, Session One: Foun...
National Information Standards Organization (NISO)
 
Apache Solr
Apache SolrApache Solr
Apache Solr
Minh Tran
 
OGCE Overview for SciDAC 2009
OGCE Overview for SciDAC 2009OGCE Overview for SciDAC 2009
OGCE Overview for SciDAC 2009marpierc
 

Similar to Improving the Solr Update Chain (20)

Solr Application Development Tutorial
Solr Application Development TutorialSolr Application Development Tutorial
Solr Application Development Tutorial
 
CrossRef How-to: A Technical Introduction to the Basics of CrossRef, Chuck Ko...
CrossRef How-to: A Technical Introduction to the Basics of CrossRef, Chuck Ko...CrossRef How-to: A Technical Introduction to the Basics of CrossRef, Chuck Ko...
CrossRef How-to: A Technical Introduction to the Basics of CrossRef, Chuck Ko...
 
Organizing the Data Chaos of Scientists
Organizing the Data Chaos of ScientistsOrganizing the Data Chaos of Scientists
Organizing the Data Chaos of Scientists
 
Python tools for testing web services over HTTP
Python tools for testing web services over HTTPPython tools for testing web services over HTTP
Python tools for testing web services over HTTP
 
Dev8d Apache Solr Tutorial
Dev8d Apache Solr TutorialDev8d Apache Solr Tutorial
Dev8d Apache Solr Tutorial
 
DataFinder: A Python Application for Scientific Data Management
DataFinder: A Python Application for Scientific Data ManagementDataFinder: A Python Application for Scientific Data Management
DataFinder: A Python Application for Scientific Data Management
 
Serve Meals, Not Ingredients - ChefConf 2015
Serve Meals, Not Ingredients - ChefConf 2015Serve Meals, Not Ingredients - ChefConf 2015
Serve Meals, Not Ingredients - ChefConf 2015
 
Serve Meals, Not Ingredients (ChefConf 2015)
Serve Meals, Not Ingredients (ChefConf 2015)Serve Meals, Not Ingredients (ChefConf 2015)
Serve Meals, Not Ingredients (ChefConf 2015)
 
Innovate2014 Better Integrations Through Open Interfaces
Innovate2014 Better Integrations Through Open InterfacesInnovate2014 Better Integrations Through Open Interfaces
Innovate2014 Better Integrations Through Open Interfaces
 
CrossRef Technical Basics 2010 CrossRef Workshops
CrossRef Technical Basics 2010 CrossRef WorkshopsCrossRef Technical Basics 2010 CrossRef Workshops
CrossRef Technical Basics 2010 CrossRef Workshops
 
Soap Toolkit Dcphp
Soap Toolkit DcphpSoap Toolkit Dcphp
Soap Toolkit Dcphp
 
apidays LIVE Helsinki - Implementing OpenAPI and GraphQL Services with gRPC b...
apidays LIVE Helsinki - Implementing OpenAPI and GraphQL Services with gRPC b...apidays LIVE Helsinki - Implementing OpenAPI and GraphQL Services with gRPC b...
apidays LIVE Helsinki - Implementing OpenAPI and GraphQL Services with gRPC b...
 
Software Engineering - RS4
Software Engineering - RS4Software Engineering - RS4
Software Engineering - RS4
 
Mufix Network Programming Lecture
Mufix Network Programming LectureMufix Network Programming Lecture
Mufix Network Programming Lecture
 
Data Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingData Architectures for Robust Decision Making
Data Architectures for Robust Decision Making
 
Apex behind the scenes
Apex behind the scenesApex behind the scenes
Apex behind the scenes
 
Processing XML with Java
Processing XML with JavaProcessing XML with Java
Processing XML with Java
 
Jones "Working with Scholarly APIs: A NISO Training Series, Session One: Foun...
Jones "Working with Scholarly APIs: A NISO Training Series, Session One: Foun...Jones "Working with Scholarly APIs: A NISO Training Series, Session One: Foun...
Jones "Working with Scholarly APIs: A NISO Training Series, Session One: Foun...
 
Apache Solr
Apache SolrApache Solr
Apache Solr
 
OGCE Overview for SciDAC 2009
OGCE Overview for SciDAC 2009OGCE Overview for SciDAC 2009
OGCE Overview for SciDAC 2009
 

More from Cominvent AS

Solr's missing plugin ecosystem
Solr's missing plugin ecosystemSolr's missing plugin ecosystem
Solr's missing plugin ecosystem
Cominvent AS
 
Scaling search with Solr Cloud
Scaling search with Solr CloudScaling search with Solr Cloud
Scaling search with Solr Cloud
Cominvent AS
 
First oslo solr community meetup lightning talk janhoy
First oslo solr community meetup lightning talk janhoyFirst oslo solr community meetup lightning talk janhoy
First oslo solr community meetup lightning talk janhoy
Cominvent AS
 
Dagens Næringslivs overgang til Lucene/Solr søk
Dagens Næringslivs overgang til Lucene/Solr søkDagens Næringslivs overgang til Lucene/Solr søk
Dagens Næringslivs overgang til Lucene/Solr søk
Cominvent AS
 
Oslo Enterprise MeetUp May 12th 2010 - Jan Høydahl
Oslo Enterprise MeetUp May 12th 2010 - Jan HøydahlOslo Enterprise MeetUp May 12th 2010 - Jan Høydahl
Oslo Enterprise MeetUp May 12th 2010 - Jan Høydahl
Cominvent AS
 
Open source breakfast norge findwise
Open source breakfast norge findwiseOpen source breakfast norge findwise
Open source breakfast norge findwiseCominvent AS
 
Frokostseminar mai 2010 solr open source cominvent as
Frokostseminar mai 2010 solr open source cominvent asFrokostseminar mai 2010 solr open source cominvent as
Frokostseminar mai 2010 solr open source cominvent as
Cominvent AS
 
Migrating Fast to Solr
Migrating Fast to SolrMigrating Fast to Solr
Migrating Fast to Solr
Cominvent AS
 
Cominvent AS company Presentation
Cominvent AS company PresentationCominvent AS company Presentation
Cominvent AS company Presentation
Cominvent AS
 

More from Cominvent AS (9)

Solr's missing plugin ecosystem
Solr's missing plugin ecosystemSolr's missing plugin ecosystem
Solr's missing plugin ecosystem
 
Scaling search with Solr Cloud
Scaling search with Solr CloudScaling search with Solr Cloud
Scaling search with Solr Cloud
 
First oslo solr community meetup lightning talk janhoy
First oslo solr community meetup lightning talk janhoyFirst oslo solr community meetup lightning talk janhoy
First oslo solr community meetup lightning talk janhoy
 
Dagens Næringslivs overgang til Lucene/Solr søk
Dagens Næringslivs overgang til Lucene/Solr søkDagens Næringslivs overgang til Lucene/Solr søk
Dagens Næringslivs overgang til Lucene/Solr søk
 
Oslo Enterprise MeetUp May 12th 2010 - Jan Høydahl
Oslo Enterprise MeetUp May 12th 2010 - Jan HøydahlOslo Enterprise MeetUp May 12th 2010 - Jan Høydahl
Oslo Enterprise MeetUp May 12th 2010 - Jan Høydahl
 
Open source breakfast norge findwise
Open source breakfast norge findwiseOpen source breakfast norge findwise
Open source breakfast norge findwise
 
Frokostseminar mai 2010 solr open source cominvent as
Frokostseminar mai 2010 solr open source cominvent asFrokostseminar mai 2010 solr open source cominvent as
Frokostseminar mai 2010 solr open source cominvent as
 
Migrating Fast to Solr
Migrating Fast to SolrMigrating Fast to Solr
Migrating Fast to Solr
 
Cominvent AS company Presentation
Cominvent AS company PresentationCominvent AS company Presentation
Cominvent AS company Presentation
 

Recently uploaded

FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
g2nightmarescribd
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 

Recently uploaded (20)

FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 

Improving the Solr Update Chain

  • 1. Improving the Solr Update Chain Jan Høydahl
  • 2. What will I cover? Who is Jan Høydahl? Intro to Solr’s (hidden) UpdateChain How to write your own UpdateProcessors Example: Web crawl @ Oslo University A vision for future improvements Conclusion 2
  • 3.
  • 4.
  • 5. Jan Høydahl 1995: Developer telecom 1998: Java developer 2000: Search - FAST 2006: Lucene 2007: Cominvent 2011: Lucene committer > 100 projects 5
  • 6. Cominvent AS Consulting & support www.solrtraining.com Lucene/Solr FAST 6
  • 7. Why document processing? Analysis is Field oriented Filters only see the “local” field 7
  • 8. Why document processing? But what if you want to: Add or remove fields? Make decisions based on other fields? We need a way to modify the Document 8
  • 9. Why document processing? Doc1 name postcode programmer cv_pdf_url near Barcelona 9
  • 10. Why document processing? Doc1 name postcode programmer latlong near Barcelona cv_pdf_url cv_text 10
  • 11. Why document processing? Client Doc1 name postcode latlong cv_pdf_url cv_text 11
  • 12. Why document processing? Client Doc1 name postcode 3rd party latlong pipeline cv_pdf_url cv_text 12
  • 16. The Update Chain Postcode ToLatLong Processor Doc Doc name name postcode postcode cv_pdf_url latlong cv_pdf_url 15
  • 17. The Update Chain Postcode UrlFetcher ToLatLong Processor Processor Doc Doc Doc name name name postcode postcode postcode cv_pdf_url latlong latlong cv_pdf_url cv_pdf_url cv_pdf_bin 15
  • 18. The Update Chain Postcode Tika UrlFetcher ToLatLong Extracting Processor Processor Processor Doc Doc Doc Doc name name name name postcode postcode postcode postcode cv_pdf_url latlong latlong latlong cv_pdf_url cv_pdf_url cv_pdf_url cv_pdf_bin cv_pdf_bin cv_text 15
  • 19.
  • 20.
  • 21. How it’s wired Chain definition in solrconfig.xml: Choose chain in your update request: .../solr/update/xml?..&update.chain=cv-chain 17
  • 23. Other examples Company The Apache Software Foundation (ASF) is a non-profit corporation to support Apache software projects. The ASF was formed from the Apache Group and incorporated in Delaware, U.S., in June 1999. Location Date Entity extraction 19
  • 24.
  • 25. Writing your own processor 21
  • 26. Writing your own processor 21
  • 27. Writing your own processor 22
  • 28. Writing your own processor 23
  • 29. Writing your own processor •Make generic processors - parameterized •Use SchemaAware, SolrCoreAware and ResourceLoaderAware interfaces •Prefix param names to avoid name clash •Testing and testable methods •Donate back to Apache & document on Wiki 24
  • 30. Web crawl with Language Detection @ Oslo University 25
  • 31. Solr @ Oslo University 26
  • 32. Solr @ Oslo University <?xml version="1.0"?> <updateRequestProcessorChain name="web"> <processor class="solr.update.processor.LanguageIdentifierUpdateProcessorFactory"> <lst name="defaults"> <str name="langid.fl">title,content</str> <str name="langid.idField">url</str> <str name="langid.fallbackFields"></str> <str name="langid.fallback">no</str> <str name="langid.whitelist">no,en</str> <str name="langid.langField">language</str> <str name="langid.langsField">languages</str> <bool name="langid.overwrite">true</bool> <bool name="langid.map">true</bool> <str name="langid.map.fl">title,content</str> <double name="langid.threshold">0.5</double> </lst> </processor> <processor class="no.uio.update.processor.RegexpReplaceProcessorFactory"> <bool name="enabled">true</bool> <str name="fl">content_no content_en</str> <str name="pattern">[su00A0]+</str> <str name="replacement"> </str> </processor> <processor class="no.uio.update.processor.FixDuplicateContentTitleProcessorFactory"> <bool name="enabled">true</bool> <str name="title.fl">title_no,title_en</str> 27 <str name="content.fl">content_no,content_en</str>
  • 33. Solr @ Oslo University <?xml version="1.0"?> <updateRequestProcessorChain name="web"> <processor class="solr.update.processor.LanguageIdentifierUpd <lst name="defaults"> <str name="langid.fl">title,content</str> <str name="langid.idField">url</str> <str name="langid.fallbackFields"></str> <str name="langid.fallback">no</str> <str name="langid.whitelist">no,en</str> <str name="langid.langField">language</str> <str name="langid.langsField">languages</str> <bool name="langid.overwrite">true</bool> <bool name="langid.map">true</bool> <str name="langid.map.fl">title,content</str> <double name="langid.threshold">0.5</double> </lst> </processor> <processor class="no.uio.update.processor.RegexpReplaceProces 27
  • 34. Solr @ Oslo University <?xml version="1.0"?> <updateRequestProcessorChain name="web"> <processor class="solr.update.processor.LanguageIdentifierUpdateProcessorFactory"> <lst name="defaults"> <str name="langid.fl">title,content</str> <str name="langid.idField">url</str> <str name="langid.fallbackFields"></str> <str name="langid.fallback">no</str> <str name="langid.whitelist">no,en</str> <str name="langid.langField">language</str> <str name="langid.langsField">languages</str> <bool name="langid.overwrite">true</bool> <bool name="langid.map">true</bool> <str name="langid.map.fl">title,content</str> <double name="langid.threshold">0.5</double> </lst> </processor> <processor class="no.uio.update.processor.RegexpReplaceProcessorFactory"> <bool name="enabled">true</bool> <str name="fl">content_no content_en</str> <str name="pattern">[su00A0]+</str> <str name="replacement"> </str> </processor> <processor class="no.uio.update.processor.FixDuplicateContentTitleProcessorFactory"> <bool name="enabled">true</bool> <str name="title.fl">title_no,title_en</str> 27 <str name="content.fl">content_no,content_en</str>
  • 35. </processor> <processor class="no.uio.update.processor.RegexpReplaceProcessorFactory"> <bool name="enabled">true</bool> Solr @ Oslo University <str name="fl">content_no content_en</str> <str name="pattern">[su00A0]+</str> <str name="replacement"> </str> </processor> <processor class="no.uio.update.processor.FixDuplicateContentTitleProcessorFactory"> <bool name="enabled">true</bool> <str name="title.fl">title_no,title_en</str> <str name="content.fl">content_no,content_en</str> <str name="contentType.field">tika_Content-Type</str> <str name="contentType.regexp">^(text/html|application/xhtml).*</str> </processor> <processor class="no.uio.update.processor.URLClassifyProcessorFactory"> <bool name="enabled">true</bool> <str name="domainOutputField">host</str> <str name="canonicalUrlOutputField">canonicalurl</str> </processor> <processor class="no.uio.update.processor.RegexpBoostProcessorFactory"> <bool name="enabled">true</bool> <str name="inputField">url</str> <str name="boostField">urlboost</str> <str name="boostFilename">${solr.solr.home}/conf/rank/urlboosts.txt</str> </processor> <processor class="no.uio.update.processor.StaticRankProcessorFactory"> <bool name="enabled">true</bool> <str name="inputField">canonicalurl</str> <str name="anchorField">anchortext</str> <str name="rankField">docrank</str> <str name="rankFilename">${solr.solr.home}/conf/rank/rankinfo.txt</str> </processor> <processor class="solr.LogUpdateProcessorFactory"/> 28
  • 36. </processor> <processor class="no.uio.update.processor.RegexpReplaceProcessorFactory"> <bool name="enabled">true</bool> Solr @ Oslo University <str name="fl">content_no content_en</str> <str name="pattern">[su00A0]+</str> <str name="replacement"> </str> </processor> <processor class="no.uio.update.processor.FixDuplicateContentTitleProcessorFactory"> <bool name="enabled">true</bool> <str name="title.fl">title_no,title_en</str> <str name="content.fl">content_no,content_en</str> <str name="contentType.field">tika_Content-Type</str> <str name="contentType.regexp">^(text/html|application/xhtml).*</str> </processor> <processor class="no.uio.update.processor.URLClassifyProcessorFactory"> <bool name="enabled">true</bool> <str name="domainOutputField">host</str> <str name="canonicalUrlOutputField">canonicalurl</str> </processor> <processor class="no.uio.update.processor.RegexpBoostProcessorFactory"> <bool name="enabled">true</bool> <str name="inputField">url</str> <str name="boostField">urlboost</str> <str name="boostFilename">${solr.solr.home}/conf/rank/urlboosts.txt</str> </processor> <processor class="no.uio.update.processor.StaticRankProcessorFactory"> <bool name="enabled">true</bool> <str name="inputField">canonicalurl</str> <str name="anchorField">anchortext</str> <str name="rankField">docrank</str> <str name="rankFilename">${solr.solr.home}/conf/rank/rankinfo.txt</str> </processor> <processor class="solr.LogUpdateProcessorFactory"/> 28
  • 37. </processor> <processor class="no.uio.update.processor.RegexpReplaceProcessorFactory"> <bool name="enabled">true</bool> Solr @ Oslo University <str name="fl">content_no content_en</str> <str name="pattern">[su00A0]+</str> <str name="replacement"> </str> </processor> <processor class="no.uio.update.processor.FixDuplicateContentTitleProcessorFactory"> <bool name="enabled">true</bool> <str name="title.fl">title_no,title_en</str> <str name="content.fl">content_no,content_en</str> <str name="contentType.field">tika_Content-Type</str> <str name="contentType.regexp">^(text/html|application/xhtml).*</str> </processor> <processor class="no.uio.update.processor.URLClassifyProcessorFactory"> <bool name="enabled">true</bool> <str name="domainOutputField">host</str> <str name="canonicalUrlOutputField">canonicalurl</str> </processor> <processor class="no.uio.update.processor.RegexpBoostProcessorFactory"> <bool name="enabled">true</bool> <str name="inputField">url</str> <str name="boostField">urlboost</str> <str name="boostFilename">${solr.solr.home}/conf/rank/urlboosts.txt</str> </processor> <processor class="no.uio.update.processor.StaticRankProcessorFactory"> <bool name="enabled">true</bool> <str name="inputField">canonicalurl</str> <str name="anchorField">anchortext</str> <str name="rankField">docrank</str> <str name="rankFilename">${solr.solr.home}/conf/rank/rankinfo.txt</str> </processor> <processor class="solr.LogUpdateProcessorFactory"/> 28
  • 38. </processor> <processor class="no.uio.update.processor.RegexpReplaceProcessorFactory"> <bool name="enabled">true</bool> Solr @ Oslo University <str name="fl">content_no content_en</str> <str name="pattern">[su00A0]+</str> <str name="replacement"> </str> </processor> <processor class="no.uio.update.processor.FixDuplicateContentTitleProcessorFactory"> <bool name="enabled">true</bool> <str name="title.fl">title_no,title_en</str> <str name="content.fl">content_no,content_en</str> <str name="contentType.field">tika_Content-Type</str> <str name="contentType.regexp">^(text/html|application/xhtml).*</str> </processor> <processor class="no.uio.update.processor.URLClassifyProcessorFactory"> <bool name="enabled">true</bool> <str name="domainOutputField">host</str> <str name="canonicalUrlOutputField">canonicalurl</str> </processor> <processor class="no.uio.update.processor.RegexpBoostProcessorFactory"> <bool name="enabled">true</bool> <str name="inputField">url</str> <str name="boostField">urlboost</str> <str name="boostFilename">${solr.solr.home}/conf/rank/urlboosts.txt</str> </processor> <processor class="no.uio.update.processor.StaticRankProcessorFactory"> <bool name="enabled">true</bool> <str name="inputField">canonicalurl</str> <str name="anchorField">anchortext</str> <str name="rankField">docrank</str> <str name="rankFilename">${solr.solr.home}/conf/rank/rankinfo.txt</str> </processor> <processor class="solr.LogUpdateProcessorFactory"/> 28
  • 39. Donations back to Apache SOLR-2599: FieldCopyProcessor SOLR-2825: RegexReplaceProcessor SOLR-2826: URLClassifyProcessor SOLR-2827: RegexpBoostProcessor SOLR-2828: StaticRankProcessor Binary Document Dumper (?) Many thanks for the donations! 29
  • 40.
  • 41.
  • 43. Improvements Processors re-created for every request Duplication of config between chains No support for non-linear or sub chains Does not scale very well Lack native scripting language support 34
  • 44. Improvements Pain: Potentially expensive initialization StaticRankProcessor: read&parse 50.000 lines Proposed cure: Keep persistent state object in factory: private final Map<Object,Object> sharedObjCache new StaticRankProcessor(params, request, response, nextProcessor, sharedObjCache); Processor uses sharedObjCache for state 35
  • 45. Improvements Processors re-created for every request Duplication of config between chains No support for non-linear or sub chains Does not scale very well Lack native scripting language support 36
  • 46. Improvements Pain: Multi chains often need identical Processors UiO’s two chains share 80% -> copy/paste Proposed cure: Allow sharing of named instances Define: <processor name="langid" class=".."> Refer: <processor ref="langid" /> See SOLR-2823 37
  • 47. Improvements Processors re-created for every request Duplication of config between chains No support for non-linear or sub chains Does not scale very well Lack native scripting language support 38
  • 48. Improvements Pain: Chains are linear only Hard to do branching, sub chains, conditional... Proposed cure (SOLR-2841): New scriptable Update Chain - alternative to XML Script chain logic in solr/conf/updateproc.groovy Full flexibility: chain myChain { if(doc.getFieldValue("type").equals("pdf")) process(tikaproc) } 39
  • 49. Improvements Processors re-created for every request Duplication of config between chains No support for non-linear chains or sub chains Does not scale very well Lack native scripting language support 40
  • 50. Improvements Pain: Single threaded Heavy processing not efficient Proposed cure: Local: Use multi threaded update requests SolrCloud: Dedicated nodes, role=“processor” ? Wrap an external pipeline in UpdateProcessor Example: OpenPipelineUpdateProcessor ? 41
  • 51. Improvements Processors re-created for every request Duplication of config between chains No support for non-linear chains or sub chains Does not scale very well Lack native scripting language support 42
  • 52. Improvements Pain: Not really a “problem” :-) Nice to write processors in Python, Groovy, JS... Proposed cure: Now: Finish SOLR-1725: Script based Processor Later: Make scripts first-class processors <processor script="myScript.py" /> or <processor ref="myScript" /> 43
  • 54. New standalone framework? •The UpdateChain is Solr specific •Interest for a pure pipeline framework •Search engine independent •Scalable •Rich pool of processors •Several existing candidates •Some initial thoughts: http://wiki.apache.org/solr/DocumentProcessing 45
  • 55. Summary 46
  • 56. Summary •Document centric vs field centric processing •UpdateChain is there - use it! •Works well for most “light” cases •Scaling issues, but caching config may help •More processors welcome! 47
  • 57. Questions? Jan Høydahl, Cominvent AS @cominvent www.cominvent.com
  • 59. Alternative pipelines OpenPipeline (Dieselpoint) •OpenPipe (T-Rank, now on GitHub) •Pypes (ESR) •UIMA (Apache) •Eclipse SMILA •Apache commons pipeline •Piped (FoundIT, Norway) •Behemoth (DigitaPebble) •FindWise and TwigKit also has some technology 50
  • 60. Calling out from UpdateChain This is one way an external pipeline system can be integrated with Solr. The main benefit of such a method is you can continue to feed content with SolrJ, DIH or other Update Request Handlers. 51
  • 61. Scaling with external pipeline Here is a more advanced, distributed case, where a Solr node is dedicated for processing, and the entry point Solr only dispatches the requests. 52

Editor's Notes

  1. \n
  2. \n
  3. \n
  4. \n
  5. \n
  6. \n
  7. \n
  8. \n
  9. \n
  10. \n
  11. \n
  12. \n
  13. \n
  14. \n
  15. \n
  16. \n
  17. \n
  18. \n
  19. \n
  20. \n
  21. \n
  22. \n
  23. \n
  24. \n
  25. \n
  26. \n
  27. \n
  28. \n
  29. \n
  30. \n
  31. \n
  32. \n
  33. \n
  34. \n
  35. \n
  36. \n
  37. \n
  38. \n
  39. \n
  40. \n
  41. \n
  42. \n
  43. \n
  44. \n
  45. \n
  46. \n
  47. \n
  48. \n
  49. \n
  50. \n
  51. \n
  52. \n
  53. \n
  54. \n
  55. \n
  56. \n
  57. \n
  58. \n
  59. \n
  60. \n
  61. \n
  62. \n
  63. \n
  64. \n
  65. \n
  66. \n
  67. \n
  68. \n
  69. \n
  70. \n
  71. \n
  72. \n
  73. \n
  74. \n
  75. \n
  76. \n
  77. \n
  78. \n
  79. \n
  80. \n
  81. \n
  82. \n
  83. \n
  84. \n
  85. \n
  86. \n
  87. \n
  88. \n
  89. \n
  90. \n
  91. \n
  92. \n
  93. \n
  94. \n
  95. \n
  96. \n
  97. \n
  98. \n
  99. \n
  100. \n
  101. \n
  102. \n
  103. \n
  104. \n
  105. \n
  106. \n
  107. \n
  108. \n
  109. \n
  110. \n
  111. \n
  112. \n
  113. \n
  114. \n
  115. \n
  116. \n
  117. \n
  118. \n
  119. \n
  120. \n
  121. \n
  122. \n