Improving the Solr Update Chain
Upcoming SlideShare
Loading in...5
×
 

Improving the Solr Update Chain

on

  • 1,347 views

A talk about the (hidden) document processing capability built right into Apache Solr. We show you what it its, how to use it, how to write your own plugins and suggest some future improvements.

A talk about the (hidden) document processing capability built right into Apache Solr. We show you what it its, how to use it, how to write your own plugins and suggest some future improvements.

Statistics

Views

Total Views
1,347
Views on SlideShare
1,341
Embed Views
6

Actions

Likes
8
Downloads
34
Comments
0

2 Embeds 6

http://paper.li 3
http://a0.twimg.com 3

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

CC Attribution-ShareAlike LicenseCC Attribution-ShareAlike License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n

Improving the Solr Update Chain Improving the Solr Update Chain Presentation Transcript

  • Improving the Solr Update Chain Jan Høydahl
  • What will I cover?Who is Jan Høydahl?Intro to Solr’s (hidden) UpdateChainHow to write your own UpdateProcessorsExample: Web crawl @ Oslo UniversityA vision for future improvementsConclusion 2
  • Jan Høydahl 1995: Developer telecom 1998: Java developer 2000: Search - FAST 2006: Lucene 2007: Cominvent 2011: Lucene committer > 100 projects 5
  • Cominvent ASConsulting & support www.solrtraining.com Lucene/Solr FAST 6
  • Why document processing?Analysis is Field orientedFilters only see the “local” field 7
  • Why document processing?But what if you want to: Add or remove fields? Make decisions based on other fields?We need a way to modify the Document 8
  • Why document processing?Doc1namepostcode programmercv_pdf_url near Barcelona 9
  • Why document processing?Doc1namepostcode programmerlatlong near Barcelonacv_pdf_urlcv_text 10
  • Why document processing? Client Doc1 name postcode latlong cv_pdf_url cv_text 11
  • Why document processing?Client Doc1 name postcode 3rd party latlong pipeline cv_pdf_url cv_text 12
  • Solr’s Update Chain 13
  • The Update Chain 14
  • The Update ChainDocnamepostcodecv_pdf_url 15
  • The Update Chain Postcode ToLatLong ProcessorDoc Docname namepostcode postcodecv_pdf_url latlong cv_pdf_url 15
  • The Update Chain Postcode UrlFetcher ToLatLong Processor ProcessorDoc Doc Docname name namepostcode postcode postcodecv_pdf_url latlong latlong cv_pdf_url cv_pdf_url cv_pdf_bin 15
  • The Update Chain Postcode Tika UrlFetcher ToLatLong Extracting Processor Processor ProcessorDoc Doc Doc Docname name name namepostcode postcode postcode postcodecv_pdf_url latlong latlong latlong cv_pdf_url cv_pdf_url cv_pdf_url cv_pdf_bin cv_pdf_bin cv_text 15
  • How it’s wiredChain definition in solrconfig.xml:Choose chain in your update request:.../solr/update/xml?..&update.chain=cv-chain 17
  • Other examplesLanguage Identification 18
  • Other examplesCompany The Apache Software Foundation (ASF) is a non-profit corporation to support Apache software projects. The ASF was formed from the Apache Group and incorporated in Delaware, U.S., in June 1999. Location Date Entity extraction 19
  • Writing your own processor 21
  • Writing your own processor 21
  • Writing your own processor 22
  • Writing your own processor 23
  • Writing your own processor•Make generic processors - parameterized•Use SchemaAware, SolrCoreAware and ResourceLoaderAware interfaces•Prefix param names to avoid name clash•Testing and testable methods•Donate back to Apache & document on Wiki 24
  • Web crawl withLanguage Detection @ Oslo University 25
  • Solr @ Oslo University 26
  • Solr @ Oslo University <?xml version="1.0"?><updateRequestProcessorChain name="web"> <processor class="solr.update.processor.LanguageIdentifierUpdateProcessorFactory"> <lst name="defaults"> <str name="langid.fl">title,content</str> <str name="langid.idField">url</str> <str name="langid.fallbackFields"></str> <str name="langid.fallback">no</str> <str name="langid.whitelist">no,en</str> <str name="langid.langField">language</str> <str name="langid.langsField">languages</str> <bool name="langid.overwrite">true</bool> <bool name="langid.map">true</bool> <str name="langid.map.fl">title,content</str> <double name="langid.threshold">0.5</double> </lst> </processor> <processor class="no.uio.update.processor.RegexpReplaceProcessorFactory"> <bool name="enabled">true</bool> <str name="fl">content_no content_en</str> <str name="pattern">[su00A0]+</str> <str name="replacement"> </str> </processor> <processor class="no.uio.update.processor.FixDuplicateContentTitleProcessorFactory"> <bool name="enabled">true</bool> <str name="title.fl">title_no,title_en</str> 27 <str name="content.fl">content_no,content_en</str>
  • Solr @ Oslo University <?xml version="1.0"?><updateRequestProcessorChain name="web"> <processor class="solr.update.processor.LanguageIdentifierUpd <lst name="defaults"> <str name="langid.fl">title,content</str> <str name="langid.idField">url</str> <str name="langid.fallbackFields"></str> <str name="langid.fallback">no</str> <str name="langid.whitelist">no,en</str> <str name="langid.langField">language</str> <str name="langid.langsField">languages</str> <bool name="langid.overwrite">true</bool> <bool name="langid.map">true</bool> <str name="langid.map.fl">title,content</str> <double name="langid.threshold">0.5</double> </lst> </processor> <processor class="no.uio.update.processor.RegexpReplaceProces 27
  • Solr @ Oslo University <?xml version="1.0"?><updateRequestProcessorChain name="web"> <processor class="solr.update.processor.LanguageIdentifierUpdateProcessorFactory"> <lst name="defaults"> <str name="langid.fl">title,content</str> <str name="langid.idField">url</str> <str name="langid.fallbackFields"></str> <str name="langid.fallback">no</str> <str name="langid.whitelist">no,en</str> <str name="langid.langField">language</str> <str name="langid.langsField">languages</str> <bool name="langid.overwrite">true</bool> <bool name="langid.map">true</bool> <str name="langid.map.fl">title,content</str> <double name="langid.threshold">0.5</double> </lst> </processor> <processor class="no.uio.update.processor.RegexpReplaceProcessorFactory"> <bool name="enabled">true</bool> <str name="fl">content_no content_en</str> <str name="pattern">[su00A0]+</str> <str name="replacement"> </str> </processor> <processor class="no.uio.update.processor.FixDuplicateContentTitleProcessorFactory"> <bool name="enabled">true</bool> <str name="title.fl">title_no,title_en</str> 27 <str name="content.fl">content_no,content_en</str>
  • </processor><processor class="no.uio.update.processor.RegexpReplaceProcessorFactory"> <bool name="enabled">true</bool> Solr @ Oslo University <str name="fl">content_no content_en</str> <str name="pattern">[su00A0]+</str> <str name="replacement"> </str></processor><processor class="no.uio.update.processor.FixDuplicateContentTitleProcessorFactory"> <bool name="enabled">true</bool> <str name="title.fl">title_no,title_en</str> <str name="content.fl">content_no,content_en</str> <str name="contentType.field">tika_Content-Type</str> <str name="contentType.regexp">^(text/html|application/xhtml).*</str></processor><processor class="no.uio.update.processor.URLClassifyProcessorFactory"> <bool name="enabled">true</bool> <str name="domainOutputField">host</str> <str name="canonicalUrlOutputField">canonicalurl</str></processor><processor class="no.uio.update.processor.RegexpBoostProcessorFactory"> <bool name="enabled">true</bool> <str name="inputField">url</str> <str name="boostField">urlboost</str> <str name="boostFilename">${solr.solr.home}/conf/rank/urlboosts.txt</str></processor><processor class="no.uio.update.processor.StaticRankProcessorFactory"> <bool name="enabled">true</bool> <str name="inputField">canonicalurl</str> <str name="anchorField">anchortext</str> <str name="rankField">docrank</str> <str name="rankFilename">${solr.solr.home}/conf/rank/rankinfo.txt</str></processor><processor class="solr.LogUpdateProcessorFactory"/> 28
  • </processor><processor class="no.uio.update.processor.RegexpReplaceProcessorFactory"> <bool name="enabled">true</bool> Solr @ Oslo University <str name="fl">content_no content_en</str> <str name="pattern">[su00A0]+</str> <str name="replacement"> </str></processor><processor class="no.uio.update.processor.FixDuplicateContentTitleProcessorFactory"> <bool name="enabled">true</bool> <str name="title.fl">title_no,title_en</str> <str name="content.fl">content_no,content_en</str> <str name="contentType.field">tika_Content-Type</str> <str name="contentType.regexp">^(text/html|application/xhtml).*</str></processor><processor class="no.uio.update.processor.URLClassifyProcessorFactory"> <bool name="enabled">true</bool> <str name="domainOutputField">host</str> <str name="canonicalUrlOutputField">canonicalurl</str></processor><processor class="no.uio.update.processor.RegexpBoostProcessorFactory"> <bool name="enabled">true</bool> <str name="inputField">url</str> <str name="boostField">urlboost</str> <str name="boostFilename">${solr.solr.home}/conf/rank/urlboosts.txt</str></processor><processor class="no.uio.update.processor.StaticRankProcessorFactory"> <bool name="enabled">true</bool> <str name="inputField">canonicalurl</str> <str name="anchorField">anchortext</str> <str name="rankField">docrank</str> <str name="rankFilename">${solr.solr.home}/conf/rank/rankinfo.txt</str></processor><processor class="solr.LogUpdateProcessorFactory"/> 28
  • </processor><processor class="no.uio.update.processor.RegexpReplaceProcessorFactory"> <bool name="enabled">true</bool> Solr @ Oslo University <str name="fl">content_no content_en</str> <str name="pattern">[su00A0]+</str> <str name="replacement"> </str></processor><processor class="no.uio.update.processor.FixDuplicateContentTitleProcessorFactory"> <bool name="enabled">true</bool> <str name="title.fl">title_no,title_en</str> <str name="content.fl">content_no,content_en</str> <str name="contentType.field">tika_Content-Type</str> <str name="contentType.regexp">^(text/html|application/xhtml).*</str></processor><processor class="no.uio.update.processor.URLClassifyProcessorFactory"> <bool name="enabled">true</bool> <str name="domainOutputField">host</str> <str name="canonicalUrlOutputField">canonicalurl</str></processor><processor class="no.uio.update.processor.RegexpBoostProcessorFactory"> <bool name="enabled">true</bool> <str name="inputField">url</str> <str name="boostField">urlboost</str> <str name="boostFilename">${solr.solr.home}/conf/rank/urlboosts.txt</str></processor><processor class="no.uio.update.processor.StaticRankProcessorFactory"> <bool name="enabled">true</bool> <str name="inputField">canonicalurl</str> <str name="anchorField">anchortext</str> <str name="rankField">docrank</str> <str name="rankFilename">${solr.solr.home}/conf/rank/rankinfo.txt</str></processor><processor class="solr.LogUpdateProcessorFactory"/> 28
  • </processor><processor class="no.uio.update.processor.RegexpReplaceProcessorFactory"> <bool name="enabled">true</bool> Solr @ Oslo University <str name="fl">content_no content_en</str> <str name="pattern">[su00A0]+</str> <str name="replacement"> </str></processor><processor class="no.uio.update.processor.FixDuplicateContentTitleProcessorFactory"> <bool name="enabled">true</bool> <str name="title.fl">title_no,title_en</str> <str name="content.fl">content_no,content_en</str> <str name="contentType.field">tika_Content-Type</str> <str name="contentType.regexp">^(text/html|application/xhtml).*</str></processor><processor class="no.uio.update.processor.URLClassifyProcessorFactory"> <bool name="enabled">true</bool> <str name="domainOutputField">host</str> <str name="canonicalUrlOutputField">canonicalurl</str></processor><processor class="no.uio.update.processor.RegexpBoostProcessorFactory"> <bool name="enabled">true</bool> <str name="inputField">url</str> <str name="boostField">urlboost</str> <str name="boostFilename">${solr.solr.home}/conf/rank/urlboosts.txt</str></processor><processor class="no.uio.update.processor.StaticRankProcessorFactory"> <bool name="enabled">true</bool> <str name="inputField">canonicalurl</str> <str name="anchorField">anchortext</str> <str name="rankField">docrank</str> <str name="rankFilename">${solr.solr.home}/conf/rank/rankinfo.txt</str></processor><processor class="solr.LogUpdateProcessorFactory"/> 28
  • Donations back to ApacheSOLR-2599: FieldCopyProcessorSOLR-2825: RegexReplaceProcessorSOLR-2826: URLClassifyProcessorSOLR-2827: RegexpBoostProcessorSOLR-2828: StaticRankProcessorBinary Document Dumper (?)Many thanks for the donations! 29
  • Room forimprovement? 32
  • ImprovementsProcessors re-created for every requestDuplication of config between chainsNo support for non-linear or sub chainsDoes not scale very wellLack native scripting language support 34
  • ImprovementsPain: Potentially expensive initialization StaticRankProcessor: read&parse 50.000 linesProposed cure: Keep persistent state object in factory: private final Map<Object,Object> sharedObjCache new StaticRankProcessor(params, request, response, nextProcessor, sharedObjCache); Processor uses sharedObjCache for state 35
  • ImprovementsProcessors re-created for every requestDuplication of config between chainsNo support for non-linear or sub chainsDoes not scale very wellLack native scripting language support 36
  • ImprovementsPain: Multi chains often need identical Processors UiO’s two chains share 80% -> copy/pasteProposed cure: Allow sharing of named instances Define: <processor name="langid" class=".."> Refer: <processor ref="langid" /> See SOLR-2823 37
  • ImprovementsProcessors re-created for every requestDuplication of config between chainsNo support for non-linear or sub chainsDoes not scale very wellLack native scripting language support 38
  • ImprovementsPain: Chains are linear only Hard to do branching, sub chains, conditional...Proposed cure (SOLR-2841): New scriptable Update Chain - alternative to XML Script chain logic in solr/conf/updateproc.groovy Full flexibility: chain myChain { if(doc.getFieldValue("type").equals("pdf")) process(tikaproc) } 39
  • ImprovementsProcessors re-created for every requestDuplication of config between chainsNo support for non-linear chains or sub chainsDoes not scale very wellLack native scripting language support 40
  • ImprovementsPain: Single threaded Heavy processing not efficientProposed cure: Local: Use multi threaded update requests SolrCloud: Dedicated nodes, role=“processor” ? Wrap an external pipeline in UpdateProcessor Example: OpenPipelineUpdateProcessor ? 41
  • ImprovementsProcessors re-created for every requestDuplication of config between chainsNo support for non-linear chains or sub chainsDoes not scale very wellLack native scripting language support 42
  • ImprovementsPain: Not really a “problem” :-) Nice to write processors in Python, Groovy, JS...Proposed cure: Now: Finish SOLR-1725: Script based Processor Later: Make scripts first-class processors <processor script="myScript.py" /> or <processor ref="myScript" /> 43
  • One last thing... 44
  • New standalone framework?•The UpdateChain is Solr specific•Interest for a pure pipeline framework •Search engine independent •Scalable •Rich pool of processors •Several existing candidates•Some initial thoughts: http://wiki.apache.org/solr/DocumentProcessing 45
  • Summary 46
  • Summary•Document centric vs field centric processing•UpdateChain is there - use it!•Works well for most “light” cases•Scaling issues, but caching config may help•More processors welcome! 47
  • Questions?Jan Høydahl, Cominvent AS@cominventwww.cominvent.com
  • Extra 49
  • Alternative pipelines OpenPipeline (Dieselpoint)•OpenPipe (T-Rank, now on GitHub)•Pypes (ESR)•UIMA (Apache)•Eclipse SMILA•Apache commons pipeline•Piped (FoundIT, Norway)•Behemoth (DigitaPebble)•FindWise and TwigKit also has some technology 50
  • Calling out from UpdateChainThis is one way anexternal pipelinesystem can beintegrated with Solr.The main benefit ofsuch a method is youcan continue to feedcontent with SolrJ, DIHor other UpdateRequest Handlers. 51
  • Scaling with external pipelineHere is a moreadvanced,distributedcase, where aSolr node isdedicated forprocessing, andthe entry pointSolr onlydispatches therequests. 52