Improving the Solr Update Chain

1,376
-1

Published on

A talk about the (hidden) document processing capability built right into Apache Solr. We show you what it its, how to use it, how to write your own plugins and suggest some future improvements.

Published in: Technology, Business
0 Comments
10 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,376
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
39
Comments
0
Likes
10
Embeds 0
No embeds

No notes for slide
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Improving the Solr Update Chain

    1. 1. Improving the Solr Update Chain Jan Høydahl
    2. 2. What will I cover?Who is Jan Høydahl?Intro to Solr’s (hidden) UpdateChainHow to write your own UpdateProcessorsExample: Web crawl @ Oslo UniversityA vision for future improvementsConclusion 2
    3. 3. Jan Høydahl 1995: Developer telecom 1998: Java developer 2000: Search - FAST 2006: Lucene 2007: Cominvent 2011: Lucene committer > 100 projects 5
    4. 4. Cominvent ASConsulting & support www.solrtraining.com Lucene/Solr FAST 6
    5. 5. Why document processing?Analysis is Field orientedFilters only see the “local” field 7
    6. 6. Why document processing?But what if you want to: Add or remove fields? Make decisions based on other fields?We need a way to modify the Document 8
    7. 7. Why document processing?Doc1namepostcode programmercv_pdf_url near Barcelona 9
    8. 8. Why document processing?Doc1namepostcode programmerlatlong near Barcelonacv_pdf_urlcv_text 10
    9. 9. Why document processing? Client Doc1 name postcode latlong cv_pdf_url cv_text 11
    10. 10. Why document processing?Client Doc1 name postcode 3rd party latlong pipeline cv_pdf_url cv_text 12
    11. 11. Solr’s Update Chain 13
    12. 12. The Update Chain 14
    13. 13. The Update ChainDocnamepostcodecv_pdf_url 15
    14. 14. The Update Chain Postcode ToLatLong ProcessorDoc Docname namepostcode postcodecv_pdf_url latlong cv_pdf_url 15
    15. 15. The Update Chain Postcode UrlFetcher ToLatLong Processor ProcessorDoc Doc Docname name namepostcode postcode postcodecv_pdf_url latlong latlong cv_pdf_url cv_pdf_url cv_pdf_bin 15
    16. 16. The Update Chain Postcode Tika UrlFetcher ToLatLong Extracting Processor Processor ProcessorDoc Doc Doc Docname name name namepostcode postcode postcode postcodecv_pdf_url latlong latlong latlong cv_pdf_url cv_pdf_url cv_pdf_url cv_pdf_bin cv_pdf_bin cv_text 15
    17. 17. How it’s wiredChain definition in solrconfig.xml:Choose chain in your update request:.../solr/update/xml?..&update.chain=cv-chain 17
    18. 18. Other examplesLanguage Identification 18
    19. 19. Other examplesCompany The Apache Software Foundation (ASF) is a non-profit corporation to support Apache software projects. The ASF was formed from the Apache Group and incorporated in Delaware, U.S., in June 1999. Location Date Entity extraction 19
    20. 20. Writing your own processor 21
    21. 21. Writing your own processor 21
    22. 22. Writing your own processor 22
    23. 23. Writing your own processor 23
    24. 24. Writing your own processor•Make generic processors - parameterized•Use SchemaAware, SolrCoreAware and ResourceLoaderAware interfaces•Prefix param names to avoid name clash•Testing and testable methods•Donate back to Apache & document on Wiki 24
    25. 25. Web crawl withLanguage Detection @ Oslo University 25
    26. 26. Solr @ Oslo University 26
    27. 27. Solr @ Oslo University <?xml version="1.0"?><updateRequestProcessorChain name="web"> <processor class="solr.update.processor.LanguageIdentifierUpdateProcessorFactory"> <lst name="defaults"> <str name="langid.fl">title,content</str> <str name="langid.idField">url</str> <str name="langid.fallbackFields"></str> <str name="langid.fallback">no</str> <str name="langid.whitelist">no,en</str> <str name="langid.langField">language</str> <str name="langid.langsField">languages</str> <bool name="langid.overwrite">true</bool> <bool name="langid.map">true</bool> <str name="langid.map.fl">title,content</str> <double name="langid.threshold">0.5</double> </lst> </processor> <processor class="no.uio.update.processor.RegexpReplaceProcessorFactory"> <bool name="enabled">true</bool> <str name="fl">content_no content_en</str> <str name="pattern">[su00A0]+</str> <str name="replacement"> </str> </processor> <processor class="no.uio.update.processor.FixDuplicateContentTitleProcessorFactory"> <bool name="enabled">true</bool> <str name="title.fl">title_no,title_en</str> 27 <str name="content.fl">content_no,content_en</str>
    28. 28. Solr @ Oslo University <?xml version="1.0"?><updateRequestProcessorChain name="web"> <processor class="solr.update.processor.LanguageIdentifierUpd <lst name="defaults"> <str name="langid.fl">title,content</str> <str name="langid.idField">url</str> <str name="langid.fallbackFields"></str> <str name="langid.fallback">no</str> <str name="langid.whitelist">no,en</str> <str name="langid.langField">language</str> <str name="langid.langsField">languages</str> <bool name="langid.overwrite">true</bool> <bool name="langid.map">true</bool> <str name="langid.map.fl">title,content</str> <double name="langid.threshold">0.5</double> </lst> </processor> <processor class="no.uio.update.processor.RegexpReplaceProces 27
    29. 29. Solr @ Oslo University <?xml version="1.0"?><updateRequestProcessorChain name="web"> <processor class="solr.update.processor.LanguageIdentifierUpdateProcessorFactory"> <lst name="defaults"> <str name="langid.fl">title,content</str> <str name="langid.idField">url</str> <str name="langid.fallbackFields"></str> <str name="langid.fallback">no</str> <str name="langid.whitelist">no,en</str> <str name="langid.langField">language</str> <str name="langid.langsField">languages</str> <bool name="langid.overwrite">true</bool> <bool name="langid.map">true</bool> <str name="langid.map.fl">title,content</str> <double name="langid.threshold">0.5</double> </lst> </processor> <processor class="no.uio.update.processor.RegexpReplaceProcessorFactory"> <bool name="enabled">true</bool> <str name="fl">content_no content_en</str> <str name="pattern">[su00A0]+</str> <str name="replacement"> </str> </processor> <processor class="no.uio.update.processor.FixDuplicateContentTitleProcessorFactory"> <bool name="enabled">true</bool> <str name="title.fl">title_no,title_en</str> 27 <str name="content.fl">content_no,content_en</str>
    30. 30. </processor><processor class="no.uio.update.processor.RegexpReplaceProcessorFactory"> <bool name="enabled">true</bool> Solr @ Oslo University <str name="fl">content_no content_en</str> <str name="pattern">[su00A0]+</str> <str name="replacement"> </str></processor><processor class="no.uio.update.processor.FixDuplicateContentTitleProcessorFactory"> <bool name="enabled">true</bool> <str name="title.fl">title_no,title_en</str> <str name="content.fl">content_no,content_en</str> <str name="contentType.field">tika_Content-Type</str> <str name="contentType.regexp">^(text/html|application/xhtml).*</str></processor><processor class="no.uio.update.processor.URLClassifyProcessorFactory"> <bool name="enabled">true</bool> <str name="domainOutputField">host</str> <str name="canonicalUrlOutputField">canonicalurl</str></processor><processor class="no.uio.update.processor.RegexpBoostProcessorFactory"> <bool name="enabled">true</bool> <str name="inputField">url</str> <str name="boostField">urlboost</str> <str name="boostFilename">${solr.solr.home}/conf/rank/urlboosts.txt</str></processor><processor class="no.uio.update.processor.StaticRankProcessorFactory"> <bool name="enabled">true</bool> <str name="inputField">canonicalurl</str> <str name="anchorField">anchortext</str> <str name="rankField">docrank</str> <str name="rankFilename">${solr.solr.home}/conf/rank/rankinfo.txt</str></processor><processor class="solr.LogUpdateProcessorFactory"/> 28
    31. 31. </processor><processor class="no.uio.update.processor.RegexpReplaceProcessorFactory"> <bool name="enabled">true</bool> Solr @ Oslo University <str name="fl">content_no content_en</str> <str name="pattern">[su00A0]+</str> <str name="replacement"> </str></processor><processor class="no.uio.update.processor.FixDuplicateContentTitleProcessorFactory"> <bool name="enabled">true</bool> <str name="title.fl">title_no,title_en</str> <str name="content.fl">content_no,content_en</str> <str name="contentType.field">tika_Content-Type</str> <str name="contentType.regexp">^(text/html|application/xhtml).*</str></processor><processor class="no.uio.update.processor.URLClassifyProcessorFactory"> <bool name="enabled">true</bool> <str name="domainOutputField">host</str> <str name="canonicalUrlOutputField">canonicalurl</str></processor><processor class="no.uio.update.processor.RegexpBoostProcessorFactory"> <bool name="enabled">true</bool> <str name="inputField">url</str> <str name="boostField">urlboost</str> <str name="boostFilename">${solr.solr.home}/conf/rank/urlboosts.txt</str></processor><processor class="no.uio.update.processor.StaticRankProcessorFactory"> <bool name="enabled">true</bool> <str name="inputField">canonicalurl</str> <str name="anchorField">anchortext</str> <str name="rankField">docrank</str> <str name="rankFilename">${solr.solr.home}/conf/rank/rankinfo.txt</str></processor><processor class="solr.LogUpdateProcessorFactory"/> 28
    32. 32. </processor><processor class="no.uio.update.processor.RegexpReplaceProcessorFactory"> <bool name="enabled">true</bool> Solr @ Oslo University <str name="fl">content_no content_en</str> <str name="pattern">[su00A0]+</str> <str name="replacement"> </str></processor><processor class="no.uio.update.processor.FixDuplicateContentTitleProcessorFactory"> <bool name="enabled">true</bool> <str name="title.fl">title_no,title_en</str> <str name="content.fl">content_no,content_en</str> <str name="contentType.field">tika_Content-Type</str> <str name="contentType.regexp">^(text/html|application/xhtml).*</str></processor><processor class="no.uio.update.processor.URLClassifyProcessorFactory"> <bool name="enabled">true</bool> <str name="domainOutputField">host</str> <str name="canonicalUrlOutputField">canonicalurl</str></processor><processor class="no.uio.update.processor.RegexpBoostProcessorFactory"> <bool name="enabled">true</bool> <str name="inputField">url</str> <str name="boostField">urlboost</str> <str name="boostFilename">${solr.solr.home}/conf/rank/urlboosts.txt</str></processor><processor class="no.uio.update.processor.StaticRankProcessorFactory"> <bool name="enabled">true</bool> <str name="inputField">canonicalurl</str> <str name="anchorField">anchortext</str> <str name="rankField">docrank</str> <str name="rankFilename">${solr.solr.home}/conf/rank/rankinfo.txt</str></processor><processor class="solr.LogUpdateProcessorFactory"/> 28
    33. 33. </processor><processor class="no.uio.update.processor.RegexpReplaceProcessorFactory"> <bool name="enabled">true</bool> Solr @ Oslo University <str name="fl">content_no content_en</str> <str name="pattern">[su00A0]+</str> <str name="replacement"> </str></processor><processor class="no.uio.update.processor.FixDuplicateContentTitleProcessorFactory"> <bool name="enabled">true</bool> <str name="title.fl">title_no,title_en</str> <str name="content.fl">content_no,content_en</str> <str name="contentType.field">tika_Content-Type</str> <str name="contentType.regexp">^(text/html|application/xhtml).*</str></processor><processor class="no.uio.update.processor.URLClassifyProcessorFactory"> <bool name="enabled">true</bool> <str name="domainOutputField">host</str> <str name="canonicalUrlOutputField">canonicalurl</str></processor><processor class="no.uio.update.processor.RegexpBoostProcessorFactory"> <bool name="enabled">true</bool> <str name="inputField">url</str> <str name="boostField">urlboost</str> <str name="boostFilename">${solr.solr.home}/conf/rank/urlboosts.txt</str></processor><processor class="no.uio.update.processor.StaticRankProcessorFactory"> <bool name="enabled">true</bool> <str name="inputField">canonicalurl</str> <str name="anchorField">anchortext</str> <str name="rankField">docrank</str> <str name="rankFilename">${solr.solr.home}/conf/rank/rankinfo.txt</str></processor><processor class="solr.LogUpdateProcessorFactory"/> 28
    34. 34. Donations back to ApacheSOLR-2599: FieldCopyProcessorSOLR-2825: RegexReplaceProcessorSOLR-2826: URLClassifyProcessorSOLR-2827: RegexpBoostProcessorSOLR-2828: StaticRankProcessorBinary Document Dumper (?)Many thanks for the donations! 29
    35. 35. Room forimprovement? 32
    36. 36. ImprovementsProcessors re-created for every requestDuplication of config between chainsNo support for non-linear or sub chainsDoes not scale very wellLack native scripting language support 34
    37. 37. ImprovementsPain: Potentially expensive initialization StaticRankProcessor: read&parse 50.000 linesProposed cure: Keep persistent state object in factory: private final Map<Object,Object> sharedObjCache new StaticRankProcessor(params, request, response, nextProcessor, sharedObjCache); Processor uses sharedObjCache for state 35
    38. 38. ImprovementsProcessors re-created for every requestDuplication of config between chainsNo support for non-linear or sub chainsDoes not scale very wellLack native scripting language support 36
    39. 39. ImprovementsPain: Multi chains often need identical Processors UiO’s two chains share 80% -> copy/pasteProposed cure: Allow sharing of named instances Define: <processor name="langid" class=".."> Refer: <processor ref="langid" /> See SOLR-2823 37
    40. 40. ImprovementsProcessors re-created for every requestDuplication of config between chainsNo support for non-linear or sub chainsDoes not scale very wellLack native scripting language support 38
    41. 41. ImprovementsPain: Chains are linear only Hard to do branching, sub chains, conditional...Proposed cure (SOLR-2841): New scriptable Update Chain - alternative to XML Script chain logic in solr/conf/updateproc.groovy Full flexibility: chain myChain { if(doc.getFieldValue("type").equals("pdf")) process(tikaproc) } 39
    42. 42. ImprovementsProcessors re-created for every requestDuplication of config between chainsNo support for non-linear chains or sub chainsDoes not scale very wellLack native scripting language support 40
    43. 43. ImprovementsPain: Single threaded Heavy processing not efficientProposed cure: Local: Use multi threaded update requests SolrCloud: Dedicated nodes, role=“processor” ? Wrap an external pipeline in UpdateProcessor Example: OpenPipelineUpdateProcessor ? 41
    44. 44. ImprovementsProcessors re-created for every requestDuplication of config between chainsNo support for non-linear chains or sub chainsDoes not scale very wellLack native scripting language support 42
    45. 45. ImprovementsPain: Not really a “problem” :-) Nice to write processors in Python, Groovy, JS...Proposed cure: Now: Finish SOLR-1725: Script based Processor Later: Make scripts first-class processors <processor script="myScript.py" /> or <processor ref="myScript" /> 43
    46. 46. One last thing... 44
    47. 47. New standalone framework?•The UpdateChain is Solr specific•Interest for a pure pipeline framework •Search engine independent •Scalable •Rich pool of processors •Several existing candidates•Some initial thoughts: http://wiki.apache.org/solr/DocumentProcessing 45
    48. 48. Summary 46
    49. 49. Summary•Document centric vs field centric processing•UpdateChain is there - use it!•Works well for most “light” cases•Scaling issues, but caching config may help•More processors welcome! 47
    50. 50. Questions?Jan Høydahl, Cominvent AS@cominventwww.cominvent.com
    51. 51. Extra 49
    52. 52. Alternative pipelines OpenPipeline (Dieselpoint)•OpenPipe (T-Rank, now on GitHub)•Pypes (ESR)•UIMA (Apache)•Eclipse SMILA•Apache commons pipeline•Piped (FoundIT, Norway)•Behemoth (DigitaPebble)•FindWise and TwigKit also has some technology 50
    53. 53. Calling out from UpdateChainThis is one way anexternal pipelinesystem can beintegrated with Solr.The main benefit ofsuch a method is youcan continue to feedcontent with SolrJ, DIHor other UpdateRequest Handlers. 51
    54. 54. Scaling with external pipelineHere is a moreadvanced,distributedcase, where aSolr node isdedicated forprocessing, andthe entry pointSolr onlydispatches therequests. 52
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×