Improving the Solr Update Chain

1. Improving the Solr Update Chain Jan Høydahl

2. What will I cover? Who is Jan Høydahl? Intro to Solr’s (hidden) UpdateChain How to write your own UpdateProcessors Example: Web crawl @ Oslo University A vision for future improvements Conclusion 2

5. Jan Høydahl 1995: Developer telecom 1998: Java developer 2000: Search - FAST 2006: Lucene 2007: Cominvent 2011: Lucene committer > 100 projects 5

6. Cominvent AS Consulting & support www.solrtraining.com Lucene/Solr FAST 6

7. Why document processing? Analysis is Field oriented Filters only see the “local” field 7

8. Why document processing? But what if you want to: Add or remove fields? Make decisions based on other fields? We need a way to modify the Document 8

9. Why document processing? Doc1 name postcode programmer cv_pdf_url near Barcelona 9

10. Why document processing? Doc1 name postcode programmer latlong near Barcelona cv_pdf_url cv_text 10

11. Why document processing? Client Doc1 name postcode latlong cv_pdf_url cv_text 11

12. Why document processing? Client Doc1 name postcode 3rd party latlong pipeline cv_pdf_url cv_text 12

13. Solr’s Update Chain 13

14. The Update Chain 14

15. The Update Chain Doc name postcode cv_pdf_url 15

16. The Update Chain Postcode ToLatLong Processor Doc Doc name name postcode postcode cv_pdf_url latlong cv_pdf_url 15

17. The Update Chain Postcode UrlFetcher ToLatLong Processor Processor Doc Doc Doc name name name postcode postcode postcode cv_pdf_url latlong latlong cv_pdf_url cv_pdf_url cv_pdf_bin 15

18. The Update Chain Postcode Tika UrlFetcher ToLatLong Extracting Processor Processor Processor Doc Doc Doc Doc name name name name postcode postcode postcode postcode cv_pdf_url latlong latlong latlong cv_pdf_url cv_pdf_url cv_pdf_url cv_pdf_bin cv_pdf_bin cv_text 15

21. How it’s wired Chain definition in solrconfig.xml: Choose chain in your update request: .../solr/update/xml?..&update.chain=cv-chain 17

22. Other examples Language Identification 18

23. Other examples Company The Apache Software Foundation (ASF) is a non-profit corporation to support Apache software projects. The ASF was formed from the Apache Group and incorporated in Delaware, U.S., in June 1999. Location Date Entity extraction 19

25. Writing your own processor 21

29. Writing your own processor •Make generic processors - parameterized •Use SchemaAware, SolrCoreAware and ResourceLoaderAware interfaces •Prefix param names to avoid name clash •Testing and testable methods •Donate back to Apache & document on Wiki 24

30. Web crawl with Language Detection @ Oslo University 25

31. Solr @ Oslo University 26

32. Solr @ Oslo University <?xml version="1.0"?> <updateRequestProcessorChain name="web"> <processor class="solr.update.processor.LanguageIdentifierUpdateProcessorFactory"> <lst name="defaults"> <str name="langid.fl">title,content</str> <str name="langid.idField">url</str> <str name="langid.fallbackFields"></str> <str name="langid.fallback">no</str> <str name="langid.whitelist">no,en</str> <str name="langid.langField">language</str> <str name="langid.langsField">languages</str> <bool name="langid.overwrite">true</bool> <bool name="langid.map">true</bool> <str name="langid.map.fl">title,content</str> <double name="langid.threshold">0.5</double> </lst> </processor> <processor class="no.uio.update.processor.RegexpReplaceProcessorFactory"> <bool name="enabled">true</bool> <str name="fl">content_no content_en</str> <str name="pattern">[su00A0]+</str> <str name="replacement"> </str> </processor> <processor class="no.uio.update.processor.FixDuplicateContentTitleProcessorFactory"> <bool name="enabled">true</bool> <str name="title.fl">title_no,title_en</str> 27 <str name="content.fl">content_no,content_en</str>

33. Solr @ Oslo University <?xml version="1.0"?> <updateRequestProcessorChain name="web"> <processor class="solr.update.processor.LanguageIdentifierUpd <lst name="defaults"> <str name="langid.fl">title,content</str> <str name="langid.idField">url</str> <str name="langid.fallbackFields"></str> <str name="langid.fallback">no</str> <str name="langid.whitelist">no,en</str> <str name="langid.langField">language</str> <str name="langid.langsField">languages</str> <bool name="langid.overwrite">true</bool> <bool name="langid.map">true</bool> <str name="langid.map.fl">title,content</str> <double name="langid.threshold">0.5</double> </lst> </processor> <processor class="no.uio.update.processor.RegexpReplaceProces 27

34. Solr @ Oslo University <?xml version="1.0"?> <updateRequestProcessorChain name="web"> <processor class="solr.update.processor.LanguageIdentifierUpdateProcessorFactory"> <lst name="defaults"> <str name="langid.fl">title,content</str> <str name="langid.idField">url</str> <str name="langid.fallbackFields"></str> <str name="langid.fallback">no</str> <str name="langid.whitelist">no,en</str> <str name="langid.langField">language</str> <str name="langid.langsField">languages</str> <bool name="langid.overwrite">true</bool> <bool name="langid.map">true</bool> <str name="langid.map.fl">title,content</str> <double name="langid.threshold">0.5</double> </lst> </processor> <processor class="no.uio.update.processor.RegexpReplaceProcessorFactory"> <bool name="enabled">true</bool> <str name="fl">content_no content_en</str> <str name="pattern">[su00A0]+</str> <str name="replacement"> </str> </processor> <processor class="no.uio.update.processor.FixDuplicateContentTitleProcessorFactory"> <bool name="enabled">true</bool> <str name="title.fl">title_no,title_en</str> 27 <str name="content.fl">content_no,content_en</str>

35. </processor> <processor class="no.uio.update.processor.RegexpReplaceProcessorFactory"> <bool name="enabled">true</bool> Solr @ Oslo University <str name="fl">content_no content_en</str> <str name="pattern">[su00A0]+</str> <str name="replacement"> </str> </processor> <processor class="no.uio.update.processor.FixDuplicateContentTitleProcessorFactory"> <bool name="enabled">true</bool> <str name="title.fl">title_no,title_en</str> <str name="content.fl">content_no,content_en</str> <str name="contentType.field">tika_Content-Type</str> <str name="contentType.regexp">^(text/html|application/xhtml).*</str> </processor> <processor class="no.uio.update.processor.URLClassifyProcessorFactory"> <bool name="enabled">true</bool> <str name="domainOutputField">host</str> <str name="canonicalUrlOutputField">canonicalurl</str> </processor> <processor class="no.uio.update.processor.RegexpBoostProcessorFactory"> <bool name="enabled">true</bool> <str name="inputField">url</str> <str name="boostField">urlboost</str> <str name="boostFilename">${solr.solr.home}/conf/rank/urlboosts.txt</str> </processor> <processor class="no.uio.update.processor.StaticRankProcessorFactory"> <bool name="enabled">true</bool> <str name="inputField">canonicalurl</str> <str name="anchorField">anchortext</str> <str name="rankField">docrank</str> <str name="rankFilename">${solr.solr.home}/conf/rank/rankinfo.txt</str> </processor> <processor class="solr.LogUpdateProcessorFactory"/> 28

39. Donations back to Apache SOLR-2599: FieldCopyProcessor SOLR-2825: RegexReplaceProcessor SOLR-2826: URLClassifyProcessor SOLR-2827: RegexpBoostProcessor SOLR-2828: StaticRankProcessor Binary Document Dumper (?) Many thanks for the donations! 29

42. Room for improvement? 32

43. Improvements Processors re-created for every request Duplication of config between chains No support for non-linear or sub chains Does not scale very well Lack native scripting language support 34

44. Improvements Pain: Potentially expensive initialization StaticRankProcessor: read&parse 50.000 lines Proposed cure: Keep persistent state object in factory: private final Map<Object,Object> sharedObjCache new StaticRankProcessor(params, request, response, nextProcessor, sharedObjCache); Processor uses sharedObjCache for state 35

46. Improvements Pain: Multi chains often need identical Processors UiO’s two chains share 80% -> copy/paste Proposed cure: Allow sharing of named instances Define: <processor name="langid" class=".."> Refer: <processor ref="langid" /> See SOLR-2823 37

48. Improvements Pain: Chains are linear only Hard to do branching, sub chains, conditional... Proposed cure (SOLR-2841): New scriptable Update Chain - alternative to XML Script chain logic in solr/conf/updateproc.groovy Full flexibility: chain myChain { if(doc.getFieldValue("type").equals("pdf")) process(tikaproc) } 39

49. Improvements Processors re-created for every request Duplication of config between chains No support for non-linear chains or sub chains Does not scale very well Lack native scripting language support 40

50. Improvements Pain: Single threaded Heavy processing not efficient Proposed cure: Local: Use multi threaded update requests SolrCloud: Dedicated nodes, role=“processor” ? Wrap an external pipeline in UpdateProcessor Example: OpenPipelineUpdateProcessor ? 41

51. Improvements Processors re-created for every request Duplication of config between chains No support for non-linear chains or sub chains Does not scale very well Lack native scripting language support 42

52. Improvements Pain: Not really a “problem” :-) Nice to write processors in Python, Groovy, JS... Proposed cure: Now: Finish SOLR-1725: Script based Processor Later: Make scripts first-class processors <processor script="myScript.py" /> or <processor ref="myScript" /> 43

53. One last thing... 44

54. New standalone framework? •The UpdateChain is Solr specific •Interest for a pure pipeline framework •Search engine independent •Scalable •Rich pool of processors •Several existing candidates •Some initial thoughts: http://wiki.apache.org/solr/DocumentProcessing 45

55. Summary 46

56. Summary •Document centric vs field centric processing •UpdateChain is there - use it! •Works well for most “light” cases •Scaling issues, but caching config may help •More processors welcome! 47

57. Questions? Jan Høydahl, Cominvent AS @cominvent www.cominvent.com

58. Extra 49

59. Alternative pipelines OpenPipeline (Dieselpoint) •OpenPipe (T-Rank, now on GitHub) •Pypes (ESR) •UIMA (Apache) •Eclipse SMILA •Apache commons pipeline •Piped (FoundIT, Norway) •Behemoth (DigitaPebble) •FindWise and TwigKit also has some technology 50

60. Calling out from UpdateChain This is one way an external pipeline system can be integrated with Solr. The main benefit of such a method is you can continue to feed content with SolrJ, DIH or other Update Request Handlers. 51

61. Scaling with external pipeline Here is a more advanced, distributed case, where a Solr node is dedicated for processing, and the entry point Solr only dispatches the requests. 52

Improving the Solr Update Chain

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Improving the Solr Update Chain

Similar to Improving the Solr Update Chain (20)

More from Cominvent AS

More from Cominvent AS (9)

Recently uploaded

Recently uploaded (20)

Improving the Solr Update Chain

Editor's Notes