Your SlideShare is downloading. ×
0
Improving the Solr Update Chain           Jan Høydahl
What will I cover?Who is Jan Høydahl?Intro to Solr’s (hidden) UpdateChainHow to write your own UpdateProcessorsExample...
Jan Høydahl          1995: Developer telecom          1998: Java developer          2000: Search - FAST          2006: Luc...
Cominvent ASConsulting & support       www.solrtraining.com   Lucene/Solr       FAST                       6
Why document processing?Analysis is Field orientedFilters only see the “local” field                      7
Why document processing?But what if you want to:  Add or remove fields?  Make decisions based on other fields?We need a wa...
Why document processing?Doc1namepostcode                         programmercv_pdf_url                        near Barcelon...
Why document processing?Doc1namepostcode                          programmerlatlong                         near Barcelona...
Why document processing?   Client    Doc1     name     postcode     latlong     cv_pdf_url     cv_text                  11
Why document processing?Client Doc1  name  postcode     3rd party  latlong      pipeline  cv_pdf_url  cv_text             ...
Solr’s Update Chain        13
The Update Chain        14
The Update ChainDocnamepostcodecv_pdf_url                     15
The Update Chain              Postcode              ToLatLong              ProcessorDoc            Docname           namep...
The Update Chain              Postcode                             UrlFetcher              ToLatLong                      ...
The Update Chain              Postcode                        Tika                             UrlFetcher              ToL...
How it’s wiredChain definition in solrconfig.xml:Choose chain in your update request:.../solr/update/xml?..&update.chain=c...
Other examplesLanguage Identification          18
Other examplesCompany       The Apache Software Foundation       (ASF) is a non-profit corporation to       support Apache...
Writing your own processor            21
Writing your own processor            21
Writing your own processor             22
Writing your own processor             23
Writing your own processor•Make generic processors - parameterized•Use SchemaAware, SolrCoreAware and  ResourceLoaderAware...
Web crawl withLanguage Detection @ Oslo University        25
Solr @ Oslo University           26
Solr @ Oslo University     <?xml version="1.0"?><updateRequestProcessorChain name="web">  <processor class="solr.update.pr...
Solr @ Oslo University     <?xml version="1.0"?><updateRequestProcessorChain name="web">  <processor class="solr.update.pr...
Solr @ Oslo University     <?xml version="1.0"?><updateRequestProcessorChain name="web">  <processor class="solr.update.pr...
</processor><processor class="no.uio.update.processor.RegexpReplaceProcessorFactory">  <bool name="enabled">true</bool>   ...
</processor><processor class="no.uio.update.processor.RegexpReplaceProcessorFactory">  <bool name="enabled">true</bool>   ...
</processor><processor class="no.uio.update.processor.RegexpReplaceProcessorFactory">  <bool name="enabled">true</bool>   ...
</processor><processor class="no.uio.update.processor.RegexpReplaceProcessorFactory">  <bool name="enabled">true</bool>   ...
Donations back to ApacheSOLR-2599: FieldCopyProcessorSOLR-2825: RegexReplaceProcessorSOLR-2826: URLClassifyProcessorSOLR-2...
Room forimprovement?  32
ImprovementsProcessors re-created for every requestDuplication of config between chainsNo support for non-linear or sub ch...
ImprovementsPain:  Potentially expensive initialization  StaticRankProcessor: read&parse 50.000 linesProposed cure:  Keep ...
ImprovementsProcessors re-created for every requestDuplication of config between chainsNo support for non-linear or sub ch...
ImprovementsPain:  Multi chains often need identical Processors  UiO’s two chains share 80% -> copy/pasteProposed cure:  A...
ImprovementsProcessors re-created for every requestDuplication of config between chainsNo support for non-linear or sub ch...
ImprovementsPain:  Chains are linear only  Hard to do branching, sub chains, conditional...Proposed cure (SOLR-2841):  New...
ImprovementsProcessors re-created for every requestDuplication of config between chainsNo support for non-linear chains or...
ImprovementsPain:  Single threaded  Heavy processing not efficientProposed cure:  Local: Use multi threaded update request...
ImprovementsProcessors re-created for every requestDuplication of config between chainsNo support for non-linear chains or...
ImprovementsPain:  Not really a “problem” :-)  Nice to write processors in Python, Groovy, JS...Proposed cure:  Now: Finis...
One last thing...       44
New standalone framework?•The UpdateChain is Solr specific•Interest for a pure pipeline framework  •Search engine independ...
Summary   46
Summary•Document centric vs field centric processing•UpdateChain is there - use it!•Works well for most “light” cases•Scal...
Questions?Jan Høydahl, Cominvent AS@cominventwww.cominvent.com
Extra 49
Alternative pipelines   OpenPipeline (Dieselpoint)•OpenPipe (T-Rank, now on GitHub)•Pypes (ESR)•UIMA (Apache)•Eclipse SMIL...
Calling out from UpdateChainThis is one way anexternal pipelinesystem can beintegrated with Solr.The main benefit ofsuch a...
Scaling with external pipelineHere is a moreadvanced,distributedcase, where aSolr node isdedicated forprocessing, andthe e...
improving solrs update chain - Jan hoydahl
improving solrs update chain - Jan hoydahl
improving solrs update chain - Jan hoydahl
improving solrs update chain - Jan hoydahl
improving solrs update chain - Jan hoydahl
improving solrs update chain - Jan hoydahl
improving solrs update chain - Jan hoydahl
Upcoming SlideShare
Loading in...5
×

improving solrs update chain - Jan hoydahl

440

Published on

See conference video - http://www.lucidimagination.com/devzone/events/conferences/ApacheLuceneEurocon2011

Solr features a little known internal document processing pipeline called the UpdateRequestProcesssorChain or simply the UpdateChain.

In this talk we'll discuss the importance of document processing, when the UpdateChain works well and what limitations it's got. We'll then go on to propose a range of possible improvements.

Topics include:

Examples of use with demo
How to write your own UpdateProcessor, best practices
Example: Tika as an UpdateProcessor
A vision for future improvements

Published in: Technology, Business
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
440
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
10
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Transcript of "improving solrs update chain - Jan hoydahl "

  1. 1. Improving the Solr Update Chain Jan Høydahl
  2. 2. What will I cover?Who is Jan Høydahl?Intro to Solr’s (hidden) UpdateChainHow to write your own UpdateProcessorsExample: Web crawl @ Oslo UniversityA vision for future improvementsConclusion 2
  3. 3. Jan Høydahl 1995: Developer telecom 1998: Java developer 2000: Search - FAST 2006: Lucene 2007: Cominvent 2011: Lucene committer > 100 projects 5
  4. 4. Cominvent ASConsulting & support www.solrtraining.com Lucene/Solr FAST 6
  5. 5. Why document processing?Analysis is Field orientedFilters only see the “local” field 7
  6. 6. Why document processing?But what if you want to: Add or remove fields? Make decisions based on other fields?We need a way to modify the Document 8
  7. 7. Why document processing?Doc1namepostcode programmercv_pdf_url near Barcelona 9
  8. 8. Why document processing?Doc1namepostcode programmerlatlong near Barcelonacv_pdf_urlcv_text 10
  9. 9. Why document processing? Client Doc1 name postcode latlong cv_pdf_url cv_text 11
  10. 10. Why document processing?Client Doc1 name postcode 3rd party latlong pipeline cv_pdf_url cv_text 12
  11. 11. Solr’s Update Chain 13
  12. 12. The Update Chain 14
  13. 13. The Update ChainDocnamepostcodecv_pdf_url 15
  14. 14. The Update Chain Postcode ToLatLong ProcessorDoc Docname namepostcode postcodecv_pdf_url latlong cv_pdf_url 15
  15. 15. The Update Chain Postcode UrlFetcher ToLatLong Processor ProcessorDoc Doc Docname name namepostcode postcode postcodecv_pdf_url latlong latlong cv_pdf_url cv_pdf_url cv_pdf_bin 15
  16. 16. The Update Chain Postcode Tika UrlFetcher ToLatLong Extracting Processor Processor ProcessorDoc Doc Doc Docname name name namepostcode postcode postcode postcodecv_pdf_url latlong latlong latlong cv_pdf_url cv_pdf_url cv_pdf_url cv_pdf_bin cv_pdf_bin cv_text 15
  17. 17. How it’s wiredChain definition in solrconfig.xml:Choose chain in your update request:.../solr/update/xml?..&update.chain=cv-chain 17
  18. 18. Other examplesLanguage Identification 18
  19. 19. Other examplesCompany The Apache Software Foundation (ASF) is a non-profit corporation to support Apache software projects. The ASF was formed from the Apache Group and incorporated in Delaware, U.S., in June 1999. Location Date Entity extraction 19
  20. 20. Writing your own processor 21
  21. 21. Writing your own processor 21
  22. 22. Writing your own processor 22
  23. 23. Writing your own processor 23
  24. 24. Writing your own processor•Make generic processors - parameterized•Use SchemaAware, SolrCoreAware and ResourceLoaderAware interfaces•Prefix param names to avoid name clash•Testing and testable methods•Donate back to Apache & document on Wiki 24
  25. 25. Web crawl withLanguage Detection @ Oslo University 25
  26. 26. Solr @ Oslo University 26
  27. 27. Solr @ Oslo University <?xml version="1.0"?><updateRequestProcessorChain name="web"> <processor class="solr.update.processor.LanguageIdentifierUpdateProcessorFactory"> <lst name="defaults"> <str name="langid.fl">title,content</str> <str name="langid.idField">url</str> <str name="langid.fallbackFields"></str> <str name="langid.fallback">no</str> <str name="langid.whitelist">no,en</str> <str name="langid.langField">language</str> <str name="langid.langsField">languages</str> <bool name="langid.overwrite">true</bool> <bool name="langid.map">true</bool> <str name="langid.map.fl">title,content</str> <double name="langid.threshold">0.5</double> </lst> </processor> <processor class="no.uio.update.processor.RegexpReplaceProcessorFactory"> <bool name="enabled">true</bool> <str name="fl">content_no content_en</str> <str name="pattern">[su00A0]+</str> <str name="replacement"> </str> </processor> <processor class="no.uio.update.processor.FixDuplicateContentTitleProcessorFactory"> <bool name="enabled">true</bool> <str name="title.fl">title_no,title_en</str> 27 <str name="content.fl">content_no,content_en</str>
  28. 28. Solr @ Oslo University <?xml version="1.0"?><updateRequestProcessorChain name="web"> <processor class="solr.update.processor.LanguageIdentifierUpd <lst name="defaults"> <str name="langid.fl">title,content</str> <str name="langid.idField">url</str> <str name="langid.fallbackFields"></str> <str name="langid.fallback">no</str> <str name="langid.whitelist">no,en</str> <str name="langid.langField">language</str> <str name="langid.langsField">languages</str> <bool name="langid.overwrite">true</bool> <bool name="langid.map">true</bool> <str name="langid.map.fl">title,content</str> <double name="langid.threshold">0.5</double> </lst> </processor> <processor class="no.uio.update.processor.RegexpReplaceProces 27
  29. 29. Solr @ Oslo University <?xml version="1.0"?><updateRequestProcessorChain name="web"> <processor class="solr.update.processor.LanguageIdentifierUpdateProcessorFactory"> <lst name="defaults"> <str name="langid.fl">title,content</str> <str name="langid.idField">url</str> <str name="langid.fallbackFields"></str> <str name="langid.fallback">no</str> <str name="langid.whitelist">no,en</str> <str name="langid.langField">language</str> <str name="langid.langsField">languages</str> <bool name="langid.overwrite">true</bool> <bool name="langid.map">true</bool> <str name="langid.map.fl">title,content</str> <double name="langid.threshold">0.5</double> </lst> </processor> <processor class="no.uio.update.processor.RegexpReplaceProcessorFactory"> <bool name="enabled">true</bool> <str name="fl">content_no content_en</str> <str name="pattern">[su00A0]+</str> <str name="replacement"> </str> </processor> <processor class="no.uio.update.processor.FixDuplicateContentTitleProcessorFactory"> <bool name="enabled">true</bool> <str name="title.fl">title_no,title_en</str> 27 <str name="content.fl">content_no,content_en</str>
  30. 30. </processor><processor class="no.uio.update.processor.RegexpReplaceProcessorFactory"> <bool name="enabled">true</bool> Solr @ Oslo University <str name="fl">content_no content_en</str> <str name="pattern">[su00A0]+</str> <str name="replacement"> </str></processor><processor class="no.uio.update.processor.FixDuplicateContentTitleProcessorFactory"> <bool name="enabled">true</bool> <str name="title.fl">title_no,title_en</str> <str name="content.fl">content_no,content_en</str> <str name="contentType.field">tika_Content-Type</str> <str name="contentType.regexp">^(text/html|application/xhtml).*</str></processor><processor class="no.uio.update.processor.URLClassifyProcessorFactory"> <bool name="enabled">true</bool> <str name="domainOutputField">host</str> <str name="canonicalUrlOutputField">canonicalurl</str></processor><processor class="no.uio.update.processor.RegexpBoostProcessorFactory"> <bool name="enabled">true</bool> <str name="inputField">url</str> <str name="boostField">urlboost</str> <str name="boostFilename">${solr.solr.home}/conf/rank/urlboosts.txt</str></processor><processor class="no.uio.update.processor.StaticRankProcessorFactory"> <bool name="enabled">true</bool> <str name="inputField">canonicalurl</str> <str name="anchorField">anchortext</str> <str name="rankField">docrank</str> <str name="rankFilename">${solr.solr.home}/conf/rank/rankinfo.txt</str></processor><processor class="solr.LogUpdateProcessorFactory"/> 28
  31. 31. </processor><processor class="no.uio.update.processor.RegexpReplaceProcessorFactory"> <bool name="enabled">true</bool> Solr @ Oslo University <str name="fl">content_no content_en</str> <str name="pattern">[su00A0]+</str> <str name="replacement"> </str></processor><processor class="no.uio.update.processor.FixDuplicateContentTitleProcessorFactory"> <bool name="enabled">true</bool> <str name="title.fl">title_no,title_en</str> <str name="content.fl">content_no,content_en</str> <str name="contentType.field">tika_Content-Type</str> <str name="contentType.regexp">^(text/html|application/xhtml).*</str></processor><processor class="no.uio.update.processor.URLClassifyProcessorFactory"> <bool name="enabled">true</bool> <str name="domainOutputField">host</str> <str name="canonicalUrlOutputField">canonicalurl</str></processor><processor class="no.uio.update.processor.RegexpBoostProcessorFactory"> <bool name="enabled">true</bool> <str name="inputField">url</str> <str name="boostField">urlboost</str> <str name="boostFilename">${solr.solr.home}/conf/rank/urlboosts.txt</str></processor><processor class="no.uio.update.processor.StaticRankProcessorFactory"> <bool name="enabled">true</bool> <str name="inputField">canonicalurl</str> <str name="anchorField">anchortext</str> <str name="rankField">docrank</str> <str name="rankFilename">${solr.solr.home}/conf/rank/rankinfo.txt</str></processor><processor class="solr.LogUpdateProcessorFactory"/> 28
  32. 32. </processor><processor class="no.uio.update.processor.RegexpReplaceProcessorFactory"> <bool name="enabled">true</bool> Solr @ Oslo University <str name="fl">content_no content_en</str> <str name="pattern">[su00A0]+</str> <str name="replacement"> </str></processor><processor class="no.uio.update.processor.FixDuplicateContentTitleProcessorFactory"> <bool name="enabled">true</bool> <str name="title.fl">title_no,title_en</str> <str name="content.fl">content_no,content_en</str> <str name="contentType.field">tika_Content-Type</str> <str name="contentType.regexp">^(text/html|application/xhtml).*</str></processor><processor class="no.uio.update.processor.URLClassifyProcessorFactory"> <bool name="enabled">true</bool> <str name="domainOutputField">host</str> <str name="canonicalUrlOutputField">canonicalurl</str></processor><processor class="no.uio.update.processor.RegexpBoostProcessorFactory"> <bool name="enabled">true</bool> <str name="inputField">url</str> <str name="boostField">urlboost</str> <str name="boostFilename">${solr.solr.home}/conf/rank/urlboosts.txt</str></processor><processor class="no.uio.update.processor.StaticRankProcessorFactory"> <bool name="enabled">true</bool> <str name="inputField">canonicalurl</str> <str name="anchorField">anchortext</str> <str name="rankField">docrank</str> <str name="rankFilename">${solr.solr.home}/conf/rank/rankinfo.txt</str></processor><processor class="solr.LogUpdateProcessorFactory"/> 28
  33. 33. </processor><processor class="no.uio.update.processor.RegexpReplaceProcessorFactory"> <bool name="enabled">true</bool> Solr @ Oslo University <str name="fl">content_no content_en</str> <str name="pattern">[su00A0]+</str> <str name="replacement"> </str></processor><processor class="no.uio.update.processor.FixDuplicateContentTitleProcessorFactory"> <bool name="enabled">true</bool> <str name="title.fl">title_no,title_en</str> <str name="content.fl">content_no,content_en</str> <str name="contentType.field">tika_Content-Type</str> <str name="contentType.regexp">^(text/html|application/xhtml).*</str></processor><processor class="no.uio.update.processor.URLClassifyProcessorFactory"> <bool name="enabled">true</bool> <str name="domainOutputField">host</str> <str name="canonicalUrlOutputField">canonicalurl</str></processor><processor class="no.uio.update.processor.RegexpBoostProcessorFactory"> <bool name="enabled">true</bool> <str name="inputField">url</str> <str name="boostField">urlboost</str> <str name="boostFilename">${solr.solr.home}/conf/rank/urlboosts.txt</str></processor><processor class="no.uio.update.processor.StaticRankProcessorFactory"> <bool name="enabled">true</bool> <str name="inputField">canonicalurl</str> <str name="anchorField">anchortext</str> <str name="rankField">docrank</str> <str name="rankFilename">${solr.solr.home}/conf/rank/rankinfo.txt</str></processor><processor class="solr.LogUpdateProcessorFactory"/> 28
  34. 34. Donations back to ApacheSOLR-2599: FieldCopyProcessorSOLR-2825: RegexReplaceProcessorSOLR-2826: URLClassifyProcessorSOLR-2827: RegexpBoostProcessorSOLR-2828: StaticRankProcessorBinary Document Dumper (?)Many thanks for the donations! 29
  35. 35. Room forimprovement? 32
  36. 36. ImprovementsProcessors re-created for every requestDuplication of config between chainsNo support for non-linear or sub chainsDoes not scale very wellLack native scripting language support 34
  37. 37. ImprovementsPain: Potentially expensive initialization StaticRankProcessor: read&parse 50.000 linesProposed cure: Keep persistent state object in factory: private final Map<Object,Object> sharedObjCache new StaticRankProcessor(params, request, response, nextProcessor, sharedObjCache); Processor uses sharedObjCache for state 35
  38. 38. ImprovementsProcessors re-created for every requestDuplication of config between chainsNo support for non-linear or sub chainsDoes not scale very wellLack native scripting language support 36
  39. 39. ImprovementsPain: Multi chains often need identical Processors UiO’s two chains share 80% -> copy/pasteProposed cure: Allow sharing of named instances Define: <processor name="langid" class=".."> Refer: <processor ref="langid" /> See SOLR-2823 37
  40. 40. ImprovementsProcessors re-created for every requestDuplication of config between chainsNo support for non-linear or sub chainsDoes not scale very wellLack native scripting language support 38
  41. 41. ImprovementsPain: Chains are linear only Hard to do branching, sub chains, conditional...Proposed cure (SOLR-2841): New scriptable Update Chain - alternative to XML Script chain logic in solr/conf/updateproc.groovy Full flexibility: chain myChain { if(doc.getFieldValue("type").equals("pdf")) process(tikaproc) } 39
  42. 42. ImprovementsProcessors re-created for every requestDuplication of config between chainsNo support for non-linear chains or sub chainsDoes not scale very wellLack native scripting language support 40
  43. 43. ImprovementsPain: Single threaded Heavy processing not efficientProposed cure: Local: Use multi threaded update requests SolrCloud: Dedicated nodes, role=“processor” ? Wrap an external pipeline in UpdateProcessor Example: OpenPipelineUpdateProcessor ? 41
  44. 44. ImprovementsProcessors re-created for every requestDuplication of config between chainsNo support for non-linear chains or sub chainsDoes not scale very wellLack native scripting language support 42
  45. 45. ImprovementsPain: Not really a “problem” :-) Nice to write processors in Python, Groovy, JS...Proposed cure: Now: Finish SOLR-1725: Script based Processor Later: Make scripts first-class processors <processor script="myScript.py" /> or <processor ref="myScript" /> 43
  46. 46. One last thing... 44
  47. 47. New standalone framework?•The UpdateChain is Solr specific•Interest for a pure pipeline framework •Search engine independent •Scalable •Rich pool of processors •Several existing candidates•Some initial thoughts: http://wiki.apache.org/solr/DocumentProcessing 45
  48. 48. Summary 46
  49. 49. Summary•Document centric vs field centric processing•UpdateChain is there - use it!•Works well for most “light” cases•Scaling issues, but caching config may help•More processors welcome! 47
  50. 50. Questions?Jan Høydahl, Cominvent AS@cominventwww.cominvent.com
  51. 51. Extra 49
  52. 52. Alternative pipelines OpenPipeline (Dieselpoint)•OpenPipe (T-Rank, now on GitHub)•Pypes (ESR)•UIMA (Apache)•Eclipse SMILA•Apache commons pipeline•Piped (FoundIT, Norway)•Behemoth (DigitaPebble)•FindWise and TwigKit also has some technology 50
  53. 53. Calling out from UpdateChainThis is one way anexternal pipelinesystem can beintegrated with Solr.The main benefit ofsuch a method is youcan continue to feedcontent with SolrJ, DIHor other UpdateRequest Handlers. 51
  54. 54. Scaling with external pipelineHere is a moreadvanced,distributedcase, where aSolr node isdedicated forprocessing, andthe entry pointSolr onlydispatches therequests. 52
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×