Lucene for Solr Developers         erik . hatcher               @                             1
AbstractYou’re Solr powered, and needing to customize itscapabilities. Apache Solr is flexibly architected, withpractically...
About me...• Co-author, “Lucene in Action”• Commiter, Lucene and Solr• Lucene PMC and ASF member• Member of Technical Staf...
... works    search platformwww.lucidimagination.com                           4
What is Lucene?•   An open source search library (not an application)•   100% Java•   Continuously improved and tuned over...
Inverted Index•   Lucene stores input data in what is known as an    inverted index•   In an inverted index each indexed t...
Inverted Index Example                         7
Segments and Merging•   A Lucene index is a collection of one or more sub-indexes    called segments•   Each segment is a ...
Segments and Merging                       9
Segments• When a document is deleted it still exists  in an index segment until that segment is  merged• At certain trigge...
IndexSearcher                11
Adding new documents                       12
Commit         13
Committed and  Warmed                14
Lucene Scoring•   Lucene uses a similarity scoring formula to rank results by measuring the    similarity between a query ...
Coord and Norms•   Coord: The coordination factor, coord (q, d).    Boosts documents that match more of the    search term...
Scoring Factors (cont)• Boost: (t.field in d). A way to boost a field  or a whole document above others.• Query Norm: (q). N...
Explanation      • Lucene has a feature called Explanation      • Solr uses the debugQuery parameter to         retrieve s...
Lucene Core• IndexWriter• Directory• IndexReader, IndexSearcher• analysis: Analyzer, TokenStream,  Tokenizer,TokenFilter• ...
Solr Architecture                    20
Customizing - Dont do it!•   Unless you need to.•   In other words... ensure youve given the built-in    capabilities a tr...
But first...•   Look at Lucene and/or Solr source code as    appropriate•   Carefully read javadocs and wiki pages - lots o...
Extension points•   Tokenizer, TokenFilter,   •   QParser    CharFilter                              •   DataImportHandler...
Factories• FooFactory (most) everywhere.  Sometimes theres BarPlugin style• for sake of discussion... lets just skip the  ...
"Installing" plugins• Compile .java to .class, JAR it up• Put JAR files in either: • <solr-home>/lib • a shared lib when us...
Multicore sharedLib<solr sharedLib="/usr/local/solr/customlib"       persistent="true">   <cores adminPath="/admin/cores">...
Plugins via        solrconfig.xml• <lib dir="/path/to/your/custom/jars" />                                            27
Analysis• CharFilter• Tokenizer• TokenFilter                           28
Primer• Tokens, Terms• Attributes: Type, Payloads, Offsets,  Positions, Term Vectors• part of the picture:                ...
Version• enum: • Version.LUCENE_31,    Version.LUCENE_32, etc• Version.onOrAfter(Version other)                           ...
CharFilter• extend BaseCharFilter• enables pre-tokenization filtering/morphing  of incoming field value• only affects tokeni...
Tokenizer•   common to extend CharTokenizer•   implement -    •   protected abstract boolean isTokenChar(int c);•   option...
TokenFilter• a TokenStream whose input is another  TokenStream• Popular TokenFilters include:  LowerCaseFilter, CommonGram...
Lucenes analysis APIs• tricky business, what with Attributes  (Source/Factorys), State, characters, code  points,Version, ...
Solrs Analysis Tools• Admin analysis tool• Field analysis request handler• DEMO                                   35
Query Parsing• String -> org.apache.lucene.search.Query                                             36
QParserPluginpublic abstract class QParserPlugin    implements NamedListInitializedPlugin {    public abstract QParser cre...
QParserpublic abstract class QParser {    public abstract Query parse()              throws ParseException;}              ...
Built-in QParsersfrom QParserPlugin.java  /** internal use - name to class mappings of builtin parsers */  public static f...
Local Parameters• {!qparser_name param=value}expression • or• {!qparser_name param=value v=expression}• Can substitute $re...
Param Substitutionsolrconfig.xml<requestHandler name="/document"                class="solr.SearchHandler">  <lst name="in...
Custom QParser• Implement a QParserPlugin that creates your  custom QParser• Register in solrconfig.xml • <queryParser name...
Update Processor• Responsible for handling these commands: • add/update • delete • commit • merge indexes                 ...
Built-in Update            Processors•   RunUpdateProcessor    •   Actually performs the operations, such as        adding...
UIMA Update           Processor•   UIMA - Unstructured Information Management    Architecture - http://uima.apache.org/•  ...
Update Processor         Chain• UpdateProcessors sequence into a chain• Each processor can abort the entire update  or han...
Default update            processor chainFrom SolrCore.java// construct the default chainUpdateRequestProcessorFactory[] f...
Example Update           Processor•   What are the best facets to show for a particular    query? Wouldnt it be nice to se...
Config for custom           update processor<updateRequestProcessorChain name="fields_used" default="true"> <processor class...
FieldsUsedUpdateProcessorFactorypublic class FieldsUsedUpdateProcessorFactory extends UpdateRequestProcessorFactory { priv...
FieldsUsedUpdateProcessorFactory @Override public void init(NamedList args) {  if (args == null) return;     SolrParams pa...
class FieldsUsedUpdateProcessor extends UpdateRequestProcessor {  public FieldsUsedUpdateProcessor(SolrQueryRequest req,  ...
FieldsUsedUpdateProcessor          in actionschema.xml  <dynamicField name="*_attribute" type="string" indexed="true" stor...
Search Components• Built-in: Clustering, Debug, Facet, Highlight,  MoreLikeThis, Query, QueryElevation,  SpellCheck, Stats...
Example - auto facet          select•   It sure would be nice if you could have Solr automatically    select field(s) for f...
AutoFacetSelection       Component•   Too much code for a slide, lets take a look in    an IDE...•   Basically -    •   pr...
SearchComponent              config<searchComponent name="autofacet"     class="solr.AutoFacetSelectionComponent"/><request...
autofacet successhttp://localhost:8983/solr/searchplus?q=*:*&facet=on&autofacet.field=attribute_fields&wt=json&indent=on{ ...
Distributed-aware    SearchComponents•   SearchComponent has a few distributed mode    methods:    •   distributedProcess(...
Testing• AbstractSolrTestCase• SolrTestCaseJ4• SolrMeter • http://code.google.com/p/solrmeter/                            ...
For more information...•   http://www.lucidimagination.com•   LucidFind    •   search Lucene ecosystem: mailing lists, wik...
Thank You!             62
Upcoming SlideShare
Loading in...5
×

Lucene for Solr Developers

2,105

Published on

You’re Solr powered, and needing to customize its capabilities. Apache Solr is flexibly architected, with practically everything pluggable. Under the hood, Solr is driven by the well-known Apache Lucene. Lucene for Solr Developers will guide you through the various ways in which Solr can be extended, customized, and enhanced with a bit of Lucene API know-how. We’ll delve into improving analysis with custom character mapping, tokenizing, and token filtering extensions; show why and how to implement specialized query parsing, and how to add your own search and update request handling.

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,105
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
65
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Lucene for Solr Developers

  1. 1. Lucene for Solr Developers erik . hatcher @ 1
  2. 2. AbstractYou’re Solr powered, and needing to customize itscapabilities. Apache Solr is flexibly architected, withpractically everything pluggable. Under the hood, Solr isdriven by the well-known Apache Lucene. Lucene forSolr Developers will guide you through the various waysin which Solr can be extended, customized, and enhancedwith a bit of Lucene API know-how. We’ll delve intoimproving analysis with custom character mapping,tokenizing, and token filtering extensions; show why andhow to implement specialized query parsing, and how toadd your own search and update request handling. 2
  3. 3. About me...• Co-author, “Lucene in Action”• Commiter, Lucene and Solr• Lucene PMC and ASF member• Member of Technical Staff / co-founder, Lucid Imagination 3
  4. 4. ... works search platformwww.lucidimagination.com 4
  5. 5. What is Lucene?• An open source search library (not an application)• 100% Java• Continuously improved and tuned over more than 10 years• Compact, portable index representation• Programmable text analyzers, spell checking and highlighting• Not a crawler or a text extraction tool 5
  6. 6. Inverted Index• Lucene stores input data in what is known as an inverted index• In an inverted index each indexed term points to a list of documents that contain the term• Similar to the index provided at the end of a book• In this case "inverted" simply means the list of terms point to documents• It is much faster to find a term in an index, than to scan all the documents 6
  7. 7. Inverted Index Example 7
  8. 8. Segments and Merging• A Lucene index is a collection of one or more sub-indexes called segments• Each segment is a fully independent index• A multi-way merge algorithm is used to periodically merge segments• New segments are created when an IndexWriter flushes new documents and pending deletes to disk• Trying for a balance between large-scale performance vs. small- scale updates• Optimization merges all segments into one 8
  9. 9. Segments and Merging 9
  10. 10. Segments• When a document is deleted it still exists in an index segment until that segment is merged• At certain trigger points, these Documents are flushed to the Directory• Can be forced by calling commit• Segments are periodically merged 10
  11. 11. IndexSearcher 11
  12. 12. Adding new documents 12
  13. 13. Commit 13
  14. 14. Committed and Warmed 14
  15. 15. Lucene Scoring• Lucene uses a similarity scoring formula to rank results by measuring the similarity between a query and the documents that match the query. The factors that form the scoring formula are: • Term Frequency: tf (t in d). How often the term occurs in the document. • Inverse Document Frequency: idf (t). A measure of how rare the term is in the whole collection. One over the number of times the term appears in the collection. • Terms that are rare throughout the entire collection score higher. 15
  16. 16. Coord and Norms• Coord: The coordination factor, coord (q, d). Boosts documents that match more of the search terms than other documents. • If 4 of 4 terms match coord = 4/4 • If 3 of 4 terms match coord = 3/4• Length Normalization - Adjust the score based on length of fields in the document. • shorter fields that match get a boost 16
  17. 17. Scoring Factors (cont)• Boost: (t.field in d). A way to boost a field or a whole document above others.• Query Norm: (q). Normalization value for a query, given the sum of the squared weights of each of the query terms.• You will often hear the Lucene scoring simply referred to as TF·IDF. 17
  18. 18. Explanation • Lucene has a feature called Explanation • Solr uses the debugQuery parameter to retrieve scoring explanations0.2987913 = (MATCH) fieldWeight(text:lucen in 688), product of: 1.4142135 = tf(termFreq(text:lucen)=2) 9.014501 = idf(docFreq=3, maxDocs=12098) 0.0234375 = fieldNorm(field=text, doc=688) 18
  19. 19. Lucene Core• IndexWriter• Directory• IndexReader, IndexSearcher• analysis: Analyzer, TokenStream, Tokenizer,TokenFilter• Query 19
  20. 20. Solr Architecture 20
  21. 21. Customizing - Dont do it!• Unless you need to.• In other words... ensure youve given the built-in capabilities a try, asked on the e-mail list, and spelunked into at least Solrs code a bit to make some sense of the situation.• But were here to roll up our sleeves, because we need to... 21
  22. 22. But first...• Look at Lucene and/or Solr source code as appropriate• Carefully read javadocs and wiki pages - lots of tips there• And, hey, search for what youre trying to do... • Google, of course • But try out LucidFind and other Lucene ecosystem specific search systems - http://www.lucidimagination.com/search/ 22
  23. 23. Extension points• Tokenizer, TokenFilter, • QParser CharFilter • DataImportHandler• SearchComponent hooks• RequestHandler • data sources• ResponseWriter • entity processors• FieldType • transformers• Similarity • several others 23
  24. 24. Factories• FooFactory (most) everywhere. Sometimes theres BarPlugin style• for sake of discussion... lets just skip the "factory" part• In Solr, Factories and Plugins are used by configuration loading to parameterize and construct 24
  25. 25. "Installing" plugins• Compile .java to .class, JAR it up• Put JAR files in either: • <solr-home>/lib • a shared lib when using multicore • anywhere, and register location in solrconfig.xml• Hook in plugins as appropriate 25
  26. 26. Multicore sharedLib<solr sharedLib="/usr/local/solr/customlib" persistent="true"> <cores adminPath="/admin/cores"> <core instanceDir="core1" name="core1"/> <core instanceDir="core2" name="core2"/> </cores></solr> 26
  27. 27. Plugins via solrconfig.xml• <lib dir="/path/to/your/custom/jars" /> 27
  28. 28. Analysis• CharFilter• Tokenizer• TokenFilter 28
  29. 29. Primer• Tokens, Terms• Attributes: Type, Payloads, Offsets, Positions, Term Vectors• part of the picture: 29
  30. 30. Version• enum: • Version.LUCENE_31, Version.LUCENE_32, etc• Version.onOrAfter(Version other) 30
  31. 31. CharFilter• extend BaseCharFilter• enables pre-tokenization filtering/morphing of incoming field value• only affects tokenization, not stored value• Built-in CharFilters: HTMLStripCharFilter, PatternReplaceCharFilter, and MappingCharFilter 31
  32. 32. Tokenizer• common to extend CharTokenizer• implement - • protected abstract boolean isTokenChar(int c);• optionally override - • protected int normalize(int c)• extend Tokenizer directly for finer control• Popular built-in Tokenizers include: WhitespaceTokenizer, StandardTokenizer, PatternTokenizer, KeywordTokenizer, ICUTokenizer 32
  33. 33. TokenFilter• a TokenStream whose input is another TokenStream• Popular TokenFilters include: LowerCaseFilter, CommonGramsFilter, SnowballFilter, StopFilter, WordDelimiterFilter 33
  34. 34. Lucenes analysis APIs• tricky business, what with Attributes (Source/Factorys), State, characters, code points,Version, etc...• Test!!! • BaseTokenStreamTestCase • Look at Lucene and Solrs test cases 34
  35. 35. Solrs Analysis Tools• Admin analysis tool• Field analysis request handler• DEMO 35
  36. 36. Query Parsing• String -> org.apache.lucene.search.Query 36
  37. 37. QParserPluginpublic abstract class QParserPlugin implements NamedListInitializedPlugin { public abstract QParser createParser( String qstr, SolrParams localParams, SolrParams params, SolrQueryRequest req);} 37
  38. 38. QParserpublic abstract class QParser { public abstract Query parse() throws ParseException;} 38
  39. 39. Built-in QParsersfrom QParserPlugin.java /** internal use - name to class mappings of builtin parsers */ public static final Object[] standardPlugins = { LuceneQParserPlugin.NAME, LuceneQParserPlugin.class, OldLuceneQParserPlugin.NAME, OldLuceneQParserPlugin.class, FunctionQParserPlugin.NAME, FunctionQParserPlugin.class, PrefixQParserPlugin.NAME, PrefixQParserPlugin.class, BoostQParserPlugin.NAME, BoostQParserPlugin.class, DisMaxQParserPlugin.NAME, DisMaxQParserPlugin.class, ExtendedDismaxQParserPlugin.NAME, ExtendedDismaxQParserPlugin.class, FieldQParserPlugin.NAME, FieldQParserPlugin.class, RawQParserPlugin.NAME, RawQParserPlugin.class, TermQParserPlugin.NAME, TermQParserPlugin.class, NestedQParserPlugin.NAME, NestedQParserPlugin.class, FunctionRangeQParserPlugin.NAME, FunctionRangeQParserPlugin.class, SpatialFilterQParserPlugin.NAME, SpatialFilterQParserPlugin.class, SpatialBoxQParserPlugin.NAME, SpatialBoxQParserPlugin.class, JoinQParserPlugin.NAME, JoinQParserPlugin.class, }; 39
  40. 40. Local Parameters• {!qparser_name param=value}expression • or• {!qparser_name param=value v=expression}• Can substitute $references from request parameters 40
  41. 41. Param Substitutionsolrconfig.xml<requestHandler name="/document" class="solr.SearchHandler"> <lst name="invariants"> <str name="q">{!term f=id v=$id}</str> </lst></requestHandler>Solr requesthttp://localhost:8983/solr/document?id=FOO37 41
  42. 42. Custom QParser• Implement a QParserPlugin that creates your custom QParser• Register in solrconfig.xml • <queryParser name="myparser" class="com.mycompany.MyQParserPlugin"/> 42
  43. 43. Update Processor• Responsible for handling these commands: • add/update • delete • commit • merge indexes 43
  44. 44. Built-in Update Processors• RunUpdateProcessor • Actually performs the operations, such as adding the documents to the index• LogUpdateProcessor • Logs each operation• SignatureUpdateProcessor • duplicate detection and optionally rejection 44
  45. 45. UIMA Update Processor• UIMA - Unstructured Information Management Architecture - http://uima.apache.org/• Enables UIMA components to augment documents• Entity extraction, automated categorization, language detection, etc• "contrib" plugin• http://wiki.apache.org/solr/SolrUIMA 45
  46. 46. Update Processor Chain• UpdateProcessors sequence into a chain• Each processor can abort the entire update or hand processing to next processor in the chain• Chains, of update processor factories, are specified in solrconfig.xml• Update requests can specify an update.processor parameter 46
  47. 47. Default update processor chainFrom SolrCore.java// construct the default chainUpdateRequestProcessorFactory[] factories = new UpdateRequestProcessorFactory[]{ new RunUpdateProcessorFactory(), new LogUpdateProcessorFactory() }; Note: these steps have been swapped on trunk recently 47
  48. 48. Example Update Processor• What are the best facets to show for a particular query? Wouldnt it be nice to see the distribution of document "attributes" represented across a result set?• Learned this trick from the Smithsonian, who were doing it manually - add an indexed field containing the field names of the interesting other fields on the document.• Facet on that field "of field names" initially, then request facets on the top values returned. 48
  49. 49. Config for custom update processor<updateRequestProcessorChain name="fields_used" default="true"> <processor class="solr.processor.FieldsUsedUpdateProcessorFactory"> <str name="fieldsUsedFieldName">attribute_fields</str> <str name="fieldNameRegex">.*_attribute</str> </processor> <processor class="solr.LogUpdateProcessorFactory" /> <processor class="solr.RunUpdateProcessorFactory" /></updateRequestProcessorChain> 49
  50. 50. FieldsUsedUpdateProcessorFactorypublic class FieldsUsedUpdateProcessorFactory extends UpdateRequestProcessorFactory { private String fieldsUsedFieldName; private Pattern fieldNamePattern; public UpdateRequestProcessor getInstance(SolrQueryRequest req, SolrQueryResponse rsp, UpdateRequestProcessor next) { return new FieldsUsedUpdateProcessor(req, rsp, this, next); } // ... next slide ...} 50
  51. 51. FieldsUsedUpdateProcessorFactory @Override public void init(NamedList args) { if (args == null) return; SolrParams params = SolrParams.toSolrParams(args); fieldsUsedFieldName = params.get("fieldsUsedFieldName"); if (fieldsUsedFieldName == null) { throw new SolrException (SolrException.ErrorCode.SERVER_ERROR, "fieldsUsedFieldName must be specified"); } // TODO check that fieldsUsedFieldName is a valid field name and multiValued String fieldNameRegex = params.get("fieldNameRegex"); if (fieldNameRegex == null) { throw new SolrException (SolrException.ErrorCode.SERVER_ERROR, "fieldNameRegex must be specified"); } fieldNamePattern = Pattern.compile(fieldNameRegex); super.init(args); } 51
  52. 52. class FieldsUsedUpdateProcessor extends UpdateRequestProcessor { public FieldsUsedUpdateProcessor(SolrQueryRequest req, SolrQueryResponse rsp, FieldsUsedUpdateProcessorFactory factory, UpdateRequestProcessor next) { super(next); } @Override public void processAdd(AddUpdateCommand cmd) throws IOException { SolrInputDocument doc = cmd.getSolrInputDocument(); Collection<String> incomingFieldNames = doc.getFieldNames(); Iterator<String> iterator = incomingFieldNames.iterator(); ArrayList<String> usedFields = new ArrayList<String>(); while (iterator.hasNext()) { String f = iterator.next(); if (fieldNamePattern.matcher(f).matches()) { usedFields.add(f); } } doc.addField(fieldsUsedFieldName, usedFields.toArray()); super.processAdd(cmd); }} 52
  53. 53. FieldsUsedUpdateProcessor in actionschema.xml <dynamicField name="*_attribute" type="string" indexed="true" stored="true" multiValued="true"/>Add some documentssolr.add([{:id=>1, :name => "Big Blue Shoes", :size_attribute => L, :color_attribute => Blue}, {:id=>2, :name => "Cool Gizmo", :memory_attribute => "16GB", :color_attribute => White}])solr.commitFacet on attribute_fields - http://localhost:8983/solr/select?q=*:*&facet=on&facet.field=attribute_fields&wt=json&indent=on "facet_fields":{ "attribute_fields":[ "color_attribute",2, "memory_attribute",1, "size_attribute",1]} 53
  54. 54. Search Components• Built-in: Clustering, Debug, Facet, Highlight, MoreLikeThis, Query, QueryElevation, SpellCheck, Stats, TermVector, Terms• Non-distributed API: • prepare(ResponseBuilder rb) • process(ResponseBuilder rb) 54
  55. 55. Example - auto facet select• It sure would be nice if you could have Solr automatically select field(s) for faceting based dynamically off the profile of the results. For example, youre indexing disparate types of products, all with varying attributes (color, size - like for apparel, memory_size - for electronics, subject - for books, etc), and a user searches for "ipod" where most products match products with color and memory_size attributes... lets automatically facet on those fields.• https://issues.apache.org/jira/browse/SOLR-2641 55
  56. 56. AutoFacetSelection Component• Too much code for a slide, lets take a look in an IDE...• Basically - • process() gets autofacet.field and autofacet.n request params, facets on field, takes top N values, sets those as facet.fields • Gotcha - need to call rb.setNeedDocSet (true) in prepare() as faceting needs it 56
  57. 57. SearchComponent config<searchComponent name="autofacet" class="solr.AutoFacetSelectionComponent"/><requestHandler name="/searchplus" class="solr.SearchHandler"> <arr name="components"> <str>query</str> <str>autofacet</str> <str>facet</str> <str>debug</str> </arr></requestHandler> 57
  58. 58. autofacet successhttp://localhost:8983/solr/searchplus?q=*:*&facet=on&autofacet.field=attribute_fields&wt=json&indent=on{ "response":{"numFound":2,"start":0,"docs":[ { "size_attribute":["L"], "color_attribute":["Blue"], "name":"Big Blue Shoes", "id":"1", "attribute_fields":["size_attribute", "color_attribute"]}, { "color_attribute":["White"], "name":"Cool Gizmo", "memory_attribute":["16GB"], "id":"2", "attribute_fields":["color_attribute", "memory_attribute"]}] }, "facet_counts":{ "facet_queries":{}, "facet_fields":{ "color_attribute":[ "Blue",1, "White",1], "memory_attribute":[ "16GB",1]}}} 58
  59. 59. Distributed-aware SearchComponents• SearchComponent has a few distributed mode methods: • distributedProcess(ResponseBuilder) • modifyRequest(ResponseBuilder rb, SearchComponent who, ShardRequest sreq) • handleResponses(ResponseBuilder rb, ShardRequest sreq) • finishStage(ResponseBuilder rb) 59
  60. 60. Testing• AbstractSolrTestCase• SolrTestCaseJ4• SolrMeter • http://code.google.com/p/solrmeter/ 60
  61. 61. For more information...• http://www.lucidimagination.com• LucidFind • search Lucene ecosystem: mailing lists, wikis, JIRA, etc • http://search.lucidimagination.com• Getting started with LucidWorks Enterprise: • http://www.lucidimagination.com/products/ lucidworks-search-platform/enterprise• http://lucene.apache.org/solr - wiki, e-mail lists 61
  62. 62. Thank You! 62
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×