Hippo get together presentation solr integration

2,032 views

Published on

By Ard Schrijvers

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
2,032
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
19
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Hippo get together presentation solr integration

  1. 1. Solr integrationApril 20, 2012Ard Schrijvers • a.schrijvers@onehippo.com /ard@apache.org
  2. 2. About me: Ard Schrijvers1. Working at Hippo since 20012. Email: a.schrijvers@onehippo.com ard@apache.org3. Worked primarily on: 1. HST 2. Hippo Repository / Jackrabbit 3. Lucene 4. Cocoon 5. Slide4. Apache committer of Jackrabbit and Cocoon
  3. 3. Outline1. The current search (HST / repo) architecture
  4. 4. Outline1. The current search (HST / repo) architecture2. The current problems / shortcomings / mismatches
  5. 5. Outline1. The current search (HST / repo) architecture2. The current problems / shortcomings / mismatches3. What we are trying to improve, the objectives
  6. 6. Outline1. The current search (HST / repo) architecture2. The current problems / shortcomings / mismatches3. What we are trying to improve, the objectives4. Solr integration to rescue
  7. 7. Outline1. The current search (HST / repo) architecture2. The current problems / shortcomings / mismatches3. What we are trying to improve, the objectives4. Solr integration to rescue5. A very fast demo
  8. 8. Outline1. The current search (HST / repo) architecture2. The current problems / shortcomings / mismatches3. What we are trying to improve, the objectives4. Solr integration to rescue5. A very fast demo6. Wrap up
  9. 9. Outline1. The current search (HST / repo) architecture2. The current problems / shortcomings / mismatches3. What we are trying to improve, the objectives4. Solr integration to rescue5. A very fast demo6. Wrap up7. Questions
  10. 10. Current search architecture
  11. 11. Current search architecture So An HSTQuery is translated to an XPath queryWhich is delegated to the repository that returns a JCR NodeIterator which the HST binds back to HippoBeans
  12. 12. Current search architecture That sounds doable and not to complex is it?
  13. 13. Current search architectureWell, it is .......
  14. 14. Current search architectureWell, it is ....... very complex
  15. 15. Current search architectureReasons:1. Back in the days when Jackrabbit 1 started, Lucene was at version 1.4
  16. 16. Current search architectureReasons:1. Back in the days when Jackrabbit 1 started, Lucene was at version 1.42. The first JSR-170 spec imposed some very harsh constraints : A save must result in directly updated search results
  17. 17. Current search architectureReasons:1. Back in the days when Jackrabbit 1 started, Lucene was at version 1.42. The first JSR-170 spec imposed some very harsh constraints : A save must result in directly updated search results3. Support for XPath / SQL was needed. However, Lucene likes flattened data, JCR with XPath / SQL is all about hierarchical data
  18. 18. Current search architectureReasons:1. Back in the days when Jackrabbit 1 started, Lucene was at version 1.42. The first JSR-170 spec imposed some very harsh constraints : A save must result in directly updated search results3. Support for XPath / SQL was needed. However, Lucene likes flattened data, JCR with XPath / SQL is all about hierarchical data4. JCR Nodes != Documents
  19. 19. Outline1. The current search (HST / repo) architecture2. The current problems / shortcomings / mismatches3. What we are trying to improve, the objectives4. Solr integration to rescue5. A short HOWTO as developer6. A very fast demo7. Wrap up8. Questions
  20. 20. Current problems / shortcomings / mismatches1. JCR Nodes are indexed instead of Documents (#nodes >> #documents)
  21. 21. Current problems / shortcomings / mismatches1. JCR Nodes are indexed instead of Documents (#nodes >> #documents)2. A search result only returns Nodes (Rows) : what if you want something else, like auto-completion
  22. 22. Current problems / shortcomings / mismatches1. JCR Nodes are indexed instead of Documents (#nodes >> #documents)2. A search result only returns Nodes (Rows) : what if you want something else, like auto-completion3. Very hard and very limited to customize
  23. 23. Current problems / shortcomings / mismatches1. JCR Nodes are indexed instead of Documents (#nodes >> #documents)2. A search result only returns Nodes (Rows) : what if you want something else, like auto-completion3. Very hard and very limited to customize4. A single index for an entire workspace
  24. 24. Current problems / shortcomings / mismatches1. JCR Nodes are indexed instead of Documents (#nodes >> #documents)2. A search result only returns Nodes (Rows) : what if you want something else, like auto-completion3. Very hard and very limited to customize4. A single index for an entire workspace5. Support for very complex XPath / SQL queries at a price of CPU, Memory and complexity
  25. 25. Current problems / shortcomings / mismatches1. JCR Nodes are indexed instead of Documents (#nodes >> #documents)2. A search result only returns Nodes (Rows) : what if you want something else, like auto-completion3. Very hard and very limited to customize4. A single index for an entire workspace5. Support for very complex XPath / SQL queries at a price of CPU, Memory and complexity6. Only JCR Nodes and properties are indexed : no derived field indexes
  26. 26. Current problems / shortcomings / mismatches1. JCR Nodes are indexed instead of Documents (#nodes >> #documents)2. A search result only returns Nodes (Rows) : what if you want something else, like auto-completion3. Very hard and very limited to customize4. A single index for an entire workspace5. Support for very complex XPath / SQL queries at a price of CPU, Memory and complexity6. Only JCR Nodes and properties are indexed : no derived field indexes7. To index external sources, the sources need to be stored in the repository
  27. 27. Current problems / shortcomings / mismatches1. JCR Nodes are indexed instead of Documents (#nodes >> #documents)2. A search result only returns Nodes (Rows) : what if you want something else, like auto-completion3. Very hard and very limited to customize4. A single index for an entire workspace5. Support for very complex XPath / SQL queries at a price of CPU, Memory and complexity6. Only JCR Nodes and properties are indexed : no derived field indexes7. To index external sources, the sources need to be stored in the repository8. Range queries (and others) easily blow up
  28. 28. Current problems / shortcomings / mismatches1. JCR Nodes are indexed instead of Documents (#nodes >> #documents)2. A search result only returns Nodes (Rows) : what if you want something else, like auto-completion3. Very hard and very limited to customize4. A single index for an entire workspace5. Support for very complex XPath / SQL queries at a price of CPU, Memory and complexity6. Only JCR Nodes and properties are indexed : no derived field indexes7. To index external sources, the sources need to be stored in the repository8. Range queries (and others) easily blow up9. Getting the number of hits is complex
  29. 29. Current problems / shortcomings / mismatches Extra problem JCR Nodes != DocumentsFor example : A news document contains a link to an authordocument : Through the author name, the news documentshould be found
  30. 30. Outline1. The current search (HST / repo) architecture2. The current problems / shortcomings / mismatches3. What we are trying to improve, the objectives4. Solr integration to rescue5. A very fast demo6. Wrap up7. Questions
  31. 31. Objectives 1. Fix all the 9+ problems / shortcomings/ mismatches from previous slides 2. Easy to use and customize 3. Satisfied customers 4. Satisfied partners 5. Scalable searches : CPU, memory and large document numbers 6. Document oriented 7. Integration with HST ContentBeans (HippoBeans) 8. Index external sources 9. Control the SIZE of the index yourself10. Dont invent but integrate ( with out-of-the-box features supported by a large community)
  32. 32. Objective: Fix all the 9 problems /shortcomings/ mismatches from previous slides
  33. 33. Objective: Fix all the 9 problems / shortcomings/ mismatches from previous slidesEasy: Solr integration to rescue
  34. 34. Objective: Easy to use and customize
  35. 35. Objective: Easy to use and customize YOU will be in the driver seat
  36. 36. Objective: Easy to use and customize
  37. 37. Objective: Easy to use and customize
  38. 38. Objective: Easy to use and customizeNo more complete dependence on what the sometimes not sosmAR&D Hippo team thought was good for YOU
  39. 39. Objective : Easy to use and customize
  40. 40. Objective: Easy to use and customizeYou decide from where, what, how and when to index
  41. 41. Objective: Easy to use and customizeYou decide from where, what, how and when to index 1. from where: which sources (jcr, webpages, database, noSQL store, nuxeo, alfresco, anything)
  42. 42. Objective: Easy to use and customizeYou decide from where, what, how and when to index 1. from where: which sources (jcr, webpages, database, noSQL store, nuxeo, alfresco, anything) 2. what : which parts of a document (not jcr node) or external source
  43. 43. Objective: Easy to use and customizeYou decide from where, what, how and when to index 1. from where: which sources (jcr, webpages, database, noSQL store, nuxeo, alfresco, anything) 2. what : which parts of a document (not jcr node) or external source 3. how : 1. which analyzer, 2. index on document level, property level or both 3. store the text
  44. 44. Objective: Easy to use and customizeYou decide from where, what, how and when to index 1. from where: which sources (jcr, webpages, database, noSQL store, nuxeo, alfresco, anything) 2. what : which parts of a document (not jcr node) or external source 3. how : 1. which analyzer, 2. index on document level, property level or both 3. store the text 4. when : when do you want to index
  45. 45. Objective: Easy to use and customizeBut of course, out-of-the-box support and tooling ready to be used by YOU
  46. 46. Objective: Easy to use and customize But of course, out-of-the-box support and tooling ready to be used by YOU1. Default hippo repository indexer & observer
  47. 47. Objective: Easy to use and customize But of course, out-of-the-box support and tooling ready to be used by YOU1. Default hippo repository indexer & observer2. ContentBean (HippoBean) annotations for indexing
  48. 48. Objective: Easy to use and customize But of course, out-of-the-box support and tooling ready to be used by YOU1. Default hippo repository indexer & observer2. ContentBean (HippoBean) annotations for indexing3. Binding search results to ContentBeans
  49. 49. Objective: Easy to use and customize But of course, out-of-the-box support and tooling ready to be used by YOU1. Default hippo repository indexer & observer2. ContentBean (HippoBean) annotations for indexing3. Binding search results to ContentBeans4. Deployment support
  50. 50. Objective: Easy to use and customize But of course, out-of-the-box support and tooling ready to be used by YOU1. Default hippo repository indexer & observer2. ContentBean (HippoBean) annotations for indexing3. Binding search results to ContentBeans4. Deployment support5. Clustering support
  51. 51. Objective: Satisfied customers
  52. 52. Objective: Satisfied customers HOW?
  53. 53. Objective: Satisfied customers EASY
  54. 54. Objective: Satisfied customers Most likely they just will be satisfied
  55. 55. Objective: Satisfied customersIf they are not satisfied enough you can: 1. Easily customize it (aka tune it until je een ons weegt) 2. Hire anyone with Solr experience : All our partners have Solr experience
  56. 56. Objective: Satisfied customersStill not satisfied? Let them pay too much for a Google Search appliance, Autonomy or any of the other useless to pay for software
  57. 57. Objective: Satisfied partners
  58. 58. Objective: Satisfied partnersAlthough on thin ice here, I strongly believe in this because:
  59. 59. Objective: Satisfied partners1. Our partners frequently have good knowledge about Solr
  60. 60. Objective: Satisfied partners1. Our partners frequently have good knowledge about Solr2. Our partners depend less on the current search limitations
  61. 61. Objective: Satisfied partners1. Our partners frequently have good knowledge about Solr2. Our partners depend less on the current search limitations3. Our partners can pitch with their Solr knowledge
  62. 62. Objective: Satisfied partners1. Our partners frequently have good knowledge about Solr2. Our partners depend less on the current search limitations3. Our partners can pitch with their Solr knowledge4. Our partners can sell more Hippo implementations
  63. 63. Objective: Satisfied partners1. Our partners frequently have good knowledge about Solr2. Our partners depend less on the current search limitations3. Our partners can pitch with their Solr knowledge4. Our partners can sell more Hippo implementations5. Our partners will earn more on Hippo and have happier developers
  64. 64. Objective: Satisfied partners1. Our partners frequently have good knowledge about Solr2. Our partners depend less on the current search limitations3. Our partners can pitch with their Solr knowledge4. Our partners can sell more Hippo implementations5. Our partners will earn more on Hippo and have happier developers6. Hippo will earn more through HES: Which will satisfy partners again, because Hippo can spend more on AR&D ==> more features
  65. 65. Objective: Scalable searches
  66. 66. Objective: Scalable searches1. Using Solr to do the searches
  67. 67. Objective: Scalable searches1. Using Solr to do the searches2. Not the complex JCR hierarchical searches
  68. 68. Objective: Scalable searches1. Using Solr to do the searches2. Not the complex JCR hierarchical searches3. Document oriented instead of JCR Nodes ( #docs << #nodes)
  69. 69. Objective: Document oriented
  70. 70. Objective: Document oriented What do we want to search for?
  71. 71. Objective: Document oriented Exactly, Documents!!
  72. 72. Objective: Document oriented A Document == A HippoBean != JCR Node
  73. 73. Objective: Document oriented So lets index
  74. 74. Objective: Document oriented So lets index HippoBeans (ContentBeans)
  75. 75. Objective: Integration withContentBeans (HippoBeans)
  76. 76. Objective: Integration with ContentBeans (HippoBeans)As a developer .... how am I going to index my beans?
  77. 77. Objective: Integration with ContentBeans (HippoBeans)I know how to write HippoBeans, that all I ever did in my life
  78. 78. Objective: Integration withContentBeans (HippoBeans) How do you expect me to index my beans?
  79. 79. Objective: Integration with ContentBeans (HippoBeans)Annotate your getters with @IndexField or @IndexField(name="foo")And account for them in Solr schema.xml <field name="title" type="text_general" indexed="true" stored="true" /> <field name="summary" type="text_general" indexed="true" stored="true"/>
  80. 80. Objective: Integration with ContentBeans (HippoBeans)An example:@Node(jcrType="demosite:textdocument")public class TextBean extends BaseDocument { @IndexField public String getTitle() { return getProperty("demosite:title") ; } @IndexField(name="samenvatting") public String getSummary() { return getProperty("demosite:summary") ; }}
  81. 81. Objective: Integration with ContentBeans (HippoBeans)Another example:@Node(jcrType="demosite:textdocument")public class TextBean extends BaseDocument { @IndexField public String getTitle() { return getProperty("demosite:title") ; } @IndexField public String getSummary() { return getProperty("demosite:summary") ; } @IndexField public String getAuthor() { return getLinkedBean("demosite:author", Author.class). etAuthor(); g }}
  82. 82. Objective: Integration with ContentBeans (HippoBeans)Another example:@Node(jcrType="demosite:textdocument")public class TextBean extends BaseDocument { @IndexField public String getTitle() { return getProperty("demosite:title") ; } @IndexField public String getSummary() { return getProperty("demosite:summary") ; } @ReIndexOnChange @IndexField public Author getAuthor() { return getLinkedBean("demosite:author", Author.class); }}
  83. 83. Objective: Integration with ContentBeans (HippoBeans)Another example: Setters@Node(jcrType="demosite:textdocument")public class TextBean extends BaseDocument { private String title; private String summary; @IndexField public String getTitle() { return title == null ? getProperty("demosite:title"): title ; } public void setTitle(String title) { this.title = title; } @IndexField public String getSummary() { return summary == null ? getProperty("demosite:summary"): summary ; } public void setSummary(String summary) { this.summary = summary; }} Bonus : What can we achieve with the Setters?
  84. 84. Objective: Integration with ContentBeans (HippoBeans)Thats all you need to doAnd the HST binds some extra indexing fields like 1. The path 2. The canonicalUUID 3. The name 4. The localized name 5. The depth 6. The class hierarchy (including interfaces)
  85. 85. Objective: Index external sources
  86. 86. Objective: Index external sourcesYou can1. Push them directly to Solr
  87. 87. Objective: Index external sourcesYou can1. Push them directly to Solr2. Push them to a HST JAX-RS resource that binds to a ContentBean and commits to Solr
  88. 88. Objective: Index external sourcesYou can1. Push them directly to Solr2. Push them to a HST JAX-RS resource that binds to a ContentBean and commits to Solr3. Crawl from the HST and bind to ContentBeans and commit them to Solr
  89. 89. Objective: Index external sourcesA ContentBean does *not* need a JCR Node!ContentBean interface:public interface ContentBean { @IndexField(name="id") String getPath(); void setPath(String path);}
  90. 90. Objective: Index external sourcesAn example : GoGreenProductBean in Testsuitepublic class GoGreenProductBean implements ContentBean { private String path; private String title; private String summary; private String description; public String getPath() {return path;} public void setPath(final String path) {this.path = path;} @IndexField public String getTitle() {return title;} public void setTitle(String title) {this.title = title;} @IndexField public String getSummary() {return summary ;} public void setSummary(String summary) {this.summary = summary;} @IndexField public String getDescription() {return description;} public void setDescription(String description) {this.description = description;}}
  91. 91. Objective: Index external sourcesAnd add the GoGreenProductBean to Solr{ List<GoGreenProductBean> gogreenBeans = new ArrayList<GoGreenProductBean>(); // FILL THE gogreenBeans LIST // NOW ADD TO INDEX HippoSolrManager solrManager = HstServices.getComponentManager().getComponent( HippoSolrManager.class.getName(), SOLR_MODULE_NAME); try { solrManager.getSolrServer().addBeans(gogreenBeans); UpdateResponse commit = solrManager.getSolrServer().commit(); } catch (IOException e) { e.printStackTrace(); } catch (SolrServerException e) { e.printStackTrace(); }}
  92. 92. Objective: Control the SIZE of the index yourself
  93. 93. Objective: Control the SIZE of the index yourselfJCR / Jackrabbit / Hippo-Repository has a generic one-fits-all-index (or one-fits-none-index)Which grows very large easily, and can hardly be customized
  94. 94. Objective: Control the SIZE of the index yourselfHowever, search is domain specificThus, Just index what is needed for the customer
  95. 95. Objective: Dont invent but integrate
  96. 96. Objective: Dont invent but integrate Use Solr Use Solrj client Expose the Solrj SolrQuery
  97. 97. Objective: Dont invent but integrateFor example:HippoSolrManager solrManager = ...String query = ...HippoQuery hippoQuery = solrManager.createQuery(query);hippoQuery.setLimit(pageSize);hippoQuery.setOffset((page - 1) * pageSize);// hippoQuery.getSolrQuery() is the SolrQuery object// include scoringhippoQuery.getSolrQuery().setIncludeScore(true);hippoQuery.getSolrQuery().setHighlight(true);hippoQuery.getSolrQuery().setHighlightFragsize(200);hippoQuery.getSolrQuery().addHighlightField("title");hippoQuery.getSolrQuery().addHighlightField("summary");hippoQuery.getSolrQuery().addHighlightField("htmlContent");HippoQueryResult result = hippoQuery.execute(true);
  98. 98. Objective: Dont invent but integrateFor example:HippoSolrManager solrManager = ...String query = ...HippoQuery hippoQuery = solrManager.createQuery(query);hippoQuery.setLimit(pageSize);hippoQuery.setOffset((page - 1) * pageSize);// hippoQuery.getSolrQuery() is the SolrQuery object// include scoringhippoQuery.getSolrQuery().setIncludeScore(true);hippoQuery.getSolrQuery().setHighlight(true);hippoQuery.getSolrQuery().setHighlightFragsize(200);hippoQuery.getSolrQuery().addHighlightField("title");hippoQuery.getSolrQuery().addHighlightField("summary");hippoQuery.getSolrQuery().addHighlightField("htmlContent");HippoQueryResult result = hippoQuery.execute(true);
  99. 99. Outline1. The current search (HST / repo) architecture2. The current problems / shortcomings / mismatches3. What we are trying to improve, the objectives4. Solr integration to rescue5. A very fast demo6. Wrap up7. Questions
  100. 100. Solr integration to rescue No further comments :-)
  101. 101. Outline1. The current search (HST / repo) architecture2. The current problems / shortcomings / mismatches3. What we are trying to improve, the objectives4. Solr integration to rescue5. A very fast demo6. Wrap up7. Questions
  102. 102. A very fast demo setup~75.000 long wikipedia docs in repository ............... doing the demo .................
  103. 103. That was : a very fast demo
  104. 104. Outline1. The current search (HST / repo) architecture2. The current problems / shortcomings / mismatches3. What we are trying to improve, the objectives4. Solr integration to rescue5. A very fast demo6. Wrap up7. Questions
  105. 105. Wrap upI think that with the Solr integration
  106. 106. Wrap upI think that with the Solr integration1. Developers will be happier
  107. 107. Wrap upI think that with the Solr integration1. Developers will be happier2. Customers will be happier
  108. 108. Wrap upI think that with the Solr integration1. Developers will be happier2. Customers will be happier3. Partners will be happier
  109. 109. Wrap upI think that with the Solr integration1. Developers will be happier2. Customers will be happier3. Partners will be happier4. Hippo will be happier
  110. 110. Wrap upI think that with the Solr integration 1. Developers will be happier 2. Customers will be happier 3. Partners will be happier 4. Hippo will be happierAnd finally, last and least
  111. 111. Wrap upI think that with the Solr integration1. Developers will be happier2. Customers will be happier3. Partners will be happier4. Hippo will be happier5. Infra will be happier because the servers stop sweating
  112. 112. Outline1. The current search (HST / repo) architecture2. The current problems / shortcomings / mismatches3. What we are trying to improve, the objectives4. Solr integration to rescue5. A very fast demo6. Wrap up7. Questions
  113. 113. Questions?Check out the example at :http://svn.onehippo.org/repos/hippo/hippo-cms7/testsuite/trunk

×